WEBVTT

00:01.690 --> 00:10.750
Didn't know we have discussed about structured data, this type of data had different columns, these

00:10.750 --> 00:20.260
columns contain categorical numerical and ordinal values, and along with that, we used to have either

00:20.410 --> 00:27.790
a target value, which was a class form, or it was a numerical form which would help finding out a

00:27.790 --> 00:28.360
regression.

00:29.560 --> 00:39.220
Now, let's see, instead of having a regression problem and instead of having a structured data, let's

00:39.230 --> 00:43.440
say we're working on a fiction legacy.

00:43.630 --> 00:46.210
The did not have such data.

00:46.630 --> 00:50.480
We had data which had a lot of text present in it.

00:51.230 --> 00:56.410
Now, we cannot really convoyed this entire text in two different columns.

00:57.070 --> 01:03.880
And even if we were able to convert this into different columns, the machine would not be able to find

01:03.880 --> 01:06.750
out the meaning of different words in the text.

01:07.630 --> 01:15.160
So how do we actually handle this particular data and how we can transform this particular data such

01:15.160 --> 01:17.990
that we can apply models on top of that?

01:18.670 --> 01:21.580
This is known as natural language processing.

01:21.850 --> 01:26.660
How we will process this entire text is what we would learn next.

01:26.920 --> 01:27.910
So let's go ahead.

01:31.520 --> 01:35.240
Now, what are the benefits of natural language processing?

01:36.350 --> 01:45.590
The benefit of natural language processing are in natural language processing can be leveraged by companies

01:45.770 --> 01:54.920
to improve the efficiency of documentation process, to improve the accuracy of documentation, to identify

01:55.130 --> 02:04.370
the most pertinent information from large devices legacy, the an organization who wanted to find out

02:04.550 --> 02:10.980
which customers were satisfied and what number of customers were not really satisfied with the services.

02:11.920 --> 02:16.020
Now for that, we wanted to have a look at.

02:17.120 --> 02:25.670
So what we do is we take a complete collection of the tweets from our users and try to find out which

02:25.670 --> 02:32.570
user has tweeted something positive about the organization and which one has tweeted something negative

02:32.570 --> 02:33.810
about the organization.

02:34.730 --> 02:44.750
Then after that, from the negative feedback, what we try to do is we can seek out those specific keywords

02:44.870 --> 02:50.600
which were related to some products or services which our organization was providing.

02:51.380 --> 02:59.750
Let us say we make shoes and shirts so we can find out the negative comments from the tweets.

03:00.680 --> 03:05.140
Now, let's say we found out that all the shoes have some negative tweets.

03:05.630 --> 03:08.570
So from that particular text, we can.

03:09.800 --> 03:18.330
Extract those specific words, which would Volchek on what is actually wrong with our shoes.

03:19.310 --> 03:25.670
So this is something what we can achieve from natural language processing, using the machines, which

03:25.670 --> 03:28.850
does not really understand what languages.

03:30.290 --> 03:38.990
So let us work further on it now, what is the next mining xed mining refers to the process of deriving

03:38.990 --> 03:41.600
high quality information from the text.

03:42.330 --> 03:51.530
The overarching goal is essentially to build the biggest into data for analysis via application of natural

03:51.530 --> 03:52.430
language processing.

03:53.440 --> 03:59.860
So instead of having that structural data, we now have the actual data.

04:00.280 --> 04:08.980
Now our task is to convert the actual data into that structured data, which we have been walking along

04:08.980 --> 04:09.250
with.

04:12.480 --> 04:19.740
Let us talk about what structured data is and what unstructured data is, the structured data is any

04:19.740 --> 04:28.380
data which resides in a fixed field within a record or a fight, for example, data in a database table

04:28.920 --> 04:32.550
which is easy to enter, easy to store and analyze.

04:33.000 --> 04:36.630
We have been working with such kind of data for some time now.

04:37.530 --> 04:41.790
The unstructured data does not really reside in a traditional database.

04:42.240 --> 04:47.070
For example, it means video or audio files, webpages, presentations.

04:47.430 --> 04:50.310
These are all difficult and costly to analyze.

04:52.450 --> 04:59.500
So to our rescue, we have natural language processing now there are a few basic concepts which I want

04:59.500 --> 05:00.320
you to understand.

05:01.120 --> 05:03.430
First one is tokenization.

05:04.300 --> 05:08.850
Tokenization is the process of converting text into tokens.

05:09.580 --> 05:12.190
So let us say I have a sentence.

05:12.400 --> 05:16.300
My house is very beautiful.

05:17.150 --> 05:26.720
So out of this sentence, I have to create several tokens, so the tokens would be my the next token

05:26.720 --> 05:31.230
would be house then is then very then beautiful.

05:31.790 --> 05:36.350
So each and every word in this text is actually a token.

05:37.070 --> 05:45.710
So tokens are words or entities present in the text and tokenization is the process of converting the

05:45.710 --> 05:48.290
entire text into small tokens.

05:49.830 --> 05:55.320
Different next objects could be a sentence, a phrase, a word or an entire article.

05:56.130 --> 06:04.170
Now, what we actually do in MLP is the segment, the words and sentences into different buckets.

06:04.740 --> 06:07.830
And after that, we pre-process this text.

06:08.400 --> 06:17.640
The preprocessing of this text include removal of Stopford, then stemming or limitation, and finally

06:17.760 --> 06:22.260
then quoting these text words in two different vectors.

06:22.710 --> 06:24.180
Now, let us see what this is.

06:24.750 --> 06:27.420
So like I say, we have this particular text.

06:28.860 --> 06:34.110
The text is my Red Sox are the prettiest socks in the country.

06:34.410 --> 06:37.760
No other Red Sox are pretty in the nation.

06:38.430 --> 06:47.010
Now, here, if you consider we have several, quote, commas and full stops, presenta, you know,

06:47.130 --> 06:51.630
if we consider these sentences, these sentences also have words.

06:51.630 --> 06:54.450
But she does not really are of importance.

06:54.810 --> 07:04.800
For example, in the are in, though, these are different words which we don't really need, because

07:05.400 --> 07:15.870
even if we say my Red Sox prettiest Sox country, no other Red Sox nation, now, this would kind of

07:15.870 --> 07:22.890
convey the same thing, but only we would be rejecting certain words which are of not importance.

07:25.540 --> 07:34.900
So the first step is to convert these and remove those unwanted things, so we remove this dump only

07:34.900 --> 07:35.790
punctuations.

07:36.130 --> 07:42.880
So now we have my Red Sox for Sox country, nor the Red Sox nation.

07:45.530 --> 07:49.750
Now, the first process which we apply is now.

07:49.850 --> 07:58.160
What is STEM stemming is the process of reducing the world's generally modified or derived through their

07:58.160 --> 08:00.460
word STEM or group form.

08:00.980 --> 08:09.680
The objective of STEM is to reduce the related words for the same stem, even if the STEM is not a dictionary

08:09.680 --> 08:10.010
word.

08:10.260 --> 08:17.690
Now we are converting different words into some form, which is not actually a dictionary word, but

08:17.690 --> 08:20.550
something which is a broad word.

08:20.780 --> 08:23.570
We are removing the suffix from this word.

08:23.920 --> 08:26.160
OK, so let us see an example.

08:26.300 --> 08:31.420
So here we have words from English language, beautiful and beautifully.

08:32.210 --> 08:40.640
Now if you see the meaning from beautiful and beautifully is a little different, but still because

08:40.790 --> 08:47.120
the root would that is be a unity is common in both of these words.

08:47.300 --> 08:58.760
These are stemmed to be while the words good, better and best are having the same meaning, but because

08:58.970 --> 09:07.520
they are not having a suffix kind of a structure where same word is extended by a suffix, these are

09:07.520 --> 09:12.710
converted into good, better, best only they are not converted by stimming.

09:13.280 --> 09:20.660
So stemming has no impact on these kind of words who do not share a same word stem.

09:23.310 --> 09:30.510
So in this particular example, my red blue socks, prettiest and prettier, both of these have the

09:30.510 --> 09:36.270
same wood, so they will be converted into that would be the FBI.

09:40.620 --> 09:48.390
Now, what limited organization does is the process of reducing a group of words into the Alema or dictionary

09:48.390 --> 09:56.250
form, it takes into account the things like parts of speech and it can words, and it also considers

09:56.250 --> 09:58.530
the context behind a particular word.

09:58.740 --> 10:03.330
So, for example, we have a sentence with beautiful and beautifully.

10:03.540 --> 10:09.930
So they will be limited to beautiful and beautifully respectively, because the meaning behind them

10:09.930 --> 10:13.720
is different because the part of speech is different.

10:14.270 --> 10:17.070
Why is this a good, better and best?

10:17.220 --> 10:24.390
They will be limited to good, good and good because the context behind good, better, best is the

10:24.390 --> 10:24.750
same.

10:25.050 --> 10:31.810
All three words are actually meaning good and that with different intensities.

10:31.950 --> 10:36.180
So it simply kind words, all three words into just one single word.

10:38.080 --> 10:46.180
So here, when we apply limited information to this particular sentence, the prettiest I reviewed because

10:46.180 --> 10:54.010
both of these words are similar in context and are just differing in the intensity, they are converted

10:54.010 --> 11:02.500
into proving via the word country and nation because they have similar meaning, they will be converted

11:02.500 --> 11:03.430
into country.

11:06.090 --> 11:13.200
So this is what limited accusation and stemming is now stammers typically easier to implement.

11:14.460 --> 11:20.920
And run faster and the reduced accuracy may not matter for some application.

11:21.180 --> 11:29.880
So if you want to apply limitation or stimming, it is always preferable to apply limitation if you

11:29.880 --> 11:32.330
are looking for a better result.

11:32.910 --> 11:42.630
But in case we do not look for better results, but for a faster implementation or a simpler implementation,

11:42.900 --> 11:45.060
then we can go for stimming.

11:47.830 --> 11:56.680
Now, after they have applied, stemming all improvisation, what we do is we can watch these words

11:56.830 --> 11:58.360
into a video form.

11:59.290 --> 12:03.650
So there are two types of ways of doing this.

12:03.910 --> 12:07.000
One is converting it into account features.

12:07.240 --> 12:08.870
So what are called features?

12:09.040 --> 12:16.840
So let's say we have those words so we can create these kind of features, these kind of columns where

12:16.840 --> 12:20.710
each word we have the frequency of its utterance.

12:21.830 --> 12:30.100
So he in the sentence, which we were discussing my game once that was occurring twice, blew up twice,

12:30.100 --> 12:34.300
sucks up twice already occurred twice country twice.

12:34.520 --> 12:37.730
No other one and other cold ones.

12:39.800 --> 12:49.400
Here in these sentences, the Red Dog, the ones red ones, the old ones and all of that are so you

12:49.400 --> 12:52.590
can understand like here drink and eat.

12:52.830 --> 13:01.070
So here red has frequency one, cat has frequency one, it has frequency one, and all other values

13:01.070 --> 13:02.370
are zero.

13:02.780 --> 13:05.060
So this kind of a structure would be created.

13:05.390 --> 13:13.600
Now, let us try to understand, this is a scenario when we have just a few words like the Red Dog guac

13:13.700 --> 13:18.110
and eat only such a small number of words.

13:18.110 --> 13:22.100
Only six words created a huge matrix like this.

13:22.730 --> 13:30.710
Now, if we are working with huge stakes, then the matrix which would be created would be very large.

13:32.280 --> 13:41.760
So that is the reason why the force step we did was to remove the Stopford and the punctuations so that

13:41.760 --> 13:45.180
we could reduce the size of the Matrix, which would be great.

13:46.150 --> 13:54.660
Now, another thing is that when we are creating this matrix, we do not want words to occur very frequently

13:54.840 --> 13:58.820
or the words which occur very less number of times.

13:59.550 --> 14:08.170
So for that, instead of creating a simple counterweight to what we can do, is we can create the idea

14:08.180 --> 14:08.720
of vector.

14:09.630 --> 14:13.820
Which is the frequency, inverse document frequency.

14:15.100 --> 14:24.510
Now, what this is, is it will try to keep only those words which are not very frequently occurring

14:24.720 --> 14:26.790
and which are not very specific.

14:29.240 --> 14:37.610
So it will basically keep in consideration the tone frequency, the frequency is the number of occurrences

14:37.610 --> 14:38.410
of the storm.

14:39.510 --> 14:43.830
Into one particular document, I'm the product.

14:44.740 --> 14:53.650
Of it with the inverse document frequency is total number of documents divided by the number of documents

14:53.650 --> 14:55.510
containing that specific book.

14:56.080 --> 15:04.180
So this is the formula which is associated with the idea which is applied so that all future space can

15:04.180 --> 15:05.020
be created.

15:05.740 --> 15:14.710
And once we have either our own features or the of features, we can attach the target column with it

15:14.860 --> 15:22.000
and use it like a simple classification model or just the way how we want to use.

15:22.210 --> 15:29.080
Similarly, like we have been doing it with the structured data, the same things can be implemented,

15:29.080 --> 15:35.920
implemented with this unstructured data, which now has been transformed into a structured form.

15:44.540 --> 15:48.130
Let us have a look at the gold walkthrough in the next session.