WEBVTT

00:00.680 --> 00:05.250
And in this session, we will discuss about my base implementation.

00:05.510 --> 00:12.390
So for this particular implementation, I have chosen the Women's Clothing E-commerce Review's dataset.

00:12.950 --> 00:20.060
So for this particular dataset, again, I would be exploring the age group of females, the age group

00:20.420 --> 00:22.080
which bought what kind of growth?

00:22.370 --> 00:26.330
The age group which bought what kind of product based on class name.

00:26.630 --> 00:32.300
The department has what number of percentage of class names then which division?

00:32.540 --> 00:38.210
The different division, different department names, then the column frequency of the woods.

00:38.450 --> 00:41.240
And all of these details would be discussed inside this.

00:41.420 --> 00:48.640
So I hope you would understand all the codes, as we have already discussed about an LP.

00:48.660 --> 00:57.110
So this would give you a more insight on how we implement data or data transformations and analysis

00:57.350 --> 00:59.230
on the actual data.

01:00.300 --> 01:07.680
So let us go ahead and the first thing which we will be working on with would be the importing of the

01:07.950 --> 01:08.690
libraries.

01:08.880 --> 01:17.400
So the libraries, which we would be working with are Fondas, number seven, McLauchlan and find us

01:17.400 --> 01:18.030
providing.

01:19.860 --> 01:27.600
Now, the data which we will be importing is, again, the women's clothing data, and it does not have

01:27.600 --> 01:28.820
the index volume.

01:29.130 --> 01:32.190
So now we have imported the data.

01:33.120 --> 01:39.630
We don't read CSFI and this is the data frame which we have created out of it.

01:41.190 --> 01:45.590
Now we have created a list of columns.

01:45.810 --> 01:50.520
So this is the column contain list, which has different column list.

01:50.520 --> 01:58.560
That is clothing like the age title review rating, recommended positive feedback division name, department

01:58.560 --> 02:00.100
name and last name.

02:00.120 --> 02:05.830
So these are the different column names which we are going to assign to this particular dataset.

02:06.240 --> 02:12.890
So the how we provide the column names is by using the DOT data frame.

02:13.410 --> 02:15.560
I'm in a data frame.

02:15.570 --> 02:18.150
We will give the data under the data.

02:19.080 --> 02:26.730
Then we would give the column names under the columns, then we can use different informal visualize

02:27.030 --> 02:29.180
different details about these columns.

02:29.580 --> 02:37.050
So we have total twenty three thousand four hundred twenty six of one hundred eighty six rows of data.

02:39.180 --> 02:39.810
And.

02:41.140 --> 02:46.510
The first one being the clothing line is integer 60 40.

02:46.900 --> 02:52.140
Then we have Ege, which is again in DNA, then we have Bitel, which is object.

02:52.990 --> 02:55.590
We have reviewed text which has objects.

02:56.290 --> 03:05.830
Then we have reefing, which is in 60 for recommended index, the in 60 for positive feedback in 60

03:05.830 --> 03:11.860
for division name, object, department name, object and class name object.

03:13.150 --> 03:21.580
The entire memory used for this data is more than one point lead and now the say it has twenty three

03:21.580 --> 03:28.990
thousand four hundred and eighty six rows of data and there are nine columns in total.

03:29.170 --> 03:34.480
Now some of these entries are missing like title division named Batmen, name and last name.

03:35.560 --> 03:39.700
So let's, first of all, view the data using the dopehead.

03:41.400 --> 03:46.270
So this is the data which we have, which is having the clothing need, which is the cloth, Heidi,

03:46.740 --> 03:53.550
then the age of the age group of the person, then the title, which is what?

03:53.820 --> 03:56.670
The title of this particular review, which we have.

03:57.550 --> 04:06.220
Then the text of the review, then the writing that the person has given, and if someone has recommended

04:06.220 --> 04:12.520
this particular plot or not, if they have recommended it says one, if someone has not recommended,

04:12.790 --> 04:14.500
then it says zero.

04:15.160 --> 04:21.660
Last is the positive feedback on how many positive feedback has this review received.

04:22.300 --> 04:24.640
Then we have the division name.

04:24.880 --> 04:27.790
That is what kind of plot is it?

04:28.000 --> 04:31.640
Is it in intimate or gender or gender?

04:32.230 --> 04:34.360
What kind of typewriters then?

04:34.360 --> 04:38.260
The department name that is in it is intimate.

04:38.260 --> 04:39.310
It is dresses.

04:39.310 --> 04:40.390
It is bottoms.

04:40.390 --> 04:41.270
It stops.

04:41.590 --> 04:46.950
Then that is another class name which is again dresses, finds or blouses.

04:46.990 --> 04:47.710
What is it.

04:51.300 --> 04:59.630
Now, let us have a look at the new medical columns so the new medical columns of clothing, idy age

05:00.150 --> 05:08.160
rating, recommended index and positive feedback so we can see that we have seventy five percent.

05:08.220 --> 05:15.200
I'll add one zero eight maximum one to zero five, because it doesn't really matter.

05:15.930 --> 05:17.520
Then we have age.

05:18.700 --> 05:26.500
Each has data from it being then the twenty five percentile at 30, 40, 50 percent, Vaillant's forty

05:26.500 --> 05:31.220
one seventy five percent, Bilad 52 and maximum at ninety nine.

05:31.240 --> 05:38.580
So there seems to be some kind of outliers present in the age of the people who are reviewing.

05:39.820 --> 05:44.410
Then we have the rating which ranges from one to five.

05:45.340 --> 05:50.960
And you can see that the twenty five percentile rating is actually four.

05:51.130 --> 05:56.380
So this means that most of the people actually give a high rating.

05:56.410 --> 06:00.220
Most of the people are giving the rating, which is four.

06:00.460 --> 06:05.820
And there are very few number of people who are giving less than four rating.

06:05.830 --> 06:13.510
That is only twenty five percent of people are giving less than four rating if this twenty five percent

06:14.110 --> 06:18.340
percentile has the minimum value being for now.

06:19.630 --> 06:27.410
Next is the recommendation index, so you can see the minimum value is zero and twenty five percentile

06:27.430 --> 06:32.300
is one, which again means that most of the clothes are actually recommended here.

06:32.920 --> 06:40.510
Then we have positive feedback so we can see that they're at 50 percent showing the positive feedback

06:40.510 --> 06:41.320
on this one.

06:41.320 --> 06:43.150
And then it is above.

06:43.660 --> 06:45.230
So below 50 percent.

06:45.250 --> 06:46.950
Dalitz is usually zero.

06:48.320 --> 06:55.710
Now, let us check if there is any type of correlation between the user's rating and the review length

06:56.030 --> 06:58.000
of the reviews.

06:58.490 --> 07:07.460
So what we can do is we have taken the data frame review fixed and we have created a new column which

07:07.460 --> 07:08.390
is review fixed.

07:08.840 --> 07:12.460
And in it we have added the type string.

07:12.470 --> 07:16.880
So we have converted the review text to string.

07:17.660 --> 07:24.410
And in the review length, we have just applied the lente on this review based so that we can get the

07:24.530 --> 07:25.760
length of the review.

07:27.460 --> 07:36.040
Then we are just creating the grid and we are playing the game up and plotting the histogram of the

07:36.130 --> 07:39.610
review linked with the Vinces size 50.

07:40.730 --> 07:42.920
So let us have a look at this.

07:44.110 --> 07:49.120
So here we have the ratings for rating one.

07:50.290 --> 07:59.800
Four, we have the review length here, so you can see that reading one of here, the review length

07:59.890 --> 08:01.690
does not really matter much.

08:03.070 --> 08:04.660
It is almost the same.

08:05.760 --> 08:14.340
And if you go further, that is one big if you see here and in all the plots, there are these big peaks

08:14.340 --> 08:16.050
which are available now.

08:16.050 --> 08:19.570
These peaks are usually available.

08:20.370 --> 08:26.060
You can see of these four or five, the peaks are higher.

08:26.310 --> 08:31.350
So it means that there are longer reviews.

08:31.350 --> 08:34.460
If the readings are higher.

08:34.860 --> 08:39.300
Now, from above chart, we can see that user gave five rating of only.

08:39.510 --> 08:42.000
There were more number of five readings.

08:42.510 --> 08:47.760
And in fact, there are no there are less number of users who gave rating one and two.

08:48.180 --> 08:50.550
You can see there is very little amount of data.

08:51.030 --> 08:51.510
And you.

08:53.300 --> 09:01.300
Now, let us plot another road to you now, this plot we are creating between the rating and the review,

09:01.310 --> 09:01.640
Len.

09:02.730 --> 09:05.970
So if you compare this, is that a violent.

09:07.040 --> 09:10.340
And these are the ratings, so you can see that the.

09:12.630 --> 09:21.180
There is not much difference, the review length is higher for ratings three and four.

09:23.350 --> 09:28.650
So from above, we can go through that routine for entry, have more length interviews.

09:29.920 --> 09:36.130
These are higher in value, so these have more number of reviews, more lengthy reviews provided.

09:37.970 --> 09:47.210
Next, let us have a look at the ratings so we will go by ratings with the means so we can see here.

09:48.580 --> 09:56.350
Four of you will be finding out the correlation between the different ratings, so we have clothing

09:56.350 --> 10:02.510
in which clothing indeed there is a more specific correlation with age.

10:03.190 --> 10:07.320
There is a negative high correlation with the clothing, Islay.

10:08.460 --> 10:17.880
Then recommended index, there is a low correlation, the violent has a high correlation here we have

10:17.880 --> 10:24.690
age and the violent, which has a negatively high negative, strong correlation.

10:27.070 --> 10:36.490
So here we have clothing, Ivan Ege, then after that, we have positive feedback on the recommendation

10:36.490 --> 10:40.090
index, which has a negative high correlation.

10:41.160 --> 10:43.140
Then after that, we have.

10:44.890 --> 10:47.380
These are the only important ones in this.

10:49.230 --> 10:52.410
Now, let us have a look at this, this is the coalition, Jack.

10:55.640 --> 11:02.890
Now, here we can see that there is not much high correlation among the columns now, the columns like

11:02.910 --> 11:15.110
review length and the columns like age and positive feedback have a strong correlation of and the numbers

11:15.110 --> 11:21.070
are zero point three or nine three in negative indicates that it is nowhere correlated with the Ruland.

11:21.890 --> 11:26.210
So as the age grows, the length of the view decreases.

11:27.280 --> 11:31.270
So this is the data set, which we have again.

11:33.380 --> 11:45.230
Now, we will group the data with respect to the rating and the data frame of GOLOMB each, and we are

11:45.230 --> 11:49.010
just creating our values from zero to hundred.

11:50.880 --> 11:56.980
With a gap of 10 and we are just on stacking and plotting the budget for the season.

11:57.420 --> 12:04.230
So here you can see that for these different age groups, we have compared the ratings.

12:04.650 --> 12:08.010
So you can see that the blue one being the rating.

12:08.220 --> 12:19.320
So you can see that the five rating is given by the most of the people age between 30 to 40 are giving

12:19.320 --> 12:20.790
very high ratings.

12:20.970 --> 12:27.870
And usually most of the people are giving high ratings rating five and rating for there are very few

12:27.870 --> 12:30.420
number of people who are actually giving lower ratings.

12:32.340 --> 12:39.150
So from above our flawed, we can see that the age group, 10 to 20 gave less ratings, less number

12:39.150 --> 12:40.970
of ratings were provided by them.

12:41.250 --> 12:46.430
And it is obvious that in this age group, teenagers generally don't care about online shopping interviews.

12:46.710 --> 12:52.290
The age group 30 to 40 give five rating as compared to all other age groups.

12:52.860 --> 12:56.850
And in fact, this is the age group who gave most reviews and ratings.

12:57.010 --> 13:01.470
And similarly, the age group of 70 did not care about the online shopping stuff.

13:01.770 --> 13:04.380
So this is what this blog displays.

13:04.590 --> 13:13.410
And as told earlier, the number of reviews, the positive reviews are higher in all the age groups.

13:16.350 --> 13:23.850
Now we will plot another chart between the department name and the age with the another budget.

13:24.150 --> 13:32.970
So here we have created the vacancy that the reviews because the number of reviews are less for these

13:32.970 --> 13:33.450
age groups.

13:33.460 --> 13:35.690
So we don't have much information about them.

13:36.150 --> 13:41.090
But for the age groups from 20 to 70, we do have the ability.

13:41.490 --> 13:49.500
So you can see that the number of Thorp's is actually having a high number of Thorp's or something,

13:49.500 --> 13:54.050
which has been sold mostly and reviewed about mostly.

13:54.330 --> 13:58.110
And then the second number, we have dresses.

13:59.360 --> 14:06.020
This, again, has more number of things, and after that we have for the blues that does the bottoms,

14:06.240 --> 14:14.730
so the most number of reviews were given about the tops, then followed by the dresses and then followed

14:14.730 --> 14:15.270
by the.

14:16.420 --> 14:17.110
Waddle's.

14:18.430 --> 14:23.770
So in the above, I applaud, I want to concentrate on the department and the age group, the female

14:23.770 --> 14:30.640
from 20 to 70 age were more active and more of the stuff on line from the above.

14:30.680 --> 14:37.600
But we conclude that females were more focused on Phelps and his department and somewhat focused on

14:37.600 --> 14:39.760
boards, too, but not that much.

14:40.120 --> 14:43.060
They were less concentrated on training department.

14:44.650 --> 14:46.920
So because that's not really visible here.

14:47.560 --> 14:48.130
So.

14:49.310 --> 14:57.740
Next, we would try to find a relationship between class name and the age, so we will have another

14:58.430 --> 14:59.390
look for that.

14:59.810 --> 15:08.240
So here you can see that the nets have more number of seals.

15:10.240 --> 15:14.260
And similarly, the red one, which is the addresses.

15:16.710 --> 15:21.780
So these are the most famous last names.

15:23.060 --> 15:29.330
Then we have creating or we are creating on a block between the department name.

15:31.260 --> 15:39.660
And the guns, so here you can see that the most number of selling department is bulb's, followed by

15:39.660 --> 15:45.690
addresses followed by Boredoms, then followed by demesne, then followed by a and train has the minimum

15:45.690 --> 15:46.080
amount.

15:46.950 --> 15:52.740
So above shows that the maximum entries for top, which is around ten thousand five hundred, and then

15:52.740 --> 15:55.570
this department is having around six thousand entries.

15:58.430 --> 16:04.580
So above board shows that the maximum number of entries from Thorp's.

16:05.790 --> 16:12.390
Now, we will compare the division names with respect to the guns, so the code which I have written

16:12.390 --> 16:15.990
for this again, is again the simple ones, which we have already discussed.

16:16.260 --> 16:19.560
We simply use the C one library for this.

16:20.220 --> 16:24.780
So it is violate Bolex neighborliness and escorted by law.

16:24.840 --> 16:28.980
So I don't think there would be much problem understanding this.

16:32.260 --> 16:39.600
So let's go ahead and you can see the columns for general is very high, that is around fourteen thousand.

16:39.880 --> 16:47.530
Then the columns for Generativity Division is around eight thousand and the lowest one is for intimate.

16:47.830 --> 16:50.450
That is around two thousand.

16:51.070 --> 16:57.850
So in our data, there are three divisions that as General General David and the general division products

16:57.850 --> 17:02.620
were most sold out as compared to the general, the intimate.

17:03.710 --> 17:12.920
Now we will start walking with the actual data, so from collections I have in Dade County and from

17:12.920 --> 17:16.660
the analytical library, I am importing the corpus.

17:16.940 --> 17:20.300
Now this corpus has these stop words in it.

17:20.600 --> 17:27.860
So again, repeating what the stop would stop words are the words which don't really provide much meaning

17:27.860 --> 17:31.080
to our data, textual data.

17:31.100 --> 17:38.630
So we will try to remove them because once we remove the stop words, then a huge chunk of words get

17:38.630 --> 17:42.710
removed from the set of tokens, which we will be creating.

17:43.780 --> 17:49.390
So next is the organizer, which we will be importing from and get organized.

17:49.650 --> 17:52.510
Now, this is a regular expression organizer.

17:53.170 --> 17:57.590
And next, we will import the sentiment organizer and devoted organizer.

17:57.650 --> 18:02.510
So these are different types of organizers which we will be using.

18:02.830 --> 18:06.520
And finally, we are importing the award cloud.

18:07.380 --> 18:10.620
And word cloud stop words from the word cloud library.

18:10.910 --> 18:12.960
You can see what word cloud that's for us.

18:14.410 --> 18:20.800
So now what we are doing is we are creating a variable that is stop and.

18:22.400 --> 18:26.540
And we are converting the list of list into next.

18:27.480 --> 18:36.420
So we are creating a particular video which will hold the column Adreview Fixed.

18:37.290 --> 18:37.800
And.

18:38.890 --> 18:39.640
It will.

18:41.390 --> 18:51.350
Contain the values separated by the space, then after that, we will be removing the punctuations numbers

18:51.350 --> 18:56.270
and returning the list of only words, basically we are selecting anything.

18:57.280 --> 19:04.270
Which is having the words in it, and after that, they are removing all the stop words by simply getting

19:04.270 --> 19:09.340
the list of words, list of stop words from English language.

19:09.610 --> 19:13.770
So stop start words and get all the words from the English language here.

19:14.940 --> 19:22.320
And we have got the English Stopford now, what we will do is we will tokenized the words which we will

19:22.320 --> 19:32.400
get, so we apply the word to organize and put the words into B of these into words organize so we will

19:32.400 --> 19:34.000
get the word Duplantier.

19:34.260 --> 19:42.180
And after that, we are just applying the law, which will take the word from the word tokens.

19:43.300 --> 19:51.580
And check if the word belongs to stop words, if not, then it will go on and put it into those sentences.

19:53.790 --> 20:01.800
After that, we are simply checking the words and upending different words from also from the word open

20:02.540 --> 20:03.980
to the sentence.

20:04.620 --> 20:09.940
And here we are removing the characters which have lent less than to.

20:10.320 --> 20:18.810
So whatever words we have, which have the Salento of the word listen to, we are removing them and

20:18.810 --> 20:22.100
keeping only the words which have lent more than to.

20:22.230 --> 20:27.110
So here we have just filtered with the land of word rather than two.

20:27.490 --> 20:30.120
And I did all those too without a single.

20:32.500 --> 20:41.440
Then we are removing the numbers from this particular data, so we are simply checking for walked into

20:41.440 --> 20:47.350
the room without a single directo and checking if the word is not numeric, then we are putting the

20:47.350 --> 20:49.420
word in to the clean data.

20:51.600 --> 20:57.620
Now, after this, we are simply calculating the frequency distributions or how we are doing this,

20:57.630 --> 21:05.910
we are using the sequence of distribution of function, which will get Deklin data inside this on the

21:06.270 --> 21:07.500
word distribution.

21:09.560 --> 21:16.760
Finally, here we are creating a data frame with the word distribution and getting the top hundred of

21:16.760 --> 21:21.560
words from the most common words from it, here we have defined development to be hundreds.

21:21.740 --> 21:29.330
So here we are getting the top hundred words from this and creating the columns to be board and frequency.

21:30.400 --> 21:38.740
Now we are just plotting this, using the block, so here we have the words so that all words are addresses

21:39.190 --> 21:44.320
love size to fit like an.

21:45.880 --> 21:52.510
So this above my blog titled Subsequence View of the World in the review column, and the word race

21:52.510 --> 21:57.040
appears more in the text next to this, the word love comes second.

21:57.430 --> 22:03.520
And it is an indicator of a positive that is more number of reviews are positive, which we already

22:03.520 --> 22:11.140
have, that the most of the reviews which have been given how high the rating for more than four from

22:11.140 --> 22:12.770
the twenty fifth person died.

22:13.030 --> 22:18.690
So from that only we can see that that is a lot of positive comments coming in.

22:20.000 --> 22:26.630
So here's what we are doing is we are generating the word cloud, so forward cloud, we are just defining

22:26.630 --> 22:28.670
the details of the word clouthier.

22:28.880 --> 22:32.140
So we are saying we want to see good and for the good.

22:32.150 --> 22:35.610
We are saying we want the word cloud in the word.

22:35.630 --> 22:38.210
Now we want to give the background color.

22:39.420 --> 22:48.540
And the maximum number of words as thousand and maximum form signs as 50, and then we are just reading

22:48.780 --> 22:56.640
the words the like, we are just combining the data with US space and then we are generating the word

22:56.640 --> 22:57.090
cloud.

22:58.940 --> 23:04.180
So this is the word cloud which we have generated with the black cloud and most use it.

23:04.820 --> 23:07.330
So these are the word clouds.

23:07.520 --> 23:13.070
So here you can see what is a word cloud, what cloud is basically a combination of all the words which

23:13.070 --> 23:15.300
are present in the data.

23:15.440 --> 23:25.160
So the most frequently occurring words will be placed in this chart and this horizontal and vertical

23:25.160 --> 23:26.570
words would be written here.

23:26.840 --> 23:31.540
The size of the word actually shows how frequently the word has been used.

23:31.760 --> 23:36.510
So you can see that the dress, the word dress is very large in size.

23:36.530 --> 23:38.780
So this is the most frequently used word.

23:39.080 --> 23:46.370
Then we have the word look, we have the word love, which is larger in size than we have the fabric,

23:46.700 --> 23:48.230
pink color.

23:48.470 --> 23:49.160
Great.

23:49.400 --> 23:50.560
So sweaters.

23:51.080 --> 23:54.910
So these are the few words which are very frequently used in.

23:56.190 --> 24:06.990
Most used, while we have another word like you, nice, dried fit, thighed, blouse, walk the design

24:07.350 --> 24:12.230
the day, love breezeways, true size but front.

24:12.420 --> 24:18.620
So these are different other words which have been used in the reviews but are lesser frequently happening.

24:18.630 --> 24:20.600
That is why the sizes are so small.

24:22.140 --> 24:22.740
So.

24:24.220 --> 24:27.270
This is what is the biggest volume now.

24:27.290 --> 24:35.210
Next, what we will be doing is we will be analyzing the sentiment of the Dow, which we have so far,

24:35.210 --> 24:37.910
analyzing the sentiments we are importing.

24:38.660 --> 24:41.960
Fluff blog from the X Blob Library.

24:43.320 --> 24:54.830
So we are creating a blank list that is a blob list description in this of another name, which is date

24:54.840 --> 24:59.780
of name review string, which contains the date of names review.

25:00.280 --> 25:03.810
So all the review fakes would be saved as string.

25:05.240 --> 25:13.310
Then we are going through the entire review list, one by one, rule by rule and the.

25:14.280 --> 25:23.220
Creating the explosion of this and this is the block value now from the globalist, we are upending

25:23.220 --> 25:26.010
this list, which we have created in this.

25:26.010 --> 25:29.220
We are upending the sentiment polarity to it.

25:29.460 --> 25:37.290
So the sentiment polarity and sentiment subjectivity will actually have the details of the sentiments

25:37.290 --> 25:38.340
of a particular day.

25:38.550 --> 25:45.960
That is how much positive the comment is, how much negative the comment is, how much new value of

25:45.960 --> 25:49.550
the new is present in this particular review.

25:49.590 --> 25:54.090
So those things will be given by the sentiment, polarity and subjectivity.

25:54.510 --> 25:58.410
So we have this data frame.

25:58.410 --> 26:04.200
So in the data frame, we have given the bloodless description and given the the view, the sentiment

26:04.200 --> 26:05.910
and the polarity value.

26:07.420 --> 26:15.420
So we have created this one function, which is checking if the polarity description is sentiment,

26:15.760 --> 26:22.290
it is if it is, if this value is greater than zero, then we give it a positive review.

26:22.600 --> 26:27.070
If the polarity is equal to zero, then we give it as a neutral review.

26:27.290 --> 26:32.810
If the value is neither greater than zero, not equal to then it means that it is less than zero.

26:32.980 --> 26:35.410
So that means it is a negative review.

26:35.740 --> 26:38.620
And based on that, we are simply.

26:40.160 --> 26:46.880
Created a lot of effort that does account a lot of this, so this is the plot which shows that that

26:46.880 --> 26:54.800
is more than twenty thousand positive reviews and less than twenty five hundred negative and neutral

26:54.800 --> 26:55.690
reviews present.

26:56.090 --> 27:05.030
Now, what you can do is you can explore the explore library and see how polarity works, how subjectivity

27:05.030 --> 27:05.390
works.

27:05.630 --> 27:10.610
So these are different things which are present in sentiment analysis.

27:10.820 --> 27:12.500
So you can use these.

27:12.650 --> 27:16.310
I'm trying to understand about these how these can be used.

27:16.520 --> 27:19.640
This is one example of using sentiment analysis.

27:19.820 --> 27:24.620
You can use several other movies and make your own implementations.

27:28.750 --> 27:37.270
Now, here's what we are doing is we are just getting the details and getting the plotting the for the

27:37.510 --> 27:38.680
positive reviews.

27:38.860 --> 27:49.480
So for positive reviews, the word cloud has words like love, dress of fabric, ball, fake look really

27:49.810 --> 27:50.100
true.

27:50.210 --> 27:55.420
Sighs Well, so these are some positive words from the positive reviews.

27:56.660 --> 28:04.880
Next, we will test for the negative reviews, so here we have the negative reviews, so it has a little.

28:06.310 --> 28:08.560
Quality disappointed.

28:09.640 --> 28:14.410
Look small, so all the material.

28:15.540 --> 28:26.580
Then we have lent you one at least, so these are different words which we can find from the negative

28:26.580 --> 28:33.110
reviews so you can similarly plot a word cloud for the neutral reviews.

28:33.300 --> 28:37.440
So there are different ways how you can use the sentiment analysis.

28:39.040 --> 28:39.940
Next is.

28:41.560 --> 28:48.300
We're just getting the details we are creating under the function that this process, so here we are

28:48.300 --> 28:53.550
simply processing the data via removing the punctuations.

28:53.760 --> 29:02.550
So we are simply getting the words from the board in reviews and checking if the word is not a string

29:02.550 --> 29:03.370
punctuation.

29:03.600 --> 29:09.580
So this is a string dot punctuation is used to check if the word is a punctuation or not.

29:09.840 --> 29:10.230
So.

29:11.440 --> 29:19.240
It will just check and put all the non punctuations in, you know, bunk and join it with the space,

29:19.960 --> 29:29.290
then we are just returning this splitted one with the non punctuations and by removing the Stopford

29:29.290 --> 29:29.700
from it.

29:29.710 --> 29:33.670
So we are just removing the punctuations and stop boards from this day.

29:33.700 --> 29:36.900
DITZEN Now we have the data here.

29:36.910 --> 29:39.250
This is these are the words which we have received.

29:39.610 --> 29:42.220
So the words are absolute wonderful.

29:42.230 --> 29:45.770
Silkies ACSI Comfortable love is so pretty.

29:45.940 --> 29:51.730
So these are different words which we have in the review text now from which the punctuations have been

29:51.730 --> 29:55.360
removed from the English word has been removed.

29:58.150 --> 30:07.060
So this is one type of implementation we have on another type of implementation during the end of the

30:07.240 --> 30:10.430
session when we discussed how we can process the data.

30:10.630 --> 30:13.830
So I'm just showing different types of implementations.

30:13.830 --> 30:19.180
So get in so that you can try different implementations and decide which one you like.

30:19.210 --> 30:20.260
What would you like to do?

30:20.290 --> 30:25.080
And based on the problem you have, you can decide what implementations you want to do.

30:26.920 --> 30:29.880
So we are just for once we have this data.

30:31.210 --> 30:33.340
We will vectorized the data.

30:33.340 --> 30:42.880
So here we are using the victories in the implementation of we had used the idea of victimiser.

30:43.120 --> 30:46.290
So for this one, I've chosen the conflict arising.

30:46.600 --> 30:49.200
So let us use the common victimiser now.

30:49.210 --> 30:57.760
So we are just simply separating the data and creating the review data frame and the video frame for

30:57.760 --> 30:58.030
this.

30:58.270 --> 31:07.060
So the extra data frame contains the review text and the Vivan by our data frame contains the three

31:07.150 --> 31:07.600
things.

31:08.740 --> 31:15.340
So that is what we want to do from the interview, because we want to classify the ratings here, right.

31:15.640 --> 31:23.040
So from the Eskil on feature extraction, we get the victimiser.

31:23.320 --> 31:28.150
So the first thing, what we will be doing is to fit the gown's victimiser on the.

31:29.940 --> 31:37.320
Training it right, the first thing which we do is the first debate that I said, no matter which vectorized

31:37.320 --> 31:45.180
that it is, if it is a victimiser or it is a video victimiser, we will fight the way that I use it

31:45.540 --> 31:49.260
on the input data, on the training data.

31:49.410 --> 31:51.930
And then we will use the.

31:53.060 --> 31:57.090
Transform function to actually transforming the beat.

31:58.620 --> 32:04.550
This is what we do because of any future, also, we will have to transform the data.

32:04.770 --> 32:10.510
This we have already discussed that because we have different though means we have different type of

32:10.530 --> 32:12.460
words based on the requirement.

32:12.630 --> 32:18.450
So if we are talking about e-mail spam, then we will have different type of words which will need to

32:18.450 --> 32:26.910
be transformed and there will be a different type of word list in know in a chemistry project or there

32:26.910 --> 32:30.280
would be a different type of word in the banking domain.

32:30.540 --> 32:36.540
So based on the words which are actually used, we will be, first of all, fitting the transformer

32:36.540 --> 32:39.910
that is draining the transformer and then actually transforming.

32:41.040 --> 32:42.770
So that's what we do here.

32:43.050 --> 32:51.930
We create this VW transformer and then use this V or W transformer on the vocabulary.

32:52.890 --> 32:54.200
So how would we do that?

32:54.570 --> 33:02.690
We will simply do the UAW plan to transform, to transform the X reviewing X reviews.

33:04.150 --> 33:11.500
Now, we will be splitting the data now, how we split the data, we can use any of the methods, the

33:11.500 --> 33:15.040
one being three in the split, we can use this also.

33:16.550 --> 33:18.890
And another method is the.

33:19.970 --> 33:28.940
Cross validation, that is, of greater TV or random thoughts TV, so we could have used any method.

33:30.080 --> 33:31.860
I am using one method.

33:32.240 --> 33:41.720
Your task is while practicing dry, brittle TV or cross-validation, so you have to try all different

33:41.720 --> 33:45.340
variants so that you learn all the ways of doing it.

33:46.650 --> 33:54.570
So we have included the Skillern model selection and the three best split.

33:55.840 --> 34:03.700
Now, what will we be doing is we will split the data extra and why I'm here, that this size is zero

34:03.700 --> 34:12.130
point three, which means that there will be 10 percent data in my testing dataset and 70 percent data

34:12.130 --> 34:13.570
in my training dataset.

34:15.040 --> 34:20.650
And from this eye opening, the extreme best light rain and witnessed.

34:22.330 --> 34:23.200
Now we've got.

34:24.220 --> 34:29.650
Predict the reading of the review, we will use a knife based machine learning algorithm.

34:30.970 --> 34:37.210
So for that, we will be importing the multinomial name based from the name based.

34:38.330 --> 34:40.310
Library from the library.

34:40.550 --> 34:48.470
So we are creating the object of Navys and putting the training data into it once we have secured the

34:48.620 --> 34:49.520
data into it.

34:49.880 --> 34:56.090
We will use and we do not predict to predict the values of the FESTE data.

34:56.360 --> 34:58.180
Now we get the predict values.

34:58.370 --> 35:02.730
We will compare these predict values with the actual very best values.

35:03.020 --> 35:08.420
So what we are doing, we are importing the Confusion, Matrix and Classification Report and we are

35:08.420 --> 35:13.970
comparing the lightest and predict using the classic confusion, matrix and classification reports so

35:13.970 --> 35:15.410
that these could be generated.

35:15.800 --> 35:16.910
So these are the.

35:18.130 --> 35:22.250
Confusion Friedrichsen classification report, which we have got.

35:23.140 --> 35:25.060
So here you can see that.

35:26.650 --> 35:34.900
The precision for the first class is zero point eight six one, the fifth one is zero point nine six

35:35.260 --> 35:36.120
and so on.

35:36.310 --> 35:41.350
So our model has achieved on ninety five percent efficiency here.

35:41.900 --> 35:46.340
It means that the business can predict the users like the product or not.

35:47.020 --> 35:50.390
Now we will test the model with the data.

35:50.650 --> 35:53.730
So this is the review text the third one.

35:54.040 --> 35:56.980
So we are checking if the rating is positive or not.

35:57.250 --> 35:58.650
So let us check.

35:59.230 --> 36:02.390
So it says, I love, love, love this jumpsuit.

36:02.410 --> 36:04.180
It is fun for the families.

36:04.400 --> 36:06.860
Every time a word, I get nothing but grief.

36:07.810 --> 36:11.340
So first I want to test with the positive review.

36:11.350 --> 36:13.650
So I have chosen the above rating.

36:13.840 --> 36:15.690
So this one has a rating five.

36:15.970 --> 36:16.980
So let us see.

36:17.590 --> 36:21.860
So we will transform this using the transformer.

36:22.420 --> 36:31.680
Then after transforming, we get the rating positive transform and we will now predicted using the Navy's

36:31.690 --> 36:35.410
not predict what is the value of the predicted one.

36:35.410 --> 36:36.340
It is five.

36:36.610 --> 36:39.300
So it is giving the correct predicted value.

36:39.490 --> 36:41.520
Now we will test a negative comment.

36:41.740 --> 36:45.740
So this is a negative comment which we have got and we will predict this.

36:45.850 --> 36:51.580
This says three times soon to spawn one huge.

36:52.130 --> 36:53.230
It is very cheap.

36:53.230 --> 36:54.430
So I cut them out.

36:54.430 --> 36:58.330
Then the train left behind was Plastiki and even more.

36:58.330 --> 37:05.620
Ichi, how can you make an intimate with such Dag's not comfortable and different views.

37:05.860 --> 37:12.260
So we get the rating negative control and we just transform the data and then predict the value.

37:12.280 --> 37:14.050
And it comes out to be what?

37:15.000 --> 37:20.650
So now we will predict if the item will be recommended or not.

37:20.970 --> 37:27.390
So what we are doing, we are getting the expert recommend and I recommend and again, we will do the

37:27.390 --> 37:27.980
same thing.

37:27.990 --> 37:37.680
We will make divisor on top of it because, again, of the very predictive value and the of the predictive

37:37.680 --> 37:40.500
value, these will have no different type of words.

37:40.500 --> 37:41.310
But as in trade.

37:41.550 --> 37:51.360
So we will again analyze, we will again of foot another vectorized on it and then transform using this

37:51.570 --> 37:52.440
vectorized.

37:53.990 --> 38:02.810
I do the same thing that this split the data into a strange split with the naval base and we will follow

38:02.870 --> 38:09.790
the Navy and then find out the probability, the predict, and we don't predict the value predictor.

38:10.010 --> 38:15.020
And then we will get the confusion, matrix and classification report out of this.

38:15.230 --> 38:19.090
Now, for this one, the model predicted eighty seven percent deficiency.

38:19.280 --> 38:23.570
So now we test again for the leading positive and negative.

38:24.500 --> 38:29.600
So for a positive rating, it gives one that is it is recommended.

38:30.870 --> 38:38.460
So the user has actually recommended this now in the above block, the model predicted correctly, that

38:38.460 --> 38:44.190
is, it was a positive review, so it was recommended now for the negative review.

38:44.430 --> 38:45.450
It is giving zero.

38:45.450 --> 38:47.500
That is, it will not be recommended.

38:47.730 --> 38:57.240
So like this, you can use any classification algorithm to implement actual data classification.

38:57.990 --> 39:01.860
And Nijhuis is a very good algorithm to implement this.

39:02.040 --> 39:05.500
You can implement this using SVM a little while.

39:05.500 --> 39:07.320
You will be working with SVM.

39:07.320 --> 39:10.500
We will again implement a different kind of data.

39:11.370 --> 39:16.940
So that you can get hold of how we shall transform the data and learn more from it.
