WEBVTT

00:01.140 --> 00:05.590
Heisel, let us have a look at the solution for this particular problem.

00:06.150 --> 00:12.670
So first of all, we have imported the required libraries and we have imported the data set as well.

00:13.110 --> 00:20.850
So the dataset contains two columns, one is revealed and the other one is the sentiment related to

00:20.850 --> 00:20.990
it.

00:21.480 --> 00:25.310
Now, the sentiment is really written as positive and negative.

00:25.320 --> 00:28.620
We need to convert that into a form that is 081.

00:31.100 --> 00:41.360
So what we can see here is that there are total 50000 rows of data when we talk about review the year

00:41.870 --> 00:50.560
around forty nine thousand five hundred unique reviews and the top review, the top number of reviews,

00:50.570 --> 00:57.340
is this a love to be and something and the sentiment related to the top one is positive.

00:58.980 --> 01:01.240
But that sentiment is positive.

01:03.360 --> 01:06.760
And as you can see, the frequency of this is 25000.

01:06.780 --> 01:10.200
So you can see that there are equal distribution of the needle.

01:10.220 --> 01:11.850
So this is not an imbalance data.

01:12.100 --> 01:13.540
It is completely balanced.

01:13.560 --> 01:18.660
It doesn't know he's using the beautiful hope here.

01:19.200 --> 01:21.240
You can use any other password.

01:21.240 --> 01:22.860
You can use any other method.

01:23.790 --> 01:28.170
What I have done here is I have used the esteemable about.

01:30.220 --> 01:33.590
This will read the file in alleged human form.

01:34.780 --> 01:44.260
So what I have done is I will be removing all this bracket from my text and after removing that, I

01:44.260 --> 01:52.690
will be removing all the noisy text and I will apply in this particular function

01:55.210 --> 01:58.030
on this data.

01:59.200 --> 02:07.010
So I will apply the noise and I will employ this remove between square brackets and everything.

02:07.360 --> 02:09.870
So this is my final I have.

02:11.710 --> 02:18.010
And later, I have also removed especially characters from my review because I don't want to keep any

02:18.010 --> 02:19.120
special categories.

02:20.260 --> 02:27.750
Next, what we do is we will be using labor by Malaysia, which will basically create lives for us.

02:28.300 --> 02:31.480
So it will give positive negative labels to us.

02:34.680 --> 02:43.110
So we have used for transport because we want to simply fix this by labor, by analyze it and apply

02:43.110 --> 02:43.380
it.

02:43.410 --> 02:45.540
We don't want to keep it for further use.

02:45.720 --> 02:52.590
If you want to keep it for the use, you will use it and then you will use transport for each transformation

02:52.590 --> 02:53.300
you want to have.

02:54.970 --> 02:58.450
Next, what we do is we get the sentiments.

03:01.380 --> 03:14.060
So 40000 sentiments into brain sentiment and then spectacular views, then the next 10000 in the best

03:14.220 --> 03:14.940
sentiment.

03:16.200 --> 03:20.870
This is basically my data splitting endorphin status.

03:22.940 --> 03:31.280
Next, what I do is I created this vinyl vectorized, you could use any other vectorized account representivity

03:31.920 --> 03:33.270
you can use in your office.

03:34.280 --> 03:37.210
What we do is we use the video victimiser.

03:37.250 --> 03:39.740
The minimum document frequency is 20.

03:40.040 --> 03:42.770
Maximum document frequency is zero point five.

03:43.010 --> 03:43.850
That means 50.

03:44.020 --> 03:52.130
Some document like this, it marks a word can occur in 50 percent on the entire use.

03:53.630 --> 03:55.440
Makes this Ingram rich.

03:56.270 --> 03:59.060
So this is the range which I want to keep.

03:59.090 --> 04:02.200
So I will be involved in parallel.

04:02.210 --> 04:03.740
So I'm a single world.

04:06.060 --> 04:15.700
Next, the greed of the transfer of this particular the effect ricin on the brain.

04:16.920 --> 04:23.730
So on the training review we use to transform and we use transform all the best of uses, but so we

04:23.730 --> 04:25.170
do the transformations.

04:25.170 --> 04:28.410
We have converted them into the idea of vectors.

04:30.060 --> 04:34.260
Now you can see the details.

04:34.330 --> 04:43.140
The the shape of the review is 40000 and 60000, 356 and 60000, 356.

04:43.500 --> 04:47.190
This means that you have around 60000.

04:49.760 --> 05:00.410
Can't read the fine yet because of the size of the BFI, there is a 60000, three, 56 columns and 40000

05:00.710 --> 05:01.090
euros.

05:01.380 --> 05:07.810
Now, again, this is in the form of a sparse matrix, because when we use any DFAT electrodes that

05:07.820 --> 05:10.730
are found, the president will give us an sportsplex.

05:12.360 --> 05:18.150
Now we will be applying logistic regression, you can apply any of other algorithm you want, you can

05:18.150 --> 05:21.790
use maybe you can use SVM, it's completely up to you.

05:22.140 --> 05:23.750
This also performs limit.

05:24.030 --> 05:25.470
We can use any algorithm.

05:26.040 --> 05:29.400
So I'm simply applying logistic regression name for thing.

05:29.400 --> 05:34.020
And then I'm finding out the models for it comes out with zero point nine four.

05:34.290 --> 05:42.420
Now, this is the very simplistic implementation and the model score on the test comes out to be 90

05:42.420 --> 05:42.960
percent.

05:45.550 --> 05:54.280
Next, I think the confusion matrix for this and the final report of which shows that it is performing

05:54.280 --> 06:01.500
pretty well for us next, what we're doing is we are converting these features into coefficient.

06:01.690 --> 06:04.960
So these are two different features that we have.

06:04.960 --> 06:14.230
And we have also got the best words, which we have in the descriptions in these columns.

06:14.530 --> 06:17.870
So they are great, excellent, perfect, wonderful best.

06:17.890 --> 06:22.550
So these are the most exciting, positive words which we have in our vehicles.

06:23.500 --> 06:29.480
The most negative words are by far the most awful, most boring.

06:30.760 --> 06:41.950
Now we are all getting the part of speech of the words and the positive and negative words.

06:42.880 --> 06:49.840
So we have basically got the words and sort of them got the features.

06:49.840 --> 06:51.220
That is the column names.

06:51.700 --> 06:54.970
And we have simply got the details from this.

06:55.540 --> 06:58.280
And we have also created a word cloud here.

06:58.280 --> 07:01.930
And this is the word cloud, which I have the idea from the word to do what you.

07:04.670 --> 07:08.840
From all the positive and all the negative thoughts you have, just get in there then all the words

07:08.840 --> 07:15.260
here and we have been counting all the words because word cloud, you can create an entire sentence

07:15.260 --> 07:16.730
on a continuous text.

07:17.570 --> 07:19.400
So we have created a word cloud.

07:19.460 --> 07:27.410
Actually, here you can see how this entire text is present and what are the most upsetting words in

07:27.410 --> 07:28.850
the positive reviews.

07:28.850 --> 07:37.370
So in the positive review, the most appealing one that excellent will save best read.

07:38.060 --> 07:40.130
Brilliant, wonderful phoniest.

07:40.640 --> 07:43.220
See, these are the good ones which you find in the reviews.

07:43.710 --> 07:52.040
When we talk about the bad reviews or the negative reviews, you can see boring waste, terrible wars,

07:52.040 --> 07:59.690
ridiculous, dull by boring, awful, poor quality.

07:59.930 --> 08:01.760
These are the words which are getting heilig.

08:02.520 --> 08:08.000
You can see what are the major words which are actually giving this result to us.

08:10.000 --> 08:11.740
So this is the implementation.

08:12.160 --> 08:18.730
This is a very simple implementation, and you can use many different methods for it and this can be

08:18.730 --> 08:21.900
up to your choice and how you want to implement.

08:21.910 --> 08:28.090
This is one of the most simplistic implementation which we have hidden text classification.

08:28.390 --> 08:28.780
Thank you.