WEBVTT

00:00.780 --> 00:03.600
-: One obvious thing you might like to do when using ChatGPT

00:03.600 --> 00:05.580
is to get it to classify data for you

00:05.580 --> 00:07.770
into specific different types of labels.

00:07.770 --> 00:09.720
For example, if you're looking at reviews,

00:09.720 --> 00:11.760
maybe that's whether the review was positive

00:11.760 --> 00:13.530
or whether the review was negative.

00:13.530 --> 00:15.060
Let's have a look at how we can do this

00:15.060 --> 00:18.420
using LangChain and the ChatGPT API.

00:18.420 --> 00:20.336
First, you're gonna have to install LangChain

00:20.336 --> 00:21.750
and Langchain OpenAI.

00:21.750 --> 00:24.390
And then what we're gonna do is import the chat models,

00:24.390 --> 00:28.020
ChatOpenAI, SystemMessage, and HumanMessage.

00:28.020 --> 00:29.340
After running that as well,

00:29.340 --> 00:32.220
you'll also need to import your OPENAI_API_KEY

00:32.220 --> 00:34.170
if you haven't done so already.

00:34.170 --> 00:35.003
As you'll see here,

00:35.003 --> 00:36.960
we've got a list of different types of reviews

00:36.960 --> 00:39.240
where each review is based on a movie.

00:39.240 --> 00:41.640
It says things like I love or I hate this movie,

00:41.640 --> 00:43.350
or this movie's a waste of time.

00:43.350 --> 00:48.000
And so what we wanna do is build a classifier using ChatGPT

00:48.000 --> 00:50.940
that allows us to get structured labeled data

00:50.940 --> 00:52.530
from this review text.

00:52.530 --> 00:53.790
So we've got our Python list,

00:53.790 --> 00:55.380
which is a list of dictionaries.

00:55.380 --> 00:57.960
We can instantiate our ChatOpenAI model,

00:57.960 --> 01:01.050
and then we can store our classifications in a Python list.

01:01.050 --> 01:03.390
We'll loop over each individual review

01:03.390 --> 01:04.290
and then we'll just print

01:04.290 --> 01:06.300
that we're gonna classify that review.

01:06.300 --> 01:08.490
And then inside each individual review,

01:08.490 --> 01:10.657
we're gonna have a system prompt which says,

01:10.657 --> 01:13.770
"You're responsible for the classification of movie reviews.

01:13.770 --> 01:15.540
Please classify the following review

01:15.540 --> 01:17.340
as positive or negative.

01:17.340 --> 01:19.590
You must use only the following words:

01:19.590 --> 01:21.540
negative or positive."

01:21.540 --> 01:22.890
And then we've got the HumanMessage

01:22.890 --> 01:26.610
which contains the content of specifically what that review.

01:26.610 --> 01:29.220
We're also saying if the content

01:29.220 --> 01:31.980
that we get back from the LM is not negative or positive,

01:31.980 --> 01:32.850
then we're throwing an error

01:32.850 --> 01:34.590
because we've got an invalid classification.

01:34.590 --> 01:36.840
And that's something we'd look at fixing

01:36.840 --> 01:39.120
in terms of on an individual basis.

01:39.120 --> 01:41.100
And so I can run this

01:41.100 --> 01:42.570
and you'll see that we're actually generating

01:42.570 --> 01:43.950
different types of classifications

01:43.950 --> 01:45.750
for all of these types of reviews.

01:45.750 --> 01:47.580
And then we can go and have a look

01:47.580 --> 01:49.140
at the response of this afterwards.

01:49.140 --> 01:50.940
So it's gonna print this out so you can see

01:50.940 --> 01:53.610
it's looping through each individual classification.

01:53.610 --> 01:55.800
And then what we're gonna have at the end of this

01:55.800 --> 01:58.503
is a way for us to classify different reviews.

01:59.580 --> 02:00.780
So we finally finished

02:00.780 --> 02:02.730
and we've got all of our different reviews.

02:02.730 --> 02:06.734
One thing we could also do is we could import pandas

02:06.734 --> 02:08.820
and then we could have a date frame

02:08.820 --> 02:13.820
and we could zip up the reviews with the classifications.

02:14.940 --> 02:16.230
And so what you'll see here now

02:16.230 --> 02:18.090
is we are just gonna import pandas

02:18.090 --> 02:19.650
and this is gonna make a data frame

02:19.650 --> 02:21.330
where for each of the reviews,

02:21.330 --> 02:23.520
we're then gonna zip that up with the classifications

02:23.520 --> 02:24.450
that came back.

02:24.450 --> 02:26.130
And now you can see we've got an easy

02:26.130 --> 02:28.740
nice pandas data frame that we can save,

02:28.740 --> 02:30.030
so I'm gonna save that,

02:30.030 --> 02:31.290
and then we've got this data frame

02:31.290 --> 02:32.820
that we can then manipulate.

02:32.820 --> 02:34.530
And then we can say, this is the review

02:34.530 --> 02:35.940
and this is the classification.

02:35.940 --> 02:38.340
And we could do things like count the number of,

02:39.360 --> 02:41.040
count the number of these classifications,

02:41.040 --> 02:44.040
so we can see that we had a total of 32 classifications.

02:44.040 --> 02:46.500
We could do various different types of operation in pandas.

02:46.500 --> 02:49.110
But the point I want to stress across to you is

02:49.110 --> 02:51.540
think about the way that you can use ChatGPT

02:51.540 --> 02:53.250
to label some of your data.

02:53.250 --> 02:57.090
And you can use this on text, you can use this on JSON data,

02:57.090 --> 02:59.310
anything that you want in terms of that.

02:59.310 --> 03:01.140
And it's also possible to use the Vision API

03:01.140 --> 03:02.610
with this type of information as well.

03:02.610 --> 03:06.780
So think about using ChatGPT as a classification engine,

03:06.780 --> 03:09.303
which you can use to get structured labeled data.
