WEBVTT

00:00.510 --> 00:01.680
-: Hey, I'm gonna walk you through

00:01.680 --> 00:05.347
how to do qualitative analysis using an LLM,

00:05.347 --> 00:08.370
but we're gonna use GPT-4o for this.

00:08.370 --> 00:11.280
We're gonna use the API specifically

00:11.280 --> 00:14.688
so we can do this without a lot of copy and pasting,

00:14.688 --> 00:16.680
and so here's a script.

00:16.680 --> 00:17.850
You can just run each line,

00:17.850 --> 00:19.170
but I'm gonna walk you through it.

00:19.170 --> 00:22.980
First you want to install OpenAI, and then if you run this,

00:22.980 --> 00:25.110
it's gonna ask you for your API key.

00:25.110 --> 00:27.870
You'll paste it in, and then you're ready to go.

00:27.870 --> 00:30.900
If you don't know where to get your API key from,

00:30.900 --> 00:33.630
if you go to platform.openai.com

00:33.630 --> 00:37.710
and then you want to go into API keys here.

00:37.710 --> 00:39.750
You can click Create a new secret key,

00:39.750 --> 00:42.120
and then don't obviously share anyone. (laughing)

00:42.120 --> 00:43.950
You know, it's like a password,

00:43.950 --> 00:45.720
and people can spend money on your account

00:45.720 --> 00:46.620
if they get access to it.

00:46.620 --> 00:49.110
So I'm gonna read in this data,

00:49.110 --> 00:51.000
and you can take a look at this data,

00:51.000 --> 00:54.180
but it's basically just a bunch of interviews

00:54.180 --> 00:57.780
from a paper that I found online,

00:57.780 --> 01:00.973
and it was just interviews on like what young people think,

01:00.973 --> 01:04.597
and so it's a perfect type of transcript to do,

01:04.597 --> 01:07.230
you know, qualitative analysis on,

01:07.230 --> 01:09.630
and all I've done really here,

01:09.630 --> 01:12.210
just to be able to use this later,

01:12.210 --> 01:15.960
is to create an id column, which is just the index plus 1.

01:15.960 --> 01:19.590
So this is id 1, id 2, id 3, et cetera.

01:19.590 --> 01:21.120
Gonna use that later.

01:21.120 --> 01:23.040
Now if we run this,

01:23.040 --> 01:26.340
what this does is just some formatting for our prompts.

01:26.340 --> 01:28.560
So you can see there's 10 interviews here,

01:28.560 --> 01:32.940
and I've just basically just stitched them together

01:32.940 --> 01:35.040
and called them Sample 1, Sample 2,

01:35.040 --> 01:37.080
Sample 3 using that id column,

01:37.080 --> 01:41.280
and that's how the AI is gonna know which labels appear

01:41.280 --> 01:43.890
but which ones are related to which document.

01:43.890 --> 01:45.450
When it comes up with the new label,

01:45.450 --> 01:49.140
it's going to give us the ids that label applies to

01:49.140 --> 01:51.480
because we're giving the ids here.

01:51.480 --> 01:53.190
All right, so here's the prompt.

01:53.190 --> 01:54.870
I've kept it relatively simple.

01:54.870 --> 01:58.140
I'm actually building a tool that does this.

01:58.140 --> 01:59.760
So I've made this like a toy version,

01:59.760 --> 02:03.450
but it's the same sort of structure in the backend.

02:03.450 --> 02:04.530
I've just simplified it.

02:04.530 --> 02:06.990
We're analyzing the given text samples

02:06.990 --> 02:10.110
for thematic labels based on similarities and differences,

02:10.110 --> 02:12.606
and I'm getting it back in JSONL.

02:12.606 --> 02:17.313
JSONL is just JSON, but every line is a JSON object.

02:18.180 --> 02:20.610
So this is what it's gonna come back with.

02:20.610 --> 02:22.890
Okay, gonna run this just to see.

02:22.890 --> 02:26.040
This is on the first 10 samples,

02:26.040 --> 02:29.370
and we're gonna get back, here we go, some labels.

02:29.370 --> 02:31.440
So there's a label called "Self-perception",

02:31.440 --> 02:33.600
which seems to apply to everything,

02:33.600 --> 02:36.180
and then there's one called "Anger/temper",

02:36.180 --> 02:40.710
which only applies to sample 1, 3, 4 and 7, and so on.

02:40.710 --> 02:44.010
Okay, then we're gonna be able to parse that stuff.

02:44.010 --> 02:48.030
So here, I've just gone through each line, split out.

02:48.030 --> 02:48.863
I stripped different things from it.

02:48.863 --> 02:51.750
I got rid of the beginning line and the end line

02:51.750 --> 02:55.397
because that was JSONL and then the back ticks here

02:55.397 --> 02:58.260
'cause that would cause errors otherwise,

02:58.260 --> 03:01.503
and now I've got like a list of these labels

03:01.503 --> 03:02.760
that we can use.

03:02.760 --> 03:04.830
Okay, cool. So that's how it works so far.

03:04.830 --> 03:06.630
Now, obviously we want to be able to process

03:06.630 --> 03:08.250
all of the things in the spreadsheet.

03:08.250 --> 03:10.980
Sometimes they don't all fit into one prompt, so,

03:10.980 --> 03:12.697
or you want to split them up a little bit.

03:12.697 --> 03:14.970
So I made this little function.

03:14.970 --> 03:17.800
It basically just shuffles the data frame

03:19.320 --> 03:22.650
and then returns it in batches, and then for every batch,

03:22.650 --> 03:25.830
it just creates a task with asyncio.

03:25.830 --> 03:30.830
Asyncio is a way to run multiple API calls at the same time.

03:31.500 --> 03:33.780
So we don't have to wait for a really long time

03:33.780 --> 03:34.783
for OpenAI to come back.

03:34.783 --> 03:37.020
We are not sending it and then coming back,

03:37.020 --> 03:38.250
sending it, coming back.

03:38.250 --> 03:40.710
Instead we're sending all the batches at once,

03:40.710 --> 03:42.030
all those different API calls,

03:42.030 --> 03:44.400
and then we get them all back at the end.

03:44.400 --> 03:47.130
Okay, all the results gather here,

03:47.130 --> 03:49.560
and then we can go through and parse them.

03:49.560 --> 03:50.393
So here we go.

03:50.393 --> 03:53.230
So here's a big list of all the labels we got back

03:55.619 --> 03:58.200
from OpenAI, lots of interesting stuff,

03:58.200 --> 04:02.970
but right now, we only got labels for that batch,

04:02.970 --> 04:07.050
and we didn't get, each batch has different sets of labels,

04:07.050 --> 04:08.880
and ideally what we wanna do is go through

04:08.880 --> 04:13.080
and check if animal lovers applies to other batches as well,

04:13.080 --> 04:16.650
check if cultural heritage applies to other batches

04:16.650 --> 04:18.750
and dual identity, right?

04:18.750 --> 04:20.550
So these are all the themes that are being discussed

04:20.550 --> 04:21.600
in these interviews.

04:21.600 --> 04:25.140
We wanna see which ones are more or less popular.

04:25.140 --> 04:27.030
You can also see some errors here,

04:27.030 --> 04:29.880
so like unique-to-sample-58, you know?

04:29.880 --> 04:31.770
It doesn't always give you good results.

04:31.770 --> 04:33.390
That's where prompt engineering comes in.

04:33.390 --> 04:34.740
You can kinda mess around a little bit

04:34.740 --> 04:37.080
and see what you can do to decrease that.

04:37.080 --> 04:38.130
All right, and the next thing we need to do

04:38.130 --> 04:41.190
is just modify the call_openai function.

04:41.190 --> 04:44.220
We're just gonna add this labels=None,

04:44.220 --> 04:45.960
and if we do have labels,

04:45.960 --> 04:48.780
then we're just gonna add this extra piece to the prompt.

04:48.780 --> 04:51.270
So it still works the same way for the previous stuff,

04:51.270 --> 04:52.530
but we're just adding this piece here

04:52.530 --> 04:54.650
that says you must only apply the following labels,

04:54.650 --> 04:58.560
and then we just join the list together here.

04:58.560 --> 05:00.540
There we go, labels_list,

05:00.540 --> 05:05.540
and actually we're gonna put the labels_list in here.

05:08.490 --> 05:09.646
Yeah, here we go.

05:09.646 --> 05:13.320
So that's gonna give you the documents

05:13.320 --> 05:15.123
with the label partial as well.

05:16.470 --> 05:18.690
All right, I'm just gonna run that,

05:18.690 --> 05:22.470
and then this is gonna go through every batch.

05:22.470 --> 05:26.730
It's gonna create a task for every batch,

05:26.730 --> 05:29.040
and then we're gonna gather all those tasks,

05:29.040 --> 05:30.960
and then we're gonna parse the responses here.

05:30.960 --> 05:32.040
We're gonna go through.

05:32.040 --> 05:33.900
We're splitting the data frame again,

05:33.900 --> 05:35.160
and then we're calling OpenAI,

05:35.160 --> 05:37.410
this time with the for labels_list,

05:37.410 --> 05:40.230
and then it's gonna give us back all those results

05:40.230 --> 05:41.220
just like we did before.

05:41.220 --> 05:43.233
So see if this works.

05:47.610 --> 05:49.980
You know, this time, we've passed in the actual labels.

05:49.980 --> 05:53.550
So we should get, you know, just those labels back,

05:53.550 --> 05:55.290
but it should be for everything now.

05:55.290 --> 05:58.110
So we'll get a big list of everything.

05:58.110 --> 06:00.530
Okay, here we go. Yeah, there's a huge list now.

06:00.530 --> 06:03.780
So we have like "self vs. perception" for this batch,

06:03.780 --> 06:06.123
and then we'll have the same thing later on,

06:08.010 --> 06:09.060
and so on and so forth.

06:09.060 --> 06:11.160
So, you know, there's a big list of these,

06:11.160 --> 06:14.340
and then this code basically

06:14.340 --> 06:16.200
just merges everything together.

06:16.200 --> 06:19.200
So what we get is, you can see here,

06:19.200 --> 06:23.790
a big list of the labels for each of the rows,

06:23.790 --> 06:28.080
and if I one hot encode the tags,

06:28.080 --> 06:29.910
that kinda just pivots them up here.

06:29.910 --> 06:32.640
Now we can see, like, we can filter in the spreadsheet

06:32.640 --> 06:35.370
for self-description and then see, okay,

06:35.370 --> 06:36.630
this one has self-description,

06:36.630 --> 06:38.550
this one has self-description and this one does,

06:38.550 --> 06:41.040
and we could just filter for how many we had.

06:41.040 --> 06:42.360
If we look at the final shape,

06:42.360 --> 06:46.410
we can see that we had 65 rows and we have 49 columns.

06:46.410 --> 06:48.390
So quite a few of those if you, you know,

06:48.390 --> 06:51.720
take off these six columns.

06:51.720 --> 06:56.190
So yeah, 43 labels that we've found, which is pretty cool,

06:56.190 --> 06:58.590
and you can export that to a CSV

06:58.590 --> 07:00.390
and then upload that into Google Sheets

07:00.390 --> 07:02.711
and do any other kind of analysis you want.

07:02.711 --> 07:04.590
Okay, there's other stuff I didn't discuss here,

07:04.590 --> 07:06.750
like you might want to de-duplicate these labels.

07:06.750 --> 07:09.630
You might wanna check if they're any good or not or,

07:09.630 --> 07:11.580
like, how interesting they are,

07:11.580 --> 07:14.160
but this is the basic mechanism that I use

07:14.160 --> 07:17.130
to do qualitative analysis with LLMs,

07:17.130 --> 07:18.993
and it's pretty good at it.
