WEBVTT

00:00.480 --> 00:03.450
-: All right, this is how to use the Whisper API

00:03.450 --> 00:06.270
to transcribe a YouTube video.

00:06.270 --> 00:09.177
We're going to just run this.

00:09.177 --> 00:12.930
I'm just gonna hit command shift

00:12.930 --> 00:14.850
or control shift if you're on a PC.

00:14.850 --> 00:17.283
And this is gonna install OpenAI.

00:18.180 --> 00:19.860
That's one of the things we need.

00:19.860 --> 00:22.770
We're also going to run this as well.

00:22.770 --> 00:24.810
This is YouTube DLP.

00:24.810 --> 00:26.880
This is like YouTube DL,

00:26.880 --> 00:29.010
if you guys have used that before.

00:29.010 --> 00:32.160
But I found that it's more up to date, more frequently,

00:32.160 --> 00:35.850
and YouTube occasionally changes the way that they stream

00:35.850 --> 00:37.620
and that breaks YouTube DL.

00:37.620 --> 00:41.310
I found this works, but you can also try YouTube DL,

00:41.310 --> 00:43.470
works in a very similar way.

00:43.470 --> 00:45.300
So I'm just gonna download this YouTube video.

00:45.300 --> 00:46.740
This one is mine.

00:46.740 --> 00:47.970
I own the copyright for this.

00:47.970 --> 00:52.710
So I would say if you don't own the copyright for your video

00:52.710 --> 00:54.150
that you're downloading, just be careful,

00:54.150 --> 00:57.270
because you might be restricted in what you can do.

00:57.270 --> 01:00.870
But it depends on the use case using the transcripts for,

01:00.870 --> 01:03.450
in our case, what we're doing is

01:03.450 --> 01:05.820
we're taking the YouTube video that I have

01:05.820 --> 01:07.830
and we're just creating a transcript.

01:07.830 --> 01:10.230
And we can use that transcript to post

01:10.230 --> 01:12.990
and promote the YouTube video itself.

01:12.990 --> 01:14.460
So while this is downloading,

01:14.460 --> 01:16.110
this is basically just streaming

01:16.110 --> 01:18.210
and then capturing the fragments.

01:18.210 --> 01:22.290
I'm just gonna click here to go to OpenAI.

01:22.290 --> 01:24.420
We need to get an API key.

01:24.420 --> 01:26.820
So we're gonna create a new secret key.

01:26.820 --> 01:29.730
Just gonna call this whisper.

01:29.730 --> 01:31.950
I'm gonna delete this afterwards,

01:31.950 --> 01:35.220
because you shouldn't share your API keys.

01:35.220 --> 01:36.053
Cool.

01:36.053 --> 01:36.886
So,

01:39.330 --> 01:41.550
it looks like this is downloaded.

01:41.550 --> 01:43.650
And actually if we look in our file system, yeah,

01:43.650 --> 01:45.990
we can see this, the video here.

01:45.990 --> 01:49.710
So now let's set up our API key.

01:49.710 --> 01:53.190
Just gonna run this and then we can just paste it in there.

01:53.190 --> 01:56.400
And that kind of saves it in the open AI library.

01:56.400 --> 01:59.520
And now we need to, so we have this video,

01:59.520 --> 02:01.200
and it's WebM format.

02:01.200 --> 02:03.660
So we're gonna use ffmpeg,

02:03.660 --> 02:08.070
which is a library for basically just changing

02:08.070 --> 02:10.830
or manipulating video audio files.

02:10.830 --> 02:11.663
And we're gonna use this,

02:11.663 --> 02:13.770
we're gonna take in this video,

02:13.770 --> 02:16.200
and then output it as like an MP3.

02:16.200 --> 02:18.603
So you can get this file name.

02:19.770 --> 02:21.130
Just paste this in here

02:22.214 --> 02:23.943
and hopefully that runs.

02:25.690 --> 02:27.930
Okay. Yeah, so that's not running, I think,

02:27.930 --> 02:29.700
because this name is a little bit funny.

02:29.700 --> 02:33.753
So let's just call this input.webm,

02:34.680 --> 02:36.530
and then we'll just change this here.

02:39.042 --> 02:39.959
Input.webm.

02:44.367 --> 02:45.200
All right.

02:46.074 --> 02:49.560
And this is basically just converting the video format

02:49.560 --> 02:52.833
and stripping out just the MP3, like the, just the audio.

02:56.550 --> 02:57.480
All right, so that's done.

02:57.480 --> 02:59.460
And actually if we refresh we can see there's

02:59.460 --> 03:01.500
output.mp3 here.

03:01.500 --> 03:04.590
And now we're going to call the Whisper API.

03:04.590 --> 03:06.540
So this is the easy part. (laughs)

03:06.540 --> 03:08.730
I've asked it for the verbose JSON.

03:08.730 --> 03:12.030
By default, it just gives you back the transcript,

03:12.030 --> 03:13.800
but this will give you all the different

03:13.800 --> 03:15.240
segments of the transcript.

03:15.240 --> 03:17.823
So just gonna run this, you'll see what I mean.

03:21.948 --> 03:24.900
And there is a cost to this that's very cheap, though,

03:24.900 --> 03:28.380
negligible, like pennies, even for relatively long audio.

03:28.380 --> 03:30.330
It shouldn't be too much of an issue.

03:30.330 --> 03:32.580
But keep on top of your cost caps. (laughs)

03:32.580 --> 03:35.580
Make sure that you don't rack up too much

03:35.580 --> 03:37.530
of a fee if you're doing this for lots

03:37.530 --> 03:39.213
of different podcasts or audio.

03:40.620 --> 03:41.733
Wait for this to run.

03:47.458 --> 03:50.541
(instructor sniffs)

04:02.580 --> 04:03.510
Oh, here we go.

04:03.510 --> 04:04.650
Now it's finished.

04:04.650 --> 04:06.570
And look, it's a pretty big file.

04:06.570 --> 04:10.530
So you can see it's broken into these chunks, right?

04:10.530 --> 04:15.450
We have the start time, the end time, the actual text,

04:15.450 --> 04:19.440
and then even the tokens and the log probability.

04:19.440 --> 04:21.240
This is like a lot of information.

04:21.240 --> 04:23.340
We don't need all of this information.

04:23.340 --> 04:25.410
We can actually just get the text itself.

04:25.410 --> 04:27.840
So if we just try this.

04:27.840 --> 04:29.010
Yeah, here we go.

04:29.010 --> 04:30.570
So this is the full text

04:30.570 --> 04:32.250
and I can click there to expand it.

04:32.250 --> 04:33.540
It's the full text of the video.

04:33.540 --> 04:34.860
That's the full transcript,

04:34.860 --> 04:36.570
but this doesn't have any timestamps.

04:36.570 --> 04:39.180
What I wanted to show you is how to kinda

04:39.180 --> 04:41.280
split out the timestamps themselves.

04:41.280 --> 04:43.200
So I've gone through the segments.

04:43.200 --> 04:46.350
So it's four segment in transcript segments,

04:46.350 --> 04:49.110
and then I'm unbundling the start minutes and start seconds.

04:49.110 --> 04:50.520
I'm using this div mod.

04:50.520 --> 04:54.240
It's basically just divides the start number by 60

04:54.240 --> 04:55.890
to get the number of minutes

04:55.890 --> 04:58.650
and then the number of seconds afterwards.

04:58.650 --> 04:59.790
So that'd be how many seconds are

04:59.790 --> 05:02.040
left over after you divided.

05:02.040 --> 05:03.810
And then I'm creating a timestamp.

05:03.810 --> 05:06.120
So if I just run this,

05:06.120 --> 05:08.643
you'll be able to see, here we go.

05:11.160 --> 05:13.470
So the timestamp is formatted,

05:13.470 --> 05:15.960
so there's always like a prevailing zero,

05:15.960 --> 05:17.280
even if the minutes,

05:17.280 --> 05:20.010
even if it's less than one minute, for example.

05:20.010 --> 05:22.290
And same thing with the seconds as well.

05:22.290 --> 05:24.240
Just to make the formatting more consistent.

05:24.240 --> 05:25.073
And there we go.

05:25.073 --> 05:27.120
We have kind of line by line transcript

05:27.120 --> 05:29.223
with the seconds and the minutes.

05:31.500 --> 05:32.880
All right, there's a lot you can do with this.

05:32.880 --> 05:35.520
I think you can do real-time streaming as well.

05:35.520 --> 05:39.200
It is open source as well as available via API.

05:39.200 --> 05:42.240
So you could run this locally if you have this

05:42.240 --> 05:44.310
set up on your server with a GPU.

05:44.310 --> 05:45.990
But I think it's easier

05:45.990 --> 05:48.360
and relatively cheap to use the API.

05:48.360 --> 05:51.360
And now you can transcribe any audio you want.

05:51.360 --> 05:54.120
I would say that one thing I would watch out for is

05:54.120 --> 05:55.440
that it's not particularly good

05:55.440 --> 05:57.180
when you have more than one speaker.

05:57.180 --> 05:58.100
That's what I've found it.

05:58.100 --> 06:03.100
It tends to work best for like tutorials or podcasts,

06:03.390 --> 06:06.303
things where there's just one speaker at a time typically.