WEBVTT

00:00.450 --> 00:02.070
-: What is OpenAI Whisper?

00:02.070 --> 00:04.140
It's an open-source audio transcription model.

00:04.140 --> 00:06.120
It was made by OpenAI,

00:06.120 --> 00:08.520
and it was based on the Transformer model.

00:08.520 --> 00:12.690
So, very similar to what powers ChatGPT-4,

00:12.690 --> 00:16.560
but it's specifically built for transcription,

00:16.560 --> 00:17.943
so audio transcription.

00:19.618 --> 00:22.320
It's actually available open-source as code.

00:22.320 --> 00:24.690
You can also use the API,

00:24.690 --> 00:28.440
so they have access through the OpenAI API

00:28.440 --> 00:30.930
for a very, very cheap cost actually.

00:30.930 --> 00:32.760
And it's used for audio transcription.

00:32.760 --> 00:34.920
It's also used for translation,

00:34.920 --> 00:36.600
so you can actually get the transcription

00:36.600 --> 00:39.510
in a different language if you want.

00:39.510 --> 00:41.580
So what are the features?

00:41.580 --> 00:42.840
Really straightforward.

00:42.840 --> 00:45.990
It's not a, you know, kind of fully fledged LLM.

00:45.990 --> 00:48.330
It's just specifically for this one use case.

00:48.330 --> 00:51.063
So, transcription and translation.

00:52.500 --> 00:54.090
What have I been using it for?

00:54.090 --> 00:56.700
I've mostly been using it for YouTube transcripts,

00:56.700 --> 01:00.600
so downloading a YouTube video and then transcribing it.

01:00.600 --> 01:03.090
Also, podcasts, you can do the same for that,

01:03.090 --> 01:06.810
like some people have built tools that transcribe podcasts.

01:06.810 --> 01:11.280
And then it's also being used by OpenAI in the mobile app,

01:11.280 --> 01:13.860
the iOS app, that they've released.

01:13.860 --> 01:16.290
So it gets access to your microphone

01:16.290 --> 01:20.130
and it uses Whisper to translate the audio into text.

01:20.130 --> 01:21.900
So you can actually speak to the app,

01:21.900 --> 01:26.283
and then the text appears and is fed into ChatGPT.

01:27.150 --> 01:31.680
It's a really useful model, it performs unreasonably well,

01:31.680 --> 01:34.050
and I would say it's probably the default now

01:34.050 --> 01:35.700
for audio transcription.

01:35.700 --> 01:38.283
So try it and see how you get on.