WEBVTT

00:00.480 --> 00:02.340
-: Hey, in this video, what we're gonna cover

00:02.340 --> 00:04.710
is how you can use OpenAI's package

00:04.710 --> 00:09.710
to convert text into real sounding voices and audio files.

00:09.840 --> 00:10.770
Now, there's a couple

00:10.770 --> 00:12.240
of different packages you're gonna need for this.

00:12.240 --> 00:14.340
You're gonna need langchain, langchain_openai

00:14.340 --> 00:16.050
and the openai package.

00:16.050 --> 00:18.150
You'll need to update your API key.

00:18.150 --> 00:21.990
And then what we can do is have a look at openai's package.

00:21.990 --> 00:24.420
Now, you can specify the file path

00:24.420 --> 00:26.790
and you'll see I've got this speech.mp3

00:26.790 --> 00:29.580
and you can also choose the model that you're interested in

00:29.580 --> 00:31.590
and there's lots of different voices.

00:31.590 --> 00:34.500
Now, the way this works is you put in your input

00:34.500 --> 00:38.370
and then you would use response.stream_to_file.

00:38.370 --> 00:39.510
And then after that,

00:39.510 --> 00:41.954
then what you'll see is on the left-hand side,

00:41.954 --> 00:45.690
we've got this speech.mp3 file that's been created,

00:45.690 --> 00:47.820
which is a couple of seconds long here.

00:47.820 --> 00:50.760
But what that will do is it will take the textual input

00:50.760 --> 00:52.770
that you're putting in and it'll convert that

00:52.770 --> 00:54.030
into an audio file.

00:54.030 --> 00:56.400
Now, let's take a little bit step further

00:56.400 --> 00:58.890
and figure out what is, you know, maybe a good use case

00:58.890 --> 01:01.380
for this and like how could you integrate this

01:01.380 --> 01:02.610
into a different flow?

01:02.610 --> 01:04.200
And so we're gonna use one

01:04.200 --> 01:06.420
of LangChain's community document loaders

01:06.420 --> 01:08.010
called a WebBaseLoader.

01:08.010 --> 01:10.230
And what you'll be able to do is you'll load a URL

01:10.230 --> 01:12.690
and we'll look at the BBC News article

01:12.690 --> 01:15.990
that recently came out in UK politics.

01:15.990 --> 01:18.150
And after this, what happens here

01:18.150 --> 01:21.150
is we've got this WebBaseLoader that can take in a URL.

01:21.150 --> 01:23.940
So I can take in, for example, this URL here.

01:23.940 --> 01:26.610
And then after that then, we get some data

01:26.610 --> 01:27.840
and we can have a look at this.

01:27.840 --> 01:29.937
And you can see we've got some documents

01:29.937 --> 01:32.490
and we can see how many of these documents that we have.

01:32.490 --> 01:34.530
So we've got one document in here

01:34.530 --> 01:36.960
and if we want to have a look at the page content

01:36.960 --> 01:40.080
of that document, you could just do the page_content

01:40.080 --> 01:41.940
and you can see this is all the information

01:41.940 --> 01:43.050
that we've got here.

01:43.050 --> 01:47.280
Now what I suggest doing is taking all of that data

01:47.280 --> 01:50.490
and then basically looping over all the documents

01:50.490 --> 01:52.740
and replacing any new line characters.

01:52.740 --> 01:53.910
And then you should end up with something

01:53.910 --> 01:54.990
that looks a little bit like this

01:54.990 --> 01:57.540
where you've got checking the character length.

01:57.540 --> 02:00.060
And we can see this is the number of characters

02:00.060 --> 02:02.820
and we could have a look at the webpage text if we wanted.

02:02.820 --> 02:05.280
But basically, you've got a lot of different webpage texts.

02:05.280 --> 02:07.950
Now, interestingly, I don't think we would want to put all

02:07.950 --> 02:10.230
of this webpage text directly in

02:10.230 --> 02:12.000
because you can see we've got other things

02:12.000 --> 02:16.050
like the navigation has somehow managed to make it in here,

02:16.050 --> 02:18.390
and that's not really what we're looking for.

02:18.390 --> 02:20.100
One thing that can be quite good to do,

02:20.100 --> 02:23.880
and especially as if we're over a certain character limit,

02:23.880 --> 02:25.680
we won't be able to convert text

02:25.680 --> 02:27.360
to speech using OpenAI's package.

02:27.360 --> 02:32.360
It does have a character limit of around 4,906 characters,

02:32.490 --> 02:35.610
but I've said it to 490 just to make sure.

02:35.610 --> 02:36.750
Now, I've decided

02:36.750 --> 02:38.910
to use something called the load_summarize_chain

02:38.910 --> 02:41.850
that allows us to take in a bunch of documents.

02:41.850 --> 02:43.110
And once we have those documents,

02:43.110 --> 02:46.710
then what we can do is we can basically map over them

02:46.710 --> 02:47.880
and reduce them all down

02:47.880 --> 02:52.080
and get some sort of final output from a bunch of documents.

02:52.080 --> 02:53.760
And what I've said, I want the prompters

02:53.760 --> 02:55.320
to say the following is a set of documents

02:55.320 --> 02:57.120
from a BBC News article.

02:57.120 --> 02:59.310
Based on this list of docs, please extract the story

02:59.310 --> 03:00.510
and make it sound engaging.

03:00.510 --> 03:02.670
We've set up a ChatPromptTemplate for that.

03:02.670 --> 03:05.220
And we've got our load_summarize_chain

03:05.220 --> 03:08.370
that's taking in the model, the map_reduce type of chain,

03:08.370 --> 03:10.140
and the map_prompt and the combine_prompt.

03:10.140 --> 03:13.200
And at the moment, I'm not really specifying what I want

03:13.200 --> 03:16.860
to happen there every single step as well as I could do,

03:16.860 --> 03:19.470
but I found this prompt seems to work fine for now.

03:19.470 --> 03:23.100
We invoke the chain, passing in the input_documents in here.

03:23.100 --> 03:25.260
You can see I've just made a new LangChain document

03:25.260 --> 03:26.670
with the web_text here.

03:26.670 --> 03:28.200
It's a single document.

03:28.200 --> 03:31.680
Now, what will happen is after this runs,

03:31.680 --> 03:34.950
if the character limit's above 4,090,

03:34.950 --> 03:36.030
we'll get the output_text

03:36.030 --> 03:38.430
and we'll save that back to the web_text.

03:38.430 --> 03:39.900
And the idea behind this

03:39.900 --> 03:41.430
is that what you really wanna be doing

03:41.430 --> 03:44.760
is not just scraping this data from the webpage,

03:44.760 --> 03:47.850
but actually, you want to be filtering it,

03:47.850 --> 03:49.200
maybe having some prompts

03:49.200 --> 03:51.300
that make it into a more interesting story.

03:51.300 --> 03:53.070
And you can see here we've got some information

03:53.070 --> 03:55.410
about the Labour deputy leader.

03:55.410 --> 03:57.090
Angela Rayner has accused the government

03:57.090 --> 03:58.740
of reneging on its promise.

03:58.740 --> 04:01.650
And so we've got a much better story here

04:01.650 --> 04:03.270
than the original webpage context.

04:03.270 --> 04:06.060
So we've really got the LLM to clean up the text

04:06.060 --> 04:08.910
before we're gonna expose that into downstream LLMs

04:08.910 --> 04:10.530
or downstream systems.

04:10.530 --> 04:13.290
And then again, we can just run the speech.mp3

04:13.290 --> 04:14.940
as the file_path.

04:14.940 --> 04:17.790
We're using the client.audio.speech.create,

04:17.790 --> 04:21.450
and then we're passing in a model, the voice

04:21.450 --> 04:24.623
and the input text, which is now the web_text.

04:24.623 --> 04:28.860
And then we're doing the response.stream_to_ file.

04:28.860 --> 04:31.350
Now, I know this looks like it does give a warning.

04:31.350 --> 04:32.640
It has recommended using

04:32.640 --> 04:35.310
this .with_streaming_response method,

04:35.310 --> 04:37.560
but I found that this doesn't actually work this method.

04:37.560 --> 04:39.450
So you can ignore this warning for now.

04:39.450 --> 04:42.300
And then that will save the audio directly

04:42.300 --> 04:44.190
into this speech.mp3 file.

04:44.190 --> 04:45.180
And you can see we've got

04:45.180 --> 04:48.420
about 1 minute and 46 seconds of speech.

04:48.420 --> 04:50.520
And so this can be a really great way for you

04:50.520 --> 04:54.060
to transform different types of content

04:54.060 --> 04:58.560
that were primarily textual based into an audio narrative.

04:58.560 --> 05:00.310
Cool, I'll see you in the next one.