WEBVTT

00:00.000 --> 00:00.833
-: Hey, welcome.

00:00.833 --> 00:01.770
And in this video we're gonna have a look

00:01.770 --> 00:04.470
at how you can use Python web sockets

00:04.470 --> 00:06.510
via the OpenAI Realtime API,

00:06.510 --> 00:08.580
to make a AI chat bot,

00:08.580 --> 00:09.930
which you can communicate with

00:09.930 --> 00:12.300
by giving it a call.

00:12.300 --> 00:13.770
Now there's a couple of prerequisites.

00:13.770 --> 00:15.180
You'll need a Twilio account,

00:15.180 --> 00:17.430
a Twilio number, an OpenAI account.

00:17.430 --> 00:19.200
You'll need OpenAI Realtime access,

00:19.200 --> 00:21.660
and you'll also need a tool called ngrok.

00:21.660 --> 00:22.980
And ngrok is a useful tool

00:22.980 --> 00:23.813
because it allows you

00:23.813 --> 00:26.850
to forward the webhook requests

00:26.850 --> 00:27.750
that you're gonna get

00:27.750 --> 00:29.880
from Twilio to a local number

00:29.880 --> 00:31.260
for development testing.

00:31.260 --> 00:33.750
So go to ngrok.com and make sure

00:33.750 --> 00:34.583
that you download this.

00:34.583 --> 00:36.030
It's actually free to use,

00:36.030 --> 00:37.650
so you can just use that for free.

00:37.650 --> 00:38.520
And after you've got

00:38.520 --> 00:40.770
that binary installed on your machine,

00:40.770 --> 00:41.820
then what you need to do is

00:41.820 --> 00:43.260
go into the repo that will be

00:43.260 --> 00:45.240
inside of our GitHub repository.

00:45.240 --> 00:46.680
And then what you're gonna need to do is

00:46.680 --> 00:50.370
go into the EMV doc example

00:50.370 --> 00:53.520
and make a copy of that and make a .emv file.

00:53.520 --> 00:56.640
And then make sure that you add your OpenAI API key.

00:56.640 --> 00:57.780
I'm not gonna add one

00:57.780 --> 01:00.120
because mine already technically exists.

01:00.120 --> 01:01.980
I've already got one on the terminal.

01:01.980 --> 01:04.680
Now this will load the environment variable

01:04.680 --> 01:06.690
and it will use the OpenAI API key

01:06.690 --> 01:11.430
and we are going to load a fast API server on port 5050.

01:11.430 --> 01:13.470
Now I've put a system message here that is

01:13.470 --> 01:16.110
for a charter surveying practice

01:16.110 --> 01:18.150
and you could change the system prompt depending upon

01:18.150 --> 01:20.490
what you want the chat agent to do.

01:20.490 --> 01:22.200
You can also change the voice

01:22.200 --> 01:25.590
and the types of events that we listen to and respond to.

01:25.590 --> 01:27.810
Now the impactful thing happens

01:27.810 --> 01:29.760
inside of this /incoming-call,

01:29.760 --> 01:34.760
we actually have a a request, we make a voice response

01:34.890 --> 01:36.150
and then what we then do is

01:36.150 --> 01:39.000
we then connect to the media stream.

01:39.000 --> 01:42.000
So we've almost got a web socket connection

01:42.000 --> 01:43.560
directly to the media stream.

01:43.560 --> 01:46.388
This will then accept that web web socket connection

01:46.388 --> 01:48.000
to the client.

01:48.000 --> 01:51.660
Then we will set a new web socket connection up to OpenAI.

01:51.660 --> 01:53.820
So you've got two web socket connections here.

01:53.820 --> 01:56.820
We'll then initialize a OpenAI session

01:56.820 --> 01:59.340
and then we've got two functions here

01:59.340 --> 02:03.057
that we'll both receive from Twilio and send to Twilio.

02:03.057 --> 02:04.890
And so the send to Twilio will actually

02:04.890 --> 02:08.340
be getting the information from the OpenAI's web socket

02:08.340 --> 02:11.010
and it will be getting the audio deltas

02:11.010 --> 02:13.710
and sending that directly to the web socket.

02:13.710 --> 02:15.960
And the web socket will then receive that

02:15.960 --> 02:20.400
and it will then send that directly to the Twilio.

02:20.400 --> 02:22.950
And the way that that works is basically we are gonna be

02:22.950 --> 02:24.870
adding in these audio chunks

02:24.870 --> 02:26.640
and then we're sending them back.

02:26.640 --> 02:29.130
And then that's basically how it works.

02:29.130 --> 02:31.080
And so you've also got an ability

02:31.080 --> 02:33.780
to truncate some of the messages if someone speaks.

02:33.780 --> 02:37.800
And we send these via the web socket using Twilio

02:37.800 --> 02:40.140
and then we basically clear those as well.

02:40.140 --> 02:41.910
Now the send mark is quite important.

02:41.910 --> 02:45.330
The send mark is how we send information directly to Twilio.

02:45.330 --> 02:46.500
So if I scroll up here,

02:46.500 --> 02:49.320
you'll see they're actually sending some information

02:49.320 --> 02:51.210
and the information that we have here

02:51.210 --> 02:53.433
also contains the audio deltas.

02:54.300 --> 02:57.150
So that marks the end of the conversation.

02:57.150 --> 02:58.650
And that's basically it.

02:58.650 --> 02:59.700
So how do we run this?

02:59.700 --> 03:01.110
First you need a Twilio number.

03:01.110 --> 03:02.340
You need a Twilio account.

03:02.340 --> 03:03.900
And then you will also then need to

03:03.900 --> 03:05.970
basically go to this active number section

03:05.970 --> 03:07.230
and buy a number.

03:07.230 --> 03:08.700
Once you have a number,

03:08.700 --> 03:11.040
it will appear in the active number section.

03:11.040 --> 03:13.380
Then after that, then what you can do is

03:13.380 --> 03:15.090
you can click into your number.

03:15.090 --> 03:16.170
And this is where we then need

03:16.170 --> 03:18.933
to configure the number using webhooks.

03:22.080 --> 03:26.010
So for example, when I do ngrok on the terminal

03:26.010 --> 03:29.070
and I do this and I'm doing HTTP 5050,

03:29.070 --> 03:32.760
it will give me a public endpoint that I can then expose

03:32.760 --> 03:35.040
so that we can share that with the webhook.

03:35.040 --> 03:37.170
And then we're basically gonna put that there

03:37.170 --> 03:40.440
and then we're just gonna put the /incoming-call,

03:40.440 --> 03:43.200
which will basically accept the call connection.

03:43.200 --> 03:45.180
Then we're gonna click save configuration.

03:45.180 --> 03:46.650
Then the next thing you now need to do

03:46.650 --> 03:48.630
is you just need to start your Python server.

03:48.630 --> 03:53.630
So make sure you go into the Python and type Python main.py.

03:53.970 --> 03:57.510
And then what you, what you now have is a fast API server

03:57.510 --> 04:00.210
that's running on ports 5050 locally,

04:00.210 --> 04:04.650
but we've actually proxied that in to using ngrok.

04:04.650 --> 04:06.990
So we've got this publicly accessible URL,

04:06.990 --> 04:10.800
which is then gonna port into 5050 on the local host.

04:10.800 --> 04:13.380
So now when we actually do phone conversations

04:13.380 --> 04:14.880
and we call this number,

04:14.880 --> 04:17.220
we're actually gonna be interacting

04:17.220 --> 04:20.700
directly with the Fast API server that you have locally.

04:20.700 --> 04:22.650
Now this is a really great experience.

04:22.650 --> 04:24.150
I really recommend trying this out

04:24.150 --> 04:26.280
and giving the number a couple of calls

04:26.280 --> 04:28.620
and I've done this in the past with my uncle

04:28.620 --> 04:29.910
and we were sort of playing around with

04:29.910 --> 04:32.190
how we could improve the prompt.

04:32.190 --> 04:34.980
And we found that basically by listening to the calls

04:34.980 --> 04:37.184
and like interacting with it, we were just iterating

04:37.184 --> 04:39.210
through on this system message prompt

04:39.210 --> 04:40.950
and trying to figure out all of the different ways

04:40.950 --> 04:42.060
that it was going wrong.

04:42.060 --> 04:44.940
Just be aware that there is a token per minute

04:44.940 --> 04:46.620
of around 20,000

04:46.620 --> 04:49.020
so you might hit rate limits if you're on the usage tier

04:49.020 --> 04:50.730
of one at this point in time.

04:50.730 --> 04:52.890
But hopefully this gives you a good indication

04:52.890 --> 04:54.360
as to how you can start using this.

04:54.360 --> 04:56.430
So you've got your fast API set up,

04:56.430 --> 04:58.890
we have a proxy with ngrok

04:58.890 --> 05:01.590
and we've added that to the phone configuration.

05:01.590 --> 05:03.600
And then we've forwarded on the...

05:03.600 --> 05:05.640
Added on the extra /incoming-call

05:05.640 --> 05:08.010
so that this will then receive the call.

05:08.010 --> 05:09.600
We get a voice response,

05:09.600 --> 05:12.660
and then we also connect to the media stream.

05:12.660 --> 05:15.420
And the media stream then handles the connections

05:15.420 --> 05:17.370
between OpenAI and Twilio.

05:17.370 --> 05:18.970
Cool, I'll see you the next one.
