WEBVTT

00:00.060 --> 00:01.650
-: Hey, welcome, and in this video you're gonna learn,

00:01.650 --> 00:02.880
what are embeddings,

00:02.880 --> 00:05.580
how to use them, and their various use cases.

00:05.580 --> 00:08.580
Embeddings are basically a numerical representation

00:08.580 --> 00:11.700
of a word, or an image, audio, or a video.

00:11.700 --> 00:14.638
You can think of a embedding as a representation

00:14.638 --> 00:16.740
of that piece of data

00:16.740 --> 00:18.810
in a high dimensional vector space.

00:18.810 --> 00:20.908
Vectors are able to capture semantic meaning,

00:20.908 --> 00:24.120
making similar concepts closer together mathematically.

00:24.120 --> 00:25.620
They're a core piece of technology

00:25.620 --> 00:27.930
that enable lots of different AI application.

00:27.930 --> 00:29.310
Embeddings work by breaking text

00:29.310 --> 00:30.990
into tokens, words, or sub words.

00:30.990 --> 00:33.390
You can then also can convert tokens or images

00:33.390 --> 00:35.257
into these high dimensional vectors,

00:35.257 --> 00:38.310
and that gives them a position in space.

00:38.310 --> 00:41.130
So even though they might have a thousand different columns

00:41.130 --> 00:45.030
or dimensions, they have a mathematical space in time.

00:45.030 --> 00:46.890
You can then compare different types of vectors.

00:46.890 --> 00:47.760
So on the left here,

00:47.760 --> 00:50.520
you can see we've got queen being compared with king,

00:50.520 --> 00:51.930
or woman being compared with man.

00:51.930 --> 00:54.690
And we can do mathematical similarity differences

00:54.690 --> 00:56.430
between these different types of vectors.

00:56.430 --> 00:57.990
There's lots of different embedding models

00:57.990 --> 00:58.981
that you might want to use.

00:58.981 --> 01:01.740
OpenAI has some which we'll be covering.

01:01.740 --> 01:04.830
You also have cohere, sentence transformers and Google.

01:04.830 --> 01:07.260
You can also get open sourced embedding,

01:07.260 --> 01:08.487
so you don't have to pay for them

01:08.487 --> 01:10.470
apart from the computation it takes

01:10.470 --> 01:11.670
to generate the embedding.

01:11.670 --> 01:13.536
There's lots of different types of applications,

01:13.536 --> 01:16.620
so one of the main ones is semantic search.

01:16.620 --> 01:18.690
So on the right here, you can see we have a query

01:18.690 --> 01:20.543
and we will then use an embedding model

01:20.543 --> 01:23.810
and we will basically query a bunch of vectors

01:23.810 --> 01:26.430
and find vectors that are similar

01:26.430 --> 01:28.470
to our original user query.

01:28.470 --> 01:30.690
This brings us onto this idea of RAG,

01:30.690 --> 01:33.946
where we can take a query, do an embedding on that query,

01:33.946 --> 01:36.210
then we query a bunch of vectors

01:36.210 --> 01:37.800
that are stored in a database.

01:37.800 --> 01:41.550
We bring back retrieved contexts into the LLM's prompt

01:41.550 --> 01:44.640
or chat messages, and then we return a response.

01:44.640 --> 01:45.720
And the idea behind this

01:45.720 --> 01:47.940
is you do retrieval augmented generation,

01:47.940 --> 01:49.814
so doing a retrieval step

01:49.814 --> 01:52.140
before you actually generate content

01:52.140 --> 01:53.490
using a large language model.

01:53.490 --> 01:55.560
There's also other ways that you can leverage embedding.

01:55.560 --> 01:57.390
So for example, text classification,

01:57.390 --> 02:00.180
where you can have different types of categories,

02:00.180 --> 02:01.800
either positive or negative.

02:01.800 --> 02:04.290
You can generate different types of embeddings,

02:04.290 --> 02:05.160
and you can use those

02:05.160 --> 02:07.620
to power content recommendation systems.

02:07.620 --> 02:09.990
So if we're looking at embeddings for OpenAI,

02:09.990 --> 02:11.231
you just import the package,

02:11.231 --> 02:13.770
and you just do client.embedding.create,

02:13.770 --> 02:14.640
you have your input

02:14.640 --> 02:16.890
and your embedding model that you choose.

02:16.890 --> 02:17.760
Now, once you get that out,

02:17.760 --> 02:20.580
you'll see you get back a structured dictionary

02:20.580 --> 02:21.750
with some data.

02:21.750 --> 02:25.245
And this is the embedding here, this embedding here, right?

02:25.245 --> 02:26.781
And you can see it's just a bunch of numbers.

02:26.781 --> 02:30.696
And basically, that is your numerical representation

02:30.696 --> 02:33.300
of this input text on the left.

02:33.300 --> 02:35.430
There's different ways you can measure similarity

02:35.430 --> 02:37.590
when it comes to embeddings.

02:37.590 --> 02:39.900
The main three are Cosine similarity,

02:39.900 --> 02:42.120
which measures the angles between vectors,

02:42.120 --> 02:44.734
Euclidean distance, which measures straight line distance,

02:44.734 --> 02:45.990
and Dot product,

02:45.990 --> 02:48.810
which is the simple multiplication of vectors.

02:48.810 --> 02:50.850
The main one is Cosine similarity,

02:50.850 --> 02:52.860
so the equation is on the right here,

02:52.860 --> 02:56.280
but basically, Cosine similarity is the main way

02:56.280 --> 02:58.800
that we do embedding similarity metrics,

02:58.800 --> 03:02.310
and it ranges from negative one to one,

03:02.310 --> 03:04.380
one being completely in the same direction,

03:04.380 --> 03:08.820
and negative one being in the opposite direction as well.

03:08.820 --> 03:10.110
You've got a coding example here,

03:10.110 --> 03:12.240
so you can take two NumPy arrays

03:12.240 --> 03:15.510
and calculate the Cosine Similarity between those two.

03:15.510 --> 03:17.880
And you can see that these NumPy arrays are quite similar,

03:17.880 --> 03:20.460
so you end up with a Cosine similarity score,

03:20.460 --> 03:21.780
which is close to one.

03:21.780 --> 03:23.370
There's also lots of advanced techniques

03:23.370 --> 03:24.270
and future directions,

03:24.270 --> 03:27.000
so people are doing things like fine tuning embeddings

03:27.000 --> 03:30.030
for specific language and applications in industries.

03:30.030 --> 03:32.640
You also have multimodal embeddings like clip

03:32.640 --> 03:35.250
and image bind that can create unified vector spaces

03:35.250 --> 03:38.070
for both text, images, and audio.

03:38.070 --> 03:40.310
And you've also got hybrid approaches using embeddings

03:40.310 --> 03:44.250
of traditional methods like BM 25 or term TF-IDF,

03:44.250 --> 03:47.340
which is term frequency inverse document frequency,

03:47.340 --> 03:50.340
so traditional keyword search with embeddings.

03:50.340 --> 03:53.640
And the key challenge is keeping embeddings updated

03:53.640 --> 03:56.970
with new knowledge without completely retraining them.

03:56.970 --> 03:59.190
Okay, great, so you now know about embeddings,

03:59.190 --> 04:00.753
a couple of practical use cases.

04:00.753 --> 04:02.820
The next thing we're gonna do is get our hands on

04:02.820 --> 04:04.303
and go through an embeddings notebook

04:04.303 --> 04:06.487
and see how we can create embeddings

04:06.487 --> 04:09.109
as well as how you can also use Cosine similarity

04:09.109 --> 04:10.710
to compare different embeddings

04:10.710 --> 04:12.210
and how similar those embeddings are.

04:12.210 --> 04:13.860
So the notebook I'd like you to open

04:13.860 --> 04:16.890
is this embeddings notebook in the OpenAI features

04:16.890 --> 04:18.120
and functionality folder.

04:18.120 --> 04:19.740
Open that, and then we're gonna go through this,

04:19.740 --> 04:21.000
and we're gonna run through the code.

04:21.000 --> 04:22.500
So the first thing that we're doing

04:22.500 --> 04:23.333
is we're installing

04:23.333 --> 04:27.496
OpenAI, tiktoken, pandas, scikit-learn, mapplotlib.

04:27.496 --> 04:30.750
And we're then doing all these imports.

04:30.750 --> 04:31.920
We're then gonna go and learn

04:31.920 --> 04:33.540
how to generate an embedding in OpenAI.

04:33.540 --> 04:36.420
So the first thing you wanna do is replace this section

04:36.420 --> 04:38.130
with your OpenAI key.

04:38.130 --> 04:41.010
Next, we're gonna create a function called get embedding.

04:41.010 --> 04:42.700
It has a default parameter

04:42.700 --> 04:46.050
for the model called text embedding-3-small.

04:46.050 --> 04:48.180
We replace all of the new line characters,

04:48.180 --> 04:50.649
and we then run client.embeddings.create

04:50.649 --> 04:54.330
with our input in text queen and the model that we're using.

04:54.330 --> 04:56.280
We're then returning a NumPy array

04:56.280 --> 05:00.060
with the resp.data square bracket zero.embedding.

05:00.060 --> 05:02.056
This gives you your embedding vector.

05:02.056 --> 05:03.750
We can then put a piece of text in.

05:03.750 --> 05:04.583
So for example,

05:04.583 --> 05:07.500
"The quick brown fox jumps over the lazy dog."

05:07.500 --> 05:09.589
We can then run the get embedding function,

05:09.589 --> 05:11.880
and you'll see that we have a embedding

05:11.880 --> 05:15.178
with around 1,536 dimensions.

05:15.178 --> 05:18.540
And we've also printed out the first five dimensions.

05:18.540 --> 05:21.360
Now, I will say that the larger the vector length,

05:21.360 --> 05:24.030
ie, the more dimensions it has, generally speaking,

05:24.030 --> 05:25.497
the more rich it is,

05:25.497 --> 05:28.020
and the more context about that specific word,

05:28.020 --> 05:31.733
or phrase, or paragraph, or document it has.

05:31.733 --> 05:33.570
It does take more memory

05:33.570 --> 05:34.980
to generate those types of embedding,

05:34.980 --> 05:38.160
so they cost more and they also cost more to store as well.

05:38.160 --> 05:39.450
We can also use tiktoken

05:39.450 --> 05:41.820
to count the number of tokens before embedding.

05:41.820 --> 05:45.180
So for example, this specific sample token count

05:45.180 --> 05:47.011
had around 10 tokens.

05:47.011 --> 05:49.650
You can then also generate multiple embeddings.

05:49.650 --> 05:51.420
So we have our sentences here.

05:51.420 --> 05:55.306
We're then using NumPy.vstack to vertically stack these.

05:55.306 --> 05:57.900
And we're just doing a four loop here

05:57.900 --> 05:58.950
in a list comprehension

05:58.950 --> 06:01.200
where we're going over every item

06:01.200 --> 06:02.940
in the censuses Python list.

06:02.940 --> 06:05.670
We're running the get embedding function against it.

06:05.670 --> 06:08.820
And then we are using something called PCA, which allows us

06:08.820 --> 06:12.030
to shrink the embedding space to only two dimensions

06:12.030 --> 06:14.760
whilst retaining a lot of the variance.

06:14.760 --> 06:16.860
When we plot this, you'll actually see,

06:16.860 --> 06:18.570
so we're gonna run this here.

06:18.570 --> 06:21.240
We basically are calculating the embeddings,

06:21.240 --> 06:22.485
and then you can see here,

06:22.485 --> 06:25.241
we have all of these embeddings.

06:25.241 --> 06:27.210
And notice how OpenAI

06:27.210 --> 06:29.880
and machine learning are actually closer together

06:29.880 --> 06:33.720
than the sky and hiking and restaurant is over here.

06:33.720 --> 06:36.660
So you can see that the embeddings actually do have some

06:36.660 --> 06:39.030
sort of spatial awareness of themselves.

06:39.030 --> 06:40.590
Now, there's a quick exercise for you.

06:40.590 --> 06:43.875
I want you to pick five of your own short sentences,

06:43.875 --> 06:46.920
embed them using the get embedding function,

06:46.920 --> 06:49.350
compute pairwise co similarities,

06:49.350 --> 06:52.740
and identify the most two similar sentences.

06:52.740 --> 06:54.900
Now, to do the Cosine similarity one,

06:54.900 --> 06:58.710
what you're gonna use is this from Sklearn.metrics.pairwise,

06:58.710 --> 07:01.080
import the Cosine similarity metric.

07:01.080 --> 07:02.490
And then after you've imported out,

07:02.490 --> 07:04.965
then we're gonna run similarity.

07:04.965 --> 07:06.600
(keyboard keys clacking)

07:06.600 --> 07:09.270
And then we'll just do Cosine similarity of the vectors.

07:09.270 --> 07:10.140
(keyboard keys clacking)

07:10.140 --> 07:13.140
And what you can see here is you end up with a matrix

07:13.140 --> 07:14.820
of all of the different similarities

07:14.820 --> 07:18.450
of each individual vector against every other vector.

07:18.450 --> 07:20.370
Now, notice there's this one going down here

07:20.370 --> 07:21.330
in the diagonal.

07:21.330 --> 07:22.740
That is the metrics

07:22.740 --> 07:26.940
of the vector Cosine similarity against itself.

07:26.940 --> 07:29.252
You can also as well do the similarity

07:29.252 --> 07:31.650
by just comparing two vectors,

07:31.650 --> 07:33.940
so I can do vectors zero

07:34.800 --> 07:37.413
and I can do Cosine similarity.

07:39.660 --> 07:42.210
And then we have to reshape these to be the same

07:42.210 --> 07:45.600
in terms of a negative one and a one.

07:45.600 --> 07:48.113
And I'm just gonna make sure we get the right function here.

07:51.690 --> 07:54.570
And you can see vector zero against vector one

07:54.570 --> 07:58.260
had a Cosine similarity of 0.46.

07:58.260 --> 08:00.990
Cool, so have a go at doing these four steps.

08:00.990 --> 08:02.280
Pick some short sentences,

08:02.280 --> 08:04.560
embed them using the get embedding function,

08:04.560 --> 08:06.930
compute the pairwise co similarities,

08:06.930 --> 08:09.600
and identify the most two similar sentences.

08:09.600 --> 08:11.070
And a bonus as well,

08:11.070 --> 08:12.150
visualize the embeddings

08:12.150 --> 08:15.900
in 2D using the PCA Principle Component Analysis

08:15.900 --> 08:17.350
machine learning model above.
