WEBVTT

00:00.184 --> 00:02.790
Instructor: First, it's important to start with

00:02.790 --> 00:04.620
what is language modeling?

00:04.620 --> 00:07.893
So this is the formal definition from Wikipedia,

00:08.820 --> 00:11.970
which talks about a distribution of probabilities

00:11.970 --> 00:13.890
over a sequence of words.

00:13.890 --> 00:17.340
So if I'm going to bring this down to earth,

00:17.340 --> 00:20.910
language modeling is the task of predicting

00:20.910 --> 00:23.460
what word will come up next.

00:23.460 --> 00:24.750
You can think about it

00:24.750 --> 00:28.140
like a super, super smart autocomplete.

00:28.140 --> 00:31.800
So if we have a sentence like the dog wigged its,

00:31.800 --> 00:34.230
and then a collection of words,

00:34.230 --> 00:36.300
then according to this sentence,

00:36.300 --> 00:39.960
the language model for each word that you see right now

00:39.960 --> 00:41.640
has a probability.

00:41.640 --> 00:43.800
And what it would output for you

00:43.800 --> 00:47.174
is the highest probability, that it's correct.

00:47.174 --> 00:50.340
So now let's review the formal definition

00:50.340 --> 00:52.590
of a language model.

00:52.590 --> 00:56.070
So for that, you don't really need to know computers

00:56.070 --> 00:59.103
and science, you'll see it's very, very intuitive.

00:59.940 --> 01:03.330
So given a sequence of words,

01:03.330 --> 01:07.260
we'll mark the words X1, X2, and Xt.

01:07.260 --> 01:09.990
So you can simply map each X

01:09.990 --> 01:12.990
to the word in the sentence previously.

01:12.990 --> 01:15.330
We want to compute the probability distribution

01:15.330 --> 01:18.270
of the next word, which is Xt+1.

01:18.270 --> 01:21.720
Because we had so far X1 until Xt,

01:21.720 --> 01:23.640
those are the words in our sentence,

01:23.640 --> 01:26.040
and we want to know what is the probability

01:26.040 --> 01:28.950
of the word Xt+1.

01:28.950 --> 01:29.783
Cool.

01:29.783 --> 01:32.493
So P is the notation for probability.

01:33.360 --> 01:35.977
So in this notation, you see right now we want to ask,

01:35.977 --> 01:40.230
"Hey, what is the probability of the next word,

01:40.230 --> 01:43.980
which is Xt+1, given this is the long line

01:43.980 --> 01:45.060
that you see right here,

01:45.060 --> 01:49.950
that we had the sentence X1 until Xt before.

01:49.950 --> 01:54.180
And Xt+1 needs to be a word in our vocabulary."

01:54.180 --> 01:56.790
Now, our vocabulary is an object,

01:56.790 --> 02:00.450
which is notated as V for vocabulary.

02:00.450 --> 02:04.110
So if we just summarize what is a language model,

02:04.110 --> 02:08.431
then it's simply the idea that we have a sentence,

02:08.431 --> 02:11.760
a couple of words that come together, one after another,

02:11.760 --> 02:14.790
and we want to guess what will be the next word.

02:14.790 --> 02:17.883
So it's super, super similar like autocomplete,

02:18.780 --> 02:21.450
just like we know from our day-to-day lives,

02:21.450 --> 02:25.440
like when we write a text message and we have a suggestion,

02:25.440 --> 02:27.630
so it's based on a language model.

02:27.630 --> 02:30.137
The same thing when we search for something

02:30.137 --> 02:31.743
in our search engine.

02:33.330 --> 02:35.880
So what's a large language model

02:35.880 --> 02:37.950
which is all the hype right now?

02:37.950 --> 02:42.750
So a large language model or abbreviated, LLM,

02:42.750 --> 02:45.870
is simply a language model like we saw before

02:45.870 --> 02:50.870
that was trained on a huge amount of data.

02:52.740 --> 02:55.620
So you can think about it as a model,

02:55.620 --> 02:58.920
a language model that was trained on so much data

02:58.920 --> 03:01.980
that is pretty much very, very good

03:01.980 --> 03:04.383
at calculating those probabilities.

03:06.060 --> 03:08.250
So every time you write a prompt,

03:08.250 --> 03:11.550
then what you basically do is give an input of words

03:11.550 --> 03:16.410
for the LLM and the LLM simply now guesses

03:16.410 --> 03:19.980
and make its best effort in order to output you

03:19.980 --> 03:22.140
the next word, one after another.

03:22.140 --> 03:23.820
So it calculates the probability

03:23.820 --> 03:26.700
and gives you the word with the highest probability

03:26.700 --> 03:29.400
in the context of the input you provided.

03:29.400 --> 03:33.480
So it simply guesses word after word after word,

03:33.480 --> 03:34.950
what you are expecting.

03:34.950 --> 03:38.812
Now, this is why LLMs can sometimes output us something

03:38.812 --> 03:43.110
that is so farfetched from reality and simply not true,

03:43.110 --> 03:46.020
because it's simply guessing the probability

03:46.020 --> 03:48.480
and simply relying on that.

03:48.480 --> 03:51.063
So that's the entire concept of LLM.