WEBVTT

00:00.040 --> 00:06.480
Large language models are among the most sophisticated machine learning systems ever built, and their

00:06.480 --> 00:14.640
training process is a major reason why training in LM is not a single step, but a multi-stage journey

00:14.640 --> 00:19.160
that transforms raw text into a conversational AI system.

00:19.640 --> 00:26.800
Rather than being explicitly taught facts, models learn by observing patterns in language at an enormous

00:26.800 --> 00:27.440
scale.

00:28.000 --> 00:34.320
At the core of this process is a deceptively simple idea predicting the next token.

00:34.320 --> 00:41.600
By repeating this task billions of times across massive datasets, models gradually learn grammar,

00:41.600 --> 00:46.080
structure, associations, and even reasoning like behavior.

00:46.680 --> 00:54.760
This requires vast amounts of data, specialized hardware such as GPUs and TPUs, and careful optimization

00:54.760 --> 00:55.520
techniques.

00:55.680 --> 01:01.670
Understanding how Llms are trained helps explain both their strengths and weaknesses.

01:01.990 --> 01:06.910
It clarifies why models can sound intelligent yet still hallucinate.

01:07.150 --> 01:13.070
Why fine tuning changes behavior dramatically, and why alignment techniques are necessary.

01:13.510 --> 01:19.910
In this section, we'll break down each training phase so you can see how raw text becomes a powerful

01:19.910 --> 01:23.630
generative system and where its limitations come from.

01:23.630 --> 01:29.950
Pretraining is the foundation of every large language model and by far the largest training phase.

01:30.350 --> 01:37.190
During pre-training, the model is exposed to enormous volumes of text from diverse sources such as

01:37.190 --> 01:43.070
books, scientific papers, news articles, websites, and programming code.

01:43.630 --> 01:49.230
The goal is not to memorize facts, but to learn how language works in general.

01:49.670 --> 01:52.710
The learning task during pre-training is simple.

01:53.150 --> 01:56.790
Given a sequence of tokens, predict the next token.

01:56.790 --> 02:02.170
This is a self-supervised process, meaning no human labeled data is required.

02:02.570 --> 02:09.090
The text itself provides the supervision over billions or trillions of prediction attempts.

02:09.370 --> 02:16.770
The model begins to internalize grammar, syntax, factual associations, and common reasoning patterns

02:16.770 --> 02:18.290
embedded in language.

02:18.450 --> 02:24.970
Because the data is so diverse, the model develops broad knowledge across many domains.

02:25.450 --> 02:29.810
However, this knowledge is statistical rather than factual.

02:30.370 --> 02:34.410
The model does not know information in a human sense.

02:34.810 --> 02:39.810
It learns what is likely to come next based on patterns it has seen before.

02:40.330 --> 02:47.690
Pre-training creates a powerful general purpose language model, but it is still raw and unaligned at

02:47.690 --> 02:48.530
this stage.

02:48.650 --> 02:53.170
At the heart of LLM training lies a surprisingly simple objective.

02:53.450 --> 02:55.290
Next token prediction.

02:55.730 --> 03:02.640
The model is trained to assign the highest possible probability to the correct next token in a sequence.

03:03.040 --> 03:10.120
This is typically optimized using a cross-entropy loss function, which penalizes incorrect predictions

03:10.120 --> 03:12.120
and rewards accurate ones.

03:12.600 --> 03:14.680
The training loop is straightforward.

03:15.040 --> 03:21.560
The model receives an input sequence of tokens, generates a probability distribution over possible

03:21.560 --> 03:28.640
next tokens, compares its prediction to the actual next token, calculates the loss, and updates its

03:28.640 --> 03:31.360
parameters to improve future predictions.

03:31.760 --> 03:34.720
This process is repeated billions of times.

03:35.160 --> 03:40.040
What makes this objective so powerful is the nature of language itself.

03:40.360 --> 03:47.320
Human knowledge, reasoning and communication are encoded in language by learning to predict language

03:47.360 --> 03:48.240
accurately.

03:48.360 --> 03:55.000
The model implicitly learns concepts, relationships, and patterns that underlie human thought.

03:55.280 --> 04:00.280
This is why a simple prediction task leads to complex emergent behavior.

04:00.790 --> 04:08.150
reasoning, summarization, translation, and even problem solving arise naturally from mastering language

04:08.150 --> 04:08.830
patterns.

04:09.230 --> 04:15.750
However, it's important to remember that the model is still optimizing probabilities, not verifying

04:15.790 --> 04:19.270
truth or reasoning logically in a human way.

04:19.390 --> 04:26.350
While pre-training produces a powerful general purpose language model, it does not make the model immediately

04:26.350 --> 04:28.350
useful for real applications.

04:28.830 --> 04:35.110
Fine tuning is the stage where the model is adapted for specific tasks and use cases.

04:35.590 --> 04:42.350
During this phase, the pre-trained model is trained further on smaller, carefully curated data sets.

04:42.870 --> 04:50.030
Fine tuning might focus on conversational data to create chat based assistance programming data sets

04:50.030 --> 04:57.230
to improve code generation or domain specific text for fields like medicine, law, or finance.

04:57.630 --> 05:03.700
These data sets are much smaller than pre-training data, but are higher quality and more targeted.

05:04.220 --> 05:08.220
This phase aligns the model's behavior with practical needs.

05:08.620 --> 05:13.460
It teaches the model how to respond appropriately in specific contexts.

05:13.620 --> 05:17.740
How to structure answers and what types of outputs are expected.

05:18.140 --> 05:24.860
Fine tuning does not fundamentally change the model's knowledge base, but it reshapes how that knowledge

05:24.860 --> 05:25.900
is expressed.

05:26.660 --> 05:33.460
For engineers, fine tuning is critical because it's where models become specialized tools rather than

05:33.460 --> 05:35.220
generic text predictors.

05:35.620 --> 05:41.220
It's also where performance improvements for specific tasks are often most noticeable.

05:41.660 --> 05:48.820
Instruction tuning is a specialized form of fine tuning that teaches models how to follow instructions

05:48.820 --> 05:51.660
and respond helpfully to user requests.

05:52.100 --> 05:57.180
Without this step, a language model would behave like a raw text generator.

05:57.340 --> 06:02.600
Continuing sentences rather than answering questions in instruction.

06:02.600 --> 06:03.120
Tuning.

06:03.360 --> 06:10.040
Human experts create data sets consisting of instructions paired with high quality responses.

06:10.680 --> 06:18.360
These instructions may ask the model to explain a concept, summarize text, solve a problem, or provide

06:18.360 --> 06:19.160
guidance.

06:19.520 --> 06:25.120
The model is trained to map user requests to appropriate structured responses.

06:25.640 --> 06:29.760
This process significantly improves output quality.

06:30.200 --> 06:34.440
Responses become clearer, more direct, and more useful.

06:34.920 --> 06:39.240
Instruction tuning also introduces safety considerations.

06:39.520 --> 06:45.400
Teaching models to avoid harmful content and to decline inappropriate requests.

06:45.440 --> 06:49.720
The result is a model that feels more conversational and cooperative.

06:50.240 --> 06:56.800
Instead of merely predicting what comes next in text, the model learns patterns of helpful interaction.

06:57.320 --> 07:05.590
Instruction tuning is a major reason modern LMS feel usable and intuitive rather than chaotic or confusing.

07:05.790 --> 07:13.830
Reinforcement learning from human feedback, or RL is often the final step in training modern conversational

07:13.830 --> 07:14.510
models.

07:14.710 --> 07:22.750
While instruction tuning teaches models how to respond, RL teaches them which responses humans prefer.

07:23.230 --> 07:28.390
In this process, the model generates multiple responses to the same prompt.

07:28.750 --> 07:35.390
Human evaluators then rank these responses based on qualities such as helpfulness, accuracy, tone,

07:35.390 --> 07:36.270
and safety.

07:36.310 --> 07:41.790
These rankings are used to train a reward model that learns to predict human preferences.

07:42.230 --> 07:46.230
The language model is then optimized to maximize this reward.

07:46.230 --> 07:46.790
Signal.

07:47.310 --> 07:54.550
Responses that align with human values are reinforced, while harmful, misleading, or unhelpful outputs

07:54.550 --> 07:55.590
are penalized.

07:56.030 --> 08:00.940
Over time, this shapes the model's behavior in subtle but important ways.

08:01.300 --> 08:06.300
Rlf is what makes modern LMS feel polite, cautious and conversational.

08:06.700 --> 08:13.980
It encourages honesty about limitations, discourages unsafe content, and improves overall usability.

08:14.340 --> 08:18.700
However, it does not make the model truthful or unbiased by default.

08:18.820 --> 08:21.700
It simply aligns outputs with human judgments.

08:22.100 --> 08:29.380
Understanding Rlf is key to understanding why models respond the way they do in practice.

08:29.420 --> 08:34.580
Understanding how LMS are trained also explains their limitations.

08:35.060 --> 08:37.860
These models do not reason like humans.

08:38.380 --> 08:43.620
They identify patterns and make statistical inferences based on training data.

08:44.220 --> 08:50.220
This distinction is critical for anyone deploying or building systems with LMS.

08:50.740 --> 08:58.020
Hallucinations occur because the model is optimized to produce plausible language, not verified facts.

08:58.500 --> 09:01.160
If patterns suggest a confident answer.

09:01.520 --> 09:05.120
The model may generate one even when it is incorrect.

09:05.120 --> 09:13.000
Biases arise because training data reflects human society, including its prejudices and imbalances.

09:13.560 --> 09:18.520
These biases can surface in model outputs, if not carefully mitigated.

09:19.000 --> 09:21.680
Llms also have knowledge cutoffs.

09:22.040 --> 09:28.240
They do not know about events that occurred after their training period, unless supplemented with external

09:28.240 --> 09:28.840
data.

09:29.240 --> 09:34.960
Finally, models rely on statistical mimicry rather than true understanding.

09:35.360 --> 09:40.200
They recombine learned patterns rather than reasoning from first principles.

09:40.480 --> 09:48.560
The key takeaway is this llms are powerful language systems, not reasoning engines or truth machines.

09:49.160 --> 09:53.600
Understanding their training process allows you to use them responsibly.

09:53.800 --> 10:00.920
Design safeguards and build systems that complement their strengths while compensating for their weaknesses.