WEBVTT

00:00.120 --> 00:07.040
This opening slide introduces one of the most critical challenges in production AI systems performance.

00:07.600 --> 00:15.280
No matter how intelligent an LLM system is, users judge it primarily by how fast and reliable it feels.

00:15.720 --> 00:19.560
Latency and scalability are therefore not secondary concerns.

00:19.880 --> 00:23.120
They define whether users trust and adopt the system.

00:23.520 --> 00:30.880
The visual on this slide highlights interconnected infrastructure, reinforcing the idea that LLM applications

00:30.880 --> 00:32.960
are distributed systems by nature.

00:33.560 --> 00:40.560
Requests flow through APIs, retrieval pipelines, model inference, and post-processing layers, each

00:40.560 --> 00:42.840
contributing to overall response time.

00:43.240 --> 00:48.720
This section focuses on engineering systems that feel responsive under real world conditions.

00:49.160 --> 00:55.680
We are not chasing microsecond optimizations, but designing architectures that remain stable as traffic

00:55.680 --> 00:58.440
grows and workloads become more complex.

00:58.640 --> 01:06.830
By the end of this section, you will understand how to reason about latency holistically, how to choose

01:06.830 --> 01:14.230
the right inference strategies, and how to scale LLM applications without sacrificing reliability or

01:14.230 --> 01:15.550
user experience.

01:16.470 --> 01:24.150
This slide explains why latency is fundamentally a user experience problem, not just a technical metric.

01:24.710 --> 01:32.590
Modern users expect near instant feedback when response times become inconsistent or exceed a few seconds.

01:32.870 --> 01:38.710
Users perceive the system as unreliable, even if it eventually returns a correct answer.

01:39.430 --> 01:45.550
The slide emphasizes an important psychological insight perceived performance matters more than raw

01:45.550 --> 01:46.470
milliseconds.

01:46.870 --> 01:53.750
A system that responds consistently feels trustworthy, while one that is occasionally slow erodes confidence.

01:54.350 --> 01:57.510
Variability is more damaging than absolute delay.

01:58.070 --> 02:01.990
Scaling compounds these challenges as traffic increases.

02:02.110 --> 02:08.300
Infrastructure stress, tool integrations and concurrency introduce additional latency layers.

02:08.820 --> 02:15.180
Systems that succeed are designed for scale from day one rather than retrofitted after problems appear.

02:15.500 --> 02:20.860
The key takeaway is clear trust is built through predictability.

02:21.420 --> 02:25.020
Engineering for low and consistent latency is not optional.

02:25.380 --> 02:29.980
It is essential for adoption, retention, and long term success.

02:30.220 --> 02:36.220
This slide breaks down where latency actually comes from in LM applications.

02:36.820 --> 02:44.020
Importantly, latency is not caused by a single bottleneck it compounds across the entire request lifecycle.

02:44.580 --> 02:51.500
The first contributor is network calls to LM providers, often adding hundreds of milliseconds before

02:51.500 --> 02:53.060
inference even begins.

02:53.700 --> 03:00.980
Next comes context assembly, where retrieval, augmented generation, memory lookups, and prompt construction

03:00.980 --> 03:02.260
can add seconds.

03:02.860 --> 03:09.260
Model inference time varies depending on model size, context length, and provider load.

03:09.610 --> 03:16.570
more advanced systems introduce tool execution latency, where external APIs or databases are called

03:16.570 --> 03:17.850
during agent loops.

03:18.370 --> 03:24.850
After generation, post-processing adds additional overhead for parsing, formatting, and validation.

03:25.370 --> 03:31.770
Finally, cold starts from serverless or containerized environments introduce unpredictable delays.

03:32.210 --> 03:39.290
The critical insight highlighted at the bottom of the slide is that these latencies stack in complex

03:39.290 --> 03:40.570
agent workflows.

03:40.610 --> 03:45.850
What starts as seconds can easily become minutes, if not carefully managed.

03:47.450 --> 03:54.810
This slide introduces a key architectural decision whether to use streaming or batch inference.

03:55.290 --> 04:01.410
While both approaches generate the same content, they fundamentally change how users experience the

04:01.410 --> 04:04.050
system with streaming inference.

04:04.170 --> 04:07.250
Tokens are sent incrementally as they are generated.

04:07.570 --> 04:15.560
This dramatically reduces perceived latency, often by 60 to 80% because users see progress immediately.

04:16.120 --> 04:22.160
Streaming is ideal for chat interfaces and interactive tools where responsiveness matters.

04:22.640 --> 04:28.080
However, it requires infrastructure support such as WebSockets or server sent events.

04:28.720 --> 04:34.560
Batch inference, on the other hand, returns the full response only after generation completes.

04:35.040 --> 04:41.120
This maximizes throughput and simplifies error handling, making it ideal for offline jobs.

04:41.120 --> 04:43.520
Analytics and background processing.

04:44.040 --> 04:48.200
Batch responses are also easier to cache and post-process.

04:48.400 --> 04:55.840
The golden rule on this slide is critical optimize for user experience first, not raw speed.

04:56.280 --> 05:03.480
A slower system that streams feels faster than a slightly quicker system that stays silent until completion.

05:03.920 --> 05:10.080
This slide provides a practical framework for deciding between streaming and batch inference.

05:10.600 --> 05:13.360
The first question is user context.

05:13.800 --> 05:15.790
Is a human actively waiting.

05:16.070 --> 05:17.870
Or is this a background task?

05:18.390 --> 05:23.630
If a person is watching, the screen, streaming almost always improves experience.

05:24.150 --> 05:26.510
Next is latency sensitivity.

05:27.070 --> 05:33.710
Interactive applications benefit from immediate feedback, while backend pipelines often tolerate delays

05:33.710 --> 05:34.870
measured in minutes.

05:35.550 --> 05:37.710
The third factor is system design.

05:38.110 --> 05:43.990
Streaming requires connection management and stateful infrastructure, which may not be feasible in

05:43.990 --> 05:45.110
all environments.

05:45.750 --> 05:48.990
The slide clearly outlines when to choose each approach.

05:49.510 --> 05:55.830
Streaming is best for chat assistance and tools, where users may interrupt or provide feedback.

05:56.430 --> 06:01.870
Batch processing is ideal for document ingestion, ETL pipelines, and analytics.

06:01.910 --> 06:06.750
The best practice at the bottom reinforces architectural flexibility.

06:07.030 --> 06:08.750
Support both modes.

06:09.350 --> 06:13.550
User facing endpoint stream while background workers batch.

06:14.350 --> 06:18.580
This separation allows you to optimize each workload Independently.

06:19.020 --> 06:26.500
This slide explains why asynchronous processing is not an optimization, but a requirement for scalable

06:26.540 --> 06:27.700
LM systems.

06:28.220 --> 06:35.380
LM calls are slow, often taking seconds, and they are I o bound, meaning the server spends most of

06:35.380 --> 06:36.500
its time waiting.

06:36.940 --> 06:43.620
Blocking on these calls wastes resources and limits concurrency with blocking architectures.

06:43.780 --> 06:50.300
Systems often fail at surprisingly low user counts because each request monopolizes a thread.

06:50.940 --> 06:56.180
Async processing solves this by freeing request threads while waiting for responses.

06:56.740 --> 07:03.540
Modern async runtimes can handle context switching with minimal overhead, enabling orders of magnitude

07:03.540 --> 07:06.340
more concurrent requests on the same hardware.

07:07.020 --> 07:12.940
The metrics on this slide highlight the impact up to 100 times, more concurrency improvements, and

07:12.940 --> 07:15.380
dramatically better resource utilization.

07:15.940 --> 07:18.620
The quote at the bottom captures the mindset shift.

07:19.060 --> 07:26.210
Async is not a performance tweak, it is the foundation of any LLM application expecting real traffic.

07:27.330 --> 07:33.450
This slide explores practical async patterns used in real world LLM systems.

07:34.250 --> 07:42.210
Async HTTP calls allow concurrent requests to LLM providers and are essential for all real time applications.

07:42.850 --> 07:43.490
Background.

07:43.490 --> 07:50.250
Workers decouple request handling from processing, enabling immediate user feedback while work continues

07:50.250 --> 07:51.290
asynchronously.

07:51.930 --> 07:59.210
Tasks like celery Sidekiq or bull MQ distribute work across worker pools and provide retry logic and

07:59.210 --> 07:59.970
monitoring.

08:00.810 --> 08:05.690
These patterns are especially valuable for long running or failure prone tasks.

08:06.210 --> 08:09.650
The real world use cases reinforce these ideas.

08:10.210 --> 08:15.530
Long running agents with tool calls should return task IDs rather than blocking users.

08:16.250 --> 08:20.890
Document ingestion and embedding generation belong entirely in background jobs.

08:21.530 --> 08:26.520
Batch embeddings and scheduled analytics benefit from parallelized async execution.

08:27.360 --> 08:31.280
The key message is that different workloads require different async patterns.

08:31.800 --> 08:37.360
Choosing the right one is an architectural decision that directly affects scalability and reliability.

08:37.560 --> 08:40.960
This slide focuses on reliability under pressure.

08:41.440 --> 08:45.680
Even perfectly async systems can fail without proper load management.

08:46.080 --> 08:52.840
Traffic spikes, abuse, or downstream failures can trigger cascading outages if left unchecked.

08:53.440 --> 08:56.360
The design philosophy presented here is important.

08:56.680 --> 09:00.800
The goal is not infinite scalability, but graceful degradation.

09:01.120 --> 09:04.960
Serving most users well is better than failing for everyone.

09:05.520 --> 09:12.000
The slide outlines defensive techniques, rate limiting controls, request velocity, and prevents abuse.

09:12.400 --> 09:16.760
Request queuing buffers spikes instead of rejecting traffic outright.

09:17.160 --> 09:21.040
Back pressure signals upstream systems to slow down before queues.

09:21.040 --> 09:22.880
Overflow timeouts.

09:22.880 --> 09:26.320
Free resources from stuck requests and circuit breakers.

09:26.320 --> 09:29.150
Prevent repeated calls to failing services.

09:29.630 --> 09:34.590
Together, these mechanisms protect both infrastructure and user experience.

09:34.950 --> 09:38.790
Load management is about resilience, not just performance.

09:38.830 --> 09:45.510
The final slide emphasizes that true scaling is architectural, not just infrastructural.

09:46.190 --> 09:51.710
Simply adding servers without redesigning system structure only scales problems.

09:52.230 --> 09:54.710
The slide breaks scaling into layers.

09:55.110 --> 10:00.390
The API layer should be stateless and horizontally scalable behind load balancers.

10:00.870 --> 10:06.230
Worker pools should be separated by workload type and auto scaled based on queue depth.

10:06.710 --> 10:11.270
Caching layers reduce repeated computation and improve response times.

10:11.790 --> 10:15.910
Model routing strategies match request complexity to appropriate models.

10:16.430 --> 10:20.630
The critical insight at the bottom is the most important takeaway of this section.

10:20.990 --> 10:24.430
Scaling requires thoughtful design across every layer.

10:24.950 --> 10:30.990
When architecture is sound, adding infrastructure amplifies success rather than chaos.
