WEBVTT

00:00.110 --> 00:01.190
Welcome back, everybody.

00:01.190 --> 00:05.690
In this section, I want to discuss what's arguably the most powerful feature that Kubernetes provides.

00:05.690 --> 00:08.690
And that is horizontal pod auto scaling.

00:08.720 --> 00:09.380
Okay.

00:09.380 --> 00:15.800
So this is the ability to automatically scale your pods based on the amount of traffic, based on the

00:15.800 --> 00:18.680
amount of load that they experience.

00:18.770 --> 00:25.730
And in order to truly weigh how important this feature is, let's imagine an e-commerce application

00:25.730 --> 00:31.370
that's been regularly serving traffic to its customers all throughout the year, and then suddenly they

00:31.370 --> 00:37.460
get promoted by a YouTuber or something, and they start getting a very sudden spike in traffic that

00:37.460 --> 00:40.040
their application simply isn't ready for.

00:40.040 --> 00:45.890
So the app becomes really slow and unresponsive, which hinders the user experience.

00:46.040 --> 00:50.690
Uh, less customers are purchasing through the website, losing a lot of money.

00:50.690 --> 00:56.240
So the operations team and all the IT guys, they get woken up in the middle of the night to scale the

00:56.240 --> 01:03.000
deployments up in a in a very hurried panic But by the time they did scale the deployment up.

01:03.540 --> 01:06.990
Um, many customers already abandoned their purchases.

01:07.050 --> 01:12.090
And then after this big sale is over, they need to scale the deployment back down.

01:12.090 --> 01:16.380
Otherwise, they'd be wasting very valuable computing resources.

01:17.130 --> 01:19.470
And that takes a lot of time to do that as well.

01:19.470 --> 01:21.840
But also by by scaling your manual.

01:22.140 --> 01:27.900
By manually scaling your deployments back down, you risk potential future disruptions in the event

01:27.900 --> 01:31.620
that there is another, uh, unexpected future promotion.

01:31.620 --> 01:39.330
So all that is to say that manually scaling your deployments up and down is a very tedious, stressful,

01:39.330 --> 01:41.370
and inefficient process.

01:41.370 --> 01:47.670
And the fact that Kubernetes can automatically scale application pods based on the amount of load that

01:47.670 --> 01:52.470
they experience is really, really powerful and really, really important.

01:52.470 --> 01:55.800
So we're going to discuss what it is in this lesson.

01:55.800 --> 02:00.930
And I'm going to show you a demo of horizontal pod auto scaling in action.

02:01.850 --> 02:06.050
But first and foremost, why is it called horizontal auto scaling?

02:06.350 --> 02:11.510
Well, let's imagine that you're this magical construction worker that can create buildings at the snap

02:11.510 --> 02:12.200
of a finger.

02:12.230 --> 02:14.600
Okay, there is hundreds of people outside.

02:14.600 --> 02:17.420
It starts raining and they need to get into the building for cover.

02:17.420 --> 02:19.850
But the building can only accommodate 20 people.

02:19.850 --> 02:26.930
And so you've got one of two options you can, at the snap of a finger, make the building a lot bigger.

02:26.930 --> 02:34.610
So grow it vertically in order to accommodate more people or at the snap of your finger, you can create

02:34.610 --> 02:41.090
many more buildings alongside the original one in order to accommodate all of that traffic.

02:41.120 --> 02:44.630
Okay, so this would be considered horizontal scaling.

02:44.630 --> 02:46.430
This is vertical scaling.

02:46.430 --> 02:54.830
If we relate this back to pods, uh, vertically scaling a pod would just mean giving it more resources,

02:54.830 --> 02:58.970
more CPU in order to handle all of that traffic.

02:59.180 --> 03:06.240
Whereas horizontal pod auto scaling Autoscaling means creating more pods and distributing that traffic

03:06.240 --> 03:08.190
across all of these pods.

03:08.220 --> 03:14.340
This is the preferred scaling approach, and it's what we're going to be discussing in this video.

03:15.270 --> 03:16.680
So how does it work?

03:16.680 --> 03:21.720
You've got this horizontal pod autoscaler abstraction that's provided by Kubernetes.

03:21.720 --> 03:28.320
This horizontal pod autoscaler object where you can define your minimum number of replicas, your maximum

03:28.320 --> 03:35.970
number of pod replicas, and the target by which the HPA should scale your application pods.

03:36.000 --> 03:45.480
Okay, so here we're saying if the average CPU utilization of all the pods ever exceeds 50%, scale

03:45.480 --> 03:46.230
up.

03:46.260 --> 03:52.350
If the average CPU utilization is way below 50%, then scale down.

03:52.350 --> 03:59.040
So here we've got a pod that's been allocated 200 millicores of CPU, and it's only using up five,

03:59.040 --> 04:02.190
which means the average utilization is 3%.

04:02.220 --> 04:03.060
All right.

04:03.080 --> 04:12.020
But then suddenly, our pod starts getting so much traffic, um, that it's using up 55% of its allocated

04:12.020 --> 04:12.710
CPU.

04:12.710 --> 04:19.610
So the horizontal pod autoscaler decides to create another pod and distribute the load.

04:19.640 --> 04:23.810
Now the average CPU utilization is 19%.

04:23.810 --> 04:26.780
And now at this point, traffic is starting to decrease.

04:26.780 --> 04:31.070
So the autoscaler decides to scale back down to one pod.

04:31.100 --> 04:34.580
Now the average utilization is only 16%.

04:34.640 --> 04:41.150
This one pod is able to handle things perfectly and then there is no traffic at all.

04:41.150 --> 04:43.970
We're down to 1%, only one pod.

04:43.970 --> 04:46.790
But then suddenly we get this sudden burst in traffic.

04:46.820 --> 04:52.490
110 out of 200 CPU is being used 55%.

04:52.490 --> 04:58.970
So the horizontal pod autoscaler scales back up, distributes the load, but the traffic is so high

04:59.000 --> 05:01.460
that even two pods are using up.

05:01.460 --> 05:06.300
Most of their CPU at an average CPU utilization of 60%.

05:06.300 --> 05:09.060
So the autoscaler creates another pod.

05:09.090 --> 05:18.930
Now the load has been distributed enough such that the average utilization is 40%, and all three pods

05:18.930 --> 05:23.610
were able to handle the traffic with ease until there was no more traffic.

05:23.640 --> 05:29.010
They all came back down to an average CPU utilization of 1%, which is way below the target.

05:29.010 --> 05:33.870
So the autoscaler brings, uh, scales back down to a single pod.

05:33.900 --> 05:34.230
All right.

05:34.260 --> 05:42.540
So it's simply an object that scales your pods up and down with the amount of traffic, with the amount

05:42.540 --> 05:44.910
of workload that's being experienced.

05:46.290 --> 05:52.890
Now, CPU is the preferred metric for auto scaling, not to memory when there is a lot of traffic,

05:52.890 --> 05:56.190
we want the autoscaler to scale our pods up.

05:56.220 --> 05:57.690
CPU and memory.

05:57.690 --> 06:04.840
They both usually increase as traffic goes up, but then when traffic reduces, we want the auto scaler

06:04.840 --> 06:08.920
to scale our pods back down because resources, they cost money.

06:09.070 --> 06:16.990
Now CPU because it's compressible, it will usually quickly decrease as the workload decreases.

06:17.020 --> 06:23.950
Memory, on the other hand, is incompressible, so it will not reliably decrease as workload decreases.

06:23.950 --> 06:30.490
So if our auto scaler were based on memory, then we would end up with one that only ever goes up,

06:30.490 --> 06:33.730
but rarely ever scales our pods back down.

06:33.760 --> 06:36.850
CPU usage, on the other hand, is more volatile.

06:36.850 --> 06:43.390
It goes up and down very quickly with increasing and decreasing traffic, which allows the auto scaler

06:43.390 --> 06:47.740
to move in both directions to scale our pods up and down.

06:47.740 --> 06:55.330
So obviously in this course, we're going to be using CPU usage as the metric for auto scaling.

06:55.360 --> 06:56.830
All right that's enough theory.

06:56.830 --> 06:59.050
Let's go ahead and implement it.

06:59.080 --> 07:03.760
So inside of the grade submission portal actually you know what.

07:03.790 --> 07:05.640
Let me clear the output first.

07:07.440 --> 07:10.920
Um, I'm going to copy this.

07:11.130 --> 07:12.870
Call it section nine.

07:17.550 --> 07:22.680
And there's one thing I forgot to mention is that in Kubernetes, there's something called a metrics

07:22.680 --> 07:33.090
server that collects CPU and memory usage metrics about all pods and nodes in your cluster without metrics

07:33.090 --> 07:34.350
being collected.

07:34.350 --> 07:37.920
Things like the horizontal pod autoscaler wouldn't work.

07:37.920 --> 07:45.780
So we need to install the metrics server in our Kubernetes cluster before we can perform horizontal

07:45.780 --> 07:47.130
pod auto scaling.

07:47.160 --> 07:52.410
All right, let me go and look for the command that I'm going to give you right now.

07:52.860 --> 08:00.840
So if you go to your resources folder of this lecture, I've left you a file that you can download

08:10.120 --> 08:14.320
And that file has the following Kubernetes command.

08:14.560 --> 08:15.340
Okay.

08:15.370 --> 08:24.160
Enter this command and it should install everything needed to set up the metrics server deployment.

08:24.370 --> 08:31.660
I'm going to clear the output CD out of section eight and CD into section nine.

08:31.780 --> 08:32.830
All right.

08:33.220 --> 08:40.000
Now if you say kubectl top pods dash n great submission.

08:40.000 --> 08:49.810
So this, uh, displays the memory and CPU usage metrics of all the pods in the great submission namespace.

08:49.930 --> 08:56.680
You should get something like metrics API not available because it takes time for the metric server

08:56.680 --> 08:57.670
to install.

08:57.700 --> 09:04.030
I already had it installed like a couple of weeks ago, so it automatically displays the metrics because

09:04.030 --> 09:06.070
they were always being collected.

09:06.080 --> 09:10.520
So I've got my grade submission API consuming five millicores of CPU.

09:10.550 --> 09:16.220
54MB of memory MongoDB using up nine Millicores of CPU.

09:16.250 --> 09:18.710
222MB of memory.

09:18.890 --> 09:19.910
All right.

09:19.910 --> 09:28.220
So I want you to pause this lesson and only resume when your kubectl top pods command actually displays

09:28.250 --> 09:32.450
CPU and memory data for all of your pods.

09:32.480 --> 09:37.190
Okay, once you've got that set up, we can clear the output.

09:38.390 --> 09:44.090
And then inside of the grade submission portal I'm going to create a grade submission portal autoscaler.

09:44.210 --> 09:48.320
I'm only going to demo auto scaling for the grade submission portal.

09:48.350 --> 09:48.860
All right.

09:48.860 --> 09:55.310
Just to keep things simple, because the auto scaler is not something that we're going to carry over

09:55.310 --> 09:56.960
in the remainder of this course.

09:56.960 --> 09:59.720
We don't actually need it because we're prototyping.

09:59.750 --> 10:00.080
All right.

10:00.080 --> 10:06.830
But in an enterprise environment, an auto scaler would be very, very necessary for reasons that were

10:06.850 --> 10:07.960
mentioned earlier.

10:07.960 --> 10:15.130
So if I want to set up an auto scaler using the Kubernetes extension, I can just say HPA and auto generate

10:15.160 --> 10:18.550
a skeleton upon which we can create an auto scaler.

10:19.330 --> 10:20.980
I'm going to call the auto scaler.

10:21.010 --> 10:24.550
Great submission portal.

10:25.540 --> 10:26.770
Uh, HPA.

10:28.060 --> 10:29.050
All right.

10:29.380 --> 10:39.850
And this auto scaler is going to scale all of the pods being managed by the Great Submission Portal

10:39.850 --> 10:40.870
deployment.

10:42.280 --> 10:52.420
The HPA is going to be scaling these pod replicas up and down based on CPU usage based on average CPU

10:52.450 --> 10:53.560
utilization.

10:53.560 --> 11:02.500
We can actually leave this as it is scaling based on 50% average CPU utilization is very common.

11:02.500 --> 11:04.630
It's a lower end for sure.

11:04.660 --> 11:10.430
50 to 60% because that kind of provides a buffer for sudden spikes in traffic.

11:10.460 --> 11:16.130
It allows time for new pods to start up before the system becomes overloaded.

11:16.130 --> 11:16.640
So.

11:16.640 --> 11:19.520
So the minimum number of replicas will be one.

11:19.550 --> 11:26.570
The maximum number of replicas is ten, and we're going to scale from 1 to 10 based on the average CPU

11:26.600 --> 11:28.610
utilization of 50%.

11:28.670 --> 11:35.630
Let's go ahead and say kubectl apply dash f everything inside a grade submission portal.

11:35.630 --> 11:39.590
And it should simply create a horizontal pod autoscaler.

11:39.620 --> 11:41.720
Everything else should remain unchanged.

11:42.200 --> 11:48.020
I'm going to say kubectl get HPA dash n grade submission.

11:50.720 --> 11:54.920
No resources found I always do this.

11:55.580 --> 12:00.170
Namespace grade submission kubectl.

12:00.200 --> 12:02.180
Let's delete the horizontal pod.

12:02.210 --> 12:03.440
Autoscaler.

12:04.910 --> 12:07.880
Um, that's inside of the default namespace.

12:09.450 --> 12:13.710
Okay, let's redeploy it in the grade submission namespace.

12:15.390 --> 12:16.470
All right.

12:16.680 --> 12:19.140
Kubectl get pods.

12:19.140 --> 12:20.700
Dash, not pods.

12:20.730 --> 12:24.060
Get HPA dash n grade submission.

12:26.310 --> 12:27.330
All right.

12:27.330 --> 12:39.000
So we've got one autoscaler that's going to be scaling our pods based on an average utilization of 50%.

12:39.300 --> 12:40.320
Okay.

12:41.520 --> 12:50.490
Here we see unknown out of 50% because it's still monitoring the metrics for the grade submission portal

12:50.490 --> 12:53.190
that are being collected by the metrics server.

12:53.190 --> 12:55.260
We just deployed the autoscaler.

12:55.260 --> 12:58.440
So it's going to take time for this value to show.

12:58.590 --> 13:00.660
Let me get a drink of water real quick.

13:03.600 --> 13:13.880
So hopefully it's done it by now Let's say kubectl top pods dash n.

13:13.880 --> 13:15.320
Great submission.

13:15.860 --> 13:21.080
So that displays the CPU and memory for the great submission portal.

13:21.080 --> 13:32.900
So the great submission portal is using up 16 millicores of CPU, um, from it's assigned to 100 millicores.

13:32.900 --> 13:35.420
So 16 divided by 200.

13:35.450 --> 13:37.100
Let me calculate that real quick.

13:37.460 --> 13:39.350
That is 8%.

13:39.350 --> 13:45.020
So we expect to see an average CPU utilization of 8% out of 50.

13:45.680 --> 13:46.220
All right.

13:46.220 --> 13:48.950
Please show me something beautiful.

13:49.550 --> 13:58.370
And so now what I'm going to do is I'm going to flood this pod with requests and have its CPU go above

13:58.370 --> 14:00.950
the 50% target.

14:00.950 --> 14:08.270
Once that happens, the horizontal pod autoscaler is going to increase the number of pods in order to

14:08.400 --> 14:14.310
Distribute that load and try to bring the average CPU utilization back down.

14:14.310 --> 14:19.860
And it's ultimately just going to scale our pods up with increasing traffic and scale our pods back

14:19.860 --> 14:21.630
down with decreasing traffic.

14:21.630 --> 14:24.330
So I'll go ahead and demo this right now.

14:24.360 --> 14:26.400
I should have postman somewhere.

14:29.460 --> 14:33.990
Oh let me bring it up so you can see it right over here.

14:33.990 --> 14:36.660
So I've already set up my postman runner.

14:36.660 --> 14:39.720
This is not a postman tutorial, so just enjoy the demo.

14:39.750 --> 14:44.040
I'm just going to demo to you how horizontal pod auto scaling works.

14:44.040 --> 14:49.980
Ultimately, you should just observe how the Horizontal Pod autoscaler reacts to a lot of traffic.

14:49.980 --> 14:58.890
So I'm going to simulate real world traffic from my local machine, and I'm going to have 20 users sending

14:58.920 --> 15:01.860
traffic to my app at a time.

15:02.040 --> 15:03.510
I'm going to click run.

15:05.610 --> 15:12.290
So this is going to make consistent requests to localhost 32,000.

15:21.440 --> 15:25.850
I can already see that responses are starting to become really slow.

15:25.880 --> 15:28.520
I'm assuming the CPU has already gone up a lot.

15:28.520 --> 15:35.240
So let me say kubectl get pods dash n great submission.

15:38.060 --> 15:38.600
Wow.

15:38.600 --> 15:43.700
And the horizontal pod autoscaler has already created three great submission portal pods.

15:43.700 --> 15:48.440
We started with one pod replica and the great Submission portal.

15:48.440 --> 15:55.040
Autoscaler has already created two more in order to handle all of the load, all of the burden that's

15:55.040 --> 15:56.480
being placed on it.

15:56.510 --> 15:56.870
All right.

15:56.870 --> 15:57.920
And that's pretty much it.

15:57.950 --> 16:04.820
If you have an application that's bound to receive a lot of traffic, you should consider giving it

16:04.820 --> 16:07.220
a horizontal pod autoscaler.

16:07.250 --> 16:07.820
Okay.

16:07.820 --> 16:09.770
That's all we're going to cover for this section.

16:09.770 --> 16:11.240
I will see you in the next one.