1
00:00:00,630 --> 00:00:04,830
Hi, and welcome back to the lesson on L1 L2 regularization.

2
00:00:05,310 --> 00:00:10,350
Let's take a look at what these methods do, and we'll explore and a contrast to them a bit.

3
00:00:10,860 --> 00:00:15,270
So firstly, they're a type of wit constraining regularization methods.

4
00:00:15,720 --> 00:00:17,880
They both work by forcing all parameters.

5
00:00:17,880 --> 00:00:24,660
That is the weird symbiosis or model or feature maps and max pulleys and fully connected layers.

6
00:00:24,840 --> 00:00:26,370
What does weird symbiosis?

7
00:00:26,670 --> 00:00:29,250
It forces them to take smaller values.

8
00:00:29,730 --> 00:00:35,850
And the reason this reduces overfitting is that by reducing the weights, no network reduces the effects

9
00:00:35,850 --> 00:00:37,980
of those weights on activation function.

10
00:00:38,400 --> 00:00:43,740
So you don't have certain filters being overpowering other filters and when they get activated.

11
00:00:44,940 --> 00:00:53,260
So L1 or L1 norm, also called lasso regression, is basically added to the lost function.

12
00:00:53,290 --> 00:00:53,580
Yeah.

13
00:00:54,150 --> 00:00:57,400
So you can see there are a few parameters that go into this formula here.

14
00:00:57,420 --> 00:00:58,590
Let's take a look at what they are.

15
00:00:59,160 --> 00:01:05,720
So Biji, which is where we take in a modular, so absolute value of Biji, all the weights and lambda

16
00:01:05,730 --> 00:01:10,080
is a concern that controls the effect of the penalty applied, usually less than one.

17
00:01:10,740 --> 00:01:13,910
So a large lambda means that we're making a penalty term quite large.

18
00:01:13,920 --> 00:01:20,010
You consider this because it scales proportional to this, and L1 uses an absolute value as a penalty.

19
00:01:20,490 --> 00:01:27,290
So what is does here, basically by putting the weights and putting this this formula in the weeds here

20
00:01:27,300 --> 00:01:32,880
in the lost function, we're effectively adding that to a lost function and using that as a metric to

21
00:01:33,180 --> 00:01:34,650
update or weights in the network.

22
00:01:35,160 --> 00:01:40,380
Now let's take a look at L2 regularization, and you can immediately see it's no longer the absolute

23
00:01:40,380 --> 00:01:40,860
value.

24
00:01:40,860 --> 00:01:45,930
Like the L1, it's no the square value and biji again.

25
00:01:45,960 --> 00:01:50,340
Oh, and biases lambda controls the effect of the penalty again.

26
00:01:50,820 --> 00:01:55,500
And similarly, a large lambda controls how much or how big if we want our penalty to be.

27
00:01:56,160 --> 00:02:01,170
And importantly, here, L2 uses a squared magnitude up to be it as a penalty.

28
00:02:01,860 --> 00:02:07,230
This can immediately tell you that if it's if it's hitting a square value, it's going to put a big

29
00:02:07,230 --> 00:02:11,130
emphasis on bigger weights and making them smaller in the end.

30
00:02:11,760 --> 00:02:14,400
So the differences between them for the formulas here.

31
00:02:14,820 --> 00:02:21,930
So you can contrast them is that L1 shrinks the weights by a constant amount towards zero by shrinking

32
00:02:21,930 --> 00:02:24,300
less important weights, which are weights to zero.

33
00:02:24,660 --> 00:02:30,060
This sort of acts like a feature selection algorithm, which is quite good, quite desirable, yielding

34
00:02:30,060 --> 00:02:36,750
sparse models of in L2, which shrink by a proportional to the weights and L2.

35
00:02:36,750 --> 00:02:42,450
As I said, because of the square penalises large weights, more stunned smaller weights and smaller

36
00:02:42,450 --> 00:02:42,720
weights.

37
00:02:42,720 --> 00:02:49,830
Less so L1 tilt tends to concentrate the weight of the network to a relatively small number, but highly

38
00:02:49,830 --> 00:02:50,970
important connections.

39
00:02:51,420 --> 00:02:54,750
So you can see that the boot would work in different ways.

40
00:02:55,410 --> 00:03:02,250
So let's take a summary of L1, L2, so L1 and L2 regularization, both macro and that would prefer

41
00:03:02,250 --> 00:03:06,180
to live in the smaller weights tends to push the weights towards smaller values.

42
00:03:06,750 --> 00:03:12,540
Large weights are actually loud sometimes, but only if the considerably improve the first part of our

43
00:03:12,540 --> 00:03:14,160
cost function or lost function.

44
00:03:14,910 --> 00:03:21,270
This can be interpreted as a way of compromising between finding small weights and minimizing the original

45
00:03:21,270 --> 00:03:22,020
lost function.

46
00:03:22,050 --> 00:03:26,130
So it's basically a added term to a lost function that we have to consider now.

47
00:03:26,850 --> 00:03:29,400
And the degree of compromise depends on lambda.

48
00:03:30,000 --> 00:03:33,900
A small lambda means that we prefer to minimize the original lost function.

49
00:03:34,410 --> 00:03:37,590
A larger lambda means we prefer we're putting emphasis.

50
00:03:37,590 --> 00:03:38,660
We go back to the formula.

51
00:03:39,300 --> 00:03:42,060
A large lambda means that we're putting more emphasis on this.

52
00:03:42,060 --> 00:03:45,590
Where it's a small lambda means you're putting more emphasis on the lost function itself.

53
00:03:47,580 --> 00:03:48,900
That's it for this chapter.

54
00:03:49,470 --> 00:03:54,780
Next, we'll take a look at the drop permitted, which is another very useful regularisation metaphor,

55
00:03:55,290 --> 00:03:56,910
so I'll see you in the next section.

56
00:03:57,090 --> 00:03:57,600
Thank you.
