1
00:00:11,720 --> 00:00:17,060
In this lecture we are going to discuss the means squared area in depth the goal of this lecture is

2
00:00:17,060 --> 00:00:22,560
to ensure that you understand the means squared error from a probabilistic perspective.

3
00:00:22,580 --> 00:00:27,320
This helps to give some theoretical backing for the mean squared error and it will help prepare us to

4
00:00:27,320 --> 00:00:31,130
discuss the cross entropy loss as well as a side note.

5
00:00:31,400 --> 00:00:35,460
I just want to remind you that there are multiple names that we use interchangeably.

6
00:00:35,600 --> 00:00:39,200
So if you hear error or loss or cost or objective.

7
00:00:39,200 --> 00:00:42,070
These are all generally referring to the same thing.

8
00:00:42,200 --> 00:00:47,960
So we might say cross entropy error but we also might say cross entropy loss they mean exactly the same

9
00:00:47,960 --> 00:00:53,510
thing.

10
00:00:53,640 --> 00:00:58,300
First let's just make sure the mean squared error makes sense in the first place.

11
00:00:58,320 --> 00:00:59,950
Why is it squared.

12
00:00:59,970 --> 00:01:03,960
Well the idea is that we would like the error to always be positive.

13
00:01:03,960 --> 00:01:08,300
Imagine this scenario one of your predictions has error plus 2.

14
00:01:08,460 --> 00:01:11,510
The other prediction you've made has error minus 2.

15
00:01:11,580 --> 00:01:17,050
If you add these two errors together you would get plus two minus two which is yes zero.

16
00:01:17,100 --> 00:01:22,330
Of course this is not zero error because both of your predictions are off by 2.

17
00:01:22,410 --> 00:01:24,420
They should not cancel each other out.

18
00:01:24,750 --> 00:01:27,780
So squaring the error is a nice way to make it positive

19
00:01:32,890 --> 00:01:37,930
but this still doesn't really explain why we want the Square Mile at the absolute value.

20
00:01:37,930 --> 00:01:41,140
Why not some other method of making the difference positive.

21
00:01:41,140 --> 00:01:45,010
In fact it's perfectly OK to use the mean absolute error.

22
00:01:45,010 --> 00:01:47,710
We just don't normally refer to that as linear regression.

23
00:01:48,250 --> 00:01:53,530
So when we say linear regression usually we are talking about a linear model using the mean squared

24
00:01:53,530 --> 00:01:59,100
error.

25
00:01:59,140 --> 00:02:04,710
OK so let's dig a little deeper on why we would want to square the error to understand this discussion.

26
00:02:04,720 --> 00:02:10,600
You have to be familiar with maximum likelihood estimation what you would normally learn in an introductory

27
00:02:10,600 --> 00:02:12,730
probability course.

28
00:02:13,030 --> 00:02:17,830
Let's say for example we would like to measure the heights of all the students in our class and we decide

29
00:02:17,830 --> 00:02:21,040
to model this as a Gaussian distribution.

30
00:02:21,040 --> 00:02:25,450
As you recall the Gaussian distribution is the typical bell curve.

31
00:02:25,450 --> 00:02:35,810
It has the probability density function that you see here which will become important momentarily.

32
00:02:35,890 --> 00:02:42,070
We know intuitively that the best way to estimate mu let's call that you have is the sample mean of

33
00:02:42,070 --> 00:02:43,870
all the heights we have measured.

34
00:02:43,870 --> 00:02:45,820
The question is why is this so

35
00:02:50,950 --> 00:02:53,020
here's how we solve this problem.

36
00:02:53,020 --> 00:02:55,500
There is actually a method to find this view.

37
00:02:55,600 --> 00:02:59,130
This method is called maximum likelihood estimation.

38
00:02:59,170 --> 00:03:00,860
As I mentioned earlier.

39
00:03:01,000 --> 00:03:03,030
So what are the steps.

40
00:03:03,040 --> 00:03:06,160
First we create the likelihood a function let's call that El.

41
00:03:06,970 --> 00:03:10,200
Unfortunately we use the letter L for a lot of things and deep learning.

42
00:03:10,210 --> 00:03:14,650
So please don't confuse this L with anything else we've discussed.

43
00:03:14,800 --> 00:03:20,710
The likelihood is the multiplication of the probability density function at each of the x values we've

44
00:03:20,710 --> 00:03:21,810
collected.

45
00:03:22,000 --> 00:03:24,600
We consider this to be a function of Mew.

46
00:03:24,610 --> 00:03:29,510
So here MU is the variable and the X's are constants because they are just numbers.

47
00:03:29,620 --> 00:03:37,820
They are the heights of the students that we have measured in our class.

48
00:03:37,840 --> 00:03:42,940
The next thing we would like to do is maximize L with respect to mu.

49
00:03:42,970 --> 00:03:46,840
Let's think about why we would want to do this just for intuition.

50
00:03:46,840 --> 00:03:49,560
Imagine that you measure someone's height x.

51
00:03:49,600 --> 00:03:56,010
Now imagine Mew is very far away from X in this case the probability of X is very tiny.

52
00:03:56,110 --> 00:04:02,250
We might think that X is not from this probability distribution because P of X yields a very small value

53
00:04:02,950 --> 00:04:05,280
but what if Mu is equal to x.

54
00:04:05,590 --> 00:04:08,140
In this case p of x yields a very large value.

55
00:04:08,650 --> 00:04:13,690
Thus we want p of x to be big if it is to be representative of the data we've collected

56
00:04:18,850 --> 00:04:25,000
so it's the same idea for L we want l to be big so that MU is close to all of the data points x that

57
00:04:25,000 --> 00:04:26,530
we have collected.

58
00:04:26,680 --> 00:04:30,070
So how do we maximize L with respect to mu.

59
00:04:30,340 --> 00:04:33,490
Luckily we can just ask our old friend calculus.

60
00:04:33,820 --> 00:04:39,550
Normally the process would be taking the derivative of L with respect to MU and then set this derivative

61
00:04:39,550 --> 00:04:41,890
to zero and solve from you.

62
00:04:41,890 --> 00:04:43,380
But for probabilities.

63
00:04:43,420 --> 00:04:48,840
Taking the log of L first usually leads to a more easily solvable derivative.

64
00:04:48,850 --> 00:04:54,180
The reason why this is allowed is because the log is a monotone likely increasing function.

65
00:04:54,550 --> 00:05:02,710
Maximizing L is the same as maximizing the log of all the value of Mu that maximizes L is the same as

66
00:05:02,710 --> 00:05:06,090
the value of Mu that maximizes lago.

67
00:05:06,230 --> 00:05:12,630
You can prove this to yourself intuitively by taking any two numbers a and b if a is bigger than B.

68
00:05:12,830 --> 00:05:15,860
It is also the case that logic is bigger than log B

69
00:05:21,080 --> 00:05:25,700
Okay so I'm not going to bore you with the calculation but as always you are welcome to do it yourself

70
00:05:25,730 --> 00:05:29,030
on paper using your trusty calculus skills.

71
00:05:29,240 --> 00:05:32,210
You should end up with the derivative we see here.

72
00:05:32,210 --> 00:05:36,990
So if you want please pause the video and make sure you can take this derivative yourself.

73
00:05:42,120 --> 00:05:46,100
OK so what do we get if we set this to zero and sell from you.

74
00:05:46,110 --> 00:05:51,400
Well you can see that we get back our original answer that Mew is the sample mean of X..

75
00:05:51,420 --> 00:05:57,480
This is called the maximum likelihood estimate of Mu because in order to find it we maximize the likelihood

76
00:06:02,580 --> 00:06:08,220
here's what's important to notice about the log likelihood if you notice the log likelihood has the

77
00:06:08,220 --> 00:06:15,890
same form as this expression where I've replaced the constant terms with C1 and C2 you'll realize that

78
00:06:16,070 --> 00:06:21,470
it actually doesn't matter what values C want to see to have they don't affect the answer we get eventually

79
00:06:21,890 --> 00:06:30,190
that MU is the sample mean of X. Although we do recognize that C is positive C one is also positive

80
00:06:30,200 --> 00:06:34,100
given our old expression but it actually doesn't matter what it is at all.

81
00:06:34,100 --> 00:06:36,960
In other words they are just superfluous constants.

82
00:06:37,160 --> 00:06:42,710
So what happens if we set these constants to convenient values such as 0 and 1 over n

83
00:06:47,880 --> 00:06:50,740
where we get an even more interesting expression.

84
00:06:50,940 --> 00:06:56,820
Little L. The thing we want to maximize is the negative sum of X Y minus Mu squared

85
00:07:01,940 --> 00:07:07,940
one thing you may recall from your study of calculus is that maximizing something is the same as minimizing

86
00:07:07,940 --> 00:07:09,920
the negative of that thing.

87
00:07:09,950 --> 00:07:16,130
In other words if I want to maximize negative x squared I should get the same answer as minimizing X

88
00:07:16,130 --> 00:07:21,280
squared the value x equals zero is the answer in both cases.

89
00:07:21,350 --> 00:07:23,830
Hopefully you can see right away that this is obvious

90
00:07:28,980 --> 00:07:35,100
at this point you probably get the gist of what I'm trying to say maximizing the likelihood is the same

91
00:07:35,190 --> 00:07:40,530
as minimizing the mean squared error where the error is the difference between X Y and Mu

92
00:07:45,640 --> 00:07:49,480
so how can we relate this back to problems such as linear regression.

93
00:07:49,480 --> 00:07:52,000
Take a look at these two error functions.

94
00:07:52,060 --> 00:07:57,280
The first error function comes from our earlier problem finding the best Mew to model the heights of

95
00:07:57,280 --> 00:08:04,150
the students in our class as a gaussian the second error comes from linear regression.

96
00:08:04,170 --> 00:08:06,630
Make the Y hats close to the wives.

97
00:08:06,810 --> 00:08:10,250
How can we relate the second problem to the first problem.

98
00:08:10,260 --> 00:08:17,190
Well if we say that the Y eyes are samples drawn from a Gaussian distribution with a mean is y hat ie

99
00:08:17,640 --> 00:08:24,000
then minimizing this mean squared error is the same as minimizing the negative log likelihood and that

100
00:08:24,000 --> 00:08:29,420
of course is the same as maximizing the log likelihood which is the same as maximizing the likelihood

101
00:08:34,550 --> 00:08:35,110
finally.

102
00:08:35,120 --> 00:08:41,060
One thing we can do is instead of saying that the mean of the distribution is why have we can simply

103
00:08:41,060 --> 00:08:46,760
say that Y minus y have is a gaussian centered as zero equivalently.

104
00:08:46,790 --> 00:08:52,460
This is saying that when we use the squared error we are making the assumption that the error of our

105
00:08:52,460 --> 00:08:57,840
model with respect to the data is Gaussian distributed with mean zero.

106
00:08:57,860 --> 00:09:03,530
This formulation helps us to put a probabilistic backing on the error function and it will also help

107
00:09:03,530 --> 00:09:06,850
us make sense of the Cross entropy loss for classification.