1
00:00:11,680 --> 00:00:16,840
In this lecture we are going to look at a slightly more advanced topic in PI talk which is creating

2
00:00:16,870 --> 00:00:19,000
custom lost functions.

3
00:00:19,030 --> 00:00:24,730
In addition we're going to apply this new knowledge to build a new kind of neural network one that not

4
00:00:24,730 --> 00:00:32,270
only estimates a function but also estimates the uncertainty in the function prediction previously.

5
00:00:32,370 --> 00:00:37,860
We spent the entire course only looking at neural networks that act as function approximations for each

6
00:00:37,890 --> 00:00:38,360
input.

7
00:00:38,400 --> 00:00:43,590
We get some output prediction but you may have noticed something curious.

8
00:00:43,590 --> 00:00:48,750
There is in fact a discrepancy between a regression and classification with regression.

9
00:00:48,750 --> 00:00:50,390
We only get one answer.

10
00:00:50,490 --> 00:00:54,740
The model's output is our best guess for the target given that input.

11
00:00:55,020 --> 00:00:58,810
But with classification we get a probability distribution.

12
00:00:58,980 --> 00:01:04,120
This probability distribution gives us some sense of the confidence in that prediction.

13
00:01:04,140 --> 00:01:09,720
For example if we're doing binary classification and the output is zero point nine nine then we might

14
00:01:09,720 --> 00:01:13,560
conclude that our model is very confident in that prediction.

15
00:01:13,560 --> 00:01:18,030
If the model outputs zero point five five then it would still make the same prediction.

16
00:01:18,330 --> 00:01:22,930
But intuitively we know that we are less confident in that prediction.

17
00:01:23,000 --> 00:01:26,480
You can apply the same reasoning to soft Max outputs.

18
00:01:26,480 --> 00:01:31,730
So given this knowledge you might wonder why don't we output probability distributions when we're doing

19
00:01:31,730 --> 00:01:32,390
a regression

20
00:01:37,470 --> 00:01:42,900
so if you've studied regression and the mean squared error more in-depth then you know that doing regression

21
00:01:42,900 --> 00:01:50,090
with the squared error is nothing more than maximum likelihood estimation with maximum likelihood estimation.

22
00:01:50,100 --> 00:01:51,490
We start with the likelihood.

23
00:01:51,630 --> 00:01:56,970
Take the log of the likelihood and then negate the likelihood to make it a loss to minimize rather than

24
00:01:56,970 --> 00:02:01,450
a utility to maximize in this form.

25
00:02:01,460 --> 00:02:07,920
We say that T which is the target is the random variable which we assume comes from a Gaza and distribution.

26
00:02:07,940 --> 00:02:14,090
Why is the model prediction which is the mean of the distribution and X is the input to the model we

27
00:02:14,090 --> 00:02:14,680
call theta.

28
00:02:14,690 --> 00:02:16,390
The collection of model parameters.

29
00:02:16,760 --> 00:02:21,110
So this can represent a neuron that work linear regression or any other regression model.

30
00:02:21,110 --> 00:02:21,680
Doesn't matter

31
00:02:26,810 --> 00:02:33,160
as you've seen the usual first step is to take the log of this function at this point your gut instinct

32
00:02:33,160 --> 00:02:34,260
should be this.

33
00:02:34,570 --> 00:02:39,880
We know that we want to end up with the T minus Y squared term and all the other terms should drop off.

34
00:02:39,880 --> 00:02:44,740
That's what gives us the sum of squared error or the mean square error and we know that we can do this

35
00:02:44,770 --> 00:02:50,910
because the log of 2 pi sigma squared is constant constants don't affect the gradient of the loss.

36
00:02:50,920 --> 00:02:56,920
So whatever model parameters maximize this log likelihood would also maximize the log likelihood minus

37
00:02:56,920 --> 00:02:58,360
or plus any constants

38
00:03:03,360 --> 00:03:06,850
but here's the major difference in this lecture in this lecture.

39
00:03:06,850 --> 00:03:09,670
We are not going to make Sigma a constant.

40
00:03:09,670 --> 00:03:14,350
Instead we can ask what if sigma is also a function of x.

41
00:03:14,530 --> 00:03:19,060
So now this term can't be dropped from the objective function because it is not constant.

42
00:03:19,840 --> 00:03:24,910
What this means is that in this scenario this whole thing will be our last function instead of just

43
00:03:24,910 --> 00:03:25,660
the squared error.

44
00:03:26,290 --> 00:03:31,570
And as usual we can take the mean instead of the sum which is equivalent to having a Monte Carlo estimate

45
00:03:31,840 --> 00:03:38,150
of the expected value.

46
00:03:38,150 --> 00:03:43,550
The next question you should have is OK now we're trying to predict two things the target and its corresponding

47
00:03:43,550 --> 00:03:44,570
variance.

48
00:03:44,600 --> 00:03:46,070
How can we do something like that.

49
00:03:47,270 --> 00:03:50,830
Well why not just have a neuron that work with two output nodes.

50
00:03:50,840 --> 00:03:55,370
Let's remember that it doesn't matter what the meaning of these two output nodes is.

51
00:03:55,400 --> 00:04:00,220
We've seen that a neuron that we can end with a linear layer with any number of output nodes.

52
00:04:00,290 --> 00:04:06,020
That's the case whether we're doing regression binary classification or multi class classification.

53
00:04:06,020 --> 00:04:09,260
So then what's the difference between these three cases.

54
00:04:09,320 --> 00:04:15,080
The difference is that the meaning is assigned by us the designer of the program and we assign that

55
00:04:15,080 --> 00:04:17,180
meaning through the loss function.

56
00:04:17,180 --> 00:04:21,980
In other words it's up to us to come up with the appropriate lost function such that the meaning of

57
00:04:21,980 --> 00:04:25,160
these two outputs is the Gaussian mean and variance

58
00:04:30,200 --> 00:04:32,840
but you should be suspicious at this point.

59
00:04:32,840 --> 00:04:38,770
A neural network layer can output any value but don't we have a constraint on the value of a variance.

60
00:04:38,960 --> 00:04:45,260
In fact we require that a variance is positive in order to enforce the variance to be positive.

61
00:04:45,320 --> 00:04:49,940
We can make the assumption that the none that work doesn't output the variance directly but rather the

62
00:04:49,940 --> 00:04:56,510
log of the variance if you think hard this is just like the classification case.

63
00:04:56,770 --> 00:04:59,530
The notion that work doesn't output probabilities directly.

64
00:04:59,530 --> 00:05:04,390
Instead it outputs logics which are related to log probabilities.

65
00:05:04,390 --> 00:05:09,520
In fact that has more constraints because not only do probability distributions have to have positive

66
00:05:09,520 --> 00:05:11,590
values they also have to sum to one

67
00:05:16,690 --> 00:05:21,580
using this information we can determine what the laws function should look like with respect to the

68
00:05:21,580 --> 00:05:23,770
neuron that works outputs.

69
00:05:23,950 --> 00:05:29,380
First let's assume that instead of having a neuron that work with two output nodes we simply have two

70
00:05:29,380 --> 00:05:34,660
neural networks in our model and then the fourth function returns the output from these individual and

71
00:05:34,660 --> 00:05:41,110
their own that works then the output of our model from the forward function will be a tuple with the

72
00:05:41,110 --> 00:05:51,110
first element is the mean predictions and the second element is the log variance predictions.

73
00:05:51,120 --> 00:05:57,220
Now we get back to our original question which was how do we create a custom lost function and Pi talk.

74
00:05:57,360 --> 00:06:02,250
As always it's easiest to start with what we're familiar with the mean squared error.

75
00:06:02,250 --> 00:06:07,170
If we had a regular road regression neural network that only outputs the mean prediction then here's

76
00:06:07,170 --> 00:06:08,750
how you would do it.

77
00:06:08,760 --> 00:06:10,760
The trick is there is no trick.

78
00:06:10,950 --> 00:06:16,180
The lost function is merely a regular Python function that takes in any arguments you want.

79
00:06:16,190 --> 00:06:22,440
Obviously some of them have to be pi to which variables for us this means the predictions and the targets

80
00:06:24,800 --> 00:06:29,900
then inside this function we find the element y's difference between the predictions and the targets

81
00:06:30,200 --> 00:06:32,140
square them and then take the mean.

82
00:06:32,300 --> 00:06:39,060
Pretty straightforward so a custom lost function is just a regular function that does mathematical operations

83
00:06:39,270 --> 00:06:45,400
using pi to urge.

84
00:06:45,440 --> 00:06:46,610
So how do we do this.

85
00:06:46,610 --> 00:06:52,820
When it comes to the full negative log likelihood First we still take in the same arguments the predictions

86
00:06:52,820 --> 00:06:54,500
and the targets.

87
00:06:54,500 --> 00:06:57,840
This time the predictions are tuples so we can split them up.

88
00:06:58,010 --> 00:07:03,750
The first element becomes the mean which I'll come to and the second element becomes the log variance.

89
00:07:03,950 --> 00:07:10,070
Since that XP function is the inverse of the log function I can take that e XP of the log to get the

90
00:07:10,070 --> 00:07:18,340
variance itself which I'll call V then we simply need to plug and chug into our formula for the negative

91
00:07:18,340 --> 00:07:19,870
log likelihood.

92
00:07:19,870 --> 00:07:24,730
This consists of the coefficient term and the exponent term since we're in log space.

93
00:07:24,730 --> 00:07:30,040
We can just add these two together and this gives us the element y's negative log likelihood after which

94
00:07:30,040 --> 00:07:31,000
we can take the mean

95
00:07:36,120 --> 00:07:40,350
so at this point you may be wondering how is a model like this useful.

96
00:07:40,350 --> 00:07:46,530
And naturally you can apply this technique to any problem that contains uncertainty in the target function

97
00:07:47,930 --> 00:07:50,300
if we recall the bias variance tradeoff.

98
00:07:50,300 --> 00:07:56,480
This is the statement that the error of your model is the bias squared plus the variance plus an irreducible

99
00:07:56,510 --> 00:08:02,960
error the irreducible error is called that because it's a characteristic of the data itself and your

100
00:08:02,960 --> 00:08:06,440
model can't make a perfect prediction because of it.

101
00:08:06,500 --> 00:08:08,240
This arises naturally.

102
00:08:08,390 --> 00:08:13,640
For example suppose you're measuring the relationship between how long students study on a test and

103
00:08:13,640 --> 00:08:20,650
the grade let's say you have one student who study 10 hours and got a ninety five percent.

104
00:08:20,670 --> 00:08:24,830
And you have another student who study 10 hours and got an 85 percent.

105
00:08:24,960 --> 00:08:29,010
In this case the input is the same but the target is different.

106
00:08:29,010 --> 00:08:34,110
And since your model is only a function of the inputs it can't make a perfect prediction.

107
00:08:34,110 --> 00:08:40,500
If there are two different possible target values for that single input as you can imagine this kind

108
00:08:40,500 --> 00:08:42,660
of noise can appear everywhere.

109
00:08:42,660 --> 00:08:48,390
So instead of ending with a generic statement like my model has a lower square you can quantify the

110
00:08:48,390 --> 00:08:49,980
uncertainty of your model instead

111
00:08:55,150 --> 00:08:56,740
when the variability of the target.

112
00:08:56,740 --> 00:08:59,470
With respect to the input is not constant.

113
00:08:59,470 --> 00:09:05,410
We say that it's hetero scholastic we can think of other situations where hetero skit activity city

114
00:09:05,410 --> 00:09:06,720
can arise.

115
00:09:06,730 --> 00:09:11,110
One popular application is with stock returns and stock prices.

116
00:09:11,110 --> 00:09:15,890
These are well known to be hetero scholastic in finance.

117
00:09:15,880 --> 00:09:20,360
We often want to avoid high variance because we think of variance as risk.

118
00:09:20,360 --> 00:09:24,150
And so in order to minimize risk we should minimize variance.

119
00:09:24,380 --> 00:09:29,810
And in order to minimize variance it would be useful to model not just the prices of the returns but

120
00:09:29,810 --> 00:09:30,980
the variance itself.