1
00:00:11,690 --> 00:00:17,830
In this lecture we are going to look at the code to demonstrate uncertainty estimation Using pi to which

2
00:00:18,380 --> 00:00:21,430
this lecture is going to walk you through a prepared code lab notebook.

3
00:00:21,770 --> 00:00:27,620
Although a very good exercise which I always recommend is once you know how this is done to try and

4
00:00:27,620 --> 00:00:33,830
recreate it yourself with as few references as possible as usual you can look at the title of the notebook

5
00:00:34,100 --> 00:00:36,440
to determine what notebook we are currently looking at.

6
00:00:40,970 --> 00:00:46,750
So let's begin by writing a function that generates data from it known synthetic distribution.

7
00:00:46,760 --> 00:00:53,300
This might seem weird but in fact this is exactly what we do when we run an experiment or do data collection.

8
00:00:53,300 --> 00:00:58,670
It's like we have some function called generate batch and that function is a black box which involves

9
00:00:58,670 --> 00:01:04,370
whatever data generating process happens to result in the data you're measuring.

10
00:01:04,370 --> 00:01:09,890
Of course the difference is that you don't ever know the true distribution of your data nor the underlying

11
00:01:09,890 --> 00:01:11,660
data generating process.

12
00:01:11,660 --> 00:01:13,760
And that's why we need machine learning.

13
00:01:13,760 --> 00:01:19,040
But for this example we do know the true distribution and the data generating process and hopefully

14
00:01:19,340 --> 00:01:23,940
that will give you some insight on what machine learning is trying to do.

15
00:01:23,960 --> 00:01:26,640
So what does this function look like.

16
00:01:26,690 --> 00:01:32,780
First we generate a bunch of random points for the input x from the uniform distribution between minus

17
00:01:32,780 --> 00:01:34,850
five and plus five.

18
00:01:34,850 --> 00:01:39,610
Next we write an expression for the standard deviation as a function of x.

19
00:01:39,890 --> 00:01:42,840
As you can see it's a linear function of x.

20
00:01:43,220 --> 00:01:50,590
So when x is small the standard deviation is small and when X is large the standard deviation is large.

21
00:01:50,590 --> 00:01:57,280
Next we get the target we assigned the target to be the cosine of X minus some linear function of x

22
00:01:57,460 --> 00:02:02,540
plus Gaussian noise center at zero with a standard deviation of SD.

23
00:02:02,710 --> 00:02:04,150
Then we return x and y

24
00:02:09,330 --> 00:02:13,580
if we plot the data you should be able to recognize each of these elements.

25
00:02:13,890 --> 00:02:17,460
First we can see the periodic component due to the cosine.

26
00:02:17,460 --> 00:02:20,270
That's why this thing appears like a wave.

27
00:02:20,280 --> 00:02:22,500
Second we can see that a trend downward.

28
00:02:22,500 --> 00:02:24,040
Thanks to the linear component.

29
00:02:24,960 --> 00:02:27,480
Finally we see hetero skid as this city.

30
00:02:27,600 --> 00:02:31,260
The noise seems to increase as X increases.

31
00:02:31,260 --> 00:02:36,150
In fact we know that the noise standard deviation increases linearly with X.

32
00:02:36,420 --> 00:02:42,030
Importantly we can see here that given the same X it's possible to get many different values for the

33
00:02:42,030 --> 00:02:42,540
target.

34
00:02:47,710 --> 00:02:49,540
Next we create our model.

35
00:02:49,570 --> 00:02:53,930
This is a pretty basic custom model which contains two simple antennas.

36
00:02:54,130 --> 00:02:59,020
The first CNN will be used to predicts the mean of the output and the second and then will be used to

37
00:02:59,020 --> 00:03:00,490
predict the log variance

38
00:03:06,580 --> 00:03:07,780
in the forward function.

39
00:03:07,780 --> 00:03:11,400
You can see how this is kind of the opposite of a recommender system.

40
00:03:11,530 --> 00:03:17,710
A recommender system takes in two inputs a user and an item and returns one output a predicted rating

41
00:03:18,460 --> 00:03:19,500
on the other hand.

42
00:03:19,510 --> 00:03:25,480
This takes in one input and produces two outputs.

43
00:03:25,480 --> 00:03:29,740
Next we instantiate the model and then we create our custom laws.

44
00:03:30,880 --> 00:03:36,250
So in this function we'll stick with the convention of calling the first argument outputs and the second

45
00:03:36,250 --> 00:03:39,000
argument targets inside the function.

46
00:03:39,010 --> 00:03:42,670
We know that outputs is a tuple with two components.

47
00:03:42,670 --> 00:03:46,530
So we take the first component and assign that to a variable called Mew.

48
00:03:46,660 --> 00:03:53,170
We take the second component exponential rate and assign that to a variable called V.

49
00:03:53,200 --> 00:03:57,040
Next we calculate the coefficient term of the Gaussian distribution.

50
00:03:57,190 --> 00:04:00,970
That's the log of the square root of two pi times V.

51
00:04:00,970 --> 00:04:04,720
Next we calculate the exponent term of the Gaussian distribution.

52
00:04:04,780 --> 00:04:11,820
That's zero point five divided by three times the square difference between the targets and Mew.

53
00:04:12,030 --> 00:04:21,670
Lastly we add the coefficient term and the exponent term together and take the mean.

54
00:04:21,720 --> 00:04:24,360
Next we have our optimizer and our training function.

55
00:04:26,000 --> 00:04:29,650
So this is nothing really new.

56
00:04:38,130 --> 00:04:38,440
All right.

57
00:04:38,470 --> 00:04:41,000
So here's our last per iteration.

58
00:04:41,080 --> 00:04:46,060
There should seem a little strange since what we're used to when we look at a loss per iteration is

59
00:04:46,060 --> 00:04:49,450
that the loss usually has a steady decrease downward.

60
00:04:49,450 --> 00:04:51,370
This on the other hand seems to be very noisy.

61
00:04:52,150 --> 00:04:54,880
So what is the cause of all this noise.

62
00:04:54,880 --> 00:05:00,900
Well remember that our data itself has noise and the loss function is the negative log likelihood.

63
00:05:01,090 --> 00:05:05,060
Remember that the loss doesn't only contain just the squared error anymore.

64
00:05:05,230 --> 00:05:10,060
So we shouldn't expect the loss to go down to zero since that would mean the variance is zero which

65
00:05:10,060 --> 00:05:17,900
we know is it true.

66
00:05:17,920 --> 00:05:21,370
Next we plot the model predictions in order to do this.

67
00:05:21,370 --> 00:05:27,210
We're going to generate a large batch of 1000 twenty four points after doing a scatter plot.

68
00:05:27,220 --> 00:05:32,960
We convert these points into torch is reshape them and pass them through our model.

69
00:05:33,100 --> 00:05:37,290
The model returns two things are y hats and the log variance.

70
00:05:37,600 --> 00:05:42,700
Since our model outputs the log of the variance we can exponential rate it to get the variance but that's

71
00:05:42,700 --> 00:05:43,960
not actually what we want.

72
00:05:44,860 --> 00:05:49,810
If you recall the variance doesn't have the same units as the original random variable so it doesn't

73
00:05:49,810 --> 00:05:51,270
make sense to plot them together.

74
00:05:52,370 --> 00:05:58,960
Rather we would like to plot the standard deviation which is the square root of the variance equivalently.

75
00:05:58,970 --> 00:06:04,160
We can just divide by 2 before taking the exponent since the square root is the same thing as a power

76
00:06:04,160 --> 00:06:09,440
1 1/2.

77
00:06:09,470 --> 00:06:14,930
Next we plot our prediction along with the corresponding uncertainty on top of our scatter plot of the

78
00:06:14,930 --> 00:06:20,450
data in order for a plot to look like it should we have the sort the data points first.

79
00:06:20,540 --> 00:06:25,940
This is because map plot lib joins the points you pass in by a line in whatever order you pass them

80
00:06:25,940 --> 00:06:30,930
in we can get the sword index by using the ARG sword function.

81
00:06:31,040 --> 00:06:33,670
Next we plot X versus Y.

82
00:06:33,770 --> 00:06:37,640
This should follow the center of the data points for each X.

83
00:06:37,670 --> 00:06:43,700
Next we use the fill between function to draw a transparent band with width equal to the standard deviation

84
00:06:46,230 --> 00:06:54,280
so that's why we have y hat minus SD and then y hat plus SD.

85
00:06:54,380 --> 00:07:03,560
So here's the plot and as you can see the band increases as we go from left to right and our model accurately

86
00:07:03,560 --> 00:07:07,490
predicts that the variance should increase as X increases.