1
00:00:11,670 --> 00:00:16,410
In this lecture we are going to look at a notebook which demonstrates the use of a feed for a neural

2
00:00:16,410 --> 00:00:18,630
network for regression.

3
00:00:18,630 --> 00:00:22,180
This lecture is going to walk you through a prepared at coal lab notebook.

4
00:00:22,350 --> 00:00:28,320
Although a very good exercise which I always recommend is once you know how this is done to try and

5
00:00:28,320 --> 00:00:35,070
recreate it yourself with as few references as possible as usual you can look at the title of the notebook

6
00:00:35,400 --> 00:00:37,920
to determine what notebook we are currently looking at.

7
00:00:39,480 --> 00:00:44,910
The important thing about this notebook is that unlike the other examples we did this one is going to

8
00:00:44,910 --> 00:00:47,340
involve synthetic data.

9
00:00:47,820 --> 00:00:53,490
That means data that we are going to create rather than data we've collected from an experiment or a

10
00:00:53,490 --> 00:00:54,960
database.

11
00:00:54,990 --> 00:00:57,550
Now you might ask why should we do this.

12
00:00:57,570 --> 00:00:58,970
Is it even practical.

13
00:00:58,980 --> 00:01:02,240
Is this just the lazy programmer of being a science geek.

14
00:01:02,310 --> 00:01:08,640
And of course if you think any of these things you are wrong using synthetic data is very important

15
00:01:08,640 --> 00:01:15,540
for understanding the behavior and quirks of any machine learning algorithm you want to be able to see

16
00:01:15,540 --> 00:01:18,270
where it succeeds and where it fails.

17
00:01:18,570 --> 00:01:20,930
The important keyword there is.

18
00:01:21,390 --> 00:01:25,240
We'll be creating a dataset which we can actually visualize and look at.

19
00:01:25,410 --> 00:01:31,380
So we can observe that the neuron network will create a prediction surface that corresponds to the true

20
00:01:31,380 --> 00:01:34,880
function that we generated the data from of course.

21
00:01:34,910 --> 00:01:38,100
Don't take my word for it that this is important.

22
00:01:38,100 --> 00:01:41,160
One great example of this is in the field of clustering.

23
00:01:41,160 --> 00:01:47,280
So if I go over to psychic learns web page on different clustering algorithms you can see a ton of examples

24
00:01:47,280 --> 00:01:48,480
of this.

25
00:01:48,540 --> 00:01:54,600
In particular it shows a set of some pretty contrived two dimensional data and then it compares how

26
00:01:54,600 --> 00:01:57,270
each algorithm clusters this data.

27
00:01:57,270 --> 00:02:02,600
By doing this we can see where each algorithm is successful and where it is not.

28
00:02:02,700 --> 00:02:08,280
In fact this is the same thing you saw in the tensor for playground without being able to see the decision

29
00:02:08,280 --> 00:02:12,150
boundary and the corresponding data with your own eyes.

30
00:02:12,180 --> 00:02:16,240
You can imagine that this website would be much less engaging.

31
00:02:16,300 --> 00:02:24,990
Okay so now you know why visualizing all synthetic data is important onto the code as usual.

32
00:02:25,000 --> 00:02:28,500
We're going to start by inputting pi torch name pi and map LA Live.

33
00:02:29,020 --> 00:02:33,990
We'll also need this 3D thing to help us make three dimensional plots.

34
00:02:34,000 --> 00:02:36,240
Next we're going to create the dataset.

35
00:02:36,730 --> 00:02:42,880
We'll start by creating the inputs X which will just be two dimensional data points uniformly distributed

36
00:02:43,120 --> 00:02:45,880
between minus three and plus three.

37
00:02:45,910 --> 00:02:51,520
As you know the random function returns data uniformly distributed between 0 and 1.

38
00:02:51,580 --> 00:02:59,640
So if we multiply by six and subtract three the scales the data to be between minus three and plus three.

39
00:02:59,890 --> 00:03:05,800
Next we calculate the targets Y which is just the cosine of two times the first feature plus the cosine

40
00:03:05,800 --> 00:03:08,040
of three times the second feature.

41
00:03:08,050 --> 00:03:10,480
Now you might ask Why am I using this function.

42
00:03:10,540 --> 00:03:12,940
Is it some kind of special function.

43
00:03:12,940 --> 00:03:14,130
The answer is no.

44
00:03:14,230 --> 00:03:22,440
I just wanted a non linear function with a few bumps and curves and the cosine does just that.

45
00:03:22,450 --> 00:03:27,870
Next we're going to plot this data on a 3D scatter plot so that you can see what it looks like.

46
00:03:28,870 --> 00:03:29,980
So let's run this

47
00:03:35,940 --> 00:03:41,580
now one thing I don't like about notebook compared to running a real python script is that you can't

48
00:03:41,580 --> 00:03:43,500
play around with the plots.

49
00:03:43,620 --> 00:03:49,260
So for stuff like this I like to run the actual python script on my own computer which shows you the

50
00:03:49,260 --> 00:03:51,330
plot in a separate window.

51
00:03:51,330 --> 00:03:54,930
Then you can zoom into the plot rotated and so forth.

52
00:03:54,930 --> 00:04:09,090
This makes it a lot easier to get a good feel for the data.

53
00:04:09,310 --> 00:04:17,990
So if you run this on your own machine you'll be able to rotate the plot and look at it from different

54
00:04:17,990 --> 00:04:18,750
angles.

55
00:04:19,820 --> 00:04:24,350
So you can really see what the underlying function should look like.

56
00:04:28,990 --> 00:04:35,310
So next we're going to build the model since the data is in the form and by D where d equals 2.

57
00:04:35,320 --> 00:04:42,800
There's no real pre processing to do so this is almost the same architecture as the previous example.

58
00:04:42,800 --> 00:04:47,980
We have one hidden layer with 128 hidden units and a real new activation.

59
00:04:47,990 --> 00:04:54,100
The difference is now we have two inputs and one output instead of seven hundred eighty four inputs

60
00:04:54,280 --> 00:04:56,070
and 10 outputs.

61
00:04:56,140 --> 00:05:01,300
And since this is regression there is no activation function and we're regressing on a single scalar

62
00:05:01,300 --> 00:05:02,050
value.

63
00:05:02,050 --> 00:05:03,550
So the output dimension is 1

64
00:05:07,980 --> 00:05:12,120
Next we're going to create the loss and optimize optimizer since we're doing regression.

65
00:05:12,120 --> 00:05:14,310
We'll use the MSE loss.

66
00:05:14,310 --> 00:05:19,850
You'll see that instead of using the default atom I've set the learning rate to 0.01.

67
00:05:19,860 --> 00:05:24,390
Now you might be wondering where can I learn about all these different optimizer objects and what their

68
00:05:24,390 --> 00:05:25,890
arguments are.

69
00:05:25,890 --> 00:05:31,020
The answer is as always the PI torch documentation.

70
00:05:31,080 --> 00:05:35,970
Your next question might be how do I know that a lending rate of 0.01 is good.

71
00:05:36,150 --> 00:05:39,620
And remember that this is because I actually sat down and tried it.

72
00:05:39,630 --> 00:05:41,360
This is the only way.

73
00:05:41,460 --> 00:05:45,780
Again if you're ever wondering how many epochs or what learning rate should I use.

74
00:05:45,780 --> 00:05:47,760
The answer is to try it and see

75
00:05:52,340 --> 00:05:55,460
next we have our training loop which I've put into a function.

76
00:05:55,550 --> 00:05:59,310
This may or may not be useful in the future for copying and pasting.

77
00:05:59,390 --> 00:06:04,250
In any case I just wanted to show you an alternative way of writing this other than the fact that the

78
00:06:04,250 --> 00:06:05,510
loop is a function.

79
00:06:05,600 --> 00:06:13,380
Nothing else here is new.

80
00:06:13,450 --> 00:06:18,850
Next we're going to plot the last pre iteration to confirm that the training process converge nicely

81
00:06:21,440 --> 00:06:22,130
if it doesn't.

82
00:06:22,160 --> 00:06:28,420
That means you have to go back and modify your hyper parameters so this appears to look OK.

83
00:06:33,410 --> 00:06:40,030
Now we have the next step which is to make predictions using the data since this data is two dimensional

84
00:06:40,040 --> 00:06:45,970
and visualizing will we have a very nice situation we can actually plot the entire prediction surface

85
00:06:46,010 --> 00:06:49,970
as a function along with the original data points.

86
00:06:50,000 --> 00:06:55,250
This will confirm to us that the neural network even though it's just a bunch of linear equations with

87
00:06:55,250 --> 00:06:59,670
real use can approximate this sum of coastlines.

88
00:06:59,720 --> 00:07:01,860
That's pretty amazing if you think about it.

89
00:07:02,030 --> 00:07:10,540
We can approximate a cosine equation without having any cosine.

90
00:07:10,610 --> 00:07:15,500
Now some of you might be wondering how we actually make this 3D surface plot so let's go through that

91
00:07:17,270 --> 00:07:24,340
the first thing we need to do is create a mesh grid we can start by choosing points along the x axis

92
00:07:24,370 --> 00:07:31,230
and the x 2 axis which I've said are just fifty evenly spaced points between minus three and plus three.

93
00:07:31,330 --> 00:07:33,320
So that's what land space does.

94
00:07:33,700 --> 00:07:38,980
Next we call the mesh grid function which basically does the cross product between these two sets of

95
00:07:38,980 --> 00:07:40,000
points.

96
00:07:40,120 --> 00:07:47,890
So every X1 and every x 2 paired up with each other we assign these to the variables x x and y y which

97
00:07:47,890 --> 00:07:55,220
represent the first axis and the second axis respectively unfortunately this data is in the right format

98
00:07:55,250 --> 00:08:02,700
for our machine learning model which expects and ended by 2 array as input in order to convert this

99
00:08:02,700 --> 00:08:04,660
into an end by 2 array.

100
00:08:04,770 --> 00:08:11,730
We're going to flatten x x and y y pass it into the V stack function which stacks these 2 arrays vertically

101
00:08:12,150 --> 00:08:19,410
and then transpose the result you can do some simpler examples on smaller arrays to confirm to yourself

102
00:08:19,410 --> 00:08:20,630
that this does the right thing.

103
00:08:23,670 --> 00:08:29,100
Next we use our model to make a prediction on X grid and then flatten the result since the result is

104
00:08:29,160 --> 00:08:31,140
end by 1.

105
00:08:31,170 --> 00:08:36,360
Finally we call the plot tracer function which plots the 3D surface.

106
00:08:36,360 --> 00:08:44,100
The first three arguments are the X1 axis the x 2 axis and then the function value which is why the

107
00:08:44,100 --> 00:08:45,780
other arguments are not so important.

108
00:08:45,960 --> 00:08:47,820
But you can play around them if you want.

109
00:08:48,620 --> 00:08:49,760
So let's run this

110
00:08:55,470 --> 00:09:00,270
Okay so we can see that the neural network manages to approximate this function quite well.

111
00:09:01,260 --> 00:09:07,350
As mentioned previously you get much more out of being able to manipulate the plot by rotating and zooming

112
00:09:07,350 --> 00:09:14,450
in so as an exercise you might want to try exploiting this script as a python file.

113
00:09:14,750 --> 00:09:24,260
So you do that as file download dot pi and then run this on your local machine so I'm not finished running

114
00:09:24,260 --> 00:09:25,550
this on my machine.

115
00:09:29,000 --> 00:09:30,170
So here is the loss

116
00:09:38,690 --> 00:09:47,020
here's the predictions along with the original data points so it really helps to be able to rotated

117
00:09:47,290 --> 00:09:48,310
and spin it around

118
00:09:55,010 --> 00:09:55,330
all right.

119
00:09:55,320 --> 00:10:02,060
And the last thing we're going to do here is try to test out if the neural network can extrapolate so

120
00:10:02,060 --> 00:10:09,710
we know that the cosine function is going to repeat periodically from minus infinity to plus infinity.

121
00:10:09,710 --> 00:10:12,880
So you might wonder can the neural network figure that out.

122
00:10:13,940 --> 00:10:18,590
So what I've done here is I've changed little space to go from minus five to plus five.

123
00:10:18,620 --> 00:10:21,570
So just to make it a little bigger.

124
00:10:21,570 --> 00:10:22,550
And if we run this

125
00:10:26,900 --> 00:10:28,540
we see that it doesn't quite work.

126
00:10:28,910 --> 00:10:34,220
So which is assumes that the pattern keeps going in the direction that was going at the edges

127
00:10:38,930 --> 00:10:43,990
and of course this is because the neural network doesn't use a periodic activation function.

128
00:10:44,000 --> 00:10:49,010
So we wouldn't expect it to be periodic with respect to the inputs.