1
00:00:11,080 --> 00:00:16,300
So in this lecture, we'll be looking at a notebook where we use TensorFlow to find a line of best fit.

2
00:00:17,080 --> 00:00:21,460
Remember that the goal of this notebook is to learn the basics of TensorFlow syntax.

3
00:00:22,000 --> 00:00:27,520
These include how to build a model, how to pass in data, to fit the model, and how to use the model

4
00:00:27,520 --> 00:00:28,660
to make predictions.

5
00:00:29,410 --> 00:00:31,810
So let's begin by importing everything we need.

6
00:00:32,470 --> 00:00:38,110
You can see that I've imported a few layers the model class and the Adam Optimizer from TensorFlow that

7
00:00:38,110 --> 00:00:44,530
carries some students aren't aware of this, but Google, who invented TensorFlow actually recommends

8
00:00:44,530 --> 00:00:47,650
using the Keras CBI unless it is not possible.

9
00:00:48,280 --> 00:00:52,330
So some students always ask Why are we using Keros instead of TensorFlow?

10
00:00:52,780 --> 00:00:57,130
And this is because Google has stated that this is the correct way to use TensorFlow.

11
00:00:57,460 --> 00:01:00,250
So basically, you listen to the people who invented this.

12
00:01:06,920 --> 00:01:09,750
The next step is to create a linear regression dataset.

13
00:01:10,400 --> 00:01:13,880
We'll begin by setting end the number of data points to 100.

14
00:01:14,540 --> 00:01:20,330
The next step is to create our inputs, which we'll call X to be uniform random data points between

15
00:01:20,330 --> 00:01:21,710
them minus three and plus three.

16
00:01:22,730 --> 00:01:28,670
The next step is to set the targets y to be a linear function of X, along with some Gaussian distributed

17
00:01:28,670 --> 00:01:29,300
noise.

18
00:01:30,140 --> 00:01:35,600
As you recall, the linear regression model actually assumes that the noise comes from a Gaussian distribution

19
00:01:36,410 --> 00:01:38,390
for reasons outside the scope of this course.

20
00:01:38,720 --> 00:01:42,170
This is connected to the mean squared error, which you'll see shortly.

21
00:01:43,430 --> 00:01:46,820
You'll want to make note of both the slope and intercept of this line.

22
00:01:47,510 --> 00:01:51,260
As you recall, the equation for a line is Y equals m x plus b.

23
00:01:51,890 --> 00:01:56,420
Here we can see the slope is zero point five, while the intercept is minus one.

24
00:01:57,260 --> 00:02:01,190
One of our goals in this script will be to see if we can recover these values.

25
00:02:05,700 --> 00:02:09,509
The next step is to draw a scatterplot of our data to see what it looks like.

26
00:02:14,470 --> 00:02:17,680
So it's a line with some added noise as expected.

27
00:02:21,440 --> 00:02:23,420
So the next step is to build our model.

28
00:02:24,320 --> 00:02:30,440
As you can see, this consists of three parts the input, the dense layer and the model class.

29
00:02:31,190 --> 00:02:36,440
For the input, we've specified that its shape is a tuple containing a single value of one.

30
00:02:37,490 --> 00:02:41,270
This is because, as you recall, our data is one dimensional.

31
00:02:41,720 --> 00:02:46,430
If we add dimensions, then we would set this value to D later.

32
00:02:46,430 --> 00:02:49,400
In this course, you'll see how this can even be multidimensional.

33
00:02:49,730 --> 00:02:53,300
For example, number of time steps by number of input features.

34
00:02:54,170 --> 00:02:56,000
The next step is to create a dense layer.

35
00:02:56,600 --> 00:03:02,120
Basically, a dense layer implements and a fine IT transformation, also somewhat incorrectly known

36
00:03:02,120 --> 00:03:03,710
as a linear transformation.

37
00:03:04,340 --> 00:03:08,780
Basically, this is doing the M X Plus B computation I told you about earlier.

38
00:03:09,500 --> 00:03:15,530
So it takes the input, multiplies it by some internal parameter called M, and adds another internal

39
00:03:15,530 --> 00:03:16,760
parameter called B.

40
00:03:17,510 --> 00:03:21,320
This all happens inside this layer, so you don't have to worry about the details.

41
00:03:22,340 --> 00:03:27,950
Note that in general, this can multiply in input vector by a matrix and add another vector.

42
00:03:28,880 --> 00:03:34,520
It just so happens that in this case, the matrix is one by one, and the bias vector is just a vector

43
00:03:34,520 --> 00:03:35,450
of size of one.

44
00:03:36,200 --> 00:03:42,500
This is because, as you can see, our input has dimension one and the output size, which is the argument

45
00:03:42,500 --> 00:03:45,590
into the dense object constructor is also one.

46
00:03:46,640 --> 00:03:51,050
Another detail to notice is that this uses the Keros Functional API.

47
00:03:51,680 --> 00:03:57,980
In simple terms, this means that although the dense object is a Python object, it can also be treated

48
00:03:57,980 --> 00:03:58,850
like a function.

49
00:03:59,450 --> 00:04:05,900
So when we say dense one, this creates an object, but we can have additional parentheses after that.

50
00:04:06,200 --> 00:04:09,530
To call this object since, it behaves just like a function.

51
00:04:10,580 --> 00:04:13,970
Note that this returns an output which we simply call X.

52
00:04:14,810 --> 00:04:20,180
Now, please note that some of the more naive beginners will ask Why do we use all these single that

53
00:04:20,180 --> 00:04:21,260
are variable names?

54
00:04:21,890 --> 00:04:24,350
This is simply because this is the convention.

55
00:04:24,650 --> 00:04:28,650
When using TensorFlow and Keras, I've written an article about this.

56
00:04:28,670 --> 00:04:34,220
So if you have this question yourself, please contact me and I will send you the article, which contains

57
00:04:34,220 --> 00:04:36,680
a more in-depth discussion about this topic.

58
00:04:38,760 --> 00:04:41,670
The final step here is to create the model objects.

59
00:04:42,270 --> 00:04:45,000
You can see that the constructor takes in two arguments.

60
00:04:45,570 --> 00:04:50,550
First, the input which we defined above in the output also defined above.

61
00:04:55,460 --> 00:05:00,440
The next step is to print out a model summary, which we accomplish by calling modeled that summary.

62
00:05:06,080 --> 00:05:11,780
As you can see, this prints out some information about the model in the data that passes through it.

63
00:05:12,470 --> 00:05:15,740
In the first column, we can see each layer along with its type.

64
00:05:16,250 --> 00:05:18,050
So we have an input layer followed by it.

65
00:05:18,050 --> 00:05:20,390
That's in the second column.

66
00:05:20,390 --> 00:05:23,960
We can see the shape of the data after going through each layer.

67
00:05:24,590 --> 00:05:28,010
As you can see, this is none by one in both cases.

68
00:05:28,670 --> 00:05:29,900
So why is it none?

69
00:05:30,680 --> 00:05:36,110
As you recall, when we do machine learning, we can have any number of samples for this notebook.

70
00:05:36,110 --> 00:05:41,840
We've created 100 samples, but obviously we want our model to work with other numbers as well.

71
00:05:42,290 --> 00:05:45,930
We might want to pass in just one sample or one million samples.

72
00:05:46,550 --> 00:05:52,400
So the nun is like a wild card that says this dimension does not have to take on any specific number.

73
00:05:54,510 --> 00:06:00,570
Finally, in the third column, we can see the number of parameters in each layer, the input has zero

74
00:06:00,570 --> 00:06:05,490
parameters because it's just an input and the dense layer has two parameters as promised.

75
00:06:06,000 --> 00:06:09,360
As you recall, these are the slope and intercept of our line.

76
00:06:13,290 --> 00:06:18,270
The next step is to call the compile function, which allows us to specify some important information

77
00:06:18,720 --> 00:06:20,550
about how our model will be trained.

78
00:06:21,360 --> 00:06:27,180
As you recall, we've only just defined our model, so it has a slope and intercept, but we don't know

79
00:06:27,180 --> 00:06:29,010
what these values should actually be.

80
00:06:29,550 --> 00:06:31,560
Our model has not yet seen our data.

81
00:06:32,820 --> 00:06:39,150
The first argument is the loss where we pass in MSI, which stands for a mean squared error as promised.

82
00:06:39,720 --> 00:06:44,550
If you don't know what this is or you need to review, please contact me or use the Q&A.

83
00:06:45,950 --> 00:06:48,320
The next step is to specify the optimizer.

84
00:06:48,890 --> 00:06:54,320
Basically, the method of training a neural network involves a process called gradient descent, which

85
00:06:54,320 --> 00:06:56,660
you can visualize like a ball rolling down a hill.

86
00:06:57,410 --> 00:07:00,310
This will be discussed a bit more in depth later in the section.

87
00:07:00,320 --> 00:07:04,880
But for now, all you need to know is that there are different flavors of gradient descent.

88
00:07:05,450 --> 00:07:09,530
The most popular is called Adam, and these days it's often the default choice.

89
00:07:11,160 --> 00:07:16,200
You'll notice that I have two atoms here with one commented out, the first atom is a string.

90
00:07:16,770 --> 00:07:22,410
This is convenient because it takes less typing and you can use this if you're OK with the default values

91
00:07:23,100 --> 00:07:28,230
for this notebook will be creating an atom object explicitly, which allows us to set the learning rate

92
00:07:28,230 --> 00:07:32,790
to zero point one, which is a better value for this data set than the default.

93
00:07:33,360 --> 00:07:36,690
We'll discuss more about how to choose these values later in the course.

94
00:07:38,460 --> 00:07:43,860
As a side note, recognize that the same principle applies to the loss as well, although we've passed

95
00:07:43,860 --> 00:07:44,580
in a string.

96
00:07:44,910 --> 00:07:50,790
We could also pass in a Keros object or even write a lost function ourselves in the case where we don't

97
00:07:50,790 --> 00:07:52,500
want to use the default values.

98
00:07:53,670 --> 00:07:59,100
Finally, we have the metrics argument, which allows us to specify more metrics other than the loss,

99
00:07:59,400 --> 00:08:02,070
which will be computed and displayed on each step.

100
00:08:02,850 --> 00:08:07,980
For this example, is kind of pointless because for regression, we typically care about the MSE, which

101
00:08:07,980 --> 00:08:09,630
also happens to be the loss.

102
00:08:10,260 --> 00:08:15,750
You can see that I've specified the M-80, which stands for mean absolute error, which is another possible

103
00:08:15,750 --> 00:08:16,380
metric.

104
00:08:17,010 --> 00:08:22,080
This will be more relevant in the classification example, where the loss is not the same as the metric

105
00:08:22,080 --> 00:08:22,950
we care about.

106
00:08:28,850 --> 00:08:30,590
The next step is to call a fit method.

107
00:08:31,250 --> 00:08:37,100
So this is where we actually perform the gradient descent process to find the slope and intercept the

108
00:08:37,100 --> 00:08:39,830
previous step was only to set up various training parameters.

109
00:08:40,159 --> 00:08:43,309
But now we will actually do this training on our data dataset.

110
00:08:44,059 --> 00:08:47,240
Note that the first two arguments are the inputs in the targets.

111
00:08:47,840 --> 00:08:51,180
You'll notice that have reshaped the targets to be minus one by one.

112
00:08:51,860 --> 00:08:54,350
As before, minus one is a wildcard.

113
00:08:54,830 --> 00:09:00,350
So this value should just become whatever is left after sending the second dimension to be of size one.

114
00:09:01,340 --> 00:09:07,670
The reason we need to do this is because, as you recall in machine learning, our data sets in general

115
00:09:07,670 --> 00:09:12,980
are of size n by D, where N is the number of samples, and D is the number of features.

116
00:09:13,490 --> 00:09:14,990
In this case, the is one.

117
00:09:15,710 --> 00:09:22,160
But what we currently have is only a one dimensional array of size n in order to convert this into the

118
00:09:22,160 --> 00:09:23,420
end by DX format.

119
00:09:23,720 --> 00:09:26,840
We need to make it a two dimensional array of size n by one.

120
00:09:27,200 --> 00:09:29,150
So that's effectively what this does.

121
00:09:30,780 --> 00:09:34,050
The next argument is epochs, which I've said to 200.

122
00:09:34,680 --> 00:09:39,480
This sets the number of steps of grading at the scene through the data set that you want to do.

123
00:09:40,140 --> 00:09:43,470
Again, this will be discussed in more detail later in the course.

124
00:09:44,370 --> 00:09:50,010
The final argument is batch size, which allows us to specify how many data points of our model will

125
00:09:50,010 --> 00:09:52,470
be looked at on each gradient descent step.

126
00:09:53,220 --> 00:09:59,190
The reason we need this is sometimes our dataset will simply be too large to fit into memory, so it's

127
00:09:59,190 --> 00:10:02,910
not possible to do the same on the whole dataset at once.

128
00:10:03,630 --> 00:10:08,940
Generally speaking, we tried to make this value as large as possible without degrading performance.

129
00:10:11,140 --> 00:10:16,450
Finally, note that the fit method returns, something which I've simply called are in this notebook.

130
00:10:17,080 --> 00:10:18,280
So let's run this block.

131
00:10:22,610 --> 00:10:28,040
OK, so as you can see, when we call the fit method, this will print out some information after each

132
00:10:28,040 --> 00:10:33,650
epoch, like the amount of time it took, the value of the loss and the value of any metrics you want

133
00:10:33,650 --> 00:10:34,280
it to compute.

134
00:10:34,730 --> 00:10:36,430
In this case, that's the main.

135
00:10:40,260 --> 00:10:45,870
OK, so once our training is complete, there are some things we should always do since, as mentioned

136
00:10:45,870 --> 00:10:50,520
in deep learning, we cannot trust that the process simply worked as expected.

137
00:10:51,210 --> 00:10:55,230
As you recall, we got back this variable called R from the previous step.

138
00:10:55,830 --> 00:10:59,880
This is an object which contains a history of the training process.

139
00:11:00,510 --> 00:11:06,180
You can see here that the history attribute contains a key called loss, and we can plot these values

140
00:11:06,180 --> 00:11:07,710
to see the loss per epoch.

141
00:11:13,690 --> 00:11:19,270
OK, so you can see that we get a chart where the loss starts out very high and gets smaller overall

142
00:11:19,270 --> 00:11:20,140
on each step.

143
00:11:20,650 --> 00:11:22,000
This is what we like to see.

144
00:11:22,600 --> 00:11:25,930
But again, this will be discussed in more detail later in the course.

145
00:11:29,750 --> 00:11:32,870
Note that any metrics you pass in can also be plotted.

146
00:11:33,470 --> 00:11:35,630
So in this step, we'll also plot the Emmy.

147
00:11:41,320 --> 00:11:47,380
OK, so as you can see, the MP for EPOC looks pretty much the same as the MSI, which was the loss.

148
00:11:51,690 --> 00:11:55,140
The next step in this notebook will be to see how to make predictions.

149
00:11:55,890 --> 00:12:01,860
To do this will create a new set of inputs called ex tests, which are just 20 evenly spaced points

150
00:12:01,860 --> 00:12:03,660
between the minus three and plus three.

151
00:12:04,410 --> 00:12:10,440
As you recall, we should reshape this to be of size and by one, the next step is to call Model Duck.

152
00:12:10,440 --> 00:12:15,570
Predict passing your next test to get the predictions, which we'll call P test.

153
00:12:21,590 --> 00:12:27,320
The next step will be to plot the original training points, along with our test, predictions for the

154
00:12:27,320 --> 00:12:29,450
training points will make these a scatterplot.

155
00:12:29,690 --> 00:12:32,510
But for the predictions, we'll want to have a line chart.

156
00:12:33,170 --> 00:12:37,430
As you recall, our model is Y equals m x plus b, which is a line.

157
00:12:43,070 --> 00:12:48,320
OK, so as you can see, we have, in fact found the line of best fit, which passes nicely through

158
00:12:48,320 --> 00:12:49,310
the training points.

159
00:12:54,240 --> 00:13:00,490
The next step in this notebook will be to check the slope and intercept that her model found to do this.

160
00:13:00,510 --> 00:13:05,160
Well, first note that we can access the layers of our model by calling model layers.

161
00:13:09,720 --> 00:13:12,960
As you can see, we get a list of all the layers we created above.

162
00:13:16,340 --> 00:13:21,410
The next step will be to demonstrate how to get the parameters that were learned for our dense layer,

163
00:13:21,410 --> 00:13:23,270
which is the layer at position one.

164
00:13:24,170 --> 00:13:26,990
Note that this is done by calling the method get weights.

165
00:13:32,050 --> 00:13:38,620
So as you can see, this returns a list of two an umpire raise notice that the first umpire, Ray,

166
00:13:38,620 --> 00:13:42,010
is a two dimensional one by one array as promised.

167
00:13:42,580 --> 00:13:48,640
As you recall, this is because in general, this parameter can be a matrix with multiple inputs and

168
00:13:48,640 --> 00:13:49,690
multiple outputs.

169
00:13:50,320 --> 00:13:55,270
In any case, you can see that the value here is close to zero point five, which corresponds to the

170
00:13:55,270 --> 00:13:56,680
slope we defined above.

171
00:13:58,600 --> 00:14:04,000
Now, notice that the second umpire, Ray, is a one dimensional array, which in general can be a bias

172
00:14:04,000 --> 00:14:06,970
vector in the case where you have multiple outputs.

173
00:14:07,660 --> 00:14:12,700
Again, we see that the value is close to minus one, which corresponds to the intercepts we defined

174
00:14:12,700 --> 00:14:13,240
above.

175
00:14:14,560 --> 00:14:17,980
In other words, we have recovered the hidden parameters of our data sets.

