1
00:00:11,200 --> 00:00:17,170
OK, so in this video, we're going to discuss how to actually build a neural network using tensor flow

2
00:00:18,100 --> 00:00:22,540
at this point, you know, the theory behind neural networks, but you do not know how to practically

3
00:00:22,540 --> 00:00:23,300
implement them.

4
00:00:24,070 --> 00:00:26,320
So this lecture will go over those details.

5
00:00:27,280 --> 00:00:31,780
Now, obviously, you're going to see this in the lectures also, but by seeing everything together

6
00:00:31,780 --> 00:00:37,450
now, it'll give you two exposure's, which hopefully improves your ability to retain this information.

7
00:00:42,130 --> 00:00:49,210
So let's start with the fact that for Tenzer flow to Google recommends using the API, I sometimes encounter

8
00:00:49,210 --> 00:00:54,100
students who wonder why we are using Keris when we are supposed to be using sensor flow to.

9
00:00:54,400 --> 00:00:56,170
But this question is misinformed.

10
00:00:56,920 --> 00:01:02,110
Tenzer flow is designed such that the proper thing to do is to use the API.

11
00:01:02,650 --> 00:01:08,620
The only time you should not be using the carers API is if you cannot use the carrots API to do what

12
00:01:08,620 --> 00:01:09,560
you want to do.

13
00:01:10,120 --> 00:01:14,980
And remember, this is advice from the developers at Google who invented flow flow.

14
00:01:15,340 --> 00:01:18,310
If anyone knows how to use tensor flow, it would be them.

15
00:01:19,730 --> 00:01:26,810
OK, so with that said, at a high level, the Keris API is very similar to learn the basic steps are

16
00:01:26,810 --> 00:01:27,590
as follows.

17
00:01:28,580 --> 00:01:30,290
The first step is to create the model.

18
00:01:30,710 --> 00:01:32,970
We'll go through the details of how this is done soon.

19
00:01:33,800 --> 00:01:36,440
The second step is to fit the model to a data set.

20
00:01:37,220 --> 00:01:40,340
The third step is to make predictions using the model.

21
00:01:40,970 --> 00:01:45,830
OK, so this is just like cycle learning that you call the fit function when you want to train the model

22
00:01:46,040 --> 00:01:48,900
and you call the predictive function when you want to make predictions.

23
00:01:49,460 --> 00:01:53,290
So it's almost exactly the same as before, except with some minor differences.

24
00:01:58,060 --> 00:02:03,570
OK, so for reasons which may not be clear yet, we are going to save how to create the model for last,

25
00:02:04,000 --> 00:02:07,090
and this is despite the fact that it will be the first step in the code.

26
00:02:08,140 --> 00:02:10,700
What we will discuss first is how to train the model.

27
00:02:11,380 --> 00:02:16,560
So imagine that you've just created your model instance and this is saved to a variable called model.

28
00:02:17,170 --> 00:02:19,780
The next step is to call a function called compile.

29
00:02:20,350 --> 00:02:21,670
This does a few things.

30
00:02:22,480 --> 00:02:25,990
Firstly, it tells your model which laws function to optimize.

31
00:02:26,440 --> 00:02:32,200
As you know, the standard for regression is the squared error or equivalently the mean square error.

32
00:02:32,800 --> 00:02:37,000
Basically, the act of learning in the neural network is really just doing calculus.

33
00:02:37,300 --> 00:02:40,060
Given some lost function, we want to find its minimum.

34
00:02:41,170 --> 00:02:45,580
Now, it's not my goal in this lecture to discuss this in any detail, but you might want to check the

35
00:02:45,580 --> 00:02:46,080
lecture.

36
00:02:46,090 --> 00:02:49,120
How does a neural network learn if you want to learn more?

37
00:02:49,990 --> 00:02:54,510
That being said, the method of minimizing this lost function must also be specified.

38
00:02:55,180 --> 00:02:58,420
The typical default in deep learning nowadays is called Atem.

39
00:02:59,230 --> 00:03:03,670
We want to explore the different kinds of optimization methods in this course, but you're encouraged

40
00:03:03,670 --> 00:03:05,750
to check out extra reading text.

41
00:03:06,010 --> 00:03:09,820
If you want to learn more specifically, look under the heading.

42
00:03:09,970 --> 00:03:11,890
How does back propagation work?

43
00:03:13,030 --> 00:03:16,810
Another option we can specify in this function is a list of metrics.

44
00:03:17,290 --> 00:03:22,300
Basically, tensor flow will track these metrics during training so that you can plot them at each step

45
00:03:22,300 --> 00:03:23,380
of the training loop.

46
00:03:23,980 --> 00:03:29,260
For example, if you're doing classification, you might want to track the accuracy rate over time.

47
00:03:30,160 --> 00:03:35,500
If you want to see an exhaustive list of possible metrics, I'd encourage you to check out the documentation.

48
00:03:36,760 --> 00:03:38,630
OK, so that's the compile function.

49
00:03:39,310 --> 00:03:42,700
Now, the reason this function is called compile is really historical.

50
00:03:42,890 --> 00:03:48,670
But again, I'd encourage you to check out extra reading dot text if you're curious about learning more.

51
00:03:53,370 --> 00:03:58,100
OK, so what happens after you call compile, the next step is to call fit.

52
00:03:59,160 --> 00:04:03,230
So normally with psych you learn, you pass in your training data and that's it.

53
00:04:03,690 --> 00:04:06,210
But for Tenzer Flow, you have a few more options.

54
00:04:06,630 --> 00:04:10,080
As mentioned, training neural networks as an iterative process.

55
00:04:10,500 --> 00:04:15,210
That means something is going to happen inside a loop, specifically back propagation.

56
00:04:15,990 --> 00:04:19,890
Because of this, we would like to specify how many times that loop iterates.

57
00:04:20,250 --> 00:04:22,030
We call this the number of epochs.

58
00:04:22,530 --> 00:04:27,180
So by specifying this number, you can control how many iterations you train for.

59
00:04:28,110 --> 00:04:31,050
Now, typically, you can guess this value beforehand.

60
00:04:31,350 --> 00:04:36,830
You simply look at the loss per iteration after you are done training to check if training has converged.

61
00:04:37,380 --> 00:04:41,610
If not, you go back and tweak your settings until you get a nicer looking curve.

62
00:04:42,750 --> 00:04:47,740
Alternatively, if you want to be more fancy, you can follow the laws per iteration during training.

63
00:04:48,750 --> 00:04:51,600
Another thing you can pass in is your validation data.

64
00:04:52,020 --> 00:04:53,800
In other words, out of simple data.

65
00:04:54,810 --> 00:05:00,030
So when you plot your metrics after training, it will show both the train metrics and the test metrics

66
00:05:00,030 --> 00:05:00,710
over time.

67
00:05:01,290 --> 00:05:04,410
This will let you decide whether or not your model is overfitting.

68
00:05:05,820 --> 00:05:11,130
Now, please note that this list of arguments is not exhaustive, but these are the basics, if there

69
00:05:11,160 --> 00:05:15,080
are any more that we use in the code, you can simply learn about them on the fly.

70
00:05:19,630 --> 00:05:25,510
OK, so after training, it's common to plot your loss per epoch, as mentioned, this is so that you

71
00:05:25,510 --> 00:05:31,240
can confirm everything worked as expected, and if it didn't, you can go back in the and things further.

72
00:05:32,080 --> 00:05:35,290
Basically, your fit function is going to return an object.

73
00:05:35,740 --> 00:05:38,290
This object will have an attribute called history.

74
00:05:38,890 --> 00:05:44,270
The history attribute is basically a dictionary that stores your metrics at each iteration of training.

75
00:05:45,130 --> 00:05:50,260
So, for example, I can access the train lost per iteration by indexing the dictionary.

76
00:05:50,380 --> 00:05:56,620
With the string loss, I can access the validation loss per iteration by indexing the dictionary with

77
00:05:56,620 --> 00:05:58,270
the string value loss.

78
00:05:59,050 --> 00:06:04,150
And also, if I passed in inaccuracy as a metric to the compiler function, I can do a similar thing

79
00:06:04,150 --> 00:06:06,600
with that metric or any other metric as well.

80
00:06:11,350 --> 00:06:16,630
OK, and so after training your model, you can call the predict function, the predictive function

81
00:06:16,630 --> 00:06:20,710
is unremarkable because it works in exactly the same way as I could learn.

82
00:06:25,460 --> 00:06:30,530
So now that we understand the whole workflow of building a model, compiling it, training it and making

83
00:06:30,530 --> 00:06:34,290
predictions, we can begin to look at how models are actually built.

84
00:06:34,910 --> 00:06:39,200
In fact, this part is what differentiates each architecture from the others.

85
00:06:39,710 --> 00:06:45,230
So when we discuss CNN, Zain Arnon's, all the steps we previously discussed still apply.

86
00:06:45,260 --> 00:06:46,400
Those won't change.

87
00:06:46,790 --> 00:06:49,590
The only thing that will change is how the model is built.

88
00:06:50,090 --> 00:06:54,770
In fact, in this section alone, we will discuss several different Annand models.

89
00:06:59,310 --> 00:07:05,400
So the first model we will discuss is the basic feed for day and then which works for tabular data and

90
00:07:05,400 --> 00:07:06,760
univariate time series.

91
00:07:07,110 --> 00:07:08,400
This is thanks to our rule.

92
00:07:08,430 --> 00:07:12,090
All data is the same, which you already observed in the previous section.

93
00:07:12,810 --> 00:07:15,630
So basically, this is just four lines of code.

94
00:07:16,050 --> 00:07:18,420
In the first line, we create our input layer.

95
00:07:19,260 --> 00:07:23,070
This isn't really a layer per say, but it defines your input shape.

96
00:07:24,600 --> 00:07:30,690
As you can see, we pass in one argument called shape, and the value is a tuple that specifies the

97
00:07:30,690 --> 00:07:32,830
shape of each dimension of your input.

98
00:07:33,510 --> 00:07:36,760
Note that this must be considerable, like a tuple or a list.

99
00:07:37,110 --> 00:07:41,930
So even if your input only has a single dimension, you cannot pass that in by itself.

100
00:07:43,790 --> 00:07:49,010
The second step is to create a dense layer, which is basically the linear expression we keep seeing

101
00:07:49,670 --> 00:07:49,900
here.

102
00:07:49,940 --> 00:07:53,530
We specify the number of hidden units and the activation function.

103
00:07:54,770 --> 00:07:59,790
Note that at this point you can chain more dense layers if you like to make a deeper network.

104
00:08:00,440 --> 00:08:05,300
Again, I'll remind you that choosing the number of hidden units, a number of hidden layers, is a

105
00:08:05,300 --> 00:08:06,660
matter of trial and error.

106
00:08:06,710 --> 00:08:08,200
There is no magic formula.

107
00:08:09,440 --> 00:08:15,050
Notice this convention that we simply use the variable X to represent the output of each layer.

108
00:08:15,590 --> 00:08:19,010
There's no need to give them special names unless you have some reason to.

109
00:08:20,390 --> 00:08:25,310
The next step is to define our final layer, where we specify the number of outputs, what you have

110
00:08:25,310 --> 00:08:26,610
called K. on the slide.

111
00:08:27,350 --> 00:08:30,770
So, for example, you'll have K equals one for a one step forecast.

112
00:08:31,040 --> 00:08:33,350
OK, equals three for a three step forecast.

113
00:08:34,460 --> 00:08:40,100
And note that for all three cases, regression problems and binary classification problems and multi

114
00:08:40,100 --> 00:08:42,590
class problems, they will all have this structure.

115
00:08:43,610 --> 00:08:48,470
You may have expected to see the sigmoid or softmax activation, but we'll see a better way of dealing

116
00:08:48,470 --> 00:08:49,280
with that soon.

117
00:08:50,830 --> 00:08:56,500
The final step is to construct our model using the layers we just created, the interface to the model

118
00:08:56,500 --> 00:08:58,160
constructor is pretty simple.

119
00:08:58,630 --> 00:09:02,350
The first argument is the input and the second argument is the output.

120
00:09:03,250 --> 00:09:08,380
And as a side note, notice that if you wanted to build just a regular linear model, all you would

121
00:09:08,380 --> 00:09:13,510
have to do is simply remove the first dense layer, although that is not of interest at this time.

122
00:09:18,190 --> 00:09:23,140
Another thing I want to make a note of here is that this is called the Kharas Functional API.

123
00:09:23,770 --> 00:09:27,290
The reason it's called this is notice the syntax is a bit strange.

124
00:09:27,670 --> 00:09:33,880
I'm creating objects like input dence and so forth, but I'm also able to use these objects as if they

125
00:09:33,880 --> 00:09:34,740
were functions.

126
00:09:35,230 --> 00:09:40,660
So you can see here that after creating a dense object, I can follow it with round brackets and pass

127
00:09:40,660 --> 00:09:43,000
in its input as if it were a function.

128
00:09:43,720 --> 00:09:45,970
So that's what we mean by functional API.

129
00:09:46,450 --> 00:09:51,910
This method is very useful, especially when you want to make a neural network with branches, for example,

130
00:09:51,910 --> 00:09:54,340
with multiple inputs or multiple outputs.

131
00:09:59,030 --> 00:10:04,940
OK, so at this point, you must be wondering, how can it be that we do not need the sigmoid or a softmax

132
00:10:04,940 --> 00:10:05,740
activation?

133
00:10:06,410 --> 00:10:09,340
The reasoning for this is far beyond the scope of this course.

134
00:10:09,350 --> 00:10:14,510
But again, you're encouraged to check out extra reading text if you want to learn more.

135
00:10:15,380 --> 00:10:21,350
Basically, there is a mathematical reason that we do not want to apply the sigmoid or softmax directly.

136
00:10:22,010 --> 00:10:27,730
As you recall, taking exponents is numerically unstable because the values will get very large.

137
00:10:28,220 --> 00:10:30,770
So we want to avoid that as much as possible.

138
00:10:32,060 --> 00:10:37,310
What you effectively end up doing is combining the final activation with the loss function.

139
00:10:37,760 --> 00:10:41,650
So for regression, nothing needs to change since it's just linear regression.

140
00:10:41,810 --> 00:10:47,870
You have your linear equation and then you apply the mean squared error for binary classification.

141
00:10:47,870 --> 00:10:51,140
You'll have one output and you'll use the binary cross entropy.

142
00:10:52,250 --> 00:10:57,530
You then pass in from logits equals true so that the model knows you have not yet applied the sigmoid.

143
00:10:58,160 --> 00:11:05,690
Basically logit is a fancy word for the value you have before you apply the logistic function for multiclass

144
00:11:05,690 --> 00:11:06,230
problems.

145
00:11:06,230 --> 00:11:11,210
You'll have key outputs and you'll use the sparse, categorical cross entropy again.

146
00:11:11,210 --> 00:11:14,960
You'll pass in from Logits equals true for the same reason as before.

147
00:11:19,730 --> 00:11:24,540
OK, so now you know how to build a feel for neural network for three different tasks.

148
00:11:25,190 --> 00:11:31,010
One important lesson to remember is that the main steps do not change, even though the model is different.

149
00:11:31,460 --> 00:11:36,560
So later on, when you learn about different models, we still have the same compile fit and predict

150
00:11:36,560 --> 00:11:37,110
steps.

151
00:11:37,520 --> 00:11:42,440
This is helpful to know so that writing the code is not a matter of learning everything from scratch

152
00:11:42,440 --> 00:11:43,190
each time.

153
00:11:44,090 --> 00:11:49,910
Only the few lines in the middle will change after you declare your input and before you finish instantiating

154
00:11:49,910 --> 00:11:50,930
the model objects.
