1
00:00:02,170 --> 00:00:04,420
Everyone and welcome back to this class.

2
00:00:06,390 --> 00:00:10,620
This lecture is about how to write code independently when you're doing machine learning.

3
00:00:11,850 --> 00:00:16,379
You always hear me say that the first thing you should try to do when you learn an algorithm is to go

4
00:00:16,379 --> 00:00:20,130
and write it up in code by yourself before looking at someone else's code.

5
00:00:21,000 --> 00:00:25,050
For some students who are new to machine learning, it's not quite clear how to do that.

6
00:00:25,710 --> 00:00:30,960
So this lecture is going to explain the thought process you should go through in hopes that it encourages

7
00:00:30,960 --> 00:00:33,090
more people to try to code by themselves.

8
00:00:36,570 --> 00:00:41,340
I like to encapsulate this in the phrase when it's time to code, you must code.

9
00:00:42,030 --> 00:00:46,800
So remember this whenever you see some code that demonstrates the concepts we learned in this course,

10
00:00:47,430 --> 00:00:53,040
even if you don't fully understand what's going on, coding and copying examples from others helps you

11
00:00:53,040 --> 00:00:54,060
build muscle memory.

12
00:00:58,500 --> 00:01:02,910
Sometimes people act like they are going to learn tennis by reading a book about tennis.

13
00:01:03,570 --> 00:01:05,790
This is, of course, obviously not possible.

14
00:01:06,690 --> 00:01:11,280
You need to learn some things about tennis, which are pure fact and resigning your brain as conscious

15
00:01:11,280 --> 00:01:11,820
facts.

16
00:01:12,150 --> 00:01:15,510
But then you have to go out and practise those concepts on the tennis court.

17
00:01:16,290 --> 00:01:21,420
Once you've exhausted and mastered everything you know so far, then you can learn new techniques.

18
00:01:21,990 --> 00:01:26,240
So you have this continuous cycle of learning new techniques and then practicing them.

19
00:01:29,640 --> 00:01:35,370
Why is this because learning the technique as pure fact doesn't mean you truly understand it?

20
00:01:36,030 --> 00:01:40,020
Practicing the technique helps you think about the technique from a different perspective.

21
00:01:40,590 --> 00:01:46,170
Practicing again and again leads to new understanding to learning in the subconscious, which is muscle

22
00:01:46,170 --> 00:01:46,590
memory.

23
00:01:47,100 --> 00:01:48,570
You must go through this cycle.

24
00:01:49,440 --> 00:01:54,330
But some students naively think they are going to read an entire book about tennis and then be a tennis

25
00:01:54,330 --> 00:01:56,700
master the first time they try playing.

26
00:01:57,330 --> 00:02:02,640
This is a very common, especially when all you really need to do is sit there and watch a video and

27
00:02:02,640 --> 00:02:03,960
no one can force you to type.

28
00:02:04,740 --> 00:02:06,810
A lot of people don't even think to try.

29
00:02:07,770 --> 00:02:13,080
So just remember this whenever you see code, when it's time to code, you must code.

30
00:02:16,640 --> 00:02:20,270
First, we are going to talk a little bit about why you want to code by yourself.

31
00:02:21,170 --> 00:02:26,270
A lot of the time what an algorithm does or how it works isn't quite clear the first time you learn

32
00:02:26,270 --> 00:02:26,630
about it.

33
00:02:27,440 --> 00:02:32,450
Every individual will have their own unique background, so they might be familiar with the patterns

34
00:02:32,450 --> 00:02:33,740
and know exactly what to do.

35
00:02:34,130 --> 00:02:37,100
While some individuals might have questions that no one else has.

36
00:02:37,940 --> 00:02:43,190
Sometimes you gloss over details because you assume you know what's going on when really there are important

37
00:02:43,190 --> 00:02:44,480
things you're not considering.

38
00:02:45,020 --> 00:02:47,930
So this is how trying to code by yourself can help you.

39
00:02:48,770 --> 00:02:52,100
Coding by yourself forces you to think about each and every detail.

40
00:02:52,520 --> 00:02:54,440
It forces you to think line by line.

41
00:02:55,820 --> 00:03:00,500
You have to be familiar with the data types and the shapes of all your variables, and they all have

42
00:03:00,500 --> 00:03:01,790
to fit together properly.

43
00:03:02,120 --> 00:03:03,530
Kind of like Lego blocks.

44
00:03:07,380 --> 00:03:13,110
As an example, you know that in order to do element wise matrix additions, both matrices have to have

45
00:03:13,110 --> 00:03:14,310
the exact same shape.

46
00:03:15,060 --> 00:03:19,980
So if you find that they are not the same shape, then one of your previous assumptions was incorrect.

47
00:03:21,000 --> 00:03:22,710
So you should go back and correct it.

48
00:03:23,760 --> 00:03:28,260
Lego blocks have to follow a specific set of rules in order to fit together properly.

49
00:03:29,040 --> 00:03:34,110
If you make incorrect assumptions about how Lego blocks fit together and you try to join them, it's

50
00:03:34,110 --> 00:03:37,800
not going to work without trying to build things by yourself.

51
00:03:38,160 --> 00:03:40,020
You will never discover these details.

52
00:03:43,850 --> 00:03:46,280
Let's now talk about how to code by yourself.

53
00:03:47,000 --> 00:03:49,520
Consider the supervised machine learning scenario.

54
00:03:50,210 --> 00:03:55,130
We know that we're going to have some inputs and some targets, and we want to try to make predictions

55
00:03:55,370 --> 00:03:57,980
from the inputs that are very close to the targets.

56
00:03:58,730 --> 00:04:01,760
We typically call the inputs X and the targets Y.

57
00:04:02,450 --> 00:04:06,980
Sometimes we call the targets T, but for the purpose of this lecture, we'll call the targets Y.

58
00:04:09,950 --> 00:04:11,750
Now, this next point is a key point.

59
00:04:12,440 --> 00:04:16,070
It doesn't actually matter what sex is and what, why is it?

60
00:04:16,070 --> 00:04:16,670
Does it matter?

61
00:04:16,670 --> 00:04:22,160
If you're looking at an e-commerce data set and the columns of X might be time on site, time of day?

62
00:04:22,400 --> 00:04:24,620
How many pages the user looked at and so on.

63
00:04:25,220 --> 00:04:30,740
You could just as well be looking at a data set of X-ray images where each column is the pixel intensity

64
00:04:30,740 --> 00:04:31,340
of an image.

65
00:04:32,000 --> 00:04:33,140
This is the key point.

66
00:04:33,470 --> 00:04:37,620
We say all data is the same when we do linear regression.

67
00:04:37,640 --> 00:04:41,630
We don't have different kinds of linear regression for e-commerce and X-ray images.

68
00:04:42,170 --> 00:04:45,920
Linear regression is the same algorithm, no matter what your data set is.

69
00:04:46,340 --> 00:04:48,290
So we say all data is the same.

70
00:04:49,370 --> 00:04:54,800
Theoretically, I could give you a data set of X's and whys, and you could train a classification algorithm

71
00:04:54,800 --> 00:04:58,460
on it without me even telling you what the X's and Y's mean.

72
00:04:59,120 --> 00:05:01,190
You should get very comfortable with this idea.

73
00:05:04,520 --> 00:05:10,010
Sometimes people say, well, this isn't practical because I want to do concrete examples, but that's

74
00:05:10,010 --> 00:05:11,960
because they're not thinking intelligently.

75
00:05:12,560 --> 00:05:17,390
In fact, this is the most practical thing we can do because it means everything we learn.

76
00:05:17,630 --> 00:05:20,180
We can apply it to any data set that exists.

77
00:05:20,870 --> 00:05:26,270
It means that I can train a model on an e-commerce dataset, but I can also train a model on an online

78
00:05:26,270 --> 00:05:29,330
advertising dataset without learning anything new at all.

79
00:05:29,900 --> 00:05:32,330
It's truly the greatest form of lazy programming.

80
00:05:32,750 --> 00:05:35,480
Learn something once and apply it to every industry.

81
00:05:39,070 --> 00:05:45,070
One great consequence of all data is the same is that there is an unlimited amount of practice opportunity

82
00:05:45,070 --> 00:05:45,460
for you.

83
00:05:46,270 --> 00:05:51,670
You can download data sets from Kaggle, from Google, from Wikipedia, from Amazon or from anywhere

84
00:05:51,670 --> 00:05:53,830
else and try the algorithms you learned.

85
00:05:54,640 --> 00:05:57,010
What this means is, let's say you have a data set.

86
00:05:57,280 --> 00:06:00,580
You care about more than something we do in Class C.

87
00:06:00,580 --> 00:06:03,640
In class, we have to use data sets that everybody can understand.

88
00:06:04,180 --> 00:06:08,440
For example, text and images, everybody should know what text and images are.

89
00:06:09,220 --> 00:06:12,430
Text and images are just fundamental data types on the web.

90
00:06:13,000 --> 00:06:17,830
You probably don't even think about the fact that text and images are data because it's so trivial.

91
00:06:18,820 --> 00:06:23,080
In any case, we work with text and images a lot because everybody understands them.

92
00:06:23,860 --> 00:06:29,710
But let's say you're a biologist and you know about DNA, you find DNA and genomics really interesting.

93
00:06:30,370 --> 00:06:32,590
So you want to use the algorithms we learn on that.

94
00:06:33,190 --> 00:06:33,880
Well, that's great.

95
00:06:33,880 --> 00:06:34,990
And you should totally do it.

96
00:06:35,680 --> 00:06:41,410
Remember, all data is the same, so all you need to do is convert your data into the appropriate XS

97
00:06:41,410 --> 00:06:44,410
in ways to plug in to our machine learning algorithms.

98
00:06:45,700 --> 00:06:52,060
The downside is we most likely won't talk about DNA and class unless it's a very simple example, because

99
00:06:52,060 --> 00:06:57,240
most computer scientists are not biologists too, so they don't know what DNA is.

100
00:06:57,280 --> 00:06:58,960
They don't understand the specifics.

101
00:06:59,470 --> 00:07:01,570
So an example wouldn't make much sense to them.

102
00:07:02,320 --> 00:07:05,530
This is opposed to text and images, which makes sense to everybody.

103
00:07:06,370 --> 00:07:12,070
Same goes with any other specialized field like finance, computer networking, cosmology and so on.

104
00:07:16,870 --> 00:07:20,410
So now we can start up from the right place to talk about supervised learning.

105
00:07:20,830 --> 00:07:22,750
We have our exes and we have our wives.

106
00:07:22,840 --> 00:07:26,040
What do we do with them in supervised learning?

107
00:07:26,050 --> 00:07:28,300
We know that there are two main things we want to do.

108
00:07:28,960 --> 00:07:33,070
We want to do training and we want to do prediction and psyche to learn.

109
00:07:33,070 --> 00:07:34,900
All the models have the same API.

110
00:07:35,350 --> 00:07:39,820
It doesn't matter if you're doing logistic regression or decision trees or random forest.

111
00:07:40,270 --> 00:07:45,220
All supervised models in psychic learning have the same two functions fit and predict.

112
00:07:45,880 --> 00:07:48,550
Fitting a model is just a synonym for training a model.

113
00:07:49,090 --> 00:07:53,020
If you don't believe me, you can go look at this Typekit documentation yourself.

114
00:07:56,570 --> 00:08:01,670
When you're learning about supervised machine learning, what you're really learning is what goes inside

115
00:08:01,670 --> 00:08:02,630
these two functions.

116
00:08:03,110 --> 00:08:05,840
What are the parameters and how are those parameters learned?

117
00:08:06,590 --> 00:08:10,790
That's really what all supervised machine learning is filling in these two functions.

118
00:08:11,480 --> 00:08:16,280
Taking this perspective is going to give you a code structure, and it's going to make things much easier

119
00:08:16,550 --> 00:08:17,870
to visualize in your head.

120
00:08:20,950 --> 00:08:24,100
Let's now do a simple example to solidify this idea.

121
00:08:24,970 --> 00:08:29,290
And this example, I'm going to give you an algorithm and you will implement it in code.

122
00:08:30,040 --> 00:08:33,280
There are two key points I want to make before I give you the algorithm.

123
00:08:34,000 --> 00:08:37,690
First, I'm not going to give you much intuition about how it works.

124
00:08:38,320 --> 00:08:41,710
Second, I'm not going to derive the theory behind why it works.

125
00:08:42,460 --> 00:08:47,620
The reason I want to mention these two key points is that you should realize you don't need these two

126
00:08:47,620 --> 00:08:51,610
pieces of information in order to translate pseudo code into code.

127
00:08:52,330 --> 00:08:58,870
A lot of time, these three concepts intuition, theory and implementation work to reinforce each other.

128
00:08:59,680 --> 00:09:05,270
The reason I'm focusing on implementation right now is because it's the one people most often miss or

129
00:09:05,290 --> 00:09:10,180
think they don't need at all when really it's an extremely important part of the learning process.

130
00:09:13,610 --> 00:09:18,770
OK, so the pseudocode is as follows in my predict function, as you already know.

131
00:09:18,800 --> 00:09:24,830
I'm going to take in some input data x my predictions will be y hat equals x times w.

132
00:09:25,700 --> 00:09:28,400
You might already recognize this as linear regression.

133
00:09:29,090 --> 00:09:34,220
However, forget what you know about linear regression and pretend that this is just some formula I

134
00:09:34,220 --> 00:09:34,700
gave you.

135
00:09:35,510 --> 00:09:38,960
What I can tell you is that we are creating some kind of regression model.

136
00:09:39,710 --> 00:09:44,210
It should be clear from this equation that the parameters of the model are contained in this vector

137
00:09:44,210 --> 00:09:44,630
of weights.

138
00:09:44,630 --> 00:09:50,990
W I would probably also tell you that W is a vector that's the same size as the number of feature vectors

139
00:09:50,990 --> 00:09:51,560
in X.

140
00:09:52,250 --> 00:09:56,000
Of course, this must be true in order for the matrix multiplied to be valid.

141
00:09:56,870 --> 00:10:01,100
This acts is kind of a sanity check, so you can make sure everything is making sense.

142
00:10:04,460 --> 00:10:10,700
In my fit function, I want to perform this loop some number of times, since we know from the previous

143
00:10:10,700 --> 00:10:15,710
slide that the parameters of the model are contained and W, then it should be clear that inside the

144
00:10:15,710 --> 00:10:18,380
fit function, what we want to do is update W.

145
00:10:19,100 --> 00:10:21,440
Not surprisingly, that's what's happening here.

146
00:10:21,710 --> 00:10:27,090
Another good sanity check now because you're already familiar with linear regression.

147
00:10:27,140 --> 00:10:30,110
You probably already recognize this as gradient descent.

148
00:10:30,740 --> 00:10:34,700
Again, I want you to pretend that you don't know that, but what do you know?

149
00:10:38,540 --> 00:10:43,490
You know, that iterative algorithms are common in machine learning and that gradient descent is among

150
00:10:43,490 --> 00:10:44,330
the most common.

151
00:10:44,930 --> 00:10:49,790
You could probably infer that this is some form of gradient descent, but you don't need to know what

152
00:10:49,790 --> 00:10:51,560
it is in order to write it in code.

153
00:10:52,310 --> 00:10:53,270
What else do we know?

154
00:10:54,170 --> 00:10:59,300
You know what a for loop is and you know how to write when in Python, you know that X is an end by

155
00:10:59,300 --> 00:11:04,520
the matrix containing the input data, and that Y is an inside vector containing the targets.

156
00:11:05,480 --> 00:11:11,270
You know, that white hat is an enzyme vector containing the predictions, you know, that W is a disease

157
00:11:11,270 --> 00:11:13,220
vector containing the model parameters.

158
00:11:13,730 --> 00:11:19,070
And you know how to add, subtract and multiply these objects so we can write this algorithm without

159
00:11:19,070 --> 00:11:20,420
even knowing how it works.

160
00:11:21,320 --> 00:11:24,680
Of course, a lot about how it works can just be inferred from this formula.

161
00:11:28,790 --> 00:11:30,050
Here is what you don't know.

162
00:11:30,410 --> 00:11:33,410
You don't know the number of times the loop is supposed to run.

163
00:11:33,890 --> 00:11:35,540
You don't know Ada the learning rate.

164
00:11:36,110 --> 00:11:37,430
This is completely fine.

165
00:11:38,000 --> 00:11:40,970
You can't let this lack of knowledge stop you in your tracks.

166
00:11:41,750 --> 00:11:46,550
The truth of the matter is they're are going to be situations where you're not told the exact numbers

167
00:11:46,550 --> 00:11:47,210
to plug in.

168
00:11:47,810 --> 00:11:51,530
In many cases, the answer is going to be it depends on the problem.

169
00:11:52,400 --> 00:11:58,460
The important thing is you have to get used to the idea of trial and error, and you have to make educated

170
00:11:58,460 --> 00:12:00,560
guesses based on what you already know.

171
00:12:04,130 --> 00:12:08,900
So what can you do when you know that the learning rate is supposed to be a small number?

172
00:12:09,260 --> 00:12:09,950
How small?

173
00:12:10,160 --> 00:12:11,510
It depends on the problem.

174
00:12:12,290 --> 00:12:15,290
Typically, it's a number less than one, like zero point one.

175
00:12:15,860 --> 00:12:17,900
If that doesn't work, you might try lowering it.

176
00:12:18,500 --> 00:12:20,660
Typically, we lower it by factors of 10.

177
00:12:21,290 --> 00:12:26,240
So, for example, the next learning rate we would try is 10 to the minus to 10 to the minus three and

178
00:12:26,240 --> 00:12:26,780
so on.

179
00:12:30,930 --> 00:12:36,330
Typically in machine learning, whether you're looking at a supervised or unsupervised algorithm, there

180
00:12:36,330 --> 00:12:39,930
is an associated cost or objective function that you're trying to minimize.

181
00:12:40,770 --> 00:12:45,270
Let's suppose I tell you it's squared error, which is typically what we would use for regression.

182
00:12:46,020 --> 00:12:50,370
Well, now you have a way to choose the number of iterations of the loop and the learning rate.

183
00:12:50,910 --> 00:12:51,840
How do we do this?

184
00:12:56,190 --> 00:13:00,570
So inside your fit function, you would plot the cost as a function of iteration.

185
00:13:01,320 --> 00:13:04,650
I always recommend doing this no matter what algorithm you're looking at.

186
00:13:05,550 --> 00:13:08,070
In general, you always want the cost to converge.

187
00:13:08,790 --> 00:13:13,980
The pattern you typically see is that there is a steep drop at the beginning, and it slowly flattens

188
00:13:13,980 --> 00:13:19,740
out as the number of iterations increases to choose the number of iterations you want to stop when the

189
00:13:19,740 --> 00:13:21,330
curve is sufficiently flat.

190
00:13:22,080 --> 00:13:25,320
You might recognize this as a diminishing returns scenario.

191
00:13:26,420 --> 00:13:32,180
The returns diminish because at the beginning, we reduce the cost by a lot in just a few iterations.

192
00:13:32,720 --> 00:13:36,800
At the end, we can do a lot of iterations, but we reduce the cost by almost nothing.

193
00:13:39,920 --> 00:13:45,920
An alternative is you could use a separate validation set and stop when the validation cost increases.

194
00:13:46,040 --> 00:13:48,110
But that is not the focus of this lecture.

195
00:13:48,740 --> 00:13:54,200
Unlike the training cost, the validation cost is not guaranteed to decrease at every round because

196
00:13:54,200 --> 00:13:56,900
that is not what we are minimizing with respect to.

197
00:14:00,720 --> 00:14:06,450
You can also use the same plot to help improve your learning rate if the cost explodes or becomes not

198
00:14:06,450 --> 00:14:06,990
a no.

199
00:14:07,230 --> 00:14:11,670
Your lending rate is probably too high if your cost converges too slowly.

200
00:14:11,700 --> 00:14:13,620
You can try increasing your learning rate.

201
00:14:17,930 --> 00:14:19,910
So what does your final code look like?

202
00:14:21,810 --> 00:14:25,680
Well, we know we want to create a class with at least that predict and fit functions.

203
00:14:26,160 --> 00:14:30,870
You don't have to use a class, but I find that it encapsulates the code nicely and provides useful

204
00:14:30,870 --> 00:14:31,440
structure.

205
00:14:33,120 --> 00:14:38,250
Notice that once you have the algorithm in math, there isn't much work to convert it into numpty code,

206
00:14:38,880 --> 00:14:43,470
you know how to do matrix multiplication, you know how to add, you know, how to subtract.

207
00:14:43,980 --> 00:14:47,340
These are basic arithmetic operations that I'm sure you know already.

208
00:14:48,150 --> 00:14:53,100
If you don't know how to do this in Nampai, then I have a completely free Nampai course on the Nampai

209
00:14:53,100 --> 00:14:54,270
stack that you can take.

210
00:14:57,980 --> 00:15:02,690
Here is the second part of the code where we actually create an instance of the class and then use it.

211
00:15:03,350 --> 00:15:07,070
I've also included the squared error cost function just for completion sake.

212
00:15:12,200 --> 00:15:16,910
What's cool about this template is that it doesn't really change from one algorithm to the next.

213
00:15:17,420 --> 00:15:22,640
The structure is basically always going to be this way, at least for supervised learning and unsupervised

214
00:15:22,640 --> 00:15:23,030
learning.

215
00:15:23,870 --> 00:15:28,310
So as we discussed before, it doesn't matter what algorithm you're implementing, whether it be linear

216
00:15:28,310 --> 00:15:31,720
regression, logistic regression, a neural network and so on.

217
00:15:31,940 --> 00:15:37,550
They always have the predict function and the fit function, and learning what goes inside these functions

218
00:15:37,790 --> 00:15:39,860
is equivalent to learning the algorithm.

219
00:15:41,450 --> 00:15:47,000
So in fact, one way to begin cutting is to start with just this boilerplate code and then fill in the

220
00:15:47,000 --> 00:15:47,540
blanks.