1
00:00:02,220 --> 00:00:04,480
Hey, everyone, and welcome back to this class.

2
00:00:06,450 --> 00:00:10,620
This lecture is about how to write code independently when you're doing machine learning.

3
00:00:11,910 --> 00:00:16,410
You always hear me say that the first thing you should try to do when you learn an algorithm is to go

4
00:00:16,410 --> 00:00:20,130
and write it up and code by yourself before looking at someone else's code.

5
00:00:21,060 --> 00:00:25,020
For some students who are new to machine learning, it's not quite clear how to do that.

6
00:00:25,770 --> 00:00:30,960
So this lecture is going to explain the thought process you should go through in hopes that it encourages

7
00:00:30,960 --> 00:00:33,120
more people to try to code by themselves.

8
00:00:36,660 --> 00:00:41,370
I like to encapsulate this in the phrase, When it's time to code, you must code.

9
00:00:42,060 --> 00:00:46,830
So remember this whenever you see some code that demonstrates the concepts we learned in this course,

10
00:00:47,520 --> 00:00:53,040
even if you don't fully understand what's going on, coding and copying examples from others helps you

11
00:00:53,040 --> 00:00:54,060
build muscle memory.

12
00:00:58,530 --> 00:01:02,970
Sometimes people act like they are going to learn tennis by reading a book about tennis.

13
00:01:03,630 --> 00:01:05,820
This is, of course, obviously not possible.

14
00:01:06,780 --> 00:01:11,280
You need to learn some things about tennis, which are pure fact and reside in your brain as conscious

15
00:01:11,280 --> 00:01:11,850
facts.

16
00:01:12,210 --> 00:01:15,510
But then you have to go out and practice those concepts on the tennis court.

17
00:01:16,380 --> 00:01:21,450
Once you've exhausted and mastered everything you know so far, then you can learn new techniques.

18
00:01:22,050 --> 00:01:26,270
So you have this continuous cycle of learning new techniques and then practicing them.

19
00:01:29,730 --> 00:01:30,510
Why is this?

20
00:01:31,260 --> 00:01:35,400
Because learning the technique as pure fact, does it mean you truly understand it?

21
00:01:36,090 --> 00:01:40,050
Practicing the technique helps you think about the technique from a different perspective.

22
00:01:40,650 --> 00:01:46,200
Practicing again and again leads to new understanding, to learning in the subconscious, which is muscle

23
00:01:46,200 --> 00:01:46,590
memory.

24
00:01:47,190 --> 00:01:48,600
You must go through the cycle.

25
00:01:49,500 --> 00:01:54,360
But some students naively think they are going to read an entire book about tennis and then be a tennis

26
00:01:54,360 --> 00:01:54,990
master.

27
00:01:55,260 --> 00:02:01,080
The first time they try playing this is very common, especially when all you really need to do is sit

28
00:02:01,080 --> 00:02:04,050
there and watch a video and no one can force you to type.

29
00:02:04,830 --> 00:02:06,810
A lot of people don't even think to try.

30
00:02:07,830 --> 00:02:09,000
So just remember this.

31
00:02:09,000 --> 00:02:13,110
Whenever you see code, when it's time to code, you must code.

32
00:02:16,670 --> 00:02:22,400
First, we are going to talk a little bit about why you want to code by yourself a lot of the time what

33
00:02:22,400 --> 00:02:26,630
an algorithm does or how it works isn't quite clear the first time you learn about it.

34
00:02:27,530 --> 00:02:30,160
Every individual will have their own unique background.

35
00:02:30,560 --> 00:02:33,740
So they might be familiar with the patterns and know exactly what to do.

36
00:02:34,250 --> 00:02:37,160
While some individuals might have questions that no one else has.

37
00:02:37,970 --> 00:02:43,220
Sometimes you gloss over details because you assume you know what's going on when really there are important

38
00:02:43,220 --> 00:02:44,480
things you're not considering.

39
00:02:45,050 --> 00:02:47,930
So this is how trying to code by yourself can help you.

40
00:02:48,830 --> 00:02:49,740
Coding by yourself.

41
00:02:49,760 --> 00:02:52,070
Forces you to think about each and every detail.

42
00:02:52,580 --> 00:02:54,430
It forces you to think line by line.

43
00:02:55,910 --> 00:03:00,530
You have to be familiar with the data types and the shapes of all your variables, and they all have

44
00:03:00,530 --> 00:03:01,820
to fit together properly.

45
00:03:02,180 --> 00:03:03,590
Kind of like Lego blocks.

46
00:03:07,440 --> 00:03:13,140
As an example, you know that in order to do element wise matrix additions, both matrices have to have

47
00:03:13,140 --> 00:03:14,370
the exact same shape.

48
00:03:15,120 --> 00:03:20,050
So if you find that they are not the same shape, then one of your previous assumptions was incorrect.

49
00:03:21,060 --> 00:03:22,710
So you should go back and correct it.

50
00:03:23,850 --> 00:03:28,260
Lego blocks have to follow a specific set of rules in order to fit together properly.

51
00:03:29,100 --> 00:03:34,140
If you make incorrect assumptions about how Lego blocks fit together and you try to join them, it's

52
00:03:34,140 --> 00:03:37,830
not going to work without trying to build things by yourself.

53
00:03:38,220 --> 00:03:40,050
You will never discover these details.

54
00:03:43,940 --> 00:03:46,340
Let's now talk about how to code by yourself.

55
00:03:47,030 --> 00:03:49,520
Consider this supervised machine learning scenario.

56
00:03:50,300 --> 00:03:55,130
We know that we're going to have some inputs and some targets, and we want to try to make predictions

57
00:03:55,400 --> 00:03:58,000
from the inputs that are very close to the targets.

58
00:03:58,850 --> 00:04:01,750
We typically call the inputs X and the targets Y.

59
00:04:02,510 --> 00:04:04,100
Sometimes we call the targets T.

60
00:04:04,160 --> 00:04:06,980
But for the purpose of this lecture, we'll call the targets Y.

61
00:04:10,010 --> 00:04:11,780
Now, this next point is a key point.

62
00:04:12,530 --> 00:04:18,140
It doesn't actually matter what X is and what why is it does it matter if you're looking at an E commerce

63
00:04:18,220 --> 00:04:22,190
dataset and the columns of X might be time on site, time of day?

64
00:04:22,460 --> 00:04:24,650
How many pages the user looked at and so on.

65
00:04:25,310 --> 00:04:30,740
You could just as well be looking at a data set of x ray images where each column is the pixel intensity

66
00:04:30,740 --> 00:04:31,400
of an image.

67
00:04:32,060 --> 00:04:33,170
This is the key point.

68
00:04:33,560 --> 00:04:35,300
We say all data is the same.

69
00:04:36,140 --> 00:04:40,790
When we do linear regression, we don't have different kinds of linear regression for e commerce and

70
00:04:40,790 --> 00:04:41,660
x ray images.

71
00:04:42,260 --> 00:04:45,980
Linear regression is the same algorithm no matter what your dataset is.

72
00:04:46,400 --> 00:04:48,350
So we say all data is the same.

73
00:04:49,400 --> 00:04:54,800
Theoretically, I could give you a data set of X's and Y's and you could train a classification algorithm

74
00:04:54,800 --> 00:04:58,490
on it without me even telling you what the Xs and Ys mean.

75
00:04:59,210 --> 00:05:01,190
You should get very comfortable with this idea.

76
00:05:04,550 --> 00:05:10,040
Sometimes people say, well, this isn't practical because I want to do concrete examples, but that's

77
00:05:10,040 --> 00:05:11,990
because they're not thinking intelligently.

78
00:05:12,650 --> 00:05:17,420
In fact, this is the most practical thing we can do because it means everything we learn.

79
00:05:17,690 --> 00:05:20,240
We can apply it to any dataset that exists.

80
00:05:20,960 --> 00:05:26,270
It means that I can train a model on an e-commerce dataset, but I can also train a model on an online

81
00:05:26,270 --> 00:05:29,330
advertising dataset without learning anything new at all.

82
00:05:29,960 --> 00:05:32,360
It's truly the greatest form of lazy programming.

83
00:05:32,840 --> 00:05:35,480
Learn something once and apply it to every industry.

84
00:05:39,190 --> 00:05:45,070
One great consequence of all data is the same is that there is an unlimited amount of practice opportunity

85
00:05:45,070 --> 00:05:45,460
for you.

86
00:05:46,330 --> 00:05:51,700
You can download data sets from Kaggle, from Google, from Wikipedia, from Amazon or from anywhere

87
00:05:51,700 --> 00:05:53,860
else and try the algorithms you learned.

88
00:05:54,700 --> 00:05:57,170
What this means is let's say you have a data set.

89
00:05:57,340 --> 00:06:01,020
You care about more than something we do in Class C and class.

90
00:06:01,030 --> 00:06:03,670
We have to use data sets that everybody can understand.

91
00:06:04,210 --> 00:06:05,980
For example, text and images.

92
00:06:06,310 --> 00:06:08,440
Everybody should know what text and images are.

93
00:06:09,280 --> 00:06:12,460
Text and images are just fundamental data types on the web.

94
00:06:13,090 --> 00:06:17,830
You probably don't even think about the fact that texts and images are data because it's so trivial.

95
00:06:18,910 --> 00:06:23,080
In any case, we work with text and images a lot because everybody understands them.

96
00:06:23,920 --> 00:06:26,800
But let's say you're a biologist and you know about DNA.

97
00:06:27,130 --> 00:06:29,710
You find DNA and genomics really interesting.

98
00:06:30,430 --> 00:06:31,840
So you want to use the algorithms.

99
00:06:31,840 --> 00:06:32,590
We learn on that.

100
00:06:33,280 --> 00:06:33,910
Well, that's great.

101
00:06:33,910 --> 00:06:35,020
And you should totally do it.

102
00:06:35,740 --> 00:06:37,390
Remember, all data is the same.

103
00:06:37,720 --> 00:06:43,810
So all you need to do is convert your data into the appropriate Xs in Ys to plug into our machine learning

104
00:06:43,810 --> 00:06:44,470
algorithms.

105
00:06:45,760 --> 00:06:52,090
The downside is we most likely won't talk about DNA in class unless it's a very simple example because

106
00:06:52,090 --> 00:06:54,660
most computer scientists are not biologists, too.

107
00:06:55,540 --> 00:06:57,220
So they don't know what DNA is.

108
00:06:57,370 --> 00:06:58,990
They don't understand the specifics.

109
00:06:59,530 --> 00:07:01,600
So an example wouldn't make much sense to them.

110
00:07:02,350 --> 00:07:05,530
This is opposed to text and images, which makes sense to everybody.

111
00:07:06,430 --> 00:07:12,120
Same goes with any other specialized field like finance, computer networking, cosmology and so on.

112
00:07:16,930 --> 00:07:20,380
So now we can start from the right place to talk about supervised learning.

113
00:07:20,920 --> 00:07:22,800
We have our exes and we have our wise.

114
00:07:22,930 --> 00:07:26,050
What do we do with them in supervised learning?

115
00:07:26,110 --> 00:07:28,300
We know that there are two main things we want to do.

116
00:07:29,050 --> 00:07:33,070
We want to do training and we want to do prediction and psyche to learn.

117
00:07:33,100 --> 00:07:34,900
All the models have the same API.

118
00:07:35,440 --> 00:07:41,070
It doesn't matter if you're doing logistic regression or decision trees or random forest, all supervised

119
00:07:41,080 --> 00:07:45,250
models in Saikat learn to have the same two functions fit and predict.

120
00:07:45,910 --> 00:07:48,550
Fitting a model is just a synonym for training, a model.

121
00:07:49,180 --> 00:07:53,080
If you don't believe me, you can go look at these Saikat documentation yourself.

122
00:07:56,690 --> 00:08:01,690
When you're learning about supervised machine learning, what you're really learning is what goes inside

123
00:08:01,700 --> 00:08:02,660
these two functions.

124
00:08:03,230 --> 00:08:05,870
What are the parameters and how are those parameters learned?

125
00:08:06,650 --> 00:08:08,000
That's really what I'll supervised.

126
00:08:08,000 --> 00:08:09,740
Machine learning is filling in.

127
00:08:09,790 --> 00:08:15,050
These two functions, taking this perspective is going to give you a code structure and it's going to

128
00:08:15,050 --> 00:08:17,900
make things much easier to visualize in your head.

129
00:08:21,040 --> 00:08:27,580
Let's now do a simple example to solidify this idea and this example, I'm going to give you an algorithm

130
00:08:27,640 --> 00:08:29,320
and you will implement it in code.

131
00:08:30,100 --> 00:08:33,310
There are two key points I want to make before I give you the algorithm.

132
00:08:34,030 --> 00:08:37,750
First, I'm not going to give you much intuition about how it works.

133
00:08:38,380 --> 00:08:41,740
Second, I'm not going to derive the theory behind why it works.

134
00:08:42,520 --> 00:08:47,620
The reason I want to mention these two key points is that you should realize you don't need these two

135
00:08:47,620 --> 00:08:51,640
pieces of information in order to translate pseudocode into code.

136
00:08:52,420 --> 00:08:58,570
A lot of time, these three concepts, intuition, theory and implementation work to reinforce each

137
00:08:58,570 --> 00:08:58,870
other.

138
00:08:59,740 --> 00:09:05,280
The reason I'm focusing on implementation right now is because it's the one people most often miss or

139
00:09:05,290 --> 00:09:10,240
think they don't need at all, when really it's an extremely important part of the learning process.

140
00:09:13,680 --> 00:09:14,000
OK.

141
00:09:14,090 --> 00:09:15,980
So the pseudocode is as follows.

142
00:09:16,490 --> 00:09:18,770
In my predict function, as you already know.

143
00:09:18,890 --> 00:09:24,830
I'm going to take in some input data X my predictions will be Y hat equals X times W.

144
00:09:25,790 --> 00:09:28,430
You might already recognize this as linear regression.

145
00:09:29,120 --> 00:09:34,250
However, forget what you know about linear regression and pretend that this is just some formula I

146
00:09:34,250 --> 00:09:34,730
gave you.

147
00:09:35,570 --> 00:09:38,960
What I can tell you is that we are creating some kind of regression model.

148
00:09:39,770 --> 00:09:44,240
It should be clear from this equation that the parameters of the model are contained in this vector

149
00:09:44,240 --> 00:09:44,620
of weights.

150
00:09:44,630 --> 00:09:51,020
W I would probably also tell you that W is a vector that's the same size as the number of feature vectors

151
00:09:51,020 --> 00:09:51,610
in X.

152
00:09:52,340 --> 00:09:56,060
Of course this must be true in order for the Matrix multiplied to be valid.

153
00:09:56,930 --> 00:09:58,670
This axis kind of a sanity check.

154
00:09:58,700 --> 00:10:01,130
So you can make sure everything is making sense.

155
00:10:04,550 --> 00:10:10,700
In my fit function, I want to perform this loop some number of times, since we know from the previous

156
00:10:10,700 --> 00:10:15,950
slide that the parameters of the model are contained and w then it should be clear that inside the fit

157
00:10:15,950 --> 00:10:18,410
function, what we want to do is update w.

158
00:10:19,160 --> 00:10:21,440
Not surprisingly, that's what's happening here.

159
00:10:21,800 --> 00:10:23,270
Another good sanity check.

160
00:10:24,290 --> 00:10:29,300
Now, because you're already familiar with linear regression, you probably already recognize this as

161
00:10:29,300 --> 00:10:30,110
gradient descent.

162
00:10:30,830 --> 00:10:34,700
Again, I want you to pretend that you don't know that, but what do you know?

163
00:10:38,630 --> 00:10:43,490
You know that iterative algorithms are common in machine learning and that gradient descent is among

164
00:10:43,490 --> 00:10:44,360
the most common.

165
00:10:45,020 --> 00:10:49,820
You could probably infer that this is some form of gradient descent, but you don't need to know what

166
00:10:49,820 --> 00:10:51,590
it is in order to write it in code.

167
00:10:52,430 --> 00:10:53,270
What else do we know?

168
00:10:54,260 --> 00:10:55,520
You know what a for loop is.

169
00:10:55,540 --> 00:10:57,290
And you know how to write one in Python.

170
00:10:58,040 --> 00:11:01,310
You know that X is an end by the Matrix containing the input data.

171
00:11:01,670 --> 00:11:04,550
And that Y is an end size vector containing the targets.

172
00:11:05,570 --> 00:11:08,900
You know that why hat is an inside vector containing the predictions.

173
00:11:09,560 --> 00:11:13,250
You know that W is a disease vector containing the model parameters.

174
00:11:13,790 --> 00:11:19,070
And you know how to add, subtract and multiply these objects so we can write this algorithm without

175
00:11:19,100 --> 00:11:20,480
even knowing how it works.

176
00:11:21,410 --> 00:11:24,650
Of course, a lot about how it works can just be inferred from this formula.

177
00:11:28,790 --> 00:11:30,110
Here is what you don't know.

178
00:11:30,470 --> 00:11:33,410
You don't know T- the number of times the loop is supposed to run.

179
00:11:33,950 --> 00:11:35,540
You don't know, Ayda, the learning rate.

180
00:11:36,170 --> 00:11:37,490
This is completely fine.

181
00:11:38,090 --> 00:11:41,030
You can't let this lack of knowledge stop you in your tracks.

182
00:11:41,810 --> 00:11:46,580
The truth of the matter is there are going to be situations where you're not told the exact numbers

183
00:11:46,580 --> 00:11:47,240
to plug in.

184
00:11:47,900 --> 00:11:51,560
In many cases, the answer is going to be it depends on the problem.

185
00:11:52,430 --> 00:11:58,460
The important thing is you have to get used to the idea of trial and error and you have to make educated

186
00:11:58,460 --> 00:12:00,590
guesses based on what you already know.

187
00:12:04,160 --> 00:12:08,900
So what can you do when you know that the learning rate is supposed to be a small number?

188
00:12:09,320 --> 00:12:09,950
How small?

189
00:12:10,250 --> 00:12:11,540
It depends on the problem.

190
00:12:12,350 --> 00:12:15,350
Typically, it's a number less than one like zero point one.

191
00:12:15,950 --> 00:12:17,900
If that doesn't work, you might try lowering it.

192
00:12:18,560 --> 00:12:20,660
Typically, we lower it by factors of 10.

193
00:12:21,320 --> 00:12:25,050
So, for example, the next learning rate we would try is ten to the minus two.

194
00:12:25,100 --> 00:12:26,810
Ten to the minus three and so on.

195
00:12:30,990 --> 00:12:36,330
Typically in machine learning, whether you're looking at a supervised or unsupervised algorithm, there

196
00:12:36,330 --> 00:12:39,990
is an associated cost or objective function that you're trying to minimize.

197
00:12:40,830 --> 00:12:45,300
Let's suppose I tell you it's square, which is typically what we would use for regression.

198
00:12:46,120 --> 00:12:50,370
Well, now you have a way to choose the number of iterations of the loop and the learning rate.

199
00:12:50,970 --> 00:12:51,870
How do we do this?

200
00:12:56,220 --> 00:13:00,600
So inside your fit function, you would plot the cost as a function of iteration?

201
00:13:01,410 --> 00:13:04,710
I always recommend doing this no matter what algorithm you're looking at.

202
00:13:05,640 --> 00:13:08,130
In general, you always want the cost to converge.

203
00:13:08,850 --> 00:13:14,010
The pattern you typically see is that there is a steep drop at the beginning and it slowly flattens

204
00:13:14,010 --> 00:13:19,770
out as the number of iterations increases to choose the number of iterations you want to stop when the

205
00:13:19,770 --> 00:13:21,300
curve is sufficiently flat.

206
00:13:22,170 --> 00:13:25,320
You might recognize this as a diminishing returns scenario.

207
00:13:26,480 --> 00:13:32,240
The returns diminish because at the beginning, we reduce the costs by a lot in just a few iterations.

208
00:13:32,810 --> 00:13:36,810
At the end we can do a lot of iterations, but we reduce the cost by almost nothing.

209
00:13:40,010 --> 00:13:45,950
An alternative is you could use a separate validation set and stop when the validation cost increases.

210
00:13:46,070 --> 00:13:48,110
But that is not the focus of this lecture.

211
00:13:48,800 --> 00:13:54,200
Unlike the training cost, the validation cost is not guaranteed to decrease at every round because

212
00:13:54,200 --> 00:13:55,660
that is not what we are minimizing.

213
00:13:56,000 --> 00:13:56,960
With respect to.

214
00:14:00,810 --> 00:14:06,450
You can also use the same plot to help improve your learning rate if the cost explodes or becomes not

215
00:14:06,450 --> 00:14:09,210
a number, your learning rate is probably too high.

216
00:14:09,840 --> 00:14:13,620
If your cause converges too slowly, you can try increasing your learning rate.

217
00:14:17,990 --> 00:14:19,940
So what does your final code look like?

218
00:14:21,870 --> 00:14:25,710
Well, we know we want to create a class with at least that predict and fit functions.

219
00:14:26,220 --> 00:14:30,870
You don't have to use a class, but I find that it encapsulates the code nicely and provides useful

220
00:14:30,870 --> 00:14:31,470
structure.

221
00:14:33,210 --> 00:14:38,280
Notice that once you have the algorithm in math, there isn't much work to convert it into no code.

222
00:14:38,970 --> 00:14:40,950
You know how to do matrix multiplication.

223
00:14:41,310 --> 00:14:42,210
You know how to add.

224
00:14:42,420 --> 00:14:43,530
You know how to subtract.

225
00:14:44,040 --> 00:14:47,370
These are basic arithmetic operations that I'm sure you know already.

226
00:14:48,240 --> 00:14:52,620
If you don't know how to do these and num pi, then I have a completely free number, of course, on

227
00:14:52,620 --> 00:14:54,300
the number I stack that you can take.

228
00:14:58,010 --> 00:15:02,690
Here is the second part of the code where we actually create an instance of the class and then use it.

229
00:15:03,410 --> 00:15:07,100
I've also included the squared error cost function just for completion sake.

230
00:15:12,320 --> 00:15:16,970
What's cool about this template is that it doesn't really change from one algorithm to the next.

231
00:15:17,480 --> 00:15:22,670
The structure is basically always going to be this way, at least for supervised learning and unsupervised

232
00:15:22,670 --> 00:15:23,030
learning.

233
00:15:23,930 --> 00:15:28,310
So as we discussed before, it is a matter what algorithm you're implementing, whether it be linear

234
00:15:28,310 --> 00:15:31,700
regression, logistic regression, a neural network and so on.

235
00:15:32,000 --> 00:15:37,550
They always have the predicts function and the fit function and learning what goes inside these functions

236
00:15:37,850 --> 00:15:39,880
is equivalent to learning the algorithm.

237
00:15:41,510 --> 00:15:46,910
So, in fact, one way to begin coding is to start with just this boilerplate code and then fill in

238
00:15:46,910 --> 00:15:47,600
the blanks.