1
00:00:02,190 --> 00:00:04,440
Everyone, and welcome back to this class.

2
00:00:06,410 --> 00:00:10,640
This lecture is about how to write code independently when you're doing machine learning.

3
00:00:11,840 --> 00:00:16,400
You always hear me say that the first thing you should try to do when you learn an algorithm is to go

4
00:00:16,400 --> 00:00:20,150
and write it up and code by yourself before looking at someone else's code.

5
00:00:20,990 --> 00:00:25,040
For some students who are new to machine learning, it's not quite clear how to do that.

6
00:00:25,700 --> 00:00:30,950
So this lecture is going to explain the thought process you should go through in hopes that it encourages

7
00:00:30,950 --> 00:00:33,130
more people to try to code by themselves.

8
00:00:36,530 --> 00:00:41,350
I like to encapsulate this in the phrase, When it's time to code, you must code.

9
00:00:41,990 --> 00:00:46,820
So remember this whenever you see some code that demonstrates the concepts we learned in this course,

10
00:00:47,420 --> 00:00:53,030
even if you don't fully understand what's going on, coding and copying examples from others helps you

11
00:00:53,030 --> 00:00:54,050
build muscle memory.

12
00:00:58,460 --> 00:01:02,900
Sometimes people act like they are going to learn tennis by reading a book about tennis.

13
00:01:03,560 --> 00:01:05,790
This is, of course, obviously not possible.

14
00:01:06,680 --> 00:01:11,300
You need to learn some things about tennis, which are pure fact and reside in your brain as conscious

15
00:01:11,300 --> 00:01:11,840
facts.

16
00:01:12,140 --> 00:01:15,520
But then you have to go out and practice those concepts on the tennis court.

17
00:01:16,310 --> 00:01:21,440
Once you've exhausted and mastered everything you know so far, then you can learn new techniques.

18
00:01:21,980 --> 00:01:26,240
So you have this continuous cycle of learning new techniques and then practicing them.

19
00:01:29,640 --> 00:01:36,630
Why is this because learning the technique as pure fact doesn't mean you truly understand it, practicing

20
00:01:36,630 --> 00:01:40,050
the technique helps you think about the technique from a different perspective.

21
00:01:40,560 --> 00:01:46,200
Practicing again and again leads to new understanding, to learning in the subconscious, which is muscle

22
00:01:46,200 --> 00:01:46,580
memory.

23
00:01:47,100 --> 00:01:48,560
You must go through the cycle.

24
00:01:49,440 --> 00:01:54,360
But some students naively think they're going to read an entire book about tennis and then be a tennis

25
00:01:54,360 --> 00:01:56,690
master the first time they try playing.

26
00:01:57,330 --> 00:02:02,820
This is very common, especially when all you really need to do is sit there and watch a video and no

27
00:02:02,820 --> 00:02:03,990
one can force you to type.

28
00:02:04,710 --> 00:02:06,820
A lot of people don't even think to try.

29
00:02:07,740 --> 00:02:09,000
So just remember this.

30
00:02:09,000 --> 00:02:13,110
Whenever you see code, when it's time to code, you must code.

31
00:02:16,630 --> 00:02:22,180
First, we are going to talk a little bit about why you want to code buy yourself a lot of the time

32
00:02:22,180 --> 00:02:25,360
what an algorithm does or how it works isn't quite clear.

33
00:02:25,360 --> 00:02:30,160
The first time you learn about it, every individual will have their own unique background.

34
00:02:30,490 --> 00:02:35,440
So they might be familiar with the patterns and know exactly what to do while some individuals might

35
00:02:35,440 --> 00:02:37,120
have questions that no one else has.

36
00:02:37,900 --> 00:02:43,210
Sometimes you gloss over details because you assume you know what's going on when really there are important

37
00:02:43,210 --> 00:02:44,440
things you're not considering.

38
00:02:44,980 --> 00:02:47,920
So this is how trying to code by yourself can help you.

39
00:02:48,760 --> 00:02:52,070
Coding by yourself forces you to think about each and every detail.

40
00:02:52,480 --> 00:02:54,460
It forces you to think line by line.

41
00:02:55,810 --> 00:03:00,520
You have to be familiar with the data types and the shapes of all your variables, and they all have

42
00:03:00,520 --> 00:03:03,550
to fit together properly, kind of like Lego blocks.

43
00:03:07,400 --> 00:03:13,130
As an example, you know that in order to do element wise matrix additions, both matrices have to have

44
00:03:13,130 --> 00:03:14,330
the exact same shape.

45
00:03:15,050 --> 00:03:20,030
So if you find that they are not the same shape, then one of your previous assumptions was incorrect.

46
00:03:20,990 --> 00:03:22,730
So you should go back and correct it.

47
00:03:23,780 --> 00:03:28,240
Lego blocks have to follow a specific set of rules in order to fit together properly.

48
00:03:29,000 --> 00:03:33,900
If you make incorrect assumptions about how Lego blocks fit together and you try to join them.

49
00:03:33,920 --> 00:03:37,840
It's not going to work without trying to build things by yourself.

50
00:03:38,120 --> 00:03:40,040
You will never discover these details.

51
00:03:43,840 --> 00:03:46,310
Let's now talk about how to code by yourself.

52
00:03:46,960 --> 00:03:49,510
Consider the supervised machine learning scenario.

53
00:03:50,230 --> 00:03:55,150
We know that we're going to have some inputs and some targets, and we want to try to make predictions

54
00:03:55,360 --> 00:03:58,000
from the inputs that are very close to the targets.

55
00:03:58,750 --> 00:04:01,390
We typically call the inputs X and the targets.

56
00:04:01,390 --> 00:04:01,780
Why?

57
00:04:02,440 --> 00:04:07,000
Sometimes we call the targets T, but for the purpose of this lecture, we'll call the targets Y.

58
00:04:09,910 --> 00:04:16,360
Now, this next point is a key point, it doesn't actually matter what X is and what why is it doesn't

59
00:04:16,360 --> 00:04:21,730
matter if you're looking at an e-commerce data set and the columns of X might be time on site, time

60
00:04:21,730 --> 00:04:24,640
of day, how many pages the user looked at and so on.

61
00:04:25,210 --> 00:04:30,730
You could just as well be looking at a data set of x ray images where each column is the pixel intensity

62
00:04:30,730 --> 00:04:31,360
of an image.

63
00:04:31,990 --> 00:04:33,190
This is the key point.

64
00:04:33,460 --> 00:04:37,650
We say all data is the same when we do linear regression.

65
00:04:37,660 --> 00:04:41,650
We don't have different kinds of linear regression for e commerce and x ray images.

66
00:04:42,190 --> 00:04:45,930
Linear regression is the same algorithm no matter what your data set is.

67
00:04:46,300 --> 00:04:48,310
So we say all data is the same.

68
00:04:49,350 --> 00:04:54,810
Theoretically, I could give you a data set of axes and Y's and you could train a classification algorithm

69
00:04:54,810 --> 00:05:00,720
on it without me even telling you what the Xs and Ys mean, you should get very comfortable with this

70
00:05:00,720 --> 00:05:01,200
idea.

71
00:05:04,470 --> 00:05:10,020
Sometimes people say, well, this isn't practical because I want to do concrete examples, but that's

72
00:05:10,020 --> 00:05:11,940
because they're not thinking intelligently.

73
00:05:12,540 --> 00:05:17,400
In fact, this is the most practical thing we can do because it means everything we learn.

74
00:05:17,610 --> 00:05:20,190
We can apply it to any data set that exists.

75
00:05:20,820 --> 00:05:26,280
It means that I can train a model on an e-commerce data set, but I can also train a model on an online

76
00:05:26,280 --> 00:05:29,320
advertising data set without learning anything new at all.

77
00:05:29,850 --> 00:05:32,330
It's truly the greatest form of lazy programming.

78
00:05:32,760 --> 00:05:35,460
Learn something once and apply it to every industry.

79
00:05:39,090 --> 00:05:45,090
One great consequence of all data is the same is that there is an unlimited amount of practice opportunity

80
00:05:45,090 --> 00:05:45,470
for you.

81
00:05:46,230 --> 00:05:51,690
You can download data sets from Kaggle, from Google, from Wikipedia, from Amazon or from anywhere

82
00:05:51,690 --> 00:05:53,850
else and try the algorithms you learned.

83
00:05:54,630 --> 00:06:00,560
What this means is let's say you have a data set you care about more than something we do in Class C

84
00:06:00,600 --> 00:06:01,040
and class.

85
00:06:01,050 --> 00:06:05,940
We have to use data sets that everybody can understand, for example, text and images.

86
00:06:06,210 --> 00:06:08,460
Everybody should know what text and images are.

87
00:06:09,240 --> 00:06:12,450
Text and images are just fundamental data types on the web.

88
00:06:12,990 --> 00:06:17,820
You probably don't even think about the fact that text and images are data because it's so trivial.

89
00:06:18,780 --> 00:06:23,100
In any case, we work with text and images a lot because everybody understands them.

90
00:06:23,850 --> 00:06:26,820
But let's say you're a biologist and you know about DNA.

91
00:06:27,060 --> 00:06:29,720
You find DNA and genomics really interesting.

92
00:06:30,360 --> 00:06:32,600
So you want to use the algorithms we learn on that.

93
00:06:33,210 --> 00:06:33,900
Well, that's great.

94
00:06:33,900 --> 00:06:35,030
And you should totally do it.

95
00:06:35,640 --> 00:06:37,410
Remember, all data is the same.

96
00:06:37,680 --> 00:06:43,530
So all you need to do is convert your data into the appropriate Xs and whys to plug into our machine

97
00:06:43,530 --> 00:06:44,440
learning algorithms.

98
00:06:45,690 --> 00:06:52,080
The downside is we most likely won't talk about DNA in class unless it's a very simple example because

99
00:06:52,080 --> 00:06:54,720
most computer scientists are not biologists too.

100
00:06:55,440 --> 00:06:57,300
So they don't know what DNA is.

101
00:06:57,300 --> 00:06:58,970
They don't understand the specifics.

102
00:06:59,430 --> 00:07:01,600
So an example wouldn't make much sense to them.

103
00:07:02,310 --> 00:07:05,520
This is opposed to text and images, which makes sense to everybody.

104
00:07:06,360 --> 00:07:12,090
Same goes with any other specialized field like finance, computer networking, cosmology and so on.

105
00:07:16,860 --> 00:07:21,840
So now we can start off from the right place to talk about supervised learning, we have our exes and

106
00:07:21,840 --> 00:07:26,060
we have our wise, what do we do with them in supervised learning?

107
00:07:26,070 --> 00:07:28,290
We know that there are two main things we want to do.

108
00:07:28,960 --> 00:07:33,090
We want to do training and we want to do prediction and psyche to learn.

109
00:07:33,090 --> 00:07:34,910
All the models have the same API.

110
00:07:35,310 --> 00:07:41,070
It doesn't matter if you're doing logistic regression or decision trees or random forest unsupervised

111
00:07:41,070 --> 00:07:45,240
models in Saikat Learn have the same two functions fit and predict.

112
00:07:45,840 --> 00:07:48,570
Fitting a model is just a synonym for training a model.

113
00:07:49,080 --> 00:07:53,070
If you don't believe me, you can go look at the psychic documentation yourself.

114
00:07:56,590 --> 00:08:01,690
When you're learning about supervised machine learning, what you're really learning is what goes inside

115
00:08:01,690 --> 00:08:05,890
these two functions, what are the parameters and how are those parameters learned?

116
00:08:06,580 --> 00:08:07,990
That's really what I'll supervise.

117
00:08:07,990 --> 00:08:10,810
Machine learning is filling in these two functions.

118
00:08:11,470 --> 00:08:16,300
Taking this perspective is going to give your code structure and it's going to make things much easier

119
00:08:16,510 --> 00:08:17,890
to visualize in your head.

120
00:08:20,970 --> 00:08:27,570
Let's now do a simple example to solidify this idea and this example, I'm going to give you an algorithm

121
00:08:27,570 --> 00:08:29,300
and you will implement it in code.

122
00:08:30,030 --> 00:08:33,290
There are two key points I want to make before I give you the algorithm.

123
00:08:33,990 --> 00:08:37,710
First, I'm not going to give you much intuition about how it works.

124
00:08:38,280 --> 00:08:41,710
Second, I'm not going to derive the theory behind why it works.

125
00:08:42,450 --> 00:08:47,640
The reason I want to mention these two key points is that you should realize you don't need these two

126
00:08:47,640 --> 00:08:51,650
pieces of information in order to translate pseudocode into code.

127
00:08:52,290 --> 00:08:58,560
A lot of time, these three concepts, intuition, theory and implementation work to reinforce each

128
00:08:58,560 --> 00:08:58,860
other.

129
00:08:59,670 --> 00:09:05,310
The reason I'm focusing on implementation right now is because it's the one people most often miss or

130
00:09:05,310 --> 00:09:10,200
think they don't need at all, when really it's an extremely important part of the learning process.

131
00:09:13,540 --> 00:09:19,630
OK, so the pseudocode is as follows in my predict function, as you already know, I'm going to take

132
00:09:19,630 --> 00:09:24,850
in some input data X my predictions will be Y hat equals X times W.

133
00:09:25,690 --> 00:09:28,420
You might already recognize this as linear regression.

134
00:09:29,050 --> 00:09:34,240
However, forget what you know about linear regression and pretend that this is just some formula I

135
00:09:34,240 --> 00:09:34,720
gave you.

136
00:09:35,500 --> 00:09:38,950
What I can tell you is that we are creating some kind of regression model.

137
00:09:39,650 --> 00:09:44,230
It should be clear from this equation that the parameters of the model are contained in this vector

138
00:09:44,230 --> 00:09:44,650
of weights.

139
00:09:44,650 --> 00:09:51,010
W I would probably also tell you that W is a vector that's the same size as the number of feature vectors

140
00:09:51,010 --> 00:09:51,580
in X.

141
00:09:52,210 --> 00:09:58,360
Of course this must be true in order for the Matrix multiplied to be valid, this axis kind of a sanity

142
00:09:58,360 --> 00:10:01,120
check so you can make sure everything is making sense.

143
00:10:04,450 --> 00:10:10,720
In my fit function, I want to perform this loop some number of times, since we know from the previous

144
00:10:10,720 --> 00:10:15,970
slide that the parameters of the model are contained and w then it should be clear that inside the fit

145
00:10:15,970 --> 00:10:18,400
function, what we want to do is update w.

146
00:10:19,060 --> 00:10:21,460
Not surprisingly, that's what's happening here.

147
00:10:21,640 --> 00:10:23,260
Another good sanity check.

148
00:10:24,180 --> 00:10:29,310
Now, because you're already familiar with linear regression, you probably already recognize this as

149
00:10:29,310 --> 00:10:30,120
gradient descent.

150
00:10:30,690 --> 00:10:34,710
Again, I want you to pretend that you don't know that, but what do you know?

151
00:10:38,530 --> 00:10:43,150
You know, that iterative of algorithms are common in machine learning, and that gradient descent is

152
00:10:43,150 --> 00:10:48,880
among the most common, you could probably infer that this is some form of gradient descent, but you

153
00:10:48,880 --> 00:10:51,560
don't need to know what it is in order to write it in code.

154
00:10:52,330 --> 00:10:53,280
What else do we know?

155
00:10:54,130 --> 00:10:59,500
You know what a for loop is and you know how to write one in Python, you know that axis and by the

156
00:10:59,500 --> 00:11:04,540
Matrix containing the input data and that Y is an inside vector containing the targets.

157
00:11:05,470 --> 00:11:11,290
You know, that white hat is an inside vector containing the predictions, you know, that W is a disease

158
00:11:11,290 --> 00:11:16,810
vector containing the model parameters and you know how to add, subtract and multiply these objects

159
00:11:17,260 --> 00:11:20,450
so we can write this algorithm without even knowing how it works.

160
00:11:21,250 --> 00:11:24,690
Of course, a lot about how it works can just be inferred from this formula.

161
00:11:28,740 --> 00:11:33,420
Here is what you don't know, you don't know what the number of times the loop is supposed to run,

162
00:11:33,870 --> 00:11:35,530
you don't know, Ayda, the learning rate.

163
00:11:36,090 --> 00:11:37,430
This is completely fine.

164
00:11:37,980 --> 00:11:41,010
You can't let this lack of knowledge stop you in your tracks.

165
00:11:41,760 --> 00:11:46,680
The truth of the matter is they're going to be situations where you're not told the exact numbers to

166
00:11:46,680 --> 00:11:47,240
plug in.

167
00:11:47,760 --> 00:11:51,570
In many cases, the answer is going to be it depends on the problem.

168
00:11:52,410 --> 00:11:58,470
The important thing is you have to get used to the idea of trial and error and you have to make educated

169
00:11:58,470 --> 00:12:00,570
guesses based on what you already know.

170
00:12:04,100 --> 00:12:09,950
So what can you do when you know that the learning rate is supposed to be a small number, how small?

171
00:12:10,130 --> 00:12:11,520
It depends on the problem.

172
00:12:12,320 --> 00:12:15,320
Typically, it's a number less than one like zero point one.

173
00:12:15,830 --> 00:12:17,930
If that doesn't work, you might try lowering it.

174
00:12:18,500 --> 00:12:20,650
Typically, we lower it by factors of 10.

175
00:12:21,260 --> 00:12:26,120
So, for example, the next learning rate we would try is ten to the minus two tenths of a minus three

176
00:12:26,120 --> 00:12:26,810
and so on.

177
00:12:30,950 --> 00:12:36,320
Typically in machine learning, whether you're looking at a supervised or unsupervised algorithm, there

178
00:12:36,320 --> 00:12:39,940
is an associated cost or objective function that you're trying to minimize.

179
00:12:40,730 --> 00:12:45,260
Let's suppose I tell you it's squarer, which is typically what we would use for regression.

180
00:12:46,010 --> 00:12:50,380
Well, now you have a way to choose the number of iterations of the loop and the learning rate.

181
00:12:50,870 --> 00:12:51,860
How do we do this?

182
00:12:56,140 --> 00:13:02,380
So inside your fit function, you would plot the cost as a function of iteration, I always recommend

183
00:13:02,380 --> 00:13:02,890
doing this.

184
00:13:02,890 --> 00:13:08,110
No matter what algorithm you're looking at in general, you always want the cost to converge.

185
00:13:08,770 --> 00:13:13,990
The pattern you typically see is that there is a steep drop at the beginning and it slowly flattens

186
00:13:13,990 --> 00:13:19,540
out as the number of iterations increases to choose the number of iterations you want to stop.

187
00:13:19,540 --> 00:13:25,350
When the curve is sufficiently flat, you might recognize this as a diminishing returns scenario.

188
00:13:26,410 --> 00:13:32,230
The returns diminish because at the beginning, we reduce the cost by a lot in just a few iterations

189
00:13:32,680 --> 00:13:36,820
at the end we can do a lot of iterations, but we reduce the costs by almost nothing.

190
00:13:39,870 --> 00:13:45,930
An alternative is you could use a separate validation set and stop when the validation cost increases,

191
00:13:46,050 --> 00:13:48,120
but that is not the focus of this lecture.

192
00:13:48,690 --> 00:13:54,210
Unlike the training cost, the validation cost is not guaranteed to decrease at every round because

193
00:13:54,210 --> 00:13:56,910
that is not what we are minimizing with respect to.

194
00:14:00,710 --> 00:14:06,440
You can also use the same plot to help improve your learning rate if the cost explodes or becomes not

195
00:14:06,440 --> 00:14:11,670
a number, your learning rate is probably too high if your cost converges to slowly.

196
00:14:11,690 --> 00:14:13,640
You can try increasing your learning rate.

197
00:14:17,890 --> 00:14:19,950
So what does your final code look like?

198
00:14:21,790 --> 00:14:26,740
Well, we know we want to create a class with at least that predict and fit functions, you don't have

199
00:14:26,740 --> 00:14:31,480
to use a class, but I find that it encapsulates the code nicely and provides useful structure.

200
00:14:33,120 --> 00:14:37,910
Notice that once you have the algorithm in math, there isn't much work to convert it into numbered

201
00:14:37,920 --> 00:14:38,260
code.

202
00:14:38,820 --> 00:14:40,410
You know how to do matrix model.

203
00:14:41,010 --> 00:14:43,100
You know how to do matrix multiplication.

204
00:14:43,380 --> 00:14:45,660
You know how to add you know how to subtract.

205
00:14:46,180 --> 00:14:49,530
These are basic arithmetic operations that I'm sure you know already.

206
00:14:50,280 --> 00:14:51,600
If you don't know how to do these.

207
00:14:51,600 --> 00:14:56,460
And then by then I have a completely free number because on the number I stack that you can take.

208
00:15:00,110 --> 00:15:04,860
Here is the second part of the code where we actually create an instance of the class and then use it.

209
00:15:05,480 --> 00:15:09,260
I've also included the squared error cost function just for completion sake.

210
00:15:14,380 --> 00:15:19,100
What's cool about this template is that it doesn't really change from one algorithm to the next.

211
00:15:19,600 --> 00:15:24,820
The structure is basically always going to be this way, at least for supervised learning and unsupervised

212
00:15:24,820 --> 00:15:25,200
learning.

213
00:15:26,020 --> 00:15:30,490
So as we discussed before, it doesn't matter what algorithm you're implementing, whether it be linear

214
00:15:30,490 --> 00:15:35,830
regression, logistic regression, a neural network and so on, they always have the predictive function

215
00:15:35,830 --> 00:15:42,040
and the fit function and learning what goes inside these functions is equivalent to learning the algorithm.

216
00:15:43,580 --> 00:15:49,100
So, in fact, one way to begin coding is to start with just this boilerplate code and then fill in

217
00:15:49,100 --> 00:15:49,730
the blanks.