1
00:00:11,570 --> 00:00:16,510
In this lecture we are going to start looking at machine learning from a new perspective.

2
00:00:16,580 --> 00:00:21,650
While this example might seem overly simplistic you might be surprised to learn that every example that

3
00:00:21,650 --> 00:00:26,810
we do in this course will rely on these basic steps which you are about to learn.

4
00:00:26,810 --> 00:00:29,880
First let's start by describing the basic task.

5
00:00:30,020 --> 00:00:34,940
I'm sure you're all familiar with this example from your grade school of science and math classes.

6
00:00:34,970 --> 00:00:38,260
Let's suppose we've done some kind of scientific experiment.

7
00:00:38,390 --> 00:00:44,120
For example we measured the voltage in a circuit with different resistors or we asked everyone in our

8
00:00:44,120 --> 00:00:51,230
class what their height and weight is.

9
00:00:51,240 --> 00:00:56,610
This gives us a dataset with two columns of numbers one that we can call X and the other that we can

10
00:00:56,610 --> 00:00:57,980
call y.

11
00:00:57,990 --> 00:01:03,150
Now imagine that we plot these points on a grid and we get something that looks pretty close to a line.

12
00:01:03,300 --> 00:01:09,450
We would like to draw a line on this grid that closely passes through these points in your grade school

13
00:01:09,450 --> 00:01:09,870
class.

14
00:01:09,870 --> 00:01:14,820
Maybe you did something like took a ruler and moved it around a little bit until you found the best

15
00:01:14,820 --> 00:01:15,670
possible line.

16
00:01:16,830 --> 00:01:21,150
But now that we're doing machine learning we need more sophisticated methods than just drawing dots

17
00:01:21,150 --> 00:01:25,440
on paper and using a ruler to estimate what the best line should be.

18
00:01:25,440 --> 00:01:29,340
The problem is what happens when we have thousands or millions of dots.

19
00:01:29,340 --> 00:01:34,680
What if our data is more than 2 dimensional then this approach is no longer possible.

20
00:01:34,770 --> 00:01:37,710
So perhaps you might say aha I know what to do.

21
00:01:37,740 --> 00:01:39,870
Let's use our trusty tool socket learn

22
00:01:45,170 --> 00:01:46,330
as you know by now.

23
00:01:46,430 --> 00:01:51,700
This is what a basic regression script might look like if we were to use psychic learning.

24
00:01:51,770 --> 00:01:54,890
First we load in some data call that x and y.

25
00:01:54,890 --> 00:01:58,790
These would just be the two columns of data that we were looking at earlier.

26
00:01:58,790 --> 00:02:03,630
Next we create a model for the coming lectures we'll be interested in linear regression.

27
00:02:03,680 --> 00:02:06,430
So let's create a linear regression model.

28
00:02:06,440 --> 00:02:10,280
Next we call the fit function to fit our model to the data.

29
00:02:10,310 --> 00:02:15,500
Next we can do multiple things with our fitted model such as making predictions with models that predicts

30
00:02:17,210 --> 00:02:23,120
the problem with this is this is just using an API but it doesn't tell us how any of this actually works

31
00:02:23,600 --> 00:02:29,050
what actually happens inside these functions and why is that important to know at this stage.

32
00:02:29,060 --> 00:02:32,900
What I want to highlight on this screen is what we should expect it to be different.

33
00:02:32,930 --> 00:02:38,210
When we look at Pi talk first model creation is going to be different.

34
00:02:38,300 --> 00:02:43,220
As I mentioned earlier pi torture is going to allow us to build all kinds of different models using

35
00:02:43,220 --> 00:02:45,710
compostable building blocks.

36
00:02:45,710 --> 00:02:50,430
So it's not going to be some predefined model as inside you learn next.

37
00:02:50,570 --> 00:02:57,290
I want to emphasize in PI talk there is no such thing as a fit function or a predict function.

38
00:02:57,290 --> 00:03:01,550
This is unlike tensor flowing carrots which do provide you with those functions.

39
00:03:01,940 --> 00:03:08,480
So if you're looking for something simpler than tensor flow or KRS maybe more your pace in PI talk however

40
00:03:08,510 --> 00:03:10,250
that will not be the case.

41
00:03:10,550 --> 00:03:15,530
Therefore we're going to have to find out how to do all these steps in PI talk using our knowledge of

42
00:03:15,530 --> 00:03:16,970
the concepts.

43
00:03:16,970 --> 00:03:23,390
Importantly however this lecture will focus only on the concepts and not on any PI talk in the next

44
00:03:23,390 --> 00:03:23,770
lecture.

45
00:03:23,770 --> 00:03:27,530
We'll look at how to actually implement these concepts in PI towards syntax

46
00:03:32,800 --> 00:03:35,120
so let's first look at what these concepts are.

47
00:03:35,230 --> 00:03:40,270
Before we discuss the details I'm going to go out of order in some sense but you'll see why that has

48
00:03:40,270 --> 00:03:41,600
to be the case.

49
00:03:41,680 --> 00:03:47,320
The three main concepts we want to discuss are number one what is the model architecture and how do

50
00:03:47,320 --> 00:03:48,760
we actually build a model.

51
00:03:48,760 --> 00:03:52,770
If it's not defined for us like it is inside get learn.

52
00:03:52,900 --> 00:03:55,960
Number two how do we make predictions with the model.

53
00:03:55,960 --> 00:04:00,290
In fact this is closely tied to number one and number three.

54
00:04:00,370 --> 00:04:06,310
How do we fit the model that is how do we find the slope and intercept parameters as you'll see.

55
00:04:06,310 --> 00:04:08,400
This is the most complicated step.

56
00:04:08,410 --> 00:04:12,790
It also depends on number one and number two which is why we have to discuss this last

57
00:04:18,150 --> 00:04:19,120
so number one.

58
00:04:19,160 --> 00:04:21,030
What is the model architecture.

59
00:04:21,050 --> 00:04:22,330
Well you already know this.

60
00:04:22,340 --> 00:04:24,940
It's why hat equals M X plus B.

61
00:04:25,100 --> 00:04:27,590
Notice our use of the hat symbol here.

62
00:04:27,590 --> 00:04:33,250
Typically we'll use Y for the true value also known as the target which is given in the dataset and

63
00:04:33,290 --> 00:04:36,250
y hat for the corresponding prediction.

64
00:04:36,260 --> 00:04:40,940
Fortunately for us this actually solves both of the first two problems for us.

65
00:04:40,970 --> 00:04:47,060
This not only defines the model architecture which is a line but it also tells us how to make predictions

66
00:04:47,540 --> 00:04:48,760
given some input x.

67
00:04:48,800 --> 00:04:51,660
I can see exactly how to calculate y hat.

68
00:04:51,830 --> 00:04:52,970
Pretty simple.

69
00:04:53,060 --> 00:04:55,760
So this is our model y hat equals M X plus b

70
00:05:01,030 --> 00:05:02,770
the hard part is number three.

71
00:05:02,890 --> 00:05:07,470
How do we actually find good values of the slope M and the intercept B.

72
00:05:07,810 --> 00:05:11,140
The key is to define what is called a lost function.

73
00:05:11,140 --> 00:05:15,310
I warn you that there are many synonyms for the lost function and you'll see them all.

74
00:05:15,370 --> 00:05:19,940
So don't be afraid if you see these different terms and think one of them might be wrong.

75
00:05:19,990 --> 00:05:21,040
That's not the case.

76
00:05:21,070 --> 00:05:26,770
These words are all used equivalently so you might see this referred to as a cause function or an objective

77
00:05:26,770 --> 00:05:28,630
function or an error function.

78
00:05:28,630 --> 00:05:29,950
These all mean the same thing.

79
00:05:32,950 --> 00:05:40,330
The basic idea is this if we pass in our data set X and Y along with some setting of MLB the lost function

80
00:05:40,360 --> 00:05:44,110
will tell us how good that setting of MLB is.

81
00:05:44,110 --> 00:05:50,230
Let's see how.

82
00:05:50,260 --> 00:05:56,770
First let's assume that we're given a set of pairs of data points X1 y 1 x 2 y 2 all the way up to X

83
00:05:56,770 --> 00:06:03,550
and Y N make a note to yourself now that it will be convention for us and many others to use the capital

84
00:06:03,550 --> 00:06:07,600
letter N to represent the number of samples in your dataset.

85
00:06:07,600 --> 00:06:12,730
This is not always the case but it's at the very least the convention I use and the convention and many

86
00:06:12,730 --> 00:06:13,880
others use.

87
00:06:13,900 --> 00:06:18,130
So if you see something different in the future or you've seen something different in the past don't

88
00:06:18,130 --> 00:06:19,720
freak out believe it or not.

89
00:06:19,720 --> 00:06:22,020
This happens sometimes anyway.

90
00:06:22,030 --> 00:06:27,400
The idea behind linear regression is we will not find a line that perfectly passes through all the points

91
00:06:29,030 --> 00:06:34,250
that's impossible since as you can see there is no such line that can pass to all the points.

92
00:06:34,250 --> 00:06:37,490
All we can do is try to find a line that is a good fit.

93
00:06:37,580 --> 00:06:39,940
So how do we define a good fit.

94
00:06:40,100 --> 00:06:43,280
As I mentioned this is called a lost function.

95
00:06:43,280 --> 00:06:48,680
I'm going to start by telling you what it is and then describing to you why it makes sense for some

96
00:06:48,680 --> 00:06:51,940
of you it'll make sense immediately and for others it might not.

97
00:06:52,400 --> 00:06:59,810
But hopefully by the end of this I will have convinced you that it does make sense so the last function

98
00:06:59,810 --> 00:07:04,820
that we use and regression is called the mean squared error or MSE for short.

99
00:07:04,820 --> 00:07:10,430
Basically you take all the y I's and you find the squared difference between them and the line itself.

100
00:07:10,430 --> 00:07:17,840
The Y had eyes you can think of this as the squared vertical distance from each y eye to the line each

101
00:07:17,840 --> 00:07:22,580
corresponding point on the line is a Y had I a prediction that given x y.

102
00:07:23,030 --> 00:07:29,030
Once we have all the squared differences we take their average you will know how to find averages so

103
00:07:29,030 --> 00:07:33,920
you know that is just the sum of the squared differences divided by the total number of points and

104
00:07:39,100 --> 00:07:42,350
the question is why does this make sense.

105
00:07:42,370 --> 00:07:47,100
I think this is a great exercise to try for yourself if you haven't done so before.

106
00:07:47,200 --> 00:07:52,690
Take a piece of graph paper and draw some dots and a line just like what you see here and calculate

107
00:07:52,690 --> 00:07:54,680
the squared error by hand.

108
00:07:54,760 --> 00:08:00,160
What you should discover is that if the line is not a good fit then the error will be larger.

109
00:08:00,160 --> 00:08:03,550
If the line is a good fit then the arrow will be smaller.

110
00:08:03,550 --> 00:08:05,650
I think that's pretty intuitive.

111
00:08:05,650 --> 00:08:11,410
In addition you should recognize that if the data points perfectly lie on the line then the mean squared

112
00:08:11,410 --> 00:08:17,620
error will be 0.

113
00:08:17,650 --> 00:08:24,180
Next question how do we use this lost function and our model to actually find the values of m and b.

114
00:08:24,820 --> 00:08:29,650
Let's see what happens if we plug in our expression for y hat into the loss.

115
00:08:29,650 --> 00:08:32,260
This is clearly the equation we would get.

116
00:08:32,260 --> 00:08:34,330
Now I have a quiz question for you.

117
00:08:34,330 --> 00:08:37,020
What are the variables in this equation.

118
00:08:37,030 --> 00:08:38,590
I'll give you a second to think about it.

119
00:08:38,770 --> 00:08:43,030
So please take a moment to come up with the answer before moving on to the next slide.

120
00:08:48,360 --> 00:08:53,660
Ok so I hope you thought about the answer to what the variables are in this equation.

121
00:08:53,760 --> 00:08:55,620
If you said x and y.

122
00:08:55,680 --> 00:08:57,380
This is not correct.

123
00:08:57,420 --> 00:09:03,120
Many beginners make this mistake because in a typical math course the variable is X and you're always

124
00:09:03,120 --> 00:09:05,760
trying to do something like solve for x.

125
00:09:05,760 --> 00:09:12,540
However it's important to recognize that in machine learning X is normally not a variable X Y and Y

126
00:09:12,620 --> 00:09:14,630
are just values from my dataset.

127
00:09:14,760 --> 00:09:18,090
You can think of them as numbers from your Excel spreadsheet.

128
00:09:18,150 --> 00:09:21,960
In fact in our case the variables are M and B.

129
00:09:21,960 --> 00:09:25,270
Remember that these are the values that we are currently trying to find.

130
00:09:25,350 --> 00:09:28,290
They are unknown and therefore they are variables

131
00:09:33,480 --> 00:09:34,170
at this point.

132
00:09:34,170 --> 00:09:40,020
We know that our goal is to minimize the loss with respect to M and B this is how we would write that

133
00:09:40,050 --> 00:09:41,700
mathematically.

134
00:09:41,700 --> 00:09:45,690
If you're afraid of math don't worry about this equation too much because it doesn't give us anything

135
00:09:45,690 --> 00:09:46,450
new.

136
00:09:46,500 --> 00:09:51,840
It's just a way of saying what we already said but using math we would like to find the best values

137
00:09:51,840 --> 00:09:58,530
of m and be called M star and B star where these values of m and b are the ones that minimize the loss

138
00:09:58,560 --> 00:10:04,290
l out of all possible values of m and b how do we solve such a problem.

139
00:10:09,470 --> 00:10:15,640
It might help to consider a similar but simply problem say we have a function f of x equals X squared.

140
00:10:15,680 --> 00:10:17,030
A quadratic.

141
00:10:17,030 --> 00:10:18,890
We all know this from high school.

142
00:10:19,010 --> 00:10:25,410
You might say well I can just look at the graph and see that the minimum value is zero when x is zero.

143
00:10:25,760 --> 00:10:30,770
But remember we don't do that in machine learning because in machine learning we work in multiple dimensions.

144
00:10:30,770 --> 00:10:36,230
You can't just look at it at this point we have to introduce our old friend calculus.

145
00:10:36,230 --> 00:10:42,740
If you recall the way that we solve this is you find the derivative D F by the X set it equal to zero

146
00:10:42,830 --> 00:10:44,510
and solve for x.

147
00:10:44,510 --> 00:10:51,450
This is because as you can see the derivative at the minimum is equal to zero so if you draw a tangent

148
00:10:51,450 --> 00:10:55,650
line at the minimum point it should be a horizontal line.

149
00:10:55,680 --> 00:10:59,850
There are of course a lot more details that I'm not including here but you should at least remember

150
00:10:59,850 --> 00:11:01,650
this from your high school math studies.

151
00:11:06,870 --> 00:11:14,060
So how do we find the solution in our case since we have two variables m and b we have to take partial

152
00:11:14,060 --> 00:11:20,990
derivatives but the idea is the same find the partial derivative of L with respect to M and set that

153
00:11:20,990 --> 00:11:22,040
to zero.

154
00:11:22,100 --> 00:11:26,690
Find the derivative of L with respect to B and set that to zero.

155
00:11:26,690 --> 00:11:33,960
This will give us two equations with two unknowns which are MLB at which point we can solve for MLB.

156
00:11:34,100 --> 00:11:39,170
In fact if you've never done this before I would encourage you to do this calculation and solve for

157
00:11:39,230 --> 00:11:41,360
MLB as an exercise.

158
00:11:41,360 --> 00:11:48,200
This can be done for any arbitrary data set X1 y one up to X and Y N So your solution should be in terms

159
00:11:48,200 --> 00:11:54,380
of those values.

160
00:11:54,440 --> 00:12:00,270
There is a problem with this approach however unfortunately finding the derivatives and setting them

161
00:12:00,270 --> 00:12:03,450
to zero only works for linear regression.

162
00:12:03,510 --> 00:12:07,570
It does not work for any other model we will discuss in this course.

163
00:12:07,590 --> 00:12:13,740
In fact the very next model we discuss for linear classification cannot be solved in this way.

164
00:12:13,740 --> 00:12:17,550
Instead we need to use a method called gradient descent.

165
00:12:17,550 --> 00:12:22,890
Very briefly the gradient is the multi-dimensional analogue of the derivative.

166
00:12:22,890 --> 00:12:28,830
So basically when you're working in multiple dimensions you have gradients in place of scalar derivatives

167
00:12:30,780 --> 00:12:34,980
the idea behind gradient descent can be understood with this simple picture.

168
00:12:35,100 --> 00:12:36,590
It's an iterative algorithm.

169
00:12:36,620 --> 00:12:40,480
So we're going to go in a loop on each iteration of the loop.

170
00:12:40,570 --> 00:12:44,410
We're going to take a small step in the direction of the gradient.

171
00:12:44,490 --> 00:12:50,010
It's been mathematically proven that if we keep doing this we will eventually end up at the minimum

172
00:12:50,010 --> 00:12:52,930
point in this lecture.

173
00:12:52,960 --> 00:12:55,330
I don't want to get into detail about gradient descent.

174
00:12:55,600 --> 00:13:00,610
However you are encouraged to check out the lectures in the ends up section of this course.

175
00:13:00,670 --> 00:13:02,310
If you are interested.

176
00:13:02,620 --> 00:13:06,780
Our goal is to get to how we do this in PI to work as quickly as possible.

177
00:13:07,090 --> 00:13:11,760
Gradient descent is a really cool algorithm because it works in pretty much all cases.

178
00:13:11,770 --> 00:13:17,770
That's why libraries that have automatic differentiation such as the ANO tensor flow and Pi torque have

179
00:13:17,770 --> 00:13:22,220
been so popular for machine learning aside from deep learning.

180
00:13:22,260 --> 00:13:26,270
You can use the exact same techniques for things like Hidden Markov models.

181
00:13:26,340 --> 00:13:29,340
K Means clustering Matrix Factorization and more

182
00:13:34,480 --> 00:13:34,830
all right.

183
00:13:34,860 --> 00:13:38,510
So since this has been a pretty long lecture let's recap the steps.

184
00:13:38,640 --> 00:13:44,640
Our goal in this lecture was to take you from the three steps of the cyclone API to what actually happens

185
00:13:44,640 --> 00:13:51,510
conceptually as you recall the three steps are creating the model fitting the model and making predictions

186
00:13:51,510 --> 00:13:52,910
with the model.

187
00:13:52,980 --> 00:13:57,900
After this lecture you now understand at a high level what actually happens during these three function

188
00:13:57,900 --> 00:13:58,520
calls.

189
00:13:59,100 --> 00:14:05,440
However the concepts don't provide the full picture because we still need to implement this in code.

190
00:14:05,460 --> 00:14:10,770
In fact I have a course that goes very in-depth into all these concepts and how you can take them and

191
00:14:10,770 --> 00:14:16,680
put them into a computer program using basic libraries like num pi in the Python programming language.

192
00:14:17,130 --> 00:14:19,400
But remember this course is about pi talk.

193
00:14:19,740 --> 00:14:24,270
So the next question we have to ask is how is this done in PI talk.

194
00:14:24,360 --> 00:14:27,530
Concepts are one thing but syntax is another.

195
00:14:27,570 --> 00:14:32,580
For example the syntax for Pi talk will be different from the syntax for tensor flow.

196
00:14:32,580 --> 00:14:34,650
Even if we are doing the exact same thing.