1
00:00:05,970 --> 00:00:09,060
Welcome back everyone to this section on using your regression.

2
00:00:10,100 --> 00:00:12,380
You've made it to the last section of the course.

3
00:00:12,380 --> 00:00:17,060
So a huge congratulations for all the learning and work you've done so far.

4
00:00:17,300 --> 00:00:22,940
In this section, we're going to focus on directly applying statistics to data sets to create models,

5
00:00:23,090 --> 00:00:27,530
and we're going to be using one of the earliest and most common techniques known as regression.

6
00:00:28,890 --> 00:00:33,540
So we're going to be covering a lot of different topics that in some form, one way or another, are

7
00:00:33,540 --> 00:00:38,910
related to regression, such as scatter plots, correlation coefficients and how they relate to the

8
00:00:38,910 --> 00:00:44,280
residual and the coefficient of determination, root mean squared error and more.

9
00:00:45,030 --> 00:00:50,010
But for right now, let's focus this introduction on regression and why it could be useful.

10
00:00:50,580 --> 00:00:56,790
Formally speaking, regression is a statistical technique that uses data sets to create explanatory

11
00:00:56,790 --> 00:00:57,510
models.

12
00:00:57,690 --> 00:01:03,300
These explanatory models can then be used to help predict future outputs of dependent variables based

13
00:01:03,300 --> 00:01:05,459
on inputs of independent variables.

14
00:01:06,910 --> 00:01:11,830
Even though regression models have been studied since the 1700s in astronomy, they are actually still

15
00:01:11,830 --> 00:01:17,380
extremely useful today, and they're often easier to understand than more complex statistical models

16
00:01:17,380 --> 00:01:20,770
that fall under the umbrella of more advanced machine learning.

17
00:01:21,980 --> 00:01:27,170
We can think of regression as the connecting piece between the world of classical statistics that we

18
00:01:27,170 --> 00:01:30,230
just learned about in modern machine learning techniques.

19
00:01:30,380 --> 00:01:36,080
It's often one of the first models applied because it directly produces coefficients that can be interpreted,

20
00:01:36,080 --> 00:01:41,180
which is an extremely useful features that not every machine learning technique actually provides for

21
00:01:41,180 --> 00:01:41,560
you.

22
00:01:43,140 --> 00:01:47,600
Now, you've probably already seen the capabilities of regression models in practice.

23
00:01:47,610 --> 00:01:54,690
For example, imagine you started a company to buy or sell houses based on features or variables such

24
00:01:54,690 --> 00:02:00,420
as the area of the house, the number of bedrooms, the number of bathrooms, etc. So you have a bunch

25
00:02:00,420 --> 00:02:05,130
of features and variables about the house, and then you're trying to predict what's the appropriate

26
00:02:05,130 --> 00:02:06,390
price for the house.

27
00:02:08,000 --> 00:02:14,000
Regression models not only provide those direct prediction properties such as modeling that approximate

28
00:02:14,000 --> 00:02:20,750
price, a house should sell it given its features or variables, but a regression model also informs

29
00:02:20,750 --> 00:02:25,610
you of the effect of each variable through the use of what are known as coefficients.

30
00:02:25,610 --> 00:02:30,950
So that way you can actually understand which features like the number of bedrooms is most important

31
00:02:30,950 --> 00:02:34,190
for predicting that dependent variable the price of the house.

32
00:02:35,950 --> 00:02:40,840
And you're probably already familiar with a very simple linear regression function such as Y equals

33
00:02:40,840 --> 00:02:48,010
M x plus B, where M is the slope and B is the y intercept where the line crosses the y axis.

34
00:02:48,430 --> 00:02:54,400
Let's start with this very basic concept of regression and a linear fit and then expand its use to a

35
00:02:54,400 --> 00:02:55,570
real world data set.

36
00:02:57,430 --> 00:03:02,830
Put simply, a linear relationship implies some constant straight line relationship.

37
00:03:02,980 --> 00:03:06,490
The simplest possible being Y is equal to X.

38
00:03:06,700 --> 00:03:10,840
Technically, you could also say y is equal to a constant, but let's just keep Y is equal to X for

39
00:03:10,840 --> 00:03:11,530
right now.

40
00:03:11,620 --> 00:03:14,980
So you notice I have three data points and they all fit.

41
00:03:15,010 --> 00:03:20,520
Y is equal to x, I have x equal to 1 to 3 and Y equal to 1 to 3.

42
00:03:20,530 --> 00:03:24,430
So if I were to pair them up, it'd be one, one, two, two and three, three.

43
00:03:25,730 --> 00:03:30,650
Now, what I could also do then is based on these data points, I can build out a relationship known

44
00:03:30,650 --> 00:03:32,250
as a fitted line.

45
00:03:32,270 --> 00:03:35,210
I'm fitting a line to these data points.

46
00:03:36,680 --> 00:03:44,140
Now, what this implies is if I get some new incoming data like X, I can predict its related y value.

47
00:03:44,150 --> 00:03:52,040
So if X is equal to 1.5, then because I have my fitted line equals to y equals x, then for the input

48
00:03:52,040 --> 00:03:56,750
x equal 1.5 I get the output y is equal to 1.5.

49
00:03:58,510 --> 00:04:00,450
Now, that was a very simple example.

50
00:04:00,460 --> 00:04:05,950
But what happens with real world data that doesn't actually fit nicely all along a straight line?

51
00:04:06,130 --> 00:04:07,910
What do we actually draw this line?

52
00:04:07,930 --> 00:04:11,980
Do I draw it up here or do I draw it somewhere here in the middle?

53
00:04:12,010 --> 00:04:18,430
We need to figure out some sort of metric that allows me to understand which line best fits the data.

54
00:04:19,790 --> 00:04:26,150
Fundamentally, we understand that we want to minimize the overall distance from the points to the line.

55
00:04:26,850 --> 00:04:34,290
So if I draw a line like this, I can actually measure the distance between those points to the line.

56
00:04:34,380 --> 00:04:37,110
And this is known as the residual error.

57
00:04:39,110 --> 00:04:43,400
That means some lines are going to clearly be better fits than others.

58
00:04:43,400 --> 00:04:48,620
You can see for this line, it looks like the distance between the line and the other points is less

59
00:04:48,620 --> 00:04:50,480
than that previous line I showed.

60
00:04:51,730 --> 00:04:57,310
We can see as well that the residuals can be both positive and negative, since the line itself can

61
00:04:57,310 --> 00:05:00,220
be below some points but above others.

62
00:05:00,220 --> 00:05:05,350
And that idea of being positive and negative for the error term is going to be important when we try

63
00:05:05,350 --> 00:05:08,860
to evaluate the actual overall error of the line.

64
00:05:10,690 --> 00:05:15,640
So what kind of mathematics can we use to try to figure out the line of best fit?

65
00:05:16,240 --> 00:05:23,110
We can use ordinary least squares, which works by minimizing the sum of the squares of the differences

66
00:05:23,110 --> 00:05:28,600
between the observed dependent variable that is, the values of the variable being observed in the given

67
00:05:28,600 --> 00:05:33,580
data set and those predicted by the linear function, which is essentially your y output.

68
00:05:34,860 --> 00:05:40,200
So let me show you how you can visualize the squared error that you're trying to minimize.

69
00:05:40,590 --> 00:05:43,720
Notice that I have my real data points as blue points.

70
00:05:43,740 --> 00:05:46,680
Then I have a line of that.

71
00:05:46,680 --> 00:05:51,840
I'm trying to fit in orange and a distance between those points in a dashed red line.

72
00:05:52,110 --> 00:05:55,050
I'm trying to minimize the squared error.

73
00:05:55,650 --> 00:06:00,000
Which means I can just square the distance between my points to the line.

74
00:06:00,000 --> 00:06:04,320
And that's area of all those squares is what I'm trying to minimize.

75
00:06:05,540 --> 00:06:09,770
Having a squared error will help us simplify our calculations later on.

76
00:06:09,770 --> 00:06:15,830
When setting up a derivative recall that I mentioned, the error could be positive or negative, but

77
00:06:15,830 --> 00:06:21,200
if you square that then everything becomes positive, which is very convenient mathematically.

78
00:06:21,200 --> 00:06:25,460
So now it doesn't really matter whether the line is above or below a point.

79
00:06:25,460 --> 00:06:29,150
The act of squaring it is going to make it positive.

80
00:06:30,140 --> 00:06:36,200
Let's continue exploring ordinary squares by converting a real data set into mathematical notation,

81
00:06:36,200 --> 00:06:40,370
then working to solve a linear relationship between features and a variable.

82
00:06:41,930 --> 00:06:48,080
So right now we know the equation of a simple straight line Y equals m x plus b m is the slope, B is

83
00:06:48,080 --> 00:06:49,760
the intercept with the Y axis.

84
00:06:50,870 --> 00:06:53,010
We can see four Y equals m X plus B.

85
00:06:53,030 --> 00:06:59,900
There's only room for one possible feature X ordinary squares will allow us to directly solve for the

86
00:06:59,900 --> 00:07:02,420
slope M and the intercept B.

87
00:07:03,750 --> 00:07:08,670
Let's explore how it could translate a real data set into mathematical notation for linear regression.

88
00:07:10,330 --> 00:07:16,040
Let's imagine that I'm actually trying to build out a linear regression model for predicting the pricing

89
00:07:16,060 --> 00:07:16,900
of a house.

90
00:07:16,930 --> 00:07:22,630
Well, I have a bunch of features that is multiple features of the house shown here in blue, like the

91
00:07:22,630 --> 00:07:26,740
area in square meters, the number of bedrooms and the number of bathrooms.

92
00:07:26,740 --> 00:07:34,060
And what I'm trying to do is estimate a target output, which is the Y output of my linear model function.

93
00:07:34,060 --> 00:07:36,760
And in this case it happens to be the price of the house.

94
00:07:38,400 --> 00:07:44,910
So I can begin to translate this into generalized mathematical notation by thinking of those independent

95
00:07:44,910 --> 00:07:53,340
variables or features as x or a matrix and Y, which is just that single vector or column of the price

96
00:07:53,340 --> 00:07:55,260
that is the output I'm trying to predict.

97
00:07:56,690 --> 00:08:01,450
So I could then start using more mathematical notation and actually build this out.

98
00:08:01,460 --> 00:08:04,160
So I just have a bunch of X's.

99
00:08:04,830 --> 00:08:11,730
And then why's and this is known as linear algebra notation where you have a notation indicating what

100
00:08:11,730 --> 00:08:15,340
column and row you belong to for the original data set.

101
00:08:15,360 --> 00:08:19,620
We don't really need to worry too much about that specific notation right now, but I do want you to

102
00:08:19,620 --> 00:08:20,430
be aware of it.

103
00:08:22,120 --> 00:08:27,250
So now what we can do is build out a linear relationship between the features X and the label Y.

104
00:08:30,480 --> 00:08:34,880
So I'm going to reformat this to look like a Y equals X equation.

105
00:08:34,890 --> 00:08:40,980
Notice I basically just swap the order from the X column being first to the Y column being first.

106
00:08:42,490 --> 00:08:46,600
Now each feature should have some beta coefficient associated with it.

107
00:08:46,750 --> 00:08:52,420
So what I'm going to do is for every general feature, remember it was area number of bedrooms and number

108
00:08:52,420 --> 00:08:53,300
of bathrooms.

109
00:08:53,320 --> 00:09:01,360
I'm going to attach some beta coefficient so I will have beta not or beta zero attached to zero plus

110
00:09:01,360 --> 00:09:04,720
beta one attached to x one and so on and so on.

111
00:09:05,390 --> 00:09:11,060
As a quick note, I formatted my notation to start at one, but realistically, if you're starting to

112
00:09:11,060 --> 00:09:16,070
dive much deeper into regression, you're probably going to start your notation at zero, because that's

113
00:09:16,070 --> 00:09:18,620
how a lot of programming languages work internally.

114
00:09:20,330 --> 00:09:26,440
So I should point out that this is pretty much the same thing as the common notation for a simple line.

115
00:09:26,450 --> 00:09:28,790
I just have more X's that I'm dealing with.

116
00:09:30,430 --> 00:09:36,250
So this is stating that there is some beta coefficient for each feature to minimize the error.

117
00:09:38,120 --> 00:09:41,360
So I can also express this equation as a sum.

118
00:09:41,630 --> 00:09:47,660
Basically, what I'm going to say is y hat, which is another way of saying my predicted y value or

119
00:09:47,660 --> 00:09:54,380
predicted price of a house is equal to a particular beta coefficient multiplied by the first feature

120
00:09:54,380 --> 00:10:00,290
value plus another particular beta coefficient multiplied by the next x value, and so on and so on

121
00:10:00,290 --> 00:10:03,560
for any number of features.

122
00:10:03,560 --> 00:10:09,140
And I can then reduce that notation y to just be a sum of beta times x.

123
00:10:11,090 --> 00:10:14,350
And again, note that the y hat symbol displays a prediction.

124
00:10:14,360 --> 00:10:17,870
There is usually no set of betas to create a perfect fit to Y.

125
00:10:17,900 --> 00:10:23,480
That's what we're actually saying y hat instead of just y because I'm not going to get a perfect fit.

126
00:10:23,480 --> 00:10:30,920
And I don't want to imply by stating y that it's going to perfectly predict all the values y hat basically

127
00:10:30,920 --> 00:10:34,430
tells you, hey, this is an approximation of what y could be.

128
00:10:34,430 --> 00:10:36,580
It's not going to be the exact value.

129
00:10:36,590 --> 00:10:40,520
Otherwise that would imply that the line is going to touch every single point.

130
00:10:42,320 --> 00:10:47,720
So again, I have my line and I'm trying to minimize the distance between the line and the points.

131
00:10:47,720 --> 00:10:52,760
And y hat is going to essentially be an estimation of the output.

132
00:10:54,370 --> 00:10:57,780
So let's imagine that I have a feature X.

133
00:10:57,790 --> 00:11:00,250
What is the beta coefficient actually do?

134
00:11:00,490 --> 00:11:07,120
If I think of this in terms of y equals m x plus B where M is now really just a beta coefficient.

135
00:11:07,120 --> 00:11:12,070
All this is really doing is a multiplying factor for a slope.

136
00:11:12,070 --> 00:11:18,430
So in this very simple case where I have one feature X and then I'm multiplying it by a beta coefficient,

137
00:11:18,430 --> 00:11:24,340
then I'm really just moving that to some particular slope and we're going to do this across multiple

138
00:11:24,340 --> 00:11:25,270
dimensions.

139
00:11:25,300 --> 00:11:32,710
Obviously I can't draw more than two dimensions on a2d slide, but hopefully you get the idea that it's

140
00:11:32,710 --> 00:11:37,180
just adjusting that slope across each dimensional feature.

141
00:11:38,780 --> 00:11:42,080
And then y hat is going to be your actual predicted value.

142
00:11:42,110 --> 00:11:46,790
Remember, y hat itself is not the precise value of Y.

143
00:11:46,790 --> 00:11:48,290
That would be the actual blue point.

144
00:11:48,320 --> 00:11:55,340
Y hat is our predicted value of y given an x input multiplied by the beta coefficient that we calculated.

145
00:11:57,020 --> 00:12:02,050
So we have now created a simple model that can help predict the price a house should be bought or sold

146
00:12:02,060 --> 00:12:04,040
at given a historical data set.

147
00:12:05,490 --> 00:12:11,460
We simply need to solve for the beta coefficients which can be done by hand for only very simple situations.

148
00:12:11,670 --> 00:12:16,620
Technically speaking, this is usually done of computational techniques using something known as stochastic

149
00:12:16,620 --> 00:12:17,630
gradient descent.

150
00:12:17,640 --> 00:12:22,560
So I should point out at a certain point regression just really can't be done by hand.

151
00:12:22,560 --> 00:12:26,670
But we're going to try to keep it as simple as possible for this section of the course that we can really

152
00:12:26,670 --> 00:12:28,290
get an intuition of what's happening.

153
00:12:30,150 --> 00:12:36,300
The important thing to realize here is that we would have a set of PN equations with known X values

154
00:12:36,300 --> 00:12:38,400
and known Y values for each row.

155
00:12:38,430 --> 00:12:39,000
I.

156
00:12:40,430 --> 00:12:46,670
The beta coefficients can then directly inform us of the strength of each variable in causing the output

157
00:12:46,670 --> 00:12:47,240
y.

158
00:12:49,010 --> 00:12:54,350
Let's continue to learn and explore some of the basic concepts of actually solving for those coefficients

159
00:12:54,350 --> 00:12:55,670
for regression.

160
00:12:55,940 --> 00:12:57,380
We'll see you at the next lecture.