1
00:00:11,730 --> 00:00:15,600
In this lecture we are going to answer the question what is machine learning.

2
00:00:16,530 --> 00:00:21,330
I always like to discuss this first because they take a very different approach compared to most other

3
00:00:21,330 --> 00:00:27,840
resources you'll find in general I think the approach most other resources take is that they like to

4
00:00:28,110 --> 00:00:35,770
frame machine learning as something complex and something fancy and something futuristic and exciting.

5
00:00:35,850 --> 00:00:41,310
My approach instead is rather kind of depressing how I want you to think of machine learning is that

6
00:00:41,310 --> 00:00:47,130
it's something not fancy at all and something that only requires spatial reasoning.

7
00:00:47,130 --> 00:00:55,670
My goal is to make machine learning sound as dumb as possible.

8
00:00:55,790 --> 00:00:59,970
Here here's some anonymous definitions of machine learning that I found on the Internet.

9
00:01:00,790 --> 00:01:02,800
Let me read this one out to you.

10
00:01:02,800 --> 00:01:09,820
Machine learning is an application of artificial intelligence that provides systems the ability to automatically

11
00:01:09,820 --> 00:01:14,740
learn and improve from experience without being explicitly programmed.

12
00:01:14,740 --> 00:01:20,320
Machine learning focuses on the development of computer programs that can access data and use it to

13
00:01:20,320 --> 00:01:22,270
learn for themselves.

14
00:01:22,270 --> 00:01:29,590
The process of learning begins with observations or data such as examples direct experience or instruction.

15
00:01:29,590 --> 00:01:35,260
In order to look for patterns in data and make better decisions in the future based on the examples

16
00:01:35,260 --> 00:01:42,220
that we provide the primary aim is to allow the computers to learn automatically without human intervention

17
00:01:42,520 --> 00:01:46,150
or assistance and adjusts actions accordingly.

18
00:01:46,150 --> 00:01:47,740
Wow that seems pretty magical

19
00:01:52,890 --> 00:01:54,610
here's another one.

20
00:01:54,630 --> 00:02:00,690
Machine learning can be explained as automating and improving the learning process of computers based

21
00:02:00,690 --> 00:02:07,250
on their experiences without actually being programmed i.e. without any human assistance.

22
00:02:07,380 --> 00:02:14,340
The process starts with feeling good quality data and then training our machines computers by building

23
00:02:14,340 --> 00:02:18,930
machine learning models using the data and different algorithms.

24
00:02:18,930 --> 00:02:26,350
Again very generic but no substance.

25
00:02:26,400 --> 00:02:29,150
In fact these two definitions were so poorly written.

26
00:02:29,160 --> 00:02:33,680
I had to fix them up a little bit due to some obvious grammatical errors.

27
00:02:33,750 --> 00:02:38,430
I didn't edit them too heavily since I didn't want to lose the gist of what they were saying but you

28
00:02:38,430 --> 00:02:43,250
can tell that these were not written by people who actually know about machine learning.

29
00:02:43,260 --> 00:02:46,450
None of this tells me how machine learning actually works.

30
00:02:46,500 --> 00:02:49,790
This is what I would expect if I were not a student of machine learning.

31
00:02:49,800 --> 00:02:55,080
In other words just the layman and I wanted to know about how someone completely ignorant of machine

32
00:02:55,080 --> 00:02:57,590
learning would view it at a high level.

33
00:02:57,630 --> 00:03:02,070
In other words as students of machine learning this is not really how you want to think of what we are

34
00:03:02,070 --> 00:03:02,490
doing.

35
00:03:03,360 --> 00:03:06,840
So starting now you are no longer just a layman.

36
00:03:06,840 --> 00:03:08,400
You are a machine learning student

37
00:03:13,510 --> 00:03:20,380
so instead I'm going to teach you my dumb as possible approach which is encapsulated by the motto machine

38
00:03:20,380 --> 00:03:23,630
learning is nothing but a geometry problem.

39
00:03:23,680 --> 00:03:29,590
Let's repeat that machine learning is nothing but a geometry problem.

40
00:03:29,630 --> 00:03:31,760
The best way to illustrate this is by example

41
00:03:37,000 --> 00:03:42,280
let's start with an example of regression regression basically means you're trying to fit a line or

42
00:03:42,280 --> 00:03:49,420
a curve so right away you understand that this geometry geometry is just lines planes curves circles

43
00:03:49,420 --> 00:03:50,710
and so forth.

44
00:03:50,770 --> 00:03:55,750
That's why you'll sometimes hear grumpy old statisticians say things like machine learning is nothing

45
00:03:55,750 --> 00:03:58,470
but glorified curve fitting by the way.

46
00:03:58,470 --> 00:04:03,790
If you are a grumpy old statistician then you're probably going to be offended by everything we do.

47
00:04:03,820 --> 00:04:06,290
You can take this up with Geoffrey Hinton.

48
00:04:06,580 --> 00:04:12,920
So let's say you're a data analyst and you want to know how is salary related to years of experience.

49
00:04:13,000 --> 00:04:19,420
You might model this as a straight line where the input x is years of experience and the output y hat

50
00:04:19,660 --> 00:04:25,060
is the predicted salary as you may recall from your high school math studies.

51
00:04:25,130 --> 00:04:29,650
The equation of the line is y hat equals M X plus B.

52
00:04:29,960 --> 00:04:33,980
Here M is the slope and B is called the y intercept.

53
00:04:33,980 --> 00:04:36,980
Your job of course is to find these values of m and b

54
00:04:42,070 --> 00:04:43,420
to expand on this a little bit.

55
00:04:43,750 --> 00:04:47,950
Here's how you would do this in the quote unquote real world.

56
00:04:47,950 --> 00:04:53,470
Let's say for example you are a data scientist at LinkedIn or Glassdoor so you have access to their

57
00:04:53,470 --> 00:04:59,410
database and you can see for each user how many years of experience they have and what their current

58
00:04:59,410 --> 00:05:00,970
salary is.

59
00:05:00,970 --> 00:05:06,120
Let's call these data points x 1 up to x n and y one up to why n.

60
00:05:06,250 --> 00:05:12,910
As mentioned previously X is the years of experience and Y represents the salary we use Y to represent

61
00:05:12,910 --> 00:05:16,960
the true salary whereas y hat represents the predicted salary.

62
00:05:18,870 --> 00:05:26,020
Both X and Y are an index by the numbers 1 to N so there are and the people in our database then what

63
00:05:26,020 --> 00:05:31,720
you're going to do is you're going to take a big piece of graph paper and plot each of these x y data

64
00:05:31,720 --> 00:05:38,170
points you're physically going to draw a dot for each data point then once you're finished drawing all

65
00:05:38,170 --> 00:05:43,090
your dots you're going to take a line that goes through all these data points.

66
00:05:43,360 --> 00:05:48,400
Pretty simple I think and believe it or not this is exactly what machine learning is

67
00:05:53,580 --> 00:05:57,450
so as mentioned the example we just discussed is called a regression.

68
00:05:57,450 --> 00:06:01,510
This is where you're trying to fit a line or a curve to some data points.

69
00:06:01,530 --> 00:06:06,630
Now there are two ways we can make this more complicated and more like the data sets we might encounter

70
00:06:06,630 --> 00:06:08,210
in the real world.

71
00:06:08,520 --> 00:06:12,990
First we might have more than one input feature in the previous example.

72
00:06:13,170 --> 00:06:18,660
The number of years of experience was the only input feature but realistically you might measure more

73
00:06:18,660 --> 00:06:19,760
things.

74
00:06:19,800 --> 00:06:25,290
For example you might record things like what kind of degree the person has like a bachelors masters

75
00:06:25,290 --> 00:06:26,320
or PGD.

76
00:06:26,610 --> 00:06:32,170
You might record what school they went to what country they're from what their ages and their gender.

77
00:06:32,190 --> 00:06:37,500
All of these factors or features might affect the salary when you do this.

78
00:06:37,540 --> 00:06:40,150
The object you're trying to fit is no longer alive.

79
00:06:40,870 --> 00:06:46,420
If you have two input features then you have three dimensions in total so you get a plane.

80
00:06:46,420 --> 00:06:52,000
If you have more than two input features then you get a hyper plane by the way in case it's not obvious

81
00:06:52,240 --> 00:07:03,450
you can't visualize anything beyond three dimensions because the physical world itself is three dimensions.

82
00:07:03,520 --> 00:07:08,140
The second way we can make the regression problem more complicated is that instead of trying to fit

83
00:07:08,170 --> 00:07:15,190
a straight or non curved object like a line or a plane we can fit a curve many real world data sets

84
00:07:15,190 --> 00:07:16,860
are nonlinear.

85
00:07:16,990 --> 00:07:19,840
Think of something simple like exercise.

86
00:07:19,840 --> 00:07:26,350
If I do 20 pushups Well I gain twice as much muscle as I would have if I did 10 pushups.

87
00:07:26,350 --> 00:07:30,090
If I do 30 pushups will I gain three times as much muscle.

88
00:07:30,130 --> 00:07:31,820
Of course the answer is No.

89
00:07:31,840 --> 00:07:37,020
Otherwise we'd have people doing push ups all the time and becoming very large people.

90
00:07:37,510 --> 00:07:46,390
At some point the benefit of doing push ups is going to taper off.

91
00:07:46,410 --> 00:07:52,570
Now let's turn our attention to another kind of machine learning problem known as classification both

92
00:07:52,570 --> 00:07:58,000
classification and regression are examples of supervised learning which is going to be the main focus

93
00:07:58,000 --> 00:08:05,250
of this course we will sometimes discuss unsupervised learning as well but the main focus will be supervised

94
00:08:05,250 --> 00:08:09,900
learning so whereas regression is concerned with predicting a real value.

95
00:08:09,930 --> 00:08:17,220
No classification is concerned with predicting a category one popular example is image classification

96
00:08:17,790 --> 00:08:23,790
where your model accepts as input an image of a dog or a cat and tries to predict whether the label

97
00:08:24,000 --> 00:08:25,390
should be dog or cat.

98
00:08:26,220 --> 00:08:32,240
However in this lecture we are going to look at an example that helps us build this geometrical intuition

99
00:08:32,250 --> 00:08:33,230
we've been talking about

100
00:08:38,400 --> 00:08:44,490
so let's say I want to predict the risk of cardiovascular disease given a patient's height and weight.

101
00:08:44,550 --> 00:08:50,920
This is a categorical problem because my prediction is either going to be at risk or not at risk.

102
00:08:50,940 --> 00:08:54,210
Again this comes down to a data collection experiment.

103
00:08:54,300 --> 00:08:59,100
I'm going to look at all my hospital records and I'm going to write down all of this information in

104
00:08:59,100 --> 00:09:00,960
an Excel spreadsheet.

105
00:09:00,960 --> 00:09:07,590
I'll have two columns to represent my X and one column to represent my target y by the way.

106
00:09:07,590 --> 00:09:12,410
Note that it's customary to represent binary targets as the integers 0 and 1.

107
00:09:13,050 --> 00:09:17,480
So he would say not at risk is zero and at risk is 1.

108
00:09:17,720 --> 00:09:20,940
Note that there is no reason that at risk should be one.

109
00:09:20,990 --> 00:09:26,720
Just like the dogs and cats example it doesn't matter of dogs or one and cats are zero or dogs or zero

110
00:09:26,720 --> 00:09:27,700
and cats are one.

111
00:09:27,710 --> 00:09:29,240
This assignment is just arbitrary

112
00:09:34,320 --> 00:09:34,770
OK.

113
00:09:34,800 --> 00:09:38,290
So now that I've collected all my data what am I going to do.

114
00:09:38,310 --> 00:09:42,350
Well again I'm going to apply these on a grid on the horizontal axis.

115
00:09:42,360 --> 00:09:46,970
I'm going to have my first feature the patient's height on the vertical axis.

116
00:09:46,980 --> 00:09:52,680
I'm going to have my second feature the patient's weight now importantly unlike regression.

117
00:09:52,680 --> 00:09:59,430
The target does not get its own axis instead because it's categorical we would represent it with color

118
00:10:00,270 --> 00:10:10,210
so I might color the at risk patients as blue and then not at risk patients as green.

119
00:10:10,430 --> 00:10:13,900
So what is my model going to be for classification.

120
00:10:13,900 --> 00:10:16,190
Well the simplest model again is a lie.

121
00:10:17,020 --> 00:10:21,670
However this line is a little bit different from our regression line from earlier.

122
00:10:21,670 --> 00:10:25,740
The regression line if you would call was the line of best fit.

123
00:10:25,900 --> 00:10:30,190
We are trying to get the line to be close to all the data points.

124
00:10:30,340 --> 00:10:33,180
Now we are no longer trying to do that.

125
00:10:33,190 --> 00:10:39,070
Instead we are trying to get the line to separate the two groups of data points or in other words separate

126
00:10:39,070 --> 00:10:45,770
the different colors as you can see this again boils down to a geometry problem.

127
00:10:45,950 --> 00:10:49,910
How do we find this line that can indeed separate these categories

128
00:10:55,050 --> 00:10:59,910
one lesson that goes hand-in-hand with the geometrical perspective is another model of mind.

129
00:10:59,970 --> 00:11:01,110
All data is the same.

130
00:11:02,040 --> 00:11:08,090
Let's use another example to illustrate the point suppose instead of working at a hospital you now work

131
00:11:08,120 --> 00:11:15,440
and an insurance company your job is to do fraud detection or in other words classify instances of fraud.

132
00:11:15,440 --> 00:11:22,040
So again you have two categories fraud or not fraud you're trying to predict whether or not a given

133
00:11:22,040 --> 00:11:23,740
claim is fraud.

134
00:11:23,870 --> 00:11:25,870
Let's say again you've collected some data points.

135
00:11:25,880 --> 00:11:28,370
Again with two input features.

136
00:11:28,370 --> 00:11:32,640
Suppose these are number one the amount of debt the claimant has.

137
00:11:32,780 --> 00:11:36,320
And number two the total amount of past insurance claims

138
00:11:41,440 --> 00:11:43,400
and again it's the same story.

139
00:11:43,540 --> 00:11:48,670
Once you've collected all your data points you're going to plot them and color them on a grid your job

140
00:11:48,670 --> 00:11:52,020
again is to separate these data points with a line or a curve.

141
00:11:52,750 --> 00:11:58,360
So you see that just because we change the meaning of the problem we haven't changed what our actual

142
00:11:58,360 --> 00:11:59,250
task is.

143
00:11:59,770 --> 00:12:05,380
It's still to plot these points and separate them with some kind of decision boundary a line is the

144
00:12:05,380 --> 00:12:06,280
simplest.

145
00:12:06,400 --> 00:12:15,330
But just like with regression there are two ways we can make this more complicated.

146
00:12:15,410 --> 00:12:20,420
The first way is that we can have more input features so that it's not possible to plot them on a grid

147
00:12:20,420 --> 00:12:22,270
that we can visualize.

148
00:12:22,370 --> 00:12:28,430
The second way is that the decision boundary may not be linear in which case the model will be an equation

149
00:12:28,430 --> 00:12:32,390
that's more complicated than a line or hyper plain.

150
00:12:32,410 --> 00:12:35,970
Importantly however the moral of the story remains the same.

151
00:12:35,980 --> 00:12:37,890
This is still a geometry problem

152
00:12:43,000 --> 00:12:48,360
and in fact for our two classification examples the geometry problem was the same.

153
00:12:48,490 --> 00:12:53,440
The only thing that changed was the meaning of the numbers of course machine learning algorithms don't

154
00:12:53,440 --> 00:12:58,390
care about these meetings and so they are essentially irrelevant in the eyes of the machine learning

155
00:12:58,390 --> 00:12:59,160
model.

156
00:12:59,200 --> 00:13:02,210
That's why we say all data is the same.

157
00:13:02,230 --> 00:13:05,080
This becomes a very powerful concept in the future.

158
00:13:05,080 --> 00:13:10,980
For example the same kind of model used for neural machine translation can also be used for a chapter

159
00:13:10,980 --> 00:13:17,710
of the same kind of model used for sentiment analysis can also be used for spam detection in thinking

160
00:13:17,710 --> 00:13:26,450
in this way essentially gives you machine learning superpowers.

161
00:13:26,500 --> 00:13:32,050
To summarize this lecture the goal was to take the magic away from machine learning you learned a very

162
00:13:32,050 --> 00:13:35,430
important lesson which is that machine learning is a magic.

163
00:13:35,440 --> 00:13:37,610
In fact it's just geometry.

164
00:13:37,900 --> 00:13:43,150
We learned about the two different kinds of geometry problems that supervised learning can solve.

165
00:13:43,240 --> 00:13:48,910
The first kind is regression where we try to get a line and playing hydroplaning or curve to be as close

166
00:13:48,910 --> 00:13:52,470
as possible to the data points from a given dataset.

167
00:13:52,510 --> 00:13:57,970
The second kind is classification where instead of trying to get the curve as close as possible to the

168
00:13:57,970 --> 00:14:02,680
data points we try to separate data points belonging to different categories

169
00:14:05,160 --> 00:14:09,180
one important lesson that goes along with this is that all data is the same.

170
00:14:09,360 --> 00:14:14,370
It doesn't matter if we're trying to classify risk of cardiovascular disease or if we're trying to classify

171
00:14:14,370 --> 00:14:16,440
fraud in an insurance company.

172
00:14:16,440 --> 00:14:20,960
The general problem remains the same and it is a problem of geometry.
