1
00:00:11,620 --> 00:00:15,600
In this lecture, we are going to answer the question, what is machine learning?

2
00:00:16,420 --> 00:00:21,340
I always like to discuss this first because I take a very different approach compared to most other

3
00:00:21,340 --> 00:00:24,250
resources you'll find in general.

4
00:00:24,250 --> 00:00:30,130
I think the approach most other resources take is that they like to frame machine learning as something

5
00:00:30,130 --> 00:00:34,870
complex and something fancy and something futuristic and exciting.

6
00:00:35,800 --> 00:00:38,680
My approach instead is rather kind of depressing.

7
00:00:39,220 --> 00:00:44,230
How I want you to think of machine learning is that it's something not fancy at all and something that

8
00:00:44,230 --> 00:00:46,400
only requires spatial reasoning.

9
00:00:47,050 --> 00:00:50,710
My goal is to make machine learning sound as dumb as possible.

10
00:00:55,860 --> 00:00:59,980
Here are some anonymous definitions of machine learning that I found on the Internet.

11
00:01:00,720 --> 00:01:01,980
Let me read this one out to you.

12
00:01:02,760 --> 00:01:09,840
Machine learning is an application of artificial intelligence that provides systems the ability to automatically

13
00:01:09,840 --> 00:01:13,950
learn and improve from experience without being explicitly programmed.

14
00:01:14,670 --> 00:01:20,310
Machine learning focuses on the development of computer programs that can access data and use it to

15
00:01:20,310 --> 00:01:21,540
learn for themselves.

16
00:01:22,230 --> 00:01:29,160
The process of learning begins with observations or data such as examples, direct experience or instruction

17
00:01:29,490 --> 00:01:35,250
in order to look for patterns in data and make better decisions in the future based on the examples

18
00:01:35,250 --> 00:01:36,180
that we provide.

19
00:01:36,810 --> 00:01:43,380
The primary aim is to allow the computers to learn automatically without human intervention or assistance

20
00:01:43,380 --> 00:01:45,120
and adjust actions accordingly.

21
00:01:46,050 --> 00:01:47,730
Wow, that seems pretty magical.

22
00:01:52,800 --> 00:01:58,680
Here's another one machine learning can be explained as automating and improving the learning process

23
00:01:58,830 --> 00:02:06,790
of computers based on their experiences without actually being programmed, i.e. without any human assistance.

24
00:02:07,320 --> 00:02:14,340
The process starts with feeling good quality data and then training our machines computers by building

25
00:02:14,340 --> 00:02:17,400
machine learning models, using the data and different algorithms.

26
00:02:18,810 --> 00:02:21,270
Again, very generic, but no substance.

27
00:02:26,290 --> 00:02:31,660
In fact, these two definitions were so poorly written, I had to fix them up a little bit due to some

28
00:02:31,660 --> 00:02:36,970
obvious grammatical errors, I didn't edit them too heavily since I didn't want to lose the gist of

29
00:02:36,970 --> 00:02:37,780
what they were saying.

30
00:02:38,110 --> 00:02:42,380
But you can tell that these were not written by people who actually know about machine learning.

31
00:02:43,180 --> 00:02:45,790
None of this tells me how machine learning actually works.

32
00:02:46,450 --> 00:02:49,670
This is what I would expect if I were not a student of machine learning.

33
00:02:49,690 --> 00:02:51,460
In other words, just the layman.

34
00:02:51,880 --> 00:02:56,500
And I wanted to know about how someone completely ignorant of machine learning would view it at a high

35
00:02:56,500 --> 00:02:56,890
level.

36
00:02:57,550 --> 00:03:01,870
In other words, as students of machine learning, this is not really how you want to think of what

37
00:03:01,870 --> 00:03:02,480
we are doing.

38
00:03:03,310 --> 00:03:06,490
So starting now, you are no longer just a layman.

39
00:03:06,760 --> 00:03:08,410
You are a machine learning student.

40
00:03:13,450 --> 00:03:19,540
So instead, I'm going to teach you my dumb as possible approach, which is encapsulated by the model

41
00:03:19,930 --> 00:03:22,930
machine learning is nothing but a geometry problem.

42
00:03:23,590 --> 00:03:28,620
Let's repeat that machine learning is nothing but a geometry problem.

43
00:03:29,560 --> 00:03:31,750
The best way to illustrate this is by example.

44
00:03:36,940 --> 00:03:42,290
Let's start with an example of regression regression basically means you're trying to fit a line or

45
00:03:42,290 --> 00:03:46,060
a curve, so right away you understand that this is geometry.

46
00:03:46,570 --> 00:03:50,120
Geometry is just lines, planes, curves, circles and so forth.

47
00:03:50,740 --> 00:03:52,330
That's why you're sometimes you're grumpy.

48
00:03:52,330 --> 00:03:57,160
Old statisticians say things like machine learning is nothing but a glorified curve fitting.

49
00:03:57,910 --> 00:04:02,950
By the way, if you are a grumpy old statistician, then you're probably going to be offended by everything

50
00:04:02,950 --> 00:04:03,330
we do.

51
00:04:03,760 --> 00:04:05,670
You can take this up with Geoffrey Hinton.

52
00:04:06,520 --> 00:04:12,280
So let's say you're a data analyst and you want to know how a salary related to years of experience.

53
00:04:12,910 --> 00:04:19,420
You might model this as a straight line where the input X is years of experience and the output Y hat

54
00:04:19,570 --> 00:04:20,860
is the predicted salary.

55
00:04:22,390 --> 00:04:28,630
As you may recall from your high school math studies, the equation of the line is Y hat equals M X

56
00:04:28,630 --> 00:04:33,370
plus be here, M is the slope and B is called the Y intercept.

57
00:04:33,910 --> 00:04:36,970
Your job, of course, is to find these values of M and B.

58
00:04:42,060 --> 00:04:47,300
To expand on this a little bit, here's how you would do this in the quote unquote real world.

59
00:04:47,910 --> 00:04:53,490
Let's say, for example, you're a data scientist at LinkedIn or Glassdoor, so you have access to their

60
00:04:53,490 --> 00:04:54,180
database.

61
00:04:54,390 --> 00:05:00,260
And you can see for each user how many years of experience they have and what their current salary is.

62
00:05:00,900 --> 00:05:05,040
Let's call these data points X one up to X in and Y one up to widen.

63
00:05:06,150 --> 00:05:11,100
As mentioned previously, X is the years of experience and Y represents the salary.

64
00:05:11,700 --> 00:05:16,950
We use Y to represent the true salary, whereas Y hat represents the predicted salary.

65
00:05:18,810 --> 00:05:24,450
Both X and Y are indexed by the numbers, one to ND, so there are and the people in our database,

66
00:05:25,500 --> 00:05:31,080
then what you're going to do is you're going to take a big piece of graph paper and plot each of these

67
00:05:31,080 --> 00:05:32,250
x y data points.

68
00:05:32,700 --> 00:05:35,790
You're physically going to draw a dot for each data point.

69
00:05:36,660 --> 00:05:41,670
Then once you're finished drawing all your dots, you're going to take a line that goes through all

70
00:05:41,670 --> 00:05:42,590
these data points.

71
00:05:43,320 --> 00:05:44,460
Pretty simple, I think.

72
00:05:44,760 --> 00:05:48,360
And believe it or not, this is exactly what machine learning is.

73
00:05:53,530 --> 00:05:56,870
So, as mentioned, the example we just discussed is called a regression.

74
00:05:57,430 --> 00:06:00,760
This is where you're trying to fit a line or a curve to some data points.

75
00:06:01,450 --> 00:06:06,640
Now, there are two ways we can make this more complicated and more like the data sets we might encounter

76
00:06:06,640 --> 00:06:07,550
in the real world.

77
00:06:08,470 --> 00:06:10,720
First, we might have more than one input feature.

78
00:06:11,350 --> 00:06:16,100
In the previous example, the number of years of experience was the only input feature.

79
00:06:16,570 --> 00:06:19,300
But realistically, you might measure more things.

80
00:06:19,760 --> 00:06:24,850
For example, you might recall things like what kind of degree the person has, like a bachelor's,

81
00:06:24,850 --> 00:06:30,130
master's or Ph.D. You might record what school they went to, what country they're from, what their

82
00:06:30,130 --> 00:06:31,470
ages and their gender.

83
00:06:32,110 --> 00:06:35,410
All of these factors or features might affect the salary.

84
00:06:36,370 --> 00:06:40,170
When you do this, the object you're trying to fit is no longer alive.

85
00:06:40,780 --> 00:06:44,500
If you have to input features, then you have three dimensions in total.

86
00:06:44,650 --> 00:06:45,580
So you get a plane.

87
00:06:46,300 --> 00:06:49,180
If you have more than two input features, then you get a hyper plane.

88
00:06:50,080 --> 00:06:55,330
By the way, in case it's not obvious, you can't visualize anything beyond three dimensions because

89
00:06:55,330 --> 00:06:58,420
the physical world itself is three dimensions.

90
00:07:03,460 --> 00:07:08,170
The second way we can make the regression problem more complicated is that instead of trying to fit

91
00:07:08,170 --> 00:07:13,060
a straight or non curved object like a line or a plane, we can fit a curve.

92
00:07:13,840 --> 00:07:15,940
Many real world data sets are non-linear.

93
00:07:16,960 --> 00:07:19,410
Think of something simple like exercise.

94
00:07:19,780 --> 00:07:25,690
If I do 20 push ups, will I gain twice as much muscle as I would have if I did 10 pushups?

95
00:07:26,230 --> 00:07:29,530
If I do 30 pushups, will I gain three times as much muscle?

96
00:07:30,040 --> 00:07:31,430
Of course, the answer is no.

97
00:07:31,750 --> 00:07:36,490
Otherwise, we'd have people doing push ups all the time and becoming very large people.

98
00:07:37,390 --> 00:07:41,130
At some point, the benefit of doing push ups is going to taper off.

99
00:07:46,290 --> 00:07:51,390
Now, let's turn our attention to another kind of machine learning problem known as classification,

100
00:07:52,260 --> 00:07:57,630
both classification and regression are examples of supervised learning, which is going to be the main

101
00:07:57,630 --> 00:07:58,890
focus of this course.

102
00:08:00,300 --> 00:08:05,610
We will sometimes discuss unsupervised learning as well, but the main focus will be supervised learning.

103
00:08:06,420 --> 00:08:12,390
So whereas regression is concerned with predicting a real value, no classification is concerned with

104
00:08:12,390 --> 00:08:19,800
predicting a category one popular example is image classification, where your model accepts as input

105
00:08:19,800 --> 00:08:25,360
an image of a dog or a cat and tries to predict whether the label should be dog or cat.

106
00:08:26,130 --> 00:08:32,250
However, in this lecture, we going to look at an example that helps us build this geometrical intuition

107
00:08:32,250 --> 00:08:33,270
we've been talking about.

108
00:08:38,320 --> 00:08:43,690
So let's say I want to predict the risk of cardiovascular disease given a patient's height and weight.

109
00:08:44,530 --> 00:08:50,050
This is a categorical problem because my prediction is either going to be at risk or not at risk.

110
00:08:50,830 --> 00:08:53,580
Again, this comes down to a data collection experiment.

111
00:08:54,190 --> 00:08:59,080
I'm going to look at all my hospital records and I'm going to write down all of this information in

112
00:08:59,080 --> 00:09:00,140
an Excel spreadsheet.

113
00:09:00,850 --> 00:09:05,350
I'll have two columns to represent my ex and one column to represent my target.

114
00:09:05,470 --> 00:09:12,400
Why, by the way, not that it's customary to represent binary targets as the integers zero and one.

115
00:09:13,000 --> 00:09:15,190
So he would say not at risk is zero.

116
00:09:15,190 --> 00:09:20,400
And at risk is one note that there's no reason that at risk should be one.

117
00:09:20,950 --> 00:09:22,840
Just like the dogs and cats example.

118
00:09:22,900 --> 00:09:27,590
It doesn't matter of dogs are one and cats are zero or dogs or zero and cats are one.

119
00:09:27,640 --> 00:09:29,260
This assignment is just arbitrary.

120
00:09:34,310 --> 00:09:37,520
OK, so now that I've collected all my data, what am I going to do?

121
00:09:38,240 --> 00:09:42,340
Well, again, I'm going to apply these on a grid, on the horizontal axis.

122
00:09:42,350 --> 00:09:46,940
I'm going to have my first feature, the patient's height on the vertical axis.

123
00:09:46,950 --> 00:09:49,540
I'm going to have my second feature, the patient's weight.

124
00:09:50,420 --> 00:09:57,530
Now, importantly, unlike regression, the target does not get its own axis instead because it's categorical

125
00:09:57,920 --> 00:09:59,440
we would represent it with color.

126
00:10:00,230 --> 00:10:04,910
So I might color the at risk patients as blue and the not at risk patients as green.

127
00:10:09,990 --> 00:10:12,990
OK, so what is my model going to be for classification?

128
00:10:13,830 --> 00:10:16,190
Well, the simplest model, again, is a line.

129
00:10:16,950 --> 00:10:20,700
However, this line is a little bit different from our regression line from earlier.

130
00:10:21,630 --> 00:10:25,220
The regression line, if you recall, was the line of best fit.

131
00:10:25,800 --> 00:10:29,470
We are trying to get the line to be close to all the data points.

132
00:10:30,270 --> 00:10:32,180
Now, we are no longer trying to do that.

133
00:10:33,090 --> 00:10:38,280
Instead, we are trying to get the line to separate the two groups of data points or in other words,

134
00:10:38,490 --> 00:10:40,080
separate the different colors.

135
00:10:41,790 --> 00:10:47,610
As you can see, this, again, boils down to a geometry problem, how do we find this line that can

136
00:10:47,610 --> 00:10:49,910
indeed separate these categories?

137
00:10:54,960 --> 00:11:00,480
One lesson that goes hand-in-hand with the geometrical perspective is another motto of mine, all data

138
00:11:00,480 --> 00:11:01,100
is the same.

139
00:11:01,980 --> 00:11:04,440
Let's use another example to illustrate the point.

140
00:11:05,370 --> 00:11:10,950
Suppose instead of working at a hospital, you now work at an insurance company, your job is to do

141
00:11:10,950 --> 00:11:14,760
fraud detection or in other words, classify instances of fraud.

142
00:11:15,360 --> 00:11:17,250
So again, you have two categories.

143
00:11:17,260 --> 00:11:18,620
Fraud or not fraud.

144
00:11:19,880 --> 00:11:25,100
You're trying to predict whether or not a given claim is fraud, let's say, again, you've collected

145
00:11:25,100 --> 00:11:27,860
some data points again with two input figures.

146
00:11:28,340 --> 00:11:32,660
Suppose these are, number one, the amount of debt the claimant has.

147
00:11:32,690 --> 00:11:36,290
And number two, the total amount of past insurance claims.

148
00:11:41,310 --> 00:11:46,290
And again, it's the same story once you've collected all your data points, you're going to plot them

149
00:11:46,290 --> 00:11:47,600
and color them on a grid.

150
00:11:48,180 --> 00:11:52,030
Your job, again, is to separate these data points with a line or a curve.

151
00:11:52,710 --> 00:11:58,350
So you see that just because we change the meaning of the problem, we haven't changed what our actual

152
00:11:58,350 --> 00:11:59,240
task is.

153
00:11:59,670 --> 00:12:04,110
It's still to plot these points and separate them with some kind of decision boundary.

154
00:12:04,710 --> 00:12:10,260
A line is the simplest, but just like with regression, there are two ways we can make this more complicated.

155
00:12:15,390 --> 00:12:20,430
The first way is that we can have more input features so that it's not possible to plot them on a grid

156
00:12:20,430 --> 00:12:21,660
that we can visualize.

157
00:12:22,350 --> 00:12:27,900
The second way is that the decision boundary may not be linear, in which case the model will be an

158
00:12:27,900 --> 00:12:31,230
equation that's more complicated than a line or hyper plane.

159
00:12:32,310 --> 00:12:35,650
Importantly, however, the moral of the story remains the same.

160
00:12:35,940 --> 00:12:37,890
This is still a geometry problem.

161
00:12:42,930 --> 00:12:48,990
And in fact, for our two classification examples, the geometry problem was the same, the only thing

162
00:12:48,990 --> 00:12:51,210
that changed was the meaning of the numbers.

163
00:12:51,720 --> 00:12:54,570
Of course, machine learning algorithms don't care about these meanings.

164
00:12:54,570 --> 00:12:58,720
And so they are essentially irrelevant in the eyes of the machine learning model.

165
00:12:59,130 --> 00:13:01,300
That's why we say all data is the same.

166
00:13:02,160 --> 00:13:04,630
This becomes a very powerful concept in the future.

167
00:13:05,070 --> 00:13:11,130
For example, the same kind of model used for neural machine translation can also be used for a chappe.

168
00:13:11,970 --> 00:13:16,500
The same kind of model used for sentiment analysis can also be used for spam detection.

169
00:13:17,250 --> 00:13:21,330
Thinking in this way essentially gives you machine learning superpowers.

170
00:13:26,480 --> 00:13:30,540
To summarize this lecture, the goal was to take the magic away from machine learning.

171
00:13:31,250 --> 00:13:35,180
You learned a very important lesson, which is that machine learning is magic.

172
00:13:35,360 --> 00:13:37,120
In fact, it's just geometry.

173
00:13:37,820 --> 00:13:42,560
We learned about the two different kinds of geometry problems that supervised learning can solve.

174
00:13:43,190 --> 00:13:48,920
The first kind is regression, where we try to get a line of plain hydroplane or curve to be as close

175
00:13:48,920 --> 00:13:51,560
as possible to the data points from a given data set.

176
00:13:52,460 --> 00:13:57,890
The second kind is classification, where instead of trying to get the curve as close as possible to

177
00:13:57,890 --> 00:14:02,660
the data points, we try to separate data points belonging to different categories.

178
00:14:05,070 --> 00:14:10,200
One important lesson that goes along with this is that all data is the same, it doesn't matter if we're

179
00:14:10,200 --> 00:14:15,360
trying to classify risk of cardiovascular disease or if we're trying to classify fraud in an insurance

180
00:14:15,360 --> 00:14:18,330
company, the general problem remains the same.

181
00:14:18,690 --> 00:14:20,970
And it is a problem of geometry.
