1
00:00:11,570 --> 00:00:16,910
At this point, you're probably sick of me saying machine learning is nothing but a geometry problem.

2
00:00:17,390 --> 00:00:19,400
But here's where it gets really important.

3
00:00:20,150 --> 00:00:24,260
The question we want to answer right now is why are neural networks so important?

4
00:00:24,590 --> 00:00:26,450
Why can't we just use a single neuron?

5
00:00:26,990 --> 00:00:28,970
I mean, it seems to be a pretty good model.

6
00:00:29,480 --> 00:00:30,860
It's nice and interpretable.

7
00:00:31,490 --> 00:00:37,040
Large weights means the corresponding input is important, and a very small or zero weights mean the

8
00:00:37,040 --> 00:00:38,420
input is not important.

9
00:00:39,050 --> 00:00:42,320
Unfortunately, this type of model only gets us so far.

10
00:00:47,400 --> 00:00:52,350
You'll recall that when we discuss linear regression and logistic regression, I said there are two

11
00:00:52,350 --> 00:00:54,600
different ways we can make things more complicated.

12
00:00:55,230 --> 00:00:59,550
The first way we can make things more complicated is to add multiple inputs.

13
00:00:59,850 --> 00:01:01,500
That's something we encountered already.

14
00:01:02,220 --> 00:01:07,770
This is still pretty simplistic because you can always picture a plane in your head, a straight non-driving

15
00:01:07,770 --> 00:01:10,350
surface that separates data between classes.

16
00:01:11,160 --> 00:01:16,640
The second way we can make things more complicated is that the decision boundary or regression function

17
00:01:16,770 --> 00:01:19,770
we are looking for is not a straight line or plane.

18
00:01:20,460 --> 00:01:23,880
This is what we are interested in when we discuss neural networks.

19
00:01:29,030 --> 00:01:36,080
The reason the equation transpose X plus B equals zero gives us a higher plane is because this is the

20
00:01:36,080 --> 00:01:38,330
actual definition of a hyper plane.

21
00:01:38,960 --> 00:01:40,250
There's no way around this.

22
00:01:40,740 --> 00:01:45,680
There's no possible setting of W and B that could give you a curved surface.

23
00:01:46,340 --> 00:01:50,930
So how can we get a curved surface so that we can solve more complicated problems?

24
00:01:56,020 --> 00:01:59,260
One method you might have thought of is to use feature engineering.

25
00:01:59,980 --> 00:02:02,110
For example, let's take a linear regression.

26
00:02:02,920 --> 00:02:09,520
Suppose that for whatever reason, salary is proportional to the square of the years of experience.

27
00:02:09,669 --> 00:02:14,350
Or put another way that salary is a quadratic function of years of experience.

28
00:02:15,100 --> 00:02:19,840
Then we could say that y hat equals the X squared plus b x plus c.

29
00:02:20,980 --> 00:02:24,520
Believe it or not, this is still just linear regression in disguise.

30
00:02:25,240 --> 00:02:31,480
It's easy to see this by doing the following take x the years of experience and call that the input

31
00:02:31,480 --> 00:02:39,340
feature X1, then take x squared the years of experience squared and call that the input feature X2.

32
00:02:40,000 --> 00:02:47,950
Then my y have becomes an equation of the form y hat equals w one x one plus w two x two plus b.

33
00:02:48,760 --> 00:02:52,150
Of course, we've already learned that this is just a linear regression.

34
00:02:57,410 --> 00:03:02,840
The problem with Fisher Engineering is that there are many different possible features squaring the

35
00:03:02,840 --> 00:03:06,170
inputs, his comment, but so is combining the inputs.

36
00:03:06,470 --> 00:03:12,200
So if I have the inputs X1 and X2, then I could make one of the features X1 at times x two.

37
00:03:12,650 --> 00:03:14,510
We call these interaction terms.

38
00:03:15,380 --> 00:03:17,990
But what happens if I have three input features?

39
00:03:18,530 --> 00:03:22,430
Now I have to consider X1 squared x two squared and x three squared.

40
00:03:22,850 --> 00:03:27,650
But we also have to consider x one x two x one x three and x two x three.

41
00:03:28,310 --> 00:03:30,980
In total, there are six quadratic terms.

42
00:03:31,910 --> 00:03:37,910
As an exercise, you might want to list out all the possible combinations we could have if we had four

43
00:03:37,910 --> 00:03:39,500
inputs or five inputs.

44
00:03:40,160 --> 00:03:46,910
You should discover that the number of terms we have to consider grows quite fast and thus feature engineering,

45
00:03:47,270 --> 00:03:51,500
even in this simple form seems to be a somewhat clumsy solution.

46
00:03:56,590 --> 00:04:03,280
But now let's remember, if we consider just the first layer of a neural network, we can see that there

47
00:04:03,280 --> 00:04:04,540
are multiple neurons.

48
00:04:05,170 --> 00:04:08,260
Importantly, there are multiple different neurons.

49
00:04:08,800 --> 00:04:15,640
Each neuron is a different nonlinear feature derived from all the inputs because we've applied the sigmoid

50
00:04:15,640 --> 00:04:16,779
activation function.

51
00:04:17,200 --> 00:04:21,670
The feature is not just a simple, linear combination of input features.

52
00:04:22,240 --> 00:04:27,340
So you can think of these as Feature one, a feature two, feature three all the way up to feature M.

53
00:04:32,510 --> 00:04:36,590
Now, a lot of people wonder, how does this make a neural network non-linear?

54
00:04:37,850 --> 00:04:42,320
Remember that a linear function always takes the form transpose x plus b.

55
00:04:43,100 --> 00:04:48,230
Our neural network takes the form of the equation you see here for a two layer neural network.

56
00:04:50,050 --> 00:04:56,620
This is what we get if we drop at the bottom and just make the whole thing into one equation to transpose

57
00:04:56,620 --> 00:05:04,900
sigmoid of one transpose X plus b one and then plus b two, the important thing to notice about this

58
00:05:04,900 --> 00:05:08,710
is you cannot simplify this into a linear function.

59
00:05:09,310 --> 00:05:12,610
If you could, then we would have a linear decision boundary.

60
00:05:17,710 --> 00:05:20,770
So let's see what would happen if we had no sigmoidal at all.

61
00:05:21,520 --> 00:05:28,750
Let's suppose we have a regression network with no sigmoid, then it's just to transpose times W1 Transpose

62
00:05:28,750 --> 00:05:31,210
X plus B1 plus B2.

63
00:05:32,050 --> 00:05:37,480
Of course, using basic arithmetic, we can expand the product due to the fact that there's no signal,

64
00:05:37,510 --> 00:05:39,370
it's just straightforward multiplication.

65
00:05:40,090 --> 00:05:43,810
Then we just get prime transpose x plus b prime.

66
00:05:44,230 --> 00:05:52,420
We have substituted W prime equals W to transpose times w one transpose and B prime equals w two transpose

67
00:05:52,420 --> 00:05:54,610
times b one plus B two.

68
00:05:55,540 --> 00:06:01,150
And so you can see that if we have no sigmoid, this just reduces to a linear equation.

69
00:06:06,260 --> 00:06:12,110
There is an added bonus to using neurons for our features, if you recall, all the weights and biases

70
00:06:12,110 --> 00:06:17,060
are randomly initialized and found using an iterative algorithm called gradient descent.

71
00:06:17,420 --> 00:06:19,430
When we call, the model got fit function.

72
00:06:20,270 --> 00:06:26,090
What this means is that instead of having to do manual feature engineering to try and guess what features

73
00:06:26,090 --> 00:06:29,330
might be good, a neural network will do this automatically.

74
00:06:30,230 --> 00:06:36,620
In the olden days of machine learning, feature engineering at times used to be the only method of really

75
00:06:36,620 --> 00:06:38,360
applying machine learning successfully.

76
00:06:39,020 --> 00:06:41,660
And usually this would require domain knowledge.

77
00:06:42,320 --> 00:06:48,020
So if you wanted to build a good image classifier for, say, lung cancer detection, you would have

78
00:06:48,020 --> 00:06:51,560
to be very knowledgeable about different computer vision techniques.

79
00:06:52,100 --> 00:06:57,500
But not only that, you would also have to be very good at looking at X-rays of lung cancer.

80
00:06:58,160 --> 00:06:58,700
By the way.

81
00:06:58,730 --> 00:07:03,080
Note that I'm not a doctor, so I don't know if they actually look at X-rays, but you get the idea.

82
00:07:03,860 --> 00:07:08,720
If you wanted to build a good stock predictor, you would have to be very knowledgeable about finance.

83
00:07:10,320 --> 00:07:14,820
If you wanted to build a good fraud detector, you would have to be very knowledgeable about the insurance

84
00:07:14,820 --> 00:07:15,360
industry.

85
00:07:16,840 --> 00:07:22,630
And as you may have heard, deep learning is so popular because it allows people who are not domain

86
00:07:22,630 --> 00:07:25,480
experts to build a very good machine learning models.

87
00:07:26,200 --> 00:07:31,960
Nowadays, we have people who are not expert radiologists building state of the art image classifiers

88
00:07:31,960 --> 00:07:33,370
for medical diagnosis.

89
00:07:34,240 --> 00:07:36,700
All they are experts in is machine learning itself.

90
00:07:37,540 --> 00:07:38,710
That is pretty powerful.

91
00:07:43,910 --> 00:07:46,670
One tool I really like is the TensorFlow playground.

92
00:07:47,150 --> 00:07:50,450
You can find it at Playground, TensorFlow dot org.

93
00:07:51,920 --> 00:07:56,360
Basically, this lets you train a neural network on some synthetic data right in your browser.

94
00:07:57,050 --> 00:08:02,000
It lets you see for yourself that the neural network is learning a nonlinear decision battery.

95
00:08:05,580 --> 00:08:11,280
So if you go there, it looks like the default is the doughnut problem where you have one circle inside

96
00:08:11,280 --> 00:08:12,180
another circle.

97
00:08:12,450 --> 00:08:15,480
And of course, the decision matter should be a circle.

98
00:08:16,170 --> 00:08:23,190
Your inputs are only the raw data X1 and X2, but you could also use feature engineering if you so choose.

99
00:08:23,640 --> 00:08:30,150
So you could select X1 Squared X two, squared one x two and even the sign of x one x two.

100
00:08:30,750 --> 00:08:37,110
But we're not going to do that because we want to show that neural networks are able to learn nonlinear

101
00:08:37,110 --> 00:08:43,740
features using only the original inputs, so you never have to manually create things like this yourself.

102
00:08:45,120 --> 00:08:52,010
And so by default, we have two hidden layer neural network with four hidden units, so there's one

103
00:08:52,020 --> 00:08:53,010
two three four.

104
00:08:54,750 --> 00:08:56,760
So let's just run this and see what happens.

105
00:09:00,590 --> 00:09:05,990
All right, so as you can see, it learns the nonlinear decision boundary quite efficiently without

106
00:09:05,990 --> 00:09:08,270
the need for any manual feature engineering.

107
00:09:09,440 --> 00:09:15,470
You're encouraged to play around with this yourself, so try switching datasets so you can pick, let's

108
00:09:15,470 --> 00:09:18,650
say, this one and then run it again.

109
00:09:24,040 --> 00:09:28,840
Also, try different numbers of hidden layers and different numbers of units per head in there.

110
00:09:29,590 --> 00:09:35,230
This should help give you a stronger intuition for how neural networks work and lets you observe for

111
00:09:35,230 --> 00:09:38,710
yourself that they are finding non-linear decision boundaries.

