1
00:00:11,570 --> 00:00:16,930
At this point, you are probably sick of me saying machine learning is nothing but a geometry problem,

2
00:00:17,390 --> 00:00:19,420
but here is where it gets really important.

3
00:00:20,150 --> 00:00:24,290
The question we want to answer right now is why are neural networks so important?

4
00:00:24,590 --> 00:00:26,520
Why can't we just use a single neuron?

5
00:00:26,960 --> 00:00:28,960
I mean, it seems to be a pretty good model.

6
00:00:29,480 --> 00:00:30,860
It's nice and interpretable.

7
00:00:31,520 --> 00:00:37,490
Large weights means the corresponding input is important and a very small or zero weights mean the input

8
00:00:37,490 --> 00:00:38,450
is not important.

9
00:00:39,020 --> 00:00:42,320
Unfortunately, this type of model only gets us so far.

10
00:00:47,430 --> 00:00:52,380
You'll recall that when we discussed linear regression and logistic regression, I said there are two

11
00:00:52,380 --> 00:00:54,620
different ways we can make things more complicated.

12
00:00:55,260 --> 00:00:59,580
The first way we can make things more complicated is to add multiple inputs.

13
00:00:59,910 --> 00:01:01,530
That's something we encountered already.

14
00:01:02,250 --> 00:01:07,770
This is still pretty simplistic because you can always picture a plane in your head, a straight, non-trivial

15
00:01:07,770 --> 00:01:10,390
surface that separates data between classes.

16
00:01:11,160 --> 00:01:16,650
The second way we can make things more complicated is that the decision boundary or regression function

17
00:01:16,800 --> 00:01:19,320
we are looking for is not a straight line or.

18
00:01:20,490 --> 00:01:23,900
This is what we are interested in when we discuss neural networks.

19
00:01:29,060 --> 00:01:36,110
The reason the equation transpose X plus B equals zero gives us A hydroplaned is because this is the

20
00:01:36,110 --> 00:01:38,330
actual definition of A hydroplaned.

21
00:01:38,960 --> 00:01:40,280
There's no way around this.

22
00:01:40,740 --> 00:01:45,710
There's no possible setting of W and B that could give you a curved surface.

23
00:01:46,340 --> 00:01:50,950
So how can we get a curved surface so that we can solve more complicated problems?

24
00:01:56,050 --> 00:02:01,570
One method you might have thought of is to use feature engineering, for example, let's take linear

25
00:02:01,570 --> 00:02:02,140
regression.

26
00:02:02,920 --> 00:02:09,660
Suppose that for whatever reason, salary is proportional to the square of the years of experience.

27
00:02:09,670 --> 00:02:14,400
Or put another way, that salary is a quadratic function of years of experience.

28
00:02:15,100 --> 00:02:22,540
Then we could say that Y hat equals the X squared plus B plus C, believe it or not, this is still

29
00:02:22,540 --> 00:02:24,530
just linear regression in disguise.

30
00:02:25,240 --> 00:02:31,480
It's easy to see this by doing the following take x the years of experience and call that the input

31
00:02:31,480 --> 00:02:39,310
feature X1 then take X squared the years of experience squared and call that the input feature x2.

32
00:02:40,000 --> 00:02:47,950
Then my white hat becomes an equation of the form Y hat equals one X one plus W two X two plus B.

33
00:02:48,730 --> 00:02:52,180
Of course we've already learned that this is just a linear regression.

34
00:02:57,440 --> 00:03:02,870
The problem with Fisher Engineering is that there are many different possible features squaring the

35
00:03:02,870 --> 00:03:06,210
inputs, his comment, but so is combining the inputs.

36
00:03:06,440 --> 00:03:12,230
So if I have the inputs x1 next to that, I could make one of the features X one at times X two.

37
00:03:12,680 --> 00:03:14,560
We call these interaction terms.

38
00:03:15,380 --> 00:03:18,030
But what happens if I have three input features?

39
00:03:18,530 --> 00:03:22,470
Now I have to consider X one squared, X two squared and X three squared.

40
00:03:22,850 --> 00:03:28,940
But we also have to consider X1 acts to X1 X three attacks to X three in total.

41
00:03:28,970 --> 00:03:32,990
There are six quadratic terms as an exercise.

42
00:03:32,990 --> 00:03:38,930
You might want to list out all the possible combinations we could have if we had four inputs or five

43
00:03:38,930 --> 00:03:39,550
inputs.

44
00:03:40,190 --> 00:03:46,940
You should discover that the number of turns we have to consider grows quite fast and thus feature engineering,

45
00:03:47,270 --> 00:03:51,520
even in this simple form, seems to be a somewhat clumsy solution.

46
00:03:56,590 --> 00:04:02,800
But now, let's remember on Iran, if we consider just the first layer of a neural network, we can

47
00:04:02,800 --> 00:04:04,570
see that there are multiple neurons.

48
00:04:05,140 --> 00:04:08,320
Importantly, there are multiple different neurons.

49
00:04:08,800 --> 00:04:15,640
Each neuron is a different nonlinear feature derived from all the inputs because we've applied the sigmoid

50
00:04:15,640 --> 00:04:16,790
activation function.

51
00:04:17,200 --> 00:04:21,660
The feature is not just a simple, linear combination of input features.

52
00:04:22,240 --> 00:04:27,400
So you can think of these as a feature one, a feature to feature three all the way up to feature M.

53
00:04:32,540 --> 00:04:36,600
Now, a lot of people wonder, how does this make a neural network non-linear?

54
00:04:37,880 --> 00:04:44,420
Remember that a linear function always takes the form W, transpose X plus B, our neural network takes

55
00:04:44,420 --> 00:04:48,260
the form of the equation you see here for a two layer neural network.

56
00:04:50,080 --> 00:04:56,620
This is what we get if we drop this item and just make the whole thing into one equation, to transpose

57
00:04:56,620 --> 00:05:05,380
Samoyed of W one transpose X plus B1 and then plus B to the important thing to notice about this is

58
00:05:05,770 --> 00:05:08,730
you cannot simplify this into a linear function.

59
00:05:09,280 --> 00:05:12,640
If you could, then we would have a linear decision boundary.

60
00:05:17,710 --> 00:05:23,500
So let's see what would happen if we had no sigmoidal at all, let's suppose we have a regression network

61
00:05:23,500 --> 00:05:31,210
with no sigmoid, then it's just to transpose times W one transpose X plus B1 plus B2.

62
00:05:32,020 --> 00:05:37,540
Of course, using basic arithmetic, we can expand the product due to the fact that there's no sigmoid.

63
00:05:37,550 --> 00:05:39,400
It's just straightforward multiplication.

64
00:05:40,090 --> 00:05:43,810
Then we just get W prime transpose X plus B prime.

65
00:05:44,230 --> 00:05:52,420
We have substituted W prime equals W two transpose times W one transpose and B prime equals two transpose

66
00:05:52,420 --> 00:05:54,610
times B1 plus B2.

67
00:05:55,510 --> 00:06:01,180
And so you can see that if we have no sigmoid, this just reduces to a linear equation.

68
00:06:06,260 --> 00:06:12,140
There is an added bonus to using neuron's for our features, if you recall, all the weights and biases

69
00:06:12,140 --> 00:06:18,050
are randomly initialized and found using an iterative algorithm called gradient descent when we call

70
00:06:18,050 --> 00:06:19,440
the model fit function.

71
00:06:20,270 --> 00:06:26,120
What this means is that instead of having to do manual feature engineering to try and guess what features

72
00:06:26,120 --> 00:06:29,360
might be good, a neural network will do this automatically.

73
00:06:30,230 --> 00:06:36,650
In the olden days of machine learning, feature engineering at times used to be the only method of really

74
00:06:36,650 --> 00:06:38,390
applying machine learning successfully.

75
00:06:39,020 --> 00:06:41,700
And usually this would require domain knowledge.

76
00:06:42,320 --> 00:06:48,050
So if you wanted to build a good image classifier for, say, lung cancer detection, you would have

77
00:06:48,050 --> 00:06:51,610
to be very knowledgeable about different computer vision techniques.

78
00:06:52,100 --> 00:06:57,500
But not only that, you would also have to be very good at looking at X-rays of lung cancer.

79
00:06:58,190 --> 00:07:02,000
By the way, note that I'm not a doctor, so I don't know if they actually look at X-rays.

80
00:07:02,000 --> 00:07:07,730
But you get the idea if you wanted to build a good stock predictor, you would have to be very knowledgeable

81
00:07:07,730 --> 00:07:08,780
about finance.

82
00:07:10,290 --> 00:07:14,850
If you wanted to build a good fraud detector, you would have to be very knowledgeable about the insurance

83
00:07:14,850 --> 00:07:15,360
industry.

84
00:07:16,840 --> 00:07:22,660
And as you may have heard, deep learning is so popular because it allows people who are not domain

85
00:07:22,660 --> 00:07:25,520
experts to build very good machine learning models.

86
00:07:26,230 --> 00:07:31,990
Nowadays, we have people who are not expert radiologists building state of the art image classifiers

87
00:07:31,990 --> 00:07:33,390
for medical diagnosis.

88
00:07:34,210 --> 00:07:36,760
All they are experts in is machine learning itself.

89
00:07:37,540 --> 00:07:38,740
That is pretty powerful.

90
00:07:43,910 --> 00:07:46,710
One tool I really like is the tensor flow playground.

91
00:07:47,150 --> 00:07:50,470
You can find it at playground to flow, dawg.

92
00:07:51,950 --> 00:07:56,390
Basically this lets you train a neural network on some synthetic data right in your browser.

93
00:07:57,020 --> 00:08:02,000
It lets you see for yourself that the known that work is learning a nonlinear decision boundary.

94
00:08:05,550 --> 00:08:11,310
So if you go there, it looks like the default is the donut problem, where you have one circle inside

95
00:08:11,310 --> 00:08:15,510
another circle, and of course the decision boundary should be a circle.

96
00:08:16,170 --> 00:08:23,200
Your inputs are only the Raw Data X1 and X2, but you could also use feature engineering if you so choose.

97
00:08:23,640 --> 00:08:30,150
So you could select X one squared, X two squared X1 X two and even the sign of X one, X two.

98
00:08:30,750 --> 00:08:37,140
But we're not going to do that because we want to show that neural networks are able to learn nonlinear

99
00:08:37,140 --> 00:08:40,130
features using only the original inputs.

100
00:08:40,130 --> 00:08:43,800
So you never have to manually create things like this yourself.

101
00:08:45,120 --> 00:08:51,230
And so by default, we have two hidden layer neuron that work with four hidden units.

102
00:08:51,240 --> 00:08:53,010
So there's one, two, three, four.

103
00:08:54,750 --> 00:08:56,780
So let's just run this and see what happens.

104
00:09:00,530 --> 00:09:06,020
All right, so as you can see, it learns the nonlinear decision boundary quite efficiently without

105
00:09:06,020 --> 00:09:08,280
the need for any manual feature engineering.

106
00:09:09,440 --> 00:09:11,770
You're encouraged to play around with this yourself.

107
00:09:12,350 --> 00:09:18,640
So try switching data sets so you can pick, let's say, this one and then run it again.

108
00:09:24,040 --> 00:09:28,810
Also, try different numbers of hidden layers and different numbers of units per head in there.

109
00:09:29,620 --> 00:09:35,230
This should help give you a stronger intuition for how neural networks work and lets you observe for

110
00:09:35,230 --> 00:09:38,740
yourself that they are finding non-linear decision boundaries.
