1
00:00:11,670 --> 00:00:17,400
At this point you're probably sick of me saying machine learning is nothing but a geometry problem.

2
00:00:17,400 --> 00:00:20,160
But here is where it gets really important.

3
00:00:20,190 --> 00:00:25,410
The question we want to answer right now is why our neuron that work so important why can't we just

4
00:00:25,410 --> 00:00:27,050
use a single neuron.

5
00:00:27,060 --> 00:00:29,520
I mean it seems to be a pretty good model.

6
00:00:29,550 --> 00:00:31,270
It's nice and interpretable.

7
00:00:31,590 --> 00:00:37,470
Large weights means the corresponding input is important and a very small or zero weights mean the input

8
00:00:37,470 --> 00:00:38,460
is not important.

9
00:00:39,120 --> 00:00:42,330
Unfortunately this type of model only gets us so far

10
00:00:47,490 --> 00:00:52,650
you'll recall that when we discuss linear regression and logistic regression I said there are two different

11
00:00:52,650 --> 00:00:55,170
ways we can make things more complicated.

12
00:00:55,260 --> 00:00:59,940
The first way we can make things more complicated is to add multiple inputs.

13
00:00:59,940 --> 00:01:02,150
That's something we encountered already.

14
00:01:02,280 --> 00:01:07,770
This is still pretty simplistic because you can always picture a plane in your head a straight non curving

15
00:01:07,770 --> 00:01:11,090
surface that separates data between classes.

16
00:01:11,220 --> 00:01:16,650
The second way we can make things more complicated is that the decision boundary or regression function

17
00:01:16,860 --> 00:01:20,490
we are looking for is not a straight line or plane.

18
00:01:20,490 --> 00:01:29,010
This is what we are interested in when we discuss neural networks.

19
00:01:29,100 --> 00:01:35,970
The reason the equation w transpose x plus B equals zero gives us a higher plane is because this is

20
00:01:35,970 --> 00:01:39,020
the actual definition of a hyper plane.

21
00:01:39,020 --> 00:01:40,800
There is no way around this.

22
00:01:40,800 --> 00:01:45,750
There's no possible setting of W and B that could give you a curved surface.

23
00:01:46,380 --> 00:01:51,000
So how can we get a curved surface so that we can solve more complicated problems

24
00:01:56,080 --> 00:01:59,740
one method you might have thought of is to use feature engineering.

25
00:02:00,010 --> 00:02:02,980
For example let's take linear regression.

26
00:02:02,980 --> 00:02:09,660
Suppose that for whatever reason salary is proportional to the square of the years of experience.

27
00:02:09,730 --> 00:02:14,800
Or put another way that salary is a quadratic function of years of experience.

28
00:02:15,130 --> 00:02:21,910
Then we could say that y hat equals X squared plus b x plus C Believe it or not.

29
00:02:21,970 --> 00:02:25,330
This is still just linear regression in disguise.

30
00:02:25,330 --> 00:02:28,290
It's easy to see this by doing the following.

31
00:02:28,450 --> 00:02:35,100
Take X the years of experience and call that the input feature x 1 then take X squared.

32
00:02:35,200 --> 00:02:39,720
The years of experience squared and call that the input feature x 2.

33
00:02:40,060 --> 00:02:48,830
Then my y hat becomes an equation of the form y hat equals w 1 x 1 plus W2 x 2 plus B.

34
00:02:48,850 --> 00:02:52,150
Of course we've already learned that this is just a linear regression

35
00:02:57,430 --> 00:03:02,840
the problem with feature engineering is that there are many different possible features squaring the

36
00:03:02,840 --> 00:03:06,450
inputs as comment but so is combining the inputs.

37
00:03:06,500 --> 00:03:13,280
So if I have the inputs X1 and X2 than I could make one of the features x1 at times x 2 We call these

38
00:03:13,310 --> 00:03:18,510
interaction terms but what happens if I have 3 input features.

39
00:03:18,590 --> 00:03:22,450
Now I have to consider X1 squared X 2 squared and x 3 squared.

40
00:03:22,910 --> 00:03:28,960
But we also have to consider x1 x 2 x1 x 3 and x 2 x 3 in total.

41
00:03:28,970 --> 00:03:36,050
There are 6 quadratic terms as an exercise you might want to list out all the possible combinations

42
00:03:36,050 --> 00:03:39,920
we could have if we had 4 inputs or 5 inputs.

43
00:03:40,250 --> 00:03:47,330
You should discover that the number of turns we have to consider grows quite fast and thus feature engineering.

44
00:03:47,330 --> 00:03:51,530
Even in this simple form seems to be a somewhat clumsy solution.

45
00:03:56,650 --> 00:03:58,900
But now let's remember our neuron.

46
00:03:59,290 --> 00:04:05,250
If we consider just the first layer of a neuron that work we can see that there are multiple neurons.

47
00:04:05,260 --> 00:04:08,790
Importantly they are all multiple different neurons.

48
00:04:08,860 --> 00:04:15,100
Each neuron is a different non-linear feature derived from all the inputs because we've applied the

49
00:04:15,100 --> 00:04:17,160
sigmoid activation function.

50
00:04:17,230 --> 00:04:24,040
The feature is not just a simple linear combination of input features so you can think of these as feature

51
00:04:24,040 --> 00:04:32,420
1 a feature to feature 3 all the way of the feature m.

52
00:04:32,600 --> 00:04:37,790
Now a lot of people wonder how does this make a neuron that work nonlinear.

53
00:04:37,910 --> 00:04:44,420
Remember that a linear function always takes the form w transpose x plus B are known that work takes

54
00:04:44,420 --> 00:04:51,510
the form of the equation you see here for a two layer neuron that we're this is what we get if we drop

55
00:04:51,560 --> 00:04:59,870
the Z term and just make the whole thing into one equation W2 transpose sigmoid of w 1 transpose x plus

56
00:04:59,980 --> 00:05:03,170
B1 and then plus B2.

57
00:05:03,230 --> 00:05:09,210
The important thing to notice about this is you can not simplify this into a linear function.

58
00:05:09,380 --> 00:05:12,620
If you could then we would have a linear decision boundary

59
00:05:17,750 --> 00:05:21,590
so let's see what would happen if we had no sigmoid at all.

60
00:05:21,590 --> 00:05:28,790
Let's suppose we have a regression network with no sigmoid then it's just W2 transpose times w 1 transpose

61
00:05:28,790 --> 00:05:34,000
x plus B1 plus B2 of course using basic arithmetic.

62
00:05:34,010 --> 00:05:37,490
We can expand the product due to the fact that there is no sigma.

63
00:05:37,610 --> 00:05:39,850
It's just straightforward multiplication.

64
00:05:40,130 --> 00:05:47,960
Then we just get w prime transpose x plus b prime where I've substituted w prime equals W2 transpose

65
00:05:47,990 --> 00:05:55,220
times w 1 transpose and b prime equals w 2 transpose times B1 plus B2.

66
00:05:55,610 --> 00:06:06,110
And so you can see that if we have no sigmoid this just reduces to a linear equation.

67
00:06:06,330 --> 00:06:09,930
There is an added bonus to using neurons for our features.

68
00:06:09,930 --> 00:06:15,840
If you recall all the weights and biases are randomly initialized and found using an iterative algorithm

69
00:06:15,840 --> 00:06:17,490
called gradient descent.

70
00:06:17,490 --> 00:06:20,250
When we call the model dot fit function.

71
00:06:20,340 --> 00:06:26,100
What this means is that instead of having to do manual feature engineering to try and guess what features

72
00:06:26,100 --> 00:06:30,160
might be good a neuron that work will do this automatically.

73
00:06:30,330 --> 00:06:36,660
In the olden days of machine learning feature engineering at times used to be the only method of really

74
00:06:36,660 --> 00:06:41,820
applying machine learning successfully and usually this would require domain knowledge.

75
00:06:42,360 --> 00:06:48,270
So if you wanted to build a good image classifier for say lung cancer detection you would have to be

76
00:06:48,270 --> 00:06:54,390
very knowledgeable about different computer vision techniques but not only that you would also have

77
00:06:54,390 --> 00:06:57,860
to be very good at looking at x rays of lung cancer.

78
00:06:58,230 --> 00:06:58,700
By the way.

79
00:06:58,710 --> 00:07:03,530
Note that I'm not a doctor so I don't know if they actually look at x rays but you get the idea.

80
00:07:03,930 --> 00:07:08,820
If you wanted to build a good stock predictor you would have to be very knowledgeable about finance

81
00:07:10,380 --> 00:07:14,850
if you wanted to build a good fraud detector you would have to be very knowledgeable about the insurance

82
00:07:14,850 --> 00:07:22,260
industry and as you may have heard deep learning is so popular because it allows people who are not

83
00:07:22,260 --> 00:07:26,190
domain experts to build very good machine learning models.

84
00:07:26,280 --> 00:07:31,980
Nowadays we have people who are not expert radiologists building state of the art image classifiers

85
00:07:31,980 --> 00:07:37,280
for medical diagnosis all they are experts in is machine learning itself.

86
00:07:37,590 --> 00:07:38,850
That is pretty powerful.

87
00:07:43,970 --> 00:07:47,110
One tool I really like is the tensor flow playground.

88
00:07:47,210 --> 00:07:53,890
You can find it at playground dot tensor flow dot org basically this lets you train a neural network

89
00:07:53,950 --> 00:07:57,130
on some synthetic data right in your browser.

90
00:07:57,130 --> 00:08:01,990
It lets you see for yourself that the known that work is learning a nonlinear decision battery

91
00:08:05,640 --> 00:08:11,280
so if you go there it looks like the default is the doughnut problem where you have one circle inside

92
00:08:11,280 --> 00:08:12,200
another circle.

93
00:08:12,570 --> 00:08:16,020
And of course the the senior mandarins should be a circle.

94
00:08:16,260 --> 00:08:23,470
Your inputs are only the raw data x1 and x2 but you could also use feature engineering if you so choose.

95
00:08:23,700 --> 00:08:31,950
So you could select X1 squared X two squared x1 x 2 and even the sign of x 1 x 2 but we're not going

96
00:08:31,950 --> 00:08:38,310
to do that because we want to show that neural networks are able to learn non-linear features using

97
00:08:38,370 --> 00:08:40,150
only the original inputs.

98
00:08:40,170 --> 00:08:48,250
So you never have to manually create things like this yourself and so by default we have a two hidden

99
00:08:48,250 --> 00:08:55,920
layer neuron that work with four hidden units so there's one two three four so let's just run this and

100
00:08:55,920 --> 00:08:56,820
see what happens

101
00:09:00,630 --> 00:09:00,890
all right.

102
00:09:00,920 --> 00:09:06,830
So as you can see it learns the nonlinear decision boundary quite efficiently without the need for any

103
00:09:06,830 --> 00:09:12,260
manual feature engineering you're encouraged to play around with this yourself.

104
00:09:12,410 --> 00:09:18,680
So try switching datasets so you can pick let's say this one and then run it again

105
00:09:24,110 --> 00:09:29,520
also try different numbers of hidden layers and different numbers of units per head layer.

106
00:09:29,630 --> 00:09:35,240
This should help give you a stronger intuition for how neural networks work and lets you observe for

107
00:09:35,240 --> 00:09:38,780
yourself that they are finding nonlinear decision boundaries.
