1
00:00:11,760 --> 00:00:18,450
In this lecture, we are going to discuss forward propagation as it relates to neural networks, neural

2
00:00:18,450 --> 00:00:24,630
networks are all about making predictions and so making a prediction in a neural network is what we

3
00:00:24,630 --> 00:00:27,660
refer to as going in the forward direction.

4
00:00:32,790 --> 00:00:38,340
Previously, we looked at logistic regression and how this might be seen as a computer simulation of

5
00:00:38,340 --> 00:00:38,890
a neuron.

6
00:00:39,510 --> 00:00:42,990
Well, let's consider an interesting scenario here.

7
00:00:42,990 --> 00:00:48,990
We have one neuron, but let's say these same inputs propagate to another output.

8
00:00:49,620 --> 00:00:51,300
Now we have two neurons.

9
00:00:51,930 --> 00:00:57,330
Each of them could be calculating something different, given the same inputs if you want a sort of

10
00:00:57,330 --> 00:00:58,750
real world analogy of that.

11
00:00:58,950 --> 00:01:01,650
Suppose we're looking at a phase one neuron.

12
00:01:01,650 --> 00:01:06,550
Might be looking for the presence of an eye and another might be looking for the presence of a nose.

13
00:01:07,020 --> 00:01:11,220
In other words, different neurons are finding different features from the input.

14
00:01:16,210 --> 00:01:21,280
Well, how about we add another neuron now we have a whole layer of neurons.

15
00:01:26,350 --> 00:01:32,410
The key insight in deep learning made decades ago by researchers such as Geoffrey Hinton was that we

16
00:01:32,410 --> 00:01:35,770
could stack more neurons on top of the existing neurons.

17
00:01:36,220 --> 00:01:38,250
Remember, that's kind of like how our brain is.

18
00:01:38,530 --> 00:01:41,120
We don't just have one neuron in our brain, we have many.

19
00:01:41,590 --> 00:01:46,660
So one neuron might be connected to another neuron, which might then be further connected to another

20
00:01:46,660 --> 00:01:47,840
neuron and so forth.

21
00:01:48,280 --> 00:01:50,830
So you end up with this long chain of neurons.

22
00:01:51,670 --> 00:01:57,400
The key idea is that we can think of neurons as uniform, although these days we know a lot about the

23
00:01:57,400 --> 00:01:57,750
brain.

24
00:01:57,760 --> 00:02:00,050
And so this is not physically the case.

25
00:02:00,580 --> 00:02:06,070
Some parts of the brain are different than others, but for our simple computational model, we can

26
00:02:06,070 --> 00:02:07,340
assume the brain is uniform.

27
00:02:07,810 --> 00:02:12,190
There's no difference between a neuron in one part of the brain compared to another part.

28
00:02:13,090 --> 00:02:18,430
As you might expect, we can just keep doing this so we have one layer going to another layer, going

29
00:02:18,430 --> 00:02:19,870
to another layer and so on.

30
00:02:24,940 --> 00:02:25,840
So what have we done?

31
00:02:26,560 --> 00:02:30,200
Well, we expanded our concept of neurons in two important ways.

32
00:02:30,820 --> 00:02:36,460
First, we said that the same inputs could be attached to multiple different neurons, each calculating

33
00:02:36,460 --> 00:02:37,310
something different.

34
00:02:37,840 --> 00:02:40,040
This means more neurons per layer.

35
00:02:40,840 --> 00:02:46,550
Second, we said that neurons in one layer could act as inputs to another layer of neurons.

36
00:02:47,200 --> 00:02:50,320
So the first concept is to make the neural network more wide.

37
00:02:50,890 --> 00:02:54,000
The second concept is to make the neural network more deep.

38
00:02:54,490 --> 00:02:55,730
And believe it or not.

39
00:02:56,050 --> 00:02:58,490
This is all it takes to build a neural network.

40
00:02:59,020 --> 00:03:04,960
Take a neuron, add more neurons across the same layer to make it wide, and then add more layers to

41
00:03:04,960 --> 00:03:05,680
make a deep.

42
00:03:10,780 --> 00:03:14,860
Previously, we looked at the equation for a line and that became our model.

43
00:03:15,310 --> 00:03:22,060
So if you recall, that was A plus B, we then said in order to make this into a neuron, we're going

44
00:03:22,060 --> 00:03:24,710
to have multiple inputs and multiple weights.

45
00:03:25,360 --> 00:03:31,060
We're also going to attach a sigmoid at the end in order to map the output to a value between zero and

46
00:03:31,060 --> 00:03:38,230
one representing something like the probability of an action potential that became the sigmoid of transpose

47
00:03:38,230 --> 00:03:39,040
X plus B.

48
00:03:40,650 --> 00:03:46,050
The next question to consider is, what shall we do if we have multiple neurons per layer?

49
00:03:51,130 --> 00:03:53,710
Well, suppose I have some output at the next layer.

50
00:03:53,740 --> 00:04:01,090
Let's call that subject to represent the JF neuron in that layer, then Z subject would just be the

51
00:04:01,090 --> 00:04:05,600
sigmoid of W's subject, transpose X plus a B subject.

52
00:04:06,310 --> 00:04:10,800
Let's say that J goes from one to M. So that we have M neurons in this layer.

53
00:04:11,500 --> 00:04:15,550
That also means we'll have M W.J. and veejays.

54
00:04:20,600 --> 00:04:26,750
It turns out that we can make this calculation more efficient if we consider it to be a vector of neuron

55
00:04:26,750 --> 00:04:30,990
values instead of just a scalar, then we can write this more compactly.

56
00:04:31,940 --> 00:04:36,890
Now, you might find this a little confusing if you're not used to working with matrices and functions

57
00:04:36,890 --> 00:04:37,680
of matrices.

58
00:04:38,060 --> 00:04:40,100
So we have to clarify a few things.

59
00:04:40,820 --> 00:04:48,950
First, Z is a vector of size, M X is, as before, a vector of size D or equivalently, you can think

60
00:04:48,950 --> 00:04:55,010
of Z as a column vector of size and by one and X to be a column vector of size DBI one.

61
00:04:55,790 --> 00:05:02,660
In order to make this a valid matrix multiplication, we have the following W is a matrix of size D

62
00:05:02,660 --> 00:05:03,240
by M.

63
00:05:03,590 --> 00:05:09,590
So when you transpose it you get M by B and then when you multiply that by X, which is of size D,

64
00:05:09,860 --> 00:05:18,140
you get out a vector of size M in order to add the bias term B, it is also a vector of size M, a sigmoid

65
00:05:18,140 --> 00:05:20,650
here is meant to be an element y's operation.

66
00:05:21,080 --> 00:05:25,010
So whatever the size of the input is, it has the same size output.

67
00:05:25,670 --> 00:05:31,280
In other words, it would be as if you apply the sigmoid to each component of its argument individually

68
00:05:31,400 --> 00:05:35,030
and then organize them into a vector or matrix of the same shape.

69
00:05:40,150 --> 00:05:46,060
Using these concepts, let's think about how we might mathematically express going from the input to

70
00:05:46,060 --> 00:05:47,690
the output of a neural network.

71
00:05:48,670 --> 00:05:53,260
First, let's consider that we are going to have multiple layers so it's not enough to call our weights

72
00:05:53,260 --> 00:05:54,560
just WNBA.

73
00:05:55,180 --> 00:05:57,360
Instead, we're going to give them indices.

74
00:05:57,910 --> 00:06:04,630
So let's say W superscripts one and be superscripts one are the weight and bias of the first layer.

75
00:06:05,440 --> 00:06:11,590
W superscript two and B superscript two are the weight and bias of the second layer and so forth.

76
00:06:12,280 --> 00:06:13,240
Z superscript.

77
00:06:13,240 --> 00:06:18,970
One will be the output of the first layer, z superscript two will be the output of the second layer

78
00:06:18,970 --> 00:06:19,840
and so forth.

79
00:06:21,280 --> 00:06:24,220
Now, there are two things to notice about the final layer here.

80
00:06:25,120 --> 00:06:30,160
First, you'll notice that I've used the capital letter L to denote the total number of layers.

81
00:06:30,880 --> 00:06:32,800
The letter L. is common and deep learning.

82
00:06:32,810 --> 00:06:36,520
So if you see the letter L. elsewhere, don't be surprised.

83
00:06:36,520 --> 00:06:38,440
Just pay attention to the context.

84
00:06:38,860 --> 00:06:41,620
In this context, we mean the number of layers.

85
00:06:42,280 --> 00:06:47,210
And of course, the input to the last layer will be the output of the second last layer.

86
00:06:47,740 --> 00:06:50,290
So we have this nice repeating pattern at each layer.

87
00:06:50,980 --> 00:06:54,060
Each layer is just the same equation over and over again.

88
00:06:56,040 --> 00:06:58,920
So that's what we mean when we say the neural network is uniform.

89
00:06:59,970 --> 00:07:05,440
The second thing to notice is that for now, we'll assume that we are still doing binary classification.

90
00:07:06,180 --> 00:07:11,460
Therefore, the final sigmoid maps, the output of the neural network to be a single number between

91
00:07:11,460 --> 00:07:12,290
a zero and one.

92
00:07:13,140 --> 00:07:17,820
You can imagine that if we are doing regression, we would not want to have the sigmoid there.

93
00:07:22,900 --> 00:07:27,490
So, of course, this brings us to the next question what does a neuron that work for regression look

94
00:07:27,490 --> 00:07:27,910
like?

95
00:07:28,390 --> 00:07:33,500
Surely we are not limited to just binary classification and luckily it's very similar.

96
00:07:33,700 --> 00:07:36,640
All we do is simply remove the final sigmoid.

97
00:07:37,900 --> 00:07:43,900
One interesting thing we can see from this, if we look at only the final layer here is that it appears

98
00:07:43,900 --> 00:07:46,240
just to be regular old linear regression.

99
00:07:51,220 --> 00:07:56,410
And that brings us to our next major point, this is a very helpful perspective when you're thinking

100
00:07:56,410 --> 00:07:57,360
about neural networks.

101
00:07:57,370 --> 00:08:02,860
So I think it's important to learn about early on, if you think about what we just looked at, neural

102
00:08:02,880 --> 00:08:07,870
networks for binary classification and neural networks for regression, they are not that different

103
00:08:07,870 --> 00:08:09,310
from what we looked at previously.

104
00:08:10,210 --> 00:08:15,550
If we look at just the final layers, they look just like regular old linear regression and logistic

105
00:08:15,550 --> 00:08:16,060
regression.

106
00:08:16,360 --> 00:08:18,280
That's what we studied in the previous section.

107
00:08:19,180 --> 00:08:25,670
So here's a new way to think about neural networks instead of merely a network of layers of neurons.

108
00:08:25,990 --> 00:08:28,840
Really, we're just doing a series of feature transformations.

109
00:08:29,290 --> 00:08:31,460
So we go from feature to feature to feature.

110
00:08:31,480 --> 00:08:36,940
And then finally, we just have a plain linear regression for regression or logistic regression for

111
00:08:36,940 --> 00:08:37,810
classification.

112
00:08:42,900 --> 00:08:48,480
Using these uniform layers, researchers noticed that neural networks were able to learn hierarchies

113
00:08:48,480 --> 00:08:53,470
of features, each layer seemed to be learning something more complex than the last.

114
00:08:54,060 --> 00:08:57,840
And this was the motivation for making neural networks deep.

115
00:08:58,080 --> 00:09:00,840
And that's why this field is called deep learning.

116
00:09:02,190 --> 00:09:08,460
One of my favorite examples of this is with facial recognition that works by visualizing what each layer

117
00:09:08,460 --> 00:09:09,150
has learned.

118
00:09:09,600 --> 00:09:14,250
Researchers found that the initial layers just learned some basic lines and strokes.

119
00:09:14,760 --> 00:09:20,520
And in the middle, the known networks started to learn about specific features of faces such as eyes,

120
00:09:20,520 --> 00:09:21,910
nose, mouth and so on.

121
00:09:22,650 --> 00:09:27,120
Finally, at the final layers, the neural networks started to see entire faces.

122
00:09:27,510 --> 00:09:33,750
In other words, neural networks are able to break down a problem into smaller sub problems or, in

123
00:09:33,750 --> 00:09:35,010
other words, a hierarchy.
