1
00:00:11,820 --> 00:00:17,310
In this lecture, we are going to discuss forward propagation as it relates to neural networks.

2
00:00:18,180 --> 00:00:24,330
Neural networks are all about making predictions, and so making a prediction in a neural network is

3
00:00:24,330 --> 00:00:27,630
what we refer to as going in the forward direction.

4
00:00:32,790 --> 00:00:38,340
Previously, we looked at logistic regression and how this might be seen as a computer simulation of

5
00:00:38,340 --> 00:00:38,850
a neuron.

6
00:00:39,480 --> 00:00:42,960
Well, let's consider an interesting scenario here.

7
00:00:42,960 --> 00:00:44,070
We have one neuron.

8
00:00:44,820 --> 00:00:48,990
But let's say these same inputs propagate to another output.

9
00:00:49,590 --> 00:00:55,650
Now we have to analyze each of them could be calculating something different given the same inputs.

10
00:00:56,370 --> 00:01:01,290
If you want a sort of Real-World analogy of that, suppose we're looking at a phase one.

11
00:01:01,290 --> 00:01:05,850
Neuron might be looking for the presence of an eye and another might be looking for the presence of

12
00:01:05,850 --> 00:01:06,540
a nose.

13
00:01:07,050 --> 00:01:11,190
In other words, different neurons are finding different features from the input.

14
00:01:16,180 --> 00:01:18,250
Well, how about we add another drone?

15
00:01:19,030 --> 00:01:21,220
Now we have a whole layer of neurons.

16
00:01:26,350 --> 00:01:32,260
The key insight in deep learning made it decades ago by researchers such as Geoffrey Hinton was that

17
00:01:32,260 --> 00:01:35,770
we could stack more neurons on top of the existing neurons.

18
00:01:36,220 --> 00:01:38,230
Remember, that's kind of like how our brain is.

19
00:01:38,530 --> 00:01:40,360
We don't just have one neurons in our brain.

20
00:01:40,360 --> 00:01:41,140
We have many.

21
00:01:41,590 --> 00:01:46,630
So one neuron might be connected to another neuron, which might then be further connected to another

22
00:01:46,630 --> 00:01:47,830
neuron and so forth.

23
00:01:48,280 --> 00:01:50,830
So you end up with this long chain of neurons.

24
00:01:51,670 --> 00:01:57,370
The key idea is that we can think of neurons as uniform, although these days we know a lot about the

25
00:01:57,370 --> 00:01:57,730
brain.

26
00:01:57,730 --> 00:02:00,040
And so this is not physically the case.

27
00:02:00,580 --> 00:02:06,040
Some parts of the brain are different than others, but for our simple computational model, we can

28
00:02:06,040 --> 00:02:07,330
assume the brain is uniform.

29
00:02:07,780 --> 00:02:13,570
There's no difference between a neuron in one part of the brain compared to another part, as you might

30
00:02:13,570 --> 00:02:14,170
expect.

31
00:02:14,440 --> 00:02:15,780
We can just keep doing this.

32
00:02:15,790 --> 00:02:19,840
So we have one layer going to another layer, going to another layer and so on.

33
00:02:24,940 --> 00:02:25,840
So what have we done?

34
00:02:26,560 --> 00:02:30,190
Well, we expanded our concept of neurons in two important ways.

35
00:02:30,820 --> 00:02:36,460
First, we said that the same inputs could be attached to multiple different neurons, each calculating

36
00:02:36,460 --> 00:02:37,270
something different.

37
00:02:37,810 --> 00:02:40,030
This means more neurons per layer.

38
00:02:40,840 --> 00:02:46,540
Second, we said that neurons in one layer could act as inputs to another layer of neurons.

39
00:02:47,200 --> 00:02:50,320
So the first concept is to make the neural network more wide.

40
00:02:50,890 --> 00:02:55,720
The second concept is to make the neural network more deep and believe it or not.

41
00:02:55,990 --> 00:02:58,450
This is all it takes to build a neural network.

42
00:02:58,990 --> 00:02:59,890
Take a neuron.

43
00:03:00,130 --> 00:03:05,620
Add more neurons across the same layer to make it wide and then add more layers to make it deep.

44
00:03:10,750 --> 00:03:14,860
Previously, we looked at the equation for a line and that became our model.

45
00:03:15,310 --> 00:03:22,030
So if you recall that was X plus B, we then said in order to make this into a neuron, we're going

46
00:03:22,030 --> 00:03:24,700
to have multiple inputs and multiple weights.

47
00:03:25,330 --> 00:03:31,030
We're also going to attach a sigmoid at the end in order to map the output to a value between zero and

48
00:03:31,030 --> 00:03:37,340
one, representing something like the probability of an action potential that became the sigmoid of

49
00:03:37,360 --> 00:03:39,040
W transpose x plus b.

50
00:03:40,650 --> 00:03:46,050
The next question to consider is what should we do if we have multiple neurons per layer?

51
00:03:51,130 --> 00:03:57,400
Well, suppose I have some output at the next layer, let's call that Z Sub J to represent the JF neuron

52
00:03:57,400 --> 00:04:05,590
in that layer, then a Z Sub J would just be the sigmoid of W subject transpose x plus a b sub J.

53
00:04:06,280 --> 00:04:10,780
Let's say that J goes from one to M so that we have M neurons in this layer.

54
00:04:11,500 --> 00:04:14,710
That also means we'll have M, WJC and M.

55
00:04:20,630 --> 00:04:26,720
It turns out that we can make this calculation more efficient if we consider Z to be a vector of neuron

56
00:04:26,720 --> 00:04:28,790
values instead of just a scalar.

57
00:04:29,150 --> 00:04:31,000
Then we can write this more compactly.

58
00:04:31,940 --> 00:04:36,950
Now you may find this a little confusing if you're not used to working with matrices and functions of

59
00:04:36,950 --> 00:04:37,670
matrices.

60
00:04:38,060 --> 00:04:40,040
So we have to clarify a few things.

61
00:04:40,820 --> 00:04:42,890
First, Z is a vector of size.

62
00:04:42,890 --> 00:04:48,410
M X is as before a vector of size D or equivalently.

63
00:04:48,410 --> 00:04:54,950
You can think of Z as a column vector of size M by one and X to be a column vector of size D by one.

64
00:04:55,820 --> 00:05:02,630
In order to make this a valid matrix multiplication, we have the following W is a matrix of size D

65
00:05:02,630 --> 00:05:03,230
by M.

66
00:05:03,620 --> 00:05:09,140
So when you transpose it, you get m by B, and then when you multiply that by X, which is of size

67
00:05:09,140 --> 00:05:14,620
D, you get out a vector of size m in order to add the bias term be.

68
00:05:14,630 --> 00:05:16,700
It is also a vector of size m.

69
00:05:17,540 --> 00:05:20,600
A sigmoid here is meant to be an Element Y's operation.

70
00:05:21,080 --> 00:05:24,950
So whatever the size of the input is, it has the same size output.

71
00:05:25,700 --> 00:05:31,280
In other words, it would be as if you apply the sigmoid to each component of its argument individually

72
00:05:31,400 --> 00:05:35,000
and then organize them into a vector or a matrix of the same shape.

73
00:05:40,120 --> 00:05:41,440
Using these concepts.

74
00:05:41,680 --> 00:05:47,140
Let's think about how we might mathematically express going from the input to the output of a neural

75
00:05:47,140 --> 00:05:47,650
network.

76
00:05:48,670 --> 00:05:53,230
First, let's consider that we are going to have multiple layers, so it's not enough to call our weights

77
00:05:53,230 --> 00:05:57,340
just W and B instead of we're going to give them indices.

78
00:05:57,910 --> 00:06:04,600
So let's say w superscript one and B superscript one are the weight and bias of the first layer.

79
00:06:05,410 --> 00:06:11,560
W superscript two and B Superscript two are the weight and bias of the second layer and so forth.

80
00:06:12,280 --> 00:06:15,700
Z superscript one will be the output of the first layer.

81
00:06:16,120 --> 00:06:19,810
Z Superscript two will be the output of the second layer and so forth.

82
00:06:21,280 --> 00:06:24,190
Now, there are two things to notice about the final layer here.

83
00:06:25,120 --> 00:06:30,130
First, you'll notice that I've used the capital letter L to to note the total number of layers.

84
00:06:30,850 --> 00:06:36,470
The letter L is common in deep learning, so if you see the letter L elsewhere, don't be surprised.

85
00:06:36,490 --> 00:06:38,380
Just pay attention to the context.

86
00:06:38,890 --> 00:06:45,190
In this context, we mean the number of layers and of course, the input to the last layer will be the

87
00:06:45,190 --> 00:06:47,200
output of the second last layer.

88
00:06:47,740 --> 00:06:50,260
So we have this nice repeating pattern at each layer.

89
00:06:50,980 --> 00:06:54,040
Each layer is just the same equation over and over again.

90
00:06:56,070 --> 00:06:58,860
So that's what we mean when we say the neural network is uniform.

91
00:06:59,940 --> 00:07:05,400
The second thing to notice is that for now, well, assume that we are still doing binary classification.

92
00:07:06,150 --> 00:07:11,450
Therefore, the final sigmoid maps, the output of the neural network to be a single number between

93
00:07:11,490 --> 00:07:12,250
zero and one.

94
00:07:13,140 --> 00:07:17,790
You can imagine that if we are doing regression, we would not want to have the sigmoid there.

95
00:07:22,900 --> 00:07:27,490
So, of course, this brings us to the next question, what does a neural network for regression look

96
00:07:27,490 --> 00:07:27,910
like?

97
00:07:28,390 --> 00:07:31,360
Surely we are not limited to just binary classification.

98
00:07:31,930 --> 00:07:33,490
And luckily, it's very similar.

99
00:07:33,730 --> 00:07:36,610
All we do is simply remove the final sigmoid.

100
00:07:37,870 --> 00:07:43,870
One interesting thing we can see from this, if we look at only the final layer here is that it appears

101
00:07:43,870 --> 00:07:46,210
just to be regular or linear regression.

102
00:07:51,250 --> 00:07:53,410
And that brings us to our next major point.

103
00:07:54,070 --> 00:07:57,320
This is a very helpful perspective when you're thinking about neural networks.

104
00:07:57,340 --> 00:07:59,800
So I think it's important to learn about early on.

105
00:08:00,640 --> 00:08:05,620
If you think about what we just looked at neural networks for binary classification and neural networks

106
00:08:05,620 --> 00:08:09,280
for regression, they are not that different from what we looked at previously.

107
00:08:10,180 --> 00:08:15,520
If we look at just the final layers, they look just like regular or linear regression and logistic

108
00:08:15,520 --> 00:08:16,030
regression.

109
00:08:16,330 --> 00:08:18,280
That's what we studied in the previous section.

110
00:08:19,180 --> 00:08:25,630
So here's a new way to think about neural networks instead of merely a network of layers of neurons.

111
00:08:25,990 --> 00:08:28,810
Really, we're just doing a series of feature transformations.

112
00:08:29,290 --> 00:08:31,450
So we go from featured a feature to feature.

113
00:08:31,450 --> 00:08:36,940
And then finally, we just have a plain linear regression for regression or logistic regression for

114
00:08:36,940 --> 00:08:37,780
classification.

115
00:08:42,870 --> 00:08:48,480
Using these uniform layers, researchers noticed that neural networks were able to learn hierarchies

116
00:08:48,480 --> 00:08:49,260
of features.

117
00:08:49,890 --> 00:08:53,430
Each layer seemed to be learning something more complex than the last.

118
00:08:54,090 --> 00:08:57,810
And this was the motivation for making neural networks deep.

119
00:08:58,140 --> 00:09:00,840
And that's why this field is called deep learning.

120
00:09:02,160 --> 00:09:08,430
One of my favorite examples of this is with facial recognition that works by visualizing what each layer

121
00:09:08,430 --> 00:09:09,150
has learned.

122
00:09:09,570 --> 00:09:15,630
Researchers found that the initial layers just learn some basic lines of strokes, and in the middle,

123
00:09:15,960 --> 00:09:21,330
the neural networks started to learn about specific features of faces such as eyes, nose, mouth and

124
00:09:21,330 --> 00:09:21,900
so on.

125
00:09:22,650 --> 00:09:27,060
Finally, at the final layers, the neural networks started to see entire faces.

126
00:09:27,510 --> 00:09:33,750
In other words, neural networks are able to break down a problem into smaller sub problems or, in

127
00:09:33,750 --> 00:09:35,010
other words, a hierarchy.