1
00:00:11,910 --> 00:00:17,980
In this lecture we are going to discuss forward propagation as it relates to neural networks.

2
00:00:18,210 --> 00:00:24,510
Neural networks are all about making predictions and so making a prediction in a neural network is what

3
00:00:24,510 --> 00:00:27,890
we refer to as going in the forward direction.

4
00:00:32,810 --> 00:00:38,390
Previously we looked at logistic regression and how this might be seen as a computer simulation of a

5
00:00:38,390 --> 00:00:39,520
neuron.

6
00:00:39,560 --> 00:00:42,800
Well let's consider an interesting scenario.

7
00:00:42,800 --> 00:00:49,660
Here we have one neuron but let's say these same inputs propagate to another output.

8
00:00:49,670 --> 00:00:56,380
Now we have two neurons each of them could be calculating something different given the same inputs.

9
00:00:56,450 --> 00:01:01,940
If you want a sort of real world analogy of that suppose we're looking at a face one neuron might be

10
00:01:01,940 --> 00:01:03,510
looking for the presence of an eye.

11
00:01:03,950 --> 00:01:07,110
And another might be looking for the presence of a nose.

12
00:01:07,130 --> 00:01:11,180
In other words different neurons are finding different features from the input

13
00:01:16,270 --> 00:01:18,910
Well how about we add another neuron.

14
00:01:19,090 --> 00:01:21,280
Now we have a whole layer of neurons

15
00:01:26,380 --> 00:01:32,410
the key insight in deep learning made decades ago by researchers such as Jeffrey Hinton was that we

16
00:01:32,410 --> 00:01:36,250
could stack more neurons on top of the existing neurons.

17
00:01:36,280 --> 00:01:38,470
Remember that's kind of like how our brain is.

18
00:01:38,590 --> 00:01:41,500
We don't just have one neuron in our brain we have many.

19
00:01:41,620 --> 00:01:46,630
So one neuron might be connected to another neuron which might then be further connected to another

20
00:01:46,630 --> 00:01:48,310
neuron and so forth.

21
00:01:48,310 --> 00:01:51,700
So you end up with this long chain of neurons.

22
00:01:51,730 --> 00:01:57,370
The key idea is that we can think of neurons as uniform although these days we know a lot about the

23
00:01:57,370 --> 00:01:57,760
brain.

24
00:01:57,760 --> 00:02:00,640
And so this is not physically the case.

25
00:02:00,640 --> 00:02:06,370
Some parts of the brain are different than others but for our simple computational model we can assume

26
00:02:06,370 --> 00:02:07,840
the brain is uniform.

27
00:02:07,840 --> 00:02:13,570
There is no difference between a neuron in one part of the brain compared to another part as you might

28
00:02:13,570 --> 00:02:14,470
expect.

29
00:02:14,530 --> 00:02:15,830
We can just keep doing this.

30
00:02:15,850 --> 00:02:19,830
So we have one layer going to another layer going to another layer and so on

31
00:02:25,000 --> 00:02:25,870
so what have we done.

32
00:02:26,650 --> 00:02:30,850
Well we expanded our concept of neurons in two important ways.

33
00:02:30,850 --> 00:02:36,460
First we said that the same inputs could be attached to multiple different neurons each calculating

34
00:02:36,460 --> 00:02:37,850
something different.

35
00:02:37,870 --> 00:02:40,890
This means more neurons per layer.

36
00:02:40,900 --> 00:02:47,150
Second we said that neurons in one layer could act as inputs to another layer of neurons.

37
00:02:47,260 --> 00:02:50,940
So the first concept is to make the neuron they work more wide.

38
00:02:50,950 --> 00:02:55,740
The second concept is to make the neural network more deep and believe it or not.

39
00:02:56,050 --> 00:02:58,970
This is all it takes to build a neural network.

40
00:02:59,020 --> 00:03:05,170
Take a neuron add more neurons across the same layer to make it wide and then add more layers to make

41
00:03:05,170 --> 00:03:05,580
it the

42
00:03:10,790 --> 00:03:11,540
previously.

43
00:03:11,540 --> 00:03:14,910
We looked at the equation for a line and that became our model.

44
00:03:15,380 --> 00:03:22,220
So if you recall that was a x plus b we then said in order to make this into a neuron we're going to

45
00:03:22,220 --> 00:03:25,400
have multiple inputs and multiple weights.

46
00:03:25,430 --> 00:03:31,040
We're also going to attach a sigmoid at the end in order to map the output to a value between 0 and

47
00:03:31,040 --> 00:03:35,900
1 representing something like the probability of an action potential.

48
00:03:35,900 --> 00:03:43,830
That became the sigmoid of W transpose x plus B the next question to consider is what should we do if

49
00:03:43,830 --> 00:03:46,080
we have multiple neurons per layer

50
00:03:51,220 --> 00:03:57,400
Well suppose I have some output at the next layer let's call that Z subject to represent the J if neuron

51
00:03:57,430 --> 00:03:58,620
in that layer.

52
00:03:58,930 --> 00:04:06,230
Then Z sub J would just be the sigmoid of W subject transpose x plus a b sub J.

53
00:04:06,340 --> 00:04:11,390
Let's say that J goes from 1 to M so that we have m neurons in this layer.

54
00:04:11,560 --> 00:04:15,580
That also means we'll have m w JS and M B JS

55
00:04:20,690 --> 00:04:26,750
It turns out that we can make this calculation more efficient if we consider Z to be a vector of neuron

56
00:04:26,750 --> 00:04:31,830
values instead of just a scalar then we can write this more compact Lee.

57
00:04:32,000 --> 00:04:36,890
Now you might find this a little confusing if you're not used to working with matrices and functions

58
00:04:36,890 --> 00:04:38,030
of matrices.

59
00:04:38,120 --> 00:04:40,850
So we have to clarify a few things.

60
00:04:40,850 --> 00:04:49,640
First Z is a vector of size m x is as before a vector of size D or equivalently you can think of Z as

61
00:04:49,640 --> 00:04:57,500
a column vector of size m by 1 and x to be a column vector of size V by 1 in order to make this a valid

62
00:04:57,500 --> 00:04:58,970
matrix multiplication.

63
00:04:58,970 --> 00:05:00,740
We have the following.

64
00:05:00,740 --> 00:05:03,580
W is a matrix of size D by m.

65
00:05:03,650 --> 00:05:10,070
So when you transpose it you get m by B and then when you multiply that by X which is of size d you

66
00:05:10,070 --> 00:05:14,640
get out a vector of size m in order to add the bias term B.

67
00:05:14,720 --> 00:05:21,830
It is also a vector of size m the sigmoid here is meant to be an element y's operation so whatever the

68
00:05:21,830 --> 00:05:25,790
size of the input is it has the same size output.

69
00:05:25,790 --> 00:05:31,280
In other words it would be as if you apply the sigmoid to each component of its argument individually

70
00:05:31,520 --> 00:05:35,050
and then organize them into a vector or a matrix of the same shape

71
00:05:40,220 --> 00:05:46,190
using these concepts let's think about how we might mathematically Express going from the input to the

72
00:05:46,190 --> 00:05:48,710
output of a neural network.

73
00:05:48,710 --> 00:05:51,560
First let's consider that we are going to have multiple layers.

74
00:05:51,560 --> 00:05:57,770
So it's not enough to call our weights just w and B instead of we're going to give them indices.

75
00:05:57,950 --> 00:06:06,410
So let's say w superscript 1 and b superscript 1 are the weight and bias of the first layer w superscript

76
00:06:06,410 --> 00:06:12,320
2 and b superscript 2 other weight and bias of the second layer and so forth.

77
00:06:12,320 --> 00:06:18,650
Ze superscript 1 will be the output of the first layer Zi superscript 2 will be the output of the second

78
00:06:18,650 --> 00:06:19,750
layer and so forth.

79
00:06:21,340 --> 00:06:25,140
Now there are two things to notice about the final layer here.

80
00:06:25,150 --> 00:06:31,300
First you'll notice that I've used the capital letter L to denote the total number of layers the letter

81
00:06:31,300 --> 00:06:32,830
L is common and deep learning.

82
00:06:32,830 --> 00:06:36,550
So if you see the letter L elsewhere don't be surprised.

83
00:06:36,550 --> 00:06:43,630
Just pay attention to the context in this context we mean the number of layers and of course the input

84
00:06:43,630 --> 00:06:47,200
to the last layer will be the output of the second last layer.

85
00:06:47,800 --> 00:06:53,680
So we have this nice repeating pattern at each layer each layer is just the same equation over and over

86
00:06:53,680 --> 00:06:54,040
again.

87
00:06:56,110 --> 00:06:59,980
So that's what we mean when we say that none that work is uniform.

88
00:07:00,010 --> 00:07:06,220
The second thing to notice is that for now we'll assume that we are still doing binary classification.

89
00:07:06,220 --> 00:07:11,950
Therefore the final sigmoid map's the output of the neural network to be a single number between 0 and

90
00:07:11,950 --> 00:07:13,190
1.

91
00:07:13,210 --> 00:07:17,800
You can imagine that if we are doing regression we would not want to have the sigmoid there

92
00:07:22,940 --> 00:07:25,460
so of course that this brings us to the next question.

93
00:07:25,580 --> 00:07:28,430
What does a neuron that work for regression look like.

94
00:07:28,430 --> 00:07:33,790
Surely we are not limited to just binary classification and luckily it's very similar.

95
00:07:33,800 --> 00:07:37,780
All we do is simply remove the final sigmoid.

96
00:07:37,960 --> 00:07:43,870
One interesting thing we can see from this if we look at only the final layer here is that it appears

97
00:07:43,870 --> 00:07:51,070
just to be regular or linear regression.

98
00:07:51,320 --> 00:07:54,040
And that brings us to our next major point.

99
00:07:54,140 --> 00:07:57,350
This is a very helpful perspective when you're thinking about neuron networks.

100
00:07:57,380 --> 00:07:59,750
So I think it's important to learn about early on.

101
00:08:00,710 --> 00:08:05,630
If you think about what we just looked at neuron that works for binary classification and neural networks

102
00:08:05,630 --> 00:08:10,250
for regression they are not that different from what we looked at previously.

103
00:08:10,250 --> 00:08:16,320
If we look at just the final layers they look just like regular or linear regression and logistic regression.

104
00:08:16,400 --> 00:08:18,320
That's what we studied in the previous session.

105
00:08:19,220 --> 00:08:26,050
So here's a new way to think about neural networks instead of merely a network of layers of neurons.

106
00:08:26,060 --> 00:08:31,470
Really we're just doing a series of feature transformations so we go from feature to feature to feature.

107
00:08:31,520 --> 00:08:37,790
And then finally we just have a plain linear regression for regression or logistic regression for classification

108
00:08:42,960 --> 00:08:44,630
using these uniform layers.

109
00:08:44,640 --> 00:08:50,790
Researchers noticed that neural networks were able to learn hierarchies of features each layer seemed

110
00:08:50,790 --> 00:08:54,100
to be learning something more complex than the last.

111
00:08:54,180 --> 00:08:58,090
And this was the motivation for making neural networks deep.

112
00:08:58,200 --> 00:09:04,930
And that's why this field is called deep learning one of my favorite examples of this is with facial

113
00:09:04,930 --> 00:09:09,640
recognition that works by visualizing what each layer has learned.

114
00:09:09,640 --> 00:09:15,640
Researchers found that the initial layers just learned some basic lines and strokes and in the middle

115
00:09:16,000 --> 00:09:21,640
the known that work started to learn about specific features of faces such as eyes nose mouth and so

116
00:09:21,640 --> 00:09:22,690
on.

117
00:09:22,690 --> 00:09:27,480
Finally at the final layers the neural networks started to see entire faces.

118
00:09:27,610 --> 00:09:33,910
In other words neural networks are able to break down a problem into smaller sub problems or in other

119
00:09:33,910 --> 00:09:35,020
words a hierarchy.