1
00:00:00,000 --> 00:00:03,580
Before we can start learning about how

2
00:00:03,580 --> 00:00:06,020
neural networks learn and optimize their

3
00:00:06,020 --> 00:00:08,560
weights and biases, let's learn about

4
00:00:08,560 --> 00:00:11,000
gradient descent. If you already know it,

5
00:00:11,010 --> 00:00:12,990
please bear with me, as I want to make

6
00:00:12,990 --> 00:00:14,549
sure that everyone who's taking this

7
00:00:14,549 --> 00:00:18,449
course is on the same page. Let's say

8
00:00:18,449 --> 00:00:20,460
that we have some data, and here's a

9
00:00:20,460 --> 00:00:23,609
scatter plot of the data. For simplicity,

10
00:00:23,609 --> 00:00:27,210
I have generated data where z is twice x.

11
00:00:27,210 --> 00:00:30,929
And say we want to find the value of w

12
00:00:30,929 --> 00:00:33,600
that would generate a line that best

13
00:00:33,600 --> 00:00:36,870
fits this data. To do that, we define a

14
00:00:36,870 --> 00:00:38,840
cost or a loss function.

15
00:00:38,840 --> 00:00:42,600
One common cost or loss function is the

16
00:00:42,600 --> 00:00:45,539
one shown here as J, where we take the

17
00:00:45,539 --> 00:00:47,850
difference between the z values and the

18
00:00:47,850 --> 00:00:51,149
product of w and x's. We square that and

19
00:00:51,149 --> 00:00:53,550
we sum the squared difference across all

20
00:00:53,550 --> 00:00:57,629
values of z and x. The best value of w

21
00:00:57,629 --> 00:00:59,579
would then be the value that results in

22
00:00:59,579 --> 00:01:02,039
the minimum value of this cost or loss

23
00:01:02,039 --> 00:01:04,830
function. Let's take a look at how this

24
00:01:04,830 --> 00:01:07,250
cost function looks like.

25
00:01:07,250 --> 00:01:10,140
What makes this loss or cost function

26
00:01:10,140 --> 00:01:12,119
attractive is that it is a parabola, and

27
00:01:12,119 --> 00:01:15,030
has one global minimum or one unique

28
00:01:15,030 --> 00:01:18,030
solution. For the given data, the value of

29
00:01:18,030 --> 00:01:20,670
w that makes this cost function minimum,

30
00:01:20,670 --> 00:01:25,799
is w equals 2, meaning z equals 2x, which

31
00:01:25,799 --> 00:01:27,689
would result in a line that fits the

32
00:01:27,689 --> 00:01:30,060
points perfectly. This is a very

33
00:01:30,060 --> 00:01:32,320
simplified example as in real world datasets,

34
00:01:32,320 --> 00:01:34,829
the target variable z would be

35
00:01:34,829 --> 00:01:36,930
dependent on more than one variable and

36
00:01:36,930 --> 00:01:39,090
we can't just simply plot the cost

37
00:01:39,090 --> 00:01:41,520
function and visually determine the best

38
00:01:41,520 --> 00:01:43,829
value of the weights. So how do we

39
00:01:43,829 --> 00:01:47,520
determine the best value of w. The best

40
00:01:47,520 --> 00:01:50,850
value of w, or w's in case you have many

41
00:01:50,850 --> 00:01:53,399
weights to optimize, is determined through

42
00:01:53,399 --> 00:01:55,700
an algorithm called gradient descent.

43
00:01:55,700 --> 00:01:58,020
Gradient descent is an iterative

44
00:01:58,020 --> 00:02:00,930
optimization algorithm for finding the

45
00:02:00,930 --> 00:02:03,659
minimum of a function. To find the

46
00:02:03,659 --> 00:02:05,430
minimum of a function using gradient

47
00:02:05,430 --> 00:02:08,280
descent, one takes steps proportional to

48
00:02:08,280 --> 00:02:10,560
the negative of the gradient of the

49
00:02:10,560 --> 00:02:12,790
function at the current point.

50
00:02:12,790 --> 00:02:15,760
What does that mean? We start at a random

51
00:02:15,760 --> 00:02:18,609
initial value of w. Let's call it  w0,

52
00:02:18,609 --> 00:02:21,189
and say it's equal to 0.2, and

53
00:02:21,189 --> 00:02:23,890
we start taking steps towards the green

54
00:02:23,890 --> 00:02:27,280
dot which is w = 2. To determine

55
00:02:27,280 --> 00:02:29,829
in which direction to move, we compute

56
00:02:29,829 --> 00:02:32,439
the gradient of the loss function at the

57
00:02:32,439 --> 00:02:35,170
current value of w, which is 0.2.

58
00:02:35,170 --> 00:02:37,720
The gradient is given by the slope

59
00:02:37,720 --> 00:02:40,299
of the tangent at w = 0.2,

60
00:02:40,299 --> 00:02:43,269
and then the magnitude of the step

61
00:02:43,269 --> 00:02:46,209
is controlled by a parameter called the

62
00:02:46,209 --> 00:02:48,519
learning rate. The larger the learning

63
00:02:48,519 --> 00:02:51,639
rate, the bigger the step we take, and the

64
00:02:51,639 --> 00:02:53,919
smaller the learning rate, the smaller

65
00:02:53,919 --> 00:02:56,260
the step we take. Then we take the step

66
00:02:56,260 --> 00:02:59,950
and we move to w1. w1 is essentially

67
00:02:59,950 --> 00:03:04,359
computed as w0 minus the learning rate

68
00:03:04,359 --> 00:03:07,810
times the gradient at w0. This

69
00:03:07,810 --> 00:03:10,120
represents the first iteration of the

70
00:03:10,120 --> 00:03:13,659
algorithm. At w1, we repeat the same

71
00:03:13,659 --> 00:03:16,659
process of computing the gradient at w1

72
00:03:16,659 --> 00:03:20,919
and using the same learning rate to

73
00:03:20,919 --> 00:03:22,629
control the magnitude of the step

74
00:03:22,629 --> 00:03:24,849
towards the minimum. We keep repeating

75
00:03:24,849 --> 00:03:27,220
this step again and again until we hit

76
00:03:27,220 --> 00:03:29,889
the minimum or a value of the cost

77
00:03:29,889 --> 00:03:31,900
function that is very close to the

78
00:03:31,900 --> 00:03:34,479
minimum, within a very small predefined

79
00:03:34,479 --> 00:03:36,879
threshold. Now when choosing the learning

80
00:03:36,879 --> 00:03:38,829
rate, we have to be very careful as a

81
00:03:38,829 --> 00:03:41,560
large learning rate can lead to big

82
00:03:41,560 --> 00:03:43,959
steps and eventually missing the minimum.

83
00:03:43,959 --> 00:03:47,079
On the other hand, a small learning rate

84
00:03:47,079 --> 00:03:49,299
can result in very small steps and

85
00:03:49,299 --> 00:03:51,879
therefore causing the algorithm to take

86
00:03:51,879 --> 00:03:54,329
a long time to find the minimum point.

87
00:03:54,329 --> 00:03:57,129
Now let's see how each iteration with a

88
00:03:57,129 --> 00:04:00,159
learning rate of 0.4 affects the way the

89
00:04:00,159 --> 00:04:02,430
resulting line fits the data on the left.

90
00:04:02,430 --> 00:04:06,189
We initialize w to 0, which means z

91
00:04:06,189 --> 00:04:07,209
equals 0.

92
00:04:07,209 --> 00:04:09,310
It is a horizontal line and therefore the

93
00:04:09,310 --> 00:04:11,829
cost is high and the line as you can

94
00:04:11,829 --> 00:04:14,379
see is a bad fit. After the 1st

95
00:04:14,379 --> 00:04:17,380
iteration, w moves closer to 2 and

96
00:04:17,380 --> 00:04:20,289
because the slope is very steep at w=0,

97
00:04:20,289 --> 00:04:23,950
the new value of w results in a

98
00:04:23,950 --> 00:04:26,590
big drop in the loss function.

99
00:04:26,590 --> 00:04:28,600
The resulting line fits better than the

100
00:04:28,600 --> 00:04:30,580
initial one but there is still room for

101
00:04:30,580 --> 00:04:36,430
improvement. After the 2nd iteration, w

102
00:04:36,430 --> 00:04:38,699
continues moving toward w = 2.

103
00:04:38,699 --> 00:04:41,350
Because the slope is not as steep as

104
00:04:41,350 --> 00:04:44,290
before, the step is not as big but the

105
00:04:44,290 --> 00:04:46,510
cost function still drops in value and

106
00:04:46,510 --> 00:04:49,090
the resulting line is moving closer to

107
00:04:49,090 --> 00:04:51,940
the ideal best fit line. The same

108
00:04:51,940 --> 00:04:53,680
observation is noted with the 3rd

109
00:04:53,680 --> 00:04:57,900
iteration, and the 4th iteration. After

110
00:04:57,900 --> 00:05:01,060
4 iterations, you can see how we are

111
00:05:01,060 --> 00:05:03,669
almost there at w = 2, and the

112
00:05:03,669 --> 00:05:05,470
resulting line almost fits the

113
00:05:05,470 --> 00:05:08,020
scatterplot perfectly. With each

114
00:05:08,020 --> 00:05:10,780
iteration, the weight is updated in a way

115
00:05:10,780 --> 00:05:13,030
that's proportional to the negative of

116
00:05:13,030 --> 00:05:15,250
the gradient of the function at the

117
00:05:15,250 --> 00:05:17,800
current point. Therefore, if you

118
00:05:17,800 --> 00:05:19,810
initialize the weight to a value that is to

119
00:05:19,810 --> 00:05:22,570
the right of the minimum, then the

120
00:05:22,570 --> 00:05:24,910
positive gradient will result in w

121
00:05:24,910 --> 00:05:27,150
moving to the left towards the minimum.

122
00:05:27,150 --> 00:05:29,979
Now that we understand how to optimize a

123
00:05:29,979 --> 00:05:32,050
parameter that a function depends on,

124
00:05:32,050 --> 00:05:34,300
we are now ready to start learning about

125
00:05:34,300 --> 00:05:37,090
backpropagation and how neural networks

126
00:05:37,090 --> 00:05:39,000
learn and optimize their weights and biases.