1
00:00:00,060 --> 00:00:11,300
Planning is one of the machine learning algorithms that is uses a neuron, an artificial neuron, neurons

2
00:00:11,300 --> 00:00:23,690
that that is connected between each other to recognize or to extract some patterns and make decisions.

3
00:00:23,840 --> 00:00:27,680
Of course, these neurons, we have to construct them.

4
00:00:27,680 --> 00:00:32,290
However, we don't have to explicitly program every one of them.

5
00:00:32,300 --> 00:00:38,210
The the tuning this neural network is a process in which we call training.

6
00:00:38,330 --> 00:00:39,260
So.

7
00:00:40,160 --> 00:00:45,230
So how do we tune these neural network?

8
00:00:45,260 --> 00:00:50,560
Well, every neural network is simply a value holder.

9
00:00:50,570 --> 00:00:55,610
Every neural will have a simple value, only a value.

10
00:00:55,610 --> 00:01:06,720
And these values will be will, will, will pass on the the value its value to the next neural network

11
00:01:06,720 --> 00:01:08,280
or the connected neural network.

12
00:01:08,280 --> 00:01:17,790
As it passes through this neural network, it will be multiplied by weight and will be summed by a value

13
00:01:17,790 --> 00:01:18,810
called bias.

14
00:01:18,990 --> 00:01:25,080
We don't control directly the value of the neural network or the actual neuron.

15
00:01:25,260 --> 00:01:34,530
However, we directly control the weight of this A that is going to be multiplied in order to pass on

16
00:01:34,530 --> 00:01:36,240
to the other neuron.

17
00:01:36,450 --> 00:01:44,490
And we we we actually control the bias in which simply the summation added to the weight multiplied

18
00:01:44,490 --> 00:01:45,870
the value of this neuron.

19
00:01:46,570 --> 00:01:48,070
So how it looks like.

20
00:01:48,190 --> 00:01:56,920
Well, if we have a data with an input value, for example, X1X2X3X4 and a target value of let's say

21
00:01:56,920 --> 00:01:58,330
we try to classify something.

22
00:01:58,330 --> 00:02:00,610
So it's it's 0 or 1.

23
00:02:00,790 --> 00:02:11,440
We simply add these ones to be every value will be x1, x2, x3 and x4 as a as an input layer of specific

24
00:02:11,440 --> 00:02:12,010
values.

25
00:02:12,010 --> 00:02:15,700
And these values will be passed on to some layers.

26
00:02:15,700 --> 00:02:17,800
Here we call it hidden layers.

27
00:02:18,100 --> 00:02:25,720
And after that, of course, depending on the architecture will be passed on to the resulting layers,

28
00:02:25,720 --> 00:02:29,620
which is in this case the output layers of zero and one.

29
00:02:30,280 --> 00:02:31,240
The.

30
00:02:31,980 --> 00:02:35,700
The hidden layers will get its value from the input layer.

31
00:02:35,700 --> 00:02:42,030
And after we start initializing these weights and biases after.

32
00:02:43,140 --> 00:02:47,220
After we initialize it, we will have it checked.

33
00:02:47,220 --> 00:02:53,940
If it's going to provide us the zero and one we want, or there will be some error.

34
00:02:53,970 --> 00:02:59,490
The error we call it in this or this analogy is we call it loss.

35
00:02:59,490 --> 00:03:07,740
And based on this loss, we will update our weights and biases until we get as as good as possible or

36
00:03:07,740 --> 00:03:14,700
as as near as possible value for zero and one for the data we have.

37
00:03:16,390 --> 00:03:28,180
Now, as we said, we will multiply every a neural network value with the weight and we sum it with

38
00:03:28,180 --> 00:03:32,080
the bias and we will pass it into a function.

39
00:03:32,500 --> 00:03:35,500
This functions, we call it activation functions.

40
00:03:35,500 --> 00:03:42,610
And in our neural networks, we simply we can we can choose any function we want.

41
00:03:42,640 --> 00:03:49,520
However, we need to test some functions in order to get a good answer.

42
00:03:49,540 --> 00:03:58,090
One of the most popular functions are activation functions is Relu function in which if the value is

43
00:03:58,270 --> 00:04:02,170
is minus of that neural network, it will just return zero.

44
00:04:02,170 --> 00:04:06,850
And if the value is positive, it will return that value.

45
00:04:07,420 --> 00:04:16,270
Now another popular function is sigmoid function, which is it has this graph that at the end it will

46
00:04:16,450 --> 00:04:22,240
provide the output of every neuron is between 0 and 1.

47
00:04:22,860 --> 00:04:25,350
As you can see, it is bounded by zero.

48
00:04:25,350 --> 00:04:32,040
And here is one and another one is leaky Relu, which is if we have a minus value, it will provide

49
00:04:32,040 --> 00:04:34,200
a very small, leaky value.

50
00:04:34,230 --> 00:04:35,410
A very small value.

51
00:04:35,430 --> 00:04:37,650
This is why we call it Relu.

52
00:04:37,650 --> 00:04:40,650
And if it's positive, it will give.

53
00:04:40,680 --> 00:04:43,500
It will provide the same value.

54
00:04:44,120 --> 00:04:49,340
So these functions, however, would say the most now commonly used is the Relu.

55
00:04:49,370 --> 00:04:55,130
However, you can see different architectures of neural networks that they use, different activation

56
00:04:55,130 --> 00:04:55,850
functions.

57
00:04:56,820 --> 00:04:58,500
Now cost functions.

58
00:04:58,710 --> 00:05:06,930
Again, as we said, we will have input layer and then output layer and the output layer is is made

59
00:05:06,930 --> 00:05:07,890
of values.

60
00:05:07,890 --> 00:05:16,380
We predicted it, which is based on our weights and biases of the whole network and we need to compare

61
00:05:16,380 --> 00:05:19,320
it with the values That is real values.

62
00:05:19,320 --> 00:05:21,360
This is in our training data.

63
00:05:21,360 --> 00:05:23,580
We we always have cases.

64
00:05:23,580 --> 00:05:30,930
We try the input with our already made weights and biases, and then we will compare it with the actual

65
00:05:30,930 --> 00:05:32,850
output that we have.

66
00:05:32,850 --> 00:05:35,070
And based on that, we will calculate the loss.

67
00:05:35,070 --> 00:05:41,310
And the loss in, in the simplest way is we can use, for example, mean square cost function, which

68
00:05:41,310 --> 00:05:45,420
is the same mean mean squared error.

69
00:05:45,420 --> 00:05:54,630
We already know from previous lectures is a simple as we minus the value we predicted minus our real

70
00:05:54,630 --> 00:05:55,200
value.

71
00:05:55,200 --> 00:05:57,310
And it's going to be squared.

72
00:05:58,170 --> 00:05:59,040
In.

73
00:06:00,380 --> 00:06:06,340
In the in the another value is we call cross entropy cost function.

74
00:06:06,340 --> 00:06:16,300
And it has a different formula that is or was based on the idea of a comparing the actual value or the

75
00:06:16,300 --> 00:06:21,220
expected value with the sorry, with the expected value from our network.

76
00:06:21,220 --> 00:06:27,070
That is, we predicted it based on our or our input and our current weights and biases of the network

77
00:06:27,070 --> 00:06:30,400
with the actual value of that training data.

78
00:06:30,400 --> 00:06:40,060
For example, if we're comparing if we have an input layer of an image of a dog and the result was a

79
00:06:40,540 --> 00:06:47,500
we should be a dog should be one and a cat should be zero, if we have not, if we have the opposite,

80
00:06:47,500 --> 00:06:55,200
that means the weight has to be updated in order to match this this output, which should give us a

81
00:06:55,210 --> 00:06:56,680
dog because the image is a dog.

82
00:06:58,390 --> 00:07:00,610
Now, how do we do that?

83
00:07:00,610 --> 00:07:08,140
Like we we can understand that in a big network there is a huge amount of weights and biases and well,

84
00:07:08,140 --> 00:07:09,490
we need to tune it.

85
00:07:09,490 --> 00:07:16,450
And the problem with tuning, we need a optimizer because of course we can tune it by ourselves, by

86
00:07:16,450 --> 00:07:16,930
our hands.

87
00:07:16,930 --> 00:07:21,670
But it's not very practical and it's I'm not sure if it's even possible.

88
00:07:21,670 --> 00:07:32,410
So to to to do that, we need Optimizer in which we will take these parameters and we try to try to

89
00:07:32,410 --> 00:07:35,290
reduce the loss as much as possible.

90
00:07:36,090 --> 00:07:43,770
One of the early and also used optimizers is the gradient descent.

91
00:07:44,500 --> 00:07:48,880
And the gradient descent is is about simple idea.

92
00:07:48,880 --> 00:07:58,540
That is, we have a function or we have these all these parameters and we need to check these parameters

93
00:07:58,540 --> 00:08:01,330
how we can always reduce the loss.

94
00:08:01,330 --> 00:08:07,260
And as we reduce the loss, we, we, we change our step size.

95
00:08:07,270 --> 00:08:15,280
For example, we can start at the beginning a little bit, big step sizes, and then we start going

96
00:08:15,280 --> 00:08:16,810
smaller as we go.

97
00:08:16,840 --> 00:08:26,260
Of course, it's, it's, it's always if we have a bigger step size, we will, we will get to the optimization

98
00:08:26,260 --> 00:08:27,730
value quicker.

99
00:08:27,730 --> 00:08:35,230
However, if we have too small, but however if we have too big that we might never get to the actual

100
00:08:35,230 --> 00:08:40,030
value because we can go from here to here and we will.

101
00:08:40,870 --> 00:08:46,130
We will just miss the minima that we are looking for to reduce the loss.

102
00:08:46,940 --> 00:08:52,280
And we cannot also go very small because simply it will take too much time.

103
00:08:52,280 --> 00:09:02,090
So we need to always kind of balance these two things, like the step size and another maybe more used

104
00:09:02,090 --> 00:09:13,730
method is the back propagation and back propagation is, is is based on the idea of we start with with

105
00:09:13,730 --> 00:09:19,850
of course we will have the, the network with our initial weights and biases and we start from the output

106
00:09:19,850 --> 00:09:26,180
of this network and we go in the opposite direction.

107
00:09:26,180 --> 00:09:29,150
For that is this is what we call back propagation.

108
00:09:29,150 --> 00:09:34,700
We start from the back of the network and we can see the expected value we have.

109
00:09:34,700 --> 00:09:43,910
And what is the effect of changing these weights on the expected value we have, depending on the difference

110
00:09:43,910 --> 00:09:46,250
that between what we can change?

111
00:09:46,840 --> 00:09:52,090
And how we can get closer to these a the output value.

112
00:09:52,120 --> 00:10:00,010
We will change the weights and biases of the of this network.

113
00:10:00,160 --> 00:10:12,850
And after that we will predict the values of the the the layer just behind the the output layer.

114
00:10:13,000 --> 00:10:20,800
And we keep doing the same to the layer behind it and the layer behind it and so on until we get to

115
00:10:21,110 --> 00:10:23,080
the to the first layer.

116
00:10:23,080 --> 00:10:26,680
So this is basically the idea of back propagation.

117
00:10:26,680 --> 00:10:31,780
Of course we have to do it for a lot of data and or patches of data.

118
00:10:31,780 --> 00:10:37,630
And with these patches of data, every data depend on the, the, the, the loss value.

119
00:10:37,660 --> 00:10:42,460
We will update the weights a little bit more, a little bit smaller and and so on.

120
00:10:42,490 --> 00:10:49,640
The overall, we should have a model that will work well for that batch and then we will get another

121
00:10:49,640 --> 00:10:54,500
patch that we tested and then we update the weights and biases and so on and so on.

122
00:10:54,500 --> 00:11:00,560
So we will of course not going to train the data in like as a whole, we will split the data into patches

123
00:11:00,560 --> 00:11:08,870
and this patches will go into the network and we get trained and with time we will keep getting closer

124
00:11:08,870 --> 00:11:12,770
and closer to a at the lowest possible loss.
