1
00:00:01,050 --> 00:00:01,220
Hey.

2
00:00:01,380 --> 00:00:06,990
Welcome back to the course in this section, we'll take a look at the all important technique of transfer

3
00:00:06,990 --> 00:00:10,700
learning why you don't always needed training your left from scratch.

4
00:00:10,800 --> 00:00:11,760
So let's begin.

5
00:00:12,330 --> 00:00:13,770
So what is transfer learning?

6
00:00:13,780 --> 00:00:15,180
Well, in practice?

7
00:00:15,330 --> 00:00:18,480
Very few people train an entire CNN from scratch.

8
00:00:18,840 --> 00:00:24,390
That that means a random initialization of which we don't always need to settle with randomly when training,

9
00:00:24,810 --> 00:00:31,050
because honestly, it's relatively rare to find a dataset that is sufficient size to train these very

10
00:00:31,050 --> 00:00:32,190
big networks.

11
00:00:33,060 --> 00:00:39,390
So instead, what we do, it's common to take a pre-trained network that was trained in a very large

12
00:00:39,390 --> 00:00:45,240
dataset something like image net, which contains 1.2 million images with towers in different categories.

13
00:00:45,870 --> 00:00:53,400
And then use that as to use the weights from image net as the initial weights for all of the network.

14
00:00:53,940 --> 00:00:58,470
So we take a network that has been trained on image net and then we can do several things.

15
00:00:58,470 --> 00:01:05,010
We can apply transfer learning, we can apply fine tuning as well as we can apply it as a fixed feature

16
00:01:05,010 --> 00:01:05,640
extractor.

17
00:01:06,090 --> 00:01:08,560
Those are the three major scenarios of transfer learning.

18
00:01:08,580 --> 00:01:09,810
So let's take a look at this.

19
00:01:10,410 --> 00:01:14,310
So let's look at the rationale behind transfer learning and the way it works.

20
00:01:14,880 --> 00:01:20,970
So training extensive models on vast image datasets like image nets can take weeks.

21
00:01:21,210 --> 00:01:21,570
OK?

22
00:01:22,290 --> 00:01:27,300
However, a model trained on so much data is going to have some very useful embeddings.

23
00:01:27,990 --> 00:01:32,970
That's that's basically the features those filters learn is going to be applicable to many of the different

24
00:01:32,970 --> 00:01:37,320
sectors because imagine you've created on a thousand different image classes.

25
00:01:37,650 --> 00:01:42,990
Features that are trained on the features that have been left on those classes can be transferred to

26
00:01:42,990 --> 00:01:43,680
other tests.

27
00:01:44,400 --> 00:01:49,050
Example with image that has categories like tigers, lions and horses.

28
00:01:49,470 --> 00:01:55,170
And then it can be useful, given that it can detect those features here and in many other mammals as

29
00:01:55,170 --> 00:01:55,440
well.

30
00:01:56,040 --> 00:01:59,100
It can be used to detect and distinguish between other four legged mammals.

31
00:01:59,100 --> 00:02:04,450
If you have a data class with, say, different types of deer or buffalo as well.

32
00:02:04,500 --> 00:02:05,520
So that's pretty cool.

33
00:02:06,570 --> 00:02:13,680
Also, transfer learning is the concept where we utilize already trained models in other domains so

34
00:02:13,680 --> 00:02:20,520
effectively that a model has been trained on image that is now applicable to achieve very good accuracy

35
00:02:20,880 --> 00:02:24,360
and much faster training times in other sectors with other datasets.

36
00:02:25,260 --> 00:02:27,480
So there are two major types of transmitting.

37
00:02:27,510 --> 00:02:29,190
The first is feature extraction.

38
00:02:29,670 --> 00:02:33,000
The second is called fine tuning, and they're both very similar.

39
00:02:33,030 --> 00:02:37,830
However, there's a distinct difference which we'll discuss, and a lot of people actually confuse them

40
00:02:37,830 --> 00:02:38,510
quite a bit.

41
00:02:38,520 --> 00:02:44,910
And it's not that important to know the exact difference between them because most times researchers

42
00:02:44,910 --> 00:02:48,920
are effectively doing basically doing the same thing just in different ways.

43
00:02:48,930 --> 00:02:50,070
But I'll explain it shortly.

44
00:02:50,370 --> 00:02:56,070
So what's how do we use a pre-trained CNN as a feature extractor?

45
00:02:56,550 --> 00:03:04,110
Firstly, let's imagine this is a very deep sea into something like a resonant one to one, and it's

46
00:03:04,290 --> 00:03:05,610
been trained on image net.

47
00:03:05,910 --> 00:03:09,300
So these weights on these filters here are quite valuable.

48
00:03:09,570 --> 00:03:14,910
These filters have learned so many different patterns and split different sports of detectors, color

49
00:03:14,910 --> 00:03:16,860
blobs, all sorts of patterns.

50
00:03:16,860 --> 00:03:19,080
It's little wonder that housing image classes.

51
00:03:19,530 --> 00:03:23,010
So this is a quite valuable to us, these layers here.

52
00:03:23,250 --> 00:03:26,370
So what do we do when we're using this and transfer learning?

53
00:03:26,800 --> 00:03:32,730
Well, since these features are so valuable, valuable these weights, we freeze them, freezing them,

54
00:03:32,730 --> 00:03:35,850
meaning that we don't trip, we don't train them anymore.

55
00:03:36,210 --> 00:03:37,650
We keep those exact weights.

56
00:03:38,160 --> 00:03:38,580
All right.

57
00:03:38,880 --> 00:03:41,850
But what we do train is to fully connected layers here.

58
00:03:42,480 --> 00:03:46,980
And that's because we're using this imaging that data sets on our new classes.

59
00:03:47,400 --> 00:03:54,180
So let's imagine that we have with Tekin and pre-trained imaging that network that opens to 2000 classes

60
00:03:54,690 --> 00:03:56,790
and we replace the head of the network.

61
00:03:56,790 --> 00:04:02,660
That's a fully connected layers here, unchanged with outputs from a thousand to just three panda,

62
00:04:02,670 --> 00:04:03,390
cat and dog.

63
00:04:03,990 --> 00:04:09,480
And we used the widths from image net to assemble a feature extraction instructors.

64
00:04:10,050 --> 00:04:14,430
And the only thing that's different now is the fully connected layer of a head with chants.

65
00:04:14,430 --> 00:04:18,540
And oh, let's go over the steps of transfer learning by feature extraction.

66
00:04:19,110 --> 00:04:23,320
So imagine this is only at work here where we have a pre-trained model.

67
00:04:23,340 --> 00:04:26,850
This is a convolutional layers here, and this is the fully connected head.

68
00:04:26,970 --> 00:04:27,690
That's trainable.

69
00:04:27,840 --> 00:04:28,630
That's what we do.

70
00:04:28,650 --> 00:04:29,970
So let's go over these steps.

71
00:04:30,390 --> 00:04:36,540
So firstly, as I said, we freeze the bottom layers of a pre-trained that we have that so convolutional

72
00:04:36,540 --> 00:04:40,140
layers in most cases, then we replace the top half.

73
00:04:40,440 --> 00:04:45,340
Some actually have the bit of the top piece of the network with our own top.

74
00:04:45,360 --> 00:04:52,740
So he just creates a few layers of fully connected nodes, set out one node number of nodes and definitely

75
00:04:52,740 --> 00:04:57,390
set the number of output nodes to be the number of classes in our new dataset.

76
00:04:57,570 --> 00:04:59,910
So if you're training on a data set that.

77
00:04:59,970 --> 00:05:05,820
As cats, dogs and pandas, we said they would put the tree as opposed to what it would have been before

78
00:05:05,820 --> 00:05:07,350
an image net, which is a thousand.

79
00:05:07,950 --> 00:05:12,310
And then we just trained the model, trained this model here with these weights being frozen.

80
00:05:12,770 --> 00:05:14,880
And we're just training the top players alone.

81
00:05:15,420 --> 00:05:19,410
So now let's take a look at fine tuning the other method of transfer learning.

82
00:05:19,830 --> 00:05:24,900
So immediately from this diagram, you can tell there's something a bit different now you can tell what

83
00:05:24,900 --> 00:05:30,030
we're doing here is that we're keeping some of the weeds here frozen and making some of them trainable

84
00:05:30,030 --> 00:05:30,300
here.

85
00:05:30,450 --> 00:05:31,230
So let's take a look.

86
00:05:31,260 --> 00:05:32,610
That's exactly what we're doing.

87
00:05:33,030 --> 00:05:39,720
So the intuition behind a fine tuning here is that instead of training and sort of keeping all of the

88
00:05:39,720 --> 00:05:49,040
convolutional filters frozen, we leave the lower filters frozen, but make them more upper end convolutional

89
00:05:49,040 --> 00:05:49,440
filters.

90
00:05:49,440 --> 00:05:52,290
We unfreeze those and we train those itself.

91
00:05:52,710 --> 00:05:58,200
That's basically what fine tuning is, and the reason we all to call it fine tuning is that we also

92
00:05:58,200 --> 00:06:01,920
use a very small learning rates here because we don't want to make big changes.

93
00:06:02,400 --> 00:06:08,880
So like I said, intuition behind this is that these lower filters here early in the network, they

94
00:06:08,880 --> 00:06:14,890
tend to learn generic features, things like edge detectors, simple patterns and all of those things.

95
00:06:14,910 --> 00:06:18,180
However, the higher layers learn more intricate patterns.

96
00:06:18,480 --> 00:06:24,480
So you're going to want to actually unfroze unfreeze those so that you can actually generalize and learn

97
00:06:24,480 --> 00:06:28,710
the stuff in your new dataset so that you get a very good performing network.

98
00:06:29,280 --> 00:06:31,140
So when do we use transfer learning?

99
00:06:31,140 --> 00:06:34,290
So when is the ideal time to use transfer learning?

100
00:06:34,800 --> 00:06:41,700
Well, the ideal scenario is if you have a very large new dataset and it's somewhat similar to the original

101
00:06:41,700 --> 00:06:44,220
dataset, that's imaging that in most cases here.

102
00:06:44,760 --> 00:06:48,450
And in that case, your model should not over fit and you should get pretty good performance.

103
00:06:48,960 --> 00:06:55,540
Less ideal but still recommended is a few new data set is large but slightly different, or a bit a

104
00:06:55,540 --> 00:06:56,250
bit different.

105
00:06:56,610 --> 00:07:02,880
In that case, I would recommend you use fine tuning his well instead of the feature extraction method

106
00:07:02,880 --> 00:07:06,120
of transfer learning next year.

107
00:07:06,360 --> 00:07:11,250
When you need a small, even if it's similar to the image, not dataset or whatever pre-trained dataset

108
00:07:11,250 --> 00:07:11,770
you use.

109
00:07:12,300 --> 00:07:15,420
It's not always ideal for transplanting or fine tuning.

110
00:07:15,810 --> 00:07:23,790
However, a useful idea is to often just extract the CNN outputs and then instead of putting the head

111
00:07:23,790 --> 00:07:29,550
as a neural network, the fully connected layers use a linear classifier, something like a logistic

112
00:07:29,550 --> 00:07:33,090
regression or an SVM, which we will be doing in one of our projects shortly.

113
00:07:33,630 --> 00:07:38,160
Use that so your model gets better generalized generalization performance.

114
00:07:38,640 --> 00:07:44,040
So some more advice here always use more lending rates, especially when doing fine tuning.

115
00:07:44,460 --> 00:07:50,100
You don't want to change those valuable image net pre-trained weights too much because I mean, your

116
00:07:50,100 --> 00:07:51,510
GPU is someone else's.

117
00:07:51,510 --> 00:07:56,400
GPUs have spent so much time getting to his good with, so don't often mess that up.

118
00:07:56,700 --> 00:08:03,030
So use small learning rates to not to so that you don't change just because it's too much and also due

119
00:08:03,030 --> 00:08:04,110
to parameter sharing.

120
00:08:04,890 --> 00:08:10,740
You can actually use different input size images when pre-training, and it doesn't affect your model

121
00:08:10,740 --> 00:08:11,390
very much.

122
00:08:11,460 --> 00:08:12,840
In fact, it doesn't effective it at all.

123
00:08:13,500 --> 00:08:19,030
So what we'll do now will actually go back into the collapse of the code, and I'll teach you how to

124
00:08:19,030 --> 00:08:23,390
use how to do transfer learning and fine tuning in both Keras and PyTorch.

125
00:08:23,460 --> 00:08:24,880
So stay tuned for that take.
