1
00:00:11,680 --> 00:00:16,960
In this section of the course we are going to cover one very important topic in a modern deep learning

2
00:00:17,050 --> 00:00:19,520
which is called transfer learning.

3
00:00:19,540 --> 00:00:24,220
This lecture will introduce you to both the intuition and implementation of transfer learning.

4
00:00:24,310 --> 00:00:28,970
So by the end of this you'll be able to understand that the subsequent code.

5
00:00:29,170 --> 00:00:34,270
The nice thing about transfer learning is it allows you to obtain very good off often the state of the

6
00:00:34,270 --> 00:00:38,050
art results with less effort and less resources spent

7
00:00:43,190 --> 00:00:46,400
the intuition behind transfer learning is this.

8
00:00:46,400 --> 00:00:52,070
Let's start with the basic observation that in a CNN the features that are learned at each layer are

9
00:00:52,070 --> 00:00:53,320
hierarchical.

10
00:00:53,630 --> 00:00:59,180
In the beginning layers the CNN only learned some basic lines and strokes in the later layers.

11
00:00:59,210 --> 00:01:05,540
The features get progressively more complex from lines and strokes to parts of faces eventually to full

12
00:01:05,540 --> 00:01:06,910
faces.

13
00:01:06,920 --> 00:01:10,750
This is of course if you train your CNN on images of faces.

14
00:01:10,970 --> 00:01:14,390
You can also train your CNN on other types of images as well.

15
00:01:14,390 --> 00:01:15,050
Of course

16
00:01:20,180 --> 00:01:22,520
but here's another basic observation.

17
00:01:22,760 --> 00:01:29,420
We recognize that the initial layers learn to recognize just simple lines and strokes is a plausible

18
00:01:29,420 --> 00:01:35,540
that other task that don't involve looking at images of faces would still find a line and stroke features

19
00:01:35,570 --> 00:01:36,680
useful.

20
00:01:36,680 --> 00:01:42,050
The answer is yes because probably all pictures will have lines and strokes.

21
00:01:42,050 --> 00:01:47,000
The difference between a car and a face being that the way that you put together those lines and strokes

22
00:01:47,000 --> 00:01:49,460
will be different in this way.

23
00:01:49,460 --> 00:01:56,450
We can imagine that the first few layers of a CNN might be useful for a wide variety of tasks from facial

24
00:01:56,450 --> 00:02:07,100
recognition to vehicle classification to recognizing anomalies in medical images.

25
00:02:07,110 --> 00:02:10,140
This is the basic idea behind transfer learning.

26
00:02:10,290 --> 00:02:16,830
It's the idea that the features that I found for one task could be useful for some other task transfer

27
00:02:16,830 --> 00:02:21,390
learning really took off in the field of computer vision so that's what we were going to focus on in

28
00:02:21,390 --> 00:02:22,200
this course.

29
00:02:23,510 --> 00:02:28,420
Probably the most widely known large scale image dataset is called image.

30
00:02:29,090 --> 00:02:35,530
This is a dataset that contains millions of images and 1000 categories meaning it can recognize 1000

31
00:02:35,540 --> 00:02:36,830
different objects.

32
00:02:37,670 --> 00:02:44,760
Every year there is a contest to see who can build the best image classifier on this dataset.

33
00:02:44,770 --> 00:02:50,110
The nice thing about the image that dataset is because it's such a large dataset containing so many

34
00:02:50,110 --> 00:02:55,900
different kinds of images we can transfer the features learned from this dataset to a wide variety of

35
00:02:55,900 --> 00:03:00,800
tasks suppose we wanted to build a dog versus cat classifier.

36
00:03:01,060 --> 00:03:06,390
That's easy because imagine it already knows how to classify dogs and cats.

37
00:03:06,400 --> 00:03:09,970
Suppose we wanted to build a car versus truck classifier.

38
00:03:09,970 --> 00:03:17,030
That's also easy because image that already knows how to classify cars and trucks is even possible for

39
00:03:17,030 --> 00:03:19,220
us to use the same features for things.

40
00:03:19,220 --> 00:03:22,670
Image that has not seen before like say images from a microscope

41
00:03:28,110 --> 00:03:28,400
all right.

42
00:03:28,410 --> 00:03:30,480
So that's just the high level intuition.

43
00:03:30,480 --> 00:03:34,860
Let's go deeper and discuss how this will actually work conceptually.

44
00:03:34,990 --> 00:03:41,080
First we recognize that it's not feasible for us to train at CNN is on average than ourselves although

45
00:03:41,080 --> 00:03:45,760
it's a lot less expensive and time consuming these days compared to the earlier days of deep learning

46
00:03:46,780 --> 00:03:52,860
in the old days it was common for CNN training to take on the order of days weeks or even months.

47
00:03:52,870 --> 00:03:58,720
Nowadays we have distributed training multi GP clusters and so forth but at the same time it's still

48
00:03:58,720 --> 00:04:00,730
not really something you want to do on your own.

49
00:04:01,660 --> 00:04:07,330
Luckily many of the major CNN ads that have won the Emmy that contest in the past have released their

50
00:04:07,330 --> 00:04:12,580
pre trained CNN is to the public so there's no need for you to do any hyper parameter tuning.

51
00:04:12,580 --> 00:04:15,100
Try out different learning rates and so forth.

52
00:04:15,160 --> 00:04:21,520
All that has been done for you and the results are open source doubly awesome is that these pre train

53
00:04:21,520 --> 00:04:24,410
networks are already included in PI torch.

54
00:04:24,430 --> 00:04:26,710
We'll discuss what they are and how to get them later

55
00:04:31,450 --> 00:04:37,870
let's say you found a super powerful state of the art CNN train on the image that data set and now you

56
00:04:37,870 --> 00:04:41,240
want to use it to do transfer learning for your specific task.

57
00:04:41,380 --> 00:04:43,140
How can you do that.

58
00:04:43,150 --> 00:04:44,970
Well let's start with the picture.

59
00:04:45,040 --> 00:04:47,170
This is a picture of a typical CNN.

60
00:04:47,500 --> 00:04:53,110
It's a series of convolutions and pulling is followed by a series of dense layers.

61
00:04:53,110 --> 00:04:59,320
One perspective on the CNN is that it's made up of two parts part one which involves convolutions and

62
00:04:59,320 --> 00:05:03,620
pooling makeup a feature extractor or a feature Transformer.

63
00:05:03,790 --> 00:05:10,170
It takes your raw image and converts it into a feature vector which is a useful representation for an

64
00:05:10,170 --> 00:05:13,030
ends and other machine learning models.

65
00:05:13,050 --> 00:05:19,000
Part 2 is just a feed for neural network or in a in which we learned about earlier in this course.

66
00:05:19,140 --> 00:05:27,910
We can think of Part 1 as the body and part 2 as the head.

67
00:05:27,950 --> 00:05:30,080
So here's how transfer learning works.

68
00:05:30,350 --> 00:05:36,680
What we do is we chop off the head keep the body which is a useful feature transformer and then attach

69
00:05:36,680 --> 00:05:38,070
our own head.

70
00:05:38,090 --> 00:05:44,260
This head can be used for whatever task we want to perform for example identifying cars and trucks.

71
00:05:44,510 --> 00:05:50,000
In that case we might use a logistic regression with a sigmoid output since as you recall the sigmoid

72
00:05:50,000 --> 00:05:56,840
activation is the appropriate activation function for binary classification optionally you could add

73
00:05:56,840 --> 00:05:59,060
more layers to make it a feed forward and then

74
00:06:04,210 --> 00:06:07,470
here's another important detail about transfer learning.

75
00:06:07,540 --> 00:06:11,800
There are a few approaches to this and I'm going to teach you about the most important and the most

76
00:06:11,800 --> 00:06:13,400
basic version.

77
00:06:13,480 --> 00:06:18,530
The idea is we're going to freeze the weights of the body and only train the head.

78
00:06:18,550 --> 00:06:21,820
In other words only train the logistic regression layer.

79
00:06:22,210 --> 00:06:28,600
You can imagine that if we use that gigantic pre train to CNN with 100 layers it would train very slowly

80
00:06:28,660 --> 00:06:32,700
because we would have to update all the weights of a 100 layers CNN.

81
00:06:33,160 --> 00:06:39,160
But by freezing the body we can simply ignore those weights and only train the weights of our own linear

82
00:06:39,160 --> 00:06:40,530
model.

83
00:06:40,540 --> 00:06:45,610
The important point is we don't need to train the convolution players because they have already been

84
00:06:45,610 --> 00:06:51,400
trained on a data set much larger than anything we could have gathered ourselves.

85
00:06:51,490 --> 00:06:55,170
The weights we already have for those convolutions are very good.

86
00:06:55,170 --> 00:06:57,760
There is no need to try to make them better.

87
00:06:57,760 --> 00:07:02,380
Instead we can assume that they give us good features and worry about training the top

88
00:07:07,650 --> 00:07:12,330
one major advantage of using transfer learning is that you don't need a lot of data to build a good

89
00:07:12,330 --> 00:07:13,430
model.

90
00:07:13,590 --> 00:07:18,990
You might recognize this graph which basically says that the more data you have the better your deep

91
00:07:18,990 --> 00:07:20,670
learning model is.

92
00:07:20,700 --> 00:07:26,110
This is opposed to classic machine learning methods which tend to plateau after a certain point.

93
00:07:26,490 --> 00:07:31,920
So you might think in order to build a good a deep learning model you must have a lot of data which

94
00:07:31,920 --> 00:07:37,530
will in turn make your neuron that work take a long time to train but with transfer learning.

95
00:07:37,530 --> 00:07:39,590
This work has been done for us.

96
00:07:39,660 --> 00:07:45,570
We don't need a lot of data because the earlier features we're already trained on a lot of data more

97
00:07:45,570 --> 00:07:51,150
data than we could have practically handled ourselves and so transfer learning allows us to get the

98
00:07:51,150 --> 00:07:55,490
benefits of deep learning without needing to have a ton of data ourselves.

99
00:07:55,530 --> 00:07:59,710
And on top of that it means we can train our model a lot faster.

100
00:07:59,790 --> 00:08:05,280
Not only are we just training a logistic regression but we have a smaller dataset compared to what we

101
00:08:05,280 --> 00:08:07,380
would need if we trained our model from scratch.
