1
00:00:11,720 --> 00:00:16,550
In this lecture I want to discuss two different approaches to transfer learning both of which we will

2
00:00:16,550 --> 00:00:19,600
cover in the subsequent coding lectures.

3
00:00:19,610 --> 00:00:25,260
The main issue I want to cover is this Imagine that the body of our neuron that work is very large.

4
00:00:25,280 --> 00:00:32,030
Let's say hundreds of layers the head as discussed previously can just be a single logistic regression

5
00:00:32,060 --> 00:00:33,880
dense layer.

6
00:00:33,920 --> 00:00:39,170
The issue is even though we are not training any of the weights in the body it still takes time to compute

7
00:00:39,230 --> 00:00:41,030
an output prediction.

8
00:00:41,030 --> 00:00:45,290
Just imagine a really big long nested function at each of these layers.

9
00:00:45,290 --> 00:00:51,080
We have to do a convolution which is faster than matrix multiplication but still can take a non-trivial

10
00:00:51,080 --> 00:00:51,830
amount of time.

11
00:00:56,970 --> 00:01:01,860
Imagine this computation in two parts in order to calculate the output prediction.

12
00:01:01,890 --> 00:01:06,680
We first calculate a feature vector z using the body of the neural network.

13
00:01:06,990 --> 00:01:13,140
Then in order to calculate the output we take Z and pass it through a logistic regression.

14
00:01:13,140 --> 00:01:18,430
My issue is the first part of calculating Z is what takes a long time.

15
00:01:18,480 --> 00:01:22,770
Now consider that our gradient descent loop works as follows.

16
00:01:22,770 --> 00:01:25,730
We loop through each batch of data for each batch.

17
00:01:25,740 --> 00:01:27,910
We calculate the up prediction y.

18
00:01:28,860 --> 00:01:34,100
Then we calculate the greening of the error with respect to the logistic regression parameters W and

19
00:01:34,100 --> 00:01:40,380
B then we update a W and B using those gradients.

20
00:01:40,740 --> 00:01:47,700
The thing we want to think about is this Does the calculation of Z ever actually change since he is

21
00:01:47,700 --> 00:01:49,550
the output of the V G network.

22
00:01:49,560 --> 00:01:55,020
After passing in our input data it doesn't make sense to put this inside our loop because it's going

23
00:01:55,020 --> 00:01:56,990
to be the same every time.

24
00:01:57,000 --> 00:02:04,680
Remember the G weights are not going to be trained.

25
00:02:04,800 --> 00:02:06,990
So here's what I propose.

26
00:02:07,020 --> 00:02:09,870
Remember my rule all data is the same.

27
00:02:09,930 --> 00:02:16,110
The idea is before we even begin training let's kind of hurt our image data into a tabular matrix of

28
00:02:16,110 --> 00:02:23,520
feature vectors z then all we need to do is run logistic regression on Z without ever looking at the

29
00:02:23,520 --> 00:02:25,500
V e.g. network again.

30
00:02:25,950 --> 00:02:31,290
By doing this we can avoid having to pass our data through any pre trained network which can take a

31
00:02:31,290 --> 00:02:38,880
lot of time if it has a lot of parameters.

32
00:02:38,910 --> 00:02:41,690
All right so there is a problem with what I just proposed.

33
00:02:41,700 --> 00:02:42,210
What is it.

34
00:02:43,110 --> 00:02:48,900
Well remember that when we create our image data generator it gives us the option to do data augmentation

35
00:02:49,800 --> 00:02:51,310
with data augmentation.

36
00:02:51,360 --> 00:02:57,510
Every time we loop over the data set the generator modifies the original images just a little bit to

37
00:02:57,510 --> 00:03:00,420
help our neural network generalize better.

38
00:03:00,420 --> 00:03:06,360
Of course if X is different on every iteration of the loop then Z will also be different on every iteration

39
00:03:06,360 --> 00:03:07,840
of the loop.

40
00:03:07,920 --> 00:03:12,720
In this case we cannot pre compute Z before the training loop begins.

41
00:03:12,780 --> 00:03:17,580
The only time it makes sense to do that is if we don't care about doing data augmentation

42
00:03:22,760 --> 00:03:30,400
so these are the two approaches approach no one will be to use the image data generator with data augmentation.

43
00:03:30,410 --> 00:03:36,300
This means we have to put the entire CNN computation inside the training loop approach.

44
00:03:36,320 --> 00:03:43,790
Number two will be to pre compute the feature vector z on the original data X without data augmentation.

45
00:03:43,790 --> 00:03:53,990
This means that after transforming the data all we need to do is train a logistic regression model.

46
00:03:54,000 --> 00:03:56,040
There are pros and cons to each approach.

47
00:03:56,040 --> 00:03:59,440
Let's list them out as discussed earlier.

48
00:03:59,510 --> 00:04:05,900
When we do use data augmentation that means we have to recapture the features for every batch we see.

49
00:04:05,900 --> 00:04:08,560
This is slow especially if you have a large network.

50
00:04:09,440 --> 00:04:14,750
Yes it's faster than training the entire network but still slower than only training a logistic regression

51
00:04:17,110 --> 00:04:17,910
on the other hand.

52
00:04:17,920 --> 00:04:23,440
Data augmentation may help your model generalize better without data augmentation.

53
00:04:23,470 --> 00:04:25,840
It's basically the opposite situation.

54
00:04:26,110 --> 00:04:31,720
You can pre compute the features before training which will save you a lot of time training logistic

55
00:04:31,720 --> 00:04:34,370
regression is extremely fast yet.

56
00:04:34,480 --> 00:04:40,950
This also means that you cannot use data augmentation which may cause your model to perform some optimally.

57
00:04:40,960 --> 00:04:46,300
However it's always worth trying both to see which method gives you the best results for your specific

58
00:04:46,350 --> 00:04:46,840
dataset.
