1
00:00:11,670 --> 00:00:16,680
In this lecture we are going to look at a modified version of the recommenders code.

2
00:00:16,680 --> 00:00:21,690
This notebook will make some major improvements on the previous code and also I will talk about how

3
00:00:21,690 --> 00:00:24,460
I discovered what modifications to make.

4
00:00:24,690 --> 00:00:27,720
This lecture is going to walk you through a prepared call lab notebook.

5
00:00:27,810 --> 00:00:33,450
Although a very good exercise which I always recommend is once you know how this is done to try and

6
00:00:33,450 --> 00:00:39,210
recreate it yourself with as few references as possible as usual you can look at the title of the notebook

7
00:00:39,540 --> 00:00:41,790
to determine what notebook we are currently looking at.

8
00:00:46,080 --> 00:00:51,450
Okay so before we look at the code I want to make a special note about why it was rewritten and why

9
00:00:51,450 --> 00:00:53,850
it might confuse a lot of beginners.

10
00:00:53,970 --> 00:01:00,120
So from a general software engineering perspective everything we did in the old script was totally right.

11
00:01:00,210 --> 00:01:05,830
We were using the functions that pi to work had given us and we were doing things the PI torch way.

12
00:01:05,880 --> 00:01:11,580
One thing I really advocate against is when students come to me and say hey lazy programmer Why did

13
00:01:11,580 --> 00:01:13,080
you write your code this way.

14
00:01:13,080 --> 00:01:15,120
Why aren't you using this built in function.

15
00:01:15,210 --> 00:01:18,650
Instead of writing this complicated code that's hard for me to understand.

16
00:01:18,930 --> 00:01:20,600
Why are you reinventing the wheel.

17
00:01:21,360 --> 00:01:27,180
And if you're a beginner or maybe you don't have that much experience yet but you are a newly minted

18
00:01:27,180 --> 00:01:33,420
software engineer who is very intent on following the rules and being a good well-behaved software engineer.

19
00:01:33,600 --> 00:01:40,310
Then you would tend to agree but I hope to teach you why this is a bad idea and I hope that you will

20
00:01:40,310 --> 00:01:44,900
learn that being a good well-behaved software engineer is not always optimal.

21
00:01:45,880 --> 00:01:51,070
So first of all if you're taking a course because you're trying to learn how something works then you

22
00:01:51,070 --> 00:01:53,900
should expect to implement some things yourself.

23
00:01:53,950 --> 00:01:55,870
That's what learning is.

24
00:01:55,870 --> 00:02:00,560
As always my rule goes if you can't implement it then you don't understand it.

25
00:02:00,920 --> 00:02:01,300
Okay.

26
00:02:01,330 --> 00:02:06,840
So maybe it's hard for you to understand but this is not a valid reason for you to not do something.

27
00:02:07,570 --> 00:02:12,140
If it were easy then you wouldn't need to take a course on how to do it in the first place.

28
00:02:12,280 --> 00:02:15,800
If it were easy you should be able to figure it out yourself.

29
00:02:15,850 --> 00:02:20,620
So if you're looking for things to be easy then you're probably taking the wrong approach.

30
00:02:20,620 --> 00:02:26,940
I'm not a fan of programmers who excel at Reading documentation but are poor at problem solving.

31
00:02:27,130 --> 00:02:33,550
Programming isn't an exercise in memorization but rather it's a task that involves building and putting

32
00:02:33,550 --> 00:02:41,680
things together from basic building blocks it's better to ask how can I build this rather than asking

33
00:02:41,980 --> 00:02:48,970
how do I spell some function name that does this one is a spelling exercise and one is a thinking exercise.

34
00:02:48,970 --> 00:02:54,490
Some people say hey you're the lazy programmer aren't you supposed to advocate the lazy approach.

35
00:02:54,670 --> 00:02:57,580
And to that I say being lazy is good.

36
00:02:57,580 --> 00:03:02,390
If you're that good kind of lazy good lazy is when you're being more efficient.

37
00:03:02,560 --> 00:03:07,660
Bad lazy is when you're trying to be more efficient but that leads to you being less effective and less

38
00:03:07,660 --> 00:03:09,700
skilled and less knowledgeable.

39
00:03:09,700 --> 00:03:12,130
So keep these ideas in mind as we go through the lecture

40
00:03:16,870 --> 00:03:17,220
okay.

41
00:03:17,240 --> 00:03:21,710
So in this script since it's so similar to the previous one I'm just going to go over the things that

42
00:03:21,710 --> 00:03:25,280
are different so let's go down to where we define the model

43
00:03:32,780 --> 00:03:36,670
now it's pretty obvious that I've trained this model many many times in the past.

44
00:03:36,680 --> 00:03:42,470
In fact I have an entire course on recommender systems so I noticed that when I implemented this model

45
00:03:42,950 --> 00:03:47,600
aside from the fact that it was training very slowly it was also not getting the means squared error

46
00:03:47,750 --> 00:03:49,190
I expected.

47
00:03:49,190 --> 00:03:54,740
So you might wonder how do I debug something like that if I know my tensor flow model performs very

48
00:03:54,740 --> 00:03:55,360
well.

49
00:03:55,520 --> 00:03:57,440
But my PI towards model does not.

50
00:03:57,440 --> 00:03:59,000
What can I do.

51
00:03:59,000 --> 00:04:01,580
Well here's one thing you can do.

52
00:04:01,750 --> 00:04:04,420
First you want to make sure that both models are equivalent.

53
00:04:05,080 --> 00:04:11,200
So if I create both models side by side and I copy the weights from one model to the other I should

54
00:04:11,200 --> 00:04:14,580
get the same model predictions given the same inputs.

55
00:04:14,590 --> 00:04:20,480
If I don't then something is wrong so you check that and it's a good sanity check.

56
00:04:20,500 --> 00:04:22,590
And let's say that it works.

57
00:04:22,690 --> 00:04:27,700
The next thing to ask is Well maybe the optimizes are implemented a little differently or they have

58
00:04:27,700 --> 00:04:29,140
different hyper parameters.

59
00:04:29,140 --> 00:04:30,690
This is totally possible.

60
00:04:30,850 --> 00:04:32,680
So you can test that as well.

61
00:04:32,680 --> 00:04:37,750
Basically you write your own custom training loop and for both models look at the data at the same time

62
00:04:37,780 --> 00:04:39,980
and do the exact same updates.

63
00:04:40,120 --> 00:04:45,940
If both optimizes are the same then both the last per iteration and the model parameters should be the

64
00:04:45,940 --> 00:04:48,010
same after each update.

65
00:04:48,130 --> 00:04:53,660
So you can plot the last pre iteration for both models and if they're the same then that means the optimizer

66
00:04:53,660 --> 00:04:54,700
is the same.

67
00:04:54,710 --> 00:04:56,430
So that's a good sanity check as well.

68
00:04:58,990 --> 00:05:03,490
Of course before you start the training process you have to make sure that both models start with the

69
00:05:03,490 --> 00:05:05,240
exact same weights.

70
00:05:05,290 --> 00:05:09,910
Luckily you already know how to copy weights from one model to the other since that was your first sanity

71
00:05:09,910 --> 00:05:11,190
check.

72
00:05:11,260 --> 00:05:15,310
Okay so let's say you do that in the last pre iteration is the same for both models.

73
00:05:15,310 --> 00:05:15,760
Now what.

74
00:05:16,690 --> 00:05:21,850
Well you have to ask are you getting the worse mean squared error or the better means square.

75
00:05:22,060 --> 00:05:27,760
If you copy the PI torch weights to the tensor flow model and both the optimizes are the same then you

76
00:05:27,760 --> 00:05:29,830
should see the worst means quit air.

77
00:05:30,100 --> 00:05:35,380
If you copy the tensor flow weights to the PI torch model and both optimizes the same then you should

78
00:05:35,380 --> 00:05:37,610
see the better it means quit air.

79
00:05:37,630 --> 00:05:41,290
So at this point you're narrowing down the source of the problem.

80
00:05:41,350 --> 00:05:43,750
Basically it has to do with weight initialization

81
00:05:47,420 --> 00:05:49,960
This is a topic that's outside the scope of this course.

82
00:05:49,970 --> 00:05:53,100
So if you're not familiar with it please don't worry about it.

83
00:05:53,150 --> 00:05:59,660
You can learn more in my in-depth series but basically there are different kinds of weight initialization.

84
00:05:59,660 --> 00:06:04,820
For example you might sample from a uniform distribution or a Gaussian distribution.

85
00:06:04,820 --> 00:06:12,850
From there you can choose the limits on the uniform distribution or the variance of the gaussian sometimes

86
00:06:12,860 --> 00:06:17,480
programmers will make this depending on the dimensionality of the input or the output.

87
00:06:18,200 --> 00:06:23,660
Well long story short if you plot a histogram of the initial weights of your PI to which model and your

88
00:06:23,660 --> 00:06:27,670
tensor flow model you'll notice that they are not the same.

89
00:06:27,790 --> 00:06:33,200
In other words that's the source of the discrepancy is that pi talk initialize as their weights different

90
00:06:33,210 --> 00:06:34,080
leave tensor flow

91
00:06:36,780 --> 00:06:42,150
luckily we have only two different kinds of layers in our network embedding and dense layers.

92
00:06:42,300 --> 00:06:46,430
So what we have to figure out is where are these differences coming from.

93
00:06:46,440 --> 00:06:49,310
How does pi torque initialize those layers by default.

94
00:06:49,320 --> 00:06:53,040
And how does tensor flow initialize those layers by default.

95
00:06:53,040 --> 00:06:55,840
And if they're both different do they both matter.

96
00:06:55,860 --> 00:06:58,970
Or maybe it's that only one of them matters.

97
00:06:59,010 --> 00:07:01,250
Lucky for you I've done all this work.

98
00:07:01,590 --> 00:07:07,620
So what you get after doing a little digging is that although both the linear layers and the embedding

99
00:07:07,620 --> 00:07:13,260
layers are initialized differently between pi torch intensive flow it's really the embedding layer that

100
00:07:13,260 --> 00:07:14,540
makes a huge difference

101
00:07:17,060 --> 00:07:22,880
specifically if you check the PI torch documentation you'll see that pi torch initialize is the embedding

102
00:07:22,880 --> 00:07:27,690
matrix to come from the standard normal regardless of the dimensionality.

103
00:07:27,830 --> 00:07:32,960
It's actually very surprising that this makes a huge difference since as you know we like to standardize

104
00:07:32,960 --> 00:07:36,050
our inputs before putting them into a neural network.

105
00:07:36,080 --> 00:07:41,300
Well if you sample from n 0 1 then our inputs will be exactly standardised.

106
00:07:41,300 --> 00:07:45,500
So it's kind of surprising that this leads to poor results.

107
00:07:45,860 --> 00:07:51,200
In any case what I ended up doing was manually initializing the embedding layers to come from a normal

108
00:07:51,200 --> 00:07:54,640
distribution with a standard deviation of my choice.

109
00:07:54,680 --> 00:07:56,110
So this is how you would do that.

110
00:08:00,240 --> 00:08:05,820
So for each of the embedding we can access the weight attribute which then has a data attribute which

111
00:08:05,820 --> 00:08:10,170
we can set simply by using the equals operation on the right side.

112
00:08:10,170 --> 00:08:16,040
We create a parameter object which wraps a tensor object which wraps a nun pie array and that none higher

113
00:08:16,050 --> 00:08:23,780
rate is a random matrix I generated myself all right so that takes care of one part which is the accuracy

114
00:08:23,780 --> 00:08:31,020
of the model but we also have this problem that the model trains extremely slow so you'll notice that

115
00:08:31,020 --> 00:08:34,410
in the script there are no more data loaders or data set objects

116
00:08:41,460 --> 00:08:47,310
so we just go straight to the training function which takes in to new inputs the train data and the

117
00:08:47,310 --> 00:08:48,660
test data.

118
00:08:48,660 --> 00:08:55,140
These are just tools that store the NUM pi array versions of the data and if you've taken my in-depth

119
00:08:55,170 --> 00:09:00,840
deep learning courses then you're in luck because we've written this loop many times for and in CNN's

120
00:09:00,840 --> 00:09:06,710
Oren and so forth and back in the old days you would actually have to write this kind of thing yourself.

121
00:09:06,900 --> 00:09:09,500
And by the old days I really mean like two or three years ago.

122
00:09:09,510 --> 00:09:13,280
So I don't think what I'm doing is some kind of ancient secret or something.

123
00:09:13,320 --> 00:09:15,010
This is still in the Deep Learning era.

124
00:09:17,700 --> 00:09:18,080
OK.

125
00:09:18,090 --> 00:09:23,000
So basically we're going to iterate through the batches manually.

126
00:09:23,200 --> 00:09:26,600
Now most of this is the same but some of it is a little different.

127
00:09:26,670 --> 00:09:28,890
So I'll walk you through the different parts.

128
00:09:29,130 --> 00:09:33,630
And by the way if you already have some idea of how you would do this yourself you should try to do

129
00:09:33,630 --> 00:09:36,730
that before looking at this.

130
00:09:37,040 --> 00:09:41,560
So one thing that we need to do is calculate how long the inner loop will even be.

131
00:09:42,110 --> 00:09:46,640
It's going to be the number of batches but how many batches do we have.

132
00:09:46,640 --> 00:09:50,630
Well that's roughly the number of samples divided by the batch size.

133
00:09:50,840 --> 00:09:52,610
Of course they may not divide evenly.

134
00:09:52,610 --> 00:09:56,090
So we need to take the ceiling and then cast the result to an integer

135
00:10:01,680 --> 00:10:05,450
so inside the main training loop we first shuffle the data.

136
00:10:05,520 --> 00:10:10,180
We do this on every iteration to avoid unwanted correlations.

137
00:10:10,200 --> 00:10:14,550
Then we do our inner loop through each batch inside the loop.

138
00:10:14,550 --> 00:10:16,350
We grab the current batch.

139
00:10:16,790 --> 00:10:21,930
The index range J times batch size up to J plus 1 times backsides.

140
00:10:22,050 --> 00:10:26,320
You want to verify this for yourself by writing it out on paper.

141
00:10:26,340 --> 00:10:33,410
Next we convert the batch to tenses and move the data to the GP you and we do our usual gradient descent

142
00:10:33,410 --> 00:10:33,860
step

143
00:10:38,050 --> 00:10:43,990
in down below we do a similar loop over the test set since this also may be too large to fit into memory.

144
00:10:44,050 --> 00:10:45,700
You should confirm that for yourself

145
00:10:51,590 --> 00:10:51,890
OK.

146
00:10:51,900 --> 00:10:56,000
So next we call our training function after splitting up the data.

147
00:10:56,040 --> 00:11:01,200
This is almost the same as what we did earlier except that this is in pure num pi and we only need to

148
00:11:01,350 --> 00:11:12,150
convert the tensor when we're inside the training function.

149
00:11:12,220 --> 00:11:12,490
All right.

150
00:11:12,520 --> 00:11:14,470
So here are the results.

151
00:11:14,470 --> 00:11:19,210
So if we look at the output from our training function we notice two things.

152
00:11:19,210 --> 00:11:21,700
First the accuracy is much better.

153
00:11:21,700 --> 00:11:25,250
These results are on par with what you would get with tensor flow.

154
00:11:25,630 --> 00:11:30,670
But note that tensor flow works right out of the box whereas you had to do some customization in PI

155
00:11:30,670 --> 00:11:33,580
talk to achieve the same results.

156
00:11:33,580 --> 00:11:40,150
Second we notice that the duration for epoch is much lower than before before it took over five minutes

157
00:11:40,150 --> 00:11:40,900
per epoch.

158
00:11:40,900 --> 00:11:47,440
And this takes just under one minute per epoch so that's over a five x speed up just from reading our

159
00:11:47,440 --> 00:11:52,900
own training loop and not using the built in data sets or data loaders which is very interesting.

160
00:11:52,900 --> 00:11:54,700
So let that be a lesson to you.

161
00:11:54,880 --> 00:12:00,190
Don't blindly take the approach of everyone must use libraries in order to be a proper software engineer

162
00:12:00,230 --> 00:12:02,280
and don't reinvent the wheel.

163
00:12:02,290 --> 00:12:07,630
Most people who say these kinds of things do so because they can't cope and because they can't code

164
00:12:07,930 --> 00:12:10,490
they don't want anyone else acting like they can code.

165
00:12:10,660 --> 00:12:14,890
They want everyone else to be just like them to appear that they have the same skill level.

166
00:12:14,980 --> 00:12:21,130
So don't be like these people instead know how to implement things yourself know how to investigate

167
00:12:21,280 --> 00:12:23,380
and debug and things like that.

168
00:12:23,380 --> 00:12:26,380
Otherwise you'll just be a slave to APIs and libraries.