1
00:00:11,690 --> 00:00:17,780
In this lecture we are going to go through a collage notebook that emphasizes the importance of shapes

2
00:00:17,780 --> 00:00:19,430
in art ends.

3
00:00:19,430 --> 00:00:24,950
Remember that whenever you hear me say something like and by t by D you should be automatically thinking

4
00:00:24,950 --> 00:00:28,840
about a box without me explicitly showing you a box.

5
00:00:28,940 --> 00:00:34,940
If you don't have this automatic visualization reflex you'll be at a disadvantage in trying to learn

6
00:00:34,940 --> 00:00:36,400
this stuff.

7
00:00:36,410 --> 00:00:41,650
This lecture is all about tracking the shapes in an art n and also we are going to go through the art

8
00:00:41,650 --> 00:00:48,430
and calculation manually to reinforce our understanding of how an art and works as usual.

9
00:00:48,530 --> 00:00:53,200
You can look at the title of the notebook to determine what notebook we are currently looking at.

10
00:00:56,450 --> 00:01:02,000
So the first thing we have here are just some comments where I list out all the important size variables

11
00:01:02,000 --> 00:01:03,800
we have to pay attention to.

12
00:01:04,220 --> 00:01:07,470
These things should be permanently stored in your memory.

13
00:01:07,490 --> 00:01:11,870
You should never be asking what does m mean again if you are.

14
00:01:11,870 --> 00:01:18,350
That will significantly slow down your learning so take notes and write things down if you have to.

15
00:01:18,350 --> 00:01:22,280
So just to recap and is the number of samples in your dataset.

16
00:01:22,280 --> 00:01:25,580
This has been the case since the beginning of this course.

17
00:01:25,640 --> 00:01:27,530
T is the sequence length.

18
00:01:27,530 --> 00:01:34,550
Remember that pi to which we assume constant size sequences D is the input feature dimensionality.

19
00:01:35,120 --> 00:01:39,400
We've gone through many examples of this where you might have a D bigger than one.

20
00:01:39,950 --> 00:01:42,080
M is the number of hidden units.

21
00:01:42,080 --> 00:01:48,790
This is the same as we have in a regular feed forward ANZ so it's a hyper parameter which you can choose.

22
00:01:48,830 --> 00:01:56,120
Finally K is the number of output nodes as a Sino K being bigger than one does not automatically imply

23
00:01:56,210 --> 00:01:59,240
you are doing classification with a soft Max.

24
00:01:59,240 --> 00:02:05,870
You can do multi-dimensional regression to imagine for instance you are trying to predict lat long coordinates

25
00:02:06,350 --> 00:02:07,190
in that scenario.

26
00:02:07,190 --> 00:02:10,460
K would be too but it would still be a regression problem.

27
00:02:14,280 --> 00:02:22,100
So next we're going to make some dummy data also going to set our size variables so we're setting and

28
00:02:22,160 --> 00:02:31,980
t the K and then we make X to be just a random array of size and by TBD.

29
00:02:32,290 --> 00:02:33,480
So and is one.

30
00:02:33,490 --> 00:02:38,740
So for this example we're only going to be working with one sample t is 10.

31
00:02:38,740 --> 00:02:42,370
So our sequence length is 10 the equals 3.

32
00:02:42,610 --> 00:02:47,130
So our feature dimensionality is 3 M is equal to 5.

33
00:02:47,140 --> 00:02:54,780
So our hidden feature size is 5 finally came was to so we have two APA nodes.

34
00:02:55,000 --> 00:02:57,120
And as you know our input x will be a shape.

35
00:02:57,130 --> 00:02:58,240
And by t by the

36
00:03:07,100 --> 00:03:12,910
next we're going to create our model so as before we're going to create a custom class called simple

37
00:03:12,910 --> 00:03:14,650
aren't at this time.

38
00:03:14,650 --> 00:03:19,780
I'm not going to have an input argument for the number of hidden layers since the default is 1 and so

39
00:03:19,780 --> 00:03:23,340
it does not need to be specified inside the constructor.

40
00:03:23,350 --> 00:03:29,110
We have the same steps as before we said all the arguments to be instance variables we instantiate the

41
00:03:29,140 --> 00:03:35,920
RNA module and we create a dense layer with the number of output units a K and just for fun I'll use

42
00:03:35,920 --> 00:03:45,530
the 10 inch activation as the RNA in a non linearity.

43
00:03:45,570 --> 00:03:51,660
Next we have the forward function which is essentially the same as before but with two crucial differences.

44
00:03:51,660 --> 00:03:54,690
First we're not going to use the GP you in this notebook.

45
00:03:54,690 --> 00:04:00,600
Since we're not training anything and second instead of just taking the final hit and state and passing

46
00:04:00,600 --> 00:04:05,130
it through the dense layer I'm going to take all the hit and states and pass them through the final

47
00:04:05,130 --> 00:04:06,180
dense layer.

48
00:04:06,240 --> 00:04:11,820
So now we'll expect our output to be of size and by t by k instead of just end by K

49
00:04:18,710 --> 00:04:23,210
next we're going to instantiate our model and then pass it X to get the model's output

50
00:04:30,340 --> 00:04:33,130
so obviously both our data and weights are random.

51
00:04:33,160 --> 00:04:35,340
So this prediction is not meaningful.

52
00:04:35,500 --> 00:04:37,240
These numbers are just for sanity checking

53
00:04:41,880 --> 00:04:46,850
as you can see the output shape is as expected one by ten by two.

54
00:04:46,860 --> 00:04:54,230
This is because we have one sample a sequence length of 10 and two outputs taking note of these numbers

55
00:04:54,350 --> 00:04:59,620
as this is what we want to compare with later on.

56
00:04:59,900 --> 00:05:08,590
Next we're going to detach the output and save it as a number pi array so we can use it later.

57
00:05:08,610 --> 00:05:11,460
Next we're going to grab the parameters of the RNA module

58
00:05:14,560 --> 00:05:19,060
as you can see when I call the parameters function it returns for things.

59
00:05:19,060 --> 00:05:23,650
Now you might ask how do I know that the parameters will be returned in this order.

60
00:05:23,650 --> 00:05:24,850
Well I don't.

61
00:05:24,880 --> 00:05:30,040
I had to figure it out by running this script and making sure the outputs were the same as my manual

62
00:05:30,040 --> 00:05:31,330
calculation.

63
00:05:31,420 --> 00:05:37,030
So as an exercise if you want to close this notebook and try to do it yourself you're strongly encouraged

64
00:05:37,030 --> 00:05:37,590
to do so

65
00:05:40,820 --> 00:05:46,180
if we inspect w x h by checking a shape and printing out its values.

66
00:05:46,370 --> 00:05:50,190
We can verify that it really refers to the inputs ahead and wait.

67
00:05:50,390 --> 00:05:55,250
If you recall the input dimensionality is three and the hidden dimensionality is five

68
00:06:01,710 --> 00:06:02,190
max.

69
00:06:02,250 --> 00:06:16,430
We just have some code to copy each of the parameters to an empire is.

70
00:06:16,800 --> 00:06:23,670
Now if we check the shape of all the parameters our ordering seems to make sense w x h we already confirmed

71
00:06:23,970 --> 00:06:27,460
w h h should be five by five which it is.

72
00:06:27,480 --> 00:06:33,240
What's weird is that pi towards separates the inputs a hidden bias and the hidden to hidden bias that

73
00:06:33,240 --> 00:06:37,940
won't be a problem as long as we follow the principles.

74
00:06:38,190 --> 00:06:45,930
Next we grab the parameters of our final fully connected layer w o and b o if we check the shape the

75
00:06:45,930 --> 00:06:47,610
ordering appears to be correct

76
00:06:51,890 --> 00:06:55,490
the last step is to do our manual or an N calculation.

77
00:06:55,580 --> 00:06:58,160
This just follows the pseudocode we discussed earlier.

78
00:06:58,160 --> 00:07:04,310
So hopefully you were taking notes to start we're going to initialize the initial head and state to

79
00:07:04,310 --> 00:07:06,320
a vector of zeros.

80
00:07:06,320 --> 00:07:11,100
Next we get X at index 0 which is our one and only sample.

81
00:07:11,120 --> 00:07:14,790
Next we initialize an array of zeros for our y hats.

82
00:07:15,050 --> 00:07:18,470
As you know in this example we only have one sample.

83
00:07:18,470 --> 00:07:23,820
So for simplicity I'm going to make it an array of size t by k.

84
00:07:23,840 --> 00:07:29,320
Next we enter a loop where little T counts up from zero up to Big T inside the loop.

85
00:07:29,330 --> 00:07:31,070
We first calculate H.

86
00:07:31,070 --> 00:07:34,130
That's the hidden value at the hidden layer.

87
00:07:34,130 --> 00:07:39,520
This calculation is a little bit different from the theory lecture since the RNA layer has two biased

88
00:07:39,530 --> 00:07:40,060
terms.

89
00:07:40,580 --> 00:07:45,710
Nevertheless we can follow the usual pattern in order to transform x of T.

90
00:07:45,710 --> 00:07:52,200
We're going to multiply by W x H and at the bias term b x H.

91
00:07:52,300 --> 00:07:58,750
Next we're going to take the previous head and state H last multiply it by W H H and add the bias term

92
00:07:58,810 --> 00:08:02,100
be a change so that transforms H last.

93
00:08:02,170 --> 00:08:06,220
So there are both a fine transformations on X of T and H last.

94
00:08:08,230 --> 00:08:13,680
Next we add both these linear transformations together and apply the 10 H activation.

95
00:08:13,930 --> 00:08:20,080
Once we have h we can calculate Y which is just the usual equation multiply by though you 0 and add

96
00:08:20,080 --> 00:08:24,930
B O and we also store the Y that we get into our y hearts array.

97
00:08:26,990 --> 00:08:33,310
Finally we assign h h last so that H last has the correct value for the next iteration of the loop.

98
00:08:35,450 --> 00:08:40,580
Once we're outside the loop we can print the final value of the Y hat's array and hopefully this is

99
00:08:40,580 --> 00:08:43,700
equal to what we calculated before with the PI torch model

100
00:08:50,240 --> 00:08:50,520
now.

101
00:08:50,540 --> 00:08:55,880
Since this is just a big list of numbers and most of us want memorized what we saw before it would be

102
00:08:55,880 --> 00:08:59,260
better to check this programmatically by using the all closed function

103
00:09:05,230 --> 00:09:05,700
awesome.

104
00:09:05,740 --> 00:09:10,360
So we've confirmed that these are indeed the calculations that are done by a simple in

105
00:09:13,350 --> 00:09:17,330
one thing that made this exercise simpler was that we only had one sample.

106
00:09:17,670 --> 00:09:20,490
As a bonus exercise here's what you can do.

107
00:09:20,540 --> 00:09:23,000
Use an N bigger than one.

108
00:09:23,400 --> 00:09:28,680
Modify this code so that it still produces the same result even when you have multiple samples.
