1
00:00:11,010 --> 00:00:16,650
In this lecture, I hope to provide you with a very intuitive introduction to recurrent neural networks.

2
00:00:17,490 --> 00:00:21,690
This lecture will focus on the simple RNA, also known as the Alman unit.

3
00:00:22,710 --> 00:00:27,480
So let's begin this lecture by thinking about why we might want to use our brains.

4
00:00:27,990 --> 00:00:30,540
Suppose that our task is to classify words.

5
00:00:31,350 --> 00:00:36,450
The label doesn't really matter, but you could think of it maybe as parts of speech, for example,

6
00:00:36,450 --> 00:00:38,910
whether the word is a noun, a verb and so forth.

7
00:00:39,420 --> 00:00:45,780
Now suppose that we use a regular feed forward and then to solve this task, our anend simply takes

8
00:00:45,780 --> 00:00:49,920
in a single word and outputs a label whi which is a prediction for the label.

9
00:00:50,460 --> 00:00:52,380
OK, so nothing special so far.

10
00:00:57,070 --> 00:00:58,990
But what's the problem with our and.

11
00:00:59,770 --> 00:01:04,450
Well, the problem is there are multiple possible answers which depend on context.

12
00:01:04,870 --> 00:01:10,990
For example, if the previous word was bank and we know that bank has several meanings, one obvious

13
00:01:10,990 --> 00:01:13,920
example of a bank is a financial institution.

14
00:01:14,380 --> 00:01:18,600
If this appeared in your e-mail, it might have a higher chance of being classified as spam.

15
00:01:19,090 --> 00:01:21,970
But we know that the word bank can mean other things.

16
00:01:22,360 --> 00:01:27,010
For example, a riverbank if your friend invites you on a hike along a riverbank.

17
00:01:27,220 --> 00:01:29,530
This clearly has a low chance of being spam.

18
00:01:30,520 --> 00:01:32,550
The word bank can also be a verb.

19
00:01:32,860 --> 00:01:37,390
So if you're doing parts of speech tagging, then your tag could be noun or verb, depending on the

20
00:01:37,390 --> 00:01:38,260
context.

21
00:01:38,800 --> 00:01:43,400
Thus, what we are saying is the target is not determined by the input alone.

22
00:01:43,660 --> 00:01:46,110
It also depends on the surrounding context.

23
00:01:50,560 --> 00:01:56,380
Now, recall that we interpret Anan's as models that can find hidden features in data, we might think

24
00:01:56,380 --> 00:01:58,450
of these as hidden representations.

25
00:01:58,870 --> 00:02:04,270
For example, if we're looking at faces, one of these units might represent eyes, while another might

26
00:02:04,270 --> 00:02:05,450
represent ears.

27
00:02:05,890 --> 00:02:10,720
So this is a critical point to understand for this lecture, which is just a review from the previous

28
00:02:10,720 --> 00:02:16,540
sections, the hidden layers of a neural network output, hidden representations of the input.

29
00:02:21,370 --> 00:02:24,470
Now, let's go back to our problem of word classification.

30
00:02:25,150 --> 00:02:30,640
Suppose that for the entire sequence of inputs, we apply the same neural network to make a prediction

31
00:02:30,820 --> 00:02:31,970
as we normally would.

32
00:02:32,620 --> 00:02:39,550
But now let's turn this into an hour and then a recurrent neural network for each neural network instead

33
00:02:39,550 --> 00:02:42,980
of only taking in a single input at the current time step.

34
00:02:43,180 --> 00:02:47,730
It also takes in the hidden representation from the previous timestep.

35
00:02:48,340 --> 00:02:54,610
Note that by doing this, there is a pathway from all the previous inputs to the current hidden representation.

36
00:02:55,150 --> 00:03:01,180
So it not only depends on the past timestep, but effectively all pass time steps through the previous

37
00:03:01,300 --> 00:03:02,350
and representation.

38
00:03:03,010 --> 00:03:05,620
So this is called a recurrent neural network.

39
00:03:10,250 --> 00:03:12,860
So why is this called a recurrent neural network?

40
00:03:13,520 --> 00:03:19,150
Well, recognize that a much more compact way of representing this network would be to draw a self loop.

41
00:03:19,700 --> 00:03:20,910
That is the hidden layers.

42
00:03:20,930 --> 00:03:26,490
Output goes back to its input with a time delay of one in previous years.

43
00:03:26,510 --> 00:03:32,150
I always started my explanation of Arnez with this diagram, but I found that some students didn't have

44
00:03:32,150 --> 00:03:33,240
the intuition for it.

45
00:03:33,770 --> 00:03:38,970
Possibly this is because diagrams like these are common in engineering, but not in other fields.

46
00:03:39,380 --> 00:03:42,500
So if you are an engineer, then you should find this very intuitive.

47
00:03:42,830 --> 00:03:47,780
But if you've never seen anything like this before, take note of how it relates to the previous diagram,

48
00:03:47,990 --> 00:03:49,880
which we call an unrolled or an end.

49
00:03:50,720 --> 00:03:56,270
We call it unrolls because this version is like a rolled up version of each timestep compacted into

50
00:03:56,270 --> 00:03:57,380
a single diagram.

51
00:04:01,990 --> 00:04:08,080
So let's discuss more about why this is a recurrent neural network, now that we understand the flow

52
00:04:08,080 --> 00:04:11,160
of information, we can start thinking in terms of neurons.

53
00:04:11,590 --> 00:04:17,020
As you recall, one of the most important expressions you'll learn in machine learning is W, transpose

54
00:04:17,020 --> 00:04:20,540
X plus B or A wait times input plus bias.

55
00:04:21,010 --> 00:04:22,690
Well, it's going to appear again.

56
00:04:23,320 --> 00:04:29,950
Basically, every arrow you see on this diagram mathematically means take the input multiplied by W

57
00:04:29,950 --> 00:04:35,740
and I'd be again, remember that you shouldn't think of this as math, but merely a pattern of symbols

58
00:04:35,890 --> 00:04:37,200
that you'll see again and again.

59
00:04:37,900 --> 00:04:41,540
Don't think of it as math, but rather just as a familiar concept.

60
00:04:42,220 --> 00:04:46,810
Now recognize that the WS and B's are different depending on the role of the arrow.

61
00:04:47,680 --> 00:04:54,160
If it's an arrow going from one hitting representation to another, we'll call the weights and b h if

62
00:04:54,160 --> 00:05:00,130
it's an arrow going from the input to the head and representation, we'll call the weights and backs.

63
00:05:00,970 --> 00:05:05,340
Note that from this point forward we will call the head and representations head and states.

64
00:05:05,770 --> 00:05:08,650
This is common terminology when we're dealing with sequences.

65
00:05:13,330 --> 00:05:17,200
OK, so what happens when we try to turn this diagram into math?

66
00:05:17,800 --> 00:05:22,630
Well, suppose that we're given some prior head and state H0 from this.

67
00:05:22,630 --> 00:05:29,660
We can compute the linear transformation, transpose times, H0 plus B.H. at this point.

68
00:05:29,680 --> 00:05:33,820
We also want to transform the input X1 from this.

69
00:05:33,820 --> 00:05:37,660
We get X Transpose Times X1 plus B X.

70
00:05:38,440 --> 00:05:43,630
As for our diagram, we then add these two together and apply some activation function like the value

71
00:05:43,630 --> 00:05:44,710
of the Tanach.

72
00:05:45,250 --> 00:05:47,470
This gives us a new head and state each one.

73
00:05:52,210 --> 00:05:53,920
So what happens at the next step?

74
00:05:54,430 --> 00:06:01,030
Well, we do the same thing again, we transform each one using transpose times, each one A plus B,

75
00:06:01,030 --> 00:06:09,910
H, we also transform X two by using X Transpose times X two plus B, X, we again add these together

76
00:06:10,030 --> 00:06:11,830
and apply the activation function.

77
00:06:12,460 --> 00:06:13,810
This gives us two.

78
00:06:18,600 --> 00:06:23,960
Now, of course, we can write this more compactly by recognizing that all the steps look the same,

79
00:06:24,570 --> 00:06:33,510
we can say each of T is equal to sigma of transpose times, each of T minus one plus B plus X transpose

80
00:06:33,510 --> 00:06:35,670
times X of T plus B, X.

81
00:06:36,270 --> 00:06:39,690
So now it should be clear why these are called recurrent neural networks.

82
00:06:40,110 --> 00:06:45,030
This equation is a recurrence as a simple example of a recurrence.

83
00:06:45,390 --> 00:06:47,190
Think of the Fibonacci sequence.

84
00:06:47,460 --> 00:06:52,450
It says event is equal to X, even minus one plus X event minus two.

85
00:06:52,980 --> 00:06:58,860
So it's a similar idea where the current value is computed by applying some transformation on the previous

86
00:06:58,860 --> 00:06:59,580
values.

87
00:07:04,220 --> 00:07:10,250
As usual, it's always important to think about shapes, so given this recurrence, can we infer what

88
00:07:10,250 --> 00:07:12,930
the shapes of our weights and bias factors should be?

89
00:07:14,150 --> 00:07:18,800
Let's start by recalling our convention that we normally use em for the size of the head and vector

90
00:07:18,980 --> 00:07:21,200
and for the size of the input vector.

91
00:07:22,160 --> 00:07:28,660
Since W.H has to map a size and vector to a size and vector, it must have the size MBM.

92
00:07:29,660 --> 00:07:37,520
Since W X has the MAPA sized vector to a size M vector, it must have the size of D by M now as per

93
00:07:37,520 --> 00:07:42,680
the Annan, many students often ask why do we use DVM and not Amedee?

94
00:07:43,190 --> 00:07:44,920
Surely this would be more convenient.

95
00:07:45,110 --> 00:07:48,060
Since then, we don't have to transpose our weight matrices.

96
00:07:48,440 --> 00:07:52,790
However, keep in mind that this is simply the convention when we write things mathematically.

97
00:07:53,810 --> 00:08:00,830
Again, one way to see this is to think of a single value in the Matrix YJ If we use the convention

98
00:08:00,830 --> 00:08:07,070
input size by output size, then this value always represents the weight going from input ie to output

99
00:08:07,070 --> 00:08:07,610
j.

100
00:08:08,120 --> 00:08:12,530
In other words, the input comes before the output, which is more intuitive.

101
00:08:13,960 --> 00:08:18,220
Finally, it should be clear that both B.H. and the Becks must be vectors of size.

102
00:08:23,470 --> 00:08:29,320
You should also recognize that there are multiple ways of representing this recurrence, for one, it

103
00:08:29,320 --> 00:08:32,100
seems that having multiple byas terms is redundant.

104
00:08:32,500 --> 00:08:34,540
It would be simpler if we just had one.

105
00:08:35,050 --> 00:08:37,480
So that's one way of simplifying this recurrence.

106
00:08:42,110 --> 00:08:48,890
Another way to simplify this recurrence is by combining H of T minus one and X of T into a single vector

107
00:08:48,920 --> 00:08:52,850
and combining X into a single matrix.

108
00:08:53,930 --> 00:09:01,370
In this case, the combined input vector has the shape D plus M since this needs to map to each of T,

109
00:09:01,370 --> 00:09:08,750
which is a vector of size M, the combined weight matrix has the shape D plus M by M and of course an

110
00:09:08,750 --> 00:09:15,020
equivalent way of looking at this is that it's just W, X and y h concatenate it into a single matrix.