1
00:00:11,040 --> 00:00:16,620
In this lecture, I hope to provide you with a very intuitive introduction to recurrent neural networks.

2
00:00:17,460 --> 00:00:21,660
This lecture will focus on the simple rainin, also known as the Elleman Unit.

3
00:00:22,710 --> 00:00:27,450
So let's begin this lecture by thinking about why we might want to use our own ends.

4
00:00:27,990 --> 00:00:30,510
Suppose that our task is to classify words.

5
00:00:31,350 --> 00:00:35,370
The label doesn't really matter, but you could think of it maybe as parts of speech.

6
00:00:35,730 --> 00:00:38,880
For example, whether the word is a noun, a verb and so forth.

7
00:00:39,420 --> 00:00:45,870
Now suppose that we use a regular feed forward and then to solve this task, our CNN simply takes in

8
00:00:45,870 --> 00:00:49,890
a single word and outputs a label whi, which is a prediction for the label.

9
00:00:50,460 --> 00:00:52,380
OK, so nothing special so far.

10
00:00:57,070 --> 00:00:59,020
But what's the problem with our island?

11
00:00:59,740 --> 00:01:04,420
Well, the problem is there are multiple possible answers which depend on context.

12
00:01:04,840 --> 00:01:09,700
For example, if the previous word was behnke and we know that bank has several meanings.

13
00:01:10,270 --> 00:01:13,900
One obvious example of a bank is a financial institution.

14
00:01:14,380 --> 00:01:18,580
If this appeared in your email, it might have a higher chance of being classified as spam.

15
00:01:19,120 --> 00:01:21,970
But we know that the word a bank can mean other things.

16
00:01:22,360 --> 00:01:26,980
For example, a river bank if your friend invites you on a hike along a river bank.

17
00:01:27,190 --> 00:01:29,560
This clearly has a low chance of being spam.

18
00:01:30,520 --> 00:01:32,500
The word bank can also be a verb.

19
00:01:32,860 --> 00:01:37,390
So if you're doing parts of speech tagging, then your tag could be noun or verb, depending on the

20
00:01:37,390 --> 00:01:38,230
context.

21
00:01:38,800 --> 00:01:43,450
Thus, what we are saying is the target is not determined by the input alone.

22
00:01:43,690 --> 00:01:46,030
It also depends on the surrounding context.

23
00:01:50,590 --> 00:01:56,140
Now, recall that we interpret opponents as models that can find hidden features in data, we might

24
00:01:56,140 --> 00:01:58,390
think of these as hidden representations.

25
00:01:58,870 --> 00:02:04,270
For example, if we're looking at faces, one of these units might represent eyes while another might

26
00:02:04,270 --> 00:02:05,440
represent years.

27
00:02:05,890 --> 00:02:10,690
So this is a critical point to understand for this lecture, which is just a review from the previous

28
00:02:10,690 --> 00:02:11,500
sections.

29
00:02:11,950 --> 00:02:16,510
The hidden layers of a neural network output hidden representations of the input.

30
00:02:21,400 --> 00:02:27,220
Now, let's go back to our problem of word classification, suppose that for the entire sequence of

31
00:02:27,220 --> 00:02:31,960
inputs, we apply the same neural network to make a prediction as we normally would.

32
00:02:32,620 --> 00:02:39,520
But now let's turn this into an answer on a recurrent neural network for each neural network instead

33
00:02:39,520 --> 00:02:42,970
of only taking in a single input at the current timestamp.

34
00:02:43,210 --> 00:02:47,620
It also takes in the hidden representation from the previous timestamp.

35
00:02:48,340 --> 00:02:54,580
Note that by doing this, there is a pathway from all the previous inputs to the current hidden representation.

36
00:02:55,150 --> 00:03:01,150
So it not only depends on the past timestamp, but effectively all past time steps through the previous

37
00:03:01,150 --> 00:03:02,350
hidden representation.

38
00:03:03,040 --> 00:03:05,530
So this is called a recurrent neural network.

39
00:03:10,280 --> 00:03:12,830
So why is this called a recurrent neural network?

40
00:03:13,520 --> 00:03:19,040
Well, recognize that a much more compact way of representing this network would be to draw a self loop

41
00:03:19,700 --> 00:03:24,410
that is, the hidden layers output goes back to its input with a time delay of one.

42
00:03:25,400 --> 00:03:30,920
In previous years, I always started my explanation of Arnaz with this diagram, but I found that some

43
00:03:30,920 --> 00:03:33,170
students didn't have the intuition for it.

44
00:03:33,740 --> 00:03:38,960
Possibly this is because diagrams like these are common in engineering, but not in other fields.

45
00:03:39,380 --> 00:03:42,470
So if you are an engineer, then you should find this very intuitive.

46
00:03:42,800 --> 00:03:44,990
But if you've never seen anything like this before?

47
00:03:45,260 --> 00:03:49,820
Take note of how it relates to the previous diagram, which we call an unrolled end.

48
00:03:50,720 --> 00:03:56,240
We call it unrolls because this version is like a rolled up version of each time step compacted into

49
00:03:56,240 --> 00:03:57,380
a single diagram.

50
00:04:01,990 --> 00:04:08,170
So let's discuss more about why this is a recurrent neural network now that we understand the flow of

51
00:04:08,170 --> 00:04:13,750
information that we can start thinking in terms of neurons, as you recall, one of the most important

52
00:04:13,750 --> 00:04:20,500
expressions you learn in machine learning is W Transpose X plus B or wait times input plus bias.

53
00:04:20,980 --> 00:04:22,660
Well, it's going to appear again.

54
00:04:23,320 --> 00:04:29,920
Basically, every arrow you see on this diagram mathematically means take the input multiplied by W,

55
00:04:29,920 --> 00:04:35,710
and I'd be again, remember that you shouldn't think of this as math, but merely a pattern of symbols

56
00:04:35,890 --> 00:04:37,150
that you'll see again and again.

57
00:04:37,870 --> 00:04:41,540
Don't think of it as math, but rather just as a familiar concept.

58
00:04:42,220 --> 00:04:46,840
Now, recognize that the WS and B's are different depending on the role of the arrow.

59
00:04:47,710 --> 00:04:53,410
If it's an arrow going from one hidden representation to another, we'll call the weights and b h.

60
00:04:53,920 --> 00:04:59,350
If it's an arrow going from the input to the head and representation, we'll call the weights W X and

61
00:04:59,350 --> 00:05:00,070
B X.

62
00:05:00,970 --> 00:05:05,320
Note that from this point forward, we will call the hidden representations Hidden States.

63
00:05:05,740 --> 00:05:08,590
This is common terminology when we're dealing with sequences.

64
00:05:13,390 --> 00:05:17,170
OK, so what happens when we try to turn this diagram into math?

65
00:05:17,800 --> 00:05:21,370
Well, suppose that we're given some prior head in state h0.

66
00:05:21,970 --> 00:05:28,300
From this, we can compute the linear transformation, which transpose times zero plus b h.

67
00:05:28,900 --> 00:05:32,320
At this point, we also want to transform the input x one.

68
00:05:33,190 --> 00:05:39,640
From this, we get W X transpose times X1 plus b x as per our diagram.

69
00:05:39,640 --> 00:05:44,650
We then add these two together and apply some activation function, like the value of the tannic h.

70
00:05:45,250 --> 00:05:47,470
This gives us a new head in state h one.

71
00:05:52,240 --> 00:05:53,890
So what happens at the next step?

72
00:05:54,400 --> 00:05:55,990
Well, we do the same thing again.

73
00:05:56,230 --> 00:06:01,480
We transform each one using transpose times, each one a plus b h.

74
00:06:01,990 --> 00:06:08,590
We also transform X two by using a W X transpose times x two plus b x y z.

75
00:06:08,590 --> 00:06:11,830
Again, add these together and apply the activation function.

76
00:06:12,430 --> 00:06:13,810
This gives us h two.

77
00:06:18,600 --> 00:06:23,970
Now, of course, we can write this more compactly by recognizing that all the steps look the same,

78
00:06:24,570 --> 00:06:33,480
we can say each of T is equal to sigma of W.H. transpose times h of T minus one plus b h plus w x transpose

79
00:06:33,480 --> 00:06:39,690
times x of T plus B X. So now it should be clear why these are called the recurrent neural networks.

80
00:06:40,080 --> 00:06:45,030
This equation is a recurrence as a simple example of a recurrence.

81
00:06:45,360 --> 00:06:52,440
Think of the Fibonacci sequence it says X event is equal to X event minus one plus x event minus two.

82
00:06:52,980 --> 00:06:58,830
So it's a similar idea where the current value is computed by applying some transformation on the previous

83
00:06:58,830 --> 00:06:59,550
values.

84
00:07:04,220 --> 00:07:06,980
As usual, it's always important to think about shapes.

85
00:07:07,400 --> 00:07:12,920
So given this recurrence, can we infer what the shapes of our weights and bias factors should be?

86
00:07:14,150 --> 00:07:18,800
Let's start by recalling our convention that we normally use em for the size of the head and vector

87
00:07:19,010 --> 00:07:21,200
and for the size of the input vector.

88
00:07:22,160 --> 00:07:28,640
Since W.H. has to map a size and vector to a size M vector, it must have the size M by M.

89
00:07:29,690 --> 00:07:34,070
Since W X has the map of size d vector two a size m vector.

90
00:07:34,250 --> 00:07:41,060
It must have the size of D by m now as per the and an S. Many students often ask Why do we use D by

91
00:07:41,060 --> 00:07:42,640
M and not my D?

92
00:07:43,190 --> 00:07:44,890
Surely this would be more convenient.

93
00:07:45,110 --> 00:07:48,020
Since then, we don't have to transpose our weight matrices.

94
00:07:48,470 --> 00:07:52,780
However, keep in mind that this is simply the convention when we write things mathematically.

95
00:07:53,840 --> 00:07:58,780
Again, one way to see this is to think of a single value in The Matrix WJ.

96
00:07:59,450 --> 00:08:05,270
If we use the convention input size by output size, then this value always represents the way going

97
00:08:05,270 --> 00:08:07,600
from input ie to output j.

98
00:08:08,120 --> 00:08:12,440
In other words, the input comes before the output, which is more intuitive.

99
00:08:13,990 --> 00:08:18,570
Finally, it should be clear that both B and B X must be vectors of size M.

100
00:08:23,410 --> 00:08:27,760
You should also recognize that there are multiple ways of representing this recurrence.

101
00:08:28,480 --> 00:08:32,110
For one, it seems that having multiple bias terms is redundant.

102
00:08:32,530 --> 00:08:34,510
It would be simpler if we just had one.

103
00:08:35,049 --> 00:08:37,419
So that's one way of simplifying this recurrence.

104
00:08:42,140 --> 00:08:47,270
Another way to simplify this recurrence is by combining H of T minus one and X of T.

105
00:08:47,570 --> 00:08:52,820
Into a single vector and combining W H and W X into a single matrix.

106
00:08:53,960 --> 00:08:58,310
In this case, the combined input vector has the shape d plus m.

107
00:08:59,450 --> 00:09:04,700
Since this needs the map to h of T, which is a vector of size m, the combined weight matrix has the

108
00:09:04,700 --> 00:09:06,710
shape D plus m by M.

109
00:09:07,850 --> 00:09:13,880
And of course, an equivalent way of looking at this is that it's just W X and W h concatenate it into

110
00:09:13,880 --> 00:09:14,990
a single matrix.