1
00:00:11,700 --> 00:00:17,820
In this section of the course we are going to focus on a new topic sequence data although deep learning

2
00:00:17,850 --> 00:00:22,100
initially became popular for being really great at working with images.

3
00:00:22,230 --> 00:00:27,350
It was not long until Deep Learning also proved its dominance with sequences as well.

4
00:00:27,420 --> 00:00:33,690
Some examples of that are text speech and even financial data such as stock returns.

5
00:00:33,810 --> 00:00:39,390
In this lecture we are going to discuss why we should model sequences and also look at some examples

6
00:00:44,530 --> 00:00:46,990
so let's start with the simplest kind of sequence data.

7
00:00:46,990 --> 00:00:52,960
At least this is what most people think of when they think of a sequence that is a time series signal

8
00:00:53,610 --> 00:00:58,960
a time series signal can be any continuous value measurement taken periodically.

9
00:00:58,960 --> 00:01:02,920
For example a company's stock price is probably the go to example

10
00:01:08,050 --> 00:01:11,810
Another famous example is the airline passengers data set.

11
00:01:11,830 --> 00:01:14,870
This is actually more practical than you might imagine.

12
00:01:14,950 --> 00:01:20,020
If you can forecast the number of passengers you will have on your airline you can do all sorts of things

13
00:01:20,020 --> 00:01:25,050
such as making sure the airport has enough workers to handle all the passengers.

14
00:01:25,210 --> 00:01:28,100
You can modify your prices to reflect demand.

15
00:01:28,150 --> 00:01:30,180
In fact that's what people do.

16
00:01:30,220 --> 00:01:33,190
Blueberry surge pricing is an example of that.

17
00:01:33,250 --> 00:01:39,730
It could also help airlines decide where to put their resources and marketing actually forecasting airline

18
00:01:39,730 --> 00:01:42,350
passengers is a surprisingly big deal.

19
00:01:42,370 --> 00:01:45,350
There are all sorts of articles and papers written on this topic.

20
00:01:45,790 --> 00:01:51,670
I just found out that China will soon surpass the United States as the world's largest aviation market.

21
00:01:56,760 --> 00:02:00,110
Another example of Time series is weather tracking.

22
00:02:00,210 --> 00:02:04,070
This is something we all probably make use of on a daily basis.

23
00:02:04,140 --> 00:02:06,690
You don't want to go outside without your umbrella.

24
00:02:06,690 --> 00:02:08,710
If it's going to rain.

25
00:02:08,790 --> 00:02:14,550
One interesting fact about weather is that it's a dynamical system that probably sounds complicated

26
00:02:14,850 --> 00:02:20,230
but some terms you probably have heard of are Chaos Theory and The Butterfly Effect.

27
00:02:20,520 --> 00:02:27,330
This is that even if we have the exact deterministic equations to describe a weather system our forecast

28
00:02:27,360 --> 00:02:30,000
will still become more and more incorrect.

29
00:02:30,030 --> 00:02:36,910
The further into the future we try to predict that's pretty counterintuitive because you would think

30
00:02:37,270 --> 00:02:42,940
if you have the exact equation for something then you should be able to calculate all the future values

31
00:02:42,940 --> 00:02:48,100
precisely but due to the butterfly effect this is not actually true.

32
00:02:48,100 --> 00:02:55,780
As the saying goes a butterfly flapping its wings in Tokyo can cause a tornado in America small in decisions

33
00:02:55,780 --> 00:03:01,480
like numerical round off error and your computer will ultimately lead it to your weather forecast being

34
00:03:01,480 --> 00:03:03,490
completely wrong eventually.

35
00:03:04,060 --> 00:03:09,520
This is actually really relevant to us because when you think of Time series our own ends you automatically

36
00:03:09,520 --> 00:03:14,890
think How can I use a powerful model like an aunt n to do time series forecasting

37
00:03:20,140 --> 00:03:26,890
another great example is speech or audio for which the obvious application would be speech recognition.

38
00:03:26,980 --> 00:03:32,500
Now I'm not sure how many of you are into audio engineering or music production but if you are then

39
00:03:32,500 --> 00:03:35,770
you should definitely recognize this as a time series.

40
00:03:35,770 --> 00:03:41,630
Just open up your favorite audio file in the open source program audacity and you'll see how it's just

41
00:03:41,630 --> 00:03:43,780
a time series of sound amplitude.

42
00:03:44,020 --> 00:03:46,600
So all in all time series are everywhere.

43
00:03:51,670 --> 00:03:57,490
Another type of sequential data is text which of course deep learning also excels at.

44
00:03:57,850 --> 00:04:03,550
The interesting thing about a text data is that it can be treated as a sequence but in machine learning.

45
00:04:03,550 --> 00:04:05,040
You don't have to.

46
00:04:05,320 --> 00:04:09,620
In the olden days of machine learning we couldn't deal with sequences very well.

47
00:04:09,730 --> 00:04:15,820
And so we resorted to more naive models the typical way of doing this would be to create what is called

48
00:04:15,820 --> 00:04:17,650
a bag of words feature vector

49
00:04:22,820 --> 00:04:24,970
basically it works like this.

50
00:04:24,980 --> 00:04:30,470
Suppose you're doing document classification so for example you take an email and you want to know is

51
00:04:30,470 --> 00:04:31,640
it spam and not spam.

52
00:04:32,090 --> 00:04:33,710
So what do you do.

53
00:04:33,710 --> 00:04:39,170
Well first let's take a big long feature vector with length equal to the number of words in the English

54
00:04:39,170 --> 00:04:40,360
language.

55
00:04:40,550 --> 00:04:46,730
Then for each word that each component of the feature vector represents we insert the count of how many

56
00:04:46,730 --> 00:04:48,590
times that word appeared in our email

57
00:04:53,730 --> 00:04:55,610
so suppose for simplicity's sake.

58
00:04:55,620 --> 00:04:59,050
There are only five words in the English language insurance.

59
00:04:59,060 --> 00:05:02,520
Lone pickles backpack and football.

60
00:05:02,520 --> 00:05:06,840
So now let's say our email contained the word insurance three times.

61
00:05:06,840 --> 00:05:14,010
The word alone one time and the rest of the words zero times then our feature vector would be 3 1 0

62
00:05:14,020 --> 00:05:15,370
0 0.

63
00:05:15,420 --> 00:05:20,760
In this way we can convert every email in our dataset into a feature vector

64
00:05:25,910 --> 00:05:32,630
and then we get back to the situation where we can say all data is the same as our input x.

65
00:05:32,660 --> 00:05:34,610
We again have a table of numbers.

66
00:05:34,610 --> 00:05:41,530
They just happen to represent word counts for the labels is just 0 and 1 for spam or not spam.

67
00:05:41,810 --> 00:05:47,450
As usual you can run any binary classifier on this such as logistic regression or a neuron that we're

68
00:05:52,760 --> 00:05:59,720
but there's a problem with these bag of words representations when we only use word counts we lose information

69
00:05:59,720 --> 00:06:01,980
about the order of the words.

70
00:06:02,030 --> 00:06:04,700
Here's a simple example of why it matters.

71
00:06:04,790 --> 00:06:06,680
Consider the phrase dog toy.

72
00:06:07,490 --> 00:06:10,310
Now consider the phrase toy dog.

73
00:06:10,310 --> 00:06:14,060
Clearly these two phrases mean something completely different.

74
00:06:14,060 --> 00:06:21,110
And yet a bag of words model would treat them the same if you want to learn about how to deal with text

75
00:06:21,360 --> 00:06:23,780
that would be a major topic of the next section.

76
00:06:23,840 --> 00:06:28,790
But this section of focus mostly on modeling continuous valued sequences of data

77
00:06:33,920 --> 00:06:37,900
usually when we think of a sequence we think of something like what you see here.

78
00:06:38,000 --> 00:06:43,370
This is just like how we started with linear regression in the simplest way to look at it is a line

79
00:06:43,370 --> 00:06:44,270
of best fit.

80
00:06:44,270 --> 00:06:50,320
This is the most basic thing we can do because we can see it so analogous leaf sequences we're going

81
00:06:50,320 --> 00:06:54,690
to think of one dimensional sequences because we can see them.

82
00:06:54,820 --> 00:06:59,470
And while this is technically a sequence we can extend this concept a little bit.

83
00:06:59,470 --> 00:07:04,450
Recall that when we were working with a non sequential data like what you saw with linear regression

84
00:07:04,840 --> 00:07:07,960
that in general the data was of shape and and by D.

85
00:07:08,440 --> 00:07:14,960
And as the number of samples in these a number of features as usual even when we are doing one dimensional

86
00:07:14,960 --> 00:07:20,500
linear regression we still use a two dimensional array to represent the input data.

87
00:07:20,810 --> 00:07:22,450
We still say it's end by D.

88
00:07:22,580 --> 00:07:23,870
It's just that these 1

89
00:07:29,030 --> 00:07:31,240
so far in time series are sequential data.

90
00:07:31,250 --> 00:07:34,550
Let's think about what the shape of the data should be.

91
00:07:34,550 --> 00:07:37,940
First consider just a basic sequence like a stock.

92
00:07:37,940 --> 00:07:40,770
How can we represent the length of the sequence.

93
00:07:40,850 --> 00:07:43,110
Should it be n should it be D.

94
00:07:43,220 --> 00:07:48,020
In fact it's not really either of these we don't want to consider each point in the sequence to be a

95
00:07:48,020 --> 00:07:53,860
sample nor do we want each point in the sequence to be considered a feature as a side note.

96
00:07:53,870 --> 00:07:57,420
We could do that but that's not general enough for our own ends.

97
00:07:57,650 --> 00:08:01,870
So it's clear we need a new letter to represent the length of a sequence.

98
00:08:01,910 --> 00:08:04,400
How about the letter Big T.

99
00:08:04,400 --> 00:08:09,910
That seems pretty intuitive because the word time starts with a letter T so it makes sense.

100
00:08:09,920 --> 00:08:15,470
If we're dealing with a time series or any other continuous valued signal that the length of the signal

101
00:08:15,470 --> 00:08:17,720
should be denoted with a new letter T.

102
00:08:22,810 --> 00:08:24,360
Considering what you just learned.

103
00:08:24,430 --> 00:08:29,380
I'm going to tell you that the shape of our data when we're dealing with our own ends will have a shape

104
00:08:29,460 --> 00:08:30,730
and a by t by D.

105
00:08:31,420 --> 00:08:33,760
Let's break down what this means.

106
00:08:34,100 --> 00:08:38,720
We know that and is the number of samples and these the number of features and you've just learned that

107
00:08:38,720 --> 00:08:46,430
t is a number of steps in a sequence as you can see this is a three dimensional signal and is the first

108
00:08:46,450 --> 00:08:50,820
dimension t is the second dimension and d is the third dimension.

109
00:08:51,170 --> 00:08:56,450
When you're dealing with sequence data sometimes students can get confused about which thing is n which

110
00:08:56,450 --> 00:08:59,120
thing is T and which thing is D.

111
00:08:59,120 --> 00:09:01,820
So I think the best way to understand this is by example

112
00:09:06,910 --> 00:09:12,790
let's suppose we want to model the path that people take to get to work so you can record data from

113
00:09:12,790 --> 00:09:15,130
the G.P.S. in everyone's car.

114
00:09:15,250 --> 00:09:19,860
In this scenario what is n what is T and what is D.

115
00:09:19,870 --> 00:09:25,810
Well since we're modeling people's trips to work one sample will consist of one person's single trip

116
00:09:25,810 --> 00:09:26,390
to work.

117
00:09:27,130 --> 00:09:32,830
You might record some data over multiple days in which case that person could have multiple trips to

118
00:09:32,830 --> 00:09:36,260
work that would count as multiple samples.

119
00:09:36,340 --> 00:09:36,600
All right.

120
00:09:36,610 --> 00:09:37,830
So what's D.

121
00:09:38,170 --> 00:09:45,720
In this case D is probably two because the G.P.S. will give you both latitude and longitude measurements.

122
00:09:45,730 --> 00:09:52,780
Okay so what s t t is the number of lat long measurements taken from the time the person embarks on

123
00:09:52,780 --> 00:09:56,880
their journey to work until they arrive at work.

124
00:09:57,070 --> 00:10:02,440
So let's say it takes a person 30 minutes to get to work and you record at lat long coordinates every

125
00:10:02,440 --> 00:10:09,360
second in that case the number of steps you would have is 30 times 60 which is eighteen hundred.

126
00:10:10,000 --> 00:10:13,080
So the length of your sequence t would be eighteen hundred

127
00:10:18,230 --> 00:10:23,750
now you might be wondering doesn't it take each person a different amount of time to get to work and

128
00:10:23,750 --> 00:10:25,800
indeed you would be correct.

129
00:10:25,970 --> 00:10:31,550
Intensive flow and chorus we generally work with equal length sequences so just assume that this is

130
00:10:31,550 --> 00:10:32,820
the case for now.

131
00:10:32,990 --> 00:10:34,920
We'll discuss this in more detail later.

132
00:10:35,000 --> 00:10:36,920
For now let's think of more examples

133
00:10:41,990 --> 00:10:44,240
here's an example of where d equals 1.

134
00:10:44,360 --> 00:10:50,780
So it's kind of like the simple case analogous to our one dimensional linear regression a simple scenario

135
00:10:50,780 --> 00:10:56,840
for this would be stock prices a stock price is a one dimensional signal the vertical axis is price

136
00:10:57,140 --> 00:10:59,490
and the horizontal axis is time.

137
00:10:59,630 --> 00:11:05,180
In other words to have a signal where D is bigger than one we would have more measurements per unit

138
00:11:05,180 --> 00:11:07,430
time for a single stock price.

139
00:11:07,430 --> 00:11:09,860
We just have one the price.

140
00:11:09,860 --> 00:11:13,360
So what would enemy in this case.

141
00:11:13,550 --> 00:11:18,020
This is a complicated question actually but let's consider again as something simple.

142
00:11:18,200 --> 00:11:24,460
Let's say we want to break up our stock price signal into windows of length 10 so our task might be

143
00:11:24,460 --> 00:11:28,300
something like using 10 sequential measurements of the stock price.

144
00:11:28,450 --> 00:11:31,240
Try to predict the next stock price.

145
00:11:31,240 --> 00:11:37,060
In this case and would just be the total number of windows of length 10 that we have by the way this

146
00:11:37,060 --> 00:11:40,340
should remind you of some of the convolution or arithmetic we did.

147
00:11:40,960 --> 00:11:47,290
If your total sequence of stock prices has Sei length 100 and you want to know how many windows of length

148
00:11:47,290 --> 00:11:50,030
10 can be taken from the sequence.

149
00:11:50,050 --> 00:11:54,460
That's just one hundred at minus 10 plus one which is ninety one.

150
00:11:54,520 --> 00:12:00,010
In other words if you have l consecutive measurements and you have a window of length t then the number

151
00:12:00,010 --> 00:12:02,740
of windows is L minus T plus 1

152
00:12:07,960 --> 00:12:09,610
of course with stock prices.

153
00:12:09,610 --> 00:12:12,040
You might not want to measure just one stock.

154
00:12:12,040 --> 00:12:13,890
There could be multiple stocks.

155
00:12:14,020 --> 00:12:18,080
So here's an example with stock prices where D is bigger than one.

156
00:12:18,160 --> 00:12:24,910
Imagine for example that you've taken measurements for 500 stocks just like the S&amp;P 500.

157
00:12:24,970 --> 00:12:29,140
Now you have the stock price for 500 different stocks over time.

158
00:12:29,140 --> 00:12:36,130
That means the equals five hundred you can again artificially make a window of length T and then the

159
00:12:36,130 --> 00:12:39,220
number of samples you have is just the number of these windows

160
00:12:44,330 --> 00:12:47,660
here's another example which is somewhat futuristic.

161
00:12:47,720 --> 00:12:53,480
Imagine you are working with Elon Musk at neural link and you record voltages from electrodes placed

162
00:12:53,480 --> 00:12:54,930
in the brain.

163
00:12:55,010 --> 00:13:01,040
In this scenario D is equal to the number of electrodes at each point in time you're going to have measured

164
00:13:01,340 --> 00:13:04,300
these different voltages at different locations in the brain.

165
00:13:09,410 --> 00:13:15,110
One of the early applications of brain computer interfaces was moving a mouse cursor on a screen and

166
00:13:15,110 --> 00:13:16,130
typing.

167
00:13:16,490 --> 00:13:19,520
In other words everything you need to use a computer.

168
00:13:19,520 --> 00:13:24,310
So suppose we are trying to predict to the letter or number you are thinking of or want to type.

169
00:13:24,530 --> 00:13:28,430
And this is based on the voltages recorded from your brain for one second.

170
00:13:28,730 --> 00:13:31,180
As you think of that letter a number.

171
00:13:31,670 --> 00:13:38,360
Now let's say measurements are taken from the electrodes at a sampling rate of one sample per milliseconds.

172
00:13:38,360 --> 00:13:41,300
That's 1000 samples per second.

173
00:13:41,300 --> 00:13:47,030
That means for one second in the time we make the recording of you thinking of the letter we've collected

174
00:13:47,030 --> 00:13:49,380
1000 measurements at each electrode.

175
00:13:49,610 --> 00:13:57,100
So t equals 1000 and as usual would be the number of letters that you thought of.

176
00:13:57,180 --> 00:14:02,010
So if we did a test where you had to think of a series of letters and collected at ten thousand one

177
00:14:02,010 --> 00:14:03,270
second recordings.

178
00:14:03,270 --> 00:14:05,250
That means an equals ten thousand

179
00:14:10,430 --> 00:14:14,240
one question you might be pondering is why is our data of shape.

180
00:14:14,240 --> 00:14:14,780
And by T.

181
00:14:14,780 --> 00:14:15,670
By D.

182
00:14:15,860 --> 00:14:16,420
Why not.

183
00:14:16,430 --> 00:14:22,850
And by D by T or some other order and in fact there is no particular reason it needs to be this way

184
00:14:23,300 --> 00:14:28,760
except for the fact that all Python libraries from machine learning conform to the standard.

185
00:14:28,760 --> 00:14:32,490
This is just like our tabular X data where we have end by D.

186
00:14:32,750 --> 00:14:36,840
Or our image data where we have n by H by W by C..

187
00:14:36,890 --> 00:14:42,050
This could just as easily be d by N or n by C by h by W.

188
00:14:42,080 --> 00:14:44,900
In fact we've already seen that in PI talks in Vienna.

189
00:14:44,900 --> 00:14:47,920
The convention is end by C by h by W..

190
00:14:47,960 --> 00:14:50,310
So this stuff is not set in stone.

191
00:14:50,330 --> 00:14:56,690
My goal in this course is to show you how we do things in PI torch in Python the general pattern is

192
00:14:56,690 --> 00:14:58,190
that we always put an a first.

193
00:14:58,190 --> 00:15:00,100
So that should go without saying.

194
00:15:00,410 --> 00:15:07,800
There are certainly other libraries which by convention do not put end first the other convention is

195
00:15:07,800 --> 00:15:10,080
that we put the number of features last.

196
00:15:10,170 --> 00:15:12,050
So that's why we have end by D.

197
00:15:12,060 --> 00:15:14,260
Where D is the number of features.

198
00:15:14,280 --> 00:15:19,710
This also makes sense for images where we have end by h by W by C because C is the number of feature

199
00:15:19,710 --> 00:15:24,420
maps which is the analogous version of features for image data.

200
00:15:24,420 --> 00:15:29,580
And this also makes sense for sequence data because this D still represents the number of features

201
00:15:34,680 --> 00:15:35,270
by the way.

202
00:15:35,280 --> 00:15:42,660
One important thing to do is visualize the shapes of the data in your mind before where we had and by

203
00:15:42,660 --> 00:15:49,000
these inputs we visualize that as a rectangle rectangles are two dimensional so that makes sense.

204
00:15:49,050 --> 00:15:53,680
Now we have a three dimensional object so it is no longer a rectangle it's a box.

205
00:15:54,180 --> 00:16:00,420
So when you hear me say and by t by D you should automatically think of a box without needing me to

206
00:16:00,420 --> 00:16:01,980
tell you to do so.

207
00:16:02,100 --> 00:16:06,780
This should be very helpful as you write your code and reason about how Arnold's work

208
00:16:11,940 --> 00:16:12,720
as promised.

209
00:16:12,720 --> 00:16:17,670
One thing we have to consider is what if we have a variable length sequences.

210
00:16:17,670 --> 00:16:19,850
It's clear that this is a pretty common scenario.

211
00:16:19,860 --> 00:16:26,040
Just think of sentences not all sentences have the same number of words some sentences have one or two

212
00:16:26,040 --> 00:16:29,810
words and other sentences have one hundred words.

213
00:16:29,850 --> 00:16:32,540
This is actually a very interesting question.

214
00:16:32,670 --> 00:16:37,250
In the past I've written code that can handle sequences of variable length.

215
00:16:37,260 --> 00:16:40,630
The problem is this gets complicated fast.

216
00:16:40,820 --> 00:16:45,680
Another problem with this is that it means you'll be using inefficient data structures.

217
00:16:45,680 --> 00:16:51,280
If you have an end by tea by the box where t is constant then we can store this in a single number higher

218
00:16:51,270 --> 00:16:55,880
rate that is efficient because a num PI was written to be efficient.

219
00:16:56,720 --> 00:17:02,870
But if T is variable and depends on which sample we are looking at we might use the notation t event

220
00:17:02,900 --> 00:17:03,560
instead.

221
00:17:03,650 --> 00:17:10,260
Since t depends on the sample n in this case we cannot store this in a single number higher rate.

222
00:17:10,980 --> 00:17:19,020
Instead we would have to use a list which is inefficient.

223
00:17:19,190 --> 00:17:23,920
On the other hand there are some inefficiencies with constant Length Sequences as well.

224
00:17:23,910 --> 00:17:28,670
In Pittsburgh it's easiest to start with constant length sequences so that's what we'll be doing in

225
00:17:28,670 --> 00:17:29,970
this course.

226
00:17:29,990 --> 00:17:35,210
This is the case with other modern Deep Learning libraries as well such as tensor flowing characters.

227
00:17:35,630 --> 00:17:41,430
If we do end up having variable length sequences as input what we can do is party sequence with zeros

228
00:17:41,750 --> 00:17:48,390
so that they all have the same length as the longest sequence for example if your longest sequence has

229
00:17:48,390 --> 00:17:54,180
one hundred words then even sentences with one or two words would be the first one or two words and

230
00:17:54,180 --> 00:17:55,900
then 98 or 99.

231
00:17:55,950 --> 00:17:58,650
No values as you can imagine.

232
00:17:58,680 --> 00:18:04,200
This is going to take up a lot of unnecessary space and processing time because the neural network still

233
00:18:04,200 --> 00:18:05,540
thinks your entire.

234
00:18:05,610 --> 00:18:10,450
By t by the box is full of legitimate data simultaneously.

235
00:18:10,470 --> 00:18:12,690
It does make coding ordinance much easier.

236
00:18:12,690 --> 00:18:14,340
So it's a tradeoff.

237
00:18:14,370 --> 00:18:17,220
Ultimately this is the way we do things in PI storage.

238
00:18:17,250 --> 00:18:22,320
So unless you wanted to write your own custom code this will be our data format for sequences

239
00:18:27,470 --> 00:18:33,230
luckily what pi torch does is a kind of compromise between constant length sequences and variable length

240
00:18:33,230 --> 00:18:34,780
sequences.

241
00:18:34,790 --> 00:18:40,550
The idea is instead of having all your sequences padded to be the same length as the longest sequence.

242
00:18:40,550 --> 00:18:43,010
Just do that across single batches.

243
00:18:43,190 --> 00:18:48,560
As you know our method of training in deep learning is batch gradient descent so we're only going to

244
00:18:48,560 --> 00:18:53,360
look at a subset of our training data at any given step during that step.

245
00:18:53,360 --> 00:18:59,510
We can pad the sequences in the batch on the fly so that they are only as long as the longest sequence

246
00:18:59,510 --> 00:19:02,420
in that batch instead of the entire training set.

247
00:19:04,620 --> 00:19:09,630
Now you might think this is still going to be a problem if you have one sequence in your batch that

248
00:19:09,630 --> 00:19:12,720
has Length one hundred and all the rest are length one.

249
00:19:12,900 --> 00:19:19,440
Then you'll have ninety nine padded values in all those shorter sequences but what pi torch does is

250
00:19:19,500 --> 00:19:20,940
for certain generators.

251
00:19:21,090 --> 00:19:26,460
It tries to grouped together sequences with similar sequence length so the batches you see will still

252
00:19:26,460 --> 00:19:32,310
be random but not as random as they would be if you didn't care about the sequence length.

253
00:19:32,310 --> 00:19:36,840
Of course if you're creating your own data generator you would have to write that kind of custom code

254
00:19:36,840 --> 00:19:37,600
yourself.

255
00:19:42,730 --> 00:19:47,890
Finally note that it's possible to take things one step further and actually deal with a variable length

256
00:19:47,890 --> 00:19:51,610
sequences almost without dealing with padding.

257
00:19:51,610 --> 00:19:55,150
This part of the lecture is optional and outside the scope of this course.

258
00:19:55,270 --> 00:20:01,280
So don't feel like you have to understand this to progress through this section the idea with this is

259
00:20:01,280 --> 00:20:04,740
that although your batch will be of size n by t by D.

260
00:20:04,790 --> 00:20:07,440
Where t is the length of the longest sequence in the batch.

261
00:20:07,760 --> 00:20:16,040
If we know the length of each sequence in the batch then we can be intelligent about which data we process.

262
00:20:16,210 --> 00:20:22,750
Ultimately it looks kind of weird because your batch of data will still be updated tensor but then using

263
00:20:22,750 --> 00:20:27,340
the lengths of each sequence you'll convert it into an unpaid add tensor.

264
00:20:27,340 --> 00:20:32,200
Then you'll pass it through the orange end unit and get back in unpaid it output.

265
00:20:32,200 --> 00:20:37,210
And finally you still have to reapply the padding and then pass your data through the rest of the neural

266
00:20:37,210 --> 00:20:37,770
network.

267
00:20:38,530 --> 00:20:43,990
Unfortunately doing this requires you to keep track of the lengths of each of your data sequences even

268
00:20:43,990 --> 00:20:47,740
while shuffling and iterating over randomize batches.

269
00:20:47,770 --> 00:20:49,070
This approach is more involved.

270
00:20:49,090 --> 00:20:52,930
So at this time it's going to be considered to be outside the scope of this course.

271
00:20:58,050 --> 00:20:58,370
All right.

272
00:20:58,400 --> 00:21:01,110
So what are the main takeaways from this lecture.

273
00:21:01,160 --> 00:21:06,320
Number one when you have sequence data that means your data isn't just a single feature vector but a

274
00:21:06,320 --> 00:21:09,660
value or a set of values that can vary with time.

275
00:21:09,710 --> 00:21:15,050
Number two lots of people think of times series which are one dimensional values that very with time

276
00:21:15,300 --> 00:21:19,330
but we can also have d dimensional vectors that very with time.

277
00:21:19,580 --> 00:21:24,470
Some examples of that are weather readings from multiple weather towers or brain signals from multiple

278
00:21:24,470 --> 00:21:26,060
electrodes on the brain.

279
00:21:27,900 --> 00:21:32,980
Number three in modern deep learning libraries you're limited to constant Length Sequences.

280
00:21:33,090 --> 00:21:39,040
If you want to do better you can use constant like sequences per batch instead of the whole dataset.

281
00:21:39,240 --> 00:21:44,400
If you want to do even better you can convert those constant length batches back into a variable length

282
00:21:44,400 --> 00:21:48,360
sequences but that's outside the scope of this chorus.

283
00:21:48,450 --> 00:21:54,960
Number 4 by convention will organize our sequence data to be of shape and by t by D where n is the number

284
00:21:54,960 --> 00:21:59,770
of samples t is the sequence length and d the observation dimensionality.

285
00:22:00,000 --> 00:22:04,950
You should always be visualizing these shapes in your mind which will help your understanding immensely.

286
00:22:04,950 --> 00:22:09,760
But I want to prompt you to do so and therefore you'll just have to remember to do it yourself.