1
00:00:10,690 --> 00:00:15,220
So this lecture is going to answer the question, why do we care about shapes?

2
00:00:15,430 --> 00:00:21,280
Well, this is an important question because all of the libraries, we will use work under the assumption

3
00:00:21,280 --> 00:00:23,680
that your data has a specific shape.

4
00:00:23,710 --> 00:00:29,770
If you store your data in any way you want without conforming to the API of those libraries, well then

5
00:00:29,770 --> 00:00:31,520
they just won't work as expected.

6
00:00:31,540 --> 00:00:34,180
So getting your shapes right is a must.

7
00:00:34,480 --> 00:00:38,950
Furthermore, I find that understanding shapes helps you visualize the data.

8
00:00:39,100 --> 00:00:42,340
Time series are generally very easy to visualize.

9
00:00:42,370 --> 00:00:47,500
In fact, you are probably visualizing a time series right now, such as the power of words.

10
00:00:47,650 --> 00:00:52,240
However, this becomes a bit more difficult when the time series is multidimensional.

11
00:00:52,360 --> 00:00:58,480
So considering the shapes of things helps you think about the data in a more real and physical way.

12
00:01:02,830 --> 00:01:08,830
So the Simplest Time series is a one dimensional time series, for example, the daily temperature over

13
00:01:08,830 --> 00:01:09,520
time.

14
00:01:09,670 --> 00:01:12,940
In this case, your data would be a one dimensional array.

15
00:01:13,120 --> 00:01:17,620
So if you have a time series of length T, then you would have an array of size t.

16
00:01:17,890 --> 00:01:22,030
You can think of this as an array that goes from left to right or from top to bottom.

17
00:01:22,450 --> 00:01:28,150
However, this is actually an illusion because in one dimension there is no concept of left and right

18
00:01:28,150 --> 00:01:29,450
or top and bottom.

19
00:01:29,470 --> 00:01:34,480
This only happens because we're representing a one dimensional thing on a two dimensional screen.

20
00:01:39,100 --> 00:01:44,140
In Pandas, data frames are two dimensional, while series objects are one dimensional.

21
00:01:44,470 --> 00:01:48,280
In either of these can be used to store a one dimensional time series.

22
00:01:48,430 --> 00:01:54,520
If you use a series, then your series would be a series of length T, but if you use the data frame,

23
00:01:54,520 --> 00:01:57,560
then it would be a table with T rows and one column.

24
00:01:57,580 --> 00:01:59,560
In other words, T by one.

25
00:02:00,460 --> 00:02:05,320
Now, this is where things might get tricky because technically this isn't required.

26
00:02:05,350 --> 00:02:09,730
You might ask why not store it as a one by T instead of a T by one?

27
00:02:10,060 --> 00:02:15,820
The answer is because it's typically our convention and it works nicely with the libraries we will use.

28
00:02:19,980 --> 00:02:23,970
Now, of course, the data frame is capable of holding multiple columns.

29
00:02:24,060 --> 00:02:29,850
So now suppose that you have a multidimensional time series, for example, the daily temperature in

30
00:02:29,850 --> 00:02:31,020
multiple cities.

31
00:02:31,200 --> 00:02:36,360
And suppose that you have the temperature for cities over a period of t days.

32
00:02:36,390 --> 00:02:38,940
Then your data frame would have the shape T by d.

33
00:02:39,660 --> 00:02:42,660
So again, y, t by d and not d by t.

34
00:02:43,320 --> 00:02:48,540
And again, the answer is because this is what works nicely with the libraries we use.

35
00:02:48,750 --> 00:02:53,970
And note that this might actually go against your intuition because when we plot a time series, we

36
00:02:54,000 --> 00:02:56,820
typically think of it as going from left to right.

37
00:02:56,910 --> 00:03:00,570
But in this case, we store the Time series going from top to bottom.

38
00:03:00,690 --> 00:03:04,200
So just remember that T by D is the convention.

39
00:03:08,570 --> 00:03:11,210
Now things can become even more tricky.

40
00:03:11,540 --> 00:03:16,010
Suppose that we have multiple time series, each with multiple dimensions.

41
00:03:16,040 --> 00:03:20,390
For example, suppose we're reading accelerometer data from your smartphone.

42
00:03:20,420 --> 00:03:25,400
You'd like to use this data to predict whether the user is walking, sitting or lying down.

43
00:03:25,760 --> 00:03:30,170
In this case, you might record acceleration in the X, Y, and Z dimensions.

44
00:03:30,170 --> 00:03:32,000
So you have D equals three.

45
00:03:32,930 --> 00:03:38,450
And suppose that each recording you take lasts two seconds and you sample at 50Hz.

46
00:03:38,540 --> 00:03:44,030
So each recording has 100 samples in total, which means T equals 100.

47
00:03:44,390 --> 00:03:46,340
And of course, this is machine learning.

48
00:03:46,340 --> 00:03:49,040
So you're going to have multiple training samples.

49
00:03:49,310 --> 00:03:54,800
Suppose that you collect 1 million samples of users walking, sitting or lying down.

50
00:03:55,310 --> 00:04:01,010
In this case, we introduce a new letter N which by convention we will use for the number of samples

51
00:04:01,010 --> 00:04:02,090
in a dataset.

52
00:04:02,120 --> 00:04:04,250
So n equals 1 million.

53
00:04:04,280 --> 00:04:08,540
We have 1 million samples of users walking, sitting or lying down.

54
00:04:09,260 --> 00:04:13,790
In this case, our convention is to organize our data in an array of size.

55
00:04:13,790 --> 00:04:14,520
N by T by.

56
00:04:15,830 --> 00:04:21,110
So it's a number of samples by number of time, steps by number of dimensions.

57
00:04:21,470 --> 00:04:26,870
Now, again, you might be wondering why not a D, by T, by N or T, by n, by D.

58
00:04:27,200 --> 00:04:33,800
And again, the answer is because this is what our libraries expect in other libraries and other languages,

59
00:04:33,800 --> 00:04:35,330
things might work a different way.

60
00:04:35,360 --> 00:04:39,070
So it's always good to be cognizant of what convention is being used.

61
00:04:39,080 --> 00:04:41,900
And that's just part of your job as a data scientist.

62
00:04:45,940 --> 00:04:51,310
So when we say n by T by D recognize that this has three dimensions.

63
00:04:51,340 --> 00:04:54,850
In other words, you can think of it like a box in 3D space.

64
00:04:55,090 --> 00:05:00,700
I always try to encourage students to automatically think of a box whenever they hear n by ID.

65
00:05:01,210 --> 00:05:08,290
This is because just saying n by t by can be a bit abstract, but a box is clearly very physical and

66
00:05:08,290 --> 00:05:09,720
easy to visualize.

67
00:05:09,730 --> 00:05:13,570
So try to make this a reflex in your life will become much easier.

68
00:05:17,900 --> 00:05:23,060
So another reason this is useful is because your computer screen is only two dimensional.

69
00:05:23,180 --> 00:05:28,200
So suppose you had a three dimensional array in your computer and then you printed it out.

70
00:05:28,220 --> 00:05:29,690
Well, what would happen?

71
00:05:29,960 --> 00:05:34,820
The answer is that you would lose this 3D structure because your screen is only 2D.

72
00:05:35,000 --> 00:05:38,270
So you can see here that it just looks like nonsense.

73
00:05:38,300 --> 00:05:44,210
The printout doesn't give me any insight because it flattens a 3D object onto a 2D screen.

74
00:05:44,360 --> 00:05:49,340
This is unlike a 1D array and a 2D array where printing things out can be very useful.

75
00:05:50,180 --> 00:05:54,300
And I always have this motto that I tell students when in doubt, print it out.

76
00:05:54,320 --> 00:05:58,430
This usually always helps, but when you have a 3D array, not so much.

77
00:05:58,790 --> 00:06:05,300
But in any case, I hope you agree that thinking of shapes is very helpful both for intuition and also

78
00:06:05,300 --> 00:06:08,000
to prevent you from making errors when writing code.
