1
00:00:12,120 --> 00:00:17,850
OK, so in this lecture, we are going to investigate the Box Cox and other Times series transformations

2
00:00:17,850 --> 00:00:18,450
in code.

3
00:00:19,110 --> 00:00:23,880
So we'll start by downloading a famous time series data set called Airline Passengers.

4
00:00:27,330 --> 00:00:33,510
Next, we're going to import numbers, pandas and matplotlib, nothing you haven't seen before, and

5
00:00:33,510 --> 00:00:35,760
also the Buzzcocks function from Zippi.

6
00:00:40,400 --> 00:00:44,790
Next, we're going to read an RSV using Pedigree's CSFI.

7
00:00:45,020 --> 00:00:47,270
So, again, nothing too surprising here.

8
00:00:51,000 --> 00:00:56,400
Now, one thing I always like to do whenever I load in some data is to just see what it looks like.

9
00:00:56,760 --> 00:01:01,410
This always makes me more confident in the code that I wrote, and it lets me know that my code makes

10
00:01:01,410 --> 00:01:01,940
sense.

11
00:01:02,370 --> 00:01:06,920
So we'll do a DFT head and this will print out the first few rows of the data.

12
00:01:09,550 --> 00:01:14,860
OK, so we can see that for the index, we have the date, which is monthly, and we have one integer

13
00:01:14,860 --> 00:01:16,420
column called the passengers.

14
00:01:19,650 --> 00:01:23,330
The next step is to plot our data set just to see what we're dealing with.

15
00:01:28,820 --> 00:01:34,550
So there are several characteristics of this data that I want you to notice, no one noticed that it

16
00:01:34,550 --> 00:01:35,360
has a trend.

17
00:01:35,810 --> 00:01:41,900
This Time series is going upward into the right number to notice that it has some seasonality.

18
00:01:42,230 --> 00:01:44,630
That is, there is a repeating pattern in time.

19
00:01:45,530 --> 00:01:50,730
Number three, notice that the amplitude of this seasonal pattern increases over time.

20
00:01:51,170 --> 00:01:56,100
So at the beginning, the amplitude is pretty small, but at the end it gets larger and larger.

21
00:01:56,870 --> 00:02:01,910
So these are all characteristics of Time series that we will explicitly model in this course.

22
00:02:02,150 --> 00:02:05,180
And you'll learn how each algorithm handles these characteristics.

23
00:02:06,680 --> 00:02:11,750
One thing we would like to see, which will make more sense later, is for things to not change over

24
00:02:11,750 --> 00:02:12,270
time.

25
00:02:12,560 --> 00:02:15,530
So, for example, this amplitude increasing over time.

26
00:02:15,710 --> 00:02:17,510
It would be nice if that went away.

27
00:02:21,160 --> 00:02:24,740
OK, so the next step will be to try to square root transform.

28
00:02:25,270 --> 00:02:28,990
So here we call the security function on the passengers column.

29
00:02:32,990 --> 00:02:35,030
The next step is to plot our new column.

30
00:02:38,900 --> 00:02:44,300
OK, so we can see that the Time series has been squashed down slightly, but the amplitude of the seasonal

31
00:02:44,300 --> 00:02:46,900
pattern still seems to increase over time.

32
00:02:49,710 --> 00:02:52,000
The next step is to try the log transform.

33
00:02:52,560 --> 00:02:55,590
So here we call the log function on the passengers column.

34
00:02:58,970 --> 00:03:00,830
The next step is to plot our new column.

35
00:03:06,510 --> 00:03:12,360
OK, so this log transform seems to do a pretty good job at squashing down the data to make it look

36
00:03:12,360 --> 00:03:13,620
more uniform and time.

37
00:03:17,550 --> 00:03:23,550
The final step will be to do a box cox transform, so this function takes in a one dimensional data

38
00:03:23,550 --> 00:03:28,800
set as input and it returns the transform data along with the optimal value of lambda.

39
00:03:29,370 --> 00:03:33,330
So you can see we've assigned the result to the variables called data in Lambe.

40
00:03:36,560 --> 00:03:39,770
The next step is to just print out lamb to see what value we got.

41
00:03:42,740 --> 00:03:47,990
So we get about zero point one five, which is kind of in between the log transform in the square root

42
00:03:47,990 --> 00:03:56,270
transform, the next step is to assign our box --'s transform data to a new column in our data frame

43
00:03:56,270 --> 00:03:57,320
and make a new plot.

44
00:03:57,590 --> 00:03:58,550
So let's try that.

45
00:04:03,030 --> 00:04:08,910
OK, and we see that it definitely looks like something in between the square root and the log transforms.

46
00:04:13,170 --> 00:04:16,770
The next step is to visualize our data in the form of a histogram.

47
00:04:17,370 --> 00:04:21,940
This should give us some insight into what the box clocks transform actually does.

48
00:04:22,350 --> 00:04:27,600
But as mentioned in the theory lecture, keep in mind that this kind of plot doesn't really make sense

49
00:04:27,780 --> 00:04:30,060
in terms of the distribution of the data.

50
00:04:30,480 --> 00:04:35,930
We can't really talk about the distribution when the distribution is dynamic and changing in time.

51
00:04:40,150 --> 00:04:45,640
OK, so for the raw passenger's data, we see that most of the values are concentrated in the lower

52
00:04:45,640 --> 00:04:46,360
hundreds.

53
00:04:51,260 --> 00:04:57,050
Now for the square root data, we see that the distribution has been pushed further to the right, so

54
00:04:57,050 --> 00:05:01,070
it's more flat than before and less concentrated on the lower values.

55
00:05:06,500 --> 00:05:12,290
For the log data, we see that the distribution now kind of resembles a mountain where it's more evenly

56
00:05:12,290 --> 00:05:15,250
spaced out in the center instead of off to one side.

57
00:05:20,720 --> 00:05:26,390
And for the Box Cox data, we see almost the same pattern, except the largest peak is now closer to

58
00:05:26,390 --> 00:05:27,050
the center.

59
00:05:28,430 --> 00:05:30,460
OK, so that's pretty much it for this code.

60
00:05:30,710 --> 00:05:35,630
I think this is enough to give you some sense of what effect these different transformations have.

61
00:05:37,460 --> 00:05:43,310
Note that in this course, we won't be applying the box cox transform very often the reason for this

62
00:05:43,310 --> 00:05:47,540
is there's going to be a combinatorial explosion of techniques for us to try.

63
00:05:47,900 --> 00:05:52,250
And if we tried them all, not only would you get very bored and think this course was very tedious,

64
00:05:52,430 --> 00:05:53,840
you wouldn't gain very much.

65
00:05:54,200 --> 00:05:59,240
But I want to make you aware of these tools so that you can apply them in your work if you think they

66
00:05:59,270 --> 00:05:59,990
would be useful.
