1
00:00:11,080 --> 00:00:15,130
In this video, we will apply Varma using stats models in Python.

2
00:00:16,360 --> 00:00:20,560
We'll begin this lecture by updating stats models to make sure we have the current version.

3
00:00:26,520 --> 00:00:29,800
The next step is to import everything we need for this notebook.

4
00:00:30,330 --> 00:00:34,290
Most of this you've probably seen before except for VMAX Inva.

5
00:00:39,230 --> 00:00:45,350
The next step is to download our data set, so this data set is a data set of temperatures in multiple

6
00:00:45,350 --> 00:00:46,040
cities.

7
00:00:50,550 --> 00:00:54,510
The next step is to load in our data set using pedigreed CSFI.

8
00:00:57,660 --> 00:01:01,410
The next step is to call ahead to see what our data set looks like.

9
00:01:05,400 --> 00:01:10,680
OK, so notice that this data is not formatted in a way that's easy for us to process.

10
00:01:11,160 --> 00:01:17,010
The first column is just a record ID, which we probably don't need the next three columns or month,

11
00:01:17,010 --> 00:01:20,610
day and year, which we would prefer to have as a single day time.

12
00:01:25,060 --> 00:01:28,780
The next two columns are for the temperature and some measure of uncertainty.

13
00:01:32,360 --> 00:01:38,630
The next three columns are for the city, countryside and country name, the final two columns are for

14
00:01:38,630 --> 00:01:43,460
latitude and longitude, which we won't use, but you might find useful in other models.

15
00:01:45,310 --> 00:01:50,310
OK, so before we do anything else, let's think about the data format that we would like to have.

16
00:01:50,740 --> 00:01:58,180
As you recall, we like our multivariate time series to have the shape TBD, that is T. rows and columns.

17
00:01:58,810 --> 00:02:03,990
For this example, we use vehicles to for simplicity, that is, will choose two cities.

18
00:02:05,410 --> 00:02:10,450
Note that our dataset is not in this format since each city is actually a column itself.

19
00:02:10,900 --> 00:02:14,620
So all the time series for each city are stacked on top of one another.

20
00:02:15,250 --> 00:02:19,050
Now there are multiple ways to parse this data into the format we need.

21
00:02:19,390 --> 00:02:23,110
I'll show you one way, but keep in mind that there are many ways to do this.

22
00:02:28,460 --> 00:02:34,310
So the first thing we'll need to do is figure out how to pass the date to do this well, write a function

23
00:02:34,310 --> 00:02:37,760
called pass date, which takes in a whole row of the data frame.

24
00:02:38,930 --> 00:02:43,940
The idea is we're going to call the appli function on our data frame passing in this function.

25
00:02:44,600 --> 00:02:48,170
Note that the appli function is analogous to the map function in Python.

26
00:02:48,890 --> 00:02:52,190
Basically, it means apply the same function to every row.

27
00:02:52,970 --> 00:02:56,350
So inside a function, we're going to use string interpolation.

28
00:02:56,780 --> 00:03:01,970
We grab the year, month and day from the row and put them into a string with dashes to separate each

29
00:03:01,970 --> 00:03:02,390
item.

30
00:03:03,530 --> 00:03:08,480
The next step is to call the date time, a strip time function passing in our string along with the

31
00:03:08,480 --> 00:03:09,620
corresponding format.

32
00:03:13,880 --> 00:03:19,490
The next step is to call the appli function passing in the date function we just created will assign

33
00:03:19,490 --> 00:03:21,200
the output to a column called Date.

34
00:03:25,810 --> 00:03:31,150
The next step is to extract the temperatures for two different cities, you can see that have chosen

35
00:03:31,150 --> 00:03:33,650
Auckland and Stockholm for no particular reason.

36
00:03:34,180 --> 00:03:36,190
Feel free to choose whatever cities you like.

37
00:03:37,360 --> 00:03:42,280
Note that there's probably no relationship between the temperatures in these two cities because they

38
00:03:42,280 --> 00:03:43,260
are so far apart.

39
00:03:43,810 --> 00:03:46,800
So we'll see what this means for our models predictive ability.

40
00:03:48,250 --> 00:03:52,960
So to get this data, we say density equals equals than the city name.

41
00:03:53,470 --> 00:03:57,540
We make a copy and then we call drop in to remove any missing values.

42
00:04:02,260 --> 00:04:07,420
Now, recall that our data frame has a bunch of columns we're not going to use, that is data, which

43
00:04:07,420 --> 00:04:09,070
is not the Time series itself.

44
00:04:09,640 --> 00:04:13,750
So in this next block, we're going to select only the temperature and the date.

45
00:04:18,410 --> 00:04:20,700
Now, we don't actually want to be a column.

46
00:04:21,150 --> 00:04:23,590
Instead, we would like it to be the index.

47
00:04:24,050 --> 00:04:28,490
However, in the previous data frame, the date was not unique, since you could have the same date

48
00:04:28,490 --> 00:04:29,600
for multiple cities.

49
00:04:30,410 --> 00:04:35,800
Now that we have only one city for our two new data frames, we can set the date as an index.

50
00:04:37,160 --> 00:04:40,790
So we call the set index function passing in the date column.

51
00:04:41,570 --> 00:04:44,920
The next step is to drop the date column since it's no longer needed.

52
00:04:45,950 --> 00:04:48,290
The next step is to rename the remaining column.

53
00:04:49,070 --> 00:04:52,330
Keep in mind that we'll need to join our two data frames together eventually.

54
00:04:52,700 --> 00:04:55,870
So we would like each data frame to have a unique column name.

55
00:04:56,450 --> 00:05:00,050
In this case, it's just the city name concatenated with the word temp.

56
00:05:04,230 --> 00:05:07,980
The next step is to call the head method to see what our new data frame looks like.

57
00:05:11,760 --> 00:05:16,980
OK, so we can confirm that it's now in the format of a Time series like the ones we've seen before.

58
00:05:20,400 --> 00:05:24,680
The next step is to perform the same set of operations on our Stockholm data frame.

59
00:05:28,970 --> 00:05:33,230
The next step is to call the head method again to confirm that what we've done is correct.

60
00:05:38,260 --> 00:05:40,930
The next step is to join our two data frames together.

61
00:05:45,350 --> 00:05:48,110
The next step is to check the shape of our new data frame.

62
00:05:51,180 --> 00:05:56,880
So as you can see, it's about three thousand rows and two columns, however, note that we have some

63
00:05:56,880 --> 00:05:59,960
missing data specifically at the start of the table.

64
00:06:02,270 --> 00:06:07,490
Now, as you'll see very soon, a three thousand is actually a pretty large number of values for Obama

65
00:06:07,490 --> 00:06:08,170
to handle.

66
00:06:08,810 --> 00:06:12,350
So the next step will be to just select the last five hundred rows.

67
00:06:12,620 --> 00:06:13,910
We'll call this joint part.

68
00:06:14,660 --> 00:06:18,800
We'll also take this opportunity to set the index frequency, which is months.

69
00:06:22,580 --> 00:06:26,810
The next step is to call it is in a function to check how many missing values remain.

70
00:06:30,060 --> 00:06:35,130
So as you can see, we have one missing value in the first column and four missing values in the second

71
00:06:35,130 --> 00:06:36,980
column, which is not too bad.

72
00:06:40,000 --> 00:06:43,900
The next step is to call the interpolate function to fill in the missing values.

73
00:06:47,830 --> 00:06:52,840
The next step is to call it is in a function again to confirm that the missing values have been filled

74
00:06:52,840 --> 00:06:53,200
in.

75
00:06:56,180 --> 00:07:01,630
OK, so since the sum of missing values is zero, we can confirm that the missing values are gone.

76
00:07:03,980 --> 00:07:06,920
The next step is to plot our data, to see what it looks like.

77
00:07:11,010 --> 00:07:16,950
OK, so notice that the temperatures for the two cities are not on the same scale, so perhaps it may

78
00:07:16,950 --> 00:07:18,630
be useful to scale this data.

79
00:07:20,110 --> 00:07:26,110
OK, so at this point, our data is now finally in the format we need, and we've also seen how it looks

80
00:07:26,530 --> 00:07:28,250
since this lecture has been pretty long.

81
00:07:28,300 --> 00:07:30,990
We'll stop now and continue in the next video.