1
00:00:11,090 --> 00:00:16,850
OK, so in this lecture, we will be looking at the CoLab notebook for profit, the first step in our

2
00:00:16,850 --> 00:00:21,560
notebook will be to install profit since it does not come preinstalled in CoLab.

3
00:00:32,560 --> 00:00:36,490
The next step is to import our usual libraries, this time including profit.

4
00:00:42,490 --> 00:00:48,610
The next step is to download our data, so this data comes from a Kaggle data set, which is for Rassman

5
00:00:48,610 --> 00:00:51,170
store sales, according to Kagle.

6
00:00:51,190 --> 00:00:57,280
This is a drug store in Europe with over three thousand locations, will be looking at one, although

7
00:00:57,280 --> 00:01:00,620
you could do the same analysis on the total sales if you wanted.

8
00:01:06,460 --> 00:01:11,350
OK, so the next step is to read in our CCV using PD that reads GSV.

9
00:01:15,960 --> 00:01:19,850
The next step is to call the head to see what our data looks like.

10
00:01:24,830 --> 00:01:31,820
OK, so let's look at these columns first, we have store, which you can see is an ID, so you'll have

11
00:01:31,820 --> 00:01:34,010
a unique integer for every store.

12
00:01:34,520 --> 00:01:36,800
We'll be focusing on the store with one.

13
00:01:37,040 --> 00:01:39,500
But of course, you can choose your own if you like.

14
00:01:40,640 --> 00:01:42,440
The next column is the day of week.

15
00:01:42,890 --> 00:01:47,750
We won't be using this column since pandas can already infer the day of week from the date.

16
00:01:49,190 --> 00:01:50,670
The next column is sales.

17
00:01:51,080 --> 00:01:53,900
So this is the column that represents our Time series.

18
00:01:55,160 --> 00:02:00,580
The next column is customers, of course, you could potentially use this as a time series as well,

19
00:02:01,220 --> 00:02:04,940
or you could even combine these somehow like the sales per customer.

20
00:02:06,530 --> 00:02:10,310
The next column is open, which tells us whether or not the store was open.

21
00:02:11,660 --> 00:02:15,880
The next column is promo, which tells us whether or not the store was having a promo.

22
00:02:17,180 --> 00:02:22,130
The next two columns are for state holiday and school holiday, which will tell us whether or not these

23
00:02:22,130 --> 00:02:23,840
were happening on those days.

24
00:02:27,790 --> 00:02:30,310
The next step is to plot the sales for store one.

25
00:02:35,380 --> 00:02:41,500
OK, so this time series is kind of interesting, it seems to have several seasonal components at different

26
00:02:41,500 --> 00:02:42,250
scales.

27
00:02:42,780 --> 00:02:47,770
In addition, there seems to be this weird thing where the sales go down to zero very frequently.

28
00:02:51,780 --> 00:02:54,950
So the next step is to count how many of these values are zero.

29
00:02:58,930 --> 00:03:02,570
So as you can see, we have one hundred sixty one zeros.

30
00:03:05,230 --> 00:03:09,150
The next step is to select only the rows of the data frame for store one.

31
00:03:09,950 --> 00:03:15,090
We'll make a copy, which will make it easier for us to update this data frame later in this notebook.

32
00:03:19,300 --> 00:03:24,700
Now, your suspicion may be that the reason the sales are going down to zero, clearly following some

33
00:03:24,700 --> 00:03:28,390
pattern, is that the store is not actually open on these days.

34
00:03:28,900 --> 00:03:33,910
So in the next block, we're going to plot the sales, but only for the day is the store was open.

35
00:03:39,390 --> 00:03:45,510
OK, so as you can see, the Time series now looks much more reasonable, note that we don't particularly

36
00:03:45,510 --> 00:03:48,780
want to model the sales for the days when the store is not open.

37
00:03:49,530 --> 00:03:54,480
We already know when the store is not open and we already know that the sales will be zero on these

38
00:03:54,480 --> 00:03:55,170
days.

39
00:03:55,590 --> 00:03:59,500
So there is no need to predict the time series when the store is closed.

40
00:03:59,760 --> 00:04:02,490
We already know precisely what the value will be.

41
00:04:05,920 --> 00:04:11,420
I know that there are several ways to check that the sales will be zero, although these are not necessarily

42
00:04:11,420 --> 00:04:12,180
equivalent.

43
00:04:12,830 --> 00:04:18,020
What we will show is that when the sales are zero, this is the same as when the customers are zero,

44
00:04:18,290 --> 00:04:20,360
which is the same as when the store is not open.

45
00:04:21,050 --> 00:04:26,480
Of course, this is not necessarily true, but it allows us to be lazy when we choose the relevant values

46
00:04:26,480 --> 00:04:27,630
for our Time series.

47
00:04:28,610 --> 00:04:34,370
For example, the store could be open, but we simply have no customers or we might count customers

48
00:04:34,370 --> 00:04:39,770
even though they do not result in sales, although it's not clear how exactly the customers were counted.

49
00:04:42,780 --> 00:04:45,820
In any case, we can see that all these signals correspond.

50
00:04:46,440 --> 00:04:52,890
What this also means is that every day this door was open, we had non-zero sales and nonzero customers.

51
00:04:56,370 --> 00:05:00,080
So the next step is do call store one head to check our data frame.

52
00:05:03,340 --> 00:05:08,260
Of course, we see that it has the same format as before, but the story is always one.

53
00:05:12,910 --> 00:05:17,860
OK, so the next step is not necessary, but recall that in pandas, we normally like to have the date

54
00:05:17,860 --> 00:05:20,920
as the index when we are looking at a Time series.

55
00:05:21,550 --> 00:05:27,370
Previously, this wasn't possible because we had multiple of the same date since we had multiple stores.

56
00:05:27,820 --> 00:05:31,410
But since we now only have one store, all the dates should be unique.

57
00:05:31,960 --> 00:05:35,980
So the next step is to convert the date column into date time format.

58
00:05:40,720 --> 00:05:44,050
The next step is to set the index of the data frame to date.

59
00:05:47,900 --> 00:05:52,570
The next step is to call store one head again to see how our data frame has changed.

60
00:05:55,920 --> 00:06:00,760
OK, so notice that our date column is now gone and the date is now just an index.

61
00:06:01,230 --> 00:06:05,940
This actually means we'll have more work to do later since, as you recall, profit needs of the date

62
00:06:05,940 --> 00:06:07,770
as a column, not an index.

63
00:06:11,840 --> 00:06:17,120
However, in the next block of code, you can see that this makes the plot of the TIME series more reasonable.

64
00:06:20,160 --> 00:06:25,020
Specifically on the x axis, we now have the date instead of just random integers.

65
00:06:29,220 --> 00:06:33,640
OK, so the next step is to convert our data frame into the format we need for profit.

66
00:06:34,470 --> 00:06:37,860
We'll start by making a copy of the sales column from store one.

67
00:06:42,140 --> 00:06:45,400
As you recall, we would also like to have the date as a column as well.

68
00:06:49,160 --> 00:06:51,650
The next step is to call DFPS head.

69
00:06:55,720 --> 00:07:01,000
One thing to notice about this data frame is that the dates go in reverse chronological order.

70
00:07:01,480 --> 00:07:06,730
This is not a problem since profit actually reads the dates, but it's easier to have them in the usual

71
00:07:06,730 --> 00:07:07,240
format.

72
00:07:08,080 --> 00:07:13,570
Also, notice that the final date of our Time series is two thousand fifteen seven thirty one.

73
00:07:16,600 --> 00:07:18,880
The next step is to call DFPS ducktail.

74
00:07:22,220 --> 00:07:27,920
So as you can see, the first date in our Time series is January 1st, 2013.

75
00:07:30,850 --> 00:07:36,700
So since we'd like to have our data frame in chronological order, the next step is to call it index.

76
00:07:39,710 --> 00:07:41,420
Now, let's call the head function again.

77
00:07:45,850 --> 00:07:51,430
By the way, we can see here where the zeros are, the first zero is on New Year's Day, which, of

78
00:07:51,430 --> 00:07:52,570
course is a holiday.

79
00:07:53,140 --> 00:07:56,140
After this, we can see zeros at regular intervals.

80
00:07:56,620 --> 00:07:58,850
Basically, the stores closed every Sunday.

81
00:07:59,710 --> 00:08:05,110
So, again, this is something you can try to incorporate into your model or we can simply remove these

82
00:08:05,110 --> 00:08:05,940
data points.

83
00:08:06,430 --> 00:08:11,540
As you recall, profit has no problem with missing data because the only regret is time.

84
00:08:12,130 --> 00:08:17,380
This also means that there is no need for your time series to be recorded at regularly spaced intervals

85
00:08:17,380 --> 00:08:17,820
either.

86
00:08:21,240 --> 00:08:24,720
The next step is to call the AFP ducktail just to sanity check.

87
00:08:28,180 --> 00:08:31,510
So, as expected, we see the final dates of our Time series.

88
00:08:34,460 --> 00:08:39,800
OK, so the next step is to rename our columns as profit expects, which is to call the Time series.

89
00:08:39,800 --> 00:08:42,290
Why and the Time Index does.

90
00:08:45,320 --> 00:08:50,480
At this point, our data frame is now ready to pass into profit, but since this lecture has already

91
00:08:50,480 --> 00:08:54,470
been pretty long, we're going to continue this notebook in the next lecture.
