1
00:00:11,150 --> 00:00:15,020
OK, so in this lecture, we are going to continue looking at our notebook.

2
00:00:15,710 --> 00:00:21,290
Previously we defined our data and transformed it so that it's in the right format to be ingested by

3
00:00:21,620 --> 00:00:22,670
US forecast.

4
00:00:23,180 --> 00:00:27,380
In this lecture, we're going to focus on uploading this data to three.

5
00:00:29,350 --> 00:00:34,960
Let's start by defining our bucket name and our region, as you recall, bucket names must be globally

6
00:00:34,960 --> 00:00:39,950
unique, so you can't choose the same bucket name as me or anyone else in the world.

7
00:00:40,600 --> 00:00:45,730
By the way, I have assumed that you know how to create a bucket inside of a U.S. and that you've already

8
00:00:45,730 --> 00:00:46,300
done so.

9
00:00:47,230 --> 00:00:53,010
If you haven't, then you can either use Boso three or the Adewusi console to get that done first.

10
00:00:53,560 --> 00:00:59,950
If you do not know how to do this and you need help, please make a question on the Q&amp;A for the region.

11
00:00:59,950 --> 00:01:05,950
You should use the same region for all your services to minimize network traffic and latency and especially

12
00:01:05,950 --> 00:01:06,590
costs.

13
00:01:07,090 --> 00:01:11,440
So basically, just choose whichever region is close to you that you normally use.

14
00:01:14,250 --> 00:01:20,070
The next step is to set your role red, as you recall, we created this in an earlier lecture.

15
00:01:20,550 --> 00:01:23,550
In that lecture, I mentioned that you should copy down this value.

16
00:01:23,730 --> 00:01:26,490
And so this is where that value is used.

17
00:01:29,900 --> 00:01:33,660
The next step is to create an S3 client using boats of three.

18
00:01:34,220 --> 00:01:39,980
So we call Bodsworth three day client passing in the string as three and also the region name we defined

19
00:01:39,980 --> 00:01:40,490
earlier.

20
00:01:41,300 --> 00:01:44,840
Optionally, you can also pass on your credentials at this point.

21
00:01:48,070 --> 00:01:54,340
The next step is to set a few more variables, our time series frequency is daily, so we create a variable

22
00:01:54,340 --> 00:01:57,400
called data set frequency and set that to D.

23
00:01:58,180 --> 00:02:00,690
Our timestamp format is year, month, day.

24
00:02:00,910 --> 00:02:05,500
And so we set a variable called time stamp format, setting it to what you see here.

25
00:02:06,550 --> 00:02:12,260
We also set a name for our data set group, which I'm going to call it Daily Forecast Data Set Group.

26
00:02:12,970 --> 00:02:18,010
So just to give you a picture of what we are doing, you know that our data can consist of many different

27
00:02:18,010 --> 00:02:18,720
files.

28
00:02:19,120 --> 00:02:23,410
So far, we have two one for the Target series and one for the related series.

29
00:02:24,160 --> 00:02:29,920
But as you recall, we can also have item metadata and we can have different item IDs, all of which

30
00:02:29,920 --> 00:02:31,530
can come from different files.

31
00:02:32,230 --> 00:02:34,580
We can even have different files for different days.

32
00:02:34,600 --> 00:02:39,160
So if you have some time series where you're uploading data every day from different log files, then

33
00:02:39,160 --> 00:02:41,410
you would have a different file for every single day.

34
00:02:42,040 --> 00:02:48,430
So a data set group is essentially a group of data sets, each of which can consist of one or more files

35
00:02:48,430 --> 00:02:49,240
in S3.

36
00:02:52,040 --> 00:02:57,440
The next step is to create more boats of three clients for the forecast service and the forecast query

37
00:02:57,440 --> 00:03:00,860
service for some reason, these are two separate services.

38
00:03:02,030 --> 00:03:07,130
As a side note, I'm going to assume that you're familiar with the Boto three documentation where you

39
00:03:07,130 --> 00:03:12,080
can look up all the functions for each of the services we are using, or at least you know how to read

40
00:03:12,080 --> 00:03:12,290
it.

41
00:03:12,440 --> 00:03:17,780
If you're seeing it for the first time since, we're only going to be using a subset of features in

42
00:03:17,780 --> 00:03:23,360
these lectures, you are strongly encouraged to check the documentation for yourself to see what's available.

43
00:03:26,580 --> 00:03:31,630
The next step is to create a data set group using the function, create data set group.

44
00:03:32,160 --> 00:03:38,040
Here we have two arguments domain and data set group name where we pass in the group name we previously

45
00:03:38,040 --> 00:03:40,070
defined for domain.

46
00:03:40,080 --> 00:03:41,290
I'm setting this to custom.

47
00:03:42,150 --> 00:03:43,290
So what does this mean?

48
00:03:44,340 --> 00:03:50,520
So domain is actually really interesting because it allows you to specify what you're trying to forecast.

49
00:03:51,240 --> 00:03:56,800
Presumably Amazon has some secret domain knowledge that would be helpful for these different scenarios.

50
00:03:57,360 --> 00:04:02,550
For example, it's possible to set this value to retail for retail demand forecasting.

51
00:04:03,390 --> 00:04:06,900
You can also set the store Web traffic to estimate Web traffic.

52
00:04:07,470 --> 00:04:12,300
So you see that this service is really geared towards businesses who typically have to deal with these

53
00:04:12,300 --> 00:04:13,470
sorts of measurements.

54
00:04:16,240 --> 00:04:22,390
Earlier, you'll recall that I said everything we do in Adewusi forecast is a job, so this means when

55
00:04:22,390 --> 00:04:27,490
we call these functions, what we're really doing is sending some commands to a bunch of computers to

56
00:04:27,490 --> 00:04:32,980
do a bunch of work, believe it or not, simply declaring that you have a data set group, take some

57
00:04:32,980 --> 00:04:33,490
time.

58
00:04:35,870 --> 00:04:42,320
The next step is to get the data set group Aaryn from the response of our last function call recall,

59
00:04:42,320 --> 00:04:46,070
I said earlier that everything we do in RWC comes with an error.

60
00:04:47,030 --> 00:04:49,610
So now we have an Aaren for our data set group.

61
00:04:50,240 --> 00:04:53,930
Once we have that, we can call the function, describe data, say group.

62
00:04:54,530 --> 00:05:00,230
This will bring back a dictionary of which the relevant keys are status creation time and last modification

63
00:05:00,230 --> 00:05:00,680
time.

64
00:05:01,430 --> 00:05:06,110
I encourage you to print out the whole dictionary for yourself just to see what it contains.

65
00:05:07,590 --> 00:05:13,230
Now, because of run this entire notebook, the status is already active, but if you're running this

66
00:05:13,230 --> 00:05:17,100
live, it will probably say create in progress for some time.

67
00:05:17,820 --> 00:05:21,680
As you can see, the last time I ran this, it took about seven minutes.

68
00:05:25,760 --> 00:05:29,150
OK, so the next step is to create a schema for our data set.

69
00:05:29,810 --> 00:05:31,310
We'll start with the target data center.

70
00:05:32,000 --> 00:05:38,030
As you can see, it contains three attributes, one for the timestamp, one for the target value and

71
00:05:38,030 --> 00:05:39,320
one for the item ID.

72
00:05:40,010 --> 00:05:42,260
The type for the timestamp is timestamp.

73
00:05:42,470 --> 00:05:46,910
The target for the target value is flow and the type for the item ID is string.

74
00:05:51,170 --> 00:05:57,170
The next step is to give our data set a name, since our target data consists of closed prices, we'll

75
00:05:57,170 --> 00:05:58,640
call it closed prices.

76
00:06:01,290 --> 00:06:05,340
The next step is to create our data set using the create data set function.

77
00:06:06,180 --> 00:06:12,390
So just to remind you, we previously created a data set group and the data set group contains data

78
00:06:12,390 --> 00:06:13,010
sets.

79
00:06:13,350 --> 00:06:15,040
So now we are creating a data set.

80
00:06:16,110 --> 00:06:19,160
So again, we specify the domain, which is custom.

81
00:06:19,710 --> 00:06:23,480
We also specify the data set type, which is target time series.

82
00:06:24,000 --> 00:06:27,900
We specify the data set name using the name we created earlier.

83
00:06:28,740 --> 00:06:33,940
We also said the data set frequency, which is daily, and the schema which we were just looking at.

84
00:06:34,950 --> 00:06:38,880
Again, this launches a job which can take several minutes to complete.

85
00:06:43,090 --> 00:06:49,210
So in the next block of code, we grab the Aaryn for our target data set from the response to the create

86
00:06:49,210 --> 00:06:50,220
data set function.

87
00:06:51,490 --> 00:06:57,670
Next, we call the described data set function passing in the area and we just retrieved from this we

88
00:06:57,670 --> 00:07:03,550
get back another dictionary where we can look at the status creation time and last modification time.

89
00:07:05,070 --> 00:07:10,890
Note that the status update in progress, this is because after running more code further down in this

90
00:07:10,890 --> 00:07:13,770
script, I scroll back up and down this block again.

91
00:07:14,670 --> 00:07:16,410
So it says update in progress.

92
00:07:16,410 --> 00:07:20,550
Since after running this, I modified the data set by adding data to it.

93
00:07:21,180 --> 00:07:24,680
Otherwise, the status should stay active when it is ready.

94
00:07:28,160 --> 00:07:32,150
The next step is to follow all the same steps for a related Time series.

95
00:07:32,690 --> 00:07:35,390
I want to make a note here that this is actually optional.

96
00:07:35,840 --> 00:07:40,780
If you want to create a forecast from the original Time series alone, that's perfectly fine.

97
00:07:41,240 --> 00:07:43,830
There's no requirement to pass in related data.

98
00:07:44,480 --> 00:07:49,550
So if you only have a time series, then you can just pass that in as your Target Time series.

99
00:07:50,060 --> 00:07:54,440
You might want to try that on your own if you're curious about how that will perform.

100
00:07:55,960 --> 00:08:02,380
So we'll start by defining our schema, note that this now has more attributes, but timestamp in item

101
00:08:02,380 --> 00:08:04,370
ID must still be present.

102
00:08:04,990 --> 00:08:09,130
The new attributes are what you expect, open, high, low in volume.

103
00:08:09,850 --> 00:08:15,340
The reason I've appended the word value to these attribute names is because of the word opium is a reserved

104
00:08:15,340 --> 00:08:15,790
word.

105
00:08:15,790 --> 00:08:17,860
So it's not an acceptable attribute name.

106
00:08:22,470 --> 00:08:28,470
The next step is to give our new data, set a name, since this data set is for the related Time series.

107
00:08:28,590 --> 00:08:30,570
I'm going to call this a related data.

108
00:08:33,800 --> 00:08:38,960
The next step is to call the create data set function again, but for new related time series.

109
00:08:39,500 --> 00:08:40,650
So what's different here?

110
00:08:41,390 --> 00:08:43,690
First, the data set type is different.

111
00:08:44,030 --> 00:08:47,850
It now says related Time series instead of Target Time series.

112
00:08:48,230 --> 00:08:53,150
And as expected, the data set, name and schema correspond to the related series.

113
00:08:57,630 --> 00:09:03,270
And again, since this launch is a job, the next thing we have to do is check the status of our job.

114
00:09:04,590 --> 00:09:09,900
Again, you can see that the last time I ran this, it said update in progress because I reran this

115
00:09:09,900 --> 00:09:13,500
line after I launched a new job to add the actual data.

116
00:09:13,980 --> 00:09:16,740
Otherwise, this should say active when it's complete.

117
00:09:21,340 --> 00:09:27,140
The next step is to add the data sets we just defined to our data set group, which we created earlier.

118
00:09:28,270 --> 00:09:31,630
The first thing we do is create a list of our data set trends.

119
00:09:32,590 --> 00:09:38,110
From there, we call the function update data set group where we pass in the Aaryn for our data set

120
00:09:38,110 --> 00:09:40,630
group and the Arendse for our data sets.

121
00:09:44,580 --> 00:09:47,050
The next step is to upload our data to S3.

122
00:09:47,900 --> 00:09:54,270
We'll start by creating an S3 resource for each of our servers, will access our bucket and then create

123
00:09:54,270 --> 00:09:58,230
an object within three key equal to our CSFI filename.

124
00:09:59,740 --> 00:10:04,630
Once we have the objects, we can call the upload file function to upload the Carvey's that we have

125
00:10:04,630 --> 00:10:05,250
locally.

126
00:10:09,870 --> 00:10:13,560
The next step is to define the paths to our data in S3.

127
00:10:14,130 --> 00:10:15,540
This is just the bucket name.

128
00:10:15,660 --> 00:10:17,460
Slash the CSP file name.

129
00:10:18,830 --> 00:10:23,960
As a side note, although we're not doing it in this lecture, it's possible to have multiple files

130
00:10:23,990 --> 00:10:27,080
as part of a data center in order to do that.

131
00:10:27,110 --> 00:10:33,800
You put all your curves into the same folder in as three and then specify the path here as the path

132
00:10:33,800 --> 00:10:36,710
to the folder instead of being a path to a file.

133
00:10:37,610 --> 00:10:43,650
The only caveat is that the curves have to be directly in the folder, so you can't have nested folders.

134
00:10:44,420 --> 00:10:49,010
For example, you can't have different files for each day and then have folders for the year, month

135
00:10:49,010 --> 00:10:51,830
and day and then specify only the year folder.

136
00:10:55,260 --> 00:11:00,760
The next step is to launch an import job, to import our data into a US forecast.

137
00:11:01,560 --> 00:11:04,170
Like I said earlier, everything we do is a job.

138
00:11:04,680 --> 00:11:08,660
So to do this, we call the create data set import job function.

139
00:11:09,450 --> 00:11:11,860
The arguments to this function should be straightforward.

140
00:11:12,510 --> 00:11:16,870
First, we give our import job a name which I've just said to our data set group name.

141
00:11:17,830 --> 00:11:21,500
Next, we said the data set Aaren, which is the target data set.

142
00:11:21,600 --> 00:11:27,840
And next we set the data source, which basically points to our three target path.

143
00:11:28,770 --> 00:11:31,550
Notice we also pass in Iran for our role.

144
00:11:32,130 --> 00:11:37,110
As you recall, our role is what allows us forecast to access as three.

145
00:11:37,980 --> 00:11:40,110
Finally, we pass in the Times that four men.

146
00:11:44,330 --> 00:11:50,000
OK, so, again, we have to wait for this job to finish from the response to the previous function,

147
00:11:50,240 --> 00:11:54,620
we can retrieve the are for the import job using this area.

148
00:11:54,670 --> 00:11:57,800
We call the described data set import job function.

149
00:11:59,350 --> 00:12:03,760
Again, we check the status creation time and last modification time.

150
00:12:04,540 --> 00:12:09,820
So if you're running this for the first time, it should say active within a few minutes, you can see

151
00:12:09,820 --> 00:12:12,640
that it took about seven minutes the last time I ran in.

152
00:12:16,400 --> 00:12:22,010
Next, we're going to do the exact same thing, but for the related Time series, so we're going to

153
00:12:22,010 --> 00:12:28,270
call create data set import job again, but this time will pass in the Aaryn for the related data set.

154
00:12:28,490 --> 00:12:31,190
And there's three path for the related data set.

155
00:12:34,070 --> 00:12:39,920
OK, so, again, we wait for this job to finish and we can call the function described data set import

156
00:12:39,920 --> 00:12:42,640
job in order to check the status of our job.

157
00:12:43,250 --> 00:12:47,510
As you can see, it again, took about seven minutes to become active.
