1
00:00:00,280 --> 00:00:03,780
Last week you looked at creating
a synthetic seasonal data set that

2
00:00:03,780 --> 00:00:07,310
contained trend, seasonality,
and a bit of noise.

3
00:00:07,310 --> 00:00:11,310
You also looked at some statistical
methods for analyzing the data set and

4
00:00:11,310 --> 00:00:13,320
making predictions from it.

5
00:00:13,320 --> 00:00:15,140
Some of the results you got
were actually quite good,

6
00:00:15,140 --> 00:00:18,020
but there was no machine
learning applied yet.

7
00:00:18,020 --> 00:00:21,180
This week, you're going to look at using
some machine learning methods with

8
00:00:21,180 --> 00:00:22,200
the same data.

9
00:00:22,200 --> 00:00:23,465
Let's see where machine
learning can take us.

10
00:00:25,872 --> 00:00:28,278
First of all,
as with any other ML problem,

11
00:00:28,278 --> 00:00:31,760
we have to divide our data
into features and labels.

12
00:00:31,760 --> 00:00:35,757
In this case our feature is effectively
a number of values in the series,

13
00:00:35,757 --> 00:00:37,580
with our label being the next value.

14
00:00:37,580 --> 00:00:41,545
We'll call that number of values
that will treat as our feature,

15
00:00:41,545 --> 00:00:45,545
the window size, where we're
taking a window of the data and

16
00:00:45,545 --> 00:00:48,645
training an ML model to
predict the next value.

17
00:00:48,645 --> 00:00:53,085
So for example, if we take our time
series data, say, 30 days at a time,

18
00:00:53,085 --> 00:00:57,630
we'll use 30 values as the feature and
the next value is the label.

19
00:00:57,630 --> 00:00:59,000
Then over time,

20
00:00:59,000 --> 00:01:03,470
we'll train a neural network to match
the 30 features to the single label.

21
00:01:05,000 --> 00:01:07,930
So let's, for example,
use the tf.data.Dataset

22
00:01:07,930 --> 00:01:12,100
class to create some data for us,
we'll make a range of 10 values.

23
00:01:13,320 --> 00:01:16,219
When we print them we'll see
a series of data from 0 to 9.

24
00:01:17,580 --> 00:01:19,710
So now let's make it a little
bit more interesting.

25
00:01:19,710 --> 00:01:24,150
We'll use the dataset.window to
expand our data set using windowing.

26
00:01:25,220 --> 00:01:27,620
Its parameters are the size
of the window and

27
00:01:27,620 --> 00:01:30,380
how much we want to shift by each time.

28
00:01:30,380 --> 00:01:35,175
So if we set a window size of 5 with
a shift of 1 when we print it we'll

29
00:01:35,175 --> 00:01:39,637
see something like this,
01234, which just stops there

30
00:01:39,637 --> 00:01:44,283
because it's five values,
then we see 12345 etc, etc,.

31
00:01:44,283 --> 00:01:49,122
Once we get towards the end of
the data set we'll have less values

32
00:01:49,122 --> 00:01:51,800
because they just don't exist.

33
00:01:51,800 --> 00:01:54,390
So we'll get 6789, and
then 789, etc, etc,.

34
00:01:56,130 --> 00:02:01,100
So let's edit our window a little bit,
so that we have regularly sized data.

35
00:02:01,100 --> 00:02:04,125
We can do that with an additional
parameter on the window called

36
00:02:04,125 --> 00:02:06,300
drop_remainder.

37
00:02:06,300 --> 00:02:07,790
And if we set this to true,

38
00:02:07,790 --> 00:02:11,230
it will truncate the data by
dropping all of the remainders.

39
00:02:11,230 --> 00:02:15,110
Namely, this means it will only
give us windows of five items.

40
00:02:15,110 --> 00:02:19,700
So when we print it,
it will now look like this,

41
00:02:19,700 --> 00:02:23,340
starting at 01234 and ending at 56789.

42
00:02:23,340 --> 00:02:27,600
Great, now let's put
these into numpy lists so

43
00:02:27,600 --> 00:02:30,130
that we can start using
them with machine learning.

44
00:02:30,130 --> 00:02:35,150
Good news is, is that this is super easy,
we just call the .numpy method on each

45
00:02:35,150 --> 00:02:40,260
item in the data set, and when we print
we now see that we have a numpy list.

46
00:02:41,630 --> 00:02:45,830
Okay, next up is to split the data
into features and labels.

47
00:02:45,830 --> 00:02:49,980
For each item in the list it kind of
makes sense to have all of the values but

48
00:02:49,980 --> 00:02:54,980
the last one to be the feature, and
then the last one can be the label.

49
00:02:54,980 --> 00:02:59,880
And this can be achieved with mapping,
like this, where we split into everything

50
00:02:59,880 --> 00:03:06,120
but the last one with :-1, and
then just the last one itself with -1:.

51
00:03:06,120 --> 00:03:08,860
Which gives us this output when we print,

52
00:03:08,860 --> 00:03:12,960
which now looks like a nice
set of features and labels.

53
00:03:12,960 --> 00:03:15,760
Typically, you would shuffle
their data before training.

54
00:03:15,760 --> 00:03:19,440
And this is possible
using the shuffle method.

55
00:03:19,440 --> 00:03:21,670
We call it with the buffer size of ten,

56
00:03:21,670 --> 00:03:23,830
because that's the amount
of data items that we have.

57
00:03:24,980 --> 00:03:26,300
And when we print the results,

58
00:03:26,300 --> 00:03:29,060
we'll see our features and
label sets have been shuffled.

59
00:03:30,130 --> 00:03:34,360
Finally, we can look at batching the data,
and this is done with the batch method.

60
00:03:35,830 --> 00:03:38,259
It'll take a size parameter,
and in this case it's 2.

61
00:03:38,259 --> 00:03:42,369
So what we'll do is we'll batch
the data into sets of two, and

62
00:03:42,369 --> 00:03:45,550
if we print them out, we'll see this.

63
00:03:45,550 --> 00:03:49,070
We now have three batches
of two data items each.

64
00:03:49,070 --> 00:03:52,960
And if you look at the first set,
you'll see the corresponding x and y.

65
00:03:52,960 --> 00:03:57,076
So when x is four, five, six and
seven, our y is eight, or

66
00:03:57,076 --> 00:04:01,460
when x is zero, one, two,
three, you'll see our y is four.

67
00:04:02,950 --> 00:04:07,930
Okay, now that you've seen the tools that
let us create a series of x and y's, or

68
00:04:07,930 --> 00:04:10,995
features and labels,
you have everything you need to work on

69
00:04:10,995 --> 00:04:13,980
a data set in order to
get predictions from it.

70
00:04:13,980 --> 00:04:18,010
We'll take a look at a screen cast of this
code next, before moving on to creating

71
00:04:18,010 --> 00:04:20,590
our first neural networks to
run predictions on this data.