1
00:00:00,000 --> 00:00:01,605
Over the last few weeks,

2
00:00:01,605 --> 00:00:03,300
you looked at time-series data

3
00:00:03,300 --> 00:00:04,830
and examined a few techniques for

4
00:00:04,830 --> 00:00:08,370
forecasting that data including
statistical analysis,

5
00:00:08,370 --> 00:00:11,100
linear regression, and
machine-learning with

6
00:00:11,100 --> 00:00:12,750
both deep learning networks

7
00:00:12,750 --> 00:00:14,865
and recurrent neural networks.

8
00:00:14,865 --> 00:00:17,850
But now we're going to move
beyond the synthetic data to

9
00:00:17,850 --> 00:00:19,950
some real-world
data and apply what

10
00:00:19,950 --> 00:00:22,800
we've learned to creating
forecasts for it.

11
00:00:22,800 --> 00:00:25,515
Let's start with
this dataset from Kaggle,

12
00:00:25,515 --> 00:00:28,155
which tracks sunspots
on a monthly basis

13
00:00:28,155 --> 00:00:31,275
from 1749 until 2018.

14
00:00:31,275 --> 00:00:33,680
Sunspots do have seasonal cycles

15
00:00:33,680 --> 00:00:35,960
approximately every 11 years.

16
00:00:35,960 --> 00:00:39,065
So let's try this out to see
if we can predict from it.

17
00:00:39,065 --> 00:00:43,040
It's a CSV dataset with
the first column being an index,

18
00:00:43,040 --> 00:00:46,190
the second being a date in
the format year, month, day,

19
00:00:46,190 --> 00:00:47,840
and the third being the date of

20
00:00:47,840 --> 00:00:49,970
that month that
the measurement was taken.

21
00:00:49,970 --> 00:00:52,010
It's an average monthly amount

22
00:00:52,010 --> 00:00:54,590
that should be at
the end of that month.

23
00:00:54,590 --> 00:00:56,944
You can download it from Kaggle

24
00:00:56,944 --> 00:00:59,330
or if you're using
the notebook in this lesson,

25
00:00:59,330 --> 00:01:02,405
I've conveniently hosted it
for your on my Cloud Storage.

26
00:01:02,405 --> 00:01:03,979
It's a pretty simple dataset,

27
00:01:03,979 --> 00:01:06,050
but it does help us
understand a little bit

28
00:01:06,050 --> 00:01:08,330
more about how to
optimize our code

29
00:01:08,330 --> 00:01:10,370
to predict the dataset based on

30
00:01:10,370 --> 00:01:12,770
the nature of
its underlying data.

31
00:01:12,770 --> 00:01:15,455
Of course, one size
does not fit all

32
00:01:15,455 --> 00:01:18,575
particularly when it comes to
data that has seasonality.

33
00:01:18,575 --> 00:01:20,910
So let's take a look at the code.

34
00:01:21,070 --> 00:01:24,559
Okay, first of all, if
you're using a codelab,

35
00:01:24,559 --> 00:01:26,000
then you'll need to get the data

36
00:01:26,000 --> 00:01:27,740
into your codelab instance.

37
00:01:27,740 --> 00:01:29,090
This code will download

38
00:01:29,090 --> 00:01:30,680
the file that I've
stored for you.

39
00:01:30,680 --> 00:01:33,260
You should really get it
from Kaggle and store it on

40
00:01:33,260 --> 00:01:36,500
your own server or even
manually upload it to codelab,

41
00:01:36,500 --> 00:01:38,930
but for convenience,
I've stored it here.

42
00:01:38,930 --> 00:01:42,140
Here's the code to read
the CSV file and get

43
00:01:42,140 --> 00:01:45,845
its data into a list of
sunspots and timestamps.

44
00:01:45,845 --> 00:01:49,205
We'll start by importing
the CSV library.

45
00:01:49,205 --> 00:01:51,350
Then we'll open the file.

46
00:01:51,350 --> 00:01:53,220
If you're using the codelab and

47
00:01:53,220 --> 00:01:55,485
the W get code that
you saw earlier,

48
00:01:55,485 --> 00:01:58,550
downloads the CSV and
puts it into slash temp.

49
00:01:58,550 --> 00:02:00,935
So this code just
reads it out of there.

50
00:02:00,935 --> 00:02:03,320
This line, next reader,

51
00:02:03,320 --> 00:02:06,540
is called Before we loop through
the rows and the reader,

52
00:02:06,540 --> 00:02:08,510
and it's simply
reads the first line

53
00:02:08,510 --> 00:02:09,920
and we end up throwing it away.

54
00:02:09,920 --> 00:02:12,020
That's because the
column titles are in

55
00:02:12,020 --> 00:02:14,750
the first line of the file
as you can see here.

56
00:02:14,750 --> 00:02:16,700
Then, we will look through

57
00:02:16,700 --> 00:02:19,780
the reader reading
the file line by line.

58
00:02:19,780 --> 00:02:21,960
Our sunspots are
actually in column

59
00:02:21,960 --> 00:02:25,070
2 and we want them to be
converted into a float.

60
00:02:25,070 --> 00:02:27,200
As the file is read, every item

61
00:02:27,200 --> 00:02:29,330
will be read as a string
so we may as well

62
00:02:29,330 --> 00:02:31,310
convert them now instead
of iterating through

63
00:02:31,310 --> 00:02:34,225
the list later and then
converting all the datatypes.

64
00:02:34,225 --> 00:02:37,875
Similarly, we'll read the
time steps as integers.

65
00:02:37,875 --> 00:02:40,540
As much of the code
we'll be using to

66
00:02:40,540 --> 00:02:43,275
process these deals
with NumPy arrays,

67
00:02:43,275 --> 00:02:46,150
we may as well now convert
a list to NumPy arrays.

68
00:02:46,150 --> 00:02:48,310
It's more efficient
to do it this way,

69
00:02:48,310 --> 00:02:49,750
build-up your data in a throwaway

70
00:02:49,750 --> 00:02:50,890
list and then convert it to

71
00:02:50,890 --> 00:02:54,220
NumPy than I would have been
to start with NumPy arrays,

72
00:02:54,220 --> 00:02:56,620
because every time you
append an item to a NumPy,

73
00:02:56,620 --> 00:02:57,820
there's a lot of

74
00:02:57,820 --> 00:03:00,130
memory management going
on to clone the list,

75
00:03:00,130 --> 00:03:02,710
maybe a lot of data
that can get slow.

76
00:03:02,710 --> 00:03:05,965
If we plot our data
it looks like this.

77
00:03:05,965 --> 00:03:07,780
Note that we have seasonality,

78
00:03:07,780 --> 00:03:09,190
but it's not very regular with

79
00:03:09,190 --> 00:03:10,840
some peaks and much
higher than others.

80
00:03:10,840 --> 00:03:12,850
We also have
quite a bit of noise,

81
00:03:12,850 --> 00:03:14,645
but there's no general trend.

82
00:03:14,645 --> 00:03:16,930
As before, let's split our series

83
00:03:16,930 --> 00:03:19,135
into a training and
validation datasets.

84
00:03:19,135 --> 00:03:20,860
We'll split at time 1,000.

85
00:03:20,860 --> 00:03:22,480
We'll have a window size of 20,

86
00:03:22,480 --> 00:03:26,345
batch size of 32, and
a shuffled buffer of 1,000.

87
00:03:26,345 --> 00:03:28,160
We'll use the same window

88
00:03:28,160 --> 00:03:29,600
dataset code that
we've been using

89
00:03:29,600 --> 00:03:30,980
all week to turn a series

90
00:03:30,980 --> 00:03:33,630
into a dataset which
we can train on.