1
00:00:11,100 --> 00:00:17,070
OK, so in this lecture, we are going to look at the notebook for human activity recognition, since

2
00:00:17,070 --> 00:00:21,720
you may not have seen this data set before, we are going to take things slowly at first and look at

3
00:00:21,720 --> 00:00:23,530
each part of the code step by step.

4
00:00:24,210 --> 00:00:25,800
So let's start with the imports.

5
00:00:26,790 --> 00:00:28,920
So there are some new things to pay attention to.

6
00:00:29,610 --> 00:00:34,380
We'll need to import the concatenate layer since that is needed to join multiple features together.

7
00:00:35,130 --> 00:00:40,200
We'll need to import the spice categorical across entropy, since we need to pass in customs arguments

8
00:00:40,200 --> 00:00:41,190
into the laws.

9
00:00:41,970 --> 00:00:47,970
As you recall, when you just want to use the defaults, you can specify the losses of string will also

10
00:00:47,970 --> 00:00:50,370
need to import the model checkpoint callback.

11
00:00:50,880 --> 00:00:55,160
Basically, we're going to use this to save the best model we encountered during training.

12
00:00:56,100 --> 00:01:02,100
This is a necessary but it's useful since the validation laws can start to increase due to overfitting.

13
00:01:08,830 --> 00:01:14,450
OK, so the next step is to download our data, so this data is currently posted on my website.

14
00:01:14,950 --> 00:01:18,700
However, you're encouraged to check out the original sources to learn more.

15
00:01:24,790 --> 00:01:28,060
The next step is to unzip the zip file that we just downloaded.

16
00:01:34,680 --> 00:01:38,430
The next step is to run the allus command to see what we just got unzipped.

17
00:01:42,040 --> 00:01:46,600
As you can see, we now have a folder called UCI Data Set.

18
00:01:49,310 --> 00:01:53,290
The next step is to run that command again to see what's inside this folder.

19
00:01:56,880 --> 00:02:01,550
OK, so it looks like we have a few files along with two folders, test and train.

20
00:02:02,190 --> 00:02:07,050
Now these files are basically just information about the data set if you want to read about it.

21
00:02:07,680 --> 00:02:10,860
The actual data is stored inside the train and test folders.

22
00:02:13,930 --> 00:02:17,750
So let's run the Ellis command again to see what's inside the train folder.

23
00:02:18,460 --> 00:02:21,970
Note that analagous files can also be found in the test folder.

24
00:02:25,260 --> 00:02:31,770
OK, so we have three files along with one folder, so the subject's file is not too important unless

25
00:02:31,770 --> 00:02:36,000
you want to be able to identify which subject corresponds to which sample.

26
00:02:36,750 --> 00:02:42,630
Since we don't really care about that for this task, it's not really important for us as an exercise.

27
00:02:42,750 --> 00:02:47,190
You might want to check whether or not the same subjects appear in both the train and test set.

28
00:02:47,670 --> 00:02:52,050
That would indicate how much we can trust our model to generalize to new subjects.

29
00:02:53,550 --> 00:02:55,740
The next fall to look at is extraneous.

30
00:02:57,390 --> 00:03:03,060
So, as you recall, although the data is a time series, the researchers also created features from

31
00:03:03,060 --> 00:03:05,100
each time series Extreme.

32
00:03:05,100 --> 00:03:07,860
That is where those features are stored.

33
00:03:08,490 --> 00:03:13,500
As usual, it's formatted with each row being one sample and each column being one feature.

34
00:03:15,310 --> 00:03:18,000
The next step is to look at why train T.

35
00:03:18,510 --> 00:03:20,220
This is where the labels are stored.

36
00:03:21,630 --> 00:03:24,460
And lastly, we have the inertial signals folder.

37
00:03:25,020 --> 00:03:27,090
So this is where the time series are stored.

38
00:03:27,480 --> 00:03:29,370
Will look inside this folder in a moment.

39
00:03:32,620 --> 00:03:37,150
The next step is to check inside what's inside the subject's train t file.

40
00:03:40,360 --> 00:03:43,070
OK, so basically it's one integer per row.

41
00:03:43,510 --> 00:03:47,230
These are the subject IDs, as you may be able to surmise.

42
00:03:50,160 --> 00:03:52,940
The next step is to check out what's inside extranets.

43
00:03:57,460 --> 00:04:03,070
So basically, it's just the table of numbers, the samples go along the rows and the features go along

44
00:04:03,070 --> 00:04:04,870
the columns as promised.

45
00:04:08,540 --> 00:04:11,000
The next step is to check out what's inside, why train?

46
00:04:15,260 --> 00:04:18,300
So, again, we have a list of integers, one per row.

47
00:04:18,890 --> 00:04:24,140
These are the class labels for each sample, you might want to verify that these labels start from one,

48
00:04:24,440 --> 00:04:27,380
which means we'll need to subtract one from them in the code.

49
00:04:28,330 --> 00:04:32,540
This is because, as you recall, in Python, we start counting from zero.

50
00:04:35,630 --> 00:04:39,350
OK, so the next step is to look inside of the inertial signals folder.

51
00:04:43,460 --> 00:04:48,590
So you can see that there are nine files corresponding to the nine different Times series measurements

52
00:04:49,250 --> 00:04:54,590
based on this, we can tell it's going to take some work in order to convert these files into the format

53
00:04:54,590 --> 00:04:55,220
we need.

54
00:04:55,730 --> 00:04:59,870
As you recall, we'd like to have a data set of shape and by TBD.

55
00:05:02,770 --> 00:05:05,830
Note that we can also check what's inside the test folder as well.

56
00:05:08,990 --> 00:05:13,850
So you can see that we have all the same files except that they end in the word test, which is kind

57
00:05:13,850 --> 00:05:14,660
of redundant.

58
00:05:17,710 --> 00:05:23,260
The next step is to actually look at what's inside one of these times series files, so I've chosen

59
00:05:23,260 --> 00:05:26,050
this one randomly, but they all basically look like this.

60
00:05:29,110 --> 00:05:33,550
So, again, it's just the table of numbers now, since each row is a sample.

61
00:05:33,850 --> 00:05:36,370
This means that each column is a different point in time.

62
00:05:36,910 --> 00:05:42,040
This is somewhat unconventional since earlier in the course time series like stock prices and airline

63
00:05:42,040 --> 00:05:44,920
passengers had each row is a different point in time.

64
00:05:48,210 --> 00:05:54,570
OK, so now just to show you how this works, I'm going to read in one of these CSV files using pandas

65
00:05:55,140 --> 00:05:56,910
since the files don't have any headers.

66
00:05:57,060 --> 00:05:58,680
I'm going to set header to none.

67
00:05:59,400 --> 00:06:04,960
And since each data point is separated by whitespace, I'm going to set Dehlin whitespace the true.

68
00:06:05,670 --> 00:06:09,180
This is unlike a regular CSV which uses commas.

69
00:06:14,120 --> 00:06:18,790
OK, so the next step is to do a drug head to see what's inside our data frame.

70
00:06:22,780 --> 00:06:25,630
So basically, it's a table of numbers as expected.

71
00:06:26,260 --> 00:06:30,250
Now, keep in mind that this is just a convenient way to read in this data.

72
00:06:30,730 --> 00:06:34,900
What do we eventually want is to store all this data inside an umpire, Ray?

73
00:06:39,100 --> 00:06:44,010
OK, so let's do a deal of that info to see some basic information about our data.

74
00:06:47,230 --> 00:06:50,720
The important things to notice here are the range index and the columns.

75
00:06:51,610 --> 00:06:57,100
Basically, this says that we have seven thousand three hundred fifty two rows and one hundred twenty

76
00:06:57,100 --> 00:06:57,980
eight columns.

77
00:06:58,540 --> 00:07:01,930
So this means that we have seven thousand three hundred fifty two samples.

78
00:07:02,110 --> 00:07:06,010
And each time Sari's has one hundred twenty eight measurements, as we discussed.

79
00:07:08,740 --> 00:07:12,390
The next step is to call the plight on one of these Times series.

80
00:07:13,180 --> 00:07:16,210
So this should give you a sense of what these Times series look like.

81
00:07:20,750 --> 00:07:26,840
OK, so at this point, I feel like we understand how our data is organized, so we should now be ready

82
00:07:26,840 --> 00:07:30,680
to actually load it into a format that we can use with tensor flow.
