1
00:00:11,650 --> 00:00:17,680
In this lecture we you're going to do code preparation for our linear classification script this example

2
00:00:17,680 --> 00:00:19,840
will look at a new classification dataset.

3
00:00:20,170 --> 00:00:22,880
So this lecture will go overloading the data as well.

4
00:00:24,230 --> 00:00:27,770
To recap here are the general tasks that we need to complete.

5
00:00:27,800 --> 00:00:33,380
Number one loading the data number to create the model number three train the model.

6
00:00:33,450 --> 00:00:35,510
And number four evaluate the model

7
00:00:40,650 --> 00:00:42,780
let's start with loading in the data.

8
00:00:42,780 --> 00:00:46,740
For this example we'll be looking at the famous breast cancer dataset.

9
00:00:46,860 --> 00:00:52,380
This data set happens to be included as part of the cycle learn API.

10
00:00:52,470 --> 00:00:57,170
You'll notice that in this course we'll be looking at quite a few pretty famous datasets.

11
00:00:57,300 --> 00:01:02,780
These datasets are so famous that they are often included in the various machine learning libraries.

12
00:01:02,910 --> 00:01:08,130
So there's no need to download the dataset as we did in our previous example.

13
00:01:08,300 --> 00:01:12,820
Of course there will be plenty of examples in this course where the data do not come from libraries.

14
00:01:12,860 --> 00:01:19,270
So you'll get experience with each loading in data from psyche you learn is super easy.

15
00:01:19,270 --> 00:01:23,490
We just call the function load breast cancer which returns a data object.

16
00:01:23,500 --> 00:01:29,590
This data contains the X's and Y's so we'll need to access them using the objects attributes in particular

17
00:01:29,860 --> 00:01:34,900
the inputs can be accessed using the data attribute and the targets can be accessed using the targets

18
00:01:34,900 --> 00:01:40,560
attribute.

19
00:01:40,580 --> 00:01:44,460
There are two ways we'll want to pre process the data before using it.

20
00:01:44,600 --> 00:01:47,120
First the data isn't normalized.

21
00:01:47,120 --> 00:01:50,390
You already learned from our earlier example why this is a good idea.

22
00:01:50,660 --> 00:01:56,750
If you did the exercise I suggested therefore we will use the cycle learn module standard scalar to

23
00:01:56,750 --> 00:01:58,970
normalize it.

24
00:01:59,120 --> 00:02:03,020
Second we want to split the data into train and test sets.

25
00:02:03,110 --> 00:02:08,210
Intuitively this is because we want to get a good idea of how the model will perform on data.

26
00:02:08,210 --> 00:02:09,490
It hasn't seen before.

27
00:02:09,650 --> 00:02:14,290
Not on data it has already seen for the data we've already seen.

28
00:02:14,290 --> 00:02:17,890
We already know the answer so machine learning isn't necessary.

29
00:02:17,920 --> 00:02:21,180
The data we really care about is the data we have not seen.

30
00:02:21,190 --> 00:02:26,920
For example if you build a fraud detector you want to be able to ask your model whether a new transaction

31
00:02:26,920 --> 00:02:28,260
is fraudulent.

32
00:02:28,270 --> 00:02:31,710
This is important because your model might do very well on data.

33
00:02:31,720 --> 00:02:35,430
It's already seen but poorly on data it hasn't seen.

34
00:02:35,500 --> 00:02:38,810
We'll discuss this more in a later lecture.

35
00:02:38,870 --> 00:02:43,600
In any case this code shows you how to do both of these steps.

36
00:02:43,610 --> 00:02:49,220
Notice how we already use this idea of train test splits with the standard scalar we fit the standard

37
00:02:49,220 --> 00:02:55,130
scalar on the training data only and we apply the standardization on the test data using the fitted

38
00:02:55,130 --> 00:02:57,070
mean and variance of the training data.

39
00:03:02,200 --> 00:03:05,350
Once we've prepared our data it's time to build a model.

40
00:03:05,560 --> 00:03:11,530
As you know the linear classifier is almost the same as linear regression just with one extra step.

41
00:03:11,530 --> 00:03:17,780
The sigmoid this should be a hint that we're still going to have a linear object somewhere in the model.

42
00:03:17,800 --> 00:03:23,080
Well it just so happens that the sigmoid is also represented in PI talks with an object the sigmoid

43
00:03:23,080 --> 00:03:29,320
object it should make complete sense then if we combine the linear object with the sigmoid object in

44
00:03:29,320 --> 00:03:32,890
sequence and in fact that's exactly what we've done.

45
00:03:33,160 --> 00:03:38,950
Pi 2 which allows you to easily stack these computations steps in sort of a wrapper objects called sequential

46
00:03:40,850 --> 00:03:46,190
what you're telling pi to which is I want my model to apply these functions in this order and the functions

47
00:03:46,190 --> 00:03:53,120
we want to apply in this case are the linear model than the sigmoid there's one detail here that I want

48
00:03:53,120 --> 00:03:59,030
you to notice and this is that when we create the linear layer the input sizes D while the output size

49
00:03:59,030 --> 00:04:00,320
is 1.

50
00:04:00,410 --> 00:04:06,200
As you recall our data matrix is of shape n by D where n is the number of samples and d is the number

51
00:04:06,200 --> 00:04:07,270
of features.

52
00:04:07,430 --> 00:04:11,660
So we'll have one input for each feature column in the dataset.

53
00:04:11,660 --> 00:04:16,580
In addition we have one output which represents the probability that the output should be classified

54
00:04:16,610 --> 00:04:22,230
as a 1.

55
00:04:22,320 --> 00:04:28,020
The next step is to train the model as promised we are still doing gradient descent which means nothing

56
00:04:28,020 --> 00:04:31,000
about our gradient descent loop is going to change.

57
00:04:31,020 --> 00:04:37,260
We still do y 0 grad get the output calculate the last call backward and do one step of a gradient update.

58
00:04:37,410 --> 00:04:43,710
What's different is that we'll be using a different cost function but also a different optimizer as

59
00:04:43,710 --> 00:04:44,860
mentioned previously.

60
00:04:44,940 --> 00:04:50,880
Our last function for binary classification is the binary cross entropy which is performed in the object

61
00:04:50,940 --> 00:04:53,130
BCE laws.

62
00:04:53,130 --> 00:04:58,320
In addition we'll be using the atom optimizer which has become the go to default in deep learning in

63
00:04:58,320 --> 00:05:00,840
recent years.

64
00:05:00,900 --> 00:05:05,580
Normally I would say that the gradient optimizer you choose is like a hyper parameter so you should

65
00:05:05,580 --> 00:05:09,340
always experiment to see what works best in practice.

66
00:05:09,360 --> 00:05:12,350
Many people simply choose Adam by default.

67
00:05:12,360 --> 00:05:15,880
This does not mean that Adam is actually guaranteed to work best.

68
00:05:16,020 --> 00:05:19,530
So you should still try other methods and observe the results yourself.

69
00:05:21,410 --> 00:05:26,410
If you want to learn more about Adam you're encouraged to check the in depth section of this course.

70
00:05:26,570 --> 00:05:30,800
And if that's still not enough then you'll want to go through the in-depth course where I discuss this

71
00:05:30,800 --> 00:05:33,440
algorithm in detail and build it up from scratch.

72
00:05:38,530 --> 00:05:42,230
The last thing we're going to do in our script is evaluate the model.

73
00:05:42,310 --> 00:05:46,720
This is different in classification compared to regression and regression.

74
00:05:46,720 --> 00:05:52,330
We use the means squared error as our loss but we also use it as our evaluation metric.

75
00:05:52,330 --> 00:05:57,890
It kind of makes sense since I don't think any other metric would be significantly more advantageous.

76
00:05:57,910 --> 00:06:03,680
The mean squared error is kind of a natural way to look at the regression error alternatively you could

77
00:06:03,680 --> 00:06:08,630
look at the root means grid error which is just the square root of the MSE so that it's in the same

78
00:06:08,630 --> 00:06:15,730
units as the target but still taking the square root is kind of a trivial transformation with classification.

79
00:06:15,740 --> 00:06:16,700
It's a different story.

80
00:06:17,770 --> 00:06:22,620
We have the cross entropy as a loss which is not a natural or intuitive measurement.

81
00:06:22,660 --> 00:06:25,150
What we really care about is accuracy.

82
00:06:25,150 --> 00:06:27,910
Out of all my predictions how many did I get right.

83
00:06:28,030 --> 00:06:29,930
And how many did I get wrong.

84
00:06:29,950 --> 00:06:35,470
So in order to evaluate our model we're going to calculate both the train and test accuracy.

85
00:06:35,470 --> 00:06:40,420
This necessarily also involves making predictions with the train model as well so we'll learn how to

86
00:06:40,420 --> 00:06:42,110
do that too.

87
00:06:42,220 --> 00:06:46,220
Of course it's not so different from how we make predictions with our regression model.

88
00:06:46,300 --> 00:06:51,850
We just need to round our prediction to 0 or 1 since the targets are encoded as 0 or 1.