1
00:00:11,060 --> 00:00:17,180
So in this lecture, we will be looking at the notebook to implement binary classification using TensorFlow.

2
00:00:18,110 --> 00:00:23,690
Note that the main outline of this notebook will be the same as our previous sentiment analysis example.

3
00:00:24,290 --> 00:00:27,970
The only difference is that we'll be using TensorFlow instead of Saikat learning.

4
00:00:28,850 --> 00:00:33,590
So let's begin by downloading our data set, which is a set of tweets about airlines.

5
00:00:41,710 --> 00:00:44,860
The next step is to import everything we need for this script.

6
00:00:45,460 --> 00:00:48,490
Note that we've imported the binary cross and tree loss.

7
00:00:56,010 --> 00:01:01,950
The next step is to set random seeds for Nampai and TensorFlow so that we all get consistent results.

8
00:01:07,940 --> 00:01:12,020
The next step is to read in our data frame using PD to read 6V.

9
00:01:17,540 --> 00:01:21,740
The next step is to call the head method to remind ourselves what this data looks like.

10
00:01:26,870 --> 00:01:31,970
OK, so as you recall, there are many columns in the state of frame, but the ones we are interested

11
00:01:31,970 --> 00:01:35,210
in are the airline sentiment and the text of the tweet.

12
00:01:40,110 --> 00:01:44,880
The next step is to select the two relevant columns and to reassign this 2df.

13
00:01:50,440 --> 00:01:53,080
The next step is to draw histogram of our labels.

14
00:01:58,680 --> 00:02:04,470
So as you recall, this is a free class data set with the classes positive, negative and neutral.

15
00:02:05,070 --> 00:02:06,840
We'll be ignoring the neutral class.

16
00:02:07,590 --> 00:02:13,620
Also recall that the data set is imbalanced, with negative tweets far more common than positive tweets.

17
00:02:17,660 --> 00:02:23,000
The next step is to filter out all of the neutral tweets, since we only want positive and negative

18
00:02:23,000 --> 00:02:24,170
for this analysis.

19
00:02:24,770 --> 00:02:29,840
We'll also want to make a copy of this point since will be assigning new columns to this data frame.

20
00:02:34,670 --> 00:02:41,300
The next step is to create a mapping from the label to an integer representation, as you recall, deep

21
00:02:41,300 --> 00:02:46,690
neural networks work with numbers and therefore this will be needed unlike say I could learn.

22
00:02:46,730 --> 00:02:50,240
There is no functionality to do this kind of conversion internally.

23
00:02:55,350 --> 00:02:59,660
The next step is to call the head method once again to see what our new data frame looks like.

24
00:03:04,120 --> 00:03:09,610
OK, so now our data frame has three columns with sentiment text in the binary target.

25
00:03:13,430 --> 00:03:16,460
The next step is to split our data into train and test.

26
00:03:21,890 --> 00:03:25,550
The next step is to convert our data into RFID, a format.

27
00:03:31,180 --> 00:03:36,700
Now, as you recall, the TF idea of matrices returned by psychic learn are sparse matrices.

28
00:03:37,270 --> 00:03:40,180
Unfortunately, TensorFlow doesn't know how to deal with these.

29
00:03:40,540 --> 00:03:46,450
So we'll call the two array method to convert them to regular Nampai arrays, since we're only keeping

30
00:03:46,450 --> 00:03:47,530
two thousand features.

31
00:03:47,770 --> 00:03:48,790
This is not an issue.

32
00:03:53,810 --> 00:03:58,970
The next step is to assign our targets to convenient variable names, white train and white test.

33
00:04:03,230 --> 00:04:08,390
The next step is to get the number of input dimensions for our model, which is the number of columns

34
00:04:08,390 --> 00:04:09,110
in a train.

35
00:04:14,210 --> 00:04:15,860
The next step is to build our model.

36
00:04:16,610 --> 00:04:22,100
Note that it includes a single dense layer, which represents the expression W Transpose X plus B,

37
00:04:23,300 --> 00:04:24,150
as you recall.

38
00:04:24,170 --> 00:04:27,050
This is a model with the inputs in one output.

39
00:04:27,800 --> 00:04:34,100
Also notice the lack of use of the sigmoid, since this will be accounted for in the binary cross entropy

40
00:04:34,100 --> 00:04:34,700
loss.

41
00:04:40,970 --> 00:04:44,990
The next step is to call a model that summary to get a sense of our model structure.

42
00:04:49,970 --> 00:04:56,270
As you can see, our dense layer has two thousand one parameters as an exercise, please pause this

43
00:04:56,270 --> 00:04:58,640
video and make sure that this makes sense to you.

44
00:05:01,880 --> 00:05:07,730
OK, so hopefully you were able to think about why this makes sense, since we have 2000 inputs.

45
00:05:08,030 --> 00:05:10,730
We should also have 2000 corresponding weights.

46
00:05:11,330 --> 00:05:14,870
On top of that, we have one bias term since there is one output.

47
00:05:18,750 --> 00:05:20,760
The next step is to call the compound method.

48
00:05:21,480 --> 00:05:27,840
As mentioned, the correct loss for binary classification is the binary cross entropy and we set from

49
00:05:27,840 --> 00:05:28,860
logic to true.

50
00:05:29,190 --> 00:05:31,350
Since our model does not apply a sigmoid.

51
00:05:32,070 --> 00:05:37,440
Otherwise, the default for from logic is false, and you would have to apply the sigmoid explicitly.

52
00:05:39,030 --> 00:05:43,950
Again, we'll use the Atom Optimizer with the non default learning rate of zero point zero one.

53
00:05:44,310 --> 00:05:46,560
Since this helps training proceed more quickly.

54
00:05:47,730 --> 00:05:50,730
Finally, note that we also want to track the accuracy metric.

55
00:05:51,000 --> 00:05:52,440
In addition to the loss.

56
00:05:57,890 --> 00:06:00,080
So the next step is to call a FID method.

57
00:06:00,770 --> 00:06:05,060
Most of these arguments you've seen before, except for perhaps validation data.

58
00:06:05,690 --> 00:06:11,450
This is where you pass in your out of sample data so that we can compute out of simple metrics at the

59
00:06:11,450 --> 00:06:12,530
end of each epoch.

60
00:06:18,290 --> 00:06:24,110
So notice that during this process, we get information about the training loss, the training accuracy

61
00:06:24,470 --> 00:06:27,200
and the validation loss and the validation accuracy.

62
00:06:31,590 --> 00:06:33,900
The next step is to plot the loss per epoch.

63
00:06:40,010 --> 00:06:46,010
As you can see, we get a nice decrease on each step, but notice how the validation loss does not go

64
00:06:46,010 --> 00:06:48,000
as low as usual.

65
00:06:48,020 --> 00:06:51,290
Machine learning models tend to perform better on the train set.

66
00:06:52,160 --> 00:06:56,660
Also notice how the validation loss increases a bit at the later epochs.

67
00:06:57,140 --> 00:07:00,950
This could suggest a bit of overfitting in order to rectify this.

68
00:07:00,980 --> 00:07:06,200
We could use a callback to save the best model during training, according to the validation loss.

69
00:07:10,550 --> 00:07:13,310
The next step is to look at the accuracy on each epoch.

70
00:07:19,470 --> 00:07:20,970
So again, we see the same pattern.

71
00:07:21,510 --> 00:07:27,390
There was a nice increase on each step and the validation accuracy is a bit less note again, how the

72
00:07:27,390 --> 00:07:31,100
accuracy for the validation set decreases a bit near the end.

73
00:07:35,890 --> 00:07:42,070
The next step is to get our models predictions now because our model outputs logics and not probabilities.

74
00:07:42,370 --> 00:07:46,510
We won't round the outputs, but rather check whether or not they are bigger than zero.

75
00:07:47,320 --> 00:07:52,540
After doing this, we get an array of cruise and falsies which we would like to convert into numbers,

76
00:07:52,840 --> 00:07:54,310
so we multiply by one.

77
00:07:55,180 --> 00:08:00,010
Finally, we call flatten so that p train and p test become one dimensional arrays.

78
00:08:06,950 --> 00:08:10,160
The next step is to compute the confusion matrix for the train sets.

79
00:08:17,140 --> 00:08:20,320
The next step is to plot the confusion matrix for the train set.

80
00:08:27,350 --> 00:08:32,860
As expected, we see better performance for the negative class, which is the over-represented class.

81
00:08:37,020 --> 00:08:40,020
The next step is to plug the confusion matrix for the test set.

82
00:08:45,630 --> 00:08:48,480
Again, we see better performance for the negative class.

83
00:08:53,740 --> 00:08:58,660
Now, since this is an imbalance dataset, we should compute other metrics like the AUC.

84
00:09:04,950 --> 00:09:09,810
In this case, we get an AUC in the 90s for both training and test, which is pretty good.

85
00:09:14,120 --> 00:09:16,340
The next step is to check the F1 score.

86
00:09:20,790 --> 00:09:23,670
In this case, our train of one is over 90 percent.

87
00:09:24,030 --> 00:09:26,670
But our test F1 is about 79 percent.

88
00:09:30,620 --> 00:09:33,440
The next step is to check the layers attribute of our model.

89
00:09:37,860 --> 00:09:41,580
Again, we have our input layer and the dense layer at position to.

90
00:09:45,720 --> 00:09:48,420
The next step will be to get the weights of the dense layer.

91
00:09:53,700 --> 00:10:00,330
So again, note that this returns a list of two Nampai arrays containing the Matrix and the B vector.

92
00:10:04,370 --> 00:10:08,150
The next step will be to assign the weights of our model to a variable called W.

93
00:10:13,380 --> 00:10:17,640
The next step is to grab the word to index mapping, as you recall.

94
00:10:17,670 --> 00:10:22,780
This is useful because we want to know which words have the most positive and negative weights.

95
00:10:23,370 --> 00:10:27,720
This should help us interpret the model and to check whether or not it's doing something that makes

96
00:10:27,720 --> 00:10:28,290
sense.

97
00:10:37,240 --> 00:10:41,080
The next step is to print the most positive words, according to our wits.

98
00:10:41,740 --> 00:10:45,910
Now you'll notice that this code is a bit strange, but this is due to legacy reasons.

99
00:10:46,300 --> 00:10:50,950
And by that, I mean the previous iteration of this notebook, which was written and so I could learn

100
00:10:51,790 --> 00:10:56,740
instead of printing all the words that have a weight above some threshold, I've simply decided to print

101
00:10:56,740 --> 00:10:57,970
the top 10 words.

102
00:10:58,630 --> 00:11:03,790
What we could have done is just sort it all the words and prints the top 10 and not check the threshold

103
00:11:03,790 --> 00:11:04,240
at all.

104
00:11:05,200 --> 00:11:07,540
In any case, this has no effect on the results.

105
00:11:14,350 --> 00:11:19,700
OK, so the top positive words are thank thanks worries, great, awesome love.

106
00:11:19,720 --> 00:11:21,250
Excellent kudos.

107
00:11:21,250 --> 00:11:23,860
Amazing and best, which all makes sense.

108
00:11:28,550 --> 00:11:33,110
Now, let's go to the negative words, which basically have the same code, but in reverse order.

109
00:11:39,670 --> 00:11:46,930
So the top and negative words are worst paid, not rude, disappointed, nothing web site hung instead

110
00:11:46,930 --> 00:11:49,690
and lists, most of which make a lot of sense.

111
00:11:54,700 --> 00:12:00,250
OK, so that's logistic regression in TensorFlow as a final exercise for this lecture.

112
00:12:00,640 --> 00:12:04,450
Please take this code and adapt it for a spam detection dataset.