1
00:00:11,030 --> 00:00:17,010
So in this lecture, we'll be continuing our discussion about TensorFlow syntax specifically.

2
00:00:17,030 --> 00:00:22,430
We'll be looking at classification with real data instead of regression with synthetic data.

3
00:00:23,240 --> 00:00:28,910
Now, just in case you're curious, students taking a more naive approach sometimes ask me, why do

4
00:00:28,910 --> 00:00:30,200
we use synthetic data?

5
00:00:30,860 --> 00:00:36,260
The answer is that this is how we check and verify that our model is working as intended.

6
00:00:36,860 --> 00:00:42,290
As you saw, we were able to check whether or not our model came up with the right answer because we

7
00:00:42,290 --> 00:00:43,550
knew the right answer.

8
00:00:44,120 --> 00:00:49,550
This is unlike real world data where you do not know the answer, and therefore you cannot check whether

9
00:00:49,550 --> 00:00:50,720
or not you are correct.

10
00:00:51,410 --> 00:00:54,800
In any case, the next notebook will look at binary classification.

11
00:00:55,190 --> 00:00:59,120
Some examples of which are spam detection and sentiment analysis.

12
00:00:59,780 --> 00:01:06,800
As you recall, our data will be in the form of an input x matrix of shape and by D and a target y vector

13
00:01:06,800 --> 00:01:07,160
of link.

14
00:01:07,160 --> 00:01:11,750
Then note that because we'll be discussing binary classification.

15
00:01:12,110 --> 00:01:15,440
This target matrix will contain only zeros and ones.

16
00:01:16,010 --> 00:01:21,470
The X Matrix will be a TFI Taf matrix for now, but you'll see how we can get rid of this later in the

17
00:01:21,470 --> 00:01:22,070
courts.

18
00:01:26,640 --> 00:01:32,220
So let's discuss the differences and similarities between this model and the previous model we looked

19
00:01:32,220 --> 00:01:32,430
at.

20
00:01:33,300 --> 00:01:38,610
Firstly, notice how they both start with an input layer because their inputs have dimensionality.

21
00:01:38,620 --> 00:01:44,220
D We pass in a tuple with one element containing D into the argument for shape.

22
00:01:45,090 --> 00:01:50,730
The next layer is again a dense layer, which, as you recall, represents in a fine transformation.

23
00:01:51,360 --> 00:01:55,860
Simply put, it represents W Transpose X plus b here.

24
00:01:55,860 --> 00:01:58,950
W is a weight matrix and B as a bias vector.

25
00:01:59,850 --> 00:02:06,030
But again, note that this is the generic case because the output of this dense layer only has one element.

26
00:02:06,480 --> 00:02:12,990
W can effectively be thought of as a D by one vector, and B can effectively be thought of as a scalar,

27
00:02:13,320 --> 00:02:16,170
even though in practice it will be a vector with one element.

28
00:02:16,950 --> 00:02:23,370
Now, as you recall for logistic regression, we also apply a sigmoid function after this transformation,

29
00:02:23,730 --> 00:02:28,710
which maps the output to a probability between zero and one in TensorFlow.

30
00:02:28,740 --> 00:02:34,200
You don't have to write this function yourself, but instead you simply pass in the string a sigmoid

31
00:02:34,560 --> 00:02:40,620
ends of the argument for activation, which automatically applies the sigmoid after doing W transpose

32
00:02:40,620 --> 00:02:41,400
x plus b.

33
00:02:42,360 --> 00:02:47,250
Note that in deep learning, we call these functions activation functions, which explains the name

34
00:02:47,250 --> 00:02:48,060
of the argument.

35
00:02:49,560 --> 00:02:56,040
Finally, note that our instantiation of the model object is the same as before we pass in the input

36
00:02:56,040 --> 00:02:59,310
as the first argument and the output as the second argument.

37
00:03:04,080 --> 00:03:06,480
The next step is to consider the compile function.

38
00:03:07,260 --> 00:03:12,570
The major difference here is that for binary classification, we use a loss function called the binary

39
00:03:12,570 --> 00:03:13,380
cross entropy.

40
00:03:14,130 --> 00:03:17,580
In this course, we won't discuss the details for why this is the case.

41
00:03:18,000 --> 00:03:22,640
But note that if you'd like to learn more, you can check out the resources in extra reading data to

42
00:03:22,780 --> 00:03:23,280
see.

43
00:03:24,910 --> 00:03:31,270
I've also taught this in my in-depth series of courses where such detail would be more appropriate for

44
00:03:31,270 --> 00:03:32,190
the optimizer.

45
00:03:32,290 --> 00:03:38,140
Note that we again choose Adam as a default, which seems to work just fine as with the loss.

46
00:03:38,170 --> 00:03:42,820
There is more in-depth knowledge about this that have taught in the past, so you can check that out

47
00:03:42,820 --> 00:03:43,630
if you're curious.

48
00:03:43,990 --> 00:03:46,240
But it would be inappropriate at this point.

49
00:03:47,290 --> 00:03:52,520
The final argument is called metrics, as you recall when we're doing classification.

50
00:03:52,570 --> 00:03:55,030
What we're often interested in is the accuracy.

51
00:03:55,630 --> 00:04:00,310
We want to know how many predictions did I get right out of all the predictions I made?

52
00:04:00,880 --> 00:04:05,770
This is unlike the binary cross entropy, which is kind of a weird value, in contrast.

53
00:04:06,370 --> 00:04:12,360
So when you pass an accuracy as a metric, this means that later on, when you call the fit method,

54
00:04:12,370 --> 00:04:15,880
both the cross entropy and the accuracy will be computed.

55
00:04:20,560 --> 00:04:26,020
As a side note, for those who are curious, I'll show you the expression for the binary cross entropy

56
00:04:26,920 --> 00:04:29,560
in this case, why is the target and why?

57
00:04:29,600 --> 00:04:34,510
That is the model output, which, as you recall, is a probability between zero and one.

58
00:04:35,440 --> 00:04:42,010
You should convince yourself that when Y is equal to y had this loss, function equals zero and when

59
00:04:42,010 --> 00:04:43,510
y and y hat are opposite.

60
00:04:43,840 --> 00:04:45,640
These laws function approaches infinity.

61
00:04:46,510 --> 00:04:52,690
What I mean by opposite is that Y is one and y had approaches zero or y a zero in a y had approaches

62
00:04:52,690 --> 00:04:53,080
one.

63
00:04:53,740 --> 00:04:56,830
So plug in some numbers and verify that this is the case.

64
00:05:01,540 --> 00:05:07,120
Now, because of numerical stability, we will actually not use the syntax I just showed you that was

65
00:05:07,120 --> 00:05:08,350
for intuition only.

66
00:05:09,070 --> 00:05:12,190
In practice, we're going to do something a bit more complex.

67
00:05:12,940 --> 00:05:18,430
Intuitively, we know that functions like the exponential in the log are unstable because they make

68
00:05:18,430 --> 00:05:20,410
things explode or shrink to zero.

69
00:05:21,280 --> 00:05:27,460
It turns out that the sigmoid involves exponentially, while the binary cross entropy involves logs,

70
00:05:27,460 --> 00:05:28,450
as you just saw.

71
00:05:29,380 --> 00:05:35,410
Now, as you may recall, the exponential is the inverse function of the log, so they effectively cancel

72
00:05:35,410 --> 00:05:36,220
each other out.

73
00:05:36,970 --> 00:05:42,910
In effect, there is a way to combine the sigmoid and the subsequent laws function that makes the computations

74
00:05:42,910 --> 00:05:44,200
more numerically stable.

75
00:05:45,070 --> 00:05:47,290
Now, luckily, you don't have to understand this.

76
00:05:47,590 --> 00:05:49,690
You just have to understand the syntax.

77
00:05:54,270 --> 00:06:00,870
So here's how the syntax changes in this form when we create the model, it will look exactly like linear

78
00:06:00,870 --> 00:06:03,150
regression without any sigmoid.

79
00:06:03,720 --> 00:06:07,260
In other words, the model is now just W transpose X plus b.

80
00:06:07,920 --> 00:06:13,110
We call these logics, which is connected with the fact that the sigmoid is sometimes called the logistic

81
00:06:13,110 --> 00:06:13,590
function.

82
00:06:18,390 --> 00:06:24,330
The next change comes from how we call compile, as you recall, when we specify the laws function,

83
00:06:24,330 --> 00:06:28,560
we can use a string, but only when we're OK with the default values.

84
00:06:29,070 --> 00:06:34,350
Now we are not OK with the default values because we have to tell TensorFlow that the laws function

85
00:06:34,350 --> 00:06:36,120
should be combined with the sigmoid.

86
00:06:36,930 --> 00:06:41,790
In order to do this, we create an object of type binary cross entropy and we pass in.

87
00:06:41,790 --> 00:06:43,920
The argument from logic equals true.

88
00:06:45,680 --> 00:06:51,560
So when we use this method, one minor difference is that when we call models that predict we will get

89
00:06:51,560 --> 00:06:57,080
these logics, which can be any value instead of probabilities, which can only be between a zero and

90
00:06:57,080 --> 00:06:57,530
one.

91
00:07:02,330 --> 00:07:07,640
Finally, it's worth mentioning that all the subsequent steps, such as calling the fit method, plotting

92
00:07:07,640 --> 00:07:12,560
the metrics over each epoch and making predictions all remain the same as before.

93
00:07:13,070 --> 00:07:15,290
So these steps require no changes.

