1
00:00:11,110 --> 00:00:16,720
So in this lecture, we'll be doing code preparation for eons, which will introduce the syntax will

2
00:00:16,720 --> 00:00:22,630
be using in the next lecture to recap the previous lectures, you recall that a neural network is simply

3
00:00:22,630 --> 00:00:29,920
a composite function where we just repeatedly apply w transpose input plus B with some activation function

4
00:00:29,920 --> 00:00:30,880
F on top.

5
00:00:31,660 --> 00:00:36,910
As you recall, we already know how to implement this function in TensorFlow, which is just a dense

6
00:00:36,910 --> 00:00:37,360
layer.

7
00:00:38,290 --> 00:00:41,240
And as such, building an a& model is very easy.

8
00:00:41,800 --> 00:00:46,000
It's just like the previous section, except that we have more dense layers.

9
00:00:46,780 --> 00:00:51,970
You'll notice that in this case, we also specify the activation function, which is most commonly the

10
00:00:51,970 --> 00:00:52,650
Rel U.

11
00:00:53,500 --> 00:00:58,660
Furthermore, you'll notice that for each of these dense layers, we need to specify their output size.

12
00:00:59,230 --> 00:01:04,269
This is what we've been calling the number of hidden units, and this is called a hyper parameter,

13
00:01:04,720 --> 00:01:06,880
meaning that you don't compute this value.

14
00:01:07,150 --> 00:01:09,550
You just choose this value to suit your needs.

15
00:01:10,090 --> 00:01:15,460
For example, by picking the value that leads to the best out of sample performance, well, discuss

16
00:01:15,460 --> 00:01:18,310
how to choose hyper parameters elsewhere in this course.

17
00:01:20,130 --> 00:01:25,530
Finally, note that because we'll be doing multiclass classification, the final dense layer will have

18
00:01:25,530 --> 00:01:30,690
key outputs where case the number of classes along with a soft max activation.

19
00:01:35,390 --> 00:01:37,850
The next step is to discuss the compile method.

20
00:01:38,920 --> 00:01:44,080
In this case, we will still pretty much always use the Atom Optimizer, and we will still always want

21
00:01:44,090 --> 00:01:45,700
to know the accuracy metric.

22
00:01:46,300 --> 00:01:52,570
What changes, as you recall, is the loss, which is now the categorical cross entropy or the sparse,

23
00:01:52,570 --> 00:01:53,980
categorical cross entropy.

24
00:01:54,820 --> 00:01:58,270
The difference between these is in how your targets are represented.

25
00:01:58,900 --> 00:02:04,150
If you want to use the regular categorical cross entropy, then your targets will be a matrix of shape

26
00:02:04,150 --> 00:02:10,330
and by K where and is the number of samples and case of the number of classes each row will contain

27
00:02:10,330 --> 00:02:11,170
a single one.

28
00:02:11,500 --> 00:02:14,410
And the rest all zeros denoting the target class.

29
00:02:14,890 --> 00:02:18,250
So, for example, if y three five is equal to one.

30
00:02:18,640 --> 00:02:21,940
This means that the third sample belongs to Class five.

31
00:02:22,540 --> 00:02:28,240
This is assuming that indexing starts from one, which is not the case in Python, but is the case in

32
00:02:28,240 --> 00:02:29,350
conventional math.

33
00:02:30,840 --> 00:02:35,850
On the other hand, if you use the sparse, categorical cross entropy, then your targets will be a

34
00:02:35,850 --> 00:02:41,040
one dimensional array of length, then containing the integer representation of each class.

35
00:02:41,730 --> 00:02:44,670
So for example, if y of three is equal to five.

36
00:02:45,060 --> 00:02:48,300
This means that the third sample belongs to Class five.

37
00:02:48,960 --> 00:02:51,960
Again, this assumes that indexing starts from one.

38
00:02:53,010 --> 00:02:56,820
Now, clearly, the sparse, categorical cross entropy is more efficient.

39
00:02:57,240 --> 00:02:59,280
So unless you have some reason not to.

40
00:02:59,580 --> 00:03:01,500
The sparse version should be used.

41
00:03:02,250 --> 00:03:07,800
The reason we like to think of the non-sports version is that it's more intuitive in how it works.

42
00:03:08,460 --> 00:03:14,520
Note that when we use this, the target matrix has the same shape as the neural network output, which

43
00:03:14,520 --> 00:03:15,960
again is end by K.

44
00:03:17,040 --> 00:03:23,640
As a side note recognized at the end by a target matrix is called a one hot encoding of the end length

45
00:03:23,640 --> 00:03:24,570
target vector.

46
00:03:25,230 --> 00:03:29,550
This idea of one heart encoding will be helpful when we discuss embeddings as well.

47
00:03:34,330 --> 00:03:40,600
Now, as with the binary cross entropy, it's possible to combine the final soft max activation with

48
00:03:40,600 --> 00:03:43,720
the loss so that we only need to compute the logics.

49
00:03:44,410 --> 00:03:47,260
Again, we like to do this because it's more numerically stable.

50
00:03:48,070 --> 00:03:50,900
The soft max, as you've seen, contains exponentially.

51
00:03:51,340 --> 00:03:56,620
And the cross entropy contains logs, both of which can make things either explode or go to zero.

52
00:03:57,190 --> 00:04:01,330
But if we combine logs with exponential those, then they cancel each other out.

53
00:04:02,230 --> 00:04:07,450
In this form, our final dense layer has no activation, but still has output size K.

54
00:04:08,820 --> 00:04:14,430
Our laws function now cannot be specified as a string, but instead will create an object passing in

55
00:04:14,430 --> 00:04:16,740
the argument from logic equals true.

56
00:04:17,850 --> 00:04:23,250
Finally, note that just as with our previous examples, the rest of the code will not change.

57
00:04:23,670 --> 00:04:29,040
So calling fit is the same plotting the last per epoch is the same, and making predictions is still

58
00:04:29,040 --> 00:04:29,670
the same.

