1
00:00:01,170 --> 00:00:02,310
Hi, welcome back.

2
00:00:02,580 --> 00:00:05,700
Now let's talk about maximizing class activations.

3
00:00:06,240 --> 00:00:08,190
This is a really cool topic, in my opinion.

4
00:00:08,700 --> 00:00:15,780
This basically tries to figure out what input image maximizes the class output representation of that

5
00:00:15,780 --> 00:00:16,260
CNN.

6
00:00:16,650 --> 00:00:18,570
So let's talk about this a bit more.

7
00:00:18,630 --> 00:00:19,890
What does that actually mean?

8
00:00:20,370 --> 00:00:24,930
Well, let's say we've trained a CNN to classify between cats and dogs.

9
00:00:25,470 --> 00:00:28,800
But there's also a CNN actually know what a cat looks like.

10
00:00:29,490 --> 00:00:35,490
What input makes a CNN output with 100 percent certainty that it's seeing a cat?

11
00:00:36,000 --> 00:00:39,330
That's the question we seek to answer with class maximization.

12
00:00:39,690 --> 00:00:46,650
What input image is going to make our CNN give a 100 percent probability output that this is a cat or

13
00:00:46,650 --> 00:00:47,310
this is a dog?

14
00:00:47,430 --> 00:00:53,540
So let's take a look at a pre-trained CNN and that CNN is VG, which you learn about shortly.

15
00:00:53,550 --> 00:00:57,540
Those are one of the classical architecture of CNN architectures.

16
00:00:58,080 --> 00:00:59,730
So let's take a look at this.

17
00:01:00,150 --> 00:01:08,730
So apparently, this image here on the right is a class that maximizes the the output probability of

18
00:01:08,880 --> 00:01:10,770
this class being a bald eagle.

19
00:01:11,760 --> 00:01:15,630
Now let's go back, so take a look at this.

20
00:01:15,840 --> 00:01:19,440
Why would why would this maximize that?

21
00:01:19,440 --> 00:01:24,270
CNN What would that input maximize it as in when I say maximize it?

22
00:01:24,270 --> 00:01:28,260
I mean, cause that CNN to give an output at 100 percent certainty.

23
00:01:28,680 --> 00:01:29,940
This is a bald eagle.

24
00:01:30,450 --> 00:01:35,250
Well, if you look carefully, you can see it is an amalgamation of beaks at different angles.

25
00:01:35,250 --> 00:01:39,120
A bird head here, a bird head here, different beak here.

26
00:01:39,600 --> 00:01:46,020
So you can see there are there are features that belong to the bald eagle that are showing up in this

27
00:01:46,020 --> 00:01:46,350
image.

28
00:01:46,530 --> 00:01:49,380
It's just not how the way our brains would perceive it.

29
00:01:50,280 --> 00:01:52,260
So let's take a look at some takeaways here.

30
00:01:52,950 --> 00:01:55,410
CNN's internalized local features.

31
00:01:55,530 --> 00:02:01,110
These are things like feathers, beaks and eyes that will bear a resemblance to the class that it's

32
00:02:01,110 --> 00:02:02,430
been trained to recognize.

33
00:02:02,910 --> 00:02:07,310
However, this shows that CNN has learned very differently to us.

34
00:02:07,800 --> 00:02:10,020
At least we think so.

35
00:02:10,140 --> 00:02:16,830
We don't actually know too much yet about how our brains actually learn and process images, because

36
00:02:17,040 --> 00:02:19,950
that's all behind the scenes in our brain, hopefully.

37
00:02:20,670 --> 00:02:28,890
So it basically shows that CNN's basically do something called decomposition of the visual input space,

38
00:02:29,400 --> 00:02:35,390
and it's a hierarchical modulated network of convolutional filters that build up that intuition that

39
00:02:35,430 --> 00:02:39,720
that that this these features belong to that specific class.

40
00:02:39,930 --> 00:02:46,470
So the internal network is then a probabilistic mapping between combinations of these filters and their

41
00:02:46,470 --> 00:02:47,280
class levels.

42
00:02:47,610 --> 00:02:53,060
So the fact that CNN has learned very differently, this tells us something.

43
00:02:53,070 --> 00:02:56,290
It tells us that CNN's can often be built before.

44
00:02:57,120 --> 00:02:58,150
So look at this.

45
00:02:58,170 --> 00:03:03,690
I mean, this is a funny example here with this looks like a raisin cookie or chocolate chip cookie,

46
00:03:03,690 --> 00:03:09,210
you can be sure, but on a chihuahua face, and you can see this as they look quite similar.

47
00:03:09,360 --> 00:03:10,350
We can tell the difference.

48
00:03:10,770 --> 00:03:16,950
But a CNN may actually think it's the same class, either a chihuahua or a muffin, or whatever that

49
00:03:16,950 --> 00:03:17,550
classes.

50
00:03:18,360 --> 00:03:24,150
So similarly, you can do something called an adversarial attack where you just apply some noise to

51
00:03:24,150 --> 00:03:27,180
the image so that the image looks like this.

52
00:03:27,180 --> 00:03:31,530
You apply some random noise it and it effectively looks the same to the naked eye.

53
00:03:31,950 --> 00:03:37,320
However, CNN is going to predict an entirely different class because all of these little pixels that

54
00:03:37,320 --> 00:03:41,930
we pushed it to made it made the airline filters failing.

55
00:03:42,690 --> 00:03:46,780
So there's that that that correspond to that alien class.

56
00:03:47,310 --> 00:03:53,760
It made that CNN trigger and gave a higher probability for that, that this image of being illegal,

57
00:03:54,060 --> 00:03:55,050
which is clearly wrong.

58
00:03:55,290 --> 00:04:01,200
And then there's another way we can do it by pasting maybe a feature of another image or the entire

59
00:04:01,200 --> 00:04:05,980
image itself into the scene so that it no longer predicts Bonano, which is clearly a banana.

60
00:04:05,980 --> 00:04:06,870
That's a main focus.

61
00:04:07,290 --> 00:04:12,240
It would never predict toaster, so it's not wrong that it's predicting twister, but you would want

62
00:04:12,240 --> 00:04:15,540
it to be maybe 50-50 probability in this case.

63
00:04:16,800 --> 00:04:24,840
So there's even something called a one pixel attack where just by changing one pixel here the image,

64
00:04:24,840 --> 00:04:27,900
you can see the pixels that are quite small, but maybe you can see them.

65
00:04:28,560 --> 00:04:32,610
But these are popular CNN architectures here, pre-trained models.

66
00:04:33,060 --> 00:04:39,870
You can see that by changing at one pixel, it changes the category from ship to car and at a 99 percent

67
00:04:39,870 --> 00:04:41,280
probability, which is not good.

68
00:04:41,700 --> 00:04:44,160
Similarly, from horse to frog, that airplane.

69
00:04:44,550 --> 00:04:46,000
So you can see what's going on here.

70
00:04:46,020 --> 00:04:49,560
You can see that CNN's can be fooled based on how they live.

71
00:04:49,680 --> 00:04:55,860
And our job as computer vision engineers or practitioners or deep learning experts is to basically make

72
00:04:55,860 --> 00:04:59,490
models that are more resilient against these types of attacks.

73
00:05:00,180 --> 00:05:04,680
So, so in the next section, we'll take a look at grad cam.

74
00:05:05,130 --> 00:05:11,940
That's an algorithm that basically gives us a visual explanation of the areas of an image that the CNN

75
00:05:11,940 --> 00:05:12,990
is responding to.

76
00:05:13,110 --> 00:05:18,180
It uses something called gradient descent as well, which we've mentioned previously to do this.

77
00:05:18,210 --> 00:05:19,290
So stay tuned.

78
00:05:19,310 --> 00:05:20,640
I'll see you in the next section.

79
00:05:20,820 --> 00:05:21,300
Thank you.