1
00:00:00,760 --> 00:00:05,440
‫Now we are going to start with convolutional neural networks.

2
00:00:05,920 --> 00:00:11,830
‫And in this section, we are going to understand some of the building blocks of CNN's.

3
00:00:13,540 --> 00:00:17,560
‫First of all, let us understand the motivation behind CNN.

4
00:00:18,160 --> 00:00:24,010
‫What is it that CNN do better than a normal artificial neural network?

5
00:00:25,190 --> 00:00:31,070
‫Seniors are mostly used for problems like image recognition or speech recognition.

6
00:00:32,100 --> 00:00:40,050
‫The reason for this is that CNN's outperform normal artificial neural networks in these types of problems.

7
00:00:41,340 --> 00:00:47,940
‫In fact, the accuracy of some CNN models at image recognition is even better than humans.

8
00:00:50,850 --> 00:00:55,620
‫Let us try to see the limitation in a normal artificial neural network.

9
00:00:57,240 --> 00:01:05,340
‫As we understand, an artificial neural network gets the data of each pixel as input into the first

10
00:01:05,340 --> 00:01:06,360
‫layer of neurons.

11
00:01:06,660 --> 00:01:16,560
‫So if you have a 16 by 16 pixel image and I want to find what is in that image, I will feed the information

12
00:01:16,560 --> 00:01:20,460
‫of all these 256 pixels into the first layer.

13
00:01:21,580 --> 00:01:23,170
‫But here's the problem.

14
00:01:23,800 --> 00:01:30,760
‫If you see the pixels randomly with no order, can you identify what this image is?

15
00:01:32,410 --> 00:01:39,130
‫For example, in this slide, I have an image of our favorite video game character, Mario.

16
00:01:40,360 --> 00:01:42,730
‫It's a 16 by 16 pixel image.

17
00:01:45,370 --> 00:01:53,380
‫Now, to demonstrate how our neural network sees this image, I have covered all of the pixels with

18
00:01:53,380 --> 00:01:59,530
‫blue squares and have revealed 25 random pixels in each of the four images.

19
00:02:01,060 --> 00:02:03,760
‫This is similar to what our neural network sees.

20
00:02:04,660 --> 00:02:12,370
‫So if our neural network sees this first image, do you think it will be able to identify the character

21
00:02:12,370 --> 00:02:15,820
‫and this image or this second image?

22
00:02:16,870 --> 00:02:18,820
‫Same goes with third and fourth image.

23
00:02:19,570 --> 00:02:25,990
‫The point that I'm trying to make here is we are not considering the effect of neighboring pixels.

24
00:02:27,110 --> 00:02:33,020
‫If we randomly pick pixels, we do not understand what is the image behind it.

25
00:02:33,560 --> 00:02:41,480
‫If we consider the order of pixels only then we are able to identify what is the object in that image.

26
00:02:42,560 --> 00:02:50,180
‫So identifying the object by looking at such images is a very difficult task because this is not how

27
00:02:50,180 --> 00:02:51,560
‫the human brain works.

28
00:02:53,110 --> 00:02:56,320
‫We do not look at individual points or pixels.

29
00:02:56,470 --> 00:03:00,460
‫We can recognize pattern in group of points or pixels.

30
00:03:01,300 --> 00:03:06,280
‫In fact, cells in our visual cortex respond to different patterns.

31
00:03:07,420 --> 00:03:09,140
‫Some respond to horizontal lines.

32
00:03:09,160 --> 00:03:10,860
‫Some respond to vertical lines.

33
00:03:10,870 --> 00:03:13,300
‫Some respond to other complex patterns.

34
00:03:13,720 --> 00:03:20,380
‫The output of the lower level neurons is then processed by higher level neurons to identify objects

35
00:03:20,380 --> 00:03:21,670
‫in our visual field.

36
00:03:23,390 --> 00:03:27,260
‫Convolutional neural networks are inspired from this concept.

37
00:03:28,660 --> 00:03:29,890
‫In CNN's.

38
00:03:30,010 --> 00:03:35,350
‫Instead of looking at each individual pixel, we look at a group of pixel.

39
00:03:38,140 --> 00:03:44,440
‫If we look at a group of pixels, we are more likely to pick up different features of the objects in

40
00:03:44,440 --> 00:03:45,100
‫the image.

41
00:03:45,730 --> 00:03:51,580
‫And once we know the features, it is more likely that we can predict the object in the image.

42
00:03:53,000 --> 00:04:00,650
‫So in this slide you can see that by using a window on the image or by looking at a group of pixels,

43
00:04:01,610 --> 00:04:03,830
‫we can identify certain features.

44
00:04:05,740 --> 00:04:10,210
‫So you can see Mario's ear here in the first image.

45
00:04:11,480 --> 00:04:17,000
‫In this image, you can see Mario's red collared shirt and probably a shoulder.

46
00:04:18,090 --> 00:04:24,750
‫And the next image, you find out some very important features of the character, eyes, nose and a

47
00:04:24,750 --> 00:04:25,530
‫mustache.

48
00:04:27,960 --> 00:04:33,630
‫The last window tells us that the character is wearing blue colored pants on the legs.

49
00:04:34,650 --> 00:04:42,780
‫With these features identified, it is easier for our network to identify the object in our image.

50
00:04:43,740 --> 00:04:52,440
‫So if you compare it with the previous slide in which we randomly showed you 25 pixels out of 256 pixels,

51
00:04:53,010 --> 00:05:01,470
‫you can easily see that in the second slide you can identify features because the group of pixels is

52
00:05:01,470 --> 00:05:08,340
‫together as compared to the one in previous slide where the pixels were randomly picked up.

53
00:05:10,170 --> 00:05:12,360
‫So this is the main idea here.

54
00:05:13,050 --> 00:05:18,630
‫Instead of looking at each individual pixel, we will look at a group of pixels.

55
00:05:20,130 --> 00:05:22,590
‫Now let us see how this is implemented.

56
00:05:24,030 --> 00:05:26,400
‫So here is that image at the bottom.

57
00:05:27,810 --> 00:05:30,780
‫This is the input image to our network.

58
00:05:32,280 --> 00:05:36,030
‫Now, on top of it, we will have a convolutional layer.

59
00:05:37,310 --> 00:05:40,070
‫This is the most important concept in science.

60
00:05:40,610 --> 00:05:42,620
‫We have a convolutional layer.

61
00:05:43,530 --> 00:05:50,430
‫A convolutional layer comprises of neurons which take in information from a group of pixels in the previous

62
00:05:50,430 --> 00:05:50,940
‫layer.

63
00:05:52,500 --> 00:05:54,180
‫So in this first layer.

64
00:05:56,030 --> 00:06:02,990
‫This neuron gets information stored in the pixels within this rectangular box.

65
00:06:03,020 --> 00:06:11,210
‫Only this other neuron gets information from pixels of this rectangle only.

66
00:06:12,170 --> 00:06:19,070
‫Similarly, in the second convolutional layer, information of all the neurons in this small rectangle.

67
00:06:20,460 --> 00:06:24,900
‫That is when the first layer is taken as input by this neuron.

68
00:06:27,690 --> 00:06:34,770
‫This architecture allows the network to concentrate on lower level features in the first layer and then

69
00:06:34,770 --> 00:06:41,160
‫assemble these features into larger, higher level features in the next hidden layer and so on.

70
00:06:42,940 --> 00:06:49,330
‫Now let us focus more on this window or the receptive field of these neurons.

71
00:06:51,220 --> 00:06:56,470
‫So this window is also known as the receptive field of that particular neuron.

72
00:06:58,330 --> 00:07:01,600
‫This window has two dimensions height and width.

73
00:07:03,230 --> 00:07:11,120
‫In this image, you can see that the height of the window we have taken is five pixels and which is

74
00:07:11,120 --> 00:07:12,350
‫also five pixels.

75
00:07:13,190 --> 00:07:16,550
‫So we say that this is a five cross five window.

76
00:07:18,200 --> 00:07:24,320
‫We can also have three cross, three window or a two cross three window or any such dimension.

77
00:07:25,520 --> 00:07:30,110
‫Most commonly used dimensions are three cross three or five cross five.

78
00:07:31,460 --> 00:07:39,500
‫So this particular window, the information stored in all the pixels of this window, will go into one

79
00:07:39,500 --> 00:07:42,410
‫particular neuron in the upper convolutional layer.