1
00:00:01,170 --> 00:00:07,800
So in order for us to understand the limitations of computer vision, it's important to understand what

2
00:00:07,800 --> 00:00:14,850
makes it so hard, because anything with easy or deep learning is essentially even though it seems easy

3
00:00:14,850 --> 00:00:15,600
to us.

4
00:00:16,020 --> 00:00:18,870
It's not easy for an algorithm or computer or model.

5
00:00:19,530 --> 00:00:20,960
So let's take a look at this.

6
00:00:20,970 --> 00:00:25,850
So firstly, like I just said, our brains make things easy for us.

7
00:00:25,860 --> 00:00:32,580
Our brains are amazing at vision and so many other tasks like speech, reading, comprehension, strategic

8
00:00:32,580 --> 00:00:33,110
thinking.

9
00:00:33,120 --> 00:00:38,910
Well, some people at least, but generally everyone can do vision quite well.

10
00:00:39,300 --> 00:00:45,600
And that's because we have a very well-developed visual cortex at the front of our brain here that basically

11
00:00:46,230 --> 00:00:53,220
can see and understand what the visual information that's being fed to eyes very, very well.

12
00:00:53,220 --> 00:00:55,020
It's amazing, actually, what it can do.

13
00:00:55,530 --> 00:00:59,190
It beats robots any day, at least currently right now.

14
00:01:00,000 --> 00:01:02,100
And there's a reason for that, actually.

15
00:01:02,550 --> 00:01:08,100
And that's because our eyes are very good at general purpose, computer vision, that is.

16
00:01:08,940 --> 00:01:11,340
Let's assume you're watching this slide right now.

17
00:01:11,730 --> 00:01:15,690
You're understanding that you're watching a screen or a laptop screen of a monitor.

18
00:01:16,020 --> 00:01:19,040
You understand that there's like a background behind it.

19
00:01:19,050 --> 00:01:20,550
You understand that it's adept.

20
00:01:20,850 --> 00:01:22,110
There's so many different things.

21
00:01:22,110 --> 00:01:24,450
You're understanding your own, you know, where's well-lit.

22
00:01:24,780 --> 00:01:26,070
You know what a desk is.

23
00:01:26,070 --> 00:01:28,950
You know, the objects are on your desk, keyboards, everything.

24
00:01:29,310 --> 00:01:33,930
So there's a lot of information being fed into your eyes and you just naturally understand it.

25
00:01:34,290 --> 00:01:38,670
However, it isn't so for computers, they have to basically understand every object.

26
00:01:39,030 --> 00:01:44,580
And then we actually have background knowledge, like what an object is, like what a cup is, cup stores,

27
00:01:44,580 --> 00:01:48,480
water or other drinks, but robots don't have that.

28
00:01:48,480 --> 00:01:53,370
So you have to kind of build all of that general knowledge into it, which is a different topic, though.

29
00:01:53,520 --> 00:02:00,420
I mean, eventually what computer vision will lead to, however, because robotics a huge area where

30
00:02:00,420 --> 00:02:08,610
computer vision can be applied, but as of now, it's more in niche applications like understanding

31
00:02:08,700 --> 00:02:17,040
one type of medical detail or understanding faces very well, or understanding how to transform your

32
00:02:17,040 --> 00:02:18,750
image into artistic styles.

33
00:02:19,140 --> 00:02:25,680
Basically, we've trained these models to do niche things, but unlike our brains or brains, a very

34
00:02:25,680 --> 00:02:28,230
general computer vision isn't there yet.

35
00:02:28,290 --> 00:02:29,530
So it might excel.

36
00:02:29,530 --> 00:02:33,570
We might even beat humans in one test, but it sucks and everything else.

37
00:02:34,200 --> 00:02:38,410
And now let's take a look at what it actually makes computer vision so hot.

38
00:02:39,330 --> 00:02:40,530
Now there are a number of things.

39
00:02:40,530 --> 00:02:43,700
So first one is you're limited by two cameras.

40
00:02:43,700 --> 00:02:46,200
So eyes are very good camera systems.

41
00:02:46,200 --> 00:02:46,980
If you think about it.

42
00:02:47,400 --> 00:02:53,640
However, physical cameras often have limitations with noise, with granularity or resolution.

43
00:02:54,000 --> 00:02:58,470
Things fall away if it's a teeny tiny camera are going to be quite blurry compared to someone who has

44
00:02:58,470 --> 00:02:59,400
2020 vision.

45
00:02:59,910 --> 00:03:04,380
So there's those limitations then viewpoint variations.

46
00:03:04,380 --> 00:03:09,210
We and we naturally understand that this is the Statue of Liberty at different angles, or this is a

47
00:03:09,210 --> 00:03:13,650
house that's just rotated, but a computer vision model might not.

48
00:03:13,860 --> 00:03:15,990
You might think it's different objects it's looking at.

49
00:03:17,280 --> 00:03:19,200
There's also changing lighting conditions.

50
00:03:19,200 --> 00:03:25,860
You can see how drastic this like being here or it's like being not there changes the scene, then there's

51
00:03:25,860 --> 00:03:26,820
scaling issues.

52
00:03:26,820 --> 00:03:32,610
So you can see the Taj Mahal at this level, at this scale looks completely different, at least to

53
00:03:32,610 --> 00:03:33,660
a computer vision model.

54
00:03:33,660 --> 00:03:35,010
It might hopefully not.

55
00:03:35,520 --> 00:03:43,260
But you can see when you scale back out how different it looks as well as these comparisons here, then

56
00:03:43,260 --> 00:03:50,130
there's natural non-rich deformations, like a dog or a horse has many different poses similar to a

57
00:03:50,130 --> 00:03:50,490
human.

58
00:03:50,490 --> 00:03:52,920
You can be sitting, standing, crouched.

59
00:03:53,520 --> 00:03:59,400
So there's this that to consider that definition that an object like this can do.

60
00:03:59,430 --> 00:04:04,770
So you're going to have to use other characteristics to identify it, which we naturally learn in our

61
00:04:04,770 --> 00:04:05,220
brains.

62
00:04:05,220 --> 00:04:13,020
But a computer vision algorithm has to be fed many different instances of a dog or a horse to understand

63
00:04:13,440 --> 00:04:14,640
that it's not.

64
00:04:14,640 --> 00:04:17,940
The fact that the animal is standing this way makes it a dog.

65
00:04:18,330 --> 00:04:21,540
It's because of its fur or its facial shape.

66
00:04:21,540 --> 00:04:23,970
Or is that's what makes it the dog.

67
00:04:25,080 --> 00:04:31,380
There's also a occlusion where basically which means that part of the object is blocked by another object.

68
00:04:31,380 --> 00:04:34,680
So it's a form of clutter, which we'll take a look at next.

69
00:04:34,680 --> 00:04:36,630
So this isn't technically a form of clutter.

70
00:04:36,630 --> 00:04:41,960
It's not camouflage, but in a way it is clutter because it's a scene where it's hard to detect the

71
00:04:41,970 --> 00:04:43,800
octopus in next.

72
00:04:43,800 --> 00:04:51,090
This is a scene somewhere in China possibly that just shows you how many clutter, how many different

73
00:04:51,090 --> 00:04:52,530
objects are around.

74
00:04:53,580 --> 00:04:58,560
So it's quite hard to make out anything in this picture for us for computer vision algorithm, trying

75
00:04:58,560 --> 00:04:59,400
to make sense of it.

76
00:05:00,060 --> 00:05:02,190
Then there's object class variation.

77
00:05:02,640 --> 00:05:05,580
Look how many different types of beds they are in this scene.

78
00:05:06,000 --> 00:05:08,940
We all know their beds, except this is like a sofa bed.

79
00:05:09,330 --> 00:05:10,680
But we all know their beds.

80
00:05:11,040 --> 00:05:13,440
But a computer vision model might not.

81
00:05:14,550 --> 00:05:17,340
Then there's ambiguous images.

82
00:05:17,340 --> 00:05:19,290
An object, optical illusion.

83
00:05:19,290 --> 00:05:19,650
Sorry.

84
00:05:20,280 --> 00:05:22,890
So you can see this is actually a flat 2D image.

85
00:05:22,920 --> 00:05:25,200
However, it does look like a truly image.

86
00:05:25,740 --> 00:05:26,900
Is this.

87
00:05:26,910 --> 00:05:28,770
Is this a fizz or two faces?

88
00:05:28,800 --> 00:05:29,400
Who knows?

89
00:05:30,210 --> 00:05:34,700
So as you can see, there's a number of ways we can trick vision systems.

90
00:05:34,710 --> 00:05:40,770
And actually there's a whole field of complete division, antagonistic type of training where you create

91
00:05:40,770 --> 00:05:43,110
models to beat other models effectively.

92
00:05:43,650 --> 00:05:45,510
So it's not foolproof.

93
00:05:45,750 --> 00:05:48,050
It's very well developed right now.

94
00:05:48,060 --> 00:05:55,440
But as you can see, naturally, vision is basically a messy field and field, but I mean like messy

95
00:05:55,980 --> 00:05:57,090
images coming in.

96
00:05:57,240 --> 00:05:59,760
So there's a lot of ambiguity inside of it.

97
00:06:00,300 --> 00:06:06,750
So in the next section, we'll take a look at what exactly are images, because this is the foundational

98
00:06:06,750 --> 00:06:08,520
knowledge of computer vision.

99
00:06:08,880 --> 00:06:10,950
Understanding what images actually are.

100
00:06:11,580 --> 00:06:15,060
So I'll see you in the next lesson where we take a look at this topic.

101
00:06:15,330 --> 00:06:15,500
Thank.