1
00:00:11,700 --> 00:00:16,650
In this section of the course we are going to discuss the theory behind that reinforcement learning

2
00:00:17,680 --> 00:00:18,460
this lecture.

3
00:00:18,460 --> 00:00:23,710
We'll give you an introduction to reinforcement learning and we will talk about it in general terms

4
00:00:23,710 --> 00:00:30,040
without any math or terminology one thing you have to brace yourself for when you're learning about

5
00:00:30,040 --> 00:00:35,530
reinforcement learning is how different it is to supervised and unsupervised learning.

6
00:00:35,530 --> 00:00:40,580
So if you just took an introduction to machine learning where you learned about models such as naive

7
00:00:40,590 --> 00:00:47,230
bayes and K Means clustering or you come from a statistics background you will be surprised at how different

8
00:00:47,230 --> 00:00:50,870
reinforcement learning is compared to what you are used to.

9
00:00:50,920 --> 00:00:57,340
So try to take some time to soak up these concepts and don't feel intimidated by this new and different

10
00:00:57,340 --> 00:01:03,210
way of thinking.

11
00:01:03,230 --> 00:01:09,620
I want to begin by thinking about supervised learning when we think about say an image classifier.

12
00:01:09,620 --> 00:01:12,130
You can think of it as a static function.

13
00:01:12,220 --> 00:01:15,050
I pass on an image and I get a prediction.

14
00:01:15,050 --> 00:01:17,450
It tells me what kind of object is in the image.

15
00:01:17,450 --> 00:01:22,170
For example there is no notion of time I pass in another image.

16
00:01:22,190 --> 00:01:23,860
I get another prediction.

17
00:01:24,140 --> 00:01:29,240
The image classifier is just a function I given an input and it produces an output

18
00:01:34,340 --> 00:01:36,070
so what do I mean by static.

19
00:01:36,080 --> 00:01:37,930
And what do I mean by time.

20
00:01:38,240 --> 00:01:43,190
You might immediately think of recurrent neural networks which are neural networks that can handle sequences

21
00:01:43,550 --> 00:01:45,870
inputs that vary with time.

22
00:01:45,890 --> 00:01:51,590
However this is not the kind of time I am thinking about if I pass in some stock prices for a given

23
00:01:51,590 --> 00:01:55,930
time period and my model predicts whether the stock will go up or down tomorrow.

24
00:01:56,000 --> 00:01:57,380
That's still a static function.

25
00:02:02,490 --> 00:02:04,390
Here's what I mean by time.

26
00:02:04,650 --> 00:02:11,040
Imagine you are building a self-driving car simulation at each moment in time your neural network can

27
00:02:11,040 --> 00:02:14,730
take a snapshot of the screen and decide what to do next.

28
00:02:14,730 --> 00:02:15,650
Should I still left.

29
00:02:15,660 --> 00:02:23,170
Should I steer right accelerate or brake.

30
00:02:23,260 --> 00:02:29,140
So this is the difference between a supervised and reinforcement learning supervised learning is just

31
00:02:29,140 --> 00:02:35,290
a function you can call this function repeatedly but it's still just a function you pass in an image

32
00:02:35,320 --> 00:02:39,970
and it produces an output reinforcement learning is more like a loop.

33
00:02:40,180 --> 00:02:42,190
It exists to achieve some goal.

34
00:02:42,520 --> 00:02:49,540
For example driving you to your desired destination inside the loop yes it still takes in an image and

35
00:02:49,540 --> 00:02:56,050
produces an output that specifies how to control the car but importantly this reinforcement learning

36
00:02:56,050 --> 00:02:58,660
program has the concept of time.

37
00:02:58,840 --> 00:03:01,200
It doesn't think about just what is this image.

38
00:03:01,210 --> 00:03:03,970
How do I translate this image into an output prediction.

39
00:03:04,450 --> 00:03:12,400
Instead it has the capacity to plan for the future even though at this moment the car may only see where

40
00:03:12,400 --> 00:03:14,000
it is on the road right now.

41
00:03:14,110 --> 00:03:19,690
It knows that there is some sequence of actions it must take in the future that will lead it towards

42
00:03:19,690 --> 00:03:20,260
its goal.

43
00:03:21,610 --> 00:03:26,960
All right so that's the major difference between a supervised learning and reinforcement learning with

44
00:03:26,990 --> 00:03:27,890
supervised learning.

45
00:03:27,890 --> 00:03:31,850
We have no concept of a goal or the future or planning.

46
00:03:31,850 --> 00:03:34,280
We just take an input and produce an output.

47
00:03:34,280 --> 00:03:35,840
It's a static function.

48
00:03:36,170 --> 00:03:42,590
With reinforcement learning we have a plan and that plan can be carried out in the future to reach some

49
00:03:42,590 --> 00:03:43,580
predefined goal

50
00:03:48,670 --> 00:03:51,790
here's another way to think about reinforcement learning.

51
00:03:51,790 --> 00:03:55,180
Think about the data with supervised learning.

52
00:03:55,180 --> 00:03:58,620
Again let's use image classification as our example.

53
00:03:58,720 --> 00:04:02,560
We must have a label for every input in our training set.

54
00:04:02,620 --> 00:04:07,990
So if we have an image of a dog we must have another item specifying the class dog.

55
00:04:08,320 --> 00:04:13,860
If we have an image of a cat we must have another item specifying the class cat.

56
00:04:13,870 --> 00:04:16,360
In other words for every x we must have a Y

57
00:04:21,480 --> 00:04:26,170
it's important to remember that label data sets must be made by humans.

58
00:04:26,220 --> 00:04:30,840
Sometimes students have this really funny idea that we should just automate the creation of labeled

59
00:04:30,870 --> 00:04:32,630
datasets guys.

60
00:04:32,700 --> 00:04:37,970
If we already had computers that could perfectly label data then that would mean we have already solved

61
00:04:37,980 --> 00:04:44,240
machine learning computers that can perfectly label data are actually what we are trying to build.

62
00:04:44,280 --> 00:04:48,840
If such computers already existed then we would not need to build them.

63
00:04:48,840 --> 00:04:54,090
All right so hopefully you are convinced that label data comes from humans and not some super smart

64
00:04:54,090 --> 00:04:59,660
computer.

65
00:04:59,680 --> 00:05:00,730
Why is this important.

66
00:05:01,420 --> 00:05:04,210
Well think about our self-driving car again.

67
00:05:04,300 --> 00:05:06,260
Take this image of the road.

68
00:05:06,520 --> 00:05:11,470
If we were to use supervised learning for this data point we would need to give it a target.

69
00:05:11,590 --> 00:05:13,060
But what's the target.

70
00:05:13,060 --> 00:05:14,110
Should I steer left.

71
00:05:14,110 --> 00:05:15,310
Should I steer right.

72
00:05:15,310 --> 00:05:16,210
Should I accelerate.

73
00:05:16,210 --> 00:05:17,640
Should I brake.

74
00:05:17,650 --> 00:05:20,550
In fact you probably cannot give this image a target

75
00:05:23,400 --> 00:05:29,040
and even if you could how could you possibly label every single frame that the car will encounter along

76
00:05:29,040 --> 00:05:30,660
its journey.

77
00:05:30,660 --> 00:05:36,960
Imagine if you have a standard camera that captures images at 30 frames per second and you have a one

78
00:05:36,960 --> 00:05:39,130
hour drive as your data set.

79
00:05:39,390 --> 00:05:45,630
One hour is thirty six hundred seconds which means you would have to label one hundred eight thousand

80
00:05:45,630 --> 00:05:47,420
images from just one trip

81
00:05:52,630 --> 00:05:53,270
instead.

82
00:05:53,290 --> 00:05:57,330
Reinforcement learning learns using goals rather than targets.

83
00:05:57,340 --> 00:06:03,020
For example suppose you wanted to teach a reinforcement learning algorithm to solve a maze.

84
00:06:03,070 --> 00:06:06,490
In this scenario the goal would be finding the maze exit.

85
00:06:06,790 --> 00:06:12,150
You do not need to tell the algorithm what the correct thing to do is for each position on the Maze.

86
00:06:12,250 --> 00:06:17,920
Since that would be supervised learning instead the only thing the reinforcement learning algorithm

87
00:06:17,920 --> 00:06:21,720
needs to know is what is the goal from there.

88
00:06:21,730 --> 00:06:27,300
Figuring out what to do in each position of the maze can be found by reinforcement learning.

89
00:06:27,700 --> 00:06:29,530
That is the power of this new paradigm.