1
00:00:11,670 --> 00:00:17,170
In this lecture we are going to define a one more term which deserves its own lecture so far.

2
00:00:17,190 --> 00:00:21,450
I told you that the goal of the aging is to maximize the reward it gets.

3
00:00:21,720 --> 00:00:26,150
But as you recall the reward may be structured differently in different games.

4
00:00:26,220 --> 00:00:33,210
For example in tic tac toe you may receive a plus one for winning or a minus one for losing on the other

5
00:00:33,210 --> 00:00:33,530
hand.

6
00:00:33,540 --> 00:00:36,960
If you're solving a maze you might get a minus one at every step.

7
00:00:37,620 --> 00:00:40,610
So what does it mean to maximize the reward.

8
00:00:40,620 --> 00:00:43,460
Does it mean to maximize the reward at the next step.

9
00:00:43,560 --> 00:00:47,280
Or does it mean to maximize the total reward over an entire episode.

10
00:00:52,290 --> 00:01:00,060
Here's the answer to stated more accurately the goal of the game is to maximize the sum of future rewards.

11
00:01:00,060 --> 00:01:00,660
Why is that.

12
00:01:01,350 --> 00:01:03,930
Well it can't maximize the rewards it got already.

13
00:01:03,930 --> 00:01:05,140
Those are in the past.

14
00:01:05,160 --> 00:01:06,240
They cannot be changed.

15
00:01:06,960 --> 00:01:11,680
Furthermore we don't want to only maximize the reward on the next step.

16
00:01:11,730 --> 00:01:15,930
What if we are solving a maze and we get a minus one for any step we take.

17
00:01:16,200 --> 00:01:22,260
In that case the agent isn't incentivized to do anything useful because no matter what it does the immediate

18
00:01:22,260 --> 00:01:24,490
reward is still minus one.

19
00:01:24,630 --> 00:01:31,280
Thus the agents true goal is to maximize the sum of future rewards until the episode is over.

20
00:01:31,500 --> 00:01:35,690
In this way the agent is planning its future steps as well.

21
00:01:35,910 --> 00:01:41,520
It must have some concept of where it will end up because that's the only way it will know how to maximize

22
00:01:41,520 --> 00:01:42,650
rewards in the future

23
00:01:47,730 --> 00:01:49,330
to take a real world example.

24
00:01:49,380 --> 00:01:54,020
Consider again the idea of preparing for a math exam for a math exam.

25
00:01:54,030 --> 00:01:58,290
You don't receive any reward until you've completed the exam.

26
00:01:58,290 --> 00:02:00,870
The reward signal is your grade on the exam.

27
00:02:01,590 --> 00:02:05,730
But imagine all the actions it will take to actually maximize that reward.

28
00:02:05,730 --> 00:02:08,370
You'll have to study you'll have to do homework.

29
00:02:08,490 --> 00:02:11,560
You have to forgo socializing with your friends.

30
00:02:12,030 --> 00:02:18,420
In fact all of those actions do not sound very rewarding at all and thus if your only incentive is immediate

31
00:02:18,420 --> 00:02:19,720
gratification.

32
00:02:19,830 --> 00:02:22,550
In other words the rewards you receive immediately.

33
00:02:22,560 --> 00:02:25,540
Then you will not do well on your math exam.

34
00:02:25,560 --> 00:02:28,830
Instead you must make use of long term planning.

35
00:02:28,830 --> 00:02:30,950
Sure I may not want to study today.

36
00:02:31,110 --> 00:02:34,580
It may be very annoying and I will miss my favorite TV show.

37
00:02:34,830 --> 00:02:39,450
But because you are planning long term you're not thinking only about today.

38
00:02:39,450 --> 00:02:42,350
You're thinking about the results of your math exam.

39
00:02:42,660 --> 00:02:47,700
The desire to maximize the total future award is necessary for long term planning

40
00:02:52,860 --> 00:02:57,980
we call the sum of future rewards the return we describe the return mathematically.

41
00:02:57,980 --> 00:03:04,260
Using the symbol je because it depends on future rewards only it is time dependent.

42
00:03:04,380 --> 00:03:06,930
So we index it with a T.

43
00:03:06,930 --> 00:03:13,050
We can say the return at time t is the sum of rewards at time T plus one up to the terminal state a

44
00:03:13,050 --> 00:03:14,070
time a big T

45
00:03:19,200 --> 00:03:25,810
now you might be wondering what happens if we have an infinite horizon MVP a game that never ends.

46
00:03:25,810 --> 00:03:28,580
In this case your return might be infinity.

47
00:03:28,660 --> 00:03:34,930
Therefore we introduce a concept known as discounting discounting is used for infinitely long tasks

48
00:03:35,140 --> 00:03:38,230
but it's also used for episodic tasks as well.

49
00:03:38,230 --> 00:03:41,440
We introduce a discount factor called gamma.

50
00:03:41,440 --> 00:03:48,710
Each feature reward is weighted by gamma to some power gamma is usually a number close to one like zero

51
00:03:48,710 --> 00:03:52,460
point nine zero point nine nine or zero point nine at nine.

52
00:03:52,460 --> 00:03:57,980
It's a hyper parameter so you'll have to choose its value based on the performance of your agent.

53
00:03:57,980 --> 00:04:02,780
The idea is the further you go into the future the harder it is to predict.

54
00:04:02,810 --> 00:04:07,840
Therefore we care a little more about getting rewards now than we do later.

55
00:04:07,850 --> 00:04:10,280
Intuitively this works just like money.

56
00:04:10,280 --> 00:04:15,950
I would rather receive one hundred dollars today than receive one hundred dollars ten years from now.

57
00:04:15,950 --> 00:04:20,690
In ten years 100 dollars will be worth much less than it is today due to interest

58
00:04:25,790 --> 00:04:30,800
one important feature of the return which we will make use of throughout the rest of the section is

59
00:04:30,800 --> 00:04:32,990
that it can be defined recursively.

60
00:04:33,260 --> 00:04:40,130
In other words in terms of itself specifically the return at time t is equal to the reward at time T

61
00:04:40,130 --> 00:04:45,500
plus 1 plus gamma times the return at time T plus 1.

62
00:04:45,500 --> 00:04:50,390
This may not seem like much more than a simple math substitution now but you'll see how it will become

63
00:04:50,390 --> 00:04:51,980
very useful later on.
