1
00:00:11,130 --> 00:00:16,350
So in this lecture, I'm going to give you an official exercise prompt in order to prepare you for the

2
00:00:16,350 --> 00:00:18,450
next lecture where we look at the code.

3
00:00:19,170 --> 00:00:23,280
So to start, let's go for a basic outline of what the code should look like.

4
00:00:24,180 --> 00:00:25,680
Firstly, you'll want to download.

5
00:00:25,680 --> 00:00:28,830
The data will be using poems by Robert Frost.

6
00:00:29,520 --> 00:00:33,510
Note that the yoro for this text corpus will be given in the coming notebook.

7
00:00:33,870 --> 00:00:38,880
So feel free to grab it from there, but do not cheat by looking at the rest of the code.

8
00:00:39,870 --> 00:00:45,330
The second step will be to read in the data and to use that data to form our second order Markov model.

9
00:00:46,110 --> 00:00:50,970
Remember that there are many ways to do this, although we've spoken about Markov models in terms of

10
00:00:50,970 --> 00:00:51,720
matrices.

11
00:00:51,960 --> 00:00:57,300
The next quote example will allow you to explore another way to do this without using matrices.

12
00:00:57,720 --> 00:01:00,850
So if you'd like to use an umpire raise, you can feel free to do so.

13
00:01:01,140 --> 00:01:03,930
But please be aware that this is not the only way.

14
00:01:05,069 --> 00:01:11,280
The third step is, once you have the Markov model, use the Markov model to generate poems in the coming

15
00:01:11,280 --> 00:01:16,290
example will be generating four lines at a time, but you don't need to follow this constraint.

16
00:01:20,940 --> 00:01:24,510
Now, I want to make note that this exercise is very open-ended.

17
00:01:25,140 --> 00:01:28,950
As mentioned, we'll be using dictionaries to store word probabilities.

18
00:01:29,460 --> 00:01:34,080
So for the rest of this lecture, I want to give you a brief outline of how that will work and why it's

19
00:01:34,080 --> 00:01:34,590
useful.

20
00:01:36,090 --> 00:01:40,650
So suppose I'd like to store the initial word distribution, which we've been calling PI.

21
00:01:41,370 --> 00:01:46,920
In practice, all we really need is a dictionary where the key is the word and the value is the corresponding

22
00:01:46,920 --> 00:01:48,540
probability for that word.

23
00:01:50,610 --> 00:01:54,270
Note that in the coming example, we will not be using AD one smoothing.

24
00:01:54,540 --> 00:01:59,850
So the number of key value pairs in the dictionary is only the number of words that are actually used

25
00:02:00,030 --> 00:02:01,500
in the first line of a poem.

26
00:02:02,190 --> 00:02:07,290
That is, although there may be very possible words in the vocabulary, the dictionary will only store

27
00:02:07,290 --> 00:02:08,910
a subset of those words.

28
00:02:09,930 --> 00:02:13,560
This is opposed to the full PI vector, which always has size V.

29
00:02:18,190 --> 00:02:23,590
Now this brings us to the next important point, which is why might we want to use dictionaries instead

30
00:02:23,590 --> 00:02:24,820
of an umpire raise?

31
00:02:25,390 --> 00:02:27,880
This introduces us to the idea of sparsity.

32
00:02:28,660 --> 00:02:34,810
As you recall, one reason we need to use add one smoothing is because many transitions simply do not

33
00:02:34,810 --> 00:02:36,640
occur in the training corpus.

34
00:02:37,090 --> 00:02:41,170
In fact, this gets worse and worse as the order of the model gets larger.

35
00:02:41,860 --> 00:02:44,500
We refer to this as the curse of dimensionality.

36
00:02:45,460 --> 00:02:50,590
The basic idea is, although the number of values grows exponentially with the number of words in the

37
00:02:50,590 --> 00:02:55,720
sequence, we want to consider the amount of data we actually have gets smaller in comparison.

38
00:02:56,500 --> 00:03:00,280
Put another way, our transition arrays will be mostly zeroes.

39
00:03:00,820 --> 00:03:05,740
So although our second order mark of a ray, technically you should have size v by v by V.

40
00:03:06,160 --> 00:03:08,140
Most of those values will just be zero.

41
00:03:08,740 --> 00:03:13,840
But if we use a dictionary that only stores the words that we have actually seen in a sequence, then

42
00:03:13,840 --> 00:03:16,690
the amount of values we need to store will be much smaller.

43
00:03:21,560 --> 00:03:27,320
OK, so how should we saw the first order transitions, as you recall, these pertain to the second

44
00:03:27,320 --> 00:03:28,340
word in each line.

45
00:03:29,150 --> 00:03:32,930
So for this, what we will use is a dictionary of dictionaries.

46
00:03:33,440 --> 00:03:35,570
Luckily, it's simpler than it sounds.

47
00:03:35,930 --> 00:03:37,730
It helps to just see what it looks like.

48
00:03:38,360 --> 00:03:44,180
So the key in the first dictionary will be the previous word that is the word at time T minus one.

49
00:03:44,930 --> 00:03:49,220
The value this time will not be a probability, but instead another dictionary.

50
00:03:49,850 --> 00:03:53,300
This nested dictionary will also store words as keys.

51
00:03:53,750 --> 00:03:58,740
But this time, these keys correspond to the second word in a two word sequence.

52
00:03:59,450 --> 00:04:05,030
Then the corresponding value will be the probability of that two words sequence occurring, given that

53
00:04:05,030 --> 00:04:06,710
the first word already occurred.

54
00:04:07,400 --> 00:04:11,450
So as an example, the first word here is the and the second word is cat.

55
00:04:12,020 --> 00:04:14,990
This has the corresponding value zero point zero five.

56
00:04:15,530 --> 00:04:22,700
Therefore, we can say that p of cat given there is zero point zero five as an exercise.

57
00:04:22,940 --> 00:04:28,250
I'll leave you to think about how to store the second order transition probabilities using a similar

58
00:04:28,250 --> 00:04:28,730
scheme.

59
00:04:33,450 --> 00:04:39,210
Now, the next issue I want to consider in this lecture is given a dictionary with these word probability

60
00:04:39,210 --> 00:04:39,840
pairs.

61
00:04:40,230 --> 00:04:44,010
How can we sample these words using the corresponding probabilities?

62
00:04:44,730 --> 00:04:48,030
Figuring out how to do this is part of your exercise for this course.

63
00:04:48,300 --> 00:04:53,220
So I'm not going to fully explain how to do it before I show you the solution to the exercise.

64
00:04:53,760 --> 00:04:59,010
What I will give you is the basic principle, which if you follow it, you should be able to implement

65
00:04:59,010 --> 00:04:59,940
this successfully.

66
00:05:04,630 --> 00:05:11,110
OK, so suppose I have three items A, B and C with probabilities zero point two, zero point five and

67
00:05:11,110 --> 00:05:12,120
zero point three.

68
00:05:13,510 --> 00:05:18,730
Now, suppose I draw a random number from the uniform distribution that is a number between a zero and

69
00:05:18,730 --> 00:05:21,640
a one with equal probability for each outcome.

70
00:05:22,720 --> 00:05:27,490
Now suppose that I bucket these outcomes according to the probabilities for A, B and C.

71
00:05:28,360 --> 00:05:33,860
So if the number a sample is between a zero and zero point two, then I will produce an A..

72
00:05:34,450 --> 00:05:41,860
If the number a sample is between 0.2 and 0.7, then I will produce a B. If the number a sample is between

73
00:05:41,860 --> 00:05:44,890
0.7 and one, then I will produce a C.

74
00:05:45,700 --> 00:05:52,030
You can see that that area covered by each of these ranges corresponds exactly to the given probabilities.

75
00:05:52,960 --> 00:05:58,480
One hints that you're being given is that we are computing the cumulative some of these probabilities.

76
00:05:59,020 --> 00:06:01,810
So zero point two is the first probability.

77
00:06:02,050 --> 00:06:07,870
Therefore, that becomes the first boundary, zero point five is the second probability, and we add

78
00:06:07,870 --> 00:06:10,150
that to the first one, which is zero point two.

79
00:06:10,810 --> 00:06:12,730
So in total, we get zero point seven.

80
00:06:13,330 --> 00:06:16,570
Therefore, zero point seven becomes the second boundary.

81
00:06:17,380 --> 00:06:19,480
OK, so hopefully you understand the pattern.

82
00:06:24,120 --> 00:06:30,030
One alternative, but less ideal way to do this exercise is to simply convert the values in the dictionary

83
00:06:30,030 --> 00:06:34,410
into two lists, then use NPD at random, not choice as normal.

84
00:06:35,010 --> 00:06:39,900
I consider this less ideal since it avoids the issue and doesn't give you practice with this kind of

85
00:06:39,900 --> 00:06:42,900
thinking that I would consider crucial for data science.

86
00:06:43,590 --> 00:06:46,980
OK, so please try to exercise and I will see you in the next lecture.

