1
00:00:02,230 --> 00:00:06,790
Everyone and welcome back to this class natural language processing in Python.

2
00:00:11,670 --> 00:00:16,530
In this lecture, we will continue looking at our code to implement a cipher, a decryption algorithm.

3
00:00:17,160 --> 00:00:21,570
This lecture will focus on creating the language model and the associated functions.

4
00:00:22,290 --> 00:00:24,780
Let's start by defining the data structures we need.

5
00:00:30,580 --> 00:00:34,480
First, we need a marker matrix to store all the bigram probabilities.

6
00:00:34,990 --> 00:00:41,500
As stated earlier, since there are 26 starting letters and 26 ending letters, we must have a 26 by

7
00:00:41,500 --> 00:00:42,640
26 matrix.

8
00:00:42,940 --> 00:00:43,870
We'll call this em.

9
00:00:44,560 --> 00:00:49,540
We also need a vector to store all the unit grant probabilities, but to be more specific.

10
00:00:49,840 --> 00:00:52,540
This will represent the initial state distribution.

11
00:00:52,750 --> 00:00:56,590
In other words, we don't care about unit grams that appear later in a word, or we just care about

12
00:00:56,590 --> 00:00:57,550
the first letter.

13
00:00:58,270 --> 00:01:01,950
So we'll call this PI and it has 26 entries, one for each letter.

14
00:01:05,080 --> 00:01:09,670
Next, we need some functions to help us actually populate these data structures pie in M..

15
00:01:10,510 --> 00:01:14,590
First of all, have a function to update them and we'll call it update transition.

16
00:01:15,250 --> 00:01:16,760
It takes in two arguments.

17
00:01:17,170 --> 00:01:24,670
One need to see each one represents the starting character and the two represents the ending character.

18
00:01:25,450 --> 00:01:31,930
The important thing to remember is that in order to index a matrix, we must use integers since the

19
00:01:31,930 --> 00:01:35,530
number of rows and columns of The Matrix are both 26.

20
00:01:35,830 --> 00:01:38,980
Then the indices will be from zero to 25 inclusive.

21
00:01:39,730 --> 00:01:45,040
Luckily, we have a useful python function that can convert characters into integers.

22
00:01:45,370 --> 00:01:46,390
It's called ORD.

23
00:01:48,310 --> 00:01:53,260
Just as a side note, if you are not a computer scientist, you may have never seen anything like this

24
00:01:53,260 --> 00:01:55,900
before, so to describe what's happening?

25
00:01:56,170 --> 00:02:02,020
Remember that everything stored in your computer is stored in binary code, just zeros and ones.

26
00:02:02,530 --> 00:02:09,139
A series of zeros and ones is called a base to no system, just like how the system, as humans use

27
00:02:09,139 --> 00:02:11,560
to count, is a base 10 number system.

28
00:02:12,250 --> 00:02:18,280
In any case, the important thing is that everything stored inside a computer is represented by a number.

29
00:02:18,850 --> 00:02:23,650
Those numbers happen to be zeros and ones, but they are still just numbers at the end of the day.

30
00:02:24,130 --> 00:02:29,890
So you have zero zero zero zero zero zero zero one zero zero one zero zero zero one one and so on.

31
00:02:32,010 --> 00:02:38,280
Now, ASCII characters, because they are also just things stored in a computer are also represented

32
00:02:38,280 --> 00:02:39,000
by numbers.

33
00:02:39,540 --> 00:02:46,260
It just so happens that this function AUD takes in a character and returns its integer representation.

34
00:02:46,710 --> 00:02:49,170
In other words, the number that it's represented by.

35
00:02:49,980 --> 00:02:55,950
So a long time ago, some computer scientists decided which characters should be represented by which

36
00:02:55,950 --> 00:02:56,580
numbers.

37
00:02:56,880 --> 00:03:03,120
So a lowercase a is 97, a lowercase B is 98, lowercase C is 99, and so on.

38
00:03:03,900 --> 00:03:08,760
Luckily, these are all in order, so there's no need to worry about the letters not being contiguous.

39
00:03:10,350 --> 00:03:16,140
In other words, all we need to do in order to get integers that are appropriate for indexing our matrix

40
00:03:16,590 --> 00:03:21,330
is just call the old function and then subtract 97 by doing this.

41
00:03:21,360 --> 00:03:27,240
A will get mapped to zero, B will get back to one and so on up to Z, which gets mapped at 25.

42
00:03:27,960 --> 00:03:30,990
And remember, that's because all the letters are contiguous.

43
00:03:31,800 --> 00:03:37,260
Finally, once I have I and J, then we add a one to M at position RJ.

44
00:03:37,980 --> 00:03:42,480
Now you might be wondering if all we're doing is counting, then these are not proper probabilities.

45
00:03:42,840 --> 00:03:43,800
We'll deal with that later.

46
00:03:45,710 --> 00:03:50,870
Also, notice that I've initialized the Matrix M to start with ones and zeros.

47
00:03:51,350 --> 00:03:55,370
This is because we want to use add one smoothing as described in the lectures.

48
00:04:00,490 --> 00:04:06,370
Next, we have a function to update PI, which uses the same principle we take in one argument, which

49
00:04:06,370 --> 00:04:08,320
represents the first character in a word.

50
00:04:08,710 --> 00:04:11,240
And then we map that word to an integer called I.

51
00:04:11,980 --> 00:04:15,730
Then we index PI by I and add one again.

52
00:04:15,730 --> 00:04:18,519
It's not a probability yet, but we'll deal with that later.

53
00:04:24,830 --> 00:04:30,470
Next, we have a function to get the log probability of a single word, the function we look at after

54
00:04:30,470 --> 00:04:35,990
this one will help us get the log probability of an entire sentence that should be easy to write.

55
00:04:36,230 --> 00:04:42,830
Once we have this function, so this function takes in one argument, which is a single word inside

56
00:04:42,830 --> 00:04:43,480
the function.

57
00:04:43,490 --> 00:04:48,680
We get the unique grand probability by taking the first letter in the word word at index zero.

58
00:04:49,400 --> 00:04:54,110
We convert that into an index and then we find its probability by indexing PI.

59
00:04:54,950 --> 00:05:00,710
Next, we take the log since we would prefer to work in log space and assign this to a variable called

60
00:05:00,710 --> 00:05:01,460
log p.

61
00:05:04,000 --> 00:05:06,700
Next, we loop through the rest of the characters in the world.

62
00:05:07,540 --> 00:05:10,930
Notice how we index the word variable with one and a colon.

63
00:05:11,410 --> 00:05:14,500
This means started index one and go all the way to the end.

64
00:05:15,280 --> 00:05:22,060
Inside the loop, we convert the current character C-H into an integer j using the same method as before.

65
00:05:23,020 --> 00:05:29,800
Notice that the variable AI still represents the previous letter, so now we can index em at position

66
00:05:29,810 --> 00:05:32,590
I.J to get the next bigram probability.

67
00:05:33,160 --> 00:05:35,830
We take the log of this and then add it to log P.

68
00:05:36,850 --> 00:05:41,980
Notice how we're using the plus equals operator here, so it just adds whatever is on the right to the

69
00:05:41,980 --> 00:05:42,940
existing value.

70
00:05:44,080 --> 00:05:49,750
Next, we have to remember to update EI, which is supposed to represent the previous letter to J.

71
00:05:49,810 --> 00:05:50,620
The current letter.

72
00:05:51,400 --> 00:05:55,990
This is because the current letter becomes the previous letter on the next iteration of this loop.

73
00:05:56,830 --> 00:06:01,600
Once we're done looping through all the letters, we found the log probability of the given word, so

74
00:06:01,600 --> 00:06:02,830
we return the log P..

75
00:06:09,240 --> 00:06:15,390
Finally, we have a function called Get Sequence Prop, which returns the log probability of an entire

76
00:06:15,390 --> 00:06:17,640
sequence as input.

77
00:06:17,670 --> 00:06:21,510
It takes an argument called words, which can be one of two things.

78
00:06:22,080 --> 00:06:28,770
First, it can be a string containing multiple words i.e. a sentence, or it can be a list containing

79
00:06:28,770 --> 00:06:31,290
multiple words, each stored as a string.

80
00:06:32,010 --> 00:06:37,830
What we'll do is if the input argument is a string, then we'll call words that split in order to convert

81
00:06:37,830 --> 00:06:39,720
it into a list of strings.

82
00:06:40,350 --> 00:06:46,350
So after this first block of code, we can be sure that the words variable is a list containing strings,

83
00:06:46,680 --> 00:06:48,690
each of which are individual words.

84
00:06:51,310 --> 00:06:54,790
Next, we initialize a variable called Log P two zero.

85
00:06:56,680 --> 00:07:00,730
Then we loop through each word in our list of words, inside the loop.

86
00:07:00,760 --> 00:07:05,350
We call the previous function get word probab to get the probability of the current word.

87
00:07:06,100 --> 00:07:10,900
Then we add the result to log P. And we keep doing this until we've seen all the words.

88
00:07:11,840 --> 00:07:15,530
Once we're done, we have the log probability of the entire sentence.