1
00:00:02,160 --> 00:00:06,660
Hey, everyone, and welcome back to this class natural language processing in Python.

2
00:00:11,620 --> 00:00:17,110
In this lecture, we are going to continue looking at our cipher, a decryption script, the previous

3
00:00:17,110 --> 00:00:22,150
lecture looked at creating some functions that would help us to build our language model and then use

4
00:00:22,150 --> 00:00:25,120
it to evaluate the likelihood of a word or sentence.

5
00:00:25,810 --> 00:00:28,480
This lecture will focus on training the language model.

6
00:00:29,050 --> 00:00:31,120
This will involve working with actual data.

7
00:00:31,390 --> 00:00:34,330
So the first thing we need to do is download that data.

8
00:00:40,250 --> 00:00:45,860
So in this first block of code, we first check to see if we have a file called the Moby Dick Act text.

9
00:00:46,790 --> 00:00:50,060
If we don't, then we'll download it using the request library.

10
00:00:51,020 --> 00:00:53,360
Notice that the file is just stored on my website.

11
00:00:53,510 --> 00:00:57,320
So if you're having any problems downloading this file, please let me know.

12
00:00:58,100 --> 00:01:04,280
So we call requests that get and this story is the response in a variable called R. At this point,

13
00:01:04,280 --> 00:01:06,320
the response is not yet stored in a file.

14
00:01:06,950 --> 00:01:13,610
So next we open a new file called Moby Dick in a write mode, and then we write the contents of the

15
00:01:13,610 --> 00:01:15,020
response to this file.

16
00:01:16,710 --> 00:01:18,060
So we call it right.

17
00:01:18,210 --> 00:01:21,480
And as input, we pass in our content to decode.

18
00:01:22,110 --> 00:01:26,940
We need to call the decode function because our content is a binary and not a string.

19
00:01:27,390 --> 00:01:30,060
So the decode function turns it into a string.

20
00:01:37,170 --> 00:01:41,010
Next, we're going to populate our bigram and MoneyGram data structures.

21
00:01:41,760 --> 00:01:46,140
First, we need a regex that will help us remove any non alphabet characters.

22
00:01:46,650 --> 00:01:48,210
That's anything that's not a letter.

23
00:01:48,900 --> 00:01:50,650
We'll call that variable regiment's.

24
00:01:51,360 --> 00:01:54,040
Next, we enter a loop that reads the Moby Dick file.

25
00:01:54,060 --> 00:01:56,700
Line by line inside the loop.

26
00:01:56,730 --> 00:02:00,630
The first thing we do is strip out any whitespace by calling our strip.

27
00:02:01,170 --> 00:02:02,550
Next, we say if line.

28
00:02:03,960 --> 00:02:09,570
The reason for this is some lines in the file are just blank since we called the strip function previously.

29
00:02:09,900 --> 00:02:14,670
Then if this line is blink, it'll now be an empty string, which would have failed this condition.

30
00:02:15,240 --> 00:02:19,320
So we only want to answer this if statement if there are actually words in this line.

31
00:02:20,340 --> 00:02:26,790
Next, we call Regex Dot Sub and replace any non alpha characters in a line with an empty space.

32
00:02:27,540 --> 00:02:32,880
Technically, you wouldn't have to do this if you wanted to also encode and decode it, not alpha characters

33
00:02:33,120 --> 00:02:35,310
or otherwise write code a deal with them.

34
00:02:35,550 --> 00:02:36,810
But let's keep this simple.

35
00:02:38,690 --> 00:02:40,950
Next, we lowercase all the letters in the line.

36
00:02:41,300 --> 00:02:44,570
And then we call the split function to get individual words.

37
00:02:50,530 --> 00:02:57,610
As per conventional A.P. terminology, I'm going to call these tokens next, we live through each token

38
00:02:57,610 --> 00:02:58,420
a one by one.

39
00:02:59,960 --> 00:03:05,080
Inside this loop, we're going to call the two functions we defined earlier for updating in them.

40
00:03:06,440 --> 00:03:12,530
Remember that these only update the counts first, we grab the first letter token at Index zero.

41
00:03:13,040 --> 00:03:15,260
We pass this into the update PI function.

42
00:03:16,070 --> 00:03:19,340
Next, we lived through all the other letters starting from the second letter.

43
00:03:20,030 --> 00:03:22,910
So that's why we index token by one and then a call in.

44
00:03:25,140 --> 00:03:30,540
Inside this loop, we call a function update transition, passing in zero and one.

45
00:03:31,140 --> 00:03:35,610
We also have to remember to update C zero to be the current character C one.

46
00:03:38,930 --> 00:03:43,160
Finally, when we're finished looping through each line of the book, there is one more thing we need

47
00:03:43,160 --> 00:03:43,610
to do.

48
00:03:44,360 --> 00:03:47,570
Remember that Pi and them currently only contain counts.

49
00:03:47,930 --> 00:03:49,970
They are not yet true probabilities.

50
00:03:50,570 --> 00:03:56,660
In order to make them into true probabilities, Pi Masoom to one in each row of them must sum to one.

51
00:03:58,220 --> 00:04:04,310
We can accomplish that simply by dividing pie, by its sum and dividing each row of them by its Roseanne.

52
00:04:04,760 --> 00:04:06,380
So that's why these two lines do.

53
00:04:11,880 --> 00:04:17,279
If you want, I would encourage you to do a simpler example to prove to yourself that this code actually

54
00:04:17,279 --> 00:04:18,660
does what we think it does.

55
00:04:19,750 --> 00:04:25,720
Alternatively, another option you could take here is to compute the log right now, if you recall our

56
00:04:25,720 --> 00:04:30,730
other functions, which calculate the log likelihoods are currently responsible for taking the log.

57
00:04:31,180 --> 00:04:32,470
However, that's inefficient.

58
00:04:32,980 --> 00:04:36,050
There's no need to take the log of the same elements again and again.

59
00:04:36,670 --> 00:04:40,690
Instead, you could just store the log probabilities directly in variables.

60
00:04:42,060 --> 00:04:47,730
So you might want to call them, say, log PI and log in and then use those directly in the other functions.

