1
00:00:02,230 --> 00:00:06,939
Hey, everyone, and welcome back to this class natural language processing in Python.

2
00:00:11,590 --> 00:00:16,360
In this lecture, we are going to outline all of the pieces of the code that we need to complete in

3
00:00:16,360 --> 00:00:19,350
order to have a fully functioning message decryption script.

4
00:00:20,080 --> 00:00:25,690
As usual, you're strongly encouraged to code by yourself first, or at least try to write as much of

5
00:00:25,690 --> 00:00:27,610
it as possible by yourself.

6
00:00:28,420 --> 00:00:33,070
So in this lecture, we're going to list out all of the pieces and also give you a general overview

7
00:00:33,070 --> 00:00:34,360
of the structure of the code.

8
00:00:34,990 --> 00:00:39,580
Then you can pick and choose which parts of the code you think you'll be able to write on your own and

9
00:00:39,590 --> 00:00:41,050
in which ones you might need help with.

10
00:00:41,860 --> 00:00:46,720
If you notice that this is just way over your head and you have no idea how you will write any of this

11
00:00:46,720 --> 00:00:47,170
code.

12
00:00:47,560 --> 00:00:50,770
Let me remind you that this section of the course is considered optional.

13
00:00:51,190 --> 00:00:53,560
It should be very interesting to those who are interested.

14
00:00:53,830 --> 00:00:57,640
But if the concepts are too overwhelming for you, please feel free to move on.

15
00:01:02,740 --> 00:01:08,230
Let's start with the data that we're going to use, thanks to Project Gutenberg, we have access to

16
00:01:08,230 --> 00:01:13,300
many classic books, which you can download in plain text, which require very little pre-processing.

17
00:01:13,930 --> 00:01:17,650
I provided you with a starter file called Cipher a placeholder dot pi.

18
00:01:18,130 --> 00:01:19,480
This gives you two things.

19
00:01:19,960 --> 00:01:25,120
First, it gives you a link to download Moby Dick in plain text, which is the book I use to build a

20
00:01:25,120 --> 00:01:26,590
character level language model.

21
00:01:27,820 --> 00:01:31,420
If you recall, that's just the unit graham and bigram possibilities.

22
00:01:32,230 --> 00:01:37,630
Second, I provide you with another snippet of text, this time from a different book The Adventures

23
00:01:37,630 --> 00:01:38,800
of Sherlock Holmes.

24
00:01:39,400 --> 00:01:44,740
This snippet of text will serve as the message that we want to encrypt and then decrypt using our algorithm.

25
00:01:45,490 --> 00:01:48,280
Of course, if you would like to choose a different message to send.

26
00:01:48,310 --> 00:01:49,240
That's OK, too.

27
00:01:49,840 --> 00:01:54,130
You can even choose your own text corpus to train your language model on if you want.

28
00:01:54,940 --> 00:01:59,770
If you plan on trying to write the code by yourself, then you should only look at this file and not

29
00:01:59,770 --> 00:02:01,360
at any other Python files.

30
00:02:06,510 --> 00:02:09,810
Let's now look at a high level outline of what we want our script to do.

31
00:02:10,560 --> 00:02:14,430
The first major task will be to create a random substitution cipher.

32
00:02:14,940 --> 00:02:19,950
Recall that this is just a dictionary that maps the unencrypted letter to the encrypted letter.

33
00:02:20,640 --> 00:02:24,630
So this serves as our ground truth, and this is what our algorithm is trying to find.

34
00:02:25,800 --> 00:02:31,080
The second major task will be to read in the Moby Dick text and create a character level language model.

35
00:02:31,710 --> 00:02:34,110
These are the unique graham and bigram probabilities.

36
00:02:34,590 --> 00:02:40,020
Remember that if we have letters in our alphabet, then we'll have you anagram probabilities and a v

37
00:02:40,020 --> 00:02:41,820
squared bigram probabilities.

38
00:02:43,230 --> 00:02:48,150
It would also be useful to have a function that will take in a message represented by a string or a

39
00:02:48,150 --> 00:02:50,880
list of strings and return its log likelihood.

40
00:02:51,480 --> 00:02:55,740
This is the function that will need to evaluate the fitness of a decrypted message.

41
00:02:56,610 --> 00:03:00,360
The next major task will be to create encoding and decoding functions.

42
00:03:00,870 --> 00:03:06,450
We'll use the encoding function to encrypt our message using the true substitution cipher, which we

43
00:03:06,450 --> 00:03:09,210
will assume is unknown to the decryption algorithm.

44
00:03:09,870 --> 00:03:16,020
We'll use the decoding function to decrypt an encrypted message using a guess for the substitution cipher.

45
00:03:17,500 --> 00:03:19,960
So that's the main difference between these two functions.

46
00:03:20,380 --> 00:03:25,990
The encoding function uses the true substitution cipher and the decoding function uses our gas.

47
00:03:26,770 --> 00:03:29,020
Finally, we'll need to run our genetic algorithm.

48
00:03:29,680 --> 00:03:35,170
This will involve initializing the first generation of the population and then going into a loop where

49
00:03:35,170 --> 00:03:40,780
we continually make new offspring, evaluate each member of the population and keep only the fittest

50
00:03:40,780 --> 00:03:41,680
individuals.

51
00:03:42,130 --> 00:03:47,470
We repeat that process until we've reached a maximum number of iterations, at which point we've hopefully

52
00:03:47,470 --> 00:03:49,360
decrypted the hidden message.

53
00:03:51,040 --> 00:03:55,450
Afterwards, we would, of course, like to print out the results and compare it to the true message.

54
00:04:00,660 --> 00:04:04,410
To summarize all the steps we need to do, let's reiterate them one more time.

55
00:04:05,280 --> 00:04:07,650
First, we need to create the substitution cipher.

56
00:04:08,370 --> 00:04:11,970
Next, we need to create our language model using the book Moby Dick.

57
00:04:13,040 --> 00:04:17,240
This should include a function that can evaluate the log likelihood of a given message.

58
00:04:18,110 --> 00:04:21,890
Next, we're going to encode our message using the true substitution cipher.

59
00:04:22,430 --> 00:04:25,400
We'll also need a decoding function that we make use of later.

60
00:04:26,330 --> 00:04:32,150
Next, we run through the genetic algorithm described previously in order to find the substitution cipher

61
00:04:32,420 --> 00:04:35,090
that yields a message with the maximum log likelihood.

62
00:04:35,720 --> 00:04:38,480
If you decide to code this yourself, which I hope you do.

63
00:04:38,750 --> 00:04:41,000
Then good luck and I will see you in the next lecture.