1
00:00:02,190 --> 00:00:06,810
Everyone and welcome back to this class natural language processing in Python.

2
00:00:11,560 --> 00:00:15,160
In this lecture, we are going to summarize everything we learn in this section.

3
00:00:15,820 --> 00:00:19,240
This section was really a combination of three interesting topics.

4
00:00:19,660 --> 00:00:22,180
Number one, ciphers and coded messages.

5
00:00:22,510 --> 00:00:25,330
Number two, probabilistic language models.

6
00:00:25,330 --> 00:00:27,790
And number three genetic algorithms.

7
00:00:28,210 --> 00:00:32,650
When we put these three things together, we get a working solution for cypher decryption.

8
00:00:37,660 --> 00:00:41,080
We started off this section by describing the substitution cipher.

9
00:00:41,590 --> 00:00:46,480
This means that we encode a message by substituting one letter with some other letter.

10
00:00:47,050 --> 00:00:51,040
The idea is we have a mapping, one unique letter to one unique letter.

11
00:00:51,610 --> 00:00:55,240
The uniqueness is, of course, necessary in order to reverse the mapping.

12
00:00:55,900 --> 00:01:01,510
We did a simple example of how we might encode and decode a message using a substitution cipher.

13
00:01:01,810 --> 00:01:05,620
So hopefully that gave you an idea of how you might implement this in code.

14
00:01:10,830 --> 00:01:14,100
Next, we looked at how to build a character level language model.

15
00:01:14,610 --> 00:01:19,530
The key idea behind building this model is the Markov assumption, where we assume that the current

16
00:01:19,530 --> 00:01:24,330
letter depends only on the previous letter and none of the other letters in the word.

17
00:01:24,930 --> 00:01:29,010
This is, of course, a faulty assumption, but as you saw, it works decently well.

18
00:01:29,580 --> 00:01:34,140
It's certainly possible to extend this model by making less restrictive assumptions.

19
00:01:34,800 --> 00:01:39,780
The way we estimate the probabilities in the language model starts from pretty much the simplest idea

20
00:01:39,780 --> 00:01:41,280
possible counting.

21
00:01:41,850 --> 00:01:46,860
I assume everybody knows how to count, and therefore we should have no trouble building these probabilities.

22
00:01:47,310 --> 00:01:48,870
This is just like a coin toss.

23
00:01:49,200 --> 00:01:54,900
The probability of heads would be estimated as the number of heads divided by the total number of coin

24
00:01:54,900 --> 00:01:55,530
tosses.

25
00:01:57,000 --> 00:02:01,050
Next, we decided this model would have to be extended by using add one smoothing.

26
00:02:01,770 --> 00:02:06,630
This is because we would like to give rare transitions that may not have occurred in our training set

27
00:02:06,900 --> 00:02:08,520
non-zero probabilities.

28
00:02:08,910 --> 00:02:14,370
If we don't and we find such a rare transition in the test set, then the entire likelihood would be

29
00:02:14,370 --> 00:02:14,940
zero.

30
00:02:15,150 --> 00:02:17,700
Since anything multiplied by zero is zero.

31
00:02:18,180 --> 00:02:22,860
And of course, we can't solve the problem if the correct answer has a probability of zero.

32
00:02:24,000 --> 00:02:29,130
Another modification we made was to use the log likelihood rather than just the likelihood.

33
00:02:29,760 --> 00:02:34,530
This is because the likelihood ends up being the multiplication of many, many small terms.

34
00:02:35,010 --> 00:02:40,680
If we multiply many small numbers together, this approach is zero quite fast and due to the finite

35
00:02:40,680 --> 00:02:42,720
precision of numbers stored in a computer.

36
00:02:43,020 --> 00:02:45,360
This will eventually just round down to zero.

37
00:02:46,020 --> 00:02:48,270
So again, we have this problem of zeros.

38
00:02:48,630 --> 00:02:50,760
We can't compare one likelihoods to another.

39
00:02:50,940 --> 00:02:57,270
If they both return zero, taking the log solves this problem since the log is a monotonically increasing

40
00:02:57,270 --> 00:03:01,740
function and therefore preserves the relative fitness of each parameter setting.

41
00:03:06,810 --> 00:03:12,690
Finally, we talked about genetic algorithms, genetic algorithms are useful when we can't differentiate

42
00:03:12,690 --> 00:03:18,600
the objective with respect to the model parameters, the thing we want to optimize in our case, the

43
00:03:18,600 --> 00:03:23,340
thing we want to optimize is more like a discrete string rather than a continuous number.

44
00:03:23,940 --> 00:03:27,120
Therefore, genetic algorithms are well-suited to this use case.

45
00:03:28,110 --> 00:03:34,710
The intuition behind genetic algorithms follows biological evolution, and each generation the genes

46
00:03:34,710 --> 00:03:37,140
mutate and these can be better or worse.

47
00:03:37,650 --> 00:03:43,230
Nature itself picks which genes are better because the better genes tend to propagate more, and the

48
00:03:43,230 --> 00:03:44,520
worst genes tend to not.

49
00:03:45,120 --> 00:03:46,950
We call this natural selection.

50
00:03:48,180 --> 00:03:52,710
In our case, the selection mechanism is the log likelihood of the decrypted message.

51
00:03:53,190 --> 00:03:57,420
We use this as our measure of fitness for each of the DNA strings that we try.

52
00:03:58,230 --> 00:04:04,500
Eventually, after many generations of DNA and mutations and offspring, we arrive at the maximum likelihood

53
00:04:04,500 --> 00:04:05,100
solution.

54
00:04:10,130 --> 00:04:15,950
One interesting result that we saw is that the DNA that gives us the true maximum likelihood can still

55
00:04:15,950 --> 00:04:17,720
be a slightly wrong answer.

56
00:04:18,320 --> 00:04:21,260
We saw that the real answer can give us a smaller likelihood.

57
00:04:21,829 --> 00:04:24,920
In this case, the genetic algorithm is behaving as it should.

58
00:04:25,220 --> 00:04:27,230
Finding the maximum likelihood solution.

59
00:04:27,680 --> 00:04:30,260
So the genetic algorithm is working as intended.

60
00:04:30,800 --> 00:04:32,540
In fact, it's the model that's wrong.

61
00:04:33,110 --> 00:04:38,960
Specifically, using Vikram's and making the mark of assumption is probably too restrictive, although

62
00:04:38,960 --> 00:04:42,470
it's worth noting that this type of model is very common in NLP.

63
00:04:42,800 --> 00:04:44,750
It's always a good first approach to try.

64
00:04:45,380 --> 00:04:47,030
So how can we improve this model?

65
00:04:52,120 --> 00:04:54,700
Here's some ideas for how you can extend this project.

66
00:04:55,300 --> 00:05:00,160
First, try using programs or even higher end grabs if you use try grams.

67
00:05:00,370 --> 00:05:04,960
You still need a unit gram and bigram probabilities, since you still need to model the first letter

68
00:05:04,960 --> 00:05:07,540
of each word and the first two letters of each word.

69
00:05:08,290 --> 00:05:12,970
You can extend what we learned previously to infer that the number of triggers and probabilities you'll

70
00:05:12,970 --> 00:05:15,040
have to estimate is VQ.

71
00:05:15,730 --> 00:05:18,160
And this is assuming V is the length of your alphabet.

72
00:05:20,040 --> 00:05:25,710
If you want to get really advanced, you could even replace the Markov model entirely with a deep learning

73
00:05:25,740 --> 00:05:26,940
type of model like an art.

74
00:05:28,140 --> 00:05:32,850
In general, I'm always going to try to give you a few ways to extend all of the projects we do in this

75
00:05:32,850 --> 00:05:33,420
course.

76
00:05:33,900 --> 00:05:37,560
The best way to make sure you've really understood what you've learned is to apply it.

77
00:05:38,310 --> 00:05:43,350
One way of doing that is to take the same problem, but try to modify and improve the approach.

78
00:05:43,920 --> 00:05:48,090
This will give you a better understanding of the code and of the concepts behind the code.

79
00:05:48,660 --> 00:05:52,170
So hopefully you can think of a few ways you might want to extend what we did.

80
00:05:52,680 --> 00:05:55,020
So give that a try, and I'll see you in the next lecture.

