1
00:00:02,180 --> 00:00:06,800
Hey, everyone, and welcome back to this class natural language processing in Python.

2
00:00:11,600 --> 00:00:16,760
In this lecture, we will continue looking at our code to implement a cipher, a decryption algorithm.

3
00:00:17,510 --> 00:00:23,210
This particular lecture will focus on implementing the evolutionary algorithm we discussed previously.

4
00:00:29,930 --> 00:00:35,060
To initialize the algorithm, we must create a DNA pool containing 20 DNA strings.

5
00:00:35,540 --> 00:00:40,370
So first, we initialize a variable called DNA pool and make that an empty list.

6
00:00:41,000 --> 00:00:45,580
Then we do a loop that iterates at 20 times inside the loop.

7
00:00:45,590 --> 00:00:50,090
We create a DNA string, which is just a list containing the lowercase alphabet.

8
00:00:52,520 --> 00:00:57,860
Then we shuffle the DNA string to get a random mapping, and then we append this to our DNA pool.

9
00:00:58,340 --> 00:01:01,940
So by the end of this loop will have 20 random DNA strings.

10
00:01:09,800 --> 00:01:12,200
Next, we have a function to evolve offspring.

11
00:01:12,860 --> 00:01:19,010
This takes in two arguments an existing DNA pool and the number of children that each individual will

12
00:01:19,010 --> 00:01:24,290
create inside the function will initialize the offspring list to be an empty list.

13
00:01:25,970 --> 00:01:32,060
Then we'll do a nested loop on the outer loop we lived through each parent in the DNA pool, on the

14
00:01:32,060 --> 00:01:32,610
inner loop.

15
00:01:32,630 --> 00:01:37,740
We'll iterate through the number of children that we need to create in order to create the child.

16
00:01:37,760 --> 00:01:39,770
Well, we want to do is a random swap.

17
00:01:40,250 --> 00:01:44,180
So first, we need to make a copy of the parent so that we don't overwrite it.

18
00:01:45,050 --> 00:01:50,810
Next, we choose to random iness J and K, which will be the two positions that we want to swap.

19
00:01:51,500 --> 00:01:55,370
These are just two random numbers between zero and 25 inclusive.

20
00:01:56,330 --> 00:01:57,680
Next, we do this which.

21
00:01:59,270 --> 00:02:06,050
First of all, a signed copy at Position J to a variable called Temp Next will assign copy at Position

22
00:02:06,050 --> 00:02:14,900
K two Position J, then will assign temp to copy at Position K finally will append this mutated copy

23
00:02:15,140 --> 00:02:16,580
into our list of offspring.

24
00:02:17,270 --> 00:02:22,280
When we've completed the loop, we combine the offspring list with the parent list and then return that

25
00:02:22,280 --> 00:02:23,750
as the final population.

26
00:02:30,800 --> 00:02:34,250
Next, we have our main loop that actually runs the genetic algorithm.

27
00:02:35,660 --> 00:02:41,210
There are some initialization variables before we enter the loop, so let's look at these numbers will

28
00:02:41,210 --> 00:02:43,850
be the number of iterations of the loop to do so.

29
00:02:43,850 --> 00:02:45,320
That's currently 1000.

30
00:02:46,400 --> 00:02:49,610
That's the number of generations of offspring that we want to evolve.

31
00:02:50,510 --> 00:02:55,820
Next, we initialize an array called scores, which will store the average score at each iteration of

32
00:02:55,820 --> 00:02:56,210
the loop.

33
00:02:56,930 --> 00:03:00,680
Alternatively, you could store the maximum score, but I didn't decide to do that.

34
00:03:02,210 --> 00:03:08,060
Next, we want to keep track of some things the best DNA, which corresponds to the best map, which

35
00:03:08,060 --> 00:03:09,530
corresponds to the best score.

36
00:03:10,160 --> 00:03:14,570
Best score is initialized to minus infinity, which is the minimum possible score.

37
00:03:17,770 --> 00:03:22,570
Next, we enter the main loop, which runs nanometers times inside the loop.

38
00:03:22,600 --> 00:03:25,090
We first check if it's the first iteration or not.

39
00:03:25,810 --> 00:03:31,150
If it is, then I will be zero and there's no need to create new offspring yet because we haven't yet

40
00:03:31,150 --> 00:03:32,080
evaluated them.

41
00:03:32,920 --> 00:03:37,900
If it's not the first iteration, then we need to create new offspring by calling the function we defined

42
00:03:37,900 --> 00:03:38,380
earlier.

43
00:03:38,590 --> 00:03:39,670
Evolve offspring.

44
00:03:44,840 --> 00:03:49,520
Next, we want to calculate the score for each DNA string in our DNA pool.

45
00:03:50,120 --> 00:03:54,590
So we start by initializing a dictionary called DNA to score.

46
00:03:55,430 --> 00:03:59,900
The key to this dictionary will be the DNA string and the value will be at score.

47
00:04:00,740 --> 00:04:03,770
Then we lived through each DNA string in the DNA pool.

48
00:04:04,490 --> 00:04:10,130
Inside the loop, we convert our DNA string into a mapping that we can use to decode the ciphertext.

49
00:04:12,000 --> 00:04:17,070
So this just loops through letters one and the DNA string in corresponding order, building a map the

50
00:04:17,070 --> 00:04:19,230
same way we did at the beginning of this script.

51
00:04:19,769 --> 00:04:23,310
The only difference now is that this is supposed to be a reverse mapping.

52
00:04:23,860 --> 00:04:26,130
In fact, it doesn't really matter what the keys are.

53
00:04:26,340 --> 00:04:27,930
As long as they stay consistent.

54
00:04:30,600 --> 00:04:35,520
Next, we call the function and decode message to decode the ciphertext using the current map.

55
00:04:36,210 --> 00:04:42,450
Then we pass this decoded message into the function and get sequence probe, which returns the log likelihood

56
00:04:42,660 --> 00:04:44,130
of this decoded message.

57
00:04:45,890 --> 00:04:49,730
Next, we store the score in our dictionary DNA to score.

58
00:04:50,570 --> 00:04:56,570
Recall that DNA is currently a list of characters, but lists are not allowed as dictionary keys.

59
00:04:57,680 --> 00:05:01,310
Therefore, we have to convert it to a string by calling the joint function.

60
00:05:08,090 --> 00:05:12,290
Next, we check of the score we just found is better than our best score so far.

61
00:05:12,890 --> 00:05:16,870
If it is, then we assign all the current values to be the best values.

62
00:05:18,500 --> 00:05:24,170
So best DNA becomes current DNA, best map becomes current map, and best score becomes the current

63
00:05:24,170 --> 00:05:24,620
score.

64
00:05:25,460 --> 00:05:30,680
When we're outside the loop, we can calculate the average score of the population and store that in

65
00:05:30,680 --> 00:05:31,970
our array of scores.

66
00:05:32,990 --> 00:05:36,050
We would like to plot this later to see how the algorithm progressed.

67
00:05:41,330 --> 00:05:47,120
Next, remember our principal survival of the fittest in this section of the code, we would like to

68
00:05:47,120 --> 00:05:49,580
sort our current population by their score.

69
00:05:50,180 --> 00:05:52,940
We only want to keep the most fit individuals.

70
00:05:53,420 --> 00:05:55,910
In this case, I've chosen to keep the top five.

71
00:05:56,630 --> 00:06:02,060
We can do this by using the sorted function in the sorted function allows you to sort a dictionary by

72
00:06:02,060 --> 00:06:02,810
its values.

73
00:06:03,320 --> 00:06:10,160
We pass in the key argument as X1, meaning that we care about the value and not the key of the dictionary

74
00:06:10,160 --> 00:06:11,240
as the sorting key.

75
00:06:12,800 --> 00:06:17,450
So the word key here is kind of an overloaded term because it really means two different things.

76
00:06:18,480 --> 00:06:23,670
So whether you mean the sort key for the sorting algorithm or the key of the dictionary, so pay attention

77
00:06:23,670 --> 00:06:24,510
to context there.

78
00:06:26,790 --> 00:06:32,010
We also pass in reverse equals true because we want to sort by score in descending order.

79
00:06:33,910 --> 00:06:39,700
What we get out of this is a list of two polls where each tuple contains a DNA string and its corresponding

80
00:06:39,700 --> 00:06:44,920
score, and they are all sorted then to find the parents of the next generation.

81
00:06:45,310 --> 00:06:50,740
We take the first five elements of this sorted list and keep only the keys which correspond to the DNA

82
00:06:50,740 --> 00:06:51,400
strings.

83
00:06:53,330 --> 00:06:59,090
Lastly, we have some debug stuff, so every few steps we print out the iteration number, the current

84
00:06:59,090 --> 00:07:01,070
score and the best score so far.

85
00:07:03,300 --> 00:07:06,990
And so this is a printout of the last time I ran this algorithm.