1
00:00:11,060 --> 00:00:16,730
So in this lecture, we'll be looking at the notebook to demonstrate an AMF note that the format of

2
00:00:16,730 --> 00:00:19,220
this notebook will be very similar to LDA.

3
00:00:19,250 --> 00:00:25,580
So we'll go through most of it fairly quickly as before we'll begin by downloading the BBC news dataset.

4
00:00:34,530 --> 00:00:36,480
The next step is to do our imports.

5
00:00:37,080 --> 00:00:40,320
Note that this time we'll be using animal with TFI Taf.

6
00:00:41,280 --> 00:00:44,160
So why is it OK to use TFI off with enemies?

7
00:00:44,910 --> 00:00:50,240
As you recall, the reason we use simple counts with LDA is because that it corresponds with the multi

8
00:00:50,580 --> 00:00:53,460
distribution, which is part of the LDA model.

9
00:00:59,090 --> 00:01:01,130
The next step is to download or stop words.

10
00:01:06,720 --> 00:01:09,930
The next step is to convert our list of stop words into a set.

11
00:01:14,780 --> 00:01:17,150
The next step is to add additional stop words.

12
00:01:22,450 --> 00:01:25,810
The next step is to load in our data using PDF that reads Feed.

13
00:01:30,340 --> 00:01:33,250
The next step is to remind ourselves what this data looks like.

14
00:01:40,030 --> 00:01:45,430
The next step is to create our TF IDF factories are objects passing in the Stoppard's we define above.

15
00:01:50,260 --> 00:01:54,040
The next step is to transform our text into a 240 of matrix.

16
00:01:59,630 --> 00:02:06,320
The next step is to create our nymph object again, we'll choose 10 components, note that for the loss,

17
00:02:06,320 --> 00:02:12,440
we use the kale divergence, which when you do corresponds with probabilistic latent semantic analysis.

18
00:02:13,220 --> 00:02:18,830
Another way of looking at this is the paper on probabilistic lean semantic analysis really just describes

19
00:02:18,830 --> 00:02:22,820
non-negative matrix factorization with kale divergence as the laws.

20
00:02:23,840 --> 00:02:29,270
Note that when you use this loss, it's required that you specify the solver as AMU, which stands for

21
00:02:29,270 --> 00:02:30,650
multiplicative update.

22
00:02:31,880 --> 00:02:36,740
Note that it's also possible to regularize the weights, but I've commented this out since it's not

23
00:02:36,740 --> 00:02:37,250
needed.

24
00:02:38,900 --> 00:02:43,580
Finally, note that I've set the random state to zero so that we obtain a consistent results.

25
00:02:49,010 --> 00:02:50,570
The next step is to fit our model.

26
00:03:00,530 --> 00:03:03,710
The next step is due to find the plot, top words function once again.

27
00:03:09,460 --> 00:03:12,550
The next step is to call the plot top where its function once again.

28
00:03:20,520 --> 00:03:24,120
OK, so let's have a look at these topics to make sure that they make sense.

29
00:03:25,350 --> 00:03:31,260
So the top words for the first topic are people, UK, mobile, U.S. music and so forth.

30
00:03:31,740 --> 00:03:35,520
So probably mobile technology is a good way to summarize this topic.

31
00:03:36,780 --> 00:03:41,250
The top words for the second topic are Labour Election, The Blair Party and so forth.

32
00:03:41,760 --> 00:03:43,620
So clearly, this is about politics.

33
00:03:45,580 --> 00:03:51,130
Four topic three, we have England win, Wales, Ireland injury, rugby coach and so forth.

34
00:03:51,670 --> 00:03:54,430
Clearly, this is about sports and specifically rugby.

35
00:03:56,160 --> 00:03:59,670
For topic for we have film a best awards and so forth.

36
00:03:59,790 --> 00:04:05,760
So this is entertainment for Topic five, we have growth economy bank and so forth.

37
00:04:06,000 --> 00:04:07,320
So this is economics.

38
00:04:07,950 --> 00:04:11,010
And again, check out all the other topics as an exercise.

39
00:04:16,880 --> 00:04:21,470
The next step is to transform our data to get back the documents by topics matrix.

40
00:04:28,250 --> 00:04:31,790
The next step is to plot the topics for a randomly chosen input sample.

41
00:04:38,770 --> 00:04:44,770
As you can see, the true label is sport in the strongest topic is topic three, by the way.

42
00:04:44,800 --> 00:04:48,190
Note that unlike LDA, these values do not sum to one.

43
00:04:49,840 --> 00:04:53,680
So let's scroll back up and remind ourselves what Typekit three represents.

44
00:05:00,150 --> 00:05:03,930
OK, so topic three is related to sports and specifically rugby.

45
00:05:09,520 --> 00:05:13,000
So let's prints out this article to make sure that our topic makes sense.

46
00:05:19,350 --> 00:05:22,350
So the article is Chavez said, to lose fitness bit.

47
00:05:23,010 --> 00:05:27,840
So these results make perfect sense, since Chavez is in fact a former rugby player.