1
00:00:11,310 --> 00:00:16,440
So in this lecture, we'll be introducing the next section of this course, which is on topic modeling.

2
00:00:17,220 --> 00:00:22,050
This lecture will outline what this section is about, and we'll also talk about why this subject is

3
00:00:22,050 --> 00:00:23,070
relevant and useful.

4
00:00:23,820 --> 00:00:25,650
So let's begin with a quick outline.

5
00:00:26,970 --> 00:00:32,729
This section will cover two popular approaches to topic modeling latency richly allocation, also known

6
00:00:32,729 --> 00:00:35,970
as LDA and non-negative matrix factorization.

7
00:00:36,060 --> 00:00:42,150
Also known as an AMF, LDA is a very interesting machine learning method because, unlike the other

8
00:00:42,150 --> 00:00:48,000
methods we've discussed, this is what we call a full Bayesian model of all the algorithms we learned

9
00:00:48,000 --> 00:00:49,050
about in this course.

10
00:00:49,380 --> 00:00:54,360
LDA is the most complex and requires the most background knowledge to fully understand.

11
00:00:55,560 --> 00:00:59,190
Lucky for us, only intuition is required to understand the code.

12
00:00:59,550 --> 00:01:04,230
But if you're interested in a full course that dives deep into this topic, I'd be happy to hear your

13
00:01:04,230 --> 00:01:04,890
requests.

14
00:01:06,180 --> 00:01:11,220
The second approach we'll discuss is an EMF, which originates from the field of recommender systems.

15
00:01:11,730 --> 00:01:16,290
It turns out that because of the way it works, it can also be applied to topic modeling.

16
00:01:16,980 --> 00:01:21,030
In fact, it's possible to sort of mix and match algorithms in applications.

17
00:01:21,510 --> 00:01:26,850
So, for example, although LDA was initially built for topic modeling, we can go in the other direction

18
00:01:26,850 --> 00:01:29,190
and apply this to recommender systems as well.

19
00:01:33,900 --> 00:01:36,470
So let's talk about why topic modeling is useful.

20
00:01:37,710 --> 00:01:43,500
One simple reason why it's useful in the context of this course is because it's an example of unsupervised

21
00:01:43,500 --> 00:01:48,600
learning tasks like spam detection and sentiment analysis require labeled data sets.

22
00:01:48,990 --> 00:01:51,180
But you'll see that topic modeling does not.

23
00:01:51,810 --> 00:01:56,550
Thus, this presents a different paradigm for machine learning that we haven't yet encountered in this

24
00:01:56,550 --> 00:02:00,810
course, and obviously not needing labels is pretty useful to.

25
00:02:02,360 --> 00:02:07,640
Another reason it's useful is because it's almost like a more powerful version of clustering, as you

26
00:02:07,640 --> 00:02:13,100
recall, clustering allows us to assign categories to our input documents, which is obviously useful

27
00:02:13,100 --> 00:02:18,320
if you wanted to do something like organize your documents without having to manually label them yourself.

28
00:02:19,640 --> 00:02:24,380
If you're a business that has to deal with many documents, this can save both time and money.

29
00:02:25,160 --> 00:02:30,110
But unlike the discrete categories you get with clustering, you'll see that topics can be more richly

30
00:02:30,110 --> 00:02:30,800
expressed.

31
00:02:32,090 --> 00:02:36,200
Another application of topic modeling is document retrieval and search engines.

32
00:02:36,860 --> 00:02:42,350
As you recall, one simple and straightforward method of doing this is to simply convert your documents

33
00:02:42,350 --> 00:02:47,600
into RDF vectors and then do a nearest neighbor search to find the closest documents to your query.

34
00:02:48,500 --> 00:02:50,230
However, this can be problematic.

35
00:02:50,240 --> 00:02:53,510
For instance, if you're to fire, effectors are very sparse.

36
00:02:54,170 --> 00:02:59,660
Topic modeling is a method of reducing a document into a small set of topics such that they can be more

37
00:02:59,660 --> 00:03:01,670
easily and more accurately searched.