1
00:00:11,120 --> 00:00:16,940
So in this lecture, we're going to discuss the problem of spam detection, as with the other sections

2
00:00:16,940 --> 00:00:19,580
of this course, which are focused on applications.

3
00:00:19,940 --> 00:00:22,730
We will split this up into two distinct lectures.

4
00:00:23,360 --> 00:00:28,760
This lecture will focus on describing the problem without discussing any solution that is.

5
00:00:28,760 --> 00:00:34,100
Our only goal in this lecture is to understand what spam detection is and why you would want to do it.

6
00:00:34,700 --> 00:00:40,010
The next lecture will focus on one specific solution, and after that we will look at how to implement

7
00:00:40,010 --> 00:00:41,240
that solution in Python.

8
00:00:45,890 --> 00:00:51,860
Now, one surprising fact I learned in view, one of this course is that not everybody knows what spam

9
00:00:51,860 --> 00:00:52,820
detection is.

10
00:00:53,270 --> 00:00:58,190
I personally found this very strange, but I guess my audience is more diverse than I assumed.

11
00:00:58,790 --> 00:01:01,490
So let me turn this over to you, the students.

12
00:01:01,970 --> 00:01:08,030
If you have never heard of spam detection before, please let me know either using the Q&A or emailing

13
00:01:08,030 --> 00:01:09,560
me directly from my website.

14
00:01:09,890 --> 00:01:10,770
Lazy programmer.

15
00:01:11,040 --> 00:01:16,280
Me if after watching this lecture, you still do not know what spam detection is.

16
00:01:16,580 --> 00:01:20,660
Then I invite you to tell me which parts you were having the most trouble understanding.

17
00:01:21,770 --> 00:01:25,310
That being said, let's move on to discussing spam detection.

18
00:01:29,940 --> 00:01:31,830
OK, so what is spam detection?

19
00:01:32,670 --> 00:01:38,430
Well, in order to best understand spam detection, you must have some experience using email, text

20
00:01:38,430 --> 00:01:41,040
message or some other kind of messaging service.

21
00:01:41,610 --> 00:01:44,490
So this lecture will assume you have such experience.

22
00:01:45,090 --> 00:01:50,040
If you do not, then it would be best to perhaps just spend some more time around your computer and

23
00:01:50,040 --> 00:01:52,020
experience these services for yourself.

24
00:01:53,310 --> 00:01:58,560
OK, so what happens when you have some experience with email, text and other messengers?

25
00:01:59,220 --> 00:02:02,610
Well, you learn that not every message you get is legitimate.

26
00:02:03,150 --> 00:02:09,030
A majority of the messages you receive should be from friends, family, colleagues and other people

27
00:02:09,030 --> 00:02:10,500
you expect to hear from.

28
00:02:11,160 --> 00:02:14,520
But sometimes you will get strange messages from unknown people.

29
00:02:15,000 --> 00:02:17,040
Perhaps they are trying to sell you something.

30
00:02:17,550 --> 00:02:22,020
Perhaps they were trying to install malware on your machine, but getting you to click a mysterious

31
00:02:22,020 --> 00:02:22,500
link.

32
00:02:23,100 --> 00:02:27,900
Perhaps they are trying to steal your credentials by pretending they are your bank or some other service

33
00:02:27,910 --> 00:02:29,670
you frequently use, such as Facebook.

34
00:02:34,470 --> 00:02:40,410
One classic scam is called the Nigerian Prince Scam, where the sender pretends to be a Nigerian prince.

35
00:02:40,890 --> 00:02:45,210
They tell you they have a large sum of money, which they need to move urgently out of their country,

36
00:02:45,450 --> 00:02:47,610
and for some reason, they need your help.

37
00:02:48,240 --> 00:02:53,040
They might ask for your bank details so that they can deposit their money into your bank account.

38
00:02:53,550 --> 00:02:57,840
Or they might ask you for a small advance payment in exchange for your help.

39
00:02:57,870 --> 00:03:00,540
They will offer to let you keep a small part of their fortune.

40
00:03:01,290 --> 00:03:06,690
Of course, there is no money to be made because none exists since, as mentioned, it is a scam.

41
00:03:07,620 --> 00:03:10,260
OK, so I hope the premise of a scam is clear.

42
00:03:10,740 --> 00:03:14,070
These are messages that you receive, but you do not want.

43
00:03:14,580 --> 00:03:18,570
Typically, they are from people who are trying to scam you or get something from you.

44
00:03:23,260 --> 00:03:28,780
So now that we know what spam is, I hope it's clear why we would want to detect spam, although would

45
00:03:28,780 --> 00:03:30,800
constitute spam seems obvious.

46
00:03:30,820 --> 00:03:36,670
Remember that the reason that spammers and scammers even bother to send these messages is because some

47
00:03:36,670 --> 00:03:38,200
people do fall for it.

48
00:03:38,800 --> 00:03:45,040
Because of that, companies who run email services and email clients try to filter spam from your inbox.

49
00:03:45,760 --> 00:03:51,090
Of course, another obvious reason is that we simply do not want to see spam, even though we can detect

50
00:03:51,100 --> 00:03:51,940
it ourselves.

51
00:03:52,420 --> 00:03:54,970
We would rather not be bothered by it in the first place.

52
00:03:56,200 --> 00:04:01,900
And just as a side note, this is an excellent example of how machine learning is used for automation.

53
00:04:02,440 --> 00:04:07,420
Sure, you could delete all your spam messages by yourself, but imagine how much time that would take.

54
00:04:07,840 --> 00:04:12,850
It would be better if a machine did that for us, and today that largely is the case.

55
00:04:17,430 --> 00:04:22,410
So at this point, I want to point out another common beginner mistake, which I saw in view one of

56
00:04:22,410 --> 00:04:23,160
this course.

57
00:04:24,270 --> 00:04:31,410
So some students for some reason were confused about where spam detection fits in to an email application.

58
00:04:32,250 --> 00:04:37,860
Remember that the goal of this section is not to build a whole email service like Gmail or a whole email

59
00:04:37,860 --> 00:04:39,120
client like Thunderbird.

60
00:04:39,660 --> 00:04:41,820
That would be a pretty monumental task.

61
00:04:42,180 --> 00:04:47,760
You would need a whole team of people and probably months or years of time to do that with any success.

62
00:04:48,450 --> 00:04:53,910
Building an email service requires lots of other work, which is not relevant to us in a machine learning

63
00:04:53,910 --> 00:04:58,800
course, such as building the user interface, designing the database and so forth.

64
00:05:00,180 --> 00:05:06,120
Basically, what you have to be comfortable with is building only a small part of a big computer program

65
00:05:06,510 --> 00:05:11,400
and knowing where it fits into that program without having to build the whole thing yourself.

66
00:05:12,090 --> 00:05:16,170
Of course, we do that every day in the real world if you work as a software engineer.

67
00:05:16,500 --> 00:05:19,410
But I realize some of you may not have that experience.

68
00:05:20,010 --> 00:05:25,620
So as always, if there is something you don't understand here, please use the Q&A to get that sorted.

69
00:05:30,180 --> 00:05:34,950
So for this class, what you can do is pretend you are a software engineer at Gmail.

70
00:05:35,460 --> 00:05:37,200
Your job is to write a function.

71
00:05:37,920 --> 00:05:39,660
Remember that this is just a function.

72
00:05:39,670 --> 00:05:41,910
You don't have to write all of Gmail yourself.

73
00:05:42,030 --> 00:05:45,210
There are other people at Google working on Gmail with you.

74
00:05:46,110 --> 00:05:48,480
The function as a very simple interface.

75
00:05:49,020 --> 00:05:51,240
The function is called detect spam.

76
00:05:51,870 --> 00:05:58,170
The input into this function is a document which represents the text of an email, SMS message or any

77
00:05:58,170 --> 00:05:59,490
other kind of message.

78
00:06:00,240 --> 00:06:03,180
The output from this function is just a binary value.

79
00:06:03,930 --> 00:06:08,460
Suppose we return one if the input was spam and zero if it was not spam.

80
00:06:09,450 --> 00:06:15,180
In the rest of this section, we will look at how to write such a function and remember every other

81
00:06:15,180 --> 00:06:20,760
part of the Gmail program, like the user interface, the web server and so forth will be written by

82
00:06:20,760 --> 00:06:21,750
your team members.

83
00:06:22,200 --> 00:06:27,330
So your job is only to write this one small function, which is part of a bigger code base.

