1
00:00:11,070 --> 00:00:16,710
So in this lecture, we are going to discuss the topic of article spinning, this is meant to be a high

2
00:00:16,710 --> 00:00:20,880
level lecture to describe what this problem is and why you might care.

3
00:00:21,780 --> 00:00:26,910
Note that there are many possible approaches to this problem, and this section will discuss one.

4
00:00:27,570 --> 00:00:30,900
So I hope that helps to distinguish this lecture from the next.

5
00:00:31,410 --> 00:00:36,300
This lecture is meant to describe the problem, while the next lecture is meant to describe one possible

6
00:00:36,300 --> 00:00:36,990
solution.

7
00:00:41,600 --> 00:00:43,460
OK, so what is article spinning?

8
00:00:44,480 --> 00:00:47,390
So imagine you've just decided to start your own blog.

9
00:00:47,780 --> 00:00:50,810
It may be helpful if you've had experience with this in the past.

10
00:00:51,740 --> 00:00:57,710
One fact that becomes painfully apparent to all bloggers once they've started blogging is that just

11
00:00:57,710 --> 00:01:01,460
because you've written something does not mean other people will read it.

12
00:01:02,000 --> 00:01:05,360
And of course, you didn't just start your blog to sit there and collect dust.

13
00:01:05,870 --> 00:01:11,420
Perhaps your goal is to get lots of readers simply because it is satisfying to know that others have

14
00:01:11,420 --> 00:01:12,350
read what you've written.

15
00:01:13,100 --> 00:01:15,530
Perhaps your goal is to sell products on your blog.

16
00:01:16,160 --> 00:01:18,980
Perhaps your goal is to make money from ads on your blog.

17
00:01:19,550 --> 00:01:24,860
In all of these cases, the common theme is that you need people to actually visit your blog.

18
00:01:25,400 --> 00:01:26,800
So how might that be done?

19
00:01:31,480 --> 00:01:36,070
Well, in today's era, perhaps you might start by sharing your blog with your friends and family.

20
00:01:36,730 --> 00:01:39,550
That's OK, but there are two problems with this approach.

21
00:01:40,150 --> 00:01:42,250
Firstly is that it does not scale.

22
00:01:42,760 --> 00:01:47,110
You want thousands of visitors and you surely do not have thousands of friends and family.

23
00:01:47,980 --> 00:01:50,020
Secondly is that it's not targeted.

24
00:01:50,500 --> 00:01:55,390
If your blog is about machine learning and data science, it is unlikely that your friends and family

25
00:01:55,390 --> 00:01:58,600
really care, even if they may pretend to at first.

26
00:01:59,140 --> 00:02:05,290
In fact, one of the major ways you're going to get people to your blog is via search and by search.

27
00:02:05,290 --> 00:02:08,650
I mean, Google, DuckDuckGo, Bing and so forth.

28
00:02:09,130 --> 00:02:12,640
I assume that because you're on this website, you know what these are.

29
00:02:17,390 --> 00:02:23,000
Now, again, just because you've created a website does not imply that it will get a good listing on

30
00:02:23,000 --> 00:02:24,110
these search engines.

31
00:02:24,650 --> 00:02:29,480
Search engines tend to rank web sites higher if they are more popular and more authoritative.

32
00:02:30,290 --> 00:02:36,350
One way to accomplish this is to simply create lots of content by creating more content about different

33
00:02:36,350 --> 00:02:36,950
topics.

34
00:02:37,220 --> 00:02:39,380
You have a better chance of attracting more people.

35
00:02:40,220 --> 00:02:42,380
So how do you create more content?

36
00:02:43,070 --> 00:02:46,130
Well, the simple answer is you just have to write it yourself.

37
00:02:46,850 --> 00:02:47,930
Now, take it from me.

38
00:02:47,930 --> 00:02:49,810
Writing content is not easy.

39
00:02:49,850 --> 00:02:51,920
It takes lots of time and effort.

40
00:02:52,400 --> 00:02:55,310
But what if there was a way to write content automatically?

41
00:02:55,580 --> 00:02:58,790
That is, to have a machine generate content for you?

42
00:03:03,490 --> 00:03:06,280
So this is where article spinning enters the picture.

43
00:03:06,940 --> 00:03:11,830
Perhaps you may have thought that you could simply copy and paste articles from other websites onto

44
00:03:11,830 --> 00:03:12,610
your blog.

45
00:03:13,300 --> 00:03:18,490
Note that this does not work, since search engines are smart enough to know that your site has simply

46
00:03:18,490 --> 00:03:20,410
duplicated the content of another.

47
00:03:21,130 --> 00:03:22,750
In fact, this will lead to your site.

48
00:03:22,750 --> 00:03:25,960
Being penalized in your search rankings will get worse.

49
00:03:26,440 --> 00:03:28,450
So this is not something you want to do.

50
00:03:30,030 --> 00:03:37,590
One option that was popular in the early 2000s and 2010s was do spend the content that is take an existing

51
00:03:37,590 --> 00:03:42,930
article and replace enough words so that it becomes sufficiently different from the original.

52
00:03:43,740 --> 00:03:48,330
The trick is to replace words in such a manner that the result still makes sense.

53
00:03:49,020 --> 00:03:53,100
Now, of course, you could just do this manually, but that still takes time and effort.

54
00:03:53,550 --> 00:03:59,430
Ideally, the whole process would be automatic, and perhaps one way to accomplish that would be through

55
00:03:59,430 --> 00:04:02,910
building a language model based on probabilistic patterns.

56
00:04:07,470 --> 00:04:12,750
Now, to give you some sense of the real world application of this, I have seen actual software that

57
00:04:12,750 --> 00:04:14,040
does this exact thing.

58
00:04:14,700 --> 00:04:19,800
I personally haven't used it, but I've watched others use it long ago with decent success.

59
00:04:21,130 --> 00:04:26,230
Note that this was before machine learning became popular, so the techniques being used weren't as

60
00:04:26,230 --> 00:04:28,630
advanced as deep learning and neural networks.

61
00:04:29,380 --> 00:04:34,330
Basically, what you would do is select a word you want to replace and the software would give you a

62
00:04:34,330 --> 00:04:36,490
dropdown list of suggestions.

63
00:04:36,910 --> 00:04:40,990
You could then click on one of those suggestions to replace the original word.

64
00:04:41,950 --> 00:04:47,320
So it wasn't completely automated, but in some sense it at least augmented the capabilities of the

65
00:04:47,320 --> 00:04:47,770
human.

66
00:04:49,400 --> 00:04:54,980
Now, the reason why this still requires a human in the loop is because, as mentioned, machine learning

67
00:04:54,980 --> 00:05:00,230
hadn't really taken off at this time in the people who did understand the state of the art in machine

68
00:05:00,230 --> 00:05:00,650
learning.

69
00:05:00,980 --> 00:05:06,560
We're in applying it to applications like these, so we will start by applying Markov models, which

70
00:05:06,560 --> 00:05:08,960
have been around for about 100 years.

71
00:05:09,320 --> 00:05:11,120
In other words, not so advanced.

72
00:05:15,710 --> 00:05:20,330
However, remember that the goal of this isn't to build a modern article spinning product.

73
00:05:20,750 --> 00:05:23,810
If I were to do that, I would charge you thousands of dollars.

74
00:05:24,350 --> 00:05:29,360
The goal of this is to learn about Markov models, which is the current section of this course.

75
00:05:31,430 --> 00:05:34,970
Many beginners mix this up, which is what I saw, often in V1.

76
00:05:35,300 --> 00:05:38,450
And so it's important to go into this with the right mindset.

77
00:05:39,170 --> 00:05:45,020
Furthermore, note that in more advanced sections and courses, we will learn about Transformers, which

78
00:05:45,020 --> 00:05:47,180
will do a much better job at this task.

79
00:05:47,780 --> 00:05:52,940
So if your goal is to simply get the state of the art in one line of code, that would be a much better

80
00:05:52,940 --> 00:05:53,750
place to go.

81
00:05:54,890 --> 00:06:00,680
One fact which I find quite amusing is that when version one of this course was released, Transformers

82
00:06:00,680 --> 00:06:05,810
did not even exist, so we wouldn't even be having this discussion five years ago.

83
00:06:06,740 --> 00:06:12,140
What was funny was a lot of beginners came to that chorus thinking you could actually build a human

84
00:06:12,140 --> 00:06:16,310
level article spinner that would just magically write an article out of thin air.

85
00:06:17,000 --> 00:06:23,000
Of course, I had to tell them that no such magic actually existed, at which point they got very angry.

86
00:06:23,540 --> 00:06:26,180
Funny to me, but frustrating for them, I'm sure.

87
00:06:27,700 --> 00:06:32,830
In any case, nowadays, there are many ways to improve upon Markov models, including Ardennes and

88
00:06:32,830 --> 00:06:39,230
Transformers, but these are far beyond the technical level at this point in the course at this point.

89
00:06:39,250 --> 00:06:43,120
Your goal should be exercising your understanding of market models.

90
00:06:43,840 --> 00:06:49,240
Now, if you are interested in these more advanced techniques, don't despair since they are actually

91
00:06:49,240 --> 00:06:50,770
based on what you learn here.

92
00:06:51,280 --> 00:06:55,750
So in fact, these are a prerequisite to understanding these new advancements.

93
00:07:00,370 --> 00:07:06,400
And as a final note on this topic, please be aware that article spinning is a Black Hat SEO technique.

94
00:07:07,000 --> 00:07:11,020
Generally speaking, these techniques are not ethical and more practically.

95
00:07:11,290 --> 00:07:12,910
They never work for very long.

96
00:07:12,940 --> 00:07:13,960
If they do it all.

97
00:07:14,740 --> 00:07:20,410
Thus, if your goal is to actually build a good and sustainable website that people enjoy, this is

98
00:07:20,410 --> 00:07:21,610
likely not the way.

99
00:07:22,450 --> 00:07:26,560
Furthermore, you also have to remember that Google invented Transformers.

100
00:07:27,010 --> 00:07:32,230
So if you're going to try and use Transformers to fool Google, realize that they are way ahead of you,

101
00:07:32,770 --> 00:07:38,080
there is simply no way you can trick them using their own technology, which by the time you see it,

102
00:07:38,110 --> 00:07:39,550
is already old to them.

103
00:07:40,390 --> 00:07:46,180
OK, so just remember you are in this course, not because you're trying to fool Google, because practically

104
00:07:46,180 --> 00:07:47,020
you cannot.

105
00:07:47,410 --> 00:07:50,320
You are in this course because you want to learn NLP.

