1
00:00:11,050 --> 00:00:16,990
So in this lecture, we are going to look at how to do named entity recognition, also known as NPR

2
00:00:16,990 --> 00:00:18,010
in TensorFlow.

3
00:00:18,790 --> 00:00:20,020
So what is NPR?

4
00:00:20,680 --> 00:00:27,130
Well, basically the idea is you have a document and you want to identify all the people, places and

5
00:00:27,130 --> 00:00:29,140
companies in that document.

6
00:00:29,680 --> 00:00:33,850
This data might be used as a pre processing step in a larger piece of code.

7
00:00:34,540 --> 00:00:38,230
So, for example, if I see Apple, I would mark that as a company.

8
00:00:38,530 --> 00:00:41,620
If I see Steve Jobs, I would mark that as a person.

9
00:00:41,980 --> 00:00:44,950
And if I see CA., I would mark that as a place.

10
00:00:49,550 --> 00:00:52,010
In any case, suppose you have some use for it.

11
00:00:52,340 --> 00:00:53,900
How does this actually work?

12
00:00:54,620 --> 00:01:00,740
Well, it turns out that an air is essentially exactly the same as parts of speech tagging, which you

13
00:01:00,740 --> 00:01:01,880
already know how to do.

14
00:01:02,750 --> 00:01:09,140
As you recall, that was a many to many task where we assign each word to a tag and we do the same thing

15
00:01:09,140 --> 00:01:09,910
with NPR.

16
00:01:10,490 --> 00:01:16,970
So every place has the tag look, every person has the tag per and every company has the tag org.

17
00:01:17,780 --> 00:01:23,120
Now, one challenging aspect of NPR is that some entities span multiple tokens.

18
00:01:23,630 --> 00:01:26,060
So, for example, consider Steve Jobs.

19
00:01:26,570 --> 00:01:30,110
This has two tokens, but both of them are part of a name.

20
00:01:30,920 --> 00:01:35,780
If you consider jobs by itself, that would be a regular word, meaning occupations.

21
00:01:37,190 --> 00:01:41,840
Another challenging aspect of any R is that this data is highly imbalanced.

22
00:01:42,560 --> 00:01:44,900
You can see that most words just get the tag.

23
00:01:44,900 --> 00:01:48,500
Oh, which means that it's not a place person or company.

24
00:01:49,310 --> 00:01:51,410
Consider this sentence from our dataset.

25
00:01:52,760 --> 00:01:59,480
Sheep have been long known to contract scrapie a brain wasting disease similar to BSE, which is believed

26
00:01:59,480 --> 00:02:03,170
to have been transferred to cattle through feed containing animal waste.

27
00:02:03,740 --> 00:02:07,190
This is a very long sentence and it has no named entities.

28
00:02:07,520 --> 00:02:09,110
All the labels are just oh.

29
00:02:13,840 --> 00:02:19,270
Let's briefly discuss the format of any AirTags, since you're probably wondering what all these A's

30
00:02:19,270 --> 00:02:20,470
and B's signify.

31
00:02:21,160 --> 00:02:27,400
This is called IOB format since obviously we use the letters I, O and B in our tags.

32
00:02:28,210 --> 00:02:34,720
The idea is that each entity potentially appears as a chunk, a chunk as a sequence of multiple tokens.

33
00:02:35,380 --> 00:02:37,490
An example of that is Steve Jobs.

34
00:02:38,080 --> 00:02:44,500
In this case, we would use Beeper to represent the fact that Steve is a person tag and it's the beginning

35
00:02:44,500 --> 00:02:46,690
of a chunk for jobs.

36
00:02:46,690 --> 00:02:53,140
We would use IPR to represent the fact that jobs is also a person tag and it's inside a trunk.

37
00:02:53,800 --> 00:03:00,520
So B stands for beginning, AI stands for inside and O stands for outside, meaning outside of any chunk.

38
00:03:01,420 --> 00:03:07,600
Luckily, these details don't really concern us, since ultimately our data set looks exactly like post

39
00:03:07,600 --> 00:03:09,820
tagging from the previous lecture.

40
00:03:10,810 --> 00:03:14,260
We have a sequence of words and each word is mapped to attack.

41
00:03:18,990 --> 00:03:24,360
Now, it turns out that the code for this is so simple that we are not even going to bother stepping

42
00:03:24,360 --> 00:03:25,890
through it like we normally would.

43
00:03:26,580 --> 00:03:32,850
I hope you remember my famous rule that makes this possible, and that rule is all data is the same.

44
00:03:33,720 --> 00:03:36,300
Thanks to this rule, no code needs to be written.

45
00:03:36,780 --> 00:03:41,010
We simply change the data set and the previous code still works.

46
00:03:42,660 --> 00:03:47,490
So at the top here you can see that I've imported pickle and I've downloaded the train set and test

47
00:03:47,490 --> 00:03:49,410
set, which are pickle files.

48
00:03:50,100 --> 00:03:55,110
This is because the data format from the original data was a bit different than what we needed.

49
00:03:55,470 --> 00:03:59,520
So I did some processing for the course and uploaded the results to my website.

50
00:04:00,210 --> 00:04:06,240
This dataset is called Cornwall two thousand three, which is probably the most popular NPR dataset

51
00:04:06,240 --> 00:04:07,020
that exists.

52
00:04:07,770 --> 00:04:09,930
In any case, let's scroll down to the results.

53
00:04:23,860 --> 00:04:30,400
As you can see, we get a very high accuracy, and that one is also high but a bit lower as expected,

54
00:04:30,400 --> 00:04:31,960
especially for the test set.

55
00:04:32,650 --> 00:04:38,200
This may be due to a large class imbalance where the model performs poorly on some under-represented

56
00:04:38,200 --> 00:04:38,800
class.

57
00:04:39,310 --> 00:04:41,110
I'll let you investigate that yourself.

58
00:04:48,030 --> 00:04:53,520
Importantly, we must again compare it to our baseline, which is to simply memorize the tags from the

59
00:04:53,520 --> 00:04:54,240
train sets.

60
00:04:54,990 --> 00:05:00,870
As you can see, we get worse performance using the baseline and the test F1 is now even worse.

61
00:05:01,500 --> 00:05:07,200
Therefore, we can conclude that our model was successful in making use of context to predict named

62
00:05:07,200 --> 00:05:07,860
entities.