1
00:00:00,000 --> 00:00:02,790
Last week, you looked
at tokenizing text.

2
00:00:02,790 --> 00:00:06,270
Where turn text into
sequences of numbers with

3
00:00:06,270 --> 00:00:07,680
a number was the value of

4
00:00:07,680 --> 00:00:10,860
a key value pair with
the key being the word.

5
00:00:10,860 --> 00:00:13,230
So for example, you
could represent

6
00:00:13,230 --> 00:00:15,570
the word TensorFlow
with the value nine,

7
00:00:15,570 --> 00:00:17,400
and then replace
every instance of

8
00:00:17,400 --> 00:00:20,400
the word with a
nine in a sequence.

9
00:00:20,400 --> 00:00:22,470
Using tools and TensorFlow,

10
00:00:22,470 --> 00:00:24,420
you are able to
process strings to

11
00:00:24,420 --> 00:00:26,520
get indices of all the words in

12
00:00:26,520 --> 00:00:28,740
a corpus of strings
and then convert

13
00:00:28,740 --> 00:00:31,575
the strings into
matrices of numbers.

14
00:00:31,575 --> 00:00:33,480
This is the start of getting

15
00:00:33,480 --> 00:00:35,615
sentiment out of your sentences.

16
00:00:35,615 --> 00:00:37,520
But right now, it's
still just a string

17
00:00:37,520 --> 00:00:39,755
of numbers representing words.

18
00:00:39,755 --> 00:00:42,790
So from there, how would
one actually get sentiment?

19
00:00:42,790 --> 00:00:44,360
Well, that's
something that can be

20
00:00:44,360 --> 00:00:46,250
learned from a corpus of words

21
00:00:46,250 --> 00:00:47,510
in much the same way as

22
00:00:47,510 --> 00:00:50,060
features were
extracted from images.

23
00:00:50,060 --> 00:00:52,580
This process is called embedding,

24
00:00:52,580 --> 00:00:54,950
with the idea being
that words and

25
00:00:54,950 --> 00:00:57,320
associated words are clustered

26
00:00:57,320 --> 00:01:00,200
as vectors in
a multi-dimensional space.

27
00:01:00,200 --> 00:01:02,630
Here, I'm showing
an embedding projector

28
00:01:02,630 --> 00:01:05,060
with classifications
of movie reviews.

29
00:01:05,060 --> 00:01:07,160
This week, you'll learn
how to build that.

30
00:01:07,160 --> 00:01:09,500
The reviews are in
two main categories;

31
00:01:09,500 --> 00:01:10,790
positive and negative.

32
00:01:10,790 --> 00:01:12,560
So together with the labels,

33
00:01:12,560 --> 00:01:15,980
TensorFlow was able to build
these embeddings showing

34
00:01:15,980 --> 00:01:18,380
a clear clustering
of words that are

35
00:01:18,380 --> 00:01:21,305
distinct to both of
these review types.

36
00:01:21,305 --> 00:01:23,360
I can actually search for words

37
00:01:23,360 --> 00:01:25,535
to see which ones match
a classification.

38
00:01:25,535 --> 00:01:27,605
So for example, if I
search for boring,

39
00:01:27,605 --> 00:01:29,240
we can see that it lights up in

40
00:01:29,240 --> 00:01:31,010
one of the clusters and that

41
00:01:31,010 --> 00:01:32,690
associated words were clearly

42
00:01:32,690 --> 00:01:35,830
negative such as unwatchable.

43
00:01:35,830 --> 00:01:40,265
Similarly, if I search for
a negative word like annoying,

44
00:01:40,265 --> 00:01:42,860
I'll find it along
with annoyingly in

45
00:01:42,860 --> 00:01:46,235
the cluster that's clearly
the negative reviews.

46
00:01:46,235 --> 00:01:48,560
Or if I search for fun,

47
00:01:48,560 --> 00:01:51,449
I'll find that fun and
funny are positive,

48
00:01:51,449 --> 00:01:53,310
fundamental is neutral,

49
00:01:53,310 --> 00:01:56,205
and unfunny is of
course, negative.

50
00:01:56,205 --> 00:01:59,570
This week, you'll learn how
to use embeddings and how

51
00:01:59,570 --> 00:02:02,765
to build a classifier that
gave that visualization.

52
00:02:02,765 --> 00:02:05,060
You're most of the way there
already with the work that

53
00:02:05,060 --> 00:02:07,325
you've been doing with
string tokenization.

54
00:02:07,325 --> 00:02:10,220
We'll get back to that later
but first let's look at

55
00:02:10,220 --> 00:02:12,325
building the IMDB classification

56
00:02:12,325 --> 00:02:14,480
that you just visualized.