1
00:00:00,150 --> 00:00:02,590
So let's start looking at it.

2
00:00:02,590 --> 00:00:04,090
There are a couple
of things that you

3
00:00:04,090 --> 00:00:05,350
need to take into account

4
00:00:05,350 --> 00:00:06,775
before you start working with

5
00:00:06,775 --> 00:00:08,665
this week's code in TensorFlow.

6
00:00:08,665 --> 00:00:11,635
The first is the version of
TensorFlow you're using.

7
00:00:11,635 --> 00:00:13,970
Use this code to determine it.

8
00:00:13,970 --> 00:00:16,450
Also, do note that
all the code I'm

9
00:00:16,450 --> 00:00:18,695
using here is in Python 3.

10
00:00:18,695 --> 00:00:21,730
There are some differences
if you use Python 2.

11
00:00:21,730 --> 00:00:23,304
So if you're using a Colab,

12
00:00:23,304 --> 00:00:25,645
you can set the
environment to three.

13
00:00:25,645 --> 00:00:27,880
If you're doing this in
your own environment,

14
00:00:27,880 --> 00:00:30,145
you may need to
make some changes.

15
00:00:30,145 --> 00:00:33,940
If the previous code
gave you TensorFlow 1.x,

16
00:00:33,940 --> 00:00:35,260
you'll need this line of

17
00:00:35,260 --> 00:00:37,240
code before you can
go any further.

18
00:00:37,240 --> 00:00:40,285
If it gave you 2.x, then you
won't need anything because

19
00:00:40,285 --> 00:00:44,540
eager execution is enabled by
default in TensorFlow 2.0.

20
00:00:44,540 --> 00:00:46,875
If you're using Google Colab,

21
00:00:46,875 --> 00:00:48,410
then you should have TensorFlow

22
00:00:48,410 --> 00:00:50,585
datasets already installed.

23
00:00:50,585 --> 00:00:52,010
Should you not have them,

24
00:00:52,010 --> 00:00:55,325
they're easily installed
with this line of code.

25
00:00:55,325 --> 00:00:59,030
Now, you can import
TensorFlow datasets,

26
00:00:59,030 --> 00:01:02,155
and in this case
I call them tfds.

27
00:01:02,155 --> 00:01:06,735
With imdb reviews, I
can now call tfds.load,

28
00:01:06,735 --> 00:01:09,430
pass it the string imdb reviews,

29
00:01:09,430 --> 00:01:11,845
and it will return
the data from imdb,

30
00:01:11,845 --> 00:01:15,650
and metadata about
it with this code.

31
00:01:15,980 --> 00:01:19,860
The data is split
into 25,000 samples

32
00:01:19,860 --> 00:01:23,070
for training and 25,000
samples for testing.

33
00:01:23,070 --> 00:01:25,350
I can split them out like this.

34
00:01:25,350 --> 00:01:29,140
Each of these are iterables
containing the 25,000

35
00:01:29,140 --> 00:01:33,235
respective sentences
and labels as tensors.

36
00:01:33,235 --> 00:01:35,020
Up to this point,
we've been using

37
00:01:35,020 --> 00:01:36,520
the Cara's tokenizers and

38
00:01:36,520 --> 00:01:38,890
padding tools on
arrays of sentences,

39
00:01:38,890 --> 00:01:40,705
so we need to do
a little converting.

40
00:01:40,705 --> 00:01:42,770
We'll do it like this.

41
00:01:42,810 --> 00:01:46,150
First of all, let's define
the lists containing

42
00:01:46,150 --> 00:01:47,740
the sentences and labels

43
00:01:47,740 --> 00:01:50,225
for both training
and testing data.

44
00:01:50,225 --> 00:01:52,450
Now, I can iterate over

45
00:01:52,450 --> 00:01:56,155
training data extracting
the sentences and the labels.

46
00:01:56,155 --> 00:01:59,395
The values for S
and I are tensors,

47
00:01:59,395 --> 00:02:01,480
so by calling their NumPy method,

48
00:02:01,480 --> 00:02:03,955
I'll actually
extract their value.

49
00:02:03,955 --> 00:02:07,315
Then I'll do the same
for the test set.

50
00:02:07,315 --> 00:02:09,850
Here's an example of a review.

51
00:02:09,850 --> 00:02:12,850
I've truncated it to
fit it on this slide,

52
00:02:12,850 --> 00:02:16,405
but you can see how it is
stored as a tf.tensor.

53
00:02:16,405 --> 00:02:18,040
Similarly, here's a bunch of

54
00:02:18,040 --> 00:02:20,155
labels also stored as tensors.

55
00:02:20,155 --> 00:02:21,700
The value 1 indicates

56
00:02:21,700 --> 00:02:24,585
a positive review and
zero a negative one.

57
00:02:24,585 --> 00:02:28,720
When training, my labels are
expected to be NumPy arrays.

58
00:02:28,720 --> 00:02:30,970
So I'll turn the list of
labels that I've just

59
00:02:30,970 --> 00:02:34,330
created into NumPy arrays
with this code.

60
00:02:34,330 --> 00:02:38,755
Next up, we'll tokenize
our sentences. Here's the code.

61
00:02:38,755 --> 00:02:41,860
I've put the hyperparameters
at the top like this for

62
00:02:41,860 --> 00:02:43,225
the reason that it makes it

63
00:02:43,225 --> 00:02:45,370
easier to change and edit them,

64
00:02:45,370 --> 00:02:46,390
instead of phishing through

65
00:02:46,390 --> 00:02:48,775
function sequences
for the literals

66
00:02:48,775 --> 00:02:50,870
and then changing those.

67
00:02:51,530 --> 00:02:54,680
Now, as before, we import

68
00:02:54,680 --> 00:02:57,800
the tokenizer and
the pad sequences.

69
00:02:57,800 --> 00:03:00,665
We'll create an instance
of tokenizer,

70
00:03:00,665 --> 00:03:02,540
giving it our vocab size and our

71
00:03:02,540 --> 00:03:05,380
desired out of vocabulary token.

72
00:03:05,380 --> 00:03:09,905
We'll now fit the tokenizer
on our training set of data.

73
00:03:09,905 --> 00:03:12,245
Once we have our word index,

74
00:03:12,245 --> 00:03:14,420
we can now replace
the strings containing

75
00:03:14,420 --> 00:03:17,630
the words with the token value
we created for them.

76
00:03:17,630 --> 00:03:20,825
This will be the list
called sequences.

77
00:03:20,825 --> 00:03:24,095
As before, the sentences
will have variant length.

78
00:03:24,095 --> 00:03:26,270
So we'll pad and or truncate

79
00:03:26,270 --> 00:03:28,580
the sequenced sentences until

80
00:03:28,580 --> 00:03:29,990
they're all the same length,

81
00:03:29,990 --> 00:03:32,800
determined by
the maxlength parameter.

82
00:03:32,800 --> 00:03:36,125
Then we'll do the same for
the testing sequences.

83
00:03:36,125 --> 00:03:37,940
Note that the word index is

84
00:03:37,940 --> 00:03:40,280
words that are derived
from the training set,

85
00:03:40,280 --> 00:03:42,410
so you should expect to
see a lot more out of

86
00:03:42,410 --> 00:03:45,655
vocabulary tokens
in the test exam.

87
00:03:45,655 --> 00:03:48,770
Now it's time to define
our neural network.

88
00:03:48,770 --> 00:03:50,885
This should look very
familiar by now,

89
00:03:50,885 --> 00:03:54,350
except for maybe this line,
the embedding.

90
00:03:54,350 --> 00:03:58,280
This is the key to text sentiment
analysis in TensorFlow,

91
00:03:58,280 --> 00:04:00,870
and this is where
the magic really happens.