Here you can see
the tokenizer from the Keras
pre-processing library. The tokenizer is
your friend when it comes to doing natural
language processing, it does all the heavy
lifting of managing tokens, turning your text into
streams of tokens, et cetera. Now the reason why
you would need this is that when it
comes to training, neural networks you're
going to be doing a lot of math and math deals with numbers and instead of having the words being trained
in a neural network, you can actually have
the number representing that word and it just makes
your life a lot easier. Here you can see
I have a body of texts where my sentences, I love my dog and I love my cat. I'm going to tokenize
those using the tokenizer. Now one note, the tokenizer
is you'll often creates the tokenizer using
the num-words property or the num-words parameter. In this case, what
it's going to do is in your body of texts
that it's tokenizing, it will take the 100
most common words or whatever value that
you actually put in here, I've a lot less than
a 100 unique words here so it's not really
going to have any effect. What fit on texts will then
do is it will go through the entire body of text and it will create a dictionary
with the key being the word and the value being
the token for that word. If I run this, will actually
see that in action. Here you can see now it's
created a word index for me. The word indexes "I"
would be number 1, "love" would be number 2, "my" will be number 3, "dog" will be number 4, and "cat" will be number 5. Those are the unique words that are actually in this
corpus of text. A few things to take note of. Number 1 is that
punctuation like spaces in the comma I've
actually been removed. It cleans up my text for me in that way to just to actually
pull out the words. Number 2, you may have
noticed that I have a lowercase i here
and an uppercase I here and as you can see to
make it case insensitive, it's just using I
and its detecting. It's giving the same
token for both of these. Now if I were to
change this a little bit by adding some
new words to it. For example, here
you love my dog. Notice that you is capitalized and dog has
an exclamation after it, but it's not going to confuse
that with the previous dog. If I run it, we'll see now that I have a
whole new set of tokens. I have one new one, I have six downside of five and that's because
the word you is the only unique new word in this corpus because love my and dog were
their previously, but you'll see the exclamation
from dog was removed. That's a basic introduction to how the tokenizer
actually works, and you'll be using that
a lot in this course.