1
00:00:11,120 --> 00:00:16,700
So in this lecture, we'll be doing a very brief introduction to words of EQ, which is what will implement

2
00:00:16,700 --> 00:00:17,600
in the next lecture.

3
00:00:18,410 --> 00:00:23,000
One of the reasons I wants to do this is it allows us to use embeddings within it.

4
00:00:23,510 --> 00:00:28,610
Otherwise we would have to wait until we studied CNN's in art ends, which will take quite some time.

5
00:00:29,390 --> 00:00:35,170
This lecture will look at one specific variants of words of EQ, specifically Zebo, which stands for

6
00:00:35,180 --> 00:00:36,890
continuous bag of words.

7
00:00:37,550 --> 00:00:42,620
The other major variant is called Skip Gram, which is, in my opinion, a little more complex.

8
00:00:43,070 --> 00:00:46,040
So to keep things simple, will be implementing Cibo.

9
00:00:50,720 --> 00:00:56,660
So now that you know about aliens, the design of the Cibo network will be very simple at a high level,

10
00:00:56,660 --> 00:01:02,300
what we are trying to do can be described in a single sentence, given a string of multiple words.

11
00:01:02,660 --> 00:01:08,030
We're going to remove the middle word and use the surrounding words to predict the missing a middle

12
00:01:08,030 --> 00:01:08,510
word.

13
00:01:09,110 --> 00:01:14,060
This should sound very familiar if you've remembered the other things we've studied in this class.

14
00:01:14,780 --> 00:01:16,370
So there are two ways to look at this.

15
00:01:21,060 --> 00:01:27,120
The first way which appears in the word T-VEC paper shows a neural network with multiple inputs all

16
00:01:27,120 --> 00:01:28,920
going through the same embedding layer.

17
00:01:29,460 --> 00:01:32,760
This converts each of the context words into a word vector.

18
00:01:33,600 --> 00:01:39,060
Once we have the vector for all the context words, we take the average, which gives us a single vector

19
00:01:39,300 --> 00:01:40,590
with the same dimension.

20
00:01:41,490 --> 00:01:45,960
Using this vector, we go through a final dense layer to get the predicted word.

21
00:01:47,750 --> 00:01:52,010
So this is like an island with multiple inputs and a single output.

22
00:01:56,650 --> 00:02:01,780
The second way to look at this is it's a sequence of context, words being passed through an embedding

23
00:02:01,780 --> 00:02:02,290
layer.

24
00:02:02,770 --> 00:02:05,020
This gives us a sequence of embeddings.

25
00:02:05,650 --> 00:02:08,770
We then take the average of all the embeddings in that sequence.

26
00:02:09,310 --> 00:02:11,500
We've not yet seen how to do this in TensorFlow.

27
00:02:11,830 --> 00:02:15,760
But note that the average is a pretty simple operation not to be concerned about.

28
00:02:16,480 --> 00:02:21,640
This gives us a single vector, which we can then pass through a final dense layer to get the predicted

29
00:02:21,640 --> 00:02:22,120
word.

30
00:02:22,990 --> 00:02:29,050
So from this perspective, the input is a sequence in the architecture is an and which handles sequences

31
00:02:29,050 --> 00:02:29,860
as input.

32
00:02:34,560 --> 00:02:36,480
So there are two notes to make about this.

33
00:02:37,410 --> 00:02:42,090
Firstly, note how this is just multiclass classification, which you know how to do.

34
00:02:43,200 --> 00:02:46,680
Secondly, note that there are some small details that aren't obvious.

35
00:02:47,490 --> 00:02:53,370
The first detail is that although this is in and there is no activation function in the middle, it's

36
00:02:53,370 --> 00:02:55,650
just to weight matrices one after another.

37
00:02:56,220 --> 00:02:58,980
Therefore, this is effectively a linear network.

38
00:03:00,510 --> 00:03:06,120
But unlike a regular linear model, there is what we call an information bottleneck, which forces data

39
00:03:06,120 --> 00:03:09,960
of large dimension to fit into a vector space of small dimension.

40
00:03:10,650 --> 00:03:13,650
You can think of our neural network as having an hourglass shape.

41
00:03:14,280 --> 00:03:17,220
The input has size V, which is our vocab size.

42
00:03:17,610 --> 00:03:20,940
But the middle only has size D, which is the embedding dimension.

43
00:03:21,750 --> 00:03:27,270
The output once again has size V since we need to be able to predict every possible word.

44
00:03:27,930 --> 00:03:30,870
So the network goes from big to small, back to big.

45
00:03:32,250 --> 00:03:37,710
This is a common theme in deep learning, which usually results in useful representations being learned.

46
00:03:38,910 --> 00:03:44,460
In fact, we saw something similar when we studied LSA when you reduce dimensionality.

47
00:03:44,490 --> 00:03:45,720
The result is useful.

48
00:03:46,830 --> 00:03:52,470
The second detail I want to mention is that both layers of the neural network do not have bias terms.

49
00:03:52,950 --> 00:03:54,930
The embedding layer doesn't have a bias term.

50
00:03:55,260 --> 00:03:58,200
And the final dense layer also does not have a bias term.

51
00:03:58,770 --> 00:04:02,220
But essentially, this is all you need to know to implement Cibo.