﻿1
00:00:01,180 --> 00:00:02,990
‫One of the most important steps

2
00:00:02,990 --> 00:00:06,250
‫in building data intensive apps is to actually model

3
00:00:06,250 --> 00:00:08,700
‫all this data in MongoDB.

4
00:00:08,700 --> 00:00:12,300
‫And so that's what we're gonna talk about in this lecture.

5
00:00:12,300 --> 00:00:14,710
‫So it's really crucial that you follow it

6
00:00:14,710 --> 00:00:19,710
‫through even at first its a lot to take in. All right.

7
00:00:19,810 --> 00:00:22,013
‫Anyway, lets now get started.

8
00:00:23,530 --> 00:00:27,530
‫Now, data modeling is probably a very new concept to you.

9
00:00:27,530 --> 00:00:28,920
‫So before we start;

10
00:00:28,920 --> 00:00:32,070
‫lets make clear what we're actually gonna talk about.

11
00:00:32,070 --> 00:00:35,656
‫So, data modeling is the process of taking unstructured data

12
00:00:35,656 --> 00:00:38,770
‫generated by a real world scenario

13
00:00:38,770 --> 00:00:42,090
‫and then structure it into a logical data model

14
00:00:42,090 --> 00:00:43,410
‫in a database.

15
00:00:43,410 --> 00:00:46,300
‫And we do that according to a set of criteria

16
00:00:46,300 --> 00:00:49,330
‫which we're gonna learn about in this video.

17
00:00:49,330 --> 00:00:51,980
‫For example; lets say that we want to design

18
00:00:51,980 --> 00:00:54,120
‫an online shop data model.

19
00:00:54,120 --> 00:00:57,040
‫There will be initially a ton of unstructured data

20
00:00:57,040 --> 00:00:58,130
‫that we know we need.

21
00:00:58,130 --> 00:00:58,980
‫Right.

22
00:00:58,980 --> 00:01:00,900
‫Stuff like products, categories,

23
00:01:00,900 --> 00:01:03,875
‫customer's orders, shopping carts, suppliers.

24
00:01:03,875 --> 00:01:06,300
‫And so on and so forth.

25
00:01:06,300 --> 00:01:09,240
‫Our goal with data modeling is to then structure

26
00:01:09,240 --> 00:01:11,450
‫this data into a logical way.

27
00:01:11,450 --> 00:01:14,090
‫Reflecting the real-world relationships

28
00:01:14,090 --> 00:01:16,920
‫that exists between some of these data sets.

29
00:01:16,920 --> 00:01:19,670
‫A bit like you can see in this example.

30
00:01:19,670 --> 00:01:23,110
‫And this is of course just a kind of imaginary situation

31
00:01:23,110 --> 00:01:24,320
‫but you get the idea.

32
00:01:24,320 --> 00:01:25,600
‫Right.

33
00:01:25,600 --> 00:01:28,940
‫Now, many backend developers say that data modeling

34
00:01:28,940 --> 00:01:30,930
‫is where we have to think the most.

35
00:01:30,930 --> 00:01:33,670
‫That its the most demanding part of building

36
00:01:33,670 --> 00:01:35,310
‫an entire application.

37
00:01:35,310 --> 00:01:38,100
‫Because it really is not always straight-forward.

38
00:01:38,100 --> 00:01:41,070
‫And sometimes there are simply no right answers.

39
00:01:41,070 --> 00:01:45,500
‫So not just one unique correct way of structuring the data.

40
00:01:45,500 --> 00:01:48,420
‫But anyway I will do my best to lay down the process

41
00:01:48,420 --> 00:01:49,510
‫in this video.

42
00:01:49,510 --> 00:01:52,367
‫And for that we're gonna go through four steps.

43
00:01:52,367 --> 00:01:56,200
‫So in the first step; we learned about how to identify

44
00:01:56,200 --> 00:01:59,340
‫different types of relationships between data.

45
00:01:59,340 --> 00:02:00,360
‫Then we're gonna understand the difference

46
00:02:00,360 --> 00:02:03,019
‫between referencing or normalization

47
00:02:03,019 --> 00:02:07,163
‫and embedding or denormalization.

48
00:02:07,163 --> 00:02:09,030
‫In the next and most important step;

49
00:02:09,030 --> 00:02:11,660
‫I will show you my framework for deciding

50
00:02:11,660 --> 00:02:13,560
‫whether we should embed documents

51
00:02:13,560 --> 00:02:15,750
‫or reference to other documents

52
00:02:15,750 --> 00:02:18,690
‫based on a couple of different factors.

53
00:02:18,690 --> 00:02:20,810
‫Also, we have to quickly talk about

54
00:02:20,810 --> 00:02:22,680
‫different types of referencing.

55
00:02:22,680 --> 00:02:25,920
‫Because that's important if that is the type of design

56
00:02:25,920 --> 00:02:28,220
‫that we choose for our data.

57
00:02:28,220 --> 00:02:32,290
‫So this is gonna be in fact a quite theoretical lecture.

58
00:02:32,290 --> 00:02:35,940
‫But also an absolutely essential one for your progress

59
00:02:35,940 --> 00:02:37,660
‫as a back-end developer.

60
00:02:37,660 --> 00:02:41,553
‫Because the way we design data so the way we model our data

61
00:02:41,553 --> 00:02:45,180
‫can make or break our entire application.

62
00:02:45,180 --> 00:02:47,950
‫And there will be a lot of examples along the way

63
00:02:47,950 --> 00:02:49,510
‫to make this process easier.

64
00:02:49,510 --> 00:02:50,343
‫All right.

65
00:02:51,320 --> 00:02:53,440
‫And the first thing that we are gonna talk about

66
00:02:53,440 --> 00:02:55,780
‫is the different types of relationships

67
00:02:55,780 --> 00:02:58,210
‫that can exist between data.

68
00:02:58,210 --> 00:03:00,780
‫So there are three big types of relationships.

69
00:03:00,780 --> 00:03:05,150
‫One to one, one to many, and many to many.

70
00:03:05,150 --> 00:03:06,990
‫And I'm gonna use a movie application

71
00:03:06,990 --> 00:03:08,890
‫as an example in this slide.

72
00:03:08,890 --> 00:03:10,000
‫Okay?

73
00:03:10,000 --> 00:03:12,440
‫So first a one to one relationship

74
00:03:12,440 --> 00:03:14,140
‫between data is basically

75
00:03:14,140 --> 00:03:17,370
‫when one field can only have one value.

76
00:03:17,370 --> 00:03:21,550
‫So in our movie application example; one movie only ever

77
00:03:21,550 --> 00:03:22,990
‫have one name.

78
00:03:22,990 --> 00:03:24,910
‫And so this is a simple example

79
00:03:24,910 --> 00:03:27,160
‫of a one to one relationship.

80
00:03:27,160 --> 00:03:29,690
‫But these relationships are not really that important

81
00:03:29,690 --> 00:03:31,363
‫in terms of data modeling.

82
00:03:32,330 --> 00:03:34,430
‫Now the most important relationships

83
00:03:34,430 --> 00:03:37,210
‫are the one to many relationships.

84
00:03:37,210 --> 00:03:39,770
‫And they are so important that in MongoDB

85
00:03:39,770 --> 00:03:42,510
‫we actually distinguish between three types

86
00:03:42,510 --> 00:03:44,540
‫of one to many relationships.

87
00:03:44,540 --> 00:03:49,540
‫One to a few, one to many, and one to a ton or to a million

88
00:03:49,910 --> 00:03:53,230
‫or something like that. So the difference here is based

89
00:03:53,230 --> 00:03:56,893
‫on the relative amount of the many. All right.

90
00:03:57,840 --> 00:04:00,969
‫So an example to a one to a few relationship is that

91
00:04:00,969 --> 00:04:05,967
‫one movie can win many awards but actually just a few.

92
00:04:05,967 --> 00:04:09,630
‫So movie is not gonna win a thousand awards

93
00:04:09,630 --> 00:04:11,220
‫but it can win some.

94
00:04:11,220 --> 00:04:14,930
‫And so this is a typical one to few relationship.

95
00:04:14,930 --> 00:04:18,710
‫So you see that in general a one to many relationship

96
00:04:18,710 --> 00:04:23,210
‫means that one document can relate to many other documents.

97
00:04:23,210 --> 00:04:26,680
‫Now this might look a bit abstract without the JSON data

98
00:04:26,680 --> 00:04:28,480
‫but that's actually the purpose here.

99
00:04:28,480 --> 00:04:31,040
‫I just wanna show you a conceptual overview

100
00:04:31,040 --> 00:04:33,759
‫of these different types of relationships.

101
00:04:33,759 --> 00:04:36,872
‫Anyway, any one to many relationship

102
00:04:36,872 --> 00:04:40,600
‫one document can relate to hundreds or thousands

103
00:04:40,600 --> 00:04:42,070
‫of other documents.

104
00:04:42,070 --> 00:04:44,788
‫For example; one movie can have thousands of reviews

105
00:04:44,788 --> 00:04:46,710
‫in our application.

106
00:04:46,710 --> 00:04:49,380
‫And so this not really a one to few

107
00:04:49,380 --> 00:04:51,524
‫but one to many relationship. Okay?

108
00:04:51,524 --> 00:04:55,616
‫And finally we have the one to ton relationship.

109
00:04:55,616 --> 00:04:59,720
‫Imagine we wanted to implement some logging functionality

110
00:04:59,720 --> 00:05:03,110
‫in our app. So basically to know exactly what's going on

111
00:05:03,110 --> 00:05:04,870
‫on our server.

112
00:05:04,870 --> 00:05:08,770
‫This logs can then easily grow to millions of documents.

113
00:05:08,770 --> 00:05:11,270
‫And so this is a very typical example

114
00:05:11,270 --> 00:05:14,200
‫of a one to tons a relationship.

115
00:05:14,200 --> 00:05:17,100
‫And the difference between many and a ton is of course

116
00:05:17,100 --> 00:05:20,730
‫a bit fuzzy but just think that if something can grow

117
00:05:20,730 --> 00:05:23,360
‫almost to infinity then its definitely

118
00:05:23,360 --> 00:05:25,532
‫a one to a ton relationship.

119
00:05:25,532 --> 00:05:28,763
‫So again the one to many relationships

120
00:05:28,763 --> 00:05:31,650
‫are the most important ones to know.

121
00:05:31,650 --> 00:05:34,050
‫By the way; in relational databases

122
00:05:34,050 --> 00:05:37,061
‫there is just one to many without quantifying

123
00:05:37,061 --> 00:05:39,800
‫how much that many actually is.

124
00:05:39,800 --> 00:05:41,800
‫In MongoDB databases though

125
00:05:41,800 --> 00:05:44,010
‫it is an extremely important difference.

126
00:05:44,010 --> 00:05:47,150
‫Because its one of the factors that we're gonna use

127
00:05:47,150 --> 00:05:49,891
‫to decide if we should denormalize or normalize data

128
00:05:49,891 --> 00:05:53,340
‫as you will learn a bit later.

129
00:05:53,340 --> 00:05:57,181
‫Anyway, the less type of relationship is the many to many

130
00:05:57,181 --> 00:06:00,149
‫where one movie can have many actors.

131
00:06:00,149 --> 00:06:04,876
‫But at the same time one actor can play in many movies.

132
00:06:04,876 --> 00:06:07,910
‫And so here the relationship basically

133
00:06:07,910 --> 00:06:09,630
‫goes in both directions.

134
00:06:09,630 --> 00:06:11,800
‫Where before in the other types

135
00:06:11,800 --> 00:06:13,939
‫it was only in one direction.

136
00:06:13,939 --> 00:06:17,470
‫For example one movie can have many reviews

137
00:06:17,470 --> 00:06:22,450
‫but one specific is only for that one movie. Right.

138
00:06:22,450 --> 00:06:24,560
‫And the same goes for the awards.

139
00:06:24,560 --> 00:06:27,506
‫So one specific award like for the best actor

140
00:06:27,506 --> 00:06:30,914
‫goes to only one movie not multiple ones.

141
00:06:30,914 --> 00:06:35,580
‫But with movies and actors it is indeed different.

142
00:06:35,580 --> 00:06:39,250
‫So again one movie stars many actors

143
00:06:39,250 --> 00:06:41,920
‫but one actor plays many movies

144
00:06:41,920 --> 00:06:45,020
‫and so its a many to many relationship.

145
00:06:45,020 --> 00:06:46,170
‫Okay.

146
00:06:46,170 --> 00:06:49,060
‫So keep all this in mind as we now move forward

147
00:06:49,060 --> 00:06:50,063
‫in this lecture.

148
00:06:51,760 --> 00:06:54,870
‫And probably the most important aspect that we need to learn

149
00:06:54,870 --> 00:06:57,900
‫about MongoDB databases is referencing

150
00:06:57,900 --> 00:07:00,340
‫and embedding two datasets.

151
00:07:00,340 --> 00:07:02,350
‫And we actually already talked a little bit

152
00:07:02,350 --> 00:07:05,050
‫about this before but lets review it here

153
00:07:05,050 --> 00:07:07,311
‫and go a bit deeper also.

154
00:07:07,311 --> 00:07:09,962
‫So each time we have two related datasets

155
00:07:09,962 --> 00:07:13,829
‫we can either represent that related data in a reference

156
00:07:13,829 --> 00:07:18,829
‫or normalized form or in an embedded or denormalized form.

157
00:07:18,842 --> 00:07:22,190
‫And I keep using the two related terms together

158
00:07:22,190 --> 00:07:24,340
‫like referencing and normalizing

159
00:07:24,340 --> 00:07:26,460
‫because you will see them both being used

160
00:07:26,460 --> 00:07:29,510
‫and so its important that you know all of them.

161
00:07:29,510 --> 00:07:33,070
‫Anyway, in the referenced form we keep two related

162
00:07:33,070 --> 00:07:35,826
‫datasets and all the documents separated.

163
00:07:35,826 --> 00:07:39,589
‫So again all the data is nicely separated

164
00:07:39,589 --> 00:07:43,275
‫which is exactly what normalized means.

165
00:07:43,275 --> 00:07:47,110
‫So continuing, the movie database example from before

166
00:07:47,110 --> 00:07:50,750
‫we would have one movie document and one actor document

167
00:07:50,750 --> 00:07:54,870
‫for each actor. Now how would we then make the connection

168
00:07:54,870 --> 00:07:58,710
‫between movie and the actors so that later in our app

169
00:07:58,710 --> 00:08:02,150
‫we can show which actors played in a particular movie.

170
00:08:02,150 --> 00:08:05,210
‫Because if they are all completely different document

171
00:08:05,210 --> 00:08:09,438
‫the movie has no way of knowing about the actors. Right.

172
00:08:09,438 --> 00:08:12,253
‫Well that's where the IDs come in.

173
00:08:12,253 --> 00:08:16,460
‫So we use the actor IDs in order to create references

174
00:08:16,460 --> 00:08:18,020
‫on the movie document.

175
00:08:18,020 --> 00:08:20,981
‫Effectively connecting movies with actors.

176
00:08:20,981 --> 00:08:24,760
‫So you see that in a movie document we have an array

177
00:08:24,760 --> 00:08:27,198
‫where we stored the IDs of all the actors

178
00:08:27,198 --> 00:08:30,760
‫so that when we request data about a certain a movie

179
00:08:30,760 --> 00:08:34,553
‫we can easily identify its actors. Does that make sense?

180
00:08:34,553 --> 00:08:38,830
‫Now this type of referencing is called child referencing

181
00:08:38,830 --> 00:08:41,480
‫because its the parent in this case the movie

182
00:08:41,480 --> 00:08:45,104
‫who references its children. In this case the actors.

183
00:08:45,104 --> 00:08:48,841
‫So we're really creating some sort of hierarchy here. Right.

184
00:08:48,841 --> 00:08:51,870
‫Now there is also parent referencing

185
00:08:51,870 --> 00:08:54,390
‫and we are gonna talk about that a bit later.

186
00:08:54,390 --> 00:08:58,710
‫And by the way in relational databases; all data is always

187
00:08:58,710 --> 00:09:01,958
‫represented in normalized form like this.

188
00:09:01,958 --> 00:09:05,490
‫But in a no sequel database like MongoDB

189
00:09:05,490 --> 00:09:09,700
‫we can denormalize data into a denormalized form

190
00:09:09,700 --> 00:09:12,450
‫simply by embedding the related document

191
00:09:12,450 --> 00:09:15,330
‫right into the main document.

192
00:09:15,330 --> 00:09:18,330
‫So now we have all the relevant data about actors

193
00:09:18,330 --> 00:09:22,060
‫right inside in one main movie document without the need

194
00:09:22,060 --> 00:09:25,700
‫for separate documents, collections, and IDs.

195
00:09:25,700 --> 00:09:30,088
‫So again, if we choose to denormalize or to embed our data

196
00:09:30,088 --> 00:09:34,280
‫we will have one main document containing all the main data

197
00:09:34,280 --> 00:09:37,197
‫as well as the related data. All right.

198
00:09:37,197 --> 00:09:40,340
‫And the result of this is that our application

199
00:09:40,340 --> 00:09:43,330
‫will need to fewer queries to the database.

200
00:09:43,330 --> 00:09:45,000
‫Because we can get all the data

201
00:09:45,000 --> 00:09:48,074
‫about movies and actors all at the same time

202
00:09:48,074 --> 00:09:51,650
‫which will of course increase our performance.

203
00:09:51,650 --> 00:09:53,840
‫Now the downside here is of course

204
00:09:53,840 --> 00:09:57,530
‫that we can't really query the embedded data on its own.

205
00:09:57,530 --> 00:10:00,810
‫And so if that's a requirement for the application

206
00:10:00,810 --> 00:10:03,790
‫you would have to choose a normalized design

207
00:10:03,790 --> 00:10:06,280
‫and since we're talking about pros and cons

208
00:10:06,280 --> 00:10:09,030
‫of the denormalized form; lets do the same

209
00:10:09,030 --> 00:10:11,490
‫about the normalized design.

210
00:10:11,490 --> 00:10:13,920
‫And basically its kind of the opposite

211
00:10:13,920 --> 00:10:15,770
‫of what we just talked about.

212
00:10:15,770 --> 00:10:18,319
‫So there is an improvement in performance

213
00:10:18,319 --> 00:10:22,390
‫when we often need to query the related data on it's own

214
00:10:22,390 --> 00:10:25,740
‫because we then can just query the data that we need

215
00:10:25,740 --> 00:10:28,490
‫and not always movies and actors together.

216
00:10:28,490 --> 00:10:31,640
‫But on the other hand; when we need to actually query

217
00:10:31,640 --> 00:10:33,906
‫movies and actors together we then are gonna need

218
00:10:33,906 --> 00:10:36,396
‫many queries to the database.

219
00:10:36,396 --> 00:10:40,010
‫So first the query for the movie and then from there

220
00:10:40,010 --> 00:10:42,610
‫we will also need a query for the actor

221
00:10:42,610 --> 00:10:44,989
‫and that is of course works for performance.

222
00:10:44,989 --> 00:10:48,328
‫So when designing your database; this is the kind of stuff

223
00:10:48,328 --> 00:10:50,569
‫that you need to keep in mind. All right.

224
00:10:50,569 --> 00:10:54,900
‫And now just as a side note; we could of course begin

225
00:10:54,900 --> 00:10:56,994
‫our thought process with denormlized data

226
00:10:56,994 --> 00:10:59,670
‫and then come to the conclusion

227
00:10:59,670 --> 00:11:01,692
‫that its best to actually normalize the data.

228
00:11:01,692 --> 00:11:05,043
‫So when thinking about our data model

229
00:11:05,043 --> 00:11:08,378
‫this way of organizing data works of course in both ways.

230
00:11:08,378 --> 00:11:12,570
‫Now, how do we actually decide if we should

231
00:11:12,570 --> 00:11:15,330
‫normalize or denormalize the data?

232
00:11:15,330 --> 00:11:18,033
‫Well that's exactly what we're gonna learn next.

233
00:11:19,690 --> 00:11:22,974
‫So when we have two related datasets; we have to decide

234
00:11:22,974 --> 00:11:26,180
‫if we're gonna embed the datasets or if we're gonna

235
00:11:26,180 --> 00:11:27,693
‫keep them separated and reference them

236
00:11:27,693 --> 00:11:30,400
‫from one dataset to the other.

237
00:11:30,400 --> 00:11:32,730
‫And I kind of developed this decision framework

238
00:11:32,730 --> 00:11:36,070
‫which I'm gonna show you where we use three criteria

239
00:11:36,070 --> 00:11:37,770
‫to take that decision.

240
00:11:37,770 --> 00:11:40,450
‫First we look at the type of relationships

241
00:11:40,450 --> 00:11:42,800
‫that exists between datasets.

242
00:11:42,800 --> 00:11:45,856
‫Second we try to determine the data access pattern

243
00:11:45,856 --> 00:11:50,150
‫of the dataset that we want to either embed or reference.

244
00:11:50,150 --> 00:11:53,320
‫And this just means to analyze how often data is read

245
00:11:53,320 --> 00:11:55,282
‫and written in that dataset.

246
00:11:55,282 --> 00:11:59,025
‫Then we also look at something that I call data closeness.

247
00:11:59,025 --> 00:12:02,940
‫And data closeness is term that I actually just made up

248
00:12:02,940 --> 00:12:06,870
‫but what it means is how much the data is really related

249
00:12:06,870 --> 00:12:10,109
‫and how we want to query the data from the database.

250
00:12:10,109 --> 00:12:11,850
‫And this will make more sense

251
00:12:11,850 --> 00:12:14,180
‫when we talk about it in a moment.

252
00:12:14,180 --> 00:12:17,330
‫Now to actually take the decision; we need to combine

253
00:12:17,330 --> 00:12:19,350
‫all of these three criteria

254
00:12:19,350 --> 00:12:21,792
‫and not just use one of them in isolation.

255
00:12:21,792 --> 00:12:25,230
‫So for example; just because criteria number one

256
00:12:25,230 --> 00:12:28,380
‫says to embed it doesn't mean that we don't need to look

257
00:12:28,380 --> 00:12:30,425
‫at the other two criteria.

258
00:12:30,425 --> 00:12:34,124
‫All right and lets start with the relationship type.

259
00:12:34,124 --> 00:12:37,968
‫So usually when we have one to few relationship

260
00:12:37,968 --> 00:12:40,700
‫we will always embed the related dataset

261
00:12:40,700 --> 00:12:43,430
‫into the main dataset just like we learned

262
00:12:43,430 --> 00:12:45,860
‫in the last slide. Right.

263
00:12:45,860 --> 00:12:49,110
‫Now in a one to many relationship; things are a bit

264
00:12:49,110 --> 00:12:52,880
‫more fuzzy so its okay to either embed or reference.

265
00:12:52,880 --> 00:12:55,140
‫In that case we will have to decide

266
00:12:55,140 --> 00:12:57,304
‫according to the other two criteria.

267
00:12:57,304 --> 00:12:59,825
‫Now on the other hand, on a one to a ton

268
00:12:59,825 --> 00:13:03,894
‫or a many to many relationship we usually always reference

269
00:13:03,894 --> 00:13:06,811
‫the data. That's because if we actually did embed

270
00:13:06,811 --> 00:13:10,004
‫in this case we could quickly create way too large document.

271
00:13:10,004 --> 00:13:14,902
‫Even potentially surpassing the maximum of 16 megabytes.

272
00:13:14,902 --> 00:13:18,214
‫And so the solution for that is of course referencing

273
00:13:18,214 --> 00:13:22,090
‫or normalizing the data. And as a quick example;

274
00:13:22,090 --> 00:13:24,142
‫lets say that in our movie database example

275
00:13:24,142 --> 00:13:27,830
‫we have around 100 images associated to each movie.

276
00:13:27,830 --> 00:13:30,874
‫So we could say its a one to many relationship

277
00:13:30,874 --> 00:13:34,230
‫but are we gonna embed the dataset or should we rather

278
00:13:34,230 --> 00:13:37,523
‫reference them here. Well we don't really know.

279
00:13:37,523 --> 00:13:40,571
‫So lets take a look at the other two criteria.

280
00:13:40,571 --> 00:13:44,420
‫So the second one is about data access patterns

281
00:13:44,420 --> 00:13:46,290
‫where its just a fancy description

282
00:13:46,290 --> 00:13:48,242
‫for evaluating whether a certain dataset

283
00:13:48,242 --> 00:13:51,559
‫is mostly written to or mostly read from.

284
00:13:51,559 --> 00:13:55,760
‫So if the dataset that we're deciding about is mostly read

285
00:13:55,760 --> 00:13:58,179
‫and the data is not updated a lot

286
00:13:58,179 --> 00:14:01,620
‫then we should probably embed that dataset.

287
00:14:01,620 --> 00:14:04,690
‫So a high read/write ratio just means

288
00:14:04,690 --> 00:14:07,100
‫that there is a lot more reading than writing.

289
00:14:07,100 --> 00:14:11,100
‫And a again, a dataset like that is a good candidate

290
00:14:11,100 --> 00:14:11,983
‫for embedding.

291
00:14:12,830 --> 00:14:15,980
‫The reason for this is that by embedding we only need

292
00:14:15,980 --> 00:14:18,379
‫one trip to the database per query.

293
00:14:18,379 --> 00:14:22,197
‫While for referencing we need two trips. Right.

294
00:14:22,197 --> 00:14:25,660
‫So if we embed data that is read a lot;

295
00:14:25,660 --> 00:14:28,383
‫in each query we save one trip to the database

296
00:14:28,383 --> 00:14:32,147
‫making the entire process way more performant.

297
00:14:32,147 --> 00:14:35,260
‫So I think that our movie image example

298
00:14:35,260 --> 00:14:38,320
‫would actually be a good candidate for embedding.

299
00:14:38,320 --> 00:14:41,543
‫Because once the 100 image are saved to the database

300
00:14:41,543 --> 00:14:43,920
‫they are not really updated anymore

301
00:14:43,920 --> 00:14:46,930
‫because there is not really anything to update

302
00:14:46,930 --> 00:14:50,057
‫about an image. Right, so its all about reading

303
00:14:50,057 --> 00:14:52,563
‫and therefore based on this criteria

304
00:14:52,563 --> 00:14:55,501
‫we would embed the imaged documents.

305
00:14:55,501 --> 00:14:59,092
‫Now on the other hand, if our data is updated a lot

306
00:14:59,092 --> 00:15:03,118
‫then we should consider referencing or normalizing the data.

307
00:15:03,118 --> 00:15:06,700
‫That's because its more work for the database engine

308
00:15:06,700 --> 00:15:08,870
‫to update and embed a document

309
00:15:08,870 --> 00:15:11,600
‫than a more simple standalone document.

310
00:15:11,600 --> 00:15:13,980
‫And since our main goal is performance;

311
00:15:13,980 --> 00:15:15,917
‫we just normalize the dataset.

312
00:15:15,917 --> 00:15:19,653
‫In our example lets say each movie has many reviews

313
00:15:19,653 --> 00:15:23,284
‫and each review can be marked as helpful by the user.

314
00:15:23,284 --> 00:15:27,560
‫So each time someone clicks on this review was helpful

315
00:15:27,560 --> 00:15:29,780
‫in our application. We need to update

316
00:15:29,780 --> 00:15:31,740
‫the corresponding document.

317
00:15:31,740 --> 00:15:35,030
‫And this means that the data can change all the time

318
00:15:35,030 --> 00:15:38,520
‫and so this is a great candidate for normalizing.

319
00:15:38,520 --> 00:15:41,420
‫Again because we don't want to be querying the movies

320
00:15:41,420 --> 00:15:45,190
‫all the time if all we really wanna update is the reviews

321
00:15:45,190 --> 00:15:47,230
‫by marking them as helpful.

322
00:15:47,230 --> 00:15:49,464
‫Okay, does that make sense?

323
00:15:49,464 --> 00:15:53,500
‫And finally the last criteria I call data closeness;

324
00:15:53,500 --> 00:15:56,320
‫which is just like a measure for how much the data

325
00:15:56,320 --> 00:15:59,469
‫is related. So if the two datasets really

326
00:15:59,469 --> 00:16:02,890
‫intrinsically belong together then they should

327
00:16:02,890 --> 00:16:05,880
‫probably be embedded into one another.

328
00:16:05,880 --> 00:16:10,440
‫In our example; all users can have many email addresses

329
00:16:10,440 --> 00:16:13,780
‫on their account and since they are so intrinsically

330
00:16:13,780 --> 00:16:17,190
‫connected to the user, there is no doubt emails

331
00:16:17,190 --> 00:16:19,920
‫should be embedded into the document.

332
00:16:19,920 --> 00:16:23,830
‫Now if we frequently need to query both of datasets

333
00:16:23,830 --> 00:16:26,388
‫on their own then that's a very good reason

334
00:16:26,388 --> 00:16:29,696
‫to normalize the data into two separate datasets.

335
00:16:29,696 --> 00:16:32,790
‫Even if they are closely related.

336
00:16:32,790 --> 00:16:35,227
‫So imagine that in our app we have a quiz

337
00:16:35,227 --> 00:16:40,227
‫where users have to identify a movie based on images.

338
00:16:40,440 --> 00:16:43,080
‫This means that we're gonna query a lot of images

339
00:16:43,080 --> 00:16:44,180
‫on their own.

340
00:16:44,180 --> 00:16:47,756
‫So without necessarily querying for the movies themselves.

341
00:16:47,756 --> 00:16:50,640
‫And so if we apply this third criteria;

342
00:16:50,640 --> 00:16:54,137
‫we come to the conclusion that we should actually normalize

343
00:16:54,137 --> 00:16:56,759
‫the image dataset. All right.

344
00:16:56,759 --> 00:17:00,770
‫Because again if we implement this quiz functionality;

345
00:17:00,770 --> 00:17:04,057
‫images are gonna be queried on their own all the time.

346
00:17:04,057 --> 00:17:07,422
‫So, all of this shows that we should really look

347
00:17:07,422 --> 00:17:09,850
‫all the three criteria together

348
00:17:09,850 --> 00:17:12,700
‫rather than just one of them in isolation.

349
00:17:12,700 --> 00:17:15,841
‫Because that might lead to less optimal decisions.

350
00:17:15,841 --> 00:17:18,908
‫And I say less optimal instead of wrong

351
00:17:18,908 --> 00:17:21,766
‫because they are not really completely right

352
00:17:21,766 --> 00:17:25,262
‫or completely wrong ways of modeling our data.

353
00:17:25,262 --> 00:17:28,970
‫There are no hard rules; these are just like guidelines

354
00:17:28,970 --> 00:17:31,380
‫that you can follow to find the probably

355
00:17:31,380 --> 00:17:33,860
‫most correct way of structuring your data.

356
00:17:33,860 --> 00:17:37,077
‫But again, it's hard to be really really wrong.

357
00:17:37,077 --> 00:17:38,253
‫Okay?

358
00:17:39,740 --> 00:17:43,110
‫Now, lets say that we have chosen to normalize

359
00:17:43,110 --> 00:17:44,270
‫our datasets.

360
00:17:44,270 --> 00:17:46,653
‫So in other words to reference data.

361
00:17:46,653 --> 00:17:49,380
‫Then after that we still have to choose

362
00:17:49,380 --> 00:17:52,840
‫between three different types of referencing.

363
00:17:52,840 --> 00:17:55,460
‫Child referencing, parent referencing

364
00:17:55,460 --> 00:17:57,540
‫and two-way referencing.

365
00:17:57,540 --> 00:18:00,767
‫So the first type is child referencing.

366
00:18:00,767 --> 00:18:04,440
‫Which is the referencing type I actually showed you before.

367
00:18:04,440 --> 00:18:05,470
‫Okay?

368
00:18:05,470 --> 00:18:07,850
‫And lets not take the error logging example

369
00:18:07,850 --> 00:18:10,128
‫that I mentioned earlier. Where we could potentially

370
00:18:10,128 --> 00:18:13,021
‫have millions of locked documents.

371
00:18:13,021 --> 00:18:17,300
‫So in child referencing; we basically keep references

372
00:18:17,300 --> 00:18:20,460
‫to the related child documents in a parent document.

373
00:18:20,460 --> 00:18:22,941
‫And they are usually stored in an array.

374
00:18:22,941 --> 00:18:25,735
‫So you see that each log has an ID

375
00:18:25,735 --> 00:18:29,040
‫and then in the app document there is that array

376
00:18:29,040 --> 00:18:31,358
‫with all of these IDs. Right?

377
00:18:31,358 --> 00:18:34,400
‫However, the problem here is that this array

378
00:18:34,400 --> 00:18:39,320
‫of IDs can become very large if there are lots of children.

379
00:18:39,320 --> 00:18:42,230
‫And this is an anti-pattern in MongoDB.

380
00:18:42,230 --> 00:18:45,156
‫So something that we should avoid at all costs.

381
00:18:45,156 --> 00:18:47,660
‫Also, child referencing makes it

382
00:18:47,660 --> 00:18:51,410
‫so that parents and children are very tightly coupled.

383
00:18:51,410 --> 00:18:54,840
‫Which is not always ideal. But that's exactly

384
00:18:54,840 --> 00:18:57,020
‫why we have parent referencing.

385
00:18:57,020 --> 00:19:00,300
‫So in parent referencing; it actually works

386
00:19:00,300 --> 00:19:01,870
‫the other way around.

387
00:19:01,870 --> 00:19:05,570
‫So here in each child document we keep a reference

388
00:19:05,570 --> 00:19:07,430
‫to the parent element.

389
00:19:07,430 --> 00:19:10,267
‫Therefore the name parent referencing.

390
00:19:10,267 --> 00:19:13,890
‫In this example the app ID is 23

391
00:19:13,890 --> 00:19:16,640
‫and so in each log there is the app field

392
00:19:16,640 --> 00:19:18,990
‫with the 23 ID in it.

393
00:19:18,990 --> 00:19:21,660
‫So that the child always knows its parent.

394
00:19:21,660 --> 00:19:24,920
‫And so in this case the parent actually knows nothing

395
00:19:24,920 --> 00:19:26,080
‫about the children.

396
00:19:26,080 --> 00:19:28,768
‫Not who they are and not how many they are.

397
00:19:28,768 --> 00:19:32,890
‫So, they are way more isolated and more standalone.

398
00:19:32,890 --> 00:19:35,326
‫In that, it can sometimes be beneficial.

399
00:19:35,326 --> 00:19:38,880
‫So which of these two types is actually better

400
00:19:38,880 --> 00:19:40,527
‫for this data relationship.

401
00:19:40,527 --> 00:19:42,820
‫And remember how I said that there

402
00:19:42,820 --> 00:19:45,860
‫could be millions of logs and so lets suppose

403
00:19:45,860 --> 00:19:47,652
‫there is two million logged documents.

404
00:19:47,652 --> 00:19:51,340
‫In a case of child referencing, that would mean

405
00:19:51,340 --> 00:19:53,209
‫that there are two million ID references

406
00:19:53,209 --> 00:19:55,091
‫in the app document.

407
00:19:55,091 --> 00:19:58,300
‫Right? Now also remember how I said that

408
00:19:58,300 --> 00:20:00,545
‫there is 16 megabyte limit on documents.

409
00:20:00,545 --> 00:20:04,302
‫So if we kept adding and adding these child IDs

410
00:20:04,302 --> 00:20:06,716
‫into the array on the parent; then we would

411
00:20:06,716 --> 00:20:09,575
‫pretty quickly hit that 16 megabytes limit

412
00:20:09,575 --> 00:20:11,772
‫that each Bson document can hold.

413
00:20:11,772 --> 00:20:14,702
‫Simply because that array will grow so much.

414
00:20:14,702 --> 00:20:17,210
‫So that's not really gonna work.

415
00:20:17,210 --> 00:20:18,510
‫Is it?

416
00:20:18,510 --> 00:20:20,590
‫On the other hand with parent referencing

417
00:20:20,590 --> 00:20:22,990
‫that problem is not gonna happen.

418
00:20:22,990 --> 00:20:25,570
‫We will simply have two million locked documents

419
00:20:25,570 --> 00:20:30,540
‫just like before but each of them holds ID of its parent.

420
00:20:30,540 --> 00:20:33,098
‫But there is no array that will grow indefinitely

421
00:20:33,098 --> 00:20:35,740
‫and therefore parent referencing

422
00:20:35,740 --> 00:20:38,443
‫would be best solution here.

423
00:20:39,380 --> 00:20:41,901
‫So the conclusion of all this is that in general

424
00:20:41,901 --> 00:20:44,385
‫child referencing is best used

425
00:20:44,385 --> 00:20:48,008
‫for one to a few relationships. Where we know before hand

426
00:20:48,008 --> 00:20:51,118
‫that the array of child documents won't grow that much.

427
00:20:51,118 --> 00:20:54,573
‫On the other hand, parent referencing is best used

428
00:20:54,573 --> 00:20:58,690
‫for one to many and one to a ton relationships

429
00:20:58,690 --> 00:21:00,927
‫like this one. Okay?

430
00:21:00,927 --> 00:21:04,610
‫So again always keep in mind that one of the most

431
00:21:04,610 --> 00:21:07,920
‫important principals of MongoDB data modeling

432
00:21:07,920 --> 00:21:11,900
‫is that array should never be allowed to grow indefinitely.

433
00:21:11,900 --> 00:21:15,420
‫In order to never break that 16 megabyte limit.

434
00:21:15,420 --> 00:21:18,170
‫We also don't want to send our users an array

435
00:21:18,170 --> 00:21:20,730
‫with thousands of IDs each time

436
00:21:20,730 --> 00:21:24,340
‫they request a parent dataset. Okay?

437
00:21:24,340 --> 00:21:26,900
‫So did this logic make sense to you?

438
00:21:26,900 --> 00:21:29,660
‫Then lets move on to third type of referencing

439
00:21:29,660 --> 00:21:31,870
‫which is two-way referencing.

440
00:21:31,870 --> 00:21:34,395
‫And this time with the movie and actor example

441
00:21:34,395 --> 00:21:36,380
‫I showed you when we talked about

442
00:21:36,380 --> 00:21:39,364
‫many to many relationships. Remember that?

443
00:21:39,364 --> 00:21:42,229
‫So again, each movie has many actors

444
00:21:42,229 --> 00:21:44,880
‫and each actor plays in many movies.

445
00:21:44,880 --> 00:21:48,464
‫And so that's a typical many to many relationship.

446
00:21:48,464 --> 00:21:52,100
‫And we usually use this two-way referencing to design

447
00:21:52,100 --> 00:21:55,346
‫many to many relationships. And it works like this;

448
00:21:55,346 --> 00:21:59,370
‫in each movie we will keep references to all the actors

449
00:21:59,370 --> 00:22:03,980
‫that star in that movie. So a bit like in child referencing.

450
00:22:03,980 --> 00:22:07,000
‫However and at the same time in each actor

451
00:22:07,000 --> 00:22:09,570
‫we also keep references to all the movies

452
00:22:09,570 --> 00:22:11,660
‫that the actor played in.

453
00:22:11,660 --> 00:22:15,120
‫So movies and actors are connected in both directions.

454
00:22:15,120 --> 00:22:17,900
‫In therefore the name two-way referencing.

455
00:22:17,900 --> 00:22:19,950
‫And this makes it really easy to search

456
00:22:19,950 --> 00:22:23,290
‫for both movies and actors completely independently.

457
00:22:23,290 --> 00:22:25,910
‫While also making it easy to find the actors

458
00:22:25,910 --> 00:22:29,029
‫associated to each movie and the movies associated

459
00:22:29,029 --> 00:22:30,383
‫to each actor.

460
00:22:31,623 --> 00:22:32,560
‫(deep breath)

461
00:22:32,560 --> 00:22:34,747
‫This was quite a long lecture indeed.

462
00:22:34,747 --> 00:22:38,030
‫With a lot of new concepts and principals

463
00:22:38,030 --> 00:22:40,220
‫and guidelines to remember.

464
00:22:40,220 --> 00:22:43,460
‫So in order to help you with that; here goes a quick

465
00:22:43,460 --> 00:22:46,650
‫summary and some more general guidelines that you can

466
00:22:46,650 --> 00:22:48,423
‫take a look at when you need it.

467
00:22:49,260 --> 00:22:52,753
‫So the most important principal is: structure your data

468
00:22:52,753 --> 00:22:56,120
‫to match the ways that your application queries

469
00:22:56,120 --> 00:22:57,436
‫and updates data.

470
00:22:57,436 --> 00:23:01,400
‫Or in other words: identify the questions that arise

471
00:23:01,400 --> 00:23:03,784
‫from your application's use cases first, and then model

472
00:23:03,784 --> 00:23:06,634
‫your data so that the questions can get answered

473
00:23:06,634 --> 00:23:08,995
‫in the most efficient way.

474
00:23:08,995 --> 00:23:12,610
‫For example; when I need to query movies and actors

475
00:23:12,610 --> 00:23:16,130
‫always together or are there scenarios where I only

476
00:23:16,130 --> 00:23:18,041
‫query movies or only actors.

477
00:23:18,041 --> 00:23:20,528
‫That kind of questions is what your data model

478
00:23:20,528 --> 00:23:22,930
‫will be based on.

479
00:23:22,930 --> 00:23:26,730
‫In general, always favor embedding unless there is a good

480
00:23:26,730 --> 00:23:28,440
‫reason not to embed.

481
00:23:28,440 --> 00:23:32,513
‫Especially on one to a few and one to many relationships.

482
00:23:33,370 --> 00:23:37,713
‫Next up, a one to a ton or a many to many relationship

483
00:23:37,713 --> 00:23:41,543
‫is usually a good reason to reference instead of embedding.

484
00:23:41,543 --> 00:23:45,734
‫Also, favor referencing when data is updated a lot

485
00:23:45,734 --> 00:23:50,717
‫and if you need to frequently access a dataset on its own.

486
00:23:50,717 --> 00:23:55,340
‫Use embedding when data is mostly read but rarely updated

487
00:23:55,340 --> 00:23:58,469
‫and when two dataset belong intrinsically together.

488
00:23:58,469 --> 00:24:02,840
‫Don't allow arrays to grow indefinitely.

489
00:24:02,840 --> 00:24:05,982
‫Therefore, if you want to normalize; use child referencing

490
00:24:05,982 --> 00:24:09,680
‫for one to many relationships and parent referencing

491
00:24:09,680 --> 00:24:11,856
‫for one to a ton relationships.

492
00:24:11,856 --> 00:24:15,160
‫And finally use two-way referencing

493
00:24:15,160 --> 00:24:17,520
‫for many to many relationships.

494
00:24:17,520 --> 00:24:18,720
‫All right?

495
00:24:18,720 --> 00:24:21,202
‫And that pretty much sums it up.

496
00:24:21,202 --> 00:24:23,970
‫I would actually recommend you watching this video

497
00:24:23,970 --> 00:24:27,144
‫twice if you can, just because of how important

498
00:24:27,144 --> 00:24:30,091
‫this material really is. All right?

499
00:24:30,091 --> 00:24:33,363
‫Anyway, see you in the next video.

