1
00:00:00,090 --> 00:00:07,170
We recently finished talking about how to transform a random variable by shifting it or scaling it by

2
00:00:07,170 --> 00:00:08,430
some constant.

3
00:00:08,430 --> 00:00:13,970
But now we want to talk about how to combine multiple random variables together.

4
00:00:13,980 --> 00:00:20,670
It's a similar concept in the sense that we are applying a kind of transformation to the random variables

5
00:00:20,670 --> 00:00:29,310
by combining them, but instead of applying a simple shift or scale to one random variable, here we

6
00:00:29,310 --> 00:00:36,060
have two different random variables, and we're combining them together as either a sum or a difference.

7
00:00:36,060 --> 00:00:39,150
And this has all kinds of real world applications.

8
00:00:39,150 --> 00:00:44,190
If we know how to do this, we can save ourselves some time, especially if we already have particular

9
00:00:44,190 --> 00:00:47,580
statistics about the individual random variables.

10
00:00:47,790 --> 00:00:52,860
So what we're saying here, let's go through an example as we do this so we can see what's actually

11
00:00:52,860 --> 00:00:53,580
going on.

12
00:00:53,820 --> 00:01:00,270
Let's say that we work for a company and we're trying to get an understanding of the total time that

13
00:01:00,270 --> 00:01:04,349
it takes for our company to get our employees paid.

14
00:01:04,349 --> 00:01:09,150
And we realize that what we really have here is two random variables.

15
00:01:09,150 --> 00:01:17,620
We have a set of data about how long it takes the manager of each department to approve the time cards

16
00:01:17,620 --> 00:01:19,440
of the timesheets of our employees.

17
00:01:19,620 --> 00:01:25,890
And then we have a separate set of data about how long it takes our payroll department to process those

18
00:01:25,890 --> 00:01:28,020
approved time cards or timesheets.

19
00:01:28,020 --> 00:01:33,690
But let's say that these two activities in total constitute all of the time that the company spends

20
00:01:33,690 --> 00:01:38,010
to actually get over the finish line of issuing paychecks to our employees.

21
00:01:38,010 --> 00:01:45,330
What we want to do is think about this set of data as the random variable x and this set of data as

22
00:01:45,330 --> 00:01:47,100
the random variable Y.

23
00:01:47,130 --> 00:01:53,640
Of course, in this particular example, what we're interested in is a sum of these two totally separate

24
00:01:53,640 --> 00:01:56,790
random variables because we want to know the total time.

25
00:01:56,790 --> 00:02:02,190
If we add these together, the total time it takes our company to pay employees, which means we're

26
00:02:02,190 --> 00:02:07,200
going to be looking here, of course, at the sum of two random variables instead of difference.

27
00:02:07,200 --> 00:02:11,880
Now, before we go forward with actually talking about the math of these calculations, the first thing

28
00:02:11,880 --> 00:02:17,160
we want to say is that it's important that our two variables be independent of one another.

29
00:02:17,160 --> 00:02:23,670
And the reason for that is because our standard deviation value, the standard deviation of the sum

30
00:02:23,670 --> 00:02:26,970
is only going to make sense when we have independent variables.

31
00:02:26,970 --> 00:02:31,200
So these two data sets can't be affected by one another.

32
00:02:31,200 --> 00:02:37,560
But if we're confident that these two random variables X and Y are independent of each other, then

33
00:02:37,560 --> 00:02:44,160
we can find their combination and calculate a mean variance and standard deviation for that combination.

34
00:02:44,160 --> 00:02:50,340
We also want to say that it's important that our two different random variables have matching units

35
00:02:50,340 --> 00:02:52,230
or units that make sense together.

36
00:02:52,230 --> 00:02:59,130
For example, if our first random variable is given in units of ours and our second random variable

37
00:02:59,130 --> 00:03:05,820
is given in units of square meters, so we have a measure of time and a measure of area doesn't really

38
00:03:05,820 --> 00:03:09,900
make sense to add those two things together or to find their difference.

39
00:03:09,900 --> 00:03:16,560
It's nonsensical to think about adding time to some area measurement that's probably obvious, but we

40
00:03:16,560 --> 00:03:18,930
just want to make sure that our units are matching.

41
00:03:18,930 --> 00:03:24,720
So if our random variable X here is in ours, we want to make sure that our random variable Y is also

42
00:03:24,720 --> 00:03:26,550
an hour so we can get total hours.

43
00:03:26,550 --> 00:03:31,410
And of course if we were given two different data sets and maybe the first one is in terms of hours

44
00:03:31,410 --> 00:03:36,870
and the second one is in terms of minutes, we would want to convert the second data set from minutes

45
00:03:36,870 --> 00:03:42,120
to hours or the first data set from hours to minutes so that we have matching units and it makes sense

46
00:03:42,120 --> 00:03:44,940
to add the two together or to find their difference.

47
00:03:45,150 --> 00:03:51,000
But assuming that we have independence and that our units are matching, let's go ahead now and talk

48
00:03:51,000 --> 00:03:55,290
about the idea of combining these two random variables.

49
00:03:55,290 --> 00:04:01,860
Before we do that, let's look at the data sets individually, starting with the mean of each data set.

50
00:04:02,070 --> 00:04:08,790
We're going to say for now that this is the entire population of data instead of a sample, which means

51
00:04:08,790 --> 00:04:14,400
that the mean is going to be given by the sum of these hourly values here.

52
00:04:14,400 --> 00:04:18,570
One plus two is three, plus two is five, plus three is eight.

53
00:04:18,600 --> 00:04:21,839
Eight divided by the number of data points, four is two.

54
00:04:21,839 --> 00:04:26,310
So we have a mean of eight over four or two.

55
00:04:26,550 --> 00:04:31,410
And then over here we'll say this is the mean with respect to the variable X, and then over here,

56
00:04:31,410 --> 00:04:38,580
the mean of the variable Y here will be two plus three is five, plus five is ten, plus six is 16.

57
00:04:38,580 --> 00:04:42,390
So we have 16 divided by four or four.

58
00:04:42,390 --> 00:04:48,990
So what we're saying is that maybe our company has four departments in total and the manager of the

59
00:04:48,990 --> 00:04:53,800
first department takes one hour to approve timesheets, the manager, the second department takes 2

60
00:04:53,800 --> 00:04:58,410
hours, the manager of the third department takes 2 hours, and the manager of the fourth department

61
00:04:58,410 --> 00:04:59,880
takes 3 hours.

62
00:04:59,950 --> 00:05:01,390
To approve timesheets.

63
00:05:01,390 --> 00:05:08,620
And then our payroll team takes 2 hours, 3 hours, 5 hours and 6 hours to process payroll for departments

64
00:05:08,620 --> 00:05:10,950
one, two, three and four, respectively.

65
00:05:10,960 --> 00:05:18,040
So we can say that on average, managers take 2 hours to approve timesheets and on average payroll takes

66
00:05:18,040 --> 00:05:22,110
4 hours to process payroll for each department.

67
00:05:22,120 --> 00:05:24,910
So we have a mean for each random variable.

68
00:05:24,910 --> 00:05:28,180
And then of course, we could also calculate variance.

69
00:05:28,180 --> 00:05:33,820
So our variance calculation, remember, we take each data point and subtract the mean.

70
00:05:33,820 --> 00:05:36,220
So here we start with this data point of one.

71
00:05:36,220 --> 00:05:38,440
So we say one minus two.

72
00:05:38,470 --> 00:05:44,590
We square that value and then we add to that the same calculation for all of the other departments.

73
00:05:44,590 --> 00:05:54,100
So here we get two minus two quantity squared plus two minus two quantity squared plus three minus two

74
00:05:54,100 --> 00:05:55,180
quantity squared.

75
00:05:55,180 --> 00:06:03,250
And we divide this whole thing by the number of data points that we have four and we'll divide by four

76
00:06:03,250 --> 00:06:08,080
because we're going to say that this data represents the entire population of our departments.

77
00:06:08,080 --> 00:06:14,110
If this was a sample of a larger number of departments in our company, we would divide by n minus one

78
00:06:14,110 --> 00:06:16,570
or in this case four, minus one or three.

79
00:06:16,570 --> 00:06:18,880
So we'd be dividing by three instead of by four.

80
00:06:19,000 --> 00:06:23,950
But since we're going to say that this represents population, data will divide by n the total number

81
00:06:23,950 --> 00:06:24,940
of departments here.

82
00:06:25,030 --> 00:06:31,660
So the result then is one fourth, one minus two is negative, one quantity squared is a positive one.

83
00:06:31,660 --> 00:06:32,770
So we have one here.

84
00:06:32,770 --> 00:06:37,000
This will be zero zero and then three minus two is one, quantity squared is one.

85
00:06:37,240 --> 00:06:40,360
So we get two divided by four or.

86
00:06:41,150 --> 00:06:42,040
One half.

87
00:06:42,050 --> 00:06:43,970
So our variance then.

88
00:06:45,290 --> 00:06:46,760
Is one half.

89
00:06:46,910 --> 00:06:54,050
And then standard deviation is the square root of that or one over square root two.

90
00:06:54,080 --> 00:07:00,410
If we do a variance in standard deviation calculation for the variable y, we start with each data point,

91
00:07:00,410 --> 00:07:08,090
subtract the mean, so we get two minus four quantity squared plus three minus four.

92
00:07:08,950 --> 00:07:10,840
Quantity squared plus.

93
00:07:11,750 --> 00:07:14,330
Five minus four squared.

94
00:07:14,330 --> 00:07:18,890
And then finally the last data point six minus four quantity squared.

95
00:07:18,890 --> 00:07:25,310
And then again, we're doing a population calculation, assuming that this is the entire population.

96
00:07:25,310 --> 00:07:30,140
So we divide by the number of data points, which is four, or multiply that by one fourth.

97
00:07:30,140 --> 00:07:34,550
And if we simplify here, we get one fourth and then we'll get two.

98
00:07:34,550 --> 00:07:39,800
Minus four is negative two, quantity squared is positive, four, three minus four is negative, one

99
00:07:39,800 --> 00:07:47,480
quantity squared is one, five, minus four is one squared is one and six minus four is two squared

100
00:07:47,480 --> 00:07:48,620
is four.

101
00:07:48,710 --> 00:07:52,700
And so we get four plus one plus one plus four is ten.

102
00:07:52,700 --> 00:07:56,840
Ten divided by four is five halves or two and one half.

103
00:07:56,840 --> 00:08:05,870
So we can say then that the variance of Y is five halves, which means that the standard deviation of

104
00:08:05,870 --> 00:08:12,080
Y is the square root of that, or the square root of five halves, the square root of two and one half.

105
00:08:12,110 --> 00:08:17,330
Now, here's what the idea of combining random variables really comes in handy.

106
00:08:17,330 --> 00:08:22,670
Let's say that we've already been collecting this data about the amount of time that our managers spend

107
00:08:22,670 --> 00:08:28,490
approving timesheets and the amount of time that our payroll team spends processing payroll for each

108
00:08:28,490 --> 00:08:29,150
department.

109
00:08:29,420 --> 00:08:35,870
So assuming we already have all of this information, if we now want to calculate the sum or difference

110
00:08:35,870 --> 00:08:42,440
of these two data sets, we don't have to start over from this raw data and create a whole new data

111
00:08:42,440 --> 00:08:47,660
set for the sum or the difference and then go through the process again of calculating mean variance

112
00:08:47,660 --> 00:08:49,880
and standard deviation for the new dataset.

113
00:08:49,880 --> 00:08:56,180
If we already have these summary statistics, then we can go directly to these formulas and use them

114
00:08:56,180 --> 00:08:59,420
to find the sum or difference of the two variables.

115
00:08:59,420 --> 00:09:05,660
So for instance, we can see here that the mean of the sum is equal to the sum of the means.

116
00:09:05,660 --> 00:09:14,180
In other words, to find the mean of the sum, we simply add the two means together to plus four and

117
00:09:14,180 --> 00:09:20,300
we get 6 hours, which means that total payroll processing time between the managers and the payroll

118
00:09:20,300 --> 00:09:25,850
department, if we add up all that data together, our mean is 6 hours per department.

119
00:09:25,850 --> 00:09:32,420
And similarly, if we wanted to find the difference, the mean of the difference, we would simply take

120
00:09:32,420 --> 00:09:34,190
the difference of the means.

121
00:09:34,190 --> 00:09:36,500
So we could take four minus two.

122
00:09:37,670 --> 00:09:39,070
To get to.

123
00:09:39,080 --> 00:09:45,980
And we could say that on average, the payroll department spends 2 hours extra per department than the

124
00:09:45,980 --> 00:09:47,820
manager spend per department.

125
00:09:47,840 --> 00:09:51,870
On the process of getting the employees paid.

126
00:09:51,890 --> 00:09:56,360
Realize here that there are a few different ways that we can prove this to ourselves, one of which

127
00:09:56,360 --> 00:09:59,990
being that we can actually create a new data set.

128
00:09:59,990 --> 00:10:07,250
We'll call it the some data set, and we find the new data set for the some by adding data points from

129
00:10:07,250 --> 00:10:09,600
these original data sets, X and Y.

130
00:10:09,620 --> 00:10:15,800
So if we take one plus two, we get the new data point three in the data set for the sum.

131
00:10:15,800 --> 00:10:24,280
If we take two plus three, we get five, two plus five, we get seven and three plus six, we get nine.

132
00:10:24,290 --> 00:10:30,770
This is now a new data set where we've summed the data points in sets X and Y.

133
00:10:30,770 --> 00:10:36,410
And what we realize here is that if we take the mean of this new data set, we would add 3 to 5 to get

134
00:10:36,410 --> 00:10:40,160
eight plus seven is 15, plus nine is 24.

135
00:10:40,160 --> 00:10:48,230
We get 24 divided by the number of data points in the population for 24 divided by four is six and we

136
00:10:48,230 --> 00:10:50,750
see that we get back to this value.

137
00:10:50,750 --> 00:10:52,490
We already found same thing here.

138
00:10:52,490 --> 00:10:58,430
If we create a new data set for the difference, we'll say a new data set.

139
00:10:59,150 --> 00:10:59,540
Four.

140
00:10:59,690 --> 00:11:07,010
The difference if we take two minus one, we get one, three minus two, we get one five minus two,

141
00:11:07,040 --> 00:11:10,970
we get three and six minus three, we get three.

142
00:11:10,970 --> 00:11:13,130
Now we have a new data set for the difference.

143
00:11:13,130 --> 00:11:18,440
And if we take the mean here, we get one plus one is two, plus three is five, plus three is eight,

144
00:11:18,440 --> 00:11:20,660
eight divided by four is two.

145
00:11:20,660 --> 00:11:24,800
And we get back to the mean for the difference that we already found.

146
00:11:24,800 --> 00:11:30,560
Now, in the same way, if we want to make a variance calculation for the sum or the difference of X

147
00:11:30,560 --> 00:11:37,850
and Y, we don't have to build this new data set for the sum and then start fresh with a variance calculation.

148
00:11:37,850 --> 00:11:39,050
For this new data set.

149
00:11:39,050 --> 00:11:44,120
We don't have to find the data set for the difference and then start fresh with a variance calculation

150
00:11:44,120 --> 00:11:45,860
for this data set for the difference.

151
00:11:45,860 --> 00:11:52,100
Instead, we can use the fact that we already have variance values for both X and Y and use these formulas

152
00:11:52,100 --> 00:11:54,230
here to calculate variance.

153
00:11:54,230 --> 00:12:00,440
So the variance for the sum is equal to the sum of the variances.

154
00:12:00,440 --> 00:12:07,940
So we take one half plus five halves is six over two or three, so we get a variance of three.

155
00:12:07,940 --> 00:12:14,330
And then similarly here for the variance of the difference, to find the variance for the difference,

156
00:12:14,330 --> 00:12:16,460
we take the sum of the variances.

157
00:12:16,460 --> 00:12:18,470
So again we get three.

158
00:12:18,470 --> 00:12:24,440
Now while we're here, it's important to mention what we just said here, the variance formula that

159
00:12:24,440 --> 00:12:29,090
we used to find the variance for the sum and the variance for the difference was identical.

160
00:12:29,090 --> 00:12:32,810
When we find the variance for the sum, we add the variances.

161
00:12:32,810 --> 00:12:37,100
And you would think that when we find the variance for the difference, we take the difference of the

162
00:12:37,100 --> 00:12:44,510
variances like we did up here with the formulas for the mean, but instead we find the sum of the variances

163
00:12:44,510 --> 00:12:45,050
again.

164
00:12:45,050 --> 00:12:50,030
And that's because both data sets bring their own variances to the table.

165
00:12:50,030 --> 00:12:57,290
And even when we combine them into a new data set for the sum or a new data set for the difference,

166
00:12:57,290 --> 00:13:03,110
we're not eliminating any of the spread of the data set when we do that.

167
00:13:03,110 --> 00:13:09,680
And so the variance for the new data set, whether we're finding the sum or the difference as our combination,

168
00:13:09,680 --> 00:13:15,620
the variance is still going to be found by summing the two variances from the original data sets X and

169
00:13:15,620 --> 00:13:21,020
Y, which means then of course, that our standard deviations for both the sum and difference are also

170
00:13:21,020 --> 00:13:21,950
going to be the same.

171
00:13:21,980 --> 00:13:25,670
The standard deviations will be, of course, the square root of the variance.

172
00:13:25,670 --> 00:13:34,160
So the standard deviation of the sum will be square root 3 hours and the standard deviation of the difference

173
00:13:34,160 --> 00:13:36,770
will be square root 3 hours.

174
00:13:36,800 --> 00:13:44,600
Keep in mind also that it's invalid for us to find the standard deviation of the sum or difference directly.

175
00:13:44,600 --> 00:13:51,440
In other words, notice here in our table of formulas that we are combining the means and the variances,

176
00:13:51,440 --> 00:13:56,600
but we don't have formulas for the standard deviation, and that's because we can't find the standard

177
00:13:56,600 --> 00:13:59,930
deviation of the sum by summing the standard deviations.

178
00:13:59,930 --> 00:14:05,000
And we can't find the standard deviation of the difference by summing the standard deviations or taking

179
00:14:05,000 --> 00:14:06,710
the difference of the standard deviations.

180
00:14:06,710 --> 00:14:13,310
The only way to get to standard deviation for the combination is to first find the variance of the combination

181
00:14:13,310 --> 00:14:19,310
by summing the variances of the original data sets and then to take the square root of variance in order

182
00:14:19,310 --> 00:14:20,660
to get standard deviation.

183
00:14:20,660 --> 00:14:27,110
So the whole takeaway here is that if we find ourselves in a scenario where we already have summary

184
00:14:27,110 --> 00:14:32,360
statistics like these ones, we've been maybe tracking this information across two data sets, X and

185
00:14:32,360 --> 00:14:32,870
Y.

186
00:14:32,870 --> 00:14:38,900
So in this case we've been keeping track of how many hours it takes managers to approve timesheets for

187
00:14:38,900 --> 00:14:44,690
each department, and we've been keeping track of how many hours it takes our payroll team to process

188
00:14:44,690 --> 00:14:50,720
payroll for each department, such that we already have a mean variance and standard deviation value

189
00:14:50,720 --> 00:14:54,410
for both of these independent random variables X and Y.

190
00:14:54,440 --> 00:15:01,040
Then when we want to find the sum or difference of these two variables, when we want to find the combination,

191
00:15:01,040 --> 00:15:08,120
what we realize is that we do not have to start from the beginning and combine these two data sets into

192
00:15:08,120 --> 00:15:15,950
a new data set for the sum or a new data set for the difference and start completely from scratch with

193
00:15:15,950 --> 00:15:22,190
a mean variance and standard deviation calculation for this new data set or this new data set.

194
00:15:22,190 --> 00:15:28,700
Instead, if we already have this information, we can go straight to these formulas and immediately

195
00:15:28,700 --> 00:15:36,800
and directly make these calculations here to see the mean variance in standard deviation of both the

196
00:15:36,800 --> 00:15:43,370
sum of these random variables and the difference of these random variables, which just means that we

197
00:15:43,370 --> 00:15:47,900
want to keep in mind that when we already have this information, we can move quickly to the combination

198
00:15:47,900 --> 00:15:49,430
and save ourselves a lot of time.

199
00:15:49,430 --> 00:15:54,380
Redoing the math from scratch with a new combined data set.

200
00:15:54,380 --> 00:15:58,400
And of course that can be really helpful when we're trying to answer real world question.

201
00:15:58,740 --> 00:16:04,290
Like the one we were here where we were trying to get a total for the complete number of hours spent

202
00:16:04,290 --> 00:16:10,170
across our entire company collecting all departments together in order to get payroll processed from

203
00:16:10,170 --> 00:16:11,160
beginning to end.

204
00:16:11,160 --> 00:16:18,030
Or the difference between these two departments on the amount of time spent on payroll.

