1
00:00:11,110 --> 00:00:17,340
So in this lecture, we will be looking at the intuition behind these bays now in order to understand

2
00:00:17,340 --> 00:00:20,440
now Hubei's, we must first understand Bayes rule.

3
00:00:20,950 --> 00:00:25,930
So this lecture will assume that you have at least a passing knowledge of probability, which you should

4
00:00:25,930 --> 00:00:28,030
as per the instructions for this course.

5
00:00:28,630 --> 00:00:31,030
So let's begin by reviewing Bayes rule.

6
00:00:32,020 --> 00:00:34,900
Suppose that we have two random variables X and Y.

7
00:00:35,260 --> 00:00:39,580
Now, if you're concerned that these are just abstract variables, please do not worry as we will be

8
00:00:39,580 --> 00:00:41,470
doing examples very shortly.

9
00:00:42,310 --> 00:00:45,940
So let's also suppose that we want to know P of why given x.

10
00:00:46,570 --> 00:00:50,950
Note that this is a conditional distribution because Y is conditioned on X.

11
00:00:51,760 --> 00:00:57,490
Now suppose that we are given X, given Y and also the marginal distribution p of Y.

12
00:00:58,270 --> 00:01:04,450
If this is the case, then we can express what we want to know in terms of what we do know using Bayes

13
00:01:04,450 --> 00:01:04,959
rule.

14
00:01:05,560 --> 00:01:12,310
So on the numerator we have X given Y times, b y and on the denominator we have the same thing except

15
00:01:12,520 --> 00:01:14,770
sum over all possible values of Y.

16
00:01:16,060 --> 00:01:22,300
And just as a reminder, recall that the numerator can be simplified to be of X and Y, while the denominator

17
00:01:22,300 --> 00:01:24,100
can be simplified to P of X.

18
00:01:24,610 --> 00:01:28,720
So that should give you some intuition about why Bayes rule is the way it is.

19
00:01:33,350 --> 00:01:37,530
Now, let's discuss Bay's role in terms of classification for machine learning.

20
00:01:38,360 --> 00:01:42,380
In this case, the input to the model is X, while the target is why.

21
00:01:42,950 --> 00:01:47,480
So these variables are no longer abstract but represents something we want to measure.

22
00:01:48,200 --> 00:01:51,340
For example, why could be whether or not an email is spam?

23
00:01:51,380 --> 00:01:53,300
Well, X could be the email itself.

24
00:01:54,080 --> 00:01:59,210
Note that in this context, because we are not doing a regression, why is a discrete or categorical

25
00:01:59,210 --> 00:02:00,020
random variable?

26
00:02:00,740 --> 00:02:03,410
As such, although we write P of Y given X?

27
00:02:03,680 --> 00:02:06,770
Note that this is not one value, but a whole distribution.

28
00:02:11,440 --> 00:02:17,140
So, for example, if y can take on the values, spam and not spam, then we might find that P of Y

29
00:02:17,140 --> 00:02:23,290
equals spam, given X is equal to zero point three, while P of Y equals not spam, given X is equal

30
00:02:23,290 --> 00:02:24,250
to zero point seven.

31
00:02:24,970 --> 00:02:29,530
So the number of probabilities in the distribution is the number of classes we have.

32
00:02:30,370 --> 00:02:34,150
Furthermore, note that this makes it easy to formulate a decision rule.

33
00:02:34,570 --> 00:02:37,390
We simply pick the class, which has the highest probability.

34
00:02:38,170 --> 00:02:44,860
Mathematically, we choose the Class K star, which is the ARG Max appeal y given x overall values of

35
00:02:44,860 --> 00:02:45,400
Y.

36
00:02:46,930 --> 00:02:52,420
So in our example above, we would choose not spam, since zero point seven is bigger than zero point

37
00:02:52,420 --> 00:02:52,960
three.

38
00:02:57,630 --> 00:03:01,230
Now, it's important to remember not to be overwhelmed by math symbols.

39
00:03:01,530 --> 00:03:07,170
If you happen to have any math phobias, intuitively, this rule makes perfect sense.

40
00:03:07,680 --> 00:03:13,200
If the probability of an email being not spam is bigger than the probability of that email being spam,

41
00:03:13,530 --> 00:03:16,050
of course, I'm going to classify it as not spam.

42
00:03:16,770 --> 00:03:19,920
In fact, we can think of an even more intuitive example.

43
00:03:20,610 --> 00:03:24,030
Suppose that you're about to play a game, say, betting on a horse race.

44
00:03:24,690 --> 00:03:29,610
Luckily, I have insider knowledge, and I can tell you the probability that each horse will win.

45
00:03:30,810 --> 00:03:32,820
So suppose that there are six horses.

46
00:03:33,210 --> 00:03:35,910
Horse number one has a 50 percent chance of winning.

47
00:03:36,180 --> 00:03:38,910
Well, all the other horses have a 10 percent chance.

48
00:03:39,510 --> 00:03:43,950
Note that you only get to play once and you must bet all your money on a single horse.

49
00:03:44,760 --> 00:03:49,770
In this case, you should pick horse number one instead of any of the other horses because it has the

50
00:03:49,770 --> 00:03:51,180
best chance to win.

51
00:03:51,900 --> 00:03:53,640
So hopefully that's pretty intuitive.

52
00:03:58,340 --> 00:04:03,530
OK, so now that you understand the basics of the decision rule, let's go through a full example where

53
00:04:03,530 --> 00:04:06,470
both X and Y represent concrete things.

54
00:04:07,280 --> 00:04:12,560
Suppose that Y represents whether or not a patient will have a severe immune reaction to COVID.

55
00:04:13,130 --> 00:04:16,190
We can think of that as whether or not they need to go to the hospital.

56
00:04:16,850 --> 00:04:21,290
So I hope that you'll find this example very intuitive as it is pretty contemporary.

57
00:04:21,290 --> 00:04:24,200
And hopefully you've read about these factors in the news.

58
00:04:24,680 --> 00:04:29,300
If you have it, please let me know in the Q&A, and I'll be happy to share some news articles with

59
00:04:29,300 --> 00:04:29,580
you.

60
00:04:30,440 --> 00:04:32,750
In any case, note that this is categorical.

61
00:04:33,110 --> 00:04:37,730
We can denote the classes as severe and mild for this initial example.

62
00:04:37,730 --> 00:04:42,980
Suppose that X is just one measurement, which is the patient's BMI, as you recall.

63
00:04:43,010 --> 00:04:47,960
BMI stands for body mass index, which is more or less your weight divided by your height.

64
00:04:48,650 --> 00:04:51,710
It is a common but flawed measurement of one's body fat.

65
00:04:52,250 --> 00:04:55,460
However, for this example, I think it's the easiest to understand.

66
00:04:56,210 --> 00:05:02,360
So given this information, we can now write down how to compute the probabilities we previously discussed.

67
00:05:02,990 --> 00:05:07,850
Remember that we know all the values on the right side, and we want to find the value on the left side

68
00:05:08,150 --> 00:05:11,210
after which we can use those values to make a prediction.

69
00:05:12,020 --> 00:05:16,010
Note that we call the distribution on the left side the posterior.

70
00:05:20,770 --> 00:05:26,830
So it may be helpful to think about how each of the values on the right side will be measured for peace

71
00:05:26,830 --> 00:05:29,980
of year and mild note that these are called priors.

72
00:05:30,310 --> 00:05:34,870
They represent the rate of severe or mild reactions, given no other information.

73
00:05:35,440 --> 00:05:37,390
That is, they are conditioned on nothing.

74
00:05:38,200 --> 00:05:41,110
In practice, they would simply be computed by counting.

75
00:05:41,740 --> 00:05:48,130
So, for example, if you have 1000 COVID patients in 100 of them had severe reactions, then that means

76
00:05:48,130 --> 00:05:52,590
peace of year is 10 percent and thus p mild would be 90 percent.

77
00:05:57,220 --> 00:06:01,810
Now, let's consider a PE of BMI given severe and PE of BMI given mild.

78
00:06:02,470 --> 00:06:04,420
Note that these are called likelihoods.

79
00:06:05,470 --> 00:06:09,070
So one plausible solution is to model these as Gaussians.

80
00:06:09,580 --> 00:06:14,230
As you recall, the Gaussian or a normal distribution, is the famous bell curve.

81
00:06:14,800 --> 00:06:18,040
Recall that it is fully specified by its mean and variance.

82
00:06:18,670 --> 00:06:24,070
Thus, what you would want to do is collect all the severe patients and compute the mean and variance

83
00:06:24,400 --> 00:06:25,420
of their BMI.

84
00:06:26,080 --> 00:06:30,790
This would fully specify the Gaussian distribution p of BMI given severe.

85
00:06:31,660 --> 00:06:36,190
Of course, we could do a similar thing for the mild patients as well, and that would give us P of

86
00:06:36,220 --> 00:06:37,320
BMI given mile.

87
00:06:38,410 --> 00:06:43,480
Now, because this is just an intuition lecture, we're not going to go through any calculations, but

88
00:06:43,480 --> 00:06:47,110
feel free to do that on your own if you feel the need to try it out.

89
00:06:51,630 --> 00:06:57,480
Now, typically in machine learning, we don't just have one input in the previous example, our only

90
00:06:57,480 --> 00:06:58,830
input was BMI.

91
00:06:59,460 --> 00:07:04,860
However, it may be the case that we could make more accurate predictions by using more information

92
00:07:04,860 --> 00:07:07,680
about the patient, as you recall.

93
00:07:07,800 --> 00:07:09,720
Age is a major factor as well.

94
00:07:10,530 --> 00:07:16,590
So suppose X is now a vector with multiple components, one for BMI and one for age.

95
00:07:17,250 --> 00:07:21,930
Of course, the computations we would do are basically the same as what I've shown you so far.

96
00:07:22,680 --> 00:07:26,520
The only difference is that everywhere you previously saw only BMI.

97
00:07:26,880 --> 00:07:29,010
Now you see BMI and age.

98
00:07:33,470 --> 00:07:37,940
So what is naive Bayes and how is that different from simply using Bayes rule?

99
00:07:38,660 --> 00:07:44,360
This all has to do with the form of the likelihood continuing on with our example that would be a p

100
00:07:44,360 --> 00:07:51,260
of BMI and age, given why note that we're just using Y is a generic variable, which in this example

101
00:07:51,290 --> 00:07:53,270
could represent severe or mild.

102
00:07:54,200 --> 00:07:58,490
So as we've discussed it so far, this is a very general way of looking at things.

103
00:07:58,910 --> 00:08:02,090
We haven't said anything about the structure of this distribution.

104
00:08:02,660 --> 00:08:08,780
It could be Gaussian exponential or any other kind of multivariate distribution, even one that we do

105
00:08:08,780 --> 00:08:10,010
not know how to compute.

106
00:08:10,700 --> 00:08:17,300
The most important factor pay attention to is that BMI and age do not have to be independent, which

107
00:08:17,300 --> 00:08:18,020
makes sense.

108
00:08:18,590 --> 00:08:21,710
It could very well be that BMI is affected by age.

109
00:08:22,040 --> 00:08:23,560
That is, as one ages.

110
00:08:23,570 --> 00:08:26,450
Perhaps it becomes more difficult to control BMI.

111
00:08:27,170 --> 00:08:31,790
Well, what makes this naive is is that we make the naive base assumption.

112
00:08:32,450 --> 00:08:36,470
The Navy's assumption is that all the inputs are independent.

113
00:08:36,950 --> 00:08:39,559
Essentially, this makes everything easier to compute.

114
00:08:39,830 --> 00:08:45,770
And as engineers, we have no problem making potentially unrealistic assumptions if they make things

115
00:08:45,770 --> 00:08:47,030
easier to compute.

116
00:08:51,730 --> 00:08:56,440
Now, one common way to visualize the naive based model is with a graphical model.

117
00:08:57,370 --> 00:09:00,640
Personally, I don't find this that useful, but perhaps you might.

118
00:09:01,240 --> 00:09:07,240
So in this case, each circle represents a random variable and each arrow represents a dependency.

119
00:09:07,990 --> 00:09:14,230
In this case, we can see that all of the X's are dependent on why this makes sense.

120
00:09:14,530 --> 00:09:20,440
For example, if y a severe, then you could imagine that the input for age would be larger on average,

121
00:09:20,890 --> 00:09:24,160
thus ages affected by what taking on a certain value.

122
00:09:25,240 --> 00:09:28,720
Importantly, note that all the X's are independent of each other.

123
00:09:29,290 --> 00:09:33,280
That is, there are no arrows going from any X to any other X.

124
00:09:33,910 --> 00:09:38,590
In this diagram, the X represent each individual component of our input.

125
00:09:39,070 --> 00:09:45,310
So in our covered example, we would have two axes, one representing BMI and one representing age.

126
00:09:45,880 --> 00:09:52,030
In general, we will just have x one x two and so forth up to XD, where D is the number of input features

127
00:09:52,030 --> 00:09:53,050
of our data set.

128
00:09:57,680 --> 00:10:04,250
OK, so as mentioned, what makes naive Bayes naive is that it assumes that all the inputs are independent.

129
00:10:04,850 --> 00:10:10,250
However, this still doesn't say anything about what kind of distribution we should use for PLX, given

130
00:10:10,250 --> 00:10:13,490
what, in fact, this is totally up to you.

131
00:10:14,120 --> 00:10:19,190
However, if you choose something exotic or unconventional, note that you would have to implement it

132
00:10:19,190 --> 00:10:19,850
yourself.

133
00:10:20,510 --> 00:10:25,130
In fact, it's not even required that all the excess come from the same kind of distribution.

134
00:10:25,670 --> 00:10:32,070
Perhaps it's the case that X1 is Gaussian, but X2 is exponential and X3 is Bernoulli and so forth.

135
00:10:32,900 --> 00:10:39,140
But in second, learn there are predefined at naive based models which use specific likelihood distributions.

136
00:10:39,530 --> 00:10:42,980
These are the Gaussian, the multi gnomeo and the Bernoulli.

137
00:10:43,820 --> 00:10:48,890
Note that for these pre-built models, all the exits come from the same kind of distribution.

138
00:10:49,460 --> 00:10:54,740
For example, if you choose Gaussian naive Bayes, that will mean all your exes will be Gaussian.

139
00:10:55,760 --> 00:11:00,650
Of course, your next question will be, well, which one do I choose for my specific problem?

140
00:11:01,190 --> 00:11:03,980
And of course, that depends on the distribution of your data.

141
00:11:04,610 --> 00:11:10,040
If your data is continuous and looks like a bell curve, then the Gaussian would be a good choice.

142
00:11:10,580 --> 00:11:15,920
If you have count data that comes from a categorical, then the multi gnomeo would be a good choice.

143
00:11:16,400 --> 00:11:21,950
Note that this is typically the correct option for NLP and specifically count vectors or to fight a

144
00:11:21,950 --> 00:11:22,340
yes.

145
00:11:23,180 --> 00:11:26,630
And if your data is binary, then Bernoulli would be a good choice.

146
00:11:27,080 --> 00:11:31,940
This might be applicable if you choose the binary version of count vectors for NLP.