1
00:00:11,160 --> 00:00:16,860
So in this video, we are going to discuss one of the most popular and powerful machine learning methods,

2
00:00:16,980 --> 00:00:23,340
the random forest, the random forest and other methods like you are today's go to method when it comes

3
00:00:23,340 --> 00:00:27,970
to machine learning for most typical supervised tabular data.

4
00:00:28,170 --> 00:00:31,680
The random forest tends to work very well without much tuning.

5
00:00:32,490 --> 00:00:37,620
This is unlike methods such as deep neural networks, which usually take some manual tuning to get just

6
00:00:37,620 --> 00:00:38,010
right.

7
00:00:42,740 --> 00:00:47,790
So in order to understand the random forest, you first need to understand the decision tree.

8
00:00:48,380 --> 00:00:50,810
So what's the intuition behind the trees?

9
00:00:51,830 --> 00:00:57,740
Well, imagine that instead of machine learning, you wanted to build your own computer program to classify

10
00:00:57,740 --> 00:00:58,610
tabular data.

11
00:00:59,630 --> 00:01:04,780
One common construct we use for building computer programs is conditional expressions.

12
00:01:05,420 --> 00:01:07,210
In other words, if then else.

13
00:01:07,850 --> 00:01:13,130
So let's consider our favorite example, predicting exam grades from the number of hours studied and

14
00:01:13,130 --> 00:01:13,790
the number of hours.

15
00:01:13,920 --> 00:01:17,570
Let's suppose you were to write such a program by hand.

16
00:01:18,020 --> 00:01:23,440
You might do something like if our study is greater than five than pass otherwise fail.

17
00:01:24,350 --> 00:01:27,410
But of course we could make use of the other data point too.

18
00:01:27,800 --> 00:01:29,690
So you might have a nested if statement.

19
00:01:30,200 --> 00:01:36,620
If our study is greater than five and number of hours slept is greater than seven, then pass otherwise

20
00:01:36,620 --> 00:01:38,690
fail and so on and so forth.

21
00:01:39,380 --> 00:01:43,520
Of course, the more variables you have, the more complicated your program will get.

22
00:01:44,300 --> 00:01:47,210
So how is any of this related to decision trees?

23
00:01:51,900 --> 00:01:58,980
Well, let's recognize this fact, if statements are trees, if statements create branches, so if you

24
00:01:58,980 --> 00:02:01,680
draw your logic on paper, you will get a tree.

25
00:02:02,580 --> 00:02:07,920
Now, your next question might be sure, but how do you actually build a tree automatically from a data

26
00:02:07,920 --> 00:02:08,260
set?

27
00:02:08,850 --> 00:02:13,780
The answer to this question turns out to be pretty complex, but you're welcome to check out extra reading

28
00:02:13,780 --> 00:02:15,970
DOT text if you want to know.

29
00:02:17,040 --> 00:02:22,940
OK, so believe it or not, that is all there is to understanding the intuition behind decision trees.

30
00:02:23,280 --> 00:02:25,590
They are simply nested if statements.

31
00:02:30,230 --> 00:02:35,480
Knowing what we know so far, there are some interesting questions we can ask, as you recall, one

32
00:02:35,480 --> 00:02:40,500
of the models you learned about in the section is machine learning is nothing but geometry.

33
00:02:41,180 --> 00:02:45,830
So let's consider what a decision trees decision boundary actually looks like.

34
00:02:47,060 --> 00:02:49,930
Note that at this point we are considering classification.

35
00:02:50,660 --> 00:02:52,320
So here's an example of a tree.

36
00:02:52,940 --> 00:02:58,430
This tree says if you studied more than five hours and you slept more than seven hours, you will pass

37
00:02:58,430 --> 00:03:00,440
your exam, otherwise you will fail.

38
00:03:01,400 --> 00:03:06,110
So you can see that this tree is characterized by horizontal and vertical lines.

39
00:03:06,590 --> 00:03:11,790
This is because a decision tree node can only split based on one attribute at a time.

40
00:03:12,560 --> 00:03:16,720
So every statement in the tree is going to be a greater than expression.

41
00:03:17,390 --> 00:03:22,520
The way this translates to paper is that one side of the space goes one way and the other side of the

42
00:03:22,520 --> 00:03:23,780
space goes the other way.

43
00:03:24,770 --> 00:03:26,210
So you always get straight lines.

44
00:03:26,210 --> 00:03:30,380
Are planes perpendicular to the axis whose variable is being split?

45
00:03:35,240 --> 00:03:41,030
One important fact to note is that trees can be of arbitrary depth, so you can split on the same variable

46
00:03:41,030 --> 00:03:46,570
more than once because of this, it's very easy to get good accuracy on your train set.

47
00:03:47,030 --> 00:03:51,890
In fact, if none of the training points for opposing classes overlap, then you should get one hundred

48
00:03:51,890 --> 00:03:52,850
percent accuracy.

49
00:03:53,450 --> 00:03:57,260
Of course, this is clearly not good because this is what we call overfitting.

50
00:04:02,010 --> 00:04:06,810
Now, since we'll be using Decision Trees four time series forecasting, we'll need to know how they

51
00:04:06,810 --> 00:04:07,620
do regression.

52
00:04:08,640 --> 00:04:14,190
Basically, this comes down to how do decision trees make decisions for classification?

53
00:04:14,200 --> 00:04:19,290
If you imagine that we split up the space of X factors, the decision made by the tree is simply the

54
00:04:19,290 --> 00:04:21,570
most common point in the partition.

55
00:04:22,410 --> 00:04:27,510
So, for example, in the space where that our study is greater than five and that our slept is greater

56
00:04:27,510 --> 00:04:30,510
than seven, most of those students pass the exam.

57
00:04:30,840 --> 00:04:33,020
In other words, most of the points there are yellow.

58
00:04:33,990 --> 00:04:37,400
Therefore the label assigned to that area would be pass.

59
00:04:37,890 --> 00:04:39,710
So you can think of this like voting.

60
00:04:39,990 --> 00:04:46,890
The majority vote wins, but for regression, if you imagine again that the spaces split up by the tree

61
00:04:47,220 --> 00:04:48,830
voting no longer makes sense.

62
00:04:49,350 --> 00:04:54,610
However, what would make sense is if you simply took the average of the points in the partition.

63
00:04:55,500 --> 00:05:00,420
So suppose you're trying to approximate some function, but you only have samples from the function

64
00:05:01,170 --> 00:05:04,750
since you only have a limited number of splits in each partition.

65
00:05:04,860 --> 00:05:08,190
You'll just take the average value of the points in that partition.

66
00:05:09,240 --> 00:05:14,380
What this ends up looking like is a bunch of horizontal lines, so hopefully that makes sense.

67
00:05:14,730 --> 00:05:20,010
The reason they are horizontal is because you've simply taken the average of the points in the partition

68
00:05:20,010 --> 00:05:21,150
defined by the tree.

69
00:05:21,810 --> 00:05:25,940
And consider what will happen if you overfit or in other words, make the tree too deep.

70
00:05:26,550 --> 00:05:31,650
If you overfit, then your tree will just put each data point into its own partition and the result

71
00:05:31,650 --> 00:05:33,420
will be a very weird looking function.

72
00:05:38,250 --> 00:05:44,370
OK, so now you know how decision trees work for both classification and regression, as you recall,

73
00:05:44,370 --> 00:05:47,860
there is one problem with these models, which is that they easily overfill.

74
00:05:49,050 --> 00:05:53,060
So the way to solve this problem is to combine multiple trees together.

75
00:05:54,270 --> 00:05:58,830
So imagine that you take hundreds of trees which are all trained on a different subset of the train

76
00:05:58,830 --> 00:06:01,800
set with a different subset of the input features.

77
00:06:02,340 --> 00:06:05,890
Then you combine the predictions of these hundreds of trees.

78
00:06:06,330 --> 00:06:09,970
This is basically what a random forest is intuitively.

79
00:06:10,080 --> 00:06:15,960
Each tree will make perfect predictions according to their own subset of the Transat, but because they

80
00:06:15,960 --> 00:06:20,260
are all using a slightly different TransFair, they will make slightly different predictions.

81
00:06:20,940 --> 00:06:25,730
It turns out that if you average out those predictions, the result is a much smoother function.

82
00:06:26,310 --> 00:06:30,500
So they smooth out all the unnecessary random variations due to noise.

83
00:06:31,080 --> 00:06:36,060
On the other hand, because decision trees can be so accurate, they don't lose much accuracy when you

84
00:06:36,060 --> 00:06:36,790
combine them.

85
00:06:37,200 --> 00:06:40,200
It's only those tiny pockets of overfitting that go away.

86
00:06:41,490 --> 00:06:46,710
OK, so that's the intuition behind the random forest, it's just a bunch of trees voting together.