1
00:00:00,150 --> 00:00:04,410
Remember earlier in the course that we looked at scatter plots when we talked through different types

2
00:00:04,410 --> 00:00:05,920
of charts and graphs.

3
00:00:05,939 --> 00:00:10,970
Now we want to revisit scatter plots and specifically this idea of regression.

4
00:00:10,980 --> 00:00:16,980
So earlier when we talked about scatter plots, remember that we talked about this least squares line

5
00:00:16,980 --> 00:00:18,550
or the line of best fit.

6
00:00:18,570 --> 00:00:24,420
We also called this the regression line, and we learned to calculate the equation of the regression

7
00:00:24,420 --> 00:00:29,400
line or determine the equation of the regression line by finding M, the slope which was found with

8
00:00:29,400 --> 00:00:37,650
this formula and B the y intercept which we found using this formula and we indicated the equation of

9
00:00:37,650 --> 00:00:43,800
the regression line with this y hat symbol to make the specific point that this is sort of a line of

10
00:00:43,800 --> 00:00:48,410
estimation and it doesn't actually run through all of the data points in the scatterplot.

11
00:00:48,420 --> 00:00:54,900
Instead, it charts a course through the scatterplot that minimizes the error or that minimizes the

12
00:00:54,900 --> 00:00:55,590
residuals.

13
00:00:55,590 --> 00:00:57,620
And so we indicate it with y hat.

14
00:00:57,630 --> 00:01:03,780
So technically regression is just the process of estimating the value of the dependent variable for

15
00:01:03,780 --> 00:01:06,450
some given value of the independent variable.

16
00:01:06,450 --> 00:01:12,540
And in order to estimate the value of the dependent variable, we're going to go through the process

17
00:01:12,540 --> 00:01:13,650
of curve fitting.

18
00:01:13,650 --> 00:01:20,490
Now this equation is the regression equation for a line, but the curve that we use to approximate the

19
00:01:20,490 --> 00:01:27,720
data does not necessarily have to be a line, which is why we'll often refer to the regression equation

20
00:01:27,720 --> 00:01:29,610
as just the approximating curve.

21
00:01:29,610 --> 00:01:34,980
Or instead of saying that we're always finding the linear regression equation, we'll instead just say

22
00:01:34,980 --> 00:01:36,960
that we're going through the curve fitting process.

23
00:01:36,990 --> 00:01:42,540
That being said, as we're getting into this idea of regression, we really want to talk through the

24
00:01:42,540 --> 00:01:46,700
four different ways that we identify the regression curve.

25
00:01:46,710 --> 00:01:53,130
In other words, there are four different ways that we normally describe the trend, and those four

26
00:01:53,130 --> 00:01:59,520
ways are the form, the direction, the strength, and whether or not we have outliers in our data.

27
00:01:59,520 --> 00:02:03,900
So let's talk first about the form of the trend.

28
00:02:03,900 --> 00:02:05,460
Through the scatterplot.

29
00:02:05,490 --> 00:02:11,009
We can think about three different examples here, and I've made these examples pretty extreme so that

30
00:02:11,009 --> 00:02:13,020
the trends are obvious.

31
00:02:13,020 --> 00:02:19,380
But in this first example here, we can clearly see that the trend is linear.

32
00:02:19,410 --> 00:02:26,610
We have linear correlation in the data because the shape of the curve that best fits this data is a

33
00:02:26,610 --> 00:02:27,090
line.

34
00:02:27,090 --> 00:02:31,620
And so in this case, if we were going to find the trend through the data, we would find the trend

35
00:02:31,620 --> 00:02:35,940
line and we would find this linear regression equation.

36
00:02:35,940 --> 00:02:37,440
But that's not always the case.

37
00:02:37,440 --> 00:02:43,080
Sometimes the trend through the data follows a parabolic shape or an exponential shape like this one.

38
00:02:43,080 --> 00:02:52,800
We might describe this trend or this correlation as parabolic or exponential correlation because of

39
00:02:52,800 --> 00:02:54,930
the shape of this approximating curve.

40
00:02:54,930 --> 00:03:01,380
And we can see clearly here that this exponential parabolic shape does a better job of fitting the data

41
00:03:01,380 --> 00:03:03,540
than just using a line.

42
00:03:03,570 --> 00:03:06,750
These are certainly not the only two shapes.

43
00:03:06,750 --> 00:03:13,620
For instance, the data points might follow a trend similar to a sign curve, something like this.

44
00:03:13,620 --> 00:03:18,600
And so maybe our approximating curve looks like this and we would call the trend sinusoidal because

45
00:03:18,600 --> 00:03:20,280
it follows a sine curve.

46
00:03:20,280 --> 00:03:26,970
And maybe that curve does a better job than specifically a line or this parabolic shape at approximating

47
00:03:26,970 --> 00:03:28,320
the trend in the data.

48
00:03:28,320 --> 00:03:32,040
So we're looking for the form of the trend.

49
00:03:32,040 --> 00:03:40,470
And for a scatterplot like this one, we might say here that we see no correlation because the data

50
00:03:40,470 --> 00:03:47,880
is so scattered in such a randomized way that it's difficult for us to see any kind of trend whatsoever

51
00:03:47,880 --> 00:03:50,100
in the data, at least visually.

52
00:03:50,100 --> 00:03:55,830
It almost doesn't make sense to fit this with a linear trend line or a parabolic trend line or some

53
00:03:55,830 --> 00:03:57,810
other approximating curve.

54
00:03:57,810 --> 00:03:59,130
And that can be the case.

55
00:03:59,130 --> 00:04:04,710
We can just have a data set where there's really no correlation at all so we can talk through different

56
00:04:04,710 --> 00:04:06,540
forms that approximate the trend.

57
00:04:06,540 --> 00:04:10,050
We can also talk about the direction of the trend.

58
00:04:10,050 --> 00:04:19,560
So to take two obvious examples here in this first example, the trend here is positive because as we

59
00:04:19,560 --> 00:04:21,870
move to the right, the graph moves up.

60
00:04:21,870 --> 00:04:25,830
Whereas with this second example here the trend is.

61
00:04:26,520 --> 00:04:32,550
Negative, because as we move to the right, the graph moves down and that holds not just for linear

62
00:04:32,550 --> 00:04:35,660
relationships, but also for relationships of other shapes.

63
00:04:35,670 --> 00:04:42,900
So instead of this parabolic curve here, if we had something that looked like this, that approximating

64
00:04:42,900 --> 00:04:48,180
curve could be parabolic or exponential, but the trend would be negative, we would describe the direction

65
00:04:48,180 --> 00:04:49,350
as negative.

66
00:04:49,560 --> 00:04:54,600
So we talk about direction, we talk about strength of the relationship.

67
00:04:54,600 --> 00:05:01,410
So in these two examples here, this first one, we might call this a strong relationship, whereas

68
00:05:01,410 --> 00:05:07,800
the relationship among this data here we might call either moderate or maybe even.

69
00:05:08,560 --> 00:05:09,340
Weak.

70
00:05:09,340 --> 00:05:15,190
And really the judgment we're making here is based on how tightly clustered the data points are around

71
00:05:15,190 --> 00:05:16,730
the approximating curve.

72
00:05:16,750 --> 00:05:21,110
In these cases, the trend line in this first scatterplot.

73
00:05:21,130 --> 00:05:25,360
All of the data points are very, very close to this trend line.

74
00:05:25,360 --> 00:05:28,560
And so the relationship in the data is strong.

75
00:05:28,570 --> 00:05:33,520
But in this example, the data points are spread further away from the trend line.

76
00:05:33,550 --> 00:05:36,130
They're not all packed in close to the line.

77
00:05:36,130 --> 00:05:39,190
And instead we see many data points that are further from the line.

78
00:05:39,190 --> 00:05:45,340
And so we would maybe say that this relationship is a moderate relationship or maybe even a weak relationship,

79
00:05:45,340 --> 00:05:51,310
but we would certainly say that it is not as strong of a relationship as the relationship we see in

80
00:05:51,310 --> 00:05:53,180
this first scatterplot.

81
00:05:53,200 --> 00:05:56,310
And then the last thing we always want to describe is outliers.

82
00:05:56,320 --> 00:06:02,530
So, for instance, in a graph like this one, once we have the approximating curve, if it's linear,

83
00:06:02,530 --> 00:06:07,750
once we have the line of best fit or the linear regression line here, we want to be able to identify

84
00:06:07,750 --> 00:06:11,800
what appear to be outliers in the data in this particular scatterplot.

85
00:06:12,280 --> 00:06:15,420
This point right here appears to be an outlier.

86
00:06:15,430 --> 00:06:22,180
It is by far the furthest point from the regression line, and so it's the biggest outlier in the data

87
00:06:22,180 --> 00:06:22,630
set.

88
00:06:22,660 --> 00:06:28,660
Obviously, that being said, the more outliers there are in the data set, the weaker the relationship

89
00:06:28,660 --> 00:06:29,170
is.

90
00:06:29,170 --> 00:06:33,820
The fewer outliers we have then clearly the stronger the relationship.

91
00:06:33,850 --> 00:06:41,380
So taking account of all four of these characteristics or descriptors, whenever we have the scatterplot

92
00:06:41,410 --> 00:06:47,500
of a data set, we want to be able to describe the trend in the data using these four characteristics.

93
00:06:47,500 --> 00:06:55,060
So for example, if we're looking at this scatterplot right here, we might say that the data displays

94
00:06:55,060 --> 00:07:03,550
a moderate, positive linear trend where the most significant outlier is this point here, which appears

95
00:07:03,550 --> 00:07:12,430
to be at maybe the point, let's say 618 roughly, that would be a fairly comprehensive way to describe

96
00:07:12,430 --> 00:07:19,420
the trend in this data as a moderate, positive linear relationship with an outlier at this point.

97
00:07:19,420 --> 00:07:29,440
Whereas this scatterplot here we might describe as a strong negative linear relationship with no noticeable

98
00:07:29,440 --> 00:07:30,220
outliers.

99
00:07:30,220 --> 00:07:37,060
So the takeaway here is just that the purpose of regression is to be able to estimate the value of the

100
00:07:37,060 --> 00:07:42,010
dependent variable y for any value of the independent variable X.

101
00:07:42,010 --> 00:07:46,350
And the way that we do that is by finding a trend through the data.

102
00:07:46,360 --> 00:07:51,280
Earlier when we talked about scatter plots, we learned how to use these equations to find the equation

103
00:07:51,280 --> 00:07:56,890
of the least squares line or the line of best fit, that regression line that's given by this equation.

104
00:07:56,890 --> 00:08:03,220
So we know how to find this simple equation of the regression line, and now we know how to describe

105
00:08:03,220 --> 00:08:09,010
that regression line based on its form, direction, strength and any outliers in the data set.

106
00:08:09,010 --> 00:08:14,560
With that foundation out of the way, we'll take the rest of this section to talk about some more advanced

107
00:08:14,560 --> 00:08:16,630
regression calculations.