Welcome back Well, we’re down to our last lecture of the semester on correlation and regression Now, your textbook covers this topic very early on I choose to cover it last for various reasons It really can be covered either way, but I think there is a logical progression from the earlier topics that I’ve covered, you know, Z to T to F I really think those follow a certain logic and it makes sense to keep the concepts of correlation and regression for last Although as your book describes quite well, ANOVA and correlation regression, they fall within this general concept of general linear modelling, which is something important to remember But a lot of these statistics do have an underlying relationship based on an underlying model But let’s go ahead and dive in here Correlation and regression They both examine linear or straight line relationships Both work with a pair of scores, one on each of two variables, x and y So we’re going to be seeing lots of graphics here today, scatter plots as they’re called, where you have your x-axis and y-axis both depicted So correlation and regression are very related subjects That’s why I’m covering them both in the same lecture here Although I do have it broken down a bit, separated a little bit, in terms of the semester here so you have enough time to absorb all the information about these Correlation is the degree of linear relationship between x and y It’s measured or described by the statistic r Regression is a little different It has to do with prediction instead of description So correlation kind of describes a relationship, regression is all about prediction So regression is concerned with predicting y from x, all right? So the example I’ll be giving is quite a common one where college admission boards are going to look at ACT scores and SAT scores and one goal of theirs might be to try to predict a student’s college GPA based on their ACT scores, let’s say And so if you’re doing that, well, you’re going to be using regression Regression forms a prediction equation to predict y from x And it’s an equation you will have to memorize, a very simple equation OK? But first let’s get into correlation Correlation is the aspect of the data that we want to describe or measure and it’s a degree of linear relationship between x and y The statistic r describes the degree of linear relationship between x and y, as I already mentioned And here’s the formula r equals the summation of the z scores for x times the z scores for y divided by n OK? In other words, r is simply the average product of z scores for x and y So x and y are typically going to be different measures So you’re taking the ACT measure and you’re trying to predict the GPA measure Maybe you’re arguing that they’re both measures of intelligence or academic success and so one should predict the other But they’re on different scales, OK? So the ACT scale is quite different It goes from 0 to 36 from the GPA scale, typically from 0 to 4.0 Because there are different scales, you want to eliminate that distraction by putting them on the same scale, put them on a z scale z is the common way of doing that by just equating measures, putting them on the z scale, making any differences in scores based on differences in standard deviation from the mean And that is what z scores do, as we’ve already learned in this course So r works with two variables, x and y r ranges from negative 1 to positive 1, so it measures both positive and negative relationships We’ll see examples of that moment here It measures only the degree of linear relationships So you have to have a linear relationship between variables It can’t be non-linear And I’ll give you an example of what that looks like It could be a variety of relationships that exist You have a curvilinear relationship that can have different forms I’ll give you an example of that r squared, also very important to remember It’s simply squaring the r value It gives you the proportion of variance or variability in y that is explained by x OK? So if you find a nice positive correlation between ACT score and college GPA, let’s say you get a correlation of 0.6 That’s pretty good

r ranges from negative 1 to positive 1 and you have a 0.6, so that’s pretty close to positive 1 That’s pretty good, pretty strong relationship there Well, you take 0.6, you square it and you get 0.36 That tells you that ACT scores are accounting for 36% of the variance in GPA I don’t think that is actually the relationship between the two I would be surprised if it were that strong, but that would be very powerful That I would say, essentially, that 36% of variance in GPA that you get among all your students can be accounted for, can be predicted by their ACT scores, OK? That would be quite powerful r is undefined if x or y has 0 spread OK, so you’ve got to have some spread You have to have some standard deviation for x and y in order for this to make any sense, really OK, so let’s look at some scatter plots here to picture this So the sine of r shows the type of linear relationship between x and y We can use the definition formula for r and the scatter plots to see positive negative and 0 relationships So here is exactly what it would look like for an r of 1 where you have x, of course, down here, y on the vertical axis And that’s always how it is. x is on the horizontal axis, y is on the vertical axis And here’s a perfect linear relationship where for every increase in y you have a concurrent increase in x and vice versa That’s a perfect one to one relationship and it’s positive way Negative, this is also a very powerful relationship But it’s negative in this case So for every increase in x you actually have a decrease in y OK? So that’s called a negative relationship where one moves up, the other one moves down Positive simply means as one moves up, the other one moves up and as one moves down, the other one moves down So if they’re moving in the same direction, it’s a positive relationship If they’re moving in opposite directions, it is a negative relationship And this is just a depiction of r of 0 where you don’t have any predictive ability of x to y or vice versa OK, so this is just to point out that, again, correlation is all about linear relationships, not curvilinear or other types relationships So here is what looks to be an interesting curvilinear relationship here You might get this Well, one way of describing this is the Yerkes-Dodson Law such that let’s say you have stress levels on the– well, let me see, let me describe this a little differently here Let’s say you have stress levels on the y-axis and you have– OK, let me start over Stress levels on the x-axis and on the y-axis You have memory ability OK? So let’s say it’s an eyewitness to a crime and lower on the x-axis you have hardly any stress Let’s say they’re just sleepwalking, hardly awake at call when a crime takes place Well, their memory for what the perpetrator looked like will be quite low And then all of a sudden you get to the middle and, hey, they’re awake, they’re alert, they’re absorbing what’s going on, they’re going to have a pretty decent memory of what the guy looked like, the perpetrator But then as you get to higher levels, perpetrator is waving a gun in their face, they’re going to be really stressed out, highly anxious, and sure enough, based on the Yerkes-Dodson Law you find their memory for the perp’s face actually gets lower there So that is a curvilinear linear relationship between stress or anxiety and memory There’s all kinds of other curvilinear relationships out there, but that’s the one I’m going to use as an example Well, you don’t want to use correlation in that case It’s not going to be revealing It could give you a 0 OK? It could be giving you the same as this kind of amorphous mass of data points here where it doesn’t reveal anything about this It can completely miss a curvilinear relationship So you don’t want to use r in that case Rather you want to use r whenever you have any kind of a linear relationship that can look like any one of these And these are all positive relationships, but you can have these negatively as well r squared, and as I pointed out already, is the proportion of variability in y that is explained by x So if r is 0.5, which is shown here, then r squared 0.25 and that’s the proportion of variability in y that is explained by x In other words, 25% of y can be explained by x, which still

leaves 75% unexplained, but still, a single factor x, if it explains 25% of the variance in y, that’s pretty good That’s pretty powerful OK? It would be, again, this kind of form here But here’s an even more powerful correlation, 0.7 in r squared or variance accounted for is 49% And here’s 0.9 As it’s getting close to that higher limit of one, so r squared is also going to approach 1, 81% of variance is explained by this relationship Very, very powerful relationship here You don’t often see that powerful of a correlation But to give you an example of one, you might see that height and weight are related pretty strongly in this way I would say it’s probably pretty high, like a 0.9 relationship there As height goes up, weight tends to go up as well In terms of Venn diagrams, you can really nicely illustrate r squared The proportion of overlap between y and x OK? So x and y, x can account for the amount of overlap with y, OK? About 25% of y is accounted for by x Here it’s about 50%, 49% of y And here it’s a whopping 81% of y is accounted for or overlapped with x OK? But you still have this amount unaccounted for here, 19% OK, so what is undefined here? This is just some additional information on the fact that you have to have some variance You can sometimes find a relationship like this, but you can’t use r You can’t use correlation to define this relationship if you have 0 variance in y, which is what I’m depicting here, or 0 variance in x, which is this right scatter plot here You have to have some variance That’s the only way you can use z scales and find this correlation r And I think this is my last slide on correlation Just pointing the population correlation coefficient There’s always a parameter to go with each statistic, so the statistics r goes with rho from the population OK? So you’re trying to predict rho based on your r value given your samples And of course, correlation does not imply causation You probably heard that before If you find that A and B are correlated, let’s say that ACT scores for your incoming freshman college students ends up being related to their GPA their freshman year They’re correlated It doesn’t mean, however, that A causing B, that their ACT score somehow is causing their GPA It could be that, well, this doesn’t work because they differ in time here One comes before the other But let me use a different example here Let’s say violent television, or violent movies, violent media, and aggression in grade schoolers There is a relationship there Clearly there is a correlation, a positive correlation there such that the amount of violent TV watching is positively related with the amount of aggression As one goes up, the other one goes up, and as one goes down, the other one goes down But does that mean that violent television, violent media viewing is causing higher aggression? Well, that’s just one of many possibilities You don’t know until you do experiments and you control for these things If you only have correlational studies, then it could be that yeah, violent TV watching is causing aggression It could just as easily be the case that children who were already higher in aggression are just more likely to watch violent TV, OK? Or it could be some other factor, C, that’s causing both A and B. Maybe it’s just how their parents are raising them or maybe it’s genetics Parents with more aggressive types of genes, if you will, pass those on to their children and so they get children who want to watch violent TV like their parents did and also have more aggressive genes or something OK? You never know what it is You just know, strictly speaking, there is a positive correlation or a negative correlation and that’s all you know You’ve got to stop there People often want to jump the gun and make arguments about causality just based on correlational relationships and that’s not appropriate OK, so moving on to regression Regression is concerned with forming a prediction equation to predict y from x So really this makes regression more interesting and practical than correlation Correlation is just descriptive whereas regression is more predictive

This is the case where the example of ACT scores, you’re trying to use that to predict something, college GPA for example Regression uses the following formula y prime equals bx plus a y prime is the predicted y score on your criterion variable In my example is college GPA That is your criterion variable The y score is the actual GPA that your student gets at the end of their freshman year y prime is what you’re predicting that their GPA is going to be It’s unlikely those are going to be identical, but you’re trying to get as close as you can to their actual score as you can with the predicted score But you’ve got to have that little prime value there, y prime, because you’re just trying to predict the y score, not exactly the y score b is slope OK, we’ve all seen this before Rise over run, the change in y divided by change in x is your b or slope x is the score on the predictor variable So in my example, the predictor variable is ACT score, OK? You’re trying to use that variable to predict your criterion variable And a is the y-intercept and is where this line over here is crossing the y-axis, the vertical axis here The value of y prime when x is 0, in other words And again, just like correlation, regression is strictly dealing with linear data Now you can generalize only for x values in your sample So if you’re trying to predict GPA based on ACT scores, well then you have to acknowledge right off the fact that you’re only using ACT scores You’re not using anything else OK? Even if it might be relevant, you’re just saying, well, I’m just trying to do what I can with ACT scores And let’s say that the admission folder comes in for a student and everything gets lost except for their ACT score So that’s all you have to work with Well, you have to acknowledge that You’re only dealing with x, in this case, ACT scores, to predict y, the student’s GPA after their freshman year So as I pointed out already, actual observed y is going to be different from what you’re predicting, but you’re trying to minimize that difference as much as possible In other words, you’re trying to minimize error Error is what you end up having as the difference between y and y prime So y is y prime plus error In other words, error in regression is y minus y prime, OK? So let’s say the student had great ACT scores, so you’re predicting that they’ll have a 4.0 GPA Well, your 4.0 is your y prime But let’s say they end up with a 3.9, OK? Well, then your error is the difference between those, OK? It gives you an idea what the error is, the fact that your prediction is never going to be perfection In regression it’s important to realize that there are an infinite number of potential regression lines but there’s only going to be one best fitting line So this is the last slide I have for you today The best fitting line as depicted here, it’s the yellow line, the best fitting line is going to be the one that is minimizing the squared difference between each of your data points and the– basically it’s the line that minimizes the differences between your y’s, your actual scores, and what your predicting them to be And of course, you square it because you’re trying to get rid of negative and positive You want to make everything positive, you want to basically, well, you want to make everything positive That’s why you square it OK, so the sum of your squared errors equals sum of all the differences between actual and predicted scores squared So that would give you this best predicting line here It’s the one that is as close as you can get to defining this linear relationship here I mean, ideally if you had a perfect relationship, a one to one relationship, then all the data points would fall exactly on this line But that’s not realistic There’s always going to be some variance, there’s always going to be some error, and so there’s going to be some distribution of your points here, your actual scores, but you’re trying to minimize the relationship between predicted and actual scores So this is called the Least Squares Criterion That’s where I want to leave you here This is the last lecture for the course As usual, if you have any questions about correlation or regression after you watch this lecture a couple times, read the chapter, the relevant chapter in your textbook Try to get a solid grasp on it But if you still have questions after that, feel free to go to the discussion board for this topic And take your time on this This is the last topic we’re going to cover and definitely

going to be featured on the final exam So study hard Well, it’s been a pleasure having you in the course and providing these lectures to you Looking forward to seeing how you do on the final exam and of course, come to me with any questions you might have