Pearson Correlation

By Lindsay Plater

Time commitment

5 - 10 minutes

Description

The purpose of this video is to explain how to conduct a Pearson correlation test using SPSS (requires two continuous variables). This tutorial is designed to help students and researchers understand: the data type required for the test, the assumptions of the test, the data set-up for the test, and how to run and interpret the test.

Transcript

What is a Pearson correlation? A Pearson correlation is used to determine the strength and direction of a relationship between two continuous variables. We'll talk about this quite a bit, you have to make sure you've got two continuous variables in order to run this test. A Pearson correlation is what we call a parametric test, which means this test assumes normally distributed data. So, this assumes normality. If you're like “Hey, what the heck does that mean?”, it means the data have to follow this typical bell-shaped curve which I've got on the slide for you there. We'll review this in a little bit more detail in a couple of slides.

I've left you some links on the slide if you need some extra help after this workshop. We've got the SPSS LibGuide, which I've written, which the video for this will end up living in that section. We've got a Laerd statistics guide, we haven't written this one, but I think it's a really good resource. And then there's also the SPSS documentation; SPSS literally publishes their own documentation on what you need to run a test and the steps for doing that. So if you need some more help, there's help available. 

Our assumptions. These are the pieces that should be met in order for the results of your test to be valid. If you don't meet the assumptions, you maybe can't use the test you're trying to run. So, we're going to look at four different assumptions before we actually run this test. 
The first is that you must have two continuous variables, no categorical variables here.  

There has to be a linear relationship between the variables. So, when you plot it, it should look like a straight line. Each of your variables, so you're using two continuous variables, both of those variables need to be approximately normally distributed (so looking like that standard bell-shaped curve).  

And there should be no significant outliers in either of your variables. And we're going to cover all of these on the next few slides. So here we go, we're going to start with number one. 

[Slide contains a screenshot of a table in SPSS within Data View. The table’s column headers are as follows: Gender, Fake_Data1, Fake_Data2, Fake_Data3, Fake_Data4 Colour, and Group.] 

So, if you're running a Pearson correlation, the first assumption you need to check is whether you have continuous data for your two variables. For our Pearson correlation today, we are going to be using “Fake_Data1” and “Fake_Data2” from the data set I've emailed you.

[Fake_Data1 and Fake_Data2 columns are highlighted.]

And if we look at these, we can look at Fake_Data1 first and then look at Fake_Data2 second and what we see is we've got a range of values and a lot of decimals. Decimals are pretty big giveaway that you probably have continuous data. This data matches continuous data requirements. If you were looking at, for example, our Gender column, that's not continuous, that's categorical; we probably couldn't use our Gender column for something like a Pearson correlation. 

[Table in SPSS now has the Graphs menu open with Chart Builder highlighted.]

Our next assumption is that the Pearson correlation requires linear data, so the relationship between Fake_Data1 and Fake_Data2 should be linear. 

There's a really easy way to check this in SPSS, and if you have yours open you're welcome to follow along. If you click the “Graphs” button and you select “Chart Builder”, we can make ourselves a little graph of our two fake data variables to see whether it looks like we have a linear relationship between them. 

When you click Graph > Chart Builder, the first thing that probably pops up is going to be a dialog box, essentially saying: “Hey, have you set up your data correctly?”. Because you got the data from me, you can say “Yep, I've got it, everything's good. OK. Move on.”.

If you do that, you'll get the Chart Builder dialog box which looks like this. It's a little unintuitive the first time you use it.  

[SPSS Chart Builder dialog box is open. The left panel lists available variables such as "Gender," "Fake_Data1," "Fake_Data2," and others. The center panel has a small “Chart preview” window. Below, the Gallery tab provides chart type options, with "Scatter/Dot" selected. Beside the Gallery tab you can select Basic Elements Groups/Port ID, or Titles/Footnotes. The right panel has the Element Properties tab selected but you can also select the Chart Appearance or Options tabs. Buttons for OK, Paste, Reset, Cancel, and Help appear at the bottom.].

What you have to remember is the first thing you pick is what kind of graph you want. So in the bottom left corner, you will click the “Scatter/Dot” button. Then you'll grab the first graph option, which is these dots here, and you'll click and drag that up to (you've got some blue text here if you've opened this for the first time), you have to click and drag it and drop it on the blue text, and it'll come up with like a little fake data plot for you. 

[Scatter/Dot options selected in the Gallery section. A red arrow connects Scatter Dot icon with the Chart Preview button section above.]

So, pick the graph you want, literally drag and drop the graph you want, and then select your variables. Your variables are on your left-hand side, you grab Fake_Data1 and put it either on the X or the Y, here I've put it on the Y-axis, and you grab Fake_Data2 and you put it on the other one, either X or Y, here I've put it on the X-axis. When you do that, it will generate just like a fake little piece of data for you to look at [in the “Chart preview” window] to be like “Is that the kind of graph you want?”. Yep, that one looks right. We want a scatterplot to look for a linear relationship, so we click “OK”. 

So, Graph > Chart Builder say “OK” because you know you've set the data up properly, grab the plot that you want, put the data points where you want them (so variable one on one axis, variable two on the other), and then click “OK” and you'll get something that looks like this. Yours is probably blue, I made mine red for this. 

[SPSS scatter plot displaying the relationship between Fake_Data1 (Y-axis) and Fake_Data2 (X-axis). The chart title reads "Scatterplot of Fake_Data1 by Fake_Data2." The plot consists of multiple red data points scattered across the graph.]

What you are looking for here is a linear relationship between your two continuous variables. So, something that looks generally like it's increasing or something that looks like generally it's decreasing. And I've made this one a tricky example on purpose: if you were to cut this plot in half and look at the left side first, the left side actually looks like it's decreasing, and the right side looks like it's increasing.

If we look at this as a whole, it's potentially increasing overall, but someone might be able to argue you've actually got a bit of a U-shape here, so maybe you can't run this test because your data might not be linear.

Like I said, I've made this one tricky on purpose. You should be thinking to yourself: “Hm, can I really run this test? Is this appropriate?”. And you always want to be asking that when you're looking at your assumptions. So, did we pass linearity? Maybe? It looks like in general it's kind of going up, but maybe we didn't. So put that in your back pocket. 

[Table is now open in Data View with the Analyze menu open and Descriptive Statistics selected. From the Descriptive Statistics sub-menu Explore is highlighted.]

Alright, two more assumptions, normality and outliers, and we check these assumptions at the same time; we use the same procedure to check these, generally, within SPSS. 

We're looking, again, to have approximately normal data in both of our two variables, and we want to check the data for both of these two variables to say, are there any outliers (i.e., any data points that are far away from all of the other data points that might be skewing our data). 
So, to check normality and outliers, we click Analyze > Descriptive Statistics > Explore, which I haven't shown you before. This is our first time doing this if you're in the workshop series. So again: Analyze > Descriptive Statistics > Explore. And we'll check normality and outliers at the same time.

[SPSS Explore dialog box with "Fake_Data1" and "Fake_Data2" selected in the Dependent List field. The left panel lists available variables, including "Gender," "Fake_Data3," "Fake_Data4," "Colour," and "Group." The Display section at the bottom allows choosing between Both, Statistics, or Plots, with "Both" selected. To the right, the Explore: Plots dialog is open, showing options for Boxplots, Descriptive plots (Stem-and-leaf, Histogram), and Normality plots with tests (which is checked). Additional settings for Spread vs Level with Levene are visible but not selected.]

You will open the Explore dialog box which looks like this. You take your two continuous variables (they should have a little yellow ruler next to them to say that they're continuous), you take them from the left side, you click this first blue arrow and put it where it says, “Dependent List:”. So you take your two groups, put them in the dependent box.  

The next thing you want to do when checking for normality and outliers is click the Plots button. There's a few selections in here that will make our output a little bit more useful. Where it says “Descriptive”, you want to uncheck “Stem-and-leaf”, folks don't really do that very much, and you want to check where it says “Histogram”. You're asking SPSS to generate a histogram for you. You also want to click where it says, “Normality plots with tests”, this will give you some actual statistics as well. If you forget this part, you can always come back to it again, but you want to make sure you've got your statistics and your plots. Once you've done that, you can click “Continue” [in the Explore: Plots dialog] and then you can click “OK” [in the Explore dialog] and that should give you everything you need for normality and outliers.

When you are checking the assumption normality, it'll generate some output for you if you've made the selections on the previous screen. The first thing you can look at is a statistic to tell you statistically, do we think this has passed or failed?

[SPSS Tests of Normality table displaying results for two datasets: "Fake data: =85 + 5*rand()" and "Fake data: =90 + 5*rand()". The table includes results from Kolmogorov-Smirnov and Shapiro-Wilk tests, showing Statistic, df (degrees of freedom), and Significance (Sig.) values for each dataset.]

If you scroll down to the box in your output that says, “Tests of Normality”, there's two pieces here, it gives you two different kinds of tests. On the right hand, we have the Shapiro-Wilk statistic. You generally read the Shapiro Wilk statistic if you have 50 or fewer observations in each of your groups. So here we have 30 and 30 [in the df column], so we'll be interpreting the Shapiro-Wilk. But I also wanted to point out if you're working with your own data and you've got 50 or more (roughly) observations in each of your groups, you should be looking at the version on the left, which is the Kolmogorov-Smirnov statistic. Today, like I said, we've got less than 50 per group, so we're actually going to look at Shapiro-Wilk. 

It gives you your statistic, but the part you're going to look at is where it says “Sig.”. Sig, or significance, is where your p-value is listed. In your Sig. column, if you have p less than (<) .05, this means you have failed normality, there is a problem, you have broken this assumption (i.e., stop now, you can't use the Pearson correlation or you shouldn't use the Pearson correlation). If you have p greater than (>) .05, it means you're OK, you've passed normality.

And an important thing to remember for this test is that you need two continuous variables and if EITHER of them have failed normality, you should not be using the Pearson correlation. So we can see here Fake_Data1 our first line, our p-value is .455, which means we have passed normality for Fake_Data1. But for Fake_Data2, we have a p-value of .003, which has failed normality, so we should not be using the Pearson correlation. 

I'm going to keep going and show you how to run this test anyway, just so that you know how to do it, and then I'm going to show you what you do “Hey, if I've actually failed normality on my own data, what do I do now?”, well, you switch to the Spearman correlation, which I'll show you in a few minutes. So this is our statistic that shows us whether we have passed or failed normality.

Some folks also like to do what's called visual inspection, so they'll create some graphs and look at some graphs and say, does it look like it passed or failed? 

[Two SPSS histograms displaying the distribution of Fake_Data1 and Fake_Data2. The left histogram represents Fake_Data1, with values ranging from 85.0 to 89.0, and frequencies peaking around 87.0. The right histogram represents Fake_Data2, with values ranging from 90.0 to 94.0, showing an uneven distribution with a noticeable peak near 94.0. Both histograms use red bars to indicate frequency counts on the y-axis.]

So, in your output, we've asked for histogram, so you should also have two histograms my one histogram here on the left is Fake_Data1. If we look, it kind of roughly approximates that normal distribution. So that would mean it's a pass; it’s highest in the middle, lowest-ish at the ends, so that matches with our statistics saying hey, the statistics says we passed the graph says we passed.  

But if we look at Fake_Data2, it's really highest at the ends, there's not much data in the middle at all, that does not pass normality. It does not look like our standard bell-shaped curve, so visual inspection would fail for Fake_Data2. 

So visual inspection, you can use a histogram. You can also use a Q-Q plot which is also generated for you in your output. 

[Two SPSS Normal Q-Q Plots comparing the observed and expected normal values for Fake_Data1 (left) and Fake_Data2 (right). Both plots have Observed Values on the x-axis and Expected Normal Values on the y-axis, with red dots representing data points.]

With your Q-Q plot, you've got a straight line going across the page. Your data points (I've made my dots red, yours are probably blue), your data point should roughly follow the line. The graph on the left, this is Fake_Data1, there's a couple data points not quite touching the line, but for the most part the points follow the line, which means we pass visual inspection for our Q-Q plot for Fake_Data1. 

But if we look at Fake_Data2, we can see a lot of these data points are falling away from the line. Most of the data points aren't even on the line at all, so we have failed visual inspection for Fake_Data2. 

So our statistics and our graph are showing us, hey, Fake_Data1 is good, Fake_Data2 is not. Because at least one of our two measures have failed normality, it means we should not be using this test. But again, I said I'm going to show you how to do it anyway, so we're going to move on and look at outliers next. 

So in the same procedure, you've done the Explore procedure, that will show you normality and outliers. To look at outliers, this is one way to do it: SPSS uses what we call the boxplot method, so it generates some boxplots.

[Two SPSS boxplots comparing the distributions of Fake_Data1 (left) and Fake_Data2 (right). Both boxplots are red, with the median represented by a central line inside each box.]

There are other ways to check for outliers, SPSS doesn't make them quite as easy, so if you did something like means and standard deviations, come and make an appointment with me, we'll chat about that instead. But if you're okay with the boxplot method, SPSS makes that super easy. 

So a box plot. What this is, you've got your median value in the middle [vertical line inside the boxplots], and you've got your interquartile range with these whiskers [line extending above and below the red boxplots]. What this is essentially showing you, is the bulk of your data is probably in this chunk here [indicating the area of the boxplot and whiskers]. An outlier would be any data point that falls quite a bit away from the bulk of your data. In this case, it would be any data point that falls beyond those whiskers. A circle with a number would indicate that that participant number, or that row in your data set, that row is an outlier. If it's a star, the star indicates that that row of your data is an extreme outlier. And we want to be careful with outliers because outliers can skew the result of our analysis.

So here we actually see no outliers at all in Fake_Data1 or Fake_Data2; we've got no circles, we've got no stars, so we're good to go, thumbs up. 

[Table is now open in Data View with the Analyze menu open and Correlate selected. From the Correlate sub-menu Bivariate is highlighted.]

If we pretend we have met all of the assumptions for the Pearson correlation, I'll show you how to run it now. You don't want to move on to this step if you failed one or more of these assumptions. So we've actually failed a few, normally we wouldn't do this step, but for demonstration I'll show you how to do it today. 

To run a Pearson correlation, you click Analyze > Correlate > Bivariate. So again, to run a Pearson correlation, if you've passed all your assumptions: Analyze > Correlate > Bivariate. That will open the Bivariate Correlations dialog box. 

[SPSS Bivariate Correlations dialog box displaying selected variables Fake_Data1 and Fake_Data2 for correlation analysis. The left panel lists available variables, including "Gender," "Fake_Data3," "Colour," and "Group." The Correlation Coefficients section offers options for Pearson, Kendall’s tau-b, and Spearman, none of which are selected. The Test of Significance section allows for Two-tailed or One-tailed tests, with Two-tailed selected. Additional options at the bottom include flagging significant correlations, showing only the lower triangle, and displaying the diagonal.]

You take your two variables that you would like to check to see whether they are correlated and you move them to the “Variables:” box. And then you make sure down here it says Pearson; you want to make sure the Pearson box is checked.  I believe it's checked by default, but you want to make sure you've got the correct thing checked and then you click OK. So you pick your two variables and you say these are the ones I want to check. 

If you have done that, you will get an output that looks something like this. 

[SPSS Correlations output in the Statistics Viewer window. The left panel displays the Output Navigator, showing a hierarchy under "Correlations" with sections for Title, Notes, and Correlations. The right panel presents the Pearson correlation results for two datasets. The table has three rows: Pearson Correlation, Sig. (2-tailed), and N.]

It's small, so if you're just doing the Pearson correlation without checking your assumptions, you'll only have this one piece of information. What you should have is a big output that has all of your assumption checks, and then this at the bottom if you've done it properly. 

So, to read the Pearson correlation output, there's three main pieces here: 
Their first one is your Pearson's r value [in the Pearson Correlation row]. This tells you the strength and direction of the relationship. If it's positive, it means you've got “as one thing increases, the other thing increases”, a positive relationship. If you've got a negative in front, if this was negative .09, it would say “as one thing increases, the other thing decreases”, so a negative relationship. A value closer to 1 is a strong relationship, a value closer to 0 is a pretty weak relationship. So here, our value is .09, it's a weak positive relationship.

This “Sig.” value, your significance value, tells you your p-value; it says whether this correlation is statistically significant. A value less than (<) .05 in the Sig. column indicates you've got a significant relationship between these variables. A value greater than (>) .05 indicates we can't say whether there is a significant relationship between these variables.

And then, as always, your “N” is just your number of observations. How many different participants or different variables or different observations are you looking at in this relationship? Here we're looking at 30.  

So we have a weak positive, non-significant relationship of 30 observations. 

 [Questions? Contact us. UG Library. Website: lib.uoguelph.ca. Email: library@uoguelph.ca.] 

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The library is committed to ensuring that members of our user community with disabilities have equal access to our services and resources and that their dignity and independence is always respected. If you encounter a barrier and/or need an alternate format, please fill out our Library Print and Multimedia Alternate-Format Request Form. Contact us if you’d like to provide feedback: lib.a11y@uoguelph.ca

chat loading...

Pearson Correlation

Attribution

Time commitment

Description

Video

Transcript

Tags

License