RStudio Workshop Series: Pearson Correlation

Time commitment

5 - 10 minutes

Description

RStudio workshop series: Pearson correlation covers the process of computing a Pearson correlation in the RStudio software, including assumption checks and graphing.

Transcript

Welcome everyone to RStudio workshop series #4. Today we are working on correlation, so I'm going to show you two kinds of correlation today. A little bit of back story before we get started: a lot of the stats you might have been taught if you've ever taken a statistics class…they require what's called “normality”. Which is a fancy way of saying the data has to look a certain way for us to be able to use that test. So if you've ever done a correlational test before, it might have been the parametric test (i.e., the test that requires normality): that's the Pearson correlation. If you do not meet normality – we'll cover this as well today – we can instead do what's called the Spearman correlation.
So let us begin. Pearson correlation: a Pearson correlation is used to determine the strength and direction of a relationship between two continuous variables. And like I've already hinted at, this test is parametric, which means it assumes the normal distribution. It assumes normality: biggest amount of data in the middle, smallest amount of data at the ends. I've left you some links in the document today to help you run a Pearson correlation. I'm going to show you how to do it start to finish.

The first thing we have to remember to do before we open anything on our computer is to set our working directory; we have to tell our computer where are we working on our computer. We can do that by clicking: Session > Set Working Directory > Choose Directory. So by clicking Session > Set Working Directory > Choose Directory, this will open your File Explorer so you can point to where you have put the downloaded files for today's workshop. So if I do this, it will show you me my File Explorer. I'm in Documents > Classes and Workshops > Lindsay's workshops > MICRO WORKSHOPS > RStudio. I don't see anything in this file folder, but I know I'm in the correct location, so I can click “Open”. This will give me my code in the bottom left-hand corner, in our console. This says setwd() left bracket to say this is a function. It's a set working directory function, and it has the very long link to where I am working on my computer. Hopefully yours is a little shorter. What you have to remember to do if you're here at today's workshop or watching the recording, is take this function – NOT including the blue pac-man at the front, that greater than (>) symbol – and you want to copy and paste it onto line 23 so that if you come back next, let's say tomorrow or next week, and you wanted to try to run this code, it's going to point to YOUR folder structure on YOUR computer. So we've set our working directory. Now we would like to open our file to do work in that file. We can open many different files on RStudio. The file type we're going to open today is what we call a .csv file, or a comma delimited file. To open a .csv file on RStudio, we can use the read.csv() function. Within that function, what we have to remember to do is give the exact name of the file as written, with the file extension, within quotation marks. So my part of the code: I've given it a name, I've called it Fake_Data = read.csv(“Fake_Data.csv”). You have to spell the file exactly right, and you have to have remembered to set your working directory, or else the file will not open. So if I highlight line 25 and I click Run, I get blue text down in the bottom left in my console showing that this worked. In the top right, I now have a data set called “Fake_Data” which has 30 observations of 7 variables.

So we've opened our file; that's step one. Now we can actually start to do things with this file. Every test that we try to do, has certain assumptions or certain rules that must be met in order for the results of the test to be valid. So there are multiple different rules for different kinds of tests, and we're going to cover all of them today. The first assumption for running a Pearson correlation is that you have two continuous variables. And we talked about continuous variables a few weeks ago. There's a common package we can use to review our data to make sure we actually have continuous variables. So if you have not come to one of my workshops before, you will need to highlight and run line 30, which is: install.packages(“tidyverse”). What this does is it will take the tidyverse package – someone else has created this package with different functions – it will take it from the Internet and put it on your computer so that you can use the functions within that package. I already have tidyverse on my computer; I don't need to run line 30. You can run it again if you already have it, but you don't need to. I'm instead going to highlight line 31 which is: library(tidyverse). When I run this, I get some text down in my console, and it's worked. What this means is it has loaded tidyverse for me. I don't have red text saying that there's a problem, so this has worked appropriately; I can now use the functions within tidyverse to look at my data. So let's do that! We're going to take a look at or take a “glimpse” of the dataset that we have open in RStudio. So we can use the function glimpse which is: glimpse(Fake_Data), because that's what we called our data set. So I can highlight this and I can click Run. And it will show me my data in the bottom left, in our console. It says we have 30 rows and 7 columns, which is good because that matches what's in the Global Environment in the top right. And it's got dollar sign to say “here's a different column for you”. So we've got $Gender, which is listed as <int> or integer data. The data we're going to use is $Fake_Data1, which is listed as <dbl> or double, which is RStudio's way of saying continuous, and we're also going to use $Fake_Data2, which is a different column in our dataset, which is also listed as <dbl> or double, which is RStudio's way of saying continuous data. So what this means is we have met this assumption; our data is considered continuous because we're going to use the Fake_Data1 column and the Fake_Data2 column. Another way to do this, which I don't have in the slides or in the script for you, is you can just double-click on “Fake_Data” and it will open it like it's an Excel spreadsheet directly within RStudio. And you can look at your variables and see what's going on here. So you could do the glimpse() function to look across and it will tell you the data types it's listed. You could also look at your data and say: “Hmm, what kind of data do I have here? In Fake_Data1, I've got a range of values, I've got a bunch of decimals, that looks continuous”. The glimpse() function helps us double-check that RStudio has appropriately put the correct data type on that column of data. So because we've got <dbl>, which is RStudio's way of saying continuous data, we get our check mark, we're good to go.

We've just finished assumption one, we checked to make sure that we've got the correct data type for our variables: we've got two continuous variables. Which means we've passed that assumption, we're allowed to move on. Assumption number two, we should have what's called a linear relationship between our two continuous variables. There are many ways to check linearity. The easiest way, is we can just make a scatter plot. We can do what's called “visual inspection”; we can plot one thing on the X[-axis], one thing on the Y[-axis], and just have a look at it and say: “Does it look kind of like a line?”. So there's lots of different ways to make plots and really fancy, really pretty plots in RStudio; I'm just showing you the super, most basic way to do this on line 39. We're literally going to use the plot() function: we're going to say “Fake_Data$Fake_Data2” to say plot Fake_Data2, that variable. And we're going to give it a comma and say “Fake_Data$Fake_Data1” to say we'd also like to plot Fake_Data1, we'd like to plot that variable. So if we highlight and Run line 39…mine gave me some red text in the console because my plot [window] is itty bitty right now because I was working on something else. So if I just make this bigger and run this again, it should show up. Boom. Sometimes the graphs down here are a little hard to look at, they're a little hard to see; so you can click the “Zoom” button which will make it full size so you can look at things a little bit better. And what we're looking for here is – it doesn't have to be perfect, doesn't have to be exactly a straight line – but you're looking to see: “Does this kind of look like a straight line?” Is it either, like, going flat across the page, is it going up to the right? Is it going down to the right? Doesn't really matter the shape of it, as long as it looks roughly like a straight line. Here, we might actually have a little bit of a problem. Because if I sort of cut this graph in half, what we see is the left half of the graph looks like it's going down to the right, and the right half of the graph looks like it's going up to the right. Which is a fancy way of saying we kind of have this “V” or “U” shape going on. And if you've ever taken like a high school math class, you might know that that's generally not considered a straight line; that's some kind of quadratic. So in this case we might not have passed this assumption. Just by visual inspection alone, it looks like we might not have a straight line going on. So it's important to always check your assumptions, because if we didn't check this assumption, we wouldn't know that maybe it's inappropriate to use this test, because it doesn't look like we have a line here. Today, we're just going to pretend it's a perfectly straight line, we're going to do the test anyway. But it's a reminder to always check your assumptions.
All right, I'll make this so we can see things again.

So we finished assumption two. Our data doesn't look fully linear, but we're going to keep going today and pretend it looks okay. The next thing we have to do is check what's called assumption three: each variable, so both Fake_Data1 and Fake_Data2 should be approximately normally distributed (i.e., this is the “parametric” part of this test; this is normality). There are several ways to check normality. I'm going to show you the statistics version. So depending on how big your dataset is, how many observations you have per condition, there are two different statistics that we generally use. One is called the Shapiro-Wilk statistic if we have fewer than 50 observations. And if we have more than 50 observations per group, we would normally use what's called the Kolmogorov-Smirnov statistic. So we have pretty small groups today; we've got 30 per group, so we're going to use the Shapiro-Wilk statistic to check normality. The way we do this is we use the shapiro.test() function, and within that function we have to give it the name of the data set, so Fake_Data. We give it a dollar sign ($), and then we give it the name of one of the variables. And we have to run this twice because they both need to pass normality or we can't use this test. So I can actually highlight both of these at the same time [shapiro.test(Fake_Data$Fake_Data1) and shapiro.test(Fake_Data$Fake_Data2] and I can click Run. And I can get the output from both of them at the same time. So for Fake_Data1, it says “Shapiro-Wilk normality test”, it tells you which data you used, it gives you your statistic, and it gives you a p-value. The p-value for Fake_Data1 is .4551. Because this is an assumption check, if p is greater than (>) .05, we get our check mark or green light; we have past normality for Fake_Data1. We also need to check Fake_Data2, so that's just a little bit lower. It says Shapiro-Wilk, it tells you which data set you used, it gives you your statistic, and it gives you your p-value. And here we have p is equal to .002933. This is less than (<) .05, which means we have failed normality. We get a big red X, we had a stop light, we say something is wrong. We cannot use this test [Pearson correlation]; it would be inappropriate to use this test because we did not pass the normality assumption. Which means you would then switch to the non-parametric test or the non-parametric correlation which is called the Spearman correlation. We're going to run both today; we're going to pretend that it worked and run the Pearson, and then we'll also run the Spearman. So here because one of our tests has failed normality, we say the entire thing has failed. Both need to pass in order for us to be able to run this test. The next assumption, assumption four, is no significant outliers for each variable. So for both Fake_Data1 and Fake_Data2. I haven't left us any code for this today, because different fields or areas of study have different ways of checking for outliers, including things like visual inspection using box plots. Or doing things with like the mean and standard deviation. So I've left you some text there if you are running this on your own data; so if you are part of a research team and you need to remove outliers, you definitely want to check for those. You want to chat with your research team about how that is done in your lab or in your field. If you have removed outliers, you then can re-run your other assumptions, so things like linearity, things like normality. And sometimes removing outliers can actually fix normality. So in our example, normality is broken for one of our two tests; if we were to remove outliers and check again, we might find normality is okay, which means we're allowed to use this test. So always check that.
All right, so we've finished checking our assumptions, now we can actually run the test. This is the part that actually doesn't take very long. The assumptions generally take a little bit longer than running the test, but it's really important that we're checking those assumptions. Because like I just said, if you've failed one or more assumptions, you probably can't use the test you're trying to run and you might have to switch to a different test. So if we're actually trying to run the Pearson correlation today, we're going to check the strength and direction of the relationship between Fake_Data1 and Fake_Data2, it's all fake data. They could be whatever you want! It could be…I don't know…let's see if height impacts your happiness. Like that would be a silly question, but we can ask different questions to see: “What's the strength and direction of that relationship?” That's what we're doing with the Pearson correlation. We're going to use the cor.test() function. And if we don't know what that does, we can highlight line 68, which is ?cor.test(), and what this will do, is it will give us down in the Help window, the RStudio documentation for what this test actually does. So cor.test() is from the stats package, which we've loaded from tidyverse, and it says: “Test for Association / Correlation Between Paired Samples. Description: Test for association between paired samples, using one of Pearson's product moment correlation coefficient, Kendall's tau, or Spearman's rho”. Those are all very stats-y symbols, you don't have to know those. So it's a fancy way of saying you can use cor.test() for three different things.

Today we're going to use it for the Pearson correlation. So how do we actually do that? All we have to do, is give it the name of the test we want to run, the Pearson, and give it the two variables we're trying to use. So let's review this together. On line 70, we've got: cor.test(Fake_Data$Fake_Data1, Fake_Data$Fake_Data2, method = “pearson”). That's the function we're using, which is to say we're going to do a correlation test in our fake data set. We're going to compare Fake_Data1 to Fake_Data2, using a Pearson correlation. So we can click Run, and in the bottom left, it will tell us exactly what we just did. It will give us in blue text the code we ran, and then in black text it will give us the actual result of that test. So we've got Pearson's product moment correlation. It gives you the data, so we've run Fake_Data1 versus Fake_Data2. It gives you a t-value, your degrees of freedom, and a p-value. Our p-value here is .6373; non significant because this is greater than .05 (>). It says “alternative hypothesis: true correlation is not equal to 0”. It gives us a 95% confidence interval, and it gives us some sample estimates as well. So in inferential statistics, the main thing most people tend to care about is the p-value. So here we have a non-significant correlation. Which is a fancy way of saying “we failed to be able to say whether there is some sort of relationship or association here”. And if we go back to our plots for a second, because we've already plotted our scatterplot (I can zoom in on this…), that kind of makes sense! We don't have a very tight cluster of dots along a very obvious line. So it also – reminder, it doesn't make sense to run this test because this doesn't look particularly linear, we've got more of like a “U” shape or a “V” shape, which is quadratic, not a line – so it makes sense that this test isn't really working as expected, because we failed that assumption. So it's saying we can't really say that there's an association here. We don't have a tight cluster of [dots] going up in one direction, for example. So it makes sense. And that is how you run the Pearson correlation, which is the parametric correlation. It assumes normality, and we broke normality for one of our two pieces.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The library is committed to ensuring that members of our user community with disabilities have equal access to our services and resources and that their dignity and independence is always respected. If you encounter a barrier and/or need an alternate format, please fill out our Library Print and Multimedia Alternate-Format Request Form. Contact us if you’d like to provide feedback: lib.a11y@uoguelph.ca

chat loading...

RStudio Workshop Series: Pearson Correlation

Time commitment

Description

Video

Transcript

Tags

License