RStudio Workshop Series: Spearman Correlation

Transcript

The next thing we can do is we can run the Spearman correlation. We've already talked about this: we broke a few of the assumptions of the Pearson correlation, so what can we do instead? We can try running the Spearman correlation. A Spearman correlation is used to determine the strength and direction of a relationship of the rankings of two variables; the variables can be ordinal or continuous. It's a non parametric test, which means we do not assume normality; we don't have to meet the normality assumption here. And I've left you some different links for if you're looking for help for the Spearman correlation.

Every test we do has certain assumptions that must be met in order for the results of the test to be valid. Assumption one, our two variables for the Spearman correlation should be ordinal or continuous. Continuous data in RStudio is generally <int> or <dbl> (which is double) or <num> (which is numeric). Ordinal data is of the “ordered factor” variety type, so it has to be set as factor, specifically factor with an order.

So we've already covered this part if you were just here for Pearson correlation; there's a common package for reviewing data. If you needed to install tidyverse, you can run line 93. If you're just starting for the first time today and you haven't just done Pearson correlation, you would need to run line 94. And then what we can do is, on line 95, we can take a glimpse of our data. And this shows us we have 30 rows with 7 columns. And if we're using Fake_Data1 and Fake_Data2, those are both listed as <dbl> or double format, which is RStudio’s way of saying “this is continuous data”.

So we have met assumption one because we're going to be using two continuous data points. I've also left you – because we haven't talked about ordered factors yet, we haven't really discussed ordinal data – I've left you some code for if you were using a different variable. So on line 99, if you wanted to use the “Group” variable which is currently listed as integer (<int>) but has the options 1, 2, and 3…if you wanted to treat those as an ordered factor (one being you like a small drink, 2 being you like a medium drink 3 being you like a large drink), there's a certain order there, it has to go 1-2-3 or 3-2-1. If you wanted to set that to factor, so that you could use this with an ordinal data type, you can do that using line 99. We don't need to do this right now because we're not using this group, but it uses the factor() function, and what you do is you say which variable, so Fake_Data$Group to say “I want to use the Group variable”. You say “I want there to be a certain order here”, order is equal to “TRUE”, this sets it to an ordered factor. And then you tell it what levels in which order, so it goes: 1-2-3. You can't do 2-1-3, that doesn't make sense. It has to be small, then medium, then large. So if we wanted to run this, we can. And then on line 101, we just checked to make sure it worked; we asked for the class() of the Group variable, and it says this is an “ordered” “factor”, which means it's an ordinal data type, we can now use it. So this is helpful code IF you're working on your own data and you're trying to make sure it's set properly in RStudio to ordinal data, which we're not doing right now. So it's just for if you need it later.

So we've passed assumption one, we're going to be using two continuous variables. Assumption two, is you have to have paired observations. Which can be a little tricky if you didn't collect the data yourself. So paired means if I was a participant in the study, I have a Fake_Data1 variable and a Fake_Data2 variable; I've got 2 observations. We can look at the data using the view() function, just to see whether it looks like things are correct. So if we do this, it's not super clear if you didn't collect this data. It's fake data, so today it works fine and I can tell you it's fine. So generally we would want something like a participant column to say I would be participant one and I would have a gender item, a Fake_Data1 item, a Fake_Data2 item, et cetera. So today it's fake data, we're going to say it's okay. It's important that you know your data and you've got it set up properly. You've got paired data so you've got a score for both pieces that you're working on. So today we're going to say we've met assumption two.

Assumption three, we need a monotonic relationship, which is another fancy stats term. It just says your relationship is either generally increasing or generally decreasing. You don't want it to be doing both. The easiest way to check to see if you have a monotonic relationship, no switchbacks, no “S”s, no “U”s, nothing weird going on, just kind of a straight line going up or down…is we can plot it. If you were just here for Pearson correlation, we've already done this, but we can do it again. On line 113, we've got just the most basic version of a scatterplot. So it's: plot(Fake_Data$Fake_Data2, Fake_Data$Fake_Data1). And if we run this, we get our plot in the bottom right, which we can zoom in on. Does this look like a monotonic relationship? Does it look like it's just increasing or just decreasing?

If I were to draw a line kind of in the middle of this plot, the left side of this plot is decreasing, generally, and the right side of this plot is increasing. So we've got some kind of “U” or “V” shape going on. It doesn't really look like it's just increasing or decreasing; it looks like it's kind of doing both. Which might be considered a problem for this test. We might have just failed this assumption. It's a fancy way of saying, maybe we can't use this test because it's not a line. We don't have a line going on. So we can close that. It's a reminder to check your assumptions. The Spearman test here might not be appropriate because we might not have a straight line. Today we're going to pretend it's totally fine, we're just going to do the test anyway. I also have left you an expert tip: non-parametric tests do not care about outliers. It's just how the tests work, they're using medians, not means. So we don't care about outliers, you don't have to find and remove outliers if you were doing the Spearman correlation.

How do we actually run the Spearman correlation? Well, we can double-check the function that we're trying to use, which is cor.test(), so I can run line 125 which is: ?cor.test() and it tells us (if you were just here for Pearson correlation, it's the same thing): “Test for Association / Correlation Between Paired Samples. Description: Test for association between paired samples, using one of Pearson's product moment correlation coefficient, Kendall's tau, or Spearman's rho”. So we could use this for Pearson's correlation, or instead, we could say we're trying to run a Spearman test. So what we give this test, is we say: cor.tes(Fake_Data$Fake_Data1, Fake_Data$Fake_Data2, method = “spearman”). So what this is doing, is we're using the cor.test() function, and we're saying in our fake dataset, compare Fake_data1 to Fake_Data2 using a Spearman correlation. So I can highlight this and I can click Run. And in the bottom left, in our console, we get in blue text, the thing we actually ran: the cor.test(). And it says Spearman's rank correlation who. “Data” tells you the two variables you asked for. We have the test value, our p-value, and some text that says: “alternative hypothesis: true rho is not equal to 0”. And it gives us a sample estimate. Again, in inferential statistics, the thing you generally care about the most is the p-value. So if we look at our p-value, we've got .9214, which is greater than (>) .05, which means we fail to be able to say there is a relationship between these two variables. If p was less than (<) .05, we would have found statistical significance and we'd be able to say there's a relationship. If I go to my Plots window, I can look at this because the scatterplot's a great way to visualize any type of correlation. And it kind of makes sense that we don't have anything significant here; we don't have a tight cluster of data going up or down in a certain direction. We don't really have a relationship here, because we're using a linear test for something that doesn't really look linear. That is how you run a Spearman correlation. Do folks have questions? We went through Pearson correlation and Spearman correlation pretty quickly. The one thing I haven't covered with you too strongly yet, is you could use your rho value – generally you do this if your p-value is statistically significant; you use the rho value if you're using Spearman correlation, you use your Pearson value, your Pearson r value, if you're doing a Pearson correlation. And that will tell you about the strength and direction of the relationship. So here, because our p-value is non-significant, we would just kind of stop; we wouldn't talk more about what's going on. But if this was statistically significant, or if it was important to talk about even if it was not statistically significant, we could use the Pearson r value, or the Spearman rho value, to talk about the strength and direction of the relationship. These values range from -1 to +1. So if you had -1 it would be a perfect line going from the top left to the bottom right of the page, and every dot would be on that perfect straight line. If you had a +1, it would be a perfect positive correlation, and every line would be exact – or every dot would be exactly on the line going from the bottom left to the top right.

Here, our rho value is .01, or if we round, .02. Which is saying there's not much going on. It's technically a positive relationship (i.e., things are going up to the right) because the value is greater than [0], but it's not a very strong value. The relationship is not a strong relationship, because the value is only .02. And again, a perfect positive relationship would be +1.0. So again, we wouldn't really talk too much about this here, because we don't have anything significant happening. But you could use the Pearson r value or the Spearman rho value to talk about the strength and direction of the relationship if it was statistically significant.

Time commitment

10 - 15 minutes

Description

RStudio workshop series: Spearman correlation covers the process of conducting a Spearman correlation in the RStudio software, including assumption checks and graphing.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The library is committed to ensuring that members of our user community with disabilities have equal access to our services and resources and that their dignity and independence is always respected. If you encounter a barrier and/or need an alternate format, please fill out our Library Print and Multimedia Alternate-Format Request Form. Contact us if you’d like to provide feedback: lib.a11y@uoguelph.ca

chat loading...

RStudio Workshop Series: Spearman Correlation

Video

Transcript

Time commitment

Description

Tags

License