RStudio Workshop Series: Kruskal Wallis H Test

Transcript

What if you failed some of the assumptions for the one-way ANOVA? Or what if you don't have the correct data type for a one-way ANOVA? You do have an option instead: you could run the non-parametric version of the one-way ANOVA, which is called the Kruskal-Wallis H test.
A Kruskal-Wallis H test is used to determine whether three groups’ medians on the same continuous variable differs. So this is similar to a one-way ANOVA. The trick is – I said continuous variable – you can also use this for ordinal data as well.
So if you don't have the right data type for a one-way ANOVA, you have ordinal data, you could instead use the Kruskal-Wallis H test for that. Or if you failed some of your assumptions for your [ANOVA], you could use the Kruskal-Wallis H test for that as well. This test is non-parametric, which means we don't assume the data must be normal. We don't assume that bell-shaped curve with the highest data in the middle and the lowest at the ends. And I've left you some help guides for if you're trying to run a Kruskal-Wallis H test on your own, if you wanted those links.
Every test that we run has certain assumptions that must be met in order for the results of the test to be valid. Let's jump into our assumptions. Assumption #1 for the Kruskal-Wallis H test is that the dependent variable must be ordinal or continuous. Continuous data in RStudio is listed as <int>, <dbl>, or <num> data types.
Ordinal data in RStudio is listed as “factored” – or sorry – “ordered factor”; it literally will say the words “ordered factor”. There's a common package that we can use to review our data. If you were just here for one-way ANOVA, we already did this, so I don't need to run these two lines here (152 or 153). But I can take a glimpse() of my Fake_Data. We've already adjusted some of this using the one-way ANOVA, but if we're using Fake_Data1 today, this is listed as <dbl> or double data type, which is RStudio’s way of saying continuous data. It's got some decimals. So we've already met the assumption of continuous data.
If you were using a different column…we're going to pick on the Group column here for a minute. We're not doing this right now, but if you were using a different column of data, let's say you were looking to make this ordinal data. Right now, Group is listed as<int>, which is integer. So it's got the numbers 1, 2, and 3. But if we wanted to use this as our dependent variable today, we would need to change this data type to say: “ordered factor”. And how you do this is using the factor() function. So: factor (Fake_Data$Group, “order = TRUE”, [to set it as ordinal data], levels [to say which order do they go in. Is it 1, 2, 3? Is it 3, 2, 1? Maybe it's 2, 3, 1; maybe you coded it kind of funny and you need to adjust your levels, so levels] = c [for concatenate function] (“1”, “2”, “3”).
So reading this left to right, if I was using a different variable and it was the wrong data type and I needed to set it to order factor, it could look something like this: Fake_Data$Group = factor(Fake_Data$Group, order = TRUE, levels = c(“1”, “2”, “3”)). If I Run this…it spits it out in blue, so it looks like it worked. There's a bunch of different ways to check the class of your data type. I'm showing you a new function just for fun. We say class(Fake_Data$Group) to just get the one. And it says: “ordered factor”.
So if you're working on your own data and you're like: “Lindsay, I'm not working with continuous data, how do I make it ordinal?”, that's how you can set your stuff as ordinal. The order matters; in this case, we're saying it goes “1, 2, 3”. Or you could say “3, 2, 1”. But here it would not make sense to say “2, 1, 3”, because in this fake data, “1” means small, “2” means medium, “3” means large. It has to go in that order to make sense; that's what we mean when we say ordinal data. Anyway, that was an aside. Today, our dependent variable is continuous; we're going to use Fake_Data1 that's listed right now as <dbl>. Which means we're in the correct data type, we've passed this assumption.
Assumption #2: the independent variable must be categorical with three or more independent groups. This means if I was a participant in the study, I am only in group 1, or group 2, or group 3; I can only be in one of the options. We're going to use the Colour variable today as our grouping variable. If you were just here for one-way ANOVA, you already did all of this, so some of this actually won't work. Line 175 will break if you have already done this line. But we'll get to that. So we're going to check our data type: we can run glimpse() of Fake_Data. We're going to use the Colour group. If you were just here, this is already listed as <fct>, which is factor, which means categorical, which is RStudio's way of saying: “This is the correct data type, we've passed this assumption”. If you've just opened the dataset for the first time, you will need to run line 170, which sets the Colour group to “factor” data type.
I've already done this, I can Run it again, it won't break, it'll just say it worked. And when I take a glimpse(), it will still list it as “factor”, which is fine. If I try to Run line 175, which is taking the numbers and making them into words…if I try to Run this, I'm going to get some red text because I've already done this. And it says: “Error in recode, unused arguments” (i.e., it can't do this; I don't have any 1s, I can't set the 1s to blue because I've already done this). But if I View() the data, we can see that it is already set up properly. In the Colour column, everyone has one word listed for their Colour. We've got options: purple, blue, yellow, and red. So it looks like it worked.
So we've passed this assumption: our independent variable is categorical. We have 4 different groups (which is more than three), and they're independent because you only pick one of those Colour options as your answer. So we pass that assumption.
Assumption #3: the data must be independent. Independence means different things depending on what level of stats you're talking about. In this case, it generally means each row should be one unique participant, and they should not be related in some way. Doesn't mean blood relation, it means you don't want all of your best friends to answer the survey; maybe you don't want everyone in the same class to answer the survey. You potentially want to do more random sampling so you get a bunch of different people across the university, for example. Hard to check this one after-the-fact, but today it's all fake data. I can tell you we've met the assumption of independence because it's not actually real data. But this is something you should be thinking about if you're conducting your own experiments.
Assumption #4: all groups’ distributions should have the same shape. So this means we have 4 colour groups, we should plot the four colour groups and say: “Do they look like they have the same shape?”. This is a bit of a new assumption for us, if you've come to the workshops here before. There aren't that many tests that have this as an assumption. So we've actually already split our Fake_Data dataset with our different Colours into the different subsets. So the “blue” group, for example, only has the blue participants. So we can use that to plot our different groups. We can say hist(blue$Fake_Data1), and if we Run that, we get a histogram of all the folks who entered blue as their favourite colour. So it looks…you know, pretty flat. We could Zoom in if we wanted. The first chunk is kind of low, then we got kind of high, then we got kind of in the middle, and we got kind of high. Looks overall pretty flat. What about red? Well, red is higher on the left, and lower on the right. What about yellow? Yellow's pretty low, and then there's one point that's a little bit higher. What about purple? Kind of high on the left, and still kind of high on the right. You can flip back and forth using these arrows to say: “Do they look like the same shape?” We need to be careful here, because our y-axis is changing a little bit. So here we have zero to three, zero to four, zero to two, zero to three. We don't have that much data for this specific dataset because it's a fake dataset. We've got 9, 8, 6, and 7, so it's hard to say that they're a perfect match. We're not necessarily looking for it to be perfect, we just want it to be kind of close. This one sticks out to me as being the weirdest of the bunch; the yellow data set has one bar that is much higher than the others.
The other ones, I would say they're all pretty close. So this is a warning to always check your assumptions; you don't know if you've passed your assumptions unless you actually check them. Today, it's a little tricky to say whether we passed or not. They're not super similar, but they're not all that different because we don't actually have that much data. So maybe it's appropriate to use the test, maybe it's not. You have to make a judgment call about what is “similar enough”.
And the last thing I have here for you is an expert tip: non-parametric tests, because they're not using means: they don't care about outliers. So if you took the outliers out for the one-way ANOVA, you can actually put them back in before you do the Kruskal-Wallis H test because you don't care about outliers. For this test, you should still remove impossible values, but outliers don't matter.
Okay, we've checked our assumptions. We think we've probably met the assumptions, it's a little hard to say with that last one. But it looks like we've met our assumptions. So let's actually run the Kruskal-Wallis H test. Let's see what we say.
So what we're doing here, is we're looking at Fake_Data1’s medians and saying: “Does the median differ based on the participants’ favourite colour?” What is Fake_Data1? Well, it's a fake dataset; fill it in with what you want. It could be how much fish oil you had this morning. It could be how much snowfall there was in that person's backyard. Doesn't matter. Pick your favourite thing, and say we're going to check the medians. And say: “Does that differ based on the participants’ favourite colour?”
If you were just here, you've already Run lines 202 and 203, so you might not need to do them again. If you get an error code asking you to install the “coin” package, you will need to Run line 205. But if you've already run all that, the next thing we're going to do is line 207. We want to use the kruskal_test() function, and if you don't know how to do that, you can ask for help. So it's: ?kruskal_test(). That will open up the Help window, with the documentation in R to say: “What is this test?”. So kruskal_test {rstatix}; it's in the “rstatix” package. It says: “Kruskal-Wallis Test. Description: Provides a pipe-friendly framework to perform Kruskal-Wallis rank sum test. Wrapper around the function kruskal.test(). Usage: kruskal_test(data, formula, …)”. It gives you the arguments that you can use, it tells you stuff about the different values…might be a little confusing if you're not used to reading documentation. But I will show you what we do. So we can go over here. We need to use the same format we used a minute ago for the one-way ANOVA. We say kruskal_test(Fake_Data1 ~ Colour, data = Fake_Data). So we're using our full dataset, not the smaller versions. And we're saying the dependent variable is Fake_Data1 (that's the continuous or ordinal data we want to look at). And we're going to split it up based on Group and say: “Are there differences in the median?”. If we click Run, we get some output. It's listed, it says: “A Tibble: 1 x 6”. Don't worry about a tibble, it's just a data type in RStudio. We've got a value here saying here's Fake_Data1, our “n” or number of observations total is 30 (so if I add up all my groups I have 30 total), my statistic is .539, my degrees of freedom is 3, my p-value is .1 or sorry .91, and the method (what did it use?) it used the Kruskal-Wallis test. So our p-value is greater than (>) .05, which means we cannot say whether the median of Fake_Data1 differed significantly between our four Colour groups. If it was less than (<) 05, we'd be able to say yes, there was a difference somewhere between the Colour groups. But that's not what we have today.
We might also want an effect size. The effect size for a Kruskal-Wallis H test is the eta-squared effect size. And we can run this with kruskal_effsize(Fake_Data1 ~ Colour, data = Fake_Data)e. So it's the exact same text as inside the Kruskal test, but now we're doing the Kruskal effect size function instead. We can Run this…it pops out again another tibble. It says: you used Fake_Data1, our n is 30, this is our effect size it's eta2[H], and it says it's a moderate effect size. How do you report this now that you've Run it? You need to give it your statistics, so it's capital italicized H (for Kruskal-Wallis H test) that value, the .539, comes from the Kruskal-Wallis H test that you ran right here. You give it your p-value, which is this .91. You might also (I didn't write this in) you might want to put in your degree of freedom, so you might do something like this and put a 3 here. And your eta-squared value, you'll want to report this value here. You don't have to keep the negative sign, that's up to you whether you want to. But you could report it as a positive so it's .0947. You would also probably want the medians of the four groups, because this is a test that uses medians, so you can use: median(red$Fake_Data1). Do the same thing for blue, for yellow, for purple. If I Run all of these…our median for red is 86.6. Our median for blue is 87.6. Our median for yellow is 87.3. And our median for purple is 87.5.
Those are all pretty similar, so it kind of makes sense that we maybe didn't find anything statistically significantly different here. The last thing we might want to do is graph this to say: “What does this actually look like? Let's have a visual of what we're doing.”. It's not a graphing workshop today, but we can quickly run this boxplot. The best visualization when you're doing stuff with medians is a boxplot. So we're going to plot Fake_Data1 and our four Colour groups [boxplot(Fake_Data$Fake_Data1~Fake_Data$Colour, ylab=“Fake_Data1 Median”,xlab=“Colour”,names=c(“blue”,”purple”,”yellow”,”red”)). And if we Run this, it'll spit out a plot for us. It's a little small, I can click the Zoom button in the Plots window to make this a little bit bigger. We've got our four Colour groups: we've got our blue group, our purple group, our yellow group with an outlier (which we don't care about because in non-parametric tests we don't care about outliers, so that's fine that that's there), and we have our red group. And we see that the medians (these thick black lines) are pretty similar between our four groups. This one's a little bit lower, but we're also seeing a lot of overlap between our inter-quartile ranges of our different groups. So these are very similar boxes; they're overlapping a lot. And we also see quite a lot of overlap in our whiskers. Which means it probably makes sense that our statistics says we can't say there are differences here. And that is how you run a Kruskal-Wallis H test.

By Lindsay Plater

Time commitment

10 - 15 minutes

Description

RStudio Workshop Series: Kruskal Wallis H Test shows the process of completing the non-parametric Kruskal-Wallis H test in the RStudio software (including assumption checks and graphing).

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The library is committed to ensuring that members of our user community with disabilities have equal access to our services and resources and that their dignity and independence is always respected. If you encounter a barrier and/or need an alternate format, please fill out our Library Print and Multimedia Alternate-Format Request Form. Contact us if you’d like to provide feedback: lib.a11y@uoguelph.ca

chat loading...

RStudio Workshop Series: Kruskal Wallis H Test

Video

Transcript

Attribution

Time commitment

Description

Tags

License