Video
Transcript
Welcome, everyone to RStudio Micro Workshop series lesson #6. Today we're covering the independent samples t-test and the Mann-Whitney U test. A little bit of background, starting with the independent samples t-test: an independent samples t-test is used to determine whether two groups’ means on the same continuous variable differs. It's a parametric test, which means it assumes normality. If we think of a standard histogram, it means highest amount of data in the middle, lowest amount of data at the ends. Our typical bell-shaped curve; the data must look approximately normal. If you need a little bit of additional help running the independent samples t-test, I've left you a few links in the code. Code for some guides to help you through that as well.
The first thing we want to do before we can do any work today, is we want to set our working directory and open our file. So on line 23, you have MY working directory. The easiest way for you to get YOUR working directory here is to go to the top of the screen and click “Session”. Click the button that says “Set Working Directory”. And then click the button that says “Choose Directory”. This will open up your File Explorer so you can go through your computer folder structure to tell the computer where is the file that you're working with today. So if you click: Session > Set Working Directory > Choose Directory, it will look something like this. I'm in: Documents > Classes and Workshops > Lindsay’s workshops > MICRO WORKSHOPS > RStudio. Hopefully yours is shorter than that. On my computer, it doesn't actually look like there's anything here. But I know this is the correct location, so I can click “Open”. In the bottom right in the Files window, it will then show you the files in the folder that you are working in. So here, I see the different workshop series workshops that we're doing for the entire semester and next semester, so I know I'm in the right spot. The other way I know it works is I get some blue text in the bottom left in our Console. You can actually copy this, it starts with “setwd()” or “set working directory”. You can copy this function and replace what's on line 23 so that the next time you open this file, it points to YOUR folder structure and where the folder is on YOUR computer. So I've set my working directory; I've told the computer: “This is where my file is, the file I would like to open in about 3 seconds so that I can actually do work with that file”. If you have appropriately set your working directory, the next thing you want to do is Run line 25. This is using the “read.csv()” function. So we know it's a function because it's some text and then some round brackets. So it's: read.csv(“___”) and then the exact name of the file you're trying to open with the file extension [read.csv(“Fake_Data.csv”)]. We can use the read.csv() function today because the file we're trying to open is a .csv or comma-delimited file. So this is in base R, it's really easy to use. The other thing you want to remember to do is give this a name, so I've just called mine: Fake_Data. You could call your something else; very standard if you're just working on something for practice, you can just call it “df” for dataframe. So we can call line 25, so we can highlight it or leave our cursor on it somewhere, and we can click the Run button. We get some blue text in the bottom left in the Console showing that this worked. And in the top right in our Global Environment we now have a dataset called Fake_Data and has 30 observations of 7 variables. If I wanted to, I could double-click this in the Global Environment, and it will actually open like it's an Excel file within RStudio. So we can see what we've actually got for our data file today, so that's kind of nice. So we know that this worked; we've opened our file, we're ready to do stuff in that file. Let's get started. Every test we do has certain assumptions that must be met in order for the results of the test to be valid. One of my favorite sayings. We're going to check the assumptions of the test, and if we meet all the assumptions, it means we're allowed to use the independent samples t-test today. So assumption number one: our dependent variable must be continuous. We just looked at our file, we could see that a bunch of the different columns had some decimals, which is a giveaway that it's continuous data. But we also can use RStudio to tell us: “Is this actually continuous?”. It looks continuous to us, what does RStudio think the data type is? If you have not come to a workshop before, you'll need to run line 30, which is install.packages(“tidyverse”). We've already done this if you've come to any of my workshops before, this is a very common package to use to look at data and work with data. So line 30 is taking this package from the Internet and putting it on your computer so you can use it. I don't need to run line 30, I've already installed this. I WILL need to run line 31; we need to run line 31 anytime we're trying to use this package: library(tidyverse). What this line does for us? Is it lets us actually use the tidyverse package, or the tidyverse library, today. So I want to Run line 31. I get some blue text. And I get a little bit of output in my code. I don't get any red text, no error code it looks like everything worked. On line 32, what we can do, is we can ask for a glimpse() of our data set (i.e., we can look at our data and say: “How does it look? Does it look correct? Are the data types being put in properly?”). So: glimpse(Fake_Data). And if I highlight this and click Run, I get some output in my Console. It tells me the line that I just ran. It says I have 30 rows with 7 columns, which matches what's in our global environment, so that's good. And then it has dollar sign ($) to say: “Which column in that data set are we looking at?”. We've got $Gender that's listed as <int> or integer. We've got $Fake_Data1 which is listed as <dbl> or double, which is RStudio's way of saying continuous data. We're actually going to be looking at Fake_Data1 today, so for assumption number one, our dependent variable must be continuous: Fake_Data1 is listed as <dbl> or double, which is RStudio's way of saying continuous, which means we pass this assumption. Check mark, we're good to go. The next thing we want to do is check assumption 2. Assumption 2 says independent variables must be categorical with two independent groups. It means you can either be in Group A or Group B, you can't be in both. And there's only two groups, there's no Group C. You can only be in group 1 or group 2. We're going to be using our Gender variable today, or our Gender column, because in this fake dataset we only have two options for gender. We just ran glimpse() and we saw that this is listed as <int> or integer. This is generally NOT what we want if we're doing an independent samples t-test. We probably want it to be set to something else. We're going to want it to be set to categorical data types. <int> normally means continuous data, so we're going to change that.
I've also left you a little bit of code on your computer for just in case your file has imported something incorrectly. So if I double-click in the Global Environment, sometimes different columns might not import properly. So if yours looks like mine, where it's got “Gender”, you're fine, you don't need to run anything else. But if you get like a weird “i" with a double omicron and then a dot dot gender [ï..Gender], (i.e., something's gone weird, something's gone funky and it's not working properly), you can run line 44 to rename() that so it just says “Gender”, so you don't have to write that weird symbol every time you do it. So we can skip that today because mine looks fine. If yours looks weird, just make sure you run line 44, and if it's not working let me know. Okay. So, we just said we've run glimpse() a moment ago, Gender’s listed as <int>; that's not right, we want to make sure it's set as categorical. We have to change it to what's called “factor” type. So we can do this on line 50. On line 50, we're going to use the as.factor() function. And we're going to say, give it Fake_Data, our data set. Give it a dollar sign ($) to say within that data set, this is the column I want. And give it the Gender column. What this is doing is, say, even though Gender is currently listed right now as 0 and 1, set them as two separate groups. Don't treat them as numbers, treat them as two separate groups like you would if they were the words “male” and “female”, for example. And then you have to set that equal to the data set dollar sign ($) Gender, because that will save it to our data set. And it will overwrite what is currently there, it will overwrite that it's integer, and set it to factor instead. So we can highlight line 50 and we can click Run. So we've got: Fake_Data$Gender = as.factor(Fake_Data$Gender). And remember everything's case sensitive, so it has to have the exact same capitalization as what's actually there. We get some blue text in our Console, which means it looks like it worked. We want to make sure it actually worked, so we can re-run our glimpse() line. We're going to re-check our dataset using glimpse() to say: “Did it change the data type?” And if I click Run…we still have blue text, which means it worked. We have 30 rows, 7 columns. But $Gender is now listed as <fct>, which is factor, which is RStudio's way of saying categorical. This means we have set it to the correct data type. So right now, we have currently passed the assumption 2, that your independent variable has two independent groups. You can either be in the “Male” group or the “Female” group. And that it's independent: you're either one or the other, but it's two groups total. And it's set to categorical. I've left a little bit more code here. This part's not necessary, but it's nice for us so that we don't have to try to remember everything. So we don't know right now what “0” means and what “1” means. Maybe 0 is Men and 1 is Women, or maybe 1 is Men and 0 is Women. We don't know. So to make stuff a little bit easier for us as we're reading through what we're doing, we can run what's on line 55. What line 55 is doing, is taking those 0s and 1s, and putting words on it so that we can understand what's happening. And we don't have to have another file to figure out what this is, we don't have to remember stuff. We don't want to get anything wrong. So we can actually put the words on top of the numbers instead. So this line is Fake_Data$Gender = recode(Fake_Data$Gender, “0” = “Men”, “1” = “Women”). What this line is doing is saying: “We're going to use the recode() function on our data set on the Gender column, and we're going to set the 0s equal to the word ‘Men’, and we're going to set the 1s equal to the word ‘Women’”. Because in our fake data set, that's how it's set up; if it was the other way, you'd just need to switch the numbers or switch the words. So if we Run this line and we look at our data really quickly, we can now see that Gender is listed as “Men” or “Women”. It's no longer 0s and 1s. And we know that it's the correct data type because it's set to <fct>, which is factor. So this line is not necessary – line 55 – you don't have to put the words on if you don't want to. You could leave things even if we're like doing ANOVA later, you could leave Groups 1, 2, 3, 4. But sometimes it's nice to not have to try to remember what everything is, so we'll put words on instead. All right. Our next assumption, assumption 3: the data must be independent. Independence means certain things in statistics. Generally, when we say independent, it indicates there should be no relationship between your observations. So for example, if I look at my data. It means the participant in row one is different than the participant in row two, and they're not related in some way. So this would be a problem…let's say I sent out a survey, but I only sent it to my best friends. Well, maybe they're all related in some way, because they all know me. Maybe I have a very specific social circle and they all actually know each other, and we're like a unique group of people, for example. So what makes more sense if you're looking for independence is to do what's called “random sampling”. So instead of just sending it to only your best friends, or only the people in your class, or something like that, maybe you send it to anyone at the University of Guelph, for them to sign up for your study. So independence is a little tricky to determine after-the-fact, but you can generally look at the data and everyone should have only one row, and hopefully the people are not related to each other in any way. Not…by related, we don't necessarily mean like siblings or something like that, but like they know each other in some way that might mean that their answers are not independent of each other. So assumption 3: data must be independent. It's a little tricky to check if you haven't collected the data. Today we're working on a fake data set, so we'll say we meet the assumption of independence. I created this data set, it's fake. All right, assumption #4: the dependent variable must be approximately normally distributed for each independent group. What that means is we've got Men and we've got Women in our groups. We're going to look at the groups separately and look at them and say: “Hm, does it look like we've met that assumption of normality? Does it look like we've got the highest amount of data in the middle, lowest amount of data at the ends?” We could do this using visual inspection (i.e., something like a histogram to actually like visualize it), or we could also do a statistic. We could instead run a statistic to say: “Do we meet this assumption? Yes or no.” Today, we're going to use the statistic. But some fields do use the visual inspection (histogram) option, so that is an option if you so choose. All right. We have less than 50 observations per group, so we're going to run what's called the Shapiro-Wilk statistic. If you have more than 50 observations per group, you would generally use the Kolmogorov-Smirnov statistic instead. The first thing you're going to want to do, is right now our data is not split into groups. We've got all the ‘Men’ and all the ‘Women’ in the same data set; what we want to do is filter() the dataset into our two separate groups. We want to have a mini data set for the Men and a mini data set for the Women to treat them separately. So on line 70, we have: men = filter(Fake_Data, Gender == “Men”). There's something weird here, and I tried to emphasize it: we've got a double equal sign (==). Previously, when we said like “0 = Men” and “1 = Women”, that's setting what we call an “assignment”; that's saying take the ‘0’ and assign it the value ‘Men’, overwrite it. The double equal sign (==) is “check something”; so in the Gender column, check and see if that value is equal to (==) “Men”. If yes, then we're going to put them in this filtered data set that we have called ‘men’. If they're not equal to “Men”, it will not do that. Similarly, on line 70, we have: women = filter(Fake_Data, Gender == “Women”). What line 71 is doing, is saying: check the Gender column, and if it is equal to the value “Women”, we're going to put them into this data set that we're calling ‘women’. Let's Run this and see what it looks like. So I can actually Run both of these at the same time, I can highlight them and click Run. I get some blue text down in the bottom left, in the Console, showing that this worked. And in the top right, in my Global Environment, I now have two new data sets. And they both have 15 observations, because we have 15 men and 15 women. If I click on the ‘men’ dataset, here I can see in our Gender column, everyone is listed as a man. And then they've got all the other columns as well. But this is only the men in this data set, because if Gender was equal to “Men”, they were put in this dataset. Similarly, if I open my ‘women’ dataset, we only have women in this dataset, because if Gender was equal to “Women”, then they got put in this dataset instead. So we've got two smaller datasets; we've taken our big dataset and cut it in half and made two smaller datasets so we can check normality. So we're going to do the shapiro.test() function. We're going to use our ‘men’ dataset and our ‘women’ dataset to check those groups separately. So on line 74, we have: shapiro.test(men$Fake_Data1). We're using the smaller filtered dataset, and checking Fake_Data1: do we meet normality on Fake_Data1 only for the men participants? Similarly, on line 75, we have: shapiro.test(women$Fake_Datat1). We're also going to check just the women version, just the smaller filtered dataset, and say: do we meet normality for Fake_Data1? Let's run these one at a time; let's Run the men first. So when I Run this, it says: “Shapiro-Wilk normality test. data: men$Fake_Data1”. It tells you which dataset and which column you were using. We've got W- and p-value. Our p-value is .1541. This is greater than (>) .05, which is our threshold for statistical significance, which means we have passed the assumption of normality for the men. If p was less than (<) .05, that's when you would have a problem, and you would have failed normality. So here, we pass normality for the men. Let's check the women; let's Run line 75. It says: “Shapiro-Wilk normality test. data: women$Fake_Data1”. It tells you which dataset and which column you've used. We've got our W statistic and our p-value, and here our p-value is .6814. Again, we have passed normality for the women, because our p-value is greater than (>) .05. So we've passed this assumption for both of our two independent groups, which is great. All right, moving right along. Assumption 5. I haven't left you any code here, just some notes. Assumption 5 says: no significant outliers for each independent group. So just like we check normality, if you're checking for outliers on your own data, you'll want to make sure you're checking the men and the women separately. Or whatever your two groups are; you want to treat them as their own individual group to check for outliers. The reason there's no code here is because different fields or different areas of study have different opinions on the best way to check for outliers. Some fields use visual inspection, so they'll make a histogram and they'll say “Is anything [data point] really far away?”, or they'll make a boxplot and say “Is there any data point really far away from the bulk of the data?”. Other fields use something a little bit more rigorous, so they'll use statistics or they'll use like the mean of a group, and then certain standard deviations away from the mean. That one's really common (mean and three standard deviations). so I haven't left any code because different fields do this differently. If you're on a research team, chat with the research lead to see which way they would like you to find (i.e., identify) and remove outliers. If you've broken other assumptions, let's say you broke normality…if you remove outliers and check those assumptions again, you might find that you pass. Because maybe you broke normality in the previous step on your own dataset because an outlier was pulling the data set in a certain way. If you remove the outlier, well now it might look like a standard bell-shaped curve, and then maybe you've passed that assumption. So if you remove outliers, re-check your assumptions again. All right. Today, we will say we're fine, no outliers. We can move to assumption 6: it's called homogeneity of variances. It's a $5 word. This means that your two groups, the men and the women in this case, should have equal variance. We can test this by running what's called the Levene's test; this is in the ‘car’ package, so if you've never done this before, you'll have to run line 90. We haven't done this so far in our workshop series, so Run line 90 if you don't think you have ‘car’. It's install.packages(“car”). I already have car on my computer, I don't need to Run line 90. But I will Run line 91; I have to do this anytime I open RStudio. This is library(car). So I can Run line 91. I get some red text here. It says: “Loading required package: carData. Attaching package: ‘car’. The following object is masked from ‘package:dplyr’”, and I also had: “The following object is masked from ‘package:purrr’”. This is not an error code we need to worry about necessarily, because it's not saying it didn't work. This is saying I have other packages that have the same functions, so the different packages are essentially fighting to say “Which one should we be using? When should I use recode()? Should I use the one from dplyr? Or should I use the one from car?” So if you get this kind of red text, this is totally fine, you don't have to worry about this. If you get red text, something along the lines of “There is no package called ‘car’”, it means you have to run line 90 because you haven't taken it from the Internet to put it on your computer yet. So we're okay, library has loaded just fine. What we want to do now is run Levene's test to check homogeneity. That's on line 94. This is: leveneTest(Fake_Data$Fake_Data1, Fake_Data$Gender, center = mean). So we're using the levene.Test(), and the first “t” in test has to be capitalized. This is a function, and we're saying: we're going to check in our fake data set, not the smaller two filtered data sets, but the big overall dataset. We're going to run Fake_Data1, we're going to break it down based on Gender, and “centre = mean” is important because if you forget to say “centre = mean”, the default is using medians. Which is fine, it shouldn't change your test too much, but generally we want to use “centre = mean” if we're using the independent samples t-test. So we can highlight line 94, we can click Run. It says: “Levene's Test for Homogeneity of Variance (centre = mean)”. So it tells you which thing you just ran. It says “group” (i.e., it's checking the men versus the women here). It's giving you your degrees of freedom (df): 1 and 28. It's giving you an F statistic. And it's giving you a p-value. The thing you probably care about the most here is the p-value; as this is an assumption check, if p is less than (<) .05 it means you have broken the assumption, there's a problem, you have failed the assumption check. Here, our p-value is .9288; this is greater (>) than .05, which means we've passed this assumption. Our two groups have approximately equal variance. So we've met the assumption of homogeneity of variances. Check mark. We just spent a whole bunch of time on assumptions…now, we can actually run the test. We have actually passed all of our assumptions today, which means we are allowed to use the independent samples t-test. How do you actually run the independent samples t-test? We want to see whether Fake_Data1’s mean differs between men and women. This could be a bunch of different things: this could be exam scores, this could be happiness scores, this could be self reported number of cookies eaten. This could be pretty much anything. But today, we're going to see: does the mean differ between the men and the women in our sample? On line 103, I've left you: ?t.test(). If you're trying to figure out what a certain function does, you can always ask for help by using the question mark (?). If I click Run on this, what happens is…I'll move this to make it bigger…down in the bottom right, in the Help window, it will show me this test. It shows me the documentation to say: “Here's how you run this”. So we've got “t.test {stats}”, meaning this is from the ‘stats’ package. We don't have to install this because this is actually part of tidyverse. What does this do? “Student’s t-Test”, because the independent samples t-test has a lot of different names. “Student’s t-Test. Description: Performs one and two sample t-tests on vectors of data. Usage: t.test (x, …)”. You need an X. Well, what's “X”? Well, we can scroll down. Arguments: “x: a (non-empty) numeric vector of data values”. That sounds a little confusing. Let's break it down and go see what we actually need. So I can make this smaller, and we can go back and look at our test. For an independent samples t-test, you give it the dependent variable, you give it the independent variable. So the dependent variable here is Fake_Data1; the independent variable here is Gender. We've passed our homog – bleh bleh bleh, everyone struggles sometimes – we passed our homogeneity of variance assumption, so we set the argument “var.equal” (i.e., are your variances equal?) to capital letters “TRUE”. If we had failed homogeneity, we would set “var.equal = FALSE”. So here, we're going to use “TRUE”. There's a few different ways to write this; I sent the easiest way to write it on 106. So you could write it: t.test(men$Fake_Data1, women$Fake_Data2, alternative = “two.sided”, var.equal = TRUE). You could run it this way. You could also write it as: t.test(Fake_Data1~Gender, data = Fake_Data, alternative = “two.sided”, var.equal = TRUE). So there's two ways to write the same thing, tou can use whichever one speaks to you. Let me Run both and see what they do. We’ve got 106…we run the version on 106 where we're using the ‘men’ dataset comparing to the ‘women’ dataset. We get: “Two Sample t-test”, and it says we used the ‘men’ dataset and the ‘women’ dataset, and we're comparing Fake_Data1 for both of those. We get a t-value, a degree of freedom (df) value, a p-value. It says: “alternative hypothesis: true difference in means is not equal to 0”. We get a 95% confidence interval. And we get our sample estimates, so we have a value for ‘x’ and a value for ‘y’; x is 87.4, and y is 86.8, if we round.
You could run it that way, or you could run the version on 108. Let's see how that's different. This one says “Two Sample t-test”. You get a t-value, you get your degrees of freedom (df), you get your p-value. It's not men versus women here, it's saying instead we use the data Fake_Data1, and we're splitting it by Gender. So it's doing the same thing, it's just writing it slightly differently. It says: “alternative hypothesis: true difference in means between group Men and group Women is not equal to 0”. We get a 95% confidence interval. And we get our sample estimates, where it says the mean for the men is 87.4 and the mean for the women is 86.8. Same test, two ways. You can pick whichever way, either works for how you set up your data; you can pick whichever way you like the output for more. The second way might be a little bit easier to read, because it tells you instead of ‘x’ and ‘y’, it says this one is the men and this one is the women. So it's up to you which way you like. Regardless of which way you run it, our p-value is .1586. If the p-value was less than (<) .05 here, this would mean we are allowed to say there is a difference between the groups, and one group is higher on this score than the other group. Or you can flip it and say one group is lower on this score than the other group; it means the same thing. Here, our p-value is greater than (>) .05, and the wording is very critical: we fail to be able to say whether there is a difference between the groups. You'll note I did not say there is “NO difference” between the Groups, you're not allowed to say that with p-values. We fail to say whether there is a difference between the groups. Even though men [mean] is 87.4 and women [mean] is 86.8, it's not a big enough difference for us to be able to say they're different. That's what our p-value tells us here. The p-value doesn't tell you how big the effect is, it just says: “Is it statistically significant? If yes, there's a difference. If no, fail to say”. We might want to run an effect size to be able to say: “How big is this effect? Small, medium, large?”. If you've come to one of my workshops before, you've probably run what's on line 112, it's: install.packages(“effectsize”). I feel like a robot saying this, but here we go. So you might not need to run this. I can run just line 113, because I've already got that package on my computer. We're going to run what's called the Cohen's d function; Cohen's d is the effect size that we use when we're looking for an independent sample t-test effect size. And I've written this out two ways. So just like how there's two ways to do the t-test, you could either do the equivalent ‘men’ versus ‘women’ version, or you can do Fake_Data1 split by Gender. They do the same thing. So if I Run this, what do I get? If I Run the first way, where it's comparing the ‘men’ dataset to the ‘women’ dataset, our Cohen’s d is .53, and it gives you a 95% conference interval on that if you want it. .53 is considered medium-to-large. Generally, if you get .6 or above, that's considered a large effect size. And if I Run line 116 to see how it's different…it looks exactly the same. It's a different function setup – it's the same formula, but we set up the function a little differently – but it gives us the same answer. It still says Cohen d is .53, so it's up to you which one you'd like to Run. And I've left you a little bit of help if you're trying to Run it or write this. If you're writing this in a paper, you would say t(28) [28 is for your degrees of freedom] = 1.4485, and I can scroll up and show you…that's this t-value right here. So t(your degrees of freedom) = 1.4485, p = 0.1586, d [lowercase d for Cohen’s d] = .53. And if you wanted, you could also give the means of your different groups: men is 87.41, women is 86.79. That's how you run an independent samples t-test. Importantly, though, you might also want to graph this. You might want to do a visual so you can see very clearly: if there is a difference, how big is the difference? And if you fail to say if there's a difference, why? How close are these things? We're not going to go too in-depth about different types of graphs, so what we're going to do just now is just highlight lines 123 to 126. We're going to be using ggplot(), which allows us to make fancier, prettier graphs. This graph is not super fancy, but we're going to use ggplot(), and we're going to create a little graph here. So we're going to make ourselves a bar graph. So if I click Run, in our Plots window in the bottom right, we should now have a graph. And I can click the “Zoom” button so we can see this a little better. On our x-axis on the bottom, we have Gender; on the left we have men, on the right we have women. On our y-axis along our left-hand side, we have Fake_Data1 – the mean of Fake_Data1 – and it goes from 0 to about 100. And if you look at these bars, they don't look super different. We can't say there's NO difference, because if you look, there is still a tiny difference, it's just not very big. Which is why we fail to find a difference here. Our p-value is not statistically significant, and that kind of makes sense if you look at the graph, because this difference is not very big. I can close this. Just glossing over ggplot(), we're not going to go too in-depth, but that's just a super duper basic bar graph for you. And that's all the steps you'd probably want to run if you're doing an independent samples t-test.
Time commitment
5 - 10 minutes
License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.