RStudio Workshop Series: One Sample T-Test

Transcript

Welcome, everyone to RStudio Micro Worksop series lesson #5. Today we are covering the one sample t-test. As in our previous session, we're going to be covering the parametric test (i.e., the test that requires normality) and the non-parametric test (the test that does not require normality). So today we're going to be doing one sample t-test and Wilcoxon signed-rank test. A little background about the one sample t-test: a one sample t-test is used to determine whether the mean of a single continuous variable differs from a specified constant. So through, for example: the literature, if you're working in research, there might be some hypothesis or some expectation that some variable is around some certain number. That's our “specified constant”. So in our fake example we're going to work on today, we're going to assume, I believe, that our fake number is equal to the value 85. So this is the one thing if you're coming to see me for help on this test, I generally can't help you too much on figuring out what your specified constant is; but if you're familiar with your literature, there might be an instance where you're like: “I have…” let’s say happiness scores. “I have a happiness score from everybody in the class that I'm teaching, and I want to compare it to the average happiness score; let's say at the University of Guelph is 85. Is my happiness score for my class higher or lower than that score of 85 at the university?” So that's when you might use this kind of test. This is a parametric test, which means it assumes normality; it assumes the normal distribution. The variable you are using to compare to the specified constant must look like a normal bell-shaped curve, which means highest amounts of data in the middle and lowest amounts of data at the ends. I've also left you some resources if you have the R script, on getting a little bit of extra help with this test if you're getting stuck.
The first thing we're going to want to do today, we have a fake data set we're going to use, we have to set our working directory to tell the computer where is the file that we are hoping to work with today. The easiest way to do that is at the top: you should have a bar with a bunch of different buttons; you're going to click the button that says “Session”. Click the button that says “Set Working Directory”. And then click the button that says “Choose Directory”. And if you do this, if you click: Session > Set Working Directory > Choose Directory, that will open up a dialog box that will let you go through your computer folder structure to tell the computer where is the file you are working with today. So the file you've downloaded from me from the e-mails.

Assumption number one: we're looking for one continuous variable. Because the one sample t-test compares essentially a column of data to a specified constant, to one certain number, we only need one variable for this. To run the one sample t-test, we need one continuous variable, and we can use RStudio to ask: “What type of variable do we actually have and is it the right variable type variable type?”. So in RStudio: <int> integer, <dbl> which is double, or <num> which is numeric, those are our standard continuous variable data types in RStudio. So we want to check to make sure that we've got this right. There is a common package we can use to review the data types. If you've not come to a workshop with me before, you'll need to run line 30, which is install.packages(“tidyverse”). You only need to do this one time, if you've come to any of my workshops before, you already have this on your computer. You've taken the package from the Internet and stored it on your computer. So we won't need to rerun this again, so I will not run line 30 right now. If you've come to a workshop before, you still need to run line 31: library(tidyverse). What this line of code is doing, is saying: “I would like to use tidyverse today, right now”. If you leave your RStudio running perpetually, which I don't recommend, you would only need to run this the first time you've loaded RStudio. So if you're like me, we probably open up RStudio once or twice a day, you'd need to run this line anytime you're trying to use RStudio. if it's the first time you've opened RStudio that day. So I'm going to click “Run” on line 31. It takes a moment. We get a little bit of information; it's saying I have some conflicts because I have some other packages installed. That's okay, we're not worried about that, that's not an error code. So we have loaded, or run the library(tidyverse) line. What we can do on line 32, is we can look at our data. We can take a glimpse() of our data. We can use the glimpse(Fake_Data) line to say: “Show me what my data types are”. And if I do this, in the bottom left, it says glimpse(Fake_Data). I have 30 rows and 7 columns, and that's good because that matches what's in the Global Environment, so we know it's doing the right thing.

And then it says dollar sign ($) to say which column we're in, and then the name of the column. So today we're going to be using Fake_Data1, so that's this line here. And $Fake_Data1 is listed as <dbl> or double type, which is RStudio’s way of saying: “This is continuous data, it's got some decimals”. So we get a green checkmark for assumption one, we get our green light, we're good to go. We have passed assumption one, because we have a continuous variable. Excellent.
We can then move on to assumption 2: the data must be independent.

This can be tricky to assess after-the-fact. So if you're getting this data from somewhere else, let's say…I don't know, you pulled some data from Statistics Canada. It might be hard to be able to say whether the data are truly independent. What we mean when we say “independence”, a lot of the time and statistics, is that each participant only has one row of data, and each participant is independent of the other participants in the dataset. So for example, if I went and sent a survey only to my best-est of friends, that's probably not independent, because maybe they're all connected in some way, maybe they all know me. So things like random sampling will help you ensure that you have independent data. So it's tricky to determine after-the-fact whether we truly have independence, but we can use the View() function (which has a capital V, there are very few functions with capitals, this one has a capital V). We can View(Fake_Data), and if I run this, what this will do…the other way to do this is you can just double-click on the data in your Global Environment.
This will open the data set for you to have a look at it. Generally, you would want a participant column; I didn't include that in my fake dataset, but it looks like we have 30 rows of different participants here, and they each have something for Gender, Fake_Data1, Fake_Data2, Fake_Data3, et cetera. So we can close that. It looks like we have independence. It's a fake dataset, I created it, we can say we've passed independence today. We've met that assumption.

Assumption 3, this is about this test being a parametric test: each variable should be approximately normally distributed. Here we have one variable; we're going to look at Fake_Data1 and say: “Does it look like that standard bell-shaped curve, highest data in the middle, lowest data at the ends?” You could look at that using something like a histogram and squint at it. It's called visual inspection: you squint at a graph and say: “Does it look approximately normal?”. We could do a histogram. The other way I recommend is doing a statistical test to say: “Does it look approximately normal?”. You could do the Shapiro-Wilk statistic, if you have less than 50 observations, or fewer than 50 observations, you would generally use the Shapiro-Wilk statistic. If you had more than 50 observations, you would generally use a Kolmogorov–Smirnov statistic. So today we have less than 50 observations; we're going to run the Shapiro test. So we do this by doing shapiro.test(Fake_Data$Fake_Data1). What this function is doing is say: “Run the Shapiro statistic, the Shapiro-Wilk stat, in the Fake_Data set, in the Fake_Data1 column”. So we can highlight this and click “Run”. And what does this give us?

It says shapiro.test and then it tells you what you ran exactly. Shapiro-Wilk normality test; gives you the data that you ran: W for your statistic and p-value for “is the statistically significant”. Because this is an assumption, check if your p-value is less than (<) .05, that means you have failed the assumption and something is wrong; you get a red light, you should stop, you might need to make a change. Here, our p-value is greater than (>) .05 we have .4551, which means we get a green check mark, we're good to go; we have passed the normality assumption, we are allowed to use the one sample t-test because we have passed normality, we can use the parametric statistic.

The next assumption, assumption 4, is about outliers: no significant outliers. I haven't left you any code for this, but I've left you a note. Different fields or areas of study have different opinions on the best way to check for outliers. There are many methods, so you might calculate the mean and standard deviation for each variable or each group. You might do visual inspection, so creating a box plot or creating a histogram and looking for any data points that are really far away. You should always remove impossible values, so if you're measuring something like glucose and someone has a score of 15,000, that's not possible; so always remove scores that are impossible. But you should also be looking for any extreme outliers, anyone that's really far away from the rest of the data. It's up to you as the researcher, or you as part of a research team, to decide how you're going to find and remove those outliers and explain that to other experts. And as a side note here, sometimes removing outliers can also fix your normality problems. Because if you think about it, let's say you've got a histogram: you've got the bulk of your data in the middle, a little bit of data at the ends…you've got your standard normal distribution, and a data point really far away. A lot of the parametric statistics are using means.
Because you have an outlier, your mean is going to be pulled towards that outlier. So if you remove outliers, sometimes that fixes your normality problems. So if you do remove outliers, and you had a problem with normality, double-check normality again after you've removed those outliers.

Alrighty, we have checked our four assumptions. We didn't check for outliers today, but it's a fake data set, we're going to pretend that that's okay knowing that different fields deal with this in different ways. How do you actually run a one-sample t-test? We're going to see whether Fake_Data1’s mean differs from a pre-specified constant, and today our constant is going to be 85. We'll say I pulled that from the literature. Where did we get the constant value? It might be from your research question, it might be from a previous paper, it might be something that is just known in your field. Today, I pulled it out of thin air because it's just a fake example. We're going to start by reviewing the t.test() function, which is on line 62. So we can highlight this, it's ?t.test(), and if we click “Run”, what that will do is down in our Help window, it will open up the help function for this specific function. So it says t.test {stats}, which means it's from the stats package, which we actually have through tidyverse. And what does it say? “Student’s t-Test” (the one sample t-test has a lot of different names, so it's also called the student’s t-test). It says: “Description: Performs one and two sample t-tests on vectors of data”. And it says: “Usage: t.test…you have to give it x”. Well, what's “x”? “x, a (non-empty) numeric vector of data values”. That might sound confusing. So what I can tell you, is all you need to tell it is: which is your variable, which column are you checking? We also need this other argument down here which is called “mu”. It says: “a number indicating the true value of the mean (or difference in means if you are performing a two sample test)”. “mu” is going to be our value of 85, our pre-specified constant.

Let's see what this looks like in action. So we can run several different tests, we want the one-sample version. To run the one sample version, you just tell it the column or the variable you're looking for, and the pre-specified constant you're comparing it to. So that's on line 65, we've got: t.test(Fake_Data$Fake_Data1, mu = 85). We've told it the column we would like to compare, and the number we would like to compare it to. That's it. If we run this, in our bottom left, in our Console, we get some output. One sample t-test. It tells you the data that you used. It gives you a t-value (this is your t-test value). It gives you the degrees of freedom for that t-test value. It gives you a p-value. It says “alternative hypothesis: true mean is not equal to 85”. It gives you a confidence interval, a 95% confidence interval. And it gives you your sample estimate (what is the actual mean of X? 87.10212). So the mean of this column is 87.1 and we've compared it to the value 85. It's about a two point difference. Is that statistically significant? To answer the question, we can look at our p-value and say: “Is this less than (<) .05?” Here, it's actually in scientific notation because there are so many zeros, so I've written it out for you a little bit lower down. It's p equals…let's see if I can count this right: .0000000001424. That is less than (<) .05, which means we are allowed to say “yes”, our variable Fake_Data1 is statistically significantly different than our specified constant of 85. And we know that our Fake_Data1 variable is actually 87, so we know that our variable is higher than the specified constant. And we know that's statistically significantly higher because our p-value is less than (<) .05. So our p-value is less than (<) .05; we found statistical significance. The one other thing we might want to report, is we might want to report an effect size. The p-value doesn't tell you about the size of this difference, it just says “there is a difference”. So we might want to say: “How big is this difference?”; we can do this using the “effectsize” package. So you might need to run line 69, it's:
install.paclages(“effectsize”). I already have this on my computer, I don't need to run the install.packages line. But I will run line 70, which is: library(effectsize). So I can run this, and I get some blue code in the Console showing that this worked. And the reason I'm installing this package, is we need this to run the cohens_d() function. Cohen's d is our effect size for a one sample t-test, so we want to say “How big is this?” Line 71: cohens_d(Fake_Data$Fake_Data1, mu = 85). So we've given cohens_d() the exact same function that we used to calculate the one sample t-test; so we're saying we want to compare the Fake_Data1 column to 85. What is the Cohen's d effect size estimate for our one sample t-test? We can click “Run”, and it spits out a number for you. It's this number in this bottom left, it's 1.76. And I've left you how to actually write about this if you were writing a paper or something on line 73: you would say t(29) = 9.6644. I can scroll up to show you where that comes from, that's this t-value here. In the brackets, the “29” is your degrees of freedom. We can give our p-value; some folks like scientific notation, some folks don't, so I've written it out with all the decimals for you. And d for Cohen's d is 1.66. And if you wanted to report the mean of the sample, it's 87.1. And the pre specified constant was 85. That is how you write a one sample t-test.

By Lindsay Plater

Time commitment

Greater than 15 minutes

Description

RStudio Workshop Series: One-Sample T-Test describes the process of checking assumptions and completing a one-sample t-test in the RStudio coding environment.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The library is committed to ensuring that members of our user community with disabilities have equal access to our services and resources and that their dignity and independence is always respected. If you encounter a barrier and/or need an alternate format, please fill out our Library Print and Multimedia Alternate-Format Request Form. Contact us if you’d like to provide feedback: lib.a11y@uoguelph.ca

chat loading...

RStudio Workshop Series: One Sample T-Test

Video

Transcript

Attribution

Time commitment

Description

Tags

License