Video
Transcript
Welcome everyone to RStudio workshop #3. Today we are learning about the Chi-square test in RStudio. I've left you some background on line 12 to 18, if you've got our script open today. So the Chi-square test is used to test whether two categorical variables are associated with each other. What this means is you have two different groups of information, and you're looking to see: are they associated? So we could do things like men versus women, what are their favourite colours? We could do things like men versus women, what's their favorite drink order, a small, a medium or large? Things like that. We're looking to see whether one thing is associated with something else. This is what we call a non-parametric statistic, which is a fancy way of saying it does NOT assume a normal distribution. If you've ever taken any statistics class before, you might be familiar with the normal distribution; it's essentially, if you draw it as a graph or a histogram: largest chunk of data in the middle, lowest chunks of data at the end. So it's that typical bell-shaped curve. So this test does not assume that the data meet normality, which is another way of saying it's a non-parametric test.
I've left you additional links in the script if you wanted to copy-paste those into a web browser to get some more information. So I've left you to the RStudio LibGuide, which I have open, so I can show you. So on the RStudio LibGuide, it's in the non-parametric statistics section, it's the first thing, so we've got some steps there for how to do a Chi-square test. And I've also left you links for the RStudio documentation for using the chisq.test() function, and also the Laerd statistics guide because they do a decent job of explaining why you might use this test and the assumptions of the test.
So if you have a fresh RStudio open, you might notice your global environment in the top-right is empty. You might not have anything open yet. I don't have anything open, it's my first time opening R today. So we can start by setting our working directory and opening our data set so we can use it to do work. So we set our working directory. There are many ways to do this, but we can click the “Session” button at the top of the screen. We can click the “Set Working Directory” button. And then we can click the “Choose Directory” button. So we will click: Session > Set Working Directory > Choose Directory. That should open your File Explorer so you can go through your file structure on your computer to where you have saved today's fake data set. So on my computer this is Documents > Classes and Workshops > Lindsay's workshops > Micro workshops > RStudio. My File Explorer shows that there is nothing here, but I know that this is the correct location, so I can say “Open”. If you have clicked: Session > Set Working Directory > Choose Directory, and you've located where you have saved the file for today, you should get some blue text down in the bottom left, in our console, that starts with setwd().
The [round brackets] denotes it’s function. And then it's got your specific location of where you have your file saved. So what you want to do is copy all of this blue text, but not the greater than symbol, not that blue Pac-Man…so we'll copy the blue text and we can paste it on line 23. That means the next time you run the script, it will point to YOUR directory, because right now your script points to my directory, which will not work on your computer because your file structure is different. So you copy the blue text, and you paste it on line 23. Another good tip to make sure that this worked, is in the bottom right, in the Files window, you should be able to see the data and anything else that you have in the folder that you have pointed to. So I have Fake_Data, I have Rstudio micro 1, Rstudio micro 2, Rstudio micro 3. So that's another tip that you're in the right spot, is if in the Files window you see the location you asked it to point to.
So if you have clicked: Session > Set Working Directory > Choose Directory, and you've copied that line of code onto line 23, and you can see the files in your Files window, you're in the right spot! You have appropriately set your working directory; you have told the computer where is the file you want to work with.
The next step, is once you've told the computer where the file is, you have to actually tell it to open the file. So that's going to be on our line 25. We're going to use the read.csv() function, because I sent you a .CSV file, it's the easiest file type to open in RStudio, it's in base R. We can use this read.csv() function, give it our brackets because it's a function, give it some quotation marks, and give it the exact name of the file exactly as written, so: Fake_Data.csv. And we're going to give that a name, so we're calling this Fake_Data, because I'm very original. You can call this whatever you want, people use “df” for data frame a lot. It's good to call this a name; so you could call this chi_data because you know you're about to do a Chi-square test, you can name it whatever you want. But if you're following along in the script I wrote, I just called it Fake_Data. So we're going to do the read.csv() function, and it has to be spelled exactly correctly, or else you'll get an error code. So if I highlight line 25, or place my cursor somewhere on line 25, I can click this “Run” button in the top right (kind of in the middle top right). I will get some blue text down in the bottom left, in our Console, showing the line of code that I ran. And in the top right, in our Global Environment, it's no longer empty; I should now have something that says Data, Fake_Data. And if I wanted, I could open this to make sure it downloaded correctly, or opened correctly from my computer. So I can double-click the title Fake_Data, and this will open up my sheet, kind of like it's in Excel, but it's in RStudio. So it shows me: a Gender column, a Fake_Data1 column, a Fake_Data2 column, a Fake_Data3 column, a Fake_Data4 column, a Colour column, and a Group column. We don't need to keep this open, so we can close this if we want, and it'll pop us back to our script.
So we have set our working directory. We've opened our file. The next thing we want to do is start checking what we call our assumptions. I'll actually take a second to show you: if we go to line 18, I can grab the Laerd statistics guide…I did not write this guide, but I do recommend looking at it, because it's useful. And if I open my Laerd statistics guide, it looks something like this. The reason why I send people to these guides, is it gives you a little intro of why might you use the specific test, and it gives you a really quick, easy breakdown of what are the assumptions of this test. Assumptions are just a fancy way of saying: there are certain rules for each test that we must follow, and if you break the rules, you might not be able to use the test. So you want to check the rules to make sure you're passing all the rules that you have to in order to actually use the test appropriately. So in the RStudio script I have written down that we're going to check some of our assumptions.
So assumption one: we must have two categorical variables. Fancy way of saying those different buckets of information. Not continuous, we're looking for categorical; something like: what's your favorite colour? Blue, purple, green. So to run a Chi-square test, we need to have two categorical variables; this is often denoted as <fct> or factor data type within RStudio. There are other ways to write a categorical variable, but factor type is very common. There's a common package we can use for reviewing data, to make sure we've got the right data type. We can install and load what is called tidyverse. If you have come to any of my former workshops, you probably have already run what's on line 30, which is install.packages(“tidyverse”).
If you already have tidyverse installed, which is to say you took it from the Internet and you put it on your computer, you do not need to re-run line 30, and what you can do is you could actually put the comment, the hashtag, the pound sign, at the front of that line, so if you come back to the script, you don't run that line by accident if you have already run it. If you haven't run this line yet, it does take a couple minutes, so feel free to either highlight the line and click Run, or put your cursor anywhere on the line and click Run. I already have tidyverse installed, I will not install it right now. Once you have used the install.packages() function, you've taken tidyverse from the Internet, you've put it on your computer, the next step is telling RStudio you would like to use this package right now. You want to use some of the functions from the package. So that's on line 31. I can either highlight this and click Run, or I can put my cursor on anywhere on that line and click Run. And it's pretty quick. It will show me in blue text in the bottom left exactly the line I just ran, and it will give me a list of the different packages I've installed. And mine also tells me it has some conflicts, I've got other packages that conflict with what's going on here. That's okay, it's not an error code, that's okay.
If you have tidyverse installed, the next thing we want to do is look at our data and see: what data types do we have right now? Reminder, we're in assumption one, we're checking our data types. So what data types does it think we have? We can do this by running what's on line 33, which is glimpse, and then the name of your data set inside those brackets. So I have: glimpse(Fake_Data). And if I run that, in the bottom left, in the Console, it tells me a bunch of different stuff about my data. It says I have a Gender column which is <int>, or integer type. I have Fake_Data1, which is <dbl>, or double type; it means that’s continuous data. And my Colour and Group columns are also <int>, which are integer type. It thinks they're numbers. And if we look, they kind of look like numbers! We've currently got 0s and 1s. We've got 1,2,3,4, we've got 1,2,3.
If something has opened incorrectly on your computer, I've left an example for if Gender is mislabeled. If we double-click on Fake_Data, we might get our headers not opening correctly. So sometimes you'll get like an “i” with a double omicron, a dot dot, and then the word Gender [e.g., ï..Gender]. If something has downloaded incorrectly, I left you some code to help fix that. If your Gender column and all your other columns have imported properly, you don't have to worry about it, but if you ended up with something wonky, you can run line 37, or you can ask for help and I'll make sure that we've got it working for you, because we're about to use some of these columns. So if I run line 39, it's the same as the line we just ran a moment ago. It just shows me what the names of each of my columns, are kind of on this left-hand side after the dollar signs. My columns all look okay, so I don't have to run anything else.
All right, I think I saw something in the chat. Let me check. Ohh, this is an excellent question. So the question is: “How do we know which library is needed for us while analyzing the data?” So I'll answer that one first. Um, if you try to use something, so let's say you found a function online, or you got a script from someone else, and you're trying to use a certain function. If you try to run it and it says this function doesn't exist, for example, or cannot find this function, you know that you probably haven't installed or loaded the library that you need. Might be both steps, it might be just one. It might be forgot to load the library, so for example, if I had not run line 31 library(tidyverse) and I tried to use the glimpse() function, it will break and essentially tell me: “You can't do that! Like, what is this glimpse() function?”. So if you're trying to run something and you've already got a script, for example from me, everything in the script is there so that it's going to work properly for you. But that's not always the case! Sometimes you might write your own script, or sometimes you might get something else. So to know which library exists, I can show you an example. Let's say I wanted to use the glimpse() function. I could do “?glimpse”. So I can write this down in the console. If I know this is a function I'm trying to use, and I don't remember what library it's from, I can ask RStudio. And I can click enter, and it will open up in the Help window in the bottom-right, where you might find that from. So ours is actually from this middle option: dplyr. This is from tidyverse, so if you're trying to get a glimpse of your data, I can click this option, and it says glimpse is from dplyr.
You might not be familiar enough to know all of the functions that are within tidyverse, but if you put in the name of your function and you get in the squiggly brackets here, the name of the function and the name of the library, the package, the piece of the package (because tidyverse is the whole package, dplyr is part of that package, it's one of the chunks in there), you could Google this to be like: “Well, where is this even from?” And if you Google this, it will tell you it's from tidyverse. So then you'll be like, “Oh okay, I need tidyverse”. The other half of this question is: “Also, how do we know which data need which type of calculation, like Chi-square?” Ha-ha, excellent question! So in statistics…generally, not always, but generally…the type of data you have determines which statistic is actually appropriate to run.
So if you were in workshop one, I can open something to help us out. If you were in workshop one, you would have gotten a Word document from me as well as a script. If you open the Word document, I called it “RStudio Micro 1: Introduction”. If you scroll all the way to the bottom of this Word document, I left you a handy flow chart that I created to help you determine which data type and which test. So in our case, we have what are called nominal variables (i.e., just categorical variables). If we look at our three bubbles, we've got one for categorical, one for continuous, and another for continuous. But if we have what we call categorical variables, your only real option is the Chi-square test. If you only have categorical and you're trying to do an inferential statistic. So you could use something like the flow chart to help you decide which kind of test is appropriate. I also have this on the RStudio LibGuide, so if you…I'll move this over here as well. If you were on my LibGuide and you're on the parametric inferential statistics page, it will show you the same flow chart so you don't have to open the Word document if you don't want to. If you were like: “Oh it, it might be on the guide somewhere”, I have also put this on the guide to help you kind of pick which test might be appropriate.
This is also why checking your assumptions are really important and why I'm starting each of these little workshops by showing you what the assumptions are. The assumptions are really important because if you're trying to do a test that needs continuous data, for example, and you try to do something that only works on categorical data, the system might let you do it, or it might be smart enough to tell you it won't work. But you need to know, as the researcher, which thing is actually appropriate, so double-checking your assumptions is always wise. Excellent questions today, folks. You're thinking like researchers. Okay, I'll close this now. All right.
So we had just double-checked that our data imported correctly and we're still checking our data types. So I mentioned Gender, Colour, and Group: these are categorical variables. You aren't using colour as a continuum, for example. You're either going to like red or you're going to like blue, for example. So we have to change the data types because it's listing them as integer because right now we've imported them as numbers as 0 and 1. So RStudio is treating them as numbers, when really that's not the case. We want to treat these as what we call “factors”. So on lines 43, 44, and 45, you can highlight all three at the same time if you want, and click Run.
What these are doing…on line 43 for example, I'm using the as.factor() function. We know it's a function because it's got those left and right brackets. And then within those brackets, we have the name of the data set, so Fake_Data.
The dollar sign, to say which column am I trying to do something to. And then the name of the column as written, so if I go to my Fake_Data set, it's Gender. So I'm saying within this fake data set, treat the Gender column like it's factor data.
Line 44: in the Colour column in the fake data set, treat that as a factor. And line 45: in the Group column in our fake data set, treat that as a factor. So now we're going to double-check that this worked; but on line 47, if we run our glimpse() function, we now see Gender is <fct>, factor type (i.e., categorical), Colour is <fct>, and Group is <fct>. So we have changed our three categorical variables to factor type, which means RStudio will treat them like they are categorical variables. So that's our assumption one: make sure your data type is assigned correctly.
The other thing I've left for you is let's say you don't want to have to open up another document to figure out what 0 means and what 1 means. That's a little confusing, we don't always want to try to remember what 0 and 1 means. In our colour group: What's 1? What's 2? What's 3? What's 4? We could use numbers, or we could add text. And if we add text, it might be a little bit easier for us as humans to read what's going on. So lines 51, 52, and 53, we're going to use the recode() function. And the recode() function will let us take those numbers and switch them to text to make our lives a little easier. So let me read you through what's happening in the recode() function on line 51. We're going to use our Fake_Data data set, dollar sign (to tell us which column we're working in we're going to change the Gender column first).
We're going to say “0” = “Men”, and “1” = “Women”. So what we're doing there, is we're just recoding the data to make sure it's got the words instead of the numbers. Same thing on line 52: we're going to say “1” = “Blue”, “2” = “Pink”, “3” = “Green”, and “4” = “Orange”. That's our Colour column. And then on line 53 with our Group column: “1” = “Small” drink, “2” = “Medium” drink, “3” = “Large” drink. So if you want, you can highlight all three of these at the same time and click Run, or you can do them one at a time. It will give us our blue text down at the bottom (i.e., everything worked). If you have red text, let me know; you shouldn't have an error code right now. And this is just going to recode them so they're equal to the words. If we want, we could re-run like 47, and what we can see now instead of 0s and 1s, we actually have our words instead. So it's just a way to make your life a little bit easier. And it's still listed as factor type, so we've met assumption one: two categorical variables, and the data type is listed as factor [<fct>] within RStudio.
There's one other assumption for this test, assumption two: each categorical variable must have two or more independent groups. So we can look at our…we're going to use Gender and Colour…we're going to look at our data set real quick. It's hard sometimes to double-check whether something meets what we call independence. Independence means something quite specific in statistics-land, so it's difficult if you already have the data set (i.e., it's secondary data, you got it from somewhere else). If you're running something like an experiment or giving out a survey, it's a little bit easier for you as a researcher to know whether something is truly independent. It's a fancy way of saying one row per participant, you're only allowed to participate one time, and hopefully you're doing something like random sampling, so getting a whole bunch of different people; if you were to ask only your friends, is that truly independent? Probably not. So we're not going to cover that too much in the series, but independence means something very specific. What we can do is we can just kind of look at our data and say: “Do we think that each of these are unique individual people?” So, for example, column one, they identify as a man, they've got a score for Fake_Data1, Fake_Data2, Fake_Data3, Fake_Data4, they identified the Colour pink, and for Group they picked small. If we squint at this a little bit, it looks like they're probably all unique individual people. We don't know for sure, so this is why it's very important to – if you're running an experiment – make sure you're checking things like independence in advance. Okay. So we're going to say that this meets the assumption. Everyone's got…they've all listed a Gender, and they've all listed a Colour. So we're going to say everything's okay.
Now we're at the really fun, easy part. This is actually running our Chi-square test. So we've checked our assumptions, that's the part that generally will actually take the most time. We've checked our assumptions, now we want to say: “Are Gender and Colour associated?”. Maybe, if we're being stereotypical, more women said they liked pink and more men said they like blue, for example. Maybe we want to check to see whether that's actually true in our data set, so we can do this. So actually running the Chi-square test; let's say you have the function, but you don't actually remember what to do with the function. So that's on line 65, we've got: “?chisq.test”. If we run this, in the bottom right, in our Help window, it will take us exactly to the R documentation about the Chi-square test, and what we need. So it says: “Pearson Chi-squared Test for Count Data. chisq.test performs chi-squared contingency table tests and goodness-of-fit tests.” We're going to take two categorical things, we're going to cross them and say: “Are they associated?” There are multiple arguments to this test! Sometimes, even though I've done stats for a long time and coding for a long time, I find the arguments are a little bit dense and hard to read. So if some of this doesn't make sense to you, that's okay. I'm going to show you the easiest, fastest way to do this, and it will probably work for most of the things you're trying to do a Chi-square test for. So we can use the Chi-square test function, and there are only two things we need to give it; we have to give it our first column (so Fake_Data$Gender to say this column in this data set), and we're going to check whether it is associated with our Colour column (we're going to say Fake_Data$Colour to say I want to check this other column). So you're going to give it your two columns. You could run just that, but you also sometimes want to save it as a variable. So here I've called it “my_chi”. So: my_chi = chisq.test(Fake_Data$Gender, Fake_Data$Colour). What this will do, is it will save our result to our global environment. And it's called my_chi, and it's a list of nine. You'll notice two things. We have red text in the console (i.e., it's warning us that something might be wrong); I'll read it, it says: “Warning message: In chisq.test(Fake_Data$Gender, Fake_Data$Colour) : Chi-squared approximation may be incorrect”. This is a warning that you might not have enough data in each of your groups. In our fake data set, we don't have very much data; we'll cover that in a second. But there's some red text, that's one thing I want you to notice. The other thing I want you to notice, is it didn't actually print the result of your test. You would expect, for example a p-value if you ran an inferential statistic. So we don't have a p-value because we stored it in what's called my_chi. We didn't actually ask it to return the value of my_chi; we didn't ask it to tell us the result of the test. We just said “store the result in my_chi”. So we can run line 68, which is literally just “my_chi” to say, well, what is the result? Give me the result of what I asked for. And if we click Run, we get a little bit more information, so now we get the result of the test. It tells us the name of the test: Pearson's Chi-squared test. It tells us which data you crossed: here, we crossed the Gender column and the Colour column. It gives you your Χ2 value. It gives you your degrees of freedom. And it gives you your p-value.
In inferential statistics, if your p-value is less than (<) .05, this means you have found statistical significance, and you can say there is an association. Or there is a difference, if you're doing at t-test. Or there is a correlation, if you're doing a correlational test. Or you know there's a relationship of some sort, something has happened. So if p was less than (<) .05, we'd be able to say we found an association.
If you were in my statistics workshop two weeks ago, you'll remember if p is greater than (>) .05, we're not allowed to say there's NO association! We are allowed to say “we failed to find an association”. So the language there is very precise and very strange because of p-values. We're allowed to say we failed to find an association.
Well, that's useful, but we might also want to be able to see what's going on with our data. Why can we not say anything? Why can't we say there's an association? Let's look at the data and see what's happening. So in the my_chi data, because it's in our global environment, it's got some information stored. You can actually click on it and see what's here, there's a lot of additional information you could ask for. We're going to ask for two things. On line 71, we're going to say my_chi$observed (i.e., give us the breakdown of what data was actually put into this [test]). So if we click Run, it will give us a little table, which is kind of neat. We did this last week when we did some descriptive statistics; we used the table() function. So this is just another way to get a table. So we've got our Colour along the top: blue, pink, green, and orange. And we've got Gender along the side: we've got men and women. And it tells us how many of each are in each group. So we've got (men) 3, 5, 4, 3, (women) 6, 3, 3, 3.
So we've got all of our different data here. Not very much data! Generally, with Chi-square, you want five or more in each group. So we got red text a moment ago saying essentially your Chi-square approximation might be incorrect because you don't have very much data. Most of our groups here are actually the number three, which is not enough data to really run this test super appropriately. I gave this as an example, you want enough data, so make sure you've got at least five in most of your groups. So this is the actual data that we gave the test. This is the breakdown of our data. If we had done this as a table. Why can we not say these are associated? Well, there's not very much data here, so that might be part of it. And there doesn't seem to be any obvious pattern; it doesn't seem in our fake example, it doesn't seem that for example, women like pink more and men like blue more. It's a pretty equal split, most of the numbers are like three or four, so we can't say they're associated because there's not one group that is sticking out liking something a lot more than another one. i.e., another way to say that, it doesn't seem like there's a very clear, like, distinction in Gender or in Colour. On line 72, we can also ask for what's called the “expected”. So we can say “my_chi$expected”. This will give you the counts that the test expects. If you're reporting a Chi-square, sometimes you'll need these numbers, sometimes not, but this is what the test would expect the breakdown to be. So the numbers are just a little bit different. If you need the actual numbers that you gave the test (i.e., what's actually in your fake data set), you're going to want the observed numbers. That's the first thing we ran there. I've left you a warning here. If more than 20% of your cells have expected counts of five or fewer, you may not be able to trust the results of the test. And that's the warning message we got earlier.
All right. I've left you one more thing here about running the test. We've ran the test, we've got our p-value, we can't say whether these are associated (Gender and Colour). But there's something else you might want, too. p-values are one piece of the story, but more and more in statistics-land, we're moving towards approximation.
We're moving towards adding additional information, so something like a confidence interval. So what we can get here is what we call an “effect size”; it's an additional piece of information that says: our p-value is one piece of information, but how large is this effect really? Not just is the effect statistically significant, but how big is it? Because something can be statistically significant, and be an itty bitty tiny effect, or really big effect. So we want to know how big is this effect. To calculate our effect size with a Chi-square test, the effect size is what's called Cramer's V. We're first going to need to run line 82, and I will run this with you because I don't have this package on this computer. We're going to run: install.packages(“confintr”). We're going to get the confidence interval package. So it's going to take a second, and then you'll get a little bit of red text, but it'll tell you where it stored the package, no problems there, everything ran. We're going to run line 83, because we took the package from the internet, but we still have to tell RStudio we would like to use this package right now. So we say library(confintr). I do get a warning message, it says: “‘confintr’ was built under our version 4.3.3”. It didn't say it didn't work, it didn't say there's a problem with it, it's just warning me that it was built under a different version of RStudio compared to the version I'm using. And that's okay. Depending which version of RStudio, if you're using 4.3.3, you might not get this message.
To actually calculate Cramer's V, it's the Cramer's V function: cramersv, left bracket, and then you give it the name of what you called your Chi-square test. I called mine my_chi, so cramersv(my_chi). If I run this, it will give me just a number: so it’s 0.23.
Different fields have different ways of explaining how big or small a confidence interval is – sorry, I said confidence interval – how big or small an effect size is. This is your effect size. A .2 is considered a small effect size. It can be hard to find statistical significance if your effect size is small, and if your sample is pretty small, so it makes sense that our p-value was non-significant here, and that's okay, we would still probably report this. So I've left you: How do you write your results? We've got a fancy squiggly X, it's not just a capital X, it's the Chi-square symbol. So we've got a fancy squiggly X, some brackets to say 3, which is your degrees of freedom. So when we ran the test earlier, it said degrees of freedom is equal to three, so Χ(3) = 1.643, p = .650, and Cramer's V = .234. So that's your statistic with your degrees of freedom, your p-value, and your effect size. And that is how you run a Chi-square test.
Time commitment
5 - 10 minutes
Description
RStudio workshop series: Chi-square test covers how to conduct a Chi-square test (including all assumptions) in the RStudio software.
License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.