Chi-Square

By Lindsay Plater

Time commitment

5 - 10 minutes

Description

The purpose of this video is to explain how to conduct a Chi-square test using SPSS (requires two categorical variables). This tutorial is designed to help students and researchers understand: the data type required for the test, the assumptions of the test, the data set-up for the test, and how to run and interpret the test.

Transcript

So what is a Chi-square test? A Chi-square test is used to test whether two categorical variables are associated with each other. So an important distinction here, compared to a lot of the other tests we’ll be doing this semester, are you need two categorical variables. And again, categorical like those buckets. You need male, female, nonbinary, would be a bucket. If you're asking someone's favourite colour, blue, green, purple; those would be buckets. They must be categorical and you don't want a lot, a lot, a lot of different options. You don't want like 100 buckets, you want a smaller number of buckets, normally. 

As per our introduction, this is a non-parametric test. This means it does not need to assume normality, i.e., your data don't have to follow that typical bell-shaped curve, that normal distribution. You don't need normal data to run this test, which kind of makes sense! Because if you've got buckets, buckets don't really follow a bell-shaped curve, you probably just have some folks who like the colour blue, some folks who like the colour purple. It's not going to follow that perfect normal distribution, you don't need a normal distribution for this.

I've left you some resources on the slide here as well if you need some additional help running a Chi-square test. So I've linked to the SPSS LibGuide, where this video will eventually live. I've linked to the Laerd statistics guide, because they give a pretty good overview. And I've also linked to the SPSS formal documentation, so their documentation of how you run this test in the SPSS software. 

So really the only thing you need is two categorical variables. 

The assumptions of the Chi-square test:

you need two categorical variables, they need to be in those buckets, and

each categorical variable must contain two or more independent groups.

So I’ll say that again: two categorical variables and those variables must contain two or more independent groups. 

You can have, for example, male and female participants. You could ask for their favourite colour; maybe it's blue, maybe it's green, maybe it's purple. What does this look like in SPSS? How do you set this up?

[Slide contains a screenshot of a table in SPSS within Data View. The table’s column headers are as follows: Gender, Fake_Data1, Fake_Data2, Fake_Data3, Fake_Data4 Colour, and Group.]

In your SPSS, when we're checking these assumptions, we have to make sure we've set up our data properly so that we can run the test. 
Our first assumption is that you need two categorical variables. In our fake data set today, we're going to be using the “Gender” variable and the “Colour” variable.

[Gender and Colour columns are highlighted in SPSS.]

So here we've got “Gender”, you could be male or female. And we've got “Colour”, you might like pink, blue, green, or orange. 

This first assumption, two categorical variables, passes; Gender and Colour are both categorical, they have separate groups, separate buckets. 

Our second assumption is that our categorical variables need at least two independent groups. If we look at our Gender variable here, in our fake data set, we have male participants and we have female participants. Within the Gender category or the Gender group, we have two different buckets, so two different independent groups. So the assumption passes for Gender. 

And within Colour, we have four different buckets or categories: we have pink, blue, green, and orange. 

So our assumption passes for both of our two categorical variables because they each have two or more independent groups. You don't have both pink, blue, and green, you're either pink or blue or green or orange. Independent groups, you only get one of the options.

[Screenshot shows the same table with the Analyze menu open in SPSS. From there the Descriptive Statistics submenu selected followed by Crosstabs.]

How do you actually run the Chi-square analysis? If you have the data set open in SPSS, you're welcome to follow along. If you don't have SPSS or if you're just not feeling it today, you're welcome to just watch, you can always come back to the slides later.  

To open this on your own, or to run the Chi-square test on your own, you would click Analyze > Descriptive Statistics > Crosstabs. 
So again, to run this on your own computer: Analyze > Descriptive Statistics > Crosstabs. That will open a dialog box that looks something like this.

[SPSS Crosstabs dialog box overlaid on the table in Data View. Variable options are listed on the left. The dialog shows "0 = Male, 1 = Female [Gender]" in the Rows field and "1 = small drink, 2 = medium drink, 3 = large drink" in the Columns field. All variables are preceded by an icon.]

And what you do, is you take your two categorical variables with independent groups, and you take them from the left side, and you put one of them where it says “Row(s):” and you put one of them where it says “Column(s):”. 

And if you have set up the “Measure” properly in variable view, the little symbol associated with your categorical groups should be either those three coloured circles or those three coloured bars. If you have a little yellow ruler, SPSS thinks you have scale data – or continuous data – so you should back out of the dialog box, go to variable view, find this variable, and in the Measure column, tell it you either have circles [nominal data] or you have bars [ordinal data]. This is telling SPSS you've got a categorical variable. So again, if you've got a yellow ruler, that's a continuous variable; your “Measure” might not be set up properly, or you shouldn't be using that variable for this analysis. 

So you take your two different variables that have two or more independent buckets, put one on “Row(s):”, you put one on “Column(s):”.

[SPSS Crosstabs dialog box with the Statistics button selected, opening the Crosstabs: Statistics dialog. In this secondary window, the "Chi-square" test is checked, along with other nominal, ordinal, and interval-based test options such as Contingency coefficient, and Phi and Cramer’s V.]

If you click the Statistics button, it will pop out another little dialog box and you want it to say “Chi-square” (so you want to click the top left button) and you might also want to click the next two buttons on the left hand side: one says “Contingency coefficient”, and once it's “Phi and Cramer's V”, we'll cover those in the output a little bit. 
But there's three buttons you want to check: Chi-square, Contingency coefficient, Phi, and Cramer's V. And then you'll click Continue, and then you'll click OK. 

Oh, I lied, don't click OK yet, just click continue.

[SPSS Crosstabs dialog box the "Cells" button selected, opening the Crosstabs: Cell Display dialog. In this window, checkboxes for Observed and Expected are selected under Counts. There are additional options for Z-test, Percentages, Residuals and Noninteger Weights. Round cell counts is selected under Noninteger Weights.]

We also want to click the Cells button. When you click the Cells button, it will pop a “Crosstabs: Cell Display” as another dialog box. Here, you want the top two left options: you want where it says “Counts: Observed” and “Counts: Expected” and then you click Continue.

Once you've done those, make sure in the bottom left corner of the Crosstabs dialog box you've selected “Display clustered bar charts”.  

So you've made some selections in Statistics. You've made some selections where it says Cells. And on the main page, you've said “Display clustered bar charts”. And then you're ready to click OK. 
If you have done those 3 little steps, technically it says five steps on the slide, but they're pretty easy steps. You'll get some output that looks like this.

[SPSS Chi-square output displaying a bar chart on the left and statistical output on the right. The bar chart compares counts of different colors (Blue, Pink, Green, Orange) across Male and Female categories. The right side shows SPSS Crosstabs output, including the Case Processing Summary, Crosstabulation table, Chi-Square Tests table, and Symmetric Measures.]

So let me grab my cursor. It's always a good idea to visualize the data to see what's going on. 

If you scroll to the very bottom of your output (I generally recommend doing this first), if you scroll to the bottom of your output, you'll get a graph that looks something like the one on my screen here. 

See, we've got our male participants, we've got our female participants, and we've got essentially the count of each of these groups, who likes which colour. Who likes blue, who likes pink, who likes green, and who likes orange. And in our little fake data set, it looks like [many] of the females actually like blue; we've got six females who picked blue as their favourite. And we've got 5 males who picked pink as their favourite. So the graph is just a nice little visual to help you see what's going on with your data. 

If you scroll back up to the top of your spreadsheet, you'll see it says Crosstabs because we've run the crosstabs procedure. The first table you get is called the “Case Processing Summary”, it gives you your “N” or your number of observations; here we have 30 participants in our fake data set and it says 100% valid. We've got no missing data, every row was used in this specific observation or in this specific analysis.

So that tells you your number of valid or missing data points. Our second table is the “Chi-square test table”. Did I lie? No, our second table…. Our first one is listed Case Processing Summary, the second one is labeled based on the label that you used for your different groups, so it mine says “0 = male, 1 = female” times (the star “*” means times in SPSS) 1 = blue, 2 = pink, 3 = green, 4 = orange Crosstabulations. This is your crosstabulation table, so this will give you your counts.

So how many participants said they were male and liked blue, how many participants said they were male and said they liked pink, how many participants said they were male and liked green, how many participants said they were male and liked orange. So it gives you the actual counts of who said what, and it also gives you your expected count. So the expected count is a calculation SPSS is doing behind-the-scenes to tell you, uh, what data did it think you were going to end up with. And in our case, it's actually giving some half responses, which wouldn't make sense for our data set, you can't be half a participant, you can only be a full participant.  

So this is showing you your actual data, your observed data, and your expected data. And it gives you a little breakdown. 

The third table here is our “Chi-square tests” table. This table is going to show you the value of the Pearson Chi-Square test statistic (χ2). So it's like essentially a little X to the power of two, that’s your Chi-square test, that would be the symbol we would use. 
It tells you the degrees of freedom (df) for the test, and the p-value associated with that test. Long story short, if your p-value (where it says: “Asymptotic Significance (2-sided)”, that's your p-value, where it says Pearson Chi-Square; here, it's .650), if your p-value is less than (<) our golden threshold of .05, we would say there is a statistically significant association between the variables, so Gender and Colour would be associated with each other. 

Here, we have p is greater than (>) .05, and we would say we failed to find a significant association between the variables. So I’ll say that part again, because this is kind of the meat and potatoes, it's the important part: if p is less than (<) .05, these variables are statistically associated with each other. If p is greater than (>) .05, which is what we have, we say we failed to find an association. 

And I see something in the chat, let me go check. Oh, I will come back to this one, someone was asking me to re-run it. We have so much time, I will re-run it with you so that you know you're clicking the right boxes. And Heather says, how do we know what's expected for our second table here? So, expected is a calculation that's happening behind-the-scenes. This is, if I'm remembering correctly, expected would be if there is an association, here's one way it might break down the data to say these variables are associated with each other. So the “Count” is what you actually have in your data and your count. Here, I've actually got it on my next button here, I’ll click my button, there we go.

[A red annotation and arrow highlight the “Male” row in the Crosstabulation table and connects it with the Male category in the bar chart.]

So the actual count gives you the real breakdown of the data, whereas (and I've think I've got that for the female participants as well, yep.

[A red annotation and arrow highlight the “Female” row in the Crosstabulation table and connects it with the Female category in the bar chart.]

So that's the actual count of the data) the “Expected Count” is the calculation for, if I'm remembering correctly, if they are actually associated, here's one way we might break down the data to see that kind of association.

In our case, the Count and the Expected Count do not match, and that might be because our data is saying we failed to find an association. These variables are not associated with each other. 

If we played with the data (it's a fake data set, you can change the data, run it again), you could try doing something like let's say let's do a gender stereotype: let's say all of the men really like blue and orange potentially, whereas all of the women really like pink and green. That might mean that they’re associated with each other, and when you re-run the test, you might see your Count and your Expected Count are different and maybe then your p-value would be significant. Good questions, pals.

So that's our Chi-Square Tests table. That's this third table, the Pearson Chi-square line gives you the value (your Chi-square value), your degree of freedom, and your p-value. 
If p is less than (<) .05, you can say we found a significant association between Gender and Colour. Here, we found p value is equal to .65 which is greater than (>) .05, so we failed to find an association between Gender and Colour.  

And our last table down here, I made you click a button that said “Phi and Cramer's V” this is our effect size. So you could use either Phi or Cramer’s V, these are essentially different effect sizes to say “how big is this effect?”; a larger number would indicate a larger effect. Here, we've got a value of .234, which is not a very large effect, we've actually got like a small or medium effect, depending what threshold you're using, and the significance. It says, again, that it's nonsignificant. 

So it's a small effect, it's a non significant effect. We wouldn't actually generally report much going on here, because we've got non-significant findings. 

[Questions? Contact us. UG Library. Website: lib.uoguelph.ca. Email: library@uoguelph.ca.]

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The library is committed to ensuring that members of our user community with disabilities have equal access to our services and resources and that their dignity and independence is always respected. If you encounter a barrier and/or need an alternate format, please fill out our Library Print and Multimedia Alternate-Format Request Form. Contact us if you’d like to provide feedback: lib.a11y@uoguelph.ca

chat loading...

Chi-Square

Attribution

Time commitment

Description

Video

Transcript

Tags

License