Time commitment
5 - 10 minutes
Description
The purpose of this video is to explain the assumptions necessary for conducting a one-way ANOVA, including the need for a continuous dependent variable, a categorical independent variable with three or more groups, independence of data, and normality with no significant outliers. It also covers performing post hoc multiple comparisons and setting additional options in SPSS before running the analysis and interpreting the output.
Video
Transcript
A one-way ANalysis Of VAriance (also called ANOVA), is used to determine whether the means of the same continuous variable from three or more groups differ. So you're going to look at the same variable; maybe you're going look at happiness; you're going to look at happiness across three different groups. Maybe it's participants in three different classes. But they're all going to be measured on the same variable.
This is a parametric test; this means that the data must look approximately normal, which means it has to look like this graph here: highest amount of data in the middle, lowest towards the end. This is what we call our typical bell-shaped curve.
And if you need some additional help running – oh, problem on the slide; it says independent samples t-test. If you need help running an ANOVA, a one-way ANOVA, the links should be correct. This is the U of G SPSS LibGuide, we've got a chunk on there. We've got the Laerd statistics guide to give some help, we did not write the Laerd statistics guide, but I think it's a really good resource, so I always tell people to look at it. And we also have some SPSS documentation, so this is literally by the people who designed the SPSS software. They've got documentation to help you walk through how to do this. Let us begin.
So when I say assumptions for one-way ANOVA, there are quite a few of them, we're going to cover all of these today.
The first is that the dependent variable must be continuous. The second assumption is that the independent variable must be categorical and have three or more independent groups. The third is that the data must be independent. The fourth is that the dependent variable is approximately normally distributed for each independent variable group; so, if you have four groups, you need to check normality for each of those four groups. The fifth is that you should have no significant outliers, and again that is for each of those groups. And the sixth has a little star, it means we actually are going to check this one when we run the test; our sixth assumption is homogeneity of variances.
And we're going to talk about all of these on the next few slides, so let us begin.
[Slide contains a screenshot of a table in SPSS within Data View. The table’s column headers are as follows: Gender, Fake_Data1, Fake_Data2, Fake_Data3, Fake_Data4, Colour, and Group.]
Our first assumption for one-way ANOVA is that the dependent variable we are using must be continuous. This could be a happiness score, rate your happiness 0 to 100. This could be…I worked a lot with the reaction times, maybe it's a reaction time data. But your single dependent variable must be continuous.
So here, if we look at Fake_Data1, we can look in this variable and say: “Hmm, does this look like a continuous variable?”
[Fake_Data1 column is highlighted.]
Here we have a range of values, we've got a bunch of decimals. So, this one passes the assumption our continuous dependent variable.
The second assumption is that the independent variable you were using must be categorical, and they must be three or more independent groups. So we're going to pick on a new column this time, I don't think we've used this column yet so far in our workshop series. We're going to pick on the Colour column.
[Colour column is highlighted.]
So, in this fake example, maybe you ask participants “What's your favorite colour?”. The options are pink, blue, green, or orange. We have 4 different groups, and they are independent groups; participants can only list one thing, so we've got a categorical independent variable, and they are independent groups because you can only pick one option. One option per person. So, we pass our second assumption.
Our third assumption is independence, so the data must be independent. This one's a little tricky to check after-the-fact, generally we set this up before we run the study, so if you're doing a survey or an experiment, for example, you might make sure that each participant only answers the survey one time. That would be independent data.
If you send out a survey and someone answers it three times and they have three rows in the data set, their answers are probably not independent of each other, so you might have failed this assumption. So, it's tricky to check this assumption after-the-fact, but we can kind of look at the data and eyeball it.
[Column headers and first row are highlighted.]
So, for row 1, this participant identifies as male, they have data for Fake_Data1/2/3/4, they've got their favorite colour listed, and they've got their group listed. It looks like this is potentially an independent dataset because as you go through the rows, we've got different answers for everybody. Hard to tell after-the-fact, but it looks like this might have passed. It's a fake dataset; we're going to say it passed today.
Check assumptions (normality and outliers)
[Slide shows the table with the Analyze menu open and Descriptive Statistics selected. From the Descriptive Statistics sub-menu Explore is highlighted.]
The next two assumptions, normality and outliers; each variable should be approximately normally distributed, in each group. So, each group – if you have four groups, each of those four groups – should be approximately normally distributed, and for each of those four groups, you should have no significant outliers in each.
We're going to click: Analyze > Descriptive Statistics > Explore. We're going to check normality (for that bell-shaped curve) and we're going to check for outliers (any points that are really far away from the rest of the data points) at the same time, so we do this in the same dialog box. So again, we're going to click: Analyze > Descriptive Statistics > Explore. Do that, it will open up the Explore dialog box.
[SPSS Explore dialog box with the Explore: Plots sub-dialog box open. In the main Explore dialog, Fake_Data1 is selected as the Dependent Variable, and Colour is chosen in the Factor List. The remaining variables are listed on the left. Under Display, there are radio buttons for Both, Statistics, and Plots, with Both selected.
In the Explore: Plots sub-dialog, the Factor levels together option is selected for Boxplots, and Histogram is checked under Descriptive plots while Stem-and-Leaf remains unchecked. Normality plots with tests is enabled, and the Spread vs. Level with Levene Test is set to "None."]
You're going to take your continuous dependent variable and put it where it says “Dependent List:”. So you take it from the left and you put it in the Dependent List box. And your continuous dependent variable should have a little yellow ruler, which means you set the variable Measure properly in Variable View. If you don't have a little yellow ruler, your data is either the incorrect type, or you need to go and fix that in Variable View and say “oops, my bad, this is scale data, it’s continuous data”.
You also from the left-hand side will take your categorical independent variable and drop it where it says, “Factor List:”, kind of in the middle of the box. So here we take the Colour column, and we put it in Factor List. If it's a categorical variable, it should either have three coloured circles or those three coloured bars to indicate nominal or ordinal data.
Once you've done that, you want to click the Plots button. This will open up the Plots dialog box. There's a few options you want to pick here: you want to uncheck Stem-and-leaf (we don't really do those much anymore), and check where it says Histogram. You also want to click where it says “Normality plots with tests”; the normality plots button is really important because this will give you a statistic to be able to say whether you have passed or failed normality. Once you've done that, you click Continue and you click OK.
So, move your options where they belong, click a few buttons in Plots and then you're ready to go.
If you have clicked everything, you will get a fairly long output, there's a lot of information here. I've broken this down into two slides because there's a lot of stuff we're going to look at. The first thing you want to look at for normality, is you're going to scroll down a little to where it says: “Tests of Normality”.
[Tests of Normality, comparing four color-coded groups (Blue, Pink, Green, Orange). It includes results from both the Kolmogorov-Smirnov Test and the Shapiro-Wilk Test, each displaying values for the test statistic, degrees of freedom (df), and significance level (Sig.). A note at the bottom clarifies that the significance values are based on the Lilliefors Significance Correction.]
The Tests of Normality box, if you have set this up properly, will give you one row for each group. So here we have a row for blue, a row for pink, a row for a green, and a row for orange. If you have about 50 or more observations in each of your group, you're going to look in the middle of the table where it says Kolmogorov-Smirnov.
That's the statistic you'll use to check for normality. Because we have fewer than 50 observations in each of our groups, we're going to look on the right-hand side where it says Shapiro-Wilk. The Shapiro-Wilk statistic (I'll move the cursor so we can see it), the Shapiro-Wilk statistic, here we're going to look in the Sig. [Significance] column. If Sig. is less than (<) .05 for any of these groups, even just one, it means you have failed the assumption of normality and you cannot use the one-way ANOVA; or, it would be inappropriate for you to use the one-way ANOVA because your data do not meet this assumption. Here we can see that all of our p-values are actually greater than (>) .05, so we're okay; we can say we've passed normality, or we've met the assumption of normality.
This is a statistic that helps you look to see whether you've passed normality. Some fields also like to do what's called visual inspection, so they're going to look at some graphs to determine whether they've met normality. So, if you scroll down in your output window a little, you'll also see some histograms.
[Slide shows a histogram for each of the colours showing the distribution of diff_score values, with Frequency on the y-axis and diff_score values on the x-axis.]
I've made mine red, yours will be blue. What we're looking for here, is each of these graphs should approximate our standard bell-shaped curve: so highest data in the middle, lowest points at the end. So, if we look, for example at histogram for the colour blue the frequency values from left to right are, it's not perfect [the frequency values from left to right are 1, 3, 2, and 3]. The right-hand side is a little high, but it roughly follows the right shape.
If we look at histogram for colour pink, this one’s a little wonky as well, but it's not terrible [the frequency values from left to right are 3, 0, 3, 2, and 0]. Histogram green, that looks pretty good, highest in the middle, lowest at the end [the frequency values from left to right are 1, 1, 4, 1, and 0]. And histogram orange, it's a little bit high on the left-hand side, a little bit low on the right [the frequency values from left to right are 0, 2, 2, 1, 0, and 1]. But if we do visual inspection of our histograms in conjunction with our p-value, our statistical test, we can say no, I think we still pass normality. The graphs aren't perfect, but they're okay, they're good enough.
If you scroll down just a little bit more in your output, you can also do visual inspection using what's called a Q-Q plot. So here you have a line going across your page and you've got a bunch of dots (yours will be blue, I made mine red).
[The Normal Q-Q Plot of diff_score compares the observed values (x-axis) to the expected normal values (y-axis). The data points are represented as red dots and are shown in addition to the black diagonal reference line.]
You're looking to see: “Do these data points fall on or pretty close to the line?” If they fall really far away from the line or are pulling away from the line, it means you've probably failed normality. But here, we can look at these and say no, those data points look pretty close to the line for each of these. So, when we use our statistic, our histogram, and our Q-Q plot, we can say no, it looks good. I think we've passed normality. Excellent.
[Slide shows a boxplot visualizing the distribution of Fake_Data1 across the four colour-coded groups: Blue, Pink, Green, and Orange. Each box represents the interquartile range (IQR), with the median marked by a horizontal line. The whiskers extend to the minimum and maximum values within 1.5 times the IQR, while any outliers beyond this range are marked separately, such as an outlier in the Green category labeled as 30.]
In the same output as the normality check, we also have asked for some information on outliers, so this is how you check outliers. SPSS uses what's called the boxplot method, so it makes you a boxplot; it's got a median, it's got an interquartile range, it's got some whiskers on it. There are other ways to check for outliers, so in your field it might be common to you something like the mean and standard deviation, or maybe you look at the histogram and just eyeball it and say if anything is really far away. But for SPSS, the easiest way to do this is to use the boxplot.
So, if we look at the boxplot, what we're looking for here is any data points really beyond those whiskers. If you have a circle with a number next to it, it means you have an outlier, and the number tells you which row of your data set is an outlier.
If you have a star with a number next to it, so for example, in this output we have a star with the number 30 in the green condition, that means that row #30 in our dataset is an extreme outlier, which is potentially problematic. We have to be cautious of outliers, and the reason this is one of the assumptions is: if you have data points that are really far away from the rest of your data, they might actually skew your analysis, so it might look like you found something when really there's nothing, or it might look like you didn't find something when maybe there should have been something. Hard to say.
So, in this case it might be appropriate, depending on your field, to remove observation 30 because they have been identified as an extreme outlier. This varies a little bit field-to-field. For today's purposes, we will pretend we had no outliers and just keep going, but depending on your field, you might actually remove participant 30 before you do the analysis at all.
Okay, this is your reminder: Always check your assumptions! You don't know whether you've met the assumptions unless you actually check them. In this case, you might think everything's fine, but if you check for outliers, for example, maybe you consider row 30 and outlier and you might have to do something about that.
Reminder, always check your assumptions before you run your test. Let us continue.
[Slide shows the table in Data View with the Analyze menu open and Compare Means and Proportions is selected. From the Compare Means and Proportions sub-menu One-Way ANOVA is highlighted.]
We've checked our assumptions! If you have passed all assumptions, you can proceed to conducting the one-way ANOVA. How do you actually do this? You click: Analyze > Compare Means and Proportions > One-Way ANOVA.
Here we're going to say we've passed all assumptions. Again, we might consider removing that one outlier, and we still need to check homogeneity. Because assumption six (if you remember, I had a star next to it), I said assumption six, we actually check when we run the test.
So one-way ANOVA may or may not be appropriate; you might need to remove that outlier before you do the test. But how to run it? You click: Analyze > Compare Means and Proportions > One-Way ANOVA.
[One-Way ANOVA dialog box is shown. There are a list of variables listed with Fake_Data1 moved to the Dependent List, while Colour is selected as the Factor. Options on the right allow for additional settings, including Contrasts, Post Hoc tests, Options, and Bootstrapping. The effect size estimation checkbox is selected.]
If you do that, that will open the One-Way ANOVA dialog box. And how you run this test is you take your continuous dependent variable from the left-hand side, and you put it in the box that says, “Dependent List:”. So here we take our Fake_Data1 variable, and we put it where it says Dependent List. It's got a little yellow ruler to tell us it's a continuous variable. If you don't have a yellow ruler, you'll have to go exit out of this box and go to Variable View, look and measure and change it to be continuous data [if appropriate to do so].
You then take your categorical independent variable, and you take it from the left-hand side, and you put it where it says “Factor:”.
You have a few more options you have to click.
You're going to click the Post Hoc button to open up the Post Hoc Multiple Comparisons dialog box.
[One-Way ANOVA: Post Hoc Multiple Comparisons dialog box appears on the slide beside the One-Way ANOVA dialog box. It is divided into two sections: Equal Variances Assumed and Equal Variances Not Assumed. Users can also set the Null Hypothesis Test significance level, either using the default alpha level from the Options menu or specifying a custom level.]
Multiple comparisons, or post hoc comparisons, or pairwise tests, or EM means: bunch of different words for the kinds of tests you might do afterwards. Here we're going to do a post hoc multiple comparison check. This will tell you what differences are statistically significant if you find something going on in your data set. We're going to click the options Bonferroni and Tukey [under Equal Variances Assumed]; we're going to apply correction; we'll talk about this a little bit more. We're also going to click where it says Games-Howell [under Equal Variances Not Assumed], this is for our homogeneity check, if we fail homogeneity. Then we click Continue.
Once you have made your selections – in your field, you might use other options, these are pretty standard options, so these are the ones we're going to go for today.
Once you've made these [selections], you're going to click the Options button, which will open the Options dialog box.
[The One-Way ANOVA: Options dialog box appears on the slide beside the One-Way ANOVA dialog box. The two main sections are for Statistics and Missing Values. There is also a check box for Means plot. Lastly, the Confidence Intervals field allows users to set the significance level by percentage, with 0.95 shown.]
And you're going to click Descriptives, you're going to click Homogeneity of variance test (to check for homogeneity), you can also click Welch test, and you also want to make sure you say Means plot (you want to graph of what's happening with your data). And when you do that, you can click Continue and then that's everything you need to click.
If you've made all your selections, some in post hoc, some in options, you click OK.
[One-Way ANOVA output in SPSS. The left panel shows the Output Navigator. The main panel presents the following tables: Descriptives, Tests of Homogeneity of Variances, ANOVA, and AVOVA Effect Sizes.]
If you have done all of that, you will get a pretty chunky output, it's pretty big.
The descriptives table will give you an overview of things like your mean, your standard deviation, how many observations per group.
The Test of Homogeneity of Variances table is the first table you're going to want to look at. If we look at this specific Sig. value, where it says “Based on Mean” [row header], “Sig.” [column header], if your p-value is less than (<) .05 you have failed the assumption of homogeneity of variances, and there's something you have to do about that in a minute. Here, we've passed the assumption (i.e., our p-value is greater than (>) .05) [the value is .406], so we're okay, we don't have to do anything else. But you want to make sure you check this assumption because there's something you have to change if you fail.
Our next table is our ANOVA table; this is our table that actually tells us is there a difference between our group. In this case we've got 4 colours, is there a difference in Fake_Data1 depending whether you said you liked pink, green, orange, or blue best. We can tell this by looking in the Sig. column. If the p-value here is less than (<) 05, it means you have found a statistically significant difference between your groups. Here, our p-value is .923, this is quite a bit above (>) .05; we can say we failed to find a difference.
What does this actually look like? Well, if you have clicked all the buttons with me today, you probably have a graph that looks something like this.
[Four rows (Blue, Pink, Green, and Orange) and two columns (N and Mean) are highlighted in the Descriptives table. A graph also appears on the slide displaying the mean values across the four colour categories (Blue – 87.34, Pink – 87.00, Green – 87.01, and Orange – 86.98). The y-axis represents the mean values while the x-axis lists the colour categories.]
This is literally just a graph that shows you the means of your different groups, and you might look at this and be like: “Yeah, that's kind of weird that we didn't find anything because blue sure seems a lot higher than pink, green, or orange, something looks like it's going on here". But I want to caution you to always look critically at your graphs. If I look at this graph and I look at my y-axis (so that vertical axis), it ranges from 86.9 to 87.4, which is not a very big difference. If I were to change the [y-]axis of my graphs – you can actually double click on it and change values like this, if you want – I can change my graph [new graph appears]. So, for example, if I put it between 80 and 90, it suddenly looks like a pretty flat straight line, and it should be pretty obvious that maybe there's not actually differences between these different groups. We can't say there are no differences, we can say we failed to find a difference! But if you ever see a graph that doesn't match your stats, check the y-axis.
We also have the ANOVA Effect Sizes table. Eta-squared [row header] is your best effect size for an ANOVA, so we would look here our value is .018 [under the Point Estimate column header]. It's saying hey, that's a really, really tiny effect, so it makes sense, then, that we probably didn't find anything.
I have a note here…in the test of Homogeneity of Variance table, so this one here,
if your p-value is less than (<) .05, you'll want to interpret the Welch’s test instead of what's in the ANOVA table. We don't have to worry about that today because we have not failed this assumption, so we're okay.
The one other trick with interpreting this: let's pretend, for example, that this was actually a real difference [referencing the first graph], the blue group was way higher than our other groups and our p-value was, I don't know, .0009 or something. So, let's say we DID find some difference here. The p-value from the ANOVA – even if you have a graph!! – does not tell you where that difference is. You can guess, so I would probably guess like, oh, let's say this is a lot higher [the blue data point on graph one] than pink, green, and orange; the blue group is probably higher based on the graph, but you don't know for sure.
Further down in your output, you will have the Post Hoc Tests > Multiple Comparisons table. This table will tell you where your differences are, and you only look at this table if your ANOVA is significant.
[The Multiple Comparisons table presents multiple comparisons for the dependent variable Fake_Data1, categorized by color groups (Blue, Pink, Green, and Orange). It includes results from different post hoc tests (Tukey HSD, Bonferroni, and Games-Howell) and displays comparisons between each pair of color groups. The columns contain mean differences, standard errors, significance values, and 95% confidence intervals (lower and upper bound0 for each test.]
So today, we wouldn't actually normally look at this, but for practice we would look at the top half of the table depending whether you want the Tukey’s correction or the Bonferroni correction, you can look at either. But for example, we would say: “Okay blue LOOKS different than the others”, I would check blue versus pink (what's my p-value for that difference?), I would check blue versus green (what's my p-value for that difference?), I would check blue versus orange (what's my p-value for that difference?), for example.
So today this is all non-significant because the ANOVA was non-significant, so we wouldn't even look at this table. But you can look at this table if you found a significant ANOVA.
The last piece of this table is the Games-Howell. You would look at this if you failed homogeneity of variances. So, if the test on the previous screen was p less than (<) .05, you would read this for your pairwise comparison instead of the top half.
And that is how you run, really quickly, a one-way ANOVA.
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- Ask Chat is a collaborative service
- Ask Us Online Chat hours
- Contact Us