Simple Linear Regression

By Lindsay Plater

Time commitment

5 - 10 minutes

Description

The purpose of this video is to explain how to conduct a simple linear regression using SPSS (requires a continuous dependent variable and only one indepdent variable). This tutorial is designed to help students and researchers understand: the data type required for the test, the assumptions of the test, the data set-up for the test, and how to run and interpret the test.

Transcript

[Lindsay Plater, PhD, Data Analyst II]

So what is a simple linear regression? Simple linear regression is used when we want to make predictions about a continuous dependent variable (which in regression is also called an outcome variable) based on one independent variable (which in regression is also called a predictor variable).

This is a parametric test and so this means that the residuals assume normality. This is a little bit new for our workshop series; normally when we say normality, we're not talking about the residuals, but we'll cover this in more detail. So we're looking at our residuals and they should form this typical bell-shaped curve.

If you were looking for additional help running a simple linear regression, there is the University of Guelph SPSS LibGuide, the Laerd statistics guide, or the SPSS documentation where you can get a little bit more support.

What are the assumptions of a simple linear regression? We have seven, and we're going to cover all of them today.

The first is that your outcome or dependent variable must be continuous. The second is that your predictor or independent variable should be: continuous, approximately continuous (which is some ordinal data, if you do some data wrangling), or it must be dummy coded if you've got a categorical variable. The third assumption is independence of observations. The fourth assumption is there needs to be a linear relationship between your variables; this is sometimes called linearity. And then our last three assumptions all have little stars, which means we do the regression first and then we check these after we've built our regression model: so our fifth assumption is homoscedasticity ($5 word), our sixth assumption is normality of the residuals, and our seventh assumption is no significant outliers in the residuals.

Okay. Side-note: it IS possible to do simple linear regression with a categorical predictor variable. It's not as easy to accomplish. So I've left you a resource on the slides about dummy coding your variable. I think our example today is a continuous example, but I think our example next week, I think I've got one continuous and one categorical. So if you're confused about dummy coding, come to the workshop next week where we're talking about multiple linear regression. All right, let us begin.

Check assumption (continuous outcome)

[Slide contains a screenshot of a table in SPSS within Data View. The table’s column headers are as follows: Gender, Fake_Data1, Fake_Data2, Fake_Data3, Fake_Data4, Colour, and Group.]

Our first assumption is that your outcome or dependent variable must be continuous. So if we use, for example, Fake_Data1 in our data set [Fake_Data1 column is highlighted], we can look at this; we see a range of values, we see a bunch of decimals, we can say “Yep, this looks like it's probably continuous data”. So we have passed this assumption.

Our second assumption is that the predictor variable should be continuous, approximately continuous, or dummy coded if it is categorical. Here, we're going to be looking at Fake_Data2 [Fade_Data2 column is highlighted]. We can look: we see a range of values, we see some decimals, it's a giveaway that this is probably continuous data. Which means we've passed our second assumption, the predictor variable in this case is continuous.

Our third assumption, we normally set this up before we actually get to this stage.
So if you're doing a survey or an experiment, you normally make sure your data is independent at that stage. But we can look in each row of the data and say: “Does this look like a unique participant?”.

[First row is highlighted.]

So for example, row one, we've got someone who identifies as male, they have data for Fake_Data1, Fake_Data2, Fake_Data3, and Fake_Data4. They've got a Colour piece, and they've got a Group piece. It LOOKS like each person here is a unique participant. It's hard to tell after-the-fact, but it's all fake data. We can say that this assumption passes today.

Reminder you normally set this up before you do the study, you want to make sure you've got unique data for each participant.

[Slide shows the table with the Graphs menu open, and Chart Builder selected.]

Our next assumption is called linearity. You must have a linear relationship between your variables. So here, if we've got a continuous predictor variable and a continuous outcome variable, we want to make sure that they look like they have a linear relationship.

To check for linearity, the easiest way to do this is to actually click: Graphs > Chart Builder. So what we're going to do is we're going to create a scatterplot of our two variables to say: “Does it look like a straight line?”.

If you click Graphs > Chart Builder, you will likely get a pop-up. It's asking to make sure you've ensured your data types are set up properly. Because you've got the fake data from me, they're both set up as continuous or “Scale” data in our Variable View under the Measure column. So we can say OK or Continue, we're good, our data is set up properly.

So, Chart Builder, say OK, and then you'll get your Chart Builder dialog box which looks something like this.

[SPSS Chart Builder dialog with three panels: on the left, a list of available variables (Gender, Fake_Data1–4, Colour, Group); in the center, a live preview of a scatter plot (Fake_Data1 vs Fake_Data2) with draganddrop drop zones for the x and y axes, color, size, and filter; and on the right, the panel has three tabs (Element Properties, which is selected, Chart Appearance, and Options). Below the preview, the pane has four tabs (Gallery, which is selected, Basic Elements, Groups/Point ID, and Titles/Footnotes).]

The easiest way to look for linearity is a scatterplot. So in the bottom left, we're going to click where it says “Scatter/Dot”. The dialog box is a little weird to use if you haven't used it very much before. You grab the first plot option (it looks like a bunch of blue data points), and you drag and drop it to the blue text in the middle of the screen. And then what happens is it pops out like some fake data points, so it makes it look kind of like a scatterplot. That means you've picked the right one. What you do next is you grab your two variables, Fake_Data1 and Fake_Data2, and you put them on your X and Y axes.

I think I put Fake_Data1 on the Y and Fake_Data2 on the X here. And if you've done that, it populates some fake data. Like, more fake data, not our exact fake data. It populates just like what this might look like if you built this as a graph.

It should look like just a bunch of random scatterplot points. If it looks like what mine looks like on the screen, you're good to go, you can click OK. So you just have to make sure you click Scatter/Dot, you move the graph that you want, you select your axis options, and then you're good to go.

If you do all that, you'll get a scatterplot that looks something like this.

[A scatter plot titled “Scatterplot of Fake_Data1 by Fake_Data2” with red circular markers. The Xaxis (“Fake_Data2”) ranges from 90 to 95, and the Yaxis (“Fake_Data1”) ranges from 85 to 90. Data points are broadly dispersed.]

You'll have blue points, I made mine red to match the presentation. So we have Fake_Data1 on our Y-axis, and we have Fake_Data2 on our X-axis, and what we're doing here is we're looking at this critically and saying: “Hmm, does this look like it forms a straight-ish line?” If I was to draw a line through my data points to try to touch as many of those data points possible, does it look like a straight line?” It can go up, it can go down, it can be going kind of flat, any of those are fine, but it should be a straight line.

This is a tricky example, because if you were to split this plot in half, it looks kind of like the left side is going down and it looks kind of like the right side is going up. So it almost looks a bit like a “V”, so it's a little tricky. There's something that we can do to help us visualize “Is it a line?”. If you double-click on your plot, it will open up your Chart Editor dialog box. So it should look something like this.

[Two sidebyside SPSS scatterplot windows of Fake_Data1 (yaxis) by Fake_Data2 (xaxis). The left pane shows the original GGRAPH output with a grey, hatched background and xaxis ticks from 90.0 to 94.0; the right pane shows the Chart Editor view on a plain white background.]

So if I double-click on the plot, it'll open up another box for us here, and there's a button that we can click to add a line to our plot for us. So it's this button right here. There's two rows of buttons in SPSS in the Chart Editor, it's on the bottom row, it's one of our last buttons. It's literally a little X/Y axis with a straight line through it. If you click that, it gives you a few options that you could select to add a line to your plot. The default is, I think it says linear or line; that would always give you a straight line. So every single time you do it, it gives you a straight line and says: “Here's where your best fit is”, it's the line of best fit, a straight line. You could select that; the option I've selected is actually called a Loess (l-o-e-s-s). Some guides will refer to this as Lowess (l-o-w-e-s-s), they mean the same thing.

So we can actually add a loess curve; it's essentially, I'll break it down, I'll explain what it is. It's essentially, for example, taking your first five data points, and then drawing where the line of best fit would be for that. Then it moves over by one point, so it would take points 2 – 6, and then it would draw where would my line be. Then it moves the window, so it would do lines or points 3 – 7 and it would draw where's my line of best fit. So it essentially is doing your line of best fit for different chunks of your data all the way through, and saying “Here's what it would look like for each of those chunks.”

So the loess curve here, it's not just a straight line, it's using pieces of the data to say for JUST these pieces, what's happening with my line? And as you move that window, the line could change; it might not always be a straight line. So if we had a loess curve, we can look at this and say: “Mmm, does this look like a straight line?”. Well, on the left-hand side here, it looks like it's mostly going down. And on the right-hand side here, it looks like it's mostly going up. This part [the right-hand side] is quite straight, like very surprisingly straight, but the stuff on the left is not really a straight line.

So we could use something like a loess curve to help us determine is this a straight line, do we meet the assumption of linearity. In this case…eeh? It's a little rough. I would – if I looked at this on some real data – I'd say you may be okay, but you would maybe be concerned that you've not met this assumption. So it's a reminder to always check your assumptions; you don't know if you've met these assumptions until you check them. So: tricky to say! It's possible we passed, or in some people's eyes, maybe we failed this one. Okay.

[Slide shows the table with the Analyze menu open and Regression selected. From the Regression sub-menu Linear is highlighted.]

What we do next is, assuming you've passed all your assumptions, we can proceed to conducting the simple linear regression and you might be going: “Lindsay, what's going on? There were seven assumptions, we have not covered seven assumptions yet!” Three of our assumptions, you build your regression first, and then you check the assumptions. So if you've passed the assumptions so-far, now is when you would do the regression and you do that by clicking: Analyze > Regression > Linear.

This is an example where SPSS has made it super easy to find what you're looking for, because it's actually using the name of the thing we're trying to do.

So there's three assumptions we still have to check, so what we're going to do is run the regression as normal and then check our last three assumptions.

So if you click: Analyze > Regression > Linear, you will get the Linear Regression dialog box.

[Linear Regression dialog shows a list of variables on the left (Gender; Fake_Data2–4; Colour; Group), with Fake_Data1 placed in the Dependent field and Fake_Data2 in Independent(s) under Block 1. To the right are buttons labeled Statistics (which is selected), Plots, Save, Options Style…, and Bootstrap, plus a Method dropdown set to “Enter.” At the bottom are boxes for Selection Variable, Case Labels, and WLS Weight.]

What you do is you take your continuous dependent variable, also called your outcome variable, and you put it in the box that says “Dependent:”. And it should have a little yellow ruler next to it saying that it's continuous data. You can also take your predictor or independent variable from the left-hand side and move it to where it says “Independent(s):”. The trick here is this might be a little yellow ruler if it's continuous or approximately continuous, or you might have that little bar plot (it's got a blue line, a green line, and a red line) if you had ordinal data. But today, we're going to use continuous data points, we're going to use Fake_Data1 for dependent, Fake_Data2 for independent.

There's a few other things you have to do, so you're going to click the Statistics box. If you click the Statistics button, it will open the Statistics dialog box and you're going to make a few selections.

[Linear Regression: Statistics dialog, which contains groups of checkboxes for Estimates (selected) and Confidence intervals (selected and set to Level: 95%), Model fit (selected), R squared change, Descriptives (selected), Part and partial correlations, Collinearity diagnostics, and Selection criteria. In the second grouping “Residuals” there are checkboxes for PRESS, Durbin-Watson, and Casewise diagnostics.]

You want it to say: Estimates, Confidence intervals (at 95%, that's default), Model fit, and Descriptives. When you've done all that, you click the Continue button. Just pausing for a second in case anyone's doing that, there's a few buttons here. Once you've made your Statistics selections, you're going to click where it says Plots [in the Linear Regression dialog]; this is to check one of our assumptions.

[Linear Regression: Plots dialog box lists available variables (Dependent, *ZPRED, *ZRESID, *DRESID, *ADJPRED, *SRESID, SDRESID) in a panel on the left. In the Scatter 1 of 1 section, Y is set to *ZRESID and X to *ZPRED, with arrow buttons for reassignment. Below, Standardized Residual Plots has checkboxes for Histogram and Normal probability plot (both checked) and an option to Produce all partial plots (unchecked).]

You're going to take where it says “ZPRED”, Z predicted, and you're going to drag and drop that to where it says “X:”. You're going to take “ZRESID”, also known as Z residuals, and put it where it says “Y:”. So Z residual on the Y, Z predicted on the X. You should also have “Histogram” and “Normal probability plot” selected in the bottom left-hand corner.

Once you've done that, you can say Continue. This will make a plot for us to look at.

You also want to click the Save button [in the Linear Regression dialog box]; this will open up our Save dialog box.

[Linear Regression: Save dialog box is divided into sections (Predicted Values, Residuals, Distances, Influence Statistics, Prediction Intervals, and Coefficient statistics) of checkboxes for output options. At the bottom is an option to Export model information to XML and Include the covariance matrix.]

It's important that we do this because this is for our assumption of normality, and I think also outliers. So we want to say “Standardized (Residuals).” It's kind of near the top right. What “Save” does in SPSS, anytime you click the Save button, it means you want to save a column of data (or more than one if you select more than one option) into your data window. So it's going to create a new column for us, and we're going to use that column in a few minutes. Once we've said Standardized Residuals, we can say Continue. And that's everything we need to do: if you've made your selections in Statistics, Plots, and Save, you can click the OK button [in the Linear Regression dialog box], you're ready to run the regression.

Okay. If you have made all of those selections, you should get a pretty big output file. There should be quite a bit going on. You're going to scroll down to the scatterplot it just generated for you.

[A scatter plot titled “Scatterplot” with the subtitle “Dependent Variable: Fake_Data1.” The Xaxis is labeled “Regression Standardized Predicted Value” (ranging roughly from –1.5 to 1.5), and the Yaxis is labeled “Regression Standardized Residual” (ranging roughly from –2 to 2). Red circular markers represent each case’s standardized residual plotted against its standardized predicted value.]

Hi, this should look a little bit familiar! It made you a scatterplot with your standardized residual and your standardized predicted value because we said in the Plots, this is what we wanted to generate. So now that we've built our model, we can check our assumption of homoscedasticity.

What we're looking for here – this is called a “Residual versus Fitted” plot, it’s a scatterplot – we want to see no obvious pattern. There should be an approximately equal number of points on the top half and the bottom half. There should be an approximately equal number of points on the left half and the right half. There should be no obvious cone or fan shape; so if you were to like draw a sideways “V”, you don't want to see the points following a sideways V going in one direction or the other.

If I look at this, it's got a little bit of a “U” going on, but approximately equal top and bottom, approximately equal left and right, no obvious fan or cone shape, we get a thumbs up. We have passed the assumption of homoscedasticity.

[Slide shows the table with the Analyze menu open and Descriptive Statistics selected. From the Descriptive Statistics sub-menu Explore is highlighted.]

Two more assumptions! We have to check for normality and outliers. You've seen this before if you've come to any of my workshops, but now we have a new column of data that we're going to use. So we have to click: Analyze > Descriptive Statistics > Explore. But in our data window, we have a new column because we build the model first, we get our standardized values, and then we use the standardized values for this check. So Analyze > Descriptive Statistics > Explore.

That will open up the Explore dialog box.

[Explore dialog box shows a left‐hand list of variables (Gender, Fake_Data1-4, Colour, Group) and a Dependent List on the right containing ZRE_1. The Factor List and Label Cases by fields are empty. The Plots button has been selected.]

We're going to take this “ZRE_1” option and put it where it says, “Dependent List:”. If you were doing multiple regressions on the same day in the same window, you might not have ZRE_1, it'll just increase the number. So if I do 3 regressions in the same day, I'll have ZRE1, ZRE2, ZRA3. So you want to pick the one that is for the regression you just ran. So today we only have one of them. It's really easy. It's ZRE_1.

So we're going to take this and put it in dependent list and then we have to click the plots button.

[Explore: Plots subdialog, where Boxplots is set to Factor levels together, Histogram is checked under Descriptive, and Normality plots with tests is enabled. The Spread vs Level with Levene Test is set to None.]

We want to uncheck stem and leaf check histogram and check where it says normality plots with tests to make sure we're checking for normality and outliers. Once we've done all that, we can click the continue button and then we're good to go. We can click OK [in the Explore dialog box], so if we make these selections in the Explorer dialog box in our output we can check for normality and outliers.

We're checking our residuals ZRE_1 is our standardized residuals. We can Scroll down to where it says tests of normality.

[Tests of Normality table presents a single row for the Standardized Residual, with two sidebyside test sections. Under KolmogorovSmirnov, the columns are Statistic, df, and Sig.; under ShapiroWilk, they are likewise Statistic, df, and Sig. A footnote notes use of the Lilliefors Significance Correction.]

If you have fifty or more observations, you're going to look where it says conglomerate Smirnov.
Here we have fewer than fifty observations [in the df column]. We're going to look on the right-hand side where it says, Shapiro Wilk, if p is less than .05 [in the Sig. column], it means you have a problem with normality. You have failed the normality assumption. It might not be appropriate for you to run this test. Here we have a p-value greater than .05, which means we're OK we've passed normality.

This is how to check normality with statistic. We can also check normality using some histograms.
We actually have two versions of this histogram in our output right now.

[Two sidebyside histograms of regression standardized residuals. The left plot shows red bars only, with the X-axis labeled “Standardized Residual” (from –2 to 2) and the Yaxis labeled “Frequency” (from 0 to 6). The right plot similarly displays red bars under a smooth black normalcurve overlay, with the Xaxis labeled “Regression Standardized Residual,” the Yaxis “Frequency,” and the subtitle “Dependent Variable: Fake_Data1.”]

If you've selected everything I've selected, um, you should get one from the explorer procedure and you should get one I think from actually just running the regression as well, so they both look about the same. One of them has the literal line for a bell-shaped curve imposed on top, but for both of these we're looking to say does it look like our typical bell-shaped curve? And for both of these, I squint and say yeah, that looks pretty good. We've passed normality.

For visual inspection we can also look at our QQ plot.

[Normal PP Plot of regression standardized residuals for Fake_Data1, with red data points plotting Observed Cum Prob on the xaxis versus Expected Cum Prob on the yaxis, and a black diagonal reference line.]

It's got a straight line going across the page in a bunch of data points. We're looking to see are the points on or pretty close to the line here. I would say yeah, that's pretty good QQ plot. Everything is close to or on the line. I would say with our statistic and our two kinds of plot, we have passed normality. Excellent.

At the bottom of the output file we should see our box plot.

[A vertical boxplot labeled “Standardized Residual” showing values from –2.0 to +2.0 on the y-axis. The box is centered around 0, with the median line near the middle of the box (slightly above 0). Whiskers extend approximately to –1.8 and +1.8.]

There are multiple ways to check for outliers. This is one way SPSS makes the box plot method really easy. We've got our median, our interquartile range, and our whiskers, and we're looking to say, are there any data points outside of those whiskers that might be considered outliers? If it's a circle with a number, it says that that data point is an outlier and it's telling you the row in your output that is the outlier. If it's a star with a number next to it, it's an extreme outlier and the number tells you which row is an extreme outlier. So here we see no outliers, so we've also passed our assumption for outliers.

We have checked all of our assumptions. We've maybe failed linearity, but everything else looks OK let's proceed and actually look at our regression, which you would do if you've passed all your assumptions.

OK, the regression output.

[SPSS output showing three tables. Model Summary table – one row (Model 1) with columns R, R Square, Adjusted R Square, and Std. Error of the Estimate. ANOVA table – three rows (Regression, Residual, Total) with columns Sum of Squares, df, Mean Square, F, and Sig. Coefficients table – two rows (Constant and the predictor variable) under Model, with column groups Unstandardized Coefficients (B, Std. Error), Standardized Coefficients (Beta), then t, Sig., and 95% Confidence Interval for B (Lower Bound, Upper Bound).]

There's quite a bit here. I'm going to show you just the most important pieces. So in our model summary table, this is telling us how good of a job are we doing? How good does our predictor do in actually saying what is happening with our outcome variable?

So here we look at our R Square value and we can see a value of .008. If you multiply this by 100, it will give you a percentage. So here we are, sitting at .8% of the variance of our outcome measure is being explained by our predictor measure. So not very good, less than 1%.

We can also look at the ANOVA table and this in the Sig. column for the ANOVA table. We'll tell you essentially, is this doing a good enough job? Are our square values very low? Our predictive power is pretty low and our p-value for our ANOVA is nonsignificant, p greater than .05. This is saying we're not doing a super great job, so we would actually stop here and we would not move on to actually interpret our linear regression because our ANOVA is nonsignificant, we would stop. If you're ANOVA value, your p-value, here was less than .05. That'd be, “Yep, you're doing a good job.
You're allowed to interpret your regression.” For practice today, we're going to interpret our regression anyway, but normally we would stop right here. Oh, I've got some circles. So R Square [column in Model Summary table], this Sig. value [Sig. column in ANOVA table].

The coefficients table, which is your next table here, is your actual regression proper. So this is your linear regression. There are two super important pieces of this table. The rest of it, it's important to you, you would probably report it in a paper, but less important if you're like, “I just want the quick and easy version.”

So the constant line, if you remember from like high school math, if you ever did Y equals MX plus B, the constant line is your plus B. This is your intercept. You generally ignore the constant line unless you were trying to build a regression equation to help you predict something. Most people, when they come see me, they actually just ignore the constant line because they are not building an equation, so that's OK.

We want to look where it says Fake data: = 90 [row in Coefficients table]. This is our Fake_Data2 variable, so this is our predictor variable. We want to see, how well is our predictor variable doing at explaining our outcome variable? We can look where it says Sig. [column], this is our p-value. Hey, these two match [Sig. columns in ANOVA table and Coefficients table] because it's a simple linear regression. It's not doing a good job, our Fake_Data2 predictor variable is not doing a good job explaining what's happening in our Fake_Data1 variable, and that's OK. That's real data. It happens sometimes. So we would not continue to interpret this. Again, because this is nonsignificant, and it matches this ANOVA. So we already would have stopped because it's not significant.

If our p-value was less than .05 here we'd be able to say that our predictor variable here [Sig. column of the Coefficients table], your Fake_Data2, is doing a good job at explaining something that's happening in Fake_Data1, and we'd be able to interpret this and say, “how good of a job is it doing?” We can quantify how much variance is being explained by looking in our Unstandardized Coefficients section [of the Coefficients table]. In this beta [B] column, so we would look at this value here. So again, we wouldn't look at this today, but if this was significant, we'd be able to say for each one unit increase in Fake_Data2, our predictor column, we're seeing a .061 unit increase in our Fake_Data1 column. Because this is a linear regression you could also say a 6.1 percent increase. It just so happens that 6.1 percent is actually not enough for it to be significant today, and that's OK, but we'd be able to say with all of the data on the screen, less than 1% of our variance is being explained by our predictor variable. It's not doing a very good job, so we would stop, but if it was doing a good job, we would look in our single row here [Fake data: =90 row] to say what is happening, what's our significance for that variable and how much is being explained here.
6.1 percent, i.e., for each one unit increase in our predictor variable, we're seeing a 6.1 percent increase in our outcome variable.

That is how you run a simple linear regression.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The library is committed to ensuring that members of our user community with disabilities have equal access to our services and resources and that their dignity and independence is always respected. If you encounter a barrier and/or need an alternate format, please fill out our Library Print and Multimedia Alternate-Format Request Form. Contact us if you’d like to provide feedback: lib.a11y@uoguelph.ca

chat loading...

Simple Linear Regression

Attribution

Time commitment

Description

Video

Transcript

Tags

License