Multiple Linear Regression

Presenter(s): Mark Tranmer, Jen Murphy, Mark Elliot, Maria Pampaka


decorative image to accompany text

Regression is a statistical modelling technique used for analysing and predicting the relationship between variables. Multiple Linear Regression (MLR) is used to understand the relationship between one dependent variable and two or more independent variables. By modelling the linear relationship between the variables, MLR allows for the prediction of outcomes and provides insights into the significance of the predictors. 

This tutorial will explain the application of MLR, drawing on examples from social sciences to demonstrate its utility in real-world scenarios. A downloadable worksheet as well as workbook are also included.

Foundations and Applications of Multiple Linear Regression

Multiple Linear Regression (MLR) evolves from simple linear regression, marking a significant advancement in statistical analysis. It introduces the ability to examine the influence of multiple independent variables (determining outcome complexity) on a single dependent variable, broadening the scope of research and analysis across various fields. MLR is a critical tool for predictive modelling and data interpretation.



   Download transcript    |   Download slides [ 134 Views ]

Simple Linear Regression

A simple linear regression estimates the relationship between a response variable, and a single explanatory variable, given a set of data that includes observations for both of these variables for a particular sample. For example, we may ask: how can exam performance at age 16, the response variable, be predicted from exam results at age 11, the explanatory variable? So, we are interested in the relationship between age 11 and age 16 scores. We hypothesise that the age 16 score can be predicted from the age 11 score; that there is an association between the two.


Multiple Linear Regression

Multiple linear regression extends simple linear regression to include more than one explanatory variable. In both cases, we still use the term ‘linear’ because we assume that the response variable is directly related to a linear combination of the explanatory variables.

MLR can be applied to a variety of problems, such as predicting an individual’s income given several socio-economic characteristics, or estimating systolic or diastolic blood pressure, given a variety of socio-economic and behavioural characteristics (occupation, drinking smoking, age, etc.).


Assumptions when using Linear Regression

When we use linear regression to build a model, we assume that:

  • The response variable is continuous and the explanatory variables are either continuous or binary

  • The relationship between outcome and explanatory variables is linear

  • The residuals are homoscedastic

  • The residuals are normally distributed

  • There is no more than limited multicollinearity

  • There are no external variables. variables are not included in the model that have strong relationships with the response variable (after controlling for the variables that are in the model)

  • Independent errors

  • Independent observations.

 

For most of these assumptions, if they are violated then it does not necessarily mean we cannot use a linear regression method, simply that we may need to acknowledge some limitations, adapt the interpretation or transform the data to make it more suitable for modelling.

 

> Download a worksheet (with answers).

> Download handbook: Multiple Linear Regression (2nd Edition). The handbook expands on simple and multiple linear regression, hypothesis testing, using SPSS, assumptions, and more complex models.

 

Steps for MLR Analysis Using SPSS:

SPSS is a statistical software used for analysing data in social sciences. It enables researchers to perform multiple linear regression (MLR) and other analyses.

The steps below provide a structured approach to conducting MLR in SPSS, from data preparation to assumption testing and interpretation of results.

  1. Prepare Data: ensure your dataset is correctly formatted, with each row representing a case and columns representing variables.
  2. Variable Selection: identify your dependent variable and multiple independent variables. Use theoretical knowledge and exploratory data analysis to select relevant predictors.
  3. Check Assumptions: before running the MLR, assess the data for multicollinearity using correlation matrices or Variance Inflation Factor (VIF), ensure linearity between independent variables and the dependent variable, and check residuals for normality.
  4. Run the Regression: Navigate to Analyze > Regression > Linear, in SPSS. Assign your dependent variable and independent variables to the respective fields. For multicollinearity diagnostics, include tolerance and VIF in the output.
  5. Interpret Results: examine the coefficients to understand the impact of each independent variable on the dependent variable. Evaluate the R-squared value to assess the model's overall explanatory power. Check the significance levels (p-values) to determine which variables significantly contribute to the model.
  6. Assumption Validation Post-hoc: After running the MLR model, validate the assumptions by analysing the residuals for homoscedasticity and normality.


Terms and definitions

categoricalA variable where the responses are categories.  For example, ethnicity.
collinearWhen one variable can be used to predict another.  When two variables are closely linearly associated.
continuousA continuous variable is a variable which takes a numeric form and can take any value.   For example, distance in miles to the nearest shop.
Cook's distanceA measure of whether or not observations are outliers.  The threshold for further consideration is three times the mean of the Cook's distance.  The creator of the measure defines any point as having a Cook's distance of 1 to be of concern.  Used to assess whether or not an observation within a dataset should be removed to improve the fit of the model.
correlatedTwo continuous variables are said to be correlated if a change in one variable results in a measurable change in the other.  The correlation coefficient is a measure of the strength of this association or relationship.
response variableThe outcome we want to predict.  The value of this variable is predicted to be dependent on the other terms in the model.   Sometimes referred to as the dependent variable or the Y variable.
error termSee residual
explanatory variableThe variables which we use to predict the outcome variable.  These variables are also referred to as independent or the X variable(s).
homoscedasticOne of the key assumptions for a linear regression model.  If residuals are homoscedastic, they have constant variance regardless of any explanatory variables. 
linear regressionA method where a line of best fit is estimated by minimising the sum of the square of the differences between the actual and predicted observations.
multicolinearityWhen two or more variables are closely linearly associated or can be used to predict each other.
multiple linear regressionLinear regression with more than one explanatory variable.
ordinalordinal variableA variable where the responses are categories, which can be put in an order.  For example, the highest level of education achieved by a respondent.  Remember that the possible responses may not be evenly spaced.
Pearson's coefficienta measure of correlation.
populationThe whole group we are interested in.
positive correlationA situation where   if one variable increases in value, another variable also tends to increase in value.
R2A measure of model fit.  The percentage of variance explained by the model.
representativeWhen a sample is representative, it has the same statistical properties as the population as a whole.  This means that when we get results of a statistical analysis of the sample, we can infer that the same results are true for the population.  To be representative a sample needs to be of sufficient size and the correct composition to reflect the means of groups within the underlying population.
residualThe difference between the predicted value from the model, and the actual value of the observation.  When texts refer to the residuals, it means the data that is generated if we calculate the residual for every observation in the dataset.
sampleThe sub section of the population which we are studying.  A smaller number of units, drawn from the population.  For example we might be interested in menu choices in a school canteen.  Our population of interest is everyone in the school.  We could then take a survey of 5 students from each year group.   This would be our sample.
simple linear regressionLinear regression with one explanatory variable
skewMeasures the symmetry of a distribution.   A symmetrical distribution has a skew of 0.  Positive skew means more of the values are at the lower end of the distribution, negative skew means that more of the values are at the higher end of the distribution.
statistically significantWhen a result is statistically significant, we mean that it meets our criteria for the hypothesis test.  Statistically significant is not the same as "important" or "interesting" and has a specific technical meaning.


Our thanks to Dr. Gil Dekel for his support in preparing this tutorial.




About the author

Mark Eliot is Professor of Data Science at the University of Manchester. He specialises in Confidentiality and Privacy and Social Data Science. 

Mark Tranmer is Professor of Quantitative Social Science at the University of Glasgow. Research interests include Multilevel Modelling and Social Network Analysis, with applications in social science, public health and animal behaviour. 

Maria Pampaka is a Professor at the Social Statistics department and the Manchester Institute of Education, at the University of Manchester. Her expertise and interests lie within evaluation and measurement, systematic reviews and meta-analyses, and advanced quantitative methods, including complex survey design, longitudinal data analysis, and dealing with missing data. She usually applies these in educational research around learners’ (maths) attitudes and dispositions and teaching practices.

Primary author profile page



BACK TO TOP