Linear modelling is a powerful tool in data analysis designed to use regression to demonstrate a relationship between a dependent variable and one or multiple independent variables. Simple regression involves an x-y relationship however multiple regression allows for any number of x's and one y. Multiple regression is incredibly useful for analyzing very large data sets with a multitude of variables. At times a data set has more variables than are necessarily predictive of the dependent variable. There are methods that can be used to reduce a linear model to a smaller more optimal variation. To begin lets look at the following data set: Data=read.table("ZarEX7.4.txt",header=T) Note there are many different variables, and some of these variables are factors; treatment classes with multiple levels. The key to what these different factors mean is presented to the right. For this example Salary will be used as the dependent variable while the other variables will be used as independent variables to predict the values of Salary. Simple Regression: To begin let us look at how a simple regression using this data set might appear. Salary will be the dependent variable and Gender will be used as the independent variable. For this question we might ask if the variable Gender shows a significant correlation to Salary. Such a correlation would show a disparity in pay between different genders, a question that some might ask as a means to demonstrate inequality in the workplace. Worded as a null hypothesis: Ho: Beta = 0, Gender has no correlation to Salary H1: Beta =/= 0, there is a correlation between Gender and Salary. The first step is to actually generate the linear model. This is done very simply in R via the linear model function attach(Data) X=factor(Gender) Y=Salary LM=lm(Y~X) #The variable LM is now assigned to our linear model LM To analyze the linear model we can do any of three things, and it is generally useful to do all three. The first would be a graphical representation, but as Gender is a factor with only two levels this is not a useful analysis here. It is also less useful in that it is difficult to apply to multiple regression. The second is the summary function. summary(LM) The summary function generates several useful values for us. It also runs two basic t-tests to determine if the alpha value is 0, and if the slope of the regression line, beta, is 0. It also generates an r-squared value, the coefficient of correlation. For these t-tests we follow a standard decision rule that if p is less than the standard type I error rate alph=.05 then we reject the null hypothesis, otherwise we accept the null hypothesis. For the null hypothesis asked above we are concerned with the t-test for beta = 0. The values for this test are found in the row labelled X1. The t-value was .303 and the p-value was .764. The p-value is greater than alpha=.05 therefore we cannot reject the null hypothesis and instead accept it. Beta is roughly equivalent to 0. In the terms of the question asked, we can draw the conclusion that Gender does not demonstrate a signicant impact upon salary. All in all the pay in this workplace appears to be fair between genders. The second method of analysis is to use the anova function. anova(LM) This generates an ANOVA table along with running an ANOVA F-test for beta=0. For this test we obtained an F-value of .0916, and a p value of .7644. Note that the p-value between the t-test in the summary table is effectively equivalent to the p-value from this test. These tests ask the same question and generate the same result. All the same conclusions drawn from the t-test above can also be generated from this anova table. Both the summary function and the linear model function are highly useful in analyzing multiple regressions as well. Multiple Regression: In this dataset there are a multitude of other variables besides Gender that can be used to help predict the values of Salary. To analyze the data in terms of all of these variables it is possible to generate a linear model using all of these variables as independent variables. Before we can generate these variables, however, there is one problem to consider. While a linear model can use a factor as an independent variable with no special considerations if it has only two levels which are reduced to 1 and 0, some consideration is required for factors with more than two levels. Fortunately, R has a very powerful means of doing all the dummy coding required to properly analyze these factors, the factor() function. This function specifies that a variable is a factor and in doing so covers all the required considerations. As long as you are careful to specify variables as factors when necessary this will not be a problem. Now with the factor() function in mind we can generate a linear model for a multiple regression: Y=Salary X1=Gender X2=factor(Rank) X3=factor(Dept) X4=factor(Years) X5=Merit LM=lm(Y~X1+X2+X3+X4+X5) #to specify multiple x's in the lm function we just put a + for each additional X LM The first and broadest question to ask in multiple regression is if any of the variables are having any effect on y. This is the overall F-test for all beta=0: Ho: all slopes, all beta = 0 H1: at least one beta for any variable =/= 0 To run this test in R we once again use the summary function summary(LM) In the case of multiple regression the summary function reports far more information than in simple regression. For this particular test we are only interested in the very last line, in which an F statistic of 4.933 and a p-value of .002261 are reported. Those values are the values specifically for the over-all test. In this case the p-value is less than .05, meaning we reject the null hypothesis that all beta=0. Now that we have determined that at least one variable has an effect we can now try to determine which of the specific variables is having an effect. The summary function has reported a complex summary due to the fact that we have specified many of the variables as factors. As such it has tried to separate out the factors into a separate factor for each of the levels. For each of these variables, a partial t-test is run. This partial t-test is roughly equivalent to the t-test for beta=0 for simple regression seen in the simple regression. However, in multiple regression it does this for each of the variables in a marginal manner, meaning it does it for one independent of the others, then for the next, and so on. The null hypothesis is also identical for the test in the summary of the simple regression. In this manner we can look at the summary table and determine which variables we think are significant based on which marginal tests generated a p-value less than .05. From our summary table it looks as though the most important variables are Rank and Dept. Within these variables the differences in the treatment classes are also significant. The data for a multiple regression can also be analyzed via the anova function. anova(LM) In multiple regression R reports a very different set of results than in simple regression. The anova function in multiple regression runs a serial test, which refers to a test in which one variable is added and then the linear models with and without that variable are compared. A serial test does this for each variable, adding each variable one at a time. In this manner the serial test compares the linear model of Y to Y~X1, then Y~X1 to Y~X1+X2 then Y~X1+X2 to Y~X1+X2+X3 etc. until it reaches the last variable. The anova function runs the serial test in the order that you specified when you created your linear model. The F and p-values generated reflect whether the addition of that variable affected the linear model in a significant manner. From this anova we can determine that Years and Merit may be unimportant to the prediction of Salary. There also exists a marginal alternative to the serial test of the anova function. This is called the Anova function. This function is not found pre-loaded in R. It is necessary to install and load the car package to use this function. Anova(lm) The marginal test compares the full linear model we specified to reduced versions of the model. The first row labeled X1 is the comparison of the full model to the same model without the variable X1. Each other row is the comparison for when that variable is removed and all the others remain. From this table we can determine similar results as from the anova table; Years and Merit appear not to be useful in the model. Choosing an Optimal Model: The anova function and the Anova function are examples of methods by which the full linear model specified is compared to reduced versions to try and determine if some variables in the model may be unnecessary. The optimal linear model is one in which only the variables which are deemed necessary remain. It is the simplest model possible that still has the most predictive power. For this example the goal is to generate an optimal model that predicts Salary without unnecessary variables. As outlined above Years and Merit look as though they may be unnecessary in the model. There exists other ways to determine more quantitatively which model is more optimal. One of these is the Akaike's Information Criterion (AIC). By measuring certain changes between a full model and a reduced model and generating an AIC value for each it is possible to say which of the models is more optimal. AIC values can be calculated via a long-hand method in R, however R has three very useful functions for generating AIC values quickly. When interpreting AIC values, the best model is the model with the lowest AIC value. At times this is the AIC value that is most negative. The AIC value can be calculated explicitly for any model via the extractAIC function extractAIC(LM) We can compare this value to AIC values for any reduced model that we wish to specify. For example let us look at two reduced models: RM1=lm(Y~X1+X2+X3+X5) extractAIC(RM1) RM2=lm(Y~X1+X2+X3) extractAIC(RM2) extractAIC reports both a degrees of freedom and the AIC value, the AIC value is the value on the right. From this comparison we can see that removing Years, in comparison to the full model, gives a lower AIC value. However, if both Merit and Years are removed together, the AIC value is higher than that of the full model. In this case we might say that either Years or Merit are necessary variables but not both. The drop1 function is a margianl test that generates an AIC value for the full model then removes the variables one at a time in the same manner as the Anova function then generates an AIC value for the reduced models for each variable. drop1(LM) In this case we can see that the AIC values that are lowest are for removing Years and Merit. This agrees with the Anova function tests earlier. There is also an automated version of the drop1 function called the step function. step(LM) The step function runs a drop1, chooses the reduced model that has the lowest AIC value, and then runs a drop1 on that reduced model, treating it as though it were a new full model. The step function continues doing this until it finds that the 'full' model for that particular step has the lowest AIC value. In this example the step function removed X4 from the model based on the first drop1, and then in the second step the 'full model', a reduced model of Y~X1+X2+X3+X5, a removal of the variable years, gives the lowest AIC value and is thus the most optimal model. This agrees with the conclusions drawn from explicitly calculating AIC values above. Another method that can be used to compare a full model to a reduced model without using AIC values is the GLM F-test using the anova function. This test compares a full model to a reduced model under the following null hypotheses: Ho: coefficients, or beta, in the full model but NOT INCLUDED in the reduced model = 0 H1: at least one coefficient, or beta, in the full model but not in the reduced model =/= 0 In simpler terms, the null hypothesis is that all the variables that were excluded from the reduced model have no significant effect on the prediction of the independent variable, as defined by the beta being 0. Acceptance of the null hypothesis is effectively acceptance of the reduced model as a more optimal model. To run this test we must specify a full model and a reduced model, and then run an anova function. FullModel=LM ReducedModel=lm(Y~X1+X2+X3+X5) anova(ReducedModel,FullModel) For this test a p-value greater than .05 was generated, signifyfing acceptance of the null hypothesis. The reduced model is a more optimal model. Conclusion: Using a multitude of different tests to determine which is a more optimal model we were able to prove that a reduced model using only the variables Gender, Rank, Dept, and Merit has effectively the same predictive power as the full dataset given at the start that also included the variable Years. Multiple regression is a very powerful tool to take a large dataset and simplify it into a more parsimonious set. Moreover, we were also able to demonstrate that Multiple regression is able to demonstrate significance in variables that when taken alone do not appear significant but when put in the context of a larger set of data have an impact on the response variable. In the simple regression we were not able to demonstrate a significant impact of gender alone on Salary, however, in the multiple regression Gender consistently proved to be a variable that was necessary to the analysis of the data. Variables that may appear unimportant can prove to have a great effect in the context of more data and as such it is always more useful to use a multiple regression to determine an optimal model rather than trying to use many simple regressions to throw out certain variables.