College Admissions vs Graduation Rate

College Admissions vs Graduation Rate

Regression


Methodology

Linear regression is a common and somewhat simple method for understanding how a particular variable (response variable) responds to changes in other variables (predictor variables). In addition to insights on variable response, linear regression also helps identify the significance of the relationship between variables, as well as the strength of their correlation.

Several questions can be answered with this method. First, is there a relationship between each of the predictor variables and the response variable? If so, how strong is this relationship? In addition to understanding IF there is a relationship between the variables, a linear regression model also provides information on the effect of the predictor variable on the response variable.

By modeling the data using linear regression, an equation is formulated where the response variable equals a linear combination of all predictor variables. For example, given a response variable (Y) and n predictor variables (X1, X2, ... Xn), the linear regression model produces a formula similar to the following:




Y = B0 + B1X1 + B2X2 + ... + BnXn




where the coefficients B1, B2, ... Bn represent the change in Y given a unit change in each respective variable (assuming that all other variables stay constant), and B0 represents the constant of the equation.

This is a multivariate linear regression equation, as there are multiple predictor variables considered. While the formula differs somewhat depending on the types of variables considered as predictor variables (quantitative or qualitative), this general equation exhibits the typical structure of a linear regression model.

It should be noted that regressions can be in many forms other than just linear. The data can be adjusted using polynomial, logistic, exponential, etc. transformations to produce different regression equations. For this analysis, the data will be examined in the Results section to understand if the linear regression is the appropriate model.

Results

There are many factors that impact university student success. A linear regression model is constructed to determine which of these factors are most significant in impacting completion rate. For this linear regression, the variables admission rate, average SAT scores, average ACT scores, undergraduate enrollment, average cost per academic year, out-of-state tuition, and region are included as predictor variables, with the response variable being completion rate. As discussed above, only data from the 2014-2015 year will be considered for all analyses. The results of the model are shown in Tables 1 and 3:


Table 1: Linear Regression Results, Part 1
EstimateStd. ErrorT valuePr(>t)SC
Intercept-4.541e-015.534e-02-8.2068.17e-16***
Admission.Rate-2.073e-052.347e-04-0.0880.929647
SAT.Average7.328e-041.740e-044.2122.80e-05***
ACT.Cumulative.Average3.373e-036.676e-030.5050.613530
Undergraduate.Enrollment3.478e-066.466e-075.3789.65e-08***
Average.Cost.per.Academic.Year9.766e-077.536e-071.2960.195320
Out.of.State.Tuition.and.Fees3.210e-061.085e-062.9580.003175**
Region.Northwest4.278e-021.150e-023.7210.000211***
Region.South-2.022e-021.074e-02-1.8830.060004.
Region.West1.136e-021.304e-020.8710.383791


Table 2: Significance Codes
Significance Codep-value
***[0, 0.001]
**(0.001, 0.01]
*(0.01, 0.05]
.(0.05, 0.1]
(0.1, 1]


Table 3: Linear Regression Results, Part 2
Residual Standard Error0.1161 on 874 degrees of freedom
Multiple R-squared0.6378
Adjusted R-squared0.6341
F-statistic171 on 9 and 874 degrees of freedom
p-value< 2.2e-16

As seen in Tables 1 and 3, the most significant variables are the average SAT score, undergraduate enrollment, out-of-state tuition per academic year, and regions. Although the out of state tuition is shown to be more significant than the average cost per academic year, this analysis choses to analyse average cost instead of the out of state tuition as it is a more encompassing metric for the general student experience rather than just the experience of out of state students.

The results of the linear regression test also provide insight into how the completion rate changes with variations in each of the variables. A complete formula is created using the coefficients shown in the second column of the linear regression output. This formula is shown as follows, with coefficients rounded to two significant figures places for readability:

Completion.Rate = -0.45 -0.000027 Admission.Rate +0.00073 Average.SAT +0.0034 ACT.Average +0.0000035 Undergrad.Enrollment +0.00000098 Average.Cost +0.0000032 Out.Of.State.Tuition +0.043 Region.NE -0.020 Region.S +0.011 * Region.W

Where: Admission.Rate,Average.SAT,ACT.Average,Undergrad.Enrollment,Average.Cost, & Out.Of.State.Tuition = the quantitative value for these variables Region.NE = {1 if state is in NE region,0 if not} Region.S = {1 if state is in S region,0 if not} Region.W = {1 if state is in W region,0 if not}

This equation is used to understand how much the completion rate is predicted to change given changes of each of the variables. For example, according to the model, an increase in the average SAT score of 1 unit produces a 0.00073 unit change in completion rate when holding all other variables constant. The magnitude scale can be adjusted for better contextual understanding: for every 100 point increase in average SAT score, the completion rate will be 7.3% higher (given all other variables stay the same). Similar interpretations can be made for all of the variables' coefficients, yielding interesting insight into each variables' impact on completion rate.

Validation of Linear Regression Model

To validate linear regression as an appropriate methodology for this analysis, each of the quantitative variables found to be statistically significant are plotted versus the completion rate. Linear, quadratic, and cubic regression curves are fit to each plot, allowing visual understanding of their appropriate prediction of the data.

First, the average SAT scores are shown versus the completion rate in the plot below.

See the codeDownload the Data
Loading... 
Loading...

The above figure shows that as the average SAT score increases, the completion rate rises in a fairly linear fashion, validating that linear regression is appropriate to use.

Next, the average cost is plotted against completion rate. As noted above, although the out-of-state tuition is found to be more statistically significant than the total cost of attendance, the total cost is used in this study as a better representation of the student population as a whole. The regression validation for average cost is seen in the figure below.

Loading... 
Loading...

While the above figure does not clearly indicate that linear regression is the most appropriate to use, it also does not illustrate that the polynomial regression metrics perform significantly better. Therefore, for simplicity, it is sufficient to accept linear regression as appropriate.

Lastly, the figure below plots the undergraduate enrollment versus the completion rate, with the linear and polynomial regression lines included.

Loading... 
Loading...

As seen in the above figure, there isn't clear evidence as to the correlation between enrollment and completion rate. However, the linear regression line seems to fit the trend the best.

Considering the plots shown in the above figures, there does not appear to be an obvious deviation toward a polynomial relationship for regression with the variables considered and completion rate. Therefore, simple linear regression is sufficient.