Exploratory Data Analysis
To better understand the cleaned dataset, several variables of interest are visualized. While the entire dataset spans from 1996 to 2020, visualizations will focus only on the trends over the past six year (2014 - 2020).
Completion Rate
First, the response variable (completion rate) is analyzed. Figure 1 shows the distribution of completion rates for the different schools for each school year from 2014 to 2020.
The plots shown above indicate that the completion rates are fairly normally distributed and that the average completion rate increases over time, particularly from 2018 to 2020.
Average Cost
Another important variable in this dataset is the average cost of attendance. This variable considers all tuition, ees, etc. that a student must pay for attendance at the institution. The figure below illustrates the average cost over a period of six years.
Many people believe that school costs have been quickly rising in the last few years. According to the figure above, there is some evidence to that statement as the average cost rose almost $5000 over the given six-year period shown.
Undergraduate Enrollment (Size of School)
Another variable reviewed is undergraduate enrollment. School size is an important factor when considering different colleges. The figure below provides insight into how school sizes have evolved over time.
Per the figure above, enrollments have not significantly changed over time. Most schools are small, with only a handful of schools larger than 20,000 students. In this research, total undergraduate enrollment of each school is discretized prior to statistical analysis. This allows for outliers, seen in the form of very large universities, to be accounted for. As discussed above, the discretization categories for school size are: Very Large, Large, Medium, Small, and Very Small. The figure below shows the results of this discretization for the 2014-2015 school year.
The figure above indicates that there are roughly equal numbers of schools of sizes small, medium, large, and very large. However, it is seen that there is a larger number of very small schools.
Region
Next, region is an important factor when analyzing completion rates. The figure below shows a graph of regions given the locations of colleges. Considering colleges do not move locations year over year, only one graph of the number of colleges per region is shown below.
The figure above shows that the distribution of colleges by region is relatively even.
Average SAT Scores
Finally, average SAT scores are analyzed. This variable indicates the average total SAT score of all students joining the university for a given school year. The figure below shows the average SAT scores and how they change over the six-year period.
The figure above shows a significant increase in SAT scores from the 2016-2017 school year to the 2017-2018 school year. However, it is important to note that the official SAT exam re-structured its scoring criteria and scale in March of 2016. Considering this and the fact that this research seeks to understand the effect of variables like SAT scores on completion rates, only years prior to 2016 will be able to be considered for analysis. Students graduating in 2018 correspond to entrance exam scores in 2014 since students typically graduate four years after they submit entrance scores. Due to this observation, the statistical analysis performed for this research is done using data in the 2014-2015 entrance year, and the corresponding 2017-2018 completion rates only.