College Admissions vs Graduation Rate

College Admissions vs Graduation Rate

Data Cleaning


The data cleaning process focuses on reducing the dimensionality of the raw data by creating a smaller dataset that is used for statistical analysis. First, a column “Class Year” is created. As mentioned previously, the raw data is a series of CSV files, one for each school year. The school year in each CSV file name is placed in this “Class Year” column which allows for concatenation of several years of data. Having several years of data in one dataset allows for the analysis of metrics for specific institutions over a period of time, which is important for future statistical analysis.

Next, dimensionality is reduced for specific variables of interest. For example, one of these groups of variables provides seven unique identifiers of secondary institutions: Unit Id, OPE Id, 6 digit OPE Id, Institution Name, City, State, and Zip Code. These seven identifiers are reduced to just three (Institution Name, City, and State), allowing for enough combinations to determine uniqueness without adding superfluous factors in the smaller dataset. Similar to this, the number of pre-entrance exam metrics is reduced to only include those most relevant to this analysis: average SAT scores and average cumulative ACT scores. In the raw dataset, there are many instances similar to these, where not all of the metrics included are relevant for the type of analysis performed herein. Therefore, the dimensionality is reduced to include only those metrics important to the research.

Of the schools included in the original dataset, this analysis looks only at institutions that award predominantly bachelor's degrees. The raw data containing this information is encoded as a numeric variable ranging from 1 to 4 with the following associations: 1 is a predominantly certificate awarding institution, 2 is a predominantly associates degree awarding institution, 3 is a predominantly bachelor's degree awarding institution, and 4 is a predominantly master's degree-awarding institution. To get the data such that only predominantly bachelor's degree institutions are included, each of these numeric variables is encoded to the corresponding string of the name of the type of degree. Then, the data is subset to only include those institutions that awarded predominantly bachelor's degrees.

Next, the variable “completion rate” is obtained from the raw data. To best represent the model of student success, the completion rate metric is mapped to correspond to four years following the school year used for pre-entrance exam metrics. This allows for proper analysis of students as they go through their four-year program, accounting for the period in which they are in school. These completion rates are then discretized into three categories: low, medium, and high. The “low” category corresponds to completion rates less than 40%, the “medium” category” corresponds to those between 40% and 60%, and the “high” category corresponds to those greater than 60%. These bins are used in the statistical analysis for this research.

Similar discretization is then performed for two of the other variables of interest. First, a “school size” metric is created in which university undergraduate enrollment is discretized into five bins: very small, small, medium, large and very large. Schools with less than 1,000 students are defined as “very small”, schools with a student count between 1,000 and 1,800 as “small”, schools with a student count between 1,800 and 3,000 as “medium”, schools with a student count between 3,000 and 9,000 as “large”, and schools with greater than 9,000 students as “very large”. Second, discretization is used to create a university “region” metric. In R, there is a “state.region” function that provides the region for each state in the United States. Therefore, universities are binned corresponding to the following regions: North East, North Central, South, and West. With all of the data appropriately binned, the final step in the data cleaning process is the removal of all rows containing null values.

Following all of these data cleaning steps, the final dataset is significantly smaller than the raw data and contains institution-specific data for the following metrics: university name, university city, university state, admission rate, average SAT score, average cumulative ACT score, undergraduate enrollment, school size, average cost, in-state tuition, out-of-state tuition, completion rate, completion rate bin, and school region.