Sam Pastoriza

College Admissions vs Graduation Rate

Introduction

Data Gathering

Data Cleaning

Exploratory Data Analysis

Regression

Hypothesis Testing

Conditional Probability

Conclusions

Home > Admissions > Hypothesis Testing

Hypothesis Testing

Methodology

Hypothesis testing is a statistical method that allows one to draw conclusions about a population of interest by using various parameters. These conclusions can be drawn at various significance levels, where the level of significance is indicated by the value alpha, ɑ. When constructing a hypothesis test, two hypotheses are established: a null hypothesis, which is typically thought of as the default case, and an alternative hypothesis. The goal of hypothesis testing is to determine if there is significant evidence to reject the null hypothesis. The metric for this significance is called the p-value, where if the p-value is less than alpha (the significance level) then there is statistical evidence to reject the null hypothesis. If the p-value is greater than alpha, then the null hypothesis cannot be rejected.

From this result, a conclusion is made regarding the population parameter of interest. These conclusions can be made with different levels of confidence depending on the results of the hypothesis test.

For this study, two different types of hypothesis testing are utilized to understand different types of population parameters: the two-sample t-test, and the Chi-square test for independence.

Two-Sample T-Test

A two-sample t-test is used to compare two population means. This allows for a greater understanding of the difference or similarity between the two populations. This hypothesis test considers two hypotheses as illustrated below, where H₀ represents the null hypothesis, H_a represents the alternative hypothesis, μ₁ represents the mean of population 1, and μ₂ represents the mean of population 2:

H₀: μ₁ = μ₂ (i.e. the means are the same)

H_a: μ₁ ≠ μ₂ (i.e. the means are not the same)

Two-sample t-tests are used for this study to understand how different predictor variables vary for three different populations: low-completion rate group, medium-completion rate group, and high-completion rate group. This will result in four separate t-tests: two considering average SAT scores and completion rates, and two considering average costs and completion rates.

Average SAT Scores

First, the mean average SAT scores are compared for the low-completion rate group and the medium-completion rate group. Then, a complementary test is used to compare the mean average SAT scores for the medium-completion rate group and the high-completion rate group. The null and alternative hypotheses for these two tests are as follows.

Comparing Mean Average SAT Scores for: Low- vs Medium-Completion Rate Schools

H₀: μ_(SAT.low) = μ_(SAT.medium) → The mean average SAT score of low-completion rate schools is the same as that of medium-completion rate schools.

H_a: μ_(SAT.low) ≠ μ_(SAT.medium) → The mean average SAT scores of low- and medium-completion rate schools are different.

Comparing Mean Average SAT Scores for: Medium- vs High-Completion Rate Schools

H₀: μ_(SAT.medium) = μ_(SAT.high) → The mean average SAT score of medium-completion rate schools is the same as that of high-completion rate schools.

H_a: μ_(SAT.medium) ≠ μ_(SAT.high) → The mean average SAT scores of medium- and high-completion rate schools are different.

Average Cost of Attendance

Second, the mean average cost of attendance is compared for the low-completion rate group and the medium-completion rate group. Then, a complementary test is used to compare the mean average cost of attendance for the medium-completion rate group and the high-completion rate group. The null and alternative hypotheses for these two tests are as follows.

Comparing Mean Average Cost of Attendance for: Low- vs Medium-Completion Rate Schools

H₀: μ_(cost.low) = μ_{(cost.medium)} → The mean average cost of attendance of low-completion rate schools is the same as that of medium-completion rate schools.

H_a: μ_(cost.low) ≠ μ_{(cost.medium)} → The mean average cost of attendance of low- and medium-completion rate schools are different.

Comparing Mean Average Cost of Attendance for: Medium- vs High-Completion Rate Schools

H₀: μ_{(cost.medium)} = μ_(cost.high) → The mean average cost of attendance of medium-completion rate schools is the same as that of high-completion rate schools.

H_a: μ_{(cost.medium)} ≠ μ_(cost.high) → The mean average cost of attendance of medium- and high-completion rate schools are different.

The results of all four of these tests along with the conclusions drawn from these results are discussed in depth below in the Results section.

Chi-Square Test for Independence

A Chi-square test for independence allows for detailed analysis of categorical variables. This test uses a contingency table, or two-way table, comparing two categorical variables to determine if the variables are related, i.e. if they are independent or dependent. One of the variables can represent different populations, such that it will be discovered if the levels of the other variable are distributed in the same manner for each of the different 'populations' of the first variable. The hypotheses for this test are described as follows, where H₀ represents the null hypothesis, H_a represents the alternative hypothesis, and the variable considered for two or more populations is called B and contains several 'levels' B₁,...,B_n:

H₀: the levels of B are distributed the same way in all populations / the variables are independent

H_a: the levels of B are distributed differently in all populations / the variables are dependent

Chi-square tests for independence are used for this study to understand the independence or dependence of two categorical variables (school size and region) and completion rate.

Undergraduate Enrollment (School Size)

First, school size, as defined by the undergraduate enrollment, is compared with completion rate. Prior to this analysis, the school sizes are discretized into five bins: Very Small, Small, Medium, Large, and Very Large. This discretization is an important step, as it allows for the creation of more relevant results. For example, it makes far more sense to conclude that small schools have lower completion rates rather than schools with less than 1800 undergraduate students. The null and alternative hypotheses for these two tests are as follows.

H₀: School size and completion rates are independent.

H_a: Completion rates vary by school size.

Geographic Region

Second, the school's geographic region (North Central, Northeast, South, or West) are compared with completion rates. The null and alternative hypotheses for these two tests are as follows.

H₀: Region and completion rates are independent.

H_a: Completion rates vary by region.

The results of both of these tests along with the conclusions drawn from these results are discussed in depth in the Results section below.

Results

Hypothesis testing is used to analyze the relationships of each of the four main factors (average SAT score, undergraduate enrollment/size of the school, the average cost of attendance, and school region) with completion rate, where completion rate is binned into three categories: Low, Medium, and High.

Two Sample T-Tests

As discussed in the Methodology section, two-sample t-tests are used to understand the difference in means of two variables (average SAT scores and cost of attendance) across the three different populations (low-, medium-, and high-completion rate schools).

Average SAT Scores

The first variable considered for hypothesis testing is average SAT score. To examine this factor's relationship with completion rate, two-sample t-tests are constructed: 1) comparing average SAT scores of low-completion rate schools and medium-completion rate schools, and 2) comparing average SAT scores of medium-completion rate schools and high-completion rate schools. For these tests, the null hypothesis states that the mean SAT scores are the same for both types of schools respectively, and the alternate hypotheses stated that they differ.

See the code Download the Data

Based on the visual shown in the above figure, it is expected that the null hypothesis will be rejected for both tests, as the average SAT score of low-completion rate schools (shown in red) is lower than the average SAT score of medium-completion rate schools (shown in green) which appears to be lower than the average SAT score of high-completion rate schools (shown in blue). As expected, both t-tests produced very small p-values (2.079 e-11 and <2.2 e-16, respectively), so at the 5% significance level the null hypotheses for both tests are rejected, indicating that there is a difference in average SAT scores between low- and medium- completion rate schools, and between medium- and high-completion rate schools.

Average Cost of Attendance

The second variable considered for hypothesis testing is average cost of attendance. To examine this factor's relationship with completion rate, two-sample t-tests are constructed: 1) comparing the average cost of attendance of low-completion rate schools and medium-completion rate schools, and 2) comparing the average cost of attendance of medium-completion rate schools and high-completion rate schools. For these tests, the null hypothesis states that the mean cost of attendance is the same for both types of schools respectively, and the alternate hypotheses stated that they are different.

See the code Download the Data

Average Cost vs Completion Rate (Boxplot)

Based on the visual shown in the figure above, it is expected that the null hypothesis will be rejected for both tests, as the average cost of attendance of low-completion rate schools (shown in red) appears to be lower than the average cost of attendance of medium-completion rate schools (shown in green) which appears to be lower than the average cost of attendance of high-completion rate schools (shown in blue). As expected, both t-tests produced very small p-values (both <2.2 e-16), so at the 5% significance level the null hypotheses for both tests are rejected, indicating that there is a difference in average cost of attendance between low- and medium- completion rate schools, and between medium- and high-completion rate schools.

Chi-Square Test for Independence

As discussed in the Methodology section, Chi-square tests for independence are used to understand how each of the significant categorical variables is independently or dependently related to completion rate. Two tests are performed to study the relationship between completion rate and 1) undergraduate enrollment, and 2) geographic region.

Undergraduate Enrollment (School Size)

The first variable used in Chi-square testing is school size, where school size is discretized into five bins as discussed in the section on data cleaning. This test is important to confirm whether size and completion rate are independent or dependent to each other. For this test, the null hypothesis states that size and completion rate are independent of each other.

See the code Download the Data

Based on the visual shown in the above figure, it is expected to reject the null hypothesis as the size of the school seems to influence the completion rate. For example, according to the visual, smaller schools have a low percentage of high completion rates, while large schools have a higher percentage of a high graduation rate. The Chi-square test for independence confirms these observations as it produces a very small p-value (7.211 e-14). Therefore, at the 5% significance level, the null hypothesis is rejected, confirming that school sizes and completion rates are dependent on each other.

Geographic Region

Next, the association between region and completion rate is analyzed. Region is also discretized: with the four bins being North Central, Northeast, South, and West. Following this discretization, a Chi-square test for independence is constructed to confirm whether region and completion rate are independent or dependent on one another. For this test, the null hypothesis states that region and completion rates are independent of each other.

See the code Download the Data

Based on the visual shown in the figure above, it is expected for the null hypothesis to be rejected as the completion rates differ significantly by region. For example, the visual illustrates that schools in the South have larger percentages of low graduation rates, while schools in the Northeast have larger percentages of high graduation rates. The Chi-square test for independence confirms these observations by producing a very small p-value (7.211 e-14). Therefore, at the 5% significance level, the null hypothesis is rejected, meaning that region and completion rates are dependent on each other.