Create and Improve Metrics with R

Joseph Scheidt

6/8/2019

Problem

Sometimes available data offer poor bases for comparison, where performance is clouded by non-relevant factors.

Examples:

Solution

Linear regression

Goal is to remove variance caused by non-relevant factors, giving us a better way to judge performance

Example problem

Comparing Washington State elementary schools

Linear Regression Code

#Create linear model using lm command
#lm(formula = my_metric ~ nonrelevant_factors, data = my_data)
model <- lm(formula = avg_test_score ~ white + asian + free_disc_lunch, 
            data = school_data)

#Check the results for p-values (R squared doesn't need to be high)
summary(model)

Output


Call:
lm(formula = avg_test_score ~ white + asian + free_disc_lunch, 
    data = school_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.72201 -0.09447  0.00962  0.11337  0.51785 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.60201    0.03333  18.063   <2e-16 ***
white            0.28315    0.03292   8.602   <2e-16 ***
asian            0.59982    0.06869   8.733   <2e-16 ***
free_disc_lunch -0.61315    0.02997 -20.461   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1663 on 1058 degrees of freedom
Multiple R-squared:  0.623, Adjusted R-squared:  0.6219 
F-statistic: 582.7 on 3 and 1058 DF,  p-value: < 2.2e-16

Add predictions and new metric to data frame

#Add predicted score as column of dataframe
school_data$expected_score <- predict(model, school_data, type = "response")

#Create new metric by comparing actual score to predicted score
school_data$new_metric <- school_data$avg_test_score - school_data$expected_score

Results

top_n(school_data, 10, new_metric) %>%
    select(school, district, avg_test_score, new_metric)
# A tibble: 10 x 4
   school               district                  avg_test_score new_metric
   <chr>                <chr>                              <dbl>      <dbl>
 1 Nooksack Elementary  Nooksack Valley School D…          0.863      0.415
 2 Hamilton Elementary  Port Angeles School Dist…          0.846      0.371
 3 Evergreen Elementary Bethel School District             0.839      0.518
 4 Moxee Elementary     East Valley School Distr…          0.741      0.400
 5 Chester H Thompson … Bethel School District             0.734      0.511
 6 Gildo Rey Elementar… Auburn School District             0.729      0.510
 7 Garfield Elementary… Everett School District            0.722      0.411
 8 Paterson Elementary… Paterson School District           0.706      0.515
 9 Stanley              Tacoma Public Schools              0.65       0.423
10 Union Gap School     Union Gap School District          0.48       0.390

Results

District Level Performance