Master of Urban Spatial Analytics (MUSA)
Stuart Weitzman School of Design
University of Pennsylvania

To: Tax Credit Program Manager
Department of Housing and Community Development (HCD)
Emil City

From: Jun, Youngsang
Student of Master of Urban Spatial Analytics (MUSA)
Stuart Weitzman School of Design
University of Pennsylvania

Date of Memo: November 1, 2024

SUBJECT: Targeting a Housing Subsidy

1. Background

Emil City has conducted marketing campaigns targeting homeowners who qualify for a home repair tax credit program. However, due to a low conversion rate and random outreach to eligible homeowners, a more proactive approach is required. To improve the efficiency of the program, research was conducted to convert all the client-level data from previous campaigns into an improved model that can better target their limited outreach resources. This memo presents the results of the trained classifier and a cost-benefit analysis by using the results of the classification.

2. Data Interpretation and Visualization

The given dataset contains the results of a survey with nine continuous variables (age, campaign, pdays, previous, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs), ten categorical variables (job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, poutcome), and the indicate that represents whether the individual enter the home repair tax credit program. In this analysis, a logistic regression was conducted with entering status as the dependent variable and the 19 variables listed above as independent variables.

  1. Continuous Variables

To determine whether entering status varies with continuous variables, mean values were compared across entering statuses. The results show significant differences between the entered and not-entered groups for age, campaign, inflation_rate, cons.conf.idx, previous, unemploy_rate, and pdays, as shown in the figure.

The following plots the mean for nine continuous features grouped by enter or non-enter. The likelihood of entering the home repair tax credit program, on average, increases with: Older age (age), Fewer contacts for this individual for this campaign (campaign), Lower U.S. inflation rate (inflation_rate), Lower consumer confidence index at time of campaign (cons.conf.idx), More contacts before this campaign for this individual (previous), Lower unemployment rate at time of campaign (unemploy_rate), and Fewer days since last contact from a previous program (pdays, in case that client previously contacted).

housingSubsidy %>%
  dplyr::select(y,age, campaign, previous, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs) %>%
  gather(Variable, value, -y) %>%
    ggplot(aes(factor(y, levels = c("no", "yes")), value, fill = y)) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free", ncol=4) +
      scale_fill_manual(values = palette2) +
      labs(x="Entered", y="Value", 
           title = "Feature associations with the likelihood of entering the home repair tax credit program",
           subtitle = "(continous features)") +
      theme(legend.position = "bottom")

housingSubsidy %>%
  dplyr::select(y,pdays) %>%
  subset(pdays < 999) %>%
  gather(Variable, value, -y) %>%
    ggplot(aes(factor(y, levels = c("no", "yes")), value, fill = y)) +
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free", ncol=4) +
      scale_fill_manual(values = palette2) +
      labs(x="Entered", y="Value", 
           title = "Feature associations with the likelihood of entering the home repair tax credit program",
           subtitle = "(continous features)") +
      theme(legend.position = "bottom")

The following plots the distributions of nine continuous features grouped by entered or not-entered. For the age, entered was dominant under age 50, while not-entered was dominant over 60. For the unemployment rate, entered was dominant at values above -0.5, while not-entered was dominant at values below -0.5. For previous, entered was dominant at values below 1, whereas not-entered was dominant at values above 2. For the inflation rate, not-entered was dominant at values below 4.2, while entered was dominant at values above 4.2.

housingSubsidy %>%
    dplyr::select(y,age, campaign, previous, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free", ncol=4) +
    scale_fill_manual(values = palette2) +
    labs(title = "Feature distributions entered vs. not entered",
         subtitle = "(continous outcomes)")+
    theme(legend.position = "bottom")

housingSubsidy %>%
    dplyr::select(y,pdays)%>%
  subset(pdays < 999) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_fill_manual(values = palette2) +
    labs(title = "Feature distributions entered vs. not entered",
         subtitle = "(continous outcomes)")+
    theme(legend.position = "bottom")

  1. Categorical Variables

The plots below illustrate whether differences in ten categorical features associate with the likelihood that homeowners entered the credit program. We can see that entered was more often than not-entered when poutcome is success and in December and March, but further insights are limited until the regression analysis is conducted.

housingSubsidy %>%
    dplyr::select(y, job, marital, education) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free", ncol=3) +
        scale_fill_manual(values = palette2) +
        labs(x="Entered", y="Value",
             title = "Feature associations with the likelihood of entering the home repair tax credit program",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),
              legend.position = "bottom")

housingSubsidy %>%
    dplyr::select(y, taxLien, mortgage, taxbill_in_phl) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free", ncol=3) +
        scale_fill_manual(values = palette2) +
        labs(x="Entered", y="Value",
             title = "Feature associations with the likelihood of entering the home repair tax credit program",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),
              legend.position = "bottom")

housingSubsidy %>%
    dplyr::select(y, contact, month, day_of_week, poutcome) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free", ncol=4) +
        scale_fill_manual(values = palette2) +
        labs(x="Entered", y="Value",
             title = "Feature associations with the likelihood of entering the home repair tax credit program",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),
              legend.position = "bottom")

3. The Original Regression Model Interpretation

  1. A Logistic Regression Model Creations

A logistic regression model predicts a binary outcome - a 1 or a 0 - an Entered or a Not-entered and associates a coefficient that describes the change in the probability of the outcome given some change in the independent variable. The data is partitioned into a 65/35 training and test set (p = 0.65). These sets are named housingSubsidyTrain and housingSubsidyTest.

set.seed(3456)
trainIndex <- createDataPartition(housingSubsidy$y, p = .65, 
                                  list = FALSE,
                                  times = 1)
housingSubsidyTrain <- housingSubsidy[ trainIndex,]
housingSubsidyTest  <- housingSubsidy[-trainIndex,]

Then the model of kitchen sink regression (original regression), which includes the whole of independent variables, is created. However, X, which represents the order, is excluded as well as taxLien, which has only one yes value that makes training and test set interpretation impossible. The results of regression that includes the whole independent variables show that contacttelephone, monthdec, monthmar, campaign, and poutcomenonexistent are statistically significant. The AIC of this training model is 1589.

housingSubsidyModel <- glm(y_numeric ~ .,
                  data=housingSubsidyTrain %>% 
                    dplyr::select(-X, -y, -taxLien),
                  family="binomial" (link="logit"))

summary(housingSubsidyModel)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingSubsidyTrain %>% dplyr::select(-X, -y, -taxLien))
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                  -64.456982 131.586619   -0.49    0.624  
## age                            0.004040   0.008501    0.48    0.635  
## jobblue-collar                -0.201978   0.269534   -0.75    0.454  
## jobentrepreneur               -0.439881   0.457493   -0.96    0.336  
## jobhousemaid                  -0.608137   0.553335   -1.10    0.272  
## jobmanagement                 -0.295884   0.305810   -0.97    0.333  
## jobretired                    -0.077069   0.370302   -0.21    0.835  
## jobself-employed              -0.720168   0.458890   -1.57    0.117  
## jobservices                   -0.378531   0.299237   -1.26    0.206  
## jobstudent                    -0.582800   0.474902   -1.23    0.220  
## jobtechnician                 -0.081523   0.233211   -0.35    0.727  
## jobunemployed                 -0.003613   0.428222   -0.01    0.993  
## jobunknown                    -0.344911   0.699671   -0.49    0.622  
## maritalmarried                 0.002554   0.239055    0.01    0.991  
## maritalsingle                 -0.027233   0.274389   -0.10    0.921  
## maritalunknown               -11.795497 329.214840   -0.04    0.971  
## educationbasic.6y              0.367033   0.399940    0.92    0.359  
## educationbasic.9y             -0.075444   0.338510   -0.22    0.824  
## educationhigh.school           0.133463   0.322580    0.41    0.679  
## educationilliterate          -13.185612 882.743557   -0.01    0.988  
## educationprofessional.course   0.179134   0.350260    0.51    0.609  
## educationuniversity.degree     0.287589   0.323026    0.89    0.373  
## educationunknown               0.320349   0.412583    0.78    0.437  
## mortgageunknown               -0.110573   0.567267   -0.19    0.845  
## mortgageyes                   -0.208863   0.142825   -1.46    0.144  
## taxbill_in_phlyes              0.181712   0.200887    0.90    0.366  
## contacttelephone              -0.641718   0.285012   -2.25    0.024 *
## monthaug                       0.011917   0.443599    0.03    0.979  
## monthdec                       1.818817   0.779553    2.33    0.020 *
## monthjul                       0.098704   0.373905    0.26    0.792  
## monthjun                       0.060152   0.471982    0.13    0.899  
## monthmar                       1.391007   0.569001    2.44    0.014 *
## monthmay                      -0.410096   0.315927   -1.30    0.194  
## monthnov                      -0.615462   0.424909   -1.45    0.147  
## monthoct                       0.070484   0.547844    0.13    0.898  
## monthsep                      -0.926398   0.682643   -1.36    0.175  
## day_of_weekmon                -0.317249   0.221360   -1.43    0.152  
## day_of_weekthu                -0.140555   0.215032   -0.65    0.513  
## day_of_weektue                -0.157101   0.216687   -0.73    0.468  
## day_of_weekwed                -0.205492   0.231250   -0.89    0.374  
## campaign                      -0.108050   0.043424   -2.49    0.013 *
## pdays                         -0.000735   0.000751   -0.98    0.328  
## previous                       0.242960   0.213056    1.14    0.254  
## poutcomenonexistent            0.742538   0.350073    2.12    0.034 *
## poutcomesuccess                1.109515   0.747728    1.48    0.138  
## unemploy_rate                 -0.939155   0.486603   -1.93    0.054 .
## cons.price.idx                 0.907770   0.861427    1.05    0.292  
## cons.conf.idx                  0.022924   0.029270    0.78    0.434  
## inflation_rate                 0.499742   0.447979    1.12    0.265  
## spent_on_repairs              -0.004471   0.010731   -0.42    0.677  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1489.0  on 2629  degrees of freedom
## AIC: 1589
## 
## Number of Fisher Scoring iterations: 13
  1. Pseudo-R² Values

The value of McFadden R-squared is 0.197. The closer to 1, generally the better.

pseudo_r2 <- pR2(housingSubsidyModel)
## fitting null model for pseudo-r2
pseudo_r2_df <- as.data.frame(t(pseudo_r2))
kable(pseudo_r2_df, "html", caption = "Pseudo-R² Values for `housingSubsidyModel`") %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
Pseudo-R² Values for housingSubsidyModel
llh llhNull G2 McFadden r2ML r2CU
-744 -927 365 0.197 0.127 0.255
  1. Predictions

The following is the distribution of predicted probabilities by observed outcome and a Confusion Matrix. In the original prediction model, the sensitivity is 0.2420, and the specificity is 0.9805.

testProbs <- data.frame(Outcome = as.factor(housingSubsidyTest$y_numeric),
                        Probs = predict(housingSubsidyModel, housingSubsidyTest, type= "response"))

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Entered", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1258  119
##          1   25   38
##                                              
##                Accuracy : 0.9                
##                  95% CI : (0.883, 0.915)     
##     No Information Rate : 0.891              
##     P-Value [Acc > NIR] : 0.145              
##                                              
##                   Kappa : 0.302              
##                                              
##  Mcnemar's Test P-Value : 0.00000000000000919
##                                              
##             Sensitivity : 0.2420             
##             Specificity : 0.9805             
##          Pos Pred Value : 0.6032             
##          Neg Pred Value : 0.9136             
##              Prevalence : 0.1090             
##          Detection Rate : 0.0264             
##    Detection Prevalence : 0.0437             
##       Balanced Accuracy : 0.6113             
##                                              
##        'Positive' Class : 1                  
## 
  1. Cross Validation

The cross validation (CV) goodness of fit metrics of the original prediction model is as follows. The value of Sensitivity is 0.196, and Specificity is 0.988, which is the results after changed the baseline y value from no to yes. The Area Under the Curve (AUC) of the original model is 0.772, which represents the area under the Receiver Operating Characteristic Curve (ROC) curve.

housingSubsidy$y <- fct_relevel(housingSubsidy$y, "yes")

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit <- train(relevel(y, ref = "yes") ~ .,
                  data=housingSubsidy %>% 
                    dplyr::select(-X, -job, -age, -marital, -taxLien, -education, -month, -day_of_week, -pdays, -previous, -inflation_rate, -y_numeric), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit
## Generalized Linear Model 
## 
## 4119 samples
##    9 predictor
##    2 classes: 'yes', 'no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4077, 4078, 4077, 4077, 4077, 4078, ... 
## Resampling results:
## 
##   ROC    Sens   Spec 
##   0.771  0.196  0.988
dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

4. Improved Regression Model Suggestion

  1. Variable Transformations to Engineer New Features that Significantly Increase the Sensitivity

To improve the model’s performance, the following transformations were applied to the variables by feature importance/correlation.

  1. From categorical to continuous:
  1. From continuous to categorical:
  1. From categorical to categorical:
housingSubsidy2 <- 
  housingSubsidy %>% 
  group_by(job) %>% 
  summarize(totEnter = sum(y_numeric), 
            n = n(), 
            jobEnterAvg = 100*(totEnter/n)) %>%
  dplyr::select(-n, -totEnter) %>%
  right_join(housingSubsidy, .) %>%
  mutate(age_cat = case_when(age <= 50  ~ "50 and below",
                              age <= 60 & age >50 ~ "50-60",
                             TRUE  ~ "others")) %>%
  mutate(pdays_cat = case_when(pdays == 999 ~ "client not previously contacted",
                               TRUE ~ "contacted")) %>%
  mutate(inflation_cat = case_when(inflation_rate <= 4.2  ~ "inflation below 4.2",
                                   inflation_rate >4.2 ~ "inflation above 4.2")) %>%
  mutate(previous_cat = case_when(previous <= 1  ~ "previous call 1 or 0",
                                  previous > 1  ~ "previous call 2 and above")) %>%
  mutate(month_con = case_when(month == "mar" ~ "March or December",
                               month == "apr" ~ "4-11",
                               month == "may" ~ "4-11",
                               month == "jun" ~ "4-11",
                               month == "jul" ~ "4-11",
                               month == "aug" ~ "4-11",
                               month == "sep" ~ "4-11",
                               month == "oct" ~ "4-11",
                               month == "nov" ~ "4-11",
                               month == "dec" ~ "March or December")) %>%
  mutate(day_con = case_when(day_of_week == "mon" ~ "Monfri",
                             day_of_week == "tue" ~ "TWR",
                             day_of_week == "wed" ~ "TWR",
                             day_of_week == "thu" ~ "TWR",
                             day_of_week == "fri" ~ "Monfri")) %>%
  mutate(marital_cat = case_when(marital == "married"  ~ "married",
                                TRUE ~ "others")) 

housingSubsidy2 <- 
  housingSubsidy2 %>% 
 group_by(education) %>% 
  summarize(totEnter = sum(y_numeric), 
            n = n(), 
            educationEnterAvg = 100*(totEnter/n)) %>%
    dplyr::select(-n, -totEnter) %>%
  right_join(housingSubsidy2, .)

The data is partitioned into a 65/35 training and test set (p = 0.65). These sets are named housingSubsidy2Train and housingSubsidy2Test.

set.seed(3456)
trainIndex2 <- createDataPartition(housingSubsidy2$y, p = .65, 
                                  list = FALSE,
                                  times = 1)
housingSubsidy2Train <- housingSubsidy2[ trainIndex2,]
housingSubsidy2Test  <- housingSubsidy2[-trainIndex2,]

The improved model with the whole variables is run with the dependent variable y_numeric and without X and taxLien for the same reason above. The results of the modeling show that contacttelephone, campaign, poutcome, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_cat, previous_cat, and month_con are statistically significant. The AIC of this training model is 1547, which is a bit improved than the original housingSubsidyModel.

housingSubsidyModel2 <- glm(y_numeric ~ .,
                  data=housingSubsidy2Train %>% 
                    dplyr::select(-X, -job, -age, -marital, -taxLien, -education, -month, -day_of_week, -pdays, -previous, -inflation_rate, -y),
                  family="binomial" (link="logit"))

summary(housingSubsidyModel2)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingSubsidy2Train %>% dplyr::select(-X, -job, -age, 
##         -marital, -taxLien, -education, -month, -day_of_week, 
##         -pdays, -previous, -inflation_rate, -y))
## 
## Coefficients:
##                                         Estimate Std. Error z value   Pr(>|z|)
## (Intercept)                           -159.58455   46.82405   -3.41    0.00065
## mortgageunknown                          0.20630    0.50516    0.41    0.68299
## mortgageyes                             -0.27119    0.14253   -1.90    0.05709
## taxbill_in_phlyes                        0.18255    0.20370    0.90    0.37016
## contacttelephone                        -1.03652    0.22378   -4.63 0.00000363
## campaign                                -0.05674    0.03891   -1.46    0.14474
## poutcomenonexistent                      0.51401    0.23187    2.22    0.02663
## poutcomesuccess                          1.82859    0.69435    2.63    0.00845
## unemploy_rate                           -1.03046    0.21015   -4.90 0.00000094
## cons.price.idx                           1.45237    0.34125    4.26 0.00002081
## cons.conf.idx                            0.03297    0.01861    1.77    0.07649
## spent_on_repairs                         0.00428    0.00328    1.31    0.19165
## jobEnterAvg                              0.03462    0.01664    2.08    0.03743
## age_cat50 and below                     -0.01146    0.19491   -0.06    0.95309
## age_catothers                            0.25462    0.36603    0.70    0.48667
## pdays_catcontacted                       0.01176    0.68418    0.02    0.98629
## inflation_catinflation below 4.2        -0.58449    0.48633   -1.20    0.22942
## previous_catprevious call 2 and above    0.62995    0.34554    1.82    0.06829
## month_conMarch or December               1.77383    0.35356    5.02 0.00000052
## day_conTWR                               0.05807    0.14402    0.40    0.68678
## marital_catothers                       -0.07767    0.14908   -0.52    0.60239
## educationEnterAvg                        0.02596    0.03270    0.79    0.42729
##                                          
## (Intercept)                           ***
## mortgageunknown                          
## mortgageyes                           .  
## taxbill_in_phlyes                        
## contacttelephone                      ***
## campaign                                 
## poutcomenonexistent                   *  
## poutcomesuccess                       ** 
## unemploy_rate                         ***
## cons.price.idx                        ***
## cons.conf.idx                         .  
## spent_on_repairs                         
## jobEnterAvg                           *  
## age_cat50 and below                      
## age_catothers                            
## pdays_catcontacted                       
## inflation_catinflation below 4.2         
## previous_catprevious call 2 and above .  
## month_conMarch or December            ***
## day_conTWR                               
## marital_catothers                        
## educationEnterAvg                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1477.1  on 2657  degrees of freedom
## AIC: 1521
## 
## Number of Fisher Scoring iterations: 5

Additionally, the correlation across numeric variables is examined to see if there exists duplicated information. It shows that unemploy_rate has a strong correlation with inflation_rate, cons.price.idx, and spent_on_repairs, while pdays has a strong correlation with previous. We also eliminated the independent variables that were statistically insignificant. Nonetheless, if the AIC decreases or increases Sensitivity when specific variables are added, those variables are retained in the model.

numericVars <- 
  select_if(housingSubsidy2, is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables") 

The final improved model result is as follows. The AIC is 1546, which is more improved than the original model.

housingSubsidyModel2 <- glm(y_numeric ~ .,
                  data=housingSubsidy2Train %>% 
                    dplyr::select(-X, -y,   -age,-age_cat,-job, -jobEnterAvg, -marital,-marital_cat,  -education, -taxLien, -month, -cons.conf.idx, -poutcome,  -day_of_week, -pdays, -previous,-spent_on_repairs,      -inflation_rate, -previous_cat ),
                  #   -mortgage, -contact, -campaign, -unemploy_rate,         -cons.price.idx,   -inflation_cat  -month_con, -educationEnterAvg  -pdays_cat,-taxbill_in_phl,-day_con, 
                  family="binomial" (link="logit"))

summary(housingSubsidyModel2)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingSubsidy2Train %>% dplyr::select(-X, -y, -age, 
##         -age_cat, -job, -jobEnterAvg, -marital, -marital_cat, 
##         -education, -taxLien, -month, -cons.conf.idx, -poutcome, 
##         -day_of_week, -pdays, -previous, -spent_on_repairs, -inflation_rate, 
##         -previous_cat))
## 
## Coefficients:
##                                   Estimate Std. Error z value          Pr(>|z|)
## (Intercept)                      -107.0439    15.8575   -6.75 0.000000000014746
## mortgageunknown                     0.0831     0.4999    0.17           0.86796
## mortgageyes                        -0.2718     0.1408   -1.93           0.05351
## taxbill_in_phlyes                   0.1610     0.1998    0.81           0.42045
## contacttelephone                   -0.9078     0.2092   -4.34 0.000014328702302
## campaign                           -0.0617     0.0391   -1.58           0.11480
## unemploy_rate                      -0.9902     0.1323   -7.49 0.000000000000071
## cons.price.idx                      1.1231     0.1704    6.59 0.000000000044094
## pdays_catcontacted                  1.5124     0.2517    6.01 0.000000001876734
## inflation_catinflation below 4.2   -1.4260     0.4183   -3.41           0.00065
## month_conMarch or December          1.6508     0.3383    4.88 0.000001064307914
## day_conTWR                          0.0572     0.1417    0.40           0.68651
## educationEnterAvg                   0.0446     0.0309    1.44           0.14883
##                                     
## (Intercept)                      ***
## mortgageunknown                     
## mortgageyes                      .  
## taxbill_in_phlyes                   
## contacttelephone                 ***
## campaign                            
## unemploy_rate                    ***
## cons.price.idx                   ***
## pdays_catcontacted               ***
## inflation_catinflation below 4.2 ***
## month_conMarch or December       ***
## day_conTWR                          
## educationEnterAvg                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1499.6  on 2666  degrees of freedom
## AIC: 1526
## 
## Number of Fisher Scoring iterations: 5
  1. Pseudo-R² Values

The value of new model’s McFadden R-squared is 0.180, which is lower than the original.

pseudo_r2 <- pR2(housingSubsidyModel)
## fitting null model for pseudo-r2
pseudo_r22 <- pR2(housingSubsidyModel2)
## fitting null model for pseudo-r2
pseudo_r2_df <- as.data.frame(t(pseudo_r2))
pseudo_r22_df <- as.data.frame(t(pseudo_r22))
pseudo_r2_df <- cbind(Model = "Original Model", pseudo_r2_df)
pseudo_r22_df <- cbind(Model = "Improved Model", pseudo_r22_df)
rbind(pseudo_r2_df, pseudo_r22_df) %>%
  kable( "html", caption = "Pseudo-R² Values for `housingSubsidyModel`") %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
Pseudo-R² Values for housingSubsidyModel
Model llh llhNull G2 McFadden r2ML r2CU
Original Model -744 -927 365 0.197 0.127 0.255
Improved Model -750 -927 354 0.191 0.124 0.248
  1. Predictions

A dataframe of predictions for the 1,440 observations in the test set is created, called testProbs2. The following is the distribution of predicted probabilities by observed outcome and a Confusion Matrix. In the improved prediction model, the sensitivity is 0.2548, and the specificity is 0.9875. In the distribution, the height of “not entered” hump get lower.

testProbs2 <- data.frame(Outcome = as.factor(housingSubsidy2Test$y_numeric),
                        Probs = predict(housingSubsidyModel2, housingSubsidy2Test, type= "response"))

ggplot(testProbs2, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Entered", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

testProbs2 <- 
  testProbs2 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(testProbs2$predOutcome, testProbs2$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1267  123
##          1   16   34
##                                              
##                Accuracy : 0.903              
##                  95% CI : (0.887, 0.918)     
##     No Information Rate : 0.891              
##     P-Value [Acc > NIR] : 0.0676             
##                                              
##                   Kappa : 0.291              
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.2166             
##             Specificity : 0.9875             
##          Pos Pred Value : 0.6800             
##          Neg Pred Value : 0.9115             
##              Prevalence : 0.1090             
##          Detection Rate : 0.0236             
##    Detection Prevalence : 0.0347             
##       Balanced Accuracy : 0.6020             
##                                              
##        'Positive' Class : 1                  
## 
  1. ROC Curve (comparing with the original model)

The ROC curve is a graphical representation of the trade-off between the true positive rate and false positive rate across a series of thresholds. The area under the curve (AUC) is a single number that summarizes the performance of the model across all thresholds. Curves are “above” the y=x line, which is where the prediction rates for positives and negatives are better than random. The ROC of the improved model is more “square” than the original one. The AUC of the improved model is 0.826, which is greater than the original model’s AUC of 0.802.

auc_testProbs <- auc(testProbs$Outcome, testProbs$Probs)
auc_testProbs2 <- auc(testProbs2$Outcome, testProbs2$Probs)

auc_df <- data.frame(
  Model = c("Original Model", "Improved Model"),
  AUC = c(auc_testProbs, auc_testProbs2)
)

kable(auc_df, "html", caption = "AUC Values by Models") %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
AUC Values by Models
Model AUC
Original Model 0.802
Improved Model 0.810
ggplot() +
  geom_roc(data= testProbs, aes(d = as.numeric(Outcome), m = Probs, colour = "Original Model"), n.cuts = 50, labels = FALSE) +
  geom_roc(data= testProbs2, aes(d = as.numeric(Outcome), m = Probs, colour = "Improved Model"), n.cuts = 50, labels = FALSE) +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  scale_colour_manual(values = c("Improved Model" = "#FE9900", "Original Model" = "#335566")) +
  labs(title = "ROC Curve - enterModel", colour = "Models") +
  theme(legend.position = "bottom")

  1. Cross validation

The CV goodness of fit metrics of the improved prediction model is as follows. The value of Sensitivity is 0.212, and Specificity is 0.987. The AUC of the improved model is 0.788.

housingSubsidy2$y <- fct_relevel(housingSubsidy2$y, "yes")

ctrl2 <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit2 <- train(relevel(y, ref = "yes") ~ .,
                  data=housingSubsidy2 %>% 
                    dplyr::select(-X, -y_numeric,   -age,-age_cat,-job, -jobEnterAvg, -marital,-marital_cat,  -education, -taxLien, -month, -cons.conf.idx, -poutcome,  -day_of_week, -pdays, -previous,-spent_on_repairs,      -inflation_rate, -previous_cat ), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl2)

cvFit2
## Generalized Linear Model 
## 
## 4119 samples
##   11 predictor
##    2 classes: 'yes', 'no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4078, 4078, 4078, 4078, 4078, ... 
## Resampling results:
## 
##   ROC    Sens   Spec 
##   0.786  0.209  0.988
dplyr::select(cvFit2$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit2$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

5. Cost-Benefit Calculation using Improved Regression Model

  1. Cost/Benefit Equation for Each Confusion Metric

This chapter estimates the revenues associated with using the improved model under the following scenario. The cost-benefit table is as follows.

(1) True Positive: Predicted correctly a homeowner would enter the credit program; allocated the marketing resources, and 25% ultimately achieved the credit. The marketing cost of $2,850 per person is included in the cost. Only 25% of eligible individuals are expected to receive the $5,000 credit, generating benefits only for this 25%, while the remaining 75% will not receive the credit. Therefore, both the $5,000 cost and the $66,000 benefit are multiplied by 25%.
   - Cost: -$2,850 -$5,000*0.25*count
   - Benefit: ($10,000+$56,000)*0.25*count
(2) True Negative: Predicted correctly a homeowner would not enter the credit program. Since no marketing resources were allocated and no credit was allocated, cost would be $0. Benefit would also be $0 because no one in this group received credit, the benefit $10,000+$56,000 will not be generated.
   - Cost: $0
   - Benefit: $0
(3) False Positive: Predicted incorrectly a homeowner would enter the credit program; allocated marketing resources; no credit allocated. The marketing cost of $2,850 per person is included in the cost, and credit cost would be $0 since no credit was allocated. Benefit would also be $0 because no one in this group received credit, the benefit $10,000+$56,000 will not be generated.
   - Cost: -$2,850*count
   - Benefit: $0
(4) False Negative: We predicted that a homeowner would not enter the credit program but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. Thus, we ‘0 out’ this category, assuming the cost/benefit of this is $0. 
   - Cost: $0
   - Benefit: $0
cost_benefit_table <-
   testProbs2 %>%
      count(predOutcome, Outcome) %>%
      summarize(
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),True_Negative = sum(n[predOutcome==0 & Outcome==0]),False_Positive = sum(n[predOutcome==1 & Outcome==0]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1])
                ) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative", (Count * 0) ,
               ifelse(Variable == "True_Positive",((-2850-5000*0.25+66000*0.25) * Count),
               ifelse(Variable == "False_Positive", (-2850) * Count, 0)))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted enter",
              "We correctly predicted no enter",
              "We predicted enter and customer did not enter",
              "We predicted no enter and customer entered")))

kable(cost_benefit_table,
       caption = "Cost/Benefit Table (threshold=0.50)") %>% kable_styling()
Cost/Benefit Table (threshold=0.50)
Variable Count Revenue Description
True_Positive 34 421600 We correctly predicted enter
True_Negative 1267 0 We correctly predicted no enter
False_Positive 16 -45600 We predicted enter and customer did not enter
False_Negative 123 0 We predicted no enter and customer entered
  1. Plot the confusion metric outcomes for each Threshold to Optimize Thresholds

The last step to tuning the model is to run it for each threshold value. In the previous step, the threshold was set to 0.5; however, this may not yield the maximum revenue. By calculating the total credit count and revenue at each threshold, the optimze threshold that maximizes revenue can be determined.

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      testProbs2 %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
     gather(Variable, Count) %>%
     mutate(Revenue =
            ifelse(Variable == "True_Negative", (Count * 0) ,
               ifelse(Variable == "True_Positive",((-2850-5000*0.25+66000*0.25) * Count),
               ifelse(Variable == "False_Positive", (-2850) * Count, 0))),
            Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}
whichThreshold <- iterateThresholds(testProbs2)

whichThreshold_revenue <- 
whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue))

whichThreshold_count <- 
whichThreshold %>% 
    subset(Variable == "True_Positive") %>%
    group_by(Threshold) %>% 
    summarize(Count)

whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Revenue by confusion matrix type and threshold",
       y = "Revenue") +
  guides(colour=guide_legend(title = "Confusion Matrix")) 

whichThreshold %>%
  ggplot(.,aes(Threshold, Count, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Count by confusion matrix type and threshold",
       y = "Count") +
  guides(colour=guide_legend(title = "Confusion Matrix")) 

A threshold of 15% is optimal and yields the greatest revenue at $928,500. After that mark, losses associated with False Negatives begin to mount. The revenue at 50% of threshold is $450,400.

ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = Revenue))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -Revenue)[1,1]))+
     ylim(-200000,1000000) +
    labs(title = "Model Revenues By Threshold For Test Sample",
         subtitle = "Vertical Line Denotes Optimal Threshold")

Since the credit count only occurs for True Positives, the total credit is the count of true positives. When Threshold=0.01, it becomes 157.

ggplot(whichThreshold_count)+
  geom_line(aes(x = Threshold, y = Count))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_count, -Count)[1,1]))+
     ylim(0,200) +
    labs(title = "Model Counts By Threshold For Test Sample",
         subtitle = "Vertical Line Denotes Optimal Threshold")

testProbs2_15 <- 
  testProbs2 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 0.15 , 1, 0)))

cost_benefit_table_15 <-
   testProbs2_15 %>%
      count(predOutcome, Outcome) %>%
      summarize(
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),True_Negative = sum(n[predOutcome==0 & Outcome==0]),False_Positive = sum(n[predOutcome==1 & Outcome==0]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1])
                ) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative", (Count * 0) ,
               ifelse(Variable == "True_Positive",((-2850-5000*0.25+66000*0.25) * Count),
               ifelse(Variable == "False_Positive", (-2850) * Count, 0)))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted enter",
              "We correctly predicted no enter",
              "We predicted enter and customer did not enter",
              "We predicted no enter and customer entered")))

kable(cost_benefit_table_15,
       caption = "Cost/Benefit Table (threshold=0.15)") %>% kable_styling()
Cost/Benefit Table (threshold=0.15)
Variable Count Revenue Description
True_Positive 91 1128400 We correctly predicted enter
True_Negative 1165 0 We correctly predicted no enter
False_Positive 118 -336300 We predicted enter and customer did not enter
False_Negative 66 0 We predicted no enter and customer entered
testProbs2_01 <- 
  testProbs2 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 0.01 , 1, 0)))

cost_benefit_table_01 <-
   testProbs2_01 %>%
      count(predOutcome, Outcome) %>%
      summarize(
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),True_Negative = sum(n[predOutcome==0 & Outcome==0]),False_Positive = sum(n[predOutcome==1 & Outcome==0]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1])
                ) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative", (Count * 0) ,
               ifelse(Variable == "True_Positive",((-2850-5000*0.25+66000*0.25) * Count),
               ifelse(Variable == "False_Positive", (-2850) * Count, 0)))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted enter",
              "We correctly predicted no enter",
              "We predicted enter and customer did not enter",
              "We predicted no enter and customer entered")))

kable(cost_benefit_table_01,
       caption = "Cost/Benefit Table (threshold=0.01)") %>% kable_styling()
Cost/Benefit Table (threshold=0.01)
Variable Count Revenue Description
True_Positive 157 1946800 We correctly predicted enter
True_Negative 1 0 We correctly predicted no enter
False_Positive 1282 -3653700 We predicted enter and customer did not enter
False_Negative 0 0 We predicted no enter and customer entered
  1. Table of the Total_Revenue and Total_Count_of_Credits by threshold (50%, optimal)
whichThreshold_total <- left_join(whichThreshold_revenue, whichThreshold_count, by = "Threshold") %>%
  filter(Threshold == 0.15 | (Threshold > 0.495 & Threshold <0.51) | Threshold == 0.01)

kable(whichThreshold_total, "html", caption = "Table of the Total_Revenue and Total_Count_of_Credits by threshold") %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
Table of the Total_Revenue and Total_Count_of_Credits by threshold
Threshold Revenue Count
0.01 -1706900 157
0.15 792100 91
0.50 376000 34

6. Recommendation

Both the original and improved model in this research demonstrate good predictive for the “not entered” group (specificity), but shows very low sensitivity for the “entered” group, so it has limitations in targeting. The recommendation to address this limitation are the following:

First, as in the last step, we could appropriately adjust the threshold to an optimized level rather than using 0.5. This would help reduce false negatives and increase true positives, thereby improving sensitivity to some extent.

Second, the models in this research show a significantly higher false negative compared to true positive. We should investigate how the false negative group learned about the home repair tax credit program outside of marketing. Additionally, since this analysis was conducted without spatial data, incorporating spatial variables to account for spatial correlation could improve the model.

Finally, this model currently assumes cost and benefit for false negatives are both zero, following the given instructions, which results in zero revenue for both true negatives and false negatives. However, since false negatives represent homeowners who received the credit, we could assign a cost of -$5,000 and a benefit of ($10,000 + $56,000) to them. Alternatively, for false positives and true negatives, we could consider a negative benefit to represent the opportunity cost of unrealized benefits.

Jun, Youngsang
Student of Master of Urban Spatial Analytics (MUSA)
Stuart Weitzman School of Design
University of Pennsylvania