Master of Urban Spatial Analytics (MUSA)
Stuart Weitzman School of Design
University of Pennsylvania
To: Tax Credit Program Manager
Department of Housing and Community Development (HCD)
Emil City
From: Jun, Youngsang
Student of Master of Urban Spatial Analytics (MUSA)
Stuart Weitzman School of Design
University of Pennsylvania
Date of Memo: November 1, 2024
SUBJECT: Targeting a Housing Subsidy
1. Background
Emil City has conducted marketing campaigns targeting homeowners who qualify for a home repair tax credit program. However, due to a low conversion rate and random outreach to eligible homeowners, a more proactive approach is required. To improve the efficiency of the program, research was conducted to convert all the client-level data from previous campaigns into an improved model that can better target their limited outreach resources. This memo presents the results of the trained classifier and a cost-benefit analysis by using the results of the classification.
2. Data Interpretation and Visualization
The given dataset contains the results of a survey with nine
continuous variables (age
, campaign
,
pdays
, previous
, unemploy_rate
,
cons.price.idx
, cons.conf.idx
,
inflation_rate
, spent_on_repairs
), ten
categorical variables (job
, marital
,
education
, taxLien
, mortgage
,
taxbill_in_phl
, contact
, month
,
day_of_week
, poutcome
), and the indicate that
represents whether the individual enter the home repair tax credit
program. In this analysis, a logistic regression was conducted with
entering status as the dependent variable and the 19 variables listed
above as independent variables.
To determine whether entering status varies with continuous
variables, mean values were compared across entering statuses. The
results show significant differences between the entered and not-entered
groups for age
, campaign
,
inflation_rate
, cons.conf.idx
,
previous
, unemploy_rate
, and
pdays
, as shown in the figure.
The following plots the mean for nine continuous features grouped by
enter or non-enter. The likelihood of entering the home repair tax
credit program, on average, increases with: Older age
(age
), Fewer contacts for this individual for this campaign
(campaign
), Lower U.S. inflation rate
(inflation_rate
), Lower consumer confidence index at time
of campaign (cons.conf.idx
), More contacts before this
campaign for this individual (previous
), Lower unemployment
rate at time of campaign (unemploy_rate
), and Fewer days
since last contact from a previous program (pdays
, in case
that client previously contacted).
housingSubsidy %>%
dplyr::select(y,age, campaign, previous, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs) %>%
gather(Variable, value, -y) %>%
ggplot(aes(factor(y, levels = c("no", "yes")), value, fill = y)) +
geom_bar(position = "dodge", stat = "summary", fun = "mean") +
facet_wrap(~Variable, scales = "free", ncol=4) +
scale_fill_manual(values = palette2) +
labs(x="Entered", y="Value",
title = "Feature associations with the likelihood of entering the home repair tax credit program",
subtitle = "(continous features)") +
theme(legend.position = "bottom")
housingSubsidy %>%
dplyr::select(y,pdays) %>%
subset(pdays < 999) %>%
gather(Variable, value, -y) %>%
ggplot(aes(factor(y, levels = c("no", "yes")), value, fill = y)) +
geom_bar(position = "dodge", stat = "summary", fun = "mean") +
facet_wrap(~Variable, scales = "free", ncol=4) +
scale_fill_manual(values = palette2) +
labs(x="Entered", y="Value",
title = "Feature associations with the likelihood of entering the home repair tax credit program",
subtitle = "(continous features)") +
theme(legend.position = "bottom")
The following plots the distributions of nine continuous features grouped by entered or not-entered. For the age, entered was dominant under age 50, while not-entered was dominant over 60. For the unemployment rate, entered was dominant at values above -0.5, while not-entered was dominant at values below -0.5. For previous, entered was dominant at values below 1, whereas not-entered was dominant at values above 2. For the inflation rate, not-entered was dominant at values below 4.2, while entered was dominant at values above 4.2.
housingSubsidy %>%
dplyr::select(y,age, campaign, previous, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs) %>%
gather(Variable, value, -y) %>%
ggplot() +
geom_density(aes(value, color=y), fill = "transparent") +
facet_wrap(~Variable, scales = "free", ncol=4) +
scale_fill_manual(values = palette2) +
labs(title = "Feature distributions entered vs. not entered",
subtitle = "(continous outcomes)")+
theme(legend.position = "bottom")
housingSubsidy %>%
dplyr::select(y,pdays)%>%
subset(pdays < 999) %>%
gather(Variable, value, -y) %>%
ggplot() +
geom_density(aes(value, color=y), fill = "transparent") +
facet_wrap(~Variable, scales = "free") +
scale_fill_manual(values = palette2) +
labs(title = "Feature distributions entered vs. not entered",
subtitle = "(continous outcomes)")+
theme(legend.position = "bottom")
The plots below illustrate whether differences in ten categorical
features associate with the likelihood that homeowners entered the
credit program. We can see that entered
was more often than
not-entered
when poutcome
is
success
and in December and March, but further insights are
limited until the regression analysis is conducted.
housingSubsidy %>%
dplyr::select(y, job, marital, education) %>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
ggplot(., aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free", ncol=3) +
scale_fill_manual(values = palette2) +
labs(x="Entered", y="Value",
title = "Feature associations with the likelihood of entering the home repair tax credit program",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "bottom")
housingSubsidy %>%
dplyr::select(y, taxLien, mortgage, taxbill_in_phl) %>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
ggplot(., aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free", ncol=3) +
scale_fill_manual(values = palette2) +
labs(x="Entered", y="Value",
title = "Feature associations with the likelihood of entering the home repair tax credit program",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "bottom")
housingSubsidy %>%
dplyr::select(y, contact, month, day_of_week, poutcome) %>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
ggplot(., aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free", ncol=4) +
scale_fill_manual(values = palette2) +
labs(x="Entered", y="Value",
title = "Feature associations with the likelihood of entering the home repair tax credit program",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "bottom")
3. The Original Regression Model Interpretation
A logistic regression model predicts a binary outcome - a
1
or a 0
- an Entered
or a
Not-entered
and associates a coefficient that describes the
change in the probability of the outcome given some change in the
independent variable. The data is partitioned into a 65/35 training and
test set (p = 0.65
). These sets are named
housingSubsidyTrain
and
housingSubsidyTest
.
set.seed(3456)
trainIndex <- createDataPartition(housingSubsidy$y, p = .65,
list = FALSE,
times = 1)
housingSubsidyTrain <- housingSubsidy[ trainIndex,]
housingSubsidyTest <- housingSubsidy[-trainIndex,]
Then the model of kitchen sink regression (original regression),
which includes the whole of independent variables, is created. However,
X
, which represents the order, is excluded as well as
taxLien
, which has only one yes
value that
makes training and test set interpretation impossible. The results of
regression that includes the whole independent variables show that
contacttelephone
, monthdec
,
monthmar
, campaign
, and
poutcomenonexistent
are statistically significant. The AIC
of this training model is 1589.
housingSubsidyModel <- glm(y_numeric ~ .,
data=housingSubsidyTrain %>%
dplyr::select(-X, -y, -taxLien),
family="binomial" (link="logit"))
summary(housingSubsidyModel)
##
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"),
## data = housingSubsidyTrain %>% dplyr::select(-X, -y, -taxLien))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -64.456982 131.586619 -0.49 0.624
## age 0.004040 0.008501 0.48 0.635
## jobblue-collar -0.201978 0.269534 -0.75 0.454
## jobentrepreneur -0.439881 0.457493 -0.96 0.336
## jobhousemaid -0.608137 0.553335 -1.10 0.272
## jobmanagement -0.295884 0.305810 -0.97 0.333
## jobretired -0.077069 0.370302 -0.21 0.835
## jobself-employed -0.720168 0.458890 -1.57 0.117
## jobservices -0.378531 0.299237 -1.26 0.206
## jobstudent -0.582800 0.474902 -1.23 0.220
## jobtechnician -0.081523 0.233211 -0.35 0.727
## jobunemployed -0.003613 0.428222 -0.01 0.993
## jobunknown -0.344911 0.699671 -0.49 0.622
## maritalmarried 0.002554 0.239055 0.01 0.991
## maritalsingle -0.027233 0.274389 -0.10 0.921
## maritalunknown -11.795497 329.214840 -0.04 0.971
## educationbasic.6y 0.367033 0.399940 0.92 0.359
## educationbasic.9y -0.075444 0.338510 -0.22 0.824
## educationhigh.school 0.133463 0.322580 0.41 0.679
## educationilliterate -13.185612 882.743557 -0.01 0.988
## educationprofessional.course 0.179134 0.350260 0.51 0.609
## educationuniversity.degree 0.287589 0.323026 0.89 0.373
## educationunknown 0.320349 0.412583 0.78 0.437
## mortgageunknown -0.110573 0.567267 -0.19 0.845
## mortgageyes -0.208863 0.142825 -1.46 0.144
## taxbill_in_phlyes 0.181712 0.200887 0.90 0.366
## contacttelephone -0.641718 0.285012 -2.25 0.024 *
## monthaug 0.011917 0.443599 0.03 0.979
## monthdec 1.818817 0.779553 2.33 0.020 *
## monthjul 0.098704 0.373905 0.26 0.792
## monthjun 0.060152 0.471982 0.13 0.899
## monthmar 1.391007 0.569001 2.44 0.014 *
## monthmay -0.410096 0.315927 -1.30 0.194
## monthnov -0.615462 0.424909 -1.45 0.147
## monthoct 0.070484 0.547844 0.13 0.898
## monthsep -0.926398 0.682643 -1.36 0.175
## day_of_weekmon -0.317249 0.221360 -1.43 0.152
## day_of_weekthu -0.140555 0.215032 -0.65 0.513
## day_of_weektue -0.157101 0.216687 -0.73 0.468
## day_of_weekwed -0.205492 0.231250 -0.89 0.374
## campaign -0.108050 0.043424 -2.49 0.013 *
## pdays -0.000735 0.000751 -0.98 0.328
## previous 0.242960 0.213056 1.14 0.254
## poutcomenonexistent 0.742538 0.350073 2.12 0.034 *
## poutcomesuccess 1.109515 0.747728 1.48 0.138
## unemploy_rate -0.939155 0.486603 -1.93 0.054 .
## cons.price.idx 0.907770 0.861427 1.05 0.292
## cons.conf.idx 0.022924 0.029270 0.78 0.434
## inflation_rate 0.499742 0.447979 1.12 0.265
## spent_on_repairs -0.004471 0.010731 -0.42 0.677
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1853.7 on 2678 degrees of freedom
## Residual deviance: 1489.0 on 2629 degrees of freedom
## AIC: 1589
##
## Number of Fisher Scoring iterations: 13
The value of McFadden R-squared is 0.197. The closer to 1, generally the better.
pseudo_r2 <- pR2(housingSubsidyModel)
## fitting null model for pseudo-r2
pseudo_r2_df <- as.data.frame(t(pseudo_r2))
kable(pseudo_r2_df, "html", caption = "Pseudo-R² Values for `housingSubsidyModel`") %>%
kable_styling(bootstrap_options = c("hover", "striped"))
llh | llhNull | G2 | McFadden | r2ML | r2CU |
---|---|---|---|---|---|
-744 | -927 | 365 | 0.197 | 0.127 | 0.255 |
The following is the distribution of predicted probabilities by observed outcome and a Confusion Matrix. In the original prediction model, the sensitivity is 0.2420, and the specificity is 0.9805.
testProbs <- data.frame(Outcome = as.factor(housingSubsidyTest$y_numeric),
Probs = predict(housingSubsidyModel, housingSubsidyTest, type= "response"))
ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
scale_fill_manual(values = palette2) +
labs(x = "Entered", y = "Density of probabilities",
title = "Distribution of predicted probabilities by observed outcome") +
theme(strip.text.x = element_text(size = 18),
legend.position = "none")
testProbs <-
testProbs %>%
mutate(predOutcome = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1258 119
## 1 25 38
##
## Accuracy : 0.9
## 95% CI : (0.883, 0.915)
## No Information Rate : 0.891
## P-Value [Acc > NIR] : 0.145
##
## Kappa : 0.302
##
## Mcnemar's Test P-Value : 0.00000000000000919
##
## Sensitivity : 0.2420
## Specificity : 0.9805
## Pos Pred Value : 0.6032
## Neg Pred Value : 0.9136
## Prevalence : 0.1090
## Detection Rate : 0.0264
## Detection Prevalence : 0.0437
## Balanced Accuracy : 0.6113
##
## 'Positive' Class : 1
##
The cross validation (CV) goodness of fit metrics of the original
prediction model is as follows. The value of Sensitivity is 0.196, and
Specificity is 0.988, which is the results after changed the baseline y
value from no
to yes
. The Area Under the Curve
(AUC) of the original model is 0.772, which represents the area under
the Receiver Operating Characteristic Curve (ROC) curve.
housingSubsidy$y <- fct_relevel(housingSubsidy$y, "yes")
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
cvFit <- train(relevel(y, ref = "yes") ~ .,
data=housingSubsidy %>%
dplyr::select(-X, -job, -age, -marital, -taxLien, -education, -month, -day_of_week, -pdays, -previous, -inflation_rate, -y_numeric),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFit
## Generalized Linear Model
##
## 4119 samples
## 9 predictor
## 2 classes: 'yes', 'no'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 4077, 4078, 4077, 4077, 4077, 4078, ...
## Resampling results:
##
## ROC Sens Spec
## 0.771 0.196 0.988
dplyr::select(cvFit$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#FF006A") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines")
4. Improved Regression Model Suggestion
To improve the model’s performance, the following transformations were applied to the variables by feature importance/correlation.
education
: a new variable
educationEnterAvg
was created to represent the average
likelihood of entering the credit program by education level.
job
: a new variable jobEnterAvg
was
created to represent the average likelihood of entering the credit
program by job.
age
: categorized into three groups:
50 and below
, 50-60
, and
others
.
pdays
: categorized into two groups:
Client not previously contacted
and
Contacted
.
inflation_rate
: categorized into two groups:
Inflation below 4.2%
and
Inflation above 4.2%
.
previous
: categorized into two groups:
Previous call was 1 or 0
and
Previous call was 2 and above
.
marital
: categorized into two groups:
Married
and Others
.
month
: categorized into two groups:
March and December
and Others
.
day_of_week
: categorized into two groups:
Monday or Friday
and
Tuesday, Wednesday, or Thursday
.
housingSubsidy2 <-
housingSubsidy %>%
group_by(job) %>%
summarize(totEnter = sum(y_numeric),
n = n(),
jobEnterAvg = 100*(totEnter/n)) %>%
dplyr::select(-n, -totEnter) %>%
right_join(housingSubsidy, .) %>%
mutate(age_cat = case_when(age <= 50 ~ "50 and below",
age <= 60 & age >50 ~ "50-60",
TRUE ~ "others")) %>%
mutate(pdays_cat = case_when(pdays == 999 ~ "client not previously contacted",
TRUE ~ "contacted")) %>%
mutate(inflation_cat = case_when(inflation_rate <= 4.2 ~ "inflation below 4.2",
inflation_rate >4.2 ~ "inflation above 4.2")) %>%
mutate(previous_cat = case_when(previous <= 1 ~ "previous call 1 or 0",
previous > 1 ~ "previous call 2 and above")) %>%
mutate(month_con = case_when(month == "mar" ~ "March or December",
month == "apr" ~ "4-11",
month == "may" ~ "4-11",
month == "jun" ~ "4-11",
month == "jul" ~ "4-11",
month == "aug" ~ "4-11",
month == "sep" ~ "4-11",
month == "oct" ~ "4-11",
month == "nov" ~ "4-11",
month == "dec" ~ "March or December")) %>%
mutate(day_con = case_when(day_of_week == "mon" ~ "Monfri",
day_of_week == "tue" ~ "TWR",
day_of_week == "wed" ~ "TWR",
day_of_week == "thu" ~ "TWR",
day_of_week == "fri" ~ "Monfri")) %>%
mutate(marital_cat = case_when(marital == "married" ~ "married",
TRUE ~ "others"))
housingSubsidy2 <-
housingSubsidy2 %>%
group_by(education) %>%
summarize(totEnter = sum(y_numeric),
n = n(),
educationEnterAvg = 100*(totEnter/n)) %>%
dplyr::select(-n, -totEnter) %>%
right_join(housingSubsidy2, .)
The data is partitioned into a 65/35 training and test set
(p = 0.65
). These sets are named
housingSubsidy2Train
and
housingSubsidy2Test
.
set.seed(3456)
trainIndex2 <- createDataPartition(housingSubsidy2$y, p = .65,
list = FALSE,
times = 1)
housingSubsidy2Train <- housingSubsidy2[ trainIndex2,]
housingSubsidy2Test <- housingSubsidy2[-trainIndex2,]
The improved model with the whole variables is run with the dependent
variable y_numeric
and without X
and
taxLien
for the same reason above. The results of the
modeling show that contacttelephone
, campaign
,
poutcome
, unemploy_rate
,
cons.price.idx
, cons.conf.idx
,
inflation_cat
, previous_cat
, and
month_con
are statistically significant. The AIC of this
training model is 1547, which is a bit improved than the original
housingSubsidyModel
.
housingSubsidyModel2 <- glm(y_numeric ~ .,
data=housingSubsidy2Train %>%
dplyr::select(-X, -job, -age, -marital, -taxLien, -education, -month, -day_of_week, -pdays, -previous, -inflation_rate, -y),
family="binomial" (link="logit"))
summary(housingSubsidyModel2)
##
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"),
## data = housingSubsidy2Train %>% dplyr::select(-X, -job, -age,
## -marital, -taxLien, -education, -month, -day_of_week,
## -pdays, -previous, -inflation_rate, -y))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -159.58455 46.82405 -3.41 0.00065
## mortgageunknown 0.20630 0.50516 0.41 0.68299
## mortgageyes -0.27119 0.14253 -1.90 0.05709
## taxbill_in_phlyes 0.18255 0.20370 0.90 0.37016
## contacttelephone -1.03652 0.22378 -4.63 0.00000363
## campaign -0.05674 0.03891 -1.46 0.14474
## poutcomenonexistent 0.51401 0.23187 2.22 0.02663
## poutcomesuccess 1.82859 0.69435 2.63 0.00845
## unemploy_rate -1.03046 0.21015 -4.90 0.00000094
## cons.price.idx 1.45237 0.34125 4.26 0.00002081
## cons.conf.idx 0.03297 0.01861 1.77 0.07649
## spent_on_repairs 0.00428 0.00328 1.31 0.19165
## jobEnterAvg 0.03462 0.01664 2.08 0.03743
## age_cat50 and below -0.01146 0.19491 -0.06 0.95309
## age_catothers 0.25462 0.36603 0.70 0.48667
## pdays_catcontacted 0.01176 0.68418 0.02 0.98629
## inflation_catinflation below 4.2 -0.58449 0.48633 -1.20 0.22942
## previous_catprevious call 2 and above 0.62995 0.34554 1.82 0.06829
## month_conMarch or December 1.77383 0.35356 5.02 0.00000052
## day_conTWR 0.05807 0.14402 0.40 0.68678
## marital_catothers -0.07767 0.14908 -0.52 0.60239
## educationEnterAvg 0.02596 0.03270 0.79 0.42729
##
## (Intercept) ***
## mortgageunknown
## mortgageyes .
## taxbill_in_phlyes
## contacttelephone ***
## campaign
## poutcomenonexistent *
## poutcomesuccess **
## unemploy_rate ***
## cons.price.idx ***
## cons.conf.idx .
## spent_on_repairs
## jobEnterAvg *
## age_cat50 and below
## age_catothers
## pdays_catcontacted
## inflation_catinflation below 4.2
## previous_catprevious call 2 and above .
## month_conMarch or December ***
## day_conTWR
## marital_catothers
## educationEnterAvg
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1853.7 on 2678 degrees of freedom
## Residual deviance: 1477.1 on 2657 degrees of freedom
## AIC: 1521
##
## Number of Fisher Scoring iterations: 5
Additionally, the correlation across numeric variables is examined to
see if there exists duplicated information. It shows that
unemploy_rate
has a strong correlation with
inflation_rate
, cons.price.idx
, and
spent_on_repairs
, while pdays
has a strong
correlation with previous
. We also eliminated the
independent variables that were statistically insignificant.
Nonetheless, if the AIC decreases or increases Sensitivity when specific
variables are added, those variables are retained in the model.
numericVars <-
select_if(housingSubsidy2, is.numeric) %>% na.omit()
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation across numeric variables")
The final improved model result is as follows. The AIC is 1546, which is more improved than the original model.
housingSubsidyModel2 <- glm(y_numeric ~ .,
data=housingSubsidy2Train %>%
dplyr::select(-X, -y, -age,-age_cat,-job, -jobEnterAvg, -marital,-marital_cat, -education, -taxLien, -month, -cons.conf.idx, -poutcome, -day_of_week, -pdays, -previous,-spent_on_repairs, -inflation_rate, -previous_cat ),
# -mortgage, -contact, -campaign, -unemploy_rate, -cons.price.idx, -inflation_cat -month_con, -educationEnterAvg -pdays_cat,-taxbill_in_phl,-day_con,
family="binomial" (link="logit"))
summary(housingSubsidyModel2)
##
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"),
## data = housingSubsidy2Train %>% dplyr::select(-X, -y, -age,
## -age_cat, -job, -jobEnterAvg, -marital, -marital_cat,
## -education, -taxLien, -month, -cons.conf.idx, -poutcome,
## -day_of_week, -pdays, -previous, -spent_on_repairs, -inflation_rate,
## -previous_cat))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -107.0439 15.8575 -6.75 0.000000000014746
## mortgageunknown 0.0831 0.4999 0.17 0.86796
## mortgageyes -0.2718 0.1408 -1.93 0.05351
## taxbill_in_phlyes 0.1610 0.1998 0.81 0.42045
## contacttelephone -0.9078 0.2092 -4.34 0.000014328702302
## campaign -0.0617 0.0391 -1.58 0.11480
## unemploy_rate -0.9902 0.1323 -7.49 0.000000000000071
## cons.price.idx 1.1231 0.1704 6.59 0.000000000044094
## pdays_catcontacted 1.5124 0.2517 6.01 0.000000001876734
## inflation_catinflation below 4.2 -1.4260 0.4183 -3.41 0.00065
## month_conMarch or December 1.6508 0.3383 4.88 0.000001064307914
## day_conTWR 0.0572 0.1417 0.40 0.68651
## educationEnterAvg 0.0446 0.0309 1.44 0.14883
##
## (Intercept) ***
## mortgageunknown
## mortgageyes .
## taxbill_in_phlyes
## contacttelephone ***
## campaign
## unemploy_rate ***
## cons.price.idx ***
## pdays_catcontacted ***
## inflation_catinflation below 4.2 ***
## month_conMarch or December ***
## day_conTWR
## educationEnterAvg
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1853.7 on 2678 degrees of freedom
## Residual deviance: 1499.6 on 2666 degrees of freedom
## AIC: 1526
##
## Number of Fisher Scoring iterations: 5
The value of new model’s McFadden R-squared is 0.180, which is lower than the original.
pseudo_r2 <- pR2(housingSubsidyModel)
## fitting null model for pseudo-r2
pseudo_r22 <- pR2(housingSubsidyModel2)
## fitting null model for pseudo-r2
pseudo_r2_df <- as.data.frame(t(pseudo_r2))
pseudo_r22_df <- as.data.frame(t(pseudo_r22))
pseudo_r2_df <- cbind(Model = "Original Model", pseudo_r2_df)
pseudo_r22_df <- cbind(Model = "Improved Model", pseudo_r22_df)
rbind(pseudo_r2_df, pseudo_r22_df) %>%
kable( "html", caption = "Pseudo-R² Values for `housingSubsidyModel`") %>%
kable_styling(bootstrap_options = c("hover", "striped"))
Model | llh | llhNull | G2 | McFadden | r2ML | r2CU |
---|---|---|---|---|---|---|
Original Model | -744 | -927 | 365 | 0.197 | 0.127 | 0.255 |
Improved Model | -750 | -927 | 354 | 0.191 | 0.124 | 0.248 |
A dataframe of predictions for the 1,440 observations in the test set
is created, called testProbs2
. The following is the
distribution of predicted probabilities by observed outcome and a
Confusion Matrix. In the improved prediction model, the sensitivity is
0.2548, and the specificity is 0.9875. In the distribution, the height
of “not entered” hump get lower.
testProbs2 <- data.frame(Outcome = as.factor(housingSubsidy2Test$y_numeric),
Probs = predict(housingSubsidyModel2, housingSubsidy2Test, type= "response"))
ggplot(testProbs2, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
scale_fill_manual(values = palette2) +
labs(x = "Entered", y = "Density of probabilities",
title = "Distribution of predicted probabilities by observed outcome") +
theme(strip.text.x = element_text(size = 18),
legend.position = "none")
testProbs2 <-
testProbs2 %>%
mutate(predOutcome = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs2$predOutcome, testProbs2$Outcome,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1267 123
## 1 16 34
##
## Accuracy : 0.903
## 95% CI : (0.887, 0.918)
## No Information Rate : 0.891
## P-Value [Acc > NIR] : 0.0676
##
## Kappa : 0.291
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.2166
## Specificity : 0.9875
## Pos Pred Value : 0.6800
## Neg Pred Value : 0.9115
## Prevalence : 0.1090
## Detection Rate : 0.0236
## Detection Prevalence : 0.0347
## Balanced Accuracy : 0.6020
##
## 'Positive' Class : 1
##
The ROC curve is a graphical representation of the trade-off between the true positive rate and false positive rate across a series of thresholds. The area under the curve (AUC) is a single number that summarizes the performance of the model across all thresholds. Curves are “above” the y=x line, which is where the prediction rates for positives and negatives are better than random. The ROC of the improved model is more “square” than the original one. The AUC of the improved model is 0.826, which is greater than the original model’s AUC of 0.802.
auc_testProbs <- auc(testProbs$Outcome, testProbs$Probs)
auc_testProbs2 <- auc(testProbs2$Outcome, testProbs2$Probs)
auc_df <- data.frame(
Model = c("Original Model", "Improved Model"),
AUC = c(auc_testProbs, auc_testProbs2)
)
kable(auc_df, "html", caption = "AUC Values by Models") %>%
kable_styling(bootstrap_options = c("hover", "striped"))
Model | AUC |
---|---|
Original Model | 0.802 |
Improved Model | 0.810 |
ggplot() +
geom_roc(data= testProbs, aes(d = as.numeric(Outcome), m = Probs, colour = "Original Model"), n.cuts = 50, labels = FALSE) +
geom_roc(data= testProbs2, aes(d = as.numeric(Outcome), m = Probs, colour = "Improved Model"), n.cuts = 50, labels = FALSE) +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
scale_colour_manual(values = c("Improved Model" = "#FE9900", "Original Model" = "#335566")) +
labs(title = "ROC Curve - enterModel", colour = "Models") +
theme(legend.position = "bottom")
The CV goodness of fit metrics of the improved prediction model is as follows. The value of Sensitivity is 0.212, and Specificity is 0.987. The AUC of the improved model is 0.788.
housingSubsidy2$y <- fct_relevel(housingSubsidy2$y, "yes")
ctrl2 <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
cvFit2 <- train(relevel(y, ref = "yes") ~ .,
data=housingSubsidy2 %>%
dplyr::select(-X, -y_numeric, -age,-age_cat,-job, -jobEnterAvg, -marital,-marital_cat, -education, -taxLien, -month, -cons.conf.idx, -poutcome, -day_of_week, -pdays, -previous,-spent_on_repairs, -inflation_rate, -previous_cat ),
method="glm", family="binomial",
metric="ROC", trControl = ctrl2)
cvFit2
## Generalized Linear Model
##
## 4119 samples
## 11 predictor
## 2 classes: 'yes', 'no'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 4078, 4078, 4078, 4078, 4078, 4078, ...
## Resampling results:
##
## ROC Sens Spec
## 0.786 0.209 0.988
dplyr::select(cvFit2$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit2$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#FF006A") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines")
5. Cost-Benefit Calculation using Improved Regression Model
This chapter estimates the revenues associated with using the improved model under the following scenario. The cost-benefit table is as follows.
(1) True Positive: Predicted correctly a homeowner would enter the credit program; allocated the marketing resources, and 25% ultimately achieved the credit. The marketing cost of $2,850 per person is included in the cost. Only 25% of eligible individuals are expected to receive the $5,000 credit, generating benefits only for this 25%, while the remaining 75% will not receive the credit. Therefore, both the $5,000 cost and the $66,000 benefit are multiplied by 25%.
- Cost: -$2,850 -$5,000*0.25*count
- Benefit: ($10,000+$56,000)*0.25*count
(2) True Negative: Predicted correctly a homeowner would not enter the credit program. Since no marketing resources were allocated and no credit was allocated, cost would be $0. Benefit would also be $0 because no one in this group received credit, the benefit $10,000+$56,000 will not be generated.
- Cost: $0
- Benefit: $0
(3) False Positive: Predicted incorrectly a homeowner would enter the credit program; allocated marketing resources; no credit allocated. The marketing cost of $2,850 per person is included in the cost, and credit cost would be $0 since no credit was allocated. Benefit would also be $0 because no one in this group received credit, the benefit $10,000+$56,000 will not be generated.
- Cost: -$2,850*count
- Benefit: $0
(4) False Negative: We predicted that a homeowner would not enter the credit program but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. Thus, we ‘0 out’ this category, assuming the cost/benefit of this is $0.
- Cost: $0
- Benefit: $0
cost_benefit_table <-
testProbs2 %>%
count(predOutcome, Outcome) %>%
summarize(
True_Positive = sum(n[predOutcome==1 & Outcome==1]),True_Negative = sum(n[predOutcome==0 & Outcome==0]),False_Positive = sum(n[predOutcome==1 & Outcome==0]),
False_Negative = sum(n[predOutcome==0 & Outcome==1])
) %>%
gather(Variable, Count) %>%
mutate(Revenue =
ifelse(Variable == "True_Negative", (Count * 0) ,
ifelse(Variable == "True_Positive",((-2850-5000*0.25+66000*0.25) * Count),
ifelse(Variable == "False_Positive", (-2850) * Count, 0)))) %>%
bind_cols(data.frame(Description = c(
"We correctly predicted enter",
"We correctly predicted no enter",
"We predicted enter and customer did not enter",
"We predicted no enter and customer entered")))
kable(cost_benefit_table,
caption = "Cost/Benefit Table (threshold=0.50)") %>% kable_styling()
Variable | Count | Revenue | Description |
---|---|---|---|
True_Positive | 34 | 421600 | We correctly predicted enter |
True_Negative | 1267 | 0 | We correctly predicted no enter |
False_Positive | 16 | -45600 | We predicted enter and customer did not enter |
False_Negative | 123 | 0 | We predicted no enter and customer entered |
The last step to tuning the model is to run it for each threshold value. In the previous step, the threshold was set to 0.5; however, this may not yield the maximum revenue. By calculating the total credit count and revenue at each threshold, the optimze threshold that maximizes revenue can be determined.
iterateThresholds <- function(data) {
x = .01
all_prediction <- data.frame()
while (x <= 1) {
this_prediction <-
testProbs2 %>%
mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
True_Positive = sum(n[predOutcome==1 & Outcome==1]),
False_Negative = sum(n[predOutcome==0 & Outcome==1]),
False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Revenue =
ifelse(Variable == "True_Negative", (Count * 0) ,
ifelse(Variable == "True_Positive",((-2850-5000*0.25+66000*0.25) * Count),
ifelse(Variable == "False_Positive", (-2850) * Count, 0))),
Threshold = x)
all_prediction <- rbind(all_prediction, this_prediction)
x <- x + .01
}
return(all_prediction)
}
whichThreshold <- iterateThresholds(testProbs2)
whichThreshold_revenue <-
whichThreshold %>%
group_by(Threshold) %>%
summarize(Revenue = sum(Revenue))
whichThreshold_count <-
whichThreshold %>%
subset(Variable == "True_Positive") %>%
group_by(Threshold) %>%
summarize(Count)
whichThreshold %>%
ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette5[c(5, 1:3)]) +
labs(title = "Revenue by confusion matrix type and threshold",
y = "Revenue") +
guides(colour=guide_legend(title = "Confusion Matrix"))
whichThreshold %>%
ggplot(.,aes(Threshold, Count, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette5[c(5, 1:3)]) +
labs(title = "Count by confusion matrix type and threshold",
y = "Count") +
guides(colour=guide_legend(title = "Confusion Matrix"))
A threshold of 15% is optimal and yields the greatest revenue at $928,500. After that mark, losses associated with False Negatives begin to mount. The revenue at 50% of threshold is $450,400.
ggplot(whichThreshold_revenue)+
geom_line(aes(x = Threshold, y = Revenue))+
geom_vline(xintercept = pull(arrange(whichThreshold_revenue, -Revenue)[1,1]))+
ylim(-200000,1000000) +
labs(title = "Model Revenues By Threshold For Test Sample",
subtitle = "Vertical Line Denotes Optimal Threshold")
Since the credit count only occurs for True Positives, the total credit is the count of true positives. When Threshold=0.01, it becomes 157.
ggplot(whichThreshold_count)+
geom_line(aes(x = Threshold, y = Count))+
geom_vline(xintercept = pull(arrange(whichThreshold_count, -Count)[1,1]))+
ylim(0,200) +
labs(title = "Model Counts By Threshold For Test Sample",
subtitle = "Vertical Line Denotes Optimal Threshold")
testProbs2_15 <-
testProbs2 %>%
mutate(predOutcome = as.factor(ifelse(testProbs2$Probs > 0.15 , 1, 0)))
cost_benefit_table_15 <-
testProbs2_15 %>%
count(predOutcome, Outcome) %>%
summarize(
True_Positive = sum(n[predOutcome==1 & Outcome==1]),True_Negative = sum(n[predOutcome==0 & Outcome==0]),False_Positive = sum(n[predOutcome==1 & Outcome==0]),
False_Negative = sum(n[predOutcome==0 & Outcome==1])
) %>%
gather(Variable, Count) %>%
mutate(Revenue =
ifelse(Variable == "True_Negative", (Count * 0) ,
ifelse(Variable == "True_Positive",((-2850-5000*0.25+66000*0.25) * Count),
ifelse(Variable == "False_Positive", (-2850) * Count, 0)))) %>%
bind_cols(data.frame(Description = c(
"We correctly predicted enter",
"We correctly predicted no enter",
"We predicted enter and customer did not enter",
"We predicted no enter and customer entered")))
kable(cost_benefit_table_15,
caption = "Cost/Benefit Table (threshold=0.15)") %>% kable_styling()
Variable | Count | Revenue | Description |
---|---|---|---|
True_Positive | 91 | 1128400 | We correctly predicted enter |
True_Negative | 1165 | 0 | We correctly predicted no enter |
False_Positive | 118 | -336300 | We predicted enter and customer did not enter |
False_Negative | 66 | 0 | We predicted no enter and customer entered |
testProbs2_01 <-
testProbs2 %>%
mutate(predOutcome = as.factor(ifelse(testProbs2$Probs > 0.01 , 1, 0)))
cost_benefit_table_01 <-
testProbs2_01 %>%
count(predOutcome, Outcome) %>%
summarize(
True_Positive = sum(n[predOutcome==1 & Outcome==1]),True_Negative = sum(n[predOutcome==0 & Outcome==0]),False_Positive = sum(n[predOutcome==1 & Outcome==0]),
False_Negative = sum(n[predOutcome==0 & Outcome==1])
) %>%
gather(Variable, Count) %>%
mutate(Revenue =
ifelse(Variable == "True_Negative", (Count * 0) ,
ifelse(Variable == "True_Positive",((-2850-5000*0.25+66000*0.25) * Count),
ifelse(Variable == "False_Positive", (-2850) * Count, 0)))) %>%
bind_cols(data.frame(Description = c(
"We correctly predicted enter",
"We correctly predicted no enter",
"We predicted enter and customer did not enter",
"We predicted no enter and customer entered")))
kable(cost_benefit_table_01,
caption = "Cost/Benefit Table (threshold=0.01)") %>% kable_styling()
Variable | Count | Revenue | Description |
---|---|---|---|
True_Positive | 157 | 1946800 | We correctly predicted enter |
True_Negative | 1 | 0 | We correctly predicted no enter |
False_Positive | 1282 | -3653700 | We predicted enter and customer did not enter |
False_Negative | 0 | 0 | We predicted no enter and customer entered |
whichThreshold_total <- left_join(whichThreshold_revenue, whichThreshold_count, by = "Threshold") %>%
filter(Threshold == 0.15 | (Threshold > 0.495 & Threshold <0.51) | Threshold == 0.01)
kable(whichThreshold_total, "html", caption = "Table of the Total_Revenue and Total_Count_of_Credits by threshold") %>%
kable_styling(bootstrap_options = c("hover", "striped"))
Threshold | Revenue | Count |
---|---|---|
0.01 | -1706900 | 157 |
0.15 | 792100 | 91 |
0.50 | 376000 | 34 |
6. Recommendation
Both the original and improved model in this research demonstrate good predictive for the “not entered” group (specificity), but shows very low sensitivity for the “entered” group, so it has limitations in targeting. The recommendation to address this limitation are the following:
First, as in the last step, we could appropriately adjust the threshold to an optimized level rather than using 0.5. This would help reduce false negatives and increase true positives, thereby improving sensitivity to some extent.
Second, the models in this research show a significantly higher false negative compared to true positive. We should investigate how the false negative group learned about the home repair tax credit program outside of marketing. Additionally, since this analysis was conducted without spatial data, incorporating spatial variables to account for spatial correlation could improve the model.
Finally, this model currently assumes cost and benefit for false negatives are both zero, following the given instructions, which results in zero revenue for both true negatives and false negatives. However, since false negatives represent homeowners who received the credit, we could assign a cost of -$5,000 and a benefit of ($10,000 + $56,000) to them. Alternatively, for false positives and true negatives, we could consider a negative benefit to represent the opportunity cost of unrealized benefits.
Jun, Youngsang
Student of Master of Urban Spatial Analytics (MUSA)
Stuart Weitzman School of Design
University of Pennsylvania