Zillow has struggled to accurately estimate home prices, partly because it is a national platform that often lacks detailed, localized information. Urban dynamics have become increasingly complex, and understanding housing markets requires a more granular approach that incorporates local factors like government policies, crime, gentrification, pollution levels, and access to public transit. Our objective is to use local data to create a model that predicts Philadelphia home prices with greater accuracy.
This markdown consists of two parts. First, we analyze the current house prices prediction model. We determine the correlation and spatial autocorrelation between house prices and variables in the current model, and derive the problems of the house prices prediction model. Second, we propose an improved model. After adding six continuous and categorical variables that affect house prices but are not considered in the current model to the model, we will predict house prices and verify the accuracy and generalization of the improved model.
Our overall modeling strategy focuses on using open-source datasets, such as those available from the Philadelphia Open Data Portal, to enhance our understanding of the local housing market. By incorporating data on crime, zoning, pollution, and transit-oriented development, we aim to build a more contextually rich model that reflects the true dynamics of the city.
Our approach to developing a more accurate home price prediction model for Philadelphia involved three main phases: data collection and preparation, analysis of the existing OLS model, and proposal of an improved model.
1. Data Collection and Preparation We gathered localized datasets from the Philadelphia Open Data Portal, aiming to capture the unique characteristics influencing home prices within the city. These datasets included:
Property Attributes: sale price, total area, livable area, age, and internal/external quality assessments. Neighborhood Characteristics: crime rates, public amenities, and zoning classifications.
Accessibility Indicators: transit-oriented development zones, proximity to subway stations, and public parks.
Socioeconomic Indicators: demographic trends and school quality. To ensure consistent spatial alignment, we mapped all spatial datasets to Philadelphia’s local coordinate system (ESRI:102728). Outliers were identified and removed, such as properties with sale prices exceeding $5 million, to focus the analysis on a more representative price range.
2. Analysis of Current OLS Model
We analyzed Zillow’s existing OLS model by examining correlations between sale price and both continuous (e.g., age, total area) and categorical variables (e.g., quality grade, view type). Our analysis showed:
Negative correlation between age and sale price, suggesting that older properties tend to sell for less.
Positive correlation between total area, livable area, and sale price, indicating larger properties command higher prices.
We evaluated these correlations using a correlation matrix, allowing us to select non-collinear predictors for regression. In this phase, we tested multiple regression models using variables like total livable area, number of bathrooms, and internal/external condition, observing that livable area was the most significant predictor.
3. Proposed Model
We sought to improve prediction accuracy by integrating six new variables beyond the original model:
Crime Data: Frequency of incidents per neighborhood to capture safety perception.
Shooting Incidents: High-density crime areas, often perceived as riskier. Public Transit Proximity: Impact of TOD zones on home values.
School Access: Number of schools per census tract as an indicator of educational access.
Environmental Quality: Pollution and green space proximity.
Gentrification Indicators: Presence of rapidly changing areas.
Using multiple linear regression, we compared our model’s Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) against Zillow’s baseline. We implemented 100-fold cross-validation to assess model generalizability, allowing systematic testing across varied training and testing splits. We also applied spatial analysis techniques, such as Moran’s I, to evaluate spatial clustering in model errors, highlighting areas for improvement.
4. Statistical and Spatial Analysis
We assessed spatial lags for both price and model errors to understand the clustering of predictions and identify areas where neighborhood effects were inadequately captured. Additionally, we tested spatial autocorrelation in errors using Moran’s I, finding positive autocorrelation that suggested the model could benefit from further integration of neighborhood-specific factors.
This approach leverages spatial and socioeconomic variables to capture Philadelphia’s local housing dynamics, ultimately reducing MAE and MAPE compared to Zillow’s model. Further refinements, such as geographically weighted regression and enhanced socio-demographic data, could improve predictive accuracy across diverse neighborhoods and price ranges.
# Install and Load Libraries
# install.packages('mapview')
library(tidyverse)
library(tidycensus)
library(sf)
library(knitr)
library(kableExtra)
library(mapview)
library(dplyr)
library(scales)
library(viridis)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools) # for regression model plots
library(broom)
library(tufte)
library(rmarkdown)
library(pander)
library(classInt)
library(ggplot2)
library(units)
# functions and data directory
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
# Source for Multiple Ring Buffer
source(
paste0("https://raw.githubusercontent.com/urbanSpatial/",
"Public-Policy-Analytics-Landing/master/functions.r"))
# Set Color Table
palette5 <- c("#25CB10", "#5AB60C", "#8FA108", "#C48C04", "#FA7800")
# Disable Scientific Notation
options(scipen=999)
options(tigris_class = "sf")
# Set ACS API Key
census_api_key("b2835f54d89a4499ba29829c908967b86765b345", overwrite = TRUE)
We used spatial data techniques to ensure all data were correctly mapped to Philadelphia’s local coordinate system (ESRI:102728).
# Load Philadelphia Zillow Data
philly.sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/studentData.geojson") %>%
# as.data.frame() %>%
st_as_sf(crs = 4326)%>%
st_transform(crs = 'ESRI:102728')
# Get ACS data
nhoods <-
get_acs(geography = "tract",
variables = c("B01003_001E","B02001_002E",
"B06011_001E"),
year=2020, state="PA",
county="Philadelphia", geometry=TRUE) %>%
st_transform(crs = 'ESRI:102728') %>%
separate(NAME, into = c("Census_Tract", "City_State"), sep = ", ", extra = "merge") %>%
mutate(
Census_Tract = gsub("Census Tract ", "", Census_Tract),
City_State = gsub(" County, Pennsylvania", "", City_State)) %>%
dplyr::select( -moe)%>%
spread(key = variable, value = estimate) %>%
rename(TotalPop = B01003_001,
NumberWhites = B02001_002,
Median_Income = B06011_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
incomeContext = ifelse(Median_Income > 35322, "High Income", "Low Income"))
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/philadelphia.geojson") %>%
st_transform(crs = 'ESRI:102728')
# Filter to eliminate outlier
philly.sf_filtered <- philly.sf %>%
filter(sale_price < 5000000 & toPredict == "MODELLING") %>%
mutate(pricepersq = ifelse(total_area > 10, sale_price / total_area, 0))
philly.sf_filtered <- philly.sf_filtered[!is.na(as.numeric(philly.sf_filtered$pricepersq)), ]
The following chart shows House Price per square foot in Philadelphia. Center City and the Northeast are mostly in the 1st quintile for house prices per square foot, while Southwest, Northwest, and Northeast Philadelphia are around in the 3rd quintile. By mapping these prices, we can pinpoint high-demand neighborhoods and observe how features like proximity to parks or specific zoning affect property desirability. In addition, it helps us understand the spatial distribution of property values and reveals which areas are more desirable based on pricing trends.
ggplot() +
geom_sf(data=nhoods, fill="#DDD", color="white")+
geom_sf(data = philly.sf_filtered, aes(color = q5(pricepersq)),
show.legend = "point", size = 0.5, alpha = 1) +
scale_color_manual(values = palette5,
labels=qBr(philly.sf_filtered, "pricepersq"),
name="Quintile\nBreaks($/sqft)") +
labs(
title = "House Price per square foot in Philadelphia",
subtitle = "Assessment date: May 24, 2022-August 14, 2023",
caption = "Data: U.S. Census Bureau, Zillow") +
theme_void()
#mapview(philly.sf_filtered)
# Drop geometry and calculate summary statistics
data_for_summary <- philly.sf_filtered %>%
st_drop_geometry() %>%
select(
number_of_bathrooms,
number_of_bedrooms,
total_area,
total_livable_area,
year_built,
exterior_condition,
interior_condition,
sale_price
) %>%
mutate(
property_age = 2024 - year_built # Assuming '2024' as the current year for age
)
# Calculate the summary statistics
sum_stats_combined <- data_for_summary %>%
summarize(
Mean = round(c(
mean(number_of_bathrooms, na.rm = TRUE),
mean(number_of_bedrooms, na.rm = TRUE),
mean(total_area, na.rm = TRUE),
mean(total_livable_area, na.rm = TRUE),
mean(property_age, na.rm = TRUE),
mean(exterior_condition, na.rm = TRUE),
mean(interior_condition, na.rm = TRUE),
mean(sale_price, na.rm = TRUE)), 2),
Median = round(c(
median(number_of_bathrooms, na.rm = TRUE),
median(number_of_bedrooms, na.rm = TRUE),
median(total_area, na.rm = TRUE),
median(total_livable_area, na.rm = TRUE),
median(property_age, na.rm = TRUE),
median(exterior_condition, na.rm = TRUE),
median(interior_condition, na.rm = TRUE),
median(sale_price, na.rm = TRUE)), 2),
Min = round(c(
min(number_of_bathrooms, na.rm = TRUE),
min(number_of_bedrooms, na.rm = TRUE),
min(total_area, na.rm = TRUE),
min(total_livable_area, na.rm = TRUE),
min(property_age, na.rm = TRUE),
min(exterior_condition, na.rm = TRUE),
min(interior_condition, na.rm = TRUE),
min(sale_price, na.rm = TRUE)), 2),
Max = round(c(
max(number_of_bathrooms, na.rm = TRUE),
max(number_of_bedrooms, na.rm = TRUE),
max(total_area, na.rm = TRUE),
max(total_livable_area, na.rm = TRUE),
max(property_age, na.rm = TRUE),
max(exterior_condition, na.rm = TRUE),
max(interior_condition, na.rm = TRUE),
max(sale_price, na.rm = TRUE)), 2),
SD = round(c(
sd(number_of_bathrooms, na.rm = TRUE),
sd(number_of_bedrooms, na.rm = TRUE),
sd(total_area, na.rm = TRUE),
sd(total_livable_area, na.rm = TRUE),
sd(property_age, na.rm = TRUE),
sd(exterior_condition, na.rm = TRUE),
sd(interior_condition, na.rm = TRUE),
sd(sale_price, na.rm = TRUE)), 2)
) %>%
mutate(
`Variable Name` = c("number_of_bathrooms", "number_of_bedrooms",
"total_area", "total_livable_area",
"property_age", "exterior_condition",
"interior_condition", "sale_price"),
Description = c("Number of Bathrooms", "Number of Bedrooms",
"Total Area of Property", "Total Livable Area of Property",
"Property Age", "Exterior Condition Rating",
"Interior Condition Rating", "Sale Price of Property"),
Category = c(rep("Physical characteristics", 4),
"Age", rep("Condition", 2), "Financial")
) %>%
select(Category, `Variable Name`, Description, everything())
# Display the summary statistics table with formatting
sum_stats_combined %>%
kable(format = "html", caption = "Table: Summary statistics for numeric variables by category") %>%
row_spec(1:4, background = "#ffffff50") %>%
row_spec(5, background = "#f2f2f250") %>%
row_spec(6:7, background = "#e6e6e680") %>%
kable_styling(bootstrap_options = c("hover"), fixed_thead = TRUE)
Category | Variable Name | Description | Mean | Median | Min | Max | SD |
---|---|---|---|---|---|---|---|
Physical characteristics | number_of_bathrooms | Number of Bathrooms | 1.07 | 1 | 0 | 8 | 0.67 |
Physical characteristics | number_of_bedrooms | Number of Bedrooms | 2.58 | 3 | 0 | 31 | 1.27 |
Physical characteristics | total_area | Total Area of Property | 1828.80 | 1179 | 0 | 226512 | 3824.23 |
Physical characteristics | total_livable_area | Total Livable Area of Property | 1333.19 | 1208 | 0 | 15246 | 546.51 |
Age | property_age | Property Age | 87.46 | 99 | 0 | 2024 | 49.43 |
Condition | exterior_condition | Exterior Condition Rating | 3.77 | 4 | 1 | 7 | 0.87 |
Condition | interior_condition | Interior Condition Rating | 3.65 | 4 | 0 | 7 | 0.88 |
Financial | sale_price | Sale Price of Property | 271929.21 | 229000 | 10100 | 4900000 | 234193.33 |
We looked at the relationship between House Price
i.e. sale_price
and Age
,
total_area
, and total_livable_area
, which are
continuous variables currently used by Zillow. Age
was
calculated by subtracting the year of completion,
year_built
, from this year, 2024. As a result,
Age
and sale_price
showed a negative
correlation, while total_area
and sale_price
and total_livable_area
and sale_price
showed a
positive correlation. We filtered our datasets to remove outliers and
errors, such as excluding properties with sale prices above $5 million
to focus on the majority market segment.
# Price as a function of continuous variables
philly.sf_filtered %>%
st_drop_geometry() %>%
mutate(Age = 2024 - year_built) %>%
dplyr::select(sale_price, total_area, Age, total_livable_area) %>%
filter(sale_price <= 5000000, total_area <= 10000, total_livable_area <= 5000,
Age < 500) %>% # Filter out crazy outliers
gather(Variable, Value, -sale_price) %>%
ggplot(aes(Value, sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Price as a function of continuous variables",
subtitle="Sale Price: $, Age: year(s), Area: sqft, Total Livable Area: sqft") +
theme_minimal()
We also examine the relationship between sale_price
and
quality_grade
, view_type
,
exterior_condition
, and interior_condition
,
which are categorical variables currently used by Zillow.
Quality_grade
has two types of grading systems, so we
changed the 1-6 system to an S-A-B-C-D-E system. The following
histograms show that as the condition or quality grade goes high, the
sale price tends to increase.
# Price as a function of categorical variables
philly.sf_filtered <- philly.sf_filtered %>%
mutate(
quality_grade_new = case_when(
quality_grade == "1 " ~ "S ",
quality_grade == "2 " ~ "A ",
quality_grade == "3 " ~ "B ",
quality_grade == "4 " ~ "C ",
quality_grade == "5 " ~ "D ",
quality_grade == "6 " ~ "E ",
TRUE ~ as.character(quality_grade)
)
)
philly.sf_filtered$quality_grade_new <- factor(philly.sf_filtered$quality_grade_new, levels = c("S+", "S ", "A+", "A ", "A-", "B+", "B ", "B-", "C+", "C ", "C-", "D+", "D ", "D-", "E+", "E ", "E-"))
# levels(philly.sf_filtered$quality_grade_new)
philly.sf_filtered %>%
st_drop_geometry() %>%
dplyr::select(sale_price, quality_grade_new, view_type, exterior_condition, interior_condition) %>%
filter(sale_price <= 1000000, !is.na(interior_condition), interior_condition != 0, view_type != "", view_type != 0, quality_grade_new != "") %>% # Filter out crazy outliers
gather(Variable, Value, -sale_price) %>%
ggplot(aes(Value, sale_price)) +
geom_bar(position = "dodge", stat="summary", fun="mean")+
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Price as a function of categorical variables", y="Mean_Price") +
plotTheme() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
theme_minimal()
The following diagram shows correlation matrix across numeric variables, which helps us decide to choose which variable for regression. It shows how different property and neighborhood numeric factors relate to each other and to home prices. It uses values between -1 and 1 to indicate the strength and direction of these relationships: - Positive values mean that as one variable increases, so does the other (e.g., property size and price). - Negative values show that as one variable increases, the other decreases (e.g., distance from the city center and price). - Values near zero suggest no clear relationship.
Since it is helpful for analysis to perform regression by selecting
variables that are not correlated with each other,
year_built
, total_area
, and
number_stories
were chosen for the subsequent regression
analysis.
numericVars <-
select_if(st_drop_geometry(philly.sf_filtered), is.numeric) %>% na.omit()
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation across numeric variables")
The following is the result of linear regression analysis of
sale_price
and total_livable_area
. If this
result is statistically significant, we can predict
sale_price
from total_livable_area
using the
formula SalePrice_i = -65,358 + 252.99Xi + ei. The example below is
statistically significant because the p-value is small, and if
total_livable_area
are 2,262 sqft, we can say that the
predicted sale_price
is -65,358+252.99x2,262=$507,000.
livingReg <- lm(sale_price ~ total_livable_area, data = philly.sf_filtered)
summary(livingReg)
##
## Call:
## lm(formula = sale_price ~ total_livable_area, data = philly.sf_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2963434 -100606 -19246 69176 3911388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -65357.845 3236.965 -20.19 <0.0000000000000002 ***
## total_livable_area 252.993 2.247 112.61 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 189000 on 23703 degrees of freedom
## Multiple R-squared: 0.3485, Adjusted R-squared: 0.3485
## F-statistic: 1.268e+04 on 1 and 23703 DF, p-value: < 0.00000000000000022
This regression analysis can be extended to multiple variables. The
following denotes that sale_price
,
exterior_condition
, interior_condition
,
number_of_bathrooms
, number_of_bedrooms
,
total_livable_area
, type_heater
,
year_built
, and number_stories
were found to
be correlated with sale price
, while
total_area
and view_type
showed no clear
evidence of correlation with sale price
. We removed
parcel_shape
because it contains only one ‘D’ value, which
makes dividing the data into test and training sets impossible.
reg1 = lm(sale_price ~ ., data =
st_drop_geometry(philly.sf_filtered) %>%
filter(number_of_bedrooms <13) %>%
dplyr::select(sale_price, exterior_condition, interior_condition,
number_of_bathrooms, number_of_bedrooms,
total_area, total_livable_area, type_heater,
view_type, year_built, number_stories))
summary(reg1)
##
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_filtered) %>%
## filter(number_of_bedrooms < 13) %>% dplyr::select(sale_price,
## exterior_condition, interior_condition, number_of_bathrooms,
## number_of_bedrooms, total_area, total_livable_area, type_heater,
## view_type, year_built, number_stories))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2521256 -82578 -12655 60090 3825388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 367739.59845 63229.63736 5.816 0.0000000061079 ***
## exterior_condition -37451.65930 1887.05912 -19.847 < 0.0000000000000002 ***
## interior_condition -26394.16316 1976.31186 -13.355 < 0.0000000000000002 ***
## number_of_bathrooms 69992.38887 2449.30053 28.576 < 0.0000000000000002 ***
## number_of_bedrooms -41559.64236 1200.83142 -34.609 < 0.0000000000000002 ***
## total_area -0.05222 0.33045 -0.158 0.874435
## total_livable_area 216.49972 2.51828 85.971 < 0.0000000000000002 ***
## type_heater0 36254.74058 18521.03755 1.957 0.050302 .
## type_heaterA 38553.45983 3173.42475 12.149 < 0.0000000000000002 ***
## type_heaterB 13393.14301 3739.78342 3.581 0.000343 ***
## type_heaterC 25736.27846 11129.93332 2.312 0.020767 *
## type_heaterD 278963.64750 23014.73383 12.121 < 0.0000000000000002 ***
## type_heaterE 86043.92597 23535.90821 3.656 0.000257 ***
## type_heaterG -2073.41856 6689.82181 -0.310 0.756612
## type_heaterH 2302.29790 3131.02207 0.735 0.462153
## view_type0 -84520.30980 26618.55376 -3.175 0.001499 **
## view_typeA 29405.09534 22013.74444 1.336 0.181640
## view_typeB 63892.75561 31807.47220 2.009 0.044577 *
## view_typeC 210729.11889 22714.28495 9.277 < 0.0000000000000002 ***
## view_typeD -20698.92314 26335.83105 -0.786 0.431899
## view_typeE -46473.14493 26580.12568 -1.748 0.080405 .
## view_typeH 16053.34952 25087.63073 0.640 0.522250
## view_typeI 11350.53124 20898.37655 0.543 0.587046
## year_built -89.20768 29.89523 -2.984 0.002848 **
## number_stories 14605.88394 2212.10634 6.603 0.0000000000412 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 165600 on 23411 degrees of freedom
## (251 observations deleted due to missingness)
## Multiple R-squared: 0.4957, Adjusted R-squared: 0.4952
## F-statistic: 959 on 24 and 23411 DF, p-value: < 0.00000000000000022
We divided number_stories
into three groups (Up to 2
Floors, 3 Floors, 4+ Floors) and performed regression analysis. The
results are as follows.
# Categorize Number of Stories
philly.sf_filtered <-
philly.sf_filtered %>%
mutate(number_stories.cat = case_when(
number_stories >= 0 & number_stories < 3 ~ "Up to 2 Floors",
number_stories >= 3 & number_stories < 4 ~ "3 Floors",
number_stories > 4 ~ "4+ Floors"
))
reg2 <- lm(sale_price ~ ., data =
st_drop_geometry(philly.sf_filtered) %>%
dplyr::select(sale_price, exterior_condition, interior_condition,
number_of_bathrooms, number_of_bedrooms,
total_area, total_livable_area, type_heater,
view_type, year_built, number_stories.cat))
summary(reg2)
##
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_filtered) %>%
## dplyr::select(sale_price, exterior_condition, interior_condition,
## number_of_bathrooms, number_of_bedrooms, total_area,
## total_livable_area, type_heater, view_type, year_built,
## number_stories.cat))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2360228 -80731 -11461 59821 3870160
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 239825.2385 61007.0510 3.931
## exterior_condition -36339.7202 1847.4969 -19.670
## interior_condition -26039.5099 1929.7698 -13.494
## number_of_bathrooms 59553.8245 2349.8048 25.344
## number_of_bedrooms -32765.3007 1112.4259 -29.454
## total_area 0.7790 0.3256 2.392
## total_livable_area 201.6932 2.6339 76.576
## type_heater0 37546.8341 18181.0871 2.065
## type_heaterA 38583.0065 3100.0228 12.446
## type_heaterB 8422.3315 3646.0934 2.310
## type_heaterC 29722.5686 10827.0012 2.745
## type_heaterD 280510.5222 22446.7370 12.497
## type_heaterE 87731.8702 23197.4335 3.782
## type_heaterG -835.0819 6526.1774 -0.128
## type_heaterH 3139.2226 3039.3621 1.033
## view_type0 -78211.5136 25977.4159 -3.011
## view_typeA 36045.5687 21502.6498 1.676
## view_typeB 70821.8996 31033.5056 2.282
## view_typeC 218674.2867 22174.7120 9.861
## view_typeD -13321.0688 25705.3411 -0.518
## view_typeE -38733.7420 25945.9282 -1.493
## view_typeH 21977.7716 24493.9795 0.897
## view_typeI 15120.9048 20404.5295 0.741
## year_built 13.7549 29.3130 0.469
## number_stories.cat4+ Floors 677963.5782 93438.7191 7.256
## number_stories.catUp to 2 Floors -53050.6479 3789.9301 -13.998
## Pr(>|t|)
## (Intercept) 0.000084799494045 ***
## exterior_condition < 0.0000000000000002 ***
## interior_condition < 0.0000000000000002 ***
## number_of_bathrooms < 0.0000000000000002 ***
## number_of_bedrooms < 0.0000000000000002 ***
## total_area 0.016753 *
## total_livable_area < 0.0000000000000002 ***
## type_heater0 0.038919 *
## type_heaterA < 0.0000000000000002 ***
## type_heaterB 0.020899 *
## type_heaterC 0.006052 **
## type_heaterD < 0.0000000000000002 ***
## type_heaterE 0.000156 ***
## type_heaterG 0.898183
## type_heaterH 0.301682
## view_type0 0.002609 **
## view_typeA 0.093687 .
## view_typeB 0.022492 *
## view_typeC < 0.0000000000000002 ***
## view_typeD 0.604308
## view_typeE 0.135486
## view_typeH 0.369583
## view_typeI 0.458667
## year_built 0.638901
## number_stories.cat4+ Floors 0.000000000000412 ***
## number_stories.catUp to 2 Floors < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 161600 on 23351 degrees of freedom
## (328 observations deleted due to missingness)
## Multiple R-squared: 0.4998, Adjusted R-squared: 0.4992
## F-statistic: 933.2 on 25 and 23351 DF, p-value: < 0.00000000000000022
To assess the accuracy of the current model, we calculated the Mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE). To do this, the dataset was randomly divided into training data and testing data in a ratio of 6:4, and then the difference between the observed value and the actual value was calculated.
The MAE is the average of the absolute differences between predicted and observed values, while the MAPE is the average of the absolute percentage differences between predicted and observed values. The MAE and MAPE are useful metrics for evaluating the accuracy of regression models, as they provide a measure of how well the model predicts actual values.
inTrain <- createDataPartition(
y = paste(philly.sf_filtered$number_stories.cat,
philly.sf_filtered$view_type, philly.sf_filtered$heat_type),
p = .60, list = FALSE)
philly.training <- philly.sf_filtered[inTrain,]
philly.test <- philly.sf_filtered[-inTrain,]
reg.training <-
lm(sale_price ~ ., data = as.data.frame(philly.training) %>%
dplyr::select(sale_price, exterior_condition, interior_condition, number_of_bathrooms, number_of_bedrooms, total_area, total_livable_area, type_heater, view_type, year_built, number_stories.cat))
summary(reg.training)
##
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(philly.training) %>%
## dplyr::select(sale_price, exterior_condition, interior_condition,
## number_of_bathrooms, number_of_bedrooms, total_area,
## total_livable_area, type_heater, view_type, year_built,
## number_stories.cat))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2263573 -81244 -11784 59685 3058212
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 221253.3658 94284.9960 2.347
## exterior_condition -36972.9817 2409.3830 -15.345
## interior_condition -25774.9487 2520.4327 -10.226
## number_of_bathrooms 68178.1442 3026.0574 22.530
## number_of_bedrooms -33208.3228 1441.2109 -23.042
## total_area 1.8720 0.4726 3.961
## total_livable_area 191.9226 3.5233 54.472
## type_heater0 48055.3735 25517.0851 1.883
## type_heaterA 34606.7146 4023.5892 8.601
## type_heaterB 6748.6386 4689.9764 1.439
## type_heaterC 20372.3193 14261.6151 1.428
## type_heaterD 328732.5501 32862.8599 10.003
## type_heaterE 85431.9992 26204.1878 3.260
## type_heaterG -519.8221 8490.7987 -0.061
## type_heaterH 341.6382 3950.8861 0.086
## view_type0 -57844.7989 32805.1269 -1.763
## view_typeA 51438.6471 26920.0604 1.911
## view_typeB 70498.8011 39118.7987 1.802
## view_typeC 250486.9406 27834.1494 8.999
## view_typeD -203.5163 32517.9642 -0.006
## view_typeE -11670.7026 32814.1360 -0.356
## view_typeH 30549.0509 30913.5975 0.988
## view_typeI 32568.7706 25453.8469 1.280
## year_built 16.5537 45.5838 0.363
## number_stories.cat4+ Floors 958613.1237 115017.1006 8.335
## number_stories.catUp to 2 Floors -50325.0509 4959.2903 -10.148
## Pr(>|t|)
## (Intercept) 0.01896 *
## exterior_condition < 0.0000000000000002 ***
## interior_condition < 0.0000000000000002 ***
## number_of_bathrooms < 0.0000000000000002 ***
## number_of_bedrooms < 0.0000000000000002 ***
## total_area 0.000075 ***
## total_livable_area < 0.0000000000000002 ***
## type_heater0 0.05969 .
## type_heaterA < 0.0000000000000002 ***
## type_heaterB 0.15019
## type_heaterC 0.15318
## type_heaterD < 0.0000000000000002 ***
## type_heaterE 0.00112 **
## type_heaterG 0.95118
## type_heaterH 0.93109
## view_type0 0.07787 .
## view_typeA 0.05605 .
## view_typeB 0.07154 .
## view_typeC < 0.0000000000000002 ***
## view_typeD 0.99501
## view_typeE 0.72210
## view_typeH 0.32307
## view_typeI 0.20073
## year_built 0.71650
## number_stories.cat4+ Floors < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162500 on 14008 degrees of freedom
## (195 observations deleted due to missingness)
## Multiple R-squared: 0.4965, Adjusted R-squared: 0.4956
## F-statistic: 552.5 on 25 and 14008 DF, p-value: < 0.00000000000000022
philly.test <-
philly.test %>%
mutate(Regression = "Baseline Regression",
sale_price.Predict = predict(reg.training, philly.test),
sale_price.Error = sale_price.Predict - sale_price,
sale_price.AbsError = abs(sale_price.Predict - sale_price),
sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
filter(sale_price < 5000000)
ggplot()+
geom_histogram(data = philly.test, aes(x = sale_price.AbsError), binwidth = 10000, fill = "orange") +
scale_x_continuous(breaks=seq(0, 1000000, by = 100000), )+
coord_cartesian(xlim = c(0, 1000000)) +
labs(title = "Distribution of Prediction Errors",
subtitle = "",
caption = "") +
theme_minimal()
The result of the current model is as follows:
- Mean Absolute Error (MAE): $100,278.2
- Mean Absolute Percent Error (MAPE): 63.5%
* The result of knit can be different.
The error is not trivial given the mean sale_price
of
$271,929. This suggests that the current model may not be accurately
capturing the underlying patterns in the data.
#Mean Absolute Error (MAE)
mean(philly.test$sale_price.AbsError, na.rm = T)
## [1] 100010.9
# Mean Absolute Percent Error (MAPE)
mean(philly.test$sale_price.APE, na.rm = T)
## [1] 0.6384693
The following plot shows that the model and perfect prediction. The orange line represents a perfect prediction (y=x), and the green line represents average prediction. This plot shows that predicted sale price was a little over estimated.
ggplot(data=philly.test, aes(sale_price.Predict, sale_price)) +
geom_point(size = .5) +
geom_abline(intercept = 0, slope = 1, color = "#FA7800") +
geom_smooth(method = "lm", se=F, colour = "#25CB10") +
coord_cartesian(xlim = c(0, 4000000), ylim = c(0, 4000000)) +
labs(title = "Price as a function of continuous variables",
subtitle = "Orange line represents a perfect prrediction (y=x)\nGreen line represents average prediction") +
theme_minimal()
To assess the generalizability of the current model, we performed k-fold cross-validation with k=100. This algorithm involves splitting the dataset into 100 subsets, training the model on 99 subsets, and testing it on the remaining subset. This process is repeated 100 times, with each subset used as the test set once. The results are then averaged to provide an estimate of the model’s performance on new data.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(sale_price ~ ., data = st_drop_geometry(philly.sf_filtered) %>%
dplyr::select(sale_price, exterior_condition,
interior_condition,
number_of_bathrooms, number_of_bedrooms,
total_area, total_livable_area, type_heater,
view_type, year_built,
number_stories.cat),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv
## Linear Regression
##
## 23705 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 23467, 23467, 23468, 23468, 23469, 23467, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 159484.7 0.5090659 99752.15
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The distribution of MAE values is shown in the histogram below.
ggplot()+
geom_histogram(data = reg.cv$resample, aes(x = MAE), binwidth = 10000, fill = "orange") +
scale_x_continuous(breaks=seq(0, 600000, by = 100000), )+
coord_cartesian(xlim = c(0, 600000)) +
labs(title = "Distribution of MAE",
subtitle = "k-fold cross-validation; k=100",
caption = "") +
theme_minimal()
The first plot shows the predicted sale price in the test
set(philly.test
), and the second plot shows the absolute
sale price errors. No matter how accurate this model may be, we can see
that some of the errors still remain clustered on the second map. It
suggests that other factors with spatial correlations remain unmodeled.
We will explore this further in the next step by applying spatial lag
analysis and Moran’s I.
philly.test <- philly.test[!is.na(philly.test$sale_price.Predict),]
ggplot() +
geom_sf(data=nhoods, fill="#DDD", color="white")+
geom_sf(data = philly.test, aes(color = q5(sale_price.Predict)),
show.legend = "point", size = 0.5, alpha = 1) +
scale_color_manual(values = palette5,
labels=qBr(philly.test, "sale_price.Predict"),
name="Quintile\nBreaks($/sqft)") +
labs(
title = "Predicted Sale Price in Philadelphia",
subtitle = "Assessment date: May 24, 2022-August 14, 2023",
caption = "Data: U.S. Census Bureau, Zillow") +
theme_void()
ggplot() +
geom_sf(data=nhoods, fill="#DDD", color="white")+
geom_sf(data = philly.test, aes(color = q5(sale_price.AbsError)),
show.legend = "point", size = 0.5, alpha = 1) +
scale_color_manual(values = palette5,
labels=qBr(philly.test, "sale_price.AbsError"),
name="Quintile\nBreaks($/sqft)") +
labs(
title = "Test Set Absolute Sale Price Errors in Philadelphia",
subtitle = "Assessment date: May 24, 2022-August 14, 2023",
caption = "Data: U.S. Census Bureau, Zillow") +
theme_void()
In this step, we calculated for each home sale, the average sale
price of its k=5 nearest neighbors. This spatial lag variable,
lagPrice
, represents the average sale price of the five
nearest neighbors of each home sale. We then used this variable to
assess the spatial autocorrelation of sale prices and errors.
coords <- st_coordinates(philly.sf_filtered)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
philly.sf_filtered$lagPrice <- lag.listw(spatialWeights, philly.sf_filtered$sale_price)
philly.test <- philly.test[!is.na(philly.test$sale_price.Error),]
coords.test <- st_coordinates(philly.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
There is still positive correlation between sale_price
and lagPrice
and between sale_price.Error
and
lagPriceError
. This suggests that the model’s errors are
clustered spatially, indicating that the model is not adequately
capturing localized factors.
# Sale Price as a function of the Spatial Lag of Price
philly.test %>%
mutate(lagPrice = lag.listw(spatialWeights.test, sale_price)) %>%
ggplot(aes(lagPrice, sale_price))+
geom_point(color = "orange") +
geom_smooth(method = "lm", se=F, colour = "#25CB10") +
labs(title = "Price as a function of the Spatial Lag of Price",
x = "Spatial Lag of Price (mean price of 5 nearest neighbors)",
y = "Sale Price") +
theme_minimal()
# Error as a Function of the Spatial Lag of Price
philly.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) %>%
ggplot(aes(lagPriceError, sale_price.Error))+
geom_point(color = "orange") +
geom_smooth(method = "lm", se=F, colour = "#25CB10") +
labs(title = "Error as a Function of the Spatial Lag of Price",
x = "Spatial Lag of Errors (mean error of 5 nearest neighbors)",
y = "Sale Price Error") +
theme_minimal()
We can calculate the Pearson’s R coefficient to test this a different way.
pearsons_test <-
philly.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error, NAOK=TRUE))
cor.test(pearsons_test$lagPriceError,
pearsons_test$sale_price.Error,
method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pearsons_test$lagPriceError and pearsons_test$sale_price.Error
## t = 71.832, df = 9341, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5832941 0.6094222
## sample estimates:
## cor
## 0.5965162
By using the Moran’s I statistic, we can assess the global degree of clustering or dispersion of sales price values. The result shows that Moran’s I = 0.396, which indicates positive spatial autocorrelation. It suggests that the model’s errors are clustered, often occurring in specific neighborhoods or regions. If Moran’s I is close to zero, it implies that errors are more randomly distributed, indicating the model might be performing well without significant spatial bias. To improve this prediction model, we need to find more spatial variables that can reduce the spatial autocorrelation of the errors.
moranTest <- moran.mc(philly.test$sale_price.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "orange",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
theme_minimal()
In Chapter 2, we discovered that more spatial variables need to be incorporated on the current Zillow model so that we can reduce the spatial autocorrelation of the errors, and finally improve prediction quality in accuracy and generalization. Therefore, We collected data from open sources such as the Philadelphia Open Data Portal, focusing on datasets that provide insight into the local character of neighborhoods.
We intentionally excluded factors like walkability because, although important in urban housing preferences, it may not be universally desirable, especially for families who might prefer less densely populated areas. The following table shows the data types we included and excluded, along with the reasons for our decisions.
# Create the data frame
data <- data.frame(
Category = c("Safety & Crime", "Urban Planning & Zoning", "Gentrification Indicators", "Accessibility & Transit", "Environmental Quality", "Education", "Quality of Life", "Public Spaces", "Government & Infrastructure"),
`Data Type` = c("Crime Data", "Zoning Regulations", "Gentrification Indicators", "Transit Accessibility", "Pollution Levels", "School Quality", "Walkability Index", "Access to Parks", "Government Investments"),
`Included/Excluded` = c("Included", "Included", "Included", "Included", "Included", "Included", "Excluded", "Excluded", "Excluded"),
Reason = c(
"Safety is a critical factor for homeowners.",
"Impacts property types, uses, and value.",
"Gentrification trends affect property values.",
"Commute to work.",
"Pollution impacts health and desirability.",
"The quality of local schools is a major consideration for families and directly impacts property values.",
"Walkability is valued in urban settings but may not be a priority for all buyers, such as families preferring suburban environments.",
"Access to parks and recreational spaces may have variable impact depending on demographics, making it less reliable for general predictions.",
"Government resources and investments are often dispersed and long-term, making their direct impact on property values harder to measure."
)
)
# Create the kable table
kable(data, "html", col.names = c("Category", "Data Type", "Included/Excluded", "Reason")) %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
column_spec(1:4, width = "20em") # Adjust column width for better readability
Category | Data Type | Included/Excluded | Reason |
---|---|---|---|
Safety & Crime | Crime Data | Included | Safety is a critical factor for homeowners. |
Urban Planning & Zoning | Zoning Regulations | Included | Impacts property types, uses, and value. |
Gentrification Indicators | Gentrification Indicators | Included | Gentrification trends affect property values. |
Accessibility & Transit | Transit Accessibility | Included | Commute to work. |
Environmental Quality | Pollution Levels | Included | Pollution impacts health and desirability. |
Education | School Quality | Included | The quality of local schools is a major consideration for families and directly impacts property values. |
Quality of Life | Walkability Index | Excluded | Walkability is valued in urban settings but may not be a priority for all buyers, such as families preferring suburban environments. |
Public Spaces | Access to Parks | Excluded | Access to parks and recreational spaces may have variable impact depending on demographics, making it less reliable for general predictions. |
Government & Infrastructure | Government Investments | Excluded | Government resources and investments are often dispersed and long-term, making their direct impact on property values harder to measure. |
This table gives an overview of different features related to properties and their surroundings, grouped into categories like Property Details, Local Amenities, Neighborhood Layout, Crime Data, and Transit/Demographics.
The table breaks down various features of properties and neighborhoods into five main groups:
Property Details: How many bedrooms and bathrooms they have, and the year they were built.
Local Amenities: Features and services that might impact a property’s appeal, such as whether a property has central air or a garage.
Neighborhood Layout: Zoning rules and the shape of the property lots, giving insight into how the area is organized.
Crime Data: Frequency of crime incidents in different areas, which can help gauge the safety of a neighborhood.
Transit/Demographics: Access to public transportation and basic demographic trends like population levels.
file_path <- "~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/Cleaned_Comprehensive_Summary_Statistics.csv"
summary_stats <- read_csv(file_path)
summary_stats_cleaned <- summary_stats %>%
drop_na()
summary_stats_cleaned %>%
kable("html", caption = "Summary Statistics") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Variable | Category | Description | count |
---|---|---|---|
total_livable_area | Internal Characteristics | Total livable area of the property (sqft) | 23881 |
number_of_bathrooms | Internal Characteristics | Number of bathrooms | 23622 |
number_of_bedrooms | Internal Characteristics | Number of bedrooms | 23865 |
number_of_rooms | Internal Characteristics | Total number of rooms | 666 |
number_stories | Internal Characteristics | Number of stories | 23873 |
year_built | Internal Characteristics | Year the property was built | 23881 |
zoning | Spatial Structure | Zoning classification of the parcel | 23881 |
parcel_shape | Spatial Structure | Shape type of the parcel (e.g., rectangular, irregular) | 23881 |
Thefts | Crime | Frequency of reported incidents for each crime type | 29413 |
Other Assaults | Crime | Frequency of reported incidents for each crime type | 20172 |
All Other Offenses | Crime | Frequency of reported incidents for each crime type | 10846 |
Motor Vehicle Theft | Crime | Frequency of reported incidents for each crime type | 10468 |
Vandalism/Criminal Mischief | Crime | Frequency of reported incidents for each crime type | 10424 |
Theft from Vehicle | Crime | Frequency of reported incidents for each crime type | 7892 |
Aggravated Assault No Firearm | Crime | Frequency of reported incidents for each crime type | 4159 |
Burglary Residential | Crime | Frequency of reported incidents for each crime type | 2521 |
Narcotic / Drug Law Violations | Crime | Frequency of reported incidents for each crime type | 2262 |
Aggravated Assault Firearm | Crime | Frequency of reported incidents for each crime type | 2117 |
Weapon Violations | Crime | Frequency of reported incidents for each crime type | 2040 |
We collected 6 additional datasets including crime data in 2018-2023, crime dispatch incidents in 2024, location of subway stations, schools, public parks and recreation areas from the Philadelphia Open Data portal the U.S. Census Bureau (ACS data) or other data source. We also cleaned data to exclude records with missing or erroneous geographic points. We created new variables, such as price per square foot, number of schools per tract, which are crucial for our model.
# Load Philadelphia Zillow Data
philly.sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/studentData.geojson") %>%
# as.data.frame() %>%
st_as_sf(crs = 4326)%>%
st_transform(crs = 'ESRI:102728')
nhoods <-
get_acs(geography = "tract",
variables = c("B01003_001E","B02001_002E",
"B06011_001E"),
year=2020, state="PA",
county="Philadelphia", geometry=TRUE) %>%
st_transform(crs = 'ESRI:102728') %>%
separate(NAME, into = c("Census_Tract", "City_State"), sep = ", ", extra = "merge") %>%
mutate(
Census_Tract = gsub("Census Tract ", "", Census_Tract),
City_State = gsub(" County, Pennsylvania", "", City_State)) %>%
dplyr::select( -moe)%>%
spread(key = variable, value = estimate) %>%
rename(TotalPop = B01003_001,
NumberWhites = B02001_002,
Median_Income = B06011_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
incomeContext = ifelse(Median_Income > 35322, "High Income", "Low Income"))
## Getting data from the 2016-2020 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
# Load Police District Data in Philadelphia
philly.police <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/police_districts.geojson") %>%
# as.data.frame() %>%
st_as_sf(crs = 4326)%>%
st_transform(crs = 'ESRI:102728')
# Load School Data in Philadelphia
philly.school <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/Schools.geojson") %>%
# as.data.frame() %>%
st_as_sf(crs = 4326)%>%
st_transform(crs = 'ESRI:102728')
# Load Green Area (Public Parks Recreation) Data in Philadelphia
philly.PPR <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/PPR_Properties.geojson") %>%
# as.data.frame() %>%
st_as_sf(crs = 4326)%>%
st_transform(crs = 'ESRI:102728')
# Load Shooting Incident Data in Philadelphia
philly.shooting <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/shootings.geojson") %>%
# as.data.frame() %>%
st_as_sf(crs = 4326)%>%
st_transform(crs = 'ESRI:102728')
ppr_properties_sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/PPR_Properties.geojson") %>%
st_as_sf(crs = 4326)%>%
st_transform(crs = 'ESRI:102728')
police_districts_sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/police_districts.geojson") %>%
st_as_sf(crs = 4326)%>%
st_transform(crs = 'ESRI:102728')
shootings_sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/shootings.geojson")%>%
st_as_sf(crs = 4326)%>%
st_transform(crs = 'ESRI:102728')
# Shooting Incident Data
shootingPol <- st_join(philly.shooting, nhoods, left = FALSE)
shootingPol_count <- shootingPol %>%
group_by(Census_Tract) %>%
summarise(shooting_count = n())
nhoods_b <- left_join(nhoods, st_drop_geometry(shootingPol_count), by = "Census_Tract") %>%
st_sf() %>%
mutate(shooting_count = ifelse(is.na(shooting_count), 0, shooting_count))
#ggplot()+
# geom_sf(data=nhoods_b, aes(fill=shooting_count))
# Green area (Public Parks Recreation) data
PPRPol <- st_join(philly.PPR, nhoods_b, left = FALSE)
PPRPol_count <- PPRPol %>%
group_by(Census_Tract) %>%
summarise(PPR_count = n())
nhoods_c <- left_join(nhoods_b, st_drop_geometry(PPRPol_count), by = "Census_Tract") %>%
st_sf() %>%
mutate(PPR_count = ifelse(is.na(PPR_count), 0, PPR_count))
#ggplot()+
# geom_sf(data=nhoods_c, aes(fill=PPR_count))
# Number of school in each tract data
schoolPol <- st_join(philly.school, nhoods_c, left = FALSE)
schoolPol_count <- schoolPol %>%
group_by(Census_Tract) %>%
summarise(school_count = n())
nhoods_d <- left_join(nhoods_c, st_drop_geometry(schoolPol_count), by = "Census_Tract") %>%
st_sf() %>%
mutate(PPR_count = ifelse(is.na(school_count), 0, school_count))
#ggplot()+
# geom_sf(data=nhoods_d, aes(fill=school_count))
# Load Crime Data in Philadelphia
philly.crime <- read.csv("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/CrimeData.csv")
philly.crime <- subset(philly.crime, the_geom!='0101000020E6100000A5A31CCC262054C0A8BE77C4B61C4540') #error
philly.crime <- subset(philly.crime, the_geom!='0101000020E6100000E3D840DB262054C0CCE8EC09B71C4540') #error
philly.crime <- st_as_sf(philly.crime, coords = c("lng", "lat"), crs = 4326) %>%
st_transform(crs = 'ESRI:102728')
# Crime Data
crimePol <- st_join(philly.crime, nhoods_d, left = FALSE)
crimePol_count <- crimePol %>%
group_by(Census_Tract) %>%
summarise(crime_count = n())
nhoods_e <- left_join(nhoods_d, st_drop_geometry(crimePol_count), by = "Census_Tract") %>%
st_sf() %>%
mutate(crime_count = ifelse(is.na(crime_count), 0, crime_count))
#ggplot()+
# geom_sf(data=nhoods_e, aes(fill=crime_count))
The map shows different types of alongside public parks and recreational areas within the city. Access to green spaces and recreational amenities is a significant driver of housing desirability and value.
ggplot() +
geom_sf(data=st_union(nhoods), fill="#DDD", color="white")+
geom_sf(data = ppr_properties_sf, fill = "green", color = "darkgreen", shape = 21, size = 1, alpha = 0.8) + # PPR properties
labs(
title = "Public Parks/Recreation (PPR) Areas in Philadelphia",
subtitle = "Visualizing the distribution of PPR properties within city limits",
caption = "Data: Philadelphia Planning Department, PPR, and City Limits") +
theme_void() +
theme(
legend.position = "right",
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
)
This map overlays police districts with population density information for each census tract, illustrating how densely populated areas align with police resources. Population density often correlates with housing demand, as denser areas tend to have higher property values due to limited space. Understanding how law enforcement resources align with population clusters can also offer insights into perceptions of safety and how these perceptions might impact local housing prices.
philly_population <- get_acs(geography = "tract",
variables = "B01003_001E", # Total population variable
year = 2020, state = "PA",
county = "Philadelphia", geometry = TRUE) %>%
st_transform(crs = 'ESRI:102728')
philly_population$area_sq_km <- set_units(st_area(philly_population), "km^2")
# Calculate population density (people per square kilometer) and ensure consistency in units
philly_population$population_density <- philly_population$estimate / drop_units(philly_population$area_sq_km)
# Plot the map showing police districts and population density
ggplot() +
geom_sf(data = philly_population, aes(fill = population_density), color = NA) +
scale_fill_viridis_c(name = "Population Density\n(per sq km)", option = "inferno") +
geom_sf(data = police_districts_sf, fill = NA, color = "red", size = 0.8) +
labs(
title = "Police Districts and Population Density in Philadelphia",
subtitle = "Population Density per Census Tract with Police District Boundaries",
caption = "Data: U.S. Census Bureau, Philadelphia Open Data Portal") +
theme_void()
This map illustrates the concentration of shooting incidents and overlays police district boundaries to see how shootings align with law enforcement zones. Identifying high-risk areas helps us understand which neighborhoods might need more intervention or are perceived as less safe, directly influencing property values.
# Create the density map with police districts
ggplot() +
stat_density2d(data = data.frame(st_coordinates(shootings_sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
geom_sf(data = police_districts_sf, fill = NA, color = "black", linetype = "solid", size = 0.5) + # Police districts overlay
scale_fill_gradient(low = "lightblue", high = "darkblue",
breaks = c(0.000000002, 0.00000002),
labels = c("Minimum", "Maximum"),
name = "Density") +
scale_alpha(range = c(0.00, 0.50), guide = "none") +
labs(title = "Density of Shooting Incidents in Philadelphia",
subtitle = "Overlaid with Police District Boundaries",
caption = "Data: Philadelphia Open Data Portal") +
theme_void()
This map shows the areas within 0.5 miles (2,641 ft) of the Market-Frankford Line (MFL) and Broad Street Line, which are key public transit routes in Philadelphia. As we discovered on the last assignment, properties near public transit are often more desirable, as they offer convenient access to transportation and amenities, which can drive up property values.
# TOD Data
MFL <- st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson")
## Reading layer `SEPTA_-_Market-Frankford_Line_Stations' from data source
## `https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson'
## using driver `GeoJSON'
## Simple feature collection with 28 features and 45 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.25973 ymin: 39.94989 xmax: -75.07788 ymax: 40.02299
## Geodetic CRS: WGS 84
Broad_St <- st_read("https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson")
## Reading layer `SEPTA_-_Broad_Street_Line_Stations' from data source
## `https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson'
## using driver `GeoJSON'
## Simple feature collection with 24 features and 45 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.17394 ymin: 39.90543 xmax: -75.13679 ymax: 40.04185
## Geodetic CRS: WGS 84
septaStops <-
rbind(
MFL %>%
mutate(Line = "MFL") %>%
dplyr::select(Station, Line),
Broad_St %>%
mutate(Line ="Broad_St") %>%
dplyr::select(Station, Line)) %>%
st_transform(st_crs(nhoods_e))
stopBuffer <- st_buffer(septaStops, 2641)
stopUnion <- st_union(st_buffer(septaStops, 2641))
buffer <- stopUnion %>%
st_sf()
selectCentroids <-
st_centroid(nhoods_e)[buffer,] %>%
st_drop_geometry() %>%
left_join(., dplyr::select(nhoods_e, GEOID), by = "GEOID") %>%
st_sf()
selectCentroids_union <- st_union(selectCentroids)%>%
st_sf()
nhoods_f <-
rbind(
st_centroid(nhoods_e)[buffer,] %>%
st_drop_geometry() %>%
left_join(nhoods_e) %>%
st_sf() %>%
mutate(TOD = "TOD"),
st_centroid(nhoods_e)[buffer, op = st_disjoint] %>%
st_drop_geometry() %>%
left_join(nhoods_e) %>%
st_sf() %>%
mutate(TOD = "Non-TOD"))
## Plot TOD
ggplot() +
geom_sf(data=nhoods_f, fill="#DDD", color="white")+
geom_sf(data=septaStops,
aes(colour = Line),
show.legend = "point", size= 1.5) +
scale_colour_manual(values = c("orange","blue")) +
geom_sf(data=selectCentroids_union, fill='transparent', color='red')+
labs(title = "TOD area in Philadelphia",
subtitle = "0.5 miles from the MFL and Broad St Line",
caption = "Data: U.S. Census Bureau, SEPTA") +
theme_void()
This map highlights areas where crime incidents are more frequent, providing a visual representation of crime density across the city. Safety is a critical factor in homebuyer decisions—people prefer safer neighborhoods, which drives up property values there. By mapping crime density, we can factor safety into our model, predicting how values might increase in safer areas or decrease in higher-crime zones.
## Plot Density of Crime Dispatch
ggplot() +
geom_sf(data=nhoods, fill="#DDD", color="white")+
stat_density2d(data = data.frame(st_coordinates(philly.crime)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "#25CB10", high = "#FA7800",
breaks=c(0.000000002,0.00000002),
labels=c("Minimum", "Maximum"),
name = "Density") +
scale_alpha(range = c(0.00, 0.50), guide = "none") +
labs(title = "Density of Crime Dispach in Philadelphia",
subtitle = "Date: January 1, 2024-October 1, 2024",
caption = "Data: U.S. Census Bureau, Open Data Philly") +
theme_void()
This map displays how close properties are to schools, which is a key factor for families seeking homes. Properties closer to schools often become more desirable, increasing demand. Families place a premium on properties near quality schools, making school proximity a crucial factor in price determination.
## Plot School
ggplot() +
geom_sf(data=nhoods_f, aes(fill=school_count), color="white")+
scale_fill_viridis_c(name = "School Count", option = "inferno") +
labs(title = "Number of Schools by Tract in Philadelphia",
caption = "Data: U.S. Census Bureau, OpenData Philly") +
theme_void()
The following plots show the relationship between sale price and various continuous variables. New data includes crime count, school count, shooting count, and public park and recreation data. The plots show that sale price is positively correlated with total area, while it is negatively correlated with age, shooting count, public parks and recreation areas and crime count. The plots also show that sale price is not significantly correlated with school count.
philly.sf_New <- philly.sf %>%
filter(sale_price < 5000000 & toPredict == "MODELLING") %>%
mutate(pricepersq = ifelse(total_area > 10, sale_price / total_area, 0))
philly.sf_New <- philly.sf_New[!is.na(as.numeric(philly.sf_New$pricepersq)), ]
philly.sf_New <- st_join(philly.sf_New, nhoods_f) # Crime Data, School Data, Green Area Data, Shooting Incident Data, TOD Data
philly.sf_New <- st_join(philly.sf_New, philly.police) # Police District Data
philly.sf_New <- st_join(philly.sf_New, neighborhoods) # Neighborhood Data
philly.sf_New <- philly.sf_New %>%
filter(name != "East Park" & name != "Wissahickon Park" & name != "Chinatown" & name != "Mechanicsville" & name != "University City" & name != "Pennypack Park")
# Price as a function of continuous variables
philly.sf_New %>%
st_drop_geometry() %>%
mutate(Age = 2024 - year_built) %>%
dplyr::select(sale_price, total_area, Age, shooting_count, PPR_count, school_count, crime_count) %>%
filter(sale_price <= 5000000, total_area <= 10000,
Age < 500) %>% # Filter out crazy outliers
gather(Variable, Value, -sale_price) %>%
ggplot(aes(Value, sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Price as a function of continuous variables") +
theme_minimal()
The following plots represents the difference in sale price by race, income, and whether it is a TOD area. The mean sale price in majority white was significantly higher than that in majority non-white, and the mean sale price in high-income areas was significantly higher than that in low-income areas. Meanwhile, the difference between TOD and non-TOD areas was not large, but the mean sale price in TOD areas was slightly higher.
# Price as a function of categorical variables
philly.sf_New <- philly.sf_New %>%
mutate(
quality_grade_new = case_when(
quality_grade == "1 " ~ "S ",
quality_grade == "2 " ~ "A ",
quality_grade == "3 " ~ "B ",
quality_grade == "4 " ~ "C ",
quality_grade == "5 " ~ "D ",
quality_grade == "6 " ~ "E ",
TRUE ~ as.character(quality_grade)
)
)
philly.sf_New$quality_grade_new <- factor(philly.sf_New$quality_grade_new, levels = c("S+", "S ", "A+", "A ", "A-", "B+", "B ", "B-", "C+", "C ", "C-", "D+", "D ", "D-", "E+", "E ", "E-"))
# levels(philly.sf_filtered$quality_grade_new)
philly.sf_New %>%
st_drop_geometry() %>%
dplyr::select(sale_price, incomeContext, raceContext, TOD) %>%
filter(sale_price <= 5000000, !is.na(incomeContext), !is.na(raceContext)) %>% # Filter out crazy outliers
gather(Variable, Value, -sale_price) %>%
ggplot(aes(Value, sale_price)) +
geom_bar(position = "dodge", stat="summary", fun="mean")+
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Price as a function of categorical variables", y="Mean_Price") +
plotTheme() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
theme_minimal()
The following chart is correlation plot that shows the relationship between numeric variables.
numericVars <-
select_if(st_drop_geometry(philly.sf_New), is.numeric) %>% na.omit()
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation across numeric variables")
The following regression denotes that sale_price
,
exterior_condition
, interior_condition
,
number_of_bathrooms
, number_of_bedrooms
,
total_area
, total_livable_area
,
year_built
, number_stories
shooting_count
, crime_count
, and
TOD
were found to be correlated with
sale_price
, while
type_heater
,view_type
, and
PPR_count
showed no clear evidence of correlation with
sale price
. Because there were many missing values in
school_count
, regression values could not be obtained.
philly.sf_New$school_count <- ifelse(is.na(philly.sf_New$school_count), 0, philly.sf_New$school_count)
# is.na(philly.sf_New$school_count)
reg1 = lm(sale_price ~ ., data =
st_drop_geometry(philly.sf_New) %>%
dplyr::select(sale_price, exterior_condition, interior_condition,
number_of_bathrooms, number_of_bedrooms,
total_area, total_livable_area, type_heater,
view_type, year_built, number_stories, shooting_count, PPR_count, school_count, crime_count, TOD))
summary(reg1)
##
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_New) %>%
## dplyr::select(sale_price, exterior_condition, interior_condition,
## number_of_bathrooms, number_of_bedrooms, total_area,
## total_livable_area, type_heater, view_type, year_built,
## number_stories, shooting_count, PPR_count, school_count,
## crime_count, TOD))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2252262 -70998 -11252 49989 3810993
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 553722.0670 59464.2513 9.312 < 0.0000000000000002 ***
## exterior_condition -26700.2433 1766.2094 -15.117 < 0.0000000000000002 ***
## interior_condition -28613.4919 1838.5652 -15.563 < 0.0000000000000002 ***
## number_of_bathrooms 60913.8737 2227.6278 27.345 < 0.0000000000000002 ***
## number_of_bedrooms -26451.6802 1066.2799 -24.807 < 0.0000000000000002 ***
## total_area -0.3317 0.3102 -1.069 0.284954
## total_livable_area 202.2441 2.3572 85.799 < 0.0000000000000002 ***
## type_heater0 4063.2541 17236.1603 0.236 0.813636
## type_heaterA 12119.8467 2988.9752 4.055 0.00005032877 ***
## type_heaterB -9275.0159 3500.9291 -2.649 0.008071 **
## type_heaterC -6897.3368 10363.6578 -0.666 0.505717
## type_heaterD 231329.2464 21428.4686 10.795 < 0.0000000000000002 ***
## type_heaterE 45462.8442 21902.9804 2.076 0.037938 *
## type_heaterG -7624.3958 6226.8943 -1.224 0.220802
## type_heaterH 3397.9300 2917.0122 1.165 0.244085
## view_type0 -62224.7060 24799.2999 -2.509 0.012110 *
## view_typeA 4966.6924 20513.1679 0.242 0.808688
## view_typeB 20229.3844 29639.8403 0.683 0.494925
## view_typeC 182173.5961 21143.7392 8.616 < 0.0000000000000002 ***
## view_typeD -20369.5147 24504.0534 -0.831 0.405829
## view_typeE -50831.8708 24740.2644 -2.055 0.039927 *
## view_typeH 7967.1794 23334.9414 0.341 0.732785
## view_typeI 3073.2911 19439.1740 0.158 0.874381
## year_built -167.2654 28.1106 -5.950 0.00000000272 ***
## number_stories 11842.0226 2062.0367 5.743 0.00000000942 ***
## shooting_count -1121.5567 27.1865 -41.254 < 0.0000000000000002 ***
## PPR_count -3983.8069 904.6838 -4.404 0.00001069728 ***
## school_count NA NA NA NA
## crime_count -28.8719 8.0289 -3.596 0.000324 ***
## TODTOD 61117.0594 2509.8704 24.351 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 154000 on 23403 degrees of freedom
## (266 observations deleted due to missingness)
## Multiple R-squared: 0.5633, Adjusted R-squared: 0.5628
## F-statistic: 1078 on 28 and 23403 DF, p-value: < 0.00000000000000022
We divided number_stories
into three groups (Up to 2
Floors, 3 Floors, 4+ Floors) and performed regression analysis. The
results are as follows.
# Categorize Number of Stories
philly.sf_New <-
philly.sf_New %>%
mutate(number_stories.cat = case_when(
number_stories >= 0 & number_stories < 3 ~ "Up to 2 Floors",
number_stories >= 3 & number_stories < 4 ~ "3 Floors",
number_stories > 4 ~ "4+ Floors"
))
reg2 <- lm(sale_price ~ ., data =
st_drop_geometry(philly.sf_New) %>%
dplyr::select(sale_price, exterior_condition, interior_condition,
number_of_bathrooms, number_of_bedrooms,
total_area, total_livable_area, type_heater,
view_type, year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD))
summary(reg2)
##
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_New) %>%
## dplyr::select(sale_price, exterior_condition, interior_condition,
## number_of_bathrooms, number_of_bedrooms, total_area,
## total_livable_area, type_heater, view_type, year_built,
## number_stories.cat, shooting_count, PPR_count, school_count,
## crime_count, TOD))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2151905 -68687 -10403 49506 3840631
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 463466.3695 57331.6626 8.084
## exterior_condition -25926.5733 1725.1513 -15.029
## interior_condition -27917.1888 1792.3469 -15.576
## number_of_bathrooms 57670.1015 2183.3574 26.413
## number_of_bedrooms -23579.4317 1043.7718 -22.591
## total_area 0.1719 0.3048 0.564
## total_livable_area 192.3956 2.4512 78.491
## type_heater0 7083.0815 16883.1429 0.420
## type_heaterA 12617.4026 2913.1631 4.331
## type_heaterB -11872.8520 3404.3831 -3.488
## type_heaterC -6986.3475 10068.7071 -0.694
## type_heaterD 232163.4261 20855.9064 11.132
## type_heaterE 49071.3715 21540.8822 2.278
## type_heaterG -5429.5187 6062.0646 -0.896
## type_heaterH 5355.2005 2826.8743 1.894
## view_type0 -59252.5331 24150.1624 -2.454
## view_typeA 9903.6023 19994.8658 0.495
## view_typeB 22838.0764 28859.1103 0.791
## view_typeC 186366.8527 20598.7134 9.047
## view_typeD -14437.6299 23866.2115 -0.605
## view_typeE -44611.4126 24097.9268 -1.851
## view_typeH 12723.9461 22734.7448 0.560
## view_typeI 5967.5524 18939.7381 0.315
## year_built -93.2800 27.5116 -3.391
## number_stories.cat4+ Floors 734862.5603 86734.7612 8.473
## number_stories.catUp to 2 Floors -36718.5589 3541.2890 -10.369
## shooting_count -1092.3269 26.5004 -41.219
## PPR_count -3603.6422 881.5378 -4.088
## school_count NA NA NA
## crime_count -28.7629 7.8262 -3.675
## TODTOD 55536.9422 2458.9433 22.586
## Pr(>|t|)
## (Intercept) 0.000000000000000657 ***
## exterior_condition < 0.0000000000000002 ***
## interior_condition < 0.0000000000000002 ***
## number_of_bathrooms < 0.0000000000000002 ***
## number_of_bedrooms < 0.0000000000000002 ***
## total_area 0.572825
## total_livable_area < 0.0000000000000002 ***
## type_heater0 0.674829
## type_heaterA 0.000014893813424129 ***
## type_heaterB 0.000488 ***
## type_heaterC 0.487772
## type_heaterD < 0.0000000000000002 ***
## type_heaterE 0.022732 *
## type_heaterG 0.370446
## type_heaterH 0.058186 .
## view_type0 0.014154 *
## view_typeA 0.620388
## view_typeB 0.428739
## view_typeC < 0.0000000000000002 ***
## view_typeD 0.545225
## view_typeE 0.064145 .
## view_typeH 0.575710
## view_typeI 0.752703
## year_built 0.000699 ***
## number_stories.cat4+ Floors < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors < 0.0000000000000002 ***
## shooting_count < 0.0000000000000002 ***
## PPR_count 0.000043674366054245 ***
## school_count NA
## crime_count 0.000238 ***
## TODTOD < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 150000 on 23340 degrees of freedom
## (328 observations deleted due to missingness)
## Multiple R-squared: 0.5687, Adjusted R-squared: 0.5682
## F-statistic: 1061 on 29 and 23340 DF, p-value: < 0.00000000000000022
As the same method in Chapter 2, we split our dataset into training and testing sets to evaluate our model’s performance. We used a 60/40 split for training and testing respectively.
The result of the proposed model is as follows:
- Mean Absolute Error (MAE): $86,664.03
- Mean Absolute Percent Error (MAPE): 53.3%
* The result of knit can be different.
M In Chapter 2, MAE of the current model was $100,278 and MAPE of the current model was 63.5%. We can say that the proposed model is more accurate than the current model.
inTrain <- createDataPartition(
y = paste(philly.sf_New$number_stories.cat,
philly.sf_New$view_type, philly.sf_New$heat_type),
p = .60, list = FALSE)
philly.training <- philly.sf_New[inTrain,]
philly.test <- philly.sf_New[-inTrain,]
reg.training <-
lm(sale_price ~ ., data = as.data.frame(philly.training) %>%
dplyr::select(sale_price, exterior_condition, interior_condition,
number_of_bathrooms, number_of_bedrooms,
total_area, total_livable_area, type_heater,
view_type, year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD))
summary(reg.training)
##
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(philly.training) %>%
## dplyr::select(sale_price, exterior_condition, interior_condition,
## number_of_bathrooms, number_of_bedrooms, total_area,
## total_livable_area, type_heater, view_type, year_built,
## number_stories.cat, shooting_count, PPR_count, school_count,
## crime_count, TOD))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1074673 -68580 -9827 50952 3812301
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 490137.34729 79557.55234 6.161
## exterior_condition -28919.35845 2256.62974 -12.815
## interior_condition -24841.55802 2366.35275 -10.498
## number_of_bathrooms 56226.61177 2821.43041 19.928
## number_of_bedrooms -23902.67515 1349.35375 -17.714
## total_area 0.07366 0.36953 0.199
## total_livable_area 196.93821 3.18435 61.846
## type_heater0 6551.31164 22476.65009 0.291
## type_heaterA 14042.50489 3791.35342 3.704
## type_heaterB -11951.99850 4395.60560 -2.719
## type_heaterC 4951.23661 13903.87959 0.356
## type_heaterD 203546.11393 27558.43175 7.386
## type_heaterE 15365.67297 25743.32118 0.597
## type_heaterG -74.21530 7947.24310 -0.009
## type_heaterH 8117.54188 3682.87989 2.204
## view_type0 -55009.72618 31875.84085 -1.726
## view_typeA 8006.77022 26563.89203 0.301
## view_typeB 18499.71463 37468.51683 0.494
## view_typeC 175299.09257 27330.83944 6.414
## view_typeD -25057.70347 31466.92885 -0.796
## view_typeE -46509.41187 31751.29592 -1.465
## view_typeH -11260.21426 29976.84517 -0.376
## view_typeI -5480.26176 25189.57255 -0.218
## year_built -103.97253 37.75948 -2.754
## number_stories.cat4+ Floors 970997.60613 107180.06165 9.059
## number_stories.catUp to 2 Floors -37169.59740 4623.52985 -8.039
## shooting_count -1081.11325 34.30396 -31.516
## PPR_count -4548.45351 1158.22264 -3.927
## school_count NA NA NA
## crime_count -27.71435 10.11622 -2.740
## TODTOD 59215.81273 3206.96427 18.465
## Pr(>|t|)
## (Intercept) 0.000000000743659306 ***
## exterior_condition < 0.0000000000000002 ***
## interior_condition < 0.0000000000000002 ***
## number_of_bathrooms < 0.0000000000000002 ***
## number_of_bedrooms < 0.0000000000000002 ***
## total_area 0.842013
## total_livable_area < 0.0000000000000002 ***
## type_heater0 0.770695
## type_heaterA 0.000213 ***
## type_heaterB 0.006554 **
## type_heaterC 0.721768
## type_heaterD 0.000000000000159879 ***
## type_heaterE 0.550597
## type_heaterG 0.992549
## type_heaterH 0.027531 *
## view_type0 0.084415 .
## view_typeA 0.763102
## view_typeB 0.621497
## view_typeC 0.000000000146336550 ***
## view_typeD 0.425860
## view_typeE 0.142997
## view_typeH 0.707197
## view_typeI 0.827775
## year_built 0.005903 **
## number_stories.cat4+ Floors < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors 0.000000000000000976 ***
## shooting_count < 0.0000000000000002 ***
## PPR_count 0.000086389376607873 ***
## school_count NA
## crime_count 0.006159 **
## TODTOD < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 151400 on 13996 degrees of freedom
## (199 observations deleted due to missingness)
## Multiple R-squared: 0.5695, Adjusted R-squared: 0.5686
## F-statistic: 638.4 on 29 and 13996 DF, p-value: < 0.00000000000000022
philly.test <-
philly.test %>%
mutate(Regression = "Baseline Regression",
sale_price.Predict = predict(reg.training, philly.test),
sale_price.Error = sale_price.Predict - sale_price,
sale_price.AbsError = abs(sale_price.Predict - sale_price),
sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
filter(sale_price < 5000000)
ggplot()+
geom_histogram(data = philly.test, aes(x = sale_price.AbsError), binwidth = 10000, fill = "orange") +
scale_x_continuous(breaks=seq(0, 1000000, by = 100000), )+
coord_cartesian(xlim = c(0, 1000000)) +
labs(title = "Distribution of Prediction Errors",
subtitle = "",
caption = "") +
theme_minimal()
#Mean Absolute Error (MAE)
mean(philly.test$sale_price.AbsError, na.rm = T)
## [1] 87459.48
# Mean Absolute Percent Error (MAPE)
mean(philly.test$sale_price.APE, na.rm = T)
## [1] 0.5351495
The following plot shows that the model and perfect prediction. The orange line represents a perfect prediction (y=x), and the green line represents average prediction. This plot shows that predicted sale price with proposed model was still a little over estimated.
ggplot(data=philly.test, aes(sale_price.Predict, sale_price)) +
geom_point(size = .5) +
geom_abline(intercept = 0, slope = 1, color = "#FA7800") +
geom_smooth(method = "lm", se=F, colour = "#25CB10") +
coord_cartesian(xlim = c(0, 4000000), ylim = c(0, 4000000)) +
labs(title = "Price as a function of continuous variables") +
theme_minimal()
Compared to the current prediction model, the predicted sale price is slightly higher and the error is reduced. However, the geographical distribution of moods is almost the same.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(sale_price ~ ., data = st_drop_geometry(philly.sf_New) %>%
dplyr::select(sale_price, exterior_condition, interior_condition,
number_of_bathrooms, number_of_bedrooms,
total_area, total_livable_area, type_heater,
view_type, year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv
## Linear Regression
##
## 23698 samples
## 15 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 23460, 23461, 23461, 23461, 23462, 23460, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 146256.9 0.5846468 88412.76
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
ggplot()+
geom_histogram(data = reg.cv$resample, aes(x = MAE), binwidth = 10000, fill = "orange") +
scale_x_continuous(breaks=seq(0, 600000, by = 100000), )+
coord_cartesian(xlim = c(0, 600000)) +
labs(title = "Distribution of MAE",
subtitle = "k-fold cross-validation; k=100",
caption = "") +
theme_minimal()
philly.test <- philly.test[!is.na(philly.test$sale_price.Predict),]
ggplot() +
geom_sf(data=nhoods, fill="#DDD", color="white")+
geom_sf(data = philly.test, aes(color = q5(sale_price.Predict)),
show.legend = "point", size = 0.5, alpha = 1) +
scale_color_manual(values = palette5,
labels=qBr(philly.test, "sale_price.Predict"),
name="Quintile\nBreaks($)") +
labs(
title = "Predicted Sale Price in Philadelphia",
subtitle = "Assessment date: May 24, 2022-August 14, 2023",
caption = "Data: U.S. Census Bureau, Zillow") +
theme_void()
ggplot() +
geom_sf(data=nhoods, fill="#DDD", color="white")+
geom_sf(data = philly.test, aes(color = q5(sale_price.AbsError)),
show.legend = "point", size = 0.5, alpha = 1) +
scale_color_manual(values = palette5,
labels=qBr(philly.test, "sale_price.AbsError"),
name="Quintile\nBreaks($)") +
labs(
title = "Test Set Absolute Sale Price Errors in Philadelphia",
subtitle = "Assessment date: May 24, 2022-August 14, 2023",
caption = "Data: U.S. Census Bureau, Zillow") +
theme_void()
Even in the proposed model, the errors are still clustered and spatial lag of error and sale price error are positively correlated.
coords <- st_coordinates(philly.sf_New)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
philly.sf_New$lagPrice <- lag.listw(spatialWeights, philly.sf_New$sale_price)
philly.test <- philly.test[!is.na(philly.test$sale_price.Error),]
coords.test <- st_coordinates(philly.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
# Sale Price as a function of the Spatial Lag of Price
philly.test %>%
mutate(lagPrice = lag.listw(spatialWeights.test, sale_price)) %>%
ggplot(aes(lagPrice, sale_price))+
geom_point(color = "orange") +
geom_smooth(method = "lm", se=F, colour = "#25CB10") +
labs(title = "Price as a function of the Spatial Lag of Price",
x = "Spatial Lag of Price (mean price of 5 nearest neighbors)",
y = "Sale Price") +
theme_minimal()
# Error as a Function of the Spatial Lag of Price
philly.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) %>%
ggplot(aes(lagPriceError, sale_price.Error))+
geom_point(color = "orange") +
geom_smooth(method = "lm", se=F, colour = "#25CB10") +
labs(title = "Error as a Function of the Spatial Lag of Price",
x = "Spatial Lag of Errors (mean error of 5 nearest neighbors)",
y = "Sale Price Error") +
theme_minimal()
We can calculate the Pearson’s R coefficient to test this a different way
pearsons_test <-
philly.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error, NAOK=TRUE))
cor.test(pearsons_test$lagPriceError,
pearsons_test$sale_price.Error,
method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pearsons_test$lagPriceError and pearsons_test$sale_price.Error
## t = 65.244, df = 9342, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5454005 0.5732627
## sample estimates:
## cor
## 0.5594897
We can assess the Moran’s I value is 0.359 to see if the errors are spatially autocorrelated. Moran’s I in the proposed model became lower than the current model (0.395), but still indicates positive spatial autocorrelation. This means that the errors are clustered, and the model is not capturing all the spatial patterns in the data.
moranTest <- moran.mc(philly.test$sale_price.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "orange",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
theme_minimal()
To validate the generalizability of the proposed model, we summarize
the predictions by 158 neighborhood (name
in the dataset).
This model examines the impact of neighborhood characteristics based on
158 neighborhoods. The following table shows that the mean prediction
and actual price of 158 neighborhoods.
philly.test %>%
as.data.frame() %>%
group_by(name) %>%
summarize(meanPrediction = mean(sale_price.Predict),
meanPrice = mean(sale_price)) %>%
pander(caption = "Mean Predicted and Actual Sale Price by Neighborhood")
name | meanPrediction | meanPrice |
---|---|---|
Academy Gardens | 293483 | 300350 |
Allegheny West | 97203 | 85369 |
Andorra | 487722 | 407943 |
Aston-Woodbridge | 278812 | 279297 |
Bartram Village | 123251 | 122812 |
Bella Vista | 408471 | 535949 |
Belmont | 156508 | 146286 |
Brewerytown | 199271 | 228292 |
Bridesburg | 257782 | 209771 |
Burholme | 278010 | 257385 |
Bustleton | 354624 | 363054 |
Byberry | 513086 | 368250 |
Callowhill | 309002 | 283333 |
Carroll Park | 139529 | 118299 |
Cedar Park | 370136 | 419500 |
Cedarbrook | 316453 | 257483 |
Chestnut Hill | 665122 | 944563 |
Clearview | 197990 | 156389 |
Cobbs Creek | 221163 | 170617 |
Crescentville | 225795 | 201280 |
Dearnley Park | 509445 | 339300 |
Dickinson Narrows | 375397 | 377542 |
Dunlap | 199575 | 132950 |
East Falls | 342331 | 394138 |
East Germantown | 200894 | 151451 |
East Kensington | 407024 | 365128 |
East Mount Airy | 333963 | 303050 |
East Oak Lane | 361981 | 248553 |
East Parkside | 238520 | 155800 |
East Passyunk | 345746 | 378162 |
East Poplar | 233444 | 264996 |
Eastwick | 254580 | 232257 |
Elmwood | 124242 | 122346 |
Fairhill | -59099 | 76924 |
Fairmount | 319011 | 426223 |
Feltonville | 173099 | 121190 |
Fern Rock | 308779 | 182284 |
Fishtown - Lower Kensington | 374256 | 385690 |
Fitler Square | 438982 | 1116021 |
Fox Chase | 313783 | 322707 |
Francisville | 480053 | 522825 |
Frankford | 197198 | 124153 |
Franklinville | 119423 | 85353 |
Garden Court | 369893 | 450116 |
Germantown - Morton | 238715 | 180053 |
Germantown - Penn Knox | 326430 | 252271 |
Germantown - Westside | 365830 | 279670 |
Germany Hill | 384796 | 374416 |
Girard Estates | 306358 | 299820 |
Glenwood | 137863 | 60971 |
Graduate Hospital | 447849 | 599698 |
Grays Ferry | 120769 | 194289 |
Greenwich | 340574 | 341392 |
Haddington | 202283 | 122306 |
Harrowgate | 141409 | 90044 |
Hartranft | 158450 | 106478 |
Haverford North | 186295 | 149833 |
Hawthorne | 481607 | 599138 |
Holmesburg | 205199 | 205505 |
Hunting Park | 118034 | 94110 |
Juniata Park | 170825 | 156814 |
Kingsessing | 181774 | 142126 |
Lawndale | 229132 | 201666 |
Lexington Park | 263029 | 323492 |
Logan | 245004 | 135113 |
Logan Square | 394816 | 547362 |
Lower Moyamensing | 259531 | 234116 |
Ludlow | 374504 | 343200 |
Manayunk | 364486 | 366448 |
Mantua | 198464 | 182546 |
Mayfair | 242747 | 229553 |
McGuire | -119342 | 53200 |
Melrose Park Gardens | 241217 | 192688 |
Mill Creek | 158872 | 109421 |
Millbrook | 279656 | 283342 |
Modena | 280237 | 300000 |
Morrell Park | 282406 | 285467 |
Newbold | 330358 | 279215 |
Nicetown | 162980 | 92800 |
Normandy Village | 293909 | 277414 |
North Central | 205588 | 217361 |
Northern Liberties | 515668 | 548994 |
Northwood | 253752 | 197994 |
Ogontz | 193015 | 159895 |
Old Kensington | 503564 | 444860 |
Olney | 204609 | 153851 |
Overbrook | 283941 | 209842 |
Oxford Circle | 251697 | 218439 |
Packer Park | 523722 | 550154 |
Parkwood Manor | 271488 | 273238 |
Paschall | 140748 | 101000 |
Passyunk Square | 426200 | 442702 |
Pennsport | 358390 | 374683 |
Pennypack | 284115 | 306579 |
Pennypack Woods | 283869 | 274574 |
Penrose | 179374 | 193889 |
Point Breeze | 350273 | 339582 |
Powelton | 676212 | 470000 |
Queen Village | 421545 | 623321 |
Rhawnhurst | 266819 | 290961 |
Richmond | 165606 | 191863 |
Rittenhouse | 489333 | 838472 |
Riverfront | 506165 | 572500 |
Roxborough | 361685 | 354145 |
Roxborough Park | 479141 | 421333 |
Sharswood | 261039 | 197444 |
Society Hill | 404819 | 672283 |
Somerton | 334167 | 345849 |
Southwest Germantown | 260410 | 182117 |
Southwest Schuylkill | 152295 | 134826 |
Spring Garden | 363399 | 384521 |
Spruce Hill | 560409 | 715200 |
Stadium District | 218755 | 294280 |
Stanton | 178804 | 118298 |
Strawberry Mansion | 97946 | 104665 |
Summerdale | 180730 | 155469 |
Tacony | 227111 | 180684 |
Tioga | 189917 | 93103 |
Torresdale | 293401 | 270365 |
Upper Kensington | 31254 | 59021 |
Upper Roxborough | 364507 | 344112 |
Walnut Hill | 340056 | 321795 |
Washington Square West | 377467 | 568568 |
West Central Germantown | 399479 | 364615 |
West Kensington | 285330 | 215717 |
West Mount Airy | 619711 | 591589 |
West Oak Lane | 228893 | 197622 |
West Parkside | 39000 | 127500 |
West Passyunk | 189510 | 209869 |
West Poplar | 416075 | 455925 |
West Powelton | 328753 | 263500 |
Whitman | 257551 | 232478 |
Winchester Park | 352945 | 333068 |
Wissahickon | 344782 | 344632 |
Wissahickon Hills | 344606 | 365386 |
Wissinoming | 172466 | 158985 |
Wister | 207398 | 150118 |
Wynnefield | 262108 | 219093 |
Wynnefield Heights | 232725 | 161950 |
Yorktown | 241213 | 253750 |
reg.nhood <- lm(sale_price ~ ., data = as.data.frame(philly.training) %>%
dplyr::select(div_code, sale_price, exterior_condition, interior_condition,
number_of_bathrooms, number_of_bedrooms,
total_area, total_livable_area, type_heater,
view_type,year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD, name))
summary(reg.nhood)
##
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(philly.training) %>%
## dplyr::select(div_code, sale_price, exterior_condition, interior_condition,
## number_of_bathrooms, number_of_bedrooms, total_area,
## total_livable_area, type_heater, view_type, year_built,
## number_stories.cat, shooting_count, PPR_count, school_count,
## crime_count, TOD, name))
##
## Residuals:
## Min 1Q Median 3Q Max
## -958080 -48459 -92 45397 3378009
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) -135242.2469 77874.5451 -1.737
## div_codeEPD -71094.7371 17853.0603 -3.982
## div_codeNEPD 73882.1219 37787.9553 1.955
## div_codeNWPD -30623.5577 72463.6701 -0.423
## div_codeSPD -127972.3942 19088.5378 -6.704
## div_codeSWPD -37545.9299 38209.2863 -0.983
## exterior_condition -17479.4094 1899.6517 -9.201
## interior_condition -23206.7347 1979.2696 -11.725
## number_of_bathrooms 34958.7686 2373.9899 14.726
## number_of_bedrooms -4316.1648 1183.1624 -3.648
## total_area 2.1810 0.3178 6.862
## total_livable_area 195.6809 2.7473 71.226
## type_heater0 2705.5964 18636.9577 0.145
## type_heaterA 4935.3139 3485.8585 1.416
## type_heaterB 1529.6906 4369.3662 0.350
## type_heaterC -72839.7887 12158.9380 -5.991
## type_heaterD 115014.9248 22800.4998 5.044
## type_heaterE -4158.9772 21778.9117 -0.191
## type_heaterG 9191.8100 6685.3171 1.375
## type_heaterH 5320.9524 3673.0713 1.449
## view_type0 4334.6676 27444.3988 0.158
## view_typeA 48651.8298 22324.6061 2.179
## view_typeB -36223.9676 31568.6564 -1.147
## view_typeC 26830.0256 22958.5792 1.169
## view_typeD 42265.7664 26148.3456 1.616
## view_typeE 27243.3066 26414.4270 1.031
## view_typeH 29422.5166 24871.5539 1.183
## view_typeI 26365.4079 20902.3159 1.261
## year_built 108.8595 33.7257 3.228
## number_stories.cat4+ Floors 787152.1858 87904.0643 8.955
## number_stories.catUp to 2 Floors 17428.9855 4107.8982 4.243
## shooting_count -347.3366 42.2635 -8.218
## PPR_count -1513.4131 1159.0202 -1.306
## school_count NA NA NA
## crime_count 1.6088 10.7904 0.149
## TODTOD 16830.7552 4008.7350 4.199
## nameAllegheny West -27700.2032 79696.4126 -0.348
## nameAndorra 54524.2898 82787.8828 0.659
## nameAston-Woodbridge -55750.5405 29665.8680 -1.879
## nameBartram Village -23375.1161 38107.6000 -0.613
## nameBella Vista 397300.0124 39587.2953 10.036
## nameBelmont 1164.6972 38532.4361 0.030
## nameBrewerytown 21334.6399 33088.0076 0.645
## nameBridesburg -102555.8736 25849.5612 -3.967
## nameBurholme -44681.8856 34528.9165 -1.294
## nameBustleton -25581.7615 22682.3491 -1.128
## nameByberry -126622.7326 49327.4751 -2.567
## nameCallowhill 96149.8416 69484.0218 1.384
## nameCarroll Park -38393.8486 25070.6024 -1.531
## nameCedar Park 144111.6279 27619.0785 5.218
## nameCedarbrook 27853.9770 80040.5996 0.348
## nameChestnut Hill 354536.1902 80378.7562 4.411
## nameClearview 48357.2960 36014.7639 1.343
## nameCobbs Creek -17968.0358 23470.0371 -0.766
## nameCrescentville -64461.5057 90185.5319 -0.715
## nameCrestmont Farms -31412.6244 90350.1466 -0.348
## nameDearnley Park 82214.8550 85418.1986 0.962
## nameDickinson Narrows 217463.8520 39136.3866 5.557
## nameDunlap -38397.3516 35940.5754 -1.068
## nameEast Falls 123157.2560 79877.8809 1.542
## nameEast Germantown -44303.5498 79776.6838 -0.555
## nameEast Kensington 122573.2036 38110.4442 3.216
## nameEast Mount Airy 18554.5503 79537.6947 0.233
## nameEast Oak Lane -57696.4338 80993.6085 -0.712
## nameEast Parkside -24373.5115 32468.5168 -0.751
## nameEast Passyunk 241414.7621 38284.3978 6.306
## nameEast Poplar 37267.2053 50218.2760 0.742
## nameEastwick 5312.2114 35476.8590 0.150
## nameElmwood -4023.2802 24140.8905 -0.167
## nameFairhill 34244.6435 41703.8801 0.821
## nameFairmount 197700.3910 32678.9043 6.050
## nameFeltonville 9980.1499 37424.8272 0.267
## nameFern Rock -67127.4496 81490.5200 -0.824
## nameFishtown - Lower Kensington 178690.0036 36480.3475 4.898
## nameFitler Square 634647.9571 41520.2464 15.285
## nameFox Chase -27674.7729 23340.0709 -1.186
## nameFrancisville 113354.2450 35730.6951 3.172
## nameFrankford -156324.2859 23444.5040 -6.668
## nameFranklinville -22693.2803 41798.1606 -0.543
## nameGarden Court 134573.9961 40675.3484 3.308
## nameGermantown - Morton -38129.0948 81124.0927 -0.470
## nameGermantown - Penn Knox 2684.5239 87377.0851 0.031
## nameGermantown - Westside -19888.1683 83439.1762 -0.238
## nameGermany Hill 81844.4684 81651.1612 1.002
## nameGirard Estates 195195.6786 38056.2665 5.129
## nameGlenwood -60083.9541 83094.1446 -0.723
## nameGraduate Hospital 385001.7907 37362.9657 10.304
## nameGrays Ferry 144776.4699 37971.4934 3.813
## nameGreenwich 168916.8857 41739.5802 4.047
## nameHaddington -45071.8048 24559.3500 -1.835
## nameHarrowgate -9154.5364 37166.2269 -0.246
## nameHartranft -43871.3962 35126.9340 -1.249
## nameHaverford North -78044.7136 39031.5509 -2.000
## nameHawthorne 333682.9234 43099.6232 7.742
## nameHolmesburg -77103.9925 22731.2997 -3.392
## nameHunting Park -18037.2394 37411.8963 -0.482
## nameJuniata Park 38271.9148 37144.3932 1.030
## nameKingsessing -27723.8975 23972.9668 -1.156
## nameLawndale -101264.8414 23067.6975 -4.390
## nameLexington Park 25101.2533 29087.7039 0.863
## nameLogan -77570.8664 79547.1706 -0.975
## nameLogan Square 234778.2626 34438.1587 6.817
## nameLower Moyamensing 145256.6523 37520.0680 3.871
## nameLudlow -18685.5834 46340.4219 -0.403
## nameManayunk 83990.2313 79661.4222 1.054
## nameMantua 35257.0962 29560.5400 1.193
## nameMayfair -84446.5477 22435.0855 -3.764
## nameMcGuire 62622.0076 46018.4668 1.361
## nameMelrose Park Gardens 4292.8318 82114.8503 0.052
## nameMill Creek -29218.1121 27012.8064 -1.082
## nameMillbrook -36805.8624 26656.9708 -1.381
## nameModena -32120.5754 24206.7237 -1.327
## nameMorrell Park -41243.0522 24978.6875 -1.651
## nameNewbold 186091.2459 38747.8387 4.803
## nameNicetown -50908.8093 81640.7214 -0.624
## nameNormandy Village -17065.5423 46459.4735 -0.367
## nameNorth Central -31047.0447 33951.9104 -0.914
## nameNorthern Liberties 230007.5329 35828.9306 6.420
## nameNorthwood -137491.7170 26100.7979 -5.268
## nameOgontz -27287.3521 79711.9329 -0.342
## nameOld Kensington 123301.6057 39252.4341 3.141
## nameOlney -35506.9558 79244.7743 -0.448
## nameOverbrook -21530.6055 23612.4623 -0.912
## nameOxford Circle -95461.4318 22307.8418 -4.279
## namePacker Park 226052.3357 42365.7045 5.336
## nameParkwood Manor -41554.6730 23560.0517 -1.764
## namePaschall -47376.1160 24915.8569 -1.901
## namePassyunk Square 265682.0669 38232.3460 6.949
## namePennsport 211889.8151 38609.7410 5.488
## namePennypack -18725.3035 24994.6787 -0.749
## namePennypack Woods -48731.8390 36192.1514 -1.346
## namePenrose 10131.6261 31515.1377 0.321
## namePoint Breeze 182964.5741 37177.1617 4.921
## namePowelton 260694.1638 59799.6009 4.359
## nameQueen Village 441436.4201 38837.0765 11.366
## nameRhawnhurst -19645.3595 23113.0509 -0.850
## nameRichmond 75804.4770 36359.0160 2.085
## nameRittenhouse 565823.1636 33011.0457 17.140
## nameRiverfront 206687.7891 79150.1994 2.611
## nameRoxborough 73947.9117 79489.4554 0.930
## nameRoxborough Park 82499.4279 84576.4649 0.975
## nameSharswood 56606.3969 46595.5394 1.215
## nameSociety Hill 408970.3935 33555.1169 12.188
## nameSomerton -14753.5571 22859.9565 -0.645
## nameSouthwest Germantown -31447.0920 80630.4225 -0.390
## nameSouthwest Schuylkill -4873.7931 25935.7904 -0.188
## nameSpring Garden 217508.4292 33629.3802 6.468
## nameSpruce Hill 195546.5815 35111.4075 5.569
## nameStadium District 192350.0278 40943.8174 4.698
## nameStanton -99983.5433 33328.6739 -3.000
## nameStrawberry Mansion -64330.9920 33003.1107 -1.949
## nameSummerdale -116053.6536 26121.1901 -4.443
## nameTacony -104702.7261 23113.8374 -4.530
## nameTioga -99947.1216 79912.5414 -1.251
## nameTorresdale -56131.7322 24716.3097 -2.271
## nameUpper Kensington 1431.1903 36915.3589 0.039
## nameUpper Roxborough 59966.7795 79781.2467 0.752
## nameWalnut Hill 36723.8440 36294.4456 1.012
## nameWashington Square West 330744.0149 34888.8370 9.480
## nameWest Central Germantown -37334.8593 81543.8432 -0.458
## nameWest Kensington 69705.3958 37948.2816 1.837
## nameWest Mount Airy 86475.5827 79849.8687 1.083
## nameWest Oak Lane -6850.7653 79256.2176 -0.086
## nameWest Parkside -17760.2798 52228.1287 -0.340
## nameWest Passyunk 148363.0958 38448.3091 3.859
## nameWest Poplar 38934.6920 50005.7065 0.779
## nameWest Powelton 46473.2328 36813.9390 1.262
## nameWhitman 155904.7686 38972.7675 4.000
## nameWinchester Park -23805.3236 37687.9395 -0.632
## nameWissahickon 72922.9302 80453.9826 0.906
## nameWissahickon Hills 122620.5736 84827.8386 1.446
## nameWissinoming -107276.3467 22730.1851 -4.720
## nameWister -48779.3878 81589.9932 -0.598
## nameWoodland Terrace 212236.8503 90379.9457 2.348
## nameWynnefield -49343.8847 26171.7584 -1.885
## nameWynnefield Heights NA NA NA
## nameYorktown NA NA NA
## Pr(>|t|)
## (Intercept) 0.082468 .
## div_codeEPD 0.000068624548814746 ***
## div_codeNEPD 0.050582 .
## div_codeNWPD 0.672590
## div_codeSPD 0.000000000021042707 ***
## div_codeSWPD 0.325802
## exterior_condition < 0.0000000000000002 ***
## interior_condition < 0.0000000000000002 ***
## number_of_bathrooms < 0.0000000000000002 ***
## number_of_bedrooms 0.000265 ***
## total_area 0.000000000007062227 ***
## total_livable_area < 0.0000000000000002 ***
## type_heater0 0.884576
## type_heaterA 0.156854
## type_heaterB 0.726273
## type_heaterC 0.000000002142008747 ***
## type_heaterD 0.000000460695242559 ***
## type_heaterE 0.848557
## type_heaterG 0.169177
## type_heaterH 0.147461
## view_type0 0.874504
## view_typeA 0.029327 *
## view_typeB 0.251209
## view_typeC 0.242574
## view_typeD 0.106034
## view_typeE 0.302381
## view_typeH 0.236838
## view_typeI 0.207199
## year_built 0.001250 **
## number_stories.cat4+ Floors < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors 0.000022218456474646 ***
## shooting_count 0.000000000000000224 ***
## PPR_count 0.191653
## school_count NA
## crime_count 0.881483
## TODTOD 0.000027034298856716 ***
## nameAllegheny West 0.728167
## nameAndorra 0.510162
## nameAston-Woodbridge 0.060227 .
## nameBartram Village 0.539623
## nameBella Vista < 0.0000000000000002 ***
## nameBelmont 0.975887
## nameBrewerytown 0.519077
## nameBridesburg 0.000073023119075921 ***
## nameBurholme 0.195672
## nameBustleton 0.259413
## nameByberry 0.010269 *
## nameCallowhill 0.166451
## nameCarroll Park 0.125686
## nameCedar Park 0.000000183645054455 ***
## nameCedarbrook 0.727847
## nameChestnut Hill 0.000010375681362365 ***
## nameClearview 0.179389
## nameCobbs Creek 0.443943
## nameCrescentville 0.474766
## nameCrestmont Farms 0.728088
## nameDearnley Park 0.335816
## nameDickinson Narrows 0.000000028021154490 ***
## nameDunlap 0.285378
## nameEast Falls 0.123140
## nameEast Germantown 0.578668
## nameEast Kensington 0.001302 **
## nameEast Mount Airy 0.815547
## nameEast Oak Lane 0.476255
## nameEast Parkside 0.452857
## nameEast Passyunk 0.000000000295360940 ***
## nameEast Poplar 0.458037
## nameEastwick 0.880974
## nameElmwood 0.867641
## nameFairhill 0.411582
## nameFairmount 0.000000001487748413 ***
## nameFeltonville 0.789726
## nameFern Rock 0.410098
## nameFishtown - Lower Kensington 0.000000977839542346 ***
## nameFitler Square < 0.0000000000000002 ***
## nameFox Chase 0.235754
## nameFrancisville 0.001515 **
## nameFrankford 0.000000000026942406 ***
## nameFranklinville 0.587190
## nameGarden Court 0.000940 ***
## nameGermantown - Morton 0.638356
## nameGermantown - Penn Knox 0.975491
## nameGermantown - Westside 0.811609
## nameGermany Hill 0.316184
## nameGirard Estates 0.000000295008330100 ***
## nameGlenwood 0.469641
## nameGraduate Hospital < 0.0000000000000002 ***
## nameGrays Ferry 0.000138 ***
## nameGreenwich 0.000052177028145784 ***
## nameHaddington 0.066495 .
## nameHarrowgate 0.805443
## nameHartranft 0.211709
## nameHaverford North 0.045571 *
## nameHawthorne 0.000000000000010452 ***
## nameHolmesburg 0.000696 ***
## nameHunting Park 0.629724
## nameJuniata Park 0.302861
## nameKingsessing 0.247511
## nameLawndale 0.000011424367647150 ***
## nameLexington Park 0.388180
## nameLogan 0.329500
## nameLogan Square 0.000000000009654858 ***
## nameLower Moyamensing 0.000109 ***
## nameLudlow 0.686789
## nameManayunk 0.291746
## nameMantua 0.233004
## nameMayfair 0.000168 ***
## nameMcGuire 0.173599
## nameMelrose Park Gardens 0.958308
## nameMill Creek 0.279432
## nameMillbrook 0.167387
## nameModena 0.184554
## nameMorrell Park 0.098735 .
## nameNewbold 0.000001582395147524 ***
## nameNicetown 0.532919
## nameNormandy Village 0.713385
## nameNorth Central 0.360501
## nameNorthern Liberties 0.000000000141081178 ***
## nameNorthwood 0.000000140196566596 ***
## nameOgontz 0.732112
## nameOld Kensington 0.001686 **
## nameOlney 0.654112
## nameOverbrook 0.361873
## nameOxford Circle 0.000018876100378693 ***
## namePacker Park 0.000000096654400958 ***
## nameParkwood Manor 0.077792 .
## namePaschall 0.057265 .
## namePassyunk Square 0.000000000003839318 ***
## namePennsport 0.000000041368009326 ***
## namePennypack 0.453767
## namePennypack Woods 0.178171
## namePenrose 0.747848
## namePoint Breeze 0.000000869048906350 ***
## namePowelton 0.000013132208723538 ***
## nameQueen Village < 0.0000000000000002 ***
## nameRhawnhurst 0.395357
## nameRichmond 0.037098 *
## nameRittenhouse < 0.0000000000000002 ***
## nameRiverfront 0.009029 **
## nameRoxborough 0.352239
## nameRoxborough Park 0.329358
## nameSharswood 0.224446
## nameSociety Hill < 0.0000000000000002 ***
## nameSomerton 0.518686
## nameSouthwest Germantown 0.696531
## nameSouthwest Schuylkill 0.850944
## nameSpring Garden 0.000000000102770967 ***
## nameSpruce Hill 0.000000026049887132 ***
## nameStadium District 0.000002653740300370 ***
## nameStanton 0.002705 **
## nameStrawberry Mansion 0.051287 .
## nameSummerdale 0.000008944558394892 ***
## nameTacony 0.000005951244738983 ***
## nameTioga 0.211063
## nameTorresdale 0.023160 *
## nameUpper Kensington 0.969075
## nameUpper Roxborough 0.452280
## nameWalnut Hill 0.311637
## nameWashington Square West < 0.0000000000000002 ***
## nameWest Central Germantown 0.647067
## nameWest Kensington 0.066253 .
## nameWest Mount Airy 0.278837
## nameWest Oak Lane 0.931119
## nameWest Parkside 0.733823
## nameWest Passyunk 0.000114 ***
## nameWest Poplar 0.436226
## nameWest Powelton 0.206833
## nameWhitman 0.000063577183942097 ***
## nameWinchester Park 0.527631
## nameWissahickon 0.364744
## nameWissahickon Hills 0.148334
## nameWissinoming 0.000002386739447581 ***
## nameWister 0.549943
## nameWoodland Terrace 0.018875 *
## nameWynnefield 0.059399 .
## nameWynnefield Heights NA
## nameYorktown NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123800 on 13852 degrees of freedom
## (199 observations deleted due to missingness)
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7113
## F-statistic: 200.7 on 173 and 13852 DF, p-value: < 0.00000000000000022
philly.test.nhood <-
philly.test %>%
mutate(Regression = "Neighborhood Effects",
sale_price.Predict = predict(reg.nhood, philly.test),
sale_price.Error = sale_price.Predict- sale_price,
sale_price.AbsError = abs(sale_price.Predict- sale_price),
sale_price.APE = (abs(sale_price.Predict- sale_price)) / sale_price)%>%
filter(sale_price < 5000000)
bothRegressions <-
rbind(
dplyr::select(philly.test, starts_with("sale_price"), Regression, name) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)),
dplyr::select(philly.test.nhood, starts_with("sale_price"), Regression, name) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)))
The MAE and MAPE of the proposed model, which accounts for the neighborhood effect, are as follows: when incorporating the neighborhood effect, the MAE decreases by 20,000 and the MAPE decreases by 18%p compared to the proposed model without the neighborhood effect.
st_drop_geometry(bothRegressions) %>%
gather(Variable, Value, -Regression, -name) %>%
filter(Variable == "sale_price.AbsError" | Variable == "sale_price.APE") %>%
group_by(Regression, Variable) %>%
summarize(meanValue = mean(Value)) %>%
spread(Variable, meanValue) %>%
pander(caption = "Mean Absolute Error and Absolute Percentage Error by Regression")
Regression | sale_price.AbsError | sale_price.APE |
---|---|---|
Baseline Regression | 87459 | 0.5351 |
Neighborhood Effects | 67766 | 0.3774 |
bothRegressions %>%
dplyr::select(sale_price.Predict, sale_price, Regression) %>%
ggplot(aes(sale_price, sale_price.Predict)) +
geom_point() +
stat_smooth(aes(sale_price, sale_price),
method = "lm", se = FALSE, size = 1, colour="orange") +
stat_smooth(aes(sale_price.Predict, sale_price),
method = "lm", se = FALSE, size = 1, colour="olivedrab") +
facet_wrap(~Regression) +
labs(title="Predicted sale price as a function of observed price",
subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
theme_minimal()
The following results of table and graph denote there is significant difference in MAE and MAPE by regression methods, particularly in Central and North Philadelphia.
st_drop_geometry(bothRegressions) %>%
group_by(Regression, name) %>%
summarize(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
ungroup() %>%
left_join(neighborhoods, by = c("name" = "name")) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = 100*mean.MAPE), color = "transparent") +
geom_sf(data = bothRegressions, colour = "gray30", size = .5) +
facet_wrap(~Regression) +
scale_fill_gradient(low = palette5[1], high = palette5[5],
name = "MAPE") +
labs(title = "Mean test set MAPE by neighborhood (Neighborhood)") +
theme_void()
The following scatterplot shows the average MAPE by 158 neighborhood mean price. The overall MAPE values were lower after accounting for the neighborhood effect, and the peak MAPE values observed before considering the neighborhood effect disappeared (Fairhill and McGuire). In both regression methods, MAPE did show an decrease depending on the sale price. However, in both methods, MAPE has a positive value.
a <- st_drop_geometry(bothRegressions) %>%
group_by(Regression, name) %>%
summarize(mean.sale_price = mean(sale_price, na.rm = T))
a2 <- st_drop_geometry(bothRegressions) %>%
group_by(Regression, name) %>%
summarize(mean.MAPE = mean(sale_price.APE, na.rm = T))
a3<- left_join(a, a2, by = c("name" = "name", "Regression" = "Regression"))
ggplot(a3) +
geom_point(aes(x = mean.sale_price, y = 100*mean.MAPE), color = "gray30", size = 1) +
facet_wrap(~Regression) +
scale_fill_gradient(low = palette5[1], high = palette5[5],
name = "MAPE") +
labs(title = "Average MAPE by Neighborhood (Neighborhood) Mean Price") +
theme_minimal()
We validated if the proposed model that accounts for neighborhood effect is applied to different group contexts: race and income.
In race, the regression that considered neighborhood effects resulted in a smaller MAPE in the majority white group. In income, the regression that considered neighborhood effects resulted in a smaller MAPE in the high-income group.
For the majority non-White and low-income groups, the MAPE decreased significantly—by more than 30%p and 25%p, respectively—when the neighborhood effect was considered compared to when it was not. It means the ‘neighborhood effect’ played an important role in predicting house prices.
As a result of generalizability, however, it can be concluded that it is relatively difficult to apply to the Majority Non-White or Low Income group. We should consider other additional variables that may resolve the correlation between the groups or analyze more data.
grid.arrange(ncol = 2,
ggplot() + geom_sf(data = na.omit(nhoods), aes(fill = raceContext), color = "#999") +
scale_fill_manual(values = c("olivedrab", "orange"), name="Race Context") +
labs(title = "Race Context") +
theme_void() + theme(legend.position="bottom"),
ggplot() + geom_sf(data = na.omit(nhoods), aes(fill = incomeContext), color = "#999") +
scale_fill_manual(values = c("olivedrab", "orange"), name="Income Context") +
labs(title = "Income Context") +
theme_void() + theme(legend.position="bottom"))
st_join(bothRegressions, nhoods_f) %>%
filter(!is.na(raceContext)) %>%
group_by(Regression, raceContext) %>%
summarize(mean.MAPE = 100*(mean(sale_price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(raceContext, mean.MAPE) %>%
pander(caption = "Test set MAPE by neighborhood race context(%)")
Regression | Majority Non-White | Majority White |
---|---|---|
Baseline Regression | 74.56 | 27.67 |
Neighborhood Effects | 48.3 | 24.78 |
st_join(bothRegressions, nhoods_f) %>%
filter(!is.na(incomeContext)) %>%
group_by(Regression, incomeContext) %>%
summarize(mean.MAPE = 100*(mean(sale_price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
pander(caption = "Test set MAPE by neighborhood income context(%)")
Regression | High Income | Low Income |
---|---|---|
Baseline Regression | 27.94 | 64.37 |
Neighborhood Effects | 25.78 | 42.89 |
We recommend that Zillow use our proposed OLS model instead of the current one. Our proposed model aimed to predict housing prices using key features such as property condition, total livable area, number of bedrooms and bathrooms, proximity to crime, and access to amenities like schools and parks. These variables are crucial as they directly influence property values based on what homebuyers prioritize. The current Zillow model mainly focuses on internal characteristics and does not focus on public services or (dis)amenities. However, our proposed model was able to obtain significantly reduced MAE and MAPE values by incorporating six public services, including crime, shooting, and police districts, which are actual factors that people around Philadelphia often consider when choosing their home location. Our proposed model finally did effectively account for neighborhood effects, possibly because the neighborhood defined by 158 neighborhoods significantly decreased the MAE and MAPE.
While the model captured a significant portion of price variation, it was not perfect. First, metrics like MAE and MAPE decreased compared to the current Zillow model, but there was still a positive correlation. We also observed positive spatial autocorrelation in the model errors, as indicated by Moran’s I statistics. We need to identify more variables with spatial correlation.
Second, regression analysis did not work well for school data because the number of schools per tract had a lot of NA data. In the future, it is necessary to reanalyze the distance to the nearest school from each house as continuous data.
Third, our proposed model should still improve generalizability. The overall MAPE values were lower after accounting for the neighborhood effect, but MAPE by different group context has a positive value and large difference between two different groups which was race and house income, respectively. The model showed better performance in high-income and predominantly white neighborhoods, evidenced by lower MAE values, possibly due to more stable socio-economic factors in these areas. To improve generalization performance by eliminating these effects, we need to add sufficient other variables that can predict the variability occurring in low-income and major non-white groups, or change the weights of each variable.
In sum, further refinement is necessary to address local neighborhood variations that are not fully captured by the current set of predictors. Incorporating more granular, demographically aligned, or culturally relevant boundaries could improve the model’s predictive power, allowing it to capture neighborhood effects more accurately and potentially reduce the overall error rate.
philly.sf_Chall <- philly.sf %>%
filter(sale_price < 5000000 & toPredict == "CHALLENGE") %>%
mutate(pricepersq = ifelse(total_area > 10, sale_price / total_area, 0))
philly.sf_Chall <- philly.sf_Chall[!is.na(as.numeric(philly.sf_Chall$pricepersq)), ]
philly.sf_Chall <- st_join(philly.sf_Chall, nhoods_f) # Crime Data, School Data, Green Area Data, Shooting Incident Data, TOD Data
philly.sf_Chall <- st_join(philly.sf_Chall, philly.police) # Police District Data
philly.sf_Chall$sale_price <- 798332.8947 + philly.sf_Chall$exterior_condition * -27278.6103 +philly.sf_Chall$interior_condition* -27822.5115 +philly.sf_Chall$number_of_bathrooms* 60246.0186 +philly.sf_Chall$number_of_bedrooms* -25245.0484 +philly.sf_Chall$total_livable_area* 193.4259 +philly.sf_Chall$year_built * -161.9947 +philly.sf_Chall$shooting_count * -1020.9659 +philly.sf_Chall$crime_count * -18.7429 + ifelse(philly.sf_Chall$type_heater == "A", 13823.6306,
ifelse(philly.sf_Chall$type_heater == "B", -8972.2810,
ifelse(philly.sf_Chall$type_heater == "D", 239670.7969,
ifelse(philly.sf_Chall$type_heater == "E", 58203.8361,
ifelse(philly.sf_Chall$type_heater == "H", 7760.8603,0)))))+
ifelse(philly.sf_Chall$view_type == "0", -82474.3969,
ifelse(philly.sf_Chall$view_type == "C", 182468.1793,
ifelse(philly.sf_Chall$view_type == "E", -50004.9183,0)))
#+ifelse(philly.sf_Chall$number_stories.cat =="4+ Floors",730714.0040,
#ifelse(philly.sf_Chall$number_stories.cat =="Up to 2 Floors",-45798.0893,0))
MUSA <- philly.sf_Chall %>%
st_drop_geometry() %>%
dplyr::select(musaID, sale_price) %>%
mutate(team = "Agarwal and Jun")
write.csv(MUSA,"mean_prediction_AgarwalJun.csv", row.names = FALSE)