Introduction

Zillow has struggled to accurately estimate home prices, partly because it is a national platform that often lacks detailed, localized information. Urban dynamics have become increasingly complex, and understanding housing markets requires a more granular approach that incorporates local factors like government policies, crime, gentrification, pollution levels, and access to public transit. Our objective is to use local data to create a model that predicts Philadelphia home prices with greater accuracy.

This markdown consists of two parts. First, we analyze the current house prices prediction model. We determine the correlation and spatial autocorrelation between house prices and variables in the current model, and derive the problems of the house prices prediction model. Second, we propose an improved model. After adding six continuous and categorical variables that affect house prices but are not considered in the current model to the model, we will predict house prices and verify the accuracy and generalization of the improved model.

Our overall modeling strategy focuses on using open-source datasets, such as those available from the Philadelphia Open Data Portal, to enhance our understanding of the local housing market. By incorporating data on crime, zoning, pollution, and transit-oriented development, we aim to build a more contextually rich model that reflects the true dynamics of the city.

Methods

Our approach to developing a more accurate home price prediction model for Philadelphia involved three main phases: data collection and preparation, analysis of the existing OLS model, and proposal of an improved model.

1.⁠ ⁠Data Collection and Preparation We gathered localized datasets from the Philadelphia Open Data Portal, aiming to capture the unique characteristics influencing home prices within the city. These datasets included:

Property Attributes: sale price, total area, livable area, age, and internal/external quality assessments. Neighborhood Characteristics: crime rates, public amenities, and zoning classifications.
Accessibility Indicators: transit-oriented development zones, proximity to subway stations, and public parks.
Socioeconomic Indicators: demographic trends and school quality. To ensure consistent spatial alignment, we mapped all spatial datasets to Philadelphia’s local coordinate system (ESRI:102728). Outliers were identified and removed, such as properties with sale prices exceeding $5 million, to focus the analysis on a more representative price range.

2.⁠ ⁠Analysis of Current OLS Model

We analyzed Zillow’s existing OLS model by examining correlations between sale price and both continuous (e.g., age, total area) and categorical variables (e.g., quality grade, view type). Our analysis showed:

Negative correlation between age and sale price, suggesting that older properties tend to sell for less.

Positive correlation between total area, livable area, and sale price, indicating larger properties command higher prices.

We evaluated these correlations using a correlation matrix, allowing us to select non-collinear predictors for regression. In this phase, we tested multiple regression models using variables like total livable area, number of bathrooms, and internal/external condition, observing that livable area was the most significant predictor.

3.⁠ ⁠Proposed Model

We sought to improve prediction accuracy by integrating six new variables beyond the original model:

Crime Data: Frequency of incidents per neighborhood to capture safety perception.
Shooting Incidents: High-density crime areas, often perceived as riskier. Public Transit Proximity: Impact of TOD zones on home values.
School Access: Number of schools per census tract as an indicator of educational access.
Environmental Quality: Pollution and green space proximity.
Gentrification Indicators: Presence of rapidly changing areas.

Using multiple linear regression, we compared our model’s Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) against Zillow’s baseline. We implemented 100-fold cross-validation to assess model generalizability, allowing systematic testing across varied training and testing splits. We also applied spatial analysis techniques, such as Moran’s I, to evaluate spatial clustering in model errors, highlighting areas for improvement.

4.⁠ ⁠Statistical and Spatial Analysis

We assessed spatial lags for both price and model errors to understand the clustering of predictions and identify areas where neighborhood effects were inadequately captured. Additionally, we tested spatial autocorrelation in errors using Moran’s I, finding positive autocorrelation that suggested the model could benefit from further integration of neighborhood-specific factors.

This approach leverages spatial and socioeconomic variables to capture Philadelphia’s local housing dynamics, ultimately reducing MAE and MAPE compared to Zillow’s model. Further refinements, such as geographically weighted regression and enhanced socio-demographic data, could improve predictive accuracy across diverse neighborhoods and price ranges.

1. Data Collection and Preparation

Load Libraries and Initial Setup

# Install and Load Libraries
# install.packages('mapview')
library(tidyverse)
library(tidycensus)
library(sf)
library(knitr)
library(kableExtra)
library(mapview)
library(dplyr)
library(scales)
library(viridis)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)     # for regression model plots
library(broom)
library(tufte)
library(rmarkdown)
library(pander)
library(classInt)
library(ggplot2)
library(units)

# functions and data directory
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

# Source for Multiple Ring Buffer
source(
  paste0("https://raw.githubusercontent.com/urbanSpatial/",
         "Public-Policy-Analytics-Landing/master/functions.r"))

# Set Color Table
palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")

# Disable Scientific Notation
options(scipen=999)
options(tigris_class = "sf")

# Set ACS API Key
census_api_key("b2835f54d89a4499ba29829c908967b86765b345", overwrite = TRUE)

Data loading

We used spatial data techniques to ensure all data were correctly mapped to Philadelphia’s local coordinate system (ESRI:102728).

# Load Philadelphia Zillow Data
philly.sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/studentData.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

# Get ACS data
nhoods <-  
  get_acs(geography = "tract",
          variables = c("B01003_001E","B02001_002E",
                        "B06011_001E"), 
          year=2020, state="PA",
          county="Philadelphia", geometry=TRUE) %>%
  st_transform(crs = 'ESRI:102728') %>%
  separate(NAME, into = c("Census_Tract", "City_State"), sep = ", ", extra = "merge") %>%
  mutate(
    Census_Tract = gsub("Census Tract ", "", Census_Tract),
    City_State = gsub(" County, Pennsylvania", "", City_State)) %>%
  dplyr::select(  -moe)%>%
  spread(key = variable, value = estimate) %>%
  rename(TotalPop = B01003_001,
         NumberWhites = B02001_002,
         Median_Income = B06011_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 35322, "High Income", "Low Income"))

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/philadelphia.geojson") %>%
  st_transform(crs = 'ESRI:102728') 

# Filter to eliminate outlier
philly.sf_filtered <- philly.sf %>%
  filter(sale_price < 5000000 & toPredict == "MODELLING") %>%
  mutate(pricepersq = ifelse(total_area > 10, sale_price / total_area, 0)) 
philly.sf_filtered <- philly.sf_filtered[!is.na(as.numeric(philly.sf_filtered$pricepersq)), ]

House Price per square foot in Philadelphia

The following chart shows House Price per square foot in Philadelphia. Center City and the Northeast are mostly in the 1st quintile for house prices per square foot, while Southwest, Northwest, and Northeast Philadelphia are around in the 3rd quintile. By mapping these prices, we can pinpoint high-demand neighborhoods and observe how features like proximity to parks or specific zoning affect property desirability. In addition, it helps us understand the spatial distribution of property values and reveals which areas are more desirable based on pricing trends.

ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  geom_sf(data = philly.sf_filtered, aes(color = q5(pricepersq)),  
          show.legend = "point", size = 0.5, alpha = 1) +  
  scale_color_manual(values = palette5,
                     labels=qBr(philly.sf_filtered, "pricepersq"),
                     name="Quintile\nBreaks($/sqft)") + 
  labs(
    title = "House Price per square foot in Philadelphia",
    subtitle = "Assessment date: May 24, 2022-August 14, 2023",
    caption = "Data: U.S. Census Bureau, Zillow") +
  theme_void()

#mapview(philly.sf_filtered)

Summary of Other Variables

# Drop geometry and calculate summary statistics
data_for_summary <- philly.sf_filtered %>%
  st_drop_geometry() %>%
  select(
    number_of_bathrooms,
    number_of_bedrooms,
    total_area,
    total_livable_area,
    year_built,
    exterior_condition,
    interior_condition,
    sale_price
  ) %>%
  mutate(
    property_age = 2024 - year_built # Assuming '2024' as the current year for age
  )

# Calculate the summary statistics
sum_stats_combined <- data_for_summary %>%
  summarize(
    Mean = round(c(
      mean(number_of_bathrooms, na.rm = TRUE),
      mean(number_of_bedrooms, na.rm = TRUE),
      mean(total_area, na.rm = TRUE),
      mean(total_livable_area, na.rm = TRUE),
      mean(property_age, na.rm = TRUE),
      mean(exterior_condition, na.rm = TRUE),
      mean(interior_condition, na.rm = TRUE),
      mean(sale_price, na.rm = TRUE)), 2),
    
    Median = round(c(
      median(number_of_bathrooms, na.rm = TRUE),
      median(number_of_bedrooms, na.rm = TRUE),
      median(total_area, na.rm = TRUE),
      median(total_livable_area, na.rm = TRUE),
      median(property_age, na.rm = TRUE),
      median(exterior_condition, na.rm = TRUE),
      median(interior_condition, na.rm = TRUE),
      median(sale_price, na.rm = TRUE)), 2),
    
    Min = round(c(
      min(number_of_bathrooms, na.rm = TRUE),
      min(number_of_bedrooms, na.rm = TRUE),
      min(total_area, na.rm = TRUE),
      min(total_livable_area, na.rm = TRUE),
      min(property_age, na.rm = TRUE),
      min(exterior_condition, na.rm = TRUE),
      min(interior_condition, na.rm = TRUE),
      min(sale_price, na.rm = TRUE)), 2),
    
    Max = round(c(
      max(number_of_bathrooms, na.rm = TRUE),
      max(number_of_bedrooms, na.rm = TRUE),
      max(total_area, na.rm = TRUE),
      max(total_livable_area, na.rm = TRUE),
      max(property_age, na.rm = TRUE),
      max(exterior_condition, na.rm = TRUE),
      max(interior_condition, na.rm = TRUE),
      max(sale_price, na.rm = TRUE)), 2),
    
    SD = round(c(
      sd(number_of_bathrooms, na.rm = TRUE),
      sd(number_of_bedrooms, na.rm = TRUE),
      sd(total_area, na.rm = TRUE),
      sd(total_livable_area, na.rm = TRUE),
      sd(property_age, na.rm = TRUE),
      sd(exterior_condition, na.rm = TRUE),
      sd(interior_condition, na.rm = TRUE),
      sd(sale_price, na.rm = TRUE)), 2)
  ) %>%
  mutate(
    `Variable Name` = c("number_of_bathrooms", "number_of_bedrooms", 
                        "total_area", "total_livable_area", 
                        "property_age", "exterior_condition", 
                        "interior_condition", "sale_price"),
    
    Description = c("Number of Bathrooms", "Number of Bedrooms", 
                    "Total Area of Property", "Total Livable Area of Property",
                    "Property Age", "Exterior Condition Rating", 
                    "Interior Condition Rating", "Sale Price of Property"),
    
    Category = c(rep("Physical characteristics", 4), 
                 "Age", rep("Condition", 2), "Financial")
  ) %>%
  select(Category, `Variable Name`, Description, everything())

# Display the summary statistics table with formatting
sum_stats_combined %>%
  kable(format = "html", caption = "Table: Summary statistics for numeric variables by category") %>%
  row_spec(1:4, background = "#ffffff50") %>%
  row_spec(5, background = "#f2f2f250") %>%
  row_spec(6:7, background = "#e6e6e680") %>%
  kable_styling(bootstrap_options = c("hover"), fixed_thead = TRUE)

Table: Summary statistics for numeric variables by category
Category	Variable Name	Description	Mean	Median	Min	Max	SD
Physical characteristics	number_of_bathrooms	Number of Bathrooms	1.07	1	0	8	0.67
Physical characteristics	number_of_bedrooms	Number of Bedrooms	2.58	3	0	31	1.27
Physical characteristics	total_area	Total Area of Property	1828.80	1179	0	226512	3824.23
Physical characteristics	total_livable_area	Total Livable Area of Property	1333.19	1208	0	15246	546.51
Age	property_age	Property Age	87.46	99	0	2024	49.43
Condition	exterior_condition	Exterior Condition Rating	3.77	4	1	7	0.87
Condition	interior_condition	Interior Condition Rating	3.65	4	0	7	0.88
Financial	sale_price	Sale Price of Property	271929.21	229000	10100	4900000	234193.33

2. Current OLS Model Analysis

Price as a function of continuous variables

We looked at the relationship between House Price i.e. sale_price and Age, total_area, and total_livable_area, which are continuous variables currently used by Zillow. Age was calculated by subtracting the year of completion, year_built, from this year, 2024. As a result, Age and sale_price showed a negative correlation, while total_area and sale_price and total_livable_area and sale_price showed a positive correlation. We filtered our datasets to remove outliers and errors, such as excluding properties with sale prices above $5 million to focus on the majority market segment.

# Price as a function of continuous variables
philly.sf_filtered %>%
  st_drop_geometry() %>%
  mutate(Age = 2024 - year_built) %>%
  dplyr::select(sale_price, total_area, Age, total_livable_area) %>%
  filter(sale_price <= 5000000, total_area <= 10000, total_livable_area <= 5000, 
         Age < 500) %>% # Filter out crazy outliers
  gather(Variable, Value, -sale_price) %>% 
   ggplot(aes(Value, sale_price)) +
     geom_point(size = .5) + 
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of continuous variables",
          subtitle="Sale Price: $, Age: year(s), Area: sqft, Total Livable Area: sqft") +
     theme_minimal()

Price as a function of categorical variables

We also examine the relationship between sale_price and quality_grade, view_type, exterior_condition, and interior_condition, which are categorical variables currently used by Zillow. Quality_grade has two types of grading systems, so we changed the 1-6 system to an S-A-B-C-D-E system. The following histograms show that as the condition or quality grade goes high, the sale price tends to increase.

# Price as a function of categorical variables

philly.sf_filtered <- philly.sf_filtered %>%
  mutate(
    quality_grade_new = case_when(
      quality_grade == "1 " ~ "S ",  
      quality_grade == "2 " ~ "A ",  
      quality_grade == "3 " ~ "B ",  
      quality_grade == "4 " ~ "C ", 
      quality_grade == "5 " ~ "D ", 
      quality_grade == "6 " ~ "E ", 
      TRUE ~ as.character(quality_grade)        
    )
  )

philly.sf_filtered$quality_grade_new <- factor(philly.sf_filtered$quality_grade_new, levels = c("S+", "S ", "A+", "A ", "A-", "B+", "B ", "B-", "C+", "C ", "C-", "D+", "D ", "D-", "E+", "E ", "E-"))
# levels(philly.sf_filtered$quality_grade_new)
philly.sf_filtered %>%
  st_drop_geometry() %>%
  dplyr::select(sale_price, quality_grade_new, view_type, exterior_condition, interior_condition) %>%
  filter(sale_price <= 1000000, !is.na(interior_condition), interior_condition != 0, view_type != "", view_type != 0, quality_grade_new != "") %>% # Filter out crazy outliers
  gather(Variable, Value, -sale_price) %>%
  ggplot(aes(Value, sale_price)) +
    geom_bar(position = "dodge", stat="summary", fun="mean")+
    facet_wrap(~Variable, ncol = 2, scales = "free") +
    labs(title = "Price as a function of categorical variables", y="Mean_Price") +
    plotTheme() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))+
    theme_minimal()

The following diagram shows correlation matrix across numeric variables, which helps us decide to choose which variable for regression. It shows how different property and neighborhood numeric factors relate to each other and to home prices. It uses values between -1 and 1 to indicate the strength and direction of these relationships: - Positive values mean that as one variable increases, so does the other (e.g., property size and price). - Negative values show that as one variable increases, the other decreases (e.g., distance from the city center and price). - Values near zero suggest no clear relationship.

Since it is helpful for analysis to perform regression by selecting variables that are not correlated with each other, year_built, total_area, and number_stories were chosen for the subsequent regression analysis.

numericVars <- 
  select_if(st_drop_geometry(philly.sf_filtered), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables")

Multiple Linear Regression Model Results

The following is the result of linear regression analysis of sale_price and total_livable_area. If this result is statistically significant, we can predict sale_price from total_livable_area using the formula SalePrice_i = -65,358 + 252.99Xi + ei. The example below is statistically significant because the p-value is small, and if total_livable_area are 2,262 sqft, we can say that the predicted sale_price is -65,358+252.99x2,262=$507,000.

livingReg <- lm(sale_price ~ total_livable_area, data = philly.sf_filtered)
summary(livingReg)

## 
## Call:
## lm(formula = sale_price ~ total_livable_area, data = philly.sf_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2963434  -100606   -19246    69176  3911388 
## 
## Coefficients:
##                      Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)        -65357.845   3236.965  -20.19 <0.0000000000000002 ***
## total_livable_area    252.993      2.247  112.61 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 189000 on 23703 degrees of freedom
## Multiple R-squared:  0.3485, Adjusted R-squared:  0.3485 
## F-statistic: 1.268e+04 on 1 and 23703 DF,  p-value: < 0.00000000000000022

This regression analysis can be extended to multiple variables. The following denotes that sale_price, exterior_condition, interior_condition, number_of_bathrooms, number_of_bedrooms, total_livable_area, type_heater, year_built, and number_stories were found to be correlated with sale price, while total_area and view_type showed no clear evidence of correlation with sale price. We removed parcel_shape because it contains only one ‘D’ value, which makes dividing the data into test and training sets impossible.

reg1 = lm(sale_price ~ ., data = 
            st_drop_geometry(philly.sf_filtered) %>%
            filter(number_of_bedrooms <13) %>%
            dplyr::select(sale_price, exterior_condition, interior_condition, 
                          number_of_bathrooms, number_of_bedrooms,
                          total_area, total_livable_area, type_heater, 
                          view_type, year_built, number_stories))
summary(reg1)

## 
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_filtered) %>% 
##     filter(number_of_bedrooms < 13) %>% dplyr::select(sale_price, 
##     exterior_condition, interior_condition, number_of_bathrooms, 
##     number_of_bedrooms, total_area, total_livable_area, type_heater, 
##     view_type, year_built, number_stories))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2521256   -82578   -12655    60090  3825388 
## 
## Coefficients:
##                         Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)         367739.59845  63229.63736   5.816      0.0000000061079 ***
## exterior_condition  -37451.65930   1887.05912 -19.847 < 0.0000000000000002 ***
## interior_condition  -26394.16316   1976.31186 -13.355 < 0.0000000000000002 ***
## number_of_bathrooms  69992.38887   2449.30053  28.576 < 0.0000000000000002 ***
## number_of_bedrooms  -41559.64236   1200.83142 -34.609 < 0.0000000000000002 ***
## total_area              -0.05222      0.33045  -0.158             0.874435    
## total_livable_area     216.49972      2.51828  85.971 < 0.0000000000000002 ***
## type_heater0         36254.74058  18521.03755   1.957             0.050302 .  
## type_heaterA         38553.45983   3173.42475  12.149 < 0.0000000000000002 ***
## type_heaterB         13393.14301   3739.78342   3.581             0.000343 ***
## type_heaterC         25736.27846  11129.93332   2.312             0.020767 *  
## type_heaterD        278963.64750  23014.73383  12.121 < 0.0000000000000002 ***
## type_heaterE         86043.92597  23535.90821   3.656             0.000257 ***
## type_heaterG         -2073.41856   6689.82181  -0.310             0.756612    
## type_heaterH          2302.29790   3131.02207   0.735             0.462153    
## view_type0          -84520.30980  26618.55376  -3.175             0.001499 ** 
## view_typeA           29405.09534  22013.74444   1.336             0.181640    
## view_typeB           63892.75561  31807.47220   2.009             0.044577 *  
## view_typeC          210729.11889  22714.28495   9.277 < 0.0000000000000002 ***
## view_typeD          -20698.92314  26335.83105  -0.786             0.431899    
## view_typeE          -46473.14493  26580.12568  -1.748             0.080405 .  
## view_typeH           16053.34952  25087.63073   0.640             0.522250    
## view_typeI           11350.53124  20898.37655   0.543             0.587046    
## year_built             -89.20768     29.89523  -2.984             0.002848 ** 
## number_stories       14605.88394   2212.10634   6.603      0.0000000000412 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 165600 on 23411 degrees of freedom
##   (251 observations deleted due to missingness)
## Multiple R-squared:  0.4957, Adjusted R-squared:  0.4952 
## F-statistic:   959 on 24 and 23411 DF,  p-value: < 0.00000000000000022

We divided number_stories into three groups (Up to 2 Floors, 3 Floors, 4+ Floors) and performed regression analysis. The results are as follows.

# Categorize Number of Stories
philly.sf_filtered <-
  philly.sf_filtered %>%
  mutate(number_stories.cat = case_when(
    number_stories >= 0 & number_stories < 3 ~ "Up to 2 Floors",
    number_stories >= 3 & number_stories < 4 ~ "3 Floors",
    number_stories > 4 ~ "4+ Floors"
  ))

reg2 <- lm(sale_price ~ ., data = 
            st_drop_geometry(philly.sf_filtered) %>%
            dplyr::select(sale_price, exterior_condition, interior_condition, 
                          number_of_bathrooms, number_of_bedrooms,
                          total_area, total_livable_area, type_heater, 
                          view_type, year_built, number_stories.cat))
summary(reg2)

## 
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_filtered) %>% 
##     dplyr::select(sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories.cat))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2360228   -80731   -11461    59821  3870160 
## 
## Coefficients:
##                                     Estimate  Std. Error t value
## (Intercept)                      239825.2385  61007.0510   3.931
## exterior_condition               -36339.7202   1847.4969 -19.670
## interior_condition               -26039.5099   1929.7698 -13.494
## number_of_bathrooms               59553.8245   2349.8048  25.344
## number_of_bedrooms               -32765.3007   1112.4259 -29.454
## total_area                            0.7790      0.3256   2.392
## total_livable_area                  201.6932      2.6339  76.576
## type_heater0                      37546.8341  18181.0871   2.065
## type_heaterA                      38583.0065   3100.0228  12.446
## type_heaterB                       8422.3315   3646.0934   2.310
## type_heaterC                      29722.5686  10827.0012   2.745
## type_heaterD                     280510.5222  22446.7370  12.497
## type_heaterE                      87731.8702  23197.4335   3.782
## type_heaterG                       -835.0819   6526.1774  -0.128
## type_heaterH                       3139.2226   3039.3621   1.033
## view_type0                       -78211.5136  25977.4159  -3.011
## view_typeA                        36045.5687  21502.6498   1.676
## view_typeB                        70821.8996  31033.5056   2.282
## view_typeC                       218674.2867  22174.7120   9.861
## view_typeD                       -13321.0688  25705.3411  -0.518
## view_typeE                       -38733.7420  25945.9282  -1.493
## view_typeH                        21977.7716  24493.9795   0.897
## view_typeI                        15120.9048  20404.5295   0.741
## year_built                           13.7549     29.3130   0.469
## number_stories.cat4+ Floors      677963.5782  93438.7191   7.256
## number_stories.catUp to 2 Floors -53050.6479   3789.9301 -13.998
##                                              Pr(>|t|)    
## (Intercept)                         0.000084799494045 ***
## exterior_condition               < 0.0000000000000002 ***
## interior_condition               < 0.0000000000000002 ***
## number_of_bathrooms              < 0.0000000000000002 ***
## number_of_bedrooms               < 0.0000000000000002 ***
## total_area                                   0.016753 *  
## total_livable_area               < 0.0000000000000002 ***
## type_heater0                                 0.038919 *  
## type_heaterA                     < 0.0000000000000002 ***
## type_heaterB                                 0.020899 *  
## type_heaterC                                 0.006052 ** 
## type_heaterD                     < 0.0000000000000002 ***
## type_heaterE                                 0.000156 ***
## type_heaterG                                 0.898183    
## type_heaterH                                 0.301682    
## view_type0                                   0.002609 ** 
## view_typeA                                   0.093687 .  
## view_typeB                                   0.022492 *  
## view_typeC                       < 0.0000000000000002 ***
## view_typeD                                   0.604308    
## view_typeE                                   0.135486    
## view_typeH                                   0.369583    
## view_typeI                                   0.458667    
## year_built                                   0.638901    
## number_stories.cat4+ Floors         0.000000000000412 ***
## number_stories.catUp to 2 Floors < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 161600 on 23351 degrees of freedom
##   (328 observations deleted due to missingness)
## Multiple R-squared:  0.4998, Adjusted R-squared:  0.4992 
## F-statistic: 933.2 on 25 and 23351 DF,  p-value: < 0.00000000000000022

Accuracy of the Current Model

To assess the accuracy of the current model, we calculated the Mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE). To do this, the dataset was randomly divided into training data and testing data in a ratio of 6:4, and then the difference between the observed value and the actual value was calculated.

The MAE is the average of the absolute differences between predicted and observed values, while the MAPE is the average of the absolute percentage differences between predicted and observed values. The MAE and MAPE are useful metrics for evaluating the accuracy of regression models, as they provide a measure of how well the model predicts actual values.

inTrain <- createDataPartition(
              y = paste(philly.sf_filtered$number_stories.cat, 
                        philly.sf_filtered$view_type, philly.sf_filtered$heat_type), 
              p = .60, list = FALSE)
philly.training <- philly.sf_filtered[inTrain,] 
philly.test <- philly.sf_filtered[-inTrain,]

reg.training <- 
  lm(sale_price ~ ., data = as.data.frame(philly.training) %>% 
                             dplyr::select(sale_price, exterior_condition, interior_condition, number_of_bathrooms, number_of_bedrooms, total_area, total_livable_area, type_heater, view_type, year_built, number_stories.cat))

summary(reg.training)

## 
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(philly.training) %>% 
##     dplyr::select(sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories.cat))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2263573   -81244   -11784    59685  3058212 
## 
## Coefficients:
##                                     Estimate  Std. Error t value
## (Intercept)                      221253.3658  94284.9960   2.347
## exterior_condition               -36972.9817   2409.3830 -15.345
## interior_condition               -25774.9487   2520.4327 -10.226
## number_of_bathrooms               68178.1442   3026.0574  22.530
## number_of_bedrooms               -33208.3228   1441.2109 -23.042
## total_area                            1.8720      0.4726   3.961
## total_livable_area                  191.9226      3.5233  54.472
## type_heater0                      48055.3735  25517.0851   1.883
## type_heaterA                      34606.7146   4023.5892   8.601
## type_heaterB                       6748.6386   4689.9764   1.439
## type_heaterC                      20372.3193  14261.6151   1.428
## type_heaterD                     328732.5501  32862.8599  10.003
## type_heaterE                      85431.9992  26204.1878   3.260
## type_heaterG                       -519.8221   8490.7987  -0.061
## type_heaterH                        341.6382   3950.8861   0.086
## view_type0                       -57844.7989  32805.1269  -1.763
## view_typeA                        51438.6471  26920.0604   1.911
## view_typeB                        70498.8011  39118.7987   1.802
## view_typeC                       250486.9406  27834.1494   8.999
## view_typeD                         -203.5163  32517.9642  -0.006
## view_typeE                       -11670.7026  32814.1360  -0.356
## view_typeH                        30549.0509  30913.5975   0.988
## view_typeI                        32568.7706  25453.8469   1.280
## year_built                           16.5537     45.5838   0.363
## number_stories.cat4+ Floors      958613.1237 115017.1006   8.335
## number_stories.catUp to 2 Floors -50325.0509   4959.2903 -10.148
##                                              Pr(>|t|)    
## (Intercept)                                   0.01896 *  
## exterior_condition               < 0.0000000000000002 ***
## interior_condition               < 0.0000000000000002 ***
## number_of_bathrooms              < 0.0000000000000002 ***
## number_of_bedrooms               < 0.0000000000000002 ***
## total_area                                   0.000075 ***
## total_livable_area               < 0.0000000000000002 ***
## type_heater0                                  0.05969 .  
## type_heaterA                     < 0.0000000000000002 ***
## type_heaterB                                  0.15019    
## type_heaterC                                  0.15318    
## type_heaterD                     < 0.0000000000000002 ***
## type_heaterE                                  0.00112 ** 
## type_heaterG                                  0.95118    
## type_heaterH                                  0.93109    
## view_type0                                    0.07787 .  
## view_typeA                                    0.05605 .  
## view_typeB                                    0.07154 .  
## view_typeC                       < 0.0000000000000002 ***
## view_typeD                                    0.99501    
## view_typeE                                    0.72210    
## view_typeH                                    0.32307    
## view_typeI                                    0.20073    
## year_built                                    0.71650    
## number_stories.cat4+ Floors      < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162500 on 14008 degrees of freedom
##   (195 observations deleted due to missingness)
## Multiple R-squared:  0.4965, Adjusted R-squared:  0.4956 
## F-statistic: 552.5 on 25 and 14008 DF,  p-value: < 0.00000000000000022

philly.test <-
  philly.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.training, philly.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
  filter(sale_price < 5000000) 

ggplot()+
  geom_histogram(data = philly.test, aes(x = sale_price.AbsError), binwidth = 10000, fill = "orange") +
  scale_x_continuous(breaks=seq(0, 1000000, by = 100000), )+
  coord_cartesian(xlim = c(0, 1000000)) +
  labs(title = "Distribution of Prediction Errors",
  subtitle = "",
  caption = "") +
  theme_minimal()

The result of the current model is as follows:
- Mean Absolute Error (MAE): $100,278.2
- Mean Absolute Percent Error (MAPE): 63.5%
* The result of knit can be different.

The error is not trivial given the mean sale_price of $271,929. This suggests that the current model may not be accurately capturing the underlying patterns in the data.

#Mean Absolute Error (MAE)
mean(philly.test$sale_price.AbsError, na.rm = T)

## [1] 100010.9

# Mean Absolute Percent Error (MAPE)
mean(philly.test$sale_price.APE, na.rm = T)

## [1] 0.6384693

The following plot shows that the model and perfect prediction. The orange line represents a perfect prediction (y=x), and the green line represents average prediction. This plot shows that predicted sale price was a little over estimated.

ggplot(data=philly.test, aes(sale_price.Predict, sale_price)) +
     geom_point(size = .5) + 
     geom_abline(intercept = 0, slope = 1, color = "#FA7800") +
     geom_smooth(method = "lm", se=F, colour = "#25CB10") +
    coord_cartesian(xlim = c(0, 4000000), ylim = c(0, 4000000)) +
     labs(title = "Price as a function of continuous variables",
          subtitle = "Orange line represents a perfect prrediction (y=x)\nGreen line represents average prediction") +
     theme_minimal()

Generalizability - ‘k-fold’ Cross-Validation

To assess the generalizability of the current model, we performed k-fold cross-validation with k=100. This algorithm involves splitting the dataset into 100 subsets, training the model on 99 subsets, and testing it on the remaining subset. This process is repeated 100 times, with each subset used as the test set once. The results are then averaged to provide an estimate of the model’s performance on new data.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(philly.sf_filtered) %>% 
                                dplyr::select(sale_price, exterior_condition,                
                                           interior_condition, 
                                           number_of_bathrooms, number_of_bedrooms,
                                           total_area, total_livable_area, type_heater, 
                                           view_type, year_built, 
                                           number_stories.cat), 
     method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv

## Linear Regression 
## 
## 23705 samples
##    10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 23467, 23467, 23468, 23468, 23469, 23467, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   159484.7  0.5090659  99752.15
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The distribution of MAE values is shown in the histogram below.

ggplot()+
  geom_histogram(data = reg.cv$resample, aes(x = MAE), binwidth = 10000, fill = "orange") +
  scale_x_continuous(breaks=seq(0, 600000, by = 100000), )+
  coord_cartesian(xlim = c(0, 600000)) +
  labs(title = "Distribution of MAE",
  subtitle = "k-fold cross-validation; k=100",
  caption = "") +
  theme_minimal()

The first plot shows the predicted sale price in the test set(philly.test), and the second plot shows the absolute sale price errors. No matter how accurate this model may be, we can see that some of the errors still remain clustered on the second map. It suggests that other factors with spatial correlations remain unmodeled. We will explore this further in the next step by applying spatial lag analysis and Moran’s I.

philly.test <- philly.test[!is.na(philly.test$sale_price.Predict),]
ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  geom_sf(data = philly.test, aes(color = q5(sale_price.Predict)),  
          show.legend = "point", size = 0.5, alpha = 1) +  
  scale_color_manual(values = palette5,
                     labels=qBr(philly.test, "sale_price.Predict"),
                     name="Quintile\nBreaks($/sqft)") + 
  labs(
    title = "Predicted Sale Price in Philadelphia",
    subtitle = "Assessment date: May 24, 2022-August 14, 2023",
    caption = "Data: U.S. Census Bureau, Zillow") +
  theme_void()

ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  geom_sf(data = philly.test, aes(color = q5(sale_price.AbsError)),  
          show.legend = "point", size = 0.5, alpha = 1) +  
  scale_color_manual(values = palette5,
                     labels=qBr(philly.test, "sale_price.AbsError"),
                     name="Quintile\nBreaks($/sqft)") + 
  labs(
    title = "Test Set Absolute Sale Price Errors in Philadelphia",
    subtitle = "Assessment date: May 24, 2022-August 14, 2023",
    caption = "Data: U.S. Census Bureau, Zillow") +
  theme_void()

Check if Prices and Errors cluster

(1) Spatial Lags

In this step, we calculated for each home sale, the average sale price of its k=5 nearest neighbors. This spatial lag variable, lagPrice, represents the average sale price of the five nearest neighbors of each home sale. We then used this variable to assess the spatial autocorrelation of sale prices and errors.

coords <- st_coordinates(philly.sf_filtered) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

philly.sf_filtered$lagPrice <- lag.listw(spatialWeights, philly.sf_filtered$sale_price)

philly.test <- philly.test[!is.na(philly.test$sale_price.Error),]
coords.test <-  st_coordinates(philly.test) 
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")

There is still positive correlation between sale_price and lagPrice and between sale_price.Error and lagPriceError. This suggests that the model’s errors are clustered spatially, indicating that the model is not adequately capturing localized factors.

# Sale Price as a function of the Spatial Lag of Price
philly.test %>% 
  mutate(lagPrice = lag.listw(spatialWeights.test, sale_price)) %>% 
  ggplot(aes(lagPrice, sale_price))+
  geom_point(color = "orange") +
  geom_smooth(method = "lm", se=F, colour = "#25CB10") +
  labs(title = "Price as a function of the Spatial Lag of Price",
       x = "Spatial Lag of Price (mean price of 5 nearest neighbors)",
       y = "Sale Price") +
  theme_minimal()

# Error as a Function of the Spatial Lag of Price
philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) %>% 
  ggplot(aes(lagPriceError, sale_price.Error))+
  geom_point(color = "orange") +
  geom_smooth(method = "lm", se=F, colour = "#25CB10") +
  labs(title = "Error as a Function of the Spatial Lag of Price",
       x = "Spatial Lag of Errors (mean error of 5 nearest neighbors)",
       y = "Sale Price Error") +
  theme_minimal()

We can calculate the Pearson’s R coefficient to test this a different way.

pearsons_test <- 
  philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error, NAOK=TRUE))

cor.test(pearsons_test$lagPriceError,
         pearsons_test$sale_price.Error, 
         method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  pearsons_test$lagPriceError and pearsons_test$sale_price.Error
## t = 71.832, df = 9341, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5832941 0.6094222
## sample estimates:
##       cor 
## 0.5965162

(2) Moran’s I

By using the Moran’s I statistic, we can assess the global degree of clustering or dispersion of sales price values. The result shows that Moran’s I = 0.396, which indicates positive spatial autocorrelation. It suggests that the model’s errors are clustered, often occurring in specific neighborhoods or regions. If Moran’s I is close to zero, it implies that errors are more randomly distributed, indicating the model might be performing well without significant spatial bias. To improve this prediction model, we need to find more spatial variables that can reduce the spatial autocorrelation of the errors.

moranTest <- moran.mc(philly.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "orange",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  theme_minimal()

3. Proposed OLS Model

Which Data to be Included

In Chapter 2, we discovered that more spatial variables need to be incorporated on the current Zillow model so that we can reduce the spatial autocorrelation of the errors, and finally improve prediction quality in accuracy and generalization. Therefore, We collected data from open sources such as the Philadelphia Open Data Portal, focusing on datasets that provide insight into the local character of neighborhoods.

We intentionally excluded factors like walkability because, although important in urban housing preferences, it may not be universally desirable, especially for families who might prefer less densely populated areas. The following table shows the data types we included and excluded, along with the reasons for our decisions.

# Create the data frame
data <- data.frame(
  Category = c("Safety & Crime", "Urban Planning & Zoning", "Gentrification Indicators", "Accessibility & Transit", "Environmental Quality", "Education", "Quality of Life", "Public Spaces", "Government & Infrastructure"),
  `Data Type` = c("Crime Data", "Zoning Regulations", "Gentrification Indicators", "Transit Accessibility", "Pollution Levels", "School Quality", "Walkability Index", "Access to Parks", "Government Investments"),
  `Included/Excluded` = c("Included", "Included", "Included", "Included", "Included", "Included", "Excluded", "Excluded", "Excluded"),
  Reason = c(
    "Safety is a critical factor for homeowners.",
    "Impacts property types, uses, and value.",
    "Gentrification trends affect property values.",
    "Commute to work.",
    "Pollution impacts health and desirability.",
    "The quality of local schools is a major consideration for families and directly impacts property values.",
    "Walkability is valued in urban settings but may not be a priority for all buyers, such as families preferring suburban environments.",
    "Access to parks and recreational spaces may have variable impact depending on demographics, making it less reliable for general predictions.",
    "Government resources and investments are often dispersed and long-term, making their direct impact on property values harder to measure."
  )
)

# Create the kable table
kable(data, "html", col.names = c("Category", "Data Type", "Included/Excluded", "Reason")) %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  column_spec(1:4, width = "20em") # Adjust column width for better readability

Category	Data Type	Included/Excluded	Reason
Safety & Crime	Crime Data	Included	Safety is a critical factor for homeowners.
Urban Planning & Zoning	Zoning Regulations	Included	Impacts property types, uses, and value.
Gentrification Indicators	Gentrification Indicators	Included	Gentrification trends affect property values.
Accessibility & Transit	Transit Accessibility	Included	Commute to work.
Environmental Quality	Pollution Levels	Included	Pollution impacts health and desirability.
Education	School Quality	Included	The quality of local schools is a major consideration for families and directly impacts property values.
Quality of Life	Walkability Index	Excluded	Walkability is valued in urban settings but may not be a priority for all buyers, such as families preferring suburban environments.
Public Spaces	Access to Parks	Excluded	Access to parks and recreational spaces may have variable impact depending on demographics, making it less reliable for general predictions.
Government & Infrastructure	Government Investments	Excluded	Government resources and investments are often dispersed and long-term, making their direct impact on property values harder to measure.

Summary Statistics Table

This table gives an overview of different features related to properties and their surroundings, grouped into categories like Property Details, Local Amenities, Neighborhood Layout, Crime Data, and Transit/Demographics.

The table breaks down various features of properties and neighborhoods into five main groups:

Property Details: How many bedrooms and bathrooms they have, and the year they were built.
Local Amenities: Features and services that might impact a property’s appeal, such as whether a property has central air or a garage.
Neighborhood Layout: Zoning rules and the shape of the property lots, giving insight into how the area is organized.
Crime Data: Frequency of crime incidents in different areas, which can help gauge the safety of a neighborhood.
Transit/Demographics: Access to public transportation and basic demographic trends like population levels.

file_path <- "~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/Cleaned_Comprehensive_Summary_Statistics.csv"

summary_stats <- read_csv(file_path)

summary_stats_cleaned <- summary_stats %>%
  drop_na()

summary_stats_cleaned %>%
  kable("html", caption = "Summary Statistics") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Summary Statistics
Variable	Category	Description	count
total_livable_area	Internal Characteristics	Total livable area of the property (sqft)	23881
number_of_bathrooms	Internal Characteristics	Number of bathrooms	23622
number_of_bedrooms	Internal Characteristics	Number of bedrooms	23865
number_of_rooms	Internal Characteristics	Total number of rooms	666
number_stories	Internal Characteristics	Number of stories	23873
year_built	Internal Characteristics	Year the property was built	23881
zoning	Spatial Structure	Zoning classification of the parcel	23881
parcel_shape	Spatial Structure	Shape type of the parcel (e.g., rectangular, irregular)	23881
Thefts	Crime	Frequency of reported incidents for each crime type	29413
Other Assaults	Crime	Frequency of reported incidents for each crime type	20172
All Other Offenses	Crime	Frequency of reported incidents for each crime type	10846
Motor Vehicle Theft	Crime	Frequency of reported incidents for each crime type	10468
Vandalism/Criminal Mischief	Crime	Frequency of reported incidents for each crime type	10424
Theft from Vehicle	Crime	Frequency of reported incidents for each crime type	7892
Aggravated Assault No Firearm	Crime	Frequency of reported incidents for each crime type	4159
Burglary Residential	Crime	Frequency of reported incidents for each crime type	2521
Narcotic / Drug Law Violations	Crime	Frequency of reported incidents for each crime type	2262
Aggravated Assault Firearm	Crime	Frequency of reported incidents for each crime type	2117
Weapon Violations	Crime	Frequency of reported incidents for each crime type	2040

6 Newly Included Data and Visualization

We collected 6 additional datasets including crime data in 2018-2023, crime dispatch incidents in 2024, location of subway stations, schools, public parks and recreation areas from the Philadelphia Open Data portal the U.S. Census Bureau (ACS data) or other data source. We also cleaned data to exclude records with missing or erroneous geographic points. We created new variables, such as price per square foot, number of schools per tract, which are crucial for our model.

# Load Philadelphia Zillow Data
philly.sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/studentData.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

nhoods <-  
  get_acs(geography = "tract",
          variables = c("B01003_001E","B02001_002E",
                        "B06011_001E"), 
          year=2020, state="PA",
          county="Philadelphia", geometry=TRUE) %>%
  st_transform(crs = 'ESRI:102728') %>%
  separate(NAME, into = c("Census_Tract", "City_State"), sep = ", ", extra = "merge") %>%
  mutate(
    Census_Tract = gsub("Census Tract ", "", Census_Tract),
    City_State = gsub(" County, Pennsylvania", "", City_State)) %>%
  dplyr::select(  -moe)%>%
  spread(key = variable, value = estimate) %>%
  rename(TotalPop = B01003_001,
         NumberWhites = B02001_002,
         Median_Income = B06011_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 35322, "High Income", "Low Income"))

## Getting data from the 2016-2020 5-year ACS

## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

# Load Police District Data in Philadelphia
philly.police <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/police_districts.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

# Load School Data in Philadelphia
philly.school <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/Schools.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

# Load Green Area (Public Parks Recreation) Data in Philadelphia
philly.PPR <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/PPR_Properties.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

# Load Shooting Incident Data in Philadelphia
philly.shooting <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/shootings.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

ppr_properties_sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/PPR_Properties.geojson") %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728')

police_districts_sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/police_districts.geojson") %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728')

shootings_sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/shootings.geojson")%>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728')

# Shooting Incident Data 
shootingPol <- st_join(philly.shooting, nhoods, left = FALSE)
shootingPol_count <- shootingPol %>%
  group_by(Census_Tract) %>%  
  summarise(shooting_count = n())  
nhoods_b <- left_join(nhoods, st_drop_geometry(shootingPol_count), by = "Census_Tract") %>%
  st_sf() %>%
  mutate(shooting_count = ifelse(is.na(shooting_count), 0, shooting_count)) 
  #ggplot()+
  #  geom_sf(data=nhoods_b, aes(fill=shooting_count))

# Green area (Public Parks Recreation) data
PPRPol <- st_join(philly.PPR, nhoods_b, left = FALSE)
PPRPol_count <- PPRPol %>%
  group_by(Census_Tract) %>%  
  summarise(PPR_count = n())  
nhoods_c <- left_join(nhoods_b, st_drop_geometry(PPRPol_count), by = "Census_Tract") %>%
  st_sf() %>%
  mutate(PPR_count = ifelse(is.na(PPR_count), 0, PPR_count)) 
  #ggplot()+
  #  geom_sf(data=nhoods_c, aes(fill=PPR_count))

# Number of school in each tract data
schoolPol <- st_join(philly.school, nhoods_c, left = FALSE)
schoolPol_count <- schoolPol %>%
  group_by(Census_Tract) %>%  
  summarise(school_count = n())  
nhoods_d <- left_join(nhoods_c, st_drop_geometry(schoolPol_count), by = "Census_Tract") %>%
  st_sf() %>%
  mutate(PPR_count = ifelse(is.na(school_count), 0, school_count)) 
  #ggplot()+
  #  geom_sf(data=nhoods_d, aes(fill=school_count))

# Load Crime Data in Philadelphia
philly.crime <- read.csv("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/CrimeData.csv") 
philly.crime <- subset(philly.crime, the_geom!='0101000020E6100000A5A31CCC262054C0A8BE77C4B61C4540') #error
philly.crime <- subset(philly.crime, the_geom!='0101000020E6100000E3D840DB262054C0CCE8EC09B71C4540') #error
philly.crime <- st_as_sf(philly.crime, coords = c("lng", "lat"), crs = 4326) %>%
  st_transform(crs = 'ESRI:102728') 

# Crime Data 
crimePol <- st_join(philly.crime, nhoods_d, left = FALSE)
crimePol_count <- crimePol %>%
  group_by(Census_Tract) %>%  
  summarise(crime_count = n())  
nhoods_e <- left_join(nhoods_d, st_drop_geometry(crimePol_count), by = "Census_Tract") %>%
  st_sf() %>%
  mutate(crime_count = ifelse(is.na(crime_count), 0, crime_count)) 
  #ggplot()+
  # geom_sf(data=nhoods_e, aes(fill=crime_count))

(1) Number of Public Parks/Recreation Areas by Tract in Philadelphia

The map shows different types of alongside public parks and recreational areas within the city. Access to green spaces and recreational amenities is a significant driver of housing desirability and value.

ggplot() + 
  geom_sf(data=st_union(nhoods), fill="#DDD", color="white")+
  geom_sf(data = ppr_properties_sf, fill = "green", color = "darkgreen", shape = 21, size = 1, alpha = 0.8) + # PPR properties
  labs(
    title = "Public Parks/Recreation (PPR) Areas in Philadelphia",
    subtitle = "Visualizing the distribution of PPR properties within city limits",
    caption = "Data: Philadelphia Planning Department, PPR, and City Limits") +
  theme_void() +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

(2) Police Districts and Population Density in Philadelphia

This map overlays police districts with population density information for each census tract, illustrating how densely populated areas align with police resources. Population density often correlates with housing demand, as denser areas tend to have higher property values due to limited space. Understanding how law enforcement resources align with population clusters can also offer insights into perceptions of safety and how these perceptions might impact local housing prices.

philly_population <- get_acs(geography = "tract",
                             variables = "B01003_001E", # Total population variable
                             year = 2020, state = "PA",
                             county = "Philadelphia", geometry = TRUE) %>%
  st_transform(crs = 'ESRI:102728')

philly_population$area_sq_km <- set_units(st_area(philly_population), "km^2")

# Calculate population density (people per square kilometer) and ensure consistency in units
philly_population$population_density <- philly_population$estimate / drop_units(philly_population$area_sq_km)

# Plot the map showing police districts and population density
ggplot() + 
  geom_sf(data = philly_population, aes(fill = population_density), color = NA) +
  scale_fill_viridis_c(name = "Population Density\n(per sq km)", option = "inferno") +
  geom_sf(data = police_districts_sf, fill = NA, color = "red", size = 0.8) +
  labs(
    title = "Police Districts and Population Density in Philadelphia",
    subtitle = "Population Density per Census Tract with Police District Boundaries",
    caption = "Data: U.S. Census Bureau, Philadelphia Open Data Portal") +
  theme_void()

(3) Density of Shooting Incidents in Philadelphia with Police Districts Overlay

This map illustrates the concentration of shooting incidents and overlays police district boundaries to see how shootings align with law enforcement zones. Identifying high-risk areas helps us understand which neighborhoods might need more intervention or are perceived as less safe, directly influencing property values.

# Create the density map with police districts
ggplot() + 
  stat_density2d(data = data.frame(st_coordinates(shootings_sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..), 
                 size = 0.01, bins = 40, geom = 'polygon') +
  geom_sf(data = police_districts_sf, fill = NA, color = "black", linetype = "solid", size = 0.5) + # Police districts overlay
  scale_fill_gradient(low = "lightblue", high = "darkblue", 
                      breaks = c(0.000000002, 0.00000002),
                      labels = c("Minimum", "Maximum"),
                      name = "Density") +
  scale_alpha(range = c(0.00, 0.50), guide = "none") +
  labs(title = "Density of Shooting Incidents in Philadelphia",
       subtitle = "Overlaid with Police District Boundaries",
       caption = "Data: Philadelphia Open Data Portal") +
  theme_void()

(4) TOD (public transit) in Philadelphia

This map shows the areas within 0.5 miles (2,641 ft) of the Market-Frankford Line (MFL) and Broad Street Line, which are key public transit routes in Philadelphia. As we discovered on the last assignment, properties near public transit are often more desirable, as they offer convenient access to transportation and amenities, which can drive up property values.

# TOD Data 
MFL <- st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson")

## Reading layer `SEPTA_-_Market-Frankford_Line_Stations' from data source 
##   `https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 28 features and 45 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.25973 ymin: 39.94989 xmax: -75.07788 ymax: 40.02299
## Geodetic CRS:  WGS 84

Broad_St <- st_read("https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson")

## Reading layer `SEPTA_-_Broad_Street_Line_Stations' from data source 
##   `https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 24 features and 45 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.17394 ymin: 39.90543 xmax: -75.13679 ymax: 40.04185
## Geodetic CRS:  WGS 84

septaStops <- 
  rbind(
     MFL %>% 
      mutate(Line = "MFL") %>%
      dplyr::select(Station, Line),
     Broad_St %>%
      mutate(Line ="Broad_St") %>%
      dplyr::select(Station, Line)) %>%
  st_transform(st_crs(nhoods_e))  

stopBuffer <- st_buffer(septaStops, 2641)
stopUnion <- st_union(st_buffer(septaStops, 2641))
buffer <- stopUnion %>%
      st_sf() 
selectCentroids <-
  st_centroid(nhoods_e)[buffer,] %>%
  st_drop_geometry() %>%
  left_join(., dplyr::select(nhoods_e, GEOID), by = "GEOID") %>%
  st_sf() 

selectCentroids_union <- st_union(selectCentroids)%>%
  st_sf() 

nhoods_f <- 
  rbind(
    st_centroid(nhoods_e)[buffer,] %>%
      st_drop_geometry() %>%
      left_join(nhoods_e) %>%
      st_sf() %>%
      mutate(TOD = "TOD"),
    st_centroid(nhoods_e)[buffer, op = st_disjoint] %>%
      st_drop_geometry() %>%
      left_join(nhoods_e) %>%
      st_sf() %>%
      mutate(TOD = "Non-TOD"))

## Plot TOD
ggplot() + 
  geom_sf(data=nhoods_f, fill="#DDD", color="white")+
  geom_sf(data=septaStops, 
          aes(colour = Line), 
          show.legend = "point", size= 1.5) +
  scale_colour_manual(values = c("orange","blue")) +
  geom_sf(data=selectCentroids_union, fill='transparent', color='red')+
  labs(title = "TOD area in Philadelphia",
    subtitle = "0.5 miles from the MFL and Broad St Line",
    caption = "Data: U.S. Census Bureau, SEPTA") +
  theme_void()

(5) Number of Crime Dispatch by Tract in Philadelphia

This map highlights areas where crime incidents are more frequent, providing a visual representation of crime density across the city. Safety is a critical factor in homebuyer decisions—people prefer safer neighborhoods, which drives up property values there. By mapping crime density, we can factor safety into our model, predicting how values might increase in safer areas or decrease in higher-crime zones.

## Plot Density of Crime Dispatch
ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  stat_density2d(data = data.frame(st_coordinates(philly.crime)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", 
                     breaks=c(0.000000002,0.00000002),
                     labels=c("Minimum", "Maximum"),
                      name = "Density") +
  scale_alpha(range = c(0.00, 0.50), guide = "none") +
  labs(title = "Density of Crime Dispach in Philadelphia",
    subtitle = "Date: January 1, 2024-October 1, 2024",
    caption = "Data: U.S. Census Bureau, Open Data Philly") +
  theme_void()

(6) Number of School by Tract in Philadelphia

This map displays how close properties are to schools, which is a key factor for families seeking homes. Properties closer to schools often become more desirable, increasing demand. Families place a premium on properties near quality schools, making school proximity a crucial factor in price determination.

## Plot School
ggplot() + 
  geom_sf(data=nhoods_f, aes(fill=school_count), color="white")+
  scale_fill_viridis_c(name = "School Count", option = "inferno") +
  labs(title = "Number of Schools by Tract in Philadelphia",
    caption = "Data: U.S. Census Bureau, OpenData Philly") +
  theme_void()

Multiple Linear Regression Model Results

The following plots show the relationship between sale price and various continuous variables. New data includes crime count, school count, shooting count, and public park and recreation data. The plots show that sale price is positively correlated with total area, while it is negatively correlated with age, shooting count, public parks and recreation areas and crime count. The plots also show that sale price is not significantly correlated with school count.

philly.sf_New <- philly.sf %>%
  filter(sale_price < 5000000 & toPredict == "MODELLING") %>%
  mutate(pricepersq = ifelse(total_area > 10, sale_price / total_area, 0)) 
philly.sf_New <- philly.sf_New[!is.na(as.numeric(philly.sf_New$pricepersq)), ]
philly.sf_New <- st_join(philly.sf_New, nhoods_f) # Crime Data, School Data, Green Area Data, Shooting Incident Data, TOD Data
philly.sf_New <- st_join(philly.sf_New, philly.police) # Police District Data
philly.sf_New <- st_join(philly.sf_New, neighborhoods) # Neighborhood Data
philly.sf_New <- philly.sf_New %>%
  filter(name != "East Park" & name != "Wissahickon Park" & name != "Chinatown" & name != "Mechanicsville" & name != "University City" & name != "Pennypack Park")

# Price as a function of continuous variables
philly.sf_New %>%
  st_drop_geometry() %>%
  mutate(Age = 2024 - year_built) %>%
  dplyr::select(sale_price, total_area, Age, shooting_count, PPR_count, school_count, crime_count) %>%
  filter(sale_price <= 5000000, total_area <= 10000, 
         Age < 500) %>% # Filter out crazy outliers
  gather(Variable, Value, -sale_price) %>% 
   ggplot(aes(Value, sale_price)) +
     geom_point(size = .5) + 
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     theme_minimal()

The following plots represents the difference in sale price by race, income, and whether it is a TOD area. The mean sale price in majority white was significantly higher than that in majority non-white, and the mean sale price in high-income areas was significantly higher than that in low-income areas. Meanwhile, the difference between TOD and non-TOD areas was not large, but the mean sale price in TOD areas was slightly higher.

# Price as a function of categorical variables

philly.sf_New <- philly.sf_New %>%
  mutate(
    quality_grade_new = case_when(
      quality_grade == "1 " ~ "S ",  
      quality_grade == "2 " ~ "A ",  
      quality_grade == "3 " ~ "B ",  
      quality_grade == "4 " ~ "C ", 
      quality_grade == "5 " ~ "D ", 
      quality_grade == "6 " ~ "E ", 
      TRUE ~ as.character(quality_grade)        
    )
  )

philly.sf_New$quality_grade_new <- factor(philly.sf_New$quality_grade_new, levels = c("S+", "S ", "A+", "A ", "A-", "B+", "B ", "B-", "C+", "C ", "C-", "D+", "D ", "D-", "E+", "E ", "E-"))
# levels(philly.sf_filtered$quality_grade_new)
philly.sf_New %>%
  st_drop_geometry() %>%
  dplyr::select(sale_price, incomeContext, raceContext, TOD) %>%
  filter(sale_price <= 5000000, !is.na(incomeContext), !is.na(raceContext)) %>% # Filter out crazy outliers
  gather(Variable, Value, -sale_price) %>%
  ggplot(aes(Value, sale_price)) +
    geom_bar(position = "dodge", stat="summary", fun="mean")+
    facet_wrap(~Variable, ncol = 3, scales = "free") +
    labs(title = "Price as a function of categorical variables", y="Mean_Price") +
    plotTheme() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))+
    theme_minimal()

The following chart is correlation plot that shows the relationship between numeric variables.

numericVars <- 
  select_if(st_drop_geometry(philly.sf_New), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables")

The following regression denotes that sale_price, exterior_condition, interior_condition, number_of_bathrooms, number_of_bedrooms, total_area, total_livable_area, year_built, number_stories shooting_count, crime_count, and TOD were found to be correlated with sale_price, while type_heater,view_type, and PPR_count showed no clear evidence of correlation with sale price. Because there were many missing values in school_count, regression values could not be obtained.

philly.sf_New$school_count <- ifelse(is.na(philly.sf_New$school_count), 0, philly.sf_New$school_count)
# is.na(philly.sf_New$school_count)
reg1 = lm(sale_price ~ ., data = 
            st_drop_geometry(philly.sf_New) %>%
            dplyr::select(sale_price, exterior_condition, interior_condition, 
                          number_of_bathrooms, number_of_bedrooms,
                          total_area, total_livable_area, type_heater, 
                          view_type, year_built, number_stories, shooting_count, PPR_count, school_count, crime_count, TOD))
summary(reg1)

## 
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_New) %>% 
##     dplyr::select(sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories, shooting_count, PPR_count, school_count, 
##         crime_count, TOD))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2252262   -70998   -11252    49989  3810993 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)         553722.0670  59464.2513   9.312 < 0.0000000000000002 ***
## exterior_condition  -26700.2433   1766.2094 -15.117 < 0.0000000000000002 ***
## interior_condition  -28613.4919   1838.5652 -15.563 < 0.0000000000000002 ***
## number_of_bathrooms  60913.8737   2227.6278  27.345 < 0.0000000000000002 ***
## number_of_bedrooms  -26451.6802   1066.2799 -24.807 < 0.0000000000000002 ***
## total_area              -0.3317      0.3102  -1.069             0.284954    
## total_livable_area     202.2441      2.3572  85.799 < 0.0000000000000002 ***
## type_heater0          4063.2541  17236.1603   0.236             0.813636    
## type_heaterA         12119.8467   2988.9752   4.055        0.00005032877 ***
## type_heaterB         -9275.0159   3500.9291  -2.649             0.008071 ** 
## type_heaterC         -6897.3368  10363.6578  -0.666             0.505717    
## type_heaterD        231329.2464  21428.4686  10.795 < 0.0000000000000002 ***
## type_heaterE         45462.8442  21902.9804   2.076             0.037938 *  
## type_heaterG         -7624.3958   6226.8943  -1.224             0.220802    
## type_heaterH          3397.9300   2917.0122   1.165             0.244085    
## view_type0          -62224.7060  24799.2999  -2.509             0.012110 *  
## view_typeA            4966.6924  20513.1679   0.242             0.808688    
## view_typeB           20229.3844  29639.8403   0.683             0.494925    
## view_typeC          182173.5961  21143.7392   8.616 < 0.0000000000000002 ***
## view_typeD          -20369.5147  24504.0534  -0.831             0.405829    
## view_typeE          -50831.8708  24740.2644  -2.055             0.039927 *  
## view_typeH            7967.1794  23334.9414   0.341             0.732785    
## view_typeI            3073.2911  19439.1740   0.158             0.874381    
## year_built            -167.2654     28.1106  -5.950        0.00000000272 ***
## number_stories       11842.0226   2062.0367   5.743        0.00000000942 ***
## shooting_count       -1121.5567     27.1865 -41.254 < 0.0000000000000002 ***
## PPR_count            -3983.8069    904.6838  -4.404        0.00001069728 ***
## school_count                 NA          NA      NA                   NA    
## crime_count            -28.8719      8.0289  -3.596             0.000324 ***
## TODTOD               61117.0594   2509.8704  24.351 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 154000 on 23403 degrees of freedom
##   (266 observations deleted due to missingness)
## Multiple R-squared:  0.5633, Adjusted R-squared:  0.5628 
## F-statistic:  1078 on 28 and 23403 DF,  p-value: < 0.00000000000000022

We divided number_stories into three groups (Up to 2 Floors, 3 Floors, 4+ Floors) and performed regression analysis. The results are as follows.

# Categorize Number of Stories
philly.sf_New <-
  philly.sf_New %>%
  mutate(number_stories.cat = case_when(
    number_stories >= 0 & number_stories < 3 ~ "Up to 2 Floors",
    number_stories >= 3 & number_stories < 4 ~ "3 Floors",
    number_stories > 4 ~ "4+ Floors"
  ))

reg2 <- lm(sale_price ~ ., data = 
            st_drop_geometry(philly.sf_New) %>%
            dplyr::select(sale_price, exterior_condition, interior_condition, 
                          number_of_bathrooms, number_of_bedrooms,
                          total_area, total_livable_area, type_heater, 
                          view_type,  year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD))
summary(reg2)

## 
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_New) %>% 
##     dplyr::select(sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories.cat, shooting_count, PPR_count, school_count, 
##         crime_count, TOD))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2151905   -68687   -10403    49506  3840631 
## 
## Coefficients: (1 not defined because of singularities)
##                                     Estimate  Std. Error t value
## (Intercept)                      463466.3695  57331.6626   8.084
## exterior_condition               -25926.5733   1725.1513 -15.029
## interior_condition               -27917.1888   1792.3469 -15.576
## number_of_bathrooms               57670.1015   2183.3574  26.413
## number_of_bedrooms               -23579.4317   1043.7718 -22.591
## total_area                            0.1719      0.3048   0.564
## total_livable_area                  192.3956      2.4512  78.491
## type_heater0                       7083.0815  16883.1429   0.420
## type_heaterA                      12617.4026   2913.1631   4.331
## type_heaterB                     -11872.8520   3404.3831  -3.488
## type_heaterC                      -6986.3475  10068.7071  -0.694
## type_heaterD                     232163.4261  20855.9064  11.132
## type_heaterE                      49071.3715  21540.8822   2.278
## type_heaterG                      -5429.5187   6062.0646  -0.896
## type_heaterH                       5355.2005   2826.8743   1.894
## view_type0                       -59252.5331  24150.1624  -2.454
## view_typeA                         9903.6023  19994.8658   0.495
## view_typeB                        22838.0764  28859.1103   0.791
## view_typeC                       186366.8527  20598.7134   9.047
## view_typeD                       -14437.6299  23866.2115  -0.605
## view_typeE                       -44611.4126  24097.9268  -1.851
## view_typeH                        12723.9461  22734.7448   0.560
## view_typeI                         5967.5524  18939.7381   0.315
## year_built                          -93.2800     27.5116  -3.391
## number_stories.cat4+ Floors      734862.5603  86734.7612   8.473
## number_stories.catUp to 2 Floors -36718.5589   3541.2890 -10.369
## shooting_count                    -1092.3269     26.5004 -41.219
## PPR_count                         -3603.6422    881.5378  -4.088
## school_count                              NA          NA      NA
## crime_count                         -28.7629      7.8262  -3.675
## TODTOD                            55536.9422   2458.9433  22.586
##                                              Pr(>|t|)    
## (Intercept)                      0.000000000000000657 ***
## exterior_condition               < 0.0000000000000002 ***
## interior_condition               < 0.0000000000000002 ***
## number_of_bathrooms              < 0.0000000000000002 ***
## number_of_bedrooms               < 0.0000000000000002 ***
## total_area                                   0.572825    
## total_livable_area               < 0.0000000000000002 ***
## type_heater0                                 0.674829    
## type_heaterA                     0.000014893813424129 ***
## type_heaterB                                 0.000488 ***
## type_heaterC                                 0.487772    
## type_heaterD                     < 0.0000000000000002 ***
## type_heaterE                                 0.022732 *  
## type_heaterG                                 0.370446    
## type_heaterH                                 0.058186 .  
## view_type0                                   0.014154 *  
## view_typeA                                   0.620388    
## view_typeB                                   0.428739    
## view_typeC                       < 0.0000000000000002 ***
## view_typeD                                   0.545225    
## view_typeE                                   0.064145 .  
## view_typeH                                   0.575710    
## view_typeI                                   0.752703    
## year_built                                   0.000699 ***
## number_stories.cat4+ Floors      < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors < 0.0000000000000002 ***
## shooting_count                   < 0.0000000000000002 ***
## PPR_count                        0.000043674366054245 ***
## school_count                                       NA    
## crime_count                                  0.000238 ***
## TODTOD                           < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 150000 on 23340 degrees of freedom
##   (328 observations deleted due to missingness)
## Multiple R-squared:  0.5687, Adjusted R-squared:  0.5682 
## F-statistic:  1061 on 29 and 23340 DF,  p-value: < 0.00000000000000022

Accuracy of the Proposed Model

As the same method in Chapter 2, we split our dataset into training and testing sets to evaluate our model’s performance. We used a 60/40 split for training and testing respectively.

The result of the proposed model is as follows:
- Mean Absolute Error (MAE): $86,664.03
- Mean Absolute Percent Error (MAPE): 53.3%
* The result of knit can be different.

M In Chapter 2, MAE of the current model was $100,278 and MAPE of the current model was 63.5%. We can say that the proposed model is more accurate than the current model.

inTrain <- createDataPartition(
              y = paste(philly.sf_New$number_stories.cat, 
                        philly.sf_New$view_type, philly.sf_New$heat_type), 
              p = .60, list = FALSE)

philly.training <- philly.sf_New[inTrain,] 
philly.test <- philly.sf_New[-inTrain,]

reg.training <- 
  lm(sale_price ~ ., data = as.data.frame(philly.training) %>% 
                             dplyr::select(sale_price, exterior_condition, interior_condition, 
                                           number_of_bathrooms, number_of_bedrooms,
                                           total_area, total_livable_area, type_heater, 
                                           view_type, year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD))

summary(reg.training)

## 
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(philly.training) %>% 
##     dplyr::select(sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories.cat, shooting_count, PPR_count, school_count, 
##         crime_count, TOD))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1074673   -68580    -9827    50952  3812301 
## 
## Coefficients: (1 not defined because of singularities)
##                                      Estimate   Std. Error t value
## (Intercept)                      490137.34729  79557.55234   6.161
## exterior_condition               -28919.35845   2256.62974 -12.815
## interior_condition               -24841.55802   2366.35275 -10.498
## number_of_bathrooms               56226.61177   2821.43041  19.928
## number_of_bedrooms               -23902.67515   1349.35375 -17.714
## total_area                            0.07366      0.36953   0.199
## total_livable_area                  196.93821      3.18435  61.846
## type_heater0                       6551.31164  22476.65009   0.291
## type_heaterA                      14042.50489   3791.35342   3.704
## type_heaterB                     -11951.99850   4395.60560  -2.719
## type_heaterC                       4951.23661  13903.87959   0.356
## type_heaterD                     203546.11393  27558.43175   7.386
## type_heaterE                      15365.67297  25743.32118   0.597
## type_heaterG                        -74.21530   7947.24310  -0.009
## type_heaterH                       8117.54188   3682.87989   2.204
## view_type0                       -55009.72618  31875.84085  -1.726
## view_typeA                         8006.77022  26563.89203   0.301
## view_typeB                        18499.71463  37468.51683   0.494
## view_typeC                       175299.09257  27330.83944   6.414
## view_typeD                       -25057.70347  31466.92885  -0.796
## view_typeE                       -46509.41187  31751.29592  -1.465
## view_typeH                       -11260.21426  29976.84517  -0.376
## view_typeI                        -5480.26176  25189.57255  -0.218
## year_built                         -103.97253     37.75948  -2.754
## number_stories.cat4+ Floors      970997.60613 107180.06165   9.059
## number_stories.catUp to 2 Floors -37169.59740   4623.52985  -8.039
## shooting_count                    -1081.11325     34.30396 -31.516
## PPR_count                         -4548.45351   1158.22264  -3.927
## school_count                               NA           NA      NA
## crime_count                         -27.71435     10.11622  -2.740
## TODTOD                            59215.81273   3206.96427  18.465
##                                              Pr(>|t|)    
## (Intercept)                      0.000000000743659306 ***
## exterior_condition               < 0.0000000000000002 ***
## interior_condition               < 0.0000000000000002 ***
## number_of_bathrooms              < 0.0000000000000002 ***
## number_of_bedrooms               < 0.0000000000000002 ***
## total_area                                   0.842013    
## total_livable_area               < 0.0000000000000002 ***
## type_heater0                                 0.770695    
## type_heaterA                                 0.000213 ***
## type_heaterB                                 0.006554 ** 
## type_heaterC                                 0.721768    
## type_heaterD                     0.000000000000159879 ***
## type_heaterE                                 0.550597    
## type_heaterG                                 0.992549    
## type_heaterH                                 0.027531 *  
## view_type0                                   0.084415 .  
## view_typeA                                   0.763102    
## view_typeB                                   0.621497    
## view_typeC                       0.000000000146336550 ***
## view_typeD                                   0.425860    
## view_typeE                                   0.142997    
## view_typeH                                   0.707197    
## view_typeI                                   0.827775    
## year_built                                   0.005903 ** 
## number_stories.cat4+ Floors      < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors 0.000000000000000976 ***
## shooting_count                   < 0.0000000000000002 ***
## PPR_count                        0.000086389376607873 ***
## school_count                                       NA    
## crime_count                                  0.006159 ** 
## TODTOD                           < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 151400 on 13996 degrees of freedom
##   (199 observations deleted due to missingness)
## Multiple R-squared:  0.5695, Adjusted R-squared:  0.5686 
## F-statistic: 638.4 on 29 and 13996 DF,  p-value: < 0.00000000000000022

philly.test <-
  philly.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.training, philly.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
  filter(sale_price < 5000000) 

ggplot()+
  geom_histogram(data = philly.test, aes(x = sale_price.AbsError), binwidth = 10000, fill = "orange") +
  scale_x_continuous(breaks=seq(0, 1000000, by = 100000), )+
  coord_cartesian(xlim = c(0, 1000000)) +
  labs(title = "Distribution of Prediction Errors",
  subtitle = "",
  caption = "") +
  theme_minimal()

#Mean Absolute Error (MAE)
mean(philly.test$sale_price.AbsError, na.rm = T)

## [1] 87459.48

# Mean Absolute Percent Error (MAPE)
mean(philly.test$sale_price.APE, na.rm = T)

## [1] 0.5351495

ggplot(data=philly.test, aes(sale_price.Predict, sale_price)) +
     geom_point(size = .5) + 
     geom_abline(intercept = 0, slope = 1, color = "#FA7800") +
     geom_smooth(method = "lm", se=F, colour = "#25CB10") +
    coord_cartesian(xlim = c(0, 4000000), ylim = c(0, 4000000)) +
     labs(title = "Price as a function of continuous variables") +
     theme_minimal()

Generalizability - ‘k-fold’ Cross-Validation

Compared to the current prediction model, the predicted sale price is slightly higher and the error is reduced. However, the geographical distribution of moods is almost the same.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(philly.sf_New) %>% 
                                dplyr::select(sale_price, exterior_condition, interior_condition, 
                                           number_of_bathrooms, number_of_bedrooms,
                                           total_area, total_livable_area, type_heater, 
                                           view_type,  year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD), 
     method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv

## Linear Regression 
## 
## 23698 samples
##    15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 23460, 23461, 23461, 23461, 23462, 23460, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   146256.9  0.5846468  88412.76
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

ggplot()+
  geom_histogram(data = reg.cv$resample, aes(x = MAE), binwidth = 10000, fill = "orange") +
  scale_x_continuous(breaks=seq(0, 600000, by = 100000), )+
  coord_cartesian(xlim = c(0, 600000)) +
  labs(title = "Distribution of MAE",
  subtitle = "k-fold cross-validation; k=100",
  caption = "") +
  theme_minimal()

philly.test <- philly.test[!is.na(philly.test$sale_price.Predict),]
ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  geom_sf(data = philly.test, aes(color = q5(sale_price.Predict)),  
          show.legend = "point", size = 0.5, alpha = 1) +  
  scale_color_manual(values = palette5,
                     labels=qBr(philly.test, "sale_price.Predict"),
                     name="Quintile\nBreaks($)") +
  labs(
    title = "Predicted Sale Price in Philadelphia",
    subtitle = "Assessment date: May 24, 2022-August 14, 2023",
    caption = "Data: U.S. Census Bureau, Zillow") +
  theme_void()

ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  geom_sf(data = philly.test, aes(color = q5(sale_price.AbsError)), 
          show.legend = "point", size = 0.5, alpha = 1) +  
  scale_color_manual(values = palette5,
                     labels=qBr(philly.test, "sale_price.AbsError"),
                     name="Quintile\nBreaks($)") + 
  labs(
    title = "Test Set Absolute Sale Price Errors in Philadelphia",
    subtitle = "Assessment date: May 24, 2022-August 14, 2023",
    caption = "Data: U.S. Census Bureau, Zillow") +
  theme_void()

Check if Prices and Errors Cluster

(1) Spatial Lags

Even in the proposed model, the errors are still clustered and spatial lag of error and sale price error are positively correlated.

coords <- st_coordinates(philly.sf_New) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

philly.sf_New$lagPrice <- lag.listw(spatialWeights, philly.sf_New$sale_price)

philly.test <- philly.test[!is.na(philly.test$sale_price.Error),]
coords.test <-  st_coordinates(philly.test) 
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")

# Sale Price as a function of the Spatial Lag of Price
philly.test %>% 
  mutate(lagPrice = lag.listw(spatialWeights.test, sale_price)) %>% 
  ggplot(aes(lagPrice, sale_price))+
  geom_point(color = "orange") +
  geom_smooth(method = "lm", se=F, colour = "#25CB10") +
  labs(title = "Price as a function of the Spatial Lag of Price",
       x = "Spatial Lag of Price (mean price of 5 nearest neighbors)",
       y = "Sale Price") +
  theme_minimal()

# Error as a Function of the Spatial Lag of Price
philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) %>% 
  ggplot(aes(lagPriceError, sale_price.Error))+
  geom_point(color = "orange") +
  geom_smooth(method = "lm", se=F, colour = "#25CB10") +
  labs(title = "Error as a Function of the Spatial Lag of Price",
       x = "Spatial Lag of Errors (mean error of 5 nearest neighbors)",
       y = "Sale Price Error") +
  theme_minimal()

We can calculate the Pearson’s R coefficient to test this a different way

pearsons_test <- 
  philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error, NAOK=TRUE))

cor.test(pearsons_test$lagPriceError,
         pearsons_test$sale_price.Error, 
         method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  pearsons_test$lagPriceError and pearsons_test$sale_price.Error
## t = 65.244, df = 9342, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5454005 0.5732627
## sample estimates:
##       cor 
## 0.5594897

(2) Moran’s I

We can assess the Moran’s I value is 0.359 to see if the errors are spatially autocorrelated. Moran’s I in the proposed model became lower than the current model (0.395), but still indicates positive spatial autocorrelation. This means that the errors are clustered, and the model is not capturing all the spatial patterns in the data.

moranTest <- moran.mc(philly.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "orange",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  theme_minimal()

Predictions by neighborhood

To validate the generalizability of the proposed model, we summarize the predictions by 158 neighborhood (name in the dataset). This model examines the impact of neighborhood characteristics based on 158 neighborhoods. The following table shows that the mean prediction and actual price of 158 neighborhoods.

philly.test %>%
as.data.frame() %>%
  group_by(name) %>%
    summarize(meanPrediction = mean(sale_price.Predict),
              meanPrice = mean(sale_price)) %>%
      pander(caption = "Mean Predicted and Actual Sale Price by Neighborhood")

Mean Predicted and Actual Sale Price by Neighborhood
name	meanPrediction	meanPrice
Academy Gardens	293483	300350
Allegheny West	97203	85369
Andorra	487722	407943
Aston-Woodbridge	278812	279297
Bartram Village	123251	122812
Bella Vista	408471	535949
Belmont	156508	146286
Brewerytown	199271	228292
Bridesburg	257782	209771
Burholme	278010	257385
Bustleton	354624	363054
Byberry	513086	368250
Callowhill	309002	283333
Carroll Park	139529	118299
Cedar Park	370136	419500
Cedarbrook	316453	257483
Chestnut Hill	665122	944563
Clearview	197990	156389
Cobbs Creek	221163	170617
Crescentville	225795	201280
Dearnley Park	509445	339300
Dickinson Narrows	375397	377542
Dunlap	199575	132950
East Falls	342331	394138
East Germantown	200894	151451
East Kensington	407024	365128
East Mount Airy	333963	303050
East Oak Lane	361981	248553
East Parkside	238520	155800
East Passyunk	345746	378162
East Poplar	233444	264996
Eastwick	254580	232257
Elmwood	124242	122346
Fairhill	-59099	76924
Fairmount	319011	426223
Feltonville	173099	121190
Fern Rock	308779	182284
Fishtown - Lower Kensington	374256	385690
Fitler Square	438982	1116021
Fox Chase	313783	322707
Francisville	480053	522825
Frankford	197198	124153
Franklinville	119423	85353
Garden Court	369893	450116
Germantown - Morton	238715	180053
Germantown - Penn Knox	326430	252271
Germantown - Westside	365830	279670
Germany Hill	384796	374416
Girard Estates	306358	299820
Glenwood	137863	60971
Graduate Hospital	447849	599698
Grays Ferry	120769	194289
Greenwich	340574	341392
Haddington	202283	122306
Harrowgate	141409	90044
Hartranft	158450	106478
Haverford North	186295	149833
Hawthorne	481607	599138
Holmesburg	205199	205505
Hunting Park	118034	94110
Juniata Park	170825	156814
Kingsessing	181774	142126
Lawndale	229132	201666
Lexington Park	263029	323492
Logan	245004	135113
Logan Square	394816	547362
Lower Moyamensing	259531	234116
Ludlow	374504	343200
Manayunk	364486	366448
Mantua	198464	182546
Mayfair	242747	229553
McGuire	-119342	53200
Melrose Park Gardens	241217	192688
Mill Creek	158872	109421
Millbrook	279656	283342
Modena	280237	300000
Morrell Park	282406	285467
Newbold	330358	279215
Nicetown	162980	92800
Normandy Village	293909	277414
North Central	205588	217361
Northern Liberties	515668	548994
Northwood	253752	197994
Ogontz	193015	159895
Old Kensington	503564	444860
Olney	204609	153851
Overbrook	283941	209842
Oxford Circle	251697	218439
Packer Park	523722	550154
Parkwood Manor	271488	273238
Paschall	140748	101000
Passyunk Square	426200	442702
Pennsport	358390	374683
Pennypack	284115	306579
Pennypack Woods	283869	274574
Penrose	179374	193889
Point Breeze	350273	339582
Powelton	676212	470000
Queen Village	421545	623321
Rhawnhurst	266819	290961
Richmond	165606	191863
Rittenhouse	489333	838472
Riverfront	506165	572500
Roxborough	361685	354145
Roxborough Park	479141	421333
Sharswood	261039	197444
Society Hill	404819	672283
Somerton	334167	345849
Southwest Germantown	260410	182117
Southwest Schuylkill	152295	134826
Spring Garden	363399	384521
Spruce Hill	560409	715200
Stadium District	218755	294280
Stanton	178804	118298
Strawberry Mansion	97946	104665
Summerdale	180730	155469
Tacony	227111	180684
Tioga	189917	93103
Torresdale	293401	270365
Upper Kensington	31254	59021
Upper Roxborough	364507	344112
Walnut Hill	340056	321795
Washington Square West	377467	568568
West Central Germantown	399479	364615
West Kensington	285330	215717
West Mount Airy	619711	591589
West Oak Lane	228893	197622
West Parkside	39000	127500
West Passyunk	189510	209869
West Poplar	416075	455925
West Powelton	328753	263500
Whitman	257551	232478
Winchester Park	352945	333068
Wissahickon	344782	344632
Wissahickon Hills	344606	365386
Wissinoming	172466	158985
Wister	207398	150118
Wynnefield	262108	219093
Wynnefield Heights	232725	161950
Yorktown	241213	253750

reg.nhood <- lm(sale_price ~ ., data = as.data.frame(philly.training) %>% 
                                 dplyr::select(div_code, sale_price, exterior_condition, interior_condition, 
                                           number_of_bathrooms, number_of_bedrooms,
                                           total_area, total_livable_area, type_heater, 
                                           view_type,year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD, name))

summary(reg.nhood)

## 
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(philly.training) %>% 
##     dplyr::select(div_code, sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories.cat, shooting_count, PPR_count, school_count, 
##         crime_count, TOD, name))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -958080  -48459     -92   45397 3378009 
## 
## Coefficients: (3 not defined because of singularities)
##                                      Estimate   Std. Error t value
## (Intercept)                      -135242.2469   77874.5451  -1.737
## div_codeEPD                       -71094.7371   17853.0603  -3.982
## div_codeNEPD                       73882.1219   37787.9553   1.955
## div_codeNWPD                      -30623.5577   72463.6701  -0.423
## div_codeSPD                      -127972.3942   19088.5378  -6.704
## div_codeSWPD                      -37545.9299   38209.2863  -0.983
## exterior_condition                -17479.4094    1899.6517  -9.201
## interior_condition                -23206.7347    1979.2696 -11.725
## number_of_bathrooms                34958.7686    2373.9899  14.726
## number_of_bedrooms                 -4316.1648    1183.1624  -3.648
## total_area                             2.1810       0.3178   6.862
## total_livable_area                   195.6809       2.7473  71.226
## type_heater0                        2705.5964   18636.9577   0.145
## type_heaterA                        4935.3139    3485.8585   1.416
## type_heaterB                        1529.6906    4369.3662   0.350
## type_heaterC                      -72839.7887   12158.9380  -5.991
## type_heaterD                      115014.9248   22800.4998   5.044
## type_heaterE                       -4158.9772   21778.9117  -0.191
## type_heaterG                        9191.8100    6685.3171   1.375
## type_heaterH                        5320.9524    3673.0713   1.449
## view_type0                          4334.6676   27444.3988   0.158
## view_typeA                         48651.8298   22324.6061   2.179
## view_typeB                        -36223.9676   31568.6564  -1.147
## view_typeC                         26830.0256   22958.5792   1.169
## view_typeD                         42265.7664   26148.3456   1.616
## view_typeE                         27243.3066   26414.4270   1.031
## view_typeH                         29422.5166   24871.5539   1.183
## view_typeI                         26365.4079   20902.3159   1.261
## year_built                           108.8595      33.7257   3.228
## number_stories.cat4+ Floors       787152.1858   87904.0643   8.955
## number_stories.catUp to 2 Floors   17428.9855    4107.8982   4.243
## shooting_count                      -347.3366      42.2635  -8.218
## PPR_count                          -1513.4131    1159.0202  -1.306
## school_count                               NA           NA      NA
## crime_count                            1.6088      10.7904   0.149
## TODTOD                             16830.7552    4008.7350   4.199
## nameAllegheny West                -27700.2032   79696.4126  -0.348
## nameAndorra                        54524.2898   82787.8828   0.659
## nameAston-Woodbridge              -55750.5405   29665.8680  -1.879
## nameBartram Village               -23375.1161   38107.6000  -0.613
## nameBella Vista                   397300.0124   39587.2953  10.036
## nameBelmont                         1164.6972   38532.4361   0.030
## nameBrewerytown                    21334.6399   33088.0076   0.645
## nameBridesburg                   -102555.8736   25849.5612  -3.967
## nameBurholme                      -44681.8856   34528.9165  -1.294
## nameBustleton                     -25581.7615   22682.3491  -1.128
## nameByberry                      -126622.7326   49327.4751  -2.567
## nameCallowhill                     96149.8416   69484.0218   1.384
## nameCarroll Park                  -38393.8486   25070.6024  -1.531
## nameCedar Park                    144111.6279   27619.0785   5.218
## nameCedarbrook                     27853.9770   80040.5996   0.348
## nameChestnut Hill                 354536.1902   80378.7562   4.411
## nameClearview                      48357.2960   36014.7639   1.343
## nameCobbs Creek                   -17968.0358   23470.0371  -0.766
## nameCrescentville                 -64461.5057   90185.5319  -0.715
## nameCrestmont Farms               -31412.6244   90350.1466  -0.348
## nameDearnley Park                  82214.8550   85418.1986   0.962
## nameDickinson Narrows             217463.8520   39136.3866   5.557
## nameDunlap                        -38397.3516   35940.5754  -1.068
## nameEast Falls                    123157.2560   79877.8809   1.542
## nameEast Germantown               -44303.5498   79776.6838  -0.555
## nameEast Kensington               122573.2036   38110.4442   3.216
## nameEast Mount Airy                18554.5503   79537.6947   0.233
## nameEast Oak Lane                 -57696.4338   80993.6085  -0.712
## nameEast Parkside                 -24373.5115   32468.5168  -0.751
## nameEast Passyunk                 241414.7621   38284.3978   6.306
## nameEast Poplar                    37267.2053   50218.2760   0.742
## nameEastwick                        5312.2114   35476.8590   0.150
## nameElmwood                        -4023.2802   24140.8905  -0.167
## nameFairhill                       34244.6435   41703.8801   0.821
## nameFairmount                     197700.3910   32678.9043   6.050
## nameFeltonville                     9980.1499   37424.8272   0.267
## nameFern Rock                     -67127.4496   81490.5200  -0.824
## nameFishtown - Lower Kensington   178690.0036   36480.3475   4.898
## nameFitler Square                 634647.9571   41520.2464  15.285
## nameFox Chase                     -27674.7729   23340.0709  -1.186
## nameFrancisville                  113354.2450   35730.6951   3.172
## nameFrankford                    -156324.2859   23444.5040  -6.668
## nameFranklinville                 -22693.2803   41798.1606  -0.543
## nameGarden Court                  134573.9961   40675.3484   3.308
## nameGermantown - Morton           -38129.0948   81124.0927  -0.470
## nameGermantown - Penn Knox          2684.5239   87377.0851   0.031
## nameGermantown - Westside         -19888.1683   83439.1762  -0.238
## nameGermany Hill                   81844.4684   81651.1612   1.002
## nameGirard Estates                195195.6786   38056.2665   5.129
## nameGlenwood                      -60083.9541   83094.1446  -0.723
## nameGraduate Hospital             385001.7907   37362.9657  10.304
## nameGrays Ferry                   144776.4699   37971.4934   3.813
## nameGreenwich                     168916.8857   41739.5802   4.047
## nameHaddington                    -45071.8048   24559.3500  -1.835
## nameHarrowgate                     -9154.5364   37166.2269  -0.246
## nameHartranft                     -43871.3962   35126.9340  -1.249
## nameHaverford North               -78044.7136   39031.5509  -2.000
## nameHawthorne                     333682.9234   43099.6232   7.742
## nameHolmesburg                    -77103.9925   22731.2997  -3.392
## nameHunting Park                  -18037.2394   37411.8963  -0.482
## nameJuniata Park                   38271.9148   37144.3932   1.030
## nameKingsessing                   -27723.8975   23972.9668  -1.156
## nameLawndale                     -101264.8414   23067.6975  -4.390
## nameLexington Park                 25101.2533   29087.7039   0.863
## nameLogan                         -77570.8664   79547.1706  -0.975
## nameLogan Square                  234778.2626   34438.1587   6.817
## nameLower Moyamensing             145256.6523   37520.0680   3.871
## nameLudlow                        -18685.5834   46340.4219  -0.403
## nameManayunk                       83990.2313   79661.4222   1.054
## nameMantua                         35257.0962   29560.5400   1.193
## nameMayfair                       -84446.5477   22435.0855  -3.764
## nameMcGuire                        62622.0076   46018.4668   1.361
## nameMelrose Park Gardens            4292.8318   82114.8503   0.052
## nameMill Creek                    -29218.1121   27012.8064  -1.082
## nameMillbrook                     -36805.8624   26656.9708  -1.381
## nameModena                        -32120.5754   24206.7237  -1.327
## nameMorrell Park                  -41243.0522   24978.6875  -1.651
## nameNewbold                       186091.2459   38747.8387   4.803
## nameNicetown                      -50908.8093   81640.7214  -0.624
## nameNormandy Village              -17065.5423   46459.4735  -0.367
## nameNorth Central                 -31047.0447   33951.9104  -0.914
## nameNorthern Liberties            230007.5329   35828.9306   6.420
## nameNorthwood                    -137491.7170   26100.7979  -5.268
## nameOgontz                        -27287.3521   79711.9329  -0.342
## nameOld Kensington                123301.6057   39252.4341   3.141
## nameOlney                         -35506.9558   79244.7743  -0.448
## nameOverbrook                     -21530.6055   23612.4623  -0.912
## nameOxford Circle                 -95461.4318   22307.8418  -4.279
## namePacker Park                   226052.3357   42365.7045   5.336
## nameParkwood Manor                -41554.6730   23560.0517  -1.764
## namePaschall                      -47376.1160   24915.8569  -1.901
## namePassyunk Square               265682.0669   38232.3460   6.949
## namePennsport                     211889.8151   38609.7410   5.488
## namePennypack                     -18725.3035   24994.6787  -0.749
## namePennypack Woods               -48731.8390   36192.1514  -1.346
## namePenrose                        10131.6261   31515.1377   0.321
## namePoint Breeze                  182964.5741   37177.1617   4.921
## namePowelton                      260694.1638   59799.6009   4.359
## nameQueen Village                 441436.4201   38837.0765  11.366
## nameRhawnhurst                    -19645.3595   23113.0509  -0.850
## nameRichmond                       75804.4770   36359.0160   2.085
## nameRittenhouse                   565823.1636   33011.0457  17.140
## nameRiverfront                    206687.7891   79150.1994   2.611
## nameRoxborough                     73947.9117   79489.4554   0.930
## nameRoxborough Park                82499.4279   84576.4649   0.975
## nameSharswood                      56606.3969   46595.5394   1.215
## nameSociety Hill                  408970.3935   33555.1169  12.188
## nameSomerton                      -14753.5571   22859.9565  -0.645
## nameSouthwest Germantown          -31447.0920   80630.4225  -0.390
## nameSouthwest Schuylkill           -4873.7931   25935.7904  -0.188
## nameSpring Garden                 217508.4292   33629.3802   6.468
## nameSpruce Hill                   195546.5815   35111.4075   5.569
## nameStadium District              192350.0278   40943.8174   4.698
## nameStanton                       -99983.5433   33328.6739  -3.000
## nameStrawberry Mansion            -64330.9920   33003.1107  -1.949
## nameSummerdale                   -116053.6536   26121.1901  -4.443
## nameTacony                       -104702.7261   23113.8374  -4.530
## nameTioga                         -99947.1216   79912.5414  -1.251
## nameTorresdale                    -56131.7322   24716.3097  -2.271
## nameUpper Kensington                1431.1903   36915.3589   0.039
## nameUpper Roxborough               59966.7795   79781.2467   0.752
## nameWalnut Hill                    36723.8440   36294.4456   1.012
## nameWashington Square West        330744.0149   34888.8370   9.480
## nameWest Central Germantown       -37334.8593   81543.8432  -0.458
## nameWest Kensington                69705.3958   37948.2816   1.837
## nameWest Mount Airy                86475.5827   79849.8687   1.083
## nameWest Oak Lane                  -6850.7653   79256.2176  -0.086
## nameWest Parkside                 -17760.2798   52228.1287  -0.340
## nameWest Passyunk                 148363.0958   38448.3091   3.859
## nameWest Poplar                    38934.6920   50005.7065   0.779
## nameWest Powelton                  46473.2328   36813.9390   1.262
## nameWhitman                       155904.7686   38972.7675   4.000
## nameWinchester Park               -23805.3236   37687.9395  -0.632
## nameWissahickon                    72922.9302   80453.9826   0.906
## nameWissahickon Hills             122620.5736   84827.8386   1.446
## nameWissinoming                  -107276.3467   22730.1851  -4.720
## nameWister                        -48779.3878   81589.9932  -0.598
## nameWoodland Terrace              212236.8503   90379.9457   2.348
## nameWynnefield                    -49343.8847   26171.7584  -1.885
## nameWynnefield Heights                     NA           NA      NA
## nameYorktown                               NA           NA      NA
##                                              Pr(>|t|)    
## (Intercept)                                  0.082468 .  
## div_codeEPD                      0.000068624548814746 ***
## div_codeNEPD                                 0.050582 .  
## div_codeNWPD                                 0.672590    
## div_codeSPD                      0.000000000021042707 ***
## div_codeSWPD                                 0.325802    
## exterior_condition               < 0.0000000000000002 ***
## interior_condition               < 0.0000000000000002 ***
## number_of_bathrooms              < 0.0000000000000002 ***
## number_of_bedrooms                           0.000265 ***
## total_area                       0.000000000007062227 ***
## total_livable_area               < 0.0000000000000002 ***
## type_heater0                                 0.884576    
## type_heaterA                                 0.156854    
## type_heaterB                                 0.726273    
## type_heaterC                     0.000000002142008747 ***
## type_heaterD                     0.000000460695242559 ***
## type_heaterE                                 0.848557    
## type_heaterG                                 0.169177    
## type_heaterH                                 0.147461    
## view_type0                                   0.874504    
## view_typeA                                   0.029327 *  
## view_typeB                                   0.251209    
## view_typeC                                   0.242574    
## view_typeD                                   0.106034    
## view_typeE                                   0.302381    
## view_typeH                                   0.236838    
## view_typeI                                   0.207199    
## year_built                                   0.001250 ** 
## number_stories.cat4+ Floors      < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors 0.000022218456474646 ***
## shooting_count                   0.000000000000000224 ***
## PPR_count                                    0.191653    
## school_count                                       NA    
## crime_count                                  0.881483    
## TODTOD                           0.000027034298856716 ***
## nameAllegheny West                           0.728167    
## nameAndorra                                  0.510162    
## nameAston-Woodbridge                         0.060227 .  
## nameBartram Village                          0.539623    
## nameBella Vista                  < 0.0000000000000002 ***
## nameBelmont                                  0.975887    
## nameBrewerytown                              0.519077    
## nameBridesburg                   0.000073023119075921 ***
## nameBurholme                                 0.195672    
## nameBustleton                                0.259413    
## nameByberry                                  0.010269 *  
## nameCallowhill                               0.166451    
## nameCarroll Park                             0.125686    
## nameCedar Park                   0.000000183645054455 ***
## nameCedarbrook                               0.727847    
## nameChestnut Hill                0.000010375681362365 ***
## nameClearview                                0.179389    
## nameCobbs Creek                              0.443943    
## nameCrescentville                            0.474766    
## nameCrestmont Farms                          0.728088    
## nameDearnley Park                            0.335816    
## nameDickinson Narrows            0.000000028021154490 ***
## nameDunlap                                   0.285378    
## nameEast Falls                               0.123140    
## nameEast Germantown                          0.578668    
## nameEast Kensington                          0.001302 ** 
## nameEast Mount Airy                          0.815547    
## nameEast Oak Lane                            0.476255    
## nameEast Parkside                            0.452857    
## nameEast Passyunk                0.000000000295360940 ***
## nameEast Poplar                              0.458037    
## nameEastwick                                 0.880974    
## nameElmwood                                  0.867641    
## nameFairhill                                 0.411582    
## nameFairmount                    0.000000001487748413 ***
## nameFeltonville                              0.789726    
## nameFern Rock                                0.410098    
## nameFishtown - Lower Kensington  0.000000977839542346 ***
## nameFitler Square                < 0.0000000000000002 ***
## nameFox Chase                                0.235754    
## nameFrancisville                             0.001515 ** 
## nameFrankford                    0.000000000026942406 ***
## nameFranklinville                            0.587190    
## nameGarden Court                             0.000940 ***
## nameGermantown - Morton                      0.638356    
## nameGermantown - Penn Knox                   0.975491    
## nameGermantown - Westside                    0.811609    
## nameGermany Hill                             0.316184    
## nameGirard Estates               0.000000295008330100 ***
## nameGlenwood                                 0.469641    
## nameGraduate Hospital            < 0.0000000000000002 ***
## nameGrays Ferry                              0.000138 ***
## nameGreenwich                    0.000052177028145784 ***
## nameHaddington                               0.066495 .  
## nameHarrowgate                               0.805443    
## nameHartranft                                0.211709    
## nameHaverford North                          0.045571 *  
## nameHawthorne                    0.000000000000010452 ***
## nameHolmesburg                               0.000696 ***
## nameHunting Park                             0.629724    
## nameJuniata Park                             0.302861    
## nameKingsessing                              0.247511    
## nameLawndale                     0.000011424367647150 ***
## nameLexington Park                           0.388180    
## nameLogan                                    0.329500    
## nameLogan Square                 0.000000000009654858 ***
## nameLower Moyamensing                        0.000109 ***
## nameLudlow                                   0.686789    
## nameManayunk                                 0.291746    
## nameMantua                                   0.233004    
## nameMayfair                                  0.000168 ***
## nameMcGuire                                  0.173599    
## nameMelrose Park Gardens                     0.958308    
## nameMill Creek                               0.279432    
## nameMillbrook                                0.167387    
## nameModena                                   0.184554    
## nameMorrell Park                             0.098735 .  
## nameNewbold                      0.000001582395147524 ***
## nameNicetown                                 0.532919    
## nameNormandy Village                         0.713385    
## nameNorth Central                            0.360501    
## nameNorthern Liberties           0.000000000141081178 ***
## nameNorthwood                    0.000000140196566596 ***
## nameOgontz                                   0.732112    
## nameOld Kensington                           0.001686 ** 
## nameOlney                                    0.654112    
## nameOverbrook                                0.361873    
## nameOxford Circle                0.000018876100378693 ***
## namePacker Park                  0.000000096654400958 ***
## nameParkwood Manor                           0.077792 .  
## namePaschall                                 0.057265 .  
## namePassyunk Square              0.000000000003839318 ***
## namePennsport                    0.000000041368009326 ***
## namePennypack                                0.453767    
## namePennypack Woods                          0.178171    
## namePenrose                                  0.747848    
## namePoint Breeze                 0.000000869048906350 ***
## namePowelton                     0.000013132208723538 ***
## nameQueen Village                < 0.0000000000000002 ***
## nameRhawnhurst                               0.395357    
## nameRichmond                                 0.037098 *  
## nameRittenhouse                  < 0.0000000000000002 ***
## nameRiverfront                               0.009029 ** 
## nameRoxborough                               0.352239    
## nameRoxborough Park                          0.329358    
## nameSharswood                                0.224446    
## nameSociety Hill                 < 0.0000000000000002 ***
## nameSomerton                                 0.518686    
## nameSouthwest Germantown                     0.696531    
## nameSouthwest Schuylkill                     0.850944    
## nameSpring Garden                0.000000000102770967 ***
## nameSpruce Hill                  0.000000026049887132 ***
## nameStadium District             0.000002653740300370 ***
## nameStanton                                  0.002705 ** 
## nameStrawberry Mansion                       0.051287 .  
## nameSummerdale                   0.000008944558394892 ***
## nameTacony                       0.000005951244738983 ***
## nameTioga                                    0.211063    
## nameTorresdale                               0.023160 *  
## nameUpper Kensington                         0.969075    
## nameUpper Roxborough                         0.452280    
## nameWalnut Hill                              0.311637    
## nameWashington Square West       < 0.0000000000000002 ***
## nameWest Central Germantown                  0.647067    
## nameWest Kensington                          0.066253 .  
## nameWest Mount Airy                          0.278837    
## nameWest Oak Lane                            0.931119    
## nameWest Parkside                            0.733823    
## nameWest Passyunk                            0.000114 ***
## nameWest Poplar                              0.436226    
## nameWest Powelton                            0.206833    
## nameWhitman                      0.000063577183942097 ***
## nameWinchester Park                          0.527631    
## nameWissahickon                              0.364744    
## nameWissahickon Hills                        0.148334    
## nameWissinoming                  0.000002386739447581 ***
## nameWister                                   0.549943    
## nameWoodland Terrace                         0.018875 *  
## nameWynnefield                               0.059399 .  
## nameWynnefield Heights                             NA    
## nameYorktown                                       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123800 on 13852 degrees of freedom
##   (199 observations deleted due to missingness)
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7113 
## F-statistic: 200.7 on 173 and 13852 DF,  p-value: < 0.00000000000000022

philly.test.nhood <-
  philly.test %>%
  mutate(Regression = "Neighborhood Effects",
         sale_price.Predict = predict(reg.nhood, philly.test),
         sale_price.Error = sale_price.Predict- sale_price,
         sale_price.AbsError = abs(sale_price.Predict- sale_price),
         sale_price.APE = (abs(sale_price.Predict- sale_price)) / sale_price)%>%
  filter(sale_price < 5000000)

bothRegressions <- 
  rbind(
    dplyr::select(philly.test, starts_with("sale_price"), Regression, name) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)),
    dplyr::select(philly.test.nhood, starts_with("sale_price"), Regression, name) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)))

The MAE and MAPE of the proposed model, which accounts for the neighborhood effect, are as follows: when incorporating the neighborhood effect, the MAE decreases by 20,000 and the MAPE decreases by 18%p compared to the proposed model without the neighborhood effect.

Initially, the neighborhood effect was defined by police districts, but upon resubmission, it was adjusted to 158 neighborhoods, and the neighborhood effect was recalculated.

st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -name) %>%
  filter(Variable == "sale_price.AbsError" | Variable == "sale_price.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value)) %>%
    spread(Variable, meanValue) %>%
    pander(caption = "Mean Absolute Error and Absolute Percentage Error by Regression")

Mean Absolute Error and Absolute Percentage Error by Regression
Regression	sale_price.AbsError	sale_price.APE
Baseline Regression	87459	0.5351
Neighborhood Effects	67766	0.3774

bothRegressions %>%
  dplyr::select(sale_price.Predict, sale_price, Regression) %>%
    ggplot(aes(sale_price, sale_price.Predict)) +
  geom_point() +
  stat_smooth(aes(sale_price, sale_price), 
             method = "lm", se = FALSE, size = 1, colour="orange") + 
  stat_smooth(aes(sale_price.Predict, sale_price), 
              method = "lm", se = FALSE, size = 1, colour="olivedrab") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  theme_minimal()

Spatial Evaluation of Errors

The following results of table and graph denote there is significant difference in MAE and MAPE by regression methods, particularly in Central and North Philadelphia.

st_drop_geometry(bothRegressions) %>%
  group_by(Regression, name) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(neighborhoods, by = c("name" = "name")) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = 100*mean.MAPE), color = "transparent") +
      geom_sf(data = bothRegressions, colour = "gray30", size = .5) +
      facet_wrap(~Regression) +
      scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood (Neighborhood)") +
      theme_void()

The following scatterplot shows the average MAPE by 158 neighborhood mean price. The overall MAPE values were lower after accounting for the neighborhood effect, and the peak MAPE values observed before considering the neighborhood effect disappeared (Fairhill and McGuire). In both regression methods, MAPE did show an decrease depending on the sale price. However, in both methods, MAPE has a positive value.

a <- st_drop_geometry(bothRegressions) %>%
  group_by(Regression, name) %>%
  summarize(mean.sale_price = mean(sale_price, na.rm = T))

a2 <- st_drop_geometry(bothRegressions) %>%
  group_by(Regression, name) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T))

a3<- left_join(a, a2, by = c("name" = "name", "Regression" = "Regression")) 

ggplot(a3) + 
    geom_point(aes(x = mean.sale_price, y = 100*mean.MAPE), color = "gray30", size = 1) +
    facet_wrap(~Regression) +
    scale_fill_gradient(low = palette5[1], high = palette5[5],
                        name = "MAPE") +
    labs(title = "Average MAPE by Neighborhood (Neighborhood) Mean Price") +
    theme_minimal()

Race and Income Context of Predictions

We validated if the proposed model that accounts for neighborhood effect is applied to different group contexts: race and income.

In race, the regression that considered neighborhood effects resulted in a smaller MAPE in the majority white group. In income, the regression that considered neighborhood effects resulted in a smaller MAPE in the high-income group.

For the majority non-White and low-income groups, the MAPE decreased significantly—by more than 30%p and 25%p, respectively—when the neighborhood effect was considered compared to when it was not. It means the ‘neighborhood effect’ played an important role in predicting house prices.

As a result of generalizability, however, it can be concluded that it is relatively difficult to apply to the Majority Non-White or Low Income group. We should consider other additional variables that may resolve the correlation between the groups or analyze more data.

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(nhoods), aes(fill = raceContext), color = "#999") +
    scale_fill_manual(values = c("olivedrab", "orange"), name="Race Context") +
    labs(title = "Race Context") +
    theme_void() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(nhoods), aes(fill = incomeContext), color = "#999") +
    scale_fill_manual(values = c("olivedrab", "orange"), name="Income Context") +
    labs(title = "Income Context") +
    theme_void() + theme(legend.position="bottom"))

st_join(bothRegressions, nhoods_f) %>% 
  filter(!is.na(raceContext)) %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = 100*(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  pander(caption = "Test set MAPE by neighborhood race context(%)")

Test set MAPE by neighborhood race context(%)
Regression	Majority Non-White	Majority White
Baseline Regression	74.56	27.67
Neighborhood Effects	48.3	24.78

st_join(bothRegressions, nhoods_f) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = 100*(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  pander(caption = "Test set MAPE by neighborhood income context(%)")

Test set MAPE by neighborhood income context(%)
Regression	High Income	Low Income
Baseline Regression	27.94	64.37
Neighborhood Effects	25.78	42.89

Conclusion and Discussion

We recommend that Zillow use our proposed OLS model instead of the current one. Our proposed model aimed to predict housing prices using key features such as property condition, total livable area, number of bedrooms and bathrooms, proximity to crime, and access to amenities like schools and parks. These variables are crucial as they directly influence property values based on what homebuyers prioritize. The current Zillow model mainly focuses on internal characteristics and does not focus on public services or (dis)amenities. However, our proposed model was able to obtain significantly reduced MAE and MAPE values by incorporating six public services, including crime, shooting, and police districts, which are actual factors that people around Philadelphia often consider when choosing their home location. Our proposed model finally did effectively account for neighborhood effects, possibly because the neighborhood defined by 158 neighborhoods significantly decreased the MAE and MAPE.

While the model captured a significant portion of price variation, it was not perfect. First, metrics like MAE and MAPE decreased compared to the current Zillow model, but there was still a positive correlation. We also observed positive spatial autocorrelation in the model errors, as indicated by Moran’s I statistics. We need to identify more variables with spatial correlation.

Second, regression analysis did not work well for school data because the number of schools per tract had a lot of NA data. In the future, it is necessary to reanalyze the distance to the nearest school from each house as continuous data.

Third, our proposed model should still improve generalizability. The overall MAPE values were lower after accounting for the neighborhood effect, but MAPE by different group context has a positive value and large difference between two different groups which was race and house income, respectively. The model showed better performance in high-income and predominantly white neighborhoods, evidenced by lower MAE values, possibly due to more stable socio-economic factors in these areas. To improve generalization performance by eliminating these effects, we need to add sufficient other variables that can predict the variability occurring in low-income and major non-white groups, or change the weights of each variable.

In sum, further refinement is necessary to address local neighborhood variations that are not fully captured by the current set of predictors. Incorporating more granular, demographically aligned, or culturally relevant boundaries could improve the model’s predictive power, allowing it to capture neighborhood effects more accurately and potentially reduce the overall error rate.

philly.sf_Chall <- philly.sf %>%
  filter(sale_price < 5000000 & toPredict == "CHALLENGE") %>%
  mutate(pricepersq = ifelse(total_area > 10, sale_price / total_area, 0)) 
philly.sf_Chall <- philly.sf_Chall[!is.na(as.numeric(philly.sf_Chall$pricepersq)), ]
philly.sf_Chall <- st_join(philly.sf_Chall, nhoods_f) # Crime Data, School Data, Green Area Data, Shooting Incident Data, TOD Data
philly.sf_Chall <- st_join(philly.sf_Chall, philly.police) # Police District Data

philly.sf_Chall$sale_price <-  798332.8947 + philly.sf_Chall$exterior_condition * -27278.6103    +philly.sf_Chall$interior_condition* -27822.5115   +philly.sf_Chall$number_of_bathrooms*   60246.0186    +philly.sf_Chall$number_of_bedrooms*   -25245.0484    +philly.sf_Chall$total_livable_area*  193.4259       +philly.sf_Chall$year_built * -161.9947      +philly.sf_Chall$shooting_count * -1020.9659      +philly.sf_Chall$crime_count  * -18.7429 + ifelse(philly.sf_Chall$type_heater == "A", 13823.6306, 
ifelse(philly.sf_Chall$type_heater == "B", -8972.2810, 
ifelse(philly.sf_Chall$type_heater == "D", 239670.7969, 
ifelse(philly.sf_Chall$type_heater == "E", 58203.8361,
ifelse(philly.sf_Chall$type_heater == "H", 7760.8603,0)))))+
ifelse(philly.sf_Chall$view_type == "0", -82474.3969,
ifelse(philly.sf_Chall$view_type == "C", 182468.1793,
ifelse(philly.sf_Chall$view_type == "E", -50004.9183,0)))
#+ifelse(philly.sf_Chall$number_stories.cat =="4+ Floors",730714.0040,
#ifelse(philly.sf_Chall$number_stories.cat =="Up to 2 Floors",-45798.0893,0))

MUSA <- philly.sf_Chall %>%
  st_drop_geometry() %>%
  dplyr::select(musaID, sale_price) %>%
  mutate(team = "Agarwal and Jun") 
write.csv(MUSA,"mean_prediction_AgarwalJun.csv", row.names = FALSE)

Midterm: Suggestion for Home Price Prediction Modeling in Philadelphia

Agarwal, Vrinda; Jun, Youngsang

October 11, 2024 / November 12, 2024 (resubmitted)

Introduction

Methods

1. Data Collection and Preparation

House Price per square foot in Philadelphia

Summary of Other Variables

2. Current OLS Model Analysis

Price as a function of continuous variables

Price as a function of categorical variables

Multiple Linear Regression Model Results

Accuracy of the Current Model

Generalizability - ‘k-fold’ Cross-Validation

Check if Prices and Errors cluster

(1) Spatial Lags

(2) Moran’s I

3. Proposed OLS Model

Which Data to be Included

Summary Statistics Table

6 Newly Included Data and Visualization

(1) Number of Public Parks/Recreation Areas by Tract in Philadelphia

(2) Police Districts and Population Density in Philadelphia

(3) Density of Shooting Incidents in Philadelphia with Police Districts Overlay

(4) TOD (public transit) in Philadelphia

(5) Number of Crime Dispatch by Tract in Philadelphia

(6) Number of School by Tract in Philadelphia

Multiple Linear Regression Model Results

Accuracy of the Proposed Model

Generalizability - ‘k-fold’ Cross-Validation

Check if Prices and Errors Cluster

(1) Spatial Lags

(2) Moran’s I

Predictions by neighborhood

Spatial Evaluation of Errors

Race and Income Context of Predictions

Conclusion and Discussion