Introduction

Zillow has struggled to accurately estimate home prices, partly because it is a national platform that often lacks detailed, localized information. Urban dynamics have become increasingly complex, and understanding housing markets requires a more granular approach that incorporates local factors like government policies, crime, gentrification, pollution levels, and access to public transit. Our objective is to use local data to create a model that predicts Philadelphia home prices with greater accuracy.

This markdown consists of two parts. First, we analyze the current house prices prediction model. We determine the correlation and spatial autocorrelation between house prices and variables in the current model, and derive the problems of the house prices prediction model. Second, we propose an improved model. After adding six continuous and categorical variables that affect house prices but are not considered in the current model to the model, we will predict house prices and verify the accuracy and generalization of the improved model.

Our overall modeling strategy focuses on using open-source datasets, such as those available from the Philadelphia Open Data Portal, to enhance our understanding of the local housing market. By incorporating data on crime, zoning, pollution, and transit-oriented development, we aim to build a more contextually rich model that reflects the true dynamics of the city.

Methods

Our approach to developing a more accurate home price prediction model for Philadelphia involved three main phases: data collection and preparation, analysis of the existing OLS model, and proposal of an improved model.

1.⁠ ⁠Data Collection and Preparation We gathered localized datasets from the Philadelphia Open Data Portal, aiming to capture the unique characteristics influencing home prices within the city. These datasets included:

2.⁠ ⁠Analysis of Current OLS Model

We analyzed Zillow’s existing OLS model by examining correlations between sale price and both continuous (e.g., age, total area) and categorical variables (e.g., quality grade, view type). Our analysis showed:

Negative correlation between age and sale price, suggesting that older properties tend to sell for less.

Positive correlation between total area, livable area, and sale price, indicating larger properties command higher prices.

We evaluated these correlations using a correlation matrix, allowing us to select non-collinear predictors for regression. In this phase, we tested multiple regression models using variables like total livable area, number of bathrooms, and internal/external condition, observing that livable area was the most significant predictor.

3.⁠ ⁠Proposed Model

We sought to improve prediction accuracy by integrating six new variables beyond the original model:

Using multiple linear regression, we compared our model’s Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) against Zillow’s baseline. We implemented 100-fold cross-validation to assess model generalizability, allowing systematic testing across varied training and testing splits. We also applied spatial analysis techniques, such as Moran’s I, to evaluate spatial clustering in model errors, highlighting areas for improvement.

4.⁠ ⁠Statistical and Spatial Analysis

We assessed spatial lags for both price and model errors to understand the clustering of predictions and identify areas where neighborhood effects were inadequately captured. Additionally, we tested spatial autocorrelation in errors using Moran’s I, finding positive autocorrelation that suggested the model could benefit from further integration of neighborhood-specific factors.

This approach leverages spatial and socioeconomic variables to capture Philadelphia’s local housing dynamics, ultimately reducing MAE and MAPE compared to Zillow’s model. Further refinements, such as geographically weighted regression and enhanced socio-demographic data, could improve predictive accuracy across diverse neighborhoods and price ranges.

1. Data Collection and Preparation

  1. Load Libraries and Initial Setup
# Install and Load Libraries
# install.packages('mapview')
library(tidyverse)
library(tidycensus)
library(sf)
library(knitr)
library(kableExtra)
library(mapview)
library(dplyr)
library(scales)
library(viridis)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)     # for regression model plots
library(broom)
library(tufte)
library(rmarkdown)
library(pander)
library(classInt)
library(ggplot2)
library(units)

# functions and data directory
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

# Source for Multiple Ring Buffer
source(
  paste0("https://raw.githubusercontent.com/urbanSpatial/",
         "Public-Policy-Analytics-Landing/master/functions.r"))

# Set Color Table
palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")

# Disable Scientific Notation
options(scipen=999)
options(tigris_class = "sf")

# Set ACS API Key
census_api_key("b2835f54d89a4499ba29829c908967b86765b345", overwrite = TRUE)
  1. Data loading

We used spatial data techniques to ensure all data were correctly mapped to Philadelphia’s local coordinate system (ESRI:102728).

# Load Philadelphia Zillow Data
philly.sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/studentData.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

# Get ACS data
nhoods <-  
  get_acs(geography = "tract",
          variables = c("B01003_001E","B02001_002E",
                        "B06011_001E"), 
          year=2020, state="PA",
          county="Philadelphia", geometry=TRUE) %>%
  st_transform(crs = 'ESRI:102728') %>%
  separate(NAME, into = c("Census_Tract", "City_State"), sep = ", ", extra = "merge") %>%
  mutate(
    Census_Tract = gsub("Census Tract ", "", Census_Tract),
    City_State = gsub(" County, Pennsylvania", "", City_State)) %>%
  dplyr::select(  -moe)%>%
  spread(key = variable, value = estimate) %>%
  rename(TotalPop = B01003_001,
         NumberWhites = B02001_002,
         Median_Income = B06011_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 35322, "High Income", "Low Income"))

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/philadelphia.geojson") %>%
  st_transform(crs = 'ESRI:102728') 

# Filter to eliminate outlier
philly.sf_filtered <- philly.sf %>%
  filter(sale_price < 5000000 & toPredict == "MODELLING") %>%
  mutate(pricepersq = ifelse(total_area > 10, sale_price / total_area, 0)) 
philly.sf_filtered <- philly.sf_filtered[!is.na(as.numeric(philly.sf_filtered$pricepersq)), ]

House Price per square foot in Philadelphia

The following chart shows House Price per square foot in Philadelphia. Center City and the Northeast are mostly in the 1st quintile for house prices per square foot, while Southwest, Northwest, and Northeast Philadelphia are around in the 3rd quintile. By mapping these prices, we can pinpoint high-demand neighborhoods and observe how features like proximity to parks or specific zoning affect property desirability. In addition, it helps us understand the spatial distribution of property values and reveals which areas are more desirable based on pricing trends.

ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  geom_sf(data = philly.sf_filtered, aes(color = q5(pricepersq)),  
          show.legend = "point", size = 0.5, alpha = 1) +  
  scale_color_manual(values = palette5,
                     labels=qBr(philly.sf_filtered, "pricepersq"),
                     name="Quintile\nBreaks($/sqft)") + 
  labs(
    title = "House Price per square foot in Philadelphia",
    subtitle = "Assessment date: May 24, 2022-August 14, 2023",
    caption = "Data: U.S. Census Bureau, Zillow") +
  theme_void()

#mapview(philly.sf_filtered)

Summary of Other Variables

# Drop geometry and calculate summary statistics
data_for_summary <- philly.sf_filtered %>%
  st_drop_geometry() %>%
  select(
    number_of_bathrooms,
    number_of_bedrooms,
    total_area,
    total_livable_area,
    year_built,
    exterior_condition,
    interior_condition,
    sale_price
  ) %>%
  mutate(
    property_age = 2024 - year_built # Assuming '2024' as the current year for age
  )

# Calculate the summary statistics
sum_stats_combined <- data_for_summary %>%
  summarize(
    Mean = round(c(
      mean(number_of_bathrooms, na.rm = TRUE),
      mean(number_of_bedrooms, na.rm = TRUE),
      mean(total_area, na.rm = TRUE),
      mean(total_livable_area, na.rm = TRUE),
      mean(property_age, na.rm = TRUE),
      mean(exterior_condition, na.rm = TRUE),
      mean(interior_condition, na.rm = TRUE),
      mean(sale_price, na.rm = TRUE)), 2),
    
    Median = round(c(
      median(number_of_bathrooms, na.rm = TRUE),
      median(number_of_bedrooms, na.rm = TRUE),
      median(total_area, na.rm = TRUE),
      median(total_livable_area, na.rm = TRUE),
      median(property_age, na.rm = TRUE),
      median(exterior_condition, na.rm = TRUE),
      median(interior_condition, na.rm = TRUE),
      median(sale_price, na.rm = TRUE)), 2),
    
    Min = round(c(
      min(number_of_bathrooms, na.rm = TRUE),
      min(number_of_bedrooms, na.rm = TRUE),
      min(total_area, na.rm = TRUE),
      min(total_livable_area, na.rm = TRUE),
      min(property_age, na.rm = TRUE),
      min(exterior_condition, na.rm = TRUE),
      min(interior_condition, na.rm = TRUE),
      min(sale_price, na.rm = TRUE)), 2),
    
    Max = round(c(
      max(number_of_bathrooms, na.rm = TRUE),
      max(number_of_bedrooms, na.rm = TRUE),
      max(total_area, na.rm = TRUE),
      max(total_livable_area, na.rm = TRUE),
      max(property_age, na.rm = TRUE),
      max(exterior_condition, na.rm = TRUE),
      max(interior_condition, na.rm = TRUE),
      max(sale_price, na.rm = TRUE)), 2),
    
    SD = round(c(
      sd(number_of_bathrooms, na.rm = TRUE),
      sd(number_of_bedrooms, na.rm = TRUE),
      sd(total_area, na.rm = TRUE),
      sd(total_livable_area, na.rm = TRUE),
      sd(property_age, na.rm = TRUE),
      sd(exterior_condition, na.rm = TRUE),
      sd(interior_condition, na.rm = TRUE),
      sd(sale_price, na.rm = TRUE)), 2)
  ) %>%
  mutate(
    `Variable Name` = c("number_of_bathrooms", "number_of_bedrooms", 
                        "total_area", "total_livable_area", 
                        "property_age", "exterior_condition", 
                        "interior_condition", "sale_price"),
    
    Description = c("Number of Bathrooms", "Number of Bedrooms", 
                    "Total Area of Property", "Total Livable Area of Property",
                    "Property Age", "Exterior Condition Rating", 
                    "Interior Condition Rating", "Sale Price of Property"),
    
    Category = c(rep("Physical characteristics", 4), 
                 "Age", rep("Condition", 2), "Financial")
  ) %>%
  select(Category, `Variable Name`, Description, everything())

# Display the summary statistics table with formatting
sum_stats_combined %>%
  kable(format = "html", caption = "Table: Summary statistics for numeric variables by category") %>%
  row_spec(1:4, background = "#ffffff50") %>%
  row_spec(5, background = "#f2f2f250") %>%
  row_spec(6:7, background = "#e6e6e680") %>%
  kable_styling(bootstrap_options = c("hover"), fixed_thead = TRUE)
Table: Summary statistics for numeric variables by category
Category Variable Name Description Mean Median Min Max SD
Physical characteristics number_of_bathrooms Number of Bathrooms 1.07 1 0 8 0.67
Physical characteristics number_of_bedrooms Number of Bedrooms 2.58 3 0 31 1.27
Physical characteristics total_area Total Area of Property 1828.80 1179 0 226512 3824.23
Physical characteristics total_livable_area Total Livable Area of Property 1333.19 1208 0 15246 546.51
Age property_age Property Age 87.46 99 0 2024 49.43
Condition exterior_condition Exterior Condition Rating 3.77 4 1 7 0.87
Condition interior_condition Interior Condition Rating 3.65 4 0 7 0.88
Financial sale_price Sale Price of Property 271929.21 229000 10100 4900000 234193.33

2. Current OLS Model Analysis

Price as a function of continuous variables

We looked at the relationship between House Price i.e. sale_price and Age, total_area, and total_livable_area, which are continuous variables currently used by Zillow. Age was calculated by subtracting the year of completion, year_built, from this year, 2024. As a result, Age and sale_price showed a negative correlation, while total_area and sale_price and total_livable_area and sale_price showed a positive correlation. We filtered our datasets to remove outliers and errors, such as excluding properties with sale prices above $5 million to focus on the majority market segment.

# Price as a function of continuous variables
philly.sf_filtered %>%
  st_drop_geometry() %>%
  mutate(Age = 2024 - year_built) %>%
  dplyr::select(sale_price, total_area, Age, total_livable_area) %>%
  filter(sale_price <= 5000000, total_area <= 10000, total_livable_area <= 5000, 
         Age < 500) %>% # Filter out crazy outliers
  gather(Variable, Value, -sale_price) %>% 
   ggplot(aes(Value, sale_price)) +
     geom_point(size = .5) + 
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of continuous variables",
          subtitle="Sale Price: $, Age: year(s), Area: sqft, Total Livable Area: sqft") +
     theme_minimal()

Price as a function of categorical variables

We also examine the relationship between sale_price and quality_grade, view_type, exterior_condition, and interior_condition, which are categorical variables currently used by Zillow. Quality_grade has two types of grading systems, so we changed the 1-6 system to an S-A-B-C-D-E system. The following histograms show that as the condition or quality grade goes high, the sale price tends to increase.

# Price as a function of categorical variables

philly.sf_filtered <- philly.sf_filtered %>%
  mutate(
    quality_grade_new = case_when(
      quality_grade == "1 " ~ "S ",  
      quality_grade == "2 " ~ "A ",  
      quality_grade == "3 " ~ "B ",  
      quality_grade == "4 " ~ "C ", 
      quality_grade == "5 " ~ "D ", 
      quality_grade == "6 " ~ "E ", 
      TRUE ~ as.character(quality_grade)        
    )
  )

philly.sf_filtered$quality_grade_new <- factor(philly.sf_filtered$quality_grade_new, levels = c("S+", "S ", "A+", "A ", "A-", "B+", "B ", "B-", "C+", "C ", "C-", "D+", "D ", "D-", "E+", "E ", "E-"))
# levels(philly.sf_filtered$quality_grade_new)
philly.sf_filtered %>%
  st_drop_geometry() %>%
  dplyr::select(sale_price, quality_grade_new, view_type, exterior_condition, interior_condition) %>%
  filter(sale_price <= 1000000, !is.na(interior_condition), interior_condition != 0, view_type != "", view_type != 0, quality_grade_new != "") %>% # Filter out crazy outliers
  gather(Variable, Value, -sale_price) %>%
  ggplot(aes(Value, sale_price)) +
    geom_bar(position = "dodge", stat="summary", fun="mean")+
    facet_wrap(~Variable, ncol = 2, scales = "free") +
    labs(title = "Price as a function of categorical variables", y="Mean_Price") +
    plotTheme() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))+
    theme_minimal()

The following diagram shows correlation matrix across numeric variables, which helps us decide to choose which variable for regression. It shows how different property and neighborhood numeric factors relate to each other and to home prices. It uses values between -1 and 1 to indicate the strength and direction of these relationships: - Positive values mean that as one variable increases, so does the other (e.g., property size and price). - Negative values show that as one variable increases, the other decreases (e.g., distance from the city center and price). - Values near zero suggest no clear relationship.

Since it is helpful for analysis to perform regression by selecting variables that are not correlated with each other, year_built, total_area, and number_stories were chosen for the subsequent regression analysis.

numericVars <- 
  select_if(st_drop_geometry(philly.sf_filtered), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables") 

Multiple Linear Regression Model Results

The following is the result of linear regression analysis of sale_price and total_livable_area. If this result is statistically significant, we can predict sale_price from total_livable_area using the formula SalePrice_i = -65,358 + 252.99Xi + ei. The example below is statistically significant because the p-value is small, and if total_livable_area are 2,262 sqft, we can say that the predicted sale_price is -65,358+252.99x2,262=$507,000.

livingReg <- lm(sale_price ~ total_livable_area, data = philly.sf_filtered)
summary(livingReg)
## 
## Call:
## lm(formula = sale_price ~ total_livable_area, data = philly.sf_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2963434  -100606   -19246    69176  3911388 
## 
## Coefficients:
##                      Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)        -65357.845   3236.965  -20.19 <0.0000000000000002 ***
## total_livable_area    252.993      2.247  112.61 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 189000 on 23703 degrees of freedom
## Multiple R-squared:  0.3485, Adjusted R-squared:  0.3485 
## F-statistic: 1.268e+04 on 1 and 23703 DF,  p-value: < 0.00000000000000022

This regression analysis can be extended to multiple variables. The following denotes that sale_price, exterior_condition, interior_condition, number_of_bathrooms, number_of_bedrooms, total_livable_area, type_heater, year_built, and number_stories were found to be correlated with sale price, while total_area and view_type showed no clear evidence of correlation with sale price. We removed parcel_shape because it contains only one ‘D’ value, which makes dividing the data into test and training sets impossible.

reg1 = lm(sale_price ~ ., data = 
            st_drop_geometry(philly.sf_filtered) %>%
            filter(number_of_bedrooms <13) %>%
            dplyr::select(sale_price, exterior_condition, interior_condition, 
                          number_of_bathrooms, number_of_bedrooms,
                          total_area, total_livable_area, type_heater, 
                          view_type, year_built, number_stories))
summary(reg1)
## 
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_filtered) %>% 
##     filter(number_of_bedrooms < 13) %>% dplyr::select(sale_price, 
##     exterior_condition, interior_condition, number_of_bathrooms, 
##     number_of_bedrooms, total_area, total_livable_area, type_heater, 
##     view_type, year_built, number_stories))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2521256   -82578   -12655    60090  3825388 
## 
## Coefficients:
##                         Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)         367739.59845  63229.63736   5.816      0.0000000061079 ***
## exterior_condition  -37451.65930   1887.05912 -19.847 < 0.0000000000000002 ***
## interior_condition  -26394.16316   1976.31186 -13.355 < 0.0000000000000002 ***
## number_of_bathrooms  69992.38887   2449.30053  28.576 < 0.0000000000000002 ***
## number_of_bedrooms  -41559.64236   1200.83142 -34.609 < 0.0000000000000002 ***
## total_area              -0.05222      0.33045  -0.158             0.874435    
## total_livable_area     216.49972      2.51828  85.971 < 0.0000000000000002 ***
## type_heater0         36254.74058  18521.03755   1.957             0.050302 .  
## type_heaterA         38553.45983   3173.42475  12.149 < 0.0000000000000002 ***
## type_heaterB         13393.14301   3739.78342   3.581             0.000343 ***
## type_heaterC         25736.27846  11129.93332   2.312             0.020767 *  
## type_heaterD        278963.64750  23014.73383  12.121 < 0.0000000000000002 ***
## type_heaterE         86043.92597  23535.90821   3.656             0.000257 ***
## type_heaterG         -2073.41856   6689.82181  -0.310             0.756612    
## type_heaterH          2302.29790   3131.02207   0.735             0.462153    
## view_type0          -84520.30980  26618.55376  -3.175             0.001499 ** 
## view_typeA           29405.09534  22013.74444   1.336             0.181640    
## view_typeB           63892.75561  31807.47220   2.009             0.044577 *  
## view_typeC          210729.11889  22714.28495   9.277 < 0.0000000000000002 ***
## view_typeD          -20698.92314  26335.83105  -0.786             0.431899    
## view_typeE          -46473.14493  26580.12568  -1.748             0.080405 .  
## view_typeH           16053.34952  25087.63073   0.640             0.522250    
## view_typeI           11350.53124  20898.37655   0.543             0.587046    
## year_built             -89.20768     29.89523  -2.984             0.002848 ** 
## number_stories       14605.88394   2212.10634   6.603      0.0000000000412 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 165600 on 23411 degrees of freedom
##   (251 observations deleted due to missingness)
## Multiple R-squared:  0.4957, Adjusted R-squared:  0.4952 
## F-statistic:   959 on 24 and 23411 DF,  p-value: < 0.00000000000000022

We divided number_stories into three groups (Up to 2 Floors, 3 Floors, 4+ Floors) and performed regression analysis. The results are as follows.

# Categorize Number of Stories
philly.sf_filtered <-
  philly.sf_filtered %>%
  mutate(number_stories.cat = case_when(
    number_stories >= 0 & number_stories < 3 ~ "Up to 2 Floors",
    number_stories >= 3 & number_stories < 4 ~ "3 Floors",
    number_stories > 4 ~ "4+ Floors"
  ))
reg2 <- lm(sale_price ~ ., data = 
            st_drop_geometry(philly.sf_filtered) %>%
            dplyr::select(sale_price, exterior_condition, interior_condition, 
                          number_of_bathrooms, number_of_bedrooms,
                          total_area, total_livable_area, type_heater, 
                          view_type, year_built, number_stories.cat))
summary(reg2)
## 
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_filtered) %>% 
##     dplyr::select(sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories.cat))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2360228   -80731   -11461    59821  3870160 
## 
## Coefficients:
##                                     Estimate  Std. Error t value
## (Intercept)                      239825.2385  61007.0510   3.931
## exterior_condition               -36339.7202   1847.4969 -19.670
## interior_condition               -26039.5099   1929.7698 -13.494
## number_of_bathrooms               59553.8245   2349.8048  25.344
## number_of_bedrooms               -32765.3007   1112.4259 -29.454
## total_area                            0.7790      0.3256   2.392
## total_livable_area                  201.6932      2.6339  76.576
## type_heater0                      37546.8341  18181.0871   2.065
## type_heaterA                      38583.0065   3100.0228  12.446
## type_heaterB                       8422.3315   3646.0934   2.310
## type_heaterC                      29722.5686  10827.0012   2.745
## type_heaterD                     280510.5222  22446.7370  12.497
## type_heaterE                      87731.8702  23197.4335   3.782
## type_heaterG                       -835.0819   6526.1774  -0.128
## type_heaterH                       3139.2226   3039.3621   1.033
## view_type0                       -78211.5136  25977.4159  -3.011
## view_typeA                        36045.5687  21502.6498   1.676
## view_typeB                        70821.8996  31033.5056   2.282
## view_typeC                       218674.2867  22174.7120   9.861
## view_typeD                       -13321.0688  25705.3411  -0.518
## view_typeE                       -38733.7420  25945.9282  -1.493
## view_typeH                        21977.7716  24493.9795   0.897
## view_typeI                        15120.9048  20404.5295   0.741
## year_built                           13.7549     29.3130   0.469
## number_stories.cat4+ Floors      677963.5782  93438.7191   7.256
## number_stories.catUp to 2 Floors -53050.6479   3789.9301 -13.998
##                                              Pr(>|t|)    
## (Intercept)                         0.000084799494045 ***
## exterior_condition               < 0.0000000000000002 ***
## interior_condition               < 0.0000000000000002 ***
## number_of_bathrooms              < 0.0000000000000002 ***
## number_of_bedrooms               < 0.0000000000000002 ***
## total_area                                   0.016753 *  
## total_livable_area               < 0.0000000000000002 ***
## type_heater0                                 0.038919 *  
## type_heaterA                     < 0.0000000000000002 ***
## type_heaterB                                 0.020899 *  
## type_heaterC                                 0.006052 ** 
## type_heaterD                     < 0.0000000000000002 ***
## type_heaterE                                 0.000156 ***
## type_heaterG                                 0.898183    
## type_heaterH                                 0.301682    
## view_type0                                   0.002609 ** 
## view_typeA                                   0.093687 .  
## view_typeB                                   0.022492 *  
## view_typeC                       < 0.0000000000000002 ***
## view_typeD                                   0.604308    
## view_typeE                                   0.135486    
## view_typeH                                   0.369583    
## view_typeI                                   0.458667    
## year_built                                   0.638901    
## number_stories.cat4+ Floors         0.000000000000412 ***
## number_stories.catUp to 2 Floors < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 161600 on 23351 degrees of freedom
##   (328 observations deleted due to missingness)
## Multiple R-squared:  0.4998, Adjusted R-squared:  0.4992 
## F-statistic: 933.2 on 25 and 23351 DF,  p-value: < 0.00000000000000022

Accuracy of the Current Model

To assess the accuracy of the current model, we calculated the Mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE). To do this, the dataset was randomly divided into training data and testing data in a ratio of 6:4, and then the difference between the observed value and the actual value was calculated.

The MAE is the average of the absolute differences between predicted and observed values, while the MAPE is the average of the absolute percentage differences between predicted and observed values. The MAE and MAPE are useful metrics for evaluating the accuracy of regression models, as they provide a measure of how well the model predicts actual values.

inTrain <- createDataPartition(
              y = paste(philly.sf_filtered$number_stories.cat, 
                        philly.sf_filtered$view_type, philly.sf_filtered$heat_type), 
              p = .60, list = FALSE)
philly.training <- philly.sf_filtered[inTrain,] 
philly.test <- philly.sf_filtered[-inTrain,]  
reg.training <- 
  lm(sale_price ~ ., data = as.data.frame(philly.training) %>% 
                             dplyr::select(sale_price, exterior_condition, interior_condition, number_of_bathrooms, number_of_bedrooms, total_area, total_livable_area, type_heater, view_type, year_built, number_stories.cat))

summary(reg.training)
## 
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(philly.training) %>% 
##     dplyr::select(sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories.cat))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2263573   -81244   -11784    59685  3058212 
## 
## Coefficients:
##                                     Estimate  Std. Error t value
## (Intercept)                      221253.3658  94284.9960   2.347
## exterior_condition               -36972.9817   2409.3830 -15.345
## interior_condition               -25774.9487   2520.4327 -10.226
## number_of_bathrooms               68178.1442   3026.0574  22.530
## number_of_bedrooms               -33208.3228   1441.2109 -23.042
## total_area                            1.8720      0.4726   3.961
## total_livable_area                  191.9226      3.5233  54.472
## type_heater0                      48055.3735  25517.0851   1.883
## type_heaterA                      34606.7146   4023.5892   8.601
## type_heaterB                       6748.6386   4689.9764   1.439
## type_heaterC                      20372.3193  14261.6151   1.428
## type_heaterD                     328732.5501  32862.8599  10.003
## type_heaterE                      85431.9992  26204.1878   3.260
## type_heaterG                       -519.8221   8490.7987  -0.061
## type_heaterH                        341.6382   3950.8861   0.086
## view_type0                       -57844.7989  32805.1269  -1.763
## view_typeA                        51438.6471  26920.0604   1.911
## view_typeB                        70498.8011  39118.7987   1.802
## view_typeC                       250486.9406  27834.1494   8.999
## view_typeD                         -203.5163  32517.9642  -0.006
## view_typeE                       -11670.7026  32814.1360  -0.356
## view_typeH                        30549.0509  30913.5975   0.988
## view_typeI                        32568.7706  25453.8469   1.280
## year_built                           16.5537     45.5838   0.363
## number_stories.cat4+ Floors      958613.1237 115017.1006   8.335
## number_stories.catUp to 2 Floors -50325.0509   4959.2903 -10.148
##                                              Pr(>|t|)    
## (Intercept)                                   0.01896 *  
## exterior_condition               < 0.0000000000000002 ***
## interior_condition               < 0.0000000000000002 ***
## number_of_bathrooms              < 0.0000000000000002 ***
## number_of_bedrooms               < 0.0000000000000002 ***
## total_area                                   0.000075 ***
## total_livable_area               < 0.0000000000000002 ***
## type_heater0                                  0.05969 .  
## type_heaterA                     < 0.0000000000000002 ***
## type_heaterB                                  0.15019    
## type_heaterC                                  0.15318    
## type_heaterD                     < 0.0000000000000002 ***
## type_heaterE                                  0.00112 ** 
## type_heaterG                                  0.95118    
## type_heaterH                                  0.93109    
## view_type0                                    0.07787 .  
## view_typeA                                    0.05605 .  
## view_typeB                                    0.07154 .  
## view_typeC                       < 0.0000000000000002 ***
## view_typeD                                    0.99501    
## view_typeE                                    0.72210    
## view_typeH                                    0.32307    
## view_typeI                                    0.20073    
## year_built                                    0.71650    
## number_stories.cat4+ Floors      < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162500 on 14008 degrees of freedom
##   (195 observations deleted due to missingness)
## Multiple R-squared:  0.4965, Adjusted R-squared:  0.4956 
## F-statistic: 552.5 on 25 and 14008 DF,  p-value: < 0.00000000000000022
philly.test <-
  philly.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.training, philly.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
  filter(sale_price < 5000000) 

ggplot()+
  geom_histogram(data = philly.test, aes(x = sale_price.AbsError), binwidth = 10000, fill = "orange") +
  scale_x_continuous(breaks=seq(0, 1000000, by = 100000), )+
  coord_cartesian(xlim = c(0, 1000000)) +
  labs(title = "Distribution of Prediction Errors",
  subtitle = "",
  caption = "") +
  theme_minimal()

The result of the current model is as follows:
- Mean Absolute Error (MAE): $100,278.2
- Mean Absolute Percent Error (MAPE): 63.5%
* The result of knit can be different.

The error is not trivial given the mean sale_price of $271,929. This suggests that the current model may not be accurately capturing the underlying patterns in the data.

#Mean Absolute Error (MAE)
mean(philly.test$sale_price.AbsError, na.rm = T)
## [1] 100010.9
# Mean Absolute Percent Error (MAPE)
mean(philly.test$sale_price.APE, na.rm = T)
## [1] 0.6384693

The following plot shows that the model and perfect prediction. The orange line represents a perfect prediction (y=x), and the green line represents average prediction. This plot shows that predicted sale price was a little over estimated.

ggplot(data=philly.test, aes(sale_price.Predict, sale_price)) +
     geom_point(size = .5) + 
     geom_abline(intercept = 0, slope = 1, color = "#FA7800") +
     geom_smooth(method = "lm", se=F, colour = "#25CB10") +
    coord_cartesian(xlim = c(0, 4000000), ylim = c(0, 4000000)) +
     labs(title = "Price as a function of continuous variables",
          subtitle = "Orange line represents a perfect prrediction (y=x)\nGreen line represents average prediction") +
     theme_minimal()

Generalizability - ‘k-fold’ Cross-Validation

To assess the generalizability of the current model, we performed k-fold cross-validation with k=100. This algorithm involves splitting the dataset into 100 subsets, training the model on 99 subsets, and testing it on the remaining subset. This process is repeated 100 times, with each subset used as the test set once. The results are then averaged to provide an estimate of the model’s performance on new data.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(philly.sf_filtered) %>% 
                                dplyr::select(sale_price, exterior_condition,                
                                           interior_condition, 
                                           number_of_bathrooms, number_of_bedrooms,
                                           total_area, total_livable_area, type_heater, 
                                           view_type, year_built, 
                                           number_stories.cat), 
     method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv
## Linear Regression 
## 
## 23705 samples
##    10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 23467, 23467, 23468, 23468, 23469, 23467, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   159484.7  0.5090659  99752.15
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The distribution of MAE values is shown in the histogram below.

ggplot()+
  geom_histogram(data = reg.cv$resample, aes(x = MAE), binwidth = 10000, fill = "orange") +
  scale_x_continuous(breaks=seq(0, 600000, by = 100000), )+
  coord_cartesian(xlim = c(0, 600000)) +
  labs(title = "Distribution of MAE",
  subtitle = "k-fold cross-validation; k=100",
  caption = "") +
  theme_minimal()

The first plot shows the predicted sale price in the test set(philly.test), and the second plot shows the absolute sale price errors. No matter how accurate this model may be, we can see that some of the errors still remain clustered on the second map. It suggests that other factors with spatial correlations remain unmodeled. We will explore this further in the next step by applying spatial lag analysis and Moran’s I.

philly.test <- philly.test[!is.na(philly.test$sale_price.Predict),]
ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  geom_sf(data = philly.test, aes(color = q5(sale_price.Predict)),  
          show.legend = "point", size = 0.5, alpha = 1) +  
  scale_color_manual(values = palette5,
                     labels=qBr(philly.test, "sale_price.Predict"),
                     name="Quintile\nBreaks($/sqft)") + 
  labs(
    title = "Predicted Sale Price in Philadelphia",
    subtitle = "Assessment date: May 24, 2022-August 14, 2023",
    caption = "Data: U.S. Census Bureau, Zillow") +
  theme_void()

ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  geom_sf(data = philly.test, aes(color = q5(sale_price.AbsError)),  
          show.legend = "point", size = 0.5, alpha = 1) +  
  scale_color_manual(values = palette5,
                     labels=qBr(philly.test, "sale_price.AbsError"),
                     name="Quintile\nBreaks($/sqft)") + 
  labs(
    title = "Test Set Absolute Sale Price Errors in Philadelphia",
    subtitle = "Assessment date: May 24, 2022-August 14, 2023",
    caption = "Data: U.S. Census Bureau, Zillow") +
  theme_void()

Check if Prices and Errors cluster

(1) Spatial Lags

In this step, we calculated for each home sale, the average sale price of its k=5 nearest neighbors. This spatial lag variable, lagPrice, represents the average sale price of the five nearest neighbors of each home sale. We then used this variable to assess the spatial autocorrelation of sale prices and errors.

coords <- st_coordinates(philly.sf_filtered) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

philly.sf_filtered$lagPrice <- lag.listw(spatialWeights, philly.sf_filtered$sale_price)
philly.test <- philly.test[!is.na(philly.test$sale_price.Error),]
coords.test <-  st_coordinates(philly.test) 
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W") 

There is still positive correlation between sale_price and lagPrice and between sale_price.Error and lagPriceError. This suggests that the model’s errors are clustered spatially, indicating that the model is not adequately capturing localized factors.

# Sale Price as a function of the Spatial Lag of Price
philly.test %>% 
  mutate(lagPrice = lag.listw(spatialWeights.test, sale_price)) %>% 
  ggplot(aes(lagPrice, sale_price))+
  geom_point(color = "orange") +
  geom_smooth(method = "lm", se=F, colour = "#25CB10") +
  labs(title = "Price as a function of the Spatial Lag of Price",
       x = "Spatial Lag of Price (mean price of 5 nearest neighbors)",
       y = "Sale Price") +
  theme_minimal()

# Error as a Function of the Spatial Lag of Price
philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) %>% 
  ggplot(aes(lagPriceError, sale_price.Error))+
  geom_point(color = "orange") +
  geom_smooth(method = "lm", se=F, colour = "#25CB10") +
  labs(title = "Error as a Function of the Spatial Lag of Price",
       x = "Spatial Lag of Errors (mean error of 5 nearest neighbors)",
       y = "Sale Price Error") +
  theme_minimal()

We can calculate the Pearson’s R coefficient to test this a different way.

pearsons_test <- 
  philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error, NAOK=TRUE))

cor.test(pearsons_test$lagPriceError,
         pearsons_test$sale_price.Error, 
         method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pearsons_test$lagPriceError and pearsons_test$sale_price.Error
## t = 71.832, df = 9341, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5832941 0.6094222
## sample estimates:
##       cor 
## 0.5965162

(2) Moran’s I

By using the Moran’s I statistic, we can assess the global degree of clustering or dispersion of sales price values. The result shows that Moran’s I = 0.396, which indicates positive spatial autocorrelation. It suggests that the model’s errors are clustered, often occurring in specific neighborhoods or regions. If Moran’s I is close to zero, it implies that errors are more randomly distributed, indicating the model might be performing well without significant spatial bias. To improve this prediction model, we need to find more spatial variables that can reduce the spatial autocorrelation of the errors.

moranTest <- moran.mc(philly.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "orange",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  theme_minimal()

3. Proposed OLS Model

Which Data to be Included

In Chapter 2, we discovered that more spatial variables need to be incorporated on the current Zillow model so that we can reduce the spatial autocorrelation of the errors, and finally improve prediction quality in accuracy and generalization. Therefore, We collected data from open sources such as the Philadelphia Open Data Portal, focusing on datasets that provide insight into the local character of neighborhoods.

We intentionally excluded factors like walkability because, although important in urban housing preferences, it may not be universally desirable, especially for families who might prefer less densely populated areas. The following table shows the data types we included and excluded, along with the reasons for our decisions.

# Create the data frame
data <- data.frame(
  Category = c("Safety & Crime", "Urban Planning & Zoning", "Gentrification Indicators", "Accessibility & Transit", "Environmental Quality", "Education", "Quality of Life", "Public Spaces", "Government & Infrastructure"),
  `Data Type` = c("Crime Data", "Zoning Regulations", "Gentrification Indicators", "Transit Accessibility", "Pollution Levels", "School Quality", "Walkability Index", "Access to Parks", "Government Investments"),
  `Included/Excluded` = c("Included", "Included", "Included", "Included", "Included", "Included", "Excluded", "Excluded", "Excluded"),
  Reason = c(
    "Safety is a critical factor for homeowners.",
    "Impacts property types, uses, and value.",
    "Gentrification trends affect property values.",
    "Commute to work.",
    "Pollution impacts health and desirability.",
    "The quality of local schools is a major consideration for families and directly impacts property values.",
    "Walkability is valued in urban settings but may not be a priority for all buyers, such as families preferring suburban environments.",
    "Access to parks and recreational spaces may have variable impact depending on demographics, making it less reliable for general predictions.",
    "Government resources and investments are often dispersed and long-term, making their direct impact on property values harder to measure."
  )
)

# Create the kable table
kable(data, "html", col.names = c("Category", "Data Type", "Included/Excluded", "Reason")) %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  column_spec(1:4, width = "20em") # Adjust column width for better readability
Category Data Type Included/Excluded Reason
Safety & Crime Crime Data Included Safety is a critical factor for homeowners.
Urban Planning & Zoning Zoning Regulations Included Impacts property types, uses, and value.
Gentrification Indicators Gentrification Indicators Included Gentrification trends affect property values.
Accessibility & Transit Transit Accessibility Included Commute to work.
Environmental Quality Pollution Levels Included Pollution impacts health and desirability.
Education School Quality Included The quality of local schools is a major consideration for families and directly impacts property values.
Quality of Life Walkability Index Excluded Walkability is valued in urban settings but may not be a priority for all buyers, such as families preferring suburban environments.
Public Spaces Access to Parks Excluded Access to parks and recreational spaces may have variable impact depending on demographics, making it less reliable for general predictions.
Government & Infrastructure Government Investments Excluded Government resources and investments are often dispersed and long-term, making their direct impact on property values harder to measure.

Summary Statistics Table

This table gives an overview of different features related to properties and their surroundings, grouped into categories like Property Details, Local Amenities, Neighborhood Layout, Crime Data, and Transit/Demographics.

The table breaks down various features of properties and neighborhoods into five main groups:

  • Property Details: How many bedrooms and bathrooms they have, and the year they were built.

  • Local Amenities: Features and services that might impact a property’s appeal, such as whether a property has central air or a garage.

  • Neighborhood Layout: Zoning rules and the shape of the property lots, giving insight into how the area is organized.

  • Crime Data: Frequency of crime incidents in different areas, which can help gauge the safety of a neighborhood.

  • Transit/Demographics: Access to public transportation and basic demographic trends like population levels.

file_path <- "~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/Cleaned_Comprehensive_Summary_Statistics.csv"

summary_stats <- read_csv(file_path)

summary_stats_cleaned <- summary_stats %>%
  drop_na()

summary_stats_cleaned %>%
  kable("html", caption = "Summary Statistics") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Summary Statistics
Variable Category Description count
total_livable_area Internal Characteristics Total livable area of the property (sqft) 23881
number_of_bathrooms Internal Characteristics Number of bathrooms 23622
number_of_bedrooms Internal Characteristics Number of bedrooms 23865
number_of_rooms Internal Characteristics Total number of rooms 666
number_stories Internal Characteristics Number of stories 23873
year_built Internal Characteristics Year the property was built 23881
zoning Spatial Structure Zoning classification of the parcel 23881
parcel_shape Spatial Structure Shape type of the parcel (e.g., rectangular, irregular) 23881
Thefts Crime Frequency of reported incidents for each crime type 29413
Other Assaults Crime Frequency of reported incidents for each crime type 20172
All Other Offenses Crime Frequency of reported incidents for each crime type 10846
Motor Vehicle Theft Crime Frequency of reported incidents for each crime type 10468
Vandalism/Criminal Mischief Crime Frequency of reported incidents for each crime type 10424
Theft from Vehicle Crime Frequency of reported incidents for each crime type 7892
Aggravated Assault No Firearm Crime Frequency of reported incidents for each crime type 4159
Burglary Residential Crime Frequency of reported incidents for each crime type 2521
Narcotic / Drug Law Violations Crime Frequency of reported incidents for each crime type 2262
Aggravated Assault Firearm Crime Frequency of reported incidents for each crime type 2117
Weapon Violations Crime Frequency of reported incidents for each crime type 2040

6 Newly Included Data and Visualization

We collected 6 additional datasets including crime data in 2018-2023, crime dispatch incidents in 2024, location of subway stations, schools, public parks and recreation areas from the Philadelphia Open Data portal the U.S. Census Bureau (ACS data) or other data source. We also cleaned data to exclude records with missing or erroneous geographic points. We created new variables, such as price per square foot, number of schools per tract, which are crucial for our model.

# Load Philadelphia Zillow Data
philly.sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/studentData.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

nhoods <-  
  get_acs(geography = "tract",
          variables = c("B01003_001E","B02001_002E",
                        "B06011_001E"), 
          year=2020, state="PA",
          county="Philadelphia", geometry=TRUE) %>%
  st_transform(crs = 'ESRI:102728') %>%
  separate(NAME, into = c("Census_Tract", "City_State"), sep = ", ", extra = "merge") %>%
  mutate(
    Census_Tract = gsub("Census Tract ", "", Census_Tract),
    City_State = gsub(" County, Pennsylvania", "", City_State)) %>%
  dplyr::select(  -moe)%>%
  spread(key = variable, value = estimate) %>%
  rename(TotalPop = B01003_001,
         NumberWhites = B02001_002,
         Median_Income = B06011_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 35322, "High Income", "Low Income"))
## Getting data from the 2016-2020 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
# Load Police District Data in Philadelphia
philly.police <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/police_districts.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

# Load School Data in Philadelphia
philly.school <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/Schools.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

# Load Green Area (Public Parks Recreation) Data in Philadelphia
philly.PPR <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/PPR_Properties.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

# Load Shooting Incident Data in Philadelphia
philly.shooting <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/shootings.geojson") %>% 
#  as.data.frame() %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728') 

ppr_properties_sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/PPR_Properties.geojson") %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728')

police_districts_sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/police_districts.geojson") %>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728')

shootings_sf <- st_read("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/shootings.geojson")%>%
  st_as_sf(crs = 4326)%>%
  st_transform(crs = 'ESRI:102728')

# Shooting Incident Data 
shootingPol <- st_join(philly.shooting, nhoods, left = FALSE)
shootingPol_count <- shootingPol %>%
  group_by(Census_Tract) %>%  
  summarise(shooting_count = n())  
nhoods_b <- left_join(nhoods, st_drop_geometry(shootingPol_count), by = "Census_Tract") %>%
  st_sf() %>%
  mutate(shooting_count = ifelse(is.na(shooting_count), 0, shooting_count)) 
  #ggplot()+
  #  geom_sf(data=nhoods_b, aes(fill=shooting_count))

# Green area (Public Parks Recreation) data
PPRPol <- st_join(philly.PPR, nhoods_b, left = FALSE)
PPRPol_count <- PPRPol %>%
  group_by(Census_Tract) %>%  
  summarise(PPR_count = n())  
nhoods_c <- left_join(nhoods_b, st_drop_geometry(PPRPol_count), by = "Census_Tract") %>%
  st_sf() %>%
  mutate(PPR_count = ifelse(is.na(PPR_count), 0, PPR_count)) 
  #ggplot()+
  #  geom_sf(data=nhoods_c, aes(fill=PPR_count))

# Number of school in each tract data
schoolPol <- st_join(philly.school, nhoods_c, left = FALSE)
schoolPol_count <- schoolPol %>%
  group_by(Census_Tract) %>%  
  summarise(school_count = n())  
nhoods_d <- left_join(nhoods_c, st_drop_geometry(schoolPol_count), by = "Census_Tract") %>%
  st_sf() %>%
  mutate(PPR_count = ifelse(is.na(school_count), 0, school_count)) 
  #ggplot()+
  #  geom_sf(data=nhoods_d, aes(fill=school_count))

# Load Crime Data in Philadelphia
philly.crime <- read.csv("~/Documents/Public Policy/Midterm_Agarwal_Jun/Data/CrimeData.csv") 
philly.crime <- subset(philly.crime, the_geom!='0101000020E6100000A5A31CCC262054C0A8BE77C4B61C4540') #error
philly.crime <- subset(philly.crime, the_geom!='0101000020E6100000E3D840DB262054C0CCE8EC09B71C4540') #error
philly.crime <- st_as_sf(philly.crime, coords = c("lng", "lat"), crs = 4326) %>%
  st_transform(crs = 'ESRI:102728') 

# Crime Data 
crimePol <- st_join(philly.crime, nhoods_d, left = FALSE)
crimePol_count <- crimePol %>%
  group_by(Census_Tract) %>%  
  summarise(crime_count = n())  
nhoods_e <- left_join(nhoods_d, st_drop_geometry(crimePol_count), by = "Census_Tract") %>%
  st_sf() %>%
  mutate(crime_count = ifelse(is.na(crime_count), 0, crime_count)) 
  #ggplot()+
  # geom_sf(data=nhoods_e, aes(fill=crime_count))

(1) Number of Public Parks/Recreation Areas by Tract in Philadelphia

The map shows different types of alongside public parks and recreational areas within the city. Access to green spaces and recreational amenities is a significant driver of housing desirability and value.

ggplot() + 
  geom_sf(data=st_union(nhoods), fill="#DDD", color="white")+
  geom_sf(data = ppr_properties_sf, fill = "green", color = "darkgreen", shape = 21, size = 1, alpha = 0.8) + # PPR properties
  labs(
    title = "Public Parks/Recreation (PPR) Areas in Philadelphia",
    subtitle = "Visualizing the distribution of PPR properties within city limits",
    caption = "Data: Philadelphia Planning Department, PPR, and City Limits") +
  theme_void() +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

(2) Police Districts and Population Density in Philadelphia

This map overlays police districts with population density information for each census tract, illustrating how densely populated areas align with police resources. Population density often correlates with housing demand, as denser areas tend to have higher property values due to limited space. Understanding how law enforcement resources align with population clusters can also offer insights into perceptions of safety and how these perceptions might impact local housing prices.

philly_population <- get_acs(geography = "tract",
                             variables = "B01003_001E", # Total population variable
                             year = 2020, state = "PA",
                             county = "Philadelphia", geometry = TRUE) %>%
  st_transform(crs = 'ESRI:102728')

philly_population$area_sq_km <- set_units(st_area(philly_population), "km^2")

# Calculate population density (people per square kilometer) and ensure consistency in units
philly_population$population_density <- philly_population$estimate / drop_units(philly_population$area_sq_km)

# Plot the map showing police districts and population density
ggplot() + 
  geom_sf(data = philly_population, aes(fill = population_density), color = NA) +
  scale_fill_viridis_c(name = "Population Density\n(per sq km)", option = "inferno") +
  geom_sf(data = police_districts_sf, fill = NA, color = "red", size = 0.8) +
  labs(
    title = "Police Districts and Population Density in Philadelphia",
    subtitle = "Population Density per Census Tract with Police District Boundaries",
    caption = "Data: U.S. Census Bureau, Philadelphia Open Data Portal") +
  theme_void()

(3) Density of Shooting Incidents in Philadelphia with Police Districts Overlay

This map illustrates the concentration of shooting incidents and overlays police district boundaries to see how shootings align with law enforcement zones. Identifying high-risk areas helps us understand which neighborhoods might need more intervention or are perceived as less safe, directly influencing property values.

# Create the density map with police districts
ggplot() + 
  stat_density2d(data = data.frame(st_coordinates(shootings_sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..), 
                 size = 0.01, bins = 40, geom = 'polygon') +
  geom_sf(data = police_districts_sf, fill = NA, color = "black", linetype = "solid", size = 0.5) + # Police districts overlay
  scale_fill_gradient(low = "lightblue", high = "darkblue", 
                      breaks = c(0.000000002, 0.00000002),
                      labels = c("Minimum", "Maximum"),
                      name = "Density") +
  scale_alpha(range = c(0.00, 0.50), guide = "none") +
  labs(title = "Density of Shooting Incidents in Philadelphia",
       subtitle = "Overlaid with Police District Boundaries",
       caption = "Data: Philadelphia Open Data Portal") +
  theme_void()

(4) TOD (public transit) in Philadelphia

This map shows the areas within 0.5 miles (2,641 ft) of the Market-Frankford Line (MFL) and Broad Street Line, which are key public transit routes in Philadelphia. As we discovered on the last assignment, properties near public transit are often more desirable, as they offer convenient access to transportation and amenities, which can drive up property values.

# TOD Data 
MFL <- st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson")
## Reading layer `SEPTA_-_Market-Frankford_Line_Stations' from data source 
##   `https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 28 features and 45 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.25973 ymin: 39.94989 xmax: -75.07788 ymax: 40.02299
## Geodetic CRS:  WGS 84
Broad_St <- st_read("https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson")
## Reading layer `SEPTA_-_Broad_Street_Line_Stations' from data source 
##   `https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 24 features and 45 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.17394 ymin: 39.90543 xmax: -75.13679 ymax: 40.04185
## Geodetic CRS:  WGS 84
septaStops <- 
  rbind(
     MFL %>% 
      mutate(Line = "MFL") %>%
      dplyr::select(Station, Line),
     Broad_St %>%
      mutate(Line ="Broad_St") %>%
      dplyr::select(Station, Line)) %>%
  st_transform(st_crs(nhoods_e))  

stopBuffer <- st_buffer(septaStops, 2641)
stopUnion <- st_union(st_buffer(septaStops, 2641))
buffer <- stopUnion %>%
      st_sf() 
selectCentroids <-
  st_centroid(nhoods_e)[buffer,] %>%
  st_drop_geometry() %>%
  left_join(., dplyr::select(nhoods_e, GEOID), by = "GEOID") %>%
  st_sf() 

selectCentroids_union <- st_union(selectCentroids)%>%
  st_sf() 

nhoods_f <- 
  rbind(
    st_centroid(nhoods_e)[buffer,] %>%
      st_drop_geometry() %>%
      left_join(nhoods_e) %>%
      st_sf() %>%
      mutate(TOD = "TOD"),
    st_centroid(nhoods_e)[buffer, op = st_disjoint] %>%
      st_drop_geometry() %>%
      left_join(nhoods_e) %>%
      st_sf() %>%
      mutate(TOD = "Non-TOD")) 
## Plot TOD
ggplot() + 
  geom_sf(data=nhoods_f, fill="#DDD", color="white")+
  geom_sf(data=septaStops, 
          aes(colour = Line), 
          show.legend = "point", size= 1.5) +
  scale_colour_manual(values = c("orange","blue")) +
  geom_sf(data=selectCentroids_union, fill='transparent', color='red')+
  labs(title = "TOD area in Philadelphia",
    subtitle = "0.5 miles from the MFL and Broad St Line",
    caption = "Data: U.S. Census Bureau, SEPTA") +
  theme_void()

(5) Number of Crime Dispatch by Tract in Philadelphia

This map highlights areas where crime incidents are more frequent, providing a visual representation of crime density across the city. Safety is a critical factor in homebuyer decisions—people prefer safer neighborhoods, which drives up property values there. By mapping crime density, we can factor safety into our model, predicting how values might increase in safer areas or decrease in higher-crime zones.

## Plot Density of Crime Dispatch
ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  stat_density2d(data = data.frame(st_coordinates(philly.crime)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", 
                     breaks=c(0.000000002,0.00000002),
                     labels=c("Minimum", "Maximum"),
                      name = "Density") +
  scale_alpha(range = c(0.00, 0.50), guide = "none") +
  labs(title = "Density of Crime Dispach in Philadelphia",
    subtitle = "Date: January 1, 2024-October 1, 2024",
    caption = "Data: U.S. Census Bureau, Open Data Philly") +
  theme_void()

(6) Number of School by Tract in Philadelphia

This map displays how close properties are to schools, which is a key factor for families seeking homes. Properties closer to schools often become more desirable, increasing demand. Families place a premium on properties near quality schools, making school proximity a crucial factor in price determination.

## Plot School
ggplot() + 
  geom_sf(data=nhoods_f, aes(fill=school_count), color="white")+
  scale_fill_viridis_c(name = "School Count", option = "inferno") +
  labs(title = "Number of Schools by Tract in Philadelphia",
    caption = "Data: U.S. Census Bureau, OpenData Philly") +
  theme_void()

Multiple Linear Regression Model Results

The following plots show the relationship between sale price and various continuous variables. New data includes crime count, school count, shooting count, and public park and recreation data. The plots show that sale price is positively correlated with total area, while it is negatively correlated with age, shooting count, public parks and recreation areas and crime count. The plots also show that sale price is not significantly correlated with school count.

philly.sf_New <- philly.sf %>%
  filter(sale_price < 5000000 & toPredict == "MODELLING") %>%
  mutate(pricepersq = ifelse(total_area > 10, sale_price / total_area, 0)) 
philly.sf_New <- philly.sf_New[!is.na(as.numeric(philly.sf_New$pricepersq)), ]
philly.sf_New <- st_join(philly.sf_New, nhoods_f) # Crime Data, School Data, Green Area Data, Shooting Incident Data, TOD Data
philly.sf_New <- st_join(philly.sf_New, philly.police) # Police District Data
philly.sf_New <- st_join(philly.sf_New, neighborhoods) # Neighborhood Data
philly.sf_New <- philly.sf_New %>%
  filter(name != "East Park" & name != "Wissahickon Park" & name != "Chinatown" & name != "Mechanicsville" & name != "University City" & name != "Pennypack Park")

# Price as a function of continuous variables
philly.sf_New %>%
  st_drop_geometry() %>%
  mutate(Age = 2024 - year_built) %>%
  dplyr::select(sale_price, total_area, Age, shooting_count, PPR_count, school_count, crime_count) %>%
  filter(sale_price <= 5000000, total_area <= 10000, 
         Age < 500) %>% # Filter out crazy outliers
  gather(Variable, Value, -sale_price) %>% 
   ggplot(aes(Value, sale_price)) +
     geom_point(size = .5) + 
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     theme_minimal()

The following plots represents the difference in sale price by race, income, and whether it is a TOD area. The mean sale price in majority white was significantly higher than that in majority non-white, and the mean sale price in high-income areas was significantly higher than that in low-income areas. Meanwhile, the difference between TOD and non-TOD areas was not large, but the mean sale price in TOD areas was slightly higher.

# Price as a function of categorical variables

philly.sf_New <- philly.sf_New %>%
  mutate(
    quality_grade_new = case_when(
      quality_grade == "1 " ~ "S ",  
      quality_grade == "2 " ~ "A ",  
      quality_grade == "3 " ~ "B ",  
      quality_grade == "4 " ~ "C ", 
      quality_grade == "5 " ~ "D ", 
      quality_grade == "6 " ~ "E ", 
      TRUE ~ as.character(quality_grade)        
    )
  )

philly.sf_New$quality_grade_new <- factor(philly.sf_New$quality_grade_new, levels = c("S+", "S ", "A+", "A ", "A-", "B+", "B ", "B-", "C+", "C ", "C-", "D+", "D ", "D-", "E+", "E ", "E-"))
# levels(philly.sf_filtered$quality_grade_new)
philly.sf_New %>%
  st_drop_geometry() %>%
  dplyr::select(sale_price, incomeContext, raceContext, TOD) %>%
  filter(sale_price <= 5000000, !is.na(incomeContext), !is.na(raceContext)) %>% # Filter out crazy outliers
  gather(Variable, Value, -sale_price) %>%
  ggplot(aes(Value, sale_price)) +
    geom_bar(position = "dodge", stat="summary", fun="mean")+
    facet_wrap(~Variable, ncol = 3, scales = "free") +
    labs(title = "Price as a function of categorical variables", y="Mean_Price") +
    plotTheme() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))+
    theme_minimal()

The following chart is correlation plot that shows the relationship between numeric variables.

numericVars <- 
  select_if(st_drop_geometry(philly.sf_New), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables") 

The following regression denotes that sale_price, exterior_condition, interior_condition, number_of_bathrooms, number_of_bedrooms, total_area, total_livable_area, year_built, number_stories shooting_count, crime_count, and TOD were found to be correlated with sale_price, while type_heater,view_type, and PPR_count showed no clear evidence of correlation with sale price. Because there were many missing values in school_count, regression values could not be obtained.

philly.sf_New$school_count <- ifelse(is.na(philly.sf_New$school_count), 0, philly.sf_New$school_count)
# is.na(philly.sf_New$school_count)
reg1 = lm(sale_price ~ ., data = 
            st_drop_geometry(philly.sf_New) %>%
            dplyr::select(sale_price, exterior_condition, interior_condition, 
                          number_of_bathrooms, number_of_bedrooms,
                          total_area, total_livable_area, type_heater, 
                          view_type, year_built, number_stories, shooting_count, PPR_count, school_count, crime_count, TOD))
summary(reg1)
## 
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_New) %>% 
##     dplyr::select(sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories, shooting_count, PPR_count, school_count, 
##         crime_count, TOD))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2252262   -70998   -11252    49989  3810993 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)         553722.0670  59464.2513   9.312 < 0.0000000000000002 ***
## exterior_condition  -26700.2433   1766.2094 -15.117 < 0.0000000000000002 ***
## interior_condition  -28613.4919   1838.5652 -15.563 < 0.0000000000000002 ***
## number_of_bathrooms  60913.8737   2227.6278  27.345 < 0.0000000000000002 ***
## number_of_bedrooms  -26451.6802   1066.2799 -24.807 < 0.0000000000000002 ***
## total_area              -0.3317      0.3102  -1.069             0.284954    
## total_livable_area     202.2441      2.3572  85.799 < 0.0000000000000002 ***
## type_heater0          4063.2541  17236.1603   0.236             0.813636    
## type_heaterA         12119.8467   2988.9752   4.055        0.00005032877 ***
## type_heaterB         -9275.0159   3500.9291  -2.649             0.008071 ** 
## type_heaterC         -6897.3368  10363.6578  -0.666             0.505717    
## type_heaterD        231329.2464  21428.4686  10.795 < 0.0000000000000002 ***
## type_heaterE         45462.8442  21902.9804   2.076             0.037938 *  
## type_heaterG         -7624.3958   6226.8943  -1.224             0.220802    
## type_heaterH          3397.9300   2917.0122   1.165             0.244085    
## view_type0          -62224.7060  24799.2999  -2.509             0.012110 *  
## view_typeA            4966.6924  20513.1679   0.242             0.808688    
## view_typeB           20229.3844  29639.8403   0.683             0.494925    
## view_typeC          182173.5961  21143.7392   8.616 < 0.0000000000000002 ***
## view_typeD          -20369.5147  24504.0534  -0.831             0.405829    
## view_typeE          -50831.8708  24740.2644  -2.055             0.039927 *  
## view_typeH            7967.1794  23334.9414   0.341             0.732785    
## view_typeI            3073.2911  19439.1740   0.158             0.874381    
## year_built            -167.2654     28.1106  -5.950        0.00000000272 ***
## number_stories       11842.0226   2062.0367   5.743        0.00000000942 ***
## shooting_count       -1121.5567     27.1865 -41.254 < 0.0000000000000002 ***
## PPR_count            -3983.8069    904.6838  -4.404        0.00001069728 ***
## school_count                 NA          NA      NA                   NA    
## crime_count            -28.8719      8.0289  -3.596             0.000324 ***
## TODTOD               61117.0594   2509.8704  24.351 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 154000 on 23403 degrees of freedom
##   (266 observations deleted due to missingness)
## Multiple R-squared:  0.5633, Adjusted R-squared:  0.5628 
## F-statistic:  1078 on 28 and 23403 DF,  p-value: < 0.00000000000000022

We divided number_stories into three groups (Up to 2 Floors, 3 Floors, 4+ Floors) and performed regression analysis. The results are as follows.

# Categorize Number of Stories
philly.sf_New <-
  philly.sf_New %>%
  mutate(number_stories.cat = case_when(
    number_stories >= 0 & number_stories < 3 ~ "Up to 2 Floors",
    number_stories >= 3 & number_stories < 4 ~ "3 Floors",
    number_stories > 4 ~ "4+ Floors"
  ))
reg2 <- lm(sale_price ~ ., data = 
            st_drop_geometry(philly.sf_New) %>%
            dplyr::select(sale_price, exterior_condition, interior_condition, 
                          number_of_bathrooms, number_of_bedrooms,
                          total_area, total_livable_area, type_heater, 
                          view_type,  year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD))
summary(reg2)
## 
## Call:
## lm(formula = sale_price ~ ., data = st_drop_geometry(philly.sf_New) %>% 
##     dplyr::select(sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories.cat, shooting_count, PPR_count, school_count, 
##         crime_count, TOD))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2151905   -68687   -10403    49506  3840631 
## 
## Coefficients: (1 not defined because of singularities)
##                                     Estimate  Std. Error t value
## (Intercept)                      463466.3695  57331.6626   8.084
## exterior_condition               -25926.5733   1725.1513 -15.029
## interior_condition               -27917.1888   1792.3469 -15.576
## number_of_bathrooms               57670.1015   2183.3574  26.413
## number_of_bedrooms               -23579.4317   1043.7718 -22.591
## total_area                            0.1719      0.3048   0.564
## total_livable_area                  192.3956      2.4512  78.491
## type_heater0                       7083.0815  16883.1429   0.420
## type_heaterA                      12617.4026   2913.1631   4.331
## type_heaterB                     -11872.8520   3404.3831  -3.488
## type_heaterC                      -6986.3475  10068.7071  -0.694
## type_heaterD                     232163.4261  20855.9064  11.132
## type_heaterE                      49071.3715  21540.8822   2.278
## type_heaterG                      -5429.5187   6062.0646  -0.896
## type_heaterH                       5355.2005   2826.8743   1.894
## view_type0                       -59252.5331  24150.1624  -2.454
## view_typeA                         9903.6023  19994.8658   0.495
## view_typeB                        22838.0764  28859.1103   0.791
## view_typeC                       186366.8527  20598.7134   9.047
## view_typeD                       -14437.6299  23866.2115  -0.605
## view_typeE                       -44611.4126  24097.9268  -1.851
## view_typeH                        12723.9461  22734.7448   0.560
## view_typeI                         5967.5524  18939.7381   0.315
## year_built                          -93.2800     27.5116  -3.391
## number_stories.cat4+ Floors      734862.5603  86734.7612   8.473
## number_stories.catUp to 2 Floors -36718.5589   3541.2890 -10.369
## shooting_count                    -1092.3269     26.5004 -41.219
## PPR_count                         -3603.6422    881.5378  -4.088
## school_count                              NA          NA      NA
## crime_count                         -28.7629      7.8262  -3.675
## TODTOD                            55536.9422   2458.9433  22.586
##                                              Pr(>|t|)    
## (Intercept)                      0.000000000000000657 ***
## exterior_condition               < 0.0000000000000002 ***
## interior_condition               < 0.0000000000000002 ***
## number_of_bathrooms              < 0.0000000000000002 ***
## number_of_bedrooms               < 0.0000000000000002 ***
## total_area                                   0.572825    
## total_livable_area               < 0.0000000000000002 ***
## type_heater0                                 0.674829    
## type_heaterA                     0.000014893813424129 ***
## type_heaterB                                 0.000488 ***
## type_heaterC                                 0.487772    
## type_heaterD                     < 0.0000000000000002 ***
## type_heaterE                                 0.022732 *  
## type_heaterG                                 0.370446    
## type_heaterH                                 0.058186 .  
## view_type0                                   0.014154 *  
## view_typeA                                   0.620388    
## view_typeB                                   0.428739    
## view_typeC                       < 0.0000000000000002 ***
## view_typeD                                   0.545225    
## view_typeE                                   0.064145 .  
## view_typeH                                   0.575710    
## view_typeI                                   0.752703    
## year_built                                   0.000699 ***
## number_stories.cat4+ Floors      < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors < 0.0000000000000002 ***
## shooting_count                   < 0.0000000000000002 ***
## PPR_count                        0.000043674366054245 ***
## school_count                                       NA    
## crime_count                                  0.000238 ***
## TODTOD                           < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 150000 on 23340 degrees of freedom
##   (328 observations deleted due to missingness)
## Multiple R-squared:  0.5687, Adjusted R-squared:  0.5682 
## F-statistic:  1061 on 29 and 23340 DF,  p-value: < 0.00000000000000022

Accuracy of the Proposed Model

As the same method in Chapter 2, we split our dataset into training and testing sets to evaluate our model’s performance. We used a 60/40 split for training and testing respectively.

The result of the proposed model is as follows:
- Mean Absolute Error (MAE): $86,664.03
- Mean Absolute Percent Error (MAPE): 53.3%
* The result of knit can be different.

M In Chapter 2, MAE of the current model was $100,278 and MAPE of the current model was 63.5%. We can say that the proposed model is more accurate than the current model.

inTrain <- createDataPartition(
              y = paste(philly.sf_New$number_stories.cat, 
                        philly.sf_New$view_type, philly.sf_New$heat_type), 
              p = .60, list = FALSE)

philly.training <- philly.sf_New[inTrain,] 
philly.test <- philly.sf_New[-inTrain,]  
reg.training <- 
  lm(sale_price ~ ., data = as.data.frame(philly.training) %>% 
                             dplyr::select(sale_price, exterior_condition, interior_condition, 
                                           number_of_bathrooms, number_of_bedrooms,
                                           total_area, total_livable_area, type_heater, 
                                           view_type, year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD))

summary(reg.training)
## 
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(philly.training) %>% 
##     dplyr::select(sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories.cat, shooting_count, PPR_count, school_count, 
##         crime_count, TOD))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1074673   -68580    -9827    50952  3812301 
## 
## Coefficients: (1 not defined because of singularities)
##                                      Estimate   Std. Error t value
## (Intercept)                      490137.34729  79557.55234   6.161
## exterior_condition               -28919.35845   2256.62974 -12.815
## interior_condition               -24841.55802   2366.35275 -10.498
## number_of_bathrooms               56226.61177   2821.43041  19.928
## number_of_bedrooms               -23902.67515   1349.35375 -17.714
## total_area                            0.07366      0.36953   0.199
## total_livable_area                  196.93821      3.18435  61.846
## type_heater0                       6551.31164  22476.65009   0.291
## type_heaterA                      14042.50489   3791.35342   3.704
## type_heaterB                     -11951.99850   4395.60560  -2.719
## type_heaterC                       4951.23661  13903.87959   0.356
## type_heaterD                     203546.11393  27558.43175   7.386
## type_heaterE                      15365.67297  25743.32118   0.597
## type_heaterG                        -74.21530   7947.24310  -0.009
## type_heaterH                       8117.54188   3682.87989   2.204
## view_type0                       -55009.72618  31875.84085  -1.726
## view_typeA                         8006.77022  26563.89203   0.301
## view_typeB                        18499.71463  37468.51683   0.494
## view_typeC                       175299.09257  27330.83944   6.414
## view_typeD                       -25057.70347  31466.92885  -0.796
## view_typeE                       -46509.41187  31751.29592  -1.465
## view_typeH                       -11260.21426  29976.84517  -0.376
## view_typeI                        -5480.26176  25189.57255  -0.218
## year_built                         -103.97253     37.75948  -2.754
## number_stories.cat4+ Floors      970997.60613 107180.06165   9.059
## number_stories.catUp to 2 Floors -37169.59740   4623.52985  -8.039
## shooting_count                    -1081.11325     34.30396 -31.516
## PPR_count                         -4548.45351   1158.22264  -3.927
## school_count                               NA           NA      NA
## crime_count                         -27.71435     10.11622  -2.740
## TODTOD                            59215.81273   3206.96427  18.465
##                                              Pr(>|t|)    
## (Intercept)                      0.000000000743659306 ***
## exterior_condition               < 0.0000000000000002 ***
## interior_condition               < 0.0000000000000002 ***
## number_of_bathrooms              < 0.0000000000000002 ***
## number_of_bedrooms               < 0.0000000000000002 ***
## total_area                                   0.842013    
## total_livable_area               < 0.0000000000000002 ***
## type_heater0                                 0.770695    
## type_heaterA                                 0.000213 ***
## type_heaterB                                 0.006554 ** 
## type_heaterC                                 0.721768    
## type_heaterD                     0.000000000000159879 ***
## type_heaterE                                 0.550597    
## type_heaterG                                 0.992549    
## type_heaterH                                 0.027531 *  
## view_type0                                   0.084415 .  
## view_typeA                                   0.763102    
## view_typeB                                   0.621497    
## view_typeC                       0.000000000146336550 ***
## view_typeD                                   0.425860    
## view_typeE                                   0.142997    
## view_typeH                                   0.707197    
## view_typeI                                   0.827775    
## year_built                                   0.005903 ** 
## number_stories.cat4+ Floors      < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors 0.000000000000000976 ***
## shooting_count                   < 0.0000000000000002 ***
## PPR_count                        0.000086389376607873 ***
## school_count                                       NA    
## crime_count                                  0.006159 ** 
## TODTOD                           < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 151400 on 13996 degrees of freedom
##   (199 observations deleted due to missingness)
## Multiple R-squared:  0.5695, Adjusted R-squared:  0.5686 
## F-statistic: 638.4 on 29 and 13996 DF,  p-value: < 0.00000000000000022
philly.test <-
  philly.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.training, philly.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
  filter(sale_price < 5000000) 

ggplot()+
  geom_histogram(data = philly.test, aes(x = sale_price.AbsError), binwidth = 10000, fill = "orange") +
  scale_x_continuous(breaks=seq(0, 1000000, by = 100000), )+
  coord_cartesian(xlim = c(0, 1000000)) +
  labs(title = "Distribution of Prediction Errors",
  subtitle = "",
  caption = "") +
  theme_minimal()

#Mean Absolute Error (MAE)
mean(philly.test$sale_price.AbsError, na.rm = T)
## [1] 87459.48
# Mean Absolute Percent Error (MAPE)
mean(philly.test$sale_price.APE, na.rm = T)
## [1] 0.5351495

The following plot shows that the model and perfect prediction. The orange line represents a perfect prediction (y=x), and the green line represents average prediction. This plot shows that predicted sale price with proposed model was still a little over estimated.

ggplot(data=philly.test, aes(sale_price.Predict, sale_price)) +
     geom_point(size = .5) + 
     geom_abline(intercept = 0, slope = 1, color = "#FA7800") +
     geom_smooth(method = "lm", se=F, colour = "#25CB10") +
    coord_cartesian(xlim = c(0, 4000000), ylim = c(0, 4000000)) +
     labs(title = "Price as a function of continuous variables") +
     theme_minimal()

Generalizability - ‘k-fold’ Cross-Validation

Compared to the current prediction model, the predicted sale price is slightly higher and the error is reduced. However, the geographical distribution of moods is almost the same.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(philly.sf_New) %>% 
                                dplyr::select(sale_price, exterior_condition, interior_condition, 
                                           number_of_bathrooms, number_of_bedrooms,
                                           total_area, total_livable_area, type_heater, 
                                           view_type,  year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD), 
     method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv
## Linear Regression 
## 
## 23698 samples
##    15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 23460, 23461, 23461, 23461, 23462, 23460, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   146256.9  0.5846468  88412.76
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
ggplot()+
  geom_histogram(data = reg.cv$resample, aes(x = MAE), binwidth = 10000, fill = "orange") +
  scale_x_continuous(breaks=seq(0, 600000, by = 100000), )+
  coord_cartesian(xlim = c(0, 600000)) +
  labs(title = "Distribution of MAE",
  subtitle = "k-fold cross-validation; k=100",
  caption = "") +
  theme_minimal()

philly.test <- philly.test[!is.na(philly.test$sale_price.Predict),]
ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  geom_sf(data = philly.test, aes(color = q5(sale_price.Predict)),  
          show.legend = "point", size = 0.5, alpha = 1) +  
  scale_color_manual(values = palette5,
                     labels=qBr(philly.test, "sale_price.Predict"),
                     name="Quintile\nBreaks($)") +
  labs(
    title = "Predicted Sale Price in Philadelphia",
    subtitle = "Assessment date: May 24, 2022-August 14, 2023",
    caption = "Data: U.S. Census Bureau, Zillow") +
  theme_void()

ggplot() + 
  geom_sf(data=nhoods, fill="#DDD", color="white")+ 
  geom_sf(data = philly.test, aes(color = q5(sale_price.AbsError)), 
          show.legend = "point", size = 0.5, alpha = 1) +  
  scale_color_manual(values = palette5,
                     labels=qBr(philly.test, "sale_price.AbsError"),
                     name="Quintile\nBreaks($)") + 
  labs(
    title = "Test Set Absolute Sale Price Errors in Philadelphia",
    subtitle = "Assessment date: May 24, 2022-August 14, 2023",
    caption = "Data: U.S. Census Bureau, Zillow") +
  theme_void()

Check if Prices and Errors Cluster

(1) Spatial Lags

Even in the proposed model, the errors are still clustered and spatial lag of error and sale price error are positively correlated.

coords <- st_coordinates(philly.sf_New) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

philly.sf_New$lagPrice <- lag.listw(spatialWeights, philly.sf_New$sale_price)
philly.test <- philly.test[!is.na(philly.test$sale_price.Error),]
coords.test <-  st_coordinates(philly.test) 
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W") 
# Sale Price as a function of the Spatial Lag of Price
philly.test %>% 
  mutate(lagPrice = lag.listw(spatialWeights.test, sale_price)) %>% 
  ggplot(aes(lagPrice, sale_price))+
  geom_point(color = "orange") +
  geom_smooth(method = "lm", se=F, colour = "#25CB10") +
  labs(title = "Price as a function of the Spatial Lag of Price",
       x = "Spatial Lag of Price (mean price of 5 nearest neighbors)",
       y = "Sale Price") +
  theme_minimal()

# Error as a Function of the Spatial Lag of Price
philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) %>% 
  ggplot(aes(lagPriceError, sale_price.Error))+
  geom_point(color = "orange") +
  geom_smooth(method = "lm", se=F, colour = "#25CB10") +
  labs(title = "Error as a Function of the Spatial Lag of Price",
       x = "Spatial Lag of Errors (mean error of 5 nearest neighbors)",
       y = "Sale Price Error") +
  theme_minimal()

We can calculate the Pearson’s R coefficient to test this a different way

pearsons_test <- 
  philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error, NAOK=TRUE))

cor.test(pearsons_test$lagPriceError,
         pearsons_test$sale_price.Error, 
         method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pearsons_test$lagPriceError and pearsons_test$sale_price.Error
## t = 65.244, df = 9342, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5454005 0.5732627
## sample estimates:
##       cor 
## 0.5594897

(2) Moran’s I

We can assess the Moran’s I value is 0.359 to see if the errors are spatially autocorrelated. Moran’s I in the proposed model became lower than the current model (0.395), but still indicates positive spatial autocorrelation. This means that the errors are clustered, and the model is not capturing all the spatial patterns in the data.

moranTest <- moran.mc(philly.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "orange",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  theme_minimal()

Predictions by neighborhood

To validate the generalizability of the proposed model, we summarize the predictions by 158 neighborhood (name in the dataset). This model examines the impact of neighborhood characteristics based on 158 neighborhoods. The following table shows that the mean prediction and actual price of 158 neighborhoods.

philly.test %>%
as.data.frame() %>%
  group_by(name) %>%
    summarize(meanPrediction = mean(sale_price.Predict),
              meanPrice = mean(sale_price)) %>%
      pander(caption = "Mean Predicted and Actual Sale Price by Neighborhood")
Mean Predicted and Actual Sale Price by Neighborhood
name meanPrediction meanPrice
Academy Gardens 293483 300350
Allegheny West 97203 85369
Andorra 487722 407943
Aston-Woodbridge 278812 279297
Bartram Village 123251 122812
Bella Vista 408471 535949
Belmont 156508 146286
Brewerytown 199271 228292
Bridesburg 257782 209771
Burholme 278010 257385
Bustleton 354624 363054
Byberry 513086 368250
Callowhill 309002 283333
Carroll Park 139529 118299
Cedar Park 370136 419500
Cedarbrook 316453 257483
Chestnut Hill 665122 944563
Clearview 197990 156389
Cobbs Creek 221163 170617
Crescentville 225795 201280
Dearnley Park 509445 339300
Dickinson Narrows 375397 377542
Dunlap 199575 132950
East Falls 342331 394138
East Germantown 200894 151451
East Kensington 407024 365128
East Mount Airy 333963 303050
East Oak Lane 361981 248553
East Parkside 238520 155800
East Passyunk 345746 378162
East Poplar 233444 264996
Eastwick 254580 232257
Elmwood 124242 122346
Fairhill -59099 76924
Fairmount 319011 426223
Feltonville 173099 121190
Fern Rock 308779 182284
Fishtown - Lower Kensington 374256 385690
Fitler Square 438982 1116021
Fox Chase 313783 322707
Francisville 480053 522825
Frankford 197198 124153
Franklinville 119423 85353
Garden Court 369893 450116
Germantown - Morton 238715 180053
Germantown - Penn Knox 326430 252271
Germantown - Westside 365830 279670
Germany Hill 384796 374416
Girard Estates 306358 299820
Glenwood 137863 60971
Graduate Hospital 447849 599698
Grays Ferry 120769 194289
Greenwich 340574 341392
Haddington 202283 122306
Harrowgate 141409 90044
Hartranft 158450 106478
Haverford North 186295 149833
Hawthorne 481607 599138
Holmesburg 205199 205505
Hunting Park 118034 94110
Juniata Park 170825 156814
Kingsessing 181774 142126
Lawndale 229132 201666
Lexington Park 263029 323492
Logan 245004 135113
Logan Square 394816 547362
Lower Moyamensing 259531 234116
Ludlow 374504 343200
Manayunk 364486 366448
Mantua 198464 182546
Mayfair 242747 229553
McGuire -119342 53200
Melrose Park Gardens 241217 192688
Mill Creek 158872 109421
Millbrook 279656 283342
Modena 280237 300000
Morrell Park 282406 285467
Newbold 330358 279215
Nicetown 162980 92800
Normandy Village 293909 277414
North Central 205588 217361
Northern Liberties 515668 548994
Northwood 253752 197994
Ogontz 193015 159895
Old Kensington 503564 444860
Olney 204609 153851
Overbrook 283941 209842
Oxford Circle 251697 218439
Packer Park 523722 550154
Parkwood Manor 271488 273238
Paschall 140748 101000
Passyunk Square 426200 442702
Pennsport 358390 374683
Pennypack 284115 306579
Pennypack Woods 283869 274574
Penrose 179374 193889
Point Breeze 350273 339582
Powelton 676212 470000
Queen Village 421545 623321
Rhawnhurst 266819 290961
Richmond 165606 191863
Rittenhouse 489333 838472
Riverfront 506165 572500
Roxborough 361685 354145
Roxborough Park 479141 421333
Sharswood 261039 197444
Society Hill 404819 672283
Somerton 334167 345849
Southwest Germantown 260410 182117
Southwest Schuylkill 152295 134826
Spring Garden 363399 384521
Spruce Hill 560409 715200
Stadium District 218755 294280
Stanton 178804 118298
Strawberry Mansion 97946 104665
Summerdale 180730 155469
Tacony 227111 180684
Tioga 189917 93103
Torresdale 293401 270365
Upper Kensington 31254 59021
Upper Roxborough 364507 344112
Walnut Hill 340056 321795
Washington Square West 377467 568568
West Central Germantown 399479 364615
West Kensington 285330 215717
West Mount Airy 619711 591589
West Oak Lane 228893 197622
West Parkside 39000 127500
West Passyunk 189510 209869
West Poplar 416075 455925
West Powelton 328753 263500
Whitman 257551 232478
Winchester Park 352945 333068
Wissahickon 344782 344632
Wissahickon Hills 344606 365386
Wissinoming 172466 158985
Wister 207398 150118
Wynnefield 262108 219093
Wynnefield Heights 232725 161950
Yorktown 241213 253750
reg.nhood <- lm(sale_price ~ ., data = as.data.frame(philly.training) %>% 
                                 dplyr::select(div_code, sale_price, exterior_condition, interior_condition, 
                                           number_of_bathrooms, number_of_bedrooms,
                                           total_area, total_livable_area, type_heater, 
                                           view_type,year_built, number_stories.cat, shooting_count, PPR_count, school_count, crime_count, TOD, name))
summary(reg.nhood)
## 
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(philly.training) %>% 
##     dplyr::select(div_code, sale_price, exterior_condition, interior_condition, 
##         number_of_bathrooms, number_of_bedrooms, total_area, 
##         total_livable_area, type_heater, view_type, year_built, 
##         number_stories.cat, shooting_count, PPR_count, school_count, 
##         crime_count, TOD, name))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -958080  -48459     -92   45397 3378009 
## 
## Coefficients: (3 not defined because of singularities)
##                                      Estimate   Std. Error t value
## (Intercept)                      -135242.2469   77874.5451  -1.737
## div_codeEPD                       -71094.7371   17853.0603  -3.982
## div_codeNEPD                       73882.1219   37787.9553   1.955
## div_codeNWPD                      -30623.5577   72463.6701  -0.423
## div_codeSPD                      -127972.3942   19088.5378  -6.704
## div_codeSWPD                      -37545.9299   38209.2863  -0.983
## exterior_condition                -17479.4094    1899.6517  -9.201
## interior_condition                -23206.7347    1979.2696 -11.725
## number_of_bathrooms                34958.7686    2373.9899  14.726
## number_of_bedrooms                 -4316.1648    1183.1624  -3.648
## total_area                             2.1810       0.3178   6.862
## total_livable_area                   195.6809       2.7473  71.226
## type_heater0                        2705.5964   18636.9577   0.145
## type_heaterA                        4935.3139    3485.8585   1.416
## type_heaterB                        1529.6906    4369.3662   0.350
## type_heaterC                      -72839.7887   12158.9380  -5.991
## type_heaterD                      115014.9248   22800.4998   5.044
## type_heaterE                       -4158.9772   21778.9117  -0.191
## type_heaterG                        9191.8100    6685.3171   1.375
## type_heaterH                        5320.9524    3673.0713   1.449
## view_type0                          4334.6676   27444.3988   0.158
## view_typeA                         48651.8298   22324.6061   2.179
## view_typeB                        -36223.9676   31568.6564  -1.147
## view_typeC                         26830.0256   22958.5792   1.169
## view_typeD                         42265.7664   26148.3456   1.616
## view_typeE                         27243.3066   26414.4270   1.031
## view_typeH                         29422.5166   24871.5539   1.183
## view_typeI                         26365.4079   20902.3159   1.261
## year_built                           108.8595      33.7257   3.228
## number_stories.cat4+ Floors       787152.1858   87904.0643   8.955
## number_stories.catUp to 2 Floors   17428.9855    4107.8982   4.243
## shooting_count                      -347.3366      42.2635  -8.218
## PPR_count                          -1513.4131    1159.0202  -1.306
## school_count                               NA           NA      NA
## crime_count                            1.6088      10.7904   0.149
## TODTOD                             16830.7552    4008.7350   4.199
## nameAllegheny West                -27700.2032   79696.4126  -0.348
## nameAndorra                        54524.2898   82787.8828   0.659
## nameAston-Woodbridge              -55750.5405   29665.8680  -1.879
## nameBartram Village               -23375.1161   38107.6000  -0.613
## nameBella Vista                   397300.0124   39587.2953  10.036
## nameBelmont                         1164.6972   38532.4361   0.030
## nameBrewerytown                    21334.6399   33088.0076   0.645
## nameBridesburg                   -102555.8736   25849.5612  -3.967
## nameBurholme                      -44681.8856   34528.9165  -1.294
## nameBustleton                     -25581.7615   22682.3491  -1.128
## nameByberry                      -126622.7326   49327.4751  -2.567
## nameCallowhill                     96149.8416   69484.0218   1.384
## nameCarroll Park                  -38393.8486   25070.6024  -1.531
## nameCedar Park                    144111.6279   27619.0785   5.218
## nameCedarbrook                     27853.9770   80040.5996   0.348
## nameChestnut Hill                 354536.1902   80378.7562   4.411
## nameClearview                      48357.2960   36014.7639   1.343
## nameCobbs Creek                   -17968.0358   23470.0371  -0.766
## nameCrescentville                 -64461.5057   90185.5319  -0.715
## nameCrestmont Farms               -31412.6244   90350.1466  -0.348
## nameDearnley Park                  82214.8550   85418.1986   0.962
## nameDickinson Narrows             217463.8520   39136.3866   5.557
## nameDunlap                        -38397.3516   35940.5754  -1.068
## nameEast Falls                    123157.2560   79877.8809   1.542
## nameEast Germantown               -44303.5498   79776.6838  -0.555
## nameEast Kensington               122573.2036   38110.4442   3.216
## nameEast Mount Airy                18554.5503   79537.6947   0.233
## nameEast Oak Lane                 -57696.4338   80993.6085  -0.712
## nameEast Parkside                 -24373.5115   32468.5168  -0.751
## nameEast Passyunk                 241414.7621   38284.3978   6.306
## nameEast Poplar                    37267.2053   50218.2760   0.742
## nameEastwick                        5312.2114   35476.8590   0.150
## nameElmwood                        -4023.2802   24140.8905  -0.167
## nameFairhill                       34244.6435   41703.8801   0.821
## nameFairmount                     197700.3910   32678.9043   6.050
## nameFeltonville                     9980.1499   37424.8272   0.267
## nameFern Rock                     -67127.4496   81490.5200  -0.824
## nameFishtown - Lower Kensington   178690.0036   36480.3475   4.898
## nameFitler Square                 634647.9571   41520.2464  15.285
## nameFox Chase                     -27674.7729   23340.0709  -1.186
## nameFrancisville                  113354.2450   35730.6951   3.172
## nameFrankford                    -156324.2859   23444.5040  -6.668
## nameFranklinville                 -22693.2803   41798.1606  -0.543
## nameGarden Court                  134573.9961   40675.3484   3.308
## nameGermantown - Morton           -38129.0948   81124.0927  -0.470
## nameGermantown - Penn Knox          2684.5239   87377.0851   0.031
## nameGermantown - Westside         -19888.1683   83439.1762  -0.238
## nameGermany Hill                   81844.4684   81651.1612   1.002
## nameGirard Estates                195195.6786   38056.2665   5.129
## nameGlenwood                      -60083.9541   83094.1446  -0.723
## nameGraduate Hospital             385001.7907   37362.9657  10.304
## nameGrays Ferry                   144776.4699   37971.4934   3.813
## nameGreenwich                     168916.8857   41739.5802   4.047
## nameHaddington                    -45071.8048   24559.3500  -1.835
## nameHarrowgate                     -9154.5364   37166.2269  -0.246
## nameHartranft                     -43871.3962   35126.9340  -1.249
## nameHaverford North               -78044.7136   39031.5509  -2.000
## nameHawthorne                     333682.9234   43099.6232   7.742
## nameHolmesburg                    -77103.9925   22731.2997  -3.392
## nameHunting Park                  -18037.2394   37411.8963  -0.482
## nameJuniata Park                   38271.9148   37144.3932   1.030
## nameKingsessing                   -27723.8975   23972.9668  -1.156
## nameLawndale                     -101264.8414   23067.6975  -4.390
## nameLexington Park                 25101.2533   29087.7039   0.863
## nameLogan                         -77570.8664   79547.1706  -0.975
## nameLogan Square                  234778.2626   34438.1587   6.817
## nameLower Moyamensing             145256.6523   37520.0680   3.871
## nameLudlow                        -18685.5834   46340.4219  -0.403
## nameManayunk                       83990.2313   79661.4222   1.054
## nameMantua                         35257.0962   29560.5400   1.193
## nameMayfair                       -84446.5477   22435.0855  -3.764
## nameMcGuire                        62622.0076   46018.4668   1.361
## nameMelrose Park Gardens            4292.8318   82114.8503   0.052
## nameMill Creek                    -29218.1121   27012.8064  -1.082
## nameMillbrook                     -36805.8624   26656.9708  -1.381
## nameModena                        -32120.5754   24206.7237  -1.327
## nameMorrell Park                  -41243.0522   24978.6875  -1.651
## nameNewbold                       186091.2459   38747.8387   4.803
## nameNicetown                      -50908.8093   81640.7214  -0.624
## nameNormandy Village              -17065.5423   46459.4735  -0.367
## nameNorth Central                 -31047.0447   33951.9104  -0.914
## nameNorthern Liberties            230007.5329   35828.9306   6.420
## nameNorthwood                    -137491.7170   26100.7979  -5.268
## nameOgontz                        -27287.3521   79711.9329  -0.342
## nameOld Kensington                123301.6057   39252.4341   3.141
## nameOlney                         -35506.9558   79244.7743  -0.448
## nameOverbrook                     -21530.6055   23612.4623  -0.912
## nameOxford Circle                 -95461.4318   22307.8418  -4.279
## namePacker Park                   226052.3357   42365.7045   5.336
## nameParkwood Manor                -41554.6730   23560.0517  -1.764
## namePaschall                      -47376.1160   24915.8569  -1.901
## namePassyunk Square               265682.0669   38232.3460   6.949
## namePennsport                     211889.8151   38609.7410   5.488
## namePennypack                     -18725.3035   24994.6787  -0.749
## namePennypack Woods               -48731.8390   36192.1514  -1.346
## namePenrose                        10131.6261   31515.1377   0.321
## namePoint Breeze                  182964.5741   37177.1617   4.921
## namePowelton                      260694.1638   59799.6009   4.359
## nameQueen Village                 441436.4201   38837.0765  11.366
## nameRhawnhurst                    -19645.3595   23113.0509  -0.850
## nameRichmond                       75804.4770   36359.0160   2.085
## nameRittenhouse                   565823.1636   33011.0457  17.140
## nameRiverfront                    206687.7891   79150.1994   2.611
## nameRoxborough                     73947.9117   79489.4554   0.930
## nameRoxborough Park                82499.4279   84576.4649   0.975
## nameSharswood                      56606.3969   46595.5394   1.215
## nameSociety Hill                  408970.3935   33555.1169  12.188
## nameSomerton                      -14753.5571   22859.9565  -0.645
## nameSouthwest Germantown          -31447.0920   80630.4225  -0.390
## nameSouthwest Schuylkill           -4873.7931   25935.7904  -0.188
## nameSpring Garden                 217508.4292   33629.3802   6.468
## nameSpruce Hill                   195546.5815   35111.4075   5.569
## nameStadium District              192350.0278   40943.8174   4.698
## nameStanton                       -99983.5433   33328.6739  -3.000
## nameStrawberry Mansion            -64330.9920   33003.1107  -1.949
## nameSummerdale                   -116053.6536   26121.1901  -4.443
## nameTacony                       -104702.7261   23113.8374  -4.530
## nameTioga                         -99947.1216   79912.5414  -1.251
## nameTorresdale                    -56131.7322   24716.3097  -2.271
## nameUpper Kensington                1431.1903   36915.3589   0.039
## nameUpper Roxborough               59966.7795   79781.2467   0.752
## nameWalnut Hill                    36723.8440   36294.4456   1.012
## nameWashington Square West        330744.0149   34888.8370   9.480
## nameWest Central Germantown       -37334.8593   81543.8432  -0.458
## nameWest Kensington                69705.3958   37948.2816   1.837
## nameWest Mount Airy                86475.5827   79849.8687   1.083
## nameWest Oak Lane                  -6850.7653   79256.2176  -0.086
## nameWest Parkside                 -17760.2798   52228.1287  -0.340
## nameWest Passyunk                 148363.0958   38448.3091   3.859
## nameWest Poplar                    38934.6920   50005.7065   0.779
## nameWest Powelton                  46473.2328   36813.9390   1.262
## nameWhitman                       155904.7686   38972.7675   4.000
## nameWinchester Park               -23805.3236   37687.9395  -0.632
## nameWissahickon                    72922.9302   80453.9826   0.906
## nameWissahickon Hills             122620.5736   84827.8386   1.446
## nameWissinoming                  -107276.3467   22730.1851  -4.720
## nameWister                        -48779.3878   81589.9932  -0.598
## nameWoodland Terrace              212236.8503   90379.9457   2.348
## nameWynnefield                    -49343.8847   26171.7584  -1.885
## nameWynnefield Heights                     NA           NA      NA
## nameYorktown                               NA           NA      NA
##                                              Pr(>|t|)    
## (Intercept)                                  0.082468 .  
## div_codeEPD                      0.000068624548814746 ***
## div_codeNEPD                                 0.050582 .  
## div_codeNWPD                                 0.672590    
## div_codeSPD                      0.000000000021042707 ***
## div_codeSWPD                                 0.325802    
## exterior_condition               < 0.0000000000000002 ***
## interior_condition               < 0.0000000000000002 ***
## number_of_bathrooms              < 0.0000000000000002 ***
## number_of_bedrooms                           0.000265 ***
## total_area                       0.000000000007062227 ***
## total_livable_area               < 0.0000000000000002 ***
## type_heater0                                 0.884576    
## type_heaterA                                 0.156854    
## type_heaterB                                 0.726273    
## type_heaterC                     0.000000002142008747 ***
## type_heaterD                     0.000000460695242559 ***
## type_heaterE                                 0.848557    
## type_heaterG                                 0.169177    
## type_heaterH                                 0.147461    
## view_type0                                   0.874504    
## view_typeA                                   0.029327 *  
## view_typeB                                   0.251209    
## view_typeC                                   0.242574    
## view_typeD                                   0.106034    
## view_typeE                                   0.302381    
## view_typeH                                   0.236838    
## view_typeI                                   0.207199    
## year_built                                   0.001250 ** 
## number_stories.cat4+ Floors      < 0.0000000000000002 ***
## number_stories.catUp to 2 Floors 0.000022218456474646 ***
## shooting_count                   0.000000000000000224 ***
## PPR_count                                    0.191653    
## school_count                                       NA    
## crime_count                                  0.881483    
## TODTOD                           0.000027034298856716 ***
## nameAllegheny West                           0.728167    
## nameAndorra                                  0.510162    
## nameAston-Woodbridge                         0.060227 .  
## nameBartram Village                          0.539623    
## nameBella Vista                  < 0.0000000000000002 ***
## nameBelmont                                  0.975887    
## nameBrewerytown                              0.519077    
## nameBridesburg                   0.000073023119075921 ***
## nameBurholme                                 0.195672    
## nameBustleton                                0.259413    
## nameByberry                                  0.010269 *  
## nameCallowhill                               0.166451    
## nameCarroll Park                             0.125686    
## nameCedar Park                   0.000000183645054455 ***
## nameCedarbrook                               0.727847    
## nameChestnut Hill                0.000010375681362365 ***
## nameClearview                                0.179389    
## nameCobbs Creek                              0.443943    
## nameCrescentville                            0.474766    
## nameCrestmont Farms                          0.728088    
## nameDearnley Park                            0.335816    
## nameDickinson Narrows            0.000000028021154490 ***
## nameDunlap                                   0.285378    
## nameEast Falls                               0.123140    
## nameEast Germantown                          0.578668    
## nameEast Kensington                          0.001302 ** 
## nameEast Mount Airy                          0.815547    
## nameEast Oak Lane                            0.476255    
## nameEast Parkside                            0.452857    
## nameEast Passyunk                0.000000000295360940 ***
## nameEast Poplar                              0.458037    
## nameEastwick                                 0.880974    
## nameElmwood                                  0.867641    
## nameFairhill                                 0.411582    
## nameFairmount                    0.000000001487748413 ***
## nameFeltonville                              0.789726    
## nameFern Rock                                0.410098    
## nameFishtown - Lower Kensington  0.000000977839542346 ***
## nameFitler Square                < 0.0000000000000002 ***
## nameFox Chase                                0.235754    
## nameFrancisville                             0.001515 ** 
## nameFrankford                    0.000000000026942406 ***
## nameFranklinville                            0.587190    
## nameGarden Court                             0.000940 ***
## nameGermantown - Morton                      0.638356    
## nameGermantown - Penn Knox                   0.975491    
## nameGermantown - Westside                    0.811609    
## nameGermany Hill                             0.316184    
## nameGirard Estates               0.000000295008330100 ***
## nameGlenwood                                 0.469641    
## nameGraduate Hospital            < 0.0000000000000002 ***
## nameGrays Ferry                              0.000138 ***
## nameGreenwich                    0.000052177028145784 ***
## nameHaddington                               0.066495 .  
## nameHarrowgate                               0.805443    
## nameHartranft                                0.211709    
## nameHaverford North                          0.045571 *  
## nameHawthorne                    0.000000000000010452 ***
## nameHolmesburg                               0.000696 ***
## nameHunting Park                             0.629724    
## nameJuniata Park                             0.302861    
## nameKingsessing                              0.247511    
## nameLawndale                     0.000011424367647150 ***
## nameLexington Park                           0.388180    
## nameLogan                                    0.329500    
## nameLogan Square                 0.000000000009654858 ***
## nameLower Moyamensing                        0.000109 ***
## nameLudlow                                   0.686789    
## nameManayunk                                 0.291746    
## nameMantua                                   0.233004    
## nameMayfair                                  0.000168 ***
## nameMcGuire                                  0.173599    
## nameMelrose Park Gardens                     0.958308    
## nameMill Creek                               0.279432    
## nameMillbrook                                0.167387    
## nameModena                                   0.184554    
## nameMorrell Park                             0.098735 .  
## nameNewbold                      0.000001582395147524 ***
## nameNicetown                                 0.532919    
## nameNormandy Village                         0.713385    
## nameNorth Central                            0.360501    
## nameNorthern Liberties           0.000000000141081178 ***
## nameNorthwood                    0.000000140196566596 ***
## nameOgontz                                   0.732112    
## nameOld Kensington                           0.001686 ** 
## nameOlney                                    0.654112    
## nameOverbrook                                0.361873    
## nameOxford Circle                0.000018876100378693 ***
## namePacker Park                  0.000000096654400958 ***
## nameParkwood Manor                           0.077792 .  
## namePaschall                                 0.057265 .  
## namePassyunk Square              0.000000000003839318 ***
## namePennsport                    0.000000041368009326 ***
## namePennypack                                0.453767    
## namePennypack Woods                          0.178171    
## namePenrose                                  0.747848    
## namePoint Breeze                 0.000000869048906350 ***
## namePowelton                     0.000013132208723538 ***
## nameQueen Village                < 0.0000000000000002 ***
## nameRhawnhurst                               0.395357    
## nameRichmond                                 0.037098 *  
## nameRittenhouse                  < 0.0000000000000002 ***
## nameRiverfront                               0.009029 ** 
## nameRoxborough                               0.352239    
## nameRoxborough Park                          0.329358    
## nameSharswood                                0.224446    
## nameSociety Hill                 < 0.0000000000000002 ***
## nameSomerton                                 0.518686    
## nameSouthwest Germantown                     0.696531    
## nameSouthwest Schuylkill                     0.850944    
## nameSpring Garden                0.000000000102770967 ***
## nameSpruce Hill                  0.000000026049887132 ***
## nameStadium District             0.000002653740300370 ***
## nameStanton                                  0.002705 ** 
## nameStrawberry Mansion                       0.051287 .  
## nameSummerdale                   0.000008944558394892 ***
## nameTacony                       0.000005951244738983 ***
## nameTioga                                    0.211063    
## nameTorresdale                               0.023160 *  
## nameUpper Kensington                         0.969075    
## nameUpper Roxborough                         0.452280    
## nameWalnut Hill                              0.311637    
## nameWashington Square West       < 0.0000000000000002 ***
## nameWest Central Germantown                  0.647067    
## nameWest Kensington                          0.066253 .  
## nameWest Mount Airy                          0.278837    
## nameWest Oak Lane                            0.931119    
## nameWest Parkside                            0.733823    
## nameWest Passyunk                            0.000114 ***
## nameWest Poplar                              0.436226    
## nameWest Powelton                            0.206833    
## nameWhitman                      0.000063577183942097 ***
## nameWinchester Park                          0.527631    
## nameWissahickon                              0.364744    
## nameWissahickon Hills                        0.148334    
## nameWissinoming                  0.000002386739447581 ***
## nameWister                                   0.549943    
## nameWoodland Terrace                         0.018875 *  
## nameWynnefield                               0.059399 .  
## nameWynnefield Heights                             NA    
## nameYorktown                                       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123800 on 13852 degrees of freedom
##   (199 observations deleted due to missingness)
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7113 
## F-statistic: 200.7 on 173 and 13852 DF,  p-value: < 0.00000000000000022
philly.test.nhood <-
  philly.test %>%
  mutate(Regression = "Neighborhood Effects",
         sale_price.Predict = predict(reg.nhood, philly.test),
         sale_price.Error = sale_price.Predict- sale_price,
         sale_price.AbsError = abs(sale_price.Predict- sale_price),
         sale_price.APE = (abs(sale_price.Predict- sale_price)) / sale_price)%>%
  filter(sale_price < 5000000)
bothRegressions <- 
  rbind(
    dplyr::select(philly.test, starts_with("sale_price"), Regression, name) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)),
    dplyr::select(philly.test.nhood, starts_with("sale_price"), Regression, name) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)))  

The MAE and MAPE of the proposed model, which accounts for the neighborhood effect, are as follows: when incorporating the neighborhood effect, the MAE decreases by 20,000 and the MAPE decreases by 18%p compared to the proposed model without the neighborhood effect.

  • Initially, the neighborhood effect was defined by police districts, but upon resubmission, it was adjusted to 158 neighborhoods, and the neighborhood effect was recalculated.
st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -name) %>%
  filter(Variable == "sale_price.AbsError" | Variable == "sale_price.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value)) %>%
    spread(Variable, meanValue) %>%
    pander(caption = "Mean Absolute Error and Absolute Percentage Error by Regression")
Mean Absolute Error and Absolute Percentage Error by Regression
Regression sale_price.AbsError sale_price.APE
Baseline Regression 87459 0.5351
Neighborhood Effects 67766 0.3774
bothRegressions %>%
  dplyr::select(sale_price.Predict, sale_price, Regression) %>%
    ggplot(aes(sale_price, sale_price.Predict)) +
  geom_point() +
  stat_smooth(aes(sale_price, sale_price), 
             method = "lm", se = FALSE, size = 1, colour="orange") + 
  stat_smooth(aes(sale_price.Predict, sale_price), 
              method = "lm", se = FALSE, size = 1, colour="olivedrab") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  theme_minimal()

Spatial Evaluation of Errors

The following results of table and graph denote there is significant difference in MAE and MAPE by regression methods, particularly in Central and North Philadelphia.

st_drop_geometry(bothRegressions) %>%
  group_by(Regression, name) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(neighborhoods, by = c("name" = "name")) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = 100*mean.MAPE), color = "transparent") +
      geom_sf(data = bothRegressions, colour = "gray30", size = .5) +
      facet_wrap(~Regression) +
      scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood (Neighborhood)") +
      theme_void()

The following scatterplot shows the average MAPE by 158 neighborhood mean price. The overall MAPE values were lower after accounting for the neighborhood effect, and the peak MAPE values observed before considering the neighborhood effect disappeared (Fairhill and McGuire). In both regression methods, MAPE did show an decrease depending on the sale price. However, in both methods, MAPE has a positive value.

a <- st_drop_geometry(bothRegressions) %>%
  group_by(Regression, name) %>%
  summarize(mean.sale_price = mean(sale_price, na.rm = T))

a2 <- st_drop_geometry(bothRegressions) %>%
  group_by(Regression, name) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T))

a3<- left_join(a, a2, by = c("name" = "name", "Regression" = "Regression")) 

ggplot(a3) + 
    geom_point(aes(x = mean.sale_price, y = 100*mean.MAPE), color = "gray30", size = 1) +
    facet_wrap(~Regression) +
    scale_fill_gradient(low = palette5[1], high = palette5[5],
                        name = "MAPE") +
    labs(title = "Average MAPE by Neighborhood (Neighborhood) Mean Price") +
    theme_minimal()

Race and Income Context of Predictions

We validated if the proposed model that accounts for neighborhood effect is applied to different group contexts: race and income.

In race, the regression that considered neighborhood effects resulted in a smaller MAPE in the majority white group. In income, the regression that considered neighborhood effects resulted in a smaller MAPE in the high-income group.

For the majority non-White and low-income groups, the MAPE decreased significantly—by more than 30%p and 25%p, respectively—when the neighborhood effect was considered compared to when it was not. It means the ‘neighborhood effect’ played an important role in predicting house prices.

As a result of generalizability, however, it can be concluded that it is relatively difficult to apply to the Majority Non-White or Low Income group. We should consider other additional variables that may resolve the correlation between the groups or analyze more data.

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(nhoods), aes(fill = raceContext), color = "#999") +
    scale_fill_manual(values = c("olivedrab", "orange"), name="Race Context") +
    labs(title = "Race Context") +
    theme_void() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(nhoods), aes(fill = incomeContext), color = "#999") +
    scale_fill_manual(values = c("olivedrab", "orange"), name="Income Context") +
    labs(title = "Income Context") +
    theme_void() + theme(legend.position="bottom"))

st_join(bothRegressions, nhoods_f) %>% 
  filter(!is.na(raceContext)) %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = 100*(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  pander(caption = "Test set MAPE by neighborhood race context(%)")
Test set MAPE by neighborhood race context(%)
Regression Majority Non-White Majority White
Baseline Regression 74.56 27.67
Neighborhood Effects 48.3 24.78
st_join(bothRegressions, nhoods_f) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = 100*(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  pander(caption = "Test set MAPE by neighborhood income context(%)")
Test set MAPE by neighborhood income context(%)
Regression High Income Low Income
Baseline Regression 27.94 64.37
Neighborhood Effects 25.78 42.89

Conclusion and Discussion

We recommend that Zillow use our proposed OLS model instead of the current one. Our proposed model aimed to predict housing prices using key features such as property condition, total livable area, number of bedrooms and bathrooms, proximity to crime, and access to amenities like schools and parks. These variables are crucial as they directly influence property values based on what homebuyers prioritize. The current Zillow model mainly focuses on internal characteristics and does not focus on public services or (dis)amenities. However, our proposed model was able to obtain significantly reduced MAE and MAPE values by incorporating six public services, including crime, shooting, and police districts, which are actual factors that people around Philadelphia often consider when choosing their home location. Our proposed model finally did effectively account for neighborhood effects, possibly because the neighborhood defined by 158 neighborhoods significantly decreased the MAE and MAPE.

While the model captured a significant portion of price variation, it was not perfect. First, metrics like MAE and MAPE decreased compared to the current Zillow model, but there was still a positive correlation. We also observed positive spatial autocorrelation in the model errors, as indicated by Moran’s I statistics. We need to identify more variables with spatial correlation.

Second, regression analysis did not work well for school data because the number of schools per tract had a lot of NA data. In the future, it is necessary to reanalyze the distance to the nearest school from each house as continuous data.

Third, our proposed model should still improve generalizability. The overall MAPE values were lower after accounting for the neighborhood effect, but MAPE by different group context has a positive value and large difference between two different groups which was race and house income, respectively. The model showed better performance in high-income and predominantly white neighborhoods, evidenced by lower MAE values, possibly due to more stable socio-economic factors in these areas. To improve generalization performance by eliminating these effects, we need to add sufficient other variables that can predict the variability occurring in low-income and major non-white groups, or change the weights of each variable.

In sum, further refinement is necessary to address local neighborhood variations that are not fully captured by the current set of predictors. Incorporating more granular, demographically aligned, or culturally relevant boundaries could improve the model’s predictive power, allowing it to capture neighborhood effects more accurately and potentially reduce the overall error rate.

philly.sf_Chall <- philly.sf %>%
  filter(sale_price < 5000000 & toPredict == "CHALLENGE") %>%
  mutate(pricepersq = ifelse(total_area > 10, sale_price / total_area, 0)) 
philly.sf_Chall <- philly.sf_Chall[!is.na(as.numeric(philly.sf_Chall$pricepersq)), ]
philly.sf_Chall <- st_join(philly.sf_Chall, nhoods_f) # Crime Data, School Data, Green Area Data, Shooting Incident Data, TOD Data
philly.sf_Chall <- st_join(philly.sf_Chall, philly.police) # Police District Data

philly.sf_Chall$sale_price <-  798332.8947 + philly.sf_Chall$exterior_condition * -27278.6103    +philly.sf_Chall$interior_condition* -27822.5115   +philly.sf_Chall$number_of_bathrooms*   60246.0186    +philly.sf_Chall$number_of_bedrooms*   -25245.0484    +philly.sf_Chall$total_livable_area*  193.4259       +philly.sf_Chall$year_built * -161.9947      +philly.sf_Chall$shooting_count * -1020.9659      +philly.sf_Chall$crime_count  * -18.7429 + ifelse(philly.sf_Chall$type_heater == "A", 13823.6306, 
ifelse(philly.sf_Chall$type_heater == "B", -8972.2810, 
ifelse(philly.sf_Chall$type_heater == "D", 239670.7969, 
ifelse(philly.sf_Chall$type_heater == "E", 58203.8361,
ifelse(philly.sf_Chall$type_heater == "H", 7760.8603,0)))))+
ifelse(philly.sf_Chall$view_type == "0", -82474.3969,
ifelse(philly.sf_Chall$view_type == "C", 182468.1793,
ifelse(philly.sf_Chall$view_type == "E", -50004.9183,0)))
#+ifelse(philly.sf_Chall$number_stories.cat =="4+ Floors",730714.0040,
#ifelse(philly.sf_Chall$number_stories.cat =="Up to 2 Floors",-45798.0893,0))

MUSA <- philly.sf_Chall %>%
  st_drop_geometry() %>%
  dplyr::select(musaID, sale_price) %>%
  mutate(team = "Agarwal and Jun") 
write.csv(MUSA,"mean_prediction_AgarwalJun.csv", row.names = FALSE)