DATA & METHODS

Hot City, Heated Calls:
Understanding How Urban Features Affect Quality of Life Under Different Heat Conditions Using New York City's 311 and SHAP

2.1. Study Area and Period

The study area is based in New York City with spatial resolution at the census tract level, with these observations during summer 2025, defined as the beginning of June through August 23rd at a weekly temporal resolution. The last week of August was not used due to recent weather data only recorded up to the 24th, therefore not providing a whole week, so it was removed from the study.

2.2. Data Preparation

Heat Data

The subsequent removal of August's last week provided a total of 12 weeks in summer 2025, where extreme heat weeks were defined as at least two extreme heat days within a week with a temperature cutoff threshold at 93°F using the John F. Kennedy (JFK) weather station located at Philadelphia's international airport. This threshold was determined according to a climatological baseline from 1981 through 2010 daily max temperature with a 95th percentile, and this split the observations into two needed regimes: 17 extreme heat days and 71 normal heat days, providing 5 extreme heat weeks and 7 normal heat weeks. Data was directly downloaded from the National Oceanic and Atmospheric Administration (NOAA).

311 Data

311 data was downloaded from NYC OpenData with the categories below.

# Noise and Social Activity
QOL_NOISE = [
    "LOUD MUSIC/PARTY", "BANGING/POUNDING", "LOUD TALKING",
    "CAR/TRUCK MUSIC", "CAR/TRUCK HORN", "DOG NOISE",
    "NOISE: BOAT(ENGINE,MUSIC,ETC) (NR10)",
    "NOISE: ALARMS (NR3)",
    "NOISE: AIR CONDITION/VENTILATION EQUIPMENT (NV1)",
    "NOISE: CONSTRUCTION BEFORE/AFTER HOURS (NM1)",
    "NOISE: JACK HAMMERING (NC2)",
    "NOISE, BARKING DOG (NR5)",
    "NOISE: MANUFACTURING NOISE (NK1)",
    "NOISE: OTHER NOISE SOURCES (USE COMMENTS) (NZZ)"
]

# Outdoor/Public Space
QOL_OUTDOOR = [
    "BLOCKED HYDRANT", "BLOCKED SIDEWALK", "BLOCKED BIKE LANE",
    "ILLEGAL PARKING", "DOUBLE PARKED BLOCKING TRAFFIC",
    "BLOCKED CROSSWALK", "DERELICT VEHICLES", "CONGESTION/GRIDLOCK",
    "GRAFFITI", "CHRONIC DUMPING",
    "COMMERCIAL OVERNIGHT PARKING"
]

# Sanitation
QOL_SANITATION = [
    "GARBAGE OR LITTER", "TRASH", "OVERFLOWING",
    "RAT SIGHTING", "MOUSE SIGHTING", "CONDITION ATTRACTING RODENTS",
    "PESTS", "UNSANITARY CONDITION", "DEAD ANIMAL",
    "WASTE DISPOSAL", "DOG WASTE"
]

# Water Infrastructure
QOL_WATER = [
    "WATER LEAK", "WATER SUPPLY", "HYDRANT LEAKING (WC1)",
    "HYDRANT RUNNING FULL (WA4)", "HYDRANT RUNNING (WC3)",
    "HYDRANT DEFECTIVE (WC2)", "SEWER", "SEWER ODOR (SA2)",
    "SEWER BACKUP (SA)", "LEAK (USE COMMENTS) (WA2)"
]

# Infrastructure Heat Stress
QOL_INFRA_HEAT = [
    "POWER OUTAGE", "ELECTRICAL/GAS RANGE", "VENTILATION SYSTEM",
    "TRAFFIC SIGNAL LIGHT", "STREET LIGHT OUT",
    "STREET LIGHT LAMP MISSING", "STREET LIGHT CYCLING"
]

Human interaction with the built environment is fundamentally non-linear, so there are thresholds where behavior changes, like when it gets too hot people might stop complaining about street noise because they are sealed indoors with AC and other insulating behavior even if heat exacerbates aggravated reactions and sensitivity to external issues.

Socioeconomic Data

Socioeconomic data was derived from the United States Census, specifically the most recent 5-year American Community Survey (ACS) in 2023. Python's pyCensus module provided easy access to filter the data down to main investigative, derived variables in the final table:

Justifications for these variables highlight socioeconomic issues and how heat-related issues disproportionately affect different communities as well as how different communities interact with public services like New York City's 311. Educated and higher-income individuals may know how to navigate what their cities offer, limited English speakers may have more barriers accessing 311 services, renters may face more infrastructural issues compared to owners.

Urban Environmental Data

Environmental urban data were derived from Landsat & LULC raster calculations and OSM water data, specifically scenes within the same study timeline, with the manipulation and computation done through ArcGIS Pro and Python. However, land-cover land-use (LULC) data was a static raster from 2024.

Urban Built and Spatial Data

Building data came from NYC open data of building footprint shp file with height field. Spatial data included deriving spatial features from Python's osmnx module to calculate points-of-interest (POI) density utilizing a 500-meter buffer and mean Euclidean distance to the nearest subway of census tract centroids.

POIs were determined as everyday main amenities, shops, leisure, and public transport categories in OpenStreet Maps (OSM) yielding 21,309 points, and are as follows:

Justification for these variables are that quantifiable metrics of greenery such as tree canopy and NDVI, as well as water coverage, could help explore the relationship between their roles in heat mitigation and alleviating air pollution within cities, and how they could potentially affect QoL requests as a byproduct. In addition, the impervious surface can suggest high heat absorption throughout the city at high percentages, and this is the same case with the building heights and densities. While many of these environmental and urban forms may be multicollinear, the goal is striving for interpretation rather than maximizing prediction, and this helps to explore the different properties of a city.

2.3. OLS Regression Model

OLS regression was used as the foundational statistical model in this study because it provides an interpretable, baseline framework for understanding the linear associations between environmental, socioeconomic, urban morphology, and spatial accessibility characteristics and the dependent variable of heat-related QoL 311 complaints per capita.

Separate cross-sectional OLS models were estimated for extreme heat weeks defined as those with at least two extreme heat days, and normal heat weeks defined as those with less than two extreme heat days, with each of the 2,225 observations representing a census tract by week.

Predictors were structured into three conceptual categories, which were added incrementally to assess the added explanatory value of each predictor block: Environmental Predictors, Socioeconomic Predictors, and Urban Morphology Predictors.

Urban Environmental Features: NDVI, percent tree canopy, percent impervious surface, and water cover ratio.

Urban Socioeconomic Features: Median income, poverty rate, percent renters, percent limited English, percent bachelor’s or more, and percent non-white.

Urban Built and Spatial Features: Average building height, building density, distance to the nearest subway station, and 500-meter buffer POI density.

OLS provides a transparent estimation of how predictors correlate with QoL complaint rates, and coefficients can be directly interpreted and compared across extreme versus normal heat conditions, serving as an important reference model before introducing nonlinear ML approaches with Random Forest. So, given the behavioral nature of 311 complaint reporting and the noisy, high-frequency variability of QoL calls, relatively low R² values are expected in this domain, consistent with existing literature on 311 data, urban complaints, and human-environment interactions.

2.4. ML Model and SHAP

Stepping further to understanding the relationships between QoL and urban dynamics under different heat conditions, to complement the OLS framework, a nonlinear ML model was used to test whether environmental, socioeconomic, and urban predictors collectively produce stronger predictive power for QoL rates per capita during extreme and normal heat weeks.

In the case of this study RF, is stable on moderate-size datasets and can handle high multicollinearity and correlated predictors like this study's without requiring regularization. In addition, it is less sensitive to hyperparameter tuning and is capable of modeling nonlinear relationships and threshold behaviors associated with heat stress, as it is a popular ML model used in the environmental exposure, health, and urban prediction literature.

Like the OLS model, the RF models were trained in the extreme heat weeks and normal heat weeks with the same predictor groups for direct comparison. Partitioning the tracts into an 80% train set and 20% test set, then a 3-fold Grid Search cross-validation was implemented to optimize the hyperparameters. The final performance metrics are based on test set.

While OLS is useful for interpretation, extreme heat effects on QoL likely possess nonlinearity, interactions between the built environment and socioeconomic vulnerability, among others, and RF accommodates to these unique idiosyncracies, capturing behavioral and nonlinear dynamics that OLS falls short on.

Finally, SHAP method was used to interpret the Random Forest results and quantify the contribution of each predictor to the predicted QoL complaint rate per capita. This particular methodology is well-suited for urban and environmental modeling because it can especially decompose predictions into additive contributions from each predictor, providing a measure of global importance and local explanations, so this makes the two regimes of extreme heat weeks versus normal heat weeks comparable.

With this, SHAP allows identification of which environmental, socioeconomic, or urban factors become more influential during extreme heat, and whether predictors behave differently under high heat versus normal heat conditions. And further use SHAP plots, we can discover the non-linear relationship between features and targets.

Study Parameters

Location: New York City

Spatial Resolution: Census tract level

Time Period: Summer 2025 (June–August 23)

Temporal Resolution: Weekly

Heat Thresholds

93°F is the extreme heat threshold based on 95th percentile of 1981–2010 climatological baseline.

- 5 extreme heat weeks
- 7 normal heat weeks

Data Sources

- NOAA (temperature)
- ACS 2023 (socioeconomic)
- Landsat (environmental / urban)
- OpenStreetMap (POIs / kNN / urban)