CODE: Weather Station Data Filtering

Hot City, Heated Calls:
Understanding Extreme Heat and Quality of Life
Using New York City's 311 and SHAP

Filtering NOAA GSOD data to extract NYC weather stations (Central Park & JFK).

1b. Heat Classification →
Code Cell 1
import os
os.getcwd()
Code Cell 2
import tarfile
from pathlib import Path
import pandas as pd

# ===== 1. PATH TO YOUR GSOD YEAR FILE =====
# change this to your actual file, e.g. 2024.tar.gz or 2025.tar.gz
tar_path = Path("data/2025.tar.gz")

# ===== 2. BUILD A STATION TABLE FROM ALL CSVs =====
stations = []

with tarfile.open(tar_path, "r:gz") as tar:
    for member in tar.getmembers():
        if not member.name.endswith(".csv"):
            continue  # skip non-csv files

        f = tar.extractfile(member)
        if f is None:
            continue

        df_head = pd.read_csv(f, nrows=1)

        stations.append({
            "file": member.name,
            "STATION": df_head.get("STATION", [None])[0],
            "NAME": df_head.get("NAME", [None])[0],
            "LATITUDE": df_head.get("LATITUDE", [None])[0],
            "LONGITUDE": df_head.get("LONGITUDE", [None])[0],
        })

stations_df = pd.DataFrame(stations)
print("Total stations in this year:", len(stations_df))

# ===== 3. FIND CANDIDATE NYC STATIONS =====
# (1) by rough lat/lon box around NYC
nyc_box = stations_df[
    (stations_df["LATITUDE"].between(40.5, 41.1)) &
    (stations_df["LONGITUDE"].between(-74.3, -73.6))
]

print("\nStations in NYC bounding box:")
print(nyc_box[["file", "STATION", "NAME", "LATITUDE", "LONGITUDE"]])

# (2) by name search (NEW YORK / CENTRAL PARK)
nyc_name = stations_df[
    stations_df["NAME"].str.contains("NEW YORK|CENTRAL PARK", case=False, na=False)
]

print("\nStations with NEW YORK or CENTRAL PARK in name:")
print(nyc_name[["file", "STATION", "NAME", "LATITUDE", "LONGITUDE"]])
Total stations in this year: 11656

Stations in NYC bounding box:
                  file      STATION  \
7356   72055399999.csv  72055399999   
7370   72058100178.csv  72058100178   
8239   72409454743.csv  72409454743   
8436   72502014734.csv  72502014734   
8437   72502594741.csv  72502594741   
8441   72503014732.csv  72503014732   
8443   72503794745.csv  72503794745   
8450   72505394728.csv  72505394728   
9077   74486094789.csv  74486094789   
11179  99727199999.csv  99727199999   
11186  99728099999.csv  99728099999   
11273  99774399999.csv  99774399999   

                                                NAME   LATITUDE  LONGITUDE  
7356   PORT AUTH DOWNTN MANHATTAN WALL ST HEL, NY US  40.701214 -74.009028  
7370                           LINDEN AIRPORT, NJ US  40.617000 -74.250000  
8239                CALDWELL ESSEX CO AIRPORT, NJ US  40.876450 -74.282840  
8436     NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  40.682750 -74.169270  
8437                        TETERBORO AIRPORT, NJ US  40.858980 -74.056160  
8441                        LAGUARDIA AIRPORT, NY US  40.779450 -73.880270  
8443                   WESTCHESTER CO AIRPORT, NY US  41.062360 -73.704540  
8450                     NY CITY CENTRAL PARK, NY US  40.778980 -73.969250  
9077                JFK INTERNATIONAL AIRPORT, NY US  40.639150 -73.763900  
11179                                THE BATTERY, US  40.701000 -74.014000  
11186                                KINGS POINT, US  40.811000 -73.765000  
11273                            ROBBINS REEF, NJ US  40.650000 -74.066667  

Stations with NEW YORK or CENTRAL PARK in name:
                 file      STATION                         NAME  LATITUDE  \
8450  72505394728.csv  72505394728  NY CITY CENTRAL PARK, NY US  40.77898   

      LONGITUDE  
8450  -73.96925  
Code Cell 3
import tarfile
import pandas as pd
from pathlib import Path

# ========== You already have these ==========
# stations_df  → a DataFrame containing columns: file, STATION, NAME, LATITUDE, LONGITUDE
# tar_path     → path to 2024.tar.gz (or 2025.tar.gz)

# Directory to save selected station csv
output_dir = Path("data/nyc_two_stations")
output_dir.mkdir(exist_ok=True)
Code Cell 4
# Central Park
cp_row = stations_df[
    stations_df["NAME"].str.contains("NY CITY CENTRAL PARK", case=False, na=False)
].iloc[0]

# JFK
jfk_row = stations_df[
    stations_df["NAME"].str.contains("JFK INTERNATIONAL AIRPORT", case=False, na=False)
].iloc[0]

cp_file = cp_row["file"]
jfk_file = jfk_row["file"]

print("Central Park file:", cp_file)
print("JFK file:", jfk_file)
Central Park file: 72505394728.csv
JFK file: 74486094789.csv
Code Cell 5
def read_csv_from_tar(tar_path, csv_file_name):
    with tarfile.open(tar_path, "r:gz") as tar:
        f = tar.extractfile(csv_file_name)
        return pd.read_csv(f)

cp_data = read_csv_from_tar(tar_path, cp_file)
jfk_data = read_csv_from_tar(tar_path, jfk_file)

cp_out = output_dir / "NYC_Central_Park.csv"
jfk_out = output_dir / "NYC_JFK_Airport.csv"

cp_data.to_csv(cp_out, index=False)
jfk_data.to_csv(jfk_out, index=False)

print("Saved:", cp_out)
print("Saved:", jfk_out)
Saved: data\nyc_two_stations\NYC_Central_Park.csv
Saved: data\nyc_two_stations\NYC_JFK_Airport.csv
1b. Heat Classification →

Notebooks

This Notebook

Source: 01a_weather_station_data_filtering.ipynb

Code Cells: 5

Figures: 0