Filtering NOAA GSOD data to extract NYC weather stations (Central Park & JFK).
import os
os.getcwd()
import tarfile
from pathlib import Path
import pandas as pd
# ===== 1. PATH TO YOUR GSOD YEAR FILE =====
# change this to your actual file, e.g. 2024.tar.gz or 2025.tar.gz
tar_path = Path("data/2025.tar.gz")
# ===== 2. BUILD A STATION TABLE FROM ALL CSVs =====
stations = []
with tarfile.open(tar_path, "r:gz") as tar:
for member in tar.getmembers():
if not member.name.endswith(".csv"):
continue # skip non-csv files
f = tar.extractfile(member)
if f is None:
continue
df_head = pd.read_csv(f, nrows=1)
stations.append({
"file": member.name,
"STATION": df_head.get("STATION", [None])[0],
"NAME": df_head.get("NAME", [None])[0],
"LATITUDE": df_head.get("LATITUDE", [None])[0],
"LONGITUDE": df_head.get("LONGITUDE", [None])[0],
})
stations_df = pd.DataFrame(stations)
print("Total stations in this year:", len(stations_df))
# ===== 3. FIND CANDIDATE NYC STATIONS =====
# (1) by rough lat/lon box around NYC
nyc_box = stations_df[
(stations_df["LATITUDE"].between(40.5, 41.1)) &
(stations_df["LONGITUDE"].between(-74.3, -73.6))
]
print("\nStations in NYC bounding box:")
print(nyc_box[["file", "STATION", "NAME", "LATITUDE", "LONGITUDE"]])
# (2) by name search (NEW YORK / CENTRAL PARK)
nyc_name = stations_df[
stations_df["NAME"].str.contains("NEW YORK|CENTRAL PARK", case=False, na=False)
]
print("\nStations with NEW YORK or CENTRAL PARK in name:")
print(nyc_name[["file", "STATION", "NAME", "LATITUDE", "LONGITUDE"]])
Total stations in this year: 11656
Stations in NYC bounding box:
file STATION \
7356 72055399999.csv 72055399999
7370 72058100178.csv 72058100178
8239 72409454743.csv 72409454743
8436 72502014734.csv 72502014734
8437 72502594741.csv 72502594741
8441 72503014732.csv 72503014732
8443 72503794745.csv 72503794745
8450 72505394728.csv 72505394728
9077 74486094789.csv 74486094789
11179 99727199999.csv 99727199999
11186 99728099999.csv 99728099999
11273 99774399999.csv 99774399999
NAME LATITUDE LONGITUDE
7356 PORT AUTH DOWNTN MANHATTAN WALL ST HEL, NY US 40.701214 -74.009028
7370 LINDEN AIRPORT, NJ US 40.617000 -74.250000
8239 CALDWELL ESSEX CO AIRPORT, NJ US 40.876450 -74.282840
8436 NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US 40.682750 -74.169270
8437 TETERBORO AIRPORT, NJ US 40.858980 -74.056160
8441 LAGUARDIA AIRPORT, NY US 40.779450 -73.880270
8443 WESTCHESTER CO AIRPORT, NY US 41.062360 -73.704540
8450 NY CITY CENTRAL PARK, NY US 40.778980 -73.969250
9077 JFK INTERNATIONAL AIRPORT, NY US 40.639150 -73.763900
11179 THE BATTERY, US 40.701000 -74.014000
11186 KINGS POINT, US 40.811000 -73.765000
11273 ROBBINS REEF, NJ US 40.650000 -74.066667
Stations with NEW YORK or CENTRAL PARK in name:
file STATION NAME LATITUDE \
8450 72505394728.csv 72505394728 NY CITY CENTRAL PARK, NY US 40.77898
LONGITUDE
8450 -73.96925
import tarfile
import pandas as pd
from pathlib import Path
# ========== You already have these ==========
# stations_df → a DataFrame containing columns: file, STATION, NAME, LATITUDE, LONGITUDE
# tar_path → path to 2024.tar.gz (or 2025.tar.gz)
# Directory to save selected station csv
output_dir = Path("data/nyc_two_stations")
output_dir.mkdir(exist_ok=True)
# Central Park
cp_row = stations_df[
stations_df["NAME"].str.contains("NY CITY CENTRAL PARK", case=False, na=False)
].iloc[0]
# JFK
jfk_row = stations_df[
stations_df["NAME"].str.contains("JFK INTERNATIONAL AIRPORT", case=False, na=False)
].iloc[0]
cp_file = cp_row["file"]
jfk_file = jfk_row["file"]
print("Central Park file:", cp_file)
print("JFK file:", jfk_file)
Central Park file: 72505394728.csv JFK file: 74486094789.csv
def read_csv_from_tar(tar_path, csv_file_name):
with tarfile.open(tar_path, "r:gz") as tar:
f = tar.extractfile(csv_file_name)
return pd.read_csv(f)
cp_data = read_csv_from_tar(tar_path, cp_file)
jfk_data = read_csv_from_tar(tar_path, jfk_file)
cp_out = output_dir / "NYC_Central_Park.csv"
jfk_out = output_dir / "NYC_JFK_Airport.csv"
cp_data.to_csv(cp_out, index=False)
jfk_data.to_csv(jfk_out, index=False)
print("Saved:", cp_out)
print("Saved:", jfk_out)
Saved: data\nyc_two_stations\NYC_Central_Park.csv Saved: data\nyc_two_stations\NYC_JFK_Airport.csv
Source: 01a_weather_station_data_filtering.ipynb
Code Cells: 5
Figures: 0