College Campus Disaggregation

Disaggregating university enrollment to TAZ level using StreetLight data

This workflow distributes college enrollment utils::data (Zip-level or Regional) to Traffic Analysis Zones (TAZs). It accounts for fixed Group Quarter populations (dorms) and distributes the remaining commuter population based on StreetLight origin-destination trip patterns.
Author
Affiliation

Pukar Bhandari

Published

January 1, 2025

1 Setup & Configuration

We begin by establishing the computational environment. This section loads the necessary libraries for spatial analysis (sf, mapview) and data manipulation (tidyverse).

Crucially, we define a central configuration list (CONSTANTS and PATHS). This practice avoids “magic numbers” (hardcoded values hidden deep in the code) and makes the script robust. If the TAZ shapefile name changes next year, or if the assumption about dorm distance needs adjusting, you only need to change it here.

  • min_hh_threshold: We ignore TAZs with fewer than 24 households to prevent distributing students to industrial or unpopulated zones.
  • gq_dist_threshold: When assigning census blocks (dorms) to campuses, we only consider blocks within 4,000 feet of a campus boundary to ensure accuracy.
  • magic_fill_value: A specific fallback weight (approx. 0.74) used when StreetLight data is missing for a specific Zip code, ensuring no enrollment data is lost due to gaps in the t
Show the code
import pandas as pd
import geopandas as gpd
import numpy as np
from shapely.geometry import Point
from pathlib import Path
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# --- CONFIGURATION ---
CONSTANTS = {
    "crs_projected": 26912,  # NAD83 / Utah Central (EPSG:26912)
    "min_hh_threshold": 24,  # TAZs with fewer HHs are ignored for distribution
    "gq_dist_threshold": 4000,  # Max distance (ft) to snap dorm blocks to campus
    "magic_fill_value": 0.7435897,  # Fallback weight for Zips with no StreetLight data
}

PATHS = {
    # Raw Input Data
    "se_data": "data/SE_2023.csv",
    "taz_shp": "data/TAZ/WFv1000_TAZ.shp",
    "sl_taz_shp": "data/StreetLight_TAZ_2021_09_22/StreetLight_TAZ_2021_09_22.shp",
    "college_key": "data/College_to_SL_COTAZID.csv",
    # Intermediate Input Data
    "block_shp": "intermediate/groupquarter_censusblocks_2020.shp",
    "zip_enrollment": "intermediate/TotEnrolByZipByCollege.csv",
    "zip_shp": "intermediate/ugrc_zip_codes.shp",
    "sl_trips": "intermediate/dfCollegeDataOD.csv",
    # Output
    "output_csv": "results/collegeHomeLocations.csv",
}

# Control Totals for Private Schools (Regional Level)
# These schools (BYU, Ensign, Westminster) do not have granular Zip-level enrollment data.
# Instead, we rely on a single "Control Total" for the entire region.
PRIVATE_CONTROLS = pd.DataFrame(
    {"COLLEGE": ["BYU", "ENSIGN", "WESTMIN"], "Control_Total": [34300, 1950, 2200]}
)

2 Data Loading & Preprocessing

2.1 Identifying Valid Residential Zones (TAZs)

Before processing trips, we need to define where students could possibly live. We load the master TAZ shapefile and join it with Socio-Economic (SE) data.

We calculate a studentRatio for each TAZ: \text{Student Ratio} = \frac{\text{Total Households}}{\text{Total Households} + \text{Total Employment}}

This ratio acts as a weighting factor later. It assumes that a trip originating from a TAZ with high employment (like a business park) is less likely to be a “home-based” student trip than a trip from a purely residential TAZ. We filter out any TAZ with fewer than r CONSTANTS$min_hh_threshold households.

Show the code
# Load SE Data and calculate the residential/student weight
se_data = pd.read_csv(PATHS['se_data'])
se_data = se_data.rename(columns={";TAZID": "TAZID"})
se_data = se_data[se_data['TOTHH'] > CONSTANTS['min_hh_threshold']].copy()
se_data['studentRatio'] = se_data['TOTHH'] / (se_data['TOTEMP'] + se_data['TOTHH'])

# Load TAZ Polygons (Geometry)
taz_poly = gpd.read_file(PATHS['taz_shp'])[['TAZID', 'geometry']]
taz_poly = taz_poly.to_crs(CONSTANTS['crs_projected'])

# Define "Valid TAZs" (Geometry + SE Data for valid residential zones)
valid_tazs = taz_poly.merge(se_data[['TAZID', 'studentRatio']], on='TAZID', how='inner')

# Validation Output
print(f"Original TAZ Count: {len(taz_poly)}")
print(f"Filtered (Valid) TAZ Count: {len(valid_tazs)}")
Original TAZ Count: 3562
Filtered (Valid) TAZ Count: 2412

2.2 Processing College & StreetLight Geometries

We use the StreetLight TAZ shapefile for two distinct purposes:

  1. Campus Polygons: We define the physical boundaries of the 5 main universities (sl_college_poly). This allows us to spatially “snap” nearby dorms to the correct school.
  2. Trip Origins: We generate centroids for all StreetLight zones (sl_centroids_valid). These points represent the start/end locations for the trip data we will process later.
Show the code
# Load raw StreetLight Polygons
sl_taz_poly = gpd.read_file(PATHS['sl_taz_shp'])[['SA_TAZID', 'SL_COTAZID', 'geometry']]
sl_taz_poly = sl_taz_poly.to_crs(CONSTANTS['crs_projected'])

# 1. Define College Campus Polygons
#    Filter to the 5 main campuses relevant to this study
college_key = pd.read_csv(PATHS['college_key'])
sl_college_poly = sl_taz_poly.merge(college_key, on='SL_COTAZID', how='inner')
sl_college_poly = sl_college_poly[
    sl_college_poly['COLLEGE'].isin(["WSU_MAIN", "UOFU_MAIN", "UVU_MAIN", "BYU", 'WESTMIN'])
]

# 2. Define Valid SL Centroids (for Trip Distribution)
#    Convert SL polygons to centroids and join with our "Valid TAZ" list.
#    This ensures we only consider trips originating from residential areas.
sl_centroids_valid = sl_taz_poly.copy()
sl_centroids_valid['geometry'] = sl_centroids_valid.centroid

# Note: valid_tazs is a GeoDataFrame, we drop geometry for the merge
sl_centroids_valid = sl_centroids_valid.merge(
    valid_tazs.drop(columns='geometry'),
    left_on="SA_TAZID",
    right_on="TAZID",
    how='inner'
)

2.3 Assigning Group Quarters (Dorms)

Students living in dormitories (Group Quarters or GQ) are a “Fixed Population”—we know exactly where they are. We load Census Block data containing student GQ counts.

Using a nearest-neighbor spatial join, we assign each census block to the closest university campus. To avoid false positives (e.g., a nursing home being misidentified as a dorm), we enforce a strict distance threshold of python CONSTANTS['gq_dist_threshold'] feet. Blocks further away are dropped.

Show the code
# Load Block Data (Points representing census blocks with students)
block_polys = gpd.read_file(PATHS['block_shp'])[[
  'geoid20',
  'gq_student',
  'geometry'
]]
block_polys = block_polys.to_crs(CONSTANTS['crs_projected'])
block_polys['geometry'] = block_polys.centroid
Show the code
# Find nearest campus. If ties exist (equidistant), groupby takes the first one found.
block_gq_assigned = (
    gpd.sjoin_nearest(
        block_polys,
        sl_college_poly[["COLLEGE", "geometry"]],
        how="left",
        distance_col="dist_ft",
    )
    .query(f"dist_ft < {CONSTANTS['gq_dist_threshold']}")
    .groupby("geoid20", as_index=False)
    .first()  # Automatically resolves duplicates by taking the first match
)[["COLLEGE", "gq_student", "geometry"]]
Show the code
# Aggregate these GQ totals to the TAZ level
# (These will be added back to the final dataset at the very end)
# We spatial join the block centroids back to the TAZ polygons to find their TAZID
taz_gq_totals = gpd.sjoin(block_gq_assigned, taz_poly, how="inner", predicate="within")

taz_gq_totals = (
    taz_gq_totals.groupby(["TAZID", "COLLEGE"])["gq_student"].sum().reset_index()
)

2.4 Creating the Trip Probability Grid (StreetLight)

This step builds the “Probability Map” for where commuter students live. We load the raw Origin-Destination (OD) trip data from StreetLight.

We filter for Mid-Day Weekday trips (9am - 3pm, Tue-Thu), which are most representative of students attending classes. We then weight these trips using the studentRatio calculated earlier. This down-weights trips from commercial zones and up-weights trips from residential neighborhoods.

Show the code
# Load and process raw trip data
sl_trips = pd.read_csv(PATHS["sl_trips"])

# Filter for Weekday Mid-Day
sl_trips = sl_trips[
    (sl_trips["day_type"] == "1: Weekday (Tu-Th)")
    & (sl_trips["day_part"] == "3: Mid-Day (9am-3pm)")
]

# Join destination college names
college_key_filtered = pd.read_csv(PATHS["college_key"])
college_key_filtered = college_key_filtered[
    college_key_filtered["COLLEGE"] != "UVU_GENEVA"
]

sl_trips = sl_trips.merge(
    college_key_filtered,
    left_on="destination_zone_name",
    right_on="SL_COTAZID",
    how="inner",
)
sl_trips = sl_trips.rename(columns={"COLLEGE": "D_COLLEGE"})

# Sum total trips by Origin Zone -> Destination College
sl_trips = (
    sl_trips.groupby(["origin_zone_name", "D_COLLEGE"])[
        "o_d_traffic_calibrated_trip_volume"
    ]
    .sum()
    .reset_index()
    .rename(columns={"o_d_traffic_calibrated_trip_volume": "trip_vol"})
)

# Join to SL centroids to attach TAZ IDs and Student Ratios
sl_trips = sl_trips.merge(
    sl_centroids_valid[["SL_COTAZID", "SA_TAZID", "studentRatio"]],
    left_on="origin_zone_name",
    right_on="SL_COTAZID",
    how="inner",
)

# Apply the Student Ratio weighting
sl_trips["weighted_trips"] = sl_trips["trip_vol"] * sl_trips["studentRatio"]
sl_trips = sl_trips[["SA_TAZID", "origin_zone_name", "D_COLLEGE", "weighted_trips"]]

3 Public School Disaggregation

For public universities, we have enrollment data available at the Zip Code level. Our strategy is to perform a constrained distribution:

  1. Calculate Net Demand: For each Zip Code, we take the Total Enrollment and subtract the known Dorm Students (gq_in_zip).
  2. Handle Overflows: If a Zip Code has more dorm students than total enrollment (due to data reporting mismatches), we assume the “Commuter Demand” is zero (max(0, ...)).
  3. Distribute: We take the remaining students (Net Demand) and spread them across the TAZs inside that Zip Code. The spread is not even; it is proportional to the weighted StreetLight trips we calculated in Step 2.4.

3.1 Preparing Zip Code Geometry & Enrollment

We load the Zip Code boundaries and the enrollment data. We pre-aggregate the enrollment CSV to ensuring uniqueness (one row per Zip/College) to prevent join errors later.

Show the code
# 1. Load Zip Code Boundaries
# Reading specific counties if file is large, otherwise load all and filter
zip_poly = gpd.read_file(PATHS["zip_shp"])
zip_poly = zip_poly.to_crs(CONSTANTS["crs_projected"])

# Filter for specific counties (18, 2, 25, 6, 29, 20 are FIPS or County codes used in query)
zip_poly = zip_poly[
    zip_poly["COUNTYNBR"].astype(str).isin(["18", "2", "25", "6", "29", "20"])
]

zip_poly["ZIP5"] = (
    pd.to_numeric(zip_poly["ZIP5"], errors="coerce").fillna(0).astype(int).astype(str)
)
Show the code
# 2. Prepare Enrollment Data
#    Summing by Zip/College ensures uniqueness before joining geometry.
zip_enroll = pd.read_csv(PATHS["zip_enrollment"])
zip_enroll = zip_enroll.groupby(["ZIPCODE", "COLLEGE"])["TotEnrol"].sum().reset_index()

zip_enroll["ZIPCODE"] = zip_enroll["ZIPCODE"].astype(str)

zip_enroll = zip_enroll.merge(zip_poly, left_on="ZIPCODE", right_on="ZIP5", how="inner")
zip_enroll = gpd.GeoDataFrame(zip_enroll, geometry="geometry")

3.2 Calculating Net Demand

Here we spatially join our assigned dorm blocks to the Zip Codes to determine how many “Fixed” students are in each Zip. We subtract this from the total enrollment to find the number of “Commuter” students we need to distribute.

Show the code
# 3. Calculate GQ (Dorm) students per Zip Code
block_gq_per_zip = gpd.sjoin(
    block_gq_assigned, zip_poly, how="inner", predicate="within"
)

block_gq_per_zip = (
    block_gq_per_zip.groupby(["ZIP5", "COLLEGE"])["gq_student"]
    .sum()
    .reset_index()
    .rename(columns={"gq_student": "gq_in_zip"})
)
Show the code
# 4. Calculate Net Demand (Total Enrollment - Dorms)
public_demand = pd.DataFrame(zip_enroll.drop(columns="geometry"))

public_demand = public_demand.merge(
    block_gq_per_zip,
    left_on=["ZIPCODE", "COLLEGE"],
    right_on=["ZIP5", "COLLEGE"],
    how="left",
)

public_demand["gq_in_zip"] = public_demand["gq_in_zip"].fillna(0)

# Use clip(lower=0) to ensure we don't get negative students
public_demand["net_demand"] = (
    public_demand["TotEnrol"] - public_demand["gq_in_zip"]
).clip(lower=0)

# Filter out private schools
public_demand = public_demand[
    ~public_demand["COLLEGE"].isin(PRIVATE_CONTROLS["COLLEGE"])
]

3.3 Distributing Commuters to TAZs

Finally, we apply the StreetLight trip distribution. We figure out which TAZs are in which Zip Code, calculate their share of the total trips in that Zip, and assign the Net Demand accordingly.

Show the code
# 5. Distribute Net Demand to TAZs
public_trips = sl_trips[~sl_trips["D_COLLEGE"].isin(PRIVATE_CONTROLS["COLLEGE"])].copy()

# Attach geometry to trips (using the centroids dataframe)
public_trips = public_trips.merge(
    sl_centroids_valid[["SL_COTAZID", "SA_TAZID", "geometry"]],
    left_on=["origin_zone_name", "SA_TAZID"],
    right_on=["SL_COTAZID", "SA_TAZID"],
    how="inner",
)
public_trips = gpd.GeoDataFrame(public_trips, geometry="geometry")

# Determine which Zip Code each Trip Origin falls into
public_trips = gpd.sjoin(public_trips, zip_poly, how="inner", predicate="within")

# Fill missing SL weights with a constant
public_trips["weighted_trips"] = public_trips["weighted_trips"].fillna(
    CONSTANTS["magic_fill_value"]
)

# Calculate the "Share" of trips for each TAZ within its specific Zip Code
public_trips["zip_total_weight"] = public_trips.groupby(["ZIP5", "D_COLLEGE"])[
    "weighted_trips"
].transform("sum")
public_trips["share"] = (
    public_trips["weighted_trips"] / public_trips["zip_total_weight"]
)

# Join the Net Demand calculated above
public_distribution = public_trips.merge(
    public_demand,
    left_on=["ZIP5", "D_COLLEGE"],
    right_on=["ZIPCODE", "COLLEGE"],
    how="inner",
)

# Final Calculation: Zip Demand * TAZ Share
public_distribution["distributed_students"] = (
    public_distribution["net_demand"] * public_distribution["share"]
)
public_distribution = public_distribution[
    ["SA_TAZID", "D_COLLEGE", "distributed_students"]
].rename(columns={"D_COLLEGE": "COLLEGE"})

3.4 Validation: Public Schools

We perform a quick check to ensure the distributed student counts roughly match the input net demand. Small discrepancies are expected due to Zip Codes with enrollment but no corresponding StreetLight trips (lost data).

Show the code
public_check = (
    public_distribution.groupby("COLLEGE")["distributed_students"]
    .sum()
    .reset_index()
    .rename(columns={"distributed_students": "Distributed_Total"})
)

public_input_demand = (
    public_demand.groupby("COLLEGE")["net_demand"]
    .sum()
    .reset_index()
    .rename(columns={"net_demand": "Input_Net_Demand"})
)

public_check = public_check.merge(public_input_demand, on="COLLEGE", how="left")
print(public_check)
     COLLEGE  Distributed_Total  Input_Net_Demand
0    SLCC_JD             2683.0            2910.0
1  SLCC_MAIN            14063.0           14742.0
2    SLCC_ML              325.0             352.0
3    SLCC_SC             2072.0            2207.0
4  UOFU_MAIN            23670.0           23996.0
5   UVU_MAIN            18098.0           18319.0
6  WSU_DAVIS             1888.0            1981.0
7   WSU_MAIN             8726.0            9158.0
Note

This is used to help double-check… but isn’t perfect because within the process some zip level data is disregarded (i.e. no SL trips existed within that zip code).

In order to verify that this is just “Lost Zips” and not a code bug, we can run this diagnostic snippet. It will show you exactly which Zip Codes had demand but zero StreetLight trips.

Show the code
# 1. Identify which Zips actually have StreetLight trips
#    (We reconstruct the data from the previous step)
zips_with_trips = public_trips[["ZIP5", "D_COLLEGE"]].drop_duplicates()
zips_with_trips = zips_with_trips.rename(
    columns={"ZIP5": "ZIPCODE", "D_COLLEGE": "COLLEGE"}
)

# 2. Find Demand that has NO matching trips
#    Merge with indicator=True to find rows only in left
lost_zips = public_demand.merge(
    zips_with_trips, on=["ZIPCODE", "COLLEGE"], how="left", indicator=True
)

lost_zips = lost_zips[
    (lost_zips["_merge"] == "left_only") & (lost_zips["net_demand"] > 0)
]

# 3. Validation
print(f"Total Lost Students: {lost_zips['net_demand'].sum()}")

# See which Zips were lost
print(lost_zips.head())
Total Lost Students: 10122.0
  ZIPCODE   COLLEGE  TotEnrol COUNTYNBR           NAME  SYMBOL ZIP5_x ZIP5_y  \
0   84003   SLCC_AP         1        25  AMERICAN FORK       4  84003    NaN   
5   84003   SLCC_UH        12        25  AMERICAN FORK       4  84003    NaN   
7   84003   UOFU_SC        13        25  AMERICAN FORK       4  84003    NaN   
8   84003    USU_BC         2        25  AMERICAN FORK       4  84003    NaN   
9   84003  USU_MAIN       175        25  AMERICAN FORK       4  84003    NaN   

   gq_in_zip  net_demand     _merge  
0        0.0         1.0  left_only  
5        0.0        12.0  left_only  
7        0.0        13.0  left_only  
8        0.0         2.0  left_only  
9        0.0       175.0  left_only  

4 Private School Disaggregation

For private universities (BYU, Ensign, Westminster), we do not have granular Zip Code data. Instead, we perform a Regional Distribution.

  1. Calculate Net Demand: We start with the regional Control Total for the school and subtract the total number of Dorm Students found in Step 2.3.
  2. Distribute: We spread the remaining students across all valid TAZs in the entire region. The spread is proportional to the weighted StreetLight trips, but unlike public schools, it is not constrained by Zip Code boundaries.
Show the code
# 1. Calculate Net Demand (Region Total - Total Dorms)
gq_by_college = taz_gq_totals.groupby("COLLEGE")["gq_student"].sum().reset_index()

private_demand = PRIVATE_CONTROLS.merge(gq_by_college, on="COLLEGE", how="left")

private_demand["gq_student"] = private_demand["gq_student"].fillna(0)
private_demand["net_demand"] = (
    private_demand["Control_Total"] - private_demand["gq_student"]
)
Show the code
# 2. Distribute based on Region-wide Trip Shares
private_trips = sl_trips[sl_trips["D_COLLEGE"].isin(PRIVATE_CONTROLS["COLLEGE"])].copy()

# Here, sum across the entire region, not just a Zip
private_trips["region_total_weight"] = private_trips.groupby("D_COLLEGE")[
    "weighted_trips"
].transform("sum")
private_trips["share"] = (
    private_trips["weighted_trips"] / private_trips["region_total_weight"]
)

private_distribution = private_trips.merge(
    private_demand, left_on="D_COLLEGE", right_on="COLLEGE", how="inner"
)

private_distribution["distributed_students"] = (
    private_distribution["net_demand"] * private_distribution["share"]
)
private_distribution = private_distribution[
    ["SA_TAZID", "COLLEGE", "distributed_students"]
]

4.1 Validation: Private Schools

Since the distribution is regional and not constrained by missing Zip data, the input and output totals should match exactly.

Show the code
private_check = (
    private_distribution.groupby("COLLEGE")["distributed_students"]
    .sum()
    .reset_index()
    .rename(columns={"distributed_students": "Distributed_Total"})
)

private_target_demand = private_demand[["COLLEGE", "net_demand"]].rename(
    columns={"net_demand": "Target_Net_Demand"}
)

private_check = private_check.merge(private_target_demand, on="COLLEGE", how="left")
print(private_check)
   COLLEGE  Distributed_Total  Target_Net_Demand
0      BYU            24245.0            24245.0
1   ENSIGN             1950.0             1950.0
2  WESTMIN             1635.0             1635.0

5 Final Compilation & Export

In this final step, we synthesize the results.

  1. Combine: We merge the distributed “Commuter” students (from both Public and Private workflows).
  2. Add Dorms: We join the “Fixed” Group Quarter students (calculated in Step 2.3) back to their respective TAZs.
  3. Format: We pivot the data into a “Wide” format (TAZ rows x College columns), creating the final matrix required for the travel demand model.
Show the code
# Combine Public and Private distributions
final_long_data = pd.concat(
    [public_distribution, private_distribution], ignore_index=True
)

# Add back the fixed GQ (dorm) students to their specific TAZs
final_long_data = final_long_data.merge(
    taz_gq_totals,
    left_on=["SA_TAZID", "COLLEGE"],
    right_on=["TAZID", "COLLEGE"],
    how="outer",
)

final_long_data["distributed_students"] = final_long_data[
    "distributed_students"
].fillna(0)
final_long_data["gq_student"] = final_long_data["gq_student"].fillna(0)

# Total Students = Commuters + Dorm Residents
final_long_data["total_students"] = (
    final_long_data["distributed_students"] + final_long_data["gq_student"]
)

# Consolidate TAZID (coalesce keys from merge)
final_long_data["SA_TAZID"] = final_long_data["SA_TAZID"].fillna(
    final_long_data["TAZID"]
)
Show the code
# Pivot to Wide Format (One column per college)
# First sum up in case of duplicates (rare, but good practice)
final_grouped = (
    final_long_data.groupby(["SA_TAZID", "COLLEGE"])["total_students"]
    .sum()
    .round(0)
    .reset_index()
    .pivot(index="SA_TAZID", columns="COLLEGE", values="total_students")
)

# 1. Fill NaNs with 0 and convert to Integer
final_output = (
    final_grouped.reindex(
        columns=[
            "ENSIGN",
            "WESTMIN",
            "WSU_MAIN",
            "WSU_DAVIS",
            "SLCC_SC",
            "BYU",
            "SLCC_MAIN",
            "UVU_MAIN",
            "UOFU_MAIN",
            "SLCC_JD",
            "SLCC_ML",
        ],
        fill_value=0,
    )
    .fillna(0)
    .astype(int)
)

# Explicitly cast the index to integer
final_output.index = final_output.index.astype(int)
final_output.index.name = ";Z"

# Preview the final table
print(final_output.head())
COLLEGE  ENSIGN  WESTMIN  WSU_MAIN  WSU_DAVIS  SLCC_SC  BYU  SLCC_MAIN  \
;Z                                                                       
5             4        3         0          0        0    0          0   
8             0        0        57          0        0    0          0   
24            0        0         0          5        0    0          0   
25            0        0         7          0        0    0          0   
26            0        2         0          0        0    0          0   

COLLEGE  UVU_MAIN  UOFU_MAIN  SLCC_JD  SLCC_ML  
;Z                                              
5               0          0        0        0  
8               0          0        0        0  
24              0          0        0        0  
25              0          0        0        0  
26              0          0        0        0  
Show the code
# Export
# Reset index to make TAZID a column in the CSV
final_output.to_csv(PATHS["output_csv"])