---
title: "College Campus Disaggregation"
subtitle: "Disaggregating university enrollment to TAZ level using StreetLight data"
description: "This workflow distributes college enrollment utils::data (Zip-level or Regional) to Traffic Analysis Zones (TAZs). It accounts for fixed Group Quarter populations (dorms) and distributes the remaining commuter population based on StreetLight origin-destination trip patterns."
author:
- name: Pukar Bhandari
email: pukar.bhandari@wfrc.utah.gov
affiliation:
- name: Wasatch Front Regional Council
url: "https://wfrc.utah.gov/"
date: "2025-01-01"
---
# Setup & Configuration
We begin by establishing the computational environment. This section loads the necessary libraries for spatial analysis (`sf`, `mapview`) and data manipulation (`tidyverse`).
Crucially, we define a central **configuration list (`CONSTANTS` and `PATHS`)**. This practice avoids "magic numbers" (hardcoded values hidden deep in the code) and makes the script robust. If the TAZ shapefile name changes next year, or if the assumption about dorm distance needs adjusting, you only need to change it here.
* **`min_hh_threshold`**: We ignore TAZs with fewer than 24 households to prevent distributing students to industrial or unpopulated zones.
* **`gq_dist_threshold`**: When assigning census blocks (dorms) to campuses, we only consider blocks within 4,000 feet of a campus boundary to ensure accuracy.
* **`magic_fill_value`**: A specific fallback weight (approx. 0.74) used when StreetLight data is missing for a specific Zip code, ensuring no enrollment data is lost due to gaps in the t
```{python}
#| message: false
#| warning: false
import pandas as pd
import geopandas as gpd
import numpy as np
from shapely.geometry import Point
from pathlib import Path
import warnings
# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")
# --- CONFIGURATION ---
CONSTANTS = {
"crs_projected": 26912, # NAD83 / Utah Central (EPSG:26912)
"min_hh_threshold": 24, # TAZs with fewer HHs are ignored for distribution
"gq_dist_threshold": 4000, # Max distance (ft) to snap dorm blocks to campus
"magic_fill_value": 0.7435897, # Fallback weight for Zips with no StreetLight data
}
PATHS = {
# Raw Input Data
"se_data": "data/SE_2023.csv",
"taz_shp": "data/TAZ/WFv1000_TAZ.shp",
"sl_taz_shp": "data/StreetLight_TAZ_2021_09_22/StreetLight_TAZ_2021_09_22.shp",
"college_key": "data/College_to_SL_COTAZID.csv",
# Intermediate Input Data
"block_shp": "intermediate/groupquarter_censusblocks_2020.shp",
"zip_enrollment": "intermediate/TotEnrolByZipByCollege.csv",
"zip_shp": "intermediate/ugrc_zip_codes.shp",
"sl_trips": "intermediate/dfCollegeDataOD.csv",
# Output
"output_csv": "results/collegeHomeLocations.csv",
}
# Control Totals for Private Schools (Regional Level)
# These schools (BYU, Ensign, Westminster) do not have granular Zip-level enrollment data.
# Instead, we rely on a single "Control Total" for the entire region.
PRIVATE_CONTROLS = pd.DataFrame(
{"COLLEGE": ["BYU", "ENSIGN", "WESTMIN"], "Control_Total": [34300, 1950, 2200]}
)
```
# Data Loading & Preprocessing
## Identifying Valid Residential Zones (TAZs)
Before processing trips, we need to define where students could possibly live. We load the master TAZ shapefile and join it with Socio-Economic (SE) data.
We calculate a `studentRatio` for each TAZ: $$ \text{Student Ratio} = \frac{\text{Total Households}}{\text{Total Households} + \text{Total Employment}} $$
This ratio acts as a weighting factor later. It assumes that a trip originating from a TAZ with high employment (like a business park) is less likely to be a "home-based" student trip than a trip from a purely residential TAZ. We filter out any TAZ with fewer than `r CONSTANTS$min_hh_threshold` households.
```{python}
#| message: false
# Load SE Data and calculate the residential/student weight
se_data = pd.read_csv(PATHS['se_data'])
se_data = se_data.rename(columns={";TAZID": "TAZID"})
se_data = se_data[se_data['TOTHH'] > CONSTANTS['min_hh_threshold']].copy()
se_data['studentRatio'] = se_data['TOTHH'] / (se_data['TOTEMP'] + se_data['TOTHH'])
# Load TAZ Polygons (Geometry)
taz_poly = gpd.read_file(PATHS['taz_shp'])[['TAZID', 'geometry']]
taz_poly = taz_poly.to_crs(CONSTANTS['crs_projected'])
# Define "Valid TAZs" (Geometry + SE Data for valid residential zones)
valid_tazs = taz_poly.merge(se_data[['TAZID', 'studentRatio']], on='TAZID', how='inner')
# Validation Output
print(f"Original TAZ Count: {len(taz_poly)}")
print(f"Filtered (Valid) TAZ Count: {len(valid_tazs)}")
```
## Processing College & StreetLight Geometries
We use the StreetLight TAZ shapefile for two distinct purposes:
1. **Campus Polygons:** We define the physical boundaries of the 5 main universities (`sl_college_poly`). This allows us to spatially "snap" nearby dorms to the correct school.
2. **Trip Origins:** We generate centroids for all StreetLight zones (`sl_centroids_valid`). These points represent the start/end locations for the trip data we will process later.
```{python}
# Load raw StreetLight Polygons
sl_taz_poly = gpd.read_file(PATHS['sl_taz_shp'])[['SA_TAZID', 'SL_COTAZID', 'geometry']]
sl_taz_poly = sl_taz_poly.to_crs(CONSTANTS['crs_projected'])
# 1. Define College Campus Polygons
# Filter to the 5 main campuses relevant to this study
college_key = pd.read_csv(PATHS['college_key'])
sl_college_poly = sl_taz_poly.merge(college_key, on='SL_COTAZID', how='inner')
sl_college_poly = sl_college_poly[
sl_college_poly['COLLEGE'].isin(["WSU_MAIN", "UOFU_MAIN", "UVU_MAIN", "BYU", 'WESTMIN'])
]
# 2. Define Valid SL Centroids (for Trip Distribution)
# Convert SL polygons to centroids and join with our "Valid TAZ" list.
# This ensures we only consider trips originating from residential areas.
sl_centroids_valid = sl_taz_poly.copy()
sl_centroids_valid['geometry'] = sl_centroids_valid.centroid
# Note: valid_tazs is a GeoDataFrame, we drop geometry for the merge
sl_centroids_valid = sl_centroids_valid.merge(
valid_tazs.drop(columns='geometry'),
left_on="SA_TAZID",
right_on="TAZID",
how='inner'
)
```
## Assigning Group Quarters (Dorms)
Students living in dormitories (Group Quarters or GQ) are a "Fixed Population"—we know exactly where they are. We load Census Block data containing student GQ counts.
Using a nearest-neighbor spatial join, we assign each census block to the closest university campus. To avoid false positives (e.g., a nursing home being misidentified as a dorm), we enforce a strict distance threshold of `python CONSTANTS['gq_dist_threshold']` feet. Blocks further away are dropped.
```{python}
# Load Block Data (Points representing census blocks with students)
block_polys = gpd.read_file(PATHS['block_shp'])[[
'geoid20',
'gq_student',
'geometry'
]]
block_polys = block_polys.to_crs(CONSTANTS['crs_projected'])
block_polys['geometry'] = block_polys.centroid
```
```{python}
# Find nearest campus. If ties exist (equidistant), groupby takes the first one found.
block_gq_assigned = (
gpd.sjoin_nearest(
block_polys,
sl_college_poly[["COLLEGE", "geometry"]],
how="left",
distance_col="dist_ft",
)
.query(f"dist_ft < {CONSTANTS['gq_dist_threshold']}")
.groupby("geoid20", as_index=False)
.first() # Automatically resolves duplicates by taking the first match
)[["COLLEGE", "gq_student", "geometry"]]
```
```{python}
# Aggregate these GQ totals to the TAZ level
# (These will be added back to the final dataset at the very end)
# We spatial join the block centroids back to the TAZ polygons to find their TAZID
taz_gq_totals = gpd.sjoin(block_gq_assigned, taz_poly, how="inner", predicate="within")
taz_gq_totals = (
taz_gq_totals.groupby(["TAZID", "COLLEGE"])["gq_student"].sum().reset_index()
)
```
## Creating the Trip Probability Grid (StreetLight)
This step builds the "Probability Map" for where commuter students live. We load the raw Origin-Destination (OD) trip data from StreetLight.
We filter for **Mid-Day Weekday** trips (9am - 3pm, Tue-Thu), which are most representative of students attending classes. We then weight these trips using the `studentRatio` calculated earlier. This down-weights trips from commercial zones and up-weights trips from residential neighborhoods.
```{python}
# Load and process raw trip data
sl_trips = pd.read_csv(PATHS["sl_trips"])
# Filter for Weekday Mid-Day
sl_trips = sl_trips[
(sl_trips["day_type"] == "1: Weekday (Tu-Th)")
& (sl_trips["day_part"] == "3: Mid-Day (9am-3pm)")
]
# Join destination college names
college_key_filtered = pd.read_csv(PATHS["college_key"])
college_key_filtered = college_key_filtered[
college_key_filtered["COLLEGE"] != "UVU_GENEVA"
]
sl_trips = sl_trips.merge(
college_key_filtered,
left_on="destination_zone_name",
right_on="SL_COTAZID",
how="inner",
)
sl_trips = sl_trips.rename(columns={"COLLEGE": "D_COLLEGE"})
# Sum total trips by Origin Zone -> Destination College
sl_trips = (
sl_trips.groupby(["origin_zone_name", "D_COLLEGE"])[
"o_d_traffic_calibrated_trip_volume"
]
.sum()
.reset_index()
.rename(columns={"o_d_traffic_calibrated_trip_volume": "trip_vol"})
)
# Join to SL centroids to attach TAZ IDs and Student Ratios
sl_trips = sl_trips.merge(
sl_centroids_valid[["SL_COTAZID", "SA_TAZID", "studentRatio"]],
left_on="origin_zone_name",
right_on="SL_COTAZID",
how="inner",
)
# Apply the Student Ratio weighting
sl_trips["weighted_trips"] = sl_trips["trip_vol"] * sl_trips["studentRatio"]
sl_trips = sl_trips[["SA_TAZID", "origin_zone_name", "D_COLLEGE", "weighted_trips"]]
```
# Public School Disaggregation
For public universities, we have enrollment data available at the **Zip Code** level. Our strategy is to perform a constrained distribution:
1. **Calculate Net Demand:** For each Zip Code, we take the Total Enrollment and subtract the known Dorm Students (`gq_in_zip`).
2. **Handle Overflows:** If a Zip Code has more dorm students than total enrollment (due to data reporting mismatches), we assume the "Commuter Demand" is zero (`max(0, ...)`).
3. **Distribute:** We take the remaining students (Net Demand) and spread them across the TAZs *inside* that Zip Code. The spread is not even; it is proportional to the weighted StreetLight trips we calculated in Step 2.4.
## Preparing Zip Code Geometry & Enrollment
We load the Zip Code boundaries and the enrollment data. We pre-aggregate the enrollment CSV to ensuring uniqueness (one row per Zip/College) to prevent join errors later.
```{python}
# 1. Load Zip Code Boundaries
# Reading specific counties if file is large, otherwise load all and filter
zip_poly = gpd.read_file(PATHS["zip_shp"])
zip_poly = zip_poly.to_crs(CONSTANTS["crs_projected"])
# Filter for specific counties (18, 2, 25, 6, 29, 20 are FIPS or County codes used in query)
zip_poly = zip_poly[
zip_poly["COUNTYNBR"].astype(str).isin(["18", "2", "25", "6", "29", "20"])
]
zip_poly["ZIP5"] = (
pd.to_numeric(zip_poly["ZIP5"], errors="coerce").fillna(0).astype(int).astype(str)
)
```
```{python}
# 2. Prepare Enrollment Data
# Summing by Zip/College ensures uniqueness before joining geometry.
zip_enroll = pd.read_csv(PATHS["zip_enrollment"])
zip_enroll = zip_enroll.groupby(["ZIPCODE", "COLLEGE"])["TotEnrol"].sum().reset_index()
zip_enroll["ZIPCODE"] = zip_enroll["ZIPCODE"].astype(str)
zip_enroll = zip_enroll.merge(zip_poly, left_on="ZIPCODE", right_on="ZIP5", how="inner")
zip_enroll = gpd.GeoDataFrame(zip_enroll, geometry="geometry")
```
## Calculating Net Demand
Here we spatially join our assigned dorm blocks to the Zip Codes to determine how many "Fixed" students are in each Zip. We subtract this from the total enrollment to find the number of "Commuter" students we need to distribute.
```{python}
# 3. Calculate GQ (Dorm) students per Zip Code
block_gq_per_zip = gpd.sjoin(
block_gq_assigned, zip_poly, how="inner", predicate="within"
)
block_gq_per_zip = (
block_gq_per_zip.groupby(["ZIP5", "COLLEGE"])["gq_student"]
.sum()
.reset_index()
.rename(columns={"gq_student": "gq_in_zip"})
)
```
```{python}
# 4. Calculate Net Demand (Total Enrollment - Dorms)
public_demand = pd.DataFrame(zip_enroll.drop(columns="geometry"))
public_demand = public_demand.merge(
block_gq_per_zip,
left_on=["ZIPCODE", "COLLEGE"],
right_on=["ZIP5", "COLLEGE"],
how="left",
)
public_demand["gq_in_zip"] = public_demand["gq_in_zip"].fillna(0)
# Use clip(lower=0) to ensure we don't get negative students
public_demand["net_demand"] = (
public_demand["TotEnrol"] - public_demand["gq_in_zip"]
).clip(lower=0)
# Filter out private schools
public_demand = public_demand[
~public_demand["COLLEGE"].isin(PRIVATE_CONTROLS["COLLEGE"])
]
```
## Distributing Commuters to TAZs
Finally, we apply the StreetLight trip distribution. We figure out which TAZs are in which Zip Code, calculate their share of the total trips in that Zip, and assign the Net Demand accordingly.
```{python}
# 5. Distribute Net Demand to TAZs
public_trips = sl_trips[~sl_trips["D_COLLEGE"].isin(PRIVATE_CONTROLS["COLLEGE"])].copy()
# Attach geometry to trips (using the centroids dataframe)
public_trips = public_trips.merge(
sl_centroids_valid[["SL_COTAZID", "SA_TAZID", "geometry"]],
left_on=["origin_zone_name", "SA_TAZID"],
right_on=["SL_COTAZID", "SA_TAZID"],
how="inner",
)
public_trips = gpd.GeoDataFrame(public_trips, geometry="geometry")
# Determine which Zip Code each Trip Origin falls into
public_trips = gpd.sjoin(public_trips, zip_poly, how="inner", predicate="within")
# Fill missing SL weights with a constant
public_trips["weighted_trips"] = public_trips["weighted_trips"].fillna(
CONSTANTS["magic_fill_value"]
)
# Calculate the "Share" of trips for each TAZ within its specific Zip Code
public_trips["zip_total_weight"] = public_trips.groupby(["ZIP5", "D_COLLEGE"])[
"weighted_trips"
].transform("sum")
public_trips["share"] = (
public_trips["weighted_trips"] / public_trips["zip_total_weight"]
)
# Join the Net Demand calculated above
public_distribution = public_trips.merge(
public_demand,
left_on=["ZIP5", "D_COLLEGE"],
right_on=["ZIPCODE", "COLLEGE"],
how="inner",
)
# Final Calculation: Zip Demand * TAZ Share
public_distribution["distributed_students"] = (
public_distribution["net_demand"] * public_distribution["share"]
)
public_distribution = public_distribution[
["SA_TAZID", "D_COLLEGE", "distributed_students"]
].rename(columns={"D_COLLEGE": "COLLEGE"})
```
## Validation: Public Schools
We perform a quick check to ensure the distributed student counts roughly match the input net demand. Small discrepancies are expected due to Zip Codes with enrollment but no corresponding StreetLight trips (lost data).
```{python}
#| eval: true
public_check = (
public_distribution.groupby("COLLEGE")["distributed_students"]
.sum()
.reset_index()
.rename(columns={"distributed_students": "Distributed_Total"})
)
public_input_demand = (
public_demand.groupby("COLLEGE")["net_demand"]
.sum()
.reset_index()
.rename(columns={"net_demand": "Input_Net_Demand"})
)
public_check = public_check.merge(public_input_demand, on="COLLEGE", how="left")
print(public_check)
```
::: {.callout-note}
This is used to help double-check... but isn't perfect because within the process some zip level data is disregarded (i.e. no SL trips existed within that zip code).
:::
In order to verify that this is just "Lost Zips" and not a code bug, we can run this diagnostic snippet. It will show you exactly which Zip Codes had demand but zero StreetLight trips.
```{python}
# 1. Identify which Zips actually have StreetLight trips
# (We reconstruct the data from the previous step)
zips_with_trips = public_trips[["ZIP5", "D_COLLEGE"]].drop_duplicates()
zips_with_trips = zips_with_trips.rename(
columns={"ZIP5": "ZIPCODE", "D_COLLEGE": "COLLEGE"}
)
# 2. Find Demand that has NO matching trips
# Merge with indicator=True to find rows only in left
lost_zips = public_demand.merge(
zips_with_trips, on=["ZIPCODE", "COLLEGE"], how="left", indicator=True
)
lost_zips = lost_zips[
(lost_zips["_merge"] == "left_only") & (lost_zips["net_demand"] > 0)
]
# 3. Validation
print(f"Total Lost Students: {lost_zips['net_demand'].sum()}")
# See which Zips were lost
print(lost_zips.head())
```
# Private School Disaggregation
For private universities (BYU, Ensign, Westminster), we do not have granular Zip Code data. Instead, we perform a **Regional Distribution**.
1. **Calculate Net Demand:** We start with the regional `Control Total` for the school and subtract the total number of Dorm Students found in Step 2.3.
2. **Distribute:** We spread the remaining students across **all** valid TAZs in the entire region. The spread is proportional to the weighted StreetLight trips, but unlike public schools, it is not constrained by Zip Code boundaries.
```{python}
# 1. Calculate Net Demand (Region Total - Total Dorms)
gq_by_college = taz_gq_totals.groupby("COLLEGE")["gq_student"].sum().reset_index()
private_demand = PRIVATE_CONTROLS.merge(gq_by_college, on="COLLEGE", how="left")
private_demand["gq_student"] = private_demand["gq_student"].fillna(0)
private_demand["net_demand"] = (
private_demand["Control_Total"] - private_demand["gq_student"]
)
```
```{python}
# 2. Distribute based on Region-wide Trip Shares
private_trips = sl_trips[sl_trips["D_COLLEGE"].isin(PRIVATE_CONTROLS["COLLEGE"])].copy()
# Here, sum across the entire region, not just a Zip
private_trips["region_total_weight"] = private_trips.groupby("D_COLLEGE")[
"weighted_trips"
].transform("sum")
private_trips["share"] = (
private_trips["weighted_trips"] / private_trips["region_total_weight"]
)
private_distribution = private_trips.merge(
private_demand, left_on="D_COLLEGE", right_on="COLLEGE", how="inner"
)
private_distribution["distributed_students"] = (
private_distribution["net_demand"] * private_distribution["share"]
)
private_distribution = private_distribution[
["SA_TAZID", "COLLEGE", "distributed_students"]
]
```
## Validation: Private Schools
Since the distribution is regional and not constrained by missing Zip data, the input and output totals should match exactly.
```{python}
# | eval: true
private_check = (
private_distribution.groupby("COLLEGE")["distributed_students"]
.sum()
.reset_index()
.rename(columns={"distributed_students": "Distributed_Total"})
)
private_target_demand = private_demand[["COLLEGE", "net_demand"]].rename(
columns={"net_demand": "Target_Net_Demand"}
)
private_check = private_check.merge(private_target_demand, on="COLLEGE", how="left")
print(private_check)
```
# Final Compilation & Export
In this final step, we synthesize the results.
1. **Combine:** We merge the distributed "Commuter" students (from both Public and Private workflows).
2. **Add Dorms:** We join the "Fixed" Group Quarter students (calculated in Step 2.3) back to their respective TAZs.
3. **Format:** We pivot the data into a "Wide" format (TAZ rows x College columns), creating the final matrix required for the travel demand model.
```{python}
# Combine Public and Private distributions
final_long_data = pd.concat(
[public_distribution, private_distribution], ignore_index=True
)
# Add back the fixed GQ (dorm) students to their specific TAZs
final_long_data = final_long_data.merge(
taz_gq_totals,
left_on=["SA_TAZID", "COLLEGE"],
right_on=["TAZID", "COLLEGE"],
how="outer",
)
final_long_data["distributed_students"] = final_long_data[
"distributed_students"
].fillna(0)
final_long_data["gq_student"] = final_long_data["gq_student"].fillna(0)
# Total Students = Commuters + Dorm Residents
final_long_data["total_students"] = (
final_long_data["distributed_students"] + final_long_data["gq_student"]
)
# Consolidate TAZID (coalesce keys from merge)
final_long_data["SA_TAZID"] = final_long_data["SA_TAZID"].fillna(
final_long_data["TAZID"]
)
```
```{python}
# Pivot to Wide Format (One column per college)
# First sum up in case of duplicates (rare, but good practice)
final_grouped = (
final_long_data.groupby(["SA_TAZID", "COLLEGE"])["total_students"]
.sum()
.round(0)
.reset_index()
.pivot(index="SA_TAZID", columns="COLLEGE", values="total_students")
)
# 1. Fill NaNs with 0 and convert to Integer
final_output = (
final_grouped.reindex(
columns=[
"ENSIGN",
"WESTMIN",
"WSU_MAIN",
"WSU_DAVIS",
"SLCC_SC",
"BYU",
"SLCC_MAIN",
"UVU_MAIN",
"UOFU_MAIN",
"SLCC_JD",
"SLCC_ML",
],
fill_value=0,
)
.fillna(0)
.astype(int)
)
# Explicitly cast the index to integer
final_output.index = final_output.index.astype(int)
final_output.index.name = ";Z"
# Preview the final table
print(final_output.head())
```
```{python}
# Export
# Reset index to make TAZID a column in the CSV
final_output.to_csv(PATHS["output_csv"])
```