---
title: "Forecasted College Enrollment"
subtitle: Forecasting the College Enrollment in WF-TDM region into the horizon year.
description: |
This notebook forecasts college enrollment for Utah's major universities using a
**Weighted Participation Rate** method. The approach estimates each university's
share of its surrounding college-age population, fits a linear trend to that share,
applies a regional penetration ceiling as a control total, and multiplies by future
population projections from the Kem C. Gardner Policy Institute. Targeted
adjustments are made for capacity-constrained flagship institutions.
author:
- name: Pukar Bhandari
email: pukar.bhandari@wfrc.utah.gov
affiliation:
- name: Wasatch Front Regional Council
url: "https://wfrc.utah.gov/"
date: "`r Sys.Date()`"
---
# Methodology Overview
This forecast predicts future college enrollment for travel demand modeling using a hybrid demographic approach. The core methodology follows these steps:
1. **Catchment Weights (Addressable Markets):** We determine what percentage of a university's student body comes from specific counties to define their "Addressable Market."
2. **Weighted Populations:** We calculate the catchment population by multiplying county shares by historical county-level college-age populations (18-24).
3. **Base Participation Rates:** We calculate the historical ratio of enrolled students to this weighted population and forecast it forward using Linear Regression.
4. **Regional Control Totals:** We apply a ceiling to the overall regional penetration rate to prevent infinite/unrealistic demographic capture.
5. **Flagship Override:** Because large flagship universities (BYU, UofU) draw heavily from outside the local region and are capacity-constrained, their forecasts are detached from local population trends and modeled directly against time.
## Data Sources
* **Historical Demographics:** US Census Bureau API and IPUMS NHGIS datasets (Age range 18-24 matches GPI methodology).
* **Projected Demographics:** [Kem C. Gardner Policy Institute Population Projections](https://gardner.utah.edu/utah-demographics/population-projections/utah-2065-long-term-vintage-2025/)
* **Utah System of Higher Education (USHE):** [Institutional Data Resources](https://ushe.edu/institutional-data-resources-fte/)
* **Historical Enrollment (IPEDS):** Integrated Postsecondary Education Data System (2000-2023). Specific institutional profiles:
* [University of Utah (UofU)](https://nces.ed.gov/ipeds/reported-data/230764?year=2024&surveyNumber=15)
* [Brigham Young University (BYU)](https://nces.ed.gov/ipeds/reported-data/230038?year=2024&surveyNumber=15)
* [Utah Valley University (UVU)](https://nces.ed.gov/ipeds/reported-data/230737?year=2024&surveyNumber=15)
* [Salt Lake Community College (SLCC)](https://nces.ed.gov/ipeds/reported-data/230746?year=2024&surveyNumber=15)
* [Weber State University (WSU)](https://nces.ed.gov/ipeds/reported-data/230782?year=2024&surveyNumber=15)
* [Ensign College (ENSIGN)](https://nces.ed.gov/ipeds/reported-data/230418?year=2024&surveyNumber=15)
* [Westminster College (WESTMIN)](https://nces.ed.gov/ipeds/reported-data/216807?year=2024&surveyNumber=15)
---
# Configuration & Setup
## File Paths & Settings
```{python}
from pathlib import Path
import pandas as pd
import numpy as np
import geopandas as gpd
from pygris.data import get_census
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_percentage_error
from IPython.display import display, Markdown
import warnings
warnings.filterwarnings("ignore")
# Centralized path configuration for all data inputs and outputs
PATHS = {
# Cache files (intermediate data)
"census_cache": Path(
"intermediate/Utah_Census_All_Counties_All_Ages_2000_2023.csv"
),
# Input data files
"enrollment_ipeds": Path("intermediate/Fall_Enrollment_1997_2023.csv"),
"gpi_data": Path("data/GPI/SchColAgeFS-Dec2022.xlsx"),
"gpi_projections": Path(
"data/GPI/Gardner-Policy-Institute-State-and-County-Projections-2025-2065-Data.xlsx"
),
"zip_code": Path("data/UGRC ZIP/UtahZipCodeAreas_-1977770497264301863.gpkg"),
"enroll_by_zip": Path("intermediate/TotEnrolByZipByCollege.csv"),
"ushe_2025": Path("data/USHE 2025 Enrollment/WFRC Data Request_Final.xlsx"),
# NHGIS data
"nhgis_dir": Path("data/NHGIS"),
# Output paths
"model_diagnostics": Path("intermediate/model_diagnostics.csv"),
"forecast_output": Path("intermediate/enrollment_forecast_2024_2060.csv"),
"campus_forecast_output": Path("intermediate/enrollment_forecast_by_campus_2000_2060.csv"),
}
# WF-TDM region counties
TARGET_COUNTIES = [
"Box Elder County",
"Weber County",
"Davis County",
"Salt Lake County",
"Utah County",
]
# Age range for college-age population (18-24) — matches GPI methodology
COLLEGE_AGE_RANGES = ["18 and 19 years", "20 years", "21 years", "22 to 24 years"]
# Train/test split: hold out 2020-2023 for model validation
TEST_START_YEAR = 2020
# Global Toggles
CALIB_SHIFT = True # Toggle whether to shift model curves to meet 2023 actuals
```
# 1. Load Data
## Enrollment Data
### Utah System of Higher Education & IPEDS
**Sources:** USHE and Integrated Postsecondary Education Data System (2000-2023)
```{python}
# Load IPEDS enrollment data (2000-2023)
df_enroll_ipeds = pd.read_csv(PATHS['enrollment_ipeds'])
# Historical enrollment data from 2019 RTP (1990-1999)
df_enroll_rtp19 = pd.DataFrame(
{
"Year": [1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999],
"UofU": [24311, 25581, 25868, 25982, 25645, 26810, 26079, 25895, 25018, 25781],
"WSU": [13449, 14495, 14993, 14447, 14230, 13996, 13906, 14613, 13900, 14984],
"UVU": [7885, 8777, 9623, 10512, 13293, 14040, 14756, 15994, 18174, 20062],
"SLCC": [13344, 15972, 17766, 18311, 19440, 20834, 22394, 24307, 19754, 21273],
"BYU": [30465, 31101, 31108, 30547, 30413, 30465, 30563, 31249, 32202, 32731],
}
)
```
## College-Age Population (Historical 18-24)
### US Census Bureau
We define a crosswalk to extract historical 18-24 population cohorts from Census Summary Files and the American Community Survey.
```{python}
# --- 1. Define Variable Crosswalk (Metadata) ---
var_crosswalk = pd.DataFrame(
[
("Male", "Total", "P012002", "P12_002N", "B01001_002E"),
("Male", "Under 5 years", "P012003", "P12_003N", "B01001_003E"),
("Male", "5 to 9 years", "P012004", "P12_004N", "B01001_004E"),
("Male", "10 to 14 years", "P012005", "P12_005N", "B01001_005E"),
("Male", "15 to 17 years", "P012006", "P12_006N", "B01001_006E"),
("Male", "18 and 19 years", "P012007", "P12_007N", "B01001_007E"),
("Male", "20 years", "P012008", "P12_008N", "B01001_008E"),
("Male", "21 years", "P012009", "P12_009N", "B01001_009E"),
("Male", "22 to 24 years", "P012010", "P12_010N", "B01001_010E"),
("Male", "25 to 29 years", "P012011", "P12_011N", "B01001_011E"),
("Male", "30 to 34 years", "P012012", "P12_012N", "B01001_012E"),
("Male", "35 to 39 years", "P012013", "P12_013N", "B01001_013E"),
("Male", "40 to 44 years", "P012014", "P12_014N", "B01001_014E"),
("Male", "45 to 49 years", "P012015", "P12_015N", "B01001_015E"),
("Male", "50 to 54 years", "P012016", "P12_016N", "B01001_016E"),
("Male", "55 to 59 years", "P012017", "P12_017N", "B01001_017E"),
("Male", "60 and 61 years", "P012018", "P12_018N", "B01001_018E"),
("Male", "62 to 64 years", "P012019", "P12_019N", "B01001_019E"),
("Male", "65 and 66 years", "P012020", "P12_020N", "B01001_020E"),
("Male", "67 to 69 years", "P012021", "P12_021N", "B01001_021E"),
("Male", "70 to 74 years", "P012022", "P12_022N", "B01001_022E"),
("Male", "75 to 79 years", "P012023", "P12_023N", "B01001_023E"),
("Male", "80 to 84 years", "P012024", "P12_024N", "B01001_024E"),
("Male", "85 years and over", "P012025", "P12_025N", "B01001_025E"),
("Female", "Total", "P012026", "P12_026N", "B01001_026E"),
("Female", "Under 5 years", "P012027", "P12_027N", "B01001_027E"),
("Female", "5 to 9 years", "P012028", "P12_028N", "B01001_028E"),
("Female", "10 to 14 years", "P012029", "P12_029N", "B01001_029E"),
("Female", "15 to 17 years", "P012030", "P12_030N", "B01001_030E"),
("Female", "18 and 19 years", "P012031", "P12_031N", "B01001_031E"),
("Female", "20 years", "P012032", "P12_032N", "B01001_032E"),
("Female", "21 years", "P012033", "P12_033N", "B01001_033E"),
("Female", "22 to 24 years", "P012034", "P12_034N", "B01001_034E"),
("Female", "25 to 29 years", "P012035", "P12_035N", "B01001_035E"),
("Female", "30 to 34 years", "P012036", "P12_036N", "B01001_036E"),
("Female", "35 to 39 years", "P012037", "P12_037N", "B01001_037E"),
("Female", "40 to 44 years", "P012038", "P12_038N", "B01001_038E"),
("Female", "45 to 49 years", "P012039", "P12_039N", "B01001_039E"),
("Female", "50 to 54 years", "P012040", "P12_040N", "B01001_040E"),
("Female", "55 to 59 years", "P012041", "P12_041N", "B01001_041E"),
("Female", "60 and 61 years", "P012042", "P12_042N", "B01001_042E"),
("Female", "62 to 64 years", "P012043", "P12_043N", "B01001_043E"),
("Female", "65 and 66 years", "P012044", "P12_044N", "B01001_044E"),
("Female", "67 to 69 years", "P012045", "P12_045N", "B01001_045E"),
("Female", "70 to 74 years", "P012046", "P12_046N", "B01001_046E"),
("Female", "75 to 79 years", "P012047", "P12_047N", "B01001_047E"),
("Female", "80 to 84 years", "P012048", "P12_048N", "B01001_048E"),
("Female", "85 years and over", "P012049", "P12_049N", "B01001_049E"),
],
columns=["Gender", "Age", "SF1", "DHC", "ACS"],
)
var_crosswalk["Label"] = var_crosswalk["Gender"] + ": " + var_crosswalk["Age"]
```
```{python}
# --- 2. Fetch & Cache Census Data ---
if not PATHS["census_cache"].exists():
print(f"Fetching Census data to {PATHS['census_cache']}...")
decennial_years = [2000, 2010, 2020]
acs_years = [y for y in range(2009, 2024) if y not in decennial_years]
data_frames = []
for year in sorted(decennial_years + acs_years):
try:
if year in decennial_years:
dataset = "dec/dhc" if year == 2020 else "dec/sf1"
code_col = "DHC" if year == 2020 else "SF1"
else:
dataset, code_col = "acs/acs5", "ACS"
year_codes = var_crosswalk[code_col].tolist()
rename_map = dict(zip(year_codes, var_crosswalk["Label"]))
df = get_census(
dataset=dataset,
variables=["NAME"] + year_codes,
year=year,
params={"for": "county:*", "in": "state:49"},
return_geoid=True,
)
df = df.rename(columns={c: "NAME" for c in df.columns if c.upper() == "NAME"})
df = df.rename(columns=rename_map)
valid_cols = ["NAME"] + var_crosswalk["Label"].tolist()
df = df[df.columns.intersection(valid_cols)].copy()
df["Year"], df["Dataset"] = year, dataset
data_frames.append(df)
print(f"Fetched {year}")
except Exception as e:
print(f"Skipping {year}: {e}")
if data_frames:
PATHS["census_cache"].parent.mkdir(parents=True, exist_ok=True)
pd.concat(data_frames, ignore_index=True).to_csv(PATHS["census_cache"], index=False)
print("Census data cached successfully")
else:
print(f"Using cached Census data from {PATHS['census_cache']}")
```
```{python}
#| eval: false
#| echo: true
# Toggle ON (eval: true) to use Census API via pygris
# Toggle OFF (eval: false) if using NHGIS data instead
df_universe = pd.read_csv(PATHS['census_cache'])
df_universe["Geography"] = df_universe["NAME"].str.split(",").str[0]
df_wfrc = df_universe[df_universe["Geography"].isin(TARGET_COUNTIES)].copy()
target_cols = var_crosswalk.loc[var_crosswalk["Age"].isin(COLLEGE_AGE_RANGES), "Label"].tolist()
df_wfrc["Population_18_24"] = (
df_wfrc[target_cols].apply(pd.to_numeric, errors="coerce").fillna(0).sum(axis=1)
)
df_college_age_hist = (
df_wfrc.pivot_table(
index="Year", columns="Geography", values="Population_18_24", aggfunc="sum"
)
.reset_index()
.rename_axis(None, axis=1)
)
```
### Timeseries Data from NHGIS
```{python}
#| eval: false
#| echo: true
# Toggle ON (eval: true) to fetch from IPUMS NHGIS API
# Requires IPUMS_API_KEY environment variable in .env
import os, zipfile
from dotenv import load_dotenv
from ipumspy import IpumsApiClient, AggregateDataExtract, TimeSeriesTable
load_dotenv()
IPUMS_API_KEY = os.getenv("IPUMS_API_KEY")
if IPUMS_API_KEY:
client = IpumsApiClient(IPUMS_API_KEY)
ts_table = TimeSeriesTable(
name="B57",
geog_levels=["county"],
years=["1970","1980","1990","2000","2010","2020",
"2006-2010","2007-2011","2008-2012","2009-2013","2010-2014",
"2011-2015","2012-2016","2013-2017","2014-2018","2015-2019",
"2016-2020","2017-2021","2018-2022","2019-2023","2020-2024"],
)
extract = AggregateDataExtract(
collection="nhgis",
description="Utah County College Age Time Series",
time_series_tables=[ts_table],
geographic_extents=["490"],
tst_layout="time_by_column_layout",
)
print("Submitting extract to IPUMS NHGIS...")
client.submit_extract(extract)
print(f"Extract submitted. ID: {extract.extract_id}")
client.wait_for_extract(extract)
download_dir = PATHS['nhgis_dir']
download_dir.mkdir(parents=True, exist_ok=True)
client.download_extract(extract, download_dir=download_dir)
zip_path = download_dir / f"nhgis{extract.extract_id:04d}_csv.zip"
with zipfile.ZipFile(zip_path, "r") as zf:
zf.extractall(download_dir / f"nhgis{extract.extract_id:04d}_csv")
print("NHGIS Fetch Complete!")
else:
print("Error: IPUMS_API_KEY not found in .env file.")
```
```{python}
#| eval: true
#| echo: true
# Load and process NHGIS time-series data
# Works with both manually downloaded and API-fetched NHGIS extracts
# 1. Locate latest NHGIS file
nhgis_files = sorted(PATHS['nhgis_dir'].rglob("*_ts_nominal_county.csv"))
nhgis_path = (
nhgis_files[-1] if nhgis_files
else PATHS['nhgis_dir'] / "nhgis0002_csv/nhgis0002_ts_nominal_county.csv"
)
df_nhgis = pd.read_csv(nhgis_path)
df_ut = df_nhgis[df_nhgis["STATE"] == "Utah"].copy()
# 2. Select 18-24 age columns (B57AE–B57AH = 18–19, 20, 21, 22–24)
target_prefixes = ["B57AE", "B57AF", "B57AG", "B57AH"]
value_cols = [
c for c in df_ut.columns
if any(c.startswith(p) for p in target_prefixes) and not c.endswith("M")
]
for col in value_cols:
df_ut[col] = pd.to_numeric(df_ut[col], errors="coerce")
# 3. Melt to long format and parse year
df_long = df_ut.melt(
id_vars=["COUNTY"], value_vars=value_cols,
var_name="Variable", value_name="Population"
)
df_long["Suffix"] = df_long["Variable"].str[5:]
df_long["Year"] = df_long["Suffix"].apply(
lambda x: int(x) if len(x) == 4 else 2000 + int(x[:2])
)
df_long["Type"] = df_long["Suffix"].apply(
lambda x: "Decennial" if len(x) == 4 else "ACS"
)
# 4. Prefer Decennial over ACS for 2010 and 2020
mask_drop = (
((df_long["Year"] == 2010) & (df_long["Type"] == "ACS")) |
((df_long["Year"] == 2020) & (df_long["Type"] == "ACS"))
)
df_clean = df_long[~mask_drop].copy()
# 5. Aggregate by county and year
df_agg = df_clean.groupby(["COUNTY", "Year"])["Population"].sum().reset_index()
df_agg["Geography"] = df_agg["COUNTY"].apply(
lambda x: x if "County" in str(x) else str(x) + " County"
)
# 6. Pivot to wide format (counties as columns)
df_college_age_hist = (
df_agg[df_agg["Geography"].isin(TARGET_COUNTIES)]
.pivot(index="Year", columns="Geography", values="Population")
.sort_index()
)
df_college_age_hist.columns.name = None
```
### Forecast: Kem C. Gardner Policy Institute (2025-2060)
The Kem C. Gardner Policy Institute (GPI) provides the official projected demographic counts for Utah counties.
```{python}
# Primary forecast data: county-level GPI projections (2025–2065)
df_gpi_projections = (
pd.read_excel(
PATHS['gpi_projections'],
sheet_name="Demographic Detail",
header=3,
nrows=1230,
)
.query("Geography in @TARGET_COUNTIES")[["Geography", "Year", "Population Ages 18-24"]]
)
df_gpi_wide = (
df_gpi_projections
.pivot(index="Year", columns="Geography", values="Population Ages 18-24")
.rename_axis(None, axis=1)
.sort_index()
)
# Backcast 2023–2024 using the average annual delta over the first 3 GPI years
avg_annual_change = df_gpi_wide.loc[2025:2027].diff().mean()
gpi_2024 = df_gpi_wide.loc[2025] - avg_annual_change
gpi_2023 = df_gpi_wide.loc[2025] - 2 * avg_annual_change
df_gpi_ext = pd.concat([
pd.DataFrame(gpi_2023).T.rename(index={0: 2023}),
pd.DataFrame(gpi_2024).T.rename(index={0: 2024}),
df_gpi_wide,
]).sort_index()
```
---
# Data Exploration
## College Age (18–24) Population Trend
```{python}
# Melt both sources to long for plotting
df_hist_long = (
df_college_age_hist.reset_index()
.melt(id_vars=["Year"], var_name="County", value_name="Population")
)
df_hist_long["Type"] = "Historical"
df_proj_long = df_gpi_projections.rename(
columns={"Geography": "County", "Population Ages 18-24": "Population"}
).copy()
df_proj_long["Type"] = "Projected"
df_plot_combined = pd.concat([df_hist_long, df_proj_long], ignore_index=True)
ojs_define(data_pop_trend=df_plot_combined.to_dict(orient="records"))
```
```{ojs}
//| label: fig-college-age-trend
//| fig-cap: "Historical and Projected College-Age Population (18-24) by County"
Plot.plot({
marks: [
Plot.ruleY([0]),
Plot.line(data_pop_trend, {
filter: d => d.Type === "Historical",
x: "Year", y: "Population", z: "County",
stroke: "County", strokeWidth: 3
}),
Plot.line(data_pop_trend, {
filter: d => d.Type === "Projected",
x: "Year", y: "Population", z: "County",
stroke: "County", strokeWidth: 3, strokeDasharray: "5,5"
}),
Plot.text(data_pop_trend, Plot.selectLast({
x: "Year", y: "Population", z: "County",
text: "County", textAnchor: "start", dx: 8, fill: "var(--bs-body-color)"
})),
Plot.tip(data_pop_trend, Plot.pointerX({
x: "Year", y: "Population",
title: d => `${d.County}\nYear: ${d.Year}\nPop: ${d.Population.toLocaleString()}\n(${d.Type})`,
fill: "var(--bs-body-bg)", stroke: "var(--bs-body-color)"
}))
],
width: 780, marginLeft: 50, marginRight: 110,
color: { legend: true },
y: { grid: true, label: "Population (18-24)" },
x: { tickFormat: "d" },
style: {
backgroundColor: "transparent", color: "var(--bs-body-color)",
fontFamily: "sans-serif", fontSize: "12px"
}
})
```
---
# Forecast Enrollment
## Enrollment by College and County (Dynamic Catchment Generation)
We calculate dynamic base percentages using localized USHE zip-code enrollment patterns for the public universities (UofU, UVU, WSU, SLCC). For private institutions missing from the public dataset, we apply standard assumed market footprints.
## Step 1 — Define Addressable Markets (Catchment Weights)
We programmatically extract the USHE derived weights for our public institutions and combine them with the explicit private institution boundaries.
```{python}
TARGET_COLLEGES = ["UOFU", "BYU", "UVU", "SLCC", "WSU", "ENSIGN", "WESTMIN"]
ANALYSIS_COUNTIES = ["Box Elder", "Weber", "Davis", "Salt Lake", "Utah"]
# Utah Counties Mapping
utah_counties = sorted(
[
"Beaver",
"Box Elder",
"Cache",
"Carbon",
"Daggett",
"Davis",
"Duchesne",
"Emery",
"Garfield",
"Grand",
"Iron",
"Juab",
"Kane",
"Millard",
"Morgan",
"Piute",
"Rich",
"Salt Lake",
"San Juan",
"Sanpete",
"Sevier",
"Summit",
"Tooele",
"Uintah",
"Utah",
"Wasatch",
"Washington",
"Wayne",
"Weber",
]
)
county_map = {i + 1: name for i, name in enumerate(utah_counties)}
# Load Zip Data
df_enroll = pd.read_csv(PATHS["enroll_by_zip"])
gdf_zip = gpd.read_file(PATHS["zip_code"])
# Process
df_enroll["College_Clean"] = df_enroll["COLLEGE"].str.split("_").str[0]
df_enroll = df_enroll[df_enroll["College_Clean"].isin(TARGET_COLLEGES)].copy()
df_enroll["ZIP5"] = df_enroll["ZIPCODE"].astype(str).str.zfill(5)
gdf_zip["ZIP5"] = gdf_zip["ZIP5"].astype(str).str.zfill(5)
df_merged = df_enroll.merge(gdf_zip[["ZIP5", "COUNTYNBR"]], on="ZIP5", how="left")
df_merged["County"] = pd.to_numeric(df_merged["COUNTYNBR"], errors="coerce").map(
county_map
)
df_merged = df_merged[df_merged["County"].isin(ANALYSIS_COUNTIES)]
# Summary & Percentages
df_summary = (
df_merged.groupby(["County", "College_Clean"])["TotEnrol"]
.sum()
.unstack(fill_value=0)
.reindex(columns=TARGET_COLLEGES, fill_value=0)
)
df_percent = df_summary.div(df_summary.sum(axis=0), axis=1)
CATCHMENT_WEIGHTS = {
# Private institutions missing from USHE zip dataset
"BYU": {
"Box Elder County": 0.000,
"Davis County": 0.025,
"Weber County": 0.025,
"Salt Lake County": 0.20,
"Utah County": 0.75,
},
"ENSIGN": {
"Box Elder County": 0.000,
"Weber County": 0.000,
"Davis County": 0.15,
"Salt Lake County": 0.70,
"Utah County": 0.15,
},
"WESTMIN": {
"Box Elder County": 0.000,
"Weber County": 0.000,
"Davis County": 0.065,
"Salt Lake County": 0.90,
"Utah County": 0.035,
},
}
# Pull Dynamic Weights for Public Universities
zip_to_ipeds = {"UOFU": "UofU", "UVU": "UVU", "WSU": "WSU", "SLCC": "SLCC"}
for zip_key, ipeds_key in zip_to_ipeds.items():
if zip_key in df_percent.columns:
weights = df_percent[zip_key].to_dict()
CATCHMENT_WEIGHTS[ipeds_key] = {f"{k} County": v for k, v in weights.items()}
DISPLAY_NAMES = {
"UofU": "University of Utah",
"UVU": "Utah Valley University",
"WSU": "Weber State University",
"SLCC": "Salt Lake Community College",
"BYU": "Brigham Young University",
"ENSIGN": "Ensign College",
"WESTMIN": "Westminster College",
}
SCHOOL_KEYS = list(CATCHMENT_WEIGHTS.keys())
display(Markdown("**Calculated Catchment Weights (Addressable Market Shares):**"))
display(pd.DataFrame(CATCHMENT_WEIGHTS).fillna(0).style.format("{:.1%}"))
```
## Step 2 — Weighted Populations
We multiply the 18-24 population of a county by the catchment weight to get the **Weighted Population**.
```{python}
# Interpolate Census data to fill 2001–2009 gap
df_census_full = df_college_age_hist.reindex(range(2000, 2024)).interpolate(
method="linear"
)
# Compute Historical & Projected Weighted Populations
df_weighted_hist = pd.DataFrame(index=range(2000, 2024))
df_weighted_proj = pd.DataFrame(index=df_gpi_ext.index)
for key, weights in CATCHMENT_WEIGHTS.items():
df_weighted_hist[key] = sum(
df_census_full[county] * weight
for county, weight in weights.items()
if county in df_census_full.columns
)
df_weighted_proj[key] = sum(
df_gpi_ext[county] * weight
for county, weight in weights.items()
if county in df_gpi_ext.columns
)
display(Markdown("**Historical Weighted Populations (Recent Years):**"))
display(df_weighted_hist.tail().astype(int))
display(Markdown("**Projected Weighted Populations (First 5 Years):**"))
display(df_weighted_proj.head().astype(int))
```
## Step 3 — Historic Participation Rates
We define the **Participation Rate** as:
$$\text{Rate}_{year} = \frac{\text{Actual Enrollment}_{year}}{\text{Weighted Population}_{year}}$$
```{python}
# Compute historic participation rates
df_ipeds = df_enroll_ipeds.set_index("Year").sort_index()
df_hist_rates = pd.DataFrame(index=range(2000, 2024))
for key in SCHOOL_KEYS:
df_hist_rates[key] = (
df_ipeds[key].reindex(df_hist_rates.index) / df_weighted_hist[key]
)
display(Markdown("**Historical Participation Rates (Recent Years):**"))
display(df_hist_rates.tail().style.format("{:.2%}"))
```
## Step 4 — Linear Regression Model
We fit a linear trend to each university's historical participation rate:
$$\text{Rate}(t) = \alpha + \beta \cdot t, \quad t = \text{Year} - 1999$$
The model is trained on 2000–`{python} (TEST_START_YEAR - 1)` and evaluated on `{python} TEST_START_YEAR`–2023.
```{python}
models_dict = {}
model_metrics = []
residual_records = []
t_2023 = 2023 - 1999
for key in SCHOOL_KEYS:
full_name = DISPLAY_NAMES[key]
rates_series = df_hist_rates[key].dropna()
years_arr = rates_series.index.values.astype(float)
rates_arr = rates_series.values
t_arr = years_arr - 1999
X = t_arr.reshape(-1, 1)
train_mask = years_arr < TEST_START_YEAR
test_mask = ~train_mask
years_train = years_arr[train_mask].astype(int)
years_test = years_arr[test_mask].astype(int)
rates_train = rates_arr[train_mask]
rates_test = rates_arr[test_mask]
X_train = X[train_mask]
X_test = X[test_mask]
model = LinearRegression().fit(X_train, rates_train)
models_dict[key] = model
alpha = float(model.intercept_)
beta = float(model.coef_[0])
pred_train = model.predict(X_train)
pred_test = model.predict(X_test) if test_mask.any() else None
# Train Metrics
pop_train = df_weighted_hist.loc[years_train, key].values
enroll_act_train = df_ipeds.loc[years_train, key].values
enroll_pred_train = pred_train * pop_train
r2_train = r2_score(rates_train, pred_train)
mape_train = mean_absolute_percentage_error(enroll_act_train, enroll_pred_train)
model_metrics.append({
"University": full_name, "Dataset": "Train", "Method": "Base Linear",
"Alpha": round(alpha, 4), "Beta": round(beta, 4), "R2": round(r2_train, 4), "MAPE": round(mape_train, 4),
})
# Collect Residuals (Train)
for yr, act, pred in zip(years_train, enroll_act_train, enroll_pred_train):
residual_records.append({
"Year": int(yr), "University": full_name, "Actual": float(act), "Predicted": float(pred),
"Residual": float(act - pred), "APE": float(abs(act - pred)/act) if act != 0 else 0,
"Method": "Base Linear", "Dataset": "Train"
})
# Test Metrics
if test_mask.any():
pop_test = df_weighted_hist.loc[years_test, key].values
enroll_act_test = df_ipeds.loc[years_test, key].values
enroll_pred_test = pred_test * pop_test
r2_test = r2_score(rates_test, pred_test)
mape_test = mean_absolute_percentage_error(enroll_act_test, enroll_pred_test)
model_metrics.append({
"University": full_name, "Dataset": "Test", "Method": "Base Linear",
"Alpha": round(alpha, 4), "Beta": round(beta, 4), "R2": round(r2_test, 4), "MAPE": round(mape_test, 4),
})
# Collect Residuals (Test)
for yr, act, pred in zip(years_test, enroll_act_test, enroll_pred_test):
residual_records.append({
"Year": int(yr), "University": full_name, "Actual": float(act), "Predicted": float(pred),
"Residual": float(act - pred), "APE": float(abs(act - pred)/act) if act != 0 else 0,
"Method": "Base Linear", "Dataset": "Test"
})
```
## Step 5 — Calibration Shift Calculation
Computes a per-school additive shift in RATE units so the model passes exactly through the 2023 actual enrollment when `CALIB_SHIFT = True`.
```{python}
shift_factors = {}
for key in SCHOOL_KEYS:
model_rate_2023 = float(models_dict[key].predict(np.array([[t_2023]]))[0])
actual_enroll_2023 = float(df_ipeds.loc[2023, key])
gpi_pop_2023 = float(df_weighted_proj.loc[2023, key])
implied_gpi_rate_2023 = actual_enroll_2023 / gpi_pop_2023
shift = (implied_gpi_rate_2023 - model_rate_2023) if CALIB_SHIFT else 0.0
shift_factors[key] = shift
display(Markdown("**Calibration Shifts (Base Linear Models):**"))
display(
pd.DataFrame.from_dict(
shift_factors, orient="index", columns=["Shift"]
).style.format("{:.4f}")
)
```
## Step 6 & 7 — Final Forecasts & Regional Control Totals
**Why a Control Total?** Left unchecked, applying straight-line growth to growing populations could suggest that an impossibly high percentage of the region is attending college. We set a `PENETRATION_CEILING` based on historical peaks to proportionally scale down college projections if total regional enrollment exceeds reality.
```{python}
total_enroll_hist = df_ipeds[SCHOOL_KEYS].sum(axis=1).loc[2000:2023]
total_pop_hist_5cty = df_census_full[TARGET_COUNTIES].sum(axis=1).loc[2000:2023]
regional_rate_hist = (total_enroll_hist / total_pop_hist_5cty).dropna()
HIST_MAX_PENETRATION = float(regional_rate_hist.max())
PENETRATION_CEILING = HIST_MAX_PENETRATION * 1.02 # 2% buffer above historical peak
FORECAST_YEARS = list(range(2024, 2061))
df_total_gpi_pop = df_gpi_ext[TARGET_COUNTIES].sum(axis=1)
forecast_records = []
regional_rate_proj = []
for year in FORECAST_YEARS:
t = year - 1999
total_gpi_pop = float(df_total_gpi_pop.loc[year])
ceiling_enroll = PENETRATION_CEILING * total_gpi_pop
raw = {}
for key in SCHOOL_KEYS:
model_rate = float(models_dict[key].predict(np.array([[t]]))[0])
# Decay calibration shift to zero over 10 years for WESTMIN only
if key == "WESTMIN":
decay_factor = max(0.0, 1.0 - (year - 2023) / 10.0) if CALIB_SHIFT else 0.0
active_shift = shift_factors[key] * decay_factor
else:
active_shift = shift_factors[key]
gpi_pop = float(df_weighted_proj.loc[year, key])
raw[key] = max(0.0, (model_rate + active_shift) * gpi_pop)
total_raw = sum(raw.values())
proj_rate = total_raw / total_gpi_pop
if total_raw > ceiling_enroll:
scale_factor = ceiling_enroll / total_raw
capped = True
else:
scale_factor = 1.0
capped = False
regional_rate_proj.append({
"Year": year, "Rate": proj_rate * scale_factor,
"Type": "Projected", "Capped": capped,
})
for key, raw_enroll in raw.items():
forecast_records.append({
"Year": year, "Univ_Code": key,
"University": DISPLAY_NAMES[key],
"Enrollment": int(round(raw_enroll * scale_factor)),
"Type": "Projected",
})
df_forecast_base = pd.DataFrame(forecast_records)
```
---
# Flagship Override Adjustment
BYU and the University of Utah draw students nationally and internationally, making their enrollment largely independent of local Wasatch Front demographics. Tying their forecast to the regional college-age population via participation rates produces unrealistic trajectories as local cohort sizes fluctuate.
Instead, we model each flagship's enrollment directly as a function of time:
$$ \text{Enrollment}(t) = \alpha + \beta \cdot t, \quad t = \text{Year} - 1999 $$
This treats their historical enrollment trend as the best predictor of future enrollment, bypassing catchment weights, weighted populations, and participation-rate mechanics entirely. The resulting forecasts are then substituted back into the base linear forecast for BYU and UofU only; all other universities retain their original participation-rate projections.
```{python}
FLAGSHIP_KEYS = ["BYU", "UofU"]
flagship_models = {}
flagship_metrics = []
flagship_shifts = {}
for key in FLAGSHIP_KEYS:
full_name = DISPLAY_NAMES[key]
enroll_series = df_ipeds[key].dropna()
years_arr = enroll_series.index.values.astype(float)
enroll_arr = enroll_series.values
t_arr = years_arr - 1999
X = t_arr.reshape(-1, 1)
train_mask = years_arr < TEST_START_YEAR
test_mask = ~train_mask
years_train = years_arr[train_mask].astype(int)
years_test = years_arr[test_mask].astype(int)
enroll_train = enroll_arr[train_mask]
enroll_test = enroll_arr[test_mask]
X_train = X[train_mask]
X_test = X[test_mask]
model = LinearRegression().fit(X_train, enroll_train)
flagship_models[key] = model
alpha = float(model.intercept_)
beta = float(model.coef_[0])
pred_train = model.predict(X_train)
pred_test = model.predict(X_test) if test_mask.any() else None
# Flagship Train Metrics
r2_train = r2_score(enroll_train, pred_train)
mape_train = mean_absolute_percentage_error(enroll_train, pred_train)
flagship_metrics.append(
{
"University": full_name,
"Dataset": "Train",
"Method": "Direct Flagship Override",
"Alpha": round(alpha, 1),
"Beta": round(beta, 1),
"R2": round(r2_train, 4),
"MAPE": round(mape_train, 4),
}
)
# Collect Residuals (Train)
for yr, act, pred in zip(years_train, enroll_train, pred_train):
residual_records.append(
{
"Year": int(yr),
"University": full_name,
"Actual": float(act),
"Predicted": float(pred),
"Residual": float(act - pred),
"APE": float(abs(act - pred) / act) if act != 0 else 0,
"Method": "Direct Flagship Override",
"Dataset": "Train",
}
)
# Flagship Test Metrics
if test_mask.any():
r2_test = r2_score(enroll_test, pred_test)
mape_test = mean_absolute_percentage_error(enroll_test, pred_test)
flagship_metrics.append(
{
"University": full_name,
"Dataset": "Test",
"Method": "Direct Flagship Override",
"Alpha": round(alpha, 1),
"Beta": round(beta, 1),
"R2": round(r2_test, 4),
"MAPE": round(mape_test, 4),
}
)
# Collect Residuals (Test)
for yr, act, pred in zip(years_test, enroll_test, pred_test):
residual_records.append(
{
"Year": int(yr),
"University": full_name,
"Actual": float(act),
"Predicted": float(pred),
"Residual": float(act - pred),
"APE": float(abs(act - pred) / act) if act != 0 else 0,
"Method": "Direct Flagship Override",
"Dataset": "Test",
}
)
# Calibration Shift for Override (in enrollment units)
model_enroll_2023 = float(model.predict(np.array([[t_2023]]))[0])
actual_enroll_2023 = float(df_ipeds.loc[2023, key])
flagship_shifts[key] = (
(actual_enroll_2023 - model_enroll_2023) if CALIB_SHIFT else 0.0
)
display(Markdown("**Calibration Shifts (Flagship Direct Models - Students):**"))
display(
pd.DataFrame.from_dict(
flagship_shifts, orient="index", columns=["Enrollment Shift"]
).astype(int)
)
# Generate Override Forecast
flagship_override_records = []
for year in FORECAST_YEARS:
t = year - 1999
for key in FLAGSHIP_KEYS:
raw_enroll = float(flagship_models[key].predict(np.array([[t]]))[0])
final_enroll = max(0, int(round(raw_enroll + flagship_shifts[key])))
flagship_override_records.append(
{
"Year": year,
"Univ_Code": key,
"University": DISPLAY_NAMES[key],
"Enrollment": final_enroll,
"Type": "Projected",
}
)
df_flagship_override = pd.DataFrame(flagship_override_records)
# Drop original BYU/UofU projected rows and substitute flagship overrides
df_forecast_adjusted = pd.concat(
[
df_forecast_base[~df_forecast_base["Univ_Code"].isin(FLAGSHIP_KEYS)],
df_flagship_override,
],
ignore_index=True,
).sort_values(["Univ_Code", "Year"])
```
---
# Policy Adjustments
## BYU Medical School (2027-2030)
**Background:** Brigham Young University is planning to integrate a Medical School starting in 2027. The medical school is expected to enroll 100 students annually.
**Enrollment Impact:**
- **2027:** +100 students (first year class)
- **2028:** +100 students (second year class added; total +200)
- **2029:** +100 students (third year class added; total +300)
- **2030:** +100 students (fourth year class added; total +400)
- **2031 onwards:** Steady state at +400 students (first cohort graduates as new cohort enters)
This adjustment is applied to BYU's flagship override forecast to reflect the planned expansion.
```{python}
# BYU Medical School Enrollment Adjustment
MEDICAL_SCHOOL_START_YEAR = 2027
MEDICAL_SCHOOL_ANNUAL_ENROLLMENT = 100
MEDICAL_SCHOOL_PROGRAM_LENGTH = 4 # MD program is 4 years
# Apply cumulative enrollment adjustment
for year in FORECAST_YEARS:
if year >= MEDICAL_SCHOOL_START_YEAR:
# Calculate cumulative enrollment (capped at steady-state)
years_since_start = year - MEDICAL_SCHOOL_START_YEAR
cohorts_enrolled = min(years_since_start + 1, MEDICAL_SCHOOL_PROGRAM_LENGTH)
medical_enrollment = cohorts_enrolled * MEDICAL_SCHOOL_ANNUAL_ENROLLMENT
# Apply adjustment to BYU forecast
mask = (df_forecast_adjusted['Univ_Code'] == 'BYU') & \
(df_forecast_adjusted['Year'] == year)
df_forecast_adjusted.loc[mask, 'Enrollment'] += medical_enrollment
# Log the adjustment
print(f"Applied BYU Medical School adjustment:")
print(f" - Annual enrollment: {MEDICAL_SCHOOL_ANNUAL_ENROLLMENT} students/year")
print(f" - Start year: {MEDICAL_SCHOOL_START_YEAR}")
print(f" - Ramp-up period: {MEDICAL_SCHOOL_PROGRAM_LENGTH} years (2027-2030)")
print(f" - Steady-state enrollment: {MEDICAL_SCHOOL_PROGRAM_LENGTH * MEDICAL_SCHOOL_ANNUAL_ENROLLMENT} students (2031+)")
```
---
# Campus-Level Disaggregation (2000-2060)
For Home Based College (HBC) travel modeling, we disaggregate institution-level forecasts to specific campus locations across the full time series (2000-2060).
**Methodology:**
- **Non-UVU institutions**: Apply 2025 USHE campus splits to all years (2000-2060)
- **UVU campuses**: Apply time-varying proportions from historical spatial planning data
## Step 1: Load USHE 2025 Campus Enrollment Data
```{python}
# Load USHE 2025 enrollment data
df_ushe_2025 = pd.read_excel(
PATHS["ushe_2025"],
sheet_name="Sheet1"
)
print(f"Total USHE records loaded: {len(df_ushe_2025)}")
print(f"\nUnique terms: {sorted(df_ushe_2025['Term'].unique())}")
print(f"Unique years: {sorted(df_ushe_2025['Academic Year'].unique())}")
# Filter to Academic Year 2025, Term 2 (Fall semester)
df_ushe_fall_2025 = df_ushe_2025[
(df_ushe_2025['Academic Year'] == 2025) &
(df_ushe_2025['Term'] == 2)
].copy()
print(f"\nFiltered to Fall 2025: {len(df_ushe_fall_2025)} records")
# Aggregate headcount by Institution and Site (sum across all ZIP codes)
df_campus_enrollment = df_ushe_fall_2025.groupby(['Institution', 'Site'])['Headcount'].sum().reset_index()
print(f"Unique campuses: {len(df_campus_enrollment)}")
# Show totals by institution
print("\n=== Total Enrollment by Institution (Fall 2025) ===")
inst_totals = df_campus_enrollment.groupby('Institution')['Headcount'].sum().sort_values(ascending=False)
for inst, count in inst_totals.items():
print(f"{inst}: {count:,}")
```
## Step 2: Define Campus Mapping (Non-UVU Institutions)
```{python}
# Map USHE Site names to campus codes for non-UVU institutions
CAMPUS_MAPPING = {
'Salt Lake Community College': {
'Main Campus': 'SLCC_MAIN',
'SLCC - Jordan Campus': 'SLCC_JD',
'SLCC - Larry H. Miller Campus': 'SLCC_ML',
'SLCC - South City Center': 'SLCC_SC',
},
'University Of Utah': {
'Main Campus': 'UOFU_Main',
},
'Weber State University': {
'Main Campus': 'WSU_Main',
'WSU - Davis Center': 'WSU_Davis',
}
}
# Single-campus institutions (no disaggregation needed)
SINGLE_CAMPUS_CODES = ['BYU', 'ENSIGN', 'WESTMIN']
# Map institution codes to USHE names
INSTITUTION_NAME_MAP = {
'SLCC': 'Salt Lake Community College',
'UofU': 'University Of Utah',
'WSU': 'Weber State University',
}
```
## Step 3: Calculate Campus Percentages (Non-UVU)
```{python}
# Calculate percentage splits for non-UVU institutions
campus_percentages = {}
print("\n=== Calculating Campus Percentages (Non-UVU) ===")
for institution, campus_map in CAMPUS_MAPPING.items():
print(f"\n{institution}:")
inst_data = df_campus_enrollment[df_campus_enrollment['Institution'] == institution]
if len(inst_data) == 0:
print(f" WARNING: No data found")
continue
# Get enrollment for mapped campuses only
mapped_campuses = list(campus_map.keys())
mapped_data = inst_data[inst_data['Site'].isin(mapped_campuses)]
total_mapped = mapped_data['Headcount'].sum()
print(f" Total mapped enrollment: {total_mapped:,}")
if total_mapped > 0:
for site, campus_code in campus_map.items():
site_enrollment = mapped_data[mapped_data['Site'] == site]['Headcount'].sum()
percentage = site_enrollment / total_mapped if total_mapped > 0 else 0
campus_percentages[campus_code] = percentage
print(f" {campus_code}: {site_enrollment:,} students ({percentage:.1%})")
print("\n" + "="*60)
print("Final Non-UVU Campus Split Percentages:")
for campus, pct in sorted(campus_percentages.items()):
print(f" {campus}: {pct:.1%}")
```
## Step 4: Hardcode UVU Campus Historical Data
```{python}
# Hardcoded UVU campus enrollment from previous RTP spatial planning
# Source: Historical records and approved campus expansion plans
# Note: UVU-Heber excluded (outside WF-TDM region)
uvu_campus_hardcoded = pd.DataFrame({
'Year': list(range(2000, 2061)),
'UVU_Orem': [20946, 22609, 23609, 23803, 24149, 24487, 23305, 23840, 27052, 29045,
31739, 32734, 31810, 30880, 31589, 33565, 35126, 37785, 39931, 40278,
39438, 39715, 41504, 43105, 44720, 46420, 47173, 48078, 48855, 48850,
49762, 50673, 51583, 52494, 53405, 54317, 55227, 56138, 57049, 57960,
58871, 59781, 60693, 61604, 62515, 63426, 64338, 65248, 66159, 67069,
67981, 67981, 67981, 67981, 67981, 67981, 67981, 67981, 67981, 67981, 67981],
'UVU_Lehi': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1300,
1342, 1384, 1426, 1468, 1510, 1552, 1594, 1635, 1677, 1719,
1761, 1803, 1845, 1887, 1929, 1971, 2013, 2055, 2097, 2139,
2181, 2223, 2265, 2306, 2348, 2390, 2432, 2474, 2516, 2558,
2600, 2600, 2600, 2600, 2600, 2600, 2600, 2600, 2600, 2600, 2600],
'UVU_Payson': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 650,
762, 874, 986, 1098, 1210, 1321, 1433, 1545, 1657, 1769,
1881, 1993, 2105, 2217, 2329, 2440, 2552, 2664, 2776, 2888,
3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000],
'UVU_Vineyard1': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 570, 725, 879, 1034,
1187, 1342, 1496, 1651, 1805, 1960, 2114, 2269, 2422, 2577,
2731, 2886, 3040, 3195, 3349, 3504, 3657, 3812, 3966, 4121,
4275, 4275, 4275, 4275, 4275, 4275, 4275, 4275, 4275, 4275, 4275],
'UVU_Vineyard2': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 30, 38, 46, 54,
63, 71, 79, 87, 95, 103, 111, 119, 128, 136,
144, 152, 160, 168, 176, 184, 193, 201, 209, 217,
225, 225, 225, 225, 225, 225, 225, 225, 225, 225, 225],
})
# Calculate total UVU enrollment from hardcoded data
uvu_campus_hardcoded['UVU_Total_Hardcoded'] = (
uvu_campus_hardcoded['UVU_Orem'] +
uvu_campus_hardcoded['UVU_Lehi'] +
uvu_campus_hardcoded['UVU_Payson'] +
uvu_campus_hardcoded['UVU_Vineyard1'] +
uvu_campus_hardcoded['UVU_Vineyard2']
)
print("=== UVU Campus Hardcoded Data ===")
print(f"Years covered: {uvu_campus_hardcoded['Year'].min()} - {uvu_campus_hardcoded['Year'].max()}")
print("\nSample (2019-2026 - satellite campus openings):")
print(uvu_campus_hardcoded[uvu_campus_hardcoded['Year'].between(2019, 2026)].to_string(index=False))
```
## Step 5: Calculate UVU Time-Varying Campus Proportions
```{python}
# Calculate proportion of total UVU enrollment at each campus for each year
# These proportions will be applied to our forecasted UVU total
uvu_campus_proportions = uvu_campus_hardcoded.copy()
for campus in ['UVU_Orem', 'UVU_Lehi', 'UVU_Payson', 'UVU_Vineyard1', 'UVU_Vineyard2']:
prop_col = campus.replace('UVU_', 'prop_')
uvu_campus_proportions[prop_col] = (
uvu_campus_proportions[campus] / uvu_campus_proportions['UVU_Total_Hardcoded']
)
# Keep proportions for merging
uvu_props_clean = uvu_campus_proportions[['Year', 'prop_Orem', 'prop_Lehi', 'prop_Payson', 'prop_Vineyard1', 'prop_Vineyard2']].copy()
print("\n=== UVU Campus Proportions (Sample Years) ===")
print(uvu_props_clean[uvu_props_clean['Year'].isin([2000, 2019, 2026, 2029, 2050, 2060])].to_string(index=False))
```
## Step 6: Create Combined Historical + Forecast Dataset
```{python}
# Combine historical enrollment (2000-2023) with forecast (2024-2060)
# Historical: df_enroll_ipeds
# Forecast: df_forecast_adjusted
# Prepare historical data
df_hist_inst = df_enroll_ipeds.copy()
df_hist_inst = df_hist_inst[df_hist_inst['Year'].between(2000, 2023)].copy()
# Melt to long format with proper column names
df_hist_long = df_hist_inst.melt(id_vars=['Year'], var_name='Univ_Code', value_name='Enrollment')
df_hist_long = df_hist_long.dropna()
df_hist_long['Type'] = 'Historical'
# Prepare forecast data
df_forecast_long = df_forecast_adjusted[['Year', 'Univ_Code', 'Enrollment']].copy()
df_forecast_long['Type'] = 'Projected'
# Combine
df_combined_inst = pd.concat([df_hist_long, df_forecast_long], ignore_index=True)
df_combined_inst = df_combined_inst.sort_values(['Univ_Code', 'Year']).reset_index(drop=True)
print(f"\n=== Combined Institution-Level Dataset ===")
print(f"Years: {df_combined_inst['Year'].min()} - {df_combined_inst['Year'].max()}")
print(f"Total records: {len(df_combined_inst)}")
print(f"Historical: {len(df_combined_inst[df_combined_inst['Type']=='Historical'])}")
print(f"Projected: {len(df_combined_inst[df_combined_inst['Type']=='Projected'])}")
```
## Step 7: Disaggregate to Campus Level (2000-2060)
```{python}
# Create year-indexed output dataframe
all_years = sorted(df_combined_inst['Year'].unique())
df_campus_output = pd.DataFrame({'YEAR': all_years})
print(f"\n=== Disaggregating to Campus Level ===")
print(f"Processing {len(all_years)} years from {min(all_years)} to {max(all_years)}")
# Process each institution
for univ_code in df_combined_inst['Univ_Code'].unique():
print(f"\n{univ_code}:")
# Get enrollment time series
inst_data = df_combined_inst[df_combined_inst['Univ_Code'] == univ_code][['Year', 'Enrollment']].copy()
inst_data = inst_data.set_index('Year').sort_index()
# === SINGLE-CAMPUS INSTITUTIONS ===
if univ_code in SINGLE_CAMPUS_CODES:
df_campus_output[univ_code] = df_campus_output['YEAR'].map(inst_data['Enrollment'])
print(f" Single campus - assigned directly")
# === UVU - SPECIAL TIME-VARYING PROPORTIONS ===
elif univ_code == 'UVU':
print(f" Applying time-varying campus proportions")
# Merge our UVU forecast with proportions
uvu_merged = inst_data.reset_index().merge(
uvu_props_clean,
on='Year',
how='left'
)
# Apply proportions to forecasted enrollment
for campus_name, prop_name in [
('UVU_Main', 'prop_Orem'),
('UVU_Lehi', 'prop_Lehi'),
('UVU_Payson', 'prop_Payson'),
('UVU_Vineyard1', 'prop_Vineyard1'),
('UVU_Vineyard2', 'prop_Vineyard2')
]:
uvu_merged[campus_name] = (uvu_merged['Enrollment'] * uvu_merged[prop_name]).round().astype(int)
# Map to output dataframe
uvu_merged_indexed = uvu_merged.set_index('Year')
df_campus_output['UVU_Main'] = df_campus_output['YEAR'].map(uvu_merged_indexed['UVU_Main'])
df_campus_output['UVU_Lehi'] = df_campus_output['YEAR'].map(uvu_merged_indexed['UVU_Lehi'])
df_campus_output['UVU_Payson'] = df_campus_output['YEAR'].map(uvu_merged_indexed['UVU_Payson'])
df_campus_output['UVU_Vineyard1'] = df_campus_output['YEAR'].map(uvu_merged_indexed['UVU_Vineyard1'])
df_campus_output['UVU_Vineyard2'] = df_campus_output['YEAR'].map(uvu_merged_indexed['UVU_Vineyard2'])
# Combined Vineyard column
df_campus_output['UVU_Vineyard'] = (
df_campus_output['UVU_Vineyard1'].fillna(0) + df_campus_output['UVU_Vineyard2'].fillna(0)
)
print(f" Created: UVU_Main, UVU_Lehi, UVU_Payson, UVU_Vineyard1, UVU_Vineyard2, UVU_Vineyard")
# === MULTI-CAMPUS (NON-UVU) - APPLY 2025 USHE SPLITS ===
elif univ_code in INSTITUTION_NAME_MAP:
institution = INSTITUTION_NAME_MAP[univ_code]
campus_map = CAMPUS_MAPPING.get(institution, {})
print(f" Applying 2025 USHE campus splits")
for site, campus_code in campus_map.items():
percentage = campus_percentages.get(campus_code, 0)
if percentage > 0:
df_campus_output[campus_code] = df_campus_output['YEAR'].map(
lambda y: int(round(inst_data.loc[y, 'Enrollment'] * percentage))
if y in inst_data.index else 0
)
print(f" {campus_code}: {percentage:.1%}")
else:
df_campus_output[campus_code] = 0
print(f" {campus_code}: WARNING - no split data")
print(f"\n{'='*60}")
print(f"Generated {len(df_campus_output.columns)-1} campus columns")
```
## Step 8: Finalize Column Order and Export
```{python}
# Define desired column order
COLUMN_ORDER = [
'YEAR', 'BYU', 'ENSIGN',
'SLCC_MAIN', 'SLCC_JD', 'SLCC_ML', 'SLCC_SC',
'UOFU_Main',
'UVU_Main', 'UVU_Lehi', 'UVU_Payson', 'UVU_Vineyard1', 'UVU_Vineyard2', 'UVU_Vineyard',
'WSU_Main', 'WSU_Davis',
'WESTMIN'
]
# Ensure all columns exist
for col in COLUMN_ORDER:
if col not in df_campus_output.columns:
print(f"WARNING: Column {col} missing, adding as zeros")
df_campus_output[col] = 0
# Reorder and clean
df_campus_final = df_campus_output[COLUMN_ORDER].copy()
df_campus_final = df_campus_final.sort_values('YEAR').reset_index(drop=True)
# Convert to integers
for col in COLUMN_ORDER:
if col != 'YEAR':
df_campus_final[col] = df_campus_final[col].fillna(0).astype(int)
# Display summary
print("\n=== Campus-Level Enrollment Forecast (2000-2060) ===")
print("\nFirst 5 years:")
print(df_campus_final.head())
print("\nSatellite campus opening years (2019-2030):")
print(df_campus_final[df_campus_final['YEAR'].between(2019, 2030)])
print("\nLast 5 years:")
print(df_campus_final.tail())
# Verification: Check UVU total
df_campus_final_check = df_campus_final.copy()
df_campus_final_check['UVU_Total_Check'] = (
df_campus_final_check['UVU_Main'] + df_campus_final_check['UVU_Lehi'] +
df_campus_final_check['UVU_Payson'] + df_campus_final_check['UVU_Vineyard1'] +
df_campus_final_check['UVU_Vineyard2']
)
print("\n=== UVU Total Verification (Sample Years) ===")
verification_years = [2000, 2019, 2026, 2029, 2050, 2060]
print(df_campus_final_check[df_campus_final_check['YEAR'].isin(verification_years)][
['YEAR', 'UVU_Main', 'UVU_Lehi', 'UVU_Payson', 'UVU_Vineyard1', 'UVU_Vineyard2', 'UVU_Total_Check']
].to_string(index=False))
```
## Export Campus-Level Forecast
```{python}
# Export disaggregated campus-level forecast (2000-2060)
df_campus_final.to_csv(PATHS['campus_forecast_output'], index=False)
print(f"\n{'='*60}")
print(f"Campus-level forecast (2000-2060) saved to:")
print(f" {PATHS['campus_forecast_output']}")
print(f"\nTotal years: {len(df_campus_final)}")
print(f"Total campuses: {len(COLUMN_ORDER) - 1}")
```
---
# OJS Data Preparation & Exports
```{python}
# ── Export Only the Final Adjusted Output CSV ─────────────────────────────────
df_forecast_adjusted.to_csv(PATHS["forecast_output"], index=False)
# ── Metrics & Residuals Export ────────────────────────────────────────────────
df_all_metrics = pd.concat([pd.DataFrame(model_metrics), pd.DataFrame(flagship_metrics)], ignore_index=True)
df_residuals = pd.DataFrame(residual_records)
# ── Prepare Forecast Data for OJS ─────────────────────────────────────────────
df_history_long = df_enroll_ipeds.melt(id_vars=["Year"], var_name="Univ_Code", value_name="Enrollment").dropna()
df_history_long["University"] = df_history_long["Univ_Code"].map(DISPLAY_NAMES)
df_history_long["Type"] = "Historical"
df_history_long["Scenario"] = "Both"
df_proj_base = df_forecast_base.copy()
df_proj_base["Scenario"] = "Off"
df_proj_adj = df_forecast_adjusted.copy()
df_proj_adj["Scenario"] = "On"
# Combine everything for the UI Toggle
df_final_trend = pd.concat(
[
df_history_long[["Year", "University", "Enrollment", "Type", "Scenario"]],
df_proj_base[["Year", "University", "Enrollment", "Type", "Scenario"]],
df_proj_adj[["Year", "University", "Enrollment", "Type", "Scenario"]],
],
ignore_index=True,
).sort_values(["University", "Year"])
# Regional rate data (Historical & Base Model only)
df_regional_rate_long = regional_rate_hist.reset_index()
df_regional_rate_long.columns = ["Year", "Rate"]
df_regional_rate_long["Type"] = "Historical"
ojs_define(
data_final_forecast=df_final_trend.to_dict(orient="records"),
data_regional_rate_hist=df_regional_rate_long.to_dict(orient="records"),
data_regional_rate_proj=pd.DataFrame(regional_rate_proj).to_dict(orient="records"),
control_ceiling=float(PENETRATION_CEILING),
data_model_metrics=df_all_metrics.to_dict(orient="records"),
data_residuals=df_residuals.to_dict(orient="records")
)
```
---
# Results & Diagnostics
## Model Performance
The following tables and charts display the accuracy (R² and Mean Absolute Percentage Error) for both the Base Linear Model (participation rate) and the Direct Flagship Override method. The Test dataset represents performance against the held-out 2020–2023 actuals.
**How to interpret these metrics:**
* **$R^2$ Score:** Measures how much of the historical enrollment trend is explained by our model. You want this number as close to `1.0` as possible. Compare the "Train" and "Test" scores—a massive drop in the Test score suggests the model is overfitting to past data and struggling with recent years.
* **MAPE (Mean Absolute Percentage Error):** The average percentage difference between the model's prediction and the actual enrollment. Lower is better. A MAPE of `2%` means the model is usually off by about 2% of the total student body.
```{ojs}
//| label: table-metrics
Inputs.table(data_model_metrics, {
sort: "Dataset", reverse: true,
columns: ["University", "Dataset", "Method", "R2", "MAPE"],
format: {
R2: d => d.toFixed(3),
MAPE: d => (d * 100).toFixed(2) + "%"
}
})
```
```{ojs}
//| label: fig-metrics-bar
//| fig-cap: "R² Accuracy Score by University (Test Dataset vs Train Dataset)"
Plot.plot({
marks: [
Plot.barY(data_model_metrics, {
x: "University", y: "R2", fill: "Dataset",
title: d => `${d.University} (${d.Dataset})\nR²: ${d.R2.toFixed(3)}`,
sort: { x: "y", reverse: true }
}),
Plot.ruleY([0])
],
facet: { data: data_model_metrics, x: "Dataset" },
x: { tickRotate: -45, label: null },
y: { grid: true, label: "R² Score" },
color: { legend: true, range: ["#4e79a7", "#f28e2c"] },
width: 800, height: 400, marginLeft: 50, marginBottom: 120,
style: { backgroundColor: "transparent", color: "var(--bs-body-color)" }
})
```
### Residual Analysis
These charts illustrate how well the linear trends predict historical data.
**What to look for in the Residuals plot:**
* **Random Scatter:** Points should be randomly scattered above and below the zero line.
* **Avoid Patterns:** If you see a distinct "U" shape or a consistent diagonal slope, it indicates the university's growth might be non-linear, and a simple linear regression is consistently over- or under-predicting during certain decades.
```{ojs}
//| label: fig-residuals
//| fig-cap: "Residuals (Actual - Predicted) over Time"
Plot.plot({
marks: [
Plot.ruleY([0], { stroke: "gray", strokeDasharray: "4,4" }),
Plot.dot(data_residuals, {
x: "Year", y: "Residual", fill: "University",
title: d => `${d.University}\nYear: ${d.Year}\nResidual: ${d.Residual.toFixed(0)}\nMethod: ${d.Method}`,
r: 4
})
],
facet: { data: data_residuals, x: "Method" },
color: { legend: true },
y: { grid: true, label: "Residual (Students)" },
x: { tickFormat: "d" },
width: 860,
style: { backgroundColor: "transparent", color: "var(--bs-body-color)", fontFamily: "sans-serif", fontSize: "12px" }
})
```
**What to look for in the Absolute Percentage Error (APE) plot:**
* Because universities vary wildly in size, raw student residuals can be misleading. Missing by 500 students is a massive error for Westminster but barely noticeable for BYU.
* This chart normalizes the error to the size of the school. Look for sudden spikes in APE, which indicate anomalous years where the model completely failed to capture enrollment reality.
```{ojs}
//| label: fig-ape
//| fig-cap: "Absolute Percentage Error over Time"
Plot.plot({
marks: [
Plot.line(data_residuals, {
x: "Year", y: "APE", stroke: "University", strokeWidth: 1.5
}),
Plot.dot(data_residuals, {
x: "Year", y: "APE", fill: "University",
title: d => `${d.University}\nYear: ${d.Year}\nAPE: ${(d.APE * 100).toFixed(1)}%\nMethod: ${d.Method}`,
r: 3
})
],
facet: { data: data_residuals, x: "Method" },
color: { legend: true },
y: { grid: true, label: "Absolute Percentage Error", tickFormat: "%" },
x: { tickFormat: "d" },
width: 860,
style: { backgroundColor: "transparent", color: "var(--bs-body-color)", fontFamily: "sans-serif", fontSize: "12px" }
})
```
## Regional Control Total
This chart maps the historical and projected regional penetration rate before any flagship adjustments. It ensures the mathematical bounds of our population-based projections remain realistic.
```{ojs}
//| label: fig-control-total
//| fig-cap: "Regional Penetration Rate: Historical and Projected vs Control Total Ceiling"
{
const allRates = [...data_regional_rate_hist, ...data_regional_rate_proj]
return Plot.plot({
marks: [
Plot.ruleY([0]),
Plot.ruleY([control_ceiling], {
stroke: "var(--bs-danger)", strokeWidth: 2, strokeDasharray: "6,3"
}),
Plot.line(allRates, {
filter: d => d.Type === "Historical",
x: "Year", y: "Rate",
stroke: "var(--bs-primary)", strokeWidth: 3
}),
Plot.line(allRates, {
filter: d => d.Type === "Projected",
x: "Year", y: "Rate",
stroke: "var(--bs-secondary)", strokeWidth: 2.5, strokeDasharray: "5,4"
}),
Plot.dot(allRates, {
filter: d => d.Type === "Projected" && d.Capped,
x: "Year", y: "Rate",
fill: "var(--bs-danger)", r: 4, symbol: "diamond"
}),
Plot.text([{year: 2053, rate: control_ceiling}], {
x: "year", y: "rate",
text: () => `Ceiling: ${(control_ceiling * 100).toFixed(1)}%`,
dy: -10, fill: "var(--bs-danger)", fontSize: 11
}),
Plot.tip(allRates, Plot.pointerX({
x: "Year", y: "Rate",
title: d => `Year: ${d.Year}\nRegional Rate: ${(d.Rate * 100).toFixed(1)}%\n${d.Type}${d.Capped ? "\n⚠ Control total applied" : ""}`,
fill: "var(--bs-body-bg)", stroke: "var(--bs-body-color)"
}))
],
width: 780, marginLeft: 60, marginRight: 40,
y: {
grid: true, label: "Regional Penetration Rate",
tickFormat: d => (d * 100).toFixed(0) + "%"
},
x: { tickFormat: "d" },
style: {
backgroundColor: "transparent", color: "var(--bs-body-color)",
fontFamily: "sans-serif", fontSize: "12px"
}
})
}
```
## Enrollment Forecast by University
**Note:** Solid lines represent historical IPEDS data (2000–2023). Dashed lines represent the forecast (2024–2060). Toggle the Flagship Adjustment "On" to see the direct-trend override applied to BYU and UofU, bypassing local demographic constraints.
```{ojs}
//| label: scenario-toggle
viewof adjustment_toggle = Inputs.radio(
["On", "Off"],
{ label: "Flagship Adjustment:", value: "On" }
)
```
```{ojs}
//| label: fig-enrollment-forecast
//| fig-cap: "College Enrollment Forecast by University (2000–2060)"
Plot.plot({
marks: [
Plot.ruleY([0]),
Plot.ruleX([2024], { stroke: "gray", strokeDasharray: "2,2", strokeOpacity: 0.5 }),
Plot.line(data_final_forecast, {
filter: d => d.Type === "Historical",
x: "Year", y: "Enrollment", z: "University",
stroke: "University", strokeWidth: 2.5
}),
Plot.line(data_final_forecast, {
filter: d => d.Type === "Projected" && d.Scenario === adjustment_toggle,
x: "Year", y: "Enrollment", z: "University",
stroke: "University",
strokeWidth: 2.5,
strokeDasharray: "5,5"
}),
Plot.text(data_final_forecast, Plot.selectLast({
filter: d => d.Type === "Historical" || (d.Type === "Projected" && d.Scenario === adjustment_toggle),
x: "Year", y: "Enrollment", z: "University",
text: "University", textAnchor: "start", dx: 8, fill: "var(--bs-body-color)"
}), { stroke: "var(--bs-body-bg)", strokeWidth: 3 }),
Plot.tip(data_final_forecast, Plot.pointerX({
filter: d => d.Type === "Historical" || (d.Type === "Projected" && d.Scenario === adjustment_toggle),
x: "Year", y: "Enrollment",
title: d => `${d.University}\nYear: ${d.Year}\nStudents: ${d.Enrollment.toLocaleString()}\n(${d.Type})`,
fill: "var(--bs-body-bg)", stroke: "var(--bs-body-color)"
}))
],
width: 860, marginLeft: 60, marginRight: 185,
color: { legend: true },
y: { grid: true, label: "Fall Enrollment", tickFormat: "s" },
x: { tickFormat: "d" },
style: {
backgroundColor: "transparent", color: "var(--bs-body-color)",
fontFamily: "sans-serif", fontSize: "12px"
}
})
```
::: {.callout-tip title="Download the output file:"}
* [enrollment_forecast_2024_2060.csv](./intermediate/enrollment_forecast_2024_2060.csv)
* [enrollment_forecast_by_campus_2000_2060.csv](./intermediate/enrollment_forecast_by_campus_2000_2060.csv)
:::