1 Methodology Overview

This forecast predicts future college enrollment for travel demand modeling using a hybrid demographic approach. The core methodology follows these steps:

Catchment Weights (Addressable Markets): We determine what percentage of a university’s student body comes from specific counties to define their “Addressable Market.”
Weighted Populations: We calculate the catchment population by multiplying county shares by historical county-level college-age populations (18-24).
Base Participation Rates: We calculate the historical ratio of enrolled students to this weighted population and forecast it forward using Linear Regression.
Regional Control Totals: We apply a ceiling to the overall regional penetration rate to prevent infinite/unrealistic demographic capture.
Flagship Override: Because large flagship universities (BYU, UofU) draw heavily from outside the local region and are capacity-constrained, their forecasts are detached from local population trends and modeled directly against time.

1.1 Data Sources

Historical Demographics: US Census Bureau API and IPUMS NHGIS datasets (Age range 18-24 matches GPI methodology).
Projected Demographics: Kem C. Gardner Policy Institute Population Projections
Utah System of Higher Education (USHE): Institutional Data Resources
Historical Enrollment (IPEDS): Integrated Postsecondary Education Data System (2000-2023). Specific institutional profiles:

2 Configuration & Setup

2.1 File Paths & Settings

Show the code

from pathlib import Path
import pandas as pd
import numpy as np
import geopandas as gpd
from pygris.data import get_census
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_percentage_error
from IPython.display import display, Markdown
import warnings

warnings.filterwarnings("ignore")

# Centralized path configuration for all data inputs and outputs
PATHS = {
    # Cache files (intermediate data)
    "census_cache": Path("intermediate/Utah_Census_All_Counties_All_Ages_2000_2023.csv"),
    # Input data files
    "enrollment_ipeds": Path("intermediate/Fall_Enrollment_InPerson_2000_2023.csv"),
    "gpi_data": Path("data/GPI/SchColAgeFS-Dec2022.xlsx"),
    "gpi_projections": Path(
        "data/GPI/Gardner-Policy-Institute-State-and-County-Projections-2025-2065-Data.xlsx"
    ),
    "zip_code": Path("data/UGRC ZIP/UtahZipCodeAreas_-1977770497264301863.gpkg"),
    "enroll_by_zip": Path("intermediate/TotEnrolByZipByCollege.csv"),
    "ushe_2025": Path("data/USHE 2025 Enrollment/WFRC Data Request_Final.xlsx"),
    # NHGIS data
    "nhgis_dir": Path("data/NHGIS"),
    # Output paths
    "model_diagnostics": Path("intermediate/model_diagnostics.csv"),
    "forecast_output": Path("results/enrollment_forecast_2024_2060.csv"),
    "campus_forecast_output": Path("results/enrollment_forecast_by_campus_2000_2060.csv"),
}

# WF-TDM region counties
TARGET_COUNTIES = [
    "Box Elder County",
    "Weber County",
    "Davis County",
    "Salt Lake County",
    "Utah County",
]

# Age range for college-age population (18-24) — matches GPI methodology
COLLEGE_AGE_RANGES = ["18 and 19 years", "20 years", "21 years", "22 to 24 years"]

# Last year included in regression training. Set to 2019 to exclude COVID disruption
# (the 2020-2021 dip permanently depresses slopes if included). Set to 2023 to train
# on all available data including the recovery years.
TRAIN_END_YEAR = 2020

# Most recent year with actual IPEDS enrollment data. The model curve is shifted to
# pass exactly through this year's actuals. Update each year as new data arrives.
CALIB_YEAR = 2024

# Global Toggles
CALIB_SHIFT = True  # Toggle whether to shift model curves to meet CALIB_YEAR actuals

# If True, freeze each school's participation rate at its CALIB_YEAR calibrated level
# for all forecast years — growth comes from population change only, eliminating the
# compounding of rate-trend × population-rebound that backloads growth to the 2050s.
# Set to False to use the linearly extrapolated rate (current default behavior).
FREEZE_RATE_AT_2023 = False

# Share of UVU_Main enrollment reallocated to UVU_Geneva (951 S Geneva Rd, Orem).
# Geneva is a small satellite campus adjacent to the main campus; this share is
# subtracted from UVU_Main and assigned to UVU_Geneva in the final campus table.
# Set to 0.0 to leave all enrollment at UVU_Main.
UVU_GENEVA_SHARE = 0.05

;WFRC area colleges
Ensign     = 1029   ;95 N 300 W, Salt Lake City, UT 84101
Westmin    = 1263   ;1840 S 1300 E, Salt Lake City, UT 84105
UofU_Main  = 1051   ;201 Presidents Cir, Salt Lake City, UT 84112
WSU_Main   =  437   ;3848 Harrison Blvd, Ogden, UT 84403
WSU_Davis  =  693   ;2750 University Park Blvd. layton, UT, 84041
SLCC_Main  = 1580   ;4600 S Redwood Rd, Salt Lake City, UT 84123
SLCC_SC    = 1231   ;1575 S State Street ,Salt Lake City, UT 84115
SLCC_JD    = 1776   ;3491 W 9000 S, West Jordan, UT  84088
SLCC_ML    = 1886   ;9750 S 300 W, Sandy, UT  84070 (Miller campus as called Sandy campus)
SLCC_HM    = 2053   ;14551 South Sentinel Ridge Blvd, Herriman, UT (new campus east of MVC)

;MAG area colleges
BYU        = 2966   ;1 N University Hill, Provo, UT 84602
UVU_Main   = 2931   ;800 West University Parkway, Orem, UT 84058
UVU_Geneva = 2930   ;951 S Geneva Rd, Orem, UT 84058
UVU_Lehi   = 2552   ;2912 Executive Pkwy, Lehi, UT 84043, Lehi, UT 84043 (new location)
UVU_Vine   = 2805   ;Mill Rd, Vineyard UT  (FUTURE CAMPUS)
UVU_Payson = 3406   ;Future Nebo Belt Rd, Payson UT  (FUTURE CAMPUS)

3 1. Load Data

3.1 Enrollment Data

Utah System of Higher Education & IPEDS

Sources: USHE and Integrated Postsecondary Education Data System (2000-2023)

In-person enrollment = students enrolled in no distance-education courses (EFDENON) + students enrolled in some distance-education courses / hybrid (EFDESOM). Exclusively online students are excluded as they generate no campus trips.

Years	Source	Method
2012–2024	`EF{YEAR}A_DIST` table in IPEDS Access DB	`EFDENON + EFDESOM` where `EFDELEV == 1`
2008–2011	`EF{YEAR}A` table in IPEDS Access DB	`EFTOTLT` max per institution (total proxy)
2004–2007	`EF{YEAR}A` table in IPEDS Access DB	`EFRACE15 + EFRACE16` max per institution
2000–2003	Cached CSV in `data/IPEDS_Raw/`	`EFRACE15 + EFRACE16` max per institution

Results are cached to intermediate/Fall_Enrollment_InPerson_2000_2023.csv after the first run.

Show the code

import zipfile, io, tempfile, pyodbc, requests

id_map = {
    230038: "BYU",
    230418: "Ensign",
    230746: "SLCC",
    230764: "UofU",
    230737: "UVU",
    230782: "WSU",
    230807: "Westmin",
}
target_cols = ["BYU", "Ensign", "SLCC", "UofU", "UVU", "WSU", "Westmin"]

enroll_cache = PATHS["enrollment_ipeds"]

if enroll_cache.exists():
    df_enroll_ipeds = pd.read_csv(enroll_cache, index_col="Year")
    print(f"Loaded enrollment from cache: {enroll_cache}")
else:
    IPEDS_ACCESS_URLS = {
        2024: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2024-25_Provisional.zip",
        2023: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2023-24_Final.zip",
        2022: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2022-23_Final.zip",
        2021: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2021-22_Final.zip",
        2020: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2020-21_Final.zip",
        2019: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2019-20_Final.zip",
        2018: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2018-19_Final.zip",
        2017: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2017-18_Final.zip",
        2016: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2016-17_Final.zip",
        2015: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2015-16_Final.zip",
        2014: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2014-15_Final.zip",
        2013: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2013-14_Final.zip",
        2012: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2012-13_Final.zip",
        2011: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2011-12_Final.zip",
        2010: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2010-11_Final.zip",
        2009: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2009-10_Final.zip",
        2008: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2008-09_Final.zip",
        2007: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2007-08_Final.zip",
        2006: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2006-07_Final.zip",
        2005: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2005-06_Final.zip",
        2004: "https://nces.ed.gov/ipeds/tablefiles/zipfiles/IPEDS_2004-05_Final.zip",
    }

    IPEDS_CSV_URLS = {
        **{year: f"https://nces.ed.gov/ipeds/tablefiles/zipfiles/EF{year}A_DIST.zip" for year in range(2012, 2024)},
        **{year: f"https://nces.ed.gov/ipeds/tablefiles/zipfiles/EF{year}A.zip" for year in range(2000, 2012)},
    }
    raw_dir = Path("data/IPEDS_Raw")
    raw_dir.mkdir(parents=True, exist_ok=True)

    de_frames  = []   # 2012–2024: EF{YEAR}A_DIST, in-person = EFDENON + EFDESOM
    pre_frames = []   # 2004–2011: EF{YEAR}A, total enrollment as proxy

    for year, url in IPEDS_ACCESS_URLS.items():
        zip_path = Path("data/IPEDS_Access") / url.split("/")[-1]
        if not zip_path.exists():
            # Fallback: download and read the individual IPEDS CSV file
            csv_url = IPEDS_CSV_URLS.get(year)
            if csv_url is None:
                print(f"No CSV fallback available for {year}, skipping")
                continue
            suffix   = "a_dist" if year >= 2012 else "a"
            csv_path = raw_dir / f"ef{year}{suffix}.csv"
            if not csv_path.exists():
                r = requests.get(csv_url, timeout=60)
                r.raise_for_status()
                with zipfile.ZipFile(io.BytesIO(r.content)) as z:
                    csv_name = next(f for f in z.namelist() if f.lower().endswith(".csv"))
                    csv_path.write_bytes(z.read(csv_name))
                print(f"Downloaded CSV fallback for {year}")
            df = pd.read_csv(csv_path, encoding="latin1", low_memory=False)
            df.columns = df.columns.str.upper()
            df["Year"] = year
            (de_frames if year >= 2012 else pre_frames).append(df)
            print(f"CSV fallback {year}", end=" ")
            continue
        with zipfile.ZipFile(zip_path) as z, tempfile.TemporaryDirectory(ignore_cleanup_errors=True) as tmp:
            # Discover accdb name at runtime (some years nest inside a subdirectory)
            accdb_in_zip = next(f for f in z.namelist() if f.endswith(".accdb"))
            z.extract(accdb_in_zip, tmp)
            accdb_path = Path(tmp) / accdb_in_zip
            conn_str   = f"Driver={{Microsoft Access Driver (*.mdb, *.accdb)}};DBQ={accdb_path}"
            conn = pyodbc.connect(conn_str)
            try:
                if year >= 2012:
                    df = pd.read_sql(
                        f"SELECT UNITID, EFDELEV, EFDENON, EFDESOM FROM [EF{year}A_DIST]",
                        conn,
                    )
                    df["Year"] = year
                    de_frames.append(df)
                elif year >= 2008:
                    df = pd.read_sql(
                        f"SELECT UNITID, EFTOTLT FROM [EF{year}A]", conn
                    )
                    df["Year"] = year
                    pre_frames.append(df)
                else:
                    df = pd.read_sql(
                        f"SELECT UNITID, EFRACE15, EFRACE16 FROM [EF{year}A]", conn
                    )
                    df["Year"] = year
                    pre_frames.append(df)
            finally:
                conn.close()
                del conn
        print(f"Queried {year}", end=" ")

    # 2012–2024: vectorized EFDENON + EFDESOM on grand-total rows
    df_de = (
        pd.concat(de_frames, ignore_index=True)
        .query("EFDELEV == 1 and UNITID in @id_map")
        .assign(InPerson=lambda x: x["EFDENON"].fillna(0) + x["EFDESOM"].fillna(0))
        [["UNITID", "Year", "InPerson"]]
    )

    # 2004–2011: concat fills missing columns with NaN; np.where selects the right total
    df_pre = (
        pd.concat(pre_frames, ignore_index=True)
        .query("UNITID in @id_map")
        .assign(
            InPerson=lambda x: np.where(
                x["EFTOTLT"].notna(),
                x["EFTOTLT"].fillna(0),
                x["EFRACE15"].fillna(0) + x["EFRACE16"].fillna(0),
            )
        )
        [["UNITID", "Year", "InPerson"]]
    )

    # 2000–2003: no Access DB available; download CSV if not already cached
    legacy_frames = []
    for year in range(2000, 2004):
        csv_path = raw_dir / f"ef{year}a.csv"
        if not csv_path.exists():
            r = requests.get(IPEDS_CSV_URLS[year], timeout=60)
            r.raise_for_status()
            with zipfile.ZipFile(io.BytesIO(r.content)) as z:
                csv_name = next(f for f in z.namelist() if f.lower().endswith(".csv"))
                csv_path.write_bytes(z.read(csv_name))
            print(f"Downloaded {csv_path.name}")
        df = pd.read_csv(csv_path, encoding="latin1", low_memory=False)
        df.columns = df.columns.str.upper()
        df["Year"] = year
        legacy_frames.append(df)
    df_legacy = (
        pd.concat(legacy_frames, ignore_index=True)
        .query("UNITID in @id_map")
        .assign(InPerson=lambda x: x["EFRACE15"].fillna(0) + x["EFRACE16"].fillna(0))
        [["UNITID", "Year", "InPerson"]]
    )

    # Combine → .max() per (Year, UNITID) collapses disaggregated rows to grand total
    df_enroll_ipeds = (
        pd.concat([df_de, df_pre, df_legacy], ignore_index=True)
        .groupby(["Year", "UNITID"])["InPerson"]
        .max()
        .reset_index()
        .assign(Inst=lambda x: x["UNITID"].map(id_map))
        .pivot_table(index="Year", columns="Inst", values="InPerson", aggfunc="sum")
        .reindex(columns=target_cols, fill_value=0)
        .astype(int)
    )
    df_enroll_ipeds.columns.name = None

    enroll_cache.parent.mkdir(exist_ok=True)
    df_enroll_ipeds.to_csv(enroll_cache)
    print(f"\nCached to {enroll_cache}")

print("\n=== In-Person Fall Enrollment (2000–2023) ===")
print(df_enroll_ipeds)

Loaded enrollment from cache: intermediate\Fall_Enrollment_InPerson_2000_2023.csv

=== In-Person Fall Enrollment (2000–2023) ===
        BYU  Ensign   SLCC   UofU    UVU    WSU  Westmin
Year                                                    
2000  32554     824  21596  24948  20946  16050     2403
2001  32771    1025  23701  27668  22609  16873     2474
2002  32408    1124  23347  28369  23609  18059     2353
2003  33008    1258  24056  28436  23803  18821     2471
2004  34347    1248  24725  28933  24149  18498     2417
2005  34067    1235  24111  30558  24180  18142     2455
2006  34185    1317  24241  30511  23305  18303     2479
2007  34174    1316  25235  28025  23840  18081     2661
2008  34244    1377  29396  28211  26696  21388     2859
2009  34130    1589  34966  29284  28765  23001     3037
2010  33841    1809  34654  30819  32670  24048     3163
2011  34101    1987  33420  31660  33395  25301     3358
2012  34402    2191  27770  31582  28409  24349     3301
2013  31106    2036  29254  31150  27411  23028     3108
2014  30447    2006  27375  30505  28344  23611     2991
2015  33432    2146  25705  30362  30303  23509     2721
2016  34184    2075  26536  30539  32249  24061     2589
2017  34191    1982  26238  31389  34535  25252     2570
2018  34281    1927  25689  31547  36834  25335     2477
2019  34031    1886  25781  31406  38454  26505     2215
2020  31013    1781  22546  22042  27536  20035     1849
2021  34086    1676  20879  31057  33733  24161     1422
2022  33657    1524  21089  32303  35880  24999     1249
2023  34275    1674  21696  33034  38353  25775     1197
2024  34975    1663  22753  34645  40152  27593     1136

Show the code

# Historical enrollment data from 2019 RTP (1990-1999)
df_enroll_rtp19 = pd.DataFrame(
    {
        "Year": [1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999],
        "UofU": [24311, 25581, 25868, 25982, 25645, 26810, 26079, 25895, 25018, 25781],
        "WSU": [13449, 14495, 14993, 14447, 14230, 13996, 13906, 14613, 13900, 14984],
        "UVU": [7885, 8777, 9623, 10512, 13293, 14040, 14756, 15994, 18174, 20062],
        "SLCC": [13344, 15972, 17766, 18311, 19440, 20834, 22394, 24307, 19754, 21273],
        "BYU": [30465, 31101, 31108, 30547, 30413, 30465, 30563, 31249, 32202, 32731],
    }
)

3.2 College-Age Population (Historical 18-24)

US Census Bureau

We define a crosswalk to extract historical 18-24 population cohorts from Census Summary Files and the American Community Survey.

Show the code

# --- 1. Define Variable Crosswalk (Metadata) ---
var_crosswalk = pd.DataFrame(
    [
        ("Male", "Total",             "P012002", "P12_002N", "B01001_002E"),
        ("Male", "Under 5 years",     "P012003", "P12_003N", "B01001_003E"),
        ("Male", "5 to 9 years",      "P012004", "P12_004N", "B01001_004E"),
        ("Male", "10 to 14 years",    "P012005", "P12_005N", "B01001_005E"),
        ("Male", "15 to 17 years",    "P012006", "P12_006N", "B01001_006E"),
        ("Male", "18 and 19 years",   "P012007", "P12_007N", "B01001_007E"),
        ("Male", "20 years",          "P012008", "P12_008N", "B01001_008E"),
        ("Male", "21 years",          "P012009", "P12_009N", "B01001_009E"),
        ("Male", "22 to 24 years",    "P012010", "P12_010N", "B01001_010E"),
        ("Male", "25 to 29 years",    "P012011", "P12_011N", "B01001_011E"),
        ("Male", "30 to 34 years",    "P012012", "P12_012N", "B01001_012E"),
        ("Male", "35 to 39 years",    "P012013", "P12_013N", "B01001_013E"),
        ("Male", "40 to 44 years",    "P012014", "P12_014N", "B01001_014E"),
        ("Male", "45 to 49 years",    "P012015", "P12_015N", "B01001_015E"),
        ("Male", "50 to 54 years",    "P012016", "P12_016N", "B01001_016E"),
        ("Male", "55 to 59 years",    "P012017", "P12_017N", "B01001_017E"),
        ("Male", "60 and 61 years",   "P012018", "P12_018N", "B01001_018E"),
        ("Male", "62 to 64 years",    "P012019", "P12_019N", "B01001_019E"),
        ("Male", "65 and 66 years",   "P012020", "P12_020N", "B01001_020E"),
        ("Male", "67 to 69 years",    "P012021", "P12_021N", "B01001_021E"),
        ("Male", "70 to 74 years",    "P012022", "P12_022N", "B01001_022E"),
        ("Male", "75 to 79 years",    "P012023", "P12_023N", "B01001_023E"),
        ("Male", "80 to 84 years",    "P012024", "P12_024N", "B01001_024E"),
        ("Male", "85 years and over", "P012025", "P12_025N", "B01001_025E"),
        ("Female", "Total",             "P012026", "P12_026N", "B01001_026E"),
        ("Female", "Under 5 years",     "P012027", "P12_027N", "B01001_027E"),
        ("Female", "5 to 9 years",      "P012028", "P12_028N", "B01001_028E"),
        ("Female", "10 to 14 years",    "P012029", "P12_029N", "B01001_029E"),
        ("Female", "15 to 17 years",    "P012030", "P12_030N", "B01001_030E"),
        ("Female", "18 and 19 years",   "P012031", "P12_031N", "B01001_031E"),
        ("Female", "20 years",          "P012032", "P12_032N", "B01001_032E"),
        ("Female", "21 years",          "P012033", "P12_033N", "B01001_033E"),
        ("Female", "22 to 24 years",    "P012034", "P12_034N", "B01001_034E"),
        ("Female", "25 to 29 years",    "P012035", "P12_035N", "B01001_035E"),
        ("Female", "30 to 34 years",    "P012036", "P12_036N", "B01001_036E"),
        ("Female", "35 to 39 years",    "P012037", "P12_037N", "B01001_037E"),
        ("Female", "40 to 44 years",    "P012038", "P12_038N", "B01001_038E"),
        ("Female", "45 to 49 years",    "P012039", "P12_039N", "B01001_039E"),
        ("Female", "50 to 54 years",    "P012040", "P12_040N", "B01001_040E"),
        ("Female", "55 to 59 years",    "P012041", "P12_041N", "B01001_041E"),
        ("Female", "60 and 61 years",   "P012042", "P12_042N", "B01001_042E"),
        ("Female", "62 to 64 years",    "P012043", "P12_043N", "B01001_043E"),
        ("Female", "65 and 66 years",   "P012044", "P12_044N", "B01001_044E"),
        ("Female", "67 to 69 years",    "P012045", "P12_045N", "B01001_045E"),
        ("Female", "70 to 74 years",    "P012046", "P12_046N", "B01001_046E"),
        ("Female", "75 to 79 years",    "P012047", "P12_047N", "B01001_047E"),
        ("Female", "80 to 84 years",    "P012048", "P12_048N", "B01001_048E"),
        ("Female", "85 years and over", "P012049", "P12_049N", "B01001_049E"),
    ],
    columns=["Gender", "Age", "SF1", "DHC", "ACS"],
)
var_crosswalk["Label"] = var_crosswalk["Gender"] + ": " + var_crosswalk["Age"]

Show the code

# --- 2. Fetch & Cache Census Data ---
if not PATHS["census_cache"].exists():
    print(f"Fetching Census data to {PATHS['census_cache']}...")
    decennial_years = [2000, 2010, 2020]
    acs_years = [y for y in range(2009, 2024) if y not in decennial_years]
    data_frames = []
    for year in sorted(decennial_years + acs_years):
        try:
            if year in decennial_years:
                dataset = "dec/dhc" if year == 2020 else "dec/sf1"
                code_col = "DHC" if year == 2020 else "SF1"
            else:
                dataset, code_col = "acs/acs5", "ACS"
            year_codes = var_crosswalk[code_col].tolist()
            rename_map = dict(zip(year_codes, var_crosswalk["Label"]))
            df = get_census(
                dataset=dataset,
                variables=["NAME"] + year_codes,
                year=year,
                params={"for": "county:*", "in": "state:49"},
                return_geoid=True,
            )
            df = df.rename(columns={c: "NAME" for c in df.columns if c.upper() == "NAME"})
            df = df.rename(columns=rename_map)
            valid_cols = ["NAME"] + var_crosswalk["Label"].tolist()
            df = df[df.columns.intersection(valid_cols)].copy()
            df["Year"], df["Dataset"] = year, dataset
            data_frames.append(df)
            print(f"Fetched {year}")
        except Exception as e:
            print(f"Skipping {year}: {e}")
    if data_frames:
        PATHS["census_cache"].parent.mkdir(parents=True, exist_ok=True)
        pd.concat(data_frames, ignore_index=True).to_csv(PATHS["census_cache"], index=False)
        print("Census data cached successfully")
else:
    print(f"Using cached Census data from {PATHS['census_cache']}")

Using cached Census data from intermediate\Utah_Census_All_Counties_All_Ages_2000_2023.csv

Show the code

# Toggle ON (eval: true) to use Census API via pygris
# Toggle OFF (eval: false) if using NHGIS data instead

df_universe = pd.read_csv(PATHS['census_cache'])
df_universe["Geography"] = df_universe["NAME"].str.split(",").str[0]
df_wfrc = df_universe[df_universe["Geography"].isin(TARGET_COUNTIES)].copy()
target_cols = var_crosswalk.loc[var_crosswalk["Age"].isin(COLLEGE_AGE_RANGES), "Label"].tolist()
df_wfrc["Population_18_24"] = (
    df_wfrc[target_cols].apply(pd.to_numeric, errors="coerce").fillna(0).sum(axis=1)
)
df_college_age_hist = (
    df_wfrc.pivot_table(
        index="Year", columns="Geography", values="Population_18_24", aggfunc="sum"
    )
    .reset_index()
    .rename_axis(None, axis=1)
)

Timeseries Data from NHGIS

Show the code

# Toggle ON (eval: true) to fetch from IPUMS NHGIS API
# Requires IPUMS_API_KEY environment variable in .env

import os, zipfile
from dotenv import load_dotenv
from ipumspy import IpumsApiClient, AggregateDataExtract, TimeSeriesTable

load_dotenv()
IPUMS_API_KEY = os.getenv("IPUMS_API_KEY")

if IPUMS_API_KEY:
    client = IpumsApiClient(IPUMS_API_KEY)
    ts_table = TimeSeriesTable(
        name="B57",
        geog_levels=["county"],
        years=["1970","1980","1990","2000","2010","2020",
               "2006-2010","2007-2011","2008-2012","2009-2013","2010-2014",
               "2011-2015","2012-2016","2013-2017","2014-2018","2015-2019",
               "2016-2020","2017-2021","2018-2022","2019-2023","2020-2024"],
    )
    extract = AggregateDataExtract(
        collection="nhgis",
        description="Utah County College Age Time Series",
        time_series_tables=[ts_table],
        geographic_extents=["490"],
        tst_layout="time_by_column_layout",
    )
    print("Submitting extract to IPUMS NHGIS...")
    client.submit_extract(extract)
    print(f"Extract submitted. ID: {extract.extract_id}")
    client.wait_for_extract(extract)
    download_dir = PATHS['nhgis_dir']
    download_dir.mkdir(parents=True, exist_ok=True)
    client.download_extract(extract, download_dir=download_dir)
    zip_path = download_dir / f"nhgis{extract.extract_id:04d}_csv.zip"
    with zipfile.ZipFile(zip_path, "r") as zf:
        zf.extractall(download_dir / f"nhgis{extract.extract_id:04d}_csv")
    print("NHGIS Fetch Complete!")
else:
    print("Error: IPUMS_API_KEY not found in .env file.")

Show the code

# Load and process NHGIS time-series data
# Works with both manually downloaded and API-fetched NHGIS extracts

# 1. Locate latest NHGIS file
nhgis_files = sorted(PATHS['nhgis_dir'].rglob("*_ts_nominal_county.csv"))
nhgis_path = (
    nhgis_files[-1] if nhgis_files
    else PATHS['nhgis_dir'] / "nhgis0002_csv/nhgis0002_ts_nominal_county.csv"
)

df_nhgis = pd.read_csv(nhgis_path)
df_ut = df_nhgis[df_nhgis["STATE"] == "Utah"].copy()

# 2. Select 18-24 age columns (B57AE–B57AH = 18–19, 20, 21, 22–24)
target_prefixes = ["B57AE", "B57AF", "B57AG", "B57AH"]
value_cols = [
    c for c in df_ut.columns
    if any(c.startswith(p) for p in target_prefixes) and not c.endswith("M")
]
for col in value_cols:
    df_ut[col] = pd.to_numeric(df_ut[col], errors="coerce")

# 3. Melt to long format and parse year
df_long = df_ut.melt(
    id_vars=["COUNTY"], value_vars=value_cols,
    var_name="Variable", value_name="Population"
)
df_long["Suffix"] = df_long["Variable"].str[5:]
df_long["Year"]   = df_long["Suffix"].apply(
    lambda x: int(x) if len(x) == 4 else 2000 + int(x[:2])
)
df_long["Type"] = df_long["Suffix"].apply(
    lambda x: "Decennial" if len(x) == 4 else "ACS"
)

# 4. Prefer Decennial over ACS for 2010 and 2020
mask_drop = (
    ((df_long["Year"] == 2010) & (df_long["Type"] == "ACS")) |
    ((df_long["Year"] == 2020) & (df_long["Type"] == "ACS"))
)
df_clean = df_long[~mask_drop].copy()

# 5. Aggregate by county and year
df_agg = df_clean.groupby(["COUNTY", "Year"])["Population"].sum().reset_index()
df_agg["Geography"] = df_agg["COUNTY"].apply(
    lambda x: x if "County" in str(x) else str(x) + " County"
)

# 6. Pivot to wide format (counties as columns)
df_college_age_hist = (
    df_agg[df_agg["Geography"].isin(TARGET_COUNTIES)]
    .pivot(index="Year", columns="Geography", values="Population")
    .sort_index()
)
df_college_age_hist.columns.name = None

Forecast: Kem C. Gardner Policy Institute (2025-2060)

The Kem C. Gardner Policy Institute (GPI) provides the official projected demographic counts for Utah counties.

Show the code

# Primary forecast data: county-level GPI projections (2025–2065)
df_gpi_projections = (
    pd.read_excel(
        PATHS['gpi_projections'],
        sheet_name="Demographic Detail",
        header=3,
        nrows=1230,
    )
    .query("Geography in @TARGET_COUNTIES")[["Geography", "Year", "Population Ages 18-24"]]
)

df_gpi_wide = (
    df_gpi_projections
    .pivot(index="Year", columns="Geography", values="Population Ages 18-24")
    .rename_axis(None, axis=1)
    .sort_index()
)

# Backcast 2023–2024 using the average annual delta over the first 3 GPI years
avg_annual_change = df_gpi_wide.loc[2025:2027].diff().mean()
gpi_2024 = df_gpi_wide.loc[2025] - avg_annual_change
gpi_2023 = df_gpi_wide.loc[2025] - 2 * avg_annual_change

df_gpi_ext = pd.concat([
    pd.DataFrame(gpi_2023).T.rename(index={0: 2023}),
    pd.DataFrame(gpi_2024).T.rename(index={0: 2024}),
    df_gpi_wide,
]).sort_index()

4 Data Exploration

4.1 College Age (18–24) Population Trend

Show the code

# Melt both sources to long for plotting
df_hist_long = (
    df_college_age_hist.reset_index()
    .melt(id_vars=["Year"], var_name="County", value_name="Population")
)
df_hist_long["Type"] = "Historical"

df_proj_long = df_gpi_projections.rename(
    columns={"Geography": "County", "Population Ages 18-24": "Population"}
).copy()
df_proj_long["Type"] = "Projected"

df_plot_combined = pd.concat([df_hist_long, df_proj_long], ignore_index=True)
ojs_define(data_pop_trend=df_plot_combined.to_dict(orient="records"))

Show the code

Plot.plot({
  marks: [
    Plot.ruleY([0]),
    Plot.line(data_pop_trend, {
      filter: d => d.Type === "Historical",
      x: "Year", y: "Population", z: "County",
      stroke: "County", strokeWidth: 3
    }),
    Plot.line(data_pop_trend, {
      filter: d => d.Type === "Projected",
      x: "Year", y: "Population", z: "County",
      stroke: "County", strokeWidth: 3, strokeDasharray: "5,5"
    }),
    Plot.text(data_pop_trend, Plot.selectLast({
      x: "Year", y: "Population", z: "County",
      text: "County", textAnchor: "start", dx: 8, fill: "var(--bs-body-color)"
    })),
    Plot.tip(data_pop_trend, Plot.pointerX({
      x: "Year", y: "Population",
      title: d => `${d.County}\nYear: ${d.Year}\nPop: ${d.Population.toLocaleString()}\n(${d.Type})`,
      fill: "var(--bs-body-bg)", stroke: "var(--bs-body-color)"
    }))
  ],
  width: 780, marginLeft: 50, marginRight: 110,
  color: { legend: true },
  y: { grid: true, label: "Population (18-24)" },
  x: { tickFormat: "d" },
  style: {
    backgroundColor: "transparent", color: "var(--bs-body-color)",
    fontFamily: "sans-serif", fontSize: "12px"
  }
})

Figure 1: Historical and Projected College-Age Population (18-24) by County

5 Forecast Enrollment

5.1 Enrollment by College and County (Dynamic Catchment Generation)

We calculate dynamic base percentages using localized USHE zip-code enrollment patterns for the public universities (UofU, UVU, WSU, SLCC). For private institutions missing from the public dataset, we apply standard assumed market footprints.

5.2 Step 1 — Define Addressable Markets (Catchment Weights)

We programmatically extract the USHE derived weights for our public institutions and combine them with the explicit private institution boundaries.

Show the code

TARGET_COLLEGES = ["UOFU", "BYU", "UVU", "SLCC", "WSU", "Ensign", "Westmin"]
ANALYSIS_COUNTIES = ["Box Elder", "Weber", "Davis", "Salt Lake", "Utah"]

# Utah Counties Mapping
utah_counties = sorted(
    [
        "Beaver",
        "Box Elder",
        "Cache",
        "Carbon",
        "Daggett",
        "Davis",
        "Duchesne",
        "Emery",
        "Garfield",
        "Grand",
        "Iron",
        "Juab",
        "Kane",
        "Millard",
        "Morgan",
        "Piute",
        "Rich",
        "Salt Lake",
        "San Juan",
        "Sanpete",
        "Sevier",
        "Summit",
        "Tooele",
        "Uintah",
        "Utah",
        "Wasatch",
        "Washington",
        "Wayne",
        "Weber",
    ]
)
county_map = {i + 1: name for i, name in enumerate(utah_counties)}

# Load Zip Data
df_enroll = pd.read_csv(PATHS["enroll_by_zip"])
gdf_zip = gpd.read_file(PATHS["zip_code"])

# Process
df_enroll["College_Clean"] = df_enroll["COLLEGE"].str.split("_").str[0]
df_enroll = df_enroll[df_enroll["College_Clean"].isin(TARGET_COLLEGES)].copy()
df_enroll["ZIP5"] = df_enroll["ZIPCODE"].astype(str).str.zfill(5)
gdf_zip["ZIP5"] = gdf_zip["ZIP5"].astype(str).str.zfill(5)

df_merged = df_enroll.merge(gdf_zip[["ZIP5", "COUNTYNBR"]], on="ZIP5", how="left")
df_merged["County"] = pd.to_numeric(df_merged["COUNTYNBR"], errors="coerce").map(
    county_map
)
df_merged = df_merged[df_merged["County"].isin(ANALYSIS_COUNTIES)]

# Summary & Percentages
df_summary = (
    df_merged.groupby(["County", "College_Clean"])["TotEnrol"]
    .sum()
    .unstack(fill_value=0)
    .reindex(columns=TARGET_COLLEGES, fill_value=0)
)

df_percent = df_summary.div(df_summary.sum(axis=0), axis=1)

CATCHMENT_WEIGHTS = {
    # Private institutions missing from USHE zip dataset
    "BYU": {
        "Box Elder County": 0.000,
        "Davis County": 0.025,
        "Weber County": 0.025,
        "Salt Lake County": 0.20,
        "Utah County": 0.75,
    },
    "Ensign": {
        "Box Elder County": 0.000,
        "Weber County": 0.000,
        "Davis County": 0.15,
        "Salt Lake County": 0.70,
        "Utah County": 0.15,
    },
    "Westmin": {
        "Box Elder County": 0.000,
        "Weber County": 0.000,
        "Davis County": 0.065,
        "Salt Lake County": 0.90,
        "Utah County": 0.035,
    },
}

# Pull Dynamic Weights for Public Universities
zip_to_ipeds = {"UOFU": "UofU", "UVU": "UVU", "WSU": "WSU", "SLCC": "SLCC"}

for zip_key, ipeds_key in zip_to_ipeds.items():
    if zip_key in df_percent.columns:
        weights = df_percent[zip_key].to_dict()
        CATCHMENT_WEIGHTS[ipeds_key] = {f"{k} County": v for k, v in weights.items()}

DISPLAY_NAMES = {
    "UofU": "University of Utah",
    "UVU": "Utah Valley University",
    "WSU": "Weber State University",
    "SLCC": "Salt Lake Community College",
    "BYU": "Brigham Young University",
    "Ensign": "Ensign College",
    "Westmin": "Westminster College",
}
SCHOOL_KEYS = list(CATCHMENT_WEIGHTS.keys())

display(Markdown("**Calculated Catchment Weights (Addressable Market Shares):**"))
display(pd.DataFrame(CATCHMENT_WEIGHTS).fillna(0).style.format("{:.1%}"))

Calculated Catchment Weights (Addressable Market Shares):

	BYU	Ensign	Westmin	UofU	UVU	WSU	SLCC
Box Elder County	0.0%	0.0%	0.0%	0.2%	0.3%	3.3%	0.3%
Davis County	2.5%	15.0%	6.5%	8.3%	4.4%	40.8%	7.0%
Weber County	2.5%	0.0%	0.0%	2.3%	0.6%	45.8%	1.4%
Salt Lake County	20.0%	70.0%	90.0%	81.0%	20.8%	7.9%	83.4%
Utah County	75.0%	15.0%	3.5%	8.1%	73.9%	2.2%	7.9%

5.3 Step 2 — Weighted Populations

We multiply the 18-24 population of a county by the catchment weight to get the Weighted Population.

Show the code

# Interpolate Census data to fill 2001–2009 gap
df_census_full = df_college_age_hist.reindex(range(2000, 2024)).interpolate(
    method="linear"
)

# Compute Historical & Projected Weighted Populations
df_weighted_hist = pd.DataFrame(index=range(2000, 2024))
df_weighted_proj = pd.DataFrame(index=df_gpi_ext.index)

for key, weights in CATCHMENT_WEIGHTS.items():
    df_weighted_hist[key] = sum(
        df_census_full[county] * weight
        for county, weight in weights.items()
        if county in df_census_full.columns
    )
    df_weighted_proj[key] = sum(
        df_gpi_ext[county] * weight
        for county, weight in weights.items()
        if county in df_gpi_ext.columns
    )

display(Markdown("**Historical Weighted Populations (Recent Years):**"))
display(df_weighted_hist.tail().astype(int))

display(Markdown("**Projected Weighted Populations (First 5 Years):**"))
display(df_weighted_proj.head().astype(int))

Historical Weighted Populations (Recent Years):

	BYU	Ensign	Westmin	UofU	UVU	WSU	SLCC
2019	99032	95346	102494	98641	98918	34989	100410
2020	103470	109483	120166	114739	103500	39620	116897
2021	104285	98959	106016	102159	104146	36301	103979
2022	112114	105294	112639	108611	111936	38276	110550
2023	110841	103480	110519	106631	110658	37733	108526

Projected Weighted Populations (First 5 Years):

	BYU	Ensign	Westmin	UofU	UVU	WSU	SLCC
2023	121186	112463	119901	115743	120991	40846	117794
2024	124780	114287	121364	117331	124563	42012	119380
2025	128375	116111	122826	118920	128135	43179	120966
2026	131555	118195	124680	120837	131310	44541	122887
2027	135564	119760	125751	122096	135279	45512	124138

5.4 Step 3 — Historic Participation Rates

We define the Participation Rate as: \text{Rate}_{year} = \frac{\text{Actual Enrollment}_{year}}{\text{Weighted Population}_{year}}

Show the code

# Compute historic participation rates
df_ipeds = df_enroll_ipeds.sort_index()
df_hist_rates = pd.DataFrame(index=range(2000, 2024))
for key in SCHOOL_KEYS:
    df_hist_rates[key] = (
        df_ipeds[key].reindex(df_hist_rates.index) / df_weighted_hist[key]
    )

display(Markdown("**Historical Participation Rates (Recent Years):**"))
display(df_hist_rates.tail().style.format("{:.2%}"))

Historical Participation Rates (Recent Years):

	BYU	Ensign	Westmin	UofU	UVU	WSU	SLCC
2019	34.36%	1.98%	2.16%	31.84%	38.87%	75.75%	25.68%
2020	29.97%	1.63%	1.54%	19.21%	26.60%	50.57%	19.29%
2021	32.69%	1.69%	1.34%	30.40%	32.39%	66.56%	20.08%
2022	30.02%	1.45%	1.11%	29.74%	32.05%	65.31%	19.08%
2023	30.92%	1.62%	1.08%	30.98%	34.66%	68.31%	19.99%

5.5 Step 4 — Linear Regression Model

We fit a linear trend to each university’s historical participation rate: \text{Rate}(t) = \alpha + \beta \cdot t, \quad t = \text{Year} - 1999 The model is trained on 2000–2020 using all available historical data, including the 2020–2023 COVID disruption and recovery period.

Show the code

models_dict      = {}
model_metrics    = []
residual_records = []
t_2023 = 2023 - 1999  # kept for residual alignment; calibration uses t_calib below
t_calib = CALIB_YEAR - 1999

for key in SCHOOL_KEYS:
    full_name    = DISPLAY_NAMES[key]
    rates_series = df_hist_rates[key].dropna()
    years_arr    = rates_series.index.values.astype(float)
    rates_arr    = rates_series.values
    t_arr        = years_arr - 1999
    X            = t_arr.reshape(-1, 1)

    train_mask  = years_arr <= TRAIN_END_YEAR
    years_train = years_arr[train_mask].astype(int)
    rates_train = rates_arr[train_mask]
    X_train     = X[train_mask]

    model = LinearRegression().fit(X_train, rates_train)
    models_dict[key] = model
    alpha      = float(model.intercept_)
    beta       = float(model.coef_[0])
    pred_train = model.predict(X_train)

    # Train Metrics
    pop_train         = df_weighted_hist.loc[years_train, key].values
    enroll_act_train  = df_ipeds.loc[years_train, key].values
    enroll_pred_train = pred_train * pop_train
    r2_train          = r2_score(rates_train, pred_train)
    mape_train        = mean_absolute_percentage_error(enroll_act_train, enroll_pred_train)

    model_metrics.append({
        "University": full_name, "Dataset": "Train", "Method": "Base Linear",
        "Alpha": round(alpha, 4), "Beta": round(beta, 4), "R2": round(r2_train, 4), "MAPE": round(mape_train, 4),
    })

    # Collect Residuals
    for yr, act, pred in zip(years_train, enroll_act_train, enroll_pred_train):
        residual_records.append({
            "Year": int(yr), "University": full_name, "Actual": float(act), "Predicted": float(pred),
            "Residual": float(act - pred), "APE": float(abs(act - pred)/act) if act != 0 else 0,
            "Method": "Base Linear", "Dataset": "Train"
        })

5.6 Step 5 — Calibration Shift Calculation

Computes a per-school additive shift in RATE units so the model passes exactly through the CALIB_YEAR actual enrollment when CALIB_SHIFT = True.

Show the code

shift_factors = {}
for key in SCHOOL_KEYS:
    model_rate_calib = float(models_dict[key].predict(np.array([[t_calib]]))[0])
    actual_enroll_calib = float(df_ipeds.loc[CALIB_YEAR, key])
    gpi_pop_calib = float(df_weighted_proj.loc[CALIB_YEAR, key])
    implied_gpi_rate_calib = actual_enroll_calib / gpi_pop_calib

    shift = (implied_gpi_rate_calib - model_rate_calib) if CALIB_SHIFT else 0.0
    shift_factors[key] = shift

display(Markdown("**Calibration Shifts (Base Linear Models):**"))
display(
    pd.DataFrame.from_dict(
        shift_factors, orient="index", columns=["Shift"]
    ).style.format("{:.4f}")
)

Calibration Shifts (Base Linear Models):

	Shift
BYU	-0.0474
Ensign	-0.0111
Westmin	-0.0169
UofU	-0.0189
UVU	-0.0574
WSU	-0.1348
SLCC	-0.0935

5.7 Step 6 & 7 — Final Forecasts & Regional Control Totals

Why a Control Total? Left unchecked, applying straight-line growth to growing populations could suggest that an impossibly high percentage of the region is attending college. We set a PENETRATION_CEILING based on historical peaks to proportionally scale down college projections if total regional enrollment exceeds reality.

Show the code

total_enroll_hist   = df_ipeds[SCHOOL_KEYS].sum(axis=1).loc[2000:2023]
total_pop_hist_5cty = df_census_full[TARGET_COUNTIES].sum(axis=1).loc[2000:2023]
regional_rate_hist  = (total_enroll_hist / total_pop_hist_5cty).dropna()

HIST_MAX_PENETRATION = float(regional_rate_hist.max())
PENETRATION_CEILING  = HIST_MAX_PENETRATION * 1.10   # 10% buffer above historical peak

FORECAST_YEARS   = list(range(2024, 2061))
df_total_gpi_pop = df_gpi_ext[TARGET_COUNTIES].sum(axis=1)

forecast_records   = []
regional_rate_proj = []

for year in FORECAST_YEARS:
    t = year - 1999
    total_gpi_pop  = float(df_total_gpi_pop.loc[year])
    ceiling_enroll = PENETRATION_CEILING * total_gpi_pop

    raw = {}
    for key in SCHOOL_KEYS:
        if FREEZE_RATE_AT_2023:
            model_rate = float(models_dict[key].predict(np.array([[t_calib]]))[0])
        else:
            model_rate = float(models_dict[key].predict(np.array([[t]]))[0])

        # Decay calibration shift to zero over 10 years for WESTMIN only
        if key == "Westmin":
            decay_factor = max(0.0, 1.0 - (year - CALIB_YEAR) / 10.0) if CALIB_SHIFT else 0.0
            active_shift = shift_factors[key] * decay_factor
        else:
            active_shift = shift_factors[key]

        gpi_pop  = float(df_weighted_proj.loc[year, key])
        raw[key] = max(0.0, (model_rate + active_shift) * gpi_pop)

    total_raw = sum(raw.values())
    proj_rate = total_raw / total_gpi_pop

    if total_raw > ceiling_enroll:
        scale_factor = ceiling_enroll / total_raw
        capped = True
    else:
        scale_factor = 1.0
        capped = False

    regional_rate_proj.append({
        "Year": year, "Rate": proj_rate * scale_factor,
        "Type": "Projected", "Capped": capped,
    })
    for key, raw_enroll in raw.items():
        forecast_records.append({
            "Year": year, "Univ_Code": key,
            "University": DISPLAY_NAMES[key],
            "Enrollment": int(round(raw_enroll * scale_factor)),
            "Type": "Projected",
        })

df_forecast_base = pd.DataFrame(forecast_records)

6 Flagship Override Adjustment

BYU and the University of Utah draw students nationally and internationally, making their enrollment largely independent of local Wasatch Front demographics. Tying their forecast to the regional college-age population via participation rates produces unrealistic trajectories as local cohort sizes fluctuate.

Instead, we model each flagship’s enrollment directly as a function of time: \text{Enrollment}(t) = \alpha + \beta \cdot t, \quad t = \text{Year} - 1999

This treats their historical enrollment trend as the best predictor of future enrollment, bypassing catchment weights, weighted populations, and participation-rate mechanics entirely. The resulting forecasts are then substituted back into the base linear forecast for BYU and UofU only; all other universities retain their original participation-rate projections.

Show the code

FLAGSHIP_KEYS = ["BYU", "UofU"]
flagship_models = {}
flagship_metrics = []
flagship_shifts = {}

for key in FLAGSHIP_KEYS:
    full_name = DISPLAY_NAMES[key]
    enroll_series = df_ipeds[key].dropna()
    years_arr = enroll_series.index.values.astype(float)
    enroll_arr = enroll_series.values
    t_arr = years_arr - 1999
    X = t_arr.reshape(-1, 1)

    train_mask = years_arr <= TRAIN_END_YEAR
    years_train = years_arr[train_mask].astype(int)
    enroll_train = enroll_arr[train_mask]
    X_train = X[train_mask]

    model = LinearRegression().fit(X_train, enroll_train)
    flagship_models[key] = model

    alpha = float(model.intercept_)
    beta = float(model.coef_[0])
    pred_train = model.predict(X_train)

    # Flagship Train Metrics
    r2_train = r2_score(enroll_train, pred_train)
    mape_train = mean_absolute_percentage_error(enroll_train, pred_train)
    flagship_metrics.append(
        {
            "University": full_name,
            "Dataset": "Train",
            "Method": "Direct Flagship Override",
            "Alpha": round(alpha, 1),
            "Beta": round(beta, 1),
            "R2": round(r2_train, 4),
            "MAPE": round(mape_train, 4),
        }
    )

    # Collect Residuals (Train)
    for yr, act, pred in zip(years_train, enroll_train, pred_train):
        residual_records.append(
            {
                "Year": int(yr),
                "University": full_name,
                "Actual": float(act),
                "Predicted": float(pred),
                "Residual": float(act - pred),
                "APE": float(abs(act - pred) / act) if act != 0 else 0,
                "Method": "Direct Flagship Override",
                "Dataset": "Train",
            }
        )

    # Calibration Shift for Override (in enrollment units)
    model_enroll_calib = float(model.predict(np.array([[t_calib]]))[0])
    actual_enroll_calib = float(df_ipeds.loc[CALIB_YEAR, key])
    flagship_shifts[key] = (actual_enroll_calib - model_enroll_calib) if CALIB_SHIFT else 0.0

display(Markdown("**Calibration Shifts (Flagship Direct Models - Students):**"))
display(
    pd.DataFrame.from_dict(flagship_shifts, orient="index", columns=["Enrollment Shift"]).astype(
        int
    )
)

# Generate Override Forecast
flagship_override_records = []
for year in FORECAST_YEARS:
    t = year - 1999
    for key in FLAGSHIP_KEYS:
        raw_enroll = float(flagship_models[key].predict(np.array([[t]]))[0])
        final_enroll = max(0, int(round(raw_enroll + flagship_shifts[key])))
        flagship_override_records.append(
            {
                "Year": year,
                "Univ_Code": key,
                "University": DISPLAY_NAMES[key],
                "Enrollment": final_enroll,
                "Type": "Projected",
            }
        )

df_flagship_override = pd.DataFrame(flagship_override_records)

# Drop original BYU/UofU projected rows and substitute flagship overrides
df_forecast_adjusted = pd.concat(
    [df_forecast_base[~df_forecast_base["Univ_Code"].isin(FLAGSHIP_KEYS)], df_flagship_override],
    ignore_index=True,
).sort_values(["Univ_Code", "Year"])

Calibration Shifts (Flagship Direct Models - Students):

	Enrollment Shift
BYU	1758
UofU	3804

7 Policy Adjustments

7.1 BYU Medical School (2027-2030)

Background: Brigham Young University is planning to integrate a Medical School starting in 2027. The medical school is expected to enroll 100 students annually.

Enrollment Impact: - 2027: +100 students (first year class) - 2028: +100 students (second year class added; total +200) - 2029: +100 students (third year class added; total +300) - 2030: +100 students (fourth year class added; total +400) - 2031 onwards: Steady state at +400 students (first cohort graduates as new cohort enters)

This adjustment is applied to BYU’s flagship override forecast to reflect the planned expansion.

Show the code

# BYU Medical School Enrollment Adjustment
MEDICAL_SCHOOL_START_YEAR = 2027
MEDICAL_SCHOOL_ANNUAL_ENROLLMENT = 100
MEDICAL_SCHOOL_PROGRAM_LENGTH = 4  # MD program is 4 years

# Apply cumulative enrollment adjustment
for year in FORECAST_YEARS:
    if year >= MEDICAL_SCHOOL_START_YEAR:
        # Calculate cumulative enrollment (capped at steady-state)
        years_since_start = year - MEDICAL_SCHOOL_START_YEAR
        cohorts_enrolled = min(years_since_start + 1, MEDICAL_SCHOOL_PROGRAM_LENGTH)
        medical_enrollment = cohorts_enrolled * MEDICAL_SCHOOL_ANNUAL_ENROLLMENT

        # Apply adjustment to BYU forecast
        mask = (df_forecast_adjusted['Univ_Code'] == 'BYU') & \
               (df_forecast_adjusted['Year'] == year)

        df_forecast_adjusted.loc[mask, 'Enrollment'] += medical_enrollment

# Log the adjustment
print(f"Applied BYU Medical School adjustment:")
print(f"  - Annual enrollment: {MEDICAL_SCHOOL_ANNUAL_ENROLLMENT} students/year")
print(f"  - Start year: {MEDICAL_SCHOOL_START_YEAR}")
print(f"  - Ramp-up period: {MEDICAL_SCHOOL_PROGRAM_LENGTH} years (2027-2030)")
print(f"  - Steady-state enrollment: {MEDICAL_SCHOOL_PROGRAM_LENGTH * MEDICAL_SCHOOL_ANNUAL_ENROLLMENT} students (2031+)")

Applied BYU Medical School adjustment:
  - Annual enrollment: 100 students/year
  - Start year: 2027
  - Ramp-up period: 4 years (2027-2030)
  - Steady-state enrollment: 400 students (2031+)

8 Campus-Level Disaggregation (2000-2060)

For Home Based College (HBC) travel modeling, we disaggregate institution-level forecasts to specific campus locations across the full time series (2000-2060).

Methodology: - Non-UVU institutions: Apply 2025 USHE campus splits to all years (2000-2060) - UVU campuses: Apply time-varying proportions from historical spatial planning data

8.1 Step 1: Load USHE 2025 Campus Enrollment Data

Show the code

# Load USHE 2025 enrollment data
df_ushe_2025 = pd.read_excel(
    PATHS["ushe_2025"],
    sheet_name="Sheet1"
)

print(f"Total USHE records loaded: {len(df_ushe_2025)}")
print(f"\nUnique terms: {sorted(df_ushe_2025['Term'].unique())}")
print(f"Unique years: {sorted(df_ushe_2025['Academic Year'].unique())}")

# Filter to Academic Year 2025, Term 2 (Fall semester)
df_ushe_fall_2025 = df_ushe_2025[
    (df_ushe_2025['Academic Year'] == 2025) &
    (df_ushe_2025['Term'] == 2)
].copy()

print(f"\nFiltered to Fall 2025: {len(df_ushe_fall_2025)} records")

# Aggregate headcount by Institution and Site (sum across all ZIP codes)
df_campus_enrollment = df_ushe_fall_2025.groupby(['Institution', 'Site'])['Headcount'].sum().reset_index()

print(f"Unique campuses: {len(df_campus_enrollment)}")

# Show totals by institution
print("\n=== Total Enrollment by Institution (Fall 2025) ===")
inst_totals = df_campus_enrollment.groupby('Institution')['Headcount'].sum().sort_values(ascending=False)
for inst, count in inst_totals.items():
    print(f"{inst}: {count:,}")

Total USHE records loaded: 6014

Unique terms: [2]
Unique years: [2025]

Filtered to Fall 2025: 6014 records
Unique campuses: 57

=== Total Enrollment by Institution (Fall 2025) ===
University Of Utah: 44,633
Utah Valley University: 41,257
Utah State University: 32,782
Weber State University: 24,442
Salt Lake Community College: 23,802

8.2 Step 2: Define Campus Mapping (Non-UVU Institutions)

Show the code

# Map USHE Site names to campus codes for non-UVU institutions
CAMPUS_MAPPING = {
    "Salt Lake Community College": {
        "Main Campus": "SLCC_Main",
        "SLCC - Jordan Campus": "SLCC_JD",
        "SLCC - Larry H. Miller Campus": "SLCC_ML",
        "SLCC - South City Center": "SLCC_SC",
        "SLCC-UU Juniper building at Herriman Campus": "SLCC_HM",
    },
    "University Of Utah": {"Main Campus": "UofU_Main"},
    "Weber State University": {"Main Campus": "WSU_Main", "WSU - Davis Center": "WSU_Davis"},
}

# Single-campus institutions (no disaggregation needed)
SINGLE_CAMPUS_CODES = ["BYU", "Ensign", "Westmin"]

# Map institution codes to USHE names
INSTITUTION_NAME_MAP = {
    "SLCC": "Salt Lake Community College",
    "UofU": "University Of Utah",
    "WSU": "Weber State University",
}

8.3 Step 3: Calculate Campus Percentages (Non-UVU)

Show the code

# Calculate percentage splits for non-UVU institutions
campus_percentages = {}

print("\n=== Calculating Campus Percentages (Non-UVU) ===")

for institution, campus_map in CAMPUS_MAPPING.items():
    print(f"\n{institution}:")

    inst_data = df_campus_enrollment[df_campus_enrollment['Institution'] == institution]

    if len(inst_data) == 0:
        print(f"  WARNING: No data found")
        continue

    # Get enrollment for mapped campuses only
    mapped_campuses = list(campus_map.keys())
    mapped_data = inst_data[inst_data['Site'].isin(mapped_campuses)]

    total_mapped = mapped_data['Headcount'].sum()
    print(f"  Total mapped enrollment: {total_mapped:,}")

    if total_mapped > 0:
        for site, campus_code in campus_map.items():
            site_enrollment = mapped_data[mapped_data['Site'] == site]['Headcount'].sum()
            percentage = site_enrollment / total_mapped if total_mapped > 0 else 0
            campus_percentages[campus_code] = percentage
            print(f"    {campus_code}: {site_enrollment:,} students ({percentage:.1%})")

print("\n" + "="*60)
print("Final Non-UVU Campus Split Percentages:")
for campus, pct in sorted(campus_percentages.items()):
    print(f"  {campus}: {pct:.1%}")


=== Calculating Campus Percentages (Non-UVU) ===

Salt Lake Community College:
  Total mapped enrollment: 22,216
    SLCC_Main: 15,618 students (70.3%)
    SLCC_JD: 3,045 students (13.7%)
    SLCC_ML: 369 students (1.7%)
    SLCC_SC: 2,323 students (10.5%)
    SLCC_HM: 861 students (3.9%)

University Of Utah:
  Total mapped enrollment: 27,488
    UofU_Main: 27,488 students (100.0%)

Weber State University:
  Total mapped enrollment: 12,446
    WSU_Main: 10,366 students (83.3%)
    WSU_Davis: 2,080 students (16.7%)

============================================================
Final Non-UVU Campus Split Percentages:
  SLCC_HM: 3.9%
  SLCC_JD: 13.7%
  SLCC_ML: 1.7%
  SLCC_Main: 70.3%
  SLCC_SC: 10.5%
  UofU_Main: 100.0%
  WSU_Davis: 16.7%
  WSU_Main: 83.3%

8.4 Step 4: Hardcode UVU Campus Historical Data

Show the code

# Hardcoded UVU campus enrollment from previous RTP spatial planning
# Source: Historical records and approved campus expansion plans
# Note: UVU-Heber excluded (outside WF-TDM region)

# fmt: off
uvu_campus_hardcoded = pd.DataFrame({
    'Year': list(range(2000, 2061)),
    'UVU_Main': [20946, 22609, 23609, 23803, 24149, 24487, 23305, 23840, 27052, 29045,
                 31739, 32734, 31810, 30880, 31589, 33565, 35126, 37785, 39931, 40278,
                 39438, 39715, 41504, 43105, 44720, 46420, 47173, 48078, 48855, 48850,
                 49762, 50673, 51583, 52494, 53405, 54317, 55227, 56138, 57049, 57960,
                 58871, 59781, 60693, 61604, 62515, 63426, 64338, 65248, 66159, 67069,
                 67981, 67981, 67981, 67981, 67981, 67981, 67981, 67981, 67981, 67981, 67981],
    'UVU_Lehi': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                 0, 0, 0, 0, 0, 0, 0, 0, 0, 1300,
                 1342, 1384, 1426, 1468, 1510, 1552, 1594, 1635, 1677, 1719,
                 1761, 1803, 1845, 1887, 1929, 1971, 2013, 2055, 2097, 2139,
                 2181, 2223, 2265, 2306, 2348, 2390, 2432, 2474, 2516, 2558,
                 2600, 2600, 2600, 2600, 2600, 2600, 2600, 2600, 2600, 2600, 2600],
    'UVU_Payson': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                   0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                   0, 0, 0, 0, 0, 0, 0, 0, 0, 650,
                   762, 874, 986, 1098, 1210, 1321, 1433, 1545, 1657, 1769,
                   1881, 1993, 2105, 2217, 2329, 2440, 2552, 2664, 2776, 2888,
                   3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000],
    'UVU_Vineyard1': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                      0, 0, 0, 0, 0, 0, 570, 725, 879, 1034,
                      1187, 1342, 1496, 1651, 1805, 1960, 2114, 2269, 2422, 2577,
                      2731, 2886, 3040, 3195, 3349, 3504, 3657, 3812, 3966, 4121,
                      4275, 4275, 4275, 4275, 4275, 4275, 4275, 4275, 4275, 4275, 4275],
    'UVU_Vineyard2': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                      0, 0, 0, 0, 0, 0, 30, 38, 46, 54,
                      63, 71, 79, 87, 95, 103, 111, 119, 128, 136,
                      144, 152, 160, 168, 176, 184, 193, 201, 209, 217,
                      225, 225, 225, 225, 225, 225, 225, 225, 225, 225, 225],
})
# fmt: on

# Calculate total UVU enrollment from hardcoded data
uvu_campus_hardcoded["UVU_Total_Hardcoded"] = (
    uvu_campus_hardcoded["UVU_Main"]
    + uvu_campus_hardcoded["UVU_Lehi"]
    + uvu_campus_hardcoded["UVU_Payson"]
    + uvu_campus_hardcoded["UVU_Vineyard1"]
    + uvu_campus_hardcoded["UVU_Vineyard2"]
)

print("=== UVU Campus Hardcoded Data ===")
print(f"Years covered: {uvu_campus_hardcoded['Year'].min()} - {uvu_campus_hardcoded['Year'].max()}")
print("\nSample (2019-2026 - satellite campus openings):")
print(uvu_campus_hardcoded[uvu_campus_hardcoded["Year"].between(2019, 2026)].to_string(index=False))

=== UVU Campus Hardcoded Data ===
Years covered: 2000 - 2060

Sample (2019-2026 - satellite campus openings):
 Year  UVU_Main  UVU_Lehi  UVU_Payson  UVU_Vineyard1  UVU_Vineyard2  UVU_Total_Hardcoded
 2019     40278      1300           0              0              0                41578
 2020     39438      1342           0              0              0                40780
 2021     39715      1384           0              0              0                41099
 2022     41504      1426           0              0              0                42930
 2023     43105      1468           0              0              0                44573
 2024     44720      1510           0              0              0                46230
 2025     46420      1552           0              0              0                47972
 2026     47173      1594           0            570             30                49367

8.5 Step 5: Calculate UVU Time-Varying Campus Proportions

Show the code

# Calculate proportion of total UVU enrollment at each campus for each year
# These proportions will be applied to our forecasted UVU total

uvu_campus_proportions = uvu_campus_hardcoded.copy()

for campus in ["UVU_Main", "UVU_Lehi", "UVU_Payson", "UVU_Vineyard1", "UVU_Vineyard2"]:
    prop_col = campus.replace("UVU_", "prop_")
    uvu_campus_proportions[prop_col] = (
        uvu_campus_proportions[campus] / uvu_campus_proportions["UVU_Total_Hardcoded"]
    )

# Keep proportions for merging
uvu_props_clean = uvu_campus_proportions[
    ["Year", "prop_Main", "prop_Lehi", "prop_Payson", "prop_Vineyard1", "prop_Vineyard2"]
].copy()

print("\n=== UVU Campus Proportions (Sample Years) ===")
print(
    uvu_props_clean[uvu_props_clean["Year"].isin([2000, 2019, 2026, 2029, 2050, 2060])].to_string(
        index=False
    )
)


=== UVU Campus Proportions (Sample Years) ===
 Year  prop_Main  prop_Lehi  prop_Payson  prop_Vineyard1  prop_Vineyard2
 2000   1.000000   0.000000     0.000000        0.000000        0.000000
 2019   0.968733   0.031267     0.000000        0.000000        0.000000
 2026   0.955557   0.032289     0.000000        0.011546        0.000608
 2029   0.933909   0.032864     0.012427        0.019768        0.001032
 2050   0.870647   0.033299     0.038422        0.054751        0.002882
 2060   0.870647   0.033299     0.038422        0.054751        0.002882

8.6 Step 6: Create Combined Historical + Forecast Dataset

Show the code

# Combine historical enrollment (2000-2023) with forecast (2024-2060)
# Historical: df_enroll_ipeds
# Forecast: df_forecast_adjusted

# Prepare historical data
df_hist_inst = df_enroll_ipeds.reset_index()
df_hist_inst = df_hist_inst[df_hist_inst['Year'].between(2000, 2023)].copy()

# Melt to long format with proper column names
df_hist_long = df_hist_inst.melt(id_vars=['Year'], var_name='Univ_Code', value_name='Enrollment')
df_hist_long = df_hist_long.dropna()
df_hist_long['Type'] = 'Historical'

# Prepare forecast data
df_forecast_long = df_forecast_adjusted[['Year', 'Univ_Code', 'Enrollment']].copy()
df_forecast_long['Type'] = 'Projected'

# Combine
df_combined_inst = pd.concat([df_hist_long, df_forecast_long], ignore_index=True)
df_combined_inst = df_combined_inst.sort_values(['Univ_Code', 'Year']).reset_index(drop=True)

print(f"\n=== Combined Institution-Level Dataset ===")
print(f"Years: {df_combined_inst['Year'].min()} - {df_combined_inst['Year'].max()}")
print(f"Total records: {len(df_combined_inst)}")
print(f"Historical: {len(df_combined_inst[df_combined_inst['Type']=='Historical'])}")
print(f"Projected: {len(df_combined_inst[df_combined_inst['Type']=='Projected'])}")


=== Combined Institution-Level Dataset ===
Years: 2000 - 2060
Total records: 427
Historical: 168
Projected: 259

8.7 Step 7: Disaggregate to Campus Level (2000-2060)

Show the code

# Create year-indexed output dataframe
all_years = sorted(df_combined_inst["Year"].unique())
df_campus_output = pd.DataFrame({";YEAR": all_years})

print(f"\n=== Disaggregating to Campus Level ===")
print(f"Processing {len(all_years)} years from {min(all_years)} to {max(all_years)}")

# Process each institution
for univ_code in df_combined_inst["Univ_Code"].unique():
    print(f"\n{univ_code}:")

    # Get enrollment time series
    inst_data = df_combined_inst[df_combined_inst["Univ_Code"] == univ_code][
        ["Year", "Enrollment"]
    ].copy()
    inst_data = inst_data.set_index("Year").sort_index()

    # === SINGLE-CAMPUS INSTITUTIONS ===
    if univ_code in SINGLE_CAMPUS_CODES:
        df_campus_output[univ_code] = df_campus_output[";YEAR"].map(inst_data["Enrollment"])
        print(f"  Single campus - assigned directly")

    # === UVU - SPECIAL TIME-VARYING PROPORTIONS ===
    elif univ_code == "UVU":
        print(f"  Applying time-varying campus proportions")

        # Merge our UVU forecast with proportions
        uvu_merged = inst_data.reset_index().merge(uvu_props_clean, on="Year", how="left")

        # Apply proportions to forecasted enrollment
        for campus_name, prop_name in [
            ("UVU_Main", "prop_Main"),
            ("UVU_Lehi", "prop_Lehi"),
            ("UVU_Payson", "prop_Payson"),
            ("UVU_Vineyard1", "prop_Vineyard1"),
            ("UVU_Vineyard2", "prop_Vineyard2"),
        ]:
            uvu_merged[campus_name] = (
                (uvu_merged["Enrollment"] * uvu_merged[prop_name]).round().astype(int)
            )

        # Map to output dataframe
        uvu_merged_indexed = uvu_merged.set_index("Year")
        df_campus_output["UVU_Main"] = df_campus_output[";YEAR"].map(uvu_merged_indexed["UVU_Main"])
        df_campus_output["UVU_Lehi"] = df_campus_output[";YEAR"].map(uvu_merged_indexed["UVU_Lehi"])
        df_campus_output["UVU_Payson"] = df_campus_output[";YEAR"].map(
            uvu_merged_indexed["UVU_Payson"]
        )
        df_campus_output["UVU_Vineyard1"] = df_campus_output[";YEAR"].map(
            uvu_merged_indexed["UVU_Vineyard1"]
        )
        df_campus_output["UVU_Vineyard2"] = df_campus_output[";YEAR"].map(
            uvu_merged_indexed["UVU_Vineyard2"]
        )

        # Combined Vineyard column
        df_campus_output["UVU_Vineyard"] = df_campus_output["UVU_Vineyard1"].fillna(
            0
        ) + df_campus_output["UVU_Vineyard2"].fillna(0)

        print(
            f"  Created: UVU_Main, UVU_Lehi, UVU_Payson, UVU_Vineyard1, UVU_Vineyard2, UVU_Vineyard"
        )

    # === MULTI-CAMPUS (NON-UVU) - APPLY 2025 USHE SPLITS ===
    elif univ_code in INSTITUTION_NAME_MAP:
        institution = INSTITUTION_NAME_MAP[univ_code]
        campus_map = CAMPUS_MAPPING.get(institution, {})

        print(f"  Applying 2025 USHE campus splits")

        for site, campus_code in campus_map.items():
            percentage = campus_percentages.get(campus_code, 0)

            if percentage > 0:
                df_campus_output[campus_code] = df_campus_output[";YEAR"].map(
                    lambda y: (
                        int(round(inst_data.loc[y, "Enrollment"] * percentage))
                        if y in inst_data.index
                        else 0
                    )
                )
                print(f"    {campus_code}: {percentage:.1%}")
            else:
                df_campus_output[campus_code] = 0
                print(f"    {campus_code}: WARNING - no split data")

print(f"\n{'=' * 60}")
print(f"Generated {len(df_campus_output.columns) - 1} campus columns")


=== Disaggregating to Campus Level ===
Processing 61 years from 2000 to 2060

BYU:
  Single campus - assigned directly

Ensign:
  Single campus - assigned directly

SLCC:
  Applying 2025 USHE campus splits
    SLCC_Main: 70.3%
    SLCC_JD: 13.7%
    SLCC_ML: 1.7%
    SLCC_SC: 10.5%
    SLCC_HM: 3.9%

UVU:
  Applying time-varying campus proportions
  Created: UVU_Main, UVU_Lehi, UVU_Payson, UVU_Vineyard1, UVU_Vineyard2, UVU_Vineyard

UofU:
  Applying 2025 USHE campus splits
    UofU_Main: 100.0%

WSU:
  Applying 2025 USHE campus splits
    WSU_Main: 83.3%
    WSU_Davis: 16.7%

Westmin:
  Single campus - assigned directly

============================================================
Generated 17 campus columns

8.8 Step 8: Finalize Column Order and Export

Show the code

# Define desired column order
COLUMN_ORDER = [
    ";YEAR",
    "Ensign",
    "Westmin",
    "UofU_Main",
    "WSU_Main",
    "WSU_Davis",
    "SLCC_Main",
    "SLCC_SC",
    "SLCC_JD",
    "SLCC_ML",
    "SLCC_HM",
    "BYU",
    "UVU_Main",
    "UVU_Geneva",
    "UVU_Lehi",
    # "UVU_Vineyard1",
    # "UVU_Vineyard2",
    "UVU_Vineyard",
    "UVU_Payson",
]

# Ensure all columns exist
for col in COLUMN_ORDER:
    if col not in df_campus_output.columns:
        print(f"WARNING: Column {col} missing, adding as zeros")
        df_campus_output[col] = 0

# Reorder and clean
df_campus_final = df_campus_output[COLUMN_ORDER].copy()
df_campus_final = df_campus_final.sort_values(";YEAR").reset_index(drop=True)

# Convert to integers
for col in COLUMN_ORDER:
    if col != ";YEAR":
        df_campus_final[col] = df_campus_final[col].fillna(0).astype(int)

# Reallocate a share of UVU_Main to UVU_Geneva
if UVU_GENEVA_SHARE > 0:
    geneva = (df_campus_final["UVU_Main"] * UVU_GENEVA_SHARE).round().astype(int)
    df_campus_final["UVU_Geneva"] = geneva
    df_campus_final["UVU_Main"]   = df_campus_final["UVU_Main"] - geneva

# Display summary
print("\n=== Campus-Level Enrollment Forecast (2000-2060) ===")
print("\nFirst 5 years:")
print(df_campus_final.head())

print("\nSatellite campus opening years (2019-2030):")
print(df_campus_final[df_campus_final[";YEAR"].between(2019, 2030)])

print("\nLast 5 years:")
print(df_campus_final.tail())

# Verification: Check UVU total
df_campus_final_check = df_campus_final.copy()
df_campus_final_check["UVU_Total_Check"] = (
    df_campus_final_check["UVU_Main"]
    + df_campus_final_check["UVU_Geneva"]
    + df_campus_final_check["UVU_Lehi"]
    + df_campus_final_check["UVU_Payson"]
    + df_campus_final_check["UVU_Vineyard"]
)

print("\n=== UVU Total Verification (Sample Years) ===")
verification_years = [2000, 2019, 2026, 2029, 2050, 2060]
print(
    df_campus_final_check[df_campus_final_check[";YEAR"].isin(verification_years)][
        [";YEAR", "UVU_Main", "UVU_Geneva", "UVU_Lehi", "UVU_Payson", "UVU_Vineyard", "UVU_Total_Check"]
    ].to_string(index=False)
)

WARNING: Column UVU_Geneva missing, adding as zeros

=== Campus-Level Enrollment Forecast (2000-2060) ===

First 5 years:
   ;YEAR  Ensign  Westmin  UofU_Main  WSU_Main  WSU_Davis  SLCC_Main  SLCC_SC  \
0   2000     824     2403      24948     13368       2682      15182     2258   
1   2001    1025     2474      27668     14053       2820      16662     2478   
2   2002    1124     2353      28369     15041       3018      16413     2441   
3   2003    1258     2471      28436     15676       3145      16912     2515   
4   2004    1248     2417      28933     15407       3091      17382     2585   

   SLCC_JD  SLCC_ML  SLCC_HM    BYU  UVU_Main  UVU_Geneva  UVU_Lehi  \
0     2960      359      837  32554     19899        1047         0   
1     3249      394      919  32771     21479        1130         0   
2     3200      388      905  32408     22429        1180         0   
3     3297      400      932  33008     22613        1190         0   
4     3389      411      958  34347     22942        1207         0   

   UVU_Vineyard  UVU_Payson  
0             0           0  
1             0           0  
2             0           0  
3             0           0  
4             0           0  

Satellite campus opening years (2019-2030):
    ;YEAR  Ensign  Westmin  UofU_Main  WSU_Main  WSU_Davis  SLCC_Main  \
19   2019    1886     2215      31406     22075       4430      18124   
20   2020    1781     1849      22042     16687       3348      15850   
21   2021    1676     1422      31057     20123       4038      14678   
22   2022    1524     1249      32303     20821       4178      14826   
23   2023    1674     1197      33034     21467       4308      15252   
24   2024    1663     1136      34645     22982       4611      15996   
25   2025    1760     1364      34746     24032       4822      16336   
26   2026    1863     1603      34847     25215       5060      16727   
27   2027    1960     1836      34948     26199       5257      17029   
28   2028    2035     2049      35049     26645       5347      17132   
29   2029    2093     2245      35150     26928       5403      17110   
30   2030    2145     2431      35251     27195       5457      17040   

    SLCC_SC  SLCC_JD  SLCC_ML  SLCC_HM    BYU  UVU_Main  UVU_Geneva  UVU_Lehi  \
19     2696     3534      428      999  34031     35389        1863      1202   
20     2358     3090      374      874  31013     25298        1332       906   
21     2183     2862      347      809  34086     30967        1630      1136   
22     2205     2891      350      817  33657     32954        1734      1192   
23     2269     2974      360      841  34275     35236        1854      1263   
24     2379     3119      378      882  34975     36899        1942      1311   
25     2430     3185      386      901  34964     38477        2025      1354   
26     2488     3261      395      922  34952     39453        2076      1403   
27     2533     3320      402      939  35041     41042        2160      1469   
28     2548     3340      405      944  35129     42170        2219      1524   
29     2545     3336      404      943  35218     42278        2225      1566   
30     2535     3322      403      939  35306     42610        2243      1587   

    UVU_Vineyard  UVU_Payson  
19             0           0  
20             0           0  
21             0           0  
22             0           0  
23             0           0  
24             0           0  
25             0           0  
26           528           0  
27           685           0  
28           841           0  
29           991         592  
30          1127         687  

Last 5 years:
    ;YEAR  Ensign  Westmin  UofU_Main  WSU_Main  WSU_Davis  SLCC_Main  \
56   2056    3578     2929      37878     35094       7042      17707   
57   2057    3687     2969      37979     36059       7236      18029   
58   2058    3786     3002      38080     36953       7415      18306   
59   2059    3875     3028      38181     37769       7578      18537   
60   2060    3956     3047      38282     38504       7726      18728   

    SLCC_SC  SLCC_JD  SLCC_ML  SLCC_HM    BYU  UVU_Main  UVU_Geneva  UVU_Lehi  \
56     2634     3452      418      976  35008     55621        2927      2239   
57     2682     3515      426      994  34997     56853        2992      2289   
58     2723     3569      432     1009  34985     57839        3044      2329   
59     2757     3614      438     1022  34974     58593        3084      2359   
60     2786     3651      442     1032  34962     59156        3114      2382   

    UVU_Vineyard  UVU_Payson  
56          3876        2584  
57          3961        2641  
58          4031        2687  
59          4083        2722  
60          4122        2748  

=== UVU Total Verification (Sample Years) ===
 ;YEAR  UVU_Main  UVU_Geneva  UVU_Lehi  UVU_Payson  UVU_Vineyard  UVU_Total_Check
  2000     19899        1047         0           0             0            20946
  2019     35389        1863      1202           0             0            38454
  2026     39453        2076      1403           0           528            43460
  2029     42278        2225      1566         592           991            47652
  2050     45559        2398      1834        2116          3175            55082
  2060     59156        3114      2382        2748          4122            71522

8.9 Export Campus-Level Forecast

Show the code

# Export disaggregated campus-level forecast (2000-2060)
df_campus_final.to_csv(PATHS['campus_forecast_output'], index=False)
print(f"\n{'='*60}")
print(f"Campus-level forecast (2000-2060) saved to:")
print(f"  {PATHS['campus_forecast_output']}")
print(f"\nTotal years: {len(df_campus_final)}")
print(f"Total campuses: {len(COLUMN_ORDER) - 1}")


============================================================
Campus-level forecast (2000-2060) saved to:
  results\enrollment_forecast_by_campus_2000_2060.csv

Total years: 61
Total campuses: 16

9 Spatial Allocation: College Enrollment to MAZ (Point-in-Polygon)

We assign each campus’s enrollment to its host Micro Analysis Zone (MAZ) by matching a representative campus point to the MAZ polygon that contains it, replacing the legacy TAZ-level assignment (BaseDistribution.csv, collegeHomeLocations.csv).

9.1 Campus Locations

Representative point per campus, geocoded via the US Census Geocoder (existing campuses) or OpenStreetMap/Nominatim (SLCC_HM, UVU_Vineyard, UVU_Payson lack a normal street address; UVU_Payson isn’t built yet and uses the Payson city centroid as a placeholder).

Show the code

# Year of campus-level enrollment to allocate to the MAZ layer
MAZ_ENROLLMENT_YEAR = 2023

PATHS["maz_shapefile"] = Path("data/TAZ/microzones_draft_20260610AndyaddedmazidforUtah.zip")
PATHS["maz_enrollment_output"] = Path(
    f"results/college_enrollment_maz_{MAZ_ENROLLMENT_YEAR}.shp.zip"
)

# WFRC + MAG area colleges — keys match the campus columns in df_campus_final
CAMPUS_LOCATIONS = {
    "Ensign": (40.771180530806994, -111.9002465430066),
    "Westmin": (40.73221908563363, -111.85497832879642),
    "UofU_Main": (40.76301882243654, -111.84509051716005),
    "WSU_Main": (41.19253220864767, -111.94459881036707),
    "WSU_Davis": (41.09888202949566, -111.98251709610648),
    "SLCC_Main": (40.67210819472819, -111.9428303160198),
    "SLCC_SC": (40.73515157561941, -111.88549497530391),
    "SLCC_JD": (40.58380544000254, -111.97470446763771),
    "SLCC_ML": (40.57483612163177, -111.90232983675175),
    "SLCC_HM": (40.48852562669204, -111.98341561621633),
    "BYU": (40.24877658596609, -111.64920574368028),
    "UVU_Main": (40.27875809516613, -111.71536707456268),
    "UVU_Geneva": (40.28038752645332, -111.72851644229027),
    "UVU_Lehi": (40.42779753848785, -111.89614493206128),
    "UVU_Vineyard": (40.30674715318882, -111.74298418973568),
    "UVU_Payson": (40.044399, -111.732152),
}

9.2 Point-in-Polygon Join & Export

Show the code

from shapely.geometry import Point

gdf_maz = gpd.read_file(f"/vsizip/{PATHS['maz_shapefile']}")

gdf_campuses = gpd.GeoDataFrame(
    {"univ_name": list(CAMPUS_LOCATIONS.keys())},
    geometry=[Point(lon, lat) for lat, lon in CAMPUS_LOCATIONS.values()],
    crs="EPSG:4326",
).to_crs(gdf_maz.crs)
gdf_campuses["enroll_uni"] = gdf_campuses["univ_name"].map(
    df_campus_final.set_index(";YEAR").loc[MAZ_ENROLLMENT_YEAR]
)

campus_maz = gpd.sjoin(gdf_campuses, gdf_maz[["maz_id", "geometry"]], predicate="within")

gdf_maz_export = gdf_maz.merge(campus_maz[["maz_id", "univ_name", "enroll_uni"]], on="maz_id", how="left")
gdf_maz_export["enroll_uni"] = gdf_maz_export["enroll_uni"].fillna(0).astype(int)

gdf_maz_export.to_file(PATHS["maz_enrollment_output"])
print(f"Saved {MAZ_ENROLLMENT_YEAR} enrollment to {PATHS['maz_enrollment_output']}")

Saved 2023 enrollment to results\college_enrollment_maz_2023.shp.zip

10 OJS Data Preparation & Exports

Show the code

# ── Export Only the Final Adjusted Output CSV ─────────────────────────────────
df_forecast_adjusted.rename(columns={"Year": ";YEAR"}).to_csv(PATHS["forecast_output"], index=False)

# ── Metrics & Residuals Export ────────────────────────────────────────────────
df_all_metrics = pd.concat(
    [pd.DataFrame(model_metrics), pd.DataFrame(flagship_metrics)], ignore_index=True
)
df_residuals = pd.DataFrame(residual_records)

# ── Prepare Forecast Data for OJS ─────────────────────────────────────────────
df_history_long = df_enroll_ipeds.reset_index().melt(
    id_vars=["Year"], var_name="Univ_Code", value_name="Enrollment"
).dropna()
df_history_long["University"] = df_history_long["Univ_Code"].map(DISPLAY_NAMES)
df_history_long["Type"] = "Historical"
df_history_long["Scenario"] = "Both"

df_proj_base = df_forecast_base.copy()
df_proj_base["Scenario"] = "Off"

df_proj_adj = df_forecast_adjusted.copy()
df_proj_adj["Scenario"] = "On"

# Combine everything for the UI Toggle
df_final_trend = pd.concat(
    [
        df_history_long[["Year", "University", "Enrollment", "Type", "Scenario"]],
        df_proj_base[["Year", "University", "Enrollment", "Type", "Scenario"]],
        df_proj_adj[["Year", "University", "Enrollment", "Type", "Scenario"]],
    ],
    ignore_index=True,
).sort_values(["University", "Year"])

# Regional rate data (Historical & Base Model only)
df_regional_rate_long = regional_rate_hist.reset_index()
df_regional_rate_long.columns = ["Year", "Rate"]
df_regional_rate_long["Type"] = "Historical"

ojs_define(
    data_final_forecast=df_final_trend.to_dict(orient="records"),
    data_regional_rate_hist=df_regional_rate_long.to_dict(orient="records"),
    data_regional_rate_proj=pd.DataFrame(regional_rate_proj).to_dict(orient="records"),
    control_ceiling=float(PENETRATION_CEILING),
    data_model_metrics=df_all_metrics.to_dict(orient="records"),
    data_residuals=df_residuals.to_dict(orient="records"),
)

11 Results & Diagnostics

11.1 Model Performance

The following tables and charts display the accuracy (R² and Mean Absolute Percentage Error) for both the Base Linear Model (participation rate) and the Direct Flagship Override method. Models are trained on all available historical data (2000–2023), including the COVID disruption and recovery years.

How to interpret these metrics:

R^2 Score: Measures how much of the historical enrollment trend is explained by our model. You want this number as close to 1.0 as possible.
MAPE (Mean Absolute Percentage Error): The average percentage difference between the model’s prediction and the actual enrollment. Lower is better. A MAPE of 2% means the model is usually off by about 2% of the total student body.

Show the code

Inputs.table(data_model_metrics, {
  sort: "Dataset", reverse: true,
  columns: ["University", "Dataset", "Method", "R2", "MAPE"],
  format: {
    R2: d => d.toFixed(3),
    MAPE: d => (d * 100).toFixed(2) + "%"
  }
})

Show the code

Plot.plot({
  marks: [
    Plot.barY(data_model_metrics, {
      x: "University", y: "R2", fill: "Dataset",
      title: d => `${d.University} (${d.Dataset})\nR²: ${d.R2.toFixed(3)}`,
      sort: { x: "y", reverse: true }
    }),
    Plot.ruleY([0])
  ],
  facet: { data: data_model_metrics, x: "Dataset" },
  x: { tickRotate: -45, label: null },
  y: { grid: true, label: "R² Score" },
  color: { legend: true, range: ["#4e79a7", "#f28e2c"] },
  width: 800, height: 400, marginLeft: 50, marginBottom: 120,
  style: { backgroundColor: "transparent", color: "var(--bs-body-color)" }
})

Figure 2: R² Accuracy Score by University (Training Data 2000–2023)

Residual Analysis

These charts illustrate how well the linear trends predict historical data.

What to look for in the Residuals plot:

Random Scatter: Points should be randomly scattered above and below the zero line.
Avoid Patterns: If you see a distinct “U” shape or a consistent diagonal slope, it indicates the university’s growth might be non-linear, and a simple linear regression is consistently over- or under-predicting during certain decades.

Show the code

Plot.plot({
  marks: [
    Plot.ruleY([0], { stroke: "gray", strokeDasharray: "4,4" }),
    Plot.dot(data_residuals, {
      x: "Year", y: "Residual", fill: "University",
      title: d => `${d.University}\nYear: ${d.Year}\nResidual: ${d.Residual.toFixed(0)}\nMethod: ${d.Method}`,
      r: 4
    })
  ],
  facet: { data: data_residuals, x: "Method" },
  color: { legend: true },
  y: { grid: true, label: "Residual (Students)" },
  x: { tickFormat: "d" },
  width: 860,
  style: { backgroundColor: "transparent", color: "var(--bs-body-color)", fontFamily: "sans-serif", fontSize: "12px" }
})

Figure 3: Residuals (Actual - Predicted) over Time

What to look for in the Absolute Percentage Error (APE) plot:

Because universities vary wildly in size, raw student residuals can be misleading. Missing by 500 students is a massive error for Westminster but barely noticeable for BYU.
This chart normalizes the error to the size of the school. Look for sudden spikes in APE, which indicate anomalous years where the model completely failed to capture enrollment reality.

Show the code

Plot.plot({
  marks: [
    Plot.line(data_residuals, {
      x: "Year", y: "APE", stroke: "University", strokeWidth: 1.5
    }),
    Plot.dot(data_residuals, {
      x: "Year", y: "APE", fill: "University",
      title: d => `${d.University}\nYear: ${d.Year}\nAPE: ${(d.APE * 100).toFixed(1)}%\nMethod: ${d.Method}`,
      r: 3
    })
  ],
  facet: { data: data_residuals, x: "Method" },
  color: { legend: true },
  y: { grid: true, label: "Absolute Percentage Error", tickFormat: "%" },
  x: { tickFormat: "d" },
  width: 860,
  style: { backgroundColor: "transparent", color: "var(--bs-body-color)", fontFamily: "sans-serif", fontSize: "12px" }
})

Figure 4: Absolute Percentage Error over Time

11.2 Regional Control Total

This chart maps the historical and projected regional penetration rate before any flagship adjustments. It ensures the mathematical bounds of our population-based projections remain realistic.

Show the code

{
  const allRates = [...data_regional_rate_hist, ...data_regional_rate_proj]

  return Plot.plot({
    marks: [
      Plot.ruleY([0]),
      Plot.ruleY([control_ceiling], {
        stroke: "var(--bs-danger)", strokeWidth: 2, strokeDasharray: "6,3"
      }),
      Plot.line(allRates, {
        filter: d => d.Type === "Historical",
        x: "Year", y: "Rate",
        stroke: "var(--bs-primary)", strokeWidth: 3
      }),
      Plot.line(allRates, {
        filter: d => d.Type === "Projected",
        x: "Year", y: "Rate",
        stroke: "var(--bs-secondary)", strokeWidth: 2.5, strokeDasharray: "5,4"
      }),
      Plot.dot(allRates, {
        filter: d => d.Type === "Projected" && d.Capped,
        x: "Year", y: "Rate",
        fill: "var(--bs-danger)", r: 4, symbol: "diamond"
      }),
      Plot.text([{year: 2053, rate: control_ceiling}], {
        x: "year", y: "rate",
        text: () => `Ceiling: ${(control_ceiling * 100).toFixed(1)}%`,
        dy: -10, fill: "var(--bs-danger)", fontSize: 11
      }),
      Plot.tip(allRates, Plot.pointerX({
        x: "Year", y: "Rate",
        title: d => `Year: ${d.Year}\nRegional Rate: ${(d.Rate * 100).toFixed(1)}%\n${d.Type}${d.Capped ? "\n⚠ Control total applied" : ""}`,
        fill: "var(--bs-body-bg)", stroke: "var(--bs-body-color)"
      }))
    ],
    width: 780, marginLeft: 60, marginRight: 40,
    y: {
      grid: true, label: "Regional Penetration Rate",
      tickFormat: d => (d * 100).toFixed(0) + "%"
    },
    x: { tickFormat: "d" },
    style: {
      backgroundColor: "transparent", color: "var(--bs-body-color)",
      fontFamily: "sans-serif", fontSize: "12px"
    }
  })
}

Figure 5: Regional Penetration Rate: Historical and Projected vs Control Total Ceiling

11.3 Enrollment Forecast by University

Note: Solid lines represent historical IPEDS data (2000–2023). Dashed lines represent the forecast (2024–2060). Toggle the Flagship Adjustment “On” to see the direct-trend override applied to BYU and UofU, bypassing local demographic constraints.

Show the code

viewof adjustment_toggle = Inputs.radio(
  ["On", "Off"],
  { label: "Flagship Adjustment:", value: "On" }
)

Show the code

Plot.plot({
  marks: [
    Plot.ruleY([0]),
    Plot.ruleX([2024], { stroke: "gray", strokeDasharray: "2,2", strokeOpacity: 0.5 }),
    Plot.line(data_final_forecast, {
      filter: d => d.Type === "Historical",
      x: "Year", y: "Enrollment", z: "University",
      stroke: "University", strokeWidth: 2.5
    }),
    Plot.line(data_final_forecast, {
      filter: d => d.Type === "Projected" && d.Scenario === adjustment_toggle,
      x: "Year", y: "Enrollment", z: "University",
      stroke: "University",
      strokeWidth: 2.5,
      strokeDasharray: "5,5"
    }),
    Plot.text(data_final_forecast, Plot.selectLast({
      filter: d => d.Type === "Historical" || (d.Type === "Projected" && d.Scenario === adjustment_toggle),
      x: "Year", y: "Enrollment", z: "University",
      text: "University", textAnchor: "start", dx: 8, fill: "var(--bs-body-color)"
    }), { stroke: "var(--bs-body-bg)", strokeWidth: 3 }),
    Plot.tip(data_final_forecast, Plot.pointerX({
      filter: d => d.Type === "Historical" || (d.Type === "Projected" && d.Scenario === adjustment_toggle),
      x: "Year", y: "Enrollment",
      title: d => `${d.University}\nYear: ${d.Year}\nStudents: ${d.Enrollment.toLocaleString()}\n(${d.Type})`,
      fill: "var(--bs-body-bg)", stroke: "var(--bs-body-color)"
    }))
  ],
  width: 860, marginLeft: 60, marginRight: 185,
  color: { legend: true },
  y: { grid: true, label: "Fall Enrollment", tickFormat: "s" },
  x: { tickFormat: "d" },
  style: {
    backgroundColor: "transparent", color: "var(--bs-body-color)",
    fontFamily: "sans-serif", fontSize: "12px"
  }
})

Figure 6: College Enrollment Forecast by University (2000–2060)

Download the output file: