Project link: https://github.com/eric-mc2/DNCTransit
A few weeks ago I fumbled an interview question on statistical modeling. That hurt my pride, so I decided to take on a little modeling project.
With all the local hubbub around the DNC security perimeter, I remember mostly staying home to avoid getting snarled in traffic. This got me wondering how the rest of the city responded to the convention. How did the visiting delegates fare? Did they use public transit? (Chicago was picked to highlight its transportation and infrastructure after all.)
Does the exogeneous shock of the 2024 Democratic National Convention in Chicago induce significant impacts on public transit usage?
On face value I expect public transit ridership to increase due to the DNC. This is because the Democratic coalition generally lives in urban areas, supports “green” policy, government spending, etc.
Disclaimer: I don’t really have a stake in whether Democrats in general use transit or not, but I like this setup for a few reasons:
Can we even test this? Is the DNC a big enough event compared to normal commuting patterns?
The official DNC impact report says 50,000 delegates visited and spent $58.7M outside of the convention within Chicago. The city-wide average daily CTA ridership is 790,000 +/- 140,000 so it would be hard to detect these extra rides against the normal rhythm of the city.
But we know convention event is localised to the United Center and McCormick place. When we only look at CTA routes that pass through these areas, the average daily ridership is 325,000 +/- 52,000. This breaks down to 136,000 bus and 189,000 train rides. So the anticipated impact is getting more noticeable.
Selecting only nearby train stations (ridership data is not available per bus stop), average daily boardings are 11,000. Any fraction of those 50,000 delegates ought to make a huge spike in this context.
What are the delegates’ other travel options? Walking, driving, Metra, biking, and rideshares. Data is available for the latter: there are 6,000 and 33,000 average daily bike and rideshare trips in areas near the United Center and McCormick place.
With these totals in mind, I’d expect the DNC delegation to significantly increase ridership nearby the convention centers, no matter what transit mode they choose.
The City of Chicago provides daily ridership totals for train (CTA), bus (CTA), rideshare (uber, lyft, etc), and bikeshare (Divvy). The original data is published at varying spatial aggregations:
I unified these sources into a panel dataset, keeping the nominal spatial unit of each transit mode1. Unfortunately I cannot include busses in this analysis because the data is not granular enough to identify rides near the convention centers.
The next step is to label transit rides as near or not near the convention centers.
Convention Centers
I use the City of Chicago buildings shapefile to find the footprint of the United Center and McCormick Place, the two official sites of the DNC. Proximity to these locations will constitute the “treatment” group in the model.
I compute a buffer around each building and find all intersecting transit stations, routes, and tracts. For robustness, I tested 400m, 800m, and 1600m buffers. To pick which catchement size to use, I was forced to choose one mile buffer, because the smaller sizes had too few stations and routes nearby.
To test the sensitivity of the buffer, I modeled:
$$\text{rides}_i \sim \text{within 400m}_i + \text{within 800m}_i + \text{within 1600m}_i$$Average rides per spatial unit varied irregularly as the buffer size increased, meaning results are very sensitive to the buffer size.
Using Community Areas vs Tracts
I chose to use tract-level data for rideshares. Here’s my reasoning. Tracts mean:
Overall the attenuation bias seemed like the most important consideration2.
Conference Dates
The DNC itself occurs between August 19-22. This constitutes the “treatment period” in the model34.
Day of the Week
Commuting patterns have strong weekday/weekend polarity. CTA ridership is drastically higher on weekdays, while on weekends Uber ridership is higher.
Due to the restricted time frame of the DNC, I drop observations from Friday - Sunday. (Ridership on these days doesn’t convey any information about the “treatment” because the DNC only occurs on Monday - Thursday.)
Transit Density
I coded the distance from the tract centroid to the nearest train, bus, and bike stop. These distances will help control for varying transit density, commercial density, and transit mode preferences.
$$ \sim \beta_1 d(\text{unit}_i, \text{train}) + \beta_2 d(\text{unit}_i, \text{bus}) + \beta_3 d(\text{unit}_i, \text{bike}) $$Location
I include the stop/tract centroid longitude and latitude as a quadratic term in the model to help control for basic city-wide spatial variation:
$$ \sim \beta_1\text{lon} + \beta_2\text{lat} + \beta_3\text{lon}*\text{lat} + \beta_4\text{lon}^2 + \beta_5\text{lat}^2 $$(For numeric stability, I scale lon/lat to zero-mean unit variance.)
Temporal Aggregation
I keep observations at the daily level. Aggregating to weekly would pull non-DNC dates into the treatment period, attenuating the treatment effect.
Time Frame
I use data from the summer months, June, July, and August. Restricting the data to this set makes sense for two reasons. First, I avoid needing to incorporate more complex modeling of seasonality. Second, transit is on a steady rebound reaching 65% of pre-pandemic levels. Though I do model a linear time covariate, the rebound suggests that disaggregated transit usage is in flux as commuter preferences and capacity continue to adjust. Therefore, previous summers are not a good baseline.
Boardings vs Trips
CTA train and bus data only provides the locations where riders board transit, not where they exit. Bike and rideshare data provides per-trip board and exit locations. For strict parity, I ought to drop exit data, but I won’t. Keeping it gives a fuller picture of transit usage56.
Statistical Power
Before running the regression, I want to do a slightly more rigorous version of the time series gut check. Looking at the distribution of daily ridership let’s ask: what is the minimum number of additional rides to significantly change the mean? First I compute the sample variance and standard error7, SE, to get the minimum detectable effect size:
$$ \text{MDE} = (z(1-\alpha / 2) + z(\beta)) * SE $$Multiplying the MDE by the number of (treated) observations yields the minimum detectable total “shock” to the system.
Transit | Mean Rides per Unit per Day (SD) | Minimum Detectable Change | Minimum Additional Rides |
---|---|---|---|
bike | 50.3 (17.5) | +151% | 10235 |
train | 1482.0 (392.7) | +227% | 58655 |
uber | 753.1 (583.5) | +142% | 55036 |
Let’s interpret the first row of this table:
Corraborating the first time series plot, if a decent fraction of delegates take transit, this regression design should be able to detect an effect.
Selection Balance
Here is a balance table for the three separate transit modes:
Not Near DNC | Near DNC | P-Value | ||
---|---|---|---|---|
train | stations | 114 | 8 | |
station-days | 6270 | 420 | ||
daily rides, mean (SD) | 2718.4 (2337.9) | 1482.8 (995.9) | <0.001 | |
log(daily rides), mean (SD) | 7.6 (0.8) | 6.8 (1.8) | <0.001 | |
bus_distance, mean (SD) | 305.0 (1472.7) | 165.3 (167.7) | 0.354 | |
bike_distance, mean (SD) | 387.6 (1535.0) | 288.5 (281.6) | 0.573 | |
sqrt(area), mean (SD) | 2521.0 (3723.0) | 3123.8 (3564.8) | 0.657 | |
lat, mean (SD) | -0.1 (0.9) | -0.3 (0.2) | 0.027 | |
long, mean (SD) | -0.0 (1.0) | 0.2 (0.4) | 0.296 | |
bike | docks | 1459 | 47 | |
dock-days | 42316 | 2568 | ||
daily rides, mean (SD) | 46.0 (67.7) | 49.5 (34.4) | <0.001 | |
log(daily rides), mean (SD) | 2.8 (1.6) | 3.7 (0.8) | <0.001 | |
train_distance, mean (SD) | 5007.3 (5396.5) | 1056.5 (1240.4) | <0.001 | |
bus_distance, mean (SD) | 146.2 (264.4) | 105.6 (73.5) | 0.002 | |
sqrt(area), mean (SD) | 666.9 (1009.0) | 655.0 (420.6) | 0.859 | |
lat, mean (SD) | -0.4 (1.3) | -0.3 (0.2) | 0.248 | |
long, mean (SD) | -0.3 (1.3) | 0.4 (0.5) | <0.001 | |
uber | tracts | 1196 | 40 | |
tract-days | 52995 | 2189 | ||
daily rides, mean (SD) | 213.8 (802.3) | 737.9 (1714.9) | <0.001 | |
log(daily rides), mean (SD) | 3.8 (1.7) | 5.5 (1.5) | <0.001 | |
train_distance, mean (SD) | 15886.8 (19731.6) | 2293.7 (1027.7) | <0.001 | |
bus_distance, mean (SD) | 9390.3 (17842.2) | 503.0 (313.2) | <0.001 | |
bike_distance, mean (SD) | 10707.4 (18691.3) | 889.5 (504.9) | <0.001 | |
sqrt(area), mean (SD) | 1134.0 (691.0) | 784.4 (320.0) | <0.001 | |
lat, mean (SD) | -0.1 (1.4) | -0.1 (0.2) | 0.153 | |
long, mean (SD) | -0.5 (1.5) | 0.5 (0.3) | <0.001 |
The imbalance in ridership, transit service density, and unit size corroborate the selection effect at play: the convention sites were chosen for their ability to accomodate lots of visitors (duh). These sites are unlike the rest of Chicago. Luckily with the diff-in-diff framework, the convention sites partially control for themselves. Is that enough to believe in this model? I’ll come back to this question later.
This post is already long. I’ll present the formal regression model and results in the next post.
Given this approach I will not be able to compare effect sizes between transit modes. Unifying these data into one model is not trivial. To start, the construct we are interested in is travel origins. The best we have are transit origins, but these are not the same – people need to walk the “last mile” from their true origin to the nearest station, potentially crossing census tracts. Second, transit modes are substitutes: modeling them together probably requires some kind of structural equation model to handle simultaneity. ↩︎
Even at the tract level, 50% of “nearby” tract land area is outside the buffer. As a robustness check, I should vary the spatial intersection threshold from 0% to 100% and test the sensitivity of the results. ↩︎
For a robustness check I should add a 1 or 2 day buffer (travel/tourism days). ↩︎
See placebo test. ↩︎
Note this mechanically doubles the ridership levels of bike/rideshare vs train. I account for this by running separate models per transit mode, and by reporting percentage changes instead of level changes. ↩︎
If delegates prefer to commute to the DNC via train and commute back via Uber, I’d only observe the return trips. This complicates comparisons between transit modes. ↩︎
This was more complicated than I expected. The panel is unbalanced in two ways, which would underestimate the standard error if unaccounted for. First, some units have as much as 12x more observations than other units, due to data availability. Second, I’ve included 15x more non-DNC days as DNC days, in order to improve the baseline estimates. Instead of the typical \(SE = \sigma / \sqrt n\) calculation, I use the pooled \(SE = \sigma \sqrt{\sum{1/n_i}}\). ↩︎