Do Democrats Really Support Public Transit?

Project link: https://github.com/eric-mc2/DNCTransit

A few weeks ago I fumbled an interview question on statistical modeling. That hurt my pride, so I decided to take on a little modeling project.

With all the local hubbub around the DNC security perimeter, I remember mostly staying home to avoid getting snarled in traffic. This got me wondering how the rest of the city responded to the convention. How did the visiting delegates fare? Did they use public transit? (Chicago was picked to highlight its transportation and infrastructure after all.)

Does the exogeneous shock of the 2024 Democratic National Convention in Chicago induce significant impacts on public transit usage?

On face value I expect public transit ridership to increase due to the DNC. This is because the Democratic coalition generally lives in urban areas, supports “green” policy, government spending, etc.

Disclaimer: I don’t really have a stake in whether Democrats in general use transit or not, but I like this setup for a few reasons:

if AN effect is found, that is interesting
- (statistics are designed to be cautious about stating effects)
if NO effect is found, that is also interesting
- (it suggests the economic impact was smaller than reported, or that delegates took Ubers!)
if the whole setup is not statistically valid, I can practice critiquing my own bad statistics
- (poorly tested statistics can have lasting effects on society!)

At First Glance

Can we even test this? Is the DNC a big enough event compared to normal commuting patterns?

The official DNC impact report says 50,000 delegates visited and spent $58.7M outside of the convention within Chicago. The city-wide average daily CTA ridership is 790,000 +/- 140,000 so it would be hard to detect these extra rides against the normal rhythm of the city.

But we know convention event is localised to the United Center and McCormick place. When we only look at CTA routes that pass through these areas, the average daily ridership is 325,000 +/- 52,000. This breaks down to 136,000 bus and 189,000 train rides. So the anticipated impact is getting more noticeable.

Selecting only nearby train stations (ridership data is not available per bus stop), average daily boardings are 11,000. Any fraction of those 50,000 delegates ought to make a huge spike in this context.

What are the delegates’ other travel options? Walking, driving, Metra, biking, and rideshares. Data is available for the latter: there are 6,000 and 33,000 average daily bike and rideshare trips in areas near the United Center and McCormick place.

With these totals in mind, I’d expect the DNC delegation to significantly increase ridership nearby the convention centers, no matter what transit mode they choose.

The data

The City of Chicago provides daily ridership totals for train (CTA), bus (CTA), rideshare (uber, lyft, etc), and bikeshare (Divvy). The original data is published at varying spatial aggregations:

Train: per station
Bike: per station
Bus: per route
Rideshares: per tract / community area

I unified these sources into a panel dataset, keeping the nominal spatial unit of each transit mode¹. Unfortunately I cannot include busses in this analysis because the data is not granular enough to identify rides near the convention centers.

Treatment Zone

The next step is to label transit rides as near or not near the convention centers.

Convention Centers

I use the City of Chicago buildings shapefile to find the footprint of the United Center and McCormick Place, the two official sites of the DNC. Proximity to these locations will constitute the “treatment” group in the model.

I compute a buffer around each building and find all intersecting transit stations, routes, and tracts. For robustness, I tested 400m, 800m, and 1600m buffers. To pick which catchement size to use, I was forced to choose one mile buffer, because the smaller sizes had too few stations and routes nearby.

Bus ridership is not known per stop. — Fig: Transit serving convention areas.

To test the sensitivity of the buffer, I modeled:

$$\text{rides}_i \sim \text{within 400m}_i + \text{within 800m}_i + \text{within 1600m}_i$$

Average rides per spatial unit varied irregularly as the buffer size increased, meaning results are very sensitive to the buffer size.

Using Community Areas vs Tracts

I chose to use tract-level data for rideshares. Here’s my reasoning. Tracts mean:

✅ a larger sample size (more units of observation)
⛔️ more noise (smaller units => more variation across units/time)
✅ less bias (smaller units => 50% less spatial non-compliance on edge of buffer)
⛔️ fewer rides (smaller units => 23% more privacy redactions)

Overall the attenuation bias seemed like the most important consideration².

Time-like Features

Conference Dates

The DNC itself occurs between August 19-22. This constitutes the “treatment period” in the model³⁴.

Day of the Week

Commuting patterns have strong weekday/weekend polarity. CTA ridership is drastically higher on weekdays, while on weekends Uber ridership is higher.

Due to the restricted time frame of the DNC, I drop observations from Friday - Sunday. (Ridership on these days doesn’t convey any information about the “treatment” because the DNC only occurs on Monday - Thursday.)

Other Spatial Features

Transit Density

I coded the distance from the tract centroid to the nearest train, bus, and bike stop. These distances will help control for varying transit density, commercial density, and transit mode preferences.

$$ \sim \beta_1 d(\text{unit}_i, \text{train}) + \beta_2 d(\text{unit}_i, \text{bus}) + \beta_3 d(\text{unit}_i, \text{bike}) $$

Location

I include the stop/tract centroid longitude and latitude as a quadratic term in the model to help control for basic city-wide spatial variation:

$$ \sim \beta_1\text{lon} + \beta_2\text{lat} + \beta_3\text{lon}*\text{lat} + \beta_4\text{lon}^2 + \beta_5\text{lat}^2 $$

(For numeric stability, I scale lon/lat to zero-mean unit variance.)

The Sample

Temporal Aggregation

I keep observations at the daily level. Aggregating to weekly would pull non-DNC dates into the treatment period, attenuating the treatment effect.

Time Frame

I use data from the summer months, June, July, and August. Restricting the data to this set makes sense for two reasons. First, I avoid needing to incorporate more complex modeling of seasonality. Second, transit is on a steady rebound reaching 65% of pre-pandemic levels. Though I do model a linear time covariate, the rebound suggests that disaggregated transit usage is in flux as commuter preferences and capacity continue to adjust. Therefore, previous summers are not a good baseline.

Boardings vs Trips

CTA train and bus data only provides the locations where riders board transit, not where they exit. Bike and rideshare data provides per-trip board and exit locations. For strict parity, I ought to drop exit data, but I won’t. Keeping it gives a fuller picture of transit usage⁵⁶.

Pre-Regression Checks

Statistical Power

Before running the regression, I want to do a slightly more rigorous version of the time series gut check. Looking at the distribution of daily ridership let’s ask: what is the minimum number of additional rides to significantly change the mean? First I compute the sample variance and standard error⁷, SE, to get the minimum detectable effect size:

$$ \text{MDE} = (z(1-\alpha / 2) + z(\beta)) * SE $$

Multiplying the MDE by the number of (treated) observations yields the minimum detectable total “shock” to the system.

Transit	Mean Rides per Unit per Day (SD)	Minimum Detectable Change	Minimum Additional Rides
bike	50.3 (17.5)	+151%	10235
train	1482.0 (392.7)	+227%	58655
uber	753.1 (583.5)	+142%	55036

Let’s interpret the first row of this table:

a bike rack near the DNC will serve an average of 50 rides per day
ridership would need to increase by 151% to be statistically significant
this translates to 954 extra rides over the course of the DNC
note: column 1 cannot simply be multiplied by column 2 to produce column 3 (because I log-transform ridership for the model)

Corraborating the first time series plot, if a decent fraction of delegates take transit, this regression design should be able to detect an effect.

Selection Balance

Here is a balance table for the three separate transit modes:

		Not Near DNC	Near DNC	P-Value
train	stations	114	8
	station-days	6270	420
	daily rides, mean (SD)	2718.4 (2337.9)	1482.8 (995.9)	<0.001
	log(daily rides), mean (SD)	7.6 (0.8)	6.8 (1.8)	<0.001
	bus_distance, mean (SD)	305.0 (1472.7)	165.3 (167.7)	0.354
	bike_distance, mean (SD)	387.6 (1535.0)	288.5 (281.6)	0.573
	sqrt(area), mean (SD)	2521.0 (3723.0)	3123.8 (3564.8)	0.657
	lat, mean (SD)	-0.1 (0.9)	-0.3 (0.2)	0.027
	long, mean (SD)	-0.0 (1.0)	0.2 (0.4)	0.296

bike	docks	1459	47
	dock-days	42316	2568
	daily rides, mean (SD)	46.0 (67.7)	49.5 (34.4)	<0.001
	log(daily rides), mean (SD)	2.8 (1.6)	3.7 (0.8)	<0.001
	train_distance, mean (SD)	5007.3 (5396.5)	1056.5 (1240.4)	<0.001
	bus_distance, mean (SD)	146.2 (264.4)	105.6 (73.5)	0.002
	sqrt(area), mean (SD)	666.9 (1009.0)	655.0 (420.6)	0.859
	lat, mean (SD)	-0.4 (1.3)	-0.3 (0.2)	0.248
	long, mean (SD)	-0.3 (1.3)	0.4 (0.5)	<0.001

uber	tracts	1196	40
	tract-days	52995	2189
	daily rides, mean (SD)	213.8 (802.3)	737.9 (1714.9)	<0.001
	log(daily rides), mean (SD)	3.8 (1.7)	5.5 (1.5)	<0.001
	train_distance, mean (SD)	15886.8 (19731.6)	2293.7 (1027.7)	<0.001
	bus_distance, mean (SD)	9390.3 (17842.2)	503.0 (313.2)	<0.001
	bike_distance, mean (SD)	10707.4 (18691.3)	889.5 (504.9)	<0.001
	sqrt(area), mean (SD)	1134.0 (691.0)	784.4 (320.0)	<0.001
	lat, mean (SD)	-0.1 (1.4)	-0.1 (0.2)	0.153
	long, mean (SD)	-0.5 (1.5)	0.5 (0.3)	<0.001

The imbalance in ridership, transit service density, and unit size corroborate the selection effect at play: the convention sites were chosen for their ability to accomodate lots of visitors (duh). These sites are unlike the rest of Chicago. Luckily with the diff-in-diff framework, the convention sites partially control for themselves. Is that enough to believe in this model? I’ll come back to this question later.

Regression

This post is already long. I’ll present the formal regression model and results in the next post.

Footnotes

Given this approach I will not be able to compare effect sizes between transit modes. Unifying these data into one model is not trivial. To start, the construct we are interested in is travel origins. The best we have are transit origins, but these are not the same – people need to walk the “last mile” from their true origin to the nearest station, potentially crossing census tracts. Second, transit modes are substitutes: modeling them together probably requires some kind of structural equation model to handle simultaneity. ↩︎
Even at the tract level, 50% of “nearby” tract land area is outside the buffer. As a robustness check, I should vary the spatial intersection threshold from 0% to 100% and test the sensitivity of the results. ↩︎
For a robustness check I should add a 1 or 2 day buffer (travel/tourism days). ↩︎
See placebo test. ↩︎
Note this mechanically doubles the ridership levels of bike/rideshare vs train. I account for this by running separate models per transit mode, and by reporting percentage changes instead of level changes. ↩︎
If delegates prefer to commute to the DNC via train and commute back via Uber, I’d only observe the return trips. This complicates comparisons between transit modes. ↩︎
This was more complicated than I expected. The panel is unbalanced in two ways, which would underestimate the standard error if unaccounted for. First, some units have as much as 12x more observations than other units, due to data availability. Second, I’ve included 15x more non-DNC days as DNC days, in order to improve the baseline estimates. Instead of the typical $SE = \sigma / \sqrt n$ calculation, I use the pooled $SE = \sigma \sqrt{\sum{1/n_i}}$. ↩︎