The future ain’t what it used to be. - Yogi Berra

NYC Subway entries

MTA turnstile data is shockingly bad, in two ways.

First, ridership is down a lot, with far-reaching implications.

And also, the data is really poor quality. You can’t run a railroad with data this bad. Maybe they have something better internally. But if so, no reason to publish embarrassingly bad data.

We have data for each turnstile at 4-hour intervals. You don’t always get data consistently every 4 hours though. Check out Astoria Boulevard.


R514,R094,00-00-00,ASTORIA BLVD,NQW,BMT,04/27/2022,00:00:00,REGULAR,0006787742,0012484763
R514,R094,00-00-00,ASTORIA BLVD,NQW,BMT,04/27/2022,04:00:00,REGULAR,0006787746,0012484793
R514,R094,00-00-00,ASTORIA BLVD,NQW,BMT,04/27/2022,08:00:00,REGULAR,0006787934,0012484831
R514,R094,00-00-00,ASTORIA BLVD,NQW,BMT,04/27/2022,12:00:00,REGULAR,0006788217,0012484886
R514,R094,00-00-00,ASTORIA BLVD,NQW,BMT,04/30/2022,00:00:00,REGULAR,0006789742,0012487200
R514,R094,00-00-00,ASTORIA BLVD,NQW,BMT,04/30/2022,04:00:00,REGULAR,0006789752,0012487262
R514,R094,00-00-00,ASTORIA BLVD,NQW,BMT,04/30/2022,08:00:00,REGULAR,0006789786,0012487287
R514,R094,00-00-00,ASTORIA BLVD,NQW,BMT,04/30/2022,12:00:00,REGULAR,0006789894,0012487341
R514,R094,00-00-00,ASTORIA BLVD,NQW,BMT,04/30/2022,16:00:00,REGULAR,0006789985,0012487447
R514,R094,00-00-00,ASTORIA BLVD,NQW,BMT,04/30/2022,20:00:00,REGULAR,0006790056,0012487606

The last 2 columns are this turnstile’s entry and exit ‘odometer’. You can see how it skips 3 days and the entry odometer count jumps by 1500. The jump suggests it’s not a scenario where the station was closed for scheduled maintenance, they just missed collecting the data for 3 days.

Is 4-hour granularity even adequate in the age of the ‘Internet of Things’? If you want to schedule trains during the morning rush, wouldn’t you want data down to minute intervals or so? 4-hour data isn’t helpful beyond scheduling the number of trains in the four-hour period.1

Sometimes turnstiles randomly start counting down instead of up. This also happens quite a bit.

R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,07/23/2019,09:00:00,REGULAR,0000390322,0000202804
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,07/23/2019,13:00:00,REGULAR,0000390763,0000203216
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,07/23/2019,17:00:00,REGULAR,0000390763,0000203478
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,07/23/2019,21:00:00,REGULAR,0592416589,0886336073
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,07/24/2019,01:00:00,REGULAR,0592416496,0886336027
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,07/24/2019,05:00:00,REGULAR,0592416496,0886336027
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,07/24/2019,09:00:00,REGULAR,0592415729,0886335659
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,07/24/2019,13:00:00,REGULAR,0592415135,0886335411
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,07/24/2019,17:00:00,REGULAR,0592414623,0886335168

It goes on like that for a couple of months, then starts counting up again.

R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,09/18/2022,13:00:00,REGULAR,0591889116,0886065249
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,09/18/2022,17:00:00,REGULAR,0591889096,0886065246
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,09/18/2022,21:00:00,REGULAR,0591889071,0886065241
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,09/19/2022,01:00:00,REGULAR,0591889055,0886065240
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,09/19/2022,05:00:00,REGULAR,0591889055,0886065240
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,09/20/2022,17:00:00,REGULAR,0000000026,0000000000
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,09/20/2022,21:00:00,REGULAR,0000000076,0000000006
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,09/21/2022,01:00:00,REGULAR,0000000080,0000000006
R236,R045,00-03-01,GRD CNTRL-42 ST,4567S,IRT,09/21/2022,05:00:00,REGULAR,0000000080,0000000006

One can conjecture that maintenance got done and the counter got reversed, and then eventually more maintenance got done and it got flipped back. Of course, you can just take the absolute value of the difference. But there are a lot of these rollovers where you just have to drop the row.

This is just scratching the surface. Inconsistently named/nonexistent stations, you name it. If your data gets assigned to data science classes the world over as the world’s messiest data set, you’ve got problems.

And it gets worse. In recent data, the entries look significantly undercounted relative to exits.