Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize GTFS "catalog" as many/many relationship of transit providers GTFS datasets #21

Closed
e-lo opened this issue Feb 17, 2021 · 16 comments

Comments

@e-lo
Copy link
Contributor

e-lo commented Feb 17, 2021

Whereas some GTFS datasets have multiple transit providers (e.g. MTC's) and some transit providers have multiple datasets (e.g. LACMTA), the GTFS data catalog needs to be formatted and maintained to acknowledge this relationship.

Options

1. List of feeds by transit provider

  • Keyed on transit provider LACMTA, BART
  • GTFS datasets as lists:
     LACMTA: [LACMTA Rail link..., LACMTA bus link...]
     BART: [MTC regional feed link...]
  • GTFS datasets will repeat, so will need to check if already completed

2. List of feeds by tuple of transit providers included

     (LACMTA) : [LACMTA Rail link..., LACMTA bus link...]
     (BART, Caltrain, SFMTA, Santa Rosa Citybus) : [MTC regional feed link...]
  • No repeating, but not as fun to parse

Other?

@hunterowens
Copy link
Member

not to plug yml which is gonna be harder for the trillium folks to edit but could do. Seems like the MTC feed has individual feeds per operator that you can get

agency_1: 
  url: 
  - item 

agency_2:
  url: 
  - item
  - item2 

agency_mtc:
  url: 
  - their_feed_subset
  - regional_feed

@hunterowens
Copy link
Member

(this might be just a restating of 2)

@e-lo
Copy link
Contributor Author

e-lo commented Feb 17, 2021

(Totally fine with yml, it can parse out to same thing)

I think what you wrote is same as 1, which I think is the preferable answer - you'll just want to track feeds you've already validated if they are duped.

Interesting about feed subsets. It might be that we want to do the validation and grading on both the subset and the regional feed b/c the subset will have some agency-specific stuff in there.

@hunterowens
Copy link
Member

@antrim brought up yesterday on the call the idea of using DRMT, which is currently used by the new transitland-atlas.

DRMT is fairly new but does capture the following elements of NTD id (at least in the transitland atlas as a "tag". Monitoring changes to the static_current key would allow us to track changes in static url which are part of the CA GTFS guidelines.

I think the two most viable options for moving the list away from a Google Sheet are either to capture and create a repository full of DRMT files for each CA agency, or do a simpler yml or csv based version tracking the following pieces of information

  • cal_itp_id
  • [GTFS static download url(s)]
  • [GTFS RT vehicle position download url(s)]
  • [GTFS RT alerts download url(s)]
  • [GTFS RT trip updates download url(s)]

and store it in Github.

The MTC situation is a little messy in this format, but I think if we stick with roughly a 1) based option based on @e-lo thoughts above, it will be most compatible with MobilityDatabase in the future even if it requires a bit of custom code.

Should we store any agency metadata aside from itp_id in the github based list? ie, agency_name etc or should we just join on itp_id with the Google sheet as needed.

`
Here's our friends MST represented in the DRMT format, fwiw.

{
  "onestop_id": "o-9q9-monterey~salinastransit",
  "tags": {
    "us_ntd_id": "90062"
  },
  "name": "Monterey-Salinas Transit",
  "short_name": "MST",
  "associated_feeds": [
    {
      "feed_onestop_id": "f-9q9-monterey~salinastransit",
      "gtfs_agency_id": ""
    },
    {
      "feed_onestop_id": "f-mst~rt",
      "gtfs_agency_id": ""
    }
  ]
}
{
  "$schema": "https://dmfr.transit.land/json-schema/dmfr.schema-v0.3.0.json",
  "feeds": [
    {
      "spec": "gtfs",
      "id": "f-9q9-monterey~salinastransit",
      "urls": {
        "static_current": "https://www.mst.org/google/google_transit.zip"
      },
      "feed_namespace_id": "o-9q9-monterey~salinastransit",
      "license": {
        "url": "https://mst.org/about-mst/developer-resources/"
      }
    },
    {
      "spec": "gtfs-rt",
      "id": "f-mst~rt",
      "urls": {
        "realtime_alerts": "http://206.128.158.191/TMGTFSRealTimeWebService/Alert/Alerts.pb",
        "realtime_trip_updates": "http://206.128.158.191/TMGTFSRealTimeWebService/TripUpdate/TripUpdates.pb",
        "realtime_vehicle_positions": "http://206.128.158.191/TMGTFSRealTimeWebService/Vehicle/VehiclePositions.pb"
      },
      "feed_namespace_id": "o-9q9-monterey~salinastransit",
      "license": {
        "url": "https://mst.org/about-mst/developer-resources/"
      },
      "associated_feeds": [
        "f-9q9-monterey~salinastransit"
      ]
    }
  ],
  "license_spdx_identifier": "CDLA-Permissive-1.0"
}

@hunterowens
Copy link
Member

putting inline the list of agencies where the link under the GTFS column in the sheet either 404s or doesn't return a valid ZIP file.

['Santa Rosa CityBus', 'County Connection', 'Amador Regional Transit System', 'Anaheim Resort Transportation', 'Avalon Transit', 'Banning Pass Transit', 'Beaumont Pass Transit', 'Calaveras Transit', 'Caltrain', 'Camarillo Area Transit', 'Lawndale Beat', 'Clovis Transit System', 'Commerce Municipal Bus Lines', 'Corona Cruiser', 'Redwood Coast Transit', 'East Los Angeles Shuttle', 'Sunshine Bus(South Whittier)', 'the Link Florence-Firestone/Walnut Park', 'the Link-Athens', 'the Link Lennox', 'the Link Willowbrook', 'East Valinda Shuttle', 'Avocado Heights/Bassett/West Valinda Shuttle', 'the Link King Medical Center', 'Duarte Transit', 'Eastern Sierra Transit Authority', 'Mammoth Lakes Transit System', 'El Dorado Transit', 'El Monte Transportation Division', 'Emery Go-Round', 'Fairfield and Suisun Transit', 'GTrans', 'Humboldt Transit Authority', 'Arcata and Mad River Transit System', 'Eureka Transit Service', 'Blue Lake Rancheria', 'Kern Transit', 'Laguna Beach Municipal Transit', 'Tahoe Transportation', 'Tahoe Truckee Area Regional Transportation', 'Lake Transit', 'Madera County Connection', 'Mendocino Transit Authority', 'Merced The Bus', 'Mission Bay TMA', 'Spirit Bus', 'Moorpark City Transit', 'Morongo Basin Transit Authority', 'MVGO', 'Needles Area Transit', 'Norwalk Transit System', 'Desert Roadrunner', 'Petaluma Transit', 'Placer County Transit', 'Lincoln Transit', 'Plumas Transit Systems', 'Palos Verdes Peninsula Transit Authority', 'Redding Area Bus Authority', 'Burney Express', 'Rio Vista Delta Breeze', 'Sage Stage', 'County Express', 'San Francisco Bay Ferry', 'Simi Valley Transit', 'Siskiyou Transit and General Express', 'Sonoma-Marin Area Rail Transit', 'Santa Maria Area Transit', 'SolTrans', 'SolanoExpress', 'Sonoma County Transit', 'Cloverdale Transit', 'South County Transit Link', 'Stanislaus Regional Transit', 'Turlock Transit', 'Ceres Area Transit', 'Tehama Rural Area eXpress', 'Lassen Transit Service Agency', 'Susanville Indian Rancheria Public Transportation Program', 'Thousand Oaks Transit', 'Tideline', 'Trinity Transit', 'Vacaville City Coach', 'Ventura County Transportation Commission', 'Victor Valley Transit', 'Vine Transit', 'WestCAT', 'Yosemite Area Regional Transportation System', 'Yuba-Sutter Transit Authority', 'Porterville Transit', 'Burbank Bus', 'Big Blue Bus', 'Folsom Stage Line', 'Roseville Transit', 'Sacramento Regional Transit District', 'Unitrans', 'Yolobus', 'DASH', 'Commuter Express', 'Marin Transit', 'Morro Bay Transit', 'Santa Ynez Valley Transit', 'San Joaquin Regional Transit District', 'Santa Barbara Metropolitan Transit District', 'Santa Cruz Metropolitan Transit District', 'Capitol Corridor', 'Clean Air Express', 'Gold Coast Transit', 'North County Transit District', 'Monterey-Salinas Transit', 'OmniTrans', 'SamTrans', 'Fresno Area Express', 'MUNI', 'Long Beach Transit', 'Orange County Transportation Authority', 'Irvine Shuttle', 'Golden Gate Bridge Highway and Transportation District', 'Marguerite Shuttle', 'Bay Area Rapid Transit', 'Menlo Park Shuttles', 'Metrolink', 'Modesto Area Express', 'Riverside Transit Agency', 'San Diego Metropolitan Transit System', 'SunLine Transit Agency', 'Yuma County Area Transit', 'Madera Area Express', 'Bear Transit', 'Montebello Bus Lines', 'Carson Circuit', 'Huntington Park Express', 'DowneyLINK', 'Bell Gardens', 'Cudahy Area Rapid Transit', 'Baldwin Park Transit', 'Calabasas Transit System', 'Compton Renaissance Transit Service', 'Rosemead Explorer', 'Bellflower Bus', 'Go West Shuttle', 'Arcadia Transit', 'La Campana', 'Glendora Transportation Division', 'Delano Area Rapid Transit', 'Guadalupe Flyer', 'Arvin Transit', 'Auburn Transit', 'Blossom Express', 'Ridgecrest Transit', 'San Juan Capistrano Free Weekend Trolley', 'Alhambra Community Transit', 'Union City Transit']

@hunterowens
Copy link
Member

Here's my thought on format. Would treat ITP ID and name string as the metadata.

to my knowledge, nobody has multiple GTFS-RT feeds, but could adopt the {list(url)} structure if needed.

agency_1: 
  itp_id: {num}
  name_string: {"some string"}
  gtfs_schedule_url: 
  - item
  gtfs_rt: 
    trip_updates: {url}
    vehicle_locations: {url}
    alerts: {url}

agency_2:
  itp_id: {num}
  name_string: {"some string"}
  gtfs_schedule_url:: 
  - item
  - item2 

agency_mtc:
  itp_id: {num}
  name_string: {"some string"}
  gtfs_schedule_url:: 
  - their_feed_subset
  - regional_feed

@hunterowens
Copy link
Member

cc @antrim @e-lo

@antrim
Copy link

antrim commented Mar 8, 2021

Do we need way of relating static and real-time feed URLs if there are multiple static URLs? Something like so?

agency_mtc:
  itp_id: {num}
  name_string: {"some string"}
  gtfs_url:: 
  - their_feed_subset
       static: {url}
       gtfs_rt: 
         trip_updates: {url}
         vehicle_locations: {url}
         alerts: {url}
  - regional_feed

@hunterowens
Copy link
Member

I think inside the gtfs_url object, static should be a list of URLs to handle the one agency has many static download urls case.

Given that MTC is a big portion of the state, we can either ignore the regional feed or code it inside a second name object, which is the approach above.

Here's what I think a MTC Agency should look like

ac_transit:
  itp_id: 
  name_string: "AC Transit"
  gtfs_schedule_url:
    - https://api.actransit.org/transit/gtfs/download?token=2512B81107A09D2DC44895CDDC650D47
    - http://api.511.org/transit/datafeeds?api_key=[your_key]&operator_id=[AC_TRANSIT_ID] 
  gtfs_rt:
    trip_updates: 
      - http://api.actransit.org/transit/Help/Api/GET-gtfsrt-tripupdates
      - http://api.511.org/transit/tripupdates?api_key=[your_key]&agency=[AC_TRANSIT_ID] 
    .... (so on for each of the three GTFS-RT feeds) 

Essentially, each key URL key should have a list of URL values, I think. Note, those urls I posted above should be equivalent in content but... we should monitor and find out.

I can take a first pass at getting a PR for this ready today b/c I have a pending ask from @mcplanner.

@antrim
Copy link

antrim commented Mar 8, 2021

The way I understand, this would depend on linked_datasets.txt to associate the GTFS (static) and GTFS-realtime feeds. That seems like a potential issue, given that it's not yet officially adopted and widespread use would be a ways out.

This would be used internally by Cal-ITP, yes? Would it ever be published externally? If so, I see an issue storing API keys in the URL. It might be useful to separate out some of the API information. linked_datasets.txt provides inspiration: https://github.com/google/transit/pull/93/files

Also, would it be useful to have a URL with terms/license/API key info?

@hunterowens
Copy link
Member

I think they are all linked in that they are in a shared object?

@antrim
Copy link

antrim commented Mar 8, 2021 via email

@e-lo
Copy link
Contributor Author

e-lo commented Mar 10, 2021

The way I understand, this would depend on linked_datasets.txt to associate the GTFS (static) and GTFS-realtime feeds.

Wouldn't an associated transit provider in MobilityDatabase for both the realtime and static be sufficient?

@e-lo
Copy link
Contributor Author

e-lo commented Mar 10, 2021

Essentially, each key URL key should have a list of URL values, I think. Note, those urls I posted above should be equivalent in content but... we should monitor and find out.

In some cases, but not always. Case in point LAMTA or providers which contract out part of their service. In any case, we should clarify how we document the spanning of the dataset, noting that MobilityDatabase will be doing same thing so we should be consistent if possible. Strawperson for static:

  • effective date(s)/times calendars.txt
  • (agency_id,route_id)

Then a rule for hierarchy in the case of conflict e.g. most preferable sources at top, or last published, etc.

@hunterowens
Copy link
Member

fyi, first draft of this PR is now live in #23

@hunterowens
Copy link
Member

This is complete, or at least, mostly done and can be reopened w/ new sub-issues as we use the new file!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants