-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize JSON schemas #36
Comments
Considerations for data model (based on our experience in doing this for all of California) Critical Items
Desired items
Questions
|
@e-lo Thank you for the in-depth feedback! Let me know if you have any additional questions or concerns based on this response: AnswersNot sure why stable vs auto-discovery URL would be different? What use case does this satisfy? You’re correct, they are the same thing. We used auto-discovery URL as a term based on using GBFS’ systems.csv as inspiration. However, upon review it’s clear that discovery isn’t a meaningful term in GTFS and it should be changed. Our plan with this issue is to modify the auto-discovery URL to be direct download URL. The main reason we don’t plan to use stable URL is that oftentimes the URL provided from data publishers isn’t in fact stable (time bound, not an official source, etc). This is marked as done, but I don't see a PR attached? Originally there wasn’t a PR because the prototype PR was extremely large and attached to another issue. This has been fixed. Critical ItemsPrimary municipality: We’re going to make both municipality and subdivision optional based on this feedback and after looking more closely at different source examples, and seeing there are many aggregate feeds and larger transit systems for which neither apply. DataType: Currently within each GTFS Realtime source, there are three fields for Trip Updates, Service Alerts, and Vehicle Positions. This ensures that the user can get all the information they want under one GTFS Realtime file. Previously, it was complex for one to search and collect GTFS Realtime information using Transitfeeds. Now everything will be under one single file. Could you elaborate on what use case needs a common template for URIs with API keys? Is this to standardize how we indicate an API key is needed within a URL? Transit provider definition, enumerated list, and array of aliases: Thanks for sharing a suggested structure for how we could provide a catalog of organizations and services in the working document. I’ve added a feature in the roadmap for expanding the catalogs that the community can vote on. (I’ve used some of the user stories you suggested for the search interface here since I believe this feature would address similar needs). Upon further consideration, we think it makes sense to use “agency” as a starting point rather than transit provider, since
For the purposes of launching V1 on the 23rd, we’ll be making this modification to agency in the Desired ItemsA few clarifying questions/comments based on our internal team review:
|
I agree that the user experience should be able to get all the realtime feeds with a single query, but that doesn't necessitate the data model do that as well. There are providers which have several realtime feeds of the same type (particularly for contracted service) and some which duplicative or enhanced feeds – so the desired user experience will still require the API (or whatever level of obfuscation) to query and assemble feeds from multiple entries. Since the URLs are each optional, it effectively allows you to have an entry for each RT data type...but I do want to make sure the user experience isn't overly dependent on this structure. |
Exactly. ie. |
I actually think that overlap with |
You could alternatively use a "common name" as the "transit provider name" and then in a future catalog of transit providers add in "official organization name". |
This is really a question about an overall governance model – but ideally any changes to this priority in a PR would flag staff at the transit provider to review and disagree with. |
🙌 |
|
I think this has the maximum flexibility and search ability. Again - happy to hear reasoning for alternative that meet the needs/situations described above. I mainly just don't want to oversimplify the data model and then have a bunch of technical debt if/when it needs to be updated based on cases we already know exist in some significant number... |
@e-lo Thanks for clarifying. Over the past week, we've heard some concern from a consumer perspective with the realtime feed information for one provider living in multiple sources, making the info more difficult to search and parse. Could you provide 2-3 examples of this use case with multiple realtime feeds of the same type so we could consider how to model it? We're considering making the static reference field a list, and the URLs nested so multiple URLs of the same type could be included in the same source file. Since these discussions are still ongoing, and we agree that we want to avoid considerable technical debt, we plan to delay importing the realtime data until later in Q2. The release plan will be reflected to include this update. |
If this happens, the list should be an object such that it can be individually queried/filtered for the following use cases (which could end up adding complexity depending on how implemented):
Transit Provider X has three published GTFS datasets, but only one "syncs" with their realtime feeds. In order to link their realtime feed with the correct static feed, I need to reference a specific schedule dataset. There are lots of examples here (69 in our current data for California), including all Bay Area datasets, Victor Valley, Tulare, Thousand Oaks, Simi Valley, Santa Ynez, Ojai, Sacramento, Gold Coast, Glenn, etc. In many (not all) of these cases this is caused because there is a CAD/AVL/Realtime service provider which needs to update the static dataset in order to publish a static dataset which is consistent with realtime –this most often occurs when there is a combination of services with the same realtime feed and naming conflicts need to be avoided, such as in the Bay Area and Ventura County which produce a single set of combined realtime feeds.
Transit providers sometimes need to publish different services in separate GTFS Schedule datasets for various reasons such as contracted service agreements (e.g.Visalia and V-Line) and feed size (e.g. LA Metro). In other cases, providing certain variables in a query to to a GTFS Schedule API will yield different services (e.g. Bay Area 511). In all cases, we likely need to know which combination of feeds produce the entirety of service.
In some cases transit providers publish data on supporting services which aren't directly managed by them and overlap with the transit provider's GTFS Schedule dataset which provides them. As a data user, I need to understand which parts of the dataset contain duplicates of service which should be screened out, deferring to a separate feed for the information that the transit provider which manages that service wants me to see. For example, the Amtrak Schedule Dataset (whoot!) contains many supportive services such as the Altamont Corridor Express (ACE). ACE is also included in Bay Area 511 among other feeds. As a data consumer, I'd like to know which GTFS Schedule Dataset I should consume ACE information from, from the transit provider's perspective (if possible) |
From a transparency perspective, it would be great to have the use cases and discussion from the transit consumers here in this issue. (Note: I'm definitely not doubting that there are very valid and important issues...I'd just prefer if we could all discuss in one place that is traceable) |
Offhand I can think of the following cases:
|
HART is one such case here in Tampa, FL. They have a single GTFS dataset that covers their bus and streetcar. Bus originally had RT data (OrbCAD system, and we at USF built a GTFS Realtime exporter for it), but streetcar did not (streetcar was a separately managed system). RT was added to streetcar via Swiftly. So the resulting system has a single GTFS, but two GTFS Realtime endpoints for TripUpdates. To model these cases, my preference would be to see something like this (URLs aren't real here, as I'm not sure if the streetcar URL is public): {
"mdb_source_id": 100,
"data_type": "gtfs_rt",
"provider": "Hillsborough Area Regional Transit",
"name": "Hillsborough Area Regional Transit GTFS Realtime",
"static_reference": 120,
"real_time_feeds": {
"vehicle_positions": [
{
"url": "https://www.hart.org/bus/bus-vehicle-positions.pb",
"license": "LicenseA",
"authentication_info_url": "https://www.hart.org/developer_info",
"authentication_type": 2,
"api_key_parameter_name": "key"
},
{
"url": "https://www.hart.org/streetcar/v1/key/API_KEY/streetcar-vehicle-positions.pb",
"license": "LicenseB",
"authentication_info_url": "https://www.hart.org/developer_info",
"authentication_type": 1
}
],
"trip_updates": [
{
"url": "https://www.hart.org/bus/bus-trip-updates.pb",
"license": "LicenseA",
"authentication_info_url": "https://www.hart.org/developer_info",
"authentication_type": 2,
"api_key_parameter_name": "key"
},
{
"url": "https://www.hart.org/streetcar/v1/key/API_KEY/streetcar-trip-updates.pb",
"license": "LicenseB",
"authentication_info_url": "https://www.hart.org/developer_info",
"authentication_type": 1
}
],
}
} This allows us to model many attributes for each endpoint as needed, but still keeps the endpoints logically grouped under the same provider. The Note the API key structure in the streetcar URL. This will be harder to model in a directory than a simple URL parameter because it's integrated into the URL itself, which is why I've assigned a Something like:
{
"url": "https://www.hart.org/streetcar/v1/key/API_KEY/streetcar-trip-updates.pb",
"license": "LicenseB",
"authentication_info_url": "https://www.hart.org/developer_info",
"authentication_type": 3,
"api_key_url_placeholder_name": "API_KEY"
} |
Actually, looking back at the GTFS linked datasets proposal, Swiftly commented here asking for another
Not sure if the streetcar URL format is an older or newer API key format for them since that comment. |
I would generally prefer handlebar-like syntax with expected values. {
"url": "https://www.hart.org/streetcar/v1/key/{API_KEY}/streetcar-trip-updates.pb",
"license": "LicenseB",
"authentication_info_url": "https://www.hart.org/developer_info",
"authentication_type": 3,
} |
@barbeau was who we were discussing this with previously so the relevant use cases so far have been mentioned now. Thanks to both of you for the above use cases and suggested approach going forward. I'm going to share this with the MobilityData team internally over the next few weeks after our quarterly planning process and get back to you with any relevant changes and how it'll accommodate the use cases you've provided. Let me know if you have any questions or concerns. |
This issue still hasn't been resolved in the JSON schema. There are many important feeds with multiple transit providers. |
Looking at the (very) lengthy filenames that are now in the catalog, I'm wondering about the use of |
Agreed, this issue hasn’t been resolved. Until we provide a catalog of organizations and providers, it’s unclear on our side how we could best achieve this enumerated list. Is there a lighter weight solution you're envisioning?
The original rationale behind this was around ease of entry and search - we didn’t want to require users to input the subdivision code name or search for it in instances where it isn't commonly used. However, it would make sense for us to alter the implementation of the file name at the bare minimum so they’re less lengthy (issue added here). |
I agree with @e-lo that these complex use cases of multiple RT feeds referring to one Schedule feed (and vice versa) has not been fully represented in the current schema. Perhaps one lightweight and interim approach to these challenging use cases is to add a |
@evansiroky @e-lo Looking back at #36 (comment), I think the one use case I didn't illustrate there is one RT feed to many static feeds - did I miss anything else? I think the one RT feed to many static feeds could be represented by making the {
"mdb_source_id": [100, 101],
"data_type": "gtfs_rt",
"provider": "Hillsborough Area Regional Transit",
"name": "Hillsborough Area Regional Transit GTFS Realtime",
"static_reference": [120, 121],
"real_time_feeds": {
... Do you know of any cases this doesn't cover, or reasons why this wouldn't work? |
@barbeau I think this may cover most use cases that are present until the catalog of organizations and providers item is discussed. |
@barbeau I think the schema here you suggested is great. Here, why would we need a |
You only need the I think MTA is a good test for this model: So, for example, if MTA Transit Bus is represented as one source with multiple GTFS static files (Brox, Brooklyn, Manhattan, Queens, Staten Island), then you could have a single GTFS RT record with a single If you wanted to treat MTA Brox as it's own source, then you'd need an array for |
Since the goal is to make it easier for consumers to see which GTFS schedule sources are tied to a realtime source, we think keeping I've opened a new issue specifically focused on realtime changes to track our progress in updating the schema and an associated PR. The only notable changes from @barbeau 's original proposal are
Please feel free to take a look and comment on the PR. |
...deleted earlier one b/c now I realize that this was in reference to mbd_source_idm not the static one :-) |
The realtime schema has been implemented! I've separated out the remainder of this big conversation into the following outstanding issues:
In order to make the discussion easier to follow in the future, I'm going to close this issue. |
What problem are we trying to solve?
We want users to be able to easily search for data by location and provider.
How will we know when this is done?
The JSON schema is updated to match the following field breakdowns.
The CSV artifact is updated to match the fields.
The text was updated successfully, but these errors were encountered: