Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Corrections from Cal-ITP (a selection) #125

Closed
wants to merge 22 commits into from

Conversation

e-lo
Copy link
Contributor

@e-lo e-lo commented Apr 6, 2022

Categories of fixes:

  1. Update download URLs to point to transit provider's DNS rather than contractor or transitfeeds
  2. Update transit provider names to be current (e.g. "Santa Maria Regional (not Area) Transit"), consistent (e.g. LA Metro should have same name for Bus and Rail feeds with only difference being bus vs rail), have their brand (e.g. "Muni")
  3. Fix outdated feed URLs
  4. Fix miscategorizations (e.g. Nevada RTC should be in...Nevada)

Still a lot more to add/fix, but this is a start.

e-lo added 18 commits April 6, 2022 09:55
Updates to use AC Transit's own URL
> Blue Lake Rancheria is included in Humboldt's feed
> Name should be Humboldt, not arcata/mad river (which is a subsystem of humboldt)
- LA Metro should have consistent naming between Rail and Bus (this is same agency, with a split feed only b/c of size)
- Should refer to DASH and Commuter Express bus services which is the brand people will search from
- Update URL to be agency's
- add brand: yolobus
- update URL to be agency's
- name is Taft Area Transit (no maricopa in any names other than routes)
Soltrans download should be agency DNS
- Should be nevada not california
- not named PASS transit
- Update URL to be agency DNS
- Add brand: SacRT to name for searchability
- Should be San Mateo  (or at least a city in San Mateo county) not San Francisco as municipality
- Download should be agency DNS
- Download should be agency DNS
Add acronym
Add acronym
 - Update download link to be current (previous one is several years outdated)
 - Update name to be accurate
- add acronym
- name isn't SMAT anymore - its SMRT / santa maria regional transit

NOTE: this is an outdated download URL!
@CLAassistant
Copy link

CLAassistant commented Apr 6, 2022

CLA assistant check
All committers have signed the CLA.

@e-lo e-lo changed the title Corrections from Cal-ITP (a selection) Fix: Corrections from Cal-ITP (a selection) Apr 6, 2022
@e-lo e-lo changed the title Fix: Corrections from Cal-ITP (a selection) fix: Corrections from Cal-ITP (a selection) Apr 6, 2022
@emmambd emmambd self-requested a review April 6, 2022 18:05
Copy link
Contributor

@emmambd emmambd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @e-lo,

Thank you for our first community source PR to the catalogs! It’s exciting to see some changes here that aren’t available on other aggregators yet.

  1. I’ve left some comments in the review for specific instances where we can’t make an alteration, mainly due to current limitations of the catalogs. I’ve created issues for the relevant changes where they seem clear.

  2. In order to trigger the pipeline, source additions/changes need to have [SOURCES] at the end of the PR name.

@@ -15,7 +15,7 @@
}
},
"urls": {
"direct_download": "https://transitfeeds.com/p/ac-transit/1269/latest/download",
"direct_download": "https://api.actransit.org/transit/gtfs/download?token={{ API_KEY }}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As highlighted in the release plan, right now API-based sources can’t be added since we’re running checks to see if a source is a valid ZIP file. I’ve created this issue to make the checks conditional so API sources can be added and we can define a convention for including API keys and authorization methods in the schema.

I’ve also added this info to CONTRIBUTING.MD so the constraint is clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in updated PR.

@@ -1,21 +0,0 @@
{
"mdb_source_id": 598,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to preserve the Mobility Database IDs, we can’t delete sources. This issue is here so we can add deprecation info to the schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate as I don't think the definition of what a unique ID even represents has been defined (see comment in I believe #36 ) - so it seems impractical /impossible to "hold them constant" if we don't know what they represent.

I would suggest we don't "hold anything to be frozen/backwards compatible until this issue is resolved/closed as it will end up in a LOT of messiness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I backed out this change in the PR.

@@ -15,7 +15,7 @@
}
},
"urls": {
"direct_download": "http://data.trilliumtransit.com/gtfs/smat-ca-us/smat-ca-us.zip",
"direct_download": "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Currently, changing the name of a file would also trigger changing the direct download URL and break pipeline workflows. We could change the direct download URL structure so it’s non-descriptive (only the source ID for instance), or discuss other potential solutions.

  2. A direct download URL needs to be available for the source to be valid. I’m assuming there’s no available replacement since you left it empty? We have talked about adding feed_end_date to the schema to make it easier to see if the most recently available URL is still in use.

Copy link
Contributor Author

@e-lo e-lo Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is unfortunate. So what happens when an agency changes their name? When a URL changes what they represent? If this is the case, I would suggest only having the ID and/or geography (something that isn't going to change) in the name.

A direct download URL needs to be available for the source to be valid.

  1. That is correct. What should be the course of action - since the URL isn't valid and therefore doesn't conform to the schema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agreed. Issue created here.

  2. The schema states that the URL should automatically open the source information, and doesn't describe anything regarding data recency. In instances where the URL is not up-to-date but there is no known available alternative, we've still been including it in the catalogs, as this is a very common occurrence and deleting the source entirely seemed counterproductive for data access. Once historical URLs are implemented in the schema, we could include these URLs there instead for clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Currently, changing the name of a file would also trigger changing the direct download URL and break pipeline workflows. We could change the direct download URL structure so it’s non-descriptive (only the source ID for instance), or discuss other potential solutions.

Here, it's the latest dataset URL that is affected, not the direct download one (so it's system generated not provided by the user), but I agree that in any cases it should be stable enough that changing the filename won't break the pipeline.

  1. A direct download URL needs to be available for the source to be valid. I’m assuming there’s no available replacement since you left it empty? We have talked about adding feed_end_date to the schema to make it easier to see if the most recently available URL is still in use.

I think adding feed_end_date and feed_start_date would be a good idea. It would allow us not to update the latest url if the feed expiration date is reached. Also, we could always keep the direct download url after adding the historical data.

@e-lo
Copy link
Contributor Author

e-lo commented Apr 12, 2022

General suggestion:

  • requirements about not deleting or renaming files should be added as a github action "check"

This should still be deleted - but is a limitation of the definition.
fredericsimard pushed a commit that referenced this pull request Nov 22, 2022
* beaumont transit fix

* humboldt county fix

* santa maria regional transit name fix

* taft area transit fix

* southern nevada rtc fix

* la county shuttles agency name fix

* ladot fix

* sacramento regional tansit fix

* sf bay area water emergency fix

* sfmta fix

* samtrans fix

* santa barbara fix

* scmtd fix

* tulare county area fix

* yolo county fix

* add fares features

* replace duplicate go bus file

* switch replacements of 561 and 28 so URL of 28 is consistent
@emmambd
Copy link
Contributor

emmambd commented Nov 23, 2022

@e-lo These data updates were blocked until we found a safe method to change file names. Now that we know consumers are systematically refreshing the CSV, we can modify file names confidently and have updated the documentation to highlight file name changes. Because of this, these changes are now unblocked and have been merged in PR #239. (I made a separate PR for the sake of ease.) I am closing this one since the changes are already reflected.

@emmambd emmambd closed this Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants