-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow schedule source to be specified #91
Conversation
This one was approved by the CI 👌 Should be generally in a good shape (I hope). It completes the osm2gtfs nicely, because now people don't have to look for the creator related schedule source files. They are now automatically getting downloaded and saved locally. Would be happy to see this getting in. Thanks! |
I think the |
For me, the fundamental idea of osm2gtfs is to combine geo-information from OSM with time information (schedule) to produce a GTFS. Not having a schedule to feed in and to rely on hard coded time information in the trip creator (like currently implemented for Accra) is in my opinion a very edge case, and should not be encouraged by making |
Some information for potential reviewers: Check the two different commits in this PR. They are reflecting two different changes and are probably easier to review first separately:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this MR. Looks very good already. Added some minor comments to this review.
osm2gtfs/core/cache.py
Outdated
hard drive. | ||
|
||
""" | ||
if not os.path.isdir('data'): # checking is the dir exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a typo in the comment, but the method name would be clear enough for my taste.
osm2gtfs/core/cache.py
Outdated
Reads and returns an object (content) with an indicated name from a | ||
file on the hard drive. | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should add that it returns an empty dict
if the file wasn't found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
osm2gtfs/core/cache.py
Outdated
""" | ||
if not os.path.isdir('data'): # checking is the dir exist | ||
os.mkdir('data') | ||
with open('data/' + name, 'wb') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that works on Linux, it is better to use os.path.join()
here.
osm2gtfs/core/cache.py
Outdated
|
||
""" | ||
if os.path.isfile('data/' + name): | ||
with open('data/' + name, 'rb') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that works on Linux, it is better to use os.path.join()
here. You can also introduce a variable for the path, since you use it twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks, done in the whole class.
osm2gtfs/core/configuration.py
Outdated
|
||
if 'schedule_source' not in self.config: | ||
sys.stderr.write("Error: No schedule source specified in config file.\n") | ||
sys.exit(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like @nlehuby I prefer this to return None
instead of bailing out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still strongly in favour of explicitly opting out. This goes down to the very basic vision of the script, e.g. look at osm2gtfs' subtitle Turn OpenStreetMap data and schedule information into GTFS.
To have an implementation that accepts by default configuration files without any schedule source for me is not a good practice, kind of against the idea of GTFS and encourages to produce low quality GTFS. I do think it should be possible to do so, but then you have to explicitly state this in the configuration file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But weren't advocating for putting as much information into OSM? For frequency based regions, I don't see why they shouldn't define the frequency of a line in OSM and don't have to bother to specify this source.
If you allow None
here, you might as well allow it to not be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as much information that is also supposed to be in OSM. And especially to stick to the standards by the community. Ffull schedule information is clearly not desired to be entered in OpenStreetMap. For frequency based networks, there is no accepted nor proposed tag (standard), that I'm aware of. And I doubt it would find acceptance.
In other words, it is not expected and mostly not wanted to add any time information (about public transport) to OSM. Unfortunately, I must say, because I also think it would be nice to add it, but I respect the decisions and state of discussion of the OSM community. Exactly because of this rationale, I'm advocating to provide a default behaviour in osm2gtfs to get the schedule in, at the moment when it is right and good to feed it in; and not to encourage hacks.
Maybe we can agree on a compromise:
- We make
schedule_source
optional. - The standard
schedule_creator
(not the configuration class) throws an error, if there is no file specified.
This would basically follow your opinion, but (at least for the standard creator) do the same checks, as it does currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good compromise to me. @nlehuby?
osm2gtfs/core/configuration.py
Outdated
|
||
# Open file and add to config object | ||
with open(self.config['schedule_source'], 'r') as schedule_source_file: | ||
self._schedule_source_file = schedule_source_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this leak the file object outside the with
environment and leave it open?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, is there any better way to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it is a good idea to pass file objects around this way. Why not passing its content around instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because, we don't know at this point the format of the schedule file (nor we want to enforce any particular format). It can be json, it can be csv or any other format, which then will be used in the (overridden) schedule_creator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. In this case, maybe just use a string and leave it to the creator to interpret it. So it can handle file opening and closing itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went for reading the files already at this point, but not decoding it. This is necessary to do proper caching and it avoids passing around any file objects. Hopefully this is also fine?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks!
osm2gtfs/core/creator_factory.py
Outdated
if 'selector' in config: | ||
self.selector = config['selector'] | ||
if 'selector' in self.config.config: | ||
self.selector = self.config.config['selector'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config.config looks confusing. Why does this happen now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the config
object of Configuration
contains an attribute config
with the processed information from the configuration file. In this PR I started giving the full config
object to the creators, because some of them might want to use functions attached to Configuration
, for example the newly introduced schedule_creator
does this, to load the schedule information when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to do this, you might want to rethink the naming at the same time, so people reading the code have a better change at knowing what is what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about config.data
? Or is this too generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess data
is fine, don't have a better idea right now. You forgot to adapt the param
docstrings where the Configuration object is passed, btw.
The ScheduleCreator loads, validates and prepares configuration data from | ||
the schedule source file for further use in the script. | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could at the URL to the documentation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
Thanks for the review, @grote ❤️ |
Because of other changes, this discussion and question got collapsed. So, I'm posting it here again. It is about the question about the issue risen by @nlehuby in one of the first comments - whether I proposed a compromise:
@grote answered and asked:
|
This pull request is now at the stage to be tested again - all feedback has been incorporated. The (in case @nlehuby disagreed, the implementation could be adjusted easily) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the last comments. I made another round.
Looks like you didn't run the accra unit tests, did you? I get FAILED (failures=1, errors=2)
osm2gtfs/core/creator_factory.py
Outdated
print "Schedule creator: " + selector.capitalize() | ||
return schedule_creator_override(self.config) | ||
except ImportError: | ||
print "Schedule creator: Default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's two python2 prints here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only two, all of the prints in this class. Changed now.
osm2gtfs/core/configuration.py
Outdated
Loads a schedule source file from either a path or a url specified | ||
in the config file | ||
|
||
:return schedule_source_file: The schedule source file object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still talking about file object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, taken out.
osm2gtfs/core/cache.py
Outdated
Reads and returns an object (content) with an indicated name from a | ||
file on the hard drive. | ||
|
||
:return file: The opened file object, or an empty dictionary in case no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still talking about file object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, taken out.
|
||
# Get Fenix data from JSON file | ||
json_data = [] | ||
with open('data/linhas.json') as f: | ||
schedule_source = self.config.get_schedule_source() | ||
with open(schedule_source) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this part I was not able to test. Fenix doesn't run through on my machine, same problem as on master. So, I reverted the changes on this part. Loads the linhas.json file from location as before..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just applied a fix for Fenix. At least it runs now. But with all these stops queries I can still not test until then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works as before (even better with the fix), so this is hopefully not a blocker to merge this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I finally had the chance to have enough stops cached to get until this point in code execution in Fenix. And I suggest to use the standard schedule_creator
for loading the file. This allows us to remove the block on opening and reading the file in the custom fenix trips_creator
and replace it with a simple linhas = data.schedule['data']
.
This change is now in the PR.
Thanks, @grote! I tested with Accra, but you are right, I didn't run the tests. There were some minor issues with the tests, that I fixed now. |
When running this branch, I still get:
Also, there's an unused import lying around still. |
A tip for testing with fenix: If you hit overpass API, disable the stop name search. This is mostly not necessary for testing. |
Oh, yes, |
I can confirm that Fenix works now (although the feed looks way smaller than list time I tried it, need to look into that some day). The Accra tests also pass. However, Incofer from @jamescr crashes. I think before merging in more PRs, we should first add more tests. Otherwise, we are going to screw things up I'm afraid. Also, manually testing these things is too much work for my taste. |
Not nice, to draw this line now 😢 (Incofer fails on master as it does on this one) |
Oh, I didn't know that. I indeed get the same error message on So for me this PR is good to go in.
I am not drawing a line, but merely suggest that we should maybe focus our energies on adding more tests first before merging even more complex PRs. This one already had some issue we only spotted with code review and manual testing. The others will most likely be even more disruptive. Not all creator maintainers have always time available to review PRs on short notice to make sure their creators are unaffected. Having more tests would reduce the burden on everybody, because we could rely on the CI to tell us whether something broke or not. |
Yes, I also think, the better and more test coverage we have the better we can maintain this growing script! Probably we should consider to require creators with respective tests, as Accra wonderfully did, from all network implementers. I suggest to create a separate issue to discuss this with all implementers(?)/maintainers(?)/sponsors(?) of the network creators. |
We have #75 for that. I just increased its priority. I am just leaving this one open for a bit longer in case one of the others wants to review as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my inline comments. I just gave it a quick reading, because I'm not very familiar with the three creators already in master
.
osm2gtfs/core/configuration.py
Outdated
source_file = self.data['schedule_source'] | ||
cached_file = 'schedule-source-' + self.data['selector'] | ||
|
||
# Preferably return cached data about routes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: shouldn't it say cached data about schedule
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for noting.
rep += str(self.config) + " | " | ||
return rep | ||
|
||
def add_schedule_to_data(self, schedule, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schedule argument will be overridden in the next line and is never passed (see osm2gtfs.py
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yes, you are right. fixed.
osm2gtfs/osm2gtfs.py
Outdated
@@ -44,58 +46,62 @@ def main(): | |||
elif args.refresh_stops: | |||
data.get_stops(refresh=True) | |||
sys.exit(0) | |||
elif args.refresh_schedule_source: | |||
config.get_schedule_source(refresh=True) | |||
sys.exit(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you sys.exit(0)
here? Couldn't the script continue to run as in the case of args.refresh_all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is a good question. I follow your point, removed all sys.exit(0)
from the other refresh calls also.
Thanks for the review @ialokim! |
Rebased after the latest pull request which has been accepted: #93. |
All green - all good. Any chance to get it in, anytime soon? As a side note: Mainly because of the renaming request (from #80), which is implemented in this PR - not having it inside osm2gtfs causes currently significant extra work to keep this in-sync with the other enhancements we are heavily working on to get Managua and Esteli ready. |
For me it is fine, just waiting a bit with merging to give others a chance to review. You should be able to rebase other work on top of this branch in the meantime. |
Yes, and what I'm trying to express is that it hurts (last sentences of the article):
|
The same kind of advise applies to force pushing. There's not really a difference between force pushing to this branch and rebasing another branch on this one. Both is easier when you are the only person working on these branches though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it is fine, just waiting a bit with merging to give others a chance to review.
For me too now, reviewed another time.
Thanks for the review @prhod! |
Wuhuu! Thanks all! |
Proposed solution for #80.
schedule
object tofeed
variable name in osm2gtfsschedule_source
in the configuration file. Can be either a local path, a url to download any file to be used in the creators to get the time information in, or a "None" (like in Accra) to omit processing of any schedule file.--refresh-schedule-source
argument to re-download a cached schedule file.schedule_creator
, which allows the input file to be load, verified and even manipulated - if needed. The simple standardschedule_creator
takes a regular json file, but others could be supported with an overriddenschedule_creator_*
)schedule_creator
; make Fenix rely on theschedule_source
information of the config file.