Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow schedule source to be specified #91

Merged
merged 2 commits into from
Dec 14, 2017
Merged

Allow schedule source to be specified #91

merged 2 commits into from
Dec 14, 2017

Conversation

pantierra
Copy link
Contributor

@pantierra pantierra commented Nov 29, 2017

Proposed solution for #80.

  • Renamed (transitfeed) schedule object to feed variable name in osm2gtfs
  • Introduced a schedule_source in the configuration file. Can be either a local path, a url to download any file to be used in the creators to get the time information in, or a "None" (like in Accra) to omit processing of any schedule file.
  • Introduced a --refresh-schedule-source argument to re-download a cached schedule file.
  • Introduced a schedule_creator, which allows the input file to be load, verified and even manipulated - if needed. The simple standard schedule_creator takes a regular json file, but others could be supported with an overridden schedule_creator_*)
  • Added source file information in the configuration file for each of the existing creators
  • Make Incofer use the standard schedule_creator; make Fenix rely on the schedule_source information of the config file.

@pantierra pantierra requested a review from grote November 29, 2017 17:09
@pantierra
Copy link
Contributor Author

This one was approved by the CI 👌 Should be generally in a good shape (I hope). It completes the osm2gtfs nicely, because now people don't have to look for the creator related schedule source files. They are now automatically getting downloaded and saved locally. Would be happy to see this getting in. Thanks!

@nlehuby
Copy link
Collaborator

nlehuby commented Dec 1, 2017

I think the schedule_source should be optional (instead of mandatory, with a value of None to be ignored).
This would be more coherent with the other optional parameters (start_date).
What do you think ?

@pantierra
Copy link
Contributor Author

pantierra commented Dec 3, 2017

For me, the fundamental idea of osm2gtfs is to combine geo-information from OSM with time information (schedule) to produce a GTFS. Not having a schedule to feed in and to rely on hard coded time information in the trip creator (like currently implemented for Accra) is in my opinion a very edge case, and should not be encouraged by making schedule_source optional. However this edge case has been taken into consideration in this pull request and is supported, by opting-in explicitly and specifying None.

@pantierra
Copy link
Contributor Author

pantierra commented Dec 4, 2017

Some information for potential reviewers: Check the two different commits in this PR. They are reflecting two different changes and are probably easier to review first separately:

Copy link
Owner

@grote grote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this MR. Looks very good already. Added some minor comments to this review.

hard drive.

"""
if not os.path.isdir('data'): # checking is the dir exist
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a typo in the comment, but the method name would be clear enough for my taste.

Reads and returns an object (content) with an indicated name from a
file on the hard drive.

"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should add that it returns an empty dict if the file wasn't found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

"""
if not os.path.isdir('data'): # checking is the dir exist
os.mkdir('data')
with open('data/' + name, 'wb') as f:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that works on Linux, it is better to use os.path.join() here.


"""
if os.path.isfile('data/' + name):
with open('data/' + name, 'rb') as f:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that works on Linux, it is better to use os.path.join() here. You can also introduce a variable for the path, since you use it twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks, done in the whole class.


if 'schedule_source' not in self.config:
sys.stderr.write("Error: No schedule source specified in config file.\n")
sys.exit(0)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @nlehuby I prefer this to return None instead of bailing out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still strongly in favour of explicitly opting out. This goes down to the very basic vision of the script, e.g. look at osm2gtfs' subtitle Turn OpenStreetMap data and schedule information into GTFS.

To have an implementation that accepts by default configuration files without any schedule source for me is not a good practice, kind of against the idea of GTFS and encourages to produce low quality GTFS. I do think it should be possible to do so, but then you have to explicitly state this in the configuration file.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But weren't advocating for putting as much information into OSM? For frequency based regions, I don't see why they shouldn't define the frequency of a line in OSM and don't have to bother to specify this source.

If you allow None here, you might as well allow it to not be there.

Copy link
Contributor Author

@pantierra pantierra Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as much information that is also supposed to be in OSM. And especially to stick to the standards by the community. Ffull schedule information is clearly not desired to be entered in OpenStreetMap. For frequency based networks, there is no accepted nor proposed tag (standard), that I'm aware of. And I doubt it would find acceptance.

In other words, it is not expected and mostly not wanted to add any time information (about public transport) to OSM. Unfortunately, I must say, because I also think it would be nice to add it, but I respect the decisions and state of discussion of the OSM community. Exactly because of this rationale, I'm advocating to provide a default behaviour in osm2gtfs to get the schedule in, at the moment when it is right and good to feed it in; and not to encourage hacks.

Maybe we can agree on a compromise:

  • We make schedule_source optional.
  • The standard schedule_creator (not the configuration class) throws an error, if there is no file specified.

This would basically follow your opinion, but (at least for the standard creator) do the same checks, as it does currently.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good compromise to me. @nlehuby?


# Open file and add to config object
with open(self.config['schedule_source'], 'r') as schedule_source_file:
self._schedule_source_file = schedule_source_file
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this leak the file object outside the with environment and leave it open?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, is there any better way to do it?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is a good idea to pass file objects around this way. Why not passing its content around instead?

Copy link
Contributor Author

@pantierra pantierra Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because, we don't know at this point the format of the schedule file (nor we want to enforce any particular format). It can be json, it can be csv or any other format, which then will be used in the (overridden) schedule_creator

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In this case, maybe just use a string and leave it to the creator to interpret it. So it can handle file opening and closing itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for reading the files already at this point, but not decoding it. This is necessary to do proper caching and it avoids passing around any file objects. Hopefully this is also fine?!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

if 'selector' in config:
self.selector = config['selector']
if 'selector' in self.config.config:
self.selector = self.config.config['selector']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.config looks confusing. Why does this happen now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the config object of Configuration contains an attribute config with the processed information from the configuration file. In this PR I started giving the full config object to the creators, because some of them might want to use functions attached to Configuration, for example the newly introduced schedule_creator does this, to load the schedule information when needed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to do this, you might want to rethink the naming at the same time, so people reading the code have a better change at knowing what is what.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about config.data? Or is this too generic?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess data is fine, don't have a better idea right now. You forgot to adapt the param docstrings where the Configuration object is passed, btw.

The ScheduleCreator loads, validates and prepares configuration data from
the schedule source file for further use in the script.

"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could at the URL to the documentation here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@pantierra
Copy link
Contributor Author

Thanks for the review, @grote ❤️

@pantierra
Copy link
Contributor Author

Because of other changes, this discussion and question got collapsed. So, I'm posting it here again. It is about the question about the issue risen by @nlehuby in one of the first comments - whether schedule_source should be optional.

I proposed a compromise:

  • We make schedule_source optional.
  • The standard schedule_creator (not the configuration class) throws an error, if there is no file specified.

This would basically follow your opinion, but (at least for the standard creator) do the same checks, as it does currently.

@grote answered and asked:

Sounds like a good compromise to me. @nlehuby?

@pantierra
Copy link
Contributor Author

pantierra commented Dec 7, 2017

This pull request is now at the stage to be tested again - all feedback has been incorporated. The schedule_source is currently implemented as proposed: It is optional in the config file, but the standard schedule_creator checks for a respective json file. For Accra the schedule_creator has been overridden to not process any source.

(in case @nlehuby disagreed, the implementation could be adjusted easily)

Copy link
Owner

@grote grote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the last comments. I made another round.

Looks like you didn't run the accra unit tests, did you? I get FAILED (failures=1, errors=2)

print "Schedule creator: " + selector.capitalize()
return schedule_creator_override(self.config)
except ImportError:
print "Schedule creator: Default"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's two python2 prints here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only two, all of the prints in this class. Changed now.

Loads a schedule source file from either a path or a url specified
in the config file

:return schedule_source_file: The schedule source file object.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still talking about file object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, taken out.

Reads and returns an object (content) with an indicated name from a
file on the hard drive.

:return file: The opened file object, or an empty dictionary in case no
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still talking about file object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, taken out.


# Get Fenix data from JSON file
json_data = []
with open('data/linhas.json') as f:
schedule_source = self.config.get_schedule_source()
with open(schedule_source) as f:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this?

Copy link
Contributor Author

@pantierra pantierra Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this part I was not able to test. Fenix doesn't run through on my machine, same problem as on master. So, I reverted the changes on this part. Loads the linhas.json file from location as before..

Copy link
Contributor Author

@pantierra pantierra Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just applied a fix for Fenix. At least it runs now. But with all these stops queries I can still not test until then.

Copy link
Contributor Author

@pantierra pantierra Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works as before (even better with the fix), so this is hopefully not a blocker to merge this in.

Copy link
Contributor Author

@pantierra pantierra Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I finally had the chance to have enough stops cached to get until this point in code execution in Fenix. And I suggest to use the standard schedule_creator for loading the file. This allows us to remove the block on opening and reading the file in the custom fenix trips_creator and replace it with a simple linhas = data.schedule['data'].

This change is now in the PR.

@pantierra
Copy link
Contributor Author

pantierra commented Dec 7, 2017

Thanks, @grote! I tested with Accra, but you are right, I didn't run the tests. There were some minor issues with the tests, that I fixed now.

@pantierra pantierra mentioned this pull request Dec 7, 2017
@grote
Copy link
Owner

grote commented Dec 8, 2017

When running this branch, I still get:

Traceback (most recent call last):
  File "/home/user/.local/bin/osm2gtfs", line 11, in <module>
    load_entry_point('osm2gtfs', 'console_scripts', 'osm2gtfs')()
  File "/home/user/vcs/osm2gtfs/osm2gtfs/osm2gtfs.py", line 72, in main
    stops_creator.add_stops_to_feed(feed, data)
  File "/home/user/vcs/osm2gtfs/osm2gtfs/creators/stops_creator.py", line 35, in add_stops_to_feed
    self.add_stop(feed, elem)
  File "/home/user/vcs/osm2gtfs/osm2gtfs/creators/stops_creator.py", line 49, in add_stop
    stop_dict["stop_id"] = str(stop.osm_id)
AttributeError: 'Stop' object has no attribute 'osm_id'

Also, there's an unused import lying around still.

@grote
Copy link
Owner

grote commented Dec 8, 2017

A tip for testing with fenix: If you hit overpass API, disable the stop name search. This is mostly not necessary for testing.

@pantierra
Copy link
Contributor Author

Oh, yes, osm_id is already from #96, fixed and tested with Fenix.

@grote
Copy link
Owner

grote commented Dec 8, 2017

I can confirm that Fenix works now (although the feed looks way smaller than list time I tried it, need to look into that some day). The Accra tests also pass.

However, Incofer from @jamescr crashes. I think before merging in more PRs, we should first add more tests. Otherwise, we are going to screw things up I'm afraid. Also, manually testing these things is too much work for my taste.

@pantierra
Copy link
Contributor Author

pantierra commented Dec 8, 2017

I think before merging in more PRs, we should first add more tests.

Not nice, to draw this line now 😢 (Incofer fails on master as it does on this one)

@grote
Copy link
Owner

grote commented Dec 8, 2017

(Incofer fails on master as it does on this one)

Oh, I didn't know that. I indeed get the same error message on master.

So for me this PR is good to go in.

Not nice, to draw this line now 😢

I am not drawing a line, but merely suggest that we should maybe focus our energies on adding more tests first before merging even more complex PRs. This one already had some issue we only spotted with code review and manual testing. The others will most likely be even more disruptive.

Not all creator maintainers have always time available to review PRs on short notice to make sure their creators are unaffected. Having more tests would reduce the burden on everybody, because we could rely on the CI to tell us whether something broke or not.

@pantierra
Copy link
Contributor Author

Yes, I also think, the better and more test coverage we have the better we can maintain this growing script! Probably we should consider to require creators with respective tests, as Accra wonderfully did, from all network implementers.

I suggest to create a separate issue to discuss this with all implementers(?)/maintainers(?)/sponsors(?) of the network creators.

@grote
Copy link
Owner

grote commented Dec 8, 2017

We have #75 for that. I just increased its priority.

I am just leaving this one open for a bit longer in case one of the others wants to review as well.

Copy link
Contributor

@ialokim ialokim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my inline comments. I just gave it a quick reading, because I'm not very familiar with the three creators already in master.

source_file = self.data['schedule_source']
cached_file = 'schedule-source-' + self.data['selector']

# Preferably return cached data about routes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: shouldn't it say cached data about schedule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for noting.

rep += str(self.config) + " | "
return rep

def add_schedule_to_data(self, schedule, data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schedule argument will be overridden in the next line and is never passed (see osm2gtfs.py).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes, you are right. fixed.

@@ -44,58 +46,62 @@ def main():
elif args.refresh_stops:
data.get_stops(refresh=True)
sys.exit(0)
elif args.refresh_schedule_source:
config.get_schedule_source(refresh=True)
sys.exit(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you sys.exit(0) here? Couldn't the script continue to run as in the case of args.refresh_all?

Copy link
Contributor Author

@pantierra pantierra Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is a good question. I follow your point, removed all sys.exit(0) from the other refresh calls also.

@pantierra
Copy link
Contributor Author

Thanks for the review @ialokim!

@pantierra pantierra removed the request for review from jamescr December 11, 2017 13:00
@pantierra
Copy link
Contributor Author

pantierra commented Dec 11, 2017

Rebased after the latest pull request which has been accepted: #93.

@pantierra
Copy link
Contributor Author

pantierra commented Dec 11, 2017

All green - all good. Any chance to get it in, anytime soon?

As a side note: Mainly because of the renaming request (from #80), which is implemented in this PR - not having it inside osm2gtfs causes currently significant extra work to keep this in-sync with the other enhancements we are heavily working on to get Managua and Esteli ready.

@grote
Copy link
Owner

grote commented Dec 11, 2017

For me it is fine, just waiting a bit with merging to give others a chance to review.

You should be able to rebase other work on top of this branch in the meantime.

@pantierra
Copy link
Contributor Author

pantierra commented Dec 11, 2017

You should be able to rebase other work on top of this branch in the meantime.

Yes, and what I'm trying to express is that it hurts (last sentences of the article):

Don’t rebase branches you have shared with another developer.
Do otherwise at your own peril - yes, it will hurt.

@grote
Copy link
Owner

grote commented Dec 11, 2017

The same kind of advise applies to force pushing. There's not really a difference between force pushing to this branch and rebasing another branch on this one. Both is easier when you are the only person working on these branches though.

Copy link
Contributor

@ialokim ialokim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it is fine, just waiting a bit with merging to give others a chance to review.

For me too now, reviewed another time.

@pantierra pantierra removed the request for review from AltNico December 12, 2017 23:56
@grote
Copy link
Owner

grote commented Dec 14, 2017

Thanks for the review @prhod!

@grote grote merged commit 6abf0ad into grote:master Dec 14, 2017
@pantierra pantierra deleted the schedule_source branch December 14, 2017 15:57
@pantierra
Copy link
Contributor Author

Wuhuu! Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants