-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] support streaming osm files through libexpat #55
Conversation
@yeesian try running it with http://docs.julialang.org/en/latest/stdlib/profile/#direct-analysis-of-memory-allocation to identify worst spots - also consider disabling gc selectively if its just generating a million of small objects in a tight loop. |
I've tried disabling gc selectively, but haven't yet gotten it to work. On the few occasions where it seemed promising, it blew up in memory; I'll try profiling as you suggest (btw, if you're looking for a smaller file to play with, @tedsteiner has a small instance on dropbox, mostly for testing) |
Sorry for the noise, i did a |
Thought you all might be interested to know: here's the latest comparison, when reading an osm file that's ~320MB (extracts available here)
julia> @time nodes, highways, buildings, features = getOSMData(filename, nodes=true, highways=true, buildings=true, features=true);
elapsed time: 77.810447564 seconds (7885139620 bytes allocated, 70.46% gc time)
julia> @time nodes, highways, buildings, features = getOSMData(filename, true);
elapsed time: 85.566212155 seconds (10574148240 bytes allocated, 65.05% gc time) |
👍 |
Excellent! On Fri, Jan 2, 2015 at 6:15 PM, Miles Lubin notifications@github.com
|
I'm wondering now
|
@yeesian I think letting users select what they want parsed should be on the roadmap. I'm not certain keyword arguments is a granular enough solution in the longterm, but I haven't thought about it much. Anyway, I'd be for implementing something for now or just leaving the design open to adding it later. |
Mm, agreed. I think it's worth having on the roadmap, but is not an immediate priority (at least to me). I'd prefer to get the other aspects of the parsing right (e.g. relations/turn-restrictions) working first, before introducing conditional filtering. That said, if you wish, feel free to have a shot! One way to implement it would be simply to discard the irrelevant parts within |
Also, I'm starting to wonder if a |
Yep, not trying to suggest someone work on it now. I care about turn restrictions more, too -- that sounds great. |
Yes, that sounds good -- returning a tuple that varies based on values of the inputs seems like a bit of an anti-pattern. |
Oh no, just thought of this -- I often only parse nodes and highways -- instead of a 10% slowdown, it will be much slower in cases like that -- I'm not sure what @tedsteiner does in his work, but he's the heaviest user, so I'll defer to him. In the meantime I'll run a quick test and report numbers -- and if if it happens people need it a little faster before merging into master, perhaps a feature branch we all could commit to to get it to parity as soon as possible. |
If the numbers are significant, I'll have a look to see what I can do in this PR. I'm guessing that it might be a challenge though.. since (i) the number of features is usually small, and (ii) you need to parse the tags to disambiguate between a building and a highway, which comes only after parsing their That said, it's always nice to have benchmarks to aim for (: |
I tried a couple extracts, and I'm seeing performance like you mentioned for parsing everything, and performance 2-3x slower for parsing just the nodes and highways, e.g.: ➜ ~ julia-dev -e 'include("osmp.jl"); time_fun(tree4)'
elapsed time: 38.797660288 seconds (3637203160 bytes allocated, 34.39% gc time)
➜ ~ julia-dev -e 'include("osmp.jl"); time_fun(tree4)'
elapsed time: 30.45888161 seconds (3637203160 bytes allocated, 43.91% gc time)
➜ ~
➜ ~ julia-dev -e 'include("osmp.jl"); time_fun(tree2)'
elapsed time: 11.748997784 seconds (792707816 bytes allocated, 4.55% gc time)
➜ ~ julia-dev -e 'include("osmp.jl"); time_fun(tree2)'
elapsed time: 11.620226646 seconds (792707816 bytes allocated, 4.68% gc time)
➜ ~
Switched to branch 'streaming'
➜ OpenStreetMap git:(streaming)
➜ ~
➜ ~ julia-dev -e 'include("osmp.jl"); time_fun(stream4)'
elapsed time: 31.529423398 seconds (6290493080 bytes allocated, 40.69% gc time)
➜ ~ julia-dev -e 'include("osmp.jl"); time_fun(stream4)'
elapsed time: 31.566031387 seconds (6290493080 bytes allocated, 40.51% gc time)
➜ ~
➜ ~ julia-dev -e 'include("osmp.jl"); time_fun(stream2)'
elapsed time: 31.150185125 seconds (6290493112 bytes allocated, 40.87% gc time)
➜ ~ julia-dev -e 'include("osmp.jl"); time_fun(stream2)'
elapsed time: 31.565353016 seconds (6290493112 bytes allocated, 41.32% gc time) Notes: |
Thanks for the benchmarks! Guess I should not have collapsed the commits; I can revert back to having both parsers tmr, but -- I'd imagine it to be confusing to anyone using openstreetmap to suddenly see the memory requirement spike (when they are in fact asking for fewer OSM objects), so we'll have to be careful with the documentation. 2-3x slower does sound like quite abit though (is it 10 vs 30sec?); I guess I should look into possibilities for speeding the parser up, since I've only worked with node and highways myself so far. |
@yeesian I think a note in the documentation is enough there, but... Do you think having two parsers to keep in sync will slow down development efforts implement other parsing? If that's so, I'd be happy to take slower parsing in exchange for a better graph (as long as the new parsing functionality comes with tests, anyone who needed faster parsing would have a clear path). @tedsteiner, what's more important to you right now?: |
Sorry about being away for a while, this looks awesome, @yeesian! @garborg TL; DR: I'd lean towards Option B. My general thoughts are that:
|
That all sounds right to me, so I guess the one worry is that we'd be switching away from a tested package under the JuliaLang organization to an untested package that less people have their eyes on -- on the other hand, easy enough to remedy and switching and adding testing upstream sounds like something that needs to happen eventually anyway. |
I opened an issue out of curiosity (not to wait on) to see if support for the streaming API was on anyone's LightXML.jl roadmap. |
FWIW, LibExpat also has some tests, even if it's not very well-documented. I'm okay with having support for multiple parsers, but we don't yet have a parser-agnostic interface (apart from My hunch is: this package will probably profit more from understanding/modelling the different OSM entities (e.g. #56), and their relationships, before we embark on the work of conditionally filtering. As it is, we're already leaving out alot of information in the OSM file, that other people might be interested in. Moving forward, I think it might be nice (from time to time), to abstract out the backends abit, e.g. |
Sounds like we're all on the same page. Good to see the package has tests -- it will be quick work to get Thanks again for getting this working! |
[WIP] support streaming osm files through libexpat
Compatibility / testing PR submitted upstream: JuliaIO/LibExpat.jl#23 |
Nice, thanks! |
@yeesian @garborg Is there a reason why string fields in Could it be that the XML tags are being read as different types on different architectures? It looks like that test is passing on the Travis Linux system. |
@tedsteiner I believe I've run tests and some downstream code on OSX since this was merged, but perhaps only on 0.4 -- I'll see how 0.3 is looking here on my laptop and let you know. |
Apologies, yeah, it was originally just |
@tedsteiner On the latest commit(367af58), tests pass for me on the heads of both Julia release-0.3 and master for OSX, as they did on Travis for Linux. I'm not sure why it's giving you trouble. By all means, we can loosen up the type if it's getting in the way of your work, but I'm generally a fan of letting the type system work its magic and generate as specialized of code as possible, even when it's not the main bottleneck. |
@garborg Below is the error message I'm getting on OS X. I was able to run the tests on my Linux machine without any problems last night. Does anything stand out to you? It seems to be reading all of the tags as julia> Pkg.test("OpenStreetMap") |
Huh, that seems like a bug in Julia 0.3.5... Let me download the app and see if I can reproduce. Perhaps we managed to trigger that "Convertacolypse" error. Probably unrelated, but I just tagged a new version LibExpat, so probably worth running Pkg.update to get Yeesian's speedup and dual 0.3 and 0.4 support. |
So here's something weird. First, I tried loading data with streets including Chinese characters, to see if that triggered the tags to be read as UTF8Strings. I got the same error. Then, I tried doing the following:
The test fails with the error message above, but then Running |
Shoot, I can't replicate this running the 0.3.5 app with OpenStreetMap checked out, with LibExpat on either v0.0.6 (newly tagged) or v0.0.4 (last 0.3 compatible version).. frustrating. At this point, I'd say switch to using String for now, but I just checked and it looks like that adds 25% overhead. |
Do you know what version of libexpat you're using? My Mac has Now I'll admit I don't really understand LibExpat.jl, but here's something that does concern me a little. In streaming.jl, it looks like the elements are passed around as a Since we're the ones introducing the type conflict, I think the bug is in OpenStreetMap.jl not LibExpat.jl. If we want a robust speed up, I think we need to get LibExpat.jl to be able to guarantee that its output is a I think that robustness needs to be the top priority, even though I want to parse files 25% faster. This is a pretty serious bug for anyone who shares my situation (they won't be able to read OSM data), and I don't want this to be a new user's first experience with OpenStreetMap.jl. It seems like the options are:
Thoughts? I'm leaning towards Option 1 until Option 2 becomes available. |
I'll lean towards using |
That sounds good. For the record, though, there shouldn't be anything stopping us from changing the code right back once the issue disappears: I don't think any of our code is the issue, it seems like a configuration issue or that somehow Ted's machine but not mine or Travis is triggering the tuple/convert bug that's been messing with 0.3 and 0.4 for so long. @tedsteiner The line you linked to in LibExpat.jl is not a problem -- there's a |
@garborg Yes, it's not a problem for LibExpat.jl, but it is a problem for us, right? Because LibExpat is happy to work with either ASCIIString or UTF8String, but we're only content when it delivers an ASCIIString. It seems risky for us to assume it's a UTF8String coming through, when it could be either. |
We're forcing conversion from type A
x::Int
end
A(1.0) # -> A(1) which is not ideal for files that will have only The fact that the convert function is broken for you suggests a known scary bug in Base Julia where using a library or user code that creates a pair or a tuple (in certain, but legitimate ways) can magically break Here's my package status running off a clean install of OpenStreetMap -- older versions of some packages, like Color, could trigger the bug, so if we're lucky, you just have an old version of one of these packages:
Otherwise, I give up, and perhaps we can switch back after it fixes itself someday, or perhaps getting BinDeps up for LibExpat would be a good side project for one of us. |
Yes! That's it! I had Color pinned at an old version to try to fix that devil bug from a long time ago. I went through and cleaned out some other packages I don't use anymore, either, and now I can parse with |
Signed-off-by: Ted Steiner <tsteiner2@gmail.com>
Awesome! That is a really nasty bug, luckily people are chipping away at it, and fingers crossed it's fixed completely soon. Glad it wasn't indicative of a LibExpat issue that we'd have to keep worrying about. Sorry it took me so long to think of the fix. |
Just for comparison, when reading an osm file that's ~320MB (extracts available here):
However, garbage collection is making it very slow, and I'm not sure how to improve the performance. Any suggestions?
@garborg, @mlubin, @IainNZ, @joehuchette
Some numbers: