-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read html tables into DataFrames #3477
Conversation
@y-p Ready for those notes whenever you are. |
First of all, thanks for all the work. much appreciated especially being so thorough.
|
One thing that I realized is glaringly obvious is that I haven't done thorough testing using bs4, need to remove the ci line that installs lxml and run travis again. |
Are you aware that you can use detox/tox_prll.sh and |
Oh ok thanks. Didn't know that. Great. |
see #3156. read the instructions at the top of scripts/use_build_cache.py |
I think I will print that out as a poster and frame it, it's so helpful. |
The **kwargs in the case of lxml builds an expression that does what bs4 would do in that case see |
It's fine, it should be a dict that's passed in, and not take over the entire kwargs by definition, so not:
but
|
@y-p Maybe I'm being a bit thick here, but |
Yeah, my bad, TextFileReader is actually a line-oriented, delimited file reader, |
Oh wow, tox where have you been all my life :D |
You've obviously spent considerable time on this and it shows: clean code, great docstrings. very nice. in the "spam" example, I see that bs4 does a much better jobs then lxml in extracting Also, have you checked the license on all the HTML pages you included in the PR? |
@y-p Cool. Thanks for the complements. Will add |
failed banklist data set is automagically part of the public domain |
Anyway both are public domain, and I've only included those in the tests that are not |
Build is failing because of failed HDFStore test. I haven't touched that code. Investigating... |
This is especially strange since I don't get the failure locally (I've rebased on upstream/master) |
I have seen an occasional failure (not on this particular one), but can never reproduce them....prob some wonkiness with travis.. just rebase (squash some commits together in any event), and try again (btw...add PTF in your commit message to make it work faster) |
Program temporary fix? |
Please Travis Faster....a y-p "hack" to make your whole testing run in about 4 min (it uses cached builds)....but have to put in a particular commit |
ah ok. thanks. |
I think I'm done twiddling here. Sorry about that, just wanted to be totally thorough. |
no problem, thanks for addressing all the issues. marking for merge in 0.11.1. |
need to tweak docstring to mention io accepts urls, suggest you make |
calling with the spam url now works with lxml but fails with |
@y-p can u post the error you're getting? The tests run successfully for both flavors over here. |
Also the io docstring mentions that it accepts urls. Should I add more detail? Would be happy to do that. |
@y-p Cool, thanks for the heads up. It's failing because you didn't provide a match, which I didn't write a test for. Doing that now and fixing up the bs4 issues with no match. It should return a list of DataFrames of all tables if no match is provided.
The issue was that bs4 needs to be told to find the tables first, with attrs and then search for text within the found tables if any (fails if none are found before searching for text, as per the docstring), whereas lxml can do it all at once via xpath. |
@y-p This is ready to go. |
Getting ready to merge this. What happens if |
@y-p A list of |
use __import__ on 2.6 extra code from previous merges and importlib failure on tests fix bs4 issues with no match provided docstring storm! markup slow tests (bank list data) and add tests for failing parameter values PTF ok that is really it for docstring mania add testfor multiple matches
yes, that's fine, it's hard to tell it's actually returning a singleton list, because the dataframe repr |
Okay, cool. Test was added to show this. There are some examples in the documentation as well that note this. |
merged as 6518c79. great functionality, thank you. |
No problem :) I had fun hacking it out. Would love to contribute more. |
not a big deal, a lot of your tests fail if you don't have bs4 installed.....I installed it an all was fine (I had lxml installed though)....is this correct? |
Hm no. There's a function that raises |
Okay, got it. Wasn't skipping correctly. |
Damn, not good. |
fixing now.... |
I will have this patch by this evening. It will support all of the parsers that bs4 supports. |
@cpcloud |
should i just raise the caught exception here instead? unicodedecodeerror occurs for some urls with lxml (spam was one of them) and ioerror is thrown for invalid urls so try to parse from a string if it's not a url and then if that doesn't work check to make sure the url protocol is valid (for bs4) else the only thing i could think of is that it's a faulty connection...am I missing anything here? also why the heck isn't that parsing? is it possible that travis has a timeout on things using the network, or just processes in general (obvs. os has that): is there an additional mechanism? |
ok just decided to catch |
Current state of
|
Thanks guys for taking on this one, really great addition to pandas |
is the footer on a table, usually a 'summary' type of row or something? one possibiilty is to return it as a separate dataframe (as you are already returning a list of frames). you could also have an option to 'control' this, |
Yep, but they behave just like a regular row. Wouldn't it be more consistent to return a series since there can only be one unless someone is nesting other tables in their html. I'll probably go with the |
I've decided not to support the native Python parsing library. It's god-awful (read: not lenient enough for the real world). |
convert in ObjectBlock |
what I mean is look at the convert method which iterates over items |
This PR adds new functionality for reading HTML tables from a URI, string, or file-like object into a DataFrame.
#3369