Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read html tables into DataFrames #3477

Merged
1 commit merged into from
May 3, 2013
Merged

Read html tables into DataFrames #3477

1 commit merged into from
May 3, 2013

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Apr 28, 2013

This PR adds new functionality for reading HTML tables from a URI, string, or file-like object into a DataFrame.
#3369

@cpcloud
Copy link
Member Author

cpcloud commented Apr 28, 2013

@y-p Ready for those notes whenever you are.

@ghost
Copy link

ghost commented Apr 28, 2013

First of all, thanks for all the work. much appreciated especially being so thorough.
Here's some points, mostly I guess for consistency in API, we can discuss
what works best. I looked at the code a couple of days ago, my apologies if
some points are now stale:

  • 1 io arg might be a url, in which case, should do the right thing. pd.read_csv does this
    ( I voted this down originally and I was wrong). see helper in io.parsers._read for free lunch.
  • 2- i agree with @jreback that you should return a list of dataframes, and my vote goes for
    returning a singleton list even when a match is specified/only one table. I see this
    as pd.read_html(). The user is then free to index the returned list so I suggest jetissoning
    table_number.
  • 3- the docstring note on "expect some cleanup" is somewhat opaque. what does it mean?
  • 4 df.applymap currently bombs on duplicate labels (df.applymap duplicates data with frame has dupe columns #2786), need to workaround for now.
  • 5_get_skiprows_iter is not actually used anywhere
  • 6 not clear what other legal values for flavor are.
  • 7- dedicating all kwargs to be passed off somewhere is a back-compat killer,
    since if we need to change something later on we can't add anything to the signature,
    make it a named dict arg instead (perhaps yanked by name from **kwargs).
    Also, doesn't make sense if you're using lxml directly. so if the user specified them
    but uses a diff engine, should probably raise in that case.
  • 8 -name bikeshedding: s/convert/infer_types ?

@cpcloud
Copy link
Member Author

cpcloud commented Apr 28, 2013

  1. urls are now handled properly, insofar as lxml can handle them (lxml only handles file, http, and ftp protocols) for bs4 there is a little bit of kludge around inferring the type of the input since bs4 doesn't handle it as transparently as lxml
  2. TODO
  3. TODO
  4. not used anymore
  5. used in _rows_to_frame
  6. flavor is gone because there's no reason to use bs4 if you have lxml (that might be too opinionated :)
  7. TODO
  8. TODO

One thing that I realized is glaringly obvious is that I haven't done thorough testing using bs4, need to remove the ci line that installs lxml and run travis again.

@ghost
Copy link

ghost commented Apr 28, 2013

  1. the referenced read_ wraps everything up into a .read()-able object, and abstracts that
    away from teh underlying parse.

Are you aware that you can use detox/tox_prll.sh and scripts/use_build_cache.py to
do all the testing locally and usually much faster?

@cpcloud
Copy link
Member Author

cpcloud commented Apr 28, 2013

Oh ok thanks. Didn't know that. Great.

@ghost
Copy link

ghost commented Apr 28, 2013

see #3156.

read the instructions at the top of scripts/use_build_cache.py(create dir, export envar) runscripts/use_build_cache.py` which will monkey patch setup.py, then run tox/tox_prll.sh/detox
to launch the build, second time around you only pay for running the tests.
runs py2.6/7/3.2/3.3.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 28, 2013

I think I will print that out as a poster and frame it, it's so helpful.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 28, 2013

The **kwargs in the case of lxml builds an expression that does what bs4 would do in that case see _build_node_xpath_expr. Is that a good idea? It's sort of a hack, but I thought that was a "nice-to-have" from bs4. Basically, [@attr1='value1' and @attr2='value2' and ... and @attrN='valueN'] is what is built.

@ghost
Copy link

ghost commented Apr 28, 2013

It's fine, it should be a dict that's passed in, and not take over the entire kwargs by definition,
because that paints us into a corner wrt future changes. Just a general rule.

so not:

def a(foo,**kwargs):
   my_delegate(**kwargs)

but

def a(foo,**kwargs):
   dkwds = kwargs.get("lxml_kwds",{})
   my_delegate(**dkwds)

@cpcloud
Copy link
Member Author

cpcloud commented Apr 28, 2013

@y-p Maybe I'm being a bit thick here, but _read tries to parse a delimited text file and thus fails when I pass in a url/string/HTML blob. Is there something I'm missing? It doesn't return a readable object it actually tries to do the reading.

@ghost
Copy link

ghost commented Apr 28, 2013

Yeah, my bad, TextFileReader is actually a line-oriented, delimited file reader,
not a an IO source abstraction, sorry.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 28, 2013

Oh wow, tox where have you been all my life :D

@ghost
Copy link

ghost commented Apr 29, 2013

You've obviously spent considerable time on this and it shows: clean code, great docstrings. very nice.

in the "spam" example, I see that bs4 does a much better jobs then lxml in extracting
text from nested tags, while lxml's ".text" doesn't go the extra mile.
I actually think the flavor argument was a good choice, since getting better data back trumps
speed for many website scraping situations. It's great that you did both, so might as well let the
user enjoy whichever solves his/her problem best.

Also, have you checked the license on all the HTML pages you included in the PR?
Perhaps it's best to just leave everything as a @network test.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 29, 2013

@y-p Cool. Thanks for the complements. Will add flavor back in. Also, tidying up the docstrings a bit, some of them are slightly inaccurate now. And yep, while bs4 is slower than lxml it's more intuitive and generally gives back "what you want". I also wanted the challenge of learning some xpath along the way so I thought why not do both?

@cpcloud
Copy link
Member Author

cpcloud commented Apr 29, 2013

spam data set license

@cpcloud
Copy link
Member Author

cpcloud commented Apr 29, 2013

failed banklist data set is automagically part of the public domain

@cpcloud
Copy link
Member Author

cpcloud commented Apr 29, 2013

Anyway both are public domain, and I've only included those in the tests that are not @network so they're okay.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2013

Build is failing because of failed HDFStore test. I haven't touched that code. Investigating...

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2013

This is especially strange since I don't get the failure locally (I've rebased on upstream/master)

@jreback
Copy link
Contributor

jreback commented Apr 30, 2013

I have seen an occasional failure (not on this particular one), but can never reproduce them....prob some wonkiness with travis..

just rebase (squash some commits together in any event), and try again (btw...add PTF in your commit message to make it work faster)

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2013

ok. just did git commit --amend -C HEAD to create a new hash a la @y-p's tip in #3156. @jreback What is PTF?

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2013

Program temporary fix?

@jreback
Copy link
Contributor

jreback commented Apr 30, 2013

Please Travis Faster....a y-p "hack" to make your whole testing run in about 4 min (it uses cached builds)....but have to put in a particular commit

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2013

ah ok. thanks.

@cpcloud
Copy link
Member Author

cpcloud commented May 1, 2013

I think I'm done twiddling here. Sorry about that, just wanted to be totally thorough.

@ghost
Copy link

ghost commented May 1, 2013

no problem, thanks for addressing all the issues. marking for merge in 0.11.1.

@ghost
Copy link

ghost commented May 2, 2013

need to tweak docstring to mention io accepts urls, suggest you make bs4 the default for better results
out of the box and easier install, I think the typical use case for prasing HTML values quality
over speed.

@ghost
Copy link

ghost commented May 2, 2013

calling with the spam url now works with lxml but fails with flavor=bs4.

@cpcloud
Copy link
Member Author

cpcloud commented May 2, 2013

@y-p can u post the error you're getting? The tests run successfully for both flavors over here.

@cpcloud
Copy link
Member Author

cpcloud commented May 2, 2013

Also the io docstring mentions that it accepts urls. Should I add more detail? Would be happy to do that.

@cpcloud
Copy link
Member Author

cpcloud commented May 2, 2013

@y-p Cool, thanks for the heads up. It's failing because you didn't provide a match, which I didn't write a test for. Doing that now and fixing up the bs4 issues with no match. It should return a list of DataFrames of all tables if no match is provided.

  • bs4 doesn't deal well with no match provided. fix that
  • make bs4 the default
  • update flavor docstring
  • add tests for failing parameter values
  • make docstrings even more awesome

The issue was that bs4 needs to be told to find the tables first, with attrs and then search for text within the found tables if any (fails if none are found before searching for text, as per the docstring), whereas lxml can do it all at once via xpath.

@cpcloud
Copy link
Member Author

cpcloud commented May 2, 2013

@y-p This is ready to go.

@ghost
Copy link

ghost commented May 3, 2013

Getting ready to merge this. What happens if match is provided and more then one table matches?

@cpcloud
Copy link
Member Author

cpcloud commented May 3, 2013

@y-p A list of DataFrames is returned. That was the agreed upon API, no? I can write a test quickly.

use __import__ on 2.6

extra code from previous merges and importlib failure on tests

fix bs4 issues with no match provided

docstring storm!

markup slow tests (bank list data) and add tests for failing parameter values
PTF

ok that is really it for docstring mania

add testfor multiple matches
@ghost
Copy link

ghost commented May 3, 2013

yes, that's fine, it's hard to tell it's actually returning a singleton list, because the dataframe repr
expands.

@cpcloud
Copy link
Member Author

cpcloud commented May 3, 2013

Okay, cool. Test was added to show this. There are some examples in the documentation as well that note this.

@ghost ghost merged commit b31c033 into pandas-dev:master May 3, 2013
@ghost
Copy link

ghost commented May 3, 2013

merged as 6518c79. great functionality, thank you.

@cpcloud
Copy link
Member Author

cpcloud commented May 3, 2013

No problem :) I had fun hacking it out. Would love to contribute more.

@ghost ghost mentioned this pull request May 4, 2013
@jreback
Copy link
Contributor

jreback commented May 6, 2013

not a big deal, a lot of your tests fail if you don't have bs4 installed.....I installed it an all was fine (I had lxml installed though)....is this correct?

@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2013

Hm no. There's a function that raises nose.SkipTest after trying to import bs4 first, then lxml. Let me check this out.

@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2013

Okay, got it. Wasn't skipping correctly.

@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2013

Damn, not good. bs4's fallback parser is not very lenient (html.parser), thus tests are failing because Python yields different results from lxml (which bs4 uses by default if available thus hiding this bug: I should have tested by mutually excluding lxml and bs4).

@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2013

fixing now....

@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2013

I will have this patch by this evening. It will support all of the parsers that bs4 supports.

@jreback
Copy link
Contributor

jreback commented May 6, 2013

@cpcloud
this looks suprios to me....?

https://travis-ci.org/pydata/pandas/jobs/6929645

@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2013

should i just raise the caught exception here instead? unicodedecodeerror occurs for some urls with lxml (spam was one of them) and ioerror is thrown for invalid urls so try to parse from a string if it's not a url and then if that doesn't work check to make sure the url protocol is valid (for bs4) else the only thing i could think of is that it's a faulty connection...am I missing anything here? also why the heck isn't that parsing? is it possible that travis has a timeout on things using the network, or just processes in general (obvs. os has that): is there an additional mechanism?

@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2013

ok just decided to catch UnicodeDecodeError otherwise catch Exception and re-raise if the string is a url and does not have a valid protocol.

@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2013

Current state of read_html

Compatibility Issues (check mark means something sane is returned)

  • lxml
  • bs4 + lxml
  • bs4 + html5lib (actually gives more "valid" results since it returns a valid HTML5 page from a garbage page)

Additional Documentation

  • Doc about what to do in the event of failure (will probably just say install both bs4 and lxml)
  • Doc the important user-facing results of the decisions made below.
    • The default column index now consists of the <th> elements of the <thead> element of a table, otherwise pandas defaults take over.
    • Any <tfoot> elements will be left as rows. An additional parameter, called footer can be used to signify that the footer should be given a different index in the frame
    • <colgroup> elements will not be processed (even though they could potentially be used to create a row MultiIndex).
  • bs4 parsing now checks for duplicate <table> elements using Python's set objects. So, for example if a bunch of strings are found that have the same <table> parent only one will be returned since their parent tables will all hash to the same value.
  • elements that have empty strings are no longer removed
  • infer_types now defaults to False.
  • <th> and <td> elements will be parsed from the header of table.

Parsing Issues

  • deal with <thead>/<tfoot> elements that lack children (the issue here is that some html contains <thead> elements with no <tr> elements (but possibly containing text), thus the parser might leave out a row that is valid upon visual inspection [less surprises for users]). As per the HTML spec for <tfoot> elements, and the spec for <thead> elements these elements can contain only 0 or more <tr> elements (obviously, this ignores nested tables which I'm not going to deal with). This issue (of empty <thead>/<tfoot> elements) then is easily dealt with by parsing <thead>, <tbody> and <tfoot> elements separately, which brings me to the next issue...
  • Decide what to do with <tfoot> elements.
    • There's only one allowed in valid HTML, however I will not raise an error if there is more than 1 since most of the parsers (if you're using Python >= 2.7) will accept this (will add a test for this).
    • If there is a <thead> element should a MultiIndex in the columns be created with the name 'footer'? I think no here.
    • If there's no <thead> but a <tfoot>, should <tfoot> become the column index? Again, I think no is the most practical choice here.
    • I think the least surprising option here would be to keep <tfoot> elements as rows, BUT
      • Use standard indexing (integer based or something else if the user provides a related option)
      • Label the row containing the individual elements of the <tfoot> elements 'footer'.
  • Make <thead> elements the default column index of the resulting DataFrame(s). Motivated by the default of read_csv. The header parameter in read_html would override this behavior.
  • Include Ground Truth™ data in banklist.csv to compare against parsed data from banklist.html
  • No need to parse colgroup elements since those are only used for formatting.

Weirdness

  • 'Gold Canyon' row from the failed bank list is not parsed by lxml (by definition by the bs4 + lxml combo as well), fix this. This is really annoying. This single element won't parse...arghhhhhhhh!!

Performance Comparison via vbench

  • For users who are concerned about the relative performance of the 3 different parsers.

@wesm
Copy link
Member

wesm commented May 8, 2013

Thanks guys for taking on this one, really great addition to pandas

@cpcloud
Copy link
Member Author

cpcloud commented May 9, 2013

@y-p @jreback @wesm Any preferences regarding <tfoot> treatment? Thanks.

@jreback
Copy link
Contributor

jreback commented May 9, 2013

is the footer on a table, usually a 'summary' type of row or something? one possibiilty is to return it as a separate dataframe (as you are already returning a list of frames). you could also have an option to 'control' this, footer='separate'?

@cpcloud
Copy link
Member Author

cpcloud commented May 9, 2013

Yep, but they behave just like a regular row. Wouldn't it be more consistent to return a series since there can only be one unless someone is nesting other tables in their html. I'll probably go with the footer='separate' option.

@cpcloud
Copy link
Member Author

cpcloud commented May 15, 2013

I've decided not to support the native Python parsing library. It's god-awful (read: not lenient enough for the real world).

@cpcloud
Copy link
Member Author

cpcloud commented May 15, 2013

@jreback @y-p Is there a way to map a callable over the elements of a specific block type?

@jreback
Copy link
Contributor

jreback commented May 15, 2013

convert in ObjectBlock

@jreback
Copy link
Contributor

jreback commented May 15, 2013

what I mean is look at the convert method which iterates over items

@cpcloud
Copy link
Member Author

cpcloud commented May 15, 2013

@jreback @y-p What do you guys think about this named footer thing? I think it's not worth it...u can check out the branch if you want it's called read-html-named-footer.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants