Read html tables into DataFrames #3477

cpcloud · 2013-04-28T18:01:06Z

This PR adds new functionality for reading HTML tables from a URI, string, or file-like object into a DataFrame.
#3369

cpcloud · 2013-04-28T18:01:27Z

@y-p Ready for those notes whenever you are.

ghost · 2013-04-28T18:08:00Z

First of all, thanks for all the work. much appreciated especially being so thorough.
Here's some points, mostly I guess for consistency in API, we can discuss
what works best. I looked at the code a couple of days ago, my apologies if
some points are now stale:

1 io arg might be a url, in which case, should do the right thing. pd.read_csv does this
( I voted this down originally and I was wrong). see helper in io.parsers._read for free lunch.
2- i agree with @jreback that you should return a list of dataframes, and my vote goes for
returning a singleton list even when a match is specified/only one table. I see this
as pd.read_html(). The user is then free to index the returned list so I suggest jetissoning
table_number.
3- the docstring note on "expect some cleanup" is somewhat opaque. what does it mean?
4 df.applymap currently bombs on duplicate labels (df.applymap duplicates data with frame has dupe columns #2786), need to workaround for now.
5_get_skiprows_iter is not actually used anywhere
6 not clear what other legal values for flavor are.
7- dedicating all kwargs to be passed off somewhere is a back-compat killer,
since if we need to change something later on we can't add anything to the signature,
make it a named dict arg instead (perhaps yanked by name from **kwargs).
Also, doesn't make sense if you're using lxml directly. so if the user specified them
but uses a diff engine, should probably raise in that case.
8 -name bikeshedding: s/convert/infer_types ?

cpcloud · 2013-04-28T18:17:46Z

urls are now handled properly, insofar as lxml can handle them (lxml only handles file, http, and ftp protocols) for bs4 there is a little bit of kludge around inferring the type of the input since bs4 doesn't handle it as transparently as lxml
TODO
TODO
not used anymore
used in _rows_to_frame
flavor is gone because there's no reason to use bs4 if you have lxml (that might be too opinionated :)
TODO
TODO

One thing that I realized is glaringly obvious is that I haven't done thorough testing using bs4, need to remove the ci line that installs lxml and run travis again.

ghost · 2013-04-28T18:23:09Z

the referenced read_ wraps everything up into a .read()-able object, and abstracts that
away from teh underlying parse.

Are you aware that you can use detox/tox_prll.sh and scripts/use_build_cache.py to
do all the testing locally and usually much faster?

cpcloud · 2013-04-28T18:24:24Z

Oh ok thanks. Didn't know that. Great.

ghost · 2013-04-28T18:26:24Z

see #3156.

read the instructions at the top of scripts/use_build_cache.py(create dir, export envar) runscripts/use_build_cache.py` which will monkey patch setup.py, then run tox/tox_prll.sh/detox
to launch the build, second time around you only pay for running the tests.
runs py2.6/7/3.2/3.3.

cpcloud · 2013-04-28T18:28:09Z

I think I will print that out as a poster and frame it, it's so helpful.

cpcloud · 2013-04-28T19:05:31Z

The **kwargs in the case of lxml builds an expression that does what bs4 would do in that case see _build_node_xpath_expr. Is that a good idea? It's sort of a hack, but I thought that was a "nice-to-have" from bs4. Basically, [@attr1='value1' and @attr2='value2' and ... and @attrN='valueN'] is what is built.

ghost · 2013-04-28T19:16:57Z

It's fine, it should be a dict that's passed in, and not take over the entire kwargs by definition,
because that paints us into a corner wrt future changes. Just a general rule.

so not:

def a(foo,**kwargs):
   my_delegate(**kwargs)

but

def a(foo,**kwargs):
   dkwds = kwargs.get("lxml_kwds",{})
   my_delegate(**dkwds)

cpcloud · 2013-04-28T19:32:27Z

@y-p Maybe I'm being a bit thick here, but _read tries to parse a delimited text file and thus fails when I pass in a url/string/HTML blob. Is there something I'm missing? It doesn't return a readable object it actually tries to do the reading.

ghost · 2013-04-28T19:38:45Z

Yeah, my bad, TextFileReader is actually a line-oriented, delimited file reader,
not a an IO source abstraction, sorry.

cpcloud · 2013-04-28T20:11:03Z

Oh wow, tox where have you been all my life :D

ghost · 2013-04-29T03:04:18Z

You've obviously spent considerable time on this and it shows: clean code, great docstrings. very nice.

in the "spam" example, I see that bs4 does a much better jobs then lxml in extracting
text from nested tags, while lxml's ".text" doesn't go the extra mile.
I actually think the flavor argument was a good choice, since getting better data back trumps
speed for many website scraping situations. It's great that you did both, so might as well let the
user enjoy whichever solves his/her problem best.

Also, have you checked the license on all the HTML pages you included in the PR?
Perhaps it's best to just leave everything as a @network test.

cpcloud · 2013-04-29T04:13:55Z

@y-p Cool. Thanks for the complements. Will add flavor back in. Also, tidying up the docstrings a bit, some of them are slightly inaccurate now. And yep, while bs4 is slower than lxml it's more intuitive and generally gives back "what you want". I also wanted the challenge of learning some xpath along the way so I thought why not do both?

cpcloud · 2013-04-29T15:34:03Z

spam data set license

cpcloud · 2013-04-29T15:44:52Z

failed banklist data set is automagically part of the public domain

cpcloud · 2013-04-29T15:46:30Z

Anyway both are public domain, and I've only included those in the tests that are not @network so they're okay.

cpcloud · 2013-04-30T17:52:20Z

Build is failing because of failed HDFStore test. I haven't touched that code. Investigating...

cpcloud · 2013-04-30T17:54:44Z

This is especially strange since I don't get the failure locally (I've rebased on upstream/master)

jreback · 2013-04-30T17:57:49Z

I have seen an occasional failure (not on this particular one), but can never reproduce them....prob some wonkiness with travis..

just rebase (squash some commits together in any event), and try again (btw...add PTF in your commit message to make it work faster)

cpcloud · 2013-04-30T18:01:10Z

ok. just did git commit --amend -C HEAD to create a new hash a la @y-p's tip in #3156. @jreback What is PTF?

cpcloud · 2013-04-30T18:02:36Z

Program temporary fix?

jreback · 2013-04-30T18:02:43Z

Please Travis Faster....a y-p "hack" to make your whole testing run in about 4 min (it uses cached builds)....but have to put in a particular commit

cpcloud · 2013-04-30T18:03:42Z

ah ok. thanks.

cpcloud · 2013-05-01T21:43:10Z

I think I'm done twiddling here. Sorry about that, just wanted to be totally thorough.

ghost · 2013-05-01T23:17:58Z

no problem, thanks for addressing all the issues. marking for merge in 0.11.1.

ghost · 2013-05-02T10:55:11Z

need to tweak docstring to mention io accepts urls, suggest you make bs4 the default for better results
out of the box and easier install, I think the typical use case for prasing HTML values quality
over speed.

ghost · 2013-05-02T10:56:34Z

calling with the spam url now works with lxml but fails with flavor=bs4.

cpcloud · 2013-05-02T16:09:03Z

@y-p can u post the error you're getting? The tests run successfully for both flavors over here.

cpcloud · 2013-05-02T16:18:25Z

Also the io docstring mentions that it accepts urls. Should I add more detail? Would be happy to do that.

cpcloud · 2013-05-02T16:49:34Z

@y-p Cool, thanks for the heads up. It's failing because you didn't provide a match, which I didn't write a test for. Doing that now and fixing up the bs4 issues with no match. It should return a list of DataFrames of all tables if no match is provided.

bs4 doesn't deal well with no match provided. fix that
make bs4 the default
update flavor docstring
add tests for failing parameter values
make docstrings even more awesome

The issue was that bs4 needs to be told to find the tables first, with attrs and then search for text within the found tables if any (fails if none are found before searching for text, as per the docstring), whereas lxml can do it all at once via xpath.

cpcloud · 2013-05-02T21:55:14Z

@y-p This is ready to go.

ghost · 2013-05-03T16:04:39Z

Getting ready to merge this. What happens if match is provided and more then one table matches?

cpcloud · 2013-05-03T16:06:00Z

@y-p A list of DataFrames is returned. That was the agreed upon API, no? I can write a test quickly.

use __import__ on 2.6 extra code from previous merges and importlib failure on tests fix bs4 issues with no match provided docstring storm! markup slow tests (bank list data) and add tests for failing parameter values PTF ok that is really it for docstring mania add testfor multiple matches

ghost · 2013-05-03T16:26:59Z

yes, that's fine, it's hard to tell it's actually returning a singleton list, because the dataframe repr
expands.

cpcloud · 2013-05-03T16:28:24Z

Okay, cool. Test was added to show this. There are some examples in the documentation as well that note this.

ghost · 2013-05-03T16:54:27Z

merged as 6518c79. great functionality, thank you.

cpcloud · 2013-05-03T17:12:50Z

No problem :) I had fun hacking it out. Would love to contribute more.

jreback · 2013-05-06T13:16:06Z

not a big deal, a lot of your tests fail if you don't have bs4 installed.....I installed it an all was fine (I had lxml installed though)....is this correct?

cpcloud · 2013-05-06T14:21:01Z

Hm no. There's a function that raises nose.SkipTest after trying to import bs4 first, then lxml. Let me check this out.

cpcloud · 2013-05-06T14:27:12Z

Okay, got it. Wasn't skipping correctly.

cpcloud · 2013-05-06T15:06:44Z

Damn, not good. bs4's fallback parser is not very lenient (html.parser), thus tests are failing because Python yields different results from lxml (which bs4 uses by default if available thus hiding this bug: I should have tested by mutually excluding lxml and bs4).

cpcloud · 2013-05-06T15:06:53Z

fixing now....

cpcloud · 2013-05-06T15:19:21Z

I will have this patch by this evening. It will support all of the parsers that bs4 supports.

jreback · 2013-05-06T17:49:42Z

@cpcloud
this looks suprios to me....?

https://travis-ci.org/pydata/pandas/jobs/6929645

cpcloud · 2013-05-06T22:04:47Z

should i just raise the caught exception here instead? unicodedecodeerror occurs for some urls with lxml (spam was one of them) and ioerror is thrown for invalid urls so try to parse from a string if it's not a url and then if that doesn't work check to make sure the url protocol is valid (for bs4) else the only thing i could think of is that it's a faulty connection...am I missing anything here? also why the heck isn't that parsing? is it possible that travis has a timeout on things using the network, or just processes in general (obvs. os has that): is there an additional mechanism?

cpcloud · 2013-05-06T22:42:11Z

ok just decided to catch UnicodeDecodeError otherwise catch Exception and re-raise if the string is a url and does not have a valid protocol.

cpcloud · 2013-05-06T23:44:33Z

Current state of `read_html`

Compatibility Issues (check mark means something sane is returned)

lxml
bs4 + lxml
bs4 + html5lib (actually gives more "valid" results since it returns a valid HTML5 page from a garbage page)

Additional Documentation

Doc about what to do in the event of failure (will probably just say install both bs4 and lxml)
Doc the important user-facing results of the decisions made below.
- The default column index now consists of the <th> elements of the <thead> element of a table, otherwise pandas defaults take over.
- Any <tfoot> elements will be left as rows. An additional parameter, called footer can be used to signify that the footer should be given a different index in the frame
- <colgroup> elements will not be processed (even though they could potentially be used to create a row MultiIndex).
bs4 parsing now checks for duplicate <table> elements using Python's set objects. So, for example if a bunch of strings are found that have the same <table> parent only one will be returned since their parent tables will all hash to the same value.
elements that have empty strings are no longer removed
infer_types now defaults to False.
<th> and <td> elements will be parsed from the header of table.

Parsing Issues

deal with <thead>/<tfoot> elements that lack children (the issue here is that some html contains <thead> elements with no <tr> elements (but possibly containing text), thus the parser might leave out a row that is valid upon visual inspection [less surprises for users]). As per the HTML spec for <tfoot> elements, and the spec for <thead> elements these elements can contain only 0 or more <tr> elements (obviously, this ignores nested tables which I'm not going to deal with). This issue (of empty <thead>/<tfoot> elements) then is easily dealt with by parsing <thead>, <tbody> and <tfoot> elements separately, which brings me to the next issue...
Decide what to do with <tfoot> elements.
- There's only one allowed in valid HTML, however I will not raise an error if there is more than 1 since most of the parsers (if you're using Python >= 2.7) will accept this (will add a test for this).
- If there is a <thead> element should a MultiIndex in the columns be created with the name 'footer'? I think no here.
- If there's no <thead> but a <tfoot>, should <tfoot> become the column index? Again, I think no is the most practical choice here.
- I think the least surprising option here would be to keep <tfoot> elements as rows, BUT
  - Use standard indexing (integer based or something else if the user provides a related option)
  - Label the row containing the individual elements of the <tfoot> elements 'footer'.
Make <thead> elements the default column index of the resulting DataFrame(s). Motivated by the default of read_csv. The header parameter in read_html would override this behavior.
Include Ground Truth™ data in banklist.csv to compare against parsed data from banklist.html
No need to parse colgroup elements since those are only used for formatting.

Weirdness

'Gold Canyon' row from the failed bank list is not parsed by lxml (by definition by the bs4 + lxml combo as well), fix this. This is really annoying. This single element won't parse...arghhhhhhhh!!

Performance Comparison via `vbench`

For users who are concerned about the relative performance of the 3 different parsers.

wesm · 2013-05-08T19:29:40Z

Thanks guys for taking on this one, really great addition to pandas

cpcloud · 2013-05-09T00:47:54Z

@y-p @jreback @wesm Any preferences regarding <tfoot> treatment? Thanks.

jreback · 2013-05-09T00:56:15Z

is the footer on a table, usually a 'summary' type of row or something? one possibiilty is to return it as a separate dataframe (as you are already returning a list of frames). you could also have an option to 'control' this, footer='separate'?

cpcloud · 2013-05-09T01:01:15Z

Yep, but they behave just like a regular row. Wouldn't it be more consistent to return a series since there can only be one unless someone is nesting other tables in their html. I'll probably go with the footer='separate' option.

cpcloud · 2013-05-15T02:26:34Z

I've decided not to support the native Python parsing library. It's god-awful (read: not lenient enough for the real world).

cpcloud · 2013-05-15T03:42:44Z

@jreback @y-p Is there a way to map a callable over the elements of a specific block type?

jreback · 2013-05-15T05:28:30Z

convert in ObjectBlock

jreback · 2013-05-15T09:21:40Z

what I mean is look at the convert method which iterates over items

cpcloud · 2013-05-15T21:55:14Z

@jreback @y-p What do you guys think about this named footer thing? I think it's not worth it...u can check out the branch if you want it's called read-html-named-footer.

ghost merged commit b31c033 into pandas-dev:master May 3, 2013

ghost mentioned this pull request May 4, 2013

Scrape HTML tables into Dataframes #3369

Closed

This pull request was closed.

Read html tables into DataFrames #3477

Read html tables into DataFrames #3477

Conversation

cpcloud commented Apr 28, 2013

cpcloud commented Apr 28, 2013

ghost commented Apr 28, 2013

cpcloud commented Apr 28, 2013

ghost commented Apr 28, 2013

cpcloud commented Apr 28, 2013

ghost commented Apr 28, 2013

cpcloud commented Apr 28, 2013

cpcloud commented Apr 28, 2013

ghost commented Apr 28, 2013

cpcloud commented Apr 28, 2013

ghost commented Apr 28, 2013

cpcloud commented Apr 28, 2013

ghost commented Apr 29, 2013

cpcloud commented Apr 29, 2013

cpcloud commented Apr 29, 2013

cpcloud commented Apr 29, 2013

cpcloud commented Apr 29, 2013

cpcloud commented Apr 30, 2013

cpcloud commented Apr 30, 2013

jreback commented Apr 30, 2013

cpcloud commented Apr 30, 2013

cpcloud commented Apr 30, 2013

jreback commented Apr 30, 2013

cpcloud commented Apr 30, 2013

cpcloud commented May 1, 2013

ghost commented May 1, 2013

ghost commented May 2, 2013

ghost commented May 2, 2013

cpcloud commented May 2, 2013

cpcloud commented May 2, 2013

cpcloud commented May 2, 2013

cpcloud commented May 2, 2013

ghost commented May 3, 2013

cpcloud commented May 3, 2013

ghost commented May 3, 2013

cpcloud commented May 3, 2013

ghost commented May 3, 2013

cpcloud commented May 3, 2013

jreback commented May 6, 2013

cpcloud commented May 6, 2013

cpcloud commented May 6, 2013

cpcloud commented May 6, 2013

cpcloud commented May 6, 2013

cpcloud commented May 6, 2013

jreback commented May 6, 2013

cpcloud commented May 6, 2013

cpcloud commented May 6, 2013

cpcloud commented May 6, 2013

Current state of read_html

Compatibility Issues (check mark means something sane is returned)

Additional Documentation

Parsing Issues

Weirdness

Performance Comparison via vbench

wesm commented May 8, 2013

cpcloud commented May 9, 2013

jreback commented May 9, 2013

cpcloud commented May 9, 2013

cpcloud commented May 15, 2013

cpcloud commented May 15, 2013

jreback commented May 15, 2013

jreback commented May 15, 2013

cpcloud commented May 15, 2013

Current state of `read_html`

Performance Comparison via `vbench`