-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF/BUG/ENH/API: refactor read_html to use TextParser #4770
Conversation
👍 |
yep looks good |
also closes #4697 |
@cancan101 yep thanks. added |
@jtratner Nope it doesn't, I think I must've mixed up 79 and 97. Thanks |
this diff is very difficult to read ... sigh |
a 5-fir! |
I think you are trying to bump your stats with html files! lol |
@jreback yep
|
@jreback nah ... i'm trying to avoid slowing down the test suite with a bunch of |
raise Exception("invalid names passed _stack_arrays") | ||
nitems, nstacked = len(items), len(stacked) | ||
if nitems != nstacked: | ||
raise BadDataError('number of names in ref_items must equal the' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you leave a note here that says "Caller must catch this error"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or I don't know, maybe not, just something like, if you think this could happen then you should catch this error and try to say something more meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I did a first pass. I'd probably like to go over it again and see what I see, but I'm sure it's good. As an aside, can you help me understand how ordering works for the outputted tables? My assumption is that the ordering is deterministic. Does it follow the order in the HTML data that's passed in? (i.e., if you found all the line numbers of |
@jtratner Re: ordering, see: #5029 (comment) and the subsequent comments. |
@cpcloud Maybe also add that note about table ordering to the html gotchas? |
look ok on tupleize_cols your explanation is odd - no other functions have it as true (by default) |
Probably a cycle in the HTML parse tree. Does copy-pasting just the table work? |
Surprised that site works... |
i think maybe a timeout parameter might be useful |
I interrupted and this is the trace: /home/alex/git/pandas/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
838 'data (you passed a negative value)')
839 return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840 parse_dates, tupleize_cols, thousands, attrs)
/home/alex/git/pandas/pandas/io/html.py in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
700
701 try:
--> 702 tables = p.parse_tables()
703 except Exception as caught:
704 retained = caught
/home/alex/git/pandas/pandas/io/html.py in parse_tables(self)
172
173 def parse_tables(self):
--> 174 tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
175 return (self._build_table(table) for table in tables)
176
/home/alex/git/pandas/pandas/io/html.py in _parse_tables(self, doc, match, attrs)
396 def _parse_tables(self, doc, match, attrs):
397 element_name = self._strainer.name
--> 398 tables = doc.find_all(element_name, attrs=attrs)
399
400 if not tables:
/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in find_all(self, name, attrs, recursive, text, limit, **kwargs)
1165 if not recursive:
1166 generator = self.children
-> 1167 return self._find_all(name, attrs, text, limit, generator, **kwargs)
1168 findAll = find_all # BS3
1169 findChildren = find_all # BS2
/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in _find_all(self, name, attrs, text, limit, generator, **kwargs)
483 # Optimization to find all tags with a given name.
484 elif isinstance(name, basestring):
--> 485 return [element for element in generator
486 if isinstance(element, Tag) and element.name == name]
487 else:
/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in descendants(self)
1182 current = self.contents[0]
1183 while current is not stopNode:
-> 1184 yield current
1185 current = current.next_element
1186 |
Yep ... @jseabold had a similar issue ... let me find it essentially the borked html is in a cycle, in this case a node goes to its child then when the current node (the child) goes to the next element, it's actually the previous node (its parent) and on and on ... |
Am I doing something wrong here:
This works fine:
|
this is an issue with if you need to pass only a single header just use the second version it doesn't really make a whole lot of sense to pass a singleton list if you just want the first row anyway only pass a list if you need more than 1 row as a i'll open an issue about |
That might have been a poor example, I see this issue even with a proper list:
|
What does your table look like? |
See: http://pastebin.com/7mAF0Ei6 The table is an extract from the problematic document. |
GitHub actually supports tables:
|
im betting that if you try to get the part of the table after the header it will work ... otherwise i'll have to take a look later |
Aside from the leading $ and trailing ) (which are like that in the HTML), GitHub does a great job of rendering that table. |
So this seems to work better:
but I am still left with a lot of nan's:
where those nans actually appear to be strings:
|
yep as i suspected. the string |
I think that It is not clear to me what the default value of Also, on I am not sure what setting It seems that some values are being incorrectly parsed as dates (See columns 9-12):
|
What are we supposed to do if we want to use read_html and not have Pandas infer types then? |
closes #4697 (refactor issue) (REF/ENH)
closes #4700 (header inconsistency issue) (API)
closes #5029 (comma issue, added this data set, ordering issue) (BUG)
closes #5048 (header type conversion issue) (BUG)
closes #5066 (index_col issue) (BUG)
skiprows
,header
, andindex_col
interaction (a somewhat longstandingMultiIndex
sorting issue, I just took the long way to get there :))spam url not working anymore(US gov "shutdown" is responsible for this, it correctly skips)table ordering doc blurb/HTML gotchas(was an actual "bug", now fixed in this PR)add tests for rows with a different length(this is already done by the existing tests)