REF/BUG/ENH/API: refactor read_html to use TextParser #4770

cpcloud · 2013-09-07T04:14:16Z

closes #4697 (refactor issue) (REF/ENH)
closes #4700 (header inconsistency issue) (API)
closes #5029 (comma issue, added this data set, ordering issue) (BUG)
closes #5048 (header type conversion issue) (BUG)
closes #5066 (index_col issue) (BUG)

figure out skiprows, header, and index_col interaction (a somewhat longstanding MultiIndex sorting issue, I just took the long way to get there :))
~~spam url not working anymore~~ (US gov "shutdown" is responsible for this, it correctly skips)
~~table ordering doc blurb/HTML gotchas~~ (was an actual "bug", now fixed in this PR)
~~add tests for rows with a different length~~ (this is already done by the existing tests)

jtratner · 2013-09-08T02:48:35Z

👍

jreback · 2013-09-08T03:06:44Z

yep looks good

cancan101 · 2013-09-18T04:20:50Z

also closes #4697

cpcloud · 2013-09-19T19:01:17Z

@cancan101 yep thanks. added

cancan101 · 2013-09-19T19:03:18Z

@cpcloud Does this PR actually close #4679 ? That one is specific to Excel.

cpcloud · 2013-09-19T19:05:22Z

@jtratner Nope it doesn't, I think I must've mixed up 79 and 97. Thanks

cpcloud · 2013-10-01T23:25:24Z

this diff is very difficult to read ... sigh

jreback · 2013-10-01T23:32:20Z

a 5-fir!

jreback · 2013-10-01T23:34:38Z

I think you are trying to bump your stats with html files! lol

cpcloud · 2013-10-01T23:35:31Z

@jreback yep

~~last issue is for me to make sure that codec.open with my chosen errors param is correct~~ not using it anymore ... sticking with our old pal open. would be nice to handle encoding decoding ... but that's for the future

cpcloud · 2013-10-01T23:37:22Z

@jreback nah ... i'm trying to avoid slowing down the test suite with a bunch of @network tests

cpcloud · 2013-10-02T04:39:45Z

@jreback @jtratner comments?

jtratner · 2013-10-02T10:52:18Z

pandas/core/internals.py

-            raise Exception("invalid names passed _stack_arrays")
+        nitems, nstacked = len(items), len(stacked)
+        if nitems != nstacked:
+            raise BadDataError('number of names in ref_items must equal the'


can you leave a note here that says "Caller must catch this error"

or I don't know, maybe not, just something like, if you think this could happen then you should catch this error and try to say something more meaningful.

jtratner · 2013-10-02T11:20:05Z

I did a first pass. I'd probably like to go over it again and see what I see, but I'm sure it's good.

As an aside, can you help me understand how ordering works for the outputted tables? My assumption is that the ordering is deterministic. Does it follow the order in the HTML data that's passed in? (i.e., if you found all the line numbers of <table> elements, the order of outputted tables would be the same as the order of line numbers [unless a table isn't parseable])

cancan101 · 2013-10-02T12:18:44Z

@jtratner Re: ordering, see: #5029 (comment) and the subsequent comments.

cancan101 · 2013-10-02T17:23:40Z

@cpcloud Maybe also add that note about table ordering to the html gotchas?

cpcloud · 2013-10-03T00:53:49Z

@jreback @jtratner any more comments?

jreback · 2013-10-03T01:00:16Z

look ok

on tupleize_cols your explanation is odd - no other functions have it as true (by default)

cpcloud · 2013-10-03T04:58:47Z

Probably a cycle in the HTML parse tree. Does copy-pasting just the table work?

cpcloud · 2013-10-03T05:01:06Z

Surprised that site works...

cpcloud · 2013-10-03T05:06:22Z

i think maybe a timeout parameter might be useful

cancan101 · 2013-10-04T04:01:41Z

I interrupted and this is the trace:

/home/alex/git/pandas/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
    838                          'data (you passed a negative value)')
    839     return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840                   parse_dates, tupleize_cols, thousands, attrs)

/home/alex/git/pandas/pandas/io/html.py in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
    700 
    701         try:
--> 702             tables = p.parse_tables()
    703         except Exception as caught:
    704             retained = caught

/home/alex/git/pandas/pandas/io/html.py in parse_tables(self)
    172 
    173     def parse_tables(self):
--> 174         tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    175         return (self._build_table(table) for table in tables)
    176 

/home/alex/git/pandas/pandas/io/html.py in _parse_tables(self, doc, match, attrs)
    396     def _parse_tables(self, doc, match, attrs):
    397         element_name = self._strainer.name
--> 398         tables = doc.find_all(element_name, attrs=attrs)
    399 
    400         if not tables:

/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in find_all(self, name, attrs, recursive, text, limit, **kwargs)
   1165         if not recursive:
   1166             generator = self.children
-> 1167         return self._find_all(name, attrs, text, limit, generator, **kwargs)
   1168     findAll = find_all       # BS3
   1169     findChildren = find_all  # BS2

/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in _find_all(self, name, attrs, text, limit, generator, **kwargs)
    483             # Optimization to find all tags with a given name.
    484             elif isinstance(name, basestring):
--> 485                 return [element for element in generator
    486                         if isinstance(element, Tag) and element.name == name]
    487             else:

/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in descendants(self)
   1182         current = self.contents[0]
   1183         while current is not stopNode:
-> 1184             yield current
   1185             current = current.next_element
   1186

cpcloud · 2013-10-04T04:05:37Z

Yep ... @jseabold had a similar issue ... let me find it

essentially the borked html is in a cycle, in this case a node goes to its child then when the current node (the child) goes to the next element, it's actually the previous node (its parent) and on and on ...

cpcloud · 2013-10-04T04:06:06Z

#4786

cancan101 · 2013-10-04T04:23:25Z

Am I doing something wrong here:

pd.read_html("/home/alex/table.html",infer_types=False,header=[0])

---------------------------------------------------------------------------
TypeError                                 
----> 1 pd.read_html("/home/alex/table.html",infer_types=False,header=[0])

/home/alex/git/pandas/pandas/io/html.pyc in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
    838                          'data (you passed a negative value)')
    839     return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840                   parse_dates, tupleize_cols, thousands, attrs)

/home/alex/git/pandas/pandas/io/html.pyc in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
    710     return [_data_to_frame(table, header, index_col, skiprows, infer_types,
    711                            parse_dates, tupleize_cols, thousands)
--> 712             for table in tables]
    713 
    714 

/home/alex/git/pandas/pandas/io/html.pyc in _data_to_frame(data, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands)
    600                     skiprows=_get_skiprows(skiprows),
    601                     parse_dates=parse_dates, tupleize_cols=tupleize_cols,
--> 602                     thousands=thousands)
    603     df = tp.read()
    604 

/home/alex/git/pandas/pandas/io/parsers.pyc in TextParser(*args, **kwds)
   1173     """
   1174     kwds['engine'] = 'python'
-> 1175     return TextFileReader(*args, **kwds)
   1176 
   1177 

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    485             self.options['has_index_names'] = kwds['has_index_names']
    486 
--> 487         self._make_engine(self.engine)
    488 
    489     def _get_options_with_defaults(self, engine):

/home/alex/git/pandas/pandas/io/parsers.pyc in _make_engine(self, engine)
    601             elif engine == 'python-fwf':
    602                 klass = FixedWidthFieldParser
--> 603             self._engine = klass(self.f, **self.options)
    604 
    605     def _failover_to_python(self):

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1296         if len(self.columns) > 1:
   1297             self.columns, self.index_names, self.col_names, _ = self._extract_multi_indexer_columns(
-> 1298                 self.columns, self.index_names, self.col_names)
   1299         else:
   1300             self.columns = self.columns[0]

/home/alex/git/pandas/pandas/io/parsers.pyc in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
    736         # if we find 'Unnamed' all of a single level, then our header was too long
    737         for n in range(len(columns[0])):
--> 738             if all([ 'Unnamed' in c[n] for c in columns ]):
    739                 raise _parser.CParserError("Passed header=[%s] are too many rows for this "
    740                                            "multi_index of columns" % ','.join([ str(x) for x in self.header ]))

TypeError: argument of type 'float' is not iterable

This works fine:

pd.read_html("/home/alex/table.html",infer_types=False,header=0)

cpcloud · 2013-10-04T04:27:12Z

this is an issue with TextParser

if you need to pass only a single header just use the second version

it doesn't really make a whole lot of sense to pass a singleton list if you just want the first row anyway

only pass a list if you need more than 1 row as a MultiIndex header

i'll open an issue about TextParser

cancan101 · 2013-10-04T04:28:12Z

That might have been a poor example, I see this issue even with a proper list:

In [6]: pd.read_html("/home/alex/table.html",infer_types=False,header=[0,1])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-1b482817d4da> in <module>()
----> 1 pd.read_html("/home/alex/table.html",infer_types=False,header=[0,1])

/home/alex/git/pandas/pandas/io/html.pyc in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
    838                          'data (you passed a negative value)')
    839     return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840                   parse_dates, tupleize_cols, thousands, attrs)

/home/alex/git/pandas/pandas/io/html.pyc in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
    710     return [_data_to_frame(table, header, index_col, skiprows, infer_types,
    711                            parse_dates, tupleize_cols, thousands)
--> 712             for table in tables]
    713 
    714 

/home/alex/git/pandas/pandas/io/html.pyc in _data_to_frame(data, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands)
    600                     skiprows=_get_skiprows(skiprows),
    601                     parse_dates=parse_dates, tupleize_cols=tupleize_cols,
--> 602                     thousands=thousands)
    603     df = tp.read()
    604 

/home/alex/git/pandas/pandas/io/parsers.pyc in TextParser(*args, **kwds)
   1173     """
   1174     kwds['engine'] = 'python'
-> 1175     return TextFileReader(*args, **kwds)
   1176 
   1177 

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    485             self.options['has_index_names'] = kwds['has_index_names']
    486 
--> 487         self._make_engine(self.engine)
    488 
    489     def _get_options_with_defaults(self, engine):

/home/alex/git/pandas/pandas/io/parsers.pyc in _make_engine(self, engine)
    601             elif engine == 'python-fwf':
    602                 klass = FixedWidthFieldParser
--> 603             self._engine = klass(self.f, **self.options)
    604 
    605     def _failover_to_python(self):

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1296         if len(self.columns) > 1:
   1297             self.columns, self.index_names, self.col_names, _ = self._extract_multi_indexer_columns(
-> 1298                 self.columns, self.index_names, self.col_names)
   1299         else:
   1300             self.columns = self.columns[0]

/home/alex/git/pandas/pandas/io/parsers.pyc in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
    736         # if we find 'Unnamed' all of a single level, then our header was too long
    737         for n in range(len(columns[0])):
--> 738             if all([ 'Unnamed' in c[n] for c in columns ]):
    739                 raise _parser.CParserError("Passed header=[%s] are too many rows for this "
    740                                            "multi_index of columns" % ','.join([ str(x) for x in self.header ]))

TypeError: argument of type 'float' is not iterable

cpcloud · 2013-10-04T04:31:54Z

What does your table look like?

cancan101 · 2013-10-04T04:33:17Z

See: http://pastebin.com/7mAF0Ei6

The table is an extract from the problematic document.

cpcloud · 2013-10-04T04:35:25Z

GitHub actually supports tables:

	Three months ended April 30						Six months ended April 30

	2013			2012			2013			2012
	In millions
Net revenue:
Notebooks	$	3,718		$	4,900		$	7,846		$	9,842
Desktops		3,103			3,827			6,424			7,033
Workstations		521			537			1,056			1,072
Other		242			206			462			415

Personal Systems		7,584			9,470			15,788			18,362

Supplies		4,122			4,060			8,015			8,139
Commercial Hardware		1,398			1,479			2,752			2,968
Consumer Hardware		561			593			1,240			1,283

Printing		6,081			6,132			12,007			12,390

Printing and Personal Systems Group		13,665			15,602			27,795			30,752

Industry Standard Servers		2,806			3,186			5,800			6,258
Technology Services		2,272			2,335			4,515			4,599
Storage		857			990			1,690			1,945
Networking		618			614			1,226			1,200
Business Critical Systems		266			421			572			826

Enterprise Group		6,819			7,546			13,803			14,828

Infrastructure Technology Outsourcing		3,721			3,954			7,457			7,934
Application and Business Services		2,278			2,535			4,461			4,926

Enterprise Services		5,999			6,489			11,918			12,860

Software		941			970			1,867			1,916
HP Financial Services		881			968			1,838			1,918
Corporate Investments		10			7			14			37

Total segments		28,315			31,582			57,235			62,311

Eliminations of intersegment net revenue and other		(733	)		(889	)		(1,294	)		(1,582	)

Total HP consolidated net revenue	$	27,582		$	30,693		$	55,941		$	60,729

cpcloud · 2013-10-04T04:36:14Z

im betting that if you try to get the part of the table after the header it will work ... otherwise i'll have to take a look later

cancan101 · 2013-10-04T04:37:29Z

Aside from the leading $ and trailing ) (which are like that in the HTML), GitHub does a great job of rendering that table.

cancan101 · 2013-10-04T04:41:38Z

So this seems to work better:

ret = pd.read_html("/home/alex/table.html",infer_types=False,skiprows=3)[0]

but I am still left with a lot of nan's:

                                                   0    1    2      3    4   \
0                                        Net revenue:  nan  nan    nan  nan   
1                                           Notebooks  nan    $   3718  nan   
2                                            Desktops  nan  nan   3103  nan   
3                                        Workstations  nan  nan    521  nan   
4                                               Other  nan  nan    242  nan   
5                                                 nan  nan  nan    nan  nan   
6                                    Personal Systems  nan  nan   7584  nan   
7                                                 nan  nan  nan    nan  nan   
8                                            Supplies  nan  nan   4122  nan   
9                                 Commercial Hardware  nan  nan   1398  nan   
10                                  Consumer Hardware  nan  nan    561  nan   
11                                                nan  nan  nan    nan  nan   
12                                           Printing  nan  nan   6081  nan   
13                                                nan  nan  nan    nan  nan   
14                Printing and Personal Systems Group  nan  nan  13665  nan   
15                                                nan  nan  nan    nan  nan   
16                          Industry Standard Servers  nan  nan   2806  nan   
17                                Technology Services  nan  nan   2272  nan   
18                                            Storage  nan  nan    857  nan   
19                                         Networking  nan  nan    618  nan   
20                          Business Critical Systems  nan  nan    266  nan   
21                                                nan  nan  nan    nan  nan   
22                                   Enterprise Group  nan  nan   6819  nan   
23                                                nan  nan  nan    nan  nan   
24              Infrastructure Technology Outsourcing  nan  nan   3721  nan   
25                  Application and Business Services  nan  nan   2278  nan   
26                                                nan  nan  nan    nan  nan   
27                                Enterprise Services  nan  nan   5999  nan   
28                                                nan  nan  nan    nan  nan   
29                                           Software  nan  nan    941  nan   
30                              HP Financial Services  nan  nan    881  nan   
31                              Corporate Investments  nan  nan     10  nan   
32                                                nan  nan  nan    nan  nan   
33                                     Total segments  nan  nan  28315  nan   
34                                                nan  nan  nan    nan  nan   
35  Eliminations of intersegment net revenue and o...  nan  nan   (733    )   
36                                                nan  nan  nan    nan  nan   
37                  Total HP consolidated net revenue  nan    $  27582  nan   
38                                                nan  nan  nan    nan  nan

where those nans actually appear to be strings:

In [22]: ret[5][0]
Out[22]: u'nan'

cpcloud · 2013-10-04T04:43:34Z

yep as i suspected.

the string nans are because you passed infer_types=False which converts everything to a string (kind of kludgy i know, it's for back compat). infer_types will have no effect starting in 0.14

cancan101 · 2013-10-04T04:49:53Z

I think that parse_dates should have a little more documentation (I did look at docs on read_csv as well).

It is not clear to me what the default value of parse_dates is.

Also, on read_html, parse_dates is listed as "bool" whereas on read_csv as "boolean, list of ints or names, list of lists, or dict".

I am not sure what setting parse_dates=False does.

It seems that some values are being incorrectly parsed as dates (See columns 9-12):

In [38]: pd.read_html("/home/alex/table.html",skiprows=3,parse_dates=False)[0]
Out[38]: 
                                                   0   1    2      3    4    5      6    7    8                   9    10   11                  12   13
0                                        Net revenue: NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
1                                           Notebooks NaN    $   3718  NaN    $   4900  NaN    $                 NaT  NaN    $                 NaT  NaN
2                                            Desktops NaN  NaN   3103  NaN  NaN   3827  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
3                                        Workstations NaN  NaN    521  NaN  NaN    537  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
4                                               Other NaN  NaN    242  NaN  NaN    206  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
5                                                 NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
6                                    Personal Systems NaN  NaN   7584  NaN  NaN   9470  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
7                                                 NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
8                                            Supplies NaN  NaN   4122  NaN  NaN   4060  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
9                                 Commercial Hardware NaN  NaN   1398  NaN  NaN   1479  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
10                                  Consumer Hardware NaN  NaN    561  NaN  NaN    593  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
11                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
12                                           Printing NaN  NaN   6081  NaN  NaN   6132  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
13                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
14                Printing and Personal Systems Group NaN  NaN  13665  NaN  NaN  15602  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
15                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
16                          Industry Standard Servers NaN  NaN   2806  NaN  NaN   3186  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
17                                Technology Services NaN  NaN   2272  NaN  NaN   2335  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
18                                            Storage NaN  NaN    857  NaN  NaN    990  NaN  NaN 1690-01-01 00:00:00  NaN  NaN 1945-01-01 00:00:00  NaN
19                                         Networking NaN  NaN    618  NaN  NaN    614  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
20                          Business Critical Systems NaN  NaN    266  NaN  NaN    421  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
21                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
22                                   Enterprise Group NaN  NaN   6819  NaN  NaN   7546  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
23                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
24              Infrastructure Technology Outsourcing NaN  NaN   3721  NaN  NaN   3954  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
25                  Application and Business Services NaN  NaN   2278  NaN  NaN   2535  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
26                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
27                                Enterprise Services NaN  NaN   5999  NaN  NaN   6489  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
28                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
29                                           Software NaN  NaN    941  NaN  NaN    970  NaN  NaN 1867-01-01 00:00:00  NaN  NaN 1916-01-01 00:00:00  NaN
30                              HP Financial Services NaN  NaN    881  NaN  NaN    968  NaN  NaN 1838-01-01 00:00:00  NaN  NaN 1918-01-01 00:00:00  NaN
31                              Corporate Investments NaN  NaN     10  NaN  NaN      7  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
32                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
33                                     Total segments NaN  NaN  28315  NaN  NaN  31582  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
34                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
35  Eliminations of intersegment net revenue and o... NaN  NaN   (733    )  NaN   (889    )  NaN                 NaT    )  NaN                 NaT    )
36                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
37                  Total HP consolidated net revenue NaN    $  27582  NaN    $  30693  NaN    $                 NaT  NaN    $                 NaT  NaN
38                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN

kevindavenport · 2014-11-08T03:54:01Z

What are we supposed to do if we want to use read_html and not have Pandas infer types then?

…as-dev#4770, pandas-dev#7032

ghost assigned cpcloud Sep 7, 2013

jtratner reviewed Oct 2, 2013
View reviewed changes

cancan101 mentioned this pull request Jan 22, 2014

ENH: read_html has no timeout #6029

Closed

This was referenced Apr 3, 2014

DEPR: create issues for the current FutureWarnings in pandas #6641

Closed

Remove number of deprecated parameters/functions/classes [fix #6641] #6813

Merged

DEPR: Clean up list of deprecations from prior versions #6581

Closed

cpcloud mentioned this pull request May 5, 2014

fully deprecate read_html infer_types argument in 0.14 #7037

Closed

sixtysecond mentioned this pull request Jul 27, 2015

Strong coupling on read_html and automatic conversion results in lost data for incorrectly inferred types #10684

Closed

jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Aug 23, 2015

jreback mentioned this pull request Aug 23, 2015

DEPR: Bunch o deprecation removals part 2 #10892

Merged

jreback added a commit to jreback/pandas that referenced this pull request Aug 24, 2015

DEPR: Remove infer_type keyword from pd.read_html as its unused, pand…

0fde3ba

…as-dev#4770, pandas-dev#7032

jreback mentioned this pull request Jul 24, 2016

DEPR: deprecations log for removed issues #13777

Closed

REF/BUG/ENH/API: refactor read_html to use TextParser #4770

REF/BUG/ENH/API: refactor read_html to use TextParser #4770

Conversation

cpcloud commented Sep 7, 2013

jtratner commented Sep 8, 2013

jreback commented Sep 8, 2013

cancan101 commented Sep 18, 2013

cpcloud commented Sep 19, 2013

cancan101 commented Sep 19, 2013

cpcloud commented Sep 19, 2013

cpcloud commented Oct 1, 2013

jreback commented Oct 1, 2013

jreback commented Oct 1, 2013

cpcloud commented Oct 1, 2013

cpcloud commented Oct 1, 2013

cpcloud commented Oct 2, 2013

jtratner Oct 2, 2013

Choose a reason for hiding this comment

jtratner Oct 2, 2013

Choose a reason for hiding this comment

cpcloud Oct 2, 2013

Choose a reason for hiding this comment

jtratner commented Oct 2, 2013

cancan101 commented Oct 2, 2013

cancan101 commented Oct 2, 2013

cpcloud commented Oct 3, 2013

jreback commented Oct 3, 2013

cpcloud commented Oct 3, 2013

cpcloud commented Oct 3, 2013

cpcloud commented Oct 3, 2013

cancan101 commented Oct 4, 2013

cpcloud commented Oct 4, 2013

cpcloud commented Oct 4, 2013

cancan101 commented Oct 4, 2013

cpcloud commented Oct 4, 2013

cancan101 commented Oct 4, 2013

cpcloud commented Oct 4, 2013

cancan101 commented Oct 4, 2013

cpcloud commented Oct 4, 2013

cpcloud commented Oct 4, 2013

cancan101 commented Oct 4, 2013

cancan101 commented Oct 4, 2013

cpcloud commented Oct 4, 2013

cancan101 commented Oct 4, 2013

kevindavenport commented Nov 8, 2014