Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/BUG/ENH/API: refactor read_html to use TextParser #4770

Merged
merged 5 commits into from
Oct 3, 2013
Merged

REF/BUG/ENH/API: refactor read_html to use TextParser #4770

merged 5 commits into from
Oct 3, 2013

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Sep 7, 2013

closes #4697 (refactor issue) (REF/ENH)
closes #4700 (header inconsistency issue) (API)
closes #5029 (comma issue, added this data set, ordering issue) (BUG)
closes #5048 (header type conversion issue) (BUG)
closes #5066 (index_col issue) (BUG)

  • figure out skiprows, header, and index_col interaction (a somewhat longstanding MultiIndex sorting issue, I just took the long way to get there :))
  • spam url not working anymore (US gov "shutdown" is responsible for this, it correctly skips)
  • table ordering doc blurb/HTML gotchas (was an actual "bug", now fixed in this PR)
  • add tests for rows with a different length (this is already done by the existing tests)

@ghost ghost assigned cpcloud Sep 7, 2013
@jtratner
Copy link
Contributor

jtratner commented Sep 8, 2013

👍

@jreback
Copy link
Contributor

jreback commented Sep 8, 2013

yep looks good

@cancan101
Copy link
Contributor

also closes #4697

@cpcloud
Copy link
Member Author

cpcloud commented Sep 19, 2013

@cancan101 yep thanks. added

@cancan101
Copy link
Contributor

@cpcloud Does this PR actually close #4679 ? That one is specific to Excel.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 19, 2013

@jtratner Nope it doesn't, I think I must've mixed up 79 and 97. Thanks

@cpcloud
Copy link
Member Author

cpcloud commented Oct 1, 2013

this diff is very difficult to read ... sigh

@jreback
Copy link
Contributor

jreback commented Oct 1, 2013

a 5-fir!

@jreback
Copy link
Contributor

jreback commented Oct 1, 2013

I think you are trying to bump your stats with html files! lol

@cpcloud
Copy link
Member Author

cpcloud commented Oct 1, 2013

@jreback yep

last issue is for me to make sure that codec.open with my chosen errors param is correct not using it anymore ... sticking with our old pal open. would be nice to handle encoding decoding ... but that's for the future

@cpcloud
Copy link
Member Author

cpcloud commented Oct 1, 2013

@jreback nah ... i'm trying to avoid slowing down the test suite with a bunch of @network tests

@cpcloud
Copy link
Member Author

cpcloud commented Oct 2, 2013

@jreback @jtratner comments?

raise Exception("invalid names passed _stack_arrays")
nitems, nstacked = len(items), len(stacked)
if nitems != nstacked:
raise BadDataError('number of names in ref_items must equal the'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave a note here that says "Caller must catch this error"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or I don't know, maybe not, just something like, if you think this could happen then you should catch this error and try to say something more meaningful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jtratner
Copy link
Contributor

jtratner commented Oct 2, 2013

I did a first pass. I'd probably like to go over it again and see what I see, but I'm sure it's good.

As an aside, can you help me understand how ordering works for the outputted tables? My assumption is that the ordering is deterministic. Does it follow the order in the HTML data that's passed in? (i.e., if you found all the line numbers of <table> elements, the order of outputted tables would be the same as the order of line numbers [unless a table isn't parseable])

@cancan101
Copy link
Contributor

@jtratner Re: ordering, see: #5029 (comment) and the subsequent comments.

@cancan101
Copy link
Contributor

@cpcloud Maybe also add that note about table ordering to the html gotchas?

@cpcloud
Copy link
Member Author

cpcloud commented Oct 3, 2013

@jreback @jtratner any more comments?

@jreback
Copy link
Contributor

jreback commented Oct 3, 2013

look ok

on tupleize_cols your explanation is odd - no other functions have it as true (by default)

@cpcloud
Copy link
Member Author

cpcloud commented Oct 3, 2013

Probably a cycle in the HTML parse tree. Does copy-pasting just the table work?

@cpcloud
Copy link
Member Author

cpcloud commented Oct 3, 2013

Surprised that site works...

@cpcloud
Copy link
Member Author

cpcloud commented Oct 3, 2013

i think maybe a timeout parameter might be useful

@cancan101
Copy link
Contributor

I interrupted and this is the trace:

/home/alex/git/pandas/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
    838                          'data (you passed a negative value)')
    839     return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840                   parse_dates, tupleize_cols, thousands, attrs)

/home/alex/git/pandas/pandas/io/html.py in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
    700 
    701         try:
--> 702             tables = p.parse_tables()
    703         except Exception as caught:
    704             retained = caught

/home/alex/git/pandas/pandas/io/html.py in parse_tables(self)
    172 
    173     def parse_tables(self):
--> 174         tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    175         return (self._build_table(table) for table in tables)
    176 

/home/alex/git/pandas/pandas/io/html.py in _parse_tables(self, doc, match, attrs)
    396     def _parse_tables(self, doc, match, attrs):
    397         element_name = self._strainer.name
--> 398         tables = doc.find_all(element_name, attrs=attrs)
    399 
    400         if not tables:

/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in find_all(self, name, attrs, recursive, text, limit, **kwargs)
   1165         if not recursive:
   1166             generator = self.children
-> 1167         return self._find_all(name, attrs, text, limit, generator, **kwargs)
   1168     findAll = find_all       # BS3
   1169     findChildren = find_all  # BS2

/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in _find_all(self, name, attrs, text, limit, generator, **kwargs)
    483             # Optimization to find all tags with a given name.
    484             elif isinstance(name, basestring):
--> 485                 return [element for element in generator
    486                         if isinstance(element, Tag) and element.name == name]
    487             else:

/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in descendants(self)
   1182         current = self.contents[0]
   1183         while current is not stopNode:
-> 1184             yield current
   1185             current = current.next_element
   1186 

@cpcloud
Copy link
Member Author

cpcloud commented Oct 4, 2013

Yep ... @jseabold had a similar issue ... let me find it

essentially the borked html is in a cycle, in this case a node goes to its child then when the current node (the child) goes to the next element, it's actually the previous node (its parent) and on and on ...

@cpcloud
Copy link
Member Author

cpcloud commented Oct 4, 2013

#4786

@cancan101
Copy link
Contributor

Am I doing something wrong here:

pd.read_html("/home/alex/table.html",infer_types=False,header=[0])

---------------------------------------------------------------------------
TypeError                                 
----> 1 pd.read_html("/home/alex/table.html",infer_types=False,header=[0])

/home/alex/git/pandas/pandas/io/html.pyc in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
    838                          'data (you passed a negative value)')
    839     return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840                   parse_dates, tupleize_cols, thousands, attrs)

/home/alex/git/pandas/pandas/io/html.pyc in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
    710     return [_data_to_frame(table, header, index_col, skiprows, infer_types,
    711                            parse_dates, tupleize_cols, thousands)
--> 712             for table in tables]
    713 
    714 

/home/alex/git/pandas/pandas/io/html.pyc in _data_to_frame(data, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands)
    600                     skiprows=_get_skiprows(skiprows),
    601                     parse_dates=parse_dates, tupleize_cols=tupleize_cols,
--> 602                     thousands=thousands)
    603     df = tp.read()
    604 

/home/alex/git/pandas/pandas/io/parsers.pyc in TextParser(*args, **kwds)
   1173     """
   1174     kwds['engine'] = 'python'
-> 1175     return TextFileReader(*args, **kwds)
   1176 
   1177 

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    485             self.options['has_index_names'] = kwds['has_index_names']
    486 
--> 487         self._make_engine(self.engine)
    488 
    489     def _get_options_with_defaults(self, engine):

/home/alex/git/pandas/pandas/io/parsers.pyc in _make_engine(self, engine)
    601             elif engine == 'python-fwf':
    602                 klass = FixedWidthFieldParser
--> 603             self._engine = klass(self.f, **self.options)
    604 
    605     def _failover_to_python(self):

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1296         if len(self.columns) > 1:
   1297             self.columns, self.index_names, self.col_names, _ = self._extract_multi_indexer_columns(
-> 1298                 self.columns, self.index_names, self.col_names)
   1299         else:
   1300             self.columns = self.columns[0]

/home/alex/git/pandas/pandas/io/parsers.pyc in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
    736         # if we find 'Unnamed' all of a single level, then our header was too long
    737         for n in range(len(columns[0])):
--> 738             if all([ 'Unnamed' in c[n] for c in columns ]):
    739                 raise _parser.CParserError("Passed header=[%s] are too many rows for this "
    740                                            "multi_index of columns" % ','.join([ str(x) for x in self.header ]))

TypeError: argument of type 'float' is not iterable

This works fine:

pd.read_html("/home/alex/table.html",infer_types=False,header=0)

@cpcloud
Copy link
Member Author

cpcloud commented Oct 4, 2013

this is an issue with TextParser

if you need to pass only a single header just use the second version

it doesn't really make a whole lot of sense to pass a singleton list if you just want the first row anyway

only pass a list if you need more than 1 row as a MultiIndex header

i'll open an issue about TextParser

@cancan101
Copy link
Contributor

That might have been a poor example, I see this issue even with a proper list:

In [6]: pd.read_html("/home/alex/table.html",infer_types=False,header=[0,1])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-1b482817d4da> in <module>()
----> 1 pd.read_html("/home/alex/table.html",infer_types=False,header=[0,1])

/home/alex/git/pandas/pandas/io/html.pyc in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
    838                          'data (you passed a negative value)')
    839     return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840                   parse_dates, tupleize_cols, thousands, attrs)

/home/alex/git/pandas/pandas/io/html.pyc in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
    710     return [_data_to_frame(table, header, index_col, skiprows, infer_types,
    711                            parse_dates, tupleize_cols, thousands)
--> 712             for table in tables]
    713 
    714 

/home/alex/git/pandas/pandas/io/html.pyc in _data_to_frame(data, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands)
    600                     skiprows=_get_skiprows(skiprows),
    601                     parse_dates=parse_dates, tupleize_cols=tupleize_cols,
--> 602                     thousands=thousands)
    603     df = tp.read()
    604 

/home/alex/git/pandas/pandas/io/parsers.pyc in TextParser(*args, **kwds)
   1173     """
   1174     kwds['engine'] = 'python'
-> 1175     return TextFileReader(*args, **kwds)
   1176 
   1177 

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    485             self.options['has_index_names'] = kwds['has_index_names']
    486 
--> 487         self._make_engine(self.engine)
    488 
    489     def _get_options_with_defaults(self, engine):

/home/alex/git/pandas/pandas/io/parsers.pyc in _make_engine(self, engine)
    601             elif engine == 'python-fwf':
    602                 klass = FixedWidthFieldParser
--> 603             self._engine = klass(self.f, **self.options)
    604 
    605     def _failover_to_python(self):

/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1296         if len(self.columns) > 1:
   1297             self.columns, self.index_names, self.col_names, _ = self._extract_multi_indexer_columns(
-> 1298                 self.columns, self.index_names, self.col_names)
   1299         else:
   1300             self.columns = self.columns[0]

/home/alex/git/pandas/pandas/io/parsers.pyc in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
    736         # if we find 'Unnamed' all of a single level, then our header was too long
    737         for n in range(len(columns[0])):
--> 738             if all([ 'Unnamed' in c[n] for c in columns ]):
    739                 raise _parser.CParserError("Passed header=[%s] are too many rows for this "
    740                                            "multi_index of columns" % ','.join([ str(x) for x in self.header ]))

TypeError: argument of type 'float' is not iterable

@cpcloud
Copy link
Member Author

cpcloud commented Oct 4, 2013

What does your table look like?

@cancan101
Copy link
Contributor

See: http://pastebin.com/7mAF0Ei6

The table is an extract from the problematic document.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 4, 2013

GitHub actually supports tables:

 
  Three months ended
April 30
  Six months ended
April 30
 
 
  2013   2012   2013   2012  
 
  In millions
 

Net revenue:

                         

Notebooks

  $ 3,718   $ 4,900   $ 7,846   $ 9,842  

Desktops

    3,103     3,827     6,424     7,033  

Workstations

    521     537     1,056     1,072  

Other

    242     206     462     415  
                   

Personal Systems

    7,584     9,470     15,788     18,362  
                   

Supplies

    4,122     4,060     8,015     8,139  

Commercial Hardware

    1,398     1,479     2,752     2,968  

Consumer Hardware

    561     593     1,240     1,283  
                   

Printing

    6,081     6,132     12,007     12,390  
                   

Printing and Personal Systems Group

    13,665     15,602     27,795     30,752  
                   

Industry Standard Servers

    2,806     3,186     5,800     6,258  

Technology Services

    2,272     2,335     4,515     4,599  

Storage

    857     990     1,690     1,945  

Networking

    618     614     1,226     1,200  

Business Critical Systems

    266     421     572     826  
                   

Enterprise Group

    6,819     7,546     13,803     14,828  
                   

Infrastructure Technology Outsourcing

    3,721     3,954     7,457     7,934  

Application and Business Services

    2,278     2,535     4,461     4,926  
                   

Enterprise Services

    5,999     6,489     11,918     12,860  
                   

Software

    941     970     1,867     1,916  

HP Financial Services

    881     968     1,838     1,918  

Corporate Investments

    10     7     14     37  
                   

Total segments

    28,315     31,582     57,235     62,311  
                   

Eliminations of intersegment net revenue and other

    (733 )   (889 )   (1,294 )   (1,582 )
                   

Total HP consolidated net revenue

  $ 27,582   $ 30,693   $ 55,941   $ 60,729  
                   

@cpcloud
Copy link
Member Author

cpcloud commented Oct 4, 2013

im betting that if you try to get the part of the table after the header it will work ... otherwise i'll have to take a look later

@cancan101
Copy link
Contributor

Aside from the leading $ and trailing ) (which are like that in the HTML), GitHub does a great job of rendering that table.

@cancan101
Copy link
Contributor

So this seems to work better:

ret = pd.read_html("/home/alex/table.html",infer_types=False,skiprows=3)[0]

but I am still left with a lot of nan's:

                                                   0    1    2      3    4   \
0                                        Net revenue:  nan  nan    nan  nan   
1                                           Notebooks  nan    $   3718  nan   
2                                            Desktops  nan  nan   3103  nan   
3                                        Workstations  nan  nan    521  nan   
4                                               Other  nan  nan    242  nan   
5                                                 nan  nan  nan    nan  nan   
6                                    Personal Systems  nan  nan   7584  nan   
7                                                 nan  nan  nan    nan  nan   
8                                            Supplies  nan  nan   4122  nan   
9                                 Commercial Hardware  nan  nan   1398  nan   
10                                  Consumer Hardware  nan  nan    561  nan   
11                                                nan  nan  nan    nan  nan   
12                                           Printing  nan  nan   6081  nan   
13                                                nan  nan  nan    nan  nan   
14                Printing and Personal Systems Group  nan  nan  13665  nan   
15                                                nan  nan  nan    nan  nan   
16                          Industry Standard Servers  nan  nan   2806  nan   
17                                Technology Services  nan  nan   2272  nan   
18                                            Storage  nan  nan    857  nan   
19                                         Networking  nan  nan    618  nan   
20                          Business Critical Systems  nan  nan    266  nan   
21                                                nan  nan  nan    nan  nan   
22                                   Enterprise Group  nan  nan   6819  nan   
23                                                nan  nan  nan    nan  nan   
24              Infrastructure Technology Outsourcing  nan  nan   3721  nan   
25                  Application and Business Services  nan  nan   2278  nan   
26                                                nan  nan  nan    nan  nan   
27                                Enterprise Services  nan  nan   5999  nan   
28                                                nan  nan  nan    nan  nan   
29                                           Software  nan  nan    941  nan   
30                              HP Financial Services  nan  nan    881  nan   
31                              Corporate Investments  nan  nan     10  nan   
32                                                nan  nan  nan    nan  nan   
33                                     Total segments  nan  nan  28315  nan   
34                                                nan  nan  nan    nan  nan   
35  Eliminations of intersegment net revenue and o...  nan  nan   (733    )   
36                                                nan  nan  nan    nan  nan   
37                  Total HP consolidated net revenue  nan    $  27582  nan   
38                                                nan  nan  nan    nan  nan   

where those nans actually appear to be strings:

In [22]: ret[5][0]
Out[22]: u'nan'

@cpcloud
Copy link
Member Author

cpcloud commented Oct 4, 2013

yep as i suspected.

the string nans are because you passed infer_types=False which converts everything to a string (kind of kludgy i know, it's for back compat). infer_types will have no effect starting in 0.14

@cancan101
Copy link
Contributor

I think that parse_dates should have a little more documentation (I did look at docs on read_csv as well).

It is not clear to me what the default value of parse_dates is.

Also, on read_html, parse_dates is listed as "bool" whereas on read_csv as "boolean, list of ints or names, list of lists, or dict".

I am not sure what setting parse_dates=False does.

It seems that some values are being incorrectly parsed as dates (See columns 9-12):

In [38]: pd.read_html("/home/alex/table.html",skiprows=3,parse_dates=False)[0]
Out[38]: 
                                                   0   1    2      3    4    5      6    7    8                   9    10   11                  12   13
0                                        Net revenue: NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
1                                           Notebooks NaN    $   3718  NaN    $   4900  NaN    $                 NaT  NaN    $                 NaT  NaN
2                                            Desktops NaN  NaN   3103  NaN  NaN   3827  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
3                                        Workstations NaN  NaN    521  NaN  NaN    537  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
4                                               Other NaN  NaN    242  NaN  NaN    206  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
5                                                 NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
6                                    Personal Systems NaN  NaN   7584  NaN  NaN   9470  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
7                                                 NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
8                                            Supplies NaN  NaN   4122  NaN  NaN   4060  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
9                                 Commercial Hardware NaN  NaN   1398  NaN  NaN   1479  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
10                                  Consumer Hardware NaN  NaN    561  NaN  NaN    593  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
11                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
12                                           Printing NaN  NaN   6081  NaN  NaN   6132  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
13                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
14                Printing and Personal Systems Group NaN  NaN  13665  NaN  NaN  15602  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
15                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
16                          Industry Standard Servers NaN  NaN   2806  NaN  NaN   3186  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
17                                Technology Services NaN  NaN   2272  NaN  NaN   2335  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
18                                            Storage NaN  NaN    857  NaN  NaN    990  NaN  NaN 1690-01-01 00:00:00  NaN  NaN 1945-01-01 00:00:00  NaN
19                                         Networking NaN  NaN    618  NaN  NaN    614  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
20                          Business Critical Systems NaN  NaN    266  NaN  NaN    421  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
21                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
22                                   Enterprise Group NaN  NaN   6819  NaN  NaN   7546  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
23                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
24              Infrastructure Technology Outsourcing NaN  NaN   3721  NaN  NaN   3954  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
25                  Application and Business Services NaN  NaN   2278  NaN  NaN   2535  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
26                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
27                                Enterprise Services NaN  NaN   5999  NaN  NaN   6489  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
28                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
29                                           Software NaN  NaN    941  NaN  NaN    970  NaN  NaN 1867-01-01 00:00:00  NaN  NaN 1916-01-01 00:00:00  NaN
30                              HP Financial Services NaN  NaN    881  NaN  NaN    968  NaN  NaN 1838-01-01 00:00:00  NaN  NaN 1918-01-01 00:00:00  NaN
31                              Corporate Investments NaN  NaN     10  NaN  NaN      7  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
32                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
33                                     Total segments NaN  NaN  28315  NaN  NaN  31582  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
34                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
35  Eliminations of intersegment net revenue and o... NaN  NaN   (733    )  NaN   (889    )  NaN                 NaT    )  NaN                 NaT    )
36                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN
37                  Total HP consolidated net revenue NaN    $  27582  NaN    $  30693  NaN    $                 NaT  NaN    $                 NaT  NaN
38                                                NaN NaN  NaN    NaN  NaN  NaN    NaN  NaN  NaN                 NaT  NaN  NaN                 NaT  NaN

@kevindavenport
Copy link

What are we supposed to do if we want to use read_html and not have Pandas infer types then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment