read_docx? #22518

jnothman · 2018-08-27T03:18:58Z

Problem description

I sometimes need to extract tables from docx files, rather than from HTML. Given that docx XML is very HTML-like when it comes to tables, it seems appropriate to reuse Pandas' loading facilities, ideally without first converging the whole docx to html.

Here is a hacky solution, which simply:

reads 'word/document.xml' from the docx zipfile
translates WordprocessingML's tbl to table and tc to td. (I've not looked into how Word marks things corresponding to th, thead and tfoot.)

Working implementation invoked by pd.read_html('file://path/to/my.docx', flavor='docx')

from pandas.io.html import _LxmlFrameParser, _valid_parsers
from pandas.io.common import _is_url, urlopen


class _LxmlDocxParser(_LxmlFrameParser):
    def _build_doc(self):
        import zipfile
        from lxml.html import parse, HTMLParser
        from lxml.etree import XMLSyntaxError
        parser = HTMLParser(recover=True, encoding=self.encoding)

        if _is_url(self.io):
            with urlopen(self.io) as zipf:
                zf = zipfile.ZipFile(zipf)
                with zf.open('word/document.xml') as f:
                    r = parse(f, parser=parser)
        else:
            zf = zipfile.ZipFile(self.io)
            with zf.open('word/document.xml') as f:
                r = parse(f, parser=parser)
        try:
            r = r.getroot()
        except AttributeError:
            pass
        if not hasattr(r, 'text_content'):
            raise XMLSyntaxError("no text parsed from document", 0, 0, 0)

        # HACK: translate WordprocessingML tags to HTML tags
        for el in r.xpath('//tbl'):
            el.tag = 'table'
        for el in r.xpath('//tc'):
            el.tag = 'td'
        return r


_valid_parsers['docx'] = _LxmlDocxParser

Let me know what interest there is, or feel free to use this code in an implementation.

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-08-27T09:53:29Z

@jnothman : Do you mind providing an example docx file for us to try?

WillAyd · 2018-08-27T13:52:52Z

I’m a little hesitant here because I’m not aware of any good Python libraries to read and write Word files (@jnothman feel free to correct me) so I’m not sure how generalizable the reading of data in these types of files can be

jnothman · 2018-09-04T03:05:26Z

I’m a little hesitant here because I’m not aware of any good Python libraries to read and write Word files (@jnothman feel free to correct me) so I’m not sure how generalizable the reading of data in these types of files can be

I've not looked into these, but I'm not sure if it's relevant as long as we are:

working with the table markup defined by the open WordprocessingML standard
doing similar to read_html: finding and parsing all tables

I've converted the .html files in pandas/tests/io/data/ at https://github.com/jnothman/pandas/tree/docx/pandas/tests/io/data

I can see in the spam.docx case that some mess at the top of the HTML table has now become an inappropriate row in the docx table and hence in the DataFrame. This might need some tweaking, but it is a result of first converting HTML to DocX, so it's a bit of a niche case.

jbrockmendel · 2019-12-22T16:55:51Z

Closing and adding to a tracker issue #30407 for IO format requests, can re-open if interest is expressed.

gfyoung added Enhancement IO Data IO issues that don't fit into a more specific label labels Aug 27, 2018

gfyoung added the Needs Discussion Requires discussion from core team before further action label Aug 27, 2018

datapythonista mentioned this issue Sep 12, 2019

DEPR: Move rarely used I/O connectors to third party modules #28409

Closed

jbrockmendel mentioned this issue Dec 22, 2019

ENH: Requested IO readers/writers #30407

Open

13 tasks

jbrockmendel closed this as completed Dec 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_docx? #22518

read_docx? #22518

jnothman commented Aug 27, 2018 •

edited

Loading

gfyoung commented Aug 27, 2018

WillAyd commented Aug 27, 2018

jnothman commented Sep 4, 2018

jbrockmendel commented Dec 22, 2019

read_docx? #22518

read_docx? #22518

Comments

jnothman commented Aug 27, 2018 • edited Loading

Problem description

gfyoung commented Aug 27, 2018

WillAyd commented Aug 27, 2018

jnothman commented Sep 4, 2018

jbrockmendel commented Dec 22, 2019

jnothman commented Aug 27, 2018 •

edited

Loading