Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_docx? #22518

Closed
jnothman opened this issue Aug 27, 2018 · 4 comments
Closed

read_docx? #22518

jnothman opened this issue Aug 27, 2018 · 4 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action

Comments

@jnothman
Copy link
Contributor

jnothman commented Aug 27, 2018

Problem description

I sometimes need to extract tables from docx files, rather than from HTML. Given that docx XML is very HTML-like when it comes to tables, it seems appropriate to reuse Pandas' loading facilities, ideally without first converging the whole docx to html.

Here is a hacky solution, which simply:

  • reads 'word/document.xml' from the docx zipfile
  • translates WordprocessingML's tbl to table and tc to td. (I've not looked into how Word marks things corresponding to th, thead and tfoot.)

Working implementation invoked by pd.read_html('file://path/to/my.docx', flavor='docx')

from pandas.io.html import _LxmlFrameParser, _valid_parsers
from pandas.io.common import _is_url, urlopen


class _LxmlDocxParser(_LxmlFrameParser):
    def _build_doc(self):
        import zipfile
        from lxml.html import parse, HTMLParser
        from lxml.etree import XMLSyntaxError
        parser = HTMLParser(recover=True, encoding=self.encoding)

        if _is_url(self.io):
            with urlopen(self.io) as zipf:
                zf = zipfile.ZipFile(zipf)
                with zf.open('word/document.xml') as f:
                    r = parse(f, parser=parser)
        else:
            zf = zipfile.ZipFile(self.io)
            with zf.open('word/document.xml') as f:
                r = parse(f, parser=parser)
        try:
            r = r.getroot()
        except AttributeError:
            pass
        if not hasattr(r, 'text_content'):
            raise XMLSyntaxError("no text parsed from document", 0, 0, 0)

        # HACK: translate WordprocessingML tags to HTML tags
        for el in r.xpath('//tbl'):
            el.tag = 'table'
        for el in r.xpath('//tc'):
            el.tag = 'td'
        return r


_valid_parsers['docx'] = _LxmlDocxParser

Let me know what interest there is, or feel free to use this code in an implementation.

@gfyoung gfyoung added Enhancement IO Data IO issues that don't fit into a more specific label labels Aug 27, 2018
@gfyoung
Copy link
Member

gfyoung commented Aug 27, 2018

@jnothman : Do you mind providing an example docx file for us to try?

@gfyoung gfyoung added the Needs Discussion Requires discussion from core team before further action label Aug 27, 2018
@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2018

I’m a little hesitant here because I’m not aware of any good Python libraries to read and write Word files (@jnothman feel free to correct me) so I’m not sure how generalizable the reading of data in these types of files can be

@jnothman
Copy link
Contributor Author

jnothman commented Sep 4, 2018

I’m a little hesitant here because I’m not aware of any good Python libraries to read and write Word files (@jnothman feel free to correct me) so I’m not sure how generalizable the reading of data in these types of files can be

I've not looked into these, but I'm not sure if it's relevant as long as we are:

  • working with the table markup defined by the open WordprocessingML standard
  • doing similar to read_html: finding and parsing all tables

I've converted the .html files in pandas/tests/io/data/ at https://github.com/jnothman/pandas/tree/docx/pandas/tests/io/data

I can see in the spam.docx case that some mess at the top of the HTML table has now become an inappropriate row in the docx table and hence in the DataFrame. This might need some tweaking, but it is a result of first converting HTML to DocX, so it's a bit of a niche case.

@jbrockmendel
Copy link
Member

Closing and adding to a tracker issue #30407 for IO format requests, can re-open if interest is expressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants