Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support XML as input format #9459

Open
fchareyr opened this issue Jun 20, 2023 · 11 comments
Open

Support XML as input format #9459

fchareyr opened this issue Jun 20, 2023 · 11 comments
Labels
enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer

Comments

@fchareyr
Copy link

fchareyr commented Jun 20, 2023

Problem description

It would be great to be able to read XML into Polars DataFrame, similarly to what pandas offers (https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html).

@fchareyr fchareyr added the enhancement New feature or an improvement of an existing feature label Jun 20, 2023
@horahh
Copy link

horahh commented Jul 31, 2023

I believe XML is not a hot language anymore, but still widely use so I believe adds lots of value.

@mdeville
Copy link

It's still massively used in production throughout the world.
As of today I agree that almost no-one would use XML as a format to export / import their data.
But some (most ?) compagnies that existed before the JSON, CSV, XLSX hype didn't wait for those formats to create programs, API's and whatnots.
And since you don't fix what's not broken, it's still very popular.

In my work there's not a single month (or dare I say week) without encountering tabular data presented as XML.

@MariusMerkleQC
Copy link

This would indeed be very helpful!

For now, I would use pandas' function df_pd = pd.read_xml() to parse the XML file and then use df_pl = pl.from_pandas(df_pd).

What are alternatives?

@cmdlineluser
Copy link
Contributor

@MariusMerkleQC pd.read_xml doesn't appear to do too much.

https://github.com/pandas-dev/pandas/blob/49ca01ba9023b677f2b2d1c42e99f45595258b74/pandas/io/xml.py#L757-L861

It seems to be essentially a small wrapper around lxml.etree / xml.etree.ElementTree

doc = lxml.etree(...)
nodes = doc.xpath(...)

df = pd.DataFrame(nodes)

@MariusMerkleQC
Copy link

Then it should be relatively easy to bring this to polars, what do you think?

@MariusMerkleQC
Copy link

As it is not supported yet, I just used the library ElementTree to parse the .xml file. I then extracted value by value and just put them into a pl.DataFrame() one by one.

@rupurt
Copy link

rupurt commented Jan 24, 2024

Would definitely love to have native xml support in polars. Not hard to add but annoying when coming from pandas.

@blackerby
Copy link

If this were to be implemented in Rust, the spark-xml data source Databricks created for Spark might be worth borrowing some ideas from. It uses a StAX approach to XML parsing, the same approach quick-xml takes.

@deanm0000
Copy link
Collaborator

polars isn't going to implement an xml reader based on python's xml reader it'd have to be rust. I can't say whether or not the maintainers want the extra binary size

@deanm0000 deanm0000 added the needs decision Awaiting decision by a maintainer label Apr 25, 2024
@cmdlineluser
Copy link
Contributor

It seems that calamine (used by fastexcel in the read_excel engine #14000) uses the quick-xml library that @blackerby has mentioned.

Perhaps something could be done with quick-xml if Calamine integration happens at the Rust level.

We may be able to push the speed even further by integrating directly with Calamine down in the lower-levels of the Rust engine

@deanm0000
Copy link
Collaborator

With XML reading I'd like to see #13063 HTML reading too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer
Projects
None yet
Development

No branches or pull requests

8 participants