Support XML as input format #9459

fchareyr · 2023-06-20T11:52:28Z

Problem description

It would be great to be able to read XML into Polars DataFrame, similarly to what pandas offers (https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html).

horahh · 2023-07-31T01:39:53Z

I believe XML is not a hot language anymore, but still widely use so I believe adds lots of value.

mdeville · 2023-08-11T15:06:26Z

It's still massively used in production throughout the world.
As of today I agree that almost no-one would use XML as a format to export / import their data.
But some (most ?) compagnies that existed before the JSON, CSV, XLSX hype didn't wait for those formats to create programs, API's and whatnots.
And since you don't fix what's not broken, it's still very popular.

In my work there's not a single month (or dare I say week) without encountering tabular data presented as XML.

MariusMerkleQC · 2023-09-07T08:17:29Z

This would indeed be very helpful!

For now, I would use pandas' function df_pd = pd.read_xml() to parse the XML file and then use df_pl = pl.from_pandas(df_pd).

What are alternatives?

cmdlineluser · 2023-09-07T12:28:41Z

@MariusMerkleQC pd.read_xml doesn't appear to do too much.

https://github.com/pandas-dev/pandas/blob/49ca01ba9023b677f2b2d1c42e99f45595258b74/pandas/io/xml.py#L757-L861

It seems to be essentially a small wrapper around lxml.etree / xml.etree.ElementTree

doc = lxml.etree(...)
nodes = doc.xpath(...)

df = pd.DataFrame(nodes)

MariusMerkleQC · 2023-09-07T12:32:09Z

Then it should be relatively easy to bring this to polars, what do you think?

MariusMerkleQC · 2023-09-25T20:30:28Z

As it is not supported yet, I just used the library ElementTree to parse the .xml file. I then extracted value by value and just put them into a pl.DataFrame() one by one.

rupurt · 2024-01-24T00:26:26Z

Would definitely love to have native xml support in polars. Not hard to add but annoying when coming from pandas.

blackerby · 2024-04-22T23:35:36Z

If this were to be implemented in Rust, the spark-xml data source Databricks created for Spark might be worth borrowing some ideas from. It uses a StAX approach to XML parsing, the same approach quick-xml takes.

deanm0000 · 2024-04-25T12:12:22Z

polars isn't going to implement an xml reader based on python's xml reader it'd have to be rust. I can't say whether or not the maintainers want the extra binary size

cmdlineluser · 2024-04-25T13:34:08Z

It seems that calamine (used by fastexcel in the read_excel engine #14000) uses the quick-xml library that @blackerby has mentioned.

Perhaps something could be done with quick-xml if Calamine integration happens at the Rust level.

We may be able to push the speed even further by integrating directly with Calamine down in the lower-levels of the Rust engine

deanm0000 · 2024-04-25T14:00:27Z

With XML reading I'd like to see #13063 HTML reading too.

fchareyr added the enhancement New feature or an improvement of an existing feature label Jun 20, 2023

deanm0000 added the needs decision Awaiting decision by a maintainer label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support XML as input format #9459

Support XML as input format #9459

fchareyr commented Jun 20, 2023 •

edited

Loading

horahh commented Jul 31, 2023 •

edited

Loading

mdeville commented Aug 11, 2023

MariusMerkleQC commented Sep 7, 2023

cmdlineluser commented Sep 7, 2023

MariusMerkleQC commented Sep 7, 2023

MariusMerkleQC commented Sep 25, 2023

rupurt commented Jan 24, 2024

blackerby commented Apr 22, 2024

deanm0000 commented Apr 25, 2024

cmdlineluser commented Apr 25, 2024

deanm0000 commented Apr 25, 2024

Support XML as input format #9459

Support XML as input format #9459

Comments

fchareyr commented Jun 20, 2023 • edited Loading

Problem description

horahh commented Jul 31, 2023 • edited Loading

mdeville commented Aug 11, 2023

MariusMerkleQC commented Sep 7, 2023

cmdlineluser commented Sep 7, 2023

MariusMerkleQC commented Sep 7, 2023

MariusMerkleQC commented Sep 25, 2023

rupurt commented Jan 24, 2024

blackerby commented Apr 22, 2024

deanm0000 commented Apr 25, 2024

cmdlineluser commented Apr 25, 2024

deanm0000 commented Apr 25, 2024

fchareyr commented Jun 20, 2023 •

edited

Loading

horahh commented Jul 31, 2023 •

edited

Loading