Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExcelControl(stringified=True) documentation could clarify interaction with schemas #1659

Open
richardt-engineb opened this issue Jun 12, 2024 · 1 comment · May be fixed by #1661
Open

ExcelControl(stringified=True) documentation could clarify interaction with schemas #1659

richardt-engineb opened this issue Jun 12, 2024 · 1 comment · May be fixed by #1661

Comments

@richardt-engineb
Copy link

Overview

formats.excelControl.stringified is documented as "Stringifies all the cell values.".

However it appears that it doesn't actually do that, even in the test cases that were added as part of the PR that added the stringified option.

e.g. in def test_xls_parser_cast_int_to_string_1251_xlsx() the test is:

resource = Resource(descriptor, control=formats.ExcelControl(stringified=True))
assert resource.read_rows() == [
    {"A": "001", "B": "b", "C": "1", "D": "a", "E": 1},
    {"A": "002", "B": "c", "C": "1", "D": "1", "E": 1},
]

where the "E": 1 is a number not a string.

Or an even simpler reproduction:

from frictionless import Resource, formats

res = Resource(path="data/table-infer-boolean.xlsx", control=formats.ExcelControl(stringified=True))
print(res.read_rows())

prints:

[{'number': Decimal('1.0'), 'string': 'two', 'boolean': True}, {'number': Decimal('3.0'), 'string': 'four', 'boolean': False}]

Is this the intended behaviour and I am misunderstanding the purpose of stringified=True or is this a bug (which I am happy to look at fixing).

To be honest, the code looks like it should convert everything to strings, but it doesn't seem to according to the unit tests (and our tests). Perhaps @roll or @shashigharti might remember from their work on the PR that created this capability?

@richardt-engineb
Copy link
Author

richardt-engineb commented Jun 12, 2024

With a bit more research, I think I understand what it is doing (though still counter-intuitive).

The stringified=True only stringifies the lowest level return from the Excel parser. Before you get to read_rows() or similar, the schema is used to convert whatever is returned by the parser. If you don't provide a schema, then detect_schema() (in detector.py) will detect a schema and that will likely pick types other than string.

So to actually get an output that is all strings you need both stringified=True and a schema with all columns set to string types.

So using table_infer_boolean.xlsx which looks like:

/ a b c
1 number string boolean
2 1 two true
3 3 four false

depending on the setting for stringified and schema we get different output:

stringified schema output notes
False None [{'number': 1, 'string': 'two', 'boolean': True}, {'number': 3, 'string': 'four', 'boolean': False}] Inferred schema, typed output
True None [{'number': 1, 'string': 'two', 'boolean': True}, {'number': 3, 'string': 'four', 'boolean': False}] Inferred schema overrides stringified so still typed output
False {"fields": [{"name": "number", "type": "string"},{"name": "string", "type": "string"},{"name": "boolean", "type": "string"}]} [{'number': None, 'string': 'two', 'boolean': None}, {'number': None, 'string': 'four', 'boolean': None}] Without stringified, but with an all-string schema, non-string cells are returned as None
True {"fields": [{"name": "number", "type": "string"},{"name": "string", "type": "string"},{"name": "boolean", "type": "string"}]} [{'number': '1.0', 'string': 'two', 'boolean': 'True'}, {'number': '3.0', 'string': 'four', 'boolean': 'False'}] Both stringified and all-string schema is the only way to get all-string output. Note though, that the 1 has been converted to "1.0" which is not same, so there has clearly been some type conversions even in this case.

This could probably do with a documentation update to clarify what stringified does (and doesn't) do.

@richardt-engineb richardt-engineb changed the title ExcelControl(stringified=True) doesn't seem to work ExcelControl(stringified=True) documentation could clarify interaction with schemas Jun 13, 2024
richardt-engineb added a commit to Engine-B/frictionless-py that referenced this issue Jun 13, 2024
…action with schema

- fixes ExcelControl(stringified=True) documentation could clarify interaction with schemas frictionlessdata#1659
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant