Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate XPT number_rows using metadata and final chunk #261

Open
gerrycampion opened this issue Apr 30, 2024 · 3 comments
Open

Calculate XPT number_rows using metadata and final chunk #261

gerrycampion opened this issue Apr 30, 2024 · 3 comments
Labels
requires changes in Readstat waiting for changes in the C library Readstat

Comments

@gerrycampion
Copy link

Describe the issue
According to the documentation for xpt metadata, number_rows cannot be determined unless the entire dataset is read. I understand that number_rows cannot be extracted from the metadata alone, but I think it can be calculated using only the metadata and final 80-byte chunk.

Expected behavior

  • Read the header information to find: variable_storage_widths and the start of record data
  • Calculate record_storage_width as sum of variable_storage_widths
  • Read the last 80-byte chunk of data to find out how much trailing ASCII blank padding there is.
  • Calculate number of records using:
    (total_file_size - start - padding) / record_storage_width
@gerrycampion gerrycampion changed the title Calculate number_rows using metadata and final chunk Calculate XPT number_rows using metadata and final chunk Apr 30, 2024
@ofajardo
Copy link
Collaborator

Thanks for the interesting suggestion. Pyreadstat is a wrapper around the C library ReadStat, new functionality has to be implemented there before I can expose that functionality here. I do not think that ReadStat has functions to return the start of the data or the padding, so the xalculation xannot be done right now, but you can suggest it over there and once implemented, I can wrap it and provide it in Pyreadstat.

@gerrycampion
Copy link
Author

WizardMac/ReadStat#315

@measiala
Copy link

measiala commented May 1, 2024

I believe that the number of rows is available for v8 XPORT files created at least for SAS v9.0401M8. This is causing an issue with readstat-created v8 XPORT files from being read by this version of SAS as readstat does not provide the observation count but SAS is expecting it.

Unfortunately, this revised layout does not appear to be documented in the official v8/v9 XPORT layout documentation released by SAS in Oct 2021.

I am currently trying to test the changes necessary to the readstat code to, first of all, write the file. Then there could be some optional code to read in that metadata from the XPORT observation header.

I'll try to get this posted to the readstat site as a new issue (and ideally a PR) soonish once I finish testing "in my spare time". :)

-- Edit: This has been posted as issue #316. I included a blurb about reading in the observation count when available. This could be a partial solution to your issue.

@ofajardo ofajardo added the requires changes in Readstat waiting for changes in the C library Readstat label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
requires changes in Readstat waiting for changes in the C library Readstat
Projects
None yet
Development

No branches or pull requests

3 participants