Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing skip_rows into scan_csv and read_csv with glob patterns doesn't skip rows before header in first file and doesn't skip rows at all in subsequent files #6692

Closed
2 tasks done
qiemem opened this issue Feb 5, 2023 · 1 comment · Fixed by #6754
Labels
bug Something isn't working python Related to Python Polars

Comments

@qiemem
Copy link
Contributor

qiemem commented Feb 5, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Passing skip_rows into scan_csv/read_csv with glob patterns results in the first line still being used as the header instead of skipping rows the specified number before the header. Also, rows are not skipped at all in any files but the first (whereas the main application of skip_rows is to skip metadata headers that are presumably the same across all files).

For anyone else that runs into this, a workaround is to just do with this manually with the built-in glob library:

pl.concat([pl.read_csv(fn, skip_rows=6) for fn in glob("data_*.csv")])

Reproducible example

Input:

for i in range(3):
    with open(f"test_{i}.csv", "w") as f:
        f.write(
            f"""
metadata goes here
file number {i}
foo,bar,baz
1,2,3
4,5,6
7,8,9
"""
        )

print("With glob")
print(
    pl.read_csv("test_*.csv", skip_rows=2)
    .to_pandas()
    .to_markdown(tablefmt="github", index=False)
)
print("\nSingle file without glob")
print(
    pl.read_csv("test_0.csv", skip_rows=2)
    .to_pandas()
    .to_markdown(tablefmt="github", index=False)
)

Output:

With glob

metadata goes here
foo
1
4
7
metadata goes here
file number 1
foo
1
4
7
metadata goes here
file number 2
foo
1
4
7
metadata goes here
file number 3
foo
1
4
7
metadata goes here
file number 4
foo
1
4
7
metadata goes here
file number 5
foo
1
4
7
metadata goes here
file number 6
foo
1
4
7
metadata goes here
file number 7
foo
1
4
7
metadata goes here
file number 8
foo
1
4
7
metadata goes here
file number 9
foo
1
4
7

Single file without glob

foo bar baz
1 2 3
4 5 6
7 8 9

Expected behavior

With glob

foo bar baz
1 2 3
4 5 6
7 8 9
1 2 3
4 5 6
7 8 9
1 2 3
4 5 6
7 8 9

Single file without glob

foo bar baz
1 2 3
4 5 6
7 8 9

Installed versions

---Version info---
Polars: 0.16.1
Index type: UInt32
Platform: Linux-6.0.16-300.fc37.x86_64-x86_64-with-glibc2.36
Python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) 
[GCC 9.4.0]
---Optional dependencies---
pyarrow: 6.0.1
pandas: 1.4.0
numpy: 1.22.2
fsspec: 2022.01.0
connectorx: 0.2.4
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: 3.5.1
@qiemem qiemem added bug Something isn't working python Related to Python Polars labels Feb 5, 2023
@cmdlineluser
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants