Write to file w/o including `filename` column #293

joshwyatt · 2024-10-10T18:52:02Z

Is your feature request related to a problem? Please describe.

I’m working through these Classifier and Heuristic Quality Filtering docs. I’m looking for an elegant way to write filtered docs back to file. If I follow the example of the docs, then I use the books = DocumentDataset.read_json(files, add_filename=True) and ultimately long_books.to_json("long_books/", write_to_filename=True) method. This gets me a filename of the correct name, but the new data now has a filename field, which I do not wish to have.

If instead I avoid using add_filename=True and then use .to_json("long_books") then I end up with the data I want, but in a file called 0.part.

Describe the solution you'd like

I'd like to be able to write to a .jsonl file directly either w/o creating the filename column, or, without including it in the output file.

Describe alternatives you've considered

df = long_books.to_pandas()
df.to_json('output.jsonl', orient='records', lines=True)

...which won't work for larger datasets or multiple files.

The text was updated successfully, but these errors were encountered:

joshwyatt added the enhancement New feature or request label Oct 10, 2024

sarahyurick self-assigned this Oct 14, 2024

sarahyurick mentioned this issue Oct 14, 2024

[BUG] Semdedup Embedding Restart not working cleanly #211

Open

sarahyurick mentioned this issue Oct 22, 2024

Write to file without including "filename" column #317

Merged

sarahyurick closed this as completed in #317 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write to file w/o including `filename` column #293

Write to file w/o including `filename` column #293

joshwyatt commented Oct 10, 2024

Write to file w/o including filename column #293

Write to file w/o including filename column #293

Comments

joshwyatt commented Oct 10, 2024

Write to file w/o including `filename` column #293

Write to file w/o including `filename` column #293