Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update csv output format to match OpenAI's Whisper dataframe output #552

Merged
merged 1 commit into from
Mar 2, 2023
Merged

update csv output format to match OpenAI's Whisper dataframe output #552

merged 1 commit into from
Mar 2, 2023

Conversation

hykelvinlee42
Copy link
Contributor

Currently, the csv output file does not have column field names. And when the file is read by data analysis tool (pandas e.g.), it will raise parsing and tokenizing errors.

import pandas as pd

pd.read_csv(output.csv, names=columns)
# pandas.errors.ParserError: Error tokenizing data

This can be resolved by setting delimiter to a whitespace character in the read function. However, this would cause the start timestamp and end timestamp to be read as string with commas by default.

pd.read_csv(output.csv, names=columns, delimiter=" ")
"""
      start      end                                               text
0        0,   10320,   We're not actually looking at on the screen, ...
1    10320,   18100,        And you can see that recording has started?
2    18100,   23600,                       Recording and transcription.
"""

This change allows the result csv file to be read successfully by default with the start and end timestamps as int64 type.

@jordibruin
Copy link

I'm curious, with this format, how the csv will not mess up when there's commas in the text? Won't that trigger a 'next item' in the csv parsing?

@hykelvinlee42
Copy link
Contributor Author

hykelvinlee42 commented Mar 2, 2023

@jordibruin the text is already encapsulated in double quotes ("). So pandas, or any csv parser tools that I know of, will parse everything within the double quotes as string, even commas.

fout << 10 * t0 << "," << 10 * t1 << ",\"" << text << "\"\n";

@ggerganov ggerganov merged commit 72af0f5 into ggerganov:master Mar 2, 2023
anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants