update csv output format to match OpenAI's Whisper dataframe output #552

hykelvinlee42 · 2023-03-01T17:44:07Z

Currently, the csv output file does not have column field names. And when the file is read by data analysis tool (pandas e.g.), it will raise parsing and tokenizing errors.

import pandas as pd

pd.read_csv(output.csv, names=columns)
# pandas.errors.ParserError: Error tokenizing data

This can be resolved by setting delimiter to a whitespace character in the read function. However, this would cause the start timestamp and end timestamp to be read as string with commas by default.

pd.read_csv(output.csv, names=columns, delimiter=" ")
"""
      start      end                                               text
0        0,   10320,   We're not actually looking at on the screen, ...
1    10320,   18100,        And you can see that recording has started?
2    18100,   23600,                       Recording and transcription.
"""

This change allows the result csv file to be read successfully by default with the start and end timestamps as int64 type.

jordibruin · 2023-03-02T11:55:15Z

I'm curious, with this format, how the csv will not mess up when there's commas in the text? Won't that trigger a 'next item' in the csv parsing?

hykelvinlee42 · 2023-03-02T13:04:32Z

@jordibruin the text is already encapsulated in double quotes ("). So pandas, or any csv parser tools that I know of, will parse everything within the double quotes as string, even commas.

whisper.cpp/examples/main/main.cpp

Line 362 in 86cddf8

fout << 10 * t0 << "," << 10 * t1 << ",\"" << text << "\"\n";

updated csv output

86cddf8

ggerganov merged commit 72af0f5 into ggerganov:master Mar 2, 2023

anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023

main : add csv header (ggerganov#552)

dabc793

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

main : add csv header (ggerganov#552)

a0f4578

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

main : add csv header (ggerganov#552)

1c38951

landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023

main : add csv header (ggerganov#552)

0e9b45e

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024

main : add csv header (ggerganov#552)

e4daabb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update csv output format to match OpenAI's Whisper dataframe output #552

update csv output format to match OpenAI's Whisper dataframe output #552

hykelvinlee42 commented Mar 1, 2023

jordibruin commented Mar 2, 2023

hykelvinlee42 commented Mar 2, 2023 •

edited

Loading

update csv output format to match OpenAI's Whisper dataframe output #552

update csv output format to match OpenAI's Whisper dataframe output #552

Conversation

hykelvinlee42 commented Mar 1, 2023

jordibruin commented Mar 2, 2023

hykelvinlee42 commented Mar 2, 2023 • edited Loading

hykelvinlee42 commented Mar 2, 2023 •

edited

Loading