Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve language detection when using clip_timestamps #867

Merged
merged 1 commit into from
Jul 1, 2024

Conversation

ben91lin
Copy link
Contributor

@ben91lin ben91lin commented Jun 4, 2024

Using clip_timestamps to improve the initial seek for language detection, avoiding incorrect detection at the start of the audio file.

@ben91lin ben91lin force-pushed the language-detection branch 3 times, most recently from 1becddb to 65dcdc4 Compare June 4, 2024 05:11
detected_language_info = {}
seek = int(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should replace with:

seek = int(start_timestamp * self.frames_per_second)

@trungkienbkhn
Copy link
Collaborator

@ben91lin , hello. Tks for your improvement. But I found that an error occurs if clip_timestamp[0] > audio duration:

    language = max(
ValueError: max() arg is an empty sequence

Could you fix this ?

@ben91lin ben91lin force-pushed the language-detection branch 4 times, most recently from b8cc0fc to 370902e Compare June 6, 2024 18:55
@ben91lin
Copy link
Contributor Author

ben91lin commented Jun 6, 2024

# If seek is beyond all frames, set it to the last segment.
if seek >= features.shape[-1]:
    seek = content_frames

If audio_length is 80s and start_timestamp is 67s, it will clip the last 1.3 seconds for detection.
If start_timestamp greater or equal 80s, force use the last nb_max_frames for detection.

@trungkienbkhn
Copy link
Collaborator

# If seek is beyond all frames, set it to the last segment.
if seek >= features.shape[-1]:
    seek = content_frames

I think it's wrong logic. In this case, if seek = content_frame, then fw will detect language from content_frames to features.shape[-1].
=> These are padded values (meaningless), and can lead to incorrect language detection.
Should set seek = 0 if seek >= content_frames

@ben91lin ben91lin force-pushed the language-detection branch from 370902e to 73dc4b2 Compare June 7, 2024 16:41
@ben91lin ben91lin force-pushed the language-detection branch from 73dc4b2 to 6f07c97 Compare June 7, 2024 16:46
@ben91lin
Copy link
Contributor Author

ben91lin commented Jun 7, 2024

I think you are right, I omitted the padding from FeatureExtractor. Setting seek = 0 if seek >= content_frame is the correct approach.

@trungkienbkhn trungkienbkhn merged commit 8862bee into SYSTRAN:master Jul 1, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants