❓ Questions / Help / Support #562

sunnnnnnnny · 2024-10-23T02:12:37Z

sunnnnnnnny
Oct 23, 2024

❓ Questions and Help

If i want to vad and keep little silence with chunk_wav's begining and ending, how do it?

Oct 23, 2024

Lines 187 to 251 in e531cd3

     def get_speech_timestamps(audio: torch.Tensor,  
   model,  
   threshold: float = 0.5,  
   sampling_rate: int = 16000,  
   min_speech_duration_ms: int = 250,  
   max_speech_duration_s: float = float('inf'),  
   min_silence_duration_ms: int = 100,  
   speech_pad_ms: int = 30,  
   return_seconds: bool = False,  
   visualize_probs: bool = False,  
   progress_tracking_callback: Callable[[float], None] = None,  
   neg_threshold: float = None,  
   window_size_samples: int = 512,):  
    
   """  
    This method is used for splitting long audios into speech chunks using silero VAD  
     
    Parameters  
   

View full answer

snakers4 · 2024-10-23T07:10:51Z

snakers4
Oct 23, 2024
Maintainer

silero-vad/src/silero_vad/utils_vad.py

Lines 187 to 251 in e531cd3

    
           def get_speech_timestamps(audio: torch.Tensor, 
        
                                     model, 
        
                                     threshold: float = 0.5, 
        
                                     sampling_rate: int = 16000, 
        
                                     min_speech_duration_ms: int = 250, 
        
                                     max_speech_duration_s: float = float('inf'), 
        
                                     min_silence_duration_ms: int = 100, 
        
                                     speech_pad_ms: int = 30, 
        
                                     return_seconds: bool = False, 
        
                                     visualize_probs: bool = False, 
        
                                     progress_tracking_callback: Callable[[float], None] = None, 
        
                                     neg_threshold: float = None, 
        
                                     window_size_samples: int = 512,): 
        
               """ 
        
               This method is used for splitting long audios into speech chunks using silero VAD 
        
               Parameters 
        
               ---------- 
        
               audio: torch.Tensor, one dimensional 
        
                   One dimensional float torch.Tensor, other types are casted to torch if possible 
        
               model: preloaded .jit/.onnx silero VAD model 
        
               threshold: float (default - 0.5) 
        
                   Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. 
        
                   It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets. 
        
               sampling_rate: int (default - 16000) 
        
                   Currently silero VAD models support 8000 and 16000 (or multiply of 16000) sample rates 
        
               min_speech_duration_ms: int (default - 250 milliseconds) 
        
                   Final speech chunks shorter min_speech_duration_ms are thrown out 
        
               max_speech_duration_s: int (default -  inf) 
        
                   Maximum duration of speech chunks in seconds 
        
                   Chunks longer than max_speech_duration_s will be split at the timestamp of the last silence that lasts more than 100ms (if any), to prevent agressive cutting. 
        
                   Otherwise, they will be split aggressively just before max_speech_duration_s. 
        
               min_silence_duration_ms: int (default - 100 milliseconds) 
        
                   In the end of each speech chunk wait for min_silence_duration_ms before separating it 
        
               speech_pad_ms: int (default - 30 milliseconds) 
        
                   Final speech chunks are padded by speech_pad_ms each side 
        
               return_seconds: bool (default - False) 
        
                   whether return timestamps in seconds (default - samples) 
        
               visualize_probs: bool (default - False) 
        
                   whether draw prob hist or not 
        
               progress_tracking_callback: Callable[[float], None] (default - None) 
        
                   callback function taking progress in percents as an argument 
        
               neg_threshold: float (default = threshold - 0.15) 
        
                   Negative threshold (noise or exit threshold). If model's current state is SPEECH, values BELOW this value are considered as NON-SPEECH. 
        
               window_size_samples: int (default - 512 samples) 
        
                   !!! DEPRECATED, DOES NOTHING !!! 
        
               Returns 
        
               ---------- 
        
               speeches: list of dicts 
        
                   list containing ends and beginnings of speech chunks (samples or seconds based on return_seconds) 
        
               """

try fiddling with min_silence_duration_ms and speech_pad_ms.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❓ Questions / Help / Support #562

{{title}}

Replies: 1 comment

{{title}}

Select a reply

	def get_speech_timestamps(audio: torch.Tensor,
	model,
	threshold: float = 0.5,
	sampling_rate: int = 16000,
	min_speech_duration_ms: int = 250,
	max_speech_duration_s: float = float('inf'),
	min_silence_duration_ms: int = 100,
	speech_pad_ms: int = 30,
	return_seconds: bool = False,
	visualize_probs: bool = False,
	progress_tracking_callback: Callable[[float], None] = None,
	neg_threshold: float = None,
	window_size_samples: int = 512,):

	"""
	This method is used for splitting long audios into speech chunks using silero VAD

	Parameters

❓ Questions / Help / Support #562

sunnnnnnnny Oct 23, 2024

❓ Questions and Help

Replies: 1 comment

snakers4 Oct 23, 2024 Maintainer

sunnnnnnnny
Oct 23, 2024

snakers4
Oct 23, 2024
Maintainer