Audio Processing Guide¶
This document covers the psifx
commands for audio pre-processing, inference, and visualization.
1. Audio Pre-processing¶
Audio Extraction¶
Extracts audio from a video file.
psifx audio manipulation extraction \
--video Videos/Video.mp4 \
--audio Audios/Audio.wav
--video
: Path to the input video file.--audio
: Path for the output audio file.
Audio Conversion¶
Converts any audio track to a mono audio track at 16kHz sample rate.
psifx audio manipulation conversion \
--audio Audios/Audio.wav \
--mono_audio Audios/MonoAudio.wav
--audio
: Path to the input audio file.--mono_audio
: Path to the output audio file.
Audio Split¶
Split a stereo audio track into two mono audio tracks.
psifx audio manipulation split \
--stereo_audio Audios/Audio.wav \
--left_audio Audios/MonoLeft.wav \
--right_audio Audios/MonoRight.wav
--stereo-audio
: Path to the input stereo audio file.--left_audio
: Path to the output left channel mono audio file.--right_audio
: Path to the output right channel mono audio file.
Audio Mixing¶
Combines multiple mono audio files into one.
psifx audio manipulation mixdown \
--mono_audios Audios/MonoRight.wav Audios/MonoLeft.wav \
--mixed_audio Audios/Mixed.wav
--mono_audios
: Paths of input mono audio files to combine.--mixed_audio
: Path for the output mixed audio file.
Audio Normalization¶
Normalizes the volume level of an audio file.
psifx audio manipulation normalization \
--audio Audios/Mixed.wav \
--normalized_audio Audios/MixedNormalized.wav
--audio
: Path to the input audio file.--normalized_audio
: Path for the output normalized audio file.
2. Inference¶
Speaker Diarization¶
Identifies segments for each speaker in an audio file.
psifx audio diarization pyannote inference \
--audio Audios/MixedNormalized.wav \
--diarization Diarizations/Mixed.rttm \
[--num_speakers 2] \
[--device cuda] \
[--api_token hf_SomeLetters] \
[--model_name pyannote/speaker-diarization@2.1.1]
--audio
: Input audio file for diarization.--diarization
: Path to save diarization results in.rttm
format.--num_speakers
: Number of speakers to identify.--device
: Processing device (cuda
for GPU,cpu
for CPU), defaultcpu
.--api_token
: Hugging Face token, may be required to download the model. Can also be provided as the environment variable HF_TOKEN.--model_name
: Name of the diarization model used, defaultpyannote/speaker-diarization@2.1.1
.
Speaker Identification¶
Associates speakers in a mixed audio file with known audio samples.
psifx audio identification pyannote inference \
--mixed_audio Audios/MixedNormalized.wav \
--diarization Diarizations/Mixed.rttm \
--mono_audios Audios/MonoRight.wav Audios/MonoLeft.wav \
--identification Identifications/Mixed.json \
[--device cuda] \
[--api_token hf_SomeLetters] \
[--model_names pyannote/embedding speechbrain/spkrec-ecapa-voxceleb]
--mixed_audio
: Mixed mono audio file with multiple speakers.--diarization
: Diarization results for speaker segmentation.--mono_audios
: Paths of known mono audio samples for each speaker.--identification
: Path for saving identification results in.json
format.--device
: Processing device (cuda
for GPU,cpu
for CPU), defaultcpu
.--api_token
: Hugging Face token, may be required to download the model. Can also be provided as the environment variable HF_TOKEN.--model_names
: Names of the embedding models.
Speech Transcription¶
Transcribes speech in an audio file to text.
psifx audio transcription whisper inference \
--audio Audios/MixedNormalized.wav \
--transcription Transcriptions/Mixed.vtt \
[--model_name small] \
[--language fr] \
[--device cuda] \
[--translate_to_english]
--audio
: Input audio file for transcription.--transcription
: Path to save the transcription in.vtt
format.--model_name
: Model name (e.g.,large
for higher accuracy) check Whisper, defaultsmall
.--language
: Language of the audio content.--device
: Processing device (cuda
for GPU,cpu
for CPU), defaultcpu
.--translate_to_english
: Whether to transcribe the audio in its original language or to translate it to english, defaultFalse
.
Enhanced Transcription¶
Enhances transcription with diarization and speaker labels.
psifx audio transcription whisper enhance \
--transcription Transcriptions/Mixed.vtt \
--diarization Diarizations/Mixed.rttm \
--identification Identifications/Mixed.json \
--enhanced_transcription Transcriptions/MixedEnhanced.vtt
--transcription
: Path to the initial transcription file.--diarization
: Diarization data file.--identification
: Identification data file.--enhanced_transcription
: Path to save the enhanced transcription.
3. Visualization¶
Diarization Visualization¶
Generates a visual timeline of speaker segments.
psifx audio diarization visualization \
--diarization Diarizations/Mixed.rttm \
--visualization Visualizations/Mixed.png
--diarization
: Diarization data file.--visualization
: Path for saving the visualization image.