# Audio Processing Guide This document covers the `psifx` commands for audio pre-processing, inference, and visualization. ## 1. Audio Pre-processing ### Audio Extraction Extracts audio from a video file. ```bash psifx audio manipulation extraction \ --video Videos/Video.mp4 \ --audio Audios/Audio.wav ``` - `--video`: Path to the input video file. - `--audio`: Path for the output audio file. ### Audio Conversion Converts any audio track to a mono audio track at 16kHz sample rate. ```bash psifx audio manipulation conversion \ --audio Audios/Audio.wav \ --mono_audio Audios/MonoAudio.wav ``` - `--audio`: Path to the input audio file. - `--mono_audio`: Path to the output audio file. ### Audio Split Split a stereo audio track into two mono audio tracks. ```bash psifx audio manipulation split \ --stereo_audio Audios/Audio.wav \ --left_audio Audios/MonoLeft.wav \ --right_audio Audios/MonoRight.wav ``` - `--stereo-audio`: Path to the input stereo audio file. - `--left_audio`: Path to the output left channel mono audio file. - `--right_audio`: Path to the output right channel mono audio file. ### Audio Mixing Combines multiple mono audio files into one. ```bash psifx audio manipulation mixdown \ --mono_audios Audios/MonoRight.wav Audios/MonoLeft.wav \ --mixed_audio Audios/Mixed.wav ``` - `--mono_audios`: Paths of input mono audio files to combine. - `--mixed_audio`: Path for the output mixed audio file. ### Audio Normalization Normalizes the volume level of an audio file. ```bash psifx audio manipulation normalization \ --audio Audios/Mixed.wav \ --normalized_audio Audios/MixedNormalized.wav ``` - `--audio`: Path to the input audio file. - `--normalized_audio`: Path for the output normalized audio file. ## 2. Inference ### Speaker Diarization Identifies segments for each speaker in an audio file. ```bash psifx audio diarization pyannote inference \ --audio Audios/MixedNormalized.wav \ --diarization Diarizations/Mixed.rttm \ [--num_speakers 2] \ [--device cuda] \ [--api_token hf_SomeLetters] \ [--model_name pyannote/speaker-diarization@2.1.1] ``` - `--audio`: Input audio file for diarization. - `--diarization`: Path to save diarization results in `.rttm` format. - `--num_speakers`: Number of speakers to identify. - `--device`: Processing device (`cuda` for GPU, `cpu` for CPU), default `cpu`. - `--api_token`: Hugging Face token, may be required to download the model. Can also be provided as the environment variable **HF_TOKEN**. - `--model_name`: Name of the diarization model used, default `pyannote/speaker-diarization@2.1.1`. ### Speaker Identification Associates speakers in a mixed audio file with known audio samples. ```bash psifx audio identification pyannote inference \ --mixed_audio Audios/MixedNormalized.wav \ --diarization Diarizations/Mixed.rttm \ --mono_audios Audios/MonoRight.wav Audios/MonoLeft.wav \ --identification Identifications/Mixed.json \ [--device cuda] \ [--api_token hf_SomeLetters] \ [--model_names pyannote/embedding speechbrain/spkrec-ecapa-voxceleb] ``` - `--mixed_audio`: Mixed mono audio file with multiple speakers. - `--diarization`: Diarization results for speaker segmentation. - `--mono_audios`: Paths of known mono audio samples for each speaker. - `--identification`: Path for saving identification results in `.json` format. - `--device`: Processing device (`cuda` for GPU, `cpu` for CPU), default `cpu`. - `--api_token`: Hugging Face token, may be required to download the model. Can also be provided as the environment variable **HF_TOKEN**. - `--model_names`: Names of the embedding models. ### Speech Transcription Transcribes speech in an audio file to text. ```bash psifx audio transcription whisper inference \ --audio Audios/MixedNormalized.wav \ --transcription Transcriptions/Mixed.vtt \ [--model_name small] \ [--language fr] \ [--device cuda] \ [--translate_to_english] ``` - `--audio`: Input audio file for transcription. - `--transcription`: Path to save the transcription in `.vtt` format. - `--model_name`: Model name (e.g., `large` for higher accuracy) check [Whisper](https://github.com/openai/whisper#available-models-and-languages), default `small`. - `--language`: Language of the audio content. - `--device`: Processing device (`cuda` for GPU, `cpu` for CPU), default `cpu`. - `--translate_to_english`: Whether to transcribe the audio in its original language or to translate it to english, default `False`. ### Enhanced Transcription Enhances transcription with diarization and speaker labels. ```bash psifx audio transcription whisper enhance \ --transcription Transcriptions/Mixed.vtt \ --diarization Diarizations/Mixed.rttm \ --identification Identifications/Mixed.json \ --enhanced_transcription Transcriptions/MixedEnhanced.vtt ``` - `--transcription`: Path to the initial transcription file. - `--diarization`: Diarization data file. - `--identification`: Identification data file. - `--enhanced_transcription`: Path to save the enhanced transcription. ## 3. Visualization ### Diarization Visualization Generates a visual timeline of speaker segments. ```bash psifx audio diarization visualization \ --diarization Diarizations/Mixed.rttm \ --visualization Visualizations/Mixed.png ``` - `--diarization`: Diarization data file. - `--visualization`: Path for saving the visualization image.