Audio Processing Guide

This document covers the psifx commands for audio pre-processing, inference, and visualization.

1. Audio Pre-processing

Audio Extraction

Extracts audio from a video file.

psifx audio manipulation extraction \
    --video Videos/Video.mp4 \
    --audio Audios/Audio.wav
  • --video: Path to the input video file.

  • --audio: Path for the output audio file.

Audio Conversion

Converts any audio track to a mono audio track at 16kHz sample rate.

psifx audio manipulation conversion \
    --audio Audios/Audio.wav \
    --mono_audio Audios/MonoAudio.wav
  • --audio: Path to the input audio file.

  • --mono_audio: Path to the output audio file.

Audio Split

Split a stereo audio track into two mono audio tracks.

psifx audio manipulation split \
    --stereo_audio  Audios/Audio.wav \
    --left_audio Audios/MonoLeft.wav \
    --right_audio Audios/MonoRight.wav
  • --stereo-audio: Path to the input stereo audio file.

  • --left_audio: Path to the output left channel mono audio file.

  • --right_audio: Path to the output right channel mono audio file.

Audio Mixing

Combines multiple mono audio files into one.

psifx audio manipulation mixdown \
    --mono_audios Audios/MonoRight.wav Audios/MonoLeft.wav \
    --mixed_audio Audios/Mixed.wav
  • --mono_audios: Paths of input mono audio files to combine.

  • --mixed_audio: Path for the output mixed audio file.

Audio Normalization

Normalizes the volume level of an audio file.

psifx audio manipulation normalization \
    --audio Audios/Mixed.wav \
    --normalized_audio Audios/MixedNormalized.wav
  • --audio: Path to the input audio file.

  • --normalized_audio: Path for the output normalized audio file.

2. Inference

Speaker Diarization

Identifies segments for each speaker in an audio file.

psifx audio diarization pyannote inference \
    --audio Audios/MixedNormalized.wav \
    --diarization Diarizations/Mixed.rttm \
    [--num_speakers 2] \
    [--device cuda] \
    [--api_token hf_SomeLetters] \
    [--model_name pyannote/speaker-diarization@2.1.1]
  • --audio: Input audio file for diarization.

  • --diarization: Path to save diarization results in .rttm format.

  • --num_speakers: Number of speakers to identify.

  • --device: Processing device (cuda for GPU, cpu for CPU), default cpu.

  • --api_token: Hugging Face token, may be required to download the model. Can also be provided as the environment variable HF_TOKEN.

  • --model_name: Name of the diarization model used, default pyannote/speaker-diarization@2.1.1.

Speaker Identification

Associates speakers in a mixed audio file with known audio samples.

psifx audio identification pyannote inference \
    --mixed_audio Audios/MixedNormalized.wav \
    --diarization Diarizations/Mixed.rttm \
    --mono_audios Audios/MonoRight.wav Audios/MonoLeft.wav \
    --identification Identifications/Mixed.json \
    [--device cuda] \
    [--api_token hf_SomeLetters] \
    [--model_names pyannote/embedding speechbrain/spkrec-ecapa-voxceleb]
  • --mixed_audio: Mixed mono audio file with multiple speakers.

  • --diarization: Diarization results for speaker segmentation.

  • --mono_audios: Paths of known mono audio samples for each speaker.

  • --identification: Path for saving identification results in .json format.

  • --device: Processing device (cuda for GPU, cpu for CPU), default cpu.

  • --api_token: Hugging Face token, may be required to download the model. Can also be provided as the environment variable HF_TOKEN.

  • --model_names: Names of the embedding models.

Speech Transcription

Transcribes speech in an audio file to text.

psifx audio transcription whisper inference \
    --audio Audios/MixedNormalized.wav \
    --transcription Transcriptions/Mixed.vtt \
    [--model_name small] \
    [--language fr] \
    [--device cuda] \
    [--translate_to_english] 
  • --audio: Input audio file for transcription.

  • --transcription: Path to save the transcription in .vtt format.

  • --model_name: Model name (e.g., large for higher accuracy) check Whisper, default small.

  • --language: Language of the audio content.

  • --device: Processing device (cuda for GPU, cpu for CPU), default cpu.

  • --translate_to_english: Whether to transcribe the audio in its original language or to translate it to english, default False.

Enhanced Transcription

Enhances transcription with diarization and speaker labels.

psifx audio transcription whisper enhance \
    --transcription Transcriptions/Mixed.vtt \
    --diarization Diarizations/Mixed.rttm \
    --identification Identifications/Mixed.json \
    --enhanced_transcription Transcriptions/MixedEnhanced.vtt
  • --transcription: Path to the initial transcription file.

  • --diarization: Diarization data file.

  • --identification: Identification data file.

  • --enhanced_transcription: Path to save the enhanced transcription.

3. Visualization

Diarization Visualization

Generates a visual timeline of speaker segments.

psifx audio diarization visualization \
    --diarization Diarizations/Mixed.rttm \
    --visualization Visualizations/Mixed.png
  • --diarization: Diarization data file.

  • --visualization: Path for saving the visualization image.