{"id":41937,"date":"2025-10-03T06:48:58","date_gmt":"2025-10-03T06:48:58","guid":{"rendered":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/"},"modified":"2025-10-03T06:48:58","modified_gmt":"2025-10-03T06:48:58","slug":"how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export","status":"publish","type":"post","link":"https:\/\/youzum.net\/it\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/","title":{"rendered":"How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?"},"content":{"rendered":"<p>In this tutorial, we walk through an advanced implementation of <a href=\"https:\/\/github.com\/m-bain\/whisperX\"><strong>WhisperX<\/strong><\/a>, where we explore transcription, alignment, and word-level timestamps in detail. We set up the environment, load and preprocess the audio, and then run the full pipeline, from transcription to alignment and analysis, while ensuring memory efficiency and supporting batch processing. Along the way, we also visualize results, export them in multiple formats, and even extract keywords to gain deeper insights from the audio content. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Voice%20AI\/voice_ai_whisperx_advanced_tutorial_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">!pip install -q git+https:\/\/github.com\/m-bain\/whisperX.git\n!pip install -q pandas matplotlib seaborn\n\n\nimport whisperx\nimport torch\nimport gc\nimport os\nimport json\nimport pandas as pd\nfrom pathlib import Path\nfrom IPython.display import Audio, display, HTML\nimport warnings\nwarnings.filterwarnings('ignore')\n\n\nCONFIG = {\n   \"device\": \"cuda\" if torch.cuda.is_available() else \"cpu\",\n   \"compute_type\": \"float16\" if torch.cuda.is_available() else \"int8\",\n   \"batch_size\": 16, \n   \"model_size\": \"base\", \n   \"language\": None, \n}\n\n\nprint(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png\" alt=\"\ud83d\ude80\" class=\"wp-smiley\" \/> Running on: {CONFIG['device']}\")\nprint(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f4ca.png\" alt=\"\ud83d\udcca\" class=\"wp-smiley\" \/> Compute type: {CONFIG['compute_type']}\")\nprint(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f3af.png\" alt=\"\ud83c\udfaf\" class=\"wp-smiley\" \/> Model: {CONFIG['model_size']}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin by installing WhisperX along with essential libraries and then configure our setup. We detect whether CUDA is available, select the compute type, and set parameters such as batch size, model size, and language to prepare for transcription. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Voice%20AI\/voice_ai_whisperx_advanced_tutorial_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def download_sample_audio():\n   \"\"\"Download a sample audio file for testing\"\"\"\n   !wget -q -O sample.mp3 https:\/\/github.com\/mozilla-extensions\/speaktome\/raw\/master\/content\/cv-valid-dev\/sample-000000.mp3\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Sample audio downloaded\")\n   return \"sample.mp3\"\n\n\ndef load_and_analyze_audio(audio_path):\n   \"\"\"Load audio and display basic info\"\"\"\n   audio = whisperx.load_audio(audio_path)\n   duration = len(audio) \/ 16000 \n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f4c1.png\" alt=\"\ud83d\udcc1\" class=\"wp-smiley\" \/> Audio: {Path(audio_path).name}\")\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/23f1.png\" alt=\"\u23f1\" class=\"wp-smiley\" \/>  Duration: {duration:.2f} seconds\")\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f3b5.png\" alt=\"\ud83c\udfb5\" class=\"wp-smiley\" \/> Sample rate: 16000 Hz\")\n   display(Audio(audio_path))\n   return audio, duration\n\n\ndef transcribe_audio(audio, model_size=CONFIG[\"model_size\"], language=None):\n   \"\"\"Transcribe audio using WhisperX (batched inference)\"\"\"\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f3a4.png\" alt=\"\ud83c\udfa4\" class=\"wp-smiley\" \/> STEP 1: Transcribing audio...\")\n  \n   model = whisperx.load_model(\n       model_size,\n       CONFIG[\"device\"],\n       compute_type=CONFIG[\"compute_type\"]\n   )\n  \n   transcribe_kwargs = {\n       \"batch_size\": CONFIG[\"batch_size\"]\n   }\n   if language:\n       transcribe_kwargs[\"language\"] = language\n  \n   result = model.transcribe(audio, **transcribe_kwargs)\n  \n   total_segments = len(result[\"segments\"])\n   total_words = sum(len(seg.get(\"words\", [])) for seg in result[\"segments\"])\n  \n   del model\n   gc.collect()\n   if CONFIG[\"device\"] == \"cuda\":\n       torch.cuda.empty_cache()\n  \n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Transcription complete!\")\n   print(f\"   Language: {result['language']}\")\n   print(f\"   Segments: {total_segments}\")\n   print(f\"   Total text length: {sum(len(seg['text']) for seg in result['segments'])} characters\")\n  \n   return result<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We download a sample audio file, load it for analysis, and then transcribe it using WhisperX. We set up batched inference with our chosen model size and configuration, and we output key details such as language, number of segments, and total text length. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Voice%20AI\/voice_ai_whisperx_advanced_tutorial_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def align_transcription(segments, audio, language_code):\n   \"\"\"Align transcription for accurate word-level timestamps\"\"\"\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f3af.png\" alt=\"\ud83c\udfaf\" class=\"wp-smiley\" \/> STEP 2: Aligning for word-level timestamps...\")\n  \n   try:\n       model_a, metadata = whisperx.load_align_model(\n           language_code=language_code,\n           device=CONFIG[\"device\"]\n       )\n      \n       result = whisperx.align(\n           segments,\n           model_a,\n           metadata,\n           audio,\n           CONFIG[\"device\"],\n           return_char_alignments=False\n       )\n      \n       total_words = sum(len(seg.get(\"words\", [])) for seg in result[\"segments\"])\n      \n       del model_a\n       gc.collect()\n       if CONFIG[\"device\"] == \"cuda\":\n           torch.cuda.empty_cache()\n      \n       print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Alignment complete!\")\n       print(f\"   Aligned words: {total_words}\")\n      \n       return result\n   except Exception as e:\n       print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/>  Alignment failed: {str(e)}\")\n       print(\"   Continuing with segment-level timestamps only...\")\n       return {\"segments\": segments, \"word_segments\": []}<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We align the transcription to generate precise word-level timestamps. By loading the alignment model and applying it to the audio, we refine timing accuracy, and then report the total aligned words while ensuring memory is cleared for efficient processing. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Voice%20AI\/voice_ai_whisperx_advanced_tutorial_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def analyze_transcription(result):\n   \"\"\"Generate statistics about the transcription\"\"\"\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f4ca.png\" alt=\"\ud83d\udcca\" class=\"wp-smiley\" \/> TRANSCRIPTION STATISTICS\")\n   print(\"=\"*70)\n  \n   segments = result[\"segments\"]\n  \n   total_duration = max(seg[\"end\"] for seg in segments) if segments else 0\n   total_words = sum(len(seg.get(\"words\", [])) for seg in segments)\n   total_chars = sum(len(seg[\"text\"].strip()) for seg in segments)\n  \n   print(f\"Total duration: {total_duration:.2f} seconds\")\n   print(f\"Total segments: {len(segments)}\")\n   print(f\"Total words: {total_words}\")\n   print(f\"Total characters: {total_chars}\")\n  \n   if total_duration &gt; 0:\n       print(f\"Words per minute: {(total_words \/ total_duration * 60):.1f}\")\n  \n   pauses = []\n   for i in range(len(segments) - 1):\n       pause = segments[i+1][\"start\"] - segments[i][\"end\"]\n       if pause &gt; 0:\n           pauses.append(pause)\n  \n   if pauses:\n       print(f\"Average pause between segments: {sum(pauses)\/len(pauses):.2f}s\")\n       print(f\"Longest pause: {max(pauses):.2f}s\")\n  \n   word_durations = []\n   for seg in segments:\n       if \"words\" in seg:\n           for word in seg[\"words\"]:\n               duration = word[\"end\"] - word[\"start\"]\n               word_durations.append(duration)\n  \n   if word_durations:\n       print(f\"Average word duration: {sum(word_durations)\/len(word_durations):.3f}s\")\n  \n   print(\"=\"*70)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We analyze the transcription by generating detailed statistics such as total duration, segment count, word count, and character count. We also calculate words per minute, pauses between segments, and average word duration to better understand the pacing and flow of the audio. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Voice%20AI\/voice_ai_whisperx_advanced_tutorial_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def display_results(result, show_words=False, max_rows=50):\n   \"\"\"Display transcription results in formatted table\"\"\"\n   data = []\n  \n   for seg in result[\"segments\"]:\n       text = seg[\"text\"].strip()\n       start = f\"{seg['start']:.2f}s\"\n       end = f\"{seg['end']:.2f}s\"\n       duration = f\"{seg['end'] - seg['start']:.2f}s\"\n      \n       if show_words and \"words\" in seg:\n           for word in seg[\"words\"]:\n               data.append({\n                   \"Start\": f\"{word['start']:.2f}s\",\n                   \"End\": f\"{word['end']:.2f}s\",\n                   \"Duration\": f\"{word['end'] - word['start']:.3f}s\",\n                   \"Text\": word[\"word\"],\n                   \"Score\": f\"{word.get('score', 0):.2f}\"\n               })\n       else:\n           data.append({\n               \"Start\": start,\n               \"End\": end,\n               \"Duration\": duration,\n               \"Text\": text\n           })\n  \n   df = pd.DataFrame(data)\n  \n   if len(df) &gt; max_rows:\n       print(f\"Showing first {max_rows} rows of {len(df)} total...\")\n       display(HTML(df.head(max_rows).to_html(index=False)))\n   else:\n       display(HTML(df.to_html(index=False)))\n  \n   return df\n\n\ndef export_results(result, output_dir=\"output\", filename=\"transcript\"):\n   \"\"\"Export results in multiple formats\"\"\"\n   os.makedirs(output_dir, exist_ok=True)\n  \n   json_path = f\"{output_dir}\/{filename}.json\"\n   with open(json_path, \"w\", encoding=\"utf-8\") as f:\n       json.dump(result, f, indent=2, ensure_ascii=False)\n  \n   srt_path = f\"{output_dir}\/{filename}.srt\"\n   with open(srt_path, \"w\", encoding=\"utf-8\") as f:\n       for i, seg in enumerate(result[\"segments\"], 1):\n           start = format_timestamp(seg[\"start\"])\n           end = format_timestamp(seg[\"end\"])\n           f.write(f\"{i}n{start} --&gt; {end}n{seg['text'].strip()}nn\")\n  \n   vtt_path = f\"{output_dir}\/{filename}.vtt\"\n   with open(vtt_path, \"w\", encoding=\"utf-8\") as f:\n       f.write(\"WEBVTTnn\")\n       for i, seg in enumerate(result[\"segments\"], 1):\n           start = format_timestamp_vtt(seg[\"start\"])\n           end = format_timestamp_vtt(seg[\"end\"])\n           f.write(f\"{start} --&gt; {end}n{seg['text'].strip()}nn\")\n  \n   txt_path = f\"{output_dir}\/{filename}.txt\"\n   with open(txt_path, \"w\", encoding=\"utf-8\") as f:\n       for seg in result[\"segments\"]:\n           f.write(f\"{seg['text'].strip()}n\")\n  \n   csv_path = f\"{output_dir}\/{filename}.csv\"\n   df_data = []\n   for seg in result[\"segments\"]:\n       df_data.append({\n           \"start\": seg[\"start\"],\n           \"end\": seg[\"end\"],\n           \"text\": seg[\"text\"].strip()\n       })\n   pd.DataFrame(df_data).to_csv(csv_path, index=False)\n  \n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f4be.png\" alt=\"\ud83d\udcbe\" class=\"wp-smiley\" \/> Results exported to '{output_dir}\/' directory:\")\n   print(f\"   \u2713 {filename}.json (full structured data)\")\n   print(f\"   \u2713 {filename}.srt (subtitles)\")\n   print(f\"   \u2713 {filename}.vtt (web video subtitles)\")\n   print(f\"   \u2713 {filename}.txt (plain text)\")\n   print(f\"   \u2713 {filename}.csv (timestamps + text)\")\n\n\ndef format_timestamp(seconds):\n   \"\"\"Convert seconds to SRT timestamp format\"\"\"\n   hours = int(seconds \/\/ 3600)\n   minutes = int((seconds % 3600) \/\/ 60)\n   secs = int(seconds % 60)\n   millis = int((seconds % 1) * 1000)\n   return f\"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}\"\n\n\ndef format_timestamp_vtt(seconds):\n   \"\"\"Convert seconds to VTT timestamp format\"\"\"\n   hours = int(seconds \/\/ 3600)\n   minutes = int((seconds % 3600) \/\/ 60)\n   secs = int(seconds % 60)\n   millis = int((seconds % 1) * 1000)\n   return f\"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}\"\n\n\ndef batch_process_files(audio_files, output_dir=\"batch_output\"):\n   \"\"\"Process multiple audio files in batch\"\"\"\n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f4e6.png\" alt=\"\ud83d\udce6\" class=\"wp-smiley\" \/> Batch processing {len(audio_files)} files...\")\n   results = {}\n  \n   for i, audio_path in enumerate(audio_files, 1):\n       print(f\"n[{i}\/{len(audio_files)}] Processing: {Path(audio_path).name}\")\n       try:\n           result, _ = process_audio_file(audio_path, show_output=False)\n           results[audio_path] = result\n          \n           filename = Path(audio_path).stem\n           export_results(result, output_dir, filename)\n       except Exception as e:\n           print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/274c.png\" alt=\"\u274c\" class=\"wp-smiley\" \/> Error processing {audio_path}: {str(e)}\")\n           results[audio_path] = None\n  \n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Batch processing complete! Processed {len(results)} files.\")\n   return results\n\n\ndef extract_keywords(result, top_n=10):\n   \"\"\"Extract most common words from transcription\"\"\"\n   from collections import Counter\n   import re\n  \n   text = \" \".join(seg[\"text\"] for seg in result[\"segments\"])\n  \n   words = re.findall(r'bw+b', text.lower())\n  \n   stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',\n                 'of', 'with', 'is', 'was', 'are', 'were', 'be', 'been', 'being',\n                 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could',\n                 'should', 'may', 'might', 'must', 'can', 'this', 'that', 'these', 'those'}\n  \n   filtered_words = [w for w in words if w not in stop_words and len(w) &gt; 2]\n  \n   word_counts = Counter(filtered_words).most_common(top_n)\n  \n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f511.png\" alt=\"\ud83d\udd11\" class=\"wp-smiley\" \/> Top {top_n} Keywords:\")\n   for word, count in word_counts:\n       print(f\"   {word}: {count}\")\n  \n   return word_counts<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We format results into clean tables, export transcripts to JSON\/SRT\/VTT\/TXT\/CSV formats, and maintain precise timestamps with helper formatters. We also batch-process multiple audio files end-to-end and extract top keywords, enabling us to quickly turn raw transcriptions into analysis-ready artifacts. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Voice%20AI\/voice_ai_whisperx_advanced_tutorial_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def process_audio_file(audio_path, show_output=True, analyze=True):\n   \"\"\"Complete WhisperX pipeline\"\"\"\n   if show_output:\n       print(\"=\"*70)\n       print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f3b5.png\" alt=\"\ud83c\udfb5\" class=\"wp-smiley\" \/> WhisperX Advanced Tutorial\")\n       print(\"=\"*70)\n  \n   audio, duration = load_and_analyze_audio(audio_path)\n  \n   result = transcribe_audio(audio, CONFIG[\"model_size\"], CONFIG[\"language\"])\n  \n   aligned_result = align_transcription(\n       result[\"segments\"],\n       audio,\n       result[\"language\"]\n   )\n  \n   if analyze and show_output:\n       analyze_transcription(aligned_result)\n       extract_keywords(aligned_result)\n  \n   if show_output:\n       print(\"n\" + \"=\"*70)\n       print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f4cb.png\" alt=\"\ud83d\udccb\" class=\"wp-smiley\" \/> TRANSCRIPTION RESULTS\")\n       print(\"=\"*70)\n       df = display_results(aligned_result, show_words=False)\n      \n       export_results(aligned_result)\n   else:\n       df = None\n  \n   return aligned_result, df\n\n\n# Example 1: Process sample audio\n# audio_path = download_sample_audio()\n# result, df = process_audio_file(audio_path)\n\n\n# Example 2: Show word-level details\n# result, df = process_audio_file(audio_path)\n# word_df = display_results(result, show_words=True)\n\n\n# Example 3: Process your own audio\n# audio_path = \"your_audio.wav\"  # or .mp3, .m4a, etc.\n# result, df = process_audio_file(audio_path)\n\n\n# Example 4: Batch process multiple files\n# audio_files = [\"audio1.mp3\", \"audio2.wav\", \"audio3.m4a\"]\n# results = batch_process_files(audio_files)\n\n\n# Example 5: Use a larger model for better accuracy\n# CONFIG[\"model_size\"] = \"large-v2\"\n# result, df = process_audio_file(\"audio.mp3\")\n\n\nprint(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2728.png\" alt=\"\u2728\" class=\"wp-smiley\" \/> Setup complete! Uncomment examples above to run.\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We run the full WhisperX pipeline end-to-end, loading the audio, transcribing it, and aligning it for word-level timestamps. When enabled, we analyze stats, extract keywords, render a clean results table, and export everything to multiple formats, ready for real use.<\/p>\n<p>In conclusion, we built a complete WhisperX pipeline that not only transcribes audio but also aligns it with precise word-level timestamps. We export the results in multiple formats, process files in batches, and analyze patterns to make the output more meaningful. With this, we now have a flexible, ready-to-use workflow for transcription and audio analysis on Colab, and we are ready to extend it further into real-world projects.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Voice%20AI\/voice_ai_whisperx_advanced_tutorial_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.\u00a0Feel free to check out our\u00a0<strong><mark><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page for Tutorials, Codes and Notebooks<\/a><\/mark><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/10\/02\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/\">How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we walk through an advanced implementation of WhisperX, where we explore transcription, alignment, and word-level timestamps in detail. We set up the environment, load and preprocess the audio, and then run the full pipeline, from transcription to alignment and analysis, while ensuring memory efficiency and supporting batch processing. Along the way, we also visualize results, export them in multiple formats, and even extract keywords to gain deeper insights from the audio content. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser !pip install -q git+https:\/\/github.com\/m-bain\/whisperX.git !pip install -q pandas matplotlib seaborn import whisperx import torch import gc import os import json import pandas as pd from pathlib import Path from IPython.display import Audio, display, HTML import warnings warnings.filterwarnings(&#8216;ignore&#8217;) CONFIG = { &#8220;device&#8221;: &#8220;cuda&#8221; if torch.cuda.is_available() else &#8220;cpu&#8221;, &#8220;compute_type&#8221;: &#8220;float16&#8221; if torch.cuda.is_available() else &#8220;int8&#8221;, &#8220;batch_size&#8221;: 16, &#8220;model_size&#8221;: &#8220;base&#8221;, &#8220;language&#8221;: None, } print(f&#8221; Running on: {CONFIG[&#8216;device&#8217;]}&#8221;) print(f&#8221; Compute type: {CONFIG[&#8216;compute_type&#8217;]}&#8221;) print(f&#8221; Model: {CONFIG[&#8216;model_size&#8217;]}&#8221;) We begin by installing WhisperX along with essential libraries and then configure our setup. We detect whether CUDA is available, select the compute type, and set parameters such as batch size, model size, and language to prepare for transcription. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser def download_sample_audio(): &#8220;&#8221;&#8221;Download a sample audio file for testing&#8221;&#8221;&#8221; !wget -q -O sample.mp3 https:\/\/github.com\/mozilla-extensions\/speaktome\/raw\/master\/content\/cv-valid-dev\/sample-000000.mp3 print(&#8221; Sample audio downloaded&#8221;) return &#8220;sample.mp3&#8221; def load_and_analyze_audio(audio_path): &#8220;&#8221;&#8221;Load audio and display basic info&#8221;&#8221;&#8221; audio = whisperx.load_audio(audio_path) duration = len(audio) \/ 16000 print(f&#8221; Audio: {Path(audio_path).name}&#8221;) print(f&#8221; Duration: {duration:.2f} seconds&#8221;) print(f&#8221; Sample rate: 16000 Hz&#8221;) display(Audio(audio_path)) return audio, duration def transcribe_audio(audio, model_size=CONFIG[&#8220;model_size&#8221;], language=None): &#8220;&#8221;&#8221;Transcribe audio using WhisperX (batched inference)&#8221;&#8221;&#8221; print(&#8220;n STEP 1: Transcribing audio&#8230;&#8221;) model = whisperx.load_model( model_size, CONFIG[&#8220;device&#8221;], compute_type=CONFIG[&#8220;compute_type&#8221;] ) transcribe_kwargs = { &#8220;batch_size&#8221;: CONFIG[&#8220;batch_size&#8221;] } if language: transcribe_kwargs[&#8220;language&#8221;] = language result = model.transcribe(audio, **transcribe_kwargs) total_segments = len(result[&#8220;segments&#8221;]) total_words = sum(len(seg.get(&#8220;words&#8221;, [])) for seg in result[&#8220;segments&#8221;]) del model gc.collect() if CONFIG[&#8220;device&#8221;] == &#8220;cuda&#8221;: torch.cuda.empty_cache() print(f&#8221; Transcription complete!&#8221;) print(f&#8221; Language: {result[&#8216;language&#8217;]}&#8221;) print(f&#8221; Segments: {total_segments}&#8221;) print(f&#8221; Total text length: {sum(len(seg[&#8216;text&#8217;]) for seg in result[&#8216;segments&#8217;])} characters&#8221;) return result We download a sample audio file, load it for analysis, and then transcribe it using WhisperX. We set up batched inference with our chosen model size and configuration, and we output key details such as language, number of segments, and total text length. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser def align_transcription(segments, audio, language_code): &#8220;&#8221;&#8221;Align transcription for accurate word-level timestamps&#8221;&#8221;&#8221; print(&#8220;n STEP 2: Aligning for word-level timestamps&#8230;&#8221;) try: model_a, metadata = whisperx.load_align_model( language_code=language_code, device=CONFIG[&#8220;device&#8221;] ) result = whisperx.align( segments, model_a, metadata, audio, CONFIG[&#8220;device&#8221;], return_char_alignments=False ) total_words = sum(len(seg.get(&#8220;words&#8221;, [])) for seg in result[&#8220;segments&#8221;]) del model_a gc.collect() if CONFIG[&#8220;device&#8221;] == &#8220;cuda&#8221;: torch.cuda.empty_cache() print(f&#8221; Alignment complete!&#8221;) print(f&#8221; Aligned words: {total_words}&#8221;) return result except Exception as e: print(f&#8221; Alignment failed: {str(e)}&#8221;) print(&#8221; Continuing with segment-level timestamps only&#8230;&#8221;) return {&#8220;segments&#8221;: segments, &#8220;word_segments&#8221;: []} We align the transcription to generate precise word-level timestamps. By loading the alignment model and applying it to the audio, we refine timing accuracy, and then report the total aligned words while ensuring memory is cleared for efficient processing. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser def analyze_transcription(result): &#8220;&#8221;&#8221;Generate statistics about the transcription&#8221;&#8221;&#8221; print(&#8220;n TRANSCRIPTION STATISTICS&#8221;) print(&#8220;=&#8221;*70) segments = result[&#8220;segments&#8221;] total_duration = max(seg[&#8220;end&#8221;] for seg in segments) if segments else 0 total_words = sum(len(seg.get(&#8220;words&#8221;, [])) for seg in segments) total_chars = sum(len(seg[&#8220;text&#8221;].strip()) for seg in segments) print(f&#8221;Total duration: {total_duration:.2f} seconds&#8221;) print(f&#8221;Total segments: {len(segments)}&#8221;) print(f&#8221;Total words: {total_words}&#8221;) print(f&#8221;Total characters: {total_chars}&#8221;) if total_duration &gt; 0: print(f&#8221;Words per minute: {(total_words \/ total_duration * 60):.1f}&#8221;) pauses = [] for i in range(len(segments) &#8211; 1): pause = segments[i+1][&#8220;start&#8221;] &#8211; segments[i][&#8220;end&#8221;] if pause &gt; 0: pauses.append(pause) if pauses: print(f&#8221;Average pause between segments: {sum(pauses)\/len(pauses):.2f}s&#8221;) print(f&#8221;Longest pause: {max(pauses):.2f}s&#8221;) word_durations = [] for seg in segments: if &#8220;words&#8221; in seg: for word in seg[&#8220;words&#8221;]: duration = word[&#8220;end&#8221;] &#8211; word[&#8220;start&#8221;] word_durations.append(duration) if word_durations: print(f&#8221;Average word duration: {sum(word_durations)\/len(word_durations):.3f}s&#8221;) print(&#8220;=&#8221;*70) We analyze the transcription by generating detailed statistics such as total duration, segment count, word count, and character count. We also calculate words per minute, pauses between segments, and average word duration to better understand the pacing and flow of the audio. Check out the\u00a0FULL CODES here. Copy CodeCopiedUse a different Browser def display_results(result, show_words=False, max_rows=50): &#8220;&#8221;&#8221;Display transcription results in formatted table&#8221;&#8221;&#8221; data = [] for seg in result[&#8220;segments&#8221;]: text = seg[&#8220;text&#8221;].strip() start = f&#8221;{seg[&#8216;start&#8217;]:.2f}s&#8221; end = f&#8221;{seg[&#8216;end&#8217;]:.2f}s&#8221; duration = f&#8221;{seg[&#8216;end&#8217;] &#8211; seg[&#8216;start&#8217;]:.2f}s&#8221; if show_words and &#8220;words&#8221; in seg: for word in seg[&#8220;words&#8221;]: data.append({ &#8220;Start&#8221;: f&#8221;{word[&#8216;start&#8217;]:.2f}s&#8221;, &#8220;End&#8221;: f&#8221;{word[&#8216;end&#8217;]:.2f}s&#8221;, &#8220;Duration&#8221;: f&#8221;{word[&#8216;end&#8217;] &#8211; word[&#8216;start&#8217;]:.3f}s&#8221;, &#8220;Text&#8221;: word[&#8220;word&#8221;], &#8220;Score&#8221;: f&#8221;{word.get(&#8216;score&#8217;, 0):.2f}&#8221; }) else: data.append({ &#8220;Start&#8221;: start, &#8220;End&#8221;: end, &#8220;Duration&#8221;: duration, &#8220;Text&#8221;: text }) df = pd.DataFrame(data) if len(df) &gt; max_rows: print(f&#8221;Showing first {max_rows} rows of {len(df)} total&#8230;&#8221;) display(HTML(df.head(max_rows).to_html(index=False))) else: display(HTML(df.to_html(index=False))) return df def export_results(result, output_dir=&#8221;output&#8221;, filename=&#8221;transcript&#8221;): &#8220;&#8221;&#8221;Export results in multiple formats&#8221;&#8221;&#8221; os.makedirs(output_dir, exist_ok=True) json_path = f&#8221;{output_dir}\/{filename}.json&#8221; with open(json_path, &#8220;w&#8221;, encoding=&#8221;utf-8&#8243;) as f: json.dump(result, f, indent=2, ensure_ascii=False) srt_path = f&#8221;{output_dir}\/{filename}.srt&#8221; with open(srt_path, &#8220;w&#8221;, encoding=&#8221;utf-8&#8243;) as f: for i, seg in enumerate(result[&#8220;segments&#8221;], 1): start = format_timestamp(seg[&#8220;start&#8221;]) end = format_timestamp(seg[&#8220;end&#8221;]) f.write(f&#8221;{i}n{start} &#8211;&gt; {end}n{seg[&#8216;text&#8217;].strip()}nn&#8221;) vtt_path = f&#8221;{output_dir}\/{filename}.vtt&#8221; with open(vtt_path, &#8220;w&#8221;, encoding=&#8221;utf-8&#8243;) as f: f.write(&#8220;WEBVTTnn&#8221;) for i, seg in enumerate(result[&#8220;segments&#8221;], 1): start = format_timestamp_vtt(seg[&#8220;start&#8221;]) end = format_timestamp_vtt(seg[&#8220;end&#8221;]) f.write(f&#8221;{start} &#8211;&gt; {end}n{seg[&#8216;text&#8217;].strip()}nn&#8221;) txt_path = f&#8221;{output_dir}\/{filename}.txt&#8221; with open(txt_path, &#8220;w&#8221;, encoding=&#8221;utf-8&#8243;) as f: for seg in result[&#8220;segments&#8221;]: f.write(f&#8221;{seg[&#8216;text&#8217;].strip()}n&#8221;) csv_path = f&#8221;{output_dir}\/{filename}.csv&#8221; df_data = [] for seg in result[&#8220;segments&#8221;]: df_data.append({ &#8220;start&#8221;: seg[&#8220;start&#8221;], &#8220;end&#8221;: seg[&#8220;end&#8221;], &#8220;text&#8221;: seg[&#8220;text&#8221;].strip() }) pd.DataFrame(df_data).to_csv(csv_path, index=False) print(f&#8221;n Results exported to &#8216;{output_dir}\/&#8217; directory:&#8221;) print(f&#8221; \u2713 {filename}.json (full structured data)&#8221;) print(f&#8221; \u2713 {filename}.srt (subtitles)&#8221;) print(f&#8221; \u2713 {filename}.vtt (web video subtitles)&#8221;) print(f&#8221; \u2713 {filename}.txt (plain text)&#8221;) print(f&#8221; \u2713 {filename}.csv (timestamps + text)&#8221;) def format_timestamp(seconds): &#8220;&#8221;&#8221;Convert seconds to SRT timestamp format&#8221;&#8221;&#8221; hours = int(seconds \/\/ 3600) minutes = int((seconds % 3600) \/\/ 60) secs = int(seconds % 60) millis = int((seconds % 1) * 1000) return f&#8221;{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}&#8221; def format_timestamp_vtt(seconds): &#8220;&#8221;&#8221;Convert seconds to VTT timestamp format&#8221;&#8221;&#8221; hours = int(seconds \/\/ 3600) minutes = int((seconds % 3600) \/\/ 60) secs = int(seconds % 60) millis = int((seconds % 1) * 1000) return f&#8221;{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}&#8221; def batch_process_files(audio_files, output_dir=&#8221;batch_output&#8221;): &#8220;&#8221;&#8221;Process multiple audio files in batch&#8221;&#8221;&#8221; print(f&#8221;n Batch processing {len(audio_files)} files&#8230;&#8221;) results = {}<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-41937","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export? - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/it\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/\" \/>\n<meta property=\"og:locale\" content=\"it_IT\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export? - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/it\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-10-03T06:48:58+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Scritto da\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tempo di lettura stimato\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minuti\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?\",\"datePublished\":\"2025-10-03T06:48:58+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/\"},\"wordCount\":561,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"it-IT\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/\",\"url\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/\",\"name\":\"How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export? - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png\",\"datePublished\":\"2025-10-03T06:48:58+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#breadcrumb\"},\"inLanguage\":\"it-IT\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#primaryimage\",\"url\":\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png\",\"contentUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"it-IT\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/it\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export? - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/it\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/","og_locale":"it_IT","og_type":"article","og_title":"How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export? - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/it\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-10-03T06:48:58+00:00","og_image":[{"url":"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png","type":"","width":"","height":""}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Scritto da":"admin NU","Tempo di lettura stimato":"10 minuti"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?","datePublished":"2025-10-03T06:48:58+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/"},"wordCount":561,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"it-IT","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/","url":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/","name":"How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export? - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png","datePublished":"2025-10-03T06:48:58+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#breadcrumb"},"inLanguage":"it-IT","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/"]}]},{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#primaryimage","url":"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png","contentUrl":"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/1f680.png"},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/how-to-build-an-advanced-voice-ai-pipeline-with-whisperx-for-transcription-alignment-analysis-and-export\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"it-IT"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/it\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/it\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/it\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"In this tutorial, we walk through an advanced implementation of WhisperX, where we explore transcription, alignment, and word-level timestamps in detail. We set up the environment, load and preprocess the audio, and then run the full pipeline, from transcription to alignment and analysis, while ensuring memory efficiency and supporting batch processing. Along the way, we&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts\/41937","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/comments?post=41937"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts\/41937\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/media?parent=41937"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/categories?post=41937"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/tags?post=41937"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}