Caption Skill — Pipeline Reference

00Overview

สร้าง word-by-word bilingual captions ซิงค์กับเสียงแล้ว render ด้วย HyperFrames เป็น MP4

รองรับทั้ง 🇹🇭 ไทย → EN (คำไทย active + EN แปล) และ 🇬🇧 EN → ไทย (EN words active + ไทยแปล)

01Pipeline

🇹🇭 TH → EN Mode

❶ ffmpeg normalize → H.264
❷ ffmpeg extract 16 kHz WAV
❸ whisper.cpp -l th Metal GPU
❹ openai-whisper medium translate th→en
❺ pythainlp newmm รวม fragments → groups
❻ LLM proofread ตรวจคำผิด ASR
❼ HyperFrames compose + render

🇬🇧 EN → TH Mode

❶ ffmpeg normalize → H.264
❷ ffmpeg extract 16 kHz WAV
❸ whisper.cpp -l en Metal GPU
❹ LLM translate en→th
❺ group EN words + distribute TH
❻ LLM proofread EN words
❼ LLM review_th ตรวจคุณภาพแปลไทย
❽ HyperFrames compose + render

Tools

Tool	ใช้ทำอะไร
`ffmpeg`	convert video → H.264 + extract 16kHz audio
`whisper-cli`	transcribe with Metal GPU (large-v3-turbo GGML)
`openai-whisper`	translate th→en (medium model, CPU FP32)
`pythainlp`	ตัดคำไทย newmm + กระจาย TH translation ตาม groups
`GLM LLM`	translate en→th + proofread ASR + review TH quality
`hyperframes`	compose HTML+GSAP + render MP4

02Setup

Project structure

project/
├── video_fixed.mp4          ← H.264 normalized
├── audio.wav                ← 16kHz mono WAV
├── whisper_cpp_th.json      ← transcribe output
├── translate_en.json        ← EN translation
├── groups.json              ← caption groups
├── index.html               ← HyperFrames composition
├── output.mp4               ← rendered result
└── capture/assets/fonts/    ← Sarabun woff2 (6 files, ~60KB)

Fonts — Sarabun 400/700/800

Copy จาก project เดิมได้เลย (6 ไฟล์ woff2):

cp <old-project>/capture/assets/fonts/* <new-project>/capture/assets/fonts/
# Sarabun-{400,700,800}-{thai,latin}.woff2

03Transcribe

Thai — whisper.cpp (Metal GPU)

ffmpeg -y -i video_fixed.mp4 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav

whisper-cli \
  -m ~/.cache/whisper/ggml-large-v3-turbo.bin \
  -f audio.wav -l th \
  --output-json-full -of whisper_cpp_th \
  -t 8 -np -bo 5 -bs 5 -sow

~2.6s สำหรับ 14s clip (Metal GPU)

English Translate — openai-whisper medium

model_m = whisper.load_model('medium')
result_en = model_m.transcribe('video_fixed.mp4', language='th', task='translate')

~3.6s สำหรับ 14s clip (CPU) → total ~6.7s vs large-v3-turbo CPU ~17.5s

Fastest pipeline: whisper.cpp Metal + openai-whisper medium = 2.6x faster กว่า openai-whisper large-v3-turbo CPU

04⚠ Pitfalls

Thai Word Fragments — ต้องรวมมือ
Whisper แยกไทยเป็นตัวอักษร/สระ ไม่ใช่คำ
ห้าม threshold merging — สระ/วรรณยุกต์ซ้อน timestamp (gap=0) → รวมทุกอย่างเป็นก้อน
วิธีที่ถูก: segment text → แบ่งคำ (newmm) → map fragments ตามลำดับ

Word Timestamps หายหลัง ~60%
Whisper มัก map fragment เดียวครอบช่วงที่เหลือ → ใช้ segment boundaries จาก EN translation และ group กว้างขึ้น

EN segment ยาวครอบหลาย Thai group
ห้ามแปะทั้งประโยคซ้ำทุก group → ล้นจอ + ซ้ำ แก้: แบ่งคำ EN กระจายตามจำนวน groups

whisper.cpp translate ไม่ทำงานกับไทย → ต้องใช้ openai-whisper medium

คำอังกฤษในประโยคไทย (Bitcoin, Pizza, Laszlo) Whisper ถอดเป็นคำเดียวสมบูรณ์ — timestamps สะอาด ไม่ต้องรวม fragment

05คำผิดที่ Whisper ได้ยินผิดบ่อย

Whisper ได้ยิน	คำจริง	สาเหตุ
โกรนมิด	โลก	พยัญชนะคล้าย
ทั้งสือ / ตั้งสือ	หนังสือ	ออกเสียงเร็ว
เหตุ	เห็น	ต/ต+็น สับสน
เหตุการ	เหตุการณ์	หาย ณ์
อดิต / อดิ	อดีต	สระ ิ/ี สับสน
อาหาร	อ่าน	คำที่ตามมาถูกกลืน
เคียนเสียน	เขียนเรื่อง	คำซ้อนกัน
ละลึก / การละลึก	รำลึก	รำ → ละ เสียงคล้าย
ควรสำคัญ	ความสำคัญ	ความ → ควร สระสับสน

AI proofread (GLM LLM) ตรวจแก้อัตโนมัติ — แต่ต้องตรวจทานมือเพิ่มในบางคำ

06Caption Groups

หลักการ

▸ 2–4 คำ/group (high-energy) ไม่เกิน 5
▸ แบ่งที่ sentence boundary หรือ pause > 150ms
▸ แต่ละ group มี translation line ที่ match ความหมาย
▸ Silence gaps → ไม่มี caption (ปล่อยว่าง)
▸ ช่วงที่ไม่มี word timestamps → groups กว้างขึ้น (3-5 วินาที)

GROUPS data structure

var GROUPS = [
  {
    id: 0, groupStart: 0.00, groupEnd: 1.78,
    en: "English phrase",       // TH mode: EN translation
    words: [                    // TH mode: Thai words; EN mode: English words
      { wi: 0, wordStart: 0.00 },
      { wi: 1, wordStart: 1.08 }
    ]
  }
];

GROUPS word count = HTML span count — ไม่ตรง = GSAP target null warnings (lint trap #1)

07Template (16:9 / 9:16)

Dimensions

ค่า	16:9	9:16
resolution	1920 × 1080	1080 × 1920
`.cg` bottom	64px	280px
`.cw` font-size	72px	62px
`.cg-en` font-size	36px	32px
`.cg-thai` wrap	wrap	wrap + center
`.cg-en`	white-space: normal, max-width: 1180px	max-width: 920px

HTML structure

<div class="cg">
  <div class="cg-inner">
    <div class="cg-thai">         <!-- active word-by-word -->
      <span class="cw">คำแรก</span>
      <span class="cw">คำสอง</span>
    </div>
    <div class="cg-en">English</div>  <!-- translation line -->
  </div>
</div>

EN Audio — swap classes

เปลี่ยน .cg-thai → .cg-en-words + .cg-en เก็บคำแปลไทยแทน

08ตัดเสียงเงียบ

ห้ามใช้ -c copy + -ss/-to — ไม่ frame-accurate → เสียงซ้ำ + ภาพกระตุก
ต้องใช้ filter_complex trim เท่านั้น

Parameters

Parameter	Default	ปรับเมื่อ
`NOISE`	-35dB	ตัดน้อยเกิน → -30dB; มากเกิน → -40dB
`THRESHOLD`	0.3s	สั้นลง → 0.2s; เก็บ pause → 0.5s
`d=0.25`	0.25s	min duration ที่ถือว่า silence

09Design

Typography

Element	Font	Weight
Active words (TH/EN)	Sarabun	800
Translation line	Sarabun	400

Active word colours

Bitcoin #F7931A Crypto #7B61FF News #EF4444 Tech #3B82F6 Nature #22C55E

Key values

ตัวเลือก	ค่า	เหตุผล
Active color	`#F7931A`	ส้ม Bitcoin — เปลี่ยนตาม topic
Inactive opacity	42%	ต่ำพอให้เห็นชัดว่า active คือคำไหน
Translation opacity	65%	รองจาก active แต่ยังอ่านได้
Pill background	`rgba(0,0,0,0.68)`	contrast ทุกสภาพแสง
Animation	`back.out(1.6)` scale-pop	bouncy เหมาะ social

10Checklist ก่อน Deliver

☐ npx hyperframes lint → 0 errors, 0 warnings
☐ GROUPS words count = HTML <span> count ทุก group
☐ npx hyperframes snapshot → caption แสดงทุก frame ที่มีเสียง
☐ ไม่มี caption ค้างหลัง group.end
☐ Audio stream อยู่ใน output
☐ ถ้า render vertical → ตรวจคำยาว wrap ถูกต้อง
☐ ถ้าตัดเสียงเงียบ → ใช้ filter_complex trim เท่านั้น