Did you know that you can navigate the posts by swiping left and right?

Manual Subtitle Speech Alignment

18 Jun 2019 . Category . Comments #research #nlp #agents #discourse #fsharp

I’ve been working on an automated approach to subitle alignment for the creation of speech data for the deep learning of text to speech synthesis.

However, the resulting data isn’t clean enough to create good quality TTS because it suffers from the following defects:

  • Music/noise in the background
  • Overlapping speech
  • Clipped final word (e.g. ~100ms)
  • Preceeding/following silence
  • Wrong speaker (rare)

To resolve these, I created a web-base UI with data preview/edit capabilities, similar to finetuneas. However, unlike that work, my program:

  • Allows editing the transcript (i.e., does not assume it is correct)
  • Allows rejection of utterances entirely, with rejection codes
  • Supports a keyboard-oriented UI for faster review/correction
  • Displays a waveform

The input to the program is required to be wav audio and json with the following format, where the times are in milliseconds:

[
  {
    "Start": 184170,
    "Stop": 184284,
    "Text": "YES, HE CAN!"
  }
]

The program is available here. It is all client-side, so there’s no need to install it yourself.

The GitHub repository is here.