Using Speech Recognition to Generate Subtitles from Video

Introduction

Making subtitles from scratch usually consists of two tedious tasks: (1) figuring out the times when someone starts and ends speaking -- in subtitle length pieces -- and (2) typing down text corresponding to that speech. An approach often worth attempting is to automate a large part of this work by using speech recognition to generate subtitles from given video. This method cannot be expected to produce release quality subtitles on its own, but it should provide a rough first draft, which can be finished by usual manual methods. With most video sources, the actual speech recognition cannot be expected to perform well, but voice recognition should provide decent results for the start and end times of subtitles.

Gaupol's speech recognition uses the excellent CMU Sphinx speech recognition toolkit developed at Carnegie Mellon University -- to be exact, the pocketsphinx plugin for the GStreamer multimedia framework.

We will begin with a description of the parameters that define how the speech recognition operates, followed by general tips on how to use this feature and how it fits in a subtitling workflow.

Parameters

Voice Recognition

There are three parameters to control the voice recognition, i.e. the detection of the difference between speech and silence and splitting detected speech into subtitle length pieces.

Speech Recognition

Detected voice is further detected into words by three models: an acoustic model, a phonetic dictionary and a language model. The default models shipped with pocketsphinx are US English models. Models for other languages can be downloaded from the CMU Sphinx directory. Acoustic models are directories with multiple files, phonetic dictionaries are single files usually with a .dic filename extension and language models are also single files usually with a .DMP extension. For example, the default models shipped with pocketsphinx are in the following files.

/usr/share/pocketsphinx/model
|-- hmm
|   `-- en_US
|       `-- hub4wsj_sc_8k
|           |-- feat.params
|           |-- mdef
|           |-- means
|           |-- noisedict
|           |-- sendump
|           |-- transition_matrices
|           `-- variances
`-- lm
    `-- en_US
        |-- cmu07a.dic
        `-- hub4.5000.DMP

The default English model might work for other languages as well if you're only interested in voice recognition. If, on the other hand, you're interested in speech recognition, you'll need language-specific models. But be warned, the models you can download from CMU Sphinx, might not work "out of the box".

Last, but not least, you can make your own models, tuned to your own recording conditions, your own voice or your own language. See the CMU Sphinx wiki for instructions.

Tips

Results will greatly depend on the cleanliness of the audio track you use. For something like conference presentations with very little background noise and the same person speaking the whole time to a good quality microphone you can expect fairly good results. When there are multiple speakers, different noise levels (e.g. different scenes of a film) or disturbing non-verbal audio (e.g. background music or sound effects) the results will be considerably worse.

Gaupol's speech recognition dialog is designed to allow experimentation. You can start with default values for parameters and hit the "Start" button. If the results look bad, hit the "Stop" button, edit the parameters and start again. You'll probably want to set the noise level low rather than high and the silence length short rather than long, since it's easier to later delete and merge subtitles than it is to create or split subtitles. Once you find the right parameters to fit your particular video, let it process all the way through and once done you can close the dialog. Among the results you'll see subtitles with no text -- only a number in square brackets. This means that voice was detected, but it could not be deciphered into words. These subtitles might often be non-verbal sounds or speech in a different language.

Once you have the results of speech recognition, you might want to let Gaupol extend the durations of those subtitles, since subtitles will often disappear too soon if they end right when speaking ends. From Gaupol's "Tools" menu select "Adjust Durations" and set it to at least extend durations to match a reasonable reading speed.

After the automatic processing you'll need to manually go over the subtitles checking the times and typing the text. Gaupol does not yet have proper tools to help in that work. The external video player preview in Gaupol is useless, since video players can only seek at keyframe accuracy. Thus, you'll want to save your subtitle file and continue the manual processing in some other subtitle editor, one that allows especially video and audio playback of individual subtitles and possibly leading and trailing context at millisecond precision.

Problems

Speech Recognition Menu Item Is Unavailable

Speech recognition has been added in Gaupol version 0.19. If you have Gaupol 0.19 or later installed, but the "Speech Recognition" item in the "Tools" menu is grayed out, you're lacking some dependency.

You need the GStreamer multimedia framework and its Python bindings, as well as GStreamer plugins to decode whatever video and audio formats you wish to use. To verify GStreamer is correctly installed and accessible from Python, you should be able to import the gst module at the Python prompt without error.

$ python
Python 2.6.7 (r267:88850, Jun 13 2011, 20:39:28)
[GCC 4.6.1 20110611 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gst
>>>

From CMU Sphinx you need sphinxbase and pocketsphinx. To verify that these are installed correctly and accessible from GStreamer, gst-inspect should be able to find plugins vader and pocketsphinx, i.e. commands

gst-inspect vader
gst-inspect pocketsphinx

should both give a long output of the properties of those elements.

Gaupol/SpeechRecognition (last edited 2011-07-08 05:57:38 by OsmoSalomaa)