Transcription of human speech into written word sequences

Target users and customers

Companies who want to integrate the transcription of human speech into their products.

Application sectors

Speech-to-Text technology is key to indexing multimedia content as it is found in multimedia databases or in video and audio collections on the World Wide Web, and to make it searchable by human queries. In addition, it offers a natural interface for submitting and executing queries.

This technology is further part of speech-translation services. In combination with machine translation technology, it is possible to design machines that take human speech as input and translate it into a new language. This can be used to enable human-to-human combination across the language barrier or to access languages in a cross-lingual way.


The KIT speech transcription system is based on the JANUS Recognition Toolkit (JRTk) which features the IBIS single pass decoder. The JRTK is a flexible toolkit which follows a object-oriented approach and which is controlled via Tcl/TK scripting.

Recognition can be performed in different modes:
In offline mode, the audio to be recognized is first segmented into sentence-like units. Theses segments are then clustered in an unsupervised way according to speaker. Recognition can then be performed in several passes. In between passes, the models are adapted in an unsupervised manner in order to improve the recognition performance. System combination using confusion network combination can be used in addition to further improve recognition performance.
In run-on mode, the audio to be recognized is continuously processed without prior segmentation. The output is a steady stream of words.

The recognizer can be flexibly configured to meet given real-time requirements, between the poles of recognition accuracy and recognition speed.

Within the Quaero project, we are targeting the languages English, French, German, Russian, and Spanish. Given sufficient amounts of training material, the HMM based acoustic models can be easily adapted to additional languages and domains.

Technical requirements:

Linux based server with 2GB of RAM.

Conditions for access and use:

Available for licensing on a case-by-case basis.

visuel KIT_speech to text


  • Karlsruhe Institute of Technology

Contact details:

Prof Alex Waibel

Karlsruhe Institute of Technology (KIT)
Adenauerring 2
76131 Karlsruhe