Skip to content

Detail of publication



Download PDF


Detail of publication

Author: Kanis Jakub
Language: Czech
Year: 2009
Type of publication: Habilitation and dissertation theses
/ 2010-04-07 12:15:43 /


 author = {Kanis Jakub},
 title = {STATISTICK\'{Y} AUTOMATICK\'{Y} P\v{R}EKLAD \v{C}E\v{S}TINA - ZNAKOVAN\'{A} \v{R}E\v{C}},
 year = {2009},
 url = {},

Additional information

This thesis deals with design of system for automatic translation between Czech and Signed Speech. The term Signed Speech is in this work used as a overall name for Czech Sign Language and Signed Czech, which both of them serve as communication facility of deaf in the Czech Republic. The main goal of this work was to create common translation system, which can be used for translation of both mentioned languages (for an arbitrary language pair respectively). The present possibilities to an automatic translation system construction and in the world existing systems for sign language translation were surveyed for this purpose. A statistical approach to the automatic translation system construction based on phrases was chosen as an optimal solution for the main goal accomplishment. This approach allow the construction of the translation system for the arbitrary language pair and the phrase based systems are one of the most used statistical translation systems at this time and represent state of the art in the accuracy and the speed of the automatic translation.

A parallel corpus containing corresponding texts in both languages is a main source of information about translated pair in the case of statistical machine translation systems. The existence of the parallel corpus is in the case of the sign languages and the Signed Speech complicated by the absence of an official written form neither any sign language nor Signed Czech (thus it does not exists no parallel corpus of the Signed Speech till now at least to author's best knowledge). The own parallel corpus of Signed Speech - Czech - Signed Czech (CSC) corpus was created for the purpose of this work. This corpus was created by the Signed Czech translation of the existing Human-Human Train Timetable dialog corpus (authors Jurčíček, Jelínek, Zahradil), which contains transcriptions of telephone queries in a train time table information center. The Signed Czech was chosen mainly because of simpler possibility of the written form creation, which is represented as a sequence of Czech Sign Language signs in the same order as the Czech words in the translated sentence. The final corpus contains 15 722 sentence pairs organized in 1 109 dialogs with a rich annotation scheme useful for next processing (there is for each utterance in the corpus its transcription and normalized transcription of speech, named entity annotation, semantic annotation in form of dialog acts and newly added Signed Czech translation).

Information acquired from the parallel corpus is in the case of the phrase based system stored in a phrase table, which contains corresponding translation pairs. An original method for the translation pairs extraction based on minimal loss principle was proposed and tested in the scope of this work. This method together with its suggested improvements (acquired phrase table respectively) was compared with two other phrase tables obtained from a handcrafted phrase alignment established during CSC corpus collecting and from a standard method for the automatic phrase extraction. The suggested improvements of the new phrase extraction method consist in dividing of the best translation selection according to a source phrase occurring frequency, in combination of the phrase tables for both translation directions and further in a filtration of the resulting table through the translation of an appropriate text.

Further, it was introduced an algorithm for monotone and non-monotone searching based on dynamic programming using phrase n-gram. The own decoder usable for translation between Czech and Signed Czech was created by the implementation of the algorithm for the monotone searching. The design of the decoder was oriented to easy using and connection in real applications. The performance of our decoder was compared to the performance of the state of the art phrase based decoder. The results of comparison indicate the comparable performance of both decoders. Additionally, the versions of our decoder using phrase bigram and trigram in combination with bigram and trigram language model were compared together. The best combination from the accuracy and the speed of translation point of view is the phrase bigram decoder with the trigram language model.

The translation accuracy between Czech and Signed Czech was compared for all three available phrase tables in the experiment stage of this work. The results show that the handcrafted table and the standard automatic method extracted table perform equal while the table extracted with the new method is behind them about one point of BLEU score (however this difference is negligible in the case of the real applications). The main advantage of the handcrafted and the new method extracted table against the standard automatic method extracted table is their multiple smaller size (12 times in the case of handcrafted and 5 times in the case of newly extracted table). The basic system for the translation from Czech to Signed Czech was tested and reached 81.22 point of BLEU score. Its possible improvements using informations included in the rich annotation of the CSC corpus (class based language model and post processing of the resulting translation) were suggested and tested further. These improvements bring the accrual of translation accuracy about more than two point of BLEU score in dependence of used table. The resulting best translation system was then used for the opposite translation direction from Signed Czech to Czech where it reached the best result of 63.98 point of BLEU score.