Skip to content

 

LINDAT/CLARIAH-CZ logo

LINDaT/CLARIAH-CZ

About the Project

LINDAT/CLARIAH-CZ is a digital research infrastructure for language technologies, arts and humanities. It enables archiving, processing, management and accessibility of data, resources and tools from the fields of arts, humanities and social sciences.

The LINDAT/CLARIAH-CZ grant project is implemented under the OP VaVpl (Research and Development for Innovations Operational Programme) of the Ministry of Education, Youth and Sports, funded from the resources of the European structural funds and investment funds.

Our Role in the Project

Our research team at the Department of Cybernetics of the Faculty of Applied Sciences of the University of West Bohemia serves mainly as the developer of software tools that facilitates processing and accessibility of various digitized data resources, especially the ones employing the speech processing technology.

Currently we are maintaining (and further developing) the following tools:

 UWebASR (https://lindat.mff.cuni.cz/services/uwebasr/)

This is a Web-based ASR engine for Czech and Slovak that is free to use for research purposes and does not require any background knowledge about the inner workings of the ASR engine or the API usage. 

 

UWebASR

 

MALACH indexing and search engine (available only via personal visit in Malach Centre for Visual History)

 In the recent decades, there is a constantly growing amount of multimodal data being collected and stored in order to be preserved for a future use. These data include – among other things – videotaped oral history interviews, archived footage of various TV broadcasts and a plethora of scanned hand-written and typed documents and photographs. The resulting archives present invaluable resources for many branches of the humanities (history, linguistics) and social sciences (political science, communication studies).

However, almost all of such archives share the same problem; unless they are really thoroughly equipped with the detailed metadata (describing e.g. topics of the individual documents, names and place appearing in the documents and/or being mentioned there, etc.) – and it is only rarely the case – it is almost impossible to find the desired information in the vast amount of data.

Our research lab has been participating in several projects (in some of them as the leading partner) that applied first the speech processing technologies and later also the image processing ones to facilitate the access to the information content of the archives.

First, we started with building the systems for automatic speech recognition (ASR) for the Czech, Slovak, Russian and Polish recordings from what is nowadays the USC Shoah Foundation Visual History Archive (http://vhaonline.usc.edu). The original plan was to transcribe the audio track of the video recordings into the plain text format and then use the traditional information retrieval techniques to find elaborately crafted search topics. This has proven to be problematic due to a relatively poor recognition accuracy (more than 30% Word Error Rate) on the challenging data and we have resorted to using an approach called spoken term detection (STD) instead.

In STD, the task is to find any occurrence of a specified word or a short phrase (the query) in the archive and return the exact time point where such occurrence takes place. Ideally, there is also a graphical interface that allows user to play the relevant segment from the recording instantly, as is shown in the screenshot below.

 AMALACH

Contact

Pavel Ircing
         head of the research team at UWB
         ircing@kky.zcu.cz