5 questions to Volker Steinbiss, Head of the Corpus project

Dr. Volker Steinbiss, researcher at technical University of Aachen (RWTH Aachen) is in charge of the Corpus project within Quaero, one of the strategic pillars of the program. Corpus constitutes, with the Core Technology Cluser, the common base of research and innovation and brings the raw material to Quaero: real data which needs to be collected and annotated.

1/ Why did Exalead decide to join Quaero in 2008? What were your expectations when joining the program?

Right after the idea of Quaero was born, RWTH was approached by CNRS-LIMSI as we knew each other from scientific research where – you might not believe it – we were heavily competing against each other, each one seeking to beat the other having the better speech recognition and machine translation systems. We as RWTH expected above all world-class research groups to work with. We were also attracted by the fact that objective evaluation played such a central role – in 2008, this important concept was not as strongly supported elsewhere as it is now, near the end of Quaero.

2/ Can you describe the role of Corpus within Quaero?

Within Quaero, the Corpus project collects corpora – data collections – for use in our research project, the core technology cluster (CTC). These corpora fulfill three important functions: first, they are used in objective comparative evaluation of the technologies developed in the CTC, which is a pillar of Quaero's strategic approach. Secondly, the corpora constitute the base of the systems developed in the CTC, not only for the evaluation but also for the phase of training of the many systems which use statistical methods. Third, the corpora implicitly define the scientific and technical agenda: Technical and scientific problems that are represented in the corpora get attention and labour, in contrast to the ones not reflected there. This is one of the reasons why we use data that exist "in the wild" From the insider's perspective, I was impressed that fertilization across disciplines actually took place  – they describe real challenges and are relevant for commercial applications. The corpora span a wide range and are very diverse. 

3/ What are the main achievements of Corpus (within Quaero)?

I have always regarded this project as a service to the CTC, a prerequisite for their work: providing the raw material that the researchers need to build excellent systems. And this is what Corpus did. Besides this general achievement, there are specific ones. Let me just mention one:  through their joint research work based on the large music corpus provided by the Corpus project, the three French research groups at Ircam, Telecom Paristech and INRIA/Metiss have in the course of the Quaero program moved to the top on an international scale.

4/ Do you consider Corpus a "success Story”?

The systematic collection of corpora is one of the pillars of Quaero's strategy and it has laid the basis for the remarkable technological advances. The corpora will continue to be used by the partners. Some are assets internal to Quaero partners that help them keeping a competitive advantage, others have been made public in order to serve the community.

5/ On a larger scale, how would you assess Quaero, a few months before the end of the programme?  

Quaero has significantly advanced the state of the art in the underlying technologies, and there has been impressive technology transfer to and uptake by the industrial partners. We were actually much better in achieving results than communicating them to the media, and the latter was our weak point. From the insider's perspective, I was impressed that fertilization across disciplines actually took place – an upside of a large concerted effort that despite its size had a rather light-weight management structure.