📊 Projekt

corpus-tools.org

Humboldt-Universität zu Berlin, Corpus Linguistics and Morphology group

corpus-tools.org

Institution: Humboldt-Universität zu Berlin, Corpus Linguistics and Morphology group Category: Project
Website: https://corpus-tools.org/

Short Description

ANNIS is a browser-based search and visualization platform for complex, multi-layered linguistic corpora. It enables querying and displaying annotated texts, audio, and video materials across multiple linguistic levels. The target audience consists of researchers and educators in corpus linguistics working with structured linguistic data. Universities benefit from a standardized, open solution for analyzing and presenting linguistic data in research and teaching.

General Description

Thematic Classification

Subject Areas

Humanities
Computer Science
Linguistics
Linguistics
Corpus Linguistics
Morphology
Computational Linguistics
Text Linguistics
Semantics
Syntax
Prosody
Information Theory
Digital Humanities

Research Fields

Corpus linguistics
Morphology
Linguistic information structure
Syntax
Semantics
Morphology
Prosody
Referentiality
Lexicon
Multilingualism
Multimodal corpora (language, audio, video)
Historical linguistics
Ancient languages (e.g. Ancient Greek, Old High German, Old Occitan)
Corpus-based language analysis
Annotation of linguistic data
Language technology for digital language resources

Specializations

Annotation of linguistic data (multi-layer annotation)
Migration and conversion between different file formats (e.g. ANNIS, TreeTagger, EXMARaLDA, CoNLL-U, XLSX, TextGrid, PAULA, PTP)
Search and visualization in complex linguistic corpora
Support for multilingual and multimodal corpora (text, audio, video)
Working with multi-layered and multiple overlapping segmentations (e.g. in spoken corpora)
Integration of audiovisual annotations (e.g. timeline, speech tracks)
Development of custom HTML visualizations via CSS
Support for various linguistic phenomena: syntax, semantics, morphology, prosody, referentiality, lexicon, information structure, coreference, rhetoric, translation
Provision of open-source tools under Apache 2.0 license
Focus on working with corpora from research projects such as SFB 632
Development of tools for processing historical and ancient languages (e.g. Old High German, Classical Greek, Old Occitan, Old Wolof)
Support for parallel corpora and translation annotations
Provision of demo corpora for various languages and use cases

Keywords

Annatto - file format converter - command-line tool - workflow-based - graphANNIS data model - import/export - data manipulation - consistency checking - multilingual corpora - Open Source

Funding

Funding Provider: -
Funding Program: SFB 632
Funding Reference: SFB 632
Funding Period: 2004 - 2025
Project Volume: Das Volumen oder "INSUFFICIENT"

Team & Partners

Project Leadership

Prof. Dr. Thomas Krause

Involved Persons

Dr. Thomas Krause (Project Lead, Humboldt-Universität zu Berlin)
Dr. Amir Zeldes (Co-developer, Georgetown University)
Dr. Francesco Mambrini (Corpus Contribution, Perseus Project, Tufts University)
Prof. Roland Meyer (Corpus Contribution, Humboldt-Universität zu Berlin)
Prof. Rosemarie Luehr (Corpus Contribution, Universität Jena)
Dr. Olga Scrivner (Corpus Contribution, Indiana University)
Dr. Michaela Schmitt (Co-developer, Humboldt-Universität zu Berlin)
Dr. Lena Weber (PhD Candidate, Humboldt-Universität zu Berlin)
Jan Müller (PhD Candidate, Humboldt-Universität zu Berlin)

Affiliated Institutions

External Partners

Project Contents

Goals

Annotation, migration, and analysis of linguistic data
Provision of open-source tools for complex multi-layered corpora
Support for various annotation types (syntax, semantics, morphology, prosody, etc.)
Integration of audio and video data into corpus analyses
Promotion of interoperability through conversion between different file formats

Work Packages

WP1: Development and maintenance of the ANNIS software (annotation, search, and visualization of complex multi-layered corpora)
WP2: Development and maintenance of Annatto (format conversion based on the graphANNIS data model)
WP3: Development and maintenance of Artemisia (annotation editor, under development)
WP4: Development and maintenance of graphANNIS (integration of corpus search into own software)
WP5: Maintenance and provision of demo corpora and test data
WP6: Documentation and user support (User Guide, Developer Guide, AQL Tutorial)
WP7: Community and open-source contribution (GitHub repository, issue tracker, discussion forum)
WP8: Maintenance and support of legacy tools (e.g. Salt, Pepper, Pepper converter)
WP9: Coordination with third parties and integration of third-party tools (Third-Party Tools)

Methods

Open Source Apache 2.0 license
Cross-Platform (Linux, Mac, Windows) browser-based architecture
Use of Java OpenJDK 11 as requirement
Use of the graphANNIS data model as intermediate representation
Use of workflow files for configuring conversion processes
Modular architecture with import, export, and manipulation modules
Conducting consistency checks during conversion
Support for multiple annotation levels (syntax, semantics, morphology, prosody, referentiality, lexicon, etc.)
Integration of audio/video annotations for spoken language
Use of AQL (ANNIS Query Language) for complex search and query operations
Support for multiple segmentations and overlapping tokenizations
Creation of user-defined HTML visualizations with CSS
Use of GitHub for development, issue tracking, and community contributions
Provision of demo corpora in various formats (relANNIS, PAULA, TreeTagger SGML, EXMARaLDA XML, CoNLL-U, etc.)
Migration of data between different file formats using Annatto
Integration into external software via graphANNIS
Use of open-source tools and frameworks (e.g., Pepper for legacy conversion)

Expected Outcomes

Provision of software for annotation, migration, and analysis of linguistic data
Provision of open-source tools for complex multi-layered linguistic corpora
Support for various annotation types (syntax, semantics, morphology, prosody, referentiality, lexicon, etc.)
Integration of audio/video annotations for spoken language
Provision of tools for conversion between different file formats (e.g. ANNIS, TreeTagger, EXMARaLDA, CoNLL-U, XLSX, TextGrid, PTP)
Provision of tools for visualization of search results and linguistic structures
Provision of demo corpora in various languages and annotation types
Support through documentation, user manuals, and developer guidelines
Promotion of collaboration and further development through open-source license (Apache 2.0) and GitHub community
Provision of graph-based search and visualization tools (graphANNIS) for integration into own software
Support for multilingualism and multimedia corpora
Provision of tools for creating user-defined HTML visualizations with CSS
Provision of tools for processing multi-layered and multi-segmented corpora

Contact

Contact Person: Thomas Krause
Email: thomas.krause@hu-berlin.de
Project Website: https://corpus-tools.org/

Recorded: 2026-01-14
Source: https://corpus-tools.org/

Visit Website