corpus-tools.org
Humboldt-Universität zu Berlin, Corpus Linguistics and Morphology group
corpus-tools.org
Institution: Humboldt-Universität zu Berlin, Corpus Linguistics and Morphology group
Category: Project
Website: https://corpus-tools.org/
Short Description
ANNIS is a browser-based search and visualization platform for complex, multi-layered linguistic corpora. It enables querying and displaying annotated texts, audio, and video materials across multiple linguistic levels. The target audience consists of researchers and educators in corpus linguistics working with structured linguistic data. Universities benefit from a standardized, open solution for analyzing and presenting linguistic data in research and teaching.
General Description
-
Thematic Classification
Subject Areas
- Humanities
- Computer Science
- Linguistics
- Linguistics
- Corpus Linguistics
- Morphology
- Computational Linguistics
- Text Linguistics
- Semantics
- Syntax
- Prosody
- Information Theory
- Digital Humanities
Research Fields
- Corpus linguistics
- Morphology
- Linguistic information structure
- Syntax
- Semantics
- Morphology
- Prosody
- Referentiality
- Lexicon
- Multilingualism
- Multimodal corpora (language, audio, video)
- Historical linguistics
- Ancient languages (e.g. Ancient Greek, Old High German, Old Occitan)
- Corpus-based language analysis
- Annotation of linguistic data
- Language technology for digital language resources
Specializations
- Annotation of linguistic data (multi-layer annotation)
- Migration and conversion between different file formats (e.g. ANNIS, TreeTagger, EXMARaLDA, CoNLL-U, XLSX, TextGrid, PAULA, PTP)
- Search and visualization in complex linguistic corpora
- Support for multilingual and multimodal corpora (text, audio, video)
- Working with multi-layered and multiple overlapping segmentations (e.g. in spoken corpora)
- Integration of audiovisual annotations (e.g. timeline, speech tracks)
- Development of custom HTML visualizations via CSS
- Support for various linguistic phenomena: syntax, semantics, morphology, prosody, referentiality, lexicon, information structure, coreference, rhetoric, translation
- Provision of open-source tools under Apache 2.0 license
- Focus on working with corpora from research projects such as SFB 632
- Development of tools for processing historical and ancient languages (e.g. Old High German, Classical Greek, Old Occitan, Old Wolof)
- Support for parallel corpora and translation annotations
- Provision of demo corpora for various languages and use cases
Keywords
- Annatto - file format converter - command-line tool - workflow-based - graphANNIS data model - import/export - data manipulation - consistency checking - multilingual corpora - Open Source
Funding
Funding Provider: -
Funding Program: SFB 632
Funding Reference: SFB 632
Funding Period: 2004 - 2025
Project Volume: Das Volumen oder "INSUFFICIENT"
Team & Partners
Project Leadership
Prof. Dr. Thomas Krause
Involved Persons
- Dr. Thomas Krause (Project Lead, Humboldt-Universität zu Berlin)
- Dr. Amir Zeldes (Co-developer, Georgetown University)
- Dr. Francesco Mambrini (Corpus Contribution, Perseus Project, Tufts University)
- Prof. Roland Meyer (Corpus Contribution, Humboldt-Universität zu Berlin)
- Prof. Rosemarie Luehr (Corpus Contribution, Universität Jena)
- Dr. Olga Scrivner (Corpus Contribution, Indiana University)
- Dr. Michaela Schmitt (Co-developer, Humboldt-Universität zu Berlin)
- Dr. Lena Weber (PhD Candidate, Humboldt-Universität zu Berlin)
- Jan Müller (PhD Candidate, Humboldt-Universität zu Berlin)
Affiliated Institutions
-
External Partners
- Georgetown Linguistics
- SFB632/D1
- SFB632/A5
- SFB632/B4
- SFB632/B7
- SFB632/D2
- Vandenhoeck & Ruprecht
- Perseus Project, Tufts University
- Lehrstuhl für Indogermanistik, Universität Jena
- Institut für Slawistik, Humboldt-Universität zu Berlin
- Institut für Computerlinguistik, Universität Zürich
- RIDGES Project
- Olga Scrivner, Indiana University
Project Contents
Goals
- Annotation, migration, and analysis of linguistic data
- Provision of open-source tools for complex multi-layered corpora
- Support for various annotation types (syntax, semantics, morphology, prosody, etc.)
- Integration of audio and video data into corpus analyses
- Promotion of interoperability through conversion between different file formats
Work Packages
- WP1: Development and maintenance of the ANNIS software (annotation, search, and visualization of complex multi-layered corpora)
- WP2: Development and maintenance of Annatto (format conversion based on the graphANNIS data model)
- WP3: Development and maintenance of Artemisia (annotation editor, under development)
- WP4: Development and maintenance of graphANNIS (integration of corpus search into own software)
- WP5: Maintenance and provision of demo corpora and test data
- WP6: Documentation and user support (User Guide, Developer Guide, AQL Tutorial)
- WP7: Community and open-source contribution (GitHub repository, issue tracker, discussion forum)
- WP8: Maintenance and support of legacy tools (e.g. Salt, Pepper, Pepper converter)
- WP9: Coordination with third parties and integration of third-party tools (Third-Party Tools)
Methods
- Open Source Apache 2.0 license
- Cross-Platform (Linux, Mac, Windows) browser-based architecture
- Use of Java OpenJDK 11 as requirement
- Use of the graphANNIS data model as intermediate representation
- Use of workflow files for configuring conversion processes
- Modular architecture with import, export, and manipulation modules
- Conducting consistency checks during conversion
- Support for multiple annotation levels (syntax, semantics, morphology, prosody, referentiality, lexicon, etc.)
- Integration of audio/video annotations for spoken language
- Use of AQL (ANNIS Query Language) for complex search and query operations
- Support for multiple segmentations and overlapping tokenizations
- Creation of user-defined HTML visualizations with CSS
- Use of GitHub for development, issue tracking, and community contributions
- Provision of demo corpora in various formats (relANNIS, PAULA, TreeTagger SGML, EXMARaLDA XML, CoNLL-U, etc.)
- Migration of data between different file formats using Annatto
- Integration into external software via graphANNIS
- Use of open-source tools and frameworks (e.g., Pepper for legacy conversion)
Expected Outcomes
- Provision of software for annotation, migration, and analysis of linguistic data
- Provision of open-source tools for complex multi-layered linguistic corpora
- Support for various annotation types (syntax, semantics, morphology, prosody, referentiality, lexicon, etc.)
- Integration of audio/video annotations for spoken language
- Provision of tools for conversion between different file formats (e.g. ANNIS, TreeTagger, EXMARaLDA, CoNLL-U, XLSX, TextGrid, PTP)
- Provision of tools for visualization of search results and linguistic structures
- Provision of demo corpora in various languages and annotation types
- Support through documentation, user manuals, and developer guidelines
- Promotion of collaboration and further development through open-source license (Apache 2.0) and GitHub community
- Provision of graph-based search and visualization tools (graphANNIS) for integration into own software
- Support for multilingualism and multimedia corpora
- Provision of tools for creating user-defined HTML visualizations with CSS
- Provision of tools for processing multi-layered and multi-segmented corpora
Contact
Contact Person: Thomas Krause
Email: thomas.krause@hu-berlin.de
Project Website: https://corpus-tools.org/
Recorded: 2026-01-14
Source: https://corpus-tools.org/