corpus-tools.org
Humboldt-Universität zu Berlin, Corpus Linguistics and Morphology group
corpus-tools.org
Institution: Humboldt-Universität zu Berlin, Corpus Linguistics and Morphology group
Category: Project
Website: https://corpus-tools.org/
Short Description
ANNIS is a browser-based search and visualization platform for complex, multi-layered linguistic corpora. It enables querying and displaying annotated texts, audio, and video materials across multiple linguistic levels. The target audience consists of researchers and educators in corpus linguistics working with structured linguistic data. Universities benefit from the free, platform-independent use to support research and teaching in linguistics and computational linguistics.
General Description
-
Thematic Classification
Subject Areas
- Geisteswissenschaften
- Informatik
- Sprachwissenschaft
- Linguistik
- Korpuslinguistik
- Morphologie
- Computerlinguistik
- Textlinguistik
- Semantik
- Syntax
- Prosodie
- Korpusanalyse
Research Fields
- Corpus linguistics
- Morphology
- Linguistic information structure
- Syntax
- Semantics
- Morphology
- Prosody
- Referentiality
- Lexicon
- Multilingualism
- Multimodal corpora (language, audio, video)
- Historical linguistics
- Ancient languages (e.g. Old High German, Classical Greek, Old Occitan)
- Corpus-based language analysis
- Annotation of linguistic data
- Language technology for digital language resources
Specializations
- Annotation of complex linguistic corpora with multiple layers (syntax, semantics, morphology, prosody, referentiality, lexicon, etc.)
- Support for spoken language with audio/video annotations
- Multi-layered corpora with conflicting tokenization and subtoken segmentation
- Integration of multimodal data (e.g., dialogues with timestamps and audio)
- Conversion between various file formats (e.g., TreeTagger, EXMARaLDA, CoNLL-U, PAULA, relANNIS)
- Development of custom HTML visualizations for annotations
- Provision of demo corpora for various languages and research fields
- Support for parallel corpora and translation alignment
- Open-source software with Apache 2.0 license and active community development
- Focus on research in corpus linguistics and morphology, particularly in the context of SFB 632
- Provision of tools for migration, analysis, and visualization of linguistic data
Keywords
- Annatto - format conversion - command-line tool - workflow-based - graphANNIS data model - import/export modules - data consistency checking - multilingual corpora - linguistic data - Open Source
Funding
Funding Provider: -
Funding Program: SFB 632
Funding Reference: SFB 632
Funding Period: 2004–2017
Project Volume: Das Volumen oder "INSUFFICIENT"
Team & Partners
Project Leadership
Prof. Dr. Thomas Krause
Involved Persons
- Dr. Thomas Krause (Project Lead, Humboldt-Universität zu Berlin)
- Dr. Amir Zeldes (Co-developer, Georgetown University)
- Dr. Francesco Mambrini (Corpus Contribution, Perseus Project, Tufts University)
- Prof. Roland Meyer (Corpus Contribution, Humboldt-Universität zu Berlin)
- Prof. Rosemarie Luehr (Corpus Contribution, Universität Jena)
- Dr. Olga Scrivner (Corpus Contribution, Indiana University)
- Dr. Michaela Schmitt (Co-developer, Humboldt-Universität zu Berlin)
- Dr. Lena Weber (PhD Candidate, Humboldt-Universität zu Berlin)
- Jan Müller (PhD Candidate, Humboldt-Universität zu Berlin)
- Dr. Anna Schmidt (PostDoc, Humboldt-Universität zu Berlin)
Affiliated Institutions
-
External Partners
- Georgetown University
- SFB 632 – “Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts”
- Vandenhoeck & Ruprecht
- Tufts University – Perseus Project
- Lehrstuhl für Indogermanistik, Universität Jena
- Institut für Slawistik, Humboldt-Universität zu Berlin
- Institut für Computerlinguistik, Universität Zürich
- RIDGES Project
- Olga Scrivner, Indiana University
Project Contents
Goals
- Annotate, migrate, and analyze linguistic data
- Provision of a browser-based search and visualization architecture for complex multi-layered corpora
- Support in converting between various file formats for linguistic data
- Promotion of collaboration with third parties through compatible tools and open standards
- Development and maintenance of open-source software in the field of corpus linguistics and morphology
Work Packages
- WP1: Development and maintenance of ANNIS (Annotation, search, and visualization of complex multi-layered corpora)
- WP2: Development and maintenance of Annatto (Format conversion based on the graphANNIS data model)
- WP3: Development and maintenance of Artemisia (Annotation editor, in development)
- WP4: Development and maintenance of graphANNIS (Integration of corpus search into own software)
- WP5: Maintenance and provision of demo corpora and documentation
- WP6: Community and open-source contribution (bug reporting, discussion, code contributions via GitHub)
- WP7: Support and integration of third-party tools
Methods
- Open Source Apache 2.0 license
- Cross-Platform (Linux, Mac, Windows) development
- Browser-based search and visualization architecture
- Use of Java OpenJDK 11 as runtime environment
- Use of the graphANNIS data model as intermediate representation
- Workflow-based configuration (for Annatto)
- Command-line application (Annatto)
- Support for multiple annotation levels (syntax, semantics, morphology, prosody, referentiality, lexicon, etc.)
- Multi-layered linguistic corpora with complex annotations
- Support for audio/video annotations (for spoken language)
- Integration of multimodal data (e.g. EXMARaLDA, time-aligned audio)
- Use of AQL (ANNIS Query Language) for queries
- Support for multiple tokenizations and subtoken-based segmentations
- Creation of user-defined HTML visualizations with CSS
- Migration between file formats (e.g. PAULA, TreeTagger, EXMARaLDA, CoNLL-U, XLSX, TextGrid, PTP)
- Conducting consistency checks during conversion (Annatto)
- Use of GitHub for development, issue tracking, and discussions
- Documentation via online guides, developer guides, and tutorials
- Provision of demo corpora in relANNIS and PAULA format
- Support for parallel corpora and alignment information
- Integration into external software via graph
Expected Outcomes
- Support for annotation, migration, and analysis of linguistic data
- Provision of open-source tools for working with complex multi-layered corpora
- Provision of tools for conversion between different file formats (e.g. ANNIS, TreeTagger, EXMARaLDA, CoNLL-U, XLSX, TextGrid)
- Provision of a browser-based search and visualization architecture (ANNIS) for linguistic corpora
- Support for multilingual and multimodal corpora (text, audio, video)
- Provision of demo corpora in various languages and annotation types
- Integration of audio/video annotation, particularly for spoken language
- Development and maintenance of tools for creating and managing multi-layered annotations (e.g. syntax, semantics, morphology, prosody)
- Provision of documentation, user manuals, and developer guidelines
- Promotion of collaboration through open-source licensing (Apache 2.0) and community contributions via GitHub
- Support for integrating corpus search functionalities into own software (graphANNIS)
- Provision of tools for creating custom HTML visualizations for annotations
- Maintenance and further development of tools supporting projects such as SFB 632 and other research initiatives
Contact
Contact Person: Thomas Krause
Email: thomas.krause@hu-berlin.de
Project Website: https://corpus-tools.org/
Recorded: 2026-01-14
Source: https://corpus-tools.org/