Lexicographical corpus and computer vision
an AI-based methodology for large-scale semi-automatic collection and annotation of signs and the analysis of lexical variation (regionalisms) in Brazilian sign language
DOI:
https://doi.org/10.14393/DLv20a2026-4Keywords:
Libras, Lexical Variation, Computational Sociolinguistics, Computer Vision, Lexicographical CorpusAbstract
The quantitative study of lexical variation (regionalisms) in Brazilian Sign Language (Libras) is methodologically hindered by the absence of large-scale, cherologically annotated lexicographical corpora. Traditional manual annotation methodologies, such as ELAN, are unfeasible for building massive corpora (tens of thousands of hours), and the praxis of conceptual glossing (translation) results in the loss of articulatory information (signifier), making the study of subtle cherological variants impossible. This article proposes an interdisciplinary methodological architecture that solves this dual bottleneck by articulating Corpus Linguistics, Computer Vision, and Computational Sociolinguistics. The objective is to detail a technically and ethically robust pipeline for large-scale semi-automatic collection and annotation, focused specifically on the discovery and analysis of sociolinguistic variation. The methodology begins with a systematic review (PRISMA) mapping the state-of-the-art (2018-2025), identifying the central gap: AI focuses almost exclusively on recognition (SLR) and translation (SLT), treating linguistic variation as noise rather than an object of study. The proposed architecture employs a two-phase pipeline. Phase 1 (Feature Extraction) uses pose estimation (optimized MediaPipe), following the optimization by dos Santos et al. (2025) , to convert videos (collected "in-the-wild") into vector representations (time series of landmarks), replacing glossing with a quantifiable "cherological transcription". Phase 2 (Semi-Automatic Annotation) utilizes Transformer models, trained on a seed Corpus, to generate lexical label suggestions. These suggestions are submitted to a human-in-the-loop validation interface, where Deaf linguists validate or correct the annotations, with the model being iteratively retrained. As a result, the methodology yields a massive relational database that links articulatory forms (vectors) to sociolinguistic metadata. This Corpus enables computational sociolinguistic analysis via unsupervised clustering techniques, allowing for pattern discovery and the identification of regional variants which are subsequently correlated with geographical data to map variation, enabling the creation of quantitative linguistic atlases for Libras. The conclusion is that this architecture overcomes the bottlenecks of traditional lexicography and offers a viable path for documenting variation. The article critically discusses the ethical foundations, positioning "Deaf-led Research" as a central methodological pillar to mitigate algorithmic biases (such as techno-ableism) and ensure the technology functions as a tool for documentation and empowerment, not for replacement or the erosion of linguistic rights.
Downloads
References
BRAGG, D. et al. Sign language recognition, generation, and translation: An interdisciplinary perspective. In: THE 21ST INTERNATIONAL ACM SIGACCESS CONFERENCE ON COMPUTERS AND ACCESSIBILITY, 2019. p. 16-31. DOI https://doi.org/10.1145/3308561.3353774
CAMGÖZ, N. C. et al. Sign language transformers: Joint end-to-end sign language recognition and translation. In: PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2020. p. 10023-10033. Disponível em: https://arxiv.org/abs/2003.13830. Acesso em: 30 jan. 2025.
CAO, Z. et al. OpenPose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, v. 43, n. 1, p. 172-186, 2021. DOI https://doi.org/10.1109/TPAMI.2019.2929257
DE MEULDER, M. The legal recognition of sign languages. Sign Language Studies, v. 15, n. 4, p. 498-506, 2015. DOI https://doi.org/10.1353/sls.2015.0018
DESAI, S. et al. Artificial intelligence in sign language research: Systematic review and future directions. ACM Computing Surveys, v. 56, n. 3, 2024.
DOS SANTOS, D. L. V. et al. Proper body landmark subset enables more accurate and 5X faster recognition of isolated signs in LIBRAS. arXiv preprint arXiv:2510.24887, 2025. Disponível em: https://arxiv.org/abs/2510.24887. Acesso em: 30 jan. 2025.
GRIEVE, J.; SPEELMAN, D.; GEERAERTS, D. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change, v. 23, n. 2, p. 193-221, 2011. DOI https://doi.org/10.1017/S095439451100007X
HIROOKA, K. et al. Stack Transformer based spatial-temporal attention model for dynamic sign language and fingerspelling recognition. arXiv preprint arXiv:2503.16855, 2025.
HOVY, D.; JOHANNSEN, A. Computational sociolinguistics. In: PROCEEDINGS OF THE 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL), 2017.
IEEE. IEEE P7000 series - Padrões éticos para sistemas autônomos e inteligentes. Disponível em: https://standards.ieee.org/. Acesso em: 10 jan. 2025.
LUGARESI, C. et al. MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019. Disponível em: https://arxiv.org/abs/1906.08172. Acesso em: 30 jan. 2025.
MACHADO, V. L. V. Análise da variação lexical em Libras. Repositório UFSC, 2018. Disponível em: https://repositorio.ufsc.br/. Acesso em: 15 jan. 2025.
MERCANOGLU, O.; KELES, H. AUTSL: A large scale multi-modal Turkish sign language dataset and baseline methods. arXiv preprint arXiv:2001.08078, 2020.
NIST. Artificial Intelligence Risk Management Framework (AI RMF 1.0). Disponível em: https://www.nist.gov/itl/ai-risk-management-framework. Acesso em: 10 jan. 2025. DOI https://doi.org/10.6028/NIST.AI.100-1.jpn
OLIVEIRA, L. A.; SILVA, M. P. S. C.; CAMPELO, W. N. M. Variações linguísticas na Libras: particularidades entre as formas de comunicação/sinalização. Revista Cocar, v. 4, 2020.
QUADROS, R. M. Língua de sinais brasileira: estudos linguísticos. Porto Alegre: Artmed, 2016.
QUADROS, R. M.; CRUZ, C. R. Língua de sinais: instrumentos de avaliação. Porto Alegre: Artmed, 2011.
QUADROS, R. M.; KARNOPP, L. B. Língua de Sinais Brasileira: estudos linguísticos. Porto Alegre: Artmed, 2004. DOI https://doi.org/10.18309/anp.v1i16.560
REZENDE, T. M.; ALMEIDA, S. G. M.; GUIMARÃES, F. G. Development and validation of a Brazilian sign language database for human gesture recognition. Research on Biomedical Engineering, v. 37, n. 4, p. 583-595, 2021. DOI
SANTOS, J. B. A variação lexical em Libras em três municípios do Estado de Alagoas. Dissertação (Mestrado em Linguística e Literatura) – Universidade Federal de Alagoas, Maceió, 2020.
SILVA, K. A. A transcrição de textos do Corpus de Libras. In: ANAIS DO VIII SIMPÓSIO INTERNACIONAL DE ESTUDOS DE GÊNEROS TEXTUAIS, 2015.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Bruno Jose Betti Galasso

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish in this journal agree to the following terms:
Authors retain the copyright and waiver the journal the right of first publication, with the work simultaneously licensed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), allowing the sharing of work with authorship recognition and preventing its commercial use.
Authors are authorized to take additional contracts separately, for non-exclusive distribution of the version of the work published in this journal (publish in institutional repository or as a book chapter), with acknowledgment of authorship and initial publication in this journal.


