Lexicographical corpus and computer vision

an AI-based methodology for large-scale semi-automatic collection and annotation of signs and the analysis of lexical variation (regionalisms) in Brazilian sign language

Authors

DOI:

https://doi.org/10.14393/DLv20a2026-4

Keywords:

Libras, Lexical Variation, Computational Sociolinguistics, Computer Vision, Lexicographical Corpus

Abstract

The quantitative study of lexical variation (regionalisms) in Brazilian Sign Language (Libras) is methodologically hindered by the absence of large-scale, cherologically annotated lexicographical corpora. Traditional manual annotation methodologies, such as ELAN, are unfeasible for building massive corpora (tens of thousands of hours), and the praxis of conceptual glossing (translation) results in the loss of articulatory information (signifier), making the study of subtle cherological variants impossible. This article proposes an interdisciplinary methodological architecture that solves this dual bottleneck by articulating Corpus Linguistics, Computer Vision, and Computational Sociolinguistics. The objective is to detail a technically and ethically robust pipeline for large-scale semi-automatic collection and annotation, focused specifically on the discovery and analysis of sociolinguistic variation. The methodology begins with a systematic review (PRISMA) mapping the state-of-the-art (2018-2025), identifying the central gap: AI focuses almost exclusively on recognition (SLR) and translation (SLT), treating linguistic variation as noise rather than an object of study. The proposed architecture employs a two-phase pipeline. Phase 1 (Feature Extraction) uses pose estimation (optimized MediaPipe), following the optimization by dos Santos et al. (2025) , to convert videos (collected "in-the-wild") into vector representations (time series of landmarks), replacing glossing with a quantifiable "cherological transcription". Phase 2 (Semi-Automatic Annotation) utilizes Transformer models, trained on a seed Corpus, to generate lexical label suggestions. These suggestions are submitted to a human-in-the-loop validation interface, where Deaf linguists validate or correct the annotations, with the model being iteratively retrained. As a result, the methodology yields a massive relational database that links articulatory forms (vectors) to sociolinguistic metadata. This Corpus enables computational sociolinguistic analysis via unsupervised clustering techniques, allowing for pattern discovery and the identification of regional variants which are subsequently correlated with geographical data to map variation, enabling the creation of quantitative linguistic atlases for Libras. The conclusion is that this architecture overcomes the bottlenecks of traditional lexicography and offers a viable path for documenting variation. The article critically discusses the ethical foundations, positioning "Deaf-led Research" as a central methodological pillar to mitigate algorithmic biases (such as techno-ableism) and ensure the technology functions as a tool for documentation and empowerment, not for replacement or the erosion of linguistic rights.

Downloads

Download data is not yet available.

Author Biography

  • Bruno Jose Betti Galasso, Federal University of São Paulo

    Doutor em Educação pela Universidade de São Paulo com bolsa-sanduíche na Universidade do Minho (Portugal), concedida pelo programa Eramus Mundus External Cooperation (Emundus15). Professor associado da Universidade Federal de São Paulo (UNIFESP).  

References

BRAGG, D. et al. Sign language recognition, generation, and translation: An interdisciplinary perspective. In: THE 21ST INTERNATIONAL ACM SIGACCESS CONFERENCE ON COMPUTERS AND ACCESSIBILITY, 2019. p. 16-31. DOI https://doi.org/10.1145/3308561.3353774

CAMGÖZ, N. C. et al. Sign language transformers: Joint end-to-end sign language recognition and translation. In: PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2020. p. 10023-10033. Disponível em: https://arxiv.org/abs/2003.13830. Acesso em: 30 jan. 2025.

CAO, Z. et al. OpenPose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, v. 43, n. 1, p. 172-186, 2021. DOI https://doi.org/10.1109/TPAMI.2019.2929257

DE MEULDER, M. The legal recognition of sign languages. Sign Language Studies, v. 15, n. 4, p. 498-506, 2015. DOI https://doi.org/10.1353/sls.2015.0018

DESAI, S. et al. Artificial intelligence in sign language research: Systematic review and future directions. ACM Computing Surveys, v. 56, n. 3, 2024.

DOS SANTOS, D. L. V. et al. Proper body landmark subset enables more accurate and 5X faster recognition of isolated signs in LIBRAS. arXiv preprint arXiv:2510.24887, 2025. Disponível em: https://arxiv.org/abs/2510.24887. Acesso em: 30 jan. 2025.

GRIEVE, J.; SPEELMAN, D.; GEERAERTS, D. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change, v. 23, n. 2, p. 193-221, 2011. DOI https://doi.org/10.1017/S095439451100007X

HIROOKA, K. et al. Stack Transformer based spatial-temporal attention model for dynamic sign language and fingerspelling recognition. arXiv preprint arXiv:2503.16855, 2025.

HOVY, D.; JOHANNSEN, A. Computational sociolinguistics. In: PROCEEDINGS OF THE 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL), 2017.

IEEE. IEEE P7000 series - Padrões éticos para sistemas autônomos e inteligentes. Disponível em: https://standards.ieee.org/. Acesso em: 10 jan. 2025.

LUGARESI, C. et al. MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019. Disponível em: https://arxiv.org/abs/1906.08172. Acesso em: 30 jan. 2025.

MACHADO, V. L. V. Análise da variação lexical em Libras. Repositório UFSC, 2018. Disponível em: https://repositorio.ufsc.br/. Acesso em: 15 jan. 2025.

MERCANOGLU, O.; KELES, H. AUTSL: A large scale multi-modal Turkish sign language dataset and baseline methods. arXiv preprint arXiv:2001.08078, 2020.

NIST. Artificial Intelligence Risk Management Framework (AI RMF 1.0). Disponível em: https://www.nist.gov/itl/ai-risk-management-framework. Acesso em: 10 jan. 2025. DOI https://doi.org/10.6028/NIST.AI.100-1.jpn

OLIVEIRA, L. A.; SILVA, M. P. S. C.; CAMPELO, W. N. M. Variações linguísticas na Libras: particularidades entre as formas de comunicação/sinalização. Revista Cocar, v. 4, 2020.

QUADROS, R. M. Língua de sinais brasileira: estudos linguísticos. Porto Alegre: Artmed, 2016.

QUADROS, R. M.; CRUZ, C. R. Língua de sinais: instrumentos de avaliação. Porto Alegre: Artmed, 2011.

QUADROS, R. M.; KARNOPP, L. B. Língua de Sinais Brasileira: estudos linguísticos. Porto Alegre: Artmed, 2004. DOI https://doi.org/10.18309/anp.v1i16.560

REZENDE, T. M.; ALMEIDA, S. G. M.; GUIMARÃES, F. G. Development and validation of a Brazilian sign language database for human gesture recognition. Research on Biomedical Engineering, v. 37, n. 4, p. 583-595, 2021. DOI

SANTOS, J. B. A variação lexical em Libras em três municípios do Estado de Alagoas. Dissertação (Mestrado em Linguística e Literatura) – Universidade Federal de Alagoas, Maceió, 2020.

SILVA, K. A. A transcrição de textos do Corpus de Libras. In: ANAIS DO VIII SIMPÓSIO INTERNACIONAL DE ESTUDOS DE GÊNEROS TEXTUAIS, 2015.

Published

2026-02-02

Issue

Section

Lexicografia e Inteligência Artificial

How to Cite

GALASSO, Bruno Jose Betti. Lexicographical corpus and computer vision: an AI-based methodology for large-scale semi-automatic collection and annotation of signs and the analysis of lexical variation (regionalisms) in Brazilian sign language. Domínios de Lingu@gem, Uberlândia, v. 20, p. e020004, 2026. DOI: 10.14393/DLv20a2026-4. Disponível em: https://seer.ufu.br/index.php/dominiosdelinguagem/article/view/80463. Acesso em: 2 feb. 2026.