Manual de anotação como recurso de Processamento de Linguagem Natural: o modelo Universal Dependencies em língua portuguesa

Magali Duran; Maria das Graças Volpe Nunes; Lucelene Lopes; Thiago Alexandre Salgueiro Pardo

doi:10.14393/DL52-v16n4a2022-13

Autores

Magali Duran USP-ICMC https://orcid.org/0000-0002-3843-4600
Maria das Graças Volpe Nunes USP-ICMC-Núcleo Interinstitucional de Linguística Computacional (NILC) https://orcid.org/0000-0002-2776-6140
Lucelene Lopes USP-ICMC-Núcleo Interinstitucional de Linguística Computacional (NILC) https://orcid.org/0000-0003-0314-140X
Thiago Alexandre Salgueiro Pardo ICMC/USP https://orcid.org/0000-0003-2111-1319

DOI:

https://doi.org/10.14393/DL52-v16n4a2022-13

Palavras-chave:

Corpora anotados, Manual de anotação, Universal Dependencies, Árvores de dependência, Português brasileiro

Resumo

Com o avanço da área de Processamento de Linguagem Natural (PLN), corpora são recursos que têm tido um lugar de destaque. Mais do que subsidiar estudos linguísticos, eles constituem as bases para o treinamento de modelos de Aprendizagem de Máquina e para o desenvolvimento de aplicações computacionais de ponta. Particularmente, há grande necessidade de corpora anotados, porém sua geração requer outro recurso essencial, o manual de anotação, que instancia o modelo de anotação de interesse para a língua em questão e delineia as decisões de anotação que devem ser adotadas. Neste artigo, exploramos questões relacionadas ao desenvolvimento de manuais para a anotação de corpus em português brasileiro segundo o modelo internacional Universal Dependencies, amplamente adotado na área. Partimos da discussão da evolução do PLN e o uso de corpora, passamos pelas questões, recursos e ferramentas fundamentais relacionados à representação sintática, discutimos o modelo Universal Dependencies e apresentamos as principais decisões tomadas na instanciação de suas diretrizes no português brasileiro. Por questões práticas e de didática, dividimos o manual em duas partes: o Manual de Anotação de PoS tags (anotação morfossintática) e o Manual de Anotação Relações de Dependência. Ambos foram resultado do processo relatado neste artigo e estão disponíveis para livre acesso no site do projeto POeTiSA na Web.

Downloads

Biografia do Autor

Magali Duran, USP-ICMC

Doutora em Estudos Linguísticos pela UNESP de São José do Rio Preto e pesquisadora de pós-doutorado no NILC.
Maria das Graças Volpe Nunes, USP-ICMC-Núcleo Interinstitucional de Linguística Computacional (NILC)

Professora Doutora do Instituto de Ciências Matemáticas e de Computação da Universidade de São Paulo, no campus de São Carlos.
Thiago Alexandre Salgueiro Pardo, ICMC/USP

Professor Doutor do Instituto de Ciências Matemáticas e de Computação da Universidade de São Paulo, no campus de São Carlos.

Referências

AFONSO, S.; BICK, E.; HABER, R.; SANTOS, D. Floresta sintá(c)tica: a treebank for Portuguese. In: RODRÍGUEZ, M. G.; ARAUJO, C. P. S. (ed.), Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). European Language Resources Association, 2002. p. 1698-1703.

AZEREDO, J. C. S. Fundamentos de Gramática do Português. Rio de Janeiro: Jorge Zahar Editores, 2013. E-book.

BAKER, C. F.; FILLMORE, C. J.; LOWE, J. B. The Berkeley Framenet Project. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Vol. 1. Quebec: Association for Computational Linguistics, 1998. p. 86-90. DOI https://doi.org/10.3115/980845.980860

BICK, E. The Parsing System PALAVRAS: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus University Press, 2000.

BICK, E. Constraint grammar-based conversion of dependency treebanks. In: Proceedings of the 13th International Conference on Natural Language Processing (ICON). Varanasi: NLP Association of India, 2016, p. 109–114.

BRESNAN, J. Control and Complementation. Linguistic Inquiry, Vol. 13, n. 3, p. 343-434, 1982.

BUCHHOLZ, S.; MARSI, E. CoNLL-X Shared Task on Multilingual Dependency Parsing. Proceedings of the Tenth Conference on Computational Natural Language Learning. New York: Association for Computational Linguistics, 2006. p. 149–164. DOI https://doi.org/10.3115/1596276.1596305

FILLMORE, C. J. The Case for Case. In: BACH, E.; HARMS R. T. (ed.) Universals in Linguistic Theory. London: Holt, Rinehart and Winston. p. 1-88, 1968.

FREITAS, C.; ROCHA, P.; BICK, E. "Floresta Sintá(c)tica: Bigger, Thicker and Easier". In: TEIXEIRA, A.; LIMA, V. L. St. de; OLIVEIRA, L. C. de; QUARESMA, P. (ed.), Proceedings of Computational Processing of the Portuguese Language, 8th International Conference, (PROPOR 2008), vol. 5190. Springer Verlag, 2008. p. 216-219. DOI https://doi.org/10.1007/978-3-540-85980-2_23

GOODFELLOW, I.; BENGIO, Y.; COURVILLE, A.. Deep Learning. Cambridge MA: MIT Press, 2016.

HENRIQUES, C. C. Nomenclatura Gramatical Brasileira: cinquenta anos depois. São Paulo: Parábola, 2009.

HONNIBAL, M.; JOHNSON, M. An Improved Non-monotonic Transition System for Dependency Parsing. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Lisboa: Association for Computational Linguistics, 2015. p. 1373-1378. DOI https://doi.org/10.18653/v1/D15-1162

IDE, N. Introduction: The Handbook of Linguistic Annotation. In: Ide, Nancy; Pustejovsky, James (ed). Handbook of Linguistic Annotation. Springer, 2017. DOI https://doi.org/10.1007/978-94-024-0881-2

KARLSSON, F. Constraint Grammar as a Framework for Parsing Unrestricted Text. In: KARLGREN, H. (ed.), Proceedings of the 13th International Conference of Computational Linguistics, Vol. 3. ACM Digital Library, 1990. p. 168-173. DOI https://doi.org/10.3115/991146.991176

KIPPER-SCHULER, K. VerbNet: A Broad-Coverage, Comprehensive Verb Lexicon. Tese de Doutorado (Ciência da Computação). University of Pennsylvania, 2005.

KONDRATYUK, D.; STRAKA, M. 75 Languages, 1 Model: Parsing Universal Dependencies Universally. In: INUI, K.; JIANG, J.; NG, V.; WAN, X. (ed.) Proceedings of the 2019 Conference on Empirical Metho1982)ds in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP), Hong Kong, China. Association for Computational Linguistics, 2019. p. 2779–2795. DOI https://doi.org/10.18653/v1/D19-1279

LEVIN, B. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, 1993.

MANNING, C.; SCHÜTZE, H. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press, 1999.

MARNEFFE, M.-C. de; MANNING, C. D.; NIVRE, J.; ZEMAN, D.. Universal Dependencies. Computational Linguistics 47 (2), p. 255–308, 2021.

MITCHELL, T. Machine Learning. New York: McGraw Hill, 1997.

NIVRE, J. Towards a Universal Grammar for Natural Language Processing. In: Proceedings of the 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), 2015. p. 3-16. DOI https://doi.org/10.1007/978-3-319-18111-0_1

NIVRE, J.; HALL, J.; NILSSON, J.; ERYIǦIT, G.; MARINOV, S. Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines. In: Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X). New York: Association for Computational Linguistics, 2006. p. 221–225. DOI https://doi.org/10.3115/1596276.1596318

NIVRE, J.; MARNEFFE, M.-C. de; GINTER, F.; HAJIČ, J.; MANNING, C. D.; PYYSALO, S.; SCHUSTER, S.; TYERS, F.; ZEMAN, D. Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection. In: Proceedings of the 12nd International Conference on Language Resources and Evaluation (LREC 2020). Marseille: European Language Resources Association, 2020. p. 4034-4043.

PALMER, M.; GILDEA, D.; KINGSBURY, P. The Proposition Bank: An Annotated Corpus of Semantic Roles. Computational Linguistics, 31:1., p. 71-105, March, 2005. DOI https://doi.org/10.1162/0891201053630264

PUSTEJOVSKY, J.; BUNT, H.; ZAENE, A. Designing Annotation Schemes: From Theory to Model. In: IDE, N.; PUSTEJOVSKY, J. (ed.). Handbook of Linguistic Annotation. Springer, 2017. DOI https://doi.org/10.1007/978-94-024-0881-2_2

QI, P.; ZHANG, Y.; ZHANG, Y.; BOLTON, J.; MANNING, C. D. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In: JURAFSKY, D.; CHAI, J.; SCHLUTER, N.; TETRAULT, J. R. (ed.). Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. p. 101-108. DOI https://doi.org/10.18653/v1/2020.acl-demos.14

RADEMAKER, A.; CHALUB, F.; REAL, L.; FREITAS, C.; BICK, E.; PAIVA, V. de. Universal Dependencies for Portuguese. In: Proceedings of the Fourth International Conference on Dependency Linguistics. Linköping University Electronic Press, 2017. p. 197-206.

SARDINHA, T. B. Linguística de corpus: Histórico e Problemática. Documentação e Estudos em Linguística Teórica e Aplicada (DELTA), 16:2, 2000. p. 323-367. DOI https://doi.org/10.1590/S0102-44502000000200005

SILVA, J.; BRANCO, A.; CASTRO, S.; REIS, R.. Out-of-the-Box Robust Parsing of Portuguese. In: Proceedings of the 9th International Conference on the Computational Processing of Portuguese. Springer, 2010. p. 75–85. DOI https://doi.org/10.1007/978-3-642-12320-7_10

SOUZA, E. de; CAVALCANTI, T.; SILVEIRA, A.; EVELYN, W.; FREITAS, C. Diretivas e documentação de anotação UD em português (e para língua portuguesa). Rio de Janeiro: PUC-RIO, 2020. Disponível em: http://comcorhd.letras.puc-rio.br/recursos/.

STRAKA, M.; STRAKOVA, J. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, 2017. p. 88-99. DOI https://doi.org/10.18653/v1/K17-3009

TESNIÈRE, L. Éléments de Syntaxe Structurale. Paris: Librarie C. Klincksieck, 1959.

TESNIÈRE, L. Elements of Structural Syntax. Tradução de OSBORNE, T.; KAHANE, S. Amsterdam: John Benjamins, 2015. DOI https://doi.org/10.1075/z.185