A Natural Language Processing approach to Complexity Assessment of 18th-century health literature
DOI:
https://doi.org/10.14393/DLv17a2023-53Keywords:
Textual complexity, 18th-century Portuguese, Historical Linguistics, Historical Terminology, Digital HumanitiesAbstract
In this paper, we present an experiment for complexity-level analysis of Portuguese texts from the 18th century using NLP tools. The 18th century was the time for the realization of a new world that had been built since the Renaissance, it was the period of consolidation of many of the current sciences. One of its characteristics is the presentation of scientific written records in national languages, rather than Latin, and the expressed wishes that the specialized texts could be more understandable to people of lesser erudition. As such, we intend to collaborate to identify if and how these wishes were fulfilled. To achieve this goal, we resort to an NLP supporting methodology to detect degrees of complexity of a medical work of this time period and compare it with two other works that have hypothesized lesser and greater complexities. By using NILC-Metrix, we intend to identify features of a continuum of complexity in this kind of document.
Downloads
Metrics
References
ALUÍSIO, S., GASPERIN, C. Fostering digital inclusion and accessibility: the Porsimples project for simplification of Portuguese texts. In: Proceedings of the NAACL-HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas, 2010. p. 46–53.
BANZA, A. P., GONÇALVES, M. F. Roteiro de história da língua portuguesa. Cátedra UNESCO, Universidade de Évora, 2018, p. 95. Available at: https://core.ac.uk/download/pdf/154812031.pdf. Accessed on: 22 Jun. 2023.
BARBOSA, A. V. Do conhecimento da doença à sua nomeação: uma viagem pelo tratado da conservação da saúde dos povos, de António Ribeiro Sanches. Panace@, v. 21(52), p. 37–48, 2020.
BERBER SARDINHA, T.; BARBARA, L. Freqüência e uso de estrangeirismos ingleses no português brasileiro: Um estudo baseado em corpus. Revista Brasileira de Linguística Aplicada, v. 5(1), p. 97–114, 2005. DOI https://doi.org/10.1590/S1984-63982005000100006
BIDERMAN, M. T. C., CARVALHO, C. S., PEDROSO, O. Meu primeiro livro de palavras: um dicionário ilustrado do português de A a Z. Ática, 2004.
CASELI, H. M., PEREIRA, T. F., SPECIA, L., PARDO, T. A., GASPERIN, C., ALUÍSIO, S. M. Building a brazilian portuguese parallel corpus of original and simplified texts. Advances in Computational Linguistics, Research in Computer Science, v. 41, p. 59–70, 2009.
CASTRO, I. Introdução à história do português. Edições Colibri, Lisboa, Portugal, 2006.
CUNHA, A. L. V. d. Coh-Metrix-Dementia: análise automática de distúrbios de linguagem nas demências utilizando Processamento de Línguas Naturais. 2015. Ph.D. thesis, Universidade de São Paulo, 2015.
DURY, P. ; PICTON, A. Terminologie et diachronie: vers une réconciliation théorique et méthodologique? Revue française de linguistique appliquée, v. 14(2), p. 31–41, 2009. DOI https://doi.org/10.3917/rfla.142.0031
FINATTO, M. J. B. Corpus-amostra português do século XVIII: textos antigos de medicina em atividades de ensino e pesquisa. Domínios de Lingu@gem, Uberlândia 12(1), 2018. DOI https://doi.org/10.14393/DL33-v12n1a2018-15
FINATTO, M. J. B. Medicina em português no século XVIII: desafios da terminologia diacrônica no cenário das humanidades digitais. Panace@, v. 21(52), p. 20–36, 2020.
FINATTO, M. J. B.; QUARESMA, P.; GONÇALVES, M.F. Portuguese corpora of the 18th century: old medicine texts for teaching and research. In: Proceedings of the Conference on Language Technologies and Digital Humanities. University of Ljubljana, 2018. p. 114–120.
FURTADO, J. F. Tropical empiricism: making medical knowledge in colonial Brazil. In: Science and empire in the Atlantic world. Routledge, 2008. p. 127–151. DOI https://doi.org/10.4324/9780203933848-8
GAZZOLA, M., LEAL, S. E., ALUISIO, S. M. Predição da complexidade textual de recursos educacionais abertos em português. In: Proceedings of the Symposium in Information and Human Language Technology - STIL. SBC, 2019.
GRAESSER, A. C.; MCNAMARA, D. S.; LOUWERSE, M. M.; CAI, Z. Coh-metrix: Analysis of text on cohesion and language. Behavior research methods, instruments, & computers, v. 36(2), p. 193–202, 2004. DOI https://doi.org/10.3758/BF03195564
LEAL, S. E.; DURAN, M. S.; SCARTON, C. E.; HARTMANN, N. S.; ALUÍSIO, S. M. NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese. arXiv preprint, arXiv:2201.03445, 2021.
LISBOA, J. L.; MIRANDA, T. C.; OLIVAL, F. As Gazetas Manuscritas da Biblioteca Pública de Évora. Colibri, CIDEHUS-UE, CHC-UNL, 2002. DOI https://doi.org/10.4000/books.cidehus.3083
LOBENSTEIN-REICHMANN, A. Luther’s Contribution as Bible Translator to the German Language. The Bible Translator, v. 73(3), p. 301-334, 2022. DOI https://doi.org/10.1177/20516770221140051
MARTINS, T. B.; GHIRALDELO, C. M.; NUNES, M. D. G. V.; OLIVEIRA JUNIOR, O. N. D. Readability formulas applied to textbooks in Brazilian Portuguese. 1996. Technical report, ICMSC-USP, 1996.
MOTTA, E. Índices de complexidade textual em sentenças dos juizados especiais cíveis do poder judiciário do estado do Rio Grande do Sul. Inventário, v. 1(21), p. 35–50, 2018.
MOTTA, E. Sentenças judiciais e acessibilidade textual e terminológica. Domínios de Lingu@gem, v. 15(3), p. 761–813, 2021. DOI https://doi.org/10.14393/DL47-v15n3a2021-6
PIOTROWSKI, M. Natural language processing for historical texts. Synthesis lectures on human language technologies, v. 5(2), p. 1–157, 2012. DOI https://doi.org/10.2200/S00436ED1V01Y201207HLT017
QUARESMA, P.; FINATTO, M. J. B. Information extraction from historical texts: a case study. In: Proceedings of the Workshop on Digital Humanities and Natural Language Processing (DHandNLP). Co-located with the International Conference on the Computational Processing of Portuguese (PROPOR 2020). Évora, Portugal, 2020. p. 49–56. DOI https://doi.org/10.1007/978-3-030-41505-1
SANTOS, I.; OLIVAL, F.; SEQUEIRA, O. Excavating the data pit: the Portuguese Parish Memories (1758) as a gold standard. In: Proceedings of the Workshop on Digital Humanities and Natural Language Processing (DHandNLP). Co-located with the International Conference on the Computational Processing of Portuguese (PROPOR 2020). Évora, Portugal, 2020. p. 69–75.
SANTOS, L. B. D.; DURAN, M. S.; HARTMANN, N. S.; CANDIDO, A.; PAETZOLD, G. H.; ALUISIO, S. M. A lightweight regression method to infer psycholinguistic properties for brazilian portuguese. In: International conference on text, speech, and dialogue. Springer, 2017. p. 281–289. DOI https://doi.org/10.1007/978-3-319-64206-2_32
SANTOS, R.; PEDRO, G.; LEAL, S.; VALE, O.; PARDO, T.; BONTCHEVA, K.; SCARTON, C. Measuring the impact of readability features in fake news detection. In: Proceedings of the 12th language resources and evaluation conference, 2020. p. 1404–1413.
SEMEDO, J.C. Observaçoens medicas doutrinaes de cem casos gravissimos, que em serviço da patria, & das nações estranhas escreve em lingua Portugueza, & Latina Joam Curvo Semmedo. Officina de Antonio Pedrozo Galram, Lisboa, Portugal, 1707.
SOUSA, M. C. P. d. O Corpus Tycho Brahe: contribuições para as humanidades digitais no Brasil. Filologia e linguística portuguesa, v. 16(esp.), p. 53–93, 2014. DOI https://doi.org/10.11606/issn.2176-9419.v16ispep53-93
VERDELHO, T. Terminologias na língua portuguesa: perspectiva diacrónica. 1998. Available at: http://clp.dlc.ua.pt/Publicacoes/Terminologias_lingua_portuguesa.pdf. Accessed on: 22 Jun. 2023.
WAGNER FILHO, J. A.; WILKENS, R.; IDIART, M.; VILLAVICENCIO, A. The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). 2018.
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Leonardo Zilio, Maria José Bocorny Finatto, Renata Vieira, Paulo Quaresma
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish in this journal agree to the following terms:
Authors retain the copyright and waiver the journal the right of first publication, with the work simultaneously licensed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), allowing the sharing of work with authorship recognition and preventing its commercial use.
Authors are authorized to take additional contracts separately, for non-exclusive distribution of the version of the work published in this journal (publish in institutional repository or as a book chapter), with acknowledgment of authorship and initial publication in this journal.