A Natural Language Processing approach to Complexity Assessment of 18th-century health literature

Leonardo Zilio; Maria José Bocorny Finatto; Renata Vieira; Paulo Quaresma

doi:10.14393/DLv17a2023-53

Authors

Leonardo Zilio Friedrich-Alexander-Universität Erlangen-Nürnberg https://orcid.org/0000-0002-6101-0814
Maria José Bocorny Finatto UFRGS https://orcid.org/0000-0002-6022-8408
Renata Vieira University of Évora https://orcid.org/0000-0003-2449-5477
Paulo Quaresma University of Évora https://orcid.org/0000-0002-5086-059X

DOI:

https://doi.org/10.14393/DLv17a2023-53

Keywords:

Textual complexity, 18th-century Portuguese, Historical Linguistics, Historical Terminology, Digital Humanities

Abstract

In this paper, we present an experiment for complexity-level analysis of Portuguese texts from the 18th century using NLP tools. The 18th century was the time for the realization of a new world that had been built since the Renaissance, it was the period of consolidation of many of the current sciences. One of its characteristics is the presentation of scientific written records in national languages, rather than Latin, and the expressed wishes that the specialized texts could be more understandable to people of lesser erudition. As such, we intend to collaborate to identify if and how these wishes were fulfilled. To achieve this goal, we resort to an NLP supporting methodology to detect degrees of complexity of a medical work of this time period and compare it with two other works that have hypothesized lesser and greater complexities. By using NILC-Metrix, we intend to identify features of a continuum of complexity in this kind of document.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Author Biographies

Leonardo Zilio, Friedrich-Alexander-Universität Erlangen-Nürnberg

Post-doctoral Researcher. Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany.

Maria José Bocorny Finatto, UFRGS

Full Professor of Linguistics. Universidade Federal do Rio Grande do Sul (UFRGS), Brazil.

Renata Vieira, University of Évora

Principal Investigator. CIDEHUS, Universidade de Évora, Portugal.

Paulo Quaresma, University of Évora

Full Professor of Informatics. Universidade de Évora, Portugal.

References

ALUÍSIO, S., GASPERIN, C. Fostering digital inclusion and accessibility: the Porsimples project for simplification of Portuguese texts. In: Proceedings of the NAACL-HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas, 2010. p. 46–53.

BANZA, A. P., GONÇALVES, M. F. Roteiro de história da língua portuguesa. Cátedra UNESCO, Universidade de Évora, 2018, p. 95. Available at: https://core.ac.uk/download/pdf/154812031.pdf. Accessed on: 22 Jun. 2023.

BARBOSA, A. V. Do conhecimento da doença à sua nomeação: uma viagem pelo tratado da conservação da saúde dos povos, de António Ribeiro Sanches. Panace@, v. 21(52), p. 37–48, 2020.

BERBER SARDINHA, T.; BARBARA, L. Freqüência e uso de estrangeirismos ingleses no português brasileiro: Um estudo baseado em corpus. Revista Brasileira de Linguística Aplicada, v. 5(1), p. 97–114, 2005. DOI https://doi.org/10.1590/S1984-63982005000100006

BIDERMAN, M. T. C., CARVALHO, C. S., PEDROSO, O. Meu primeiro livro de palavras: um dicionário ilustrado do português de A a Z. Ática, 2004.

CASELI, H. M., PEREIRA, T. F., SPECIA, L., PARDO, T. A., GASPERIN, C., ALUÍSIO, S. M. Building a brazilian portuguese parallel corpus of original and simplified texts. Advances in Computational Linguistics, Research in Computer Science, v. 41, p. 59–70, 2009.

CASTRO, I. Introdução à história do português. Edições Colibri, Lisboa, Portugal, 2006.

CUNHA, A. L. V. d. Coh-Metrix-Dementia: análise automática de distúrbios de linguagem nas demências utilizando Processamento de Línguas Naturais. 2015. Ph.D. thesis, Universidade de São Paulo, 2015.

DURY, P. ; PICTON, A. Terminologie et diachronie: vers une réconciliation théorique et méthodologique? Revue française de linguistique appliquée, v. 14(2), p. 31–41, 2009. DOI https://doi.org/10.3917/rfla.142.0031

FINATTO, M. J. B. Corpus-amostra português do século XVIII: textos antigos de medicina em atividades de ensino e pesquisa. Domínios de Lingu@gem, Uberlândia 12(1), 2018. DOI https://doi.org/10.14393/DL33-v12n1a2018-15

FINATTO, M. J. B. Medicina em português no século XVIII: desafios da terminologia diacrônica no cenário das humanidades digitais. Panace@, v. 21(52), p. 20–36, 2020.

FINATTO, M. J. B.; QUARESMA, P.; GONÇALVES, M.F. Portuguese corpora of the 18th century: old medicine texts for teaching and research. In: Proceedings of the Conference on Language Technologies and Digital Humanities. University of Ljubljana, 2018. p. 114–120.

FURTADO, J. F. Tropical empiricism: making medical knowledge in colonial Brazil. In: Science and empire in the Atlantic world. Routledge, 2008. p. 127–151. DOI https://doi.org/10.4324/9780203933848-8

GAZZOLA, M., LEAL, S. E., ALUISIO, S. M. Predição da complexidade textual de recursos educacionais abertos em português. In: Proceedings of the Symposium in Information and Human Language Technology - STIL. SBC, 2019.

GRAESSER, A. C.; MCNAMARA, D. S.; LOUWERSE, M. M.; CAI, Z. Coh-metrix: Analysis of text on cohesion and language. Behavior research methods, instruments, & computers, v. 36(2), p. 193–202, 2004. DOI https://doi.org/10.3758/BF03195564

LEAL, S. E.; DURAN, M. S.; SCARTON, C. E.; HARTMANN, N. S.; ALUÍSIO, S. M. NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese. arXiv preprint, arXiv:2201.03445, 2021.

LISBOA, J. L.; MIRANDA, T. C.; OLIVAL, F. As Gazetas Manuscritas da Biblioteca Pública de Évora. Colibri, CIDEHUS-UE, CHC-UNL, 2002. DOI https://doi.org/10.4000/books.cidehus.3083

LOBENSTEIN-REICHMANN, A. Luther’s Contribution as Bible Translator to the German Language. The Bible Translator, v. 73(3), p. 301-334, 2022. DOI https://doi.org/10.1177/20516770221140051

MARTINS, T. B.; GHIRALDELO, C. M.; NUNES, M. D. G. V.; OLIVEIRA JUNIOR, O. N. D. Readability formulas applied to textbooks in Brazilian Portuguese. 1996. Technical report, ICMSC-USP, 1996.

MOTTA, E. Índices de complexidade textual em sentenças dos juizados especiais cíveis do poder judiciário do estado do Rio Grande do Sul. Inventário, v. 1(21), p. 35–50, 2018.

MOTTA, E. Sentenças judiciais e acessibilidade textual e terminológica. Domínios de Lingu@gem, v. 15(3), p. 761–813, 2021. DOI https://doi.org/10.14393/DL47-v15n3a2021-6

PIOTROWSKI, M. Natural language processing for historical texts. Synthesis lectures on human language technologies, v. 5(2), p. 1–157, 2012. DOI https://doi.org/10.2200/S00436ED1V01Y201207HLT017

QUARESMA, P.; FINATTO, M. J. B. Information extraction from historical texts: a case study. In: Proceedings of the Workshop on Digital Humanities and Natural Language Processing (DHandNLP). Co-located with the International Conference on the Computational Processing of Portuguese (PROPOR 2020). Évora, Portugal, 2020. p. 49–56. DOI https://doi.org/10.1007/978-3-030-41505-1

SANTOS, I.; OLIVAL, F.; SEQUEIRA, O. Excavating the data pit: the Portuguese Parish Memories (1758) as a gold standard. In: Proceedings of the Workshop on Digital Humanities and Natural Language Processing (DHandNLP). Co-located with the International Conference on the Computational Processing of Portuguese (PROPOR 2020). Évora, Portugal, 2020. p. 69–75.

SANTOS, L. B. D.; DURAN, M. S.; HARTMANN, N. S.; CANDIDO, A.; PAETZOLD, G. H.; ALUISIO, S. M. A lightweight regression method to infer psycholinguistic properties for brazilian portuguese. In: International conference on text, speech, and dialogue. Springer, 2017. p. 281–289. DOI https://doi.org/10.1007/978-3-319-64206-2_32

SANTOS, R.; PEDRO, G.; LEAL, S.; VALE, O.; PARDO, T.; BONTCHEVA, K.; SCARTON, C. Measuring the impact of readability features in fake news detection. In: Proceedings of the 12th language resources and evaluation conference, 2020. p. 1404–1413.

SEMEDO, J.C. Observaçoens medicas doutrinaes de cem casos gravissimos, que em serviço da patria, & das nações estranhas escreve em lingua Portugueza, & Latina Joam Curvo Semmedo. Officina de Antonio Pedrozo Galram, Lisboa, Portugal, 1707.

SOUSA, M. C. P. d. O Corpus Tycho Brahe: contribuições para as humanidades digitais no Brasil. Filologia e linguística portuguesa, v. 16(esp.), p. 53–93, 2014. DOI https://doi.org/10.11606/issn.2176-9419.v16ispep53-93

VERDELHO, T. Terminologias na língua portuguesa: perspectiva diacrónica. 1998. Available at: http://clp.dlc.ua.pt/Publicacoes/Terminologias_lingua_portuguesa.pdf. Accessed on: 22 Jun. 2023.

WAGNER FILHO, J. A.; WILKENS, R.; IDIART, M.; VILLAVICENCIO, A. The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). 2018.