Abordagem baseada em Aumento de Dados para Avaliação Automática de Leiturabilidade

Luiza Cunha de Menezes; Aline Paes; Maria José Bocorny Finatto

doi:10.14393/DLv17a2023-21

Autores

Luiza Cunha de Menezes Universidade Federal Fluminense https://orcid.org/0000-0003-0545-7336
Aline Paes UFF https://orcid.org/0000-0002-9089-7303
Maria José Bocorny Finatto UFRGS https://orcid.org/0000-0002-6022-8408

DOI:

https://doi.org/10.14393/DLv17a2023-21

Palavras-chave:

Processamento de Linguagem Natural, Substituição por Sinônimo, Retrotradução, Aumento de Dados, Avaliação Automática de Leiturabilidade

Resumo

Embora estudos sobre como medir a leiturabilidade de um texto remontem ao século passado, ainda não há um consenso sobre quais seriam as melhores métricas. Ferramentas de Processamento de Linguagem Natural (PLN) podem apoiar esta tarefa, mas dependem de um grande número de amostras para treinamento, o que é uma barreira para seu avanço. O objetivo principal deste artigo é analisar o impacto de determinados métodos de aumento de dados (AD) para enfrentar essa barreira e apoiar a classificação de leiturabilidade no português brasileiro (PB). Para tanto, foi estabelecido um corpus pareado e classificado, com textos originais complexos e suas versões simplificadas sobre temas de Ciências, desenvolvido por linguistas. Esse corpus foi aumentado com técnicas agnósticas de AD: substituição por sinônimos (SS) e retrotradução (RT). Foram avaliados 75 modelos com diferentes técnicas e combinações de atributos de entrada. O melhor resultado obtido para o conjunto dos textos do corpus sem aumento foi de 94,0% de taxa de acerto. Este resultado subiu para 95,2% combinando-se as métricas do sistema NILC-Metrix com representações vetoriais de texto contextualizadas. Quando comparados a outros trabalhos voltados para o PB, a metodologia proposta gerou um aumento na taxa de acerto em um domínio distinto ao de treino. Conclui-se que o modelo treinado com AD demonstra capacidade igual ou superior àqueles treinados sem aumento e, ao mesmo tempo, apresenta maior generalização quando aplicado a outros domínios.

Downloads

Referências

AGGARWAL, C. C. Machine learning for text, v. 848, Springer, 2018. DOI https://doi.org/10.1007/978-3-319-73531-3

ALUÍSIO, S. M.; SPECIA, L.; PARDO, T. A. S.; MAZIERO, E. G.; FORTES, R. Towards Brazilian Portuguese automatic text simplification systems. Proceedings of the eighth ACM symposium on Document engineering, 2008. p. 240–248. DOI https://doi.org/10.1145/1410140.1410191

ALUÍSIO, S.; SPECIA, L.; GASPERIN, G.; SCARTON, C. Readability assessment for text simplification. Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, 2010. p. 1–9

BACCOURI, N. deep-translator. Deep translator. 2020. Disponível em: https://deep-translator.readthedocs.io/en/latest/README.html. Acesso em : 20 fev. 2023.

BAYER, M.; KAUFHOLD, M.; REUTER, C. A survey on data augmentation for text classification. ACM Computing Surveys, 2021. DOI https://doi.org/10.1145/3544558

BENTZ, C. et al. Complexity Trade-Offs and equi-complexity in natural languages: A meta-analysis. Linguistics Vanguard, De Gruyter Mouton, 2022. DOI https://doi.org/10.1515/lingvan-2021-0054

BIRD, S.; KLEIN, E.; LOPER, E. Natural language processing with python: analyzing text with the natural language toolkit. O’Reilly Media, Inc, 2019.

BRILL, E. Processing natural language without natural language processing. International Conference on Intelligent Text Processing and Computational Linguistics, Springer, 2003. p. 360–369. DOI https://doi.org/10.1007/3-540-36456-0_37

CAO, Y.; SHUI, R.; PAN, L.; KAN, M. Y.; LIU, Z.; CHUA, T. Expertise style transfer: A new task towards better communication between experts and laymen. 58th Annual Meeting of the Association for Computational Linguistics, Online: Association for Computational Linguistics, 2020. p. 1061–1071. DOI https://doi.org/10.18653/v1/2020.acl-main.100

CAYLOR, J. S. et al. Methodologies for determining reading requirements of military occupational specialties, ERIC, 1973.

CHAQUET-ULLDEMOLINS, J.; GIMENO-BLANES, F.; MORAL-RUBIO, S.; MUNOZ-ROMERO, S.; ROJO-ALVAREZ, J. L. On the black-box challenge for fraud detection using machine learning (ii): Nonlinear analysis through interpretable autoencoders. Applied Sciences. v.12(8), p. 3856, 2022. DOI https://doi.org/10.3390/app12083856

CHUCHU, M. Readability Assessment with Pre-Trained Transformer Models: An Investigation with Neural Linguistic Features. Uppsala University, 2022.

DEVLIN, J.; CHANG, M. W.; LEE, K.; TOUTANOVA, K. Bert: Pre training of deep bidirectional transformers for language understanding. Proceedings of naacL-HLT, v. 1, 2018. p. 2.

DIAS-DA-SILVA, B. C.; MORAES, H. R. A construção de um thesaurus eletrônico para o português do Brasil. ALFA: Revista de Linguística, 2003.

FENG, L.; ELHADAD, N.; HUENER, M. Cognitively motivated features for readability assessment. 12th Conference of the European Chapter of the ACL, 2009. p. 229–237. DOI https://doi.org/10.3115/1609067.1609092

FENG, S. Y.; GANGAL, V.; WEI, J.; CHANDAR, S.; VOSOUGHI, S.; MITAMURA, T.; HOVY, E. 2021. A survey of data augmentation approaches for nlp. Association for Computational Linguistics, 2021. p. 968-988. DOI https://doi.org/10.18653/v1/2021.findings-acl.84

FINATTO, M. J. B. 2020. Acessibilidade textual e terminológica: promovendo a tradução intralinguística. Estudos Linguísticos. São Paulo, v. 49(1), p. 72–96, 2020. DOI https://doi.org/10.21165/el.v49i1.2775

FINATTO, M. J. B; TCACENCO, L. M. Tradução intralinguística, estratégias de equivalência e acessibilidade textual e terminológica. Tradterm. São Paulo, v. 37, n. 1, p. 30-63, 2021. DOI https://doi.org/10.11606/issn.2317-9511.v37p30-63

FREITAS, C. Linguística computacional. Parábola Editorial, 2022.

HARTMANN, N. S.; ALUÍSIO, S. M. Adaptação lexical automática em textos informativos do português brasileiro para o ensino fundamental. Linguamática, v.12(2). p. 3–27, 2020. DOI https://doi.org/10.21814/lm.12.2.323

HONNIBAL, M.; MONTANI, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear. 2017.

IMPERIAL, J. M. BERT embeddings for automatic readability assessment. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing, v.1-3, INCOMA Ltd, 2021. p. 611–618. Disponível em: https://aclanthology.org/2021.ranlp-1.69. Acesso em: 20 fev. 2023.

INC., F. Learning word vectors for 157 languages. Fasttext. 2018. Disponível em: https://dl.fbaipublicfiles.com/fasttext/vectorscrawl/cc.pt.300.bin.gz. Acesso em: 20 fev. 2023.

INC., F. Library for efficient text classification and representation learning. Fasttext. 2022. Disponível em: https://fasttext.cc/. Acesso em: 20 fev. 2023.

KINCAID, J. P.; YASUTAKE, J. Y.; GEISELHART, R. Use of the automated readability index to assess comprehensibility of air force technical orders. Wright-Patterson AFB, Ohio: Aeronautical Systems Division, 1967.

KLARE, G. R. et al. Measurement of readability, AGRIS, Iowa State University Press, 1963.

LEAL, S. E. 2020. Simpligo ranking. NILC. Disponível em: http://fw.nilc.icmc.usp.br:23380/simpligo-ranking. Acesso em: 20 fev. 2023.

LEAL, S. E. Predição da complexidade sentencial do português brasileiro escrito, usando métricas linguísticas, psicolinguísticas e de rastreamento ocular. 2019. Tese de Doutoramento. Universidade de São Paulo, 2019.

LEAL, S. E.; DURAN, M. S.; SCARTON, C. E.; HARTMANN, N. S.; ALUISIO, S. M. Nilc Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese. CoRR, v.2201.03445, 2021.

LEAL, S. E.; LUKASOVA, K.; CARTHERY-GOULART, M. T.; ALUISIO, S. M. Rastros project: Natural language processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese. Language Resources and Evaluation, v.56(4), p.1333– 1372, 2022b. DOI https://doi.org/10.1007/s10579-022-09609-0

LEAL, S. E.; MAGALHAES, V. M. A.; DURAN, M. S.; ALUÍSIO, S. M. Avaliação automática da complexidade de sentenças do português brasileiro para o domínio rural. Symposium in Information and Human Language Technology and Collocates, 2019.

LEAL, S. E.; SCARTON, C.; CUNHA, A.; HARTMANN, N.; DURAN, M.; ALUÍSIO, S. Nilc-Metrix, NILC, 2022a. Disponível em: http://fw.nilc.icmc.usp.br:23380/nilcmetrix. Acesso em: 20 fev. 2023.

LEAL, S. E; DURAN, M. S.; ALUÍSIO, S. A nontrivial sentence corpus for the task of sentence readability assessment in Portuguese. 27th International Conference on Computational Linguistics, 2018. p. 401–413.

LEE, B. W.; JANG, Y. S.; LEE, J. H. J. Pushing on text readability assessment: A transformer meets hand crafted linguistic features, CoRR, v.2109.12258, 2021. DOI https://doi.org/10.18653/v1/2021.emnlp-main.834

LONGPRE, S., WANG, Y.; BOIS, C. D. How effective is task-agnostic data augmentation for pretrained transformers? Findings of ACL, v. EMNLP 2020, 2020. p. 4401-4411. DOI https://doi.org/10.18653/v1/2020.findings-emnlp.394

MARTINC, M.; POLLAK, S.; SIKONJA, M. R. Supervised and Unsupervised Neural Approaches to Text Readability. Computational Linguistics, v.47(1), p. 141–179, 2021. DOI https://doi.org/10.1162/coli_a_00398

MARTINS, T. B. F.; GHIRALDELO, C. M.; NUNES, MGV; JUNIOR, ONO. Readability formulas applied to textbooks in Brazilian Portuguese, Icmsc-Usp, 1996.

MCNAMARA, M. M.; LOUWERSE, D. S.; GRAESSER, A. C. Coh-metrix: Automated cohesion and coherence scores to predict text readability and facilitate comprehension. Cognitive Science and Educational Practice group at the University of Memphis, 2002. Disponível em: http://csep.psyc.memphis.edu/mcnamara/pdf/IESproposal.pdf. Acesso em: 20 fev. 2023.

MUNIZ, M. C. M. A construção de recursos linguístico-computacionais para o português do Brasil: o projeto unitex-pb. Tese de Doutoramento. Universidade de São Paulo, 2004.

NILC. Repositório de word embeddings. NILC. Disponível em: http://143.107.183.175:22980/download.php?file=embeddings/fasttext/skip_s300.zip. 2017. Acesso em : 20 fev. 2023.

PASQUALINI, B. F. CorPop: um corpus de referência do português popular escrito do Brasil. Tese de Doutoramento. Universidade Federal do Rio Grande do Sul, 2018.

PEI, M.; GAYNOR, F. Dictionary of linguistics. Rowman & Littlefield, 1954.

PONOMARENKO, G. L. Índices para cálculo de leiturabilidade. Acessibilidade TT, Universidade Federal do Rio Grande do Sul, 2018. Disponível em: http://www.ufrgs.br/textecc/acessibilidadett/files/IndicesdeLeiturabilidade.pdf. Acesso em : 20 fev. 2023.

REHUREK, R. Gensim topic modelling for humans. Gensim, 2022. Disponível em: https://radimrehurek.com/gensim/index.html. Acesso em: 20 fev. 2023.

SAHIN, G. G. To augment or not to augment? a comparative study on text augmentation techniques for low resource NLP. Computational Linguistics, v. 48(1), p. 5–42, 2022. DOI https://doi.org/10.1162/coli_a_00425

SANTOS, A. M. Leiturabilidade: É possível medi-la em livros infanto-juvenis? Congresso Internacional de Leitura e Literatura Infantil e Juvenil, 2010.

SCARTON, C.; ALUÍSIO, S. Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: adaptando as métricas do coh metrix para o português. Linguamática, v. 2(1). p.45–61, 2010.

SCARTON, C.; GASPERIN, C.; ALUÍSIO, S. Revisiting the readability assessment of texts in Portuguese. Ibero-American Conference on Artificial Intelligence, Springer, 2010a. p. 306–315. DOI https://doi.org/10.1007/978-3-642-16952-6_31

SCARTON, C.; OLIVEIRA, M.; JÚNIOR, A. C.; GASPERIN, C.; ALUÍSIO, S. Simplifica: a tool for authoring simplified texts in Brazilian Portuguese guided by readability assessments. Proceedings of the NAACL HLT 2010 Demonstration Session, 2010b. p. 41–44.

SCHWARM, S.; OSTENDORF, M. Reading level assessment using support vector machines and statistical language models. 43rd annual meeting of the Association for Computational Linguistics, 2005. p. 523–530. DOI https://doi.org/10.3115/1219840.1219905

SHUKLA, A.; PANDEY, H. M.; MEHROTRA, D. Comparative review of selection techniques in genetic algorithms. International conference on futuristic trends on computational analysis and knowledge management, IEEE, 2015. p. 515–519. DOI https://doi.org/10.1109/ABLAZE.2015.7154916

SI, L.; CALLAN, J. A statistical model for scientific readability. International conference on Information and knowledge management, 2001. p. 574–576. DOI https://doi.org/10.1145/502585.502695

SOUZA, F.; NOGUEIRA, R.; LOTUFO, R. BERTimbau: pretrained BERT models for Brazilian Portuguese. 9th Brazilian Conference on Intelligent Systems, 2020. DOI https://doi.org/10.1007/978-3-030-61377-8_28

STOLCKE, A. Srilm - an extensible language modeling toolkit. Seventh international conference on spoken language processing, ISCA, 2002. DOI https://doi.org/10.21437/ICSLP.2002-303

TAYLOR, W. L. “cloze procedure”: A new tool for measuring readability. Journalism quarterly, v.30(4), 1953. p. 415–433. DOI https://doi.org/10.1177/107769905303000401

TAYNAN, M. F.; COSTA, A. H. R. Deepbt and nlp data augmentation techniques: A new proposal and a comprehensive study. Ricardo Cerri & Ronaldo C. Prati, Intelligent Systems, Springer, 2020. p. 435– 449. DOI https://doi.org/10.1007/978-3-030-61377-8_30

USP. A lexicon for Brazilian Portuguese according to Universal Dependencies. PortiLexicon-UD. 2022a. Disponível em: https://portilexicon.icmc.usp.br/. Acesso em: 20 fev. 2023.

USP. Tep 2.0. 2022b. Disponível em: http://www.nilc.icmc.usp.br/tep2/ajuda.htm. Acesso em: 20 fev. 2023.

USP; FAPESP; IBM. Portuguese processing - towards syntactic analysis and parsing. Poetisa. 2021. Disponível em: https://sites.google.com/icmc.usp.br/poetisa. Acesso em: 20 fev. 2023.

VASWANI, A.; SHAZEER, N.; PARMAR, N.; USZJOREIT, J.; JONES, L.; GOMEZ, A. N.; KAISER, L.; POLOSUKHIN, I. Attention is all you need. Advances in neural information processing systems, p.5998-6008, 2017.

WATANABE, W. M.; JUNIOR, A. C.; UZÊDA, V. R.; FORTES, R. P. M.; PARDO, T. A. S.; ALUÍSIO, S. M. Facilita: reading assistance for low literacy readers. Proceedings of the 27th ACM international conference on Design of communication, 2009. p. 29–36. DOI https://doi.org/10.1145/1621995.1622002

WILKENS, R.; VECCHIA, A. D.; BOITO, M. Z.; PADRÓ, M.; VILLAVICENCIO, A. Size does not matter. frequency does. a study of features for measuring lexical complexity. Ibero-American conference on artificial intelligence, Springer, 2014. p. 129–140. DOI https://doi.org/10.1007/978-3-319-12027-0_11

WILKENS, R.; ZILIO, L.; IDIART, M.; VILLAVICENCIO, A. et al. Crawling by readability level. International Conference on Computational Processing of the Portuguese Language, Springer, 2016. p. 306–318. DOI https://doi.org/10.1007/978-3-319-41552-9_31

WOLF, T.; DEBUT, L.; SANH, V.; CHAUMOND, J. et al. Transformers: State-of-the-art natural language processing. Association for Computational Linguistics, 2020. p. 38-45. DOI https://doi.org/10.18653/v1/2020.emnlp-demos.6

WONG, T. T. Performance evaluation of classification algorithms by k-fold and leave one-out cross validation. Pattern Recognition, v. 48(9), p. 2839–2846, 2015. DOI https://doi.org/10.1016/j.patcog.2015.03.009

XIA, M.; KOCHMAR, E.; BRIS, B. Text readability assessment for second language learners. The Association for Computer Linguistics, p. 12-22, 2019.