Abordagem baseada em Aumento de Dados para Avaliação Automática de Leiturabilidade

Luiza Cunha de Menezes; Aline Paes; Maria José Bocorny Finatto

doi:10.14393/DLv17a2023-21

Authors

Luiza Cunha de Menezes Universidade Federal Fluminense https://orcid.org/0000-0003-0545-7336
Aline Paes UFF https://orcid.org/0000-0002-9089-7303
Maria José Bocorny Finatto UFRGS https://orcid.org/0000-0002-6022-8408

DOI:

https://doi.org/10.14393/DLv17a2023-21

Keywords:

Natural Language Processing, Synonym Replacement, Back-translation, Data Augmentation, Automatic Readability Assessment

Abstract

Studies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

AGGARWAL, C. C. Machine learning for text, v. 848, Springer, 2018. DOI https://doi.org/10.1007/978-3-319-73531-3

ALUÍSIO, S. M.; SPECIA, L.; PARDO, T. A. S.; MAZIERO, E. G.; FORTES, R. Towards Brazilian Portuguese automatic text simplification systems. Proceedings of the eighth ACM symposium on Document engineering, 2008. p. 240–248. DOI https://doi.org/10.1145/1410140.1410191

ALUÍSIO, S.; SPECIA, L.; GASPERIN, G.; SCARTON, C. Readability assessment for text simplification. Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, 2010. p. 1–9

BACCOURI, N. deep-translator. Deep translator. 2020. Disponível em: https://deep-translator.readthedocs.io/en/latest/README.html. Acesso em : 20 fev. 2023.

BAYER, M.; KAUFHOLD, M.; REUTER, C. A survey on data augmentation for text classification. ACM Computing Surveys, 2021. DOI https://doi.org/10.1145/3544558

BENTZ, C. et al. Complexity Trade-Offs and equi-complexity in natural languages: A meta-analysis. Linguistics Vanguard, De Gruyter Mouton, 2022. DOI https://doi.org/10.1515/lingvan-2021-0054

BIRD, S.; KLEIN, E.; LOPER, E. Natural language processing with python: analyzing text with the natural language toolkit. O’Reilly Media, Inc, 2019.

BRILL, E. Processing natural language without natural language processing. International Conference on Intelligent Text Processing and Computational Linguistics, Springer, 2003. p. 360–369. DOI https://doi.org/10.1007/3-540-36456-0_37

CAO, Y.; SHUI, R.; PAN, L.; KAN, M. Y.; LIU, Z.; CHUA, T. Expertise style transfer: A new task towards better communication between experts and laymen. 58th Annual Meeting of the Association for Computational Linguistics, Online: Association for Computational Linguistics, 2020. p. 1061–1071. DOI https://doi.org/10.18653/v1/2020.acl-main.100

CAYLOR, J. S. et al. Methodologies for determining reading requirements of military occupational specialties, ERIC, 1973.

CHAQUET-ULLDEMOLINS, J.; GIMENO-BLANES, F.; MORAL-RUBIO, S.; MUNOZ-ROMERO, S.; ROJO-ALVAREZ, J. L. On the black-box challenge for fraud detection using machine learning (ii): Nonlinear analysis through interpretable autoencoders. Applied Sciences. v.12(8), p. 3856, 2022. DOI https://doi.org/10.3390/app12083856

CHUCHU, M. Readability Assessment with Pre-Trained Transformer Models: An Investigation with Neural Linguistic Features. Uppsala University, 2022.

DEVLIN, J.; CHANG, M. W.; LEE, K.; TOUTANOVA, K. Bert: Pre training of deep bidirectional transformers for language understanding. Proceedings of naacL-HLT, v. 1, 2018. p. 2.

DIAS-DA-SILVA, B. C.; MORAES, H. R. A construção de um thesaurus eletrônico para o português do Brasil. ALFA: Revista de Linguística, 2003.

FENG, L.; ELHADAD, N.; HUENER, M. Cognitively motivated features for readability assessment. 12th Conference of the European Chapter of the ACL, 2009. p. 229–237. DOI https://doi.org/10.3115/1609067.1609092

FENG, S. Y.; GANGAL, V.; WEI, J.; CHANDAR, S.; VOSOUGHI, S.; MITAMURA, T.; HOVY, E. 2021. A survey of data augmentation approaches for nlp. Association for Computational Linguistics, 2021. p. 968-988. DOI https://doi.org/10.18653/v1/2021.findings-acl.84

FINATTO, M. J. B. 2020. Acessibilidade textual e terminológica: promovendo a tradução intralinguística. Estudos Linguísticos. São Paulo, v. 49(1), p. 72–96, 2020. DOI https://doi.org/10.21165/el.v49i1.2775

FINATTO, M. J. B; TCACENCO, L. M. Tradução intralinguística, estratégias de equivalência e acessibilidade textual e terminológica. Tradterm. São Paulo, v. 37, n. 1, p. 30-63, 2021. DOI https://doi.org/10.11606/issn.2317-9511.v37p30-63

FREITAS, C. Linguística computacional. Parábola Editorial, 2022.

HARTMANN, N. S.; ALUÍSIO, S. M. Adaptação lexical automática em textos informativos do português brasileiro para o ensino fundamental. Linguamática, v.12(2). p. 3–27, 2020. DOI https://doi.org/10.21814/lm.12.2.323

HONNIBAL, M.; MONTANI, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear. 2017.

IMPERIAL, J. M. BERT embeddings for automatic readability assessment. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing, v.1-3, INCOMA Ltd, 2021. p. 611–618. Disponível em: https://aclanthology.org/2021.ranlp-1.69. Acesso em: 20 fev. 2023.

INC., F. Learning word vectors for 157 languages. Fasttext. 2018. Disponível em: https://dl.fbaipublicfiles.com/fasttext/vectorscrawl/cc.pt.300.bin.gz. Acesso em: 20 fev. 2023.

INC., F. Library for efficient text classification and representation learning. Fasttext. 2022. Disponível em: https://fasttext.cc/. Acesso em: 20 fev. 2023.

KINCAID, J. P.; YASUTAKE, J. Y.; GEISELHART, R. Use of the automated readability index to assess comprehensibility of air force technical orders. Wright-Patterson AFB, Ohio: Aeronautical Systems Division, 1967.

KLARE, G. R. et al. Measurement of readability, AGRIS, Iowa State University Press, 1963.

LEAL, S. E. 2020. Simpligo ranking. NILC. Disponível em: http://fw.nilc.icmc.usp.br:23380/simpligo-ranking. Acesso em: 20 fev. 2023.

LEAL, S. E. Predição da complexidade sentencial do português brasileiro escrito, usando métricas linguísticas, psicolinguísticas e de rastreamento ocular. 2019. Tese de Doutoramento. Universidade de São Paulo, 2019.

LEAL, S. E.; DURAN, M. S.; SCARTON, C. E.; HARTMANN, N. S.; ALUISIO, S. M. Nilc Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese. CoRR, v.2201.03445, 2021.

LEAL, S. E.; LUKASOVA, K.; CARTHERY-GOULART, M. T.; ALUISIO, S. M. Rastros project: Natural language processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese. Language Resources and Evaluation, v.56(4), p.1333– 1372, 2022b. DOI https://doi.org/10.1007/s10579-022-09609-0

LEAL, S. E.; MAGALHAES, V. M. A.; DURAN, M. S.; ALUÍSIO, S. M. Avaliação automática da complexidade de sentenças do português brasileiro para o domínio rural. Symposium in Information and Human Language Technology and Collocates, 2019.

LEAL, S. E.; SCARTON, C.; CUNHA, A.; HARTMANN, N.; DURAN, M.; ALUÍSIO, S. Nilc-Metrix, NILC, 2022a. Disponível em: http://fw.nilc.icmc.usp.br:23380/nilcmetrix. Acesso em: 20 fev. 2023.

LEAL, S. E; DURAN, M. S.; ALUÍSIO, S. A nontrivial sentence corpus for the task of sentence readability assessment in Portuguese. 27th International Conference on Computational Linguistics, 2018. p. 401–413.

LEE, B. W.; JANG, Y. S.; LEE, J. H. J. Pushing on text readability assessment: A transformer meets hand crafted linguistic features, CoRR, v.2109.12258, 2021. DOI https://doi.org/10.18653/v1/2021.emnlp-main.834

LONGPRE, S., WANG, Y.; BOIS, C. D. How effective is task-agnostic data augmentation for pretrained transformers? Findings of ACL, v. EMNLP 2020, 2020. p. 4401-4411. DOI https://doi.org/10.18653/v1/2020.findings-emnlp.394

MARTINC, M.; POLLAK, S.; SIKONJA, M. R. Supervised and Unsupervised Neural Approaches to Text Readability. Computational Linguistics, v.47(1), p. 141–179, 2021. DOI https://doi.org/10.1162/coli_a_00398

MARTINS, T. B. F.; GHIRALDELO, C. M.; NUNES, MGV; JUNIOR, ONO. Readability formulas applied to textbooks in Brazilian Portuguese, Icmsc-Usp, 1996.

MCNAMARA, M. M.; LOUWERSE, D. S.; GRAESSER, A. C. Coh-metrix: Automated cohesion and coherence scores to predict text readability and facilitate comprehension. Cognitive Science and Educational Practice group at the University of Memphis, 2002. Disponível em: http://csep.psyc.memphis.edu/mcnamara/pdf/IESproposal.pdf. Acesso em: 20 fev. 2023.

MUNIZ, M. C. M. A construção de recursos linguístico-computacionais para o português do Brasil: o projeto unitex-pb. Tese de Doutoramento. Universidade de São Paulo, 2004.

NILC. Repositório de word embeddings. NILC. Disponível em: http://143.107.183.175:22980/download.php?file=embeddings/fasttext/skip_s300.zip. 2017. Acesso em : 20 fev. 2023.

PASQUALINI, B. F. CorPop: um corpus de referência do português popular escrito do Brasil. Tese de Doutoramento. Universidade Federal do Rio Grande do Sul, 2018.

PEI, M.; GAYNOR, F. Dictionary of linguistics. Rowman & Littlefield, 1954.

PONOMARENKO, G. L. Índices para cálculo de leiturabilidade. Acessibilidade TT, Universidade Federal do Rio Grande do Sul, 2018. Disponível em: http://www.ufrgs.br/textecc/acessibilidadett/files/IndicesdeLeiturabilidade.pdf. Acesso em : 20 fev. 2023.

REHUREK, R. Gensim topic modelling for humans. Gensim, 2022. Disponível em: https://radimrehurek.com/gensim/index.html. Acesso em: 20 fev. 2023.

SAHIN, G. G. To augment or not to augment? a comparative study on text augmentation techniques for low resource NLP. Computational Linguistics, v. 48(1), p. 5–42, 2022. DOI https://doi.org/10.1162/coli_a_00425

SANTOS, A. M. Leiturabilidade: É possível medi-la em livros infanto-juvenis? Congresso Internacional de Leitura e Literatura Infantil e Juvenil, 2010.

SCARTON, C.; ALUÍSIO, S. Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: adaptando as métricas do coh metrix para o português. Linguamática, v. 2(1). p.45–61, 2010.

SCARTON, C.; GASPERIN, C.; ALUÍSIO, S. Revisiting the readability assessment of texts in Portuguese. Ibero-American Conference on Artificial Intelligence, Springer, 2010a. p. 306–315. DOI https://doi.org/10.1007/978-3-642-16952-6_31

SCARTON, C.; OLIVEIRA, M.; JÚNIOR, A. C.; GASPERIN, C.; ALUÍSIO, S. Simplifica: a tool for authoring simplified texts in Brazilian Portuguese guided by readability assessments. Proceedings of the NAACL HLT 2010 Demonstration Session, 2010b. p. 41–44.

SCHWARM, S.; OSTENDORF, M. Reading level assessment using support vector machines and statistical language models. 43rd annual meeting of the Association for Computational Linguistics, 2005. p. 523–530. DOI https://doi.org/10.3115/1219840.1219905

SHUKLA, A.; PANDEY, H. M.; MEHROTRA, D. Comparative review of selection techniques in genetic algorithms. International conference on futuristic trends on computational analysis and knowledge management, IEEE, 2015. p. 515–519. DOI https://doi.org/10.1109/ABLAZE.2015.7154916

SI, L.; CALLAN, J. A statistical model for scientific readability. International conference on Information and knowledge management, 2001. p. 574–576. DOI https://doi.org/10.1145/502585.502695

SOUZA, F.; NOGUEIRA, R.; LOTUFO, R. BERTimbau: pretrained BERT models for Brazilian Portuguese. 9th Brazilian Conference on Intelligent Systems, 2020. DOI https://doi.org/10.1007/978-3-030-61377-8_28

STOLCKE, A. Srilm - an extensible language modeling toolkit. Seventh international conference on spoken language processing, ISCA, 2002. DOI https://doi.org/10.21437/ICSLP.2002-303

TAYLOR, W. L. “cloze procedure”: A new tool for measuring readability. Journalism quarterly, v.30(4), 1953. p. 415–433. DOI https://doi.org/10.1177/107769905303000401

TAYNAN, M. F.; COSTA, A. H. R. Deepbt and nlp data augmentation techniques: A new proposal and a comprehensive study. Ricardo Cerri & Ronaldo C. Prati, Intelligent Systems, Springer, 2020. p. 435– 449. DOI https://doi.org/10.1007/978-3-030-61377-8_30

USP. A lexicon for Brazilian Portuguese according to Universal Dependencies. PortiLexicon-UD. 2022a. Disponível em: https://portilexicon.icmc.usp.br/. Acesso em: 20 fev. 2023.

USP. Tep 2.0. 2022b. Disponível em: http://www.nilc.icmc.usp.br/tep2/ajuda.htm. Acesso em: 20 fev. 2023.

USP; FAPESP; IBM. Portuguese processing - towards syntactic analysis and parsing. Poetisa. 2021. Disponível em: https://sites.google.com/icmc.usp.br/poetisa. Acesso em: 20 fev. 2023.

VASWANI, A.; SHAZEER, N.; PARMAR, N.; USZJOREIT, J.; JONES, L.; GOMEZ, A. N.; KAISER, L.; POLOSUKHIN, I. Attention is all you need. Advances in neural information processing systems, p.5998-6008, 2017.

WATANABE, W. M.; JUNIOR, A. C.; UZÊDA, V. R.; FORTES, R. P. M.; PARDO, T. A. S.; ALUÍSIO, S. M. Facilita: reading assistance for low literacy readers. Proceedings of the 27th ACM international conference on Design of communication, 2009. p. 29–36. DOI https://doi.org/10.1145/1621995.1622002

WILKENS, R.; VECCHIA, A. D.; BOITO, M. Z.; PADRÓ, M.; VILLAVICENCIO, A. Size does not matter. frequency does. a study of features for measuring lexical complexity. Ibero-American conference on artificial intelligence, Springer, 2014. p. 129–140. DOI https://doi.org/10.1007/978-3-319-12027-0_11

WILKENS, R.; ZILIO, L.; IDIART, M.; VILLAVICENCIO, A. et al. Crawling by readability level. International Conference on Computational Processing of the Portuguese Language, Springer, 2016. p. 306–318. DOI https://doi.org/10.1007/978-3-319-41552-9_31

WOLF, T.; DEBUT, L.; SANH, V.; CHAUMOND, J. et al. Transformers: State-of-the-art natural language processing. Association for Computational Linguistics, 2020. p. 38-45. DOI https://doi.org/10.18653/v1/2020.emnlp-demos.6

WONG, T. T. Performance evaluation of classification algorithms by k-fold and leave one-out cross validation. Pattern Recognition, v. 48(9), p. 2839–2846, 2015. DOI https://doi.org/10.1016/j.patcog.2015.03.009

XIA, M.; KOCHMAR, E.; BRIS, B. Text readability assessment for second language learners. The Association for Computer Linguistics, p. 12-22, 2019.