A Data Augmentation approach to Automated Readability Assessment
DOI:
https://doi.org/10.14393/DLv17a2023-21Keywords:
Natural Language Processing, Synonym Replacement, Back-translation, Data Augmentation, Automatic Readability AssessmentAbstract
Studies about how to measure text readability reassemble the last century. Nonetheless, there is no consensus on which could be the best metrics. Tools regarding the field of Natural Language Processing (NLP) may support this task but are dependent on a high number of samples for training, and that is a bottleneck to its advancement. The main goal of this paper is to analyze the impact of a couple of data augmentation (DA) methods to support the readability classification task in Brazilian Portuguese (BP) to mitigate the bottleneck problem. In this sense, we worked on a paired and classified corpus created by linguists. The corpus is about science, and each text contemplates its original and simplified versions. About the methodology, we considered two agnostic tasks: synonym replacement and back-translation and evaluated 75 models with different techniques and combinations of input features. For the trained model with the corpus without DA, the best score reached 94.0% of the hit rate. When combining the NILC-Metrix metrics and contextualized word embeddings, the results overtook 95.2%. Compared to other papers applied to the BP, the proposed methodology improved the hit rate considering a distinct training domain. Our results demonstrate that the capacity of DA methods can be equal to or greater than those trained without augmentation and, at the same time, present greater generalization when applied to other domains.
Downloads
Metrics
References
AGGARWAL, C. C. Machine learning for text, v. 848, Springer, 2018. DOI https://doi.org/10.1007/978-3-319-73531-3
ALUÍSIO, S. M.; SPECIA, L.; PARDO, T. A. S.; MAZIERO, E. G.; FORTES, R. Towards Brazilian Portuguese automatic text simplification systems. Proceedings of the eighth ACM symposium on Document engineering, 2008. p. 240–248. DOI https://doi.org/10.1145/1410140.1410191
ALUÍSIO, S.; SPECIA, L.; GASPERIN, G.; SCARTON, C. Readability assessment for text simplification. Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, 2010. p. 1–9
BACCOURI, N. deep-translator. Deep translator. 2020. Disponível em: https://deep-translator.readthedocs.io/en/latest/README.html. Acesso em : 20 fev. 2023.
BAYER, M.; KAUFHOLD, M.; REUTER, C. A survey on data augmentation for text classification. ACM Computing Surveys, 2021. DOI https://doi.org/10.1145/3544558
BENTZ, C. et al. Complexity Trade-Offs and equi-complexity in natural languages: A meta-analysis. Linguistics Vanguard, De Gruyter Mouton, 2022. DOI https://doi.org/10.1515/lingvan-2021-0054
BIRD, S.; KLEIN, E.; LOPER, E. Natural language processing with python: analyzing text with the natural language toolkit. O’Reilly Media, Inc, 2019.
BRILL, E. Processing natural language without natural language processing. International Conference on Intelligent Text Processing and Computational Linguistics, Springer, 2003. p. 360–369. DOI https://doi.org/10.1007/3-540-36456-0_37
CAO, Y.; SHUI, R.; PAN, L.; KAN, M. Y.; LIU, Z.; CHUA, T. Expertise style transfer: A new task towards better communication between experts and laymen. 58th Annual Meeting of the Association for Computational Linguistics, Online: Association for Computational Linguistics, 2020. p. 1061–1071. DOI https://doi.org/10.18653/v1/2020.acl-main.100
CAYLOR, J. S. et al. Methodologies for determining reading requirements of military occupational specialties, ERIC, 1973.
CHAQUET-ULLDEMOLINS, J.; GIMENO-BLANES, F.; MORAL-RUBIO, S.; MUNOZ-ROMERO, S.; ROJO-ALVAREZ, J. L. On the black-box challenge for fraud detection using machine learning (ii): Nonlinear analysis through interpretable autoencoders. Applied Sciences. v.12(8), p. 3856, 2022. DOI https://doi.org/10.3390/app12083856
CHUCHU, M. Readability Assessment with Pre-Trained Transformer Models: An Investigation with Neural Linguistic Features. Uppsala University, 2022.
DEVLIN, J.; CHANG, M. W.; LEE, K.; TOUTANOVA, K. Bert: Pre training of deep bidirectional transformers for language understanding. Proceedings of naacL-HLT, v. 1, 2018. p. 2.
DIAS-DA-SILVA, B. C.; MORAES, H. R. A construção de um thesaurus eletrônico para o português do Brasil. ALFA: Revista de Linguística, 2003.
FENG, L.; ELHADAD, N.; HUENER, M. Cognitively motivated features for readability assessment. 12th Conference of the European Chapter of the ACL, 2009. p. 229–237. DOI https://doi.org/10.3115/1609067.1609092
FENG, S. Y.; GANGAL, V.; WEI, J.; CHANDAR, S.; VOSOUGHI, S.; MITAMURA, T.; HOVY, E. 2021. A survey of data augmentation approaches for nlp. Association for Computational Linguistics, 2021. p. 968-988. DOI https://doi.org/10.18653/v1/2021.findings-acl.84
FINATTO, M. J. B. 2020. Acessibilidade textual e terminológica: promovendo a tradução intralinguística. Estudos Linguísticos. São Paulo, v. 49(1), p. 72–96, 2020. DOI https://doi.org/10.21165/el.v49i1.2775
FINATTO, M. J. B; TCACENCO, L. M. Tradução intralinguística, estratégias de equivalência e acessibilidade textual e terminológica. Tradterm. São Paulo, v. 37, n. 1, p. 30-63, 2021. DOI https://doi.org/10.11606/issn.2317-9511.v37p30-63
FREITAS, C. Linguística computacional. Parábola Editorial, 2022.
HARTMANN, N. S.; ALUÍSIO, S. M. Adaptação lexical automática em textos informativos do português brasileiro para o ensino fundamental. Linguamática, v.12(2). p. 3–27, 2020. DOI https://doi.org/10.21814/lm.12.2.323
HONNIBAL, M.; MONTANI, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear. 2017.
IMPERIAL, J. M. BERT embeddings for automatic readability assessment. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing, v.1-3, INCOMA Ltd, 2021. p. 611–618. Disponível em: https://aclanthology.org/2021.ranlp-1.69. Acesso em: 20 fev. 2023.
INC., F. Learning word vectors for 157 languages. Fasttext. 2018. Disponível em: https://dl.fbaipublicfiles.com/fasttext/vectorscrawl/cc.pt.300.bin.gz. Acesso em: 20 fev. 2023.
INC., F. Library for efficient text classification and representation learning. Fasttext. 2022. Disponível em: https://fasttext.cc/. Acesso em: 20 fev. 2023.
KINCAID, J. P.; YASUTAKE, J. Y.; GEISELHART, R. Use of the automated readability index to assess comprehensibility of air force technical orders. Wright-Patterson AFB, Ohio: Aeronautical Systems Division, 1967.
KLARE, G. R. et al. Measurement of readability, AGRIS, Iowa State University Press, 1963.
LEAL, S. E. 2020. Simpligo ranking. NILC. Disponível em: http://fw.nilc.icmc.usp.br:23380/simpligo-ranking. Acesso em: 20 fev. 2023.
LEAL, S. E. Predição da complexidade sentencial do português brasileiro escrito, usando métricas linguísticas, psicolinguísticas e de rastreamento ocular. 2019. Tese de Doutoramento. Universidade de São Paulo, 2019.
LEAL, S. E.; DURAN, M. S.; SCARTON, C. E.; HARTMANN, N. S.; ALUISIO, S. M. Nilc Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese. CoRR, v.2201.03445, 2021.
LEAL, S. E.; LUKASOVA, K.; CARTHERY-GOULART, M. T.; ALUISIO, S. M. Rastros project: Natural language processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese. Language Resources and Evaluation, v.56(4), p.1333– 1372, 2022b. DOI https://doi.org/10.1007/s10579-022-09609-0
LEAL, S. E.; MAGALHAES, V. M. A.; DURAN, M. S.; ALUÍSIO, S. M. Avaliação automática da complexidade de sentenças do português brasileiro para o domínio rural. Symposium in Information and Human Language Technology and Collocates, 2019.
LEAL, S. E.; SCARTON, C.; CUNHA, A.; HARTMANN, N.; DURAN, M.; ALUÍSIO, S. Nilc-Metrix, NILC, 2022a. Disponível em: http://fw.nilc.icmc.usp.br:23380/nilcmetrix. Acesso em: 20 fev. 2023.
LEAL, S. E; DURAN, M. S.; ALUÍSIO, S. A nontrivial sentence corpus for the task of sentence readability assessment in Portuguese. 27th International Conference on Computational Linguistics, 2018. p. 401–413.
LEE, B. W.; JANG, Y. S.; LEE, J. H. J. Pushing on text readability assessment: A transformer meets hand crafted linguistic features, CoRR, v.2109.12258, 2021. DOI https://doi.org/10.18653/v1/2021.emnlp-main.834
LONGPRE, S., WANG, Y.; BOIS, C. D. How effective is task-agnostic data augmentation for pretrained transformers? Findings of ACL, v. EMNLP 2020, 2020. p. 4401-4411. DOI https://doi.org/10.18653/v1/2020.findings-emnlp.394
MARTINC, M.; POLLAK, S.; SIKONJA, M. R. Supervised and Unsupervised Neural Approaches to Text Readability. Computational Linguistics, v.47(1), p. 141–179, 2021. DOI https://doi.org/10.1162/coli_a_00398
MARTINS, T. B. F.; GHIRALDELO, C. M.; NUNES, MGV; JUNIOR, ONO. Readability formulas applied to textbooks in Brazilian Portuguese, Icmsc-Usp, 1996.
MCNAMARA, M. M.; LOUWERSE, D. S.; GRAESSER, A. C. Coh-metrix: Automated cohesion and coherence scores to predict text readability and facilitate comprehension. Cognitive Science and Educational Practice group at the University of Memphis, 2002. Disponível em: http://csep.psyc.memphis.edu/mcnamara/pdf/IESproposal.pdf. Acesso em: 20 fev. 2023.
MUNIZ, M. C. M. A construção de recursos linguístico-computacionais para o português do Brasil: o projeto unitex-pb. Tese de Doutoramento. Universidade de São Paulo, 2004.
NILC. Repositório de word embeddings. NILC. Disponível em: http://143.107.183.175:22980/download.php?file=embeddings/fasttext/skip_s300.zip. 2017. Acesso em : 20 fev. 2023.
PASQUALINI, B. F. CorPop: um corpus de referência do português popular escrito do Brasil. Tese de Doutoramento. Universidade Federal do Rio Grande do Sul, 2018.
PEI, M.; GAYNOR, F. Dictionary of linguistics. Rowman & Littlefield, 1954.
PONOMARENKO, G. L. Índices para cálculo de leiturabilidade. Acessibilidade TT, Universidade Federal do Rio Grande do Sul, 2018. Disponível em: http://www.ufrgs.br/textecc/acessibilidadett/files/IndicesdeLeiturabilidade.pdf. Acesso em : 20 fev. 2023.
REHUREK, R. Gensim topic modelling for humans. Gensim, 2022. Disponível em: https://radimrehurek.com/gensim/index.html. Acesso em: 20 fev. 2023.
SAHIN, G. G. To augment or not to augment? a comparative study on text augmentation techniques for low resource NLP. Computational Linguistics, v. 48(1), p. 5–42, 2022. DOI https://doi.org/10.1162/coli_a_00425
SANTOS, A. M. Leiturabilidade: É possível medi-la em livros infanto-juvenis? Congresso Internacional de Leitura e Literatura Infantil e Juvenil, 2010.
SCARTON, C.; ALUÍSIO, S. Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: adaptando as métricas do coh metrix para o português. Linguamática, v. 2(1). p.45–61, 2010.
SCARTON, C.; GASPERIN, C.; ALUÍSIO, S. Revisiting the readability assessment of texts in Portuguese. Ibero-American Conference on Artificial Intelligence, Springer, 2010a. p. 306–315. DOI https://doi.org/10.1007/978-3-642-16952-6_31
SCARTON, C.; OLIVEIRA, M.; JÚNIOR, A. C.; GASPERIN, C.; ALUÍSIO, S. Simplifica: a tool for authoring simplified texts in Brazilian Portuguese guided by readability assessments. Proceedings of the NAACL HLT 2010 Demonstration Session, 2010b. p. 41–44.
SCHWARM, S.; OSTENDORF, M. Reading level assessment using support vector machines and statistical language models. 43rd annual meeting of the Association for Computational Linguistics, 2005. p. 523–530. DOI https://doi.org/10.3115/1219840.1219905
SHUKLA, A.; PANDEY, H. M.; MEHROTRA, D. Comparative review of selection techniques in genetic algorithms. International conference on futuristic trends on computational analysis and knowledge management, IEEE, 2015. p. 515–519. DOI https://doi.org/10.1109/ABLAZE.2015.7154916
SI, L.; CALLAN, J. A statistical model for scientific readability. International conference on Information and knowledge management, 2001. p. 574–576. DOI https://doi.org/10.1145/502585.502695
SOUZA, F.; NOGUEIRA, R.; LOTUFO, R. BERTimbau: pretrained BERT models for Brazilian Portuguese. 9th Brazilian Conference on Intelligent Systems, 2020. DOI https://doi.org/10.1007/978-3-030-61377-8_28
STOLCKE, A. Srilm - an extensible language modeling toolkit. Seventh international conference on spoken language processing, ISCA, 2002. DOI https://doi.org/10.21437/ICSLP.2002-303
TAYLOR, W. L. “cloze procedure”: A new tool for measuring readability. Journalism quarterly, v.30(4), 1953. p. 415–433. DOI https://doi.org/10.1177/107769905303000401
TAYNAN, M. F.; COSTA, A. H. R. Deepbt and nlp data augmentation techniques: A new proposal and a comprehensive study. Ricardo Cerri & Ronaldo C. Prati, Intelligent Systems, Springer, 2020. p. 435– 449. DOI https://doi.org/10.1007/978-3-030-61377-8_30
USP. A lexicon for Brazilian Portuguese according to Universal Dependencies. PortiLexicon-UD. 2022a. Disponível em: https://portilexicon.icmc.usp.br/. Acesso em: 20 fev. 2023.
USP. Tep 2.0. 2022b. Disponível em: http://www.nilc.icmc.usp.br/tep2/ajuda.htm. Acesso em: 20 fev. 2023.
USP; FAPESP; IBM. Portuguese processing - towards syntactic analysis and parsing. Poetisa. 2021. Disponível em: https://sites.google.com/icmc.usp.br/poetisa. Acesso em: 20 fev. 2023.
VASWANI, A.; SHAZEER, N.; PARMAR, N.; USZJOREIT, J.; JONES, L.; GOMEZ, A. N.; KAISER, L.; POLOSUKHIN, I. Attention is all you need. Advances in neural information processing systems, p.5998-6008, 2017.
WATANABE, W. M.; JUNIOR, A. C.; UZÊDA, V. R.; FORTES, R. P. M.; PARDO, T. A. S.; ALUÍSIO, S. M. Facilita: reading assistance for low literacy readers. Proceedings of the 27th ACM international conference on Design of communication, 2009. p. 29–36. DOI https://doi.org/10.1145/1621995.1622002
WILKENS, R.; VECCHIA, A. D.; BOITO, M. Z.; PADRÓ, M.; VILLAVICENCIO, A. Size does not matter. frequency does. a study of features for measuring lexical complexity. Ibero-American conference on artificial intelligence, Springer, 2014. p. 129–140. DOI https://doi.org/10.1007/978-3-319-12027-0_11
WILKENS, R.; ZILIO, L.; IDIART, M.; VILLAVICENCIO, A. et al. Crawling by readability level. International Conference on Computational Processing of the Portuguese Language, Springer, 2016. p. 306–318. DOI https://doi.org/10.1007/978-3-319-41552-9_31
WOLF, T.; DEBUT, L.; SANH, V.; CHAUMOND, J. et al. Transformers: State-of-the-art natural language processing. Association for Computational Linguistics, 2020. p. 38-45. DOI https://doi.org/10.18653/v1/2020.emnlp-demos.6
WONG, T. T. Performance evaluation of classification algorithms by k-fold and leave one-out cross validation. Pattern Recognition, v. 48(9), p. 2839–2846, 2015. DOI https://doi.org/10.1016/j.patcog.2015.03.009
XIA, M.; KOCHMAR, E.; BRIS, B. Text readability assessment for second language learners. The Association for Computer Linguistics, p. 12-22, 2019.
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Luiza Cunha de Menezes, Aline Paes, Maria José Bocorny Finatto
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish in this journal agree to the following terms:
Authors retain the copyright and waiver the journal the right of first publication, with the work simultaneously licensed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), allowing the sharing of work with authorship recognition and preventing its commercial use.
Authors are authorized to take additional contracts separately, for non-exclusive distribution of the version of the work published in this journal (publish in institutional repository or as a book chapter), with acknowledgment of authorship and initial publication in this journal.