Evaluating a typology of signals for automatic detection of complementarity

Jackson Wilke da Cruz Souza; Ariani Di Felippo

doi:10.14393/DL52-v16n4a2022-10

Authors

Jackson Wilke da Cruz Souza UNIFAL-MG https://orcid.org/0000-0003-1881-6780
Ariani Di Felippo UFSCar https://orcid.org/0000-0002-4566-9352

DOI:

https://doi.org/10.14393/DL52-v16n4a2022-10

Keywords:

Cross-Document Structure Theory, Automatic summarization, Multi-document Corpus, Complementarity, Textual signal

Abstract

In a cluster of news texts on the same event, two sentences from different documents might express different multi-document phenomena (redundancy, complementarity, and contradiction). Cross-Document Structure Theory (CST) provides labels to explicitly represent these phenomena. The automatic identification of the multi-document phenomena and their correspondent CST relations is definitely handy for Automatic Multi-Document Summarization since it helps computers understand text meaning. In this paper, we evaluated a typology of (textual) signals for the automatic detection of the CST relations of complementarity (i.e., Historical background, Follow-up and Elaboration) in a multi-document corpus of news texts in Brazilian Portuguese. Using algorithms from different machine-learning paradigms, we obtained classifiers that achieved high general accuracy (higher than 90%), indicating the potential of the signals.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Author Biographies

Jackson Wilke da Cruz Souza, UNIFAL-MG

PhD in Linguistics (UFSCar), professor in Instituto de Ciências Sociais Aplicadas from Universidade Federal de Alfenas (UNIFAL-MG).

Ariani Di Felippo, UFSCar

PhD in Linguistics (UNESP), professor in Departamento de Letras from Universidade Federal de São Carlos (UFSCar).

References

ALEIXO, P.; PARDO, T.A.S. Finding Related Sentences in Multiple Documents for Multidocument Discourse Parsing of Brazilian Portuguese Texts. In: Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web. 2008. p. 298-303. DOI https://doi.org/10.1145/1809980.1810055

BELTRAME, W.; CURY, D.; MENEZES, C. S. Fique Sabendo: um Sistema de Disseminação Seletiva da Informação para Apoio à Aprendizagem. In: Brazilian symposium on Computers in Education. Rio de Janeiro – Brazil. 2012. 10p.

CARDOSO, P. C. F.; MAZIERO, E. G.; JORGE, M. L. C.; SENO, E. M. R.; DI-FELIPPO, A.; RINO, L. H. M.; NUNES, M. G. V.; PARDO, T. A. S. CSTNews: a discourse-annotated corpus for Single and Multi-Document Summarization of news texts in Brazilian Portuguese. In: Proceedings of the 3rd RST Brazilian Meeting. Cuiabá – Brazil. 2011. p. 88-105.

DAS, D.; TABOADA, M. RST signalling corpus: a corpus of signals of coherence relations. Language Resources and Evaluation, v. 52, n. 1, p. 149–184, 2018. DOI https://doi.org/10.1007/s10579-017-9383-x

INAM, S.; SHOAIB, M.; MAJEED, F.; SHAERJEEL, M. I. Ontology based query reformulation using rhetorical relations. International Journal of Computer Sciences IJCS, Vol 9, Issue 4. p. 261-268, 2012.

JURAFSKY, D; MARTIN, J. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 3ª Edition (Draft), 2021. Available at: https://web.stanford.edu/~jurafsky/slp3/. Access in: 08 Sept. 2021.

KUMAR, Y. J.; SALIM, N.; RAZA, B. Cross-document structural relationship identification using supervised machine learning. Applied Soft Computing, v. 12, n. 10, p. 3124-3131, 2012. DOI https://doi.org/10.1016/j.asoc.2012.06.017

MANI, I. Automatic summarization. Vol. 3. John Benjamins Publishing. 2001. DOI https://doi.org/10.1075/nlp.3

MANN, W. C.; THOMPSON, S. A. Rhetorical structure theory: A theory of text organization. University of Southern California, Information Sciences Institute, 1987. DOI https://doi.org/10.1515/text.1.1988.8.3.243

MAZIERO, E. G.; JORGE, M. L. C.; PARDO, T. A. S. Identifying multi-document relations. In: Proceedings of the International Workshop on Natural Language Processing and Cognitive Science. Funchal, Madeira/Funchal. 2010. p. 60-69.

MAZIERO, E.; PARDO, T. A. CSTParser–a multi-document discourse parser. In: Proceedings of the PROPOR. Coimbra – Portugal. 2012. p. 1-3.

MAZIERO, E. G. Identificação automática de relações multidocumento. Master’s dissertation (Masters in Computer Science and Computational Mathematics) - Institute of Mathematical and Computer Sciences, University of São Paulo, São Carlos, 2012.

MAZIERO, E. G.; JORGE, M. L. R. C.; PARDO, T. A. S. Revisiting Cross-document Structure Theory for multi-document discourse parsing. Information Processing & Management, v. 50, n. 2. p. 297-314, 2014. DOI https://doi.org/10.1016/j.ipm.2013.12.003

MITCHELL, T. M. Does machine learning really work? AI magazine, v. 18, n. 3, p. 11. 1997.

MURAKAMI, K.; NICHOLS, E.; MIZUNO, J.; WATANABE, Y.; GOTO, H.; OHKI, M. Automatic classification of semantic relations between facts and opinions. In: Proceedings of 2nd workshop on NLP challenges in the information explosion Era NLPIX. Beijing – China. 2010. p. 21–30.

NENKOVA, A.; MCKEOWN, K. Automatic summarization. Foundations and Trends in Information Retrieval, 5(2-3), p. 103–233, 2011. DOI https://doi.org/10.1561/1500000015

RADEV, D. R. A. Common theory of information fusion from multiple text sources step one: cross-document structure. In: Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue. Volume 10. 2000. p. 74-83. DOI https://doi.org/10.3115/1117736.1117745

SHALEV-SHWARTZ, S.; BEN-DAVID, S. Understanding Machine Learning: From Theory to Algorithms. New York: Cambridge University Press, 2014. DOI https://doi.org/10.1017/CBO9781107298019

SOUZA, J. W. C. Descrição linguística da complementaridade para a sumarização automática multidocumento. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos. 2015. p. 102.

SOUZA, J. W. C.; DI-FELIPPO, A. Caracterização da complementaridade temporal: subsídios para sumarização automática multidocumento. Alfa: Revista de Linguística (São José do Rio Preto), v. 62, p. 125-150, 2018. DOI https://doi.org/10.1590/1981-5794-1804-6

SOUZA, J. W. C. Aprofundamento da caracterização linguístico-computacional da complementaridade em um corpus jornalístico multidocumento. 2019. Tese (Doutorado em Linguística) – Universidade Federal de São Carlos, São Carlos, p. 117. 2019.

SOUZA, J. W. C. O papel do corpus de estudo no aprimoramento descritivo da complementaridade informacional multidocumento. Revista de Estudos da Linguagem, v. 29, n. 2, 2021. DOI https://doi.org/10.17851/2237-2083.29.2.1059-1087

TABOADA, M.; DAS, D. Annotation upon annotation: adding signalling information to a corpus of discourse relations. Dialogue and Discourse. v. 4, n. 2, p. 249-281, 2013. DOI https://doi.org/10.5087/dad.2013.211

WITTEN, I. H.; FRANK, E. Data Mining: Practical machine learning tools and techniques. 2nd edition. Morgan Kaufmann, San Francisco. 2005.

ZAHRI, N. A. H. B.; FUKUMOTO, F. Multi-document Summarization using link analysis based on rhetorical relations between sentences. In: CICling Lectures Notes in Computer Science. 2011. p. 328-338. DOI https://doi.org/10.1007/978-3-642-19437-5_27

ZHANG, Z.; BLAIR-GOLDENSOHN, S.; RADEV, D. R. Towards CST-enhanced summarization. In: Proceedings of the 18th National Conference on Artificial Intelligence (AAAI-2002), Edmonton – Canada. 2002. p. 439-446.

ZHANG, Z.; OTTERBACHER, J.; RADEV, D. R. Learning cross-document structural relationships using boosting. In: Proceedings of 12th ICIKM. New Orleans, USA. 2003. p. 124–130. DOI https://doi.org/10.1145/956863.956887