Optimization of Integration of Toponyms by Lexical Similarity
Main Article Content
Abstract
Real-world identifiable features are, through mapping functions, instantiated in a Geographic Database (GD) as representations of this reality. These representations are individualized by the specifier attributes of the mapped class. Among these attributes are at least one geometry and an identifier name (toponym) associated with the primary key. However, different data producers interpret reality with slight discrepancies, making some representations of mapped features similar but not identical. In particular, toponyms have small differences resulting from modifications over the years, the way they are spelled or, also, due to human errors in the recording of the data. Therefore, when trying to integrate different GDs, through toponyms, they do not favor a total pairing, since the records are not identified as being the same reality. In the particular case of the toponymy class, this occurs mainly due to typos arising from the data insertion process, especially by inversion in the positioning of the characters within the word. In this research, an improvement in the Dice Coefficient was developed and compared with the original method applied in three distinct GDs. The analysis was based on the frequencies of characters and bigrams existing in those bases. The proposed improvement was based on the hypothesis that inverted bigrams, like 'αβ' and 'βα', may, according to certain criteria, be admitted as similar. The analysis identified the most common characters and the most frequent bigrams in the bases whose association with a distance analysis on a standard keyboard allowed the identification of a series of pairs of bigrams to be considered similar. This proposal allowed an average increase of 0.58% in the total paired instances in the GDs tested.
Downloads
Metrics
Article Details
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
Authors who publish in this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors can enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) before and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see "The Effect of Open Access").