openaire-graph-docs/docs/data-provision/deduplication/clustering-functions.md

899 B

sidebar_position
3

Clustering functions

NgramPairs

It produces a list of concatenations of a pair of ngrams generated from different words.
Example:
Input string: “Search for the Standard Model Higgs Boson”
Parameters: ngram length = 3
List of ngrams: “sea”, “sta”, “mod”, “hig”
Ngram pairs: “seasta”, “stamod”, “modhig”

SuffixPrefix

It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string.
Example:
Input string: “Search for the Standard Model Higgs Boson”
Parameters: suffix and prefix length = 3
Output list: “ardmod” (suffix of the word “Standard” + prefix of the word “Model”), “rchsta” (suffix of the word “Search” + prefix of the word “Standard”)