added relationship redistribution phase

This commit is contained in:
Claudio Atzori 2023-01-10 12:08:32 +01:00
parent 87ef2724da
commit 38e3f8b780
3 changed files with 15 additions and 1 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 387 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 474 KiB

After

Width:  |  Height:  |  Size: 34 KiB

View File

@ -8,11 +8,14 @@ The deduplication process can be divided into three different phases:
* Candidate identification (clustering) * Candidate identification (clustering)
* Duplicates identification (pair-wise comparisons) * Duplicates identification (pair-wise comparisons)
* Duplicates grouping (transitive closure) * Duplicates grouping (transitive closure)
* Relation redistribution
<p align="center"> <p align="center">
<img loading="lazy" alt="Deduplication Workflow" src={require('../../assets/img/deduplication-workflow.png').default} width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/> <img loading="lazy" alt="Deduplication Workflow" src={require('../../assets/img/deduplication-workflow.png').default} width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p> </p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1lLLSU3wsWighmxGQMNMZbgP3mg3BfDVAGVLwt4_OFA8/edit?usp=sharing)
### Candidate identification (clustering) ### Candidate identification (clustering)
Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a [clustering function](./clustering-functions) that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no [clustering function](./clustering-functions) will ever bring them into the same cluster. Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a [clustering function](./clustering-functions) that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no [clustering function](./clustering-functions) will ever bring them into the same cluster.
@ -26,3 +29,14 @@ To further limit the number of comparisons, a sliding window mechanism is used:
### Duplicates grouping (transitive closure) ### Duplicates grouping (transitive closure)
Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance. Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance.
### Relation redistribution
Relations involved in nodes identified as duplicated are eventually marked as virtually deleted and used as template for creating a new relation pointing to the new representative record.
Note that nodes and relationships marked as virtually deleted are not exported.
<p align="center">
<img loading="lazy" alt="Deduplication Workflow" src={require('../../assets/img/dedup-relation-fixup.png').default} width="75%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1cDEuVhWnSO8lUZs_Nd748vKfIPxg10jbwKSVZlv33Mg/edit?usp=sharing)