deduplication workflow to consider pre-existing (dis)equality relationships #32

Open
opened 2020-07-28 12:00:36 +02:00 by claudio.atzori · 1 comment

The current deduplication workflow implementation doesn't take into account the possibility to exploit pre-existing relationships provided to the graph from stateful subsystems.

The use case built around the OpenOrgs DB is one of those and will provide:

  • dedup similarity relationships indicating the equivalence between pairs of organizations; these are the same sort of relationship objects that the deduplication process already builds and these could be taken into account by isolating them in the relation table and performing the union with the set of relationships produced by the dedup algorithm.
  • dis-equality relationships indicating that two organizations must NOT be considered as equivalent. At the moment we don't have a set of labels indicating the dis-equality semantic, so please consider to propose one (relType, subRelType, relClass). In this case the dis-equality relationships should be taken into consideration by the workflow to discard relationships generated by false positive matches produced by the dedup algorithm.
The current deduplication workflow implementation doesn't take into account the possibility to exploit pre-existing relationships provided to the graph from stateful subsystems. The use case built around the OpenOrgs DB is one of those and will provide: * dedup similarity relationships indicating the equivalence between pairs of organizations; these are the same sort of relationship objects that the deduplication process already builds and these could be taken into account by isolating them in the relation table and performing the union with the set of relationships produced by the dedup algorithm. * dis-equality relationships indicating that two organizations must NOT be considered as equivalent. At the moment we don't have a set of labels indicating the dis-equality semantic, so please consider to propose one (relType, subRelType, relClass). In this case the dis-equality relationships should be taken into consideration by the workflow to discard relationships generated by false positive matches produced by the dedup algorithm.
claudio.atzori added the
enhancement
label 2020-07-28 12:00:36 +02:00
michele.artini was assigned by claudio.atzori 2020-07-28 12:00:36 +02:00
michele.debonis was assigned by claudio.atzori 2020-07-28 12:00:36 +02:00
sandro.labruzzo was assigned by claudio.atzori 2020-07-28 12:09:04 +02:00
Member

I created a graph that contains:

  • the Corda organizations (trust: 0.8, prefixes: corda_______ and corda__h2020)
  • the preliminary OpenOrgs organizations (imported from Grid/ROR) (trust: 0.99, prefix: openorgs____)
  • alternative organizations related to the OpenOrgs organizations, one for each alternative name (trust: 0.5, prefix: openorgsmesh)
  • Similarity relations between main and alternative OpenOrgs organizations

The graph is available in /tmp/graph_openorgs_and_corda.

I have also updated the dnet:pid_types vocabulary, adding the terms: ROR and GRID.

@michele.debonis In the tests performed some months ago, the final results of the dedup process was a tsv with this fields:

  • local_id (the openorgs id)
  • oa_original_id (the openaire id)
  • oa_name
  • oa_acronym
  • oa_country
  • oa_url
  • oa_collectedfrom

In addition to this tsv, you should produce an other tsv containing the corda organizations that have not been related with an openOrgs organization.

I created a graph that contains: * the Corda organizations (trust: 0.8, prefixes: corda_______ and corda__h2020) * the preliminary OpenOrgs organizations (imported from Grid/ROR) (trust: 0.99, prefix: openorgs____) * alternative organizations related to the OpenOrgs organizations, one for each alternative name (trust: 0.5, prefix: openorgsmesh) * Similarity relations between main and alternative OpenOrgs organizations The graph is available in /tmp/graph_openorgs_and_corda. I have also updated the **dnet:pid_types** vocabulary, adding the terms: **ROR** and **GRID**. @michele.debonis In the tests performed some months ago, the final results of the dedup process was a tsv with this fields: * local_id (the openorgs id) * oa_original_id (the openaire id) * oa_name * oa_acronym * oa_country * oa_url * oa_collectedfrom In addition to this tsv, you should produce an other tsv containing the corda organizations that have not been related with an openOrgs organization.
Sign in to join this conversation.
No Milestone
No project
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#32
No description provided.