deduplication workflow to consider pre-existing (dis)equality relationships #32

New Issue

claudio.atzori · 2020-07-28T12:00:36+02:00

claudio.atzori commented

2020-07-28 12:00:36 +02:00

The current deduplication workflow implementation doesn't take into account the possibility to exploit pre-existing relationships provided to the graph from stateful subsystems.

The use case built around the OpenOrgs DB is one of those and will provide:

dedup similarity relationships indicating the equivalence between pairs of organizations; these are the same sort of relationship objects that the deduplication process already builds and these could be taken into account by isolating them in the relation table and performing the union with the set of relationships produced by the dedup algorithm.
dis-equality relationships indicating that two organizations must NOT be considered as equivalent. At the moment we don't have a set of labels indicating the dis-equality semantic, so please consider to propose one (relType, subRelType, relClass). In this case the dis-equality relationships should be taken into consideration by the workflow to discard relationships generated by false positive matches produced by the dedup algorithm.

The current deduplication workflow implementation doesn't take into account the possibility to exploit pre-existing relationships provided to the graph from stateful subsystems. The use case built around the OpenOrgs DB is one of those and will provide: * dedup similarity relationships indicating the equivalence between pairs of organizations; these are the same sort of relationship objects that the deduplication process already builds and these could be taken into account by isolating them in the relation table and performing the union with the set of relationships produced by the dedup algorithm. * dis-equality relationships indicating that two organizations must NOT be considered as equivalent. At the moment we don't have a set of labels indicating the dis-equality semantic, so please consider to propose one (relType, subRelType, relClass). In this case the dis-equality relationships should be taken into consideration by the workflow to discard relationships generated by false positive matches produced by the dedup algorithm.

claudio.atzori added the

enhancement

label 2020-07-28 12:00:36 +02:00

michele.artini was assigned by claudio.atzori

2020-07-28 12:00:36 +02:00

michele.debonis was assigned by claudio.atzori

2020-07-28 12:00:36 +02:00

sandro.labruzzo was assigned by claudio.atzori

2020-07-28 12:09:04 +02:00

michele.artini commented

2020-07-28 16:15:37 +02:00

I created a graph that contains:

the Corda organizations (trust: 0.8, prefixes: corda_______ and corda__h2020)
the preliminary OpenOrgs organizations (imported from Grid/ROR) (trust: 0.99, prefix: openorgs____)
alternative organizations related to the OpenOrgs organizations, one for each alternative name (trust: 0.5, prefix: openorgsmesh)
Similarity relations between main and alternative OpenOrgs organizations

The graph is available in /tmp/graph_openorgs_and_corda.

I have also updated the dnet:pid_types vocabulary, adding the terms: ROR and GRID.

@michele.debonis In the tests performed some months ago, the final results of the dedup process was a tsv with this fields:

local_id (the openorgs id)
oa_original_id (the openaire id)
oa_name
oa_acronym
oa_country
oa_url
oa_collectedfrom

In addition to this tsv, you should produce an other tsv containing the corda organizations that have not been related with an openOrgs organization.

I created a graph that contains: * the Corda organizations (trust: 0.8, prefixes: corda_______ and corda__h2020) * the preliminary OpenOrgs organizations (imported from Grid/ROR) (trust: 0.99, prefix: openorgs____) * alternative organizations related to the OpenOrgs organizations, one for each alternative name (trust: 0.5, prefix: openorgsmesh) * Similarity relations between main and alternative OpenOrgs organizations The graph is available in /tmp/graph_openorgs_and_corda. I have also updated the **dnet:pid_types** vocabulary, adding the terms: **ROR** and **GRID**. @michele.debonis In the tests performed some months ago, the final results of the dedup process was a tsv with this fields: * local_id (the openorgs id) * oa_original_id (the openaire id) * oa_name * oa_acronym * oa_country * oa_url * oa_collectedfrom In addition to this tsv, you should produce an other tsv containing the corda organizations that have not been related with an openOrgs organization.

Sign in to join this conversation.