Priority to records from delegated authorities #187

claudio.atzori · 2022-01-19T13:00:14+01:00

claudio.atzori commented

2022-01-19 13:00:14 +01:00

When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them, depending on the case, thus resulting in inconsistencies in the attribution of the field values. To overcome the issue, the intuition is to include such records only once in the graph.

For the time being, this case seems to involve only Zenodo as delegated authority from Datacite and the policy we're going to implement assumes to pick the version from Zenodo (as it is assumed to be richer).

This "selection" can be performed when the entitites in the graph sharing the same identifier are grouped together, but the graph pipeline does not currently include any of such operation between the raw graph is materialised and before the deduplication workflow takes place.

This implies that we must introduce a new grouping phase, producing a new graph materialization. The implementation for the procedure can share the same code, extended to support this further businness logic; to this aim, the grouping spark job was factored in the dhp-common module.

Note that the project temporarily depends on dhp-schemas 2.10.26-SNAPSHOT until it will be released.

When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them, depending on the case, thus resulting in inconsistencies in the attribution of the field values. To overcome the issue, the intuition is to include such records only once in the graph. For the time being, this case seems to involve only Zenodo as delegated authority from Datacite and the policy we're going to implement assumes to pick the version from Zenodo (as it is assumed to be richer). This "selection" can be performed when the entitites in the graph sharing the same identifier are grouped together, but the graph pipeline does not currently include any of such operation between the raw graph is materialised and before the deduplication workflow takes place. This implies that we must introduce a new grouping phase, producing a new graph materialization. The implementation for the procedure can share the same code, extended to support this further businness logic; to this aim, the grouping spark job was factored in the `dhp-common` module. Note that the project temporarily depends on ```dhp-schemas 2.10.26-SNAPSHOT``` until it will be released.

claudio.atzori added the

enhancement

label 2022-01-19 13:00:14 +01:00

claudio.atzori added 2 commits 2022-01-19 13:00:15 +01:00

44a937f4ed factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources

62f135262e code formatting

claudio.atzori requested review from miriam.baglioni 2022-01-19 13:02:05 +01:00

claudio.atzori added 2 commits 2022-01-19 17:17:16 +01:00

391aa1373b added unit test

abfa9c6045 code formatting

claudio.atzori added 1 commit 2022-01-19 18:16:02 +01:00

3b9020c1b7 added unit test for the DispatchEntitiesJob