Clean Country #241

Merged
claudio.atzori merged 11 commits from clean_country into beta 2022-09-27 14:35:41 +02:00

This PR is the first naive implementation to clean the country element. It is done to remove from the records collected from NARCIS the country NL. It is needed because NARCIS has been included in the allowed datasources for the country propagation step. Being NARCIS an aggregator some of the repositories it collect from do not provide only elements from the NL. In that case the association to the country should be removed.

The criteria to be matched are:

  • in the pids of the records there is one doi with prefix 10.17632 (Mendely data) or 10.5061 (Dryad),
  • the record does not have in the hostedby.key any institutional repository with country NL,
  • the record is collectedfrom.value = NARCIS,
  • the country with value NL has been inserted via propagation.

If all the constraints above match, the country NL is removed from the set of the countries for the result.

The introduction of this new cleaning process modifies one parameter in the workflow: before we had the shouldCleanContext parameter, now we have the shouldClean parameter instand. If this param is set to true all the cleaning action for the result will be triggered. So far they are clean of context for sobigdata and clean of country for NARCIS

This PR is the first naive implementation to clean the country element. It is done to remove from the records collected from NARCIS the country `NL`. It is needed because NARCIS has been included in the allowed datasources for the country propagation step. Being NARCIS an aggregator some of the repositories it collect from do not provide only elements from the NL. In that case the association to the country should be removed. The criteria to be matched are: * in the pids of the records there is one doi with prefix `10.17632` (Mendely data) or `10.5061` (Dryad), * the record does not have in the `hostedby.key` any institutional repository with country `NL`, * the record is `collectedfrom.value = NARCIS`, * the country with value `NL` has been inserted via propagation. If all the constraints above match, the country `NL` is removed from the set of the countries for the result. The introduction of this new cleaning process modifies one parameter in the workflow: before we had the `shouldCleanContext` parameter, now we have the `shouldClean` parameter instand. If this param is set to true all the cleaning action for the result will be triggered. So far they are clean of context for sobigdata and clean of country for NARCIS
miriam.baglioni added the
enhancement
label 2022-08-08 14:27:34 +02:00
alessia.bardi was assigned by miriam.baglioni 2022-08-08 14:27:34 +02:00
claudio.atzori was assigned by miriam.baglioni 2022-08-08 14:27:36 +02:00
miriam.baglioni added 6 commits 2022-08-08 14:27:44 +02:00

Two comments without having seen the changeset on the PR description:

  1. it misses the why we are introducing this change. I barely remember the fact that we are assuming the contents in Narcis to cover only the national scope, while we found out this is not the case. Is this the reason? If so, please describe it.
  2. include a summary of the changes to be introduced in the parameters when invoking the graph cleaning workflow.
Two comments without having seen the changeset on the PR description: 1. it misses the _why_ we are introducing this change. I barely remember the fact that we are assuming the contents in Narcis to cover only the national scope, while we found out this is not the case. Is this the reason? If so, please describe it. 2. include a summary of the changes to be introduced in the parameters when invoking the graph cleaning workflow.
claudio.atzori reviewed 2022-08-10 08:50:10 +02:00
@ -0,0 +94,4 @@
List<String> hostedBy = spark
.read()
.textFile(datasourcePath)
// .filter((FilterFunction<String>) ds -> !ds.equals(collectedfrom))

if not needed, clean it up please.

if not needed, clean it up please.
claudio.atzori marked this conversation as resolved
claudio.atzori reviewed 2022-08-10 08:52:16 +02:00
@ -0,0 +113,4 @@
if (r
.getPid()
.stream()
.anyMatch(p -> p.getQualifier().getClassid().equals("doi") && pidInParam(p.getValue(), verifyParam))) {

Could you replace the doi string with the serialization of eu.dnetlib.dhp.schema.oaf.utils.PidType#doi?

Could you replace the `doi` string with the serialization of `eu.dnetlib.dhp.schema.oaf.utils.PidType#doi`?
claudio.atzori marked this conversation as resolved
claudio.atzori requested changes 2022-08-10 08:54:35 +02:00
claudio.atzori left a comment
Owner

Minor changes. Please check the inline comments.

Minor changes. Please check the inline comments.
miriam.baglioni added 1 commit 2022-08-10 15:13:25 +02:00
claudio.atzori added 1 commit 2022-09-09 10:38:23 +02:00
claudio.atzori added 1 commit 2022-09-09 12:20:23 +02:00

I have no changes to suggest. You have my green light to test it on beta.

I have no changes to suggest. You have my green light to test it on beta.
claudio.atzori added 1 commit 2022-09-19 11:34:07 +02:00
claudio.atzori added 1 commit 2022-09-27 14:27:53 +02:00
claudio.atzori merged commit 6ad38ade74 into beta 2022-09-27 14:35:39 +02:00
claudio.atzori deleted branch clean_country 2022-09-27 14:35:59 +02:00
Sign in to join this conversation.
No description provided.