This PR is the first naive implementation to clean the country element. It is done to remove from the records collected from NARCIS the country NL. It is needed because NARCIS has been included in the allowed datasources for the country propagation step. Being NARCIS an aggregator some of the repositories it collect from do not provide only elements from the NL. In that case the association to the country should be removed.
The criteria to be matched are:
in the pids of the records there is one doi with prefix 10.17632 (Mendely data) or 10.5061 (Dryad),
the record does not have in the hostedby.key any institutional repository with country NL,
the record is collectedfrom.value = NARCIS,
the country with value NL has been inserted via propagation.
If all the constraints above match, the country NL is removed from the set of the countries for the result.
The introduction of this new cleaning process modifies one parameter in the workflow: before we had the shouldCleanContext parameter, now we have the shouldClean parameter instand. If this param is set to true all the cleaning action for the result will be triggered. So far they are clean of context for sobigdata and clean of country for NARCIS
This PR is the first naive implementation to clean the country element. It is done to remove from the records collected from NARCIS the country `NL`. It is needed because NARCIS has been included in the allowed datasources for the country propagation step. Being NARCIS an aggregator some of the repositories it collect from do not provide only elements from the NL. In that case the association to the country should be removed.
The criteria to be matched are:
* in the pids of the records there is one doi with prefix `10.17632` (Mendely data) or `10.5061` (Dryad),
* the record does not have in the `hostedby.key` any institutional repository with country `NL`,
* the record is `collectedfrom.value = NARCIS`,
* the country with value `NL` has been inserted via propagation.
If all the constraints above match, the country `NL` is removed from the set of the countries for the result.
The introduction of this new cleaning process modifies one parameter in the workflow: before we had the `shouldCleanContext` parameter, now we have the `shouldClean` parameter instand. If this param is set to true all the cleaning action for the result will be triggered. So far they are clean of context for sobigdata and clean of country for NARCIS
Two comments without having seen the changeset on the PR description:
it misses the why we are introducing this change. I barely remember the fact that we are assuming the contents in Narcis to cover only the national scope, while we found out this is not the case. Is this the reason? If so, please describe it.
include a summary of the changes to be introduced in the parameters when invoking the graph cleaning workflow.
Two comments without having seen the changeset on the PR description:
1. it misses the _why_ we are introducing this change. I barely remember the fact that we are assuming the contents in Narcis to cover only the national scope, while we found out this is not the case. Is this the reason? If so, please describe it.
2. include a summary of the changes to be introduced in the parameters when invoking the graph cleaning workflow.
This PR is the first naive implementation to clean the country element. It is done to remove from the records collected from NARCIS the country
NL
. It is needed because NARCIS has been included in the allowed datasources for the country propagation step. Being NARCIS an aggregator some of the repositories it collect from do not provide only elements from the NL. In that case the association to the country should be removed.The criteria to be matched are:
10.17632
(Mendely data) or10.5061
(Dryad),hostedby.key
any institutional repository with countryNL
,collectedfrom.value = NARCIS
,NL
has been inserted via propagation.If all the constraints above match, the country
NL
is removed from the set of the countries for the result.The introduction of this new cleaning process modifies one parameter in the workflow: before we had the
shouldCleanContext
parameter, now we have theshouldClean
parameter instand. If this param is set to true all the cleaning action for the result will be triggered. So far they are clean of context for sobigdata and clean of country for NARCISTwo comments without having seen the changeset on the PR description:
@ -0,0 +94,4 @@
List<String> hostedBy = spark
.read()
.textFile(datasourcePath)
// .filter((FilterFunction<String>) ds -> !ds.equals(collectedfrom))
if not needed, clean it up please.
@ -0,0 +113,4 @@
if (r
.getPid()
.stream()
.anyMatch(p -> p.getQualifier().getClassid().equals("doi") && pidInParam(p.getValue(), verifyParam))) {
Could you replace the
doi
string with the serialization ofeu.dnetlib.dhp.schema.oaf.utils.PidType#doi
?Minor changes. Please check the inline comments.
I have no changes to suggest. You have my green light to test it on beta.
6ad38ade74
into beta 2 years agoReviewers
6ad38ade74
.Step 1:
From your project repository, check out a new branch and test the changes.Step 2:
Merge the changes and update on Gitea.