Clean Country #241

Merged
claudio.atzori merged 11 commits from clean_country into beta 2 years ago
Collaborator

This PR is the first naive implementation to clean the country element. It is done to remove from the records collected from NARCIS the country NL. It is needed because NARCIS has been included in the allowed datasources for the country propagation step. Being NARCIS an aggregator some of the repositories it collect from do not provide only elements from the NL. In that case the association to the country should be removed.

The criteria to be matched are:

  • in the pids of the records there is one doi with prefix 10.17632 (Mendely data) or 10.5061 (Dryad),
  • the record does not have in the hostedby.key any institutional repository with country NL,
  • the record is collectedfrom.value = NARCIS,
  • the country with value NL has been inserted via propagation.

If all the constraints above match, the country NL is removed from the set of the countries for the result.

The introduction of this new cleaning process modifies one parameter in the workflow: before we had the shouldCleanContext parameter, now we have the shouldClean parameter instand. If this param is set to true all the cleaning action for the result will be triggered. So far they are clean of context for sobigdata and clean of country for NARCIS

This PR is the first naive implementation to clean the country element. It is done to remove from the records collected from NARCIS the country `NL`. It is needed because NARCIS has been included in the allowed datasources for the country propagation step. Being NARCIS an aggregator some of the repositories it collect from do not provide only elements from the NL. In that case the association to the country should be removed. The criteria to be matched are: * in the pids of the records there is one doi with prefix `10.17632` (Mendely data) or `10.5061` (Dryad), * the record does not have in the `hostedby.key` any institutional repository with country `NL`, * the record is `collectedfrom.value = NARCIS`, * the country with value `NL` has been inserted via propagation. If all the constraints above match, the country `NL` is removed from the set of the countries for the result. The introduction of this new cleaning process modifies one parameter in the workflow: before we had the `shouldCleanContext` parameter, now we have the `shouldClean` parameter instand. If this param is set to true all the cleaning action for the result will be triggered. So far they are clean of context for sobigdata and clean of country for NARCIS
miriam.baglioni added the
enhancement
label 2 years ago
alessia.bardi was assigned by miriam.baglioni 2 years ago
claudio.atzori was assigned by miriam.baglioni 2 years ago
miriam.baglioni added 6 commits 2 years ago
Owner

Two comments without having seen the changeset on the PR description:

  1. it misses the why we are introducing this change. I barely remember the fact that we are assuming the contents in Narcis to cover only the national scope, while we found out this is not the case. Is this the reason? If so, please describe it.
  2. include a summary of the changes to be introduced in the parameters when invoking the graph cleaning workflow.
Two comments without having seen the changeset on the PR description: 1. it misses the _why_ we are introducing this change. I barely remember the fact that we are assuming the contents in Narcis to cover only the national scope, while we found out this is not the case. Is this the reason? If so, please describe it. 2. include a summary of the changes to be introduced in the parameters when invoking the graph cleaning workflow.
claudio.atzori reviewed 2 years ago
@ -0,0 +94,4 @@
List<String> hostedBy = spark
.read()
.textFile(datasourcePath)
// .filter((FilterFunction<String>) ds -> !ds.equals(collectedfrom))
Owner

if not needed, clean it up please.

if not needed, clean it up please.
claudio.atzori marked this conversation as resolved
claudio.atzori reviewed 2 years ago
@ -0,0 +113,4 @@
if (r
.getPid()
.stream()
.anyMatch(p -> p.getQualifier().getClassid().equals("doi") && pidInParam(p.getValue(), verifyParam))) {
Owner

Could you replace the doi string with the serialization of eu.dnetlib.dhp.schema.oaf.utils.PidType#doi?

Could you replace the `doi` string with the serialization of `eu.dnetlib.dhp.schema.oaf.utils.PidType#doi`?
claudio.atzori marked this conversation as resolved
claudio.atzori requested changes 2 years ago
claudio.atzori left a comment
Owner

Minor changes. Please check the inline comments.

Minor changes. Please check the inline comments.
miriam.baglioni added 1 commit 2 years ago
claudio.atzori added 1 commit 2 years ago
claudio.atzori added 1 commit 2 years ago
Owner

I have no changes to suggest. You have my green light to test it on beta.

I have no changes to suggest. You have my green light to test it on beta.
claudio.atzori added 1 commit 2 years ago
claudio.atzori added 1 commit 2 years ago
claudio.atzori merged commit 6ad38ade74 into beta 2 years ago
claudio.atzori deleted branch clean_country 2 years ago

Reviewers

claudio.atzori requested changes 2 years ago
The pull request has been merged as 6ad38ade74.
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b clean_country beta
git pull origin clean_country

Step 2:

Merge the changes and update on Gitea.
git checkout beta
git merge --no-ff clean_country
git push origin beta
Sign in to join this conversation.
No reviewers
No Milestone
No project
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#241
Loading…
There is no content yet.