subjects cleaning #239

This PR introduces the support for cleaning the result subjects. The subjects candidates for cleaning are only those typed as keyword. By dedicated vocabularies the bulk cleaning can identify the synonyms and replace them with the corresponding term; furthermore, the following fields are also updated

subject.qualifier.classid is set to the vocabulary code
subject.qualifier.classname is set to the vocabulary name

Instead, subject.qualifier.schemeid and schemename are left untouched.

This PR introduces the support for cleaning the result subjects. The subjects candidates for cleaning are only those typed as `keyword`. By dedicated vocabularies the bulk cleaning can identify the synonyms and replace them with the corresponding term; furthermore, the following fields are also updated * `subject.qualifier.classid` is set to the vocabulary code * `subject.qualifier.classname` is set to the vocabulary name Instead, `subject.qualifier.schemeid` and `schemename` are left untouched.

Another consideration, let's assume that for a given result

the cleaning does its job and produces a FOS subject.
the same result has a DOI which have been also classified "by inference" with the same subject term

In such situation, I believe the same subject term

should not repeat and
provided that a single subject exposes only one provenance information, the one from the repository should prevail against the inferred one.

Opinions?

Another consideration, let's assume that for a given result 1. the cleaning does its job and produces a FOS subject. 2. the same result has a DOI which have been also classified "by inference" with the same subject term In such situation, I believe the same subject term 1. should not repeat and 2. provided that a single subject exposes only one provenance information, the one from the repository should prevail against the inferred one. Opinions?

I also think it is better to have as provenance the one from the repository.
What instead if we have one subject related to a research field and the other one to a different not compatible one?
Should/can we check this?

I also think it is better to have as provenance the one from the repository. What instead if we have one subject related to a research field and the other one to a different not compatible one? Should/can we check this?

What instead if we have one subject related to a research field and the other one to a different not compatible one?
Should/can we check this?

Is the compatibility declared/available anywhere? That would be a slippery ground, I feel like it would be quite hard to implement a procedure that automatically evaluates that.

Anyway, the last commit b7c387c21f adds the grouping of the subjects during the bulk cleaning.

They are grouped using the concatenation among the subject's qualifier.classid and the subject value, then the conflict resolution is delegated to the dedicated SubjectProvenanceComparator.

> What instead if we have one subject related to a research field and the other one to a different not compatible one? > Should/can we check this? Is the compatibility declared/available anywhere? That would be a slippery ground, I feel like it would be quite hard to implement a procedure that automatically evaluates that. Anyway, the last commit https://code-repo.d4science.org/D-Net/dnet-hadoop/commit/b7c387c21f946adbc9da90ded95166205195edb0 adds the grouping of the subjects during the bulk cleaning. They are grouped using the concatenation among the subject's `qualifier.classid` and the subject `value`, then the conflict resolution is delegated to the dedicated `SubjectProvenanceComparator`.

To update the Fields of Science and Technology classification vocabulary, the list of synonyms is available in this zepebook: https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2HAR7PU7E

To update the `Fields of Science and Technology classification` vocabulary, the list of synonyms is available in this zepebook: https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2HAR7PU7E

Ok also for me. I am going to update the FOS vocabulary with the synonyms.
Regarding "compatible" fields of science I do not think it is in the scope of the activity of this PR. We can ask the team who developed the mining module if they are interested, for the sake of testing, to know which records come with a FOS that is different from the one that has been inferred.

Ok also for me. I am going to update the FOS vocabulary with the synonyms. Regarding "compatible" fields of science I do not think it is in the scope of the activity of this PR. We can ask the team who developed the mining module if they are interested, for the sake of testing, to know which records come with a FOS that is different from the one that has been inferred.

Although this PR was already merged, its implementation is not yet complete. In fact, identifying the typed subjects and replace them with their corresponding term declared in a vocabulary is only one part of the task.

In case of FOS subjects, each term belongs to a hierarchy tree and for each already existing FOS subject, the entire upper hierarchy is also provided as a separated subject, i.e. when a FOS term like (e.g.) 0101 mathematics is provided, also its parent 01 natural sciences must be available in the record.

This imply that the identification/cleaning of "hidden" typed subjects, must be extended by a subroutine that expands also its relative parent terms, unless they are not already available in the record.

This task will be addressed in a separated PR.

Although this PR was already merged, its implementation is not yet complete. In fact, identifying the typed subjects and replace them with their corresponding term declared in a vocabulary is only one part of the task. In case of FOS subjects, each term belongs to a hierarchy tree and for each already existing FOS subject, the entire upper hierarchy is also provided as a separated subject, i.e. when a FOS term like (e.g.) ```0101 mathematics``` is provided, also its parent ```01 natural sciences``` must be available in the record. This imply that the identification/cleaning of "hidden" typed subjects, must be extended by a subroutine that expands also its relative parent terms, unless they are not already available in the record. This task will be addressed in a separated PR.

 @ -33,0 +46,4 @@
 				return;
 			}
 			Qualifier newValue = vocabulary.lookup(subject.getValue());
 			if (!subject.getValue().equals(newValue.getClassid())) {

Labels Milestones

subjects cleaning #239

Reviewers

Step 1:

Step 2: