subjects cleaning
#239
Merged
claudio.atzori
merged 15 commits from clean_subjects
into beta
2 years ago
Loading…
Reference in New Issue
There is no content yet.
Delete Branch 'clean_subjects'
Deleting a branch is permanent. It CANNOT be undone. Continue?
This PR introduces the support for cleaning the result subjects. The subjects candidates for cleaning are only those typed as
keyword
. By dedicated vocabularies the bulk cleaning can identify the synonyms and replace them with the corresponding term; furthermore, the following fields are also updatedsubject.qualifier.classid
is set to the vocabulary codesubject.qualifier.classname
is set to the vocabulary nameInstead,
subject.qualifier.schemeid
andschemename
are left untouched.@ -33,0 +46,4 @@
return;
}
Qualifier newValue = vocabulary.lookup(subject.getValue());
if (!subject.getValue().equals(newValue.getClassid())) {
I do not understand why you compare subject.value and newValue.classid
I saw by myself. It is ok.
One thing: you will not change the classId of the subject if the value provided is equal to the term in the vocabulary.
You are right. However, the
vocabulary.lookup
method already tryes to find match a synonym and in case it can't then it looks for a matching equivalent term:Then, in case a matching term cannot be found, the method returns a qualifier set to UNKNOWN
hence I could just exploit this to decide what to do when the value provided is equal to the term in the vocabulary.
Yes, I do agree
Another consideration, let's assume that for a given result
In such situation, I believe the same subject term
Opinions?
I also think it is better to have as provenance the one from the repository.
What instead if we have one subject related to a research field and the other one to a different not compatible one?
Should/can we check this?
Is the compatibility declared/available anywhere? That would be a slippery ground, I feel like it would be quite hard to implement a procedure that automatically evaluates that.
Anyway, the last commit
b7c387c21f
adds the grouping of the subjects during the bulk cleaning.They are grouped using the concatenation among the subject's
qualifier.classid
and the subjectvalue
, then the conflict resolution is delegated to the dedicatedSubjectProvenanceComparator
.To update the
Fields of Science and Technology classification
vocabulary, the list of synonyms is available in this zepebook: https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2HAR7PU7EOk also for me. I am going to update the FOS vocabulary with the synonyms.
Regarding "compatible" fields of science I do not think it is in the scope of the activity of this PR. We can ask the team who developed the mining module if they are interested, for the sake of testing, to know which records come with a FOS that is different from the one that has been inferred.
5066db3386
into beta 2 years agoAlthough this PR was already merged, its implementation is not yet complete. In fact, identifying the typed subjects and replace them with their corresponding term declared in a vocabulary is only one part of the task.
In case of FOS subjects, each term belongs to a hierarchy tree and for each already existing FOS subject, the entire upper hierarchy is also provided as a separated subject, i.e. when a FOS term like (e.g.)
0101 mathematics
is provided, also its parent01 natural sciences
must be available in the record.This imply that the identification/cleaning of "hidden" typed subjects, must be extended by a subroutine that expands also its relative parent terms, unless they are not already available in the record.
This task will be addressed in a separated PR.
Reviewers
5066db3386
.Step 1:
From your project repository, check out a new branch and test the changes.Step 2:
Merge the changes and update on Gitea.