subjects cleaning #239
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#239
Loading…
Reference in New Issue
No description provided.
Delete Branch "clean_subjects"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR introduces the support for cleaning the result subjects. The subjects candidates for cleaning are only those typed as
keyword
. By dedicated vocabularies the bulk cleaning can identify the synonyms and replace them with the corresponding term; furthermore, the following fields are also updatedsubject.qualifier.classid
is set to the vocabulary codesubject.qualifier.classname
is set to the vocabulary nameInstead,
subject.qualifier.schemeid
andschemename
are left untouched.@ -33,0 +46,4 @@
return;
}
Qualifier newValue = vocabulary.lookup(subject.getValue());
if (!subject.getValue().equals(newValue.getClassid())) {
I do not understand why you compare subject.value and newValue.classid
I saw by myself. It is ok.
One thing: you will not change the classId of the subject if the value provided is equal to the term in the vocabulary.
You are right. However, the
vocabulary.lookup
method already tryes to find match a synonym and in case it can't then it looks for a matching equivalent term:Then, in case a matching term cannot be found, the method returns a qualifier set to UNKNOWN
hence I could just exploit this to decide what to do when the value provided is equal to the term in the vocabulary.
Yes, I do agree
Another consideration, let's assume that for a given result
In such situation, I believe the same subject term
Opinions?
I also think it is better to have as provenance the one from the repository.
What instead if we have one subject related to a research field and the other one to a different not compatible one?
Should/can we check this?
Is the compatibility declared/available anywhere? That would be a slippery ground, I feel like it would be quite hard to implement a procedure that automatically evaluates that.
Anyway, the last commit
b7c387c21f
adds the grouping of the subjects during the bulk cleaning.They are grouped using the concatenation among the subject's
qualifier.classid
and the subjectvalue
, then the conflict resolution is delegated to the dedicatedSubjectProvenanceComparator
.To update the
Fields of Science and Technology classification
vocabulary, the list of synonyms is available in this zepebook: https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2HAR7PU7EOk also for me. I am going to update the FOS vocabulary with the synonyms.
Regarding "compatible" fields of science I do not think it is in the scope of the activity of this PR. We can ask the team who developed the mining module if they are interested, for the sake of testing, to know which records come with a FOS that is different from the one that has been inferred.
Although this PR was already merged, its implementation is not yet complete. In fact, identifying the typed subjects and replace them with their corresponding term declared in a vocabulary is only one part of the task.
In case of FOS subjects, each term belongs to a hierarchy tree and for each already existing FOS subject, the entire upper hierarchy is also provided as a separated subject, i.e. when a FOS term like (e.g.)
0101 mathematics
is provided, also its parent01 natural sciences
must be available in the record.This imply that the identification/cleaning of "hidden" typed subjects, must be extended by a subroutine that expands also its relative parent terms, unless they are not already available in the record.
This task will be addressed in a separated PR.