subjects cleaning #239

Merged
claudio.atzori merged 15 commits from clean_subjects into beta 2 years ago
Owner

This PR introduces the support for cleaning the result subjects. The subjects candidates for cleaning are only those typed as keyword. By dedicated vocabularies the bulk cleaning can identify the synonyms and replace them with the corresponding term; furthermore, the following fields are also updated

  • subject.qualifier.classid is set to the vocabulary code
  • subject.qualifier.classname is set to the vocabulary name

Instead, subject.qualifier.schemeid and schemename are left untouched.

This PR introduces the support for cleaning the result subjects. The subjects candidates for cleaning are only those typed as `keyword`. By dedicated vocabularies the bulk cleaning can identify the synonyms and replace them with the corresponding term; furthermore, the following fields are also updated * `subject.qualifier.classid` is set to the vocabulary code * `subject.qualifier.classname` is set to the vocabulary name Instead, `subject.qualifier.schemeid` and `schemename` are left untouched.
claudio.atzori added 4 commits 2 years ago
claudio.atzori added 1 commit 2 years ago
claudio.atzori added 3 commits 2 years ago
miriam.baglioni reviewed 2 years ago
@ -33,0 +46,4 @@
return;
}
Qualifier newValue = vocabulary.lookup(subject.getValue());
if (!subject.getValue().equals(newValue.getClassid())) {
Collaborator

I do not understand why you compare subject.value and newValue.classid

I do not understand why you compare subject.value and newValue.classid
Collaborator

I saw by myself. It is ok.

One thing: you will not change the classId of the subject if the value provided is equal to the term in the vocabulary.

I saw by myself. It is ok. One thing: you will not change the classId of the subject if the value provided is equal to the term in the vocabulary.
Poster
Owner

You are right. However, the vocabulary.lookup method already tryes to find match a synonym and in case it can't then it looks for a matching equivalent term:

public Qualifier lookup(String id) {
	return Optional
		.ofNullable(getSynonymAsQualifier(id))
		.orElse(getTermAsQualifier(id));
}

Then, in case a matching term cannot be found, the method returns a qualifier set to UNKNOWN

public Qualifier getTermAsQualifier(final String termId) {
	if (StringUtils.isBlank(termId)) {
		return OafMapperUtils.unknown(getId(), getName());
	} else if (termExists(termId)) {
		final VocabularyTerm t = getTerm(termId);
		return OafMapperUtils.qualifier(t.getId(), t.getName(), getId(), getName());
	} else {
		return OafMapperUtils.qualifier(termId, termId, getId(), getName());
	}
}

hence I could just exploit this to decide what to do when the value provided is equal to the term in the vocabulary.

You are right. However, the `vocabulary.lookup` method already tryes to find match a synonym and in case it can't then it looks for a matching equivalent term: ``` public Qualifier lookup(String id) { return Optional .ofNullable(getSynonymAsQualifier(id)) .orElse(getTermAsQualifier(id)); } ``` Then, in case a matching term cannot be found, the method returns a qualifier set to UNKNOWN ``` public Qualifier getTermAsQualifier(final String termId) { if (StringUtils.isBlank(termId)) { return OafMapperUtils.unknown(getId(), getName()); } else if (termExists(termId)) { final VocabularyTerm t = getTerm(termId); return OafMapperUtils.qualifier(t.getId(), t.getName(), getId(), getName()); } else { return OafMapperUtils.qualifier(termId, termId, getId(), getName()); } } ``` hence I could just exploit this to decide what to do when the value provided is equal to the term in the vocabulary.
Collaborator

Yes, I do agree

Yes, I do agree
claudio.atzori added 1 commit 2 years ago
claudio.atzori added 1 commit 2 years ago
Poster
Owner

Another consideration, let's assume that for a given result

  1. the cleaning does its job and produces a FOS subject.
  2. the same result has a DOI which have been also classified "by inference" with the same subject term

In such situation, I believe the same subject term

  1. should not repeat and
  2. provided that a single subject exposes only one provenance information, the one from the repository should prevail against the inferred one.

Opinions?

Another consideration, let's assume that for a given result 1. the cleaning does its job and produces a FOS subject. 2. the same result has a DOI which have been also classified "by inference" with the same subject term In such situation, I believe the same subject term 1. should not repeat and 2. provided that a single subject exposes only one provenance information, the one from the repository should prevail against the inferred one. Opinions?
Collaborator

I also think it is better to have as provenance the one from the repository.
What instead if we have one subject related to a research field and the other one to a different not compatible one?
Should/can we check this?

I also think it is better to have as provenance the one from the repository. What instead if we have one subject related to a research field and the other one to a different not compatible one? Should/can we check this?
claudio.atzori added 2 commits 2 years ago
Poster
Owner

What instead if we have one subject related to a research field and the other one to a different not compatible one?
Should/can we check this?

Is the compatibility declared/available anywhere? That would be a slippery ground, I feel like it would be quite hard to implement a procedure that automatically evaluates that.

Anyway, the last commit b7c387c21f adds the grouping of the subjects during the bulk cleaning.

They are grouped using the concatenation among the subject's qualifier.classid and the subject value, then the conflict resolution is delegated to the dedicated SubjectProvenanceComparator.

> What instead if we have one subject related to a research field and the other one to a different not compatible one? > Should/can we check this? Is the compatibility declared/available anywhere? That would be a slippery ground, I feel like it would be quite hard to implement a procedure that automatically evaluates that. Anyway, the last commit https://code-repo.d4science.org/D-Net/dnet-hadoop/commit/b7c387c21f946adbc9da90ded95166205195edb0 adds the grouping of the subjects during the bulk cleaning. They are grouped using the concatenation among the subject's `qualifier.classid` and the subject `value`, then the conflict resolution is delegated to the dedicated `SubjectProvenanceComparator`.
claudio.atzori added 1 commit 2 years ago
claudio.atzori added 1 commit 2 years ago
claudio.atzori requested review from alessia.bardi 2 years ago
Poster
Owner

To update the Fields of Science and Technology classification vocabulary, the list of synonyms is available in this zepebook: https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2HAR7PU7E

To update the `Fields of Science and Technology classification` vocabulary, the list of synonyms is available in this zepebook: https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2HAR7PU7E
Owner

Ok also for me. I am going to update the FOS vocabulary with the synonyms.
Regarding "compatible" fields of science I do not think it is in the scope of the activity of this PR. We can ask the team who developed the mining module if they are interested, for the sake of testing, to know which records come with a FOS that is different from the one that has been inferred.

Ok also for me. I am going to update the FOS vocabulary with the synonyms. Regarding "compatible" fields of science I do not think it is in the scope of the activity of this PR. We can ask the team who developed the mining module if they are interested, for the sake of testing, to know which records come with a FOS that is different from the one that has been inferred.
claudio.atzori added 1 commit 2 years ago
claudio.atzori merged commit 5066db3386 into beta 2 years ago
claudio.atzori deleted branch clean_subjects 2 years ago
Poster
Owner

Although this PR was already merged, its implementation is not yet complete. In fact, identifying the typed subjects and replace them with their corresponding term declared in a vocabulary is only one part of the task.

In case of FOS subjects, each term belongs to a hierarchy tree and for each already existing FOS subject, the entire upper hierarchy is also provided as a separated subject, i.e. when a FOS term like (e.g.) 0101 mathematics is provided, also its parent 01 natural sciences must be available in the record.

This imply that the identification/cleaning of "hidden" typed subjects, must be extended by a subroutine that expands also its relative parent terms, unless they are not already available in the record.

This task will be addressed in a separated PR.

Although this PR was already merged, its implementation is not yet complete. In fact, identifying the typed subjects and replace them with their corresponding term declared in a vocabulary is only one part of the task. In case of FOS subjects, each term belongs to a hierarchy tree and for each already existing FOS subject, the entire upper hierarchy is also provided as a separated subject, i.e. when a FOS term like (e.g.) ```0101 mathematics``` is provided, also its parent ```01 natural sciences``` must be available in the record. This imply that the identification/cleaning of "hidden" typed subjects, must be extended by a subroutine that expands also its relative parent terms, unless they are not already available in the record. This task will be addressed in a separated PR.

Reviewers

alessia.bardi was requested for review 2 years ago
The pull request has been merged as 5066db3386.
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b clean_subjects beta
git pull origin clean_subjects

Step 2:

Merge the changes and update on Gitea.
git checkout beta
git merge --no-ff clean_subjects
git push origin beta
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#239
Loading…
There is no content yet.