subjects cleaning #239

claudio.atzori · 2022-08-05T12:38:51+02:00

claudio.atzori commented

2022-08-05 12:38:51 +02:00

This PR introduces the support for cleaning the result subjects. The subjects candidates for cleaning are only those typed as keyword. By dedicated vocabularies the bulk cleaning can identify the synonyms and replace them with the corresponding term; furthermore, the following fields are also updated

subject.qualifier.classid is set to the vocabulary code
subject.qualifier.classname is set to the vocabulary name

Instead, subject.qualifier.schemeid and schemename are left untouched.

This PR introduces the support for cleaning the result subjects. The subjects candidates for cleaning are only those typed as `keyword`. By dedicated vocabularies the bulk cleaning can identify the synonyms and replace them with the corresponding term; furthermore, the following fields are also updated * `subject.qualifier.classid` is set to the vocabulary code * `subject.qualifier.classname` is set to the vocabulary name Instead, `subject.qualifier.schemeid` and `schemename` are left untouched.

claudio.atzori added 4 commits 2022-08-05 12:38:55 +02:00

27a91841e7 WIP: cleaning of subjects

b78889a0ce WIP: cleaning of subjects

6c0fd9284b merge from beta

32cee1f619 WIP: cleaning of subjects

claudio.atzori added 1 commit 2022-08-05 12:39:08 +02:00

844f6eb465 Merge branch 'beta' into clean_subjects

claudio.atzori added 3 commits 2022-08-05 16:57:11 +02:00

4eaa063b1f cleaning of subjects

29c4cde42e Merge branch 'clean_subjects' of https://code-repo.d4science.org/D-Net/dnet-hadoop into clean_subjects

a4815f6bec Merge branch 'beta' into clean_subjects

miriam.baglioni reviewed 2022-08-08 10:52:28 +02:00

dhp-workflows/dhp-graph-mapper/src/main/java/eu/dnetlib/dhp/oa/graph/clean/CleaningRuleMap.java Outdated

						
				@ -33,0 +46,4 @@

								return;

							}

							Qualifier newValue = vocabulary.lookup(subject.getValue());

							if (!subject.getValue().equals(newValue.getClassid())) {

miriam.baglioni commented

2022-08-08 10:52:28 +02:00

I do not understand why you compare subject.value and newValue.classid

miriam.baglioni commented

2022-08-08 11:00:57 +02:00

I saw by myself. It is ok.

One thing: you will not change the classId of the subject if the value provided is equal to the term in the vocabulary.

I saw by myself. It is ok. One thing: you will not change the classId of the subject if the value provided is equal to the term in the vocabulary.

claudio.atzori commented

2022-08-08 12:41:34 +02:00

You are right. However, the vocabulary.lookup method already tryes to find match a synonym and in case it can't then it looks for a matching equivalent term:

public Qualifier lookup(String id) {
	return Optional
		.ofNullable(getSynonymAsQualifier(id))
		.orElse(getTermAsQualifier(id));
}

Then, in case a matching term cannot be found, the method returns a qualifier set to UNKNOWN

public Qualifier getTermAsQualifier(final String termId) {
	if (StringUtils.isBlank(termId)) {
		return OafMapperUtils.unknown(getId(), getName());
	} else if (termExists(termId)) {
		final VocabularyTerm t = getTerm(termId);
		return OafMapperUtils.qualifier(t.getId(), t.getName(), getId(), getName());
	} else {
		return OafMapperUtils.qualifier(termId, termId, getId(), getName());
	}
}

hence I could just exploit this to decide what to do when the value provided is equal to the term in the vocabulary.

You are right. However, the `vocabulary.lookup` method already tryes to find match a synonym and in case it can't then it looks for a matching equivalent term: ``` public Qualifier lookup(String id) { return Optional .ofNullable(getSynonymAsQualifier(id)) .orElse(getTermAsQualifier(id)); } ``` Then, in case a matching term cannot be found, the method returns a qualifier set to UNKNOWN ``` public Qualifier getTermAsQualifier(final String termId) { if (StringUtils.isBlank(termId)) { return OafMapperUtils.unknown(getId(), getName()); } else if (termExists(termId)) { final VocabularyTerm t = getTerm(termId); return OafMapperUtils.qualifier(t.getId(), t.getName(), getId(), getName()); } else { return OafMapperUtils.qualifier(termId, termId, getId(), getName()); } } ``` hence I could just exploit this to decide what to do when the value provided is equal to the term in the vocabulary.

miriam.baglioni commented

2022-08-08 14:12:32 +02:00

Yes, I do agree

claudio.atzori added 1 commit 2022-08-08 12:34:41 +02:00

a78028dabc Merge branch 'beta' into clean_subjects

claudio.atzori added 1 commit 2022-08-08 12:48:56 +02:00

3418ce50ac cleaning of subjects: perform the cleaning when the given value is equivalent to one of the terms in the vocabulary

claudio.atzori commented

2022-08-08 12:52:16 +02:00

Another consideration, let's assume that for a given result

the cleaning does its job and produces a FOS subject.
the same result has a DOI which have been also classified "by inference" with the same subject term

In such situation, I believe the same subject term

should not repeat and
provided that a single subject exposes only one provenance information, the one from the repository should prevail against the inferred one.

Opinions?

Another consideration, let's assume that for a given result 1. the cleaning does its job and produces a FOS subject. 2. the same result has a DOI which have been also classified "by inference" with the same subject term In such situation, I believe the same subject term 1. should not repeat and 2. provided that a single subject exposes only one provenance information, the one from the repository should prevail against the inferred one. Opinions?

miriam.baglioni commented

2022-08-08 14:16:29 +02:00

I also think it is better to have as provenance the one from the repository.
What instead if we have one subject related to a research field and the other one to a different not compatible one?
Should/can we check this?

I also think it is better to have as provenance the one from the repository. What instead if we have one subject related to a research field and the other one to a different not compatible one? Should/can we check this?

claudio.atzori added 2 commits 2022-08-12 15:09:25 +02:00

adb526b0e1 Merge branch 'beta' into clean_subjects

b7c387c21f cleaning of subjects: avoid duplicated subjects, prioritise collected vs inferred or other sources

claudio.atzori commented

2022-08-12 15:14:06 +02:00

What instead if we have one subject related to a research field and the other one to a different not compatible one?
Should/can we check this?

Is the compatibility declared/available anywhere? That would be a slippery ground, I feel like it would be quite hard to implement a procedure that automatically evaluates that.

Anyway, the last commit b7c387c21f adds the grouping of the subjects during the bulk cleaning.

They are grouped using the concatenation among the subject's qualifier.classid and the subject value, then the conflict resolution is delegated to the dedicated SubjectProvenanceComparator.

> What instead if we have one subject related to a research field and the other one to a different not compatible one? > Should/can we check this? Is the compatibility declared/available anywhere? That would be a slippery ground, I feel like it would be quite hard to implement a procedure that automatically evaluates that. Anyway, the last commit https://code-repo.d4science.org/D-Net/dnet-hadoop/commit/b7c387c21f946adbc9da90ded95166205195edb0 adds the grouping of the subjects during the bulk cleaning. They are grouped using the concatenation among the subject's `qualifier.classid` and the subject `value`, then the conflict resolution is delegated to the dedicated `SubjectProvenanceComparator`.

claudio.atzori added 1 commit 2022-09-09 10:38:51 +02:00

1203378441 Merge branch 'beta' into clean_subjects

claudio.atzori added 1 commit 2022-09-09 12:20:07 +02:00

b5f7bd30be Merge branch 'beta' into clean_subjects

claudio.atzori requested review from alessia.bardi 2022-09-09 12:27:45 +02:00

claudio.atzori commented

2022-09-09 12:28:32 +02:00

To update the Fields of Science and Technology classification vocabulary, the list of synonyms is available in this zepebook: https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2HAR7PU7E

To update the `Fields of Science and Technology classification` vocabulary, the list of synonyms is available in this zepebook: https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2HAR7PU7E

alessia.bardi commented

2022-09-09 12:44:22 +02:00

Ok also for me. I am going to update the FOS vocabulary with the synonyms.
Regarding "compatible" fields of science I do not think it is in the scope of the activity of this PR. We can ask the team who developed the mining module if they are interested, for the sake of testing, to know which records come with a FOS that is different from the one that has been inferred.

Ok also for me. I am going to update the FOS vocabulary with the synonyms. Regarding "compatible" fields of science I do not think it is in the scope of the activity of this PR. We can ask the team who developed the mining module if they are interested, for the sake of testing, to know which records come with a FOS that is different from the one that has been inferred.

claudio.atzori added 1 commit 2022-09-09 15:16:41 +02:00

ff6f789b6d code formatting

claudio.atzori referenced this issue from a commit

2022-09-09 15:17:04 +02:00

Merge pull request 'subjects cleaning' (#239) from clean_subjects into beta

claudio.atzori merged commit 5066db3386 into beta

2022-09-09 15:17:07 +02:00

claudio.atzori deleted branch clean_subjects

2022-09-09 15:21:44 +02:00

claudio.atzori commented

2022-11-29 16:14:33 +01:00

Although this PR was already merged, its implementation is not yet complete. In fact, identifying the typed subjects and replace them with their corresponding term declared in a vocabulary is only one part of the task.

In case of FOS subjects, each term belongs to a hierarchy tree and for each already existing FOS subject, the entire upper hierarchy is also provided as a separated subject, i.e. when a FOS term like (e.g.) 0101 mathematics is provided, also its parent 01 natural sciences must be available in the record.

This imply that the identification/cleaning of "hidden" typed subjects, must be extended by a subroutine that expands also its relative parent terms, unless they are not already available in the record.

This task will be addressed in a separated PR.

Although this PR was already merged, its implementation is not yet complete. In fact, identifying the typed subjects and replace them with their corresponding term declared in a vocabulary is only one part of the task. In case of FOS subjects, each term belongs to a hierarchy tree and for each already existing FOS subject, the entire upper hierarchy is also provided as a separated subject, i.e. when a FOS term like (e.g.) ```0101 mathematics``` is provided, also its parent ```01 natural sciences``` must be available in the record. This imply that the identification/cleaning of "hidden" typed subjects, must be extended by a subroutine that expands also its relative parent terms, unless they are not already available in the record. This task will be addressed in a separated PR.

Sign in to join this conversation.

No reviewers