COAR based resource types & Irish tender #350

Manually merged
claudio.atzori merged 0 commits from resource_types into beta 2023-11-29 14:38:08 +01:00

This PR introduces support for

  • the new fields meant to support the activities in the Irish tender;
  • the new COAR based resource type by adapting the mapping implemented on the parsers for the dublin core and datacite formats.

The mapping populates the instance.instanceTypeMapping field by looking for the original types in the transformed records. When the relative xpaths doesn't resolve to any literal, then it is not possible to produce any entry to include in the instanceTypeMapping list. Otherwise the mapping proceeds as follows

  1. lookups the vocabulary openaire::coar_resource_types_3_1 for a term, given the original resource type as a synonym. When found, the term is created, when not found, the original type is included in the entry anyway for further analysis and support the refinement of the entries in the vocabulary;
  2. the term from the vocabulary openaire::coar_resource_types_3_1 is used to lookup for the relative user term in the vocabulary openaire::user_resource_types;

Furthermore, the mapping populates the result.metaResourceType based on the instanceTypeMaping entry associated with the vocabulary openaire::coar_resource_types_3_1, by means of the vocabulary named openaire::meta_resource_types.

For further reference, the PR for introducing the changes in the model is defined in D-Net/dhp-schemas#25.

This PR introduces support for * the new fields meant to support the activities in the Irish tender; * the new COAR based resource type by adapting the mapping implemented on the parsers for the dublin core and datacite formats. The mapping populates the `instance.instanceTypeMapping` field by looking for the original types in the transformed records. When the relative xpaths doesn't resolve to any literal, then it is not possible to produce any entry to include in the `instanceTypeMapping` list. Otherwise the mapping proceeds as follows 1. lookups the vocabulary `openaire::coar_resource_types_3_1` for a term, given the original resource type as a synonym. When found, the term is created, when not found, the original type is included in the entry anyway for further analysis and support the refinement of the entries in the vocabulary; 2. the term from the vocabulary `openaire::coar_resource_types_3_1` is used to lookup for the relative user term in the vocabulary `openaire::user_resource_types`; Furthermore, the mapping populates the `result.metaResourceType` based on the `instanceTypeMaping` entry associated with the vocabulary `openaire::coar_resource_types_3_1`, by means of the vocabulary named `openaire::meta_resource_types`. For further reference, the PR for introducing the changes in the model is defined in https://code-repo.d4science.org/D-Net/dhp-schemas/pulls/25.
claudio.atzori added 4 commits 2023-10-16 14:49:13 +02:00
claudio.atzori requested review from alessia.bardi 2023-10-16 14:49:20 +02:00
claudio.atzori requested review from giambattista.bloisi 2023-10-16 14:49:26 +02:00
claudio.atzori requested review from miriam.baglioni 2023-10-16 14:49:31 +02:00
giambattista.bloisi requested changes 2023-10-16 15:34:35 +02:00
@ -138,0 +143,4 @@
final Vocabulary vocabulary = vocs.get(vocId.toLowerCase());
return Optional
.ofNullable(vocabulary.getTerm(syn))

It looks like vocs.get(vocId.toLowerCase()) too had to be wrapped into an Optional to prevent NullPointerException

It looks like vocs.get(vocId.toLowerCase()) too had to be wrapped into an Optional to prevent NullPointerException
Author
Owner

Nice catch, I'm going to make it more NPE-safe. Thanks!

Nice catch, I'm going to make it more NPE-safe. Thanks!
claudio.atzori added 1 commit 2023-10-17 11:09:36 +02:00
claudio.atzori added 1 commit 2023-10-17 11:09:51 +02:00
claudio.atzori added this to the OpenAIRE project 2023-10-26 09:41:50 +02:00
Author
Owner

This PR addressed so far the application of the revised resource type mappings to the contents acquired from the aggregation system. It doesn't cover yet the results which get their way into the graph in the form of actionsets, thus more work is necessary to update the relative mapping implementations.

This PR addressed so far the application of the revised resource type mappings to the contents acquired from the aggregation system. It doesn't cover yet the results which get their way into the graph in the form of actionsets, thus more work is necessary to update the relative mapping implementations.
Author
Owner

So, we found a compromise solution that @sandro.labruzzo and I believe is viable for the time being

  1. each component responsible to create records for the graph will produce only produce the originalType field in a single instanceTypeMapping element
  2. the new vocabulary based mappings will be applied downstream in the job responsible for grouping the records by id

eu.dnetlib.dhp.oa.merge.GroupEntitiesSparkJob

Below the implementation of a function that can be used to apply the cleaning.

protected List<InstanceTypeMapping> prepareInstanceTypeMapping_orig(Document doc) {
  return Optional
    .ofNullable(findOriginalType(doc))
    .map(originalType -> {
      final List<InstanceTypeMapping> mappings = Lists.newArrayList();

      if (vocs.vocabularyExists(OPENAIRE_COAR_RESOURCE_TYPES_3_1)) {
      
        Optional
          .ofNullable(vocs.lookupTermBySynonym(OPENAIRE_COAR_RESOURCE_TYPES_3_1, originalType))
          .ifPresent(coarTerm -> {
            mappings.add(OafMapperUtils.instanceTypeMapping(originalType, coarTerm));

            if (vocs.vocabularyExists(OPENAIRE_USER_RESOURCE_TYPES)) {

              Optional
                .ofNullable(
                  vocs.lookupTermBySynonym(OPENAIRE_USER_RESOURCE_TYPES, coarTerm.getClassid()))
                .ifPresent(
                  type -> mappings.add(OafMapperUtils.instanceTypeMapping(originalType, type)));
            }
          });
        }
        return mappings;
      })
      .orElse(new ArrayList<>());
}
So, we found a compromise solution that @sandro.labruzzo and I believe is viable for the time being 1. each component responsible to create records for the graph will produce only produce the `originalType` field in a single `instanceTypeMapping` element 2. the new vocabulary based mappings will be applied downstream in the job responsible for grouping the records by id ```eu.dnetlib.dhp.oa.merge.GroupEntitiesSparkJob``` Below the implementation of a function that can be used to apply the cleaning. ``` protected List<InstanceTypeMapping> prepareInstanceTypeMapping_orig(Document doc) { return Optional .ofNullable(findOriginalType(doc)) .map(originalType -> { final List<InstanceTypeMapping> mappings = Lists.newArrayList(); if (vocs.vocabularyExists(OPENAIRE_COAR_RESOURCE_TYPES_3_1)) { Optional .ofNullable(vocs.lookupTermBySynonym(OPENAIRE_COAR_RESOURCE_TYPES_3_1, originalType)) .ifPresent(coarTerm -> { mappings.add(OafMapperUtils.instanceTypeMapping(originalType, coarTerm)); if (vocs.vocabularyExists(OPENAIRE_USER_RESOURCE_TYPES)) { Optional .ofNullable( vocs.lookupTermBySynonym(OPENAIRE_USER_RESOURCE_TYPES, coarTerm.getClassid())) .ifPresent( type -> mappings.add(OafMapperUtils.instanceTypeMapping(originalType, type))); } }); } return mappings; }) .orElse(new ArrayList<>()); } ```
claudio.atzori added 1 commit 2023-11-22 12:22:20 +01:00
sandro.labruzzo was assigned by claudio.atzori 2023-11-22 12:22:51 +01:00
Author
Owner

@sandro.labruzzo I moved the application of the COAR based vocabularies in the GroupEntitiesSparkJob mentioned above. It assumes to find one element in the instance[].instanceTypeMapping[] list with two fields set

  • originalType = [whatever comes from the source] // the only restriction is that we support a single value here
  • vocabularyName = "openaire::coar_resource_types_3_1" // from eu.dnetlib.dhp.schema.common.ModelConstants#OPENAIRE_COAR_RESOURCE_TYPES_3_1

I'm waiting your contribution on adapting each of the actionset generating procedures for the following sources

  • DOIBoost
  • Datacite
  • PubMed
  • The rest of the Scholexplorer specific sources
@sandro.labruzzo I moved the application of the COAR based vocabularies in the `GroupEntitiesSparkJob` mentioned above. It assumes to find one element in the `instance[].instanceTypeMapping[]` list with two fields set * `originalType` = `[whatever comes from the source]` // the only restriction is that we support a single value here * `vocabularyName` = `"openaire::coar_resource_types_3_1"` // from `eu.dnetlib.dhp.schema.common.ModelConstants#OPENAIRE_COAR_RESOURCE_TYPES_3_1` I'm waiting your contribution on adapting each of the actionset generating procedures for the following sources - DOIBoost - Datacite - PubMed - The rest of the Scholexplorer specific sources

I made a fresh checkout of the branch and installed dhp-schemas from the master, and it doesn't compile. Please @claudio.atzori check the dependencies to dhp-schemas and the current version of the snapshot deployed on nexus.

I made a fresh checkout of the branch and installed dhp-schemas from the master, and it doesn't compile. Please @claudio.atzori check the dependencies to dhp-schemas and the current version of the snapshot deployed on nexus.
sandro.labruzzo added 2 commits 2023-11-29 12:46:17 +01:00
sandro.labruzzo added 1 commit 2023-11-29 13:15:49 +01:00
86b5775e08 added vocabulary in instanceTypeMapping for
- DOIBoost
- Datacite
- PubMed
- Scholexplorer Datasource
Author
Owner

I made a fresh checkout of the branch and installed dhp-schemas from the master, and it doesn't compile. Please @claudio.atzori check the dependencies to dhp-schemas and the current version of the snapshot deployed on nexus.

Thanks Sandro. The dhp-schemas version to use is 4.17.2 and I'm going to resolve the conflict locally.

> I made a fresh checkout of the branch and installed dhp-schemas from the master, and it doesn't compile. Please @claudio.atzori check the dependencies to dhp-schemas and the current version of the snapshot deployed on nexus. Thanks Sandro. The dhp-schemas version to use is *4.17.2* and I'm going to resolve the conflict locally.
claudio.atzori manually merged commit 4e1aac2e2f into beta 2023-11-29 14:38:08 +01:00
claudio.atzori deleted branch resource_types 2023-11-29 14:38:16 +01:00
Sign in to join this conversation.
No description provided.