dhp model extensions #9

Open
opened 4 years ago by claudio.atzori · 7 comments
Owner

This task aims to introduce a set of extensions to the dhp.Oaf data model and to propagate the changes to other modules and workflows where needed. The changes include:

  • Addition of Relation properties as List[KeyValue];
  • We need to keep track of funder-validated result-project Relations, thus we should introduce a pair [Date - Boolean] to assert: the moment in time the validation occurred (Date) and the validation status (Boolean);
  • Introduction of Measures to be attached Result measures (see below the proposed structure). The aim is to embed analytical results about the impact factor of publications from the work done by the BIP! Team;
  • Contextes should be moved to the OafEntity level. We want to context-tag also projects.
  • Addition of H2020 program information in the data model for projects.

Structure for the proposed measures to be attached to Result entities

  measureName [1]
  measureId [1]
  measureDescription [0]
  measureUnit [1..n]
	UnitLabel [1]
	UnitDescription [0]
	UnitType [1]
This task aims to introduce a set of extensions to the dhp.Oaf data model and to propagate the changes to other modules and workflows where needed. The changes include: * Addition of Relation properties as List[KeyValue]; * We need to keep track of funder-validated result-project Relations, thus we should introduce a pair [Date - Boolean] to assert: the moment in time the validation occurred (Date) and the validation status (Boolean); * Introduction of Measures to be attached Result measures (see below the proposed structure). The aim is to embed analytical results about the impact factor of publications from the work done by the BIP! Team; * Contextes should be moved to the OafEntity level. We want to context-tag also projects. * Addition of H2020 program information in the data model for projects. Structure for the proposed measures to be attached to Result entities ```measure measureName [1] measureId [1] measureDescription [0] measureUnit [1..n] UnitLabel [1] UnitDescription [0] UnitType [1]
claudio.atzori self-assigned this 4 years ago
alessia.bardi was assigned by claudio.atzori 4 years ago
Poster
Owner

Update:

  • branch dhp_oaf_model aligned with master branch;
  • Relation properties added as List[KeyValue];
  • validated/validationDate to relationships;
  • measure type and simple unit test to indicate the relative serialization;

@alessia.bardi do we know which data type we must consider for the addition of H2020 program information?

Update: - branch `dhp_oaf_model` aligned with master branch; - Relation properties added as List[KeyValue]; - validated/validationDate to relationships; - measure type and simple unit test to indicate the relative serialization; @alessia.bardi do we know which data type we must consider for the addition of H2020 program information?
Poster
Owner

Another update:

  • Relation properties were included in master branch released version 1.2.0
Another update: - Relation properties were included in master branch released version 1.2.0
miriam.baglioni was assigned by claudio.atzori 4 years ago
Collaborator

Update: the information for the H2020 programme has been introduced in the data model (programme is a list of objects of type Programme each having a code - String - and a description - String).
Each project can be associated to more than one programme.

The update to the project is create as an action set and produced by:

We select from the set of projects in the csv only those also present in the database, then we join then with the programme information in the other csv file and create a set of atomic actions, one for each project for which we get a match.

Update: the information for the H2020 programme has been introduced in the data model (programme is a list of objects of type Programme each having a code - String - and a description - String). Each project can be associated to more than one programme. The update to the project is create as an action set and produced by: - reading the csv file at http://cordis.europa.eu/data/reference/cordisref-H2020programmes.csv for information about the Programmes - reading the csv file at https://cordis.europa.eu/data/cordis-h2020projects.csv for information about the projects. We cannot use the info we have in the db because the programme code is missing from our data - reading the code of the projects for corda__h2020 from the db We select from the set of projects in the csv only those also present in the database, then we join then with the programme information in the other csv file and create a set of atomic actions, one for each project for which we get a match.
Poster
Owner

Other two requests that imply changes in the data model

  • make simple repeatable fields unique (suggested by Michele in #23)
  • add the original OAI identifier to the index schema: the internal graph model defines the field @originalId@, it is repeatable, but it doesn't provide any further specification to indicate a distinction between the oai identifier assigned by the aggregator and the oai identifier assigned by the original repository. (originally requested in https://issue.openaire.research-infrastructures.eu/issues/5747)
Other two requests that imply changes in the data model * make simple repeatable fields unique (suggested by Michele in #23) * add the original OAI identifier to the index schema: the internal graph model defines the field @originalId@, it is repeatable, but it doesn't provide any further specification to indicate a distinction between the oai identifier assigned by the aggregator and the oai identifier assigned by the original repository. (originally requested in https://issue.openaire.research-infrastructures.eu/issues/5747)
claudio.atzori added the
enhancement
label 4 years ago
claudio.atzori added the
help wanted
label 4 years ago
Poster
Owner

Another aspect to consider that impact on the model definition derives from the record merge operation performed in the deduplication workflow on the group of duplicates. Provided that representative records produced by such operation obey to the same exact model used for the duplicated records, the merge policy generally assumes to gather all the occurrences for a given field from the duplicates and set them in the corresponding repeatable field in the representative record being built.

This approach works well when no restrictions are applied to the N values from the duplicates, however, the field publisher is declared as non repeatable, therefore the winning value will only depend on the trust-based ordering performed in the merge procedure (this case was origially spotted in In #5915). An idea to solve this case would be to move the definition of the field publisher inside each record instance, but this case can be generalized to every field defined as non-repeatable.

Another aspect to consider that impact on the model definition derives from the record merge operation performed in the deduplication workflow on the group of duplicates. Provided that representative records produced by such operation obey to the same _exact_ model used for the duplicated records, the merge policy generally assumes to gather all the occurrences for a given field from the duplicates and set them in the corresponding repeatable field in the representative record being built. This approach works well when no restrictions are applied to the N values from the duplicates, however, the field `publisher` is declared as non repeatable, therefore the winning value will only depend on the trust-based ordering performed in the merge procedure (this case was origially spotted in In [#5915](https://issue.openaire.research-infrastructures.eu/issues/5915)). An idea to solve this case would be to move the definition of the field `publisher` inside each record instance, but this case can be generalized to every field defined as non-repeatable.
Poster
Owner

Make room for the OpenAccess statuses:

  • Green
  • Gold
  • Hybrid
  • Bronze

As indicated by the method to be applied on the Unpaywall records

Make room for the OpenAccess statuses: * Green * Gold * Hybrid * Bronze As indicated by the [method](https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-) to be applied on the Unpaywall records
Owner

In order to support in the future the validation of context tags in a similar way we are addressing the validation of result-project links, I think we should add the validated and validationDate properties also in the Context model class.

The information about the validation for context tags can already be set when the tag has been added via a claim, because the gateway curators can already approve/reject claims. Further ways to get this kind of feedback for tags that do not come from claims still need to be thought.

Let me also remind that the main goals of the validated property are:

  • to be able to attach a "validated badge" to relationships and context tags in the portals.
  • to exclude from the API for Sygma the results whose relationship to a project has been already validated (as this means that the result was accepted by the project manager in the Sygma portal and we do not need to suggest it again for that specific project)
In order to support in the future the validation of context tags in a similar way we are addressing the validation of result-project links, I think we should add the validated and validationDate properties also in the Context model class. The information about the validation for context tags can already be set when the tag has been added via a claim, because the gateway curators can already approve/reject claims. Further ways to get this kind of feedback for tags that do not come from claims still need to be thought. Let me also remind that the main goals of the validated property are: * to be able to attach a "validated badge" to relationships and context tags in the portals. * to exclude from the API for Sygma the results whose relationship to a project has been already validated (as this means that the result was accepted by the project manager in the Sygma portal and we do not need to suggest it again for that specific project)
Sign in to join this conversation.
No Milestone
No project
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#9
Loading…
There is no content yet.