Fields under vocabulary control. Model adaptation & cleaning workflow #19

Closed
opened 2020-06-09 18:50:24 +02:00 by claudio.atzori · 2 comments

Cleaning the graph got more and more priority, becoming a blocking issue for the promotion of the beta graph to production.

Many terms that should be controlled by vocabularies are polluted by unmapped terms, stemming from many sides:

  • content in the aggregation system needing to be re-transformed;
  • action sets produced recently as well as in the past.

An efficient way to capture all the unmapped terms consists of running the a cleaning process in bulk along with the graph processing pipeline, as so as to define a single point where the fields in the model under vocabulary control are all checked at once and possibly fixed, according to the synonyms available in the vocabularies.

In the past this procedure was guided by dedicated profiles in the IS, declaring an XPATH used to point the XML fragment containing the information to be cleaned in the bibliographic record. Such approach, although quite generic, is not well suited to be applied to the current graph representation. An alternative possibility would assume to map the same implementation over JSON path, but to my knowledge there is no Java library available supporting a query/manipulation language expressive enough to easily alter the structured fields (Qualifiers) subject to vocabulary control.

An alternative implementation could instead make use of Java reflection to navigate the graph object structure to process all the Qualifier(s) available in a given object instance and apply the cleaning procedure using the vocabulary defined in the Qualifier itself (schemeid). This approach has one main drawback that limits its genericity: all the fields subject to vocabulary control MUST be defined of type Qualifier (so far this has been always the case, except for the Result.Instance.Referee field), so the cleaner implementation is bound to the dhp.Oaf model, at least in the 1st version.

Cleaning the graph got more and more priority, becoming a blocking issue for the promotion of the beta graph to production. Many terms that should be controlled by vocabularies are polluted by unmapped terms, stemming from many sides: - content in the aggregation system needing to be re-transformed; - action sets produced recently as well as in the past. An efficient way to capture all the unmapped terms consists of running the a cleaning process in bulk along with the graph processing pipeline, as so as to define a single point where the fields in the model under vocabulary control are all checked at once and possibly fixed, according to the synonyms available in the vocabularies. In the past this procedure was guided by dedicated profiles in the IS, declaring an XPATH used to point the XML fragment containing the information to be cleaned in the bibliographic record. Such approach, although quite generic, is not well suited to be applied to the current graph representation. An alternative possibility would assume to map the same implementation over JSON path, but to my knowledge there is no Java library available supporting a query/manipulation language expressive enough to easily alter the structured fields (Qualifiers) subject to vocabulary control. An alternative implementation could instead make use of Java reflection to navigate the graph object structure to process all the Qualifier(s) available in a given <T extends Oaf> object instance and apply the cleaning procedure using the vocabulary defined in the Qualifier itself (schemeid). This approach has one main drawback that limits its genericity: all the fields subject to vocabulary control MUST be defined of type Qualifier (so far this has been always the case, except for the Result.Instance.Referee field), so the cleaner implementation is bound to the dhp.Oaf model, at least in the 1st version.
claudio.atzori self-assigned this 2020-06-09 18:50:24 +02:00
Author
Owner

A first implementation of the cleaning workflow is ready. It features:

  • Data model change, where Result.Instance.Refereed field type is changed from Field<String> to Qualifier;
  • Cleaning rule implementation, including the cleaning crteria mapping definitions and a reflection-based Oaf type object scanner;
  • Basic unit test verifying the application of the vocabulary terms to a small syntetic example record.

Such changes were included in the master branch and the schema module released under dhp-1.2.2

A first implementation of the cleaning workflow is ready. It features: - Data model change, where `Result.Instance.Refereed` field type is changed from `Field<String>` to `Qualifier`; - Cleaning rule implementation, including the cleaning crteria mapping definitions and a reflection-based Oaf type object scanner; - Basic unit test verifying the application of the vocabulary terms to a small syntetic example record. Such changes were included in the master branch and the schema module released under `dhp-1.2.2`
Author
Owner

The cleaning workflow is now implemented and covers the normalization of the dhp.Oaf model fields typed as Qualifier.

The cleaning workflow is now implemented and covers the normalization of the `dhp.Oaf` model fields typed as `Qualifier`.
claudio.atzori added the
enhancement
label 2020-07-27 18:03:00 +02:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#19
No description provided.