Fields under vocabulary control. Model adaptation & cleaning workflow #19
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#19
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Cleaning the graph got more and more priority, becoming a blocking issue for the promotion of the beta graph to production.
Many terms that should be controlled by vocabularies are polluted by unmapped terms, stemming from many sides:
An efficient way to capture all the unmapped terms consists of running the a cleaning process in bulk along with the graph processing pipeline, as so as to define a single point where the fields in the model under vocabulary control are all checked at once and possibly fixed, according to the synonyms available in the vocabularies.
In the past this procedure was guided by dedicated profiles in the IS, declaring an XPATH used to point the XML fragment containing the information to be cleaned in the bibliographic record. Such approach, although quite generic, is not well suited to be applied to the current graph representation. An alternative possibility would assume to map the same implementation over JSON path, but to my knowledge there is no Java library available supporting a query/manipulation language expressive enough to easily alter the structured fields (Qualifiers) subject to vocabulary control.
An alternative implementation could instead make use of Java reflection to navigate the graph object structure to process all the Qualifier(s) available in a given object instance and apply the cleaning procedure using the vocabulary defined in the Qualifier itself (schemeid). This approach has one main drawback that limits its genericity: all the fields subject to vocabulary control MUST be defined of type Qualifier (so far this has been always the case, except for the Result.Instance.Referee field), so the cleaner implementation is bound to the dhp.Oaf model, at least in the 1st version.
A first implementation of the cleaning workflow is ready. It features:
Result.Instance.Refereed
field type is changed fromField<String>
toQualifier
;Such changes were included in the master branch and the schema module released under
dhp-1.2.2
The cleaning workflow is now implemented and covers the normalization of the
dhp.Oaf
model fields typed asQualifier
.