Update page 'Data provision workflow'

Claudio Atzori 2020-03-03 10:49:29 +01:00
parent 0adb200894
commit 8d80885b2e
1 changed files with 12 additions and 3 deletions

@ -1,7 +1,16 @@
The data provision workflow is a sequence of processing steps aimed at updating the content of the backends serving the OpenAIRE public services. Currently it covers
The data provision workflow is a sequence of processing steps aimed at updating the content of the backends serving the OpenAIRE public services. Currently it implements the content update procedures for:
* Apache Solr fulltext index serving content to [explore.openaire.eu](https://explore.openaire.eu/) and to the [HTTP search API](http://api.openaire.eu/api.html)
* Databases accessed through Apache Impala for calculation of statistics over the Graph
* MongoDB noSQL database serving content to the [OpenAIRE OAI-PMH endpoint](http://api.openaire.eu/oai_pmh)
This document provides a coarse grained description of the data provision workflow, it is composed of several data movement and manipulation steps:
* Load data from the aggregator and map it to the OCEAN cluster according to the dhp.Oaf model. The procedure freezes the content stored in the aggregation system backends to produce the so called raw graph **G_r**
This document provides a coarse grained description of the data provision workflow, a glossary of the terms used in the following text and in the detailed subsections used to refer to the different materializations of the OpenAIRE Graph.
The data provision workflow is composed the following logical steps:
* Freeze the content stored in the aggregation system backends, mapping it to the OCEAN Hadoop cluster according to the `dhp.Oaf` model to produce the so called raw graph **G_r**;
* Apply the deduplication outcome from the dedicated action sets and process it to produce the `de-duplicated` graph **G_d**;
* Apply the inference processes outcome from the dedicated action sets and merge it with **G_d** to produce the `inferred` Graph **G_i**;
* Remove from the Graph the relationships that were indicated as `not valid` in the black listing subsystem. The operation produces the `blacklisted` Graph, named **G_bl**;
* Process the Graph according to a set of defined deduction rules (bulk-tagging criteria) to produce the bulk-tagged Graph, named **G_bt**
* Process the Graph navigating the links between the nodes to propagate contextual information from one node to another according to a set of defined criteria to produce the `propagated` Graph **G_p**
The last materialization of the Graph **G_p** is finally mapped towards the target backends.