From 8d80885b2e3c2c06a5665cd7cb03858fb5329aaf Mon Sep 17 00:00:00 2001 From: Claudio Atzori Date: Tue, 3 Mar 2020 10:49:29 +0100 Subject: [PATCH] Update page 'Data provision workflow' --- Data-provision-workflow.md | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/Data-provision-workflow.md b/Data-provision-workflow.md index 3e5e24f..c341960 100644 --- a/Data-provision-workflow.md +++ b/Data-provision-workflow.md @@ -1,7 +1,16 @@ -The data provision workflow is a sequence of processing steps aimed at updating the content of the backends serving the OpenAIRE public services. Currently it covers +The data provision workflow is a sequence of processing steps aimed at updating the content of the backends serving the OpenAIRE public services. Currently it implements the content update procedures for: * Apache Solr fulltext index serving content to [explore.openaire.eu](https://explore.openaire.eu/) and to the [HTTP search API](http://api.openaire.eu/api.html) * Databases accessed through Apache Impala for calculation of statistics over the Graph * MongoDB noSQL database serving content to the [OpenAIRE OAI-PMH endpoint](http://api.openaire.eu/oai_pmh) -This document provides a coarse grained description of the data provision workflow, it is composed of several data movement and manipulation steps: -* Load data from the aggregator and map it to the OCEAN cluster according to the dhp.Oaf model. The procedure freezes the content stored in the aggregation system backends to produce the so called raw graph **G_r** +This document provides a coarse grained description of the data provision workflow, a glossary of the terms used in the following text and in the detailed subsections used to refer to the different materializations of the OpenAIRE Graph. + +The data provision workflow is composed the following logical steps: +* Freeze the content stored in the aggregation system backends, mapping it to the OCEAN Hadoop cluster according to the `dhp.Oaf` model to produce the so called raw graph **G_r**; +* Apply the deduplication outcome from the dedicated action sets and process it to produce the `de-duplicated` graph **G_d**; +* Apply the inference processes outcome from the dedicated action sets and merge it with **G_d** to produce the `inferred` Graph **G_i**; +* Remove from the Graph the relationships that were indicated as `not valid` in the black listing subsystem. The operation produces the `blacklisted` Graph, named **G_bl**; +* Process the Graph according to a set of defined deduction rules (bulk-tagging criteria) to produce the bulk-tagged Graph, named **G_bt** +* Process the Graph navigating the links between the nodes to propagate contextual information from one node to another according to a set of defined criteria to produce the `propagated` Graph **G_p** + +The last materialization of the Graph **G_p** is finally mapped towards the target backends.