Update page 'Data provision workflow'

Claudio Atzori 2020-03-03 12:02:13 +01:00
parent 56b73fd9cc
commit 91265de1d2
1 changed files with 10 additions and 6 deletions

@ -5,9 +5,9 @@ Go back to [[Home]].
The data provision workflow is a sequence of processing steps aimed at updating the content of the backends serving the OpenAIRE public services. Currently it implements the content update procedures for:
* Apache Solr fulltext index serving content to [explore.openaire.eu](https://explore.openaire.eu/) and to the [HTTP search API](http://api.openaire.eu/api.html)
* Databases accessed through Apache Impala for calculation of statistics over the Graph
* MongoDB noSQL database serving content to the [OpenAIRE OAI-PMH endpoint](http://api.openaire.eu/oai_pmh)
1. Apache Solr fulltext index serving content to [explore.openaire.eu](https://explore.openaire.eu/) and to the [HTTP search API](http://api.openaire.eu/api.html)
2. Databases accessed through Apache Impala for calculation of statistics over the Graph
3. MongoDB noSQL database serving content to the [OpenAIRE OAI-PMH endpoint](http://api.openaire.eu/oai_pmh)
This document provides a coarse grained description of the data provision workflow, a glossary of the terms used in the following text and in the detailed subsections used to refer to the different materializations of the OpenAIRE Graph.
@ -16,7 +16,11 @@ The data provision workflow is composed the following logical steps:
* Apply the deduplication outcome from the dedicated action sets and process it to produce the `de-duplicated` graph **G_d**;
* Apply the inference processes outcome from the dedicated action sets and merge it with **G_d** to produce the `inferred` Graph **G_i**;
* Remove from the Graph the relationships that were indicated as `not valid` in the black listing subsystem. The operation produces the `blacklisted` Graph, named **G_bl**;
* Process the Graph according to a set of defined deduction rules (bulk-tagging criteria) to produce the bulk-tagged Graph, named **G_bt**
* Process the Graph navigating the links between the nodes to propagate contextual information from one node to another according to a set of defined criteria to produce the `propagated` Graph **G_p**
* Process the Graph according to a set of defined deduction rules (bulk-tagging criteria) to produce the bulk-tagged Graph, named **G_bt**;
* Process the Graph navigating the links between the nodes to propagate contextual information from one node to another according to a set of defined criteria to produce the `propagated` Graph **G_p**.
The materialization of the Graph named **G_p** is finally prepared to be mapped towards the target backends:
1. Process the Graph **G_p** such that the nodes are joined by resolving edges at distance = 1 to create an adjacency list of linked objects. The operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization and all the possible relationships among them. The resulting set of adjacency lists is then serialized accoring to the XML format required by the search service. Such records are eventually indexed on the Solr fulltext server;
2. Process the Graph **G_p** as a **Hive** database that represents a 1:1 mapping between the Graph materialization and the implicit relational schema. This allows to slice and dice the data using SQL statements.
The last materialization of the Graph **G_p** is finally mapped towards the target backends.