Update page 'Data provision workflow'

Claudio Atzori 2020-03-03 17:54:56 +01:00
parent 91265de1d2
commit bc9904bda8
1 changed files with 64 additions and 2 deletions

@ -3,6 +3,8 @@ Go back to [[Home]].
-----
## Data provision workflow
The data provision workflow is a sequence of processing steps aimed at updating the content of the backends serving the OpenAIRE public services. Currently it implements the content update procedures for:
1. Apache Solr fulltext index serving content to [explore.openaire.eu](https://explore.openaire.eu/) and to the [HTTP search API](http://api.openaire.eu/api.html)
@ -11,8 +13,21 @@ The data provision workflow is a sequence of processing steps aimed at updating
This document provides a coarse grained description of the data provision workflow, a glossary of the terms used in the following text and in the detailed subsections used to refer to the different materializations of the OpenAIRE Graph.
### Glossary
* **G_r**: raw graph
* **G_d**: deduplicated graph
* **G_i**: inferred graph
* **G_bl**: blacklisted graph
* **G_bt**: bulk-tagged graph
* **G_p**: propagated graph
* **dhp.Oaf**: data model used to describe entity & relationship types in the graph
### High level workflow description
The data provision workflow is composed the following logical steps:
* Freeze the content stored in the aggregation system backends, mapping it to the OCEAN Hadoop cluster according to the `dhp.Oaf` model to produce the so called raw graph **G_r**;
* Freeze the content stored in the aggregation system backends, mapping it to the OCEAN Hadoop cluster according to the **dhp.Oaf** model to produce the so called raw graph **G_r**;
* Apply the deduplication outcome from the dedicated action sets and process it to produce the `de-duplicated` graph **G_d**;
* Apply the inference processes outcome from the dedicated action sets and merge it with **G_d** to produce the `inferred` Graph **G_i**;
* Remove from the Graph the relationships that were indicated as `not valid` in the black listing subsystem. The operation produces the `blacklisted` Graph, named **G_bl**;
@ -20,7 +35,54 @@ The data provision workflow is composed the following logical steps:
* Process the Graph navigating the links between the nodes to propagate contextual information from one node to another according to a set of defined criteria to produce the `propagated` Graph **G_p**.
The materialization of the Graph named **G_p** is finally prepared to be mapped towards the target backends:
1. Process the Graph **G_p** such that the nodes are joined by resolving edges at distance = 1 to create an adjacency list of linked objects. The operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization and all the possible relationships among them. The resulting set of adjacency lists is then serialized accoring to the XML format required by the search service. Such records are eventually indexed on the Solr fulltext server;
1. Process the Graph **G_p** such that the nodes are joined by resolving edges at distance = 1 to create an adjacency list of linked objects. The operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization) as well as all the possible relationships among them. The resulting set of adjacency lists is then serialized accoring to the XML format required by the search service. Such records are eventually indexed on the Solr fulltext server;
2. Process the Graph **G_p** as a **Hive** database that represents a 1:1 mapping between the Graph materialization and the implicit relational schema. This allows to slice and dice the data using SQL statements.
### Graph representation
The OpenAIRE scholarly graph is a collection of interlinked objects represented according to a given schema where both nodes and edges in the graph conform to it. The schema declares the set of properties that characterize each entity type as well as the properties that allows to characterize the edges. The schema is defined in the maven module `dhp-schema`.
Each materialzation of the Graph **G_*** is encoded as a set of files and directories, one directory per entity type, plus one directory holding all the relationships:
```
.
├─ dataset
│ ├─ dataset-m-00000.part
│ ├─ dataset-m-00000.part
│ └─ ...
├─ datasource
│ ├─ datasource-m-00000.part
│ ├─ datasource-m-00000.part
│ └─ ...
├─ organization
│ ├─ organization-m-00000.part
│ ├─ organization-m-00000.part
│ └─ ...
├─ otherresearchproduct
│ ├─ otherresearchproduct-m-00000.part
│ ├─ otherresearchproduct-m-00000.part
│ └─ ...
├─ project
│ ├─ project-m-00000.part
│ ├─ project-m-00000.part
│ └─ ...
├─ publication
│ ├─ publication-m-00000.part
│ ├─ publication-m-00000.part
│ └─ ...
├─ relation
│ ├─ relation-m-00000.part
│ ├─ relation-m-00000.part
│ └─ ...
└─ software
├─ software-m-00000.part
├─ software-m-00000.part
└─ ...
```