Update page 'Data provision workflow'

Claudio Atzori 2020-03-04 11:39:28 +01:00
parent 8a66ebdd0c
commit c5c2586a35
1 changed files with 4 additions and 4 deletions

@ -35,19 +35,19 @@ The data provision workflow is composed the following logical steps:
* Process the Graph navigating the links between the nodes to propagate contextual information from one node to another according to a set of defined criteria to produce the `propagated` Graph **G_p**.
The materialization of the Graph named **G_p** is finally prepared to be mapped towards the target backends:
1. Process the Graph **G_p** such that the nodes are joined by resolving edges at distance = 1 to create an adjacency list of linked objects. The operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization) as well as all the possible relationships among them. The resulting set of adjacency lists is then serialized accoring to the XML format required by the search service. Such records are eventually indexed on the Solr fulltext server;
2. Process the Graph **G_p** as a **Hive** database that represents a 1:1 mapping between the Graph materialization and the implicit relational schema. This allows to slice and dice the data using SQL statements.
1. The Graph **G_p** is processed such that the nodes are joined by resolving edges at distance = 1 to create an adjacency list of linked objects. The operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization) as well as all the possible relationships among them. The resulting set of adjacency lists is then serialized accoring to the XML format required by the search service. Such XML records are eventually indexed on the Solr fulltext server;
2. The Graph **G_p** is mapped as a **Hive** database, whose Entity-Relation schema is described by the figure XX.
### Graph representation
The OpenAIRE scholarly graph is a collection of interlinked objects represented according to a given schema where both nodes and edges in the graph conform to it. The schema declares the set of properties that characterize each entity type as well as the properties that allows to characterize the edges. The schema is defined in the maven module `dhp-schema`.
Each materialzation of the Graph **G_*** is encoded as a set of files and directories, one directory per entity type, plus one directory holding all the relationships. Each file contains newline-delimited JSON records, produced by the serialization of each corresponding object.
Each materialzation of the Graph **G_*** is encoded as a set of files and directories, one directory per entity type, plus one directory holding all the relationships. Note that although the schema includes a hiearachy of entities (e.g. Publication is a subclass of Result), only actual classes materialize as tables. Each file contains newline-delimited JSON records, produced by the serialization of each corresponding object.
```
.
${graph_base_path}/G_*.
├─ dataset
│ ├─ dataset-m-00000.part
│ ├─ dataset-m-00000.part