From c5c2586a35859822f9961016a5e9d624c28e0f6e Mon Sep 17 00:00:00 2001 From: Claudio Atzori Date: Wed, 4 Mar 2020 11:39:28 +0100 Subject: [PATCH] Update page 'Data provision workflow' --- Data-provision-workflow.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Data-provision-workflow.md b/Data-provision-workflow.md index bcb0053..39a0716 100644 --- a/Data-provision-workflow.md +++ b/Data-provision-workflow.md @@ -35,19 +35,19 @@ The data provision workflow is composed the following logical steps: * Process the Graph navigating the links between the nodes to propagate contextual information from one node to another according to a set of defined criteria to produce the `propagated` Graph **G_p**. The materialization of the Graph named **G_p** is finally prepared to be mapped towards the target backends: -1. Process the Graph **G_p** such that the nodes are joined by resolving edges at distance = 1 to create an adjacency list of linked objects. The operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization) as well as all the possible relationships among them. The resulting set of adjacency lists is then serialized accoring to the XML format required by the search service. Such records are eventually indexed on the Solr fulltext server; -2. Process the Graph **G_p** as a **Hive** database that represents a 1:1 mapping between the Graph materialization and the implicit relational schema. This allows to slice and dice the data using SQL statements. +1. The Graph **G_p** is processed such that the nodes are joined by resolving edges at distance = 1 to create an adjacency list of linked objects. The operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization) as well as all the possible relationships among them. The resulting set of adjacency lists is then serialized accoring to the XML format required by the search service. Such XML records are eventually indexed on the Solr fulltext server; +2. The Graph **G_p** is mapped as a **Hive** database, whose Entity-Relation schema is described by the figure XX. ### Graph representation The OpenAIRE scholarly graph is a collection of interlinked objects represented according to a given schema where both nodes and edges in the graph conform to it. The schema declares the set of properties that characterize each entity type as well as the properties that allows to characterize the edges. The schema is defined in the maven module `dhp-schema`. -Each materialzation of the Graph **G_*** is encoded as a set of files and directories, one directory per entity type, plus one directory holding all the relationships. Each file contains newline-delimited JSON records, produced by the serialization of each corresponding object. +Each materialzation of the Graph **G_*** is encoded as a set of files and directories, one directory per entity type, plus one directory holding all the relationships. Note that although the schema includes a hiearachy of entities (e.g. Publication is a subclass of Result), only actual classes materialize as tables. Each file contains newline-delimited JSON records, produced by the serialization of each corresponding object. ``` -. +${graph_base_path}/G_*. ├─ dataset │ ├─ dataset-m-00000.part │ ├─ dataset-m-00000.part