14 Data provision workflow
Claudio Atzori edited this page 3 years ago

Go back to Home.


Data provision workflow

The data provision workflow is a sequence of processing steps aimed at updating the content of the backends serving the OpenAIRE public services. Currently it implements the content update procedures for:

  1. Apache Solr fulltext index serving content to explore.openaire.eu and to the HTTP search API
  2. Databases accessed through Apache Impala for calculation of statistics over the Graph
  3. MongoDB noSQL database serving content to the OpenAIRE OAI-PMH endpoint

This document provides a coarse grained description of the data provision workflow, a glossary of the terms used in the following text and in the detailed subsections used to refer to the different materializations of the OpenAIRE Graph.

Glossary

  • G_r: raw graph
  • G_d: deduplicated graph
  • G_i: inferred graph
  • G_bl: blacklisted graph
  • G_bt: bulk-tagged graph
  • G_p: propagated graph
  • dhp.Oaf: data model used to describe entity & relationship types in the graph

High level workflow description

Data provision workflow

The data provision workflow is composed the following logical steps:

  • Freeze the content stored in the aggregation system backends, mapping it to the OCEAN Hadoop cluster according to the dhp.Oaf model to produce the so called raw graph G_r;
  • Apply the deduplication outcome from the dedicated action sets and process it to produce the de-duplicated graph G_d;
  • Apply the inference processes outcome from the dedicated action sets and merge it with G_d to produce the inferred Graph G_i;
  • Remove from the Graph the relationships that were indicated as not valid in the black listing subsystem. The operation produces the blacklisted Graph, named G_bl;
  • Process the Graph according to a set of defined deduction rules (bulk-tagging criteria) to enrich its nodes with and produce the bulk-tagged Graph, named G_bt;
  • Process the Graph navigating the links between the nodes to propagate contextual information from one node to another according to a set of defined criteria to produce the propagated Graph G_p.

The materialization of the Graph named G_p is finally prepared to be mapped towards the target backends:

  1. The Graph G_p is processed such that the nodes are joined by resolving edges at distance = 1 to create an adjacency list of linked objects. The operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization) as well as all the possible relationships among them. The resulting set of adjacency lists is then serialized accoring to the XML format required by the search service. Such XML records are eventually indexed on the Solr fulltext server;
  2. The Graph G_p is mapped as a Hive database, whose Entity-Relation schema is described by the figure XX.

Graph representation

The OpenAIRE scholarly graph is a collection of interlinked objects represented according to a given schema where both nodes and edges in the graph conform to it. The schema declares the set of properties that characterize each entity type as well as the properties that allows to characterize the edges. The schema is defined in the maven module dhp-schema.

Each materialzation of the Graph G_ is encoded as a set of files and directories, one directory per entity type, plus one directory holding all the relationships. Note that although the schema includes a hierarchy of entities (e.g. Publication is a subclass of Result), only actual classes materialize as tables. Each file contains newline-delimited JSON records, produced by the serialization of each corresponding object.

${graph_base_path}/G_*.
├─ dataset
│  ├─ dataset-m-00000.part
│  ├─ dataset-m-00000.part
│  └─ ...
├─ datasource
│  ├─ datasource-m-00000.part
│  ├─ datasource-m-00000.part
│  └─ ...
├─ organization
│  ├─ organization-m-00000.part
│  ├─ organization-m-00000.part
│  └─ ...
├─ otherresearchproduct
│  ├─ otherresearchproduct-m-00000.part
│  ├─ otherresearchproduct-m-00000.part
│  └─ ...
├─ project
│  ├─ project-m-00000.part
│  ├─ project-m-00000.part
│  └─ ...
├─ publication
│  ├─ publication-m-00000.part
│  ├─ publication-m-00000.part
│  └─ ...
├─ relation
│  ├─ relation-m-00000.part
│  ├─ relation-m-00000.part
│  └─ ...
└─ software
   ├─ software-m-00000.part
   ├─ software-m-00000.part
   └─ ...