improved documentation in dhp-graph-provision

This commit is contained in:
Claudio Atzori 2020-11-10 11:46:50 +01:00
parent d10447e747
commit c88ec1bc52
6 changed files with 29 additions and 95 deletions

View File

@ -0,0 +1,21 @@
Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The
operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and
all the possible relationships (similarity links produced by the Dedup process are excluded).
The operation is implemented by sequentially joining one entity type at time (E) with the relationships (R), and
again by E, finally grouped by E.id;
The workflow is organized in different parts aimed to to reduce the complexity of the operation
1) PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
false), each entity can be linked at most to 100 other objects
2) CreateRelatedEntitiesJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
(R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
result as JoinedEntity
4) XmlConverterJob: convert the JoinedEntities as XML records

View File

@ -25,26 +25,8 @@ import scala.collection.JavaConverters;
import scala.collection.Seq; import scala.collection.Seq;
/** /**
* Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The * AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
* operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and
* all the possible relationships (similarity links produced by the Dedup process are excluded).
* <p>
* The operation is implemented by sequentially joining one entity type at time (E) with the relationships (R), and
* again by E, finally grouped by E.id;
* <p>
* The workflow is organized in different parts aimed to to reduce the complexity of the operation 1)
* PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
* false), each entity can be linked at most to 100 other objects
* <p>
* 2) JoinRelationEntityByTargetJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
* T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
* <p>
* 3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
* result as JoinedEntity * result as JoinedEntity
* <p>
* 4) XmlConverterJob: convert the JoinedEntities as XML records
*/ */
public class AdjacencyListBuilderJob { public class AdjacencyListBuilderJob {

View File

@ -31,26 +31,9 @@ import eu.dnetlib.dhp.schema.oaf.*;
import scala.Tuple2; import scala.Tuple2;
/** /**
* Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The * CreateRelatedEntitiesJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
* operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and * E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join
* all the possible relationships (similarity links produced by the Dedup process are excluded). * (R.target = T_i.id) save the tuples (R_i, T_i)
* <p>
* The operation is implemented by sequentially joining one entity type at time (E) with the relationships (R), and
* again by E, finally grouped by E.id;
* <p>
* The workflow is organized in different parts aimed to to reduce the complexity of the operation 1)
* PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
* false), each entity can be linked at most to 100 other objects
* <p>
* 2) JoinRelationEntityByTargetJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
* T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
* <p>
* 3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
* result as JoinedEntity
* <p>
* 4) XmlConverterJob: convert the JoinedEntities as XML records
*/ */
public class CreateRelatedEntitiesJob_phase1 { public class CreateRelatedEntitiesJob_phase1 {

View File

@ -34,26 +34,8 @@ import scala.collection.JavaConverters;
import scala.collection.Seq; import scala.collection.Seq;
/** /**
* Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The * CreateRelatedEntitiesJob (phase 2): create the union of all the entity types E, hash by id read the tuples
* operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and
* all the possible relationships (similarity links produced by the Dedup process are excluded).
* <p>
* The operation is implemented by sequentially joining one entity type at time (E) with the relationships (R), and
* again by E, finally grouped by E.id;
* <p>
* The workflow is organized in different parts aimed to to reduce the complexity of the operation 1)
* PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
* false), each entity can be linked at most to 100 other objects
* <p>
* 2) JoinRelationEntityByTargetJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
* T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T) * (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
* <p>
* 3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
* result as JoinedEntity
* <p>
* 4) XmlConverterJob: convert the JoinedEntities as XML records
*/ */
public class CreateRelatedEntitiesJob_phase2 { public class CreateRelatedEntitiesJob_phase2 {

View File

@ -36,26 +36,8 @@ import eu.dnetlib.dhp.schema.oaf.Relation;
import scala.Tuple2; import scala.Tuple2;
/** /**
* Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The * PrepareRelationsJob prunes the relationships: only consider relationships that are not virtually deleted
* operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and * ($.dataInfo.deletedbyinference == false), each entity can be linked at most to 100 other objects
* all the possible relationships (similarity links produced by the Dedup process are excluded).
* <p>
* The operation is implemented by sequentially joining one entity type at time (E) with the relationships (R), and
* again by E, finally grouped by E.id;
* <p>
* The workflow is organized in different parts aimed to to reduce the complexity of the operation 1)
* PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
* false), each entity can be linked at most to 100 other objects
* <p>
* 2) JoinRelationEntityByTargetJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
* T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
* <p>
* 3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
* result as JoinedEntity
* <p>
* 4) XmlConverterJob: convert the JoinedEntities as XML records
*/ */
public class PrepareRelationsJob { public class PrepareRelationsJob {

View File

@ -37,23 +37,7 @@ import scala.collection.JavaConverters;
import scala.collection.Seq; import scala.collection.Seq;
/** /**
* Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The * XmlConverterJob converts the JoinedEntities as XML records
* operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and
* all the possible relationships (similarity links produced by the Dedup process are excluded).
* <p>
* The workflow is organized in different parts aimed to to reduce the complexity of the operation 1)
* PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
* false), each entity can be linked at most to 100 other objects
* <p>
* 2) JoinRelationEntityByTargetJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
* T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
* <p>
* 3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
* result as JoinedEntity
* <p>
* 4) XmlConverterJob: convert the JoinedEntities as XML records
*/ */
public class XmlConverterJob { public class XmlConverterJob {