forked from D-Net/dnet-hadoop
improved documentation in dhp-graph-provision
This commit is contained in:
parent
027ef2326c
commit
18d9aad70c
|
@ -0,0 +1,21 @@
|
||||||
|
Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The
|
||||||
|
operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and
|
||||||
|
all the possible relationships (similarity links produced by the Dedup process are excluded).
|
||||||
|
|
||||||
|
The operation is implemented by sequentially joining one entity type at time (E) with the relationships (R), and
|
||||||
|
again by E, finally grouped by E.id;
|
||||||
|
|
||||||
|
The workflow is organized in different parts aimed to to reduce the complexity of the operation
|
||||||
|
|
||||||
|
1) PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
|
||||||
|
false), each entity can be linked at most to 100 other objects
|
||||||
|
|
||||||
|
2) CreateRelatedEntitiesJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
|
||||||
|
E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
|
||||||
|
T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
|
||||||
|
(R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
|
||||||
|
|
||||||
|
3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
|
||||||
|
result as JoinedEntity
|
||||||
|
|
||||||
|
4) XmlConverterJob: convert the JoinedEntities as XML records
|
|
@ -25,26 +25,8 @@ import scala.collection.JavaConverters;
|
||||||
import scala.collection.Seq;
|
import scala.collection.Seq;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The
|
* AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
|
||||||
* operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and
|
|
||||||
* all the possible relationships (similarity links produced by the Dedup process are excluded).
|
|
||||||
* <p>
|
|
||||||
* The operation is implemented by sequentially joining one entity type at time (E) with the relationships (R), and
|
|
||||||
* again by E, finally grouped by E.id;
|
|
||||||
* <p>
|
|
||||||
* The workflow is organized in different parts aimed to to reduce the complexity of the operation 1)
|
|
||||||
* PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
|
|
||||||
* false), each entity can be linked at most to 100 other objects
|
|
||||||
* <p>
|
|
||||||
* 2) JoinRelationEntityByTargetJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
|
|
||||||
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
|
|
||||||
* T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
|
|
||||||
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
|
|
||||||
* <p>
|
|
||||||
* 3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
|
|
||||||
* result as JoinedEntity
|
* result as JoinedEntity
|
||||||
* <p>
|
|
||||||
* 4) XmlConverterJob: convert the JoinedEntities as XML records
|
|
||||||
*/
|
*/
|
||||||
public class AdjacencyListBuilderJob {
|
public class AdjacencyListBuilderJob {
|
||||||
|
|
||||||
|
|
|
@ -31,26 +31,9 @@ import eu.dnetlib.dhp.schema.oaf.*;
|
||||||
import scala.Tuple2;
|
import scala.Tuple2;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The
|
* CreateRelatedEntitiesJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
|
||||||
* operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and
|
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join
|
||||||
* all the possible relationships (similarity links produced by the Dedup process are excluded).
|
* (R.target = T_i.id) save the tuples (R_i, T_i)
|
||||||
* <p>
|
|
||||||
* The operation is implemented by sequentially joining one entity type at time (E) with the relationships (R), and
|
|
||||||
* again by E, finally grouped by E.id;
|
|
||||||
* <p>
|
|
||||||
* The workflow is organized in different parts aimed to to reduce the complexity of the operation 1)
|
|
||||||
* PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
|
|
||||||
* false), each entity can be linked at most to 100 other objects
|
|
||||||
* <p>
|
|
||||||
* 2) JoinRelationEntityByTargetJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
|
|
||||||
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
|
|
||||||
* T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
|
|
||||||
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
|
|
||||||
* <p>
|
|
||||||
* 3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
|
|
||||||
* result as JoinedEntity
|
|
||||||
* <p>
|
|
||||||
* 4) XmlConverterJob: convert the JoinedEntities as XML records
|
|
||||||
*/
|
*/
|
||||||
public class CreateRelatedEntitiesJob_phase1 {
|
public class CreateRelatedEntitiesJob_phase1 {
|
||||||
|
|
||||||
|
|
|
@ -34,26 +34,8 @@ import scala.collection.JavaConverters;
|
||||||
import scala.collection.Seq;
|
import scala.collection.Seq;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The
|
* CreateRelatedEntitiesJob (phase 2): create the union of all the entity types E, hash by id read the tuples
|
||||||
* operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and
|
|
||||||
* all the possible relationships (similarity links produced by the Dedup process are excluded).
|
|
||||||
* <p>
|
|
||||||
* The operation is implemented by sequentially joining one entity type at time (E) with the relationships (R), and
|
|
||||||
* again by E, finally grouped by E.id;
|
|
||||||
* <p>
|
|
||||||
* The workflow is organized in different parts aimed to to reduce the complexity of the operation 1)
|
|
||||||
* PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
|
|
||||||
* false), each entity can be linked at most to 100 other objects
|
|
||||||
* <p>
|
|
||||||
* 2) JoinRelationEntityByTargetJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
|
|
||||||
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
|
|
||||||
* T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
|
|
||||||
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
|
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
|
||||||
* <p>
|
|
||||||
* 3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
|
|
||||||
* result as JoinedEntity
|
|
||||||
* <p>
|
|
||||||
* 4) XmlConverterJob: convert the JoinedEntities as XML records
|
|
||||||
*/
|
*/
|
||||||
public class CreateRelatedEntitiesJob_phase2 {
|
public class CreateRelatedEntitiesJob_phase2 {
|
||||||
|
|
||||||
|
|
|
@ -36,26 +36,8 @@ import eu.dnetlib.dhp.schema.oaf.Relation;
|
||||||
import scala.Tuple2;
|
import scala.Tuple2;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The
|
* PrepareRelationsJob prunes the relationships: only consider relationships that are not virtually deleted
|
||||||
* operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and
|
* ($.dataInfo.deletedbyinference == false), each entity can be linked at most to 100 other objects
|
||||||
* all the possible relationships (similarity links produced by the Dedup process are excluded).
|
|
||||||
* <p>
|
|
||||||
* The operation is implemented by sequentially joining one entity type at time (E) with the relationships (R), and
|
|
||||||
* again by E, finally grouped by E.id;
|
|
||||||
* <p>
|
|
||||||
* The workflow is organized in different parts aimed to to reduce the complexity of the operation 1)
|
|
||||||
* PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
|
|
||||||
* false), each entity can be linked at most to 100 other objects
|
|
||||||
* <p>
|
|
||||||
* 2) JoinRelationEntityByTargetJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
|
|
||||||
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
|
|
||||||
* T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
|
|
||||||
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
|
|
||||||
* <p>
|
|
||||||
* 3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
|
|
||||||
* result as JoinedEntity
|
|
||||||
* <p>
|
|
||||||
* 4) XmlConverterJob: convert the JoinedEntities as XML records
|
|
||||||
*/
|
*/
|
||||||
public class PrepareRelationsJob {
|
public class PrepareRelationsJob {
|
||||||
|
|
||||||
|
|
|
@ -37,23 +37,7 @@ import scala.collection.JavaConverters;
|
||||||
import scala.collection.Seq;
|
import scala.collection.Seq;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Joins the graph nodes by resolving the links of distance = 1 to create an adjacency list of linked objects. The
|
* XmlConverterJob converts the JoinedEntities as XML records
|
||||||
* operation considers all the entity types (publication, dataset, software, ORP, project, datasource, organization, and
|
|
||||||
* all the possible relationships (similarity links produced by the Dedup process are excluded).
|
|
||||||
* <p>
|
|
||||||
* The workflow is organized in different parts aimed to to reduce the complexity of the operation 1)
|
|
||||||
* PrepareRelationsJob: only consider relationships that are not virtually deleted ($.dataInfo.deletedbyinference ==
|
|
||||||
* false), each entity can be linked at most to 100 other objects
|
|
||||||
* <p>
|
|
||||||
* 2) JoinRelationEntityByTargetJob: (phase 1): prepare tuples [relation - target entity] (R - T): for each entity type
|
|
||||||
* E_i map E_i as RelatedEntity T_i to simplify the model and extracting only the necessary information join (R.target =
|
|
||||||
* T_i.id) save the tuples (R_i, T_i) (phase 2): create the union of all the entity types E, hash by id read the tuples
|
|
||||||
* (R, T), hash by R.source join E.id = (R, T).source, where E becomes the Source Entity S save the tuples (S, R, T)
|
|
||||||
* <p>
|
|
||||||
* 3) AdjacencyListBuilderJob: given the tuple (S - R - T) we need to group by S.id -> List [ R - T ], mapping the
|
|
||||||
* result as JoinedEntity
|
|
||||||
* <p>
|
|
||||||
* 4) XmlConverterJob: convert the JoinedEntities as XML records
|
|
||||||
*/
|
*/
|
||||||
public class XmlConverterJob {
|
public class XmlConverterJob {
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue