Collection of OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
Go to file
Claudio Atzori 3b1c8b9fbd Merge pull request 'FIX: GroupEntitiesSparkJob deletes whole graph outputPath instead of its temporary folder' (#351) from fix_consistency_missing_rels into beta
Reviewed-on: #351
2023-10-17 08:40:23 +02:00
dhp-build [maven-release-plugin] prepare for next development iteration 2022-04-07 13:32:22 +02:00
dhp-common FIX: GroupEntitiesSparkJob deletes whole graph outputPath instead of its temporary folder 2023-10-17 07:54:01 +02:00
dhp-doc-resources/img updated image 2020-03-05 15:11:42 +01:00
dhp-pace-core Fix bug in conversion from dedup json model to Spark Dataset of Rows: list of strings contained the json escaped representation of the value instead of the plain value, this caused instanceTypeMatch failures because of the leading and trailing double quotes 2023-10-02 11:34:51 +02:00
dhp-workflows [dedup] use common saveParquet and save methods to ensure outputs are compressed 2023-10-16 10:55:47 +02:00
src/site added mvn site for dnet-hadoop project 2021-11-16 15:16:28 +01:00
.gitignore ignored jenv prop 2023-10-06 10:40:05 +02:00
.scalafmt.conf [stats-wf]fixed the result_result table related to PR#191 2022-02-04 14:51:25 +01:00
LICENSE added LICENSE file - AGPL-3.0 2020-04-29 16:11:17 +02:00
README.md cleanup & docs 2023-10-12 12:23:20 +02:00
pom.xml Added maven repo for dependencies that are not in maven central 2023-09-20 10:33:14 +02:00

README.md

dnet-hadoop

Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.

How to build, package and run oozie workflows

Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a *.tar.gz package that contains resources that define a workflow and some helper scripts.

This module is automatically executed when running:

mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app

on module having set:

<parent>
    <groupId>eu.dnetlib.dhp</groupId>
    <artifactId>dhp-workflows</artifactId>
</parent>

in pom.xml file. oozie-package profile initializes oozie workflow packaging, workflow.source.dir property points to a workflow (notice: this is not a relative path but a classpath to directory usually holding oozie_app subdirectory).

The outcome of this packaging is oozie-package.tar.gz file containing inside all the resources required to run Oozie workflow:

  • jar packages
  • workflow definitions
  • job properties
  • maintenance scripts

Required properties

In order to include proper workflow within package, workflow.source.dir property has to be set. It could be provided by setting -Dworkflow.source.dir=some/job/dir maven parameter.

In oder to define full set of cluster environment properties one should create ~/.dhp/application.properties file with the following properties:

  • dhp.hadoop.frontend.user.name - your user name on hadoop cluster and frontend machine
  • dhp.hadoop.frontend.host.name - frontend host name
  • dhp.hadoop.frontend.temp.dir - frontend directory for temporary files
  • dhp.hadoop.frontend.port.ssh - frontend machine ssh port
  • oozieServiceLoc - oozie service location required by run_workflow.sh script executing oozie job
  • nameNode - name node address
  • jobTracker - job tracker address
  • oozie.execution.log.file.location - location of file that will be created when executing oozie job, it contains output produced by run_workflow.sh script (needed to obtain oozie job id)
  • maven.executable - mvn command location, requires parameterization due to a different setup of CI cluster
  • sparkDriverMemory - amount of memory assigned to spark jobs driver
  • sparkExecutorMemory - amount of memory assigned to spark jobs executors
  • sparkExecutorCores - number of cores assigned to spark jobs executors

All values will be overriden with the ones from job.properties and eventually job-override.properties stored in module's main folder.

When overriding properties from job.properties, job-override.properties file can be created in main module directory (the one containing pom.xml file) and define all new properties which will override existing properties. One can provide those properties one by one as command line -D arguments.

Properties overriding order is the following:

  1. pom.xml defined properties (located in the project root dir)
  2. ~/.dhp/application.properties defined properties
  3. ${workflow.source.dir}/job.properties
  4. job-override.properties (located in the project root dir)
  5. maven -Dparam=value

where the maven -Dparam property is overriding all the other ones.

Workflow definition requirements

workflow.source.dir property should point to the following directory structure:

[${workflow.source.dir}]
	|
	|-job.properties (optional)
	|
	\-[oozie_app]
		|
		\-workflow.xml

This property can be set using maven -D switch.

[oozie_app] is the default directory name however it can be set to any value as soon as oozieAppDir property is provided with directory name as value.

Sub-workflows are supported as well and sub-workflow directories should be nested within [oozie_app] directory.

Creating oozie installer step-by-step

Automated oozie-installer steps are the following:

  1. creating jar packages: *.jar and *tests.jar along with copying all dependencies in target/dependencies
  2. reading properties from maven, ~/.dhp/application.properties, job.properties, job-override.properties
  3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
  4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
  5. copying whole ${workflow.source.dir} content to target/${oozie.package.file.name}
  6. generating updated job.properties file in target/${oozie.package.file.name} based on maven, ~/.dhp/application.properties, job.properties and job-override.properties
  7. creating lib directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages created at step (1) to each one of them
  8. bundling whole ${oozie.package.file.name} directory into single tar.gz package

Uploading oozie package and running workflow on cluster

In order to simplify deployment and execution process two dedicated profiles were introduced:

  • deploy
  • run

to be used along with oozie-package profile e.g. by providing -Poozie-package,deploy,run maven parameters.

The deploy profile supplements packaging process with:

  1. uploading oozie-package via scp to /home/${user.name}/oozie-packages directory on ${dhp.hadoop.frontend.host.name} machine
  2. extracting uploaded package
  3. uploading oozie content to hadoop cluster HDFS location defined in oozie.wf.application.path property (generated dynamically by maven build process, based on ${dhp.hadoop.frontend.user.name} and workflow.source.dir properties)

The run profile introduces:

  1. executing oozie application uploaded to HDFS cluster using deploy command. Triggers run_workflow.sh script providing runtime properties defined in job.properties file.

Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.