Collection of OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
Go to file
Claudio Atzori 242d647146 cleanup & docs 2023-10-12 12:23:44 +02:00
dhp-build [maven-release-plugin] prepare for next development iteration 2022-04-07 13:32:22 +02:00
dhp-common Fix cleaning of Pmid where parsing of numbers stopped at first not leading 0' character 2023-10-06 12:35:54 +02:00
dhp-doc-resources/img updated image 2020-03-05 15:11:42 +01:00
dhp-pace-core Merge branch 'master' into fix_dedupfailsonmatchinginstances 2023-10-02 11:28:16 +02:00
dhp-workflows cleanup & docs 2023-10-12 12:23:44 +02:00
src/site added mvn site for dnet-hadoop project 2021-11-16 15:16:28 +01:00
.gitignore minor fix on the aggregation of uniprot and pdb 2023-09-25 15:28:58 +02:00
.scalafmt.conf [stats-wf]fixed the result_result table related to PR#191 2022-02-04 14:51:25 +01:00
LICENSE added LICENSE file - AGPL-3.0 2020-04-29 16:11:17 +02:00 cleanup & docs 2023-10-12 12:23:44 +02:00
pom.xml Use scala.binary.version property to resolve scala maven dependencies 2023-07-24 11:13:48 +02:00


Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.

How to build, package and run oozie workflows

Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a *.tar.gz package that contains resources that define a workflow and some helper scripts.

This module is automatically executed when running:

mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app

on module having set:


in pom.xml file. oozie-package profile initializes oozie workflow packaging, workflow.source.dir property points to a workflow (notice: this is not a relative path but a classpath to directory usually holding oozie_app subdirectory).

The outcome of this packaging is oozie-package.tar.gz file containing inside all the resources required to run Oozie workflow:

  • jar packages
  • workflow definitions
  • job properties
  • maintenance scripts

Required properties

In order to include proper workflow within package, workflow.source.dir property has to be set. It could be provided by setting -Dworkflow.source.dir=some/job/dir maven parameter.

In oder to define full set of cluster environment properties one should create ~/.dhp/ file with the following properties:

  • - your user name on hadoop cluster and frontend machine
  • - frontend host name
  • dhp.hadoop.frontend.temp.dir - frontend directory for temporary files
  • dhp.hadoop.frontend.port.ssh - frontend machine ssh port
  • oozieServiceLoc - oozie service location required by script executing oozie job
  • nameNode - name node address
  • jobTracker - job tracker address
  • oozie.execution.log.file.location - location of file that will be created when executing oozie job, it contains output produced by script (needed to obtain oozie job id)
  • maven.executable - mvn command location, requires parameterization due to a different setup of CI cluster
  • sparkDriverMemory - amount of memory assigned to spark jobs driver
  • sparkExecutorMemory - amount of memory assigned to spark jobs executors
  • sparkExecutorCores - number of cores assigned to spark jobs executors

All values will be overriden with the ones from and eventually stored in module's main folder.

When overriding properties from, file can be created in main module directory (the one containing pom.xml file) and define all new properties which will override existing properties. One can provide those properties one by one as command line -D arguments.

Properties overriding order is the following:

  1. pom.xml defined properties (located in the project root dir)
  2. ~/.dhp/ defined properties
  3. ${workflow.source.dir}/
  4. (located in the project root dir)
  5. maven -Dparam=value

where the maven -Dparam property is overriding all the other ones.

Workflow definition requirements

workflow.source.dir property should point to the following directory structure:

	| (optional)

This property can be set using maven -D switch.

[oozie_app] is the default directory name however it can be set to any value as soon as oozieAppDir property is provided with directory name as value.

Sub-workflows are supported as well and sub-workflow directories should be nested within [oozie_app] directory.

Creating oozie installer step-by-step

Automated oozie-installer steps are the following:

  1. creating jar packages: *.jar and *tests.jar along with copying all dependencies in target/dependencies
  2. reading properties from maven, ~/.dhp/,,
  3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
  4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
  5. copying whole ${workflow.source.dir} content to target/${}
  6. generating updated file in target/${} based on maven, ~/.dhp/, and
  7. creating lib directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages created at step (1) to each one of them
  8. bundling whole ${} directory into single tar.gz package

Uploading oozie package and running workflow on cluster

In order to simplify deployment and execution process two dedicated profiles were introduced:

  • deploy
  • run

to be used along with oozie-package profile e.g. by providing -Poozie-package,deploy,run maven parameters.

The deploy profile supplements packaging process with:

  1. uploading oozie-package via scp to /home/${}/oozie-packages directory on ${} machine
  2. extracting uploaded package
  3. uploading oozie content to hadoop cluster HDFS location defined in property (generated dynamically by maven build process, based on ${} and workflow.source.dir properties)

The run profile introduces:

  1. executing oozie application uploaded to HDFS cluster using deploy command. Triggers script providing runtime properties defined in file.

Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.