Giambattista Bloisi
da333e9f4d
Reviewed-on: D-Net/dnet-hadoop#419 |
||
---|---|---|
dhp-build | ||
dhp-common | ||
dhp-doc-resources/img | ||
dhp-pace-core | ||
dhp-workflows | ||
src/site | ||
.gitignore | ||
.scalafmt.conf | ||
CODE_OF_CONDUCT.md | ||
CONTRIBUTING.md | ||
LICENSE.md | ||
README.md | ||
pom.xml |
README.md
dnet-hadoop
Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to dnet-team@isti.cnr.it.
This project is licensed under the AGPL v3 or later version.
How to build, package and run oozie workflows
Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a *.tar.gz
package that contains resources that define a workflow and some helper scripts.
This module is automatically executed when running:
mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app
on module having set:
<parent>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-workflows</artifactId>
</parent>
in pom.xml
file. oozie-package
profile initializes oozie workflow packaging, workflow.source.dir
property points to
a workflow (notice: this is not a relative path but a classpath to directory usually holding oozie_app
subdirectory).
The outcome of this packaging is oozie-package.tar.gz
file containing inside all the resources required to run Oozie workflow:
- jar packages
- workflow definitions
- job properties
- maintenance scripts
Required properties
In order to include proper workflow within package, workflow.source.dir
property has to be set. It could be provided
by setting -Dworkflow.source.dir=some/job/dir
maven parameter.
In oder to define full set of cluster environment properties one should create ~/.dhp/application.properties
file with
the following properties:
dhp.hadoop.frontend.user.name
- your user name on hadoop cluster and frontend machinedhp.hadoop.frontend.host.name
- frontend host namedhp.hadoop.frontend.temp.dir
- frontend directory for temporary filesdhp.hadoop.frontend.port.ssh
- frontend machine ssh portoozieServiceLoc
- oozie service location required by run_workflow.sh script executing oozie jobnameNode
- name node addressjobTracker
- job tracker addressoozie.execution.log.file.location
- location of file that will be created when executing oozie job, it contains output produced byrun_workflow.sh
script (needed to obtain oozie job id)maven.executable
- mvn command location, requires parameterization due to a different setup of CI clustersparkDriverMemory
- amount of memory assigned to spark jobs driversparkExecutorMemory
- amount of memory assigned to spark jobs executorssparkExecutorCores
- number of cores assigned to spark jobs executors
All values will be overriden with the ones from job.properties
and eventually job-override.properties
stored in module's
main folder.
When overriding properties from job.properties
, job-override.properties
file can be created in main module directory
(the one containing pom.xml
file) and define all new properties which will override existing properties.
One can provide those properties one by one as command line -D
arguments.
Properties overriding order is the following:
pom.xml
defined properties (located in the project root dir)~/.dhp/application.properties
defined properties${workflow.source.dir}/job.properties
job-override.properties
(located in the project root dir)maven -Dparam=value
where the maven -Dparam
property is overriding all the other ones.
Workflow definition requirements
workflow.source.dir
property should point to the following directory structure:
[${workflow.source.dir}]
|
|-job.properties (optional)
|
\-[oozie_app]
|
\-workflow.xml
This property can be set using maven -D
switch.
[oozie_app]
is the default directory name however it can be set to any value as soon as oozieAppDir
property is
provided with directory name as value.
Sub-workflows are supported as well and sub-workflow directories should be nested within [oozie_app]
directory.
Creating oozie installer step-by-step
Automated oozie-installer steps are the following:
- creating jar packages:
*.jar
and*tests.jar
along with copying all dependencies intarget/dependencies
- reading properties from maven,
~/.dhp/application.properties
,job.properties
,job-override.properties
- invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
- assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
- copying whole
${workflow.source.dir}
content totarget/${oozie.package.file.name}
- generating updated
job.properties
file intarget/${oozie.package.file.name}
based on maven,~/.dhp/application.properties
,job.properties
andjob-override.properties
- creating
lib
directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages created at step (1) to each one of them - bundling whole
${oozie.package.file.name}
directory into single tar.gz package
Uploading oozie package and running workflow on cluster
In order to simplify deployment and execution process two dedicated profiles were introduced:
deploy
run
to be used along with oozie-package
profile e.g. by providing -Poozie-package,deploy,run
maven parameters.
The deploy
profile supplements packaging process with:
- uploading oozie-package via scp to
/home/${user.name}/oozie-packages
directory on${dhp.hadoop.frontend.host.name}
machine - extracting uploaded package
- uploading oozie content to hadoop cluster HDFS location defined in
oozie.wf.application.path
property (generated dynamically by maven build process, based on${dhp.hadoop.frontend.user.name}
andworkflow.source.dir
properties)
The run
profile introduces:
- executing oozie application uploaded to HDFS cluster using
deploy
command. Triggersrun_workflow.sh
script providing runtime properties defined injob.properties
file.
Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.