From 76447958bb538c75872d6b5f0fef184e97b42d55 Mon Sep 17 00:00:00 2001 From: Claudio Atzori Date: Thu, 12 Oct 2023 12:23:20 +0200 Subject: [PATCH] cleanup & docs --- README.md | 128 +++++++++++++++++- dhp-workflows/dhp-distcp/pom.xml | 13 -- .../dhp/distcp/oozie_app/config-default.xml | 18 --- .../dnetlib/dhp/distcp/oozie_app/workflow.xml | 46 ------- dhp-workflows/docs/oozie-installer.markdown | 111 --------------- dhp-workflows/pom.xml | 1 - 6 files changed, 127 insertions(+), 190 deletions(-) delete mode 100644 dhp-workflows/dhp-distcp/pom.xml delete mode 100644 dhp-workflows/dhp-distcp/src/main/resources/eu/dnetlib/dhp/distcp/oozie_app/config-default.xml delete mode 100644 dhp-workflows/dhp-distcp/src/main/resources/eu/dnetlib/dhp/distcp/oozie_app/workflow.xml delete mode 100644 dhp-workflows/docs/oozie-installer.markdown diff --git a/README.md b/README.md index 0a0bd82ab..2c1440f44 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,128 @@ # dnet-hadoop -Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning. \ No newline at end of file + +Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning. + +How to build, package and run oozie workflows +==================== + +Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a `*.tar.gz` +package that contains resources that define a workflow and some helper scripts. + +This module is automatically executed when running: + +`mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app` + +on module having set: + +``` + + eu.dnetlib.dhp + dhp-workflows + +``` + +in `pom.xml` file. `oozie-package` profile initializes oozie workflow packaging, `workflow.source.dir` property points to +a workflow (notice: this is not a relative path but a classpath to directory usually holding `oozie_app` subdirectory). + +The outcome of this packaging is `oozie-package.tar.gz` file containing inside all the resources required to run Oozie workflow: + +- jar packages +- workflow definitions +- job properties +- maintenance scripts + +Required properties +==================== + +In order to include proper workflow within package, `workflow.source.dir` property has to be set. It could be provided +by setting `-Dworkflow.source.dir=some/job/dir` maven parameter. + +In oder to define full set of cluster environment properties one should create `~/.dhp/application.properties` file with +the following properties: + +- `dhp.hadoop.frontend.user.name` - your user name on hadoop cluster and frontend machine +- `dhp.hadoop.frontend.host.name` - frontend host name +- `dhp.hadoop.frontend.temp.dir` - frontend directory for temporary files +- `dhp.hadoop.frontend.port.ssh` - frontend machine ssh port +- `oozieServiceLoc` - oozie service location required by run_workflow.sh script executing oozie job +- `nameNode` - name node address +- `jobTracker` - job tracker address +- `oozie.execution.log.file.location` - location of file that will be created when executing oozie job, it contains output +produced by `run_workflow.sh` script (needed to obtain oozie job id) +- `maven.executable` - mvn command location, requires parameterization due to a different setup of CI cluster +- `sparkDriverMemory` - amount of memory assigned to spark jobs driver +- `sparkExecutorMemory` - amount of memory assigned to spark jobs executors +- `sparkExecutorCores` - number of cores assigned to spark jobs executors + +All values will be overriden with the ones from `job.properties` and eventually `job-override.properties` stored in module's +main folder. + +When overriding properties from `job.properties`, `job-override.properties` file can be created in main module directory +(the one containing `pom.xml` file) and define all new properties which will override existing properties. +One can provide those properties one by one as command line `-D` arguments. + +Properties overriding order is the following: + +1. `pom.xml` defined properties (located in the project root dir) +2. `~/.dhp/application.properties` defined properties +3. `${workflow.source.dir}/job.properties` +4. `job-override.properties` (located in the project root dir) +5. `maven -Dparam=value` + +where the maven `-Dparam` property is overriding all the other ones. + +Workflow definition requirements +==================== + +`workflow.source.dir` property should point to the following directory structure: + + [${workflow.source.dir}] + | + |-job.properties (optional) + | + \-[oozie_app] + | + \-workflow.xml + +This property can be set using maven `-D` switch. + +`[oozie_app]` is the default directory name however it can be set to any value as soon as `oozieAppDir` property is +provided with directory name as value. + +Sub-workflows are supported as well and sub-workflow directories should be nested within `[oozie_app]` directory. + +Creating oozie installer step-by-step +===================================== + +Automated oozie-installer steps are the following: + +1. creating jar packages: `*.jar` and `*tests.jar` along with copying all dependencies in `target/dependencies` +2. reading properties from maven, `~/.dhp/application.properties`, `job.properties`, `job-override.properties` +3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources) +4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow +5. copying whole `${workflow.source.dir}` content to `target/${oozie.package.file.name}` +6. generating updated `job.properties` file in `target/${oozie.package.file.name}` based on maven, +`~/.dhp/application.properties`, `job.properties` and `job-override.properties` +7. creating `lib` directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages +created at step (1) to each one of them +8. bundling whole `${oozie.package.file.name}` directory into single tar.gz package + +Uploading oozie package and running workflow on cluster +======================================================= + +In order to simplify deployment and execution process two dedicated profiles were introduced: + +- `deploy` +- `run` + +to be used along with `oozie-package` profile e.g. by providing `-Poozie-package,deploy,run` maven parameters. + +The `deploy` profile supplements packaging process with: +1) uploading oozie-package via scp to `/home/${user.name}/oozie-packages` directory on `${dhp.hadoop.frontend.host.name}` machine +2) extracting uploaded package +3) uploading oozie content to hadoop cluster HDFS location defined in `oozie.wf.application.path` property (generated dynamically by maven build process, based on `${dhp.hadoop.frontend.user.name}` and `workflow.source.dir` properties) + +The `run` profile introduces: +1) executing oozie application uploaded to HDFS cluster using `deploy` command. Triggers `run_workflow.sh` script providing runtime properties defined in `job.properties` file. + +Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations. \ No newline at end of file diff --git a/dhp-workflows/dhp-distcp/pom.xml b/dhp-workflows/dhp-distcp/pom.xml deleted file mode 100644 index c3d3a7375..000000000 --- a/dhp-workflows/dhp-distcp/pom.xml +++ /dev/null @@ -1,13 +0,0 @@ - - - - dhp-workflows - eu.dnetlib.dhp - 1.2.5-SNAPSHOT - - 4.0.0 - - dhp-distcp - - - \ No newline at end of file diff --git a/dhp-workflows/dhp-distcp/src/main/resources/eu/dnetlib/dhp/distcp/oozie_app/config-default.xml b/dhp-workflows/dhp-distcp/src/main/resources/eu/dnetlib/dhp/distcp/oozie_app/config-default.xml deleted file mode 100644 index 905fb9984..000000000 --- a/dhp-workflows/dhp-distcp/src/main/resources/eu/dnetlib/dhp/distcp/oozie_app/config-default.xml +++ /dev/null @@ -1,18 +0,0 @@ - - - jobTracker - yarnRM - - - nameNode - hdfs://nameservice1 - - - sourceNN - webhdfs://namenode2.hadoop.dm.openaire.eu:50071 - - - oozie.use.system.libpath - true - - \ No newline at end of file diff --git a/dhp-workflows/dhp-distcp/src/main/resources/eu/dnetlib/dhp/distcp/oozie_app/workflow.xml b/dhp-workflows/dhp-distcp/src/main/resources/eu/dnetlib/dhp/distcp/oozie_app/workflow.xml deleted file mode 100644 index 91b97332b..000000000 --- a/dhp-workflows/dhp-distcp/src/main/resources/eu/dnetlib/dhp/distcp/oozie_app/workflow.xml +++ /dev/null @@ -1,46 +0,0 @@ - - - - sourceNN - the source name node - - - sourcePath - the source path - - - targetPath - the target path - - - hbase_dump_distcp_memory_mb - 6144 - memory for distcp action copying InfoSpace dump from remote cluster - - - hbase_dump_distcp_num_maps - 1 - maximum number of simultaneous copies of InfoSpace dump from remote location - - - - - - - Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}] - - - - - -Dmapreduce.map.memory.mb=${hbase_dump_distcp_memory_mb} - -pb - -m ${hbase_dump_distcp_num_maps} - ${sourceNN}/${sourcePath} - ${nameNode}/${targetPath} - - - - - - - \ No newline at end of file diff --git a/dhp-workflows/docs/oozie-installer.markdown b/dhp-workflows/docs/oozie-installer.markdown deleted file mode 100644 index d2de80dcc..000000000 --- a/dhp-workflows/docs/oozie-installer.markdown +++ /dev/null @@ -1,111 +0,0 @@ -General notes -==================== - -Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a `*.tar.gz` package that contains resouces that define a workflow and some helper scripts. - -This module is automatically executed when running: - -`mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app` - -on module having set: - - - eu.dnetlib.dhp - dhp-workflows - - -in `pom.xml` file. `oozie-package` profile initializes oozie workflow packaging, `workflow.source.dir` property points to a workflow (notice: this is not a relative path but a classpath to directory usually holding `oozie_app` subdirectory). - -The outcome of this packaging is `oozie-package.tar.gz` file containing inside all the resources required to run Oozie workflow: - -- jar packages -- workflow definitions -- job properties -- maintenance scripts - -Required properties -==================== - -In order to include proper workflow within package, `workflow.source.dir` property has to be set. It could be provided by setting `-Dworkflow.source.dir=some/job/dir` maven parameter. - -In oder to define full set of cluster environment properties one should create `~/.dhp/application.properties` file with the following properties: - -- `dhp.hadoop.frontend.user.name` - your user name on hadoop cluster and frontend machine -- `dhp.hadoop.frontend.host.name` - frontend host name -- `dhp.hadoop.frontend.temp.dir` - frontend directory for temporary files -- `dhp.hadoop.frontend.port.ssh` - frontend machine ssh port -- `oozieServiceLoc` - oozie service location required by run_workflow.sh script executing oozie job -- `nameNode` - name node address -- `jobTracker` - job tracker address -- `oozie.execution.log.file.location` - location of file that will be created when executing oozie job, it contains output produced by `run_workflow.sh` script (needed to obtain oozie job id) -- `maven.executable` - mvn command location, requires parameterization due to a different setup of CI cluster -- `sparkDriverMemory` - amount of memory assigned to spark jobs driver -- `sparkExecutorMemory` - amount of memory assigned to spark jobs executors -- `sparkExecutorCores` - number of cores assigned to spark jobs executors - -All values will be overriden with the ones from `job.properties` and eventually `job-override.properties` stored in module's main folder. - -When overriding properties from `job.properties`, `job-override.properties` file can be created in main module directory (the one containing `pom.xml` file) and define all new properties which will override existing properties. One can provide those properties one by one as command line -D arguments. - -Properties overriding order is the following: - -1. `pom.xml` defined properties (located in the project root dir) -2. `~/.dhp/application.properties` defined properties -3. `${workflow.source.dir}/job.properties` -4. `job-override.properties` (located in the project root dir) -5. `maven -Dparam=value` - -where the maven `-Dparam` property is overriding all the other ones. - -Workflow definition requirements -==================== - -`workflow.source.dir` property should point to the following directory structure: - - [${workflow.source.dir}] - | - |-job.properties (optional) - | - \-[oozie_app] - | - \-workflow.xml - -This property can be set using maven `-D` switch. - -`[oozie_app]` is the default directory name however it can be set to any value as soon as `oozieAppDir` property is provided with directory name as value. - -Subworkflows are supported as well and subworkflow directories should be nested within `[oozie_app]` directory. - -Creating oozie installer step-by-step -===================================== - -Automated oozie-installer steps are the following: - -1. creating jar packages: `*.jar` and `*tests.jar` along with copying all dependancies in `target/dependencies` -2. reading properties from maven, `~/.dhp/application.properties`, `job.properties`, `job-override.properties` -3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources) -4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow -5. copying whole `${workflow.source.dir}` content to `target/${oozie.package.file.name}` -6. generating updated `job.properties` file in `target/${oozie.package.file.name}` based on maven, `~/.dhp/application.properties`, `job.properties` and `job-override.properties` -7. creating `lib` directory (or multiple directories for subworkflows for each nested directory) and copying jar packages created at step (1) to each one of them -8. bundling whole `${oozie.package.file.name}` directory into single tar.gz package - -Uploading oozie package and running workflow on cluster -======================================================= - -In order to simplify deployment and execution process two dedicated profiles were introduced: - -- `deploy` -- `run` - -to be used along with `oozie-package` profile e.g. by providing `-Poozie-package,deploy,run` maven parameters. - -`deploy` profile supplements packaging process with: -1) uploading oozie-package via scp to `/home/${user.name}/oozie-packages` directory on `${dhp.hadoop.frontend.host.name}` machine -2) extracting uploaded package -3) uploading oozie content to hadoop cluster HDFS location defined in `oozie.wf.application.path` property (generated dynamically by maven build process, based on `${dhp.hadoop.frontend.user.name}` and `workflow.source.dir` properties) - -`run` profile introduces: -1) executing oozie application uploaded to HDFS cluster using `deploy` command. Triggers `run_workflow.sh` script providing runtime properties defined in `job.properties` file. - -Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations. \ No newline at end of file diff --git a/dhp-workflows/pom.xml b/dhp-workflows/pom.xml index 64f5f2d26..369c71b5b 100644 --- a/dhp-workflows/pom.xml +++ b/dhp-workflows/pom.xml @@ -25,7 +25,6 @@ dhp-workflow-profiles dhp-aggregation - dhp-distcp dhp-actionmanager dhp-graph-mapper dhp-dedup-openaire