Collection of OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
Go to file
Giambattista Bloisi b0fc113749 SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2023-12-05 00:14:41 +01:00
dhp-build [maven-release-plugin] prepare for next development iteration 2022-04-07 13:32:22 +02:00
dhp-common avoid NPEs 2023-12-03 16:49:49 +01:00
dhp-doc-resources/img updated image 2020-03-05 15:11:42 +01:00
dhp-pace-core SparkCreateSimRels: 2023-12-05 00:14:41 +01:00
dhp-workflows SparkCreateSimRels: 2023-12-05 00:14:41 +01:00
src/site added mvn site for dnet-hadoop project 2021-11-16 15:16:28 +01:00
.gitignore ignored jenv prop 2023-10-06 10:40:05 +02:00
.scalafmt.conf [stats-wf]fixed the result_result table related to PR#191 2022-02-04 14:51:25 +01:00
LICENSE added LICENSE file - AGPL-3.0 2020-04-29 16:11:17 +02:00
README.md cleanup & docs 2023-10-12 12:23:20 +02:00
pom.xml SparkCreateSimRels: 2023-12-05 00:14:41 +01:00

README.md

dnet-hadoop

Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.

How to build, package and run oozie workflows

Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a *.tar.gz package that contains resources that define a workflow and some helper scripts.

This module is automatically executed when running:

mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app

on module having set:

<parent>
    <groupId>eu.dnetlib.dhp</groupId>
    <artifactId>dhp-workflows</artifactId>
</parent>

in pom.xml file. oozie-package profile initializes oozie workflow packaging, workflow.source.dir property points to a workflow (notice: this is not a relative path but a classpath to directory usually holding oozie_app subdirectory).

The outcome of this packaging is oozie-package.tar.gz file containing inside all the resources required to run Oozie workflow:

  • jar packages
  • workflow definitions
  • job properties
  • maintenance scripts

Required properties

In order to include proper workflow within package, workflow.source.dir property has to be set. It could be provided by setting -Dworkflow.source.dir=some/job/dir maven parameter.

In oder to define full set of cluster environment properties one should create ~/.dhp/application.properties file with the following properties:

  • dhp.hadoop.frontend.user.name - your user name on hadoop cluster and frontend machine
  • dhp.hadoop.frontend.host.name - frontend host name
  • dhp.hadoop.frontend.temp.dir - frontend directory for temporary files
  • dhp.hadoop.frontend.port.ssh - frontend machine ssh port
  • oozieServiceLoc - oozie service location required by run_workflow.sh script executing oozie job
  • nameNode - name node address
  • jobTracker - job tracker address
  • oozie.execution.log.file.location - location of file that will be created when executing oozie job, it contains output produced by run_workflow.sh script (needed to obtain oozie job id)
  • maven.executable - mvn command location, requires parameterization due to a different setup of CI cluster
  • sparkDriverMemory - amount of memory assigned to spark jobs driver
  • sparkExecutorMemory - amount of memory assigned to spark jobs executors
  • sparkExecutorCores - number of cores assigned to spark jobs executors

All values will be overriden with the ones from job.properties and eventually job-override.properties stored in module's main folder.

When overriding properties from job.properties, job-override.properties file can be created in main module directory (the one containing pom.xml file) and define all new properties which will override existing properties. One can provide those properties one by one as command line -D arguments.

Properties overriding order is the following:

  1. pom.xml defined properties (located in the project root dir)
  2. ~/.dhp/application.properties defined properties
  3. ${workflow.source.dir}/job.properties
  4. job-override.properties (located in the project root dir)
  5. maven -Dparam=value

where the maven -Dparam property is overriding all the other ones.

Workflow definition requirements

workflow.source.dir property should point to the following directory structure:

[${workflow.source.dir}]
	|
	|-job.properties (optional)
	|
	\-[oozie_app]
		|
		\-workflow.xml

This property can be set using maven -D switch.

[oozie_app] is the default directory name however it can be set to any value as soon as oozieAppDir property is provided with directory name as value.

Sub-workflows are supported as well and sub-workflow directories should be nested within [oozie_app] directory.

Creating oozie installer step-by-step

Automated oozie-installer steps are the following:

  1. creating jar packages: *.jar and *tests.jar along with copying all dependencies in target/dependencies
  2. reading properties from maven, ~/.dhp/application.properties, job.properties, job-override.properties
  3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
  4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
  5. copying whole ${workflow.source.dir} content to target/${oozie.package.file.name}
  6. generating updated job.properties file in target/${oozie.package.file.name} based on maven, ~/.dhp/application.properties, job.properties and job-override.properties
  7. creating lib directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages created at step (1) to each one of them
  8. bundling whole ${oozie.package.file.name} directory into single tar.gz package

Uploading oozie package and running workflow on cluster

In order to simplify deployment and execution process two dedicated profiles were introduced:

  • deploy
  • run

to be used along with oozie-package profile e.g. by providing -Poozie-package,deploy,run maven parameters.

The deploy profile supplements packaging process with:

  1. uploading oozie-package via scp to /home/${user.name}/oozie-packages directory on ${dhp.hadoop.frontend.host.name} machine
  2. extracting uploaded package
  3. uploading oozie content to hadoop cluster HDFS location defined in oozie.wf.application.path property (generated dynamically by maven build process, based on ${dhp.hadoop.frontend.user.name} and workflow.source.dir properties)

The run profile introduces:

  1. executing oozie application uploaded to HDFS cluster using deploy command. Triggers run_workflow.sh script providing runtime properties defined in job.properties file.

Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.