2.1 KiB
Continuous Validation
This module is responsible for deploying an Oozie Workflow (on the desired cluster), which executes a Spark action.
This action takes the HDFS-path of a directory of parquet files containing metadata records, and applies the validation process on all of them, in parallel. Then it outputs the results, in json-format, in the given directory.
The validation process is powered by the "uoa-validator-engine2" software,
which is included as a dependency inside the main "pom.xml" file.
Configure the workflow
Add the wanted values for each of the parameters, defined in the "/src/main/resources/eu/dnetlib/dhp/continuous_validator/oozie_app/workflow.xml" file.
The most important parameters are the following:
- parquet_path: the input parquet
- openaire_guidelines: valid values: "4.0", "3.0", "2.0", "fair_data", "fair_literature_v4"
- output_path: Be careful to use a base directory which is different from the one that this module is running on, as during a new deployment, that base directory will be deleted.
Install the project and then deploy and run the workflow
Run the ./installProject.sh script and then the ./runOozieWorkflow.sh script.
Use the "workflow-id" displayed by the "runOozieWorkflow.sh" script to check the running status and logs, in the remote machine, as follows:
- Check the status:
oozie job -oozie http://<cluster's domain and port>/oozie -info <Workflow-ID>
- Copy the "Job-id" from the output of the above command (numbers with ONE underscore between them).
- Check the job's logs (not the app's logs!):
yarn logs -applicationId application_<Job-ID>
Note:
If you encounter any "java.lang.NoSuchFieldError" issues in the logs, rerun using the following steps:
- Delete some remote directories related to the workflow in your user's dir: /user//
- .sparkStaging
- oozie-oozi
- Run the ./installProject.sh script and then the ./runOozieWorkflow.sh script.