Closes #4: New action manager implementation #5

Manually merged
claudio.atzori merged 23 commits from :przemyslawjacewicz_actionmanager_impl_prototype into master 2020-04-06 17:35:08 +02:00
Contributor

Overview

This PR adds a new module with action manager implementation.

It contains a single Oozie workflow for promotion of action sets supplied as a comma separated list of HDFS locations containing hadoop sequence files with AtomicAction as the value.

Additions

  1. OAF model
    • autogenerated equals and hashCode for classes that did not overwrite theses methods
    • safe casting in mergeFrom, avoiding ClassCastException when target instance cannot be casted to source instance
    • mergeFrom for Relation safeguarded against null value of collectedfrom field
  2. Spark jobs
    • spark job for partitioning action sets by payload type
    • spark job for promoting given action payload for graph table
  3. Helper classes
    • HDFS support
    • SparkSession helper - allows to reuse a SparkSession created in test to be used in main code
    • minor helpers
  4. Oozie workflow for promotion of action sets
    • main workflow consisting of subworkflows handling promotion for each graph table
  5. Suite of unit and local tests

A major drawback is the lack of tests for Oozie workflows. These tests would be possible after introduction of a framework for testing oozie workflow in the parent module.

### Overview This PR adds a new module with action manager implementation. It contains a single Oozie workflow for promotion of action sets supplied as a comma separated list of HDFS locations containing hadoop sequence files with `AtomicAction` as the value. ### Additions 1. OAF model - autogenerated `equals` and `hashCode` for classes that did not overwrite theses methods - safe casting in `mergeFrom`, avoiding `ClassCastException` when target instance cannot be casted to source instance - `mergeFrom` for `Relation` safeguarded against null value of `collectedfrom` field 1. Spark jobs - spark job for partitioning action sets by payload type - spark job for promoting given action payload for graph table 1. Helper classes - HDFS support - SparkSession helper - allows to reuse a SparkSession created in test to be used in main code - minor helpers 1. Oozie workflow for promotion of action sets - main workflow consisting of subworkflows handling promotion for each graph table 1. Suite of unit and local tests A major drawback is the lack of tests for Oozie workflows. These tests would be possible after introduction of a framework for testing oozie workflow in the parent module.
claudio.atzori self-assigned this 2020-04-06 15:17:16 +02:00

Overall the code is well organized and the testing suite is also fine. One aspect still being discussed is the encoding used to store the AtomicActions, depending on what we're going to decide, the ActionSets read procedure might need to be adjusted a bit (https://code-repo.d4science.org/przemyslaw.jacewicz/dnet-hadoop/src/branch/przemyslawjacewicz_actionmanager_impl_prototype/dhp-workflows/dhp-actionmanager/src/main/java/eu/dnetlib/dhp/actionmanager/partition/PartitionActionSetsByPayloadTypeJob.java#L103), but this is something I can take care of.

Ref. ticket https://issue.openaire.research-infrastructures.eu/issues/5507

  1. OAF model
    • autogenerated equals and hashCode for classes that did not overwrite theses methods
    • safe casting in mergeFrom, avoiding ClassCastException when target instance cannot be casted to source instance
    • mergeFrom for Relation safeguarded against null value of collectedfrom field

Thanks for these!

  1. Helper classes
    • HDFS support
    • SparkSession helper - allows to reuse a SparkSession created in test to be used in main code
    • minor helpers

The helper classes defined in the eu.dnetlib.dhp.actionmanager.common package are well suited to be part of the dhp-common module. I'm probably going to move them there, under eu.dnetlib.dhp.common`.

  1. Oozie workflow for promotion of action sets
    • main workflow consisting of subworkflows handling promotion for each graph table
  2. Suite of unit and local tests

A major drawback is the lack of tests for Oozie workflows. These tests would be possible after introduction of a framework for testing oozie workflow in the parent module.

I agree. As more workflows are being defined, we need a framework to test them. Your experience with IIS workflows testing will be precious :)

Overall the code is well organized and the testing suite is also fine. One aspect still being discussed is the encoding used to store the `AtomicActions`, depending on what we're going to decide, the ActionSets read procedure might need to be adjusted a bit (https://code-repo.d4science.org/przemyslaw.jacewicz/dnet-hadoop/src/branch/przemyslawjacewicz_actionmanager_impl_prototype/dhp-workflows/dhp-actionmanager/src/main/java/eu/dnetlib/dhp/actionmanager/partition/PartitionActionSetsByPayloadTypeJob.java#L103), but this is something I can take care of. Ref. ticket https://issue.openaire.research-infrastructures.eu/issues/5507 > 1. OAF model > - autogenerated `equals` and `hashCode` for classes that did not overwrite theses methods > - safe casting in `mergeFrom`, avoiding `ClassCastException` when target instance cannot be casted to source instance > - `mergeFrom` for `Relation` safeguarded against null value of `collectedfrom` field Thanks for these! > 1. Helper classes > - HDFS support > - SparkSession helper - allows to reuse a SparkSession created in test to be used in main code > - minor helpers The helper classes defined in the `eu.dnetlib.dhp.actionmanager.common package are well suited to be part of the dhp-common module. I'm probably going to move them there, under `eu.dnetlib.dhp.common`. > 1. Oozie workflow for promotion of action sets > - main workflow consisting of subworkflows handling promotion for each graph table > 1. Suite of unit and local tests > > A major drawback is the lack of tests for Oozie workflows. These tests would be possible after introduction of a framework for testing oozie workflow in the parent module. I agree. As more workflows are being defined, we need a framework to test them. Your experience with IIS workflows testing will be precious :)
claudio.atzori closed this pull request 2020-04-06 17:35:08 +02:00
claudio.atzori reviewed 2020-04-07 18:01:32 +02:00
@ -0,0 +26,4 @@
public static void runWithSparkSession(SparkConf conf,
Boolean isSparkSessionManaged,
ThrowingConsumer<SparkSession, Exception> fn) {
runWithSparkSession(c -> SparkSession.builder().config(c).getOrCreate(), conf, isSparkSessionManaged, fn);

Something I realized today while refactoring: what about those spark actions that need to have hive support enabled?

SparkSession.Builder has a nice .enableHiveSupport() to activate it, should we make it always enabled? Otherwise I can define a specialized runWithSparkSession.

Something I realized today while refactoring: what about those spark actions that need to have hive support enabled? `SparkSession.Builder` has a nice `.enableHiveSupport()` to activate it, should we make it always enabled? Otherwise I can define a specialized `runWithSparkSession`.
Author
Contributor

Hmm, I didn't think about that, the methods runWithSparkSession were to be used by action manager's spark actions, where Hive support is not needed. But I think that the version of runWithSparkSession that accepts SparkSession builder function could be used to supply a builder that builds a SparkSession with Hive support using code like this

SparkSession.builder().config(c).enableHiveSupport().getOrCreate()

PS. Sorry for the late reply, didn't see the comment earlier.

Hmm, I didn't think about that, the methods `runWithSparkSession` were to be used by action manager's spark actions, where Hive support is not needed. But I think that the version of `runWithSparkSession` that accepts `SparkSession` builder function could be used to supply a builder that builds a `SparkSession` with Hive support using code like this ```java SparkSession.builder().config(c).enableHiveSupport().getOrCreate() ``` PS. Sorry for the late reply, didn't see the comment earlier.
Author
Contributor

Just saw that you already did that :)

Just saw that you already did that :)
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#5
No description provided.