parameter naming conversions across the provision workflows #29

Open
opened 2020-07-27 17:37:04 +02:00 by claudio.atzori · 0 comments

Currently the different oozie workflows assembled as the OpenAIRE data provision pipeline doesn't follow precise conventions for naming the input/output parameters.

An overview of the parameters available through the D-Net provisionw workflow definition:

shows that some simple naming conventions could allveviate the burden when it comes to maintain the parameter coherency across the oozie workflows as well as to introduce new graph processing steps, when needed.

From the descriptions below, there are many parameters describing the same concepts across different workflow steps.

aggregatorGraph

  • graphOutputPath
  • isLookupUrl

promoteActions

  • inputGraphRootPath
  • outputGraphRootPath
  • isLookupUrl

duplicateScan

  • graphBasePath // input
  • dedupGraphPath // output
  • isLookUpUrl

dedupConsistency

  • graphBasePath // input
  • dedupGraphPath // output

graphCleaning

  • graphInputPath
  • graphOutputPath
  • isLookupUrl

orcidPropagation

  • sourcePath
  • outputPath

bulkTagging

  • sourcePath
  • outputPath
  • isLookUpUrl

affiliationPropagation

  • sourcePath
  • outputPath

communityOrganizationPropagation

  • sourcePath
  • outputPath

resultProjectPropagation

  • sourcePath
  • outputPath

communitySemrelPropagation

  • sourcePath
  • outputPath
  • isLookUpUrl

countryPropagation

  • sourcePath
  • outputPath

blacklistRelations

  • sourcePath
  • outputPath

In particular, the most recurring parameters describe

  1. the input/output graph. They currently refer to the HDFS path, but the parameter could abstract a bit and refer generically to non ambiguous input/output graph names;
  2. URL to the information system lookup service, which luckily only suffers from 2 different spellings, isLookUpUrl vs isLookupUrl.

So, I propose to adopt the same naming conventions for the three parameters:

  • inputGraph
  • outputGraph
  • isLookUpUrl
Currently the different oozie workflows assembled as the OpenAIRE data provision pipeline doesn't follow precise conventions for naming the input/output parameters. An overview of the parameters available through the D-Net provisionw workflow definition: * [BETA - Graph construction for IIS](https://beta.services.openaire.eu/is/mvc/ui/isManager.do#/profile/b05c97e6-69b5-497d-87fd-2137d3ff2c2e_V29ya2Zsb3dEU1Jlc291cmNlcy9Xb3JrZmxvd0RTUmVzb3VyY2VUeXBl) * [BETA - Graph construction for PROVISION](https://beta.services.openaire.eu/is/mvc/ui/isManager.do#/profile/b05c97e6-69b5-497d-87fd-2137d3ff2c2e_V29ya2Zsb3dEU1Jlc291cmNlcy9Xb3JrZmxvd0RTUmVzb3VyY2VUeXBl) * [PROD - Graph construction for IIS](https://services.openaire.eu/is/mvc/ui/isManager.do#/profile/4801c33c-66ca-4ab6-af64-aa812194ec68_V29ya2Zsb3dEU1Jlc291cmNlcy9Xb3JrZmxvd0RTUmVzb3VyY2VUeXBl) * [PROD - Graph construction for PROVISION](https://services.openaire.eu/is/mvc/ui/isManager.do#/profile/74d90d54-bea4-4a79-82d9-adddcc89e660_V29ya2Zsb3dEU1Jlc291cmNlcy9Xb3JrZmxvd0RTUmVzb3VyY2VUeXBl) shows that some simple naming conventions could allveviate the burden when it comes to maintain the parameter coherency across the oozie workflows as well as to introduce new graph processing steps, when needed. From the descriptions below, there are many parameters describing the same concepts across different workflow steps. **aggregatorGraph** * graphOutputPath * isLookupUrl **promoteActions** * inputGraphRootPath * outputGraphRootPath * isLookupUrl **duplicateScan** * graphBasePath // input * dedupGraphPath // output * isLookUpUrl **dedupConsistency** * graphBasePath // input * dedupGraphPath // output **graphCleaning** * graphInputPath * graphOutputPath * isLookupUrl **orcidPropagation** * sourcePath * outputPath **bulkTagging** * sourcePath * outputPath * isLookUpUrl **affiliationPropagation** * sourcePath * outputPath **communityOrganizationPropagation** * sourcePath * outputPath **resultProjectPropagation** * sourcePath * outputPath **communitySemrelPropagation** * sourcePath * outputPath * isLookUpUrl **countryPropagation** * sourcePath * outputPath **blacklistRelations** * sourcePath * outputPath In particular, the most recurring parameters describe 1. the input/output graph. They currently refer to the HDFS path, but the parameter could abstract a bit and refer generically to non ambiguous input/output graph names; 2. URL to the information system lookup service, which luckily only suffers from 2 different spellings, `isLookUpUrl` vs `isLookupUrl`. So, I propose to adopt the same naming conventions for the three parameters: * inputGraph * outputGraph * isLookUpUrl
claudio.atzori self-assigned this 2020-07-27 17:37:04 +02:00
miriam.baglioni was assigned by claudio.atzori 2020-07-27 17:37:04 +02:00
claudio.atzori added the
enhancement
label 2020-07-27 18:01:24 +02:00
Sign in to join this conversation.
No Milestone
No project
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#29
No description provided.