[graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplication #260

Merged
claudio.atzori merged 6 commits from graph_cleaning into beta 2022-12-02 14:49:15 +01:00

Description

This PR introduces a further processing step in the graph cleaning workflow. It is necessary to update the fields

result.collectedfrom[].key
result.collectedfrom[].value
result.instance[].collectedfrom.key
result.instance[].collectedfrom.value
result.instance[].hostedby.key
result.instance[].hostedby.value

So that they point to the master datasource resulting from the deduplication task.

The procedure can be summarised as follows

  • it is activated when the workflow parameter shouldClean is set to true
  • it reads the datasource master - duplicate (MD) information from the DSM database and stores that information on HDFS for later use
  • it operates on the four result subtypes that reads again the MD information and uses it to patch the fields mentioned above
  • the patched results are stored in the working directory and evetually copied on the graph directory - it operates in overwrite mode, replacing the input graph with the cleaned one.

Further work

To test in isolation this modification the result recors are cloned before being stored again, but I think we should start to optimise these aspects, chaining the cleaning processes whenever possible in this workflow to reduce the amount of intermediate data it produces. This will imply to restructure the workflow structure as well as all its unit testing.

### Description This PR introduces a further processing step in the graph cleaning workflow. It is necessary to update the fields ``` result.collectedfrom[].key result.collectedfrom[].value result.instance[].collectedfrom.key result.instance[].collectedfrom.value result.instance[].hostedby.key result.instance[].hostedby.value ``` So that they point to the master datasource resulting from the deduplication task. The procedure can be summarised as follows * it is activated when the workflow parameter `shouldClean` is set to `true` * it reads the datasource master - duplicate (MD) information from the DSM database and stores that information on HDFS for later use * it operates on the four result subtypes that reads again the MD information and uses it to patch the fields mentioned above * the patched results are stored in the working directory and evetually copied on the graph directory - it operates in overwrite mode, replacing the input graph with the cleaned one. ### Further work To test in isolation this modification the result recors are cloned before being stored again, but I think we should start to optimise these aspects, chaining the cleaning processes whenever possible in this workflow to reduce the amount of intermediate data it produces. This will imply to restructure the workflow structure as well as all its unit testing.
miriam.baglioni was assigned by claudio.atzori 2022-11-30 10:30:28 +01:00
claudio.atzori added 5 commits 2022-11-30 10:30:29 +01:00
claudio.atzori changed title from WIP: [graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplication to [graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplication 2022-12-02 14:48:57 +01:00
claudio.atzori added 1 commit 2022-12-02 14:49:02 +01:00
claudio.atzori merged commit 71b121e9f8 into beta 2022-12-02 14:49:15 +01:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#260
No description provided.