[graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplication #260

claudio.atzori · 2022-11-30T10:30:28+01:00

claudio.atzori commented

2022-11-30 10:30:28 +01:00

Description

This PR introduces a further processing step in the graph cleaning workflow. It is necessary to update the fields

result.collectedfrom[].key
result.collectedfrom[].value
result.instance[].collectedfrom.key
result.instance[].collectedfrom.value
result.instance[].hostedby.key
result.instance[].hostedby.value

So that they point to the master datasource resulting from the deduplication task.

The procedure can be summarised as follows

it is activated when the workflow parameter shouldClean is set to true
it reads the datasource master - duplicate (MD) information from the DSM database and stores that information on HDFS for later use
it operates on the four result subtypes that reads again the MD information and uses it to patch the fields mentioned above
the patched results are stored in the working directory and evetually copied on the graph directory - it operates in overwrite mode, replacing the input graph with the cleaned one.

Further work

To test in isolation this modification the result recors are cloned before being stored again, but I think we should start to optimise these aspects, chaining the cleaning processes whenever possible in this workflow to reduce the amount of intermediate data it produces. This will imply to restructure the workflow structure as well as all its unit testing.

### Description This PR introduces a further processing step in the graph cleaning workflow. It is necessary to update the fields ``` result.collectedfrom[].key result.collectedfrom[].value result.instance[].collectedfrom.key result.instance[].collectedfrom.value result.instance[].hostedby.key result.instance[].hostedby.value ``` So that they point to the master datasource resulting from the deduplication task. The procedure can be summarised as follows * it is activated when the workflow parameter `shouldClean` is set to `true` * it reads the datasource master - duplicate (MD) information from the DSM database and stores that information on HDFS for later use * it operates on the four result subtypes that reads again the MD information and uses it to patch the fields mentioned above * the patched results are stored in the working directory and evetually copied on the graph directory - it operates in overwrite mode, replacing the input graph with the cleaned one. ### Further work To test in isolation this modification the result recors are cloned before being stored again, but I think we should start to optimise these aspects, chaining the cleaning processes whenever possible in this workflow to reduce the amount of intermediate data it produces. This will imply to restructure the workflow structure as well as all its unit testing.

miriam.baglioni was assigned by claudio.atzori

2022-11-30 10:30:28 +01:00

claudio.atzori added 5 commits 2022-11-30 10:30:29 +01:00

24ef301cc1 [graph cleaning] patch the result's collectedfrom and hostedby identifiers according to the datasource master-duplicate mapping

6082d235d3 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into graph_cleaning

11695ba649 [graph cleaning] patch also the result's collectedfrom and hostedby datasource name according to the datasource master-duplicate mapping

58c05731f9 [graph cleaning] WIP: testing the collectedfron and hostedby patch procedure

8e3edba318 [graph cleaning] testing the collectedfron and hostedby patch procedure

claudio.atzori changed title from ~~WIP: [graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplication~~ to [graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplication

2022-12-02 14:48:57 +01:00

claudio.atzori added 1 commit 2022-12-02 14:49:02 +01:00

8248da40d9 Merge branch 'beta' into graph_cleaning

claudio.atzori merged commit 71b121e9f8 into beta

2022-12-02 14:49:15 +01:00

claudio.atzori referenced this issue from a commit

2022-12-02 14:49:15 +01:00

Merge pull request '[graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplication' (#260) from graph_cleaning into beta

Sign in to join this conversation.

No reviewers