[graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplication #260
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#260
Loading…
Reference in New Issue
No description provided.
Delete Branch "graph_cleaning"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Description
This PR introduces a further processing step in the graph cleaning workflow. It is necessary to update the fields
So that they point to the master datasource resulting from the deduplication task.
The procedure can be summarised as follows
shouldClean
is set totrue
Further work
To test in isolation this modification the result recors are cloned before being stored again, but I think we should start to optimise these aspects, chaining the cleaning processes whenever possible in this workflow to reduce the amount of intermediate data it produces. This will imply to restructure the workflow structure as well as all its unit testing.
WIP: [graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplicationto [graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplication