compress broker workflow outputs #67
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#67
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I noticed the intermediate data produced by the broker related workflows is stored as non compressed. Given that the inspection tools supports on the fly decompression (HUE web UI, hdfs cmdline tool) I don't see a valid reason for not storing them in compressed format.
Many other worfklows implementation does the same using the gzip compression format, so please take inpiration from the existing classes.
Please note also that I'm assigning the task to the
broker
branch. Any change should be implemented there, tested, and only after a successful testing integrated in the master branch, possibly via a pull request.The fix has been committed on the broker branch
Side note: the fact that the Event records were not stored as compressed could have a role in the slowness of the procedure dedicated to partition them by opendoar ID. The bigger the data the more time it takes to move it around the cluster.
Integrated in PR !78
@michele.artini We should monitor the execution of the paritioning by opendoarID operation to see if the compression introduces any significant advantage.