compress broker workflow outputs #67

Closed
opened 2020-12-09 09:26:05 +01:00 by claudio.atzori · 3 comments

I noticed the intermediate data produced by the broker related workflows is stored as non compressed. Given that the inspection tools supports on the fly decompression (HUE web UI, hdfs cmdline tool) I don't see a valid reason for not storing them in compressed format.

Many other worfklows implementation does the same using the gzip compression format, so please take inpiration from the existing classes.

Please note also that I'm assigning the task to the broker branch. Any change should be implemented there, tested, and only after a successful testing integrated in the master branch, possibly via a pull request.

I noticed the intermediate data produced by the broker related workflows is stored as non compressed. Given that the inspection tools supports on the fly decompression (HUE web UI, hdfs cmdline tool) I don't see a valid reason for not storing them in compressed format. Many other worfklows implementation does the same using the gzip compression format, so please take inpiration from the existing classes. Please note also that I'm assigning the task to the `broker` branch. Any change should be implemented there, tested, and only after a successful testing integrated in the master branch, possibly via a pull request.
claudio.atzori added the
enhancement
label 2020-12-09 09:26:05 +01:00
michele.artini was assigned by claudio.atzori 2020-12-09 09:26:05 +01:00
Member

The fix has been committed on the broker branch

The fix has been committed on the broker branch
Author
Owner

Side note: the fact that the Event records were not stored as compressed could have a role in the slowness of the procedure dedicated to partition them by opendoar ID. The bigger the data the more time it takes to move it around the cluster.

Side note: the fact that the Event records were not stored as compressed could have a role in the slowness of the procedure dedicated to partition them by opendoar ID. The bigger the data the more time it takes to move it around the cluster.
Author
Owner

Integrated in PR !78

@michele.artini We should monitor the execution of the paritioning by opendoarID operation to see if the compression introduces any significant advantage.

Integrated in PR !78 @michele.artini We should monitor the execution of the paritioning by opendoarID operation to see if the compression introduces any significant advantage.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#67
No description provided.