[BETA] Improvements to copying data from ocean to impala #421

Merged
claudio.atzori merged 4 commits from antonis.lempesis/dnet-hadoop:beta into beta 2024-04-16 14:22:33 +02:00

these 4 commits fix the namespace issue with the impala cluster and also greatly greatly improves the performance (hours to minutes)

these 4 commits fix the namespace issue with the impala cluster and also greatly greatly improves the performance (hours to minutes)
claudio.atzori added 4 commits 2024-04-16 14:22:26 +02:00
abf0b69f29 Upgrade the copying operation to Impala Cluster:
- Use only hive commands in the Ocean Cluster, as the "impala-shell" will be removed from there to free-up resources.
- Hugely improve the performance in every aspect of the copying process: a) speedup file-transferring and DB-deletion, b) eliminate permissions-assignment, "load" operations and "use $db" queries, c) retry only the "create view" statements and only as long as they depend on other non-created views, instead of trying to recreate all tables and views 5 consecutive times.
- Add error-checks for the creation of tables and views.
14719dcd62 Miscellaneous updates to the copying operation to Impala Cluster:
- Update the algorithm for creating views that depend on other views.
- Add check for successful execution of the "hadoop distcp" command.
- Add a check for successful copy operation of all entities.
- Upon facing an error in a DB, exit the method, instead of the whole script.
- Improve logging.
- Code polishing.
d7da4f814b Minor updates to the copying operation to Impala Cluster:
- Improve logging.
- Code optimization/polishing.
claudio.atzori merged commit 589bce3520 into beta 2024-04-16 14:22:33 +02:00
claudio.atzori changed title from [pBETA] Improvements to copying data from ocean to impala to [BETA] Improvements to copying data from ocean to impala 2024-04-17 11:59:40 +02:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#421
No description provided.