convert_hive_to_spark_actions #1

antonis.lempesis · 2024-09-23T13:53:08+02:00

antonis.lempesis commented

2024-09-23 13:53:08 +02:00

No description provided.

antonis.lempesis added 20 commits 2024-09-23 13:53:08 +02:00

db33f7727c Update "dhp-stats-update" workflow to use "spark"-actions, instead of "hive" ones.

Note: Currently the code is set to only test the "Step1".

0b897f2f66 Fix and add missing "DROP TABLE" statements, in "dhp-stats-update" sql-scripts.

ca091c0f1e dhp-stats-update:

- Fix not passing some parameters to some Spark actions.
- Allow the workflow to run up to Step7. The first 7 steps seem to work out of the box.

6f2ebb2a52 Revert Step8 and Step11 to use Hive again, since their "UPDATE" statements are not supported by Spark.

d46b78b659 dhp-stats-update:

- Set Steps 2-7 and 9 to limit the amount of files generated by Spark, from 8000, down to 100, to improve file-transfer and querying performance.
- Allow the workflow to run up to Step10. The Step11 seems to have some issues even when using hive-action.

ba533d9f34 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions

2616971e2b dhp-stats-update: remove leftover duplicate line

342223f75c Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions

69a9ac7393 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions

3c17183d10 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions

e0ac494859 Merge branch 'beta' into convert_hive_to_spark_actions

# Conflicts:
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step15.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step15_5.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_1-definitions.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_5.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step2.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step7.sql

888637773c Add missing "/*EOS*/" comments.

a644a6f4fe Catch Spark-sql errors and show a log with the statement that failed.

fe2275a9b0 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions

# Conflicts:
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step14.sql

54e11b6a43 Improve performance and efficiency by rewriting the creation process of "publication", "project", "dataset", "datasource", "software", "otherresearchproduct" and "result" tables, to be performed in a single query, for each one.

aa4d7d5e20 Prioritize the rest of the stats-queries over other tasks on the cluster, by putting them in the "analytics" queue.

7ce051d766 - Update the remaining hive-actions to spark-actions.

- Update the version of shell-actions.
- Fix missing "/*EOS*/" indicators.

7b7dd32ad5 - Fix placement of some "set mapred.job.queue.name=analytics" statements and remove their unused "/*EOS*/" indicator.

- Add stacktrace-info to failed actions.

ce0aee21cc Improve performance of transferring the stats-DBs to another cluster and querying the DBs' tables, by ordering Spark to create up to 100 files per table, instead of thousands.

e9686365a2 Improve performance of creating the "result_fos" table, by using a temp-table to cache data, which is requested multiple times.

antonis.lempesis merged commit c9241dba0d into beta

2024-09-23 13:53:29 +02:00

antonis.lempesis referenced this issue from a commit

2024-09-23 13:53:30 +02:00

Merge pull request 'convert_hive_to_spark_actions' (#1) from convert_hive_to_spark_actions into beta

Sign in to join this conversation.

No reviewers

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: antonis.lempesis/dnet-hadoop#1