Antonis Lempesis
f3c179658a
datasource table creation split in steps
2024-09-30 17:12:21 +03:00
Antonis Lempesis
619aa34a15
Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta
2024-09-23 15:25:59 +03:00
Antonis Lempesis
dbea7a4072
removed duplicate line
2024-09-23 14:57:11 +03:00
Antonis Lempesis
c9241dba0d
Merge pull request 'convert_hive_to_spark_actions' ( #1 ) from convert_hive_to_spark_actions into beta
...
Reviewed-on: antonis.lempesis/dnet-hadoop#1
2024-09-23 13:53:28 +02:00
Antonis Lempesis
b64c144abf
added new institutions
2024-09-05 16:00:09 +03:00
Antonis Lempesis
7d2c0a3723
added new institutions
2024-07-23 15:10:17 +03:00
Lampros Smyrnaios
e9686365a2
Improve performance of creating the "result_fos" table, by using a temp-table to cache data, which is requested multiple times.
2024-07-03 20:24:36 +03:00
Lampros Smyrnaios
ce0aee21cc
Improve performance of transferring the stats-DBs to another cluster and querying the DBs' tables, by ordering Spark to create up to 100 files per table, instead of thousands.
2024-07-03 20:15:33 +03:00
Lampros Smyrnaios
7b7dd32ad5
- Fix placement of some "set mapred.job.queue.name=analytics" statements and remove their unused "/*EOS*/" indicator.
...
- Add stacktrace-info to failed actions.
2024-07-03 19:53:24 +03:00
Lampros Smyrnaios
7ce051d766
- Update the remaining hive-actions to spark-actions.
...
- Update the version of shell-actions.
- Fix missing "/*EOS*/" indicators.
2024-07-03 19:49:19 +03:00
Lampros Smyrnaios
aa4d7d5e20
Prioritize the rest of the stats-queries over other tasks on the cluster, by putting them in the "analytics" queue.
2024-07-03 19:14:25 +03:00
Lampros Smyrnaios
54e11b6a43
Improve performance and efficiency by rewriting the creation process of "publication", "project", "dataset", "datasource", "software", "otherresearchproduct" and "result" tables, to be performed in a single query, for each one.
2024-07-03 13:03:15 +03:00
Lampros Smyrnaios
fe2275a9b0
Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions
...
# Conflicts:
# dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step14.sql
2024-06-25 20:17:47 +03:00
Lampros Smyrnaios
66cd28f70a
- Fix not using the "export HADOOP_USER_NAME" statement in "createPDFsAggregated.sh", which caused permission-issues when creating tables with Impala.
...
- Remove unused "--user" parameter in "impala-shell" calls.
- Code polishing.
2024-06-20 14:33:46 +03:00
Lampros Smyrnaios
3095047e5e
Miscellaneous updates to the copying operation to Impala Cluster:
...
- Fix not breaking out of the VIEWS-infinite-loop when the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" is set to "false".
- Exit the script when no HDFS-active-node was found, independently of the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR".
- Fix view_name-recognition in a log-message, by using the more advanced "Perl-Compatible Regular Expressions" in "grep".
- Add error-handling for "compute stats" errors.
2024-06-18 14:40:41 +03:00
Antonis Lempesis
0456f1b788
Merge remote-tracking branch 'origin/beta' into beta
2024-06-14 15:11:30 +03:00
Antonis Lempesis
38636942c7
filtering out deletedbyinference and invinsible results from accessroute
2024-06-14 15:11:19 +03:00
Lampros Smyrnaios
d942a1101b
Miscellaneous updates to the copying operation to Impala Cluster:
...
- Show some counts and the elapsed time for various sub-tasks.
- Code polishing.
2024-06-14 12:14:38 +03:00
Lampros Smyrnaios
e3f28338c1
Miscellaneous updates to the copying operation to Impala Cluster:
...
- Assign the WRITE and EXECUTE permissions to the DBs' HDFS-directories, in order to be able to create tables on top of them, in the Impala Cluster.
- Make sure the "copydb" function returns early, when it encounters a fatal error, while respecting the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" config.
2024-05-28 17:51:45 +03:00
Lampros Smyrnaios
888637773c
Add missing "/*EOS*/" comments.
2024-05-27 12:34:49 +03:00
Lampros Smyrnaios
e0ac494859
Merge branch 'beta' into convert_hive_to_spark_actions
...
# Conflicts:
# dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step15.sql
# dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step15_5.sql
# dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_1-definitions.sql
# dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_5.sql
# dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step2.sql
# dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step7.sql
2024-05-27 12:27:40 +03:00
Antonis Lempesis
15b54a345a
added fos lvl4
2024-05-24 13:21:28 +03:00
Lampros Smyrnaios
b48ed6e617
Change configuration in the copy-operation to Impala Cluster:
...
Set the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" parameter to "false".
2024-05-23 16:58:12 +03:00
Lampros Smyrnaios
68322843e2
Small updates to the copy-operation to Impala Cluster:
...
- Add a configuration-"switch" to control whether the script exits upon an error or not.
- Allow the script to exit when a table could not be created.
- Show the elapsed time for processing each database.
2024-05-23 15:07:49 +03:00
Lampros Smyrnaios
c7b32bbacc
Update CopyDataToImpalaCluster:
...
Update the code of acquiring the entities from Ocean cluster, through hive, in order to optimize the process and account for additional reserved keywords in Impala.
Co-authored-by: Antonis Lempesis <antleb@di.uoa.gr>
2024-05-23 13:00:19 +03:00
Antonis Lempesis
0cada3cc8f
every step is run in the analytics queue. Hardcoded for now, will make a parameter later
2024-05-08 13:42:53 +03:00
Antonis Lempesis
90a4fb3547
fixed typos
2024-05-08 13:17:58 +03:00
Lampros Smyrnaios
3c17183d10
Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions
2024-04-23 17:18:16 +03:00
Lampros Smyrnaios
49af2e5740
Miscellaneous updates to the copying operation to Impala Cluster:
...
- Update the algorithm for creating views that depend on other views; overcome some bash-instabilities.
- Upon any error, fail the whole process, not just the current DB-creation, as those errors usually indicate a bug in the initial DB-creation, that should be fixed immediately.
- Enhance parallel-copy of large files by "hadoop distcp" command.
- Reduce the "invalidate metadata" commands to just the current DB's tables, in order to eliminate the general overhead on Impala.
- Show the number of tables and views in the logs.
- Fix some log-messages.
2024-04-23 17:15:04 +03:00
Lampros Smyrnaios
69a9ac7393
Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions
2024-04-22 17:07:11 +03:00
Antonis Lempesis
b52a5a753b
Merge remote-tracking branch 'upstream/beta' into beta
2024-04-19 15:28:28 +03:00
Lampros Smyrnaios
342223f75c
Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions
2024-04-19 13:18:34 +03:00
Antonis Lempesis
c3fe9662b2
all indicator tables are now stored as parquet
2024-04-19 12:45:36 +03:00
Lampros Smyrnaios
2616971e2b
dhp-stats-update: remove leftover duplicate line
2024-04-18 16:18:16 +03:00
Lampros Smyrnaios
ba533d9f34
Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions
2024-04-18 15:47:56 +03:00
Lampros Smyrnaios
d46b78b659
dhp-stats-update:
...
- Set Steps 2-7 and 9 to limit the amount of files generated by Spark, from 8000, down to 100, to improve file-transfer and querying performance.
- Allow the workflow to run up to Step10. The Step11 seems to have some issues even when using hive-action.
2024-04-18 15:40:27 +03:00
Lampros Smyrnaios
6f2ebb2a52
Revert Step8 and Step11 to use Hive again, since their "UPDATE" statements are not supported by Spark.
2024-04-18 15:35:03 +03:00
Claudio Atzori
57c678d904
integrating changes from PR#424
2024-04-18 11:38:35 +02:00
Claudio Atzori
5ab8cd1794
Various fixes for the stats DB update workflow, step16-createIndicatorsTables.sql
2024-04-18 11:28:18 +02:00
Antonis Lempesis
0c71c58df6
fixed the definition of gold_oa
2024-04-18 12:01:27 +03:00
Antonis Lempesis
43d05dbebb
fixed the definition of result_country
2024-04-18 11:53:50 +03:00
Antonis Lempesis
e728a0897c
fixed the definition of indi_pub_bronze_oa
2024-04-18 11:07:55 +03:00
Antonis Lempesis
308ae580a9
slight optimization in indi_pub_gold_oa definition
2024-04-18 10:57:52 +03:00
Antonis Lempesis
27d22bd8f9
slight optimization in indi_pub_gold_oa definition
2024-04-17 23:59:52 +03:00
Antonis Lempesis
1f5aba12fa
slight optimization in indi_pub_gold_oa definition
2024-04-17 23:54:23 +03:00
Lampros Smyrnaios
ca091c0f1e
dhp-stats-update:
...
- Fix not passing some parameters to some Spark actions.
- Allow the workflow to run up to Step7. The first 7 steps seem to work out of the box.
2024-04-17 14:03:59 +03:00
Lampros Smyrnaios
0b897f2f66
Fix and add missing "DROP TABLE" statements, in "dhp-stats-update" sql-scripts.
2024-04-16 18:17:54 +03:00
Lampros Smyrnaios
db33f7727c
Update "dhp-stats-update" workflow to use "spark"-actions, instead of "hive" ones.
...
Note: Currently the code is set to only test the "Step1".
2024-04-15 16:22:40 +03:00
Lampros Smyrnaios
d7da4f814b
Minor updates to the copying operation to Impala Cluster:
...
- Improve logging.
- Code optimization/polishing.
2024-04-12 18:12:06 +03:00
Lampros Smyrnaios
14719dcd62
Miscellaneous updates to the copying operation to Impala Cluster:
...
- Update the algorithm for creating views that depend on other views.
- Add check for successful execution of the "hadoop distcp" command.
- Add a check for successful copy operation of all entities.
- Upon facing an error in a DB, exit the method, instead of the whole script.
- Improve logging.
- Code polishing.
2024-04-12 15:36:13 +03:00