dnet-hadoop

Author	SHA1	Message	Date
Lampros Smyrnaios	54e11b6a43	Improve performance and efficiency by rewriting the creation process of "publication", "project", "dataset", "datasource", "software", "otherresearchproduct" and "result" tables, to be performed in a single query, for each one.	2024-07-03 13:03:15 +03:00
Lampros Smyrnaios	fe2275a9b0	Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions # Conflicts: # dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step14.sql	2024-06-25 20:17:47 +03:00
Lampros Smyrnaios	66cd28f70a	- Fix not using the "export HADOOP_USER_NAME" statement in "createPDFsAggregated.sh", which caused permission-issues when creating tables with Impala. - Remove unused "--user" parameter in "impala-shell" calls. - Code polishing.	2024-06-20 14:33:46 +03:00
Lampros Smyrnaios	3095047e5e	Miscellaneous updates to the copying operation to Impala Cluster: - Fix not breaking out of the VIEWS-infinite-loop when the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" is set to "false". - Exit the script when no HDFS-active-node was found, independently of the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR". - Fix view_name-recognition in a log-message, by using the more advanced "Perl-Compatible Regular Expressions" in "grep". - Add error-handling for "compute stats" errors.	2024-06-18 14:40:41 +03:00
Antonis Lempesis	0456f1b788	Merge remote-tracking branch 'origin/beta' into beta	2024-06-14 15:11:30 +03:00
Antonis Lempesis	38636942c7	filtering out deletedbyinference and invinsible results from accessroute	2024-06-14 15:11:19 +03:00
Lampros Smyrnaios	d942a1101b	Miscellaneous updates to the copying operation to Impala Cluster: - Show some counts and the elapsed time for various sub-tasks. - Code polishing.	2024-06-14 12:14:38 +03:00
Lampros Smyrnaios	e3f28338c1	Miscellaneous updates to the copying operation to Impala Cluster: - Assign the WRITE and EXECUTE permissions to the DBs' HDFS-directories, in order to be able to create tables on top of them, in the Impala Cluster. - Make sure the "copydb" function returns early, when it encounters a fatal error, while respecting the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" config.	2024-05-28 17:51:45 +03:00
Lampros Smyrnaios	888637773c	Add missing "/EOS/" comments.	2024-05-27 12:34:49 +03:00
Lampros Smyrnaios	e0ac494859	Merge branch 'beta' into convert_hive_to_spark_actions # Conflicts: # dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step15.sql # dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step15_5.sql # dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_1-definitions.sql # dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_5.sql # dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step2.sql # dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step7.sql	2024-05-27 12:27:40 +03:00
Antonis Lempesis	15b54a345a	added fos lvl4	2024-05-24 13:21:28 +03:00
Lampros Smyrnaios	b48ed6e617	Change configuration in the copy-operation to Impala Cluster: Set the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" parameter to "false".	2024-05-23 16:58:12 +03:00
Lampros Smyrnaios	68322843e2	Small updates to the copy-operation to Impala Cluster: - Add a configuration-"switch" to control whether the script exits upon an error or not. - Allow the script to exit when a table could not be created. - Show the elapsed time for processing each database.	2024-05-23 15:07:49 +03:00
Lampros Smyrnaios	c7b32bbacc	Update CopyDataToImpalaCluster: Update the code of acquiring the entities from Ocean cluster, through hive, in order to optimize the process and account for additional reserved keywords in Impala. Co-authored-by: Antonis Lempesis <antleb@di.uoa.gr>	2024-05-23 13:00:19 +03:00
Antonis Lempesis	0cada3cc8f	every step is run in the analytics queue. Hardcoded for now, will make a parameter later	2024-05-08 13:42:53 +03:00
Antonis Lempesis	90a4fb3547	fixed typos	2024-05-08 13:17:58 +03:00
Lampros Smyrnaios	3c17183d10	Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions	2024-04-23 17:18:16 +03:00
Lampros Smyrnaios	49af2e5740	Miscellaneous updates to the copying operation to Impala Cluster: - Update the algorithm for creating views that depend on other views; overcome some bash-instabilities. - Upon any error, fail the whole process, not just the current DB-creation, as those errors usually indicate a bug in the initial DB-creation, that should be fixed immediately. - Enhance parallel-copy of large files by "hadoop distcp" command. - Reduce the "invalidate metadata" commands to just the current DB's tables, in order to eliminate the general overhead on Impala. - Show the number of tables and views in the logs. - Fix some log-messages.	2024-04-23 17:15:04 +03:00
Lampros Smyrnaios	69a9ac7393	Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions	2024-04-22 17:07:11 +03:00
Antonis Lempesis	b52a5a753b	Merge remote-tracking branch 'upstream/beta' into beta	2024-04-19 15:28:28 +03:00
Lampros Smyrnaios	342223f75c	Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions	2024-04-19 13:18:34 +03:00
Antonis Lempesis	c3fe9662b2	all indicator tables are now stored as parquet	2024-04-19 12:45:36 +03:00
Lampros Smyrnaios	2616971e2b	dhp-stats-update: remove leftover duplicate line	2024-04-18 16:18:16 +03:00
Lampros Smyrnaios	ba533d9f34	Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions	2024-04-18 15:47:56 +03:00
Lampros Smyrnaios	d46b78b659	dhp-stats-update: - Set Steps 2-7 and 9 to limit the amount of files generated by Spark, from 8000, down to 100, to improve file-transfer and querying performance. - Allow the workflow to run up to Step10. The Step11 seems to have some issues even when using hive-action.	2024-04-18 15:40:27 +03:00
Lampros Smyrnaios	6f2ebb2a52	Revert Step8 and Step11 to use Hive again, since their "UPDATE" statements are not supported by Spark.	2024-04-18 15:35:03 +03:00
Claudio Atzori	57c678d904	integrating changes from PR#424	2024-04-18 11:38:35 +02:00
Claudio Atzori	5ab8cd1794	Various fixes for the stats DB update workflow, step16-createIndicatorsTables.sql	2024-04-18 11:28:18 +02:00
Antonis Lempesis	0c71c58df6	fixed the definition of gold_oa	2024-04-18 12:01:27 +03:00
Antonis Lempesis	43d05dbebb	fixed the definition of result_country	2024-04-18 11:53:50 +03:00
Antonis Lempesis	e728a0897c	fixed the definition of indi_pub_bronze_oa	2024-04-18 11:07:55 +03:00
Antonis Lempesis	308ae580a9	slight optimization in indi_pub_gold_oa definition	2024-04-18 10:57:52 +03:00
Antonis Lempesis	27d22bd8f9	slight optimization in indi_pub_gold_oa definition	2024-04-17 23:59:52 +03:00
Antonis Lempesis	1f5aba12fa	slight optimization in indi_pub_gold_oa definition	2024-04-17 23:54:23 +03:00
Lampros Smyrnaios	ca091c0f1e	dhp-stats-update: - Fix not passing some parameters to some Spark actions. - Allow the workflow to run up to Step7. The first 7 steps seem to work out of the box.	2024-04-17 14:03:59 +03:00
Lampros Smyrnaios	0b897f2f66	Fix and add missing "DROP TABLE" statements, in "dhp-stats-update" sql-scripts.	2024-04-16 18:17:54 +03:00
Lampros Smyrnaios	db33f7727c	Update "dhp-stats-update" workflow to use "spark"-actions, instead of "hive" ones. Note: Currently the code is set to only test the "Step1".	2024-04-15 16:22:40 +03:00
Lampros Smyrnaios	d7da4f814b	Minor updates to the copying operation to Impala Cluster: - Improve logging. - Code optimization/polishing.	2024-04-12 18:12:06 +03:00
Lampros Smyrnaios	14719dcd62	Miscellaneous updates to the copying operation to Impala Cluster: - Update the algorithm for creating views that depend on other views. - Add check for successful execution of the "hadoop distcp" command. - Add a check for successful copy operation of all entities. - Upon facing an error in a DB, exit the method, instead of the whole script. - Improve logging. - Code polishing.	2024-04-12 15:36:13 +03:00
Lampros Smyrnaios	abf0b69f29	Upgrade the copying operation to Impala Cluster: - Use only hive commands in the Ocean Cluster, as the "impala-shell" will be removed from there to free-up resources. - Hugely improve the performance in every aspect of the copying process: a) speedup file-transferring and DB-deletion, b) eliminate permissions-assignment, "load" operations and "use $db" queries, c) retry only the "create view" statements and only as long as they depend on other non-created views, instead of trying to recreate all tables and views 5 consecutive times. - Add error-checks for the creation of tables and views.	2024-04-11 17:12:12 +03:00
Lampros Smyrnaios	b7c8acc563	- Update the code which acquires the "IMPALA_HDFS_NODE", to test the "tmp"-dir, instead of the base-dir and introduce retries, to overcome potential file-system failures. This change was suggested by "Sebastian Tymkow" and "Grzegorz Bakalarski". - Fix typos.	2024-04-03 13:15:37 +03:00
Antonis Lempesis	df6e3bda04	added new orgs in monitor	2024-04-01 22:45:29 +03:00
Antonis Lempesis	573b081f1d	added new orgs in monitor	2024-04-01 22:24:46 +03:00
Antonis Lempesis	0bf2a7a359	fixed the result_country definition	2024-04-01 15:23:22 +03:00
Antonis Lempesis	9ff44eed96	fixed typo in indicator query added more institutions	2024-03-27 14:39:01 +02:00
Antonis Lempesis	1fee4124e0	added missing EOS	2024-03-27 12:58:25 +02:00
Lampros Smyrnaios	036ba03fcd	Generate tables with parquet-files, instead of csv, in "dhp-stats-update/.../contexts.sh" script.	2024-03-26 13:29:04 +02:00
Lampros Smyrnaios	92cc27e7eb	Use the ACTIVE HDFS NODE for Impala cluster, in "copyDataToImpalaCluster.sh" script.	2024-03-26 12:34:11 +02:00
Antonis Lempesis	4c40c96e30	code cleanup	2024-03-22 10:16:49 +02:00
Antonis Lempesis	459167ac2f	Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta	2024-03-21 12:44:58 +02:00

1 2 3 4 5 ...

356 Commits