dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Lampros Smyrnaios	d46b78b659	dhp-stats-update: - Set Steps 2-7 and 9 to limit the amount of files generated by Spark, from 8000, down to 100, to improve file-transfer and querying performance. - Allow the workflow to run up to Step10. The Step11 seems to have some issues even when using hive-action.	2024-04-18 15:40:27 +03:00
Lampros Smyrnaios	6f2ebb2a52	Revert Step8 and Step11 to use Hive again, since their "UPDATE" statements are not supported by Spark.	2024-04-18 15:35:03 +03:00
Lampros Smyrnaios	ca091c0f1e	dhp-stats-update: - Fix not passing some parameters to some Spark actions. - Allow the workflow to run up to Step7. The first 7 steps seem to work out of the box.	2024-04-17 14:03:59 +03:00
Lampros Smyrnaios	0b897f2f66	Fix and add missing "DROP TABLE" statements, in "dhp-stats-update" sql-scripts.	2024-04-16 18:17:54 +03:00
Lampros Smyrnaios	db33f7727c	Update "dhp-stats-update" workflow to use "spark"-actions, instead of "hive" ones. Note: Currently the code is set to only test the "Step1".	2024-04-15 16:22:40 +03:00
Lampros Smyrnaios	d7da4f814b	Minor updates to the copying operation to Impala Cluster: - Improve logging. - Code optimization/polishing.	2024-04-12 18:12:06 +03:00
Lampros Smyrnaios	14719dcd62	Miscellaneous updates to the copying operation to Impala Cluster: - Update the algorithm for creating views that depend on other views. - Add check for successful execution of the "hadoop distcp" command. - Add a check for successful copy operation of all entities. - Upon facing an error in a DB, exit the method, instead of the whole script. - Improve logging. - Code polishing.	2024-04-12 15:36:13 +03:00
Lampros Smyrnaios	22745027c8	Use the "HADOOP_USER_NAME" value from the "workflow-property", in "copyDataToImpalaCluster.sh", in "stats-monitor-updates".	2024-04-11 17:46:33 +03:00
Lampros Smyrnaios	abf0b69f29	Upgrade the copying operation to Impala Cluster: - Use only hive commands in the Ocean Cluster, as the "impala-shell" will be removed from there to free-up resources. - Hugely improve the performance in every aspect of the copying process: a) speedup file-transferring and DB-deletion, b) eliminate permissions-assignment, "load" operations and "use $db" queries, c) retry only the "create view" statements and only as long as they depend on other non-created views, instead of trying to recreate all tables and views 5 consecutive times. - Add error-checks for the creation of tables and views.	2024-04-11 17:12:12 +03:00
Lampros Smyrnaios	b7c8acc563	- Update the code which acquires the "IMPALA_HDFS_NODE", to test the "tmp"-dir, instead of the base-dir and introduce retries, to overcome potential file-system failures. This change was suggested by "Sebastian Tymkow" and "Grzegorz Bakalarski". - Fix typos.	2024-04-03 13:15:37 +03:00
Antonis Lempesis	df6e3bda04	added new orgs in monitor	2024-04-01 22:45:29 +03:00
Antonis Lempesis	573b081f1d	added new orgs in monitor	2024-04-01 22:24:46 +03:00
Antonis Lempesis	0bf2a7a359	fixed the result_country definition	2024-04-01 15:23:22 +03:00
Antonis Lempesis	9ff44eed96	fixed typo in indicator query added more institutions	2024-03-27 14:39:01 +02:00
Antonis Lempesis	1fee4124e0	added missing EOS	2024-03-27 12:58:25 +02:00
Lampros Smyrnaios	036ba03fcd	Generate tables with parquet-files, instead of csv, in "dhp-stats-update/.../contexts.sh" script.	2024-03-26 13:29:04 +02:00
Lampros Smyrnaios	bc8c97182d	Automatically select the ACTIVE HDFS NODE for Impala cluster, in all "copyDataToImpalaCluster.sh" scripts.	2024-03-26 13:01:12 +02:00
Lampros Smyrnaios	92cc27e7eb	Use the ACTIVE HDFS NODE for Impala cluster, in "copyDataToImpalaCluster.sh" script.	2024-03-26 12:34:11 +02:00
Antonis Lempesis	4c40c96e30	code cleanup	2024-03-22 10:16:49 +02:00
Antonis Lempesis	459167ac2f	Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta	2024-03-21 12:44:58 +02:00
Antonis Lempesis	07f634a46d	code cleanup	2024-03-21 12:44:30 +02:00
Antonis Lempesis	9521625a07	code cleanup	2024-03-21 11:45:08 +02:00
Antonis Lempesis	67a5aa0a38	Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta	2024-03-19 11:24:54 +02:00
dimitrispie	a3a570e9a0	Commit monitor-updates-wf	2024-03-19 09:42:21 +02:00
Antonis Lempesis	f74c7e8689	selecting distinct peer_reviewed	2024-03-12 02:13:04 +02:00
Antonis Lempesis	3c79720342	fixed the irish result subset	2024-03-07 14:08:57 +02:00
Antonis Lempesis	5ae4b4286c	Merge branch 'beta' of https://code-repo.d3science.org/antonis.lempesis/dnet-hadoop into beta	2024-03-07 12:15:19 +02:00
Antonis Lempesis	316d585c8a	using distinct apcs per publication to avoid huge sums	2024-03-07 02:07:59 +02:00
Antonis Lempesis	dd4c27f4f3	added 2 new institutions in monitor	2024-02-08 12:57:57 +02:00
Antonis Lempesis	a512ead447	changed orcid ids to all capital	2024-01-30 16:54:47 +02:00
Antonis Lempesis	bb10a22290	merged changes from dnet-hadoop	2024-01-29 21:51:47 +02:00
Claudio Atzori	f804c58bc7	Merge pull request 'Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf' (#386 ) from stats_with_spark_sql into beta Reviewed-on: D-Net/dnet-hadoop#386	2024-01-29 09:11:59 +01:00
Claudio Atzori	926903b06b	Merge branch 'beta' into stats_with_spark_sql	2024-01-29 09:11:45 +01:00
Giambattista Bloisi	078df0b4d1	Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf	2024-01-26 21:56:55 +01:00
Claudio Atzori	bf99c424fa	Merge pull request 'Fixed problem on missing author in crossref Mapping' (#383 ) from crossref_missing_author_fix into beta Reviewed-on: D-Net/dnet-hadoop#383	2024-01-26 15:57:23 +01:00
Claudio Atzori	ce3200263e	Merge branch 'beta' into crossref_missing_author_fix	2024-01-26 15:57:04 +01:00
Sandro La Bruzzo	e889808daa	Fixed problem on missing author in crossref Mapping	2024-01-26 12:19:04 +01:00
Claudio Atzori	9e8fc6aa88	[collection] increased logging from the oai-pmh metadata collection process	2024-01-26 09:17:20 +01:00
Antonis Lempesis	c548796463	Changed step16-createIndicatorsTables to use a spark oozie action instead of hive	2024-01-26 02:04:48 +02:00
Antonis Lempesis	a7115cfa9e	max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%.	2024-01-25 15:13:16 +01:00
Antonis Lempesis	fd43b0e84a	max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%.	2024-01-25 15:06:34 +01:00
Claudio Atzori	2838a9b630	Update 'CONTRIBUTING.md'	2024-01-24 16:07:05 +01:00
Claudio Atzori	da944a5c55	Merge pull request 'code of conduct and contributing' (#382 ) from contributing into beta Reviewed-on: D-Net/dnet-hadoop#382	2024-01-24 15:40:26 +01:00
Claudio Atzori	0c97a3a81a	minor	2024-01-24 10:56:33 +01:00
Claudio Atzori	2c1e6849f0	added code of conduct and contributing files	2024-01-24 10:36:41 +01:00
Claudio Atzori	9b13c22e5d	[graph provision] retrieve all the context information by adding all=true to the requests issued to thr API	2024-01-23 15:36:08 +01:00
Claudio Atzori	3e96777cc4	[collection] increased logging from the oai-pmh metadata collection process	2024-01-23 15:21:03 +01:00
Claudio Atzori	9812406589	Merge pull request '[graph provision] updated param specification for the XML converter job' (#380 ) from provision_community_api into beta Reviewed-on: D-Net/dnet-hadoop#380	2024-01-23 08:55:59 +01:00
Claudio Atzori	f87f3a6483	[graph provision] updated param specification for the XML converter job	2024-01-23 08:54:37 +01:00
Claudio Atzori	6fd25cf549	code formatting	2024-01-23 08:47:12 +01:00

1 2 3 4 5 ...

4932 Commits All Branches Search

4932 Commits

All Branches