Lampros Smyrnaios
d46b78b659
dhp-stats-update:
...
- Set Steps 2-7 and 9 to limit the amount of files generated by Spark, from 8000, down to 100, to improve file-transfer and querying performance.
- Allow the workflow to run up to Step10. The Step11 seems to have some issues even when using hive-action.
2024-04-18 15:40:27 +03:00
Lampros Smyrnaios
6f2ebb2a52
Revert Step8 and Step11 to use Hive again, since their "UPDATE" statements are not supported by Spark.
2024-04-18 15:35:03 +03:00
Lampros Smyrnaios
ca091c0f1e
dhp-stats-update:
...
- Fix not passing some parameters to some Spark actions.
- Allow the workflow to run up to Step7. The first 7 steps seem to work out of the box.
2024-04-17 14:03:59 +03:00
Lampros Smyrnaios
0b897f2f66
Fix and add missing "DROP TABLE" statements, in "dhp-stats-update" sql-scripts.
2024-04-16 18:17:54 +03:00
Lampros Smyrnaios
db33f7727c
Update "dhp-stats-update" workflow to use "spark"-actions, instead of "hive" ones.
...
Note: Currently the code is set to only test the "Step1".
2024-04-15 16:22:40 +03:00
Lampros Smyrnaios
d7da4f814b
Minor updates to the copying operation to Impala Cluster:
...
- Improve logging.
- Code optimization/polishing.
2024-04-12 18:12:06 +03:00
Lampros Smyrnaios
14719dcd62
Miscellaneous updates to the copying operation to Impala Cluster:
...
- Update the algorithm for creating views that depend on other views.
- Add check for successful execution of the "hadoop distcp" command.
- Add a check for successful copy operation of all entities.
- Upon facing an error in a DB, exit the method, instead of the whole script.
- Improve logging.
- Code polishing.
2024-04-12 15:36:13 +03:00
Lampros Smyrnaios
22745027c8
Use the "HADOOP_USER_NAME" value from the "workflow-property", in "copyDataToImpalaCluster.sh", in "stats-monitor-updates".
2024-04-11 17:46:33 +03:00
Lampros Smyrnaios
abf0b69f29
Upgrade the copying operation to Impala Cluster:
...
- Use only hive commands in the Ocean Cluster, as the "impala-shell" will be removed from there to free-up resources.
- Hugely improve the performance in every aspect of the copying process: a) speedup file-transferring and DB-deletion, b) eliminate permissions-assignment, "load" operations and "use $db" queries, c) retry only the "create view" statements and only as long as they depend on other non-created views, instead of trying to recreate all tables and views 5 consecutive times.
- Add error-checks for the creation of tables and views.
2024-04-11 17:12:12 +03:00
Lampros Smyrnaios
b7c8acc563
- Update the code which acquires the "IMPALA_HDFS_NODE", to test the "tmp"-dir, instead of the base-dir and introduce retries, to overcome potential file-system failures. This change was suggested by "Sebastian Tymkow" and "Grzegorz Bakalarski".
...
- Fix typos.
2024-04-03 13:15:37 +03:00
Antonis Lempesis
df6e3bda04
added new orgs in monitor
2024-04-01 22:45:29 +03:00
Antonis Lempesis
573b081f1d
added new orgs in monitor
2024-04-01 22:24:46 +03:00
Antonis Lempesis
0bf2a7a359
fixed the result_country definition
2024-04-01 15:23:22 +03:00
Antonis Lempesis
9ff44eed96
fixed typo in indicator query
...
added more institutions
2024-03-27 14:39:01 +02:00
Antonis Lempesis
1fee4124e0
added missing EOS
2024-03-27 12:58:25 +02:00
Lampros Smyrnaios
036ba03fcd
Generate tables with parquet-files, instead of csv, in "dhp-stats-update/.../contexts.sh" script.
2024-03-26 13:29:04 +02:00
Lampros Smyrnaios
bc8c97182d
Automatically select the ACTIVE HDFS NODE for Impala cluster, in all "copyDataToImpalaCluster.sh" scripts.
2024-03-26 13:01:12 +02:00
Lampros Smyrnaios
92cc27e7eb
Use the ACTIVE HDFS NODE for Impala cluster, in "copyDataToImpalaCluster.sh" script.
2024-03-26 12:34:11 +02:00
Antonis Lempesis
4c40c96e30
code cleanup
2024-03-22 10:16:49 +02:00
Antonis Lempesis
459167ac2f
Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta
2024-03-21 12:44:58 +02:00
Antonis Lempesis
07f634a46d
code cleanup
2024-03-21 12:44:30 +02:00
Antonis Lempesis
9521625a07
code cleanup
2024-03-21 11:45:08 +02:00
Antonis Lempesis
67a5aa0a38
Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta
2024-03-19 11:24:54 +02:00
dimitrispie
a3a570e9a0
Commit monitor-updates-wf
2024-03-19 09:42:21 +02:00
Antonis Lempesis
f74c7e8689
selecting distinct peer_reviewed
2024-03-12 02:13:04 +02:00
Antonis Lempesis
3c79720342
fixed the irish result subset
2024-03-07 14:08:57 +02:00
Antonis Lempesis
5ae4b4286c
Merge branch 'beta' of https://code-repo.d3science.org/antonis.lempesis/dnet-hadoop into beta
2024-03-07 12:15:19 +02:00
Antonis Lempesis
316d585c8a
using distinct apcs per publication to avoid huge sums
2024-03-07 02:07:59 +02:00
Antonis Lempesis
dd4c27f4f3
added 2 new institutions in monitor
2024-02-08 12:57:57 +02:00
Antonis Lempesis
a512ead447
changed orcid ids to all capital
2024-01-30 16:54:47 +02:00
Antonis Lempesis
bb10a22290
merged changes from dnet-hadoop
2024-01-29 21:51:47 +02:00
Claudio Atzori
f804c58bc7
Merge pull request 'Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf' ( #386 ) from stats_with_spark_sql into beta
...
Reviewed-on: D-Net/dnet-hadoop#386
2024-01-29 09:11:59 +01:00
Claudio Atzori
926903b06b
Merge branch 'beta' into stats_with_spark_sql
2024-01-29 09:11:45 +01:00
Giambattista Bloisi
078df0b4d1
Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf
2024-01-26 21:56:55 +01:00
Claudio Atzori
bf99c424fa
Merge pull request 'Fixed problem on missing author in crossref Mapping' ( #383 ) from crossref_missing_author_fix into beta
...
Reviewed-on: D-Net/dnet-hadoop#383
2024-01-26 15:57:23 +01:00
Claudio Atzori
ce3200263e
Merge branch 'beta' into crossref_missing_author_fix
2024-01-26 15:57:04 +01:00
Sandro La Bruzzo
e889808daa
Fixed problem on missing author in crossref Mapping
2024-01-26 12:19:04 +01:00
Claudio Atzori
9e8fc6aa88
[collection] increased logging from the oai-pmh metadata collection process
2024-01-26 09:17:20 +01:00
Antonis Lempesis
c548796463
Changed step16-createIndicatorsTables to use a spark oozie action instead of hive
2024-01-26 02:04:48 +02:00
Antonis Lempesis
a7115cfa9e
max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%.
2024-01-25 15:13:16 +01:00
Antonis Lempesis
fd43b0e84a
max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%.
2024-01-25 15:06:34 +01:00
Claudio Atzori
2838a9b630
Update 'CONTRIBUTING.md'
2024-01-24 16:07:05 +01:00
Claudio Atzori
da944a5c55
Merge pull request 'code of conduct and contributing' ( #382 ) from contributing into beta
...
Reviewed-on: D-Net/dnet-hadoop#382
2024-01-24 15:40:26 +01:00
Claudio Atzori
0c97a3a81a
minor
2024-01-24 10:56:33 +01:00
Claudio Atzori
2c1e6849f0
added code of conduct and contributing files
2024-01-24 10:36:41 +01:00
Claudio Atzori
9b13c22e5d
[graph provision] retrieve all the context information by adding all=true to the requests issued to thr API
2024-01-23 15:36:08 +01:00
Claudio Atzori
3e96777cc4
[collection] increased logging from the oai-pmh metadata collection process
2024-01-23 15:21:03 +01:00
Claudio Atzori
9812406589
Merge pull request '[graph provision] updated param specification for the XML converter job' ( #380 ) from provision_community_api into beta
...
Reviewed-on: D-Net/dnet-hadoop#380
2024-01-23 08:55:59 +01:00
Claudio Atzori
f87f3a6483
[graph provision] updated param specification for the XML converter job
2024-01-23 08:54:37 +01:00
Claudio Atzori
6fd25cf549
code formatting
2024-01-23 08:47:12 +01:00