Commit Graph

  • 49af2e5740 Miscellaneous updates to the copying operation to Impala Cluster: - Update the algorithm for creating views that depend on other views; overcome some bash-instabilities. - Upon any error, fail the whole process, not just the current DB-creation, as those errors usually indicate a bug in the initial DB-creation, that should be fixed immediately. - Enhance parallel-copy of large files by "hadoop distcp" command. - Reduce the "invalidate metadata" commands to just the current DB's tables, in order to eliminate the general overhead on Impala. - Show the number of tables and views in the logs. - Fix some log-messages. beta Lampros Smyrnaios 2024-04-23 17:15:04 +0300
  • d2649a1429 increased the jvm ram Antonis Lempesis 2024-04-23 16:03:16 +0300
  • b52a5a753b Merge remote-tracking branch 'upstream/beta' into beta Antonis Lempesis 2024-04-19 15:28:28 +0300
  • c3fe9662b2 all indicator tables are now stored as parquet Antonis Lempesis 2024-04-19 12:45:36 +0300
  • 2616971e2b dhp-stats-update: remove leftover duplicate line convert_hive_to_spark_actions Lampros Smyrnaios 2024-04-18 16:18:16 +0300
  • ba533d9f34 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions Lampros Smyrnaios 2024-04-18 15:47:56 +0300
  • d46b78b659 dhp-stats-update: - Set Steps 2-7 and 9 to limit the amount of files generated by Spark, from 8000, down to 100, to improve file-transfer and querying performance. - Allow the workflow to run up to Step10. The Step11 seems to have some issues even when using hive-action. Lampros Smyrnaios 2024-04-18 15:40:27 +0300
  • 6f2ebb2a52 Revert Step8 and Step11 to use Hive again, since their "UPDATE" statements are not supported by Spark. Lampros Smyrnaios 2024-04-18 15:35:03 +0300
  • 57c678d904 integrating changes from PR#424 Claudio Atzori 2024-04-18 11:38:35 +0200
  • 5ab8cd1794 Various fixes for the stats DB update workflow, step16-createIndicatorsTables.sql Claudio Atzori 2024-04-18 11:28:18 +0200
  • 0c71c58df6 fixed the definition of gold_oa Antonis Lempesis 2024-04-18 12:01:27 +0300
  • 43d05dbebb fixed the definition of result_country Antonis Lempesis 2024-04-18 11:53:50 +0300
  • e728a0897c fixed the definition of indi_pub_bronze_oa Antonis Lempesis 2024-04-18 11:07:55 +0300
  • 308ae580a9 slight optimization in indi_pub_gold_oa definition Antonis Lempesis 2024-04-18 10:57:52 +0300
  • 27d22bd8f9 slight optimization in indi_pub_gold_oa definition Antonis Lempesis 2024-04-17 23:59:52 +0300
  • 1f5aba12fa slight optimization in indi_pub_gold_oa definition Antonis Lempesis 2024-04-17 23:54:23 +0300
  • ca091c0f1e dhp-stats-update: - Fix not passing some parameters to some Spark actions. - Allow the workflow to run up to Step7. The first 7 steps seem to work out of the box. Lampros Smyrnaios 2024-04-17 14:03:59 +0300
  • b554c41cc7 Merge pull request 'doidoost_dismiss' (#418) from doidoost_dismiss into beta Claudio Atzori 2024-04-17 12:01:11 +0200
  • ac8747582c Merge branch 'beta' into doidoost_dismiss Claudio Atzori 2024-04-17 12:01:01 +0200
  • 0db7e4ae9a Merge pull request 'Refinements to PR #404: refactoring the Oaf records merge utilities into dhp-common' (#422) from revised_merge_logic into beta Claudio Atzori 2024-04-17 11:58:26 +0200
  • 8ac167e420 Refinements to PR #404: refactoring the Oaf records merge utilities into dhp-common Giambattista Bloisi 2024-04-11 15:49:29 +0200
  • 0b897f2f66 Fix and add missing "DROP TABLE" statements, in "dhp-stats-update" sql-scripts. Lampros Smyrnaios 2024-04-16 18:17:54 +0300
  • 0625b9061f removed the funder id : 100011062 Asian Spinal Cord Network, wrongly associated to Ireland Miriam Baglioni 2024-04-16 15:26:53 +0200
  • 9eeb9f5d32 mergin with branch beta Miriam Baglioni 2024-04-16 15:24:40 +0200
  • 589bce3520 Merge pull request '[pBETA] Improvements to copying data from ocean to impala' (#421) from antonis.lempesis/dnet-hadoop:beta into beta Claudio Atzori 2024-04-16 14:22:32 +0200
  • a5ddd8dfbb Added Action set generation for the MAG organization Sandro La Bruzzo 2024-04-16 13:39:15 +0200
  • da333e9f4d Merge pull request 'Enhance Dedup authors matching with algorithms used for ORCID enhancements (task 9690)' (#419) from dedup_authorsmatch_bytoken into beta Giambattista Bloisi 2024-04-16 10:24:11 +0200
  • 43fd1de681 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta Claudio Atzori 2024-04-16 09:42:05 +0200
  • d070db4a32 added a couple more invalid author names Claudio Atzori 2024-04-16 09:41:59 +0200
  • 78b9d84e4a test Michele Artini 2024-04-16 09:41:16 +0200
  • 43b454399f - Bug fix in matchOrderedTokenAndAbbreviations algorithms where tokens with same initial character were always considered equal - AuthorsMatch exploits the new matching strategy used for ORCID enhancements in #PR398: split author names in tokens, order the tokens, then check for matches of ordered full tokens or abbreviations Giambattista Bloisi 2024-04-15 18:19:29 +0200
  • db33f7727c Update "dhp-stats-update" workflow to use "spark"-actions, instead of "hive" ones. Note: Currently the code is set to only test the "Step1". Lampros Smyrnaios 2024-04-15 16:22:40 +0300
  • d7da4f814b Minor updates to the copying operation to Impala Cluster: - Improve logging. - Code optimization/polishing. Lampros Smyrnaios 2024-04-12 18:12:06 +0300
  • 14719dcd62 Miscellaneous updates to the copying operation to Impala Cluster: - Update the algorithm for creating views that depend on other views. - Add check for successful execution of the "hadoop distcp" command. - Add a check for successful copy operation of all entities. - Upon facing an error in a DB, exit the method, instead of the whole script. - Improve logging. - Code polishing. Lampros Smyrnaios 2024-04-12 15:36:13 +0300
  • 41a42dde64 code formatted Sandro La Bruzzo 2024-04-11 17:43:48 +0200
  • 843dc95340 resolved conflict Sandro La Bruzzo 2024-04-11 17:38:16 +0200
  • 1e30454ee0 added vocabulary tu instanceTypeMApping of Mag Sandro La Bruzzo 2024-04-11 17:32:30 +0200
  • 2581672c11 updated wf of MAG and crossref to use transaction Sandro La Bruzzo 2024-04-11 17:27:49 +0200
  • 22745027c8 Use the "HADOOP_USER_NAME" value from the "workflow-property", in "copyDataToImpalaCluster.sh", in "stats-monitor-updates". Lampros Smyrnaios 2024-04-11 17:46:33 +0300
  • abf0b69f29 Upgrade the copying operation to Impala Cluster: - Use only hive commands in the Ocean Cluster, as the "impala-shell" will be removed from there to free-up resources. - Hugely improve the performance in every aspect of the copying process: a) speedup file-transferring and DB-deletion, b) eliminate permissions-assignment, "load" operations and "use $db" queries, c) retry only the "create view" statements and only as long as they depend on other non-created views, instead of trying to recreate all tables and views 5 consecutive times. - Add error-checks for the creation of tables and views. Lampros Smyrnaios 2024-04-11 17:12:12 +0300
  • 3cad4a415d fixed duplicated property dhp-schemas.version Claudio Atzori 2024-04-11 15:44:12 +0200
  • a0642bd190 added instanceTypeMapping field on MAG Sandro La Bruzzo 2024-04-11 13:10:12 +0200
  • 98dc042db5 mapping generated for MAG, missing generation of Organization Action set Sandro La Bruzzo 2024-04-05 18:12:53 +0200
  • ef582948a7 Updated mapping Sandro La Bruzzo 2024-04-05 11:10:44 +0200
  • 5142f462b5 completed mapping from paper to OAF, not tested Sandro La Bruzzo 2024-04-04 21:06:04 +0200
  • 0794e0667b Merge branch 'doidoost_dismiss' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doidoost_dismiss Miriam Baglioni 2024-04-04 09:16:18 +0200
  • 4b1de076ac [DataciteHostedByMap] added entry for EBRAINS Miriam Baglioni 2024-04-04 09:16:14 +0200
  • c8a88b2187 [DataciteHostedByMap] added entry for EBRAINS Miriam Baglioni 2024-04-04 09:14:58 +0200
  • 31e152d2bb Merge remote-tracking branch 'origin/doidoost_dismiss' into doidoost_dismiss Sandro La Bruzzo 2024-04-03 17:08:35 +0200
  • 6f3e925cae Implemented first part of the new MAG mapping Sandro La Bruzzo 2024-04-03 17:07:14 +0200
  • f0f6abf892 [MapToFunderLink]added references for HFRI and Erasmus+ for the creation of links for funders Miriam Baglioni 2024-04-03 14:59:09 +0200
  • 26b97aa5ed Merge pull request '[BETA] fixed the result_country definition and updated the stats DB copy procedure' (#416) from antonis.lempesis/dnet-hadoop:beta into beta Claudio Atzori 2024-04-03 12:36:03 +0200
  • b7c8acc563 - Update the code which acquires the "IMPALA_HDFS_NODE", to test the "tmp"-dir, instead of the base-dir and introduce retries, to overcome potential file-system failures. This change was suggested by "Sebastian Tymkow" and "Grzegorz Bakalarski". - Fix typos. Lampros Smyrnaios 2024-04-03 13:15:37 +0300
  • 50fbebf186 [NOAMI] removed entry for Health and Social Care Board from the list of funders. Modified IRC putting 1596 and 1597 as synonyms, as required in ticket 9635 Miriam Baglioni 2024-04-03 11:45:40 +0200
  • 71d6e02886 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta Michele Artini 2024-04-03 09:50:41 +0200
  • 02c9a311c8 base datainfo with trust=0.89 Michele Artini 2024-04-03 09:50:21 +0200
  • 42846d3b91 [OpenCitation] add compression option when writing the sequence file Miriam Baglioni 2024-04-03 09:25:00 +0200
  • 4f0a044245 Merge pull request 'Add action set creation for Datacite affiliations' (#413) from 9647_datacite_affiliations into beta Miriam Baglioni 2024-04-02 17:33:38 +0200
  • 4bb504e693 Merge pull request '[UsageCount] fixed error' (#415) from UsageStatsRecordDS into beta Miriam Baglioni 2024-04-02 17:06:12 +0200
  • cbe13a5c61 Fix datacite input path in properties file Serafeim Chatzopoulos 2024-04-02 18:00:35 +0300
  • 9c9a9562ae [UsageCount] fixed error Miriam Baglioni 2024-04-02 16:56:37 +0200
  • 2c4440951f Merge pull request '[UsageCount] add check in case the datasource is not matched against those present in the graph' (#414) from UsageStatsRecordDS into beta Miriam Baglioni 2024-04-02 16:30:39 +0200
  • b42bdd5fb3 [UsageCount] add check in case the datasource is not matched against those present in the graph Miriam Baglioni 2024-04-02 16:28:27 +0200
  • 64cbd8abe9 Merge pull request '[UsageCount] Usage count per result split by datasource' (#318) from UsageStatsRecordDS into beta Miriam Baglioni 2024-04-02 10:21:39 +0200
  • df6e3bda04 added new orgs in monitor Antonis Lempesis 2024-04-01 22:45:29 +0300
  • 573b081f1d added new orgs in monitor Antonis Lempesis 2024-04-01 22:24:46 +0300
  • 0eb0701b26 Add action set creation for Datacite affiliations Serafeim Chatzopoulos 2024-04-01 17:23:26 +0300
  • 0bf2a7a359 fixed the result_country definition Antonis Lempesis 2024-04-01 15:23:22 +0300
  • 24227ab598 Merge pull request '[BETA] fixed typo in indicator query' (#411) from antonis.lempesis/dnet-hadoop:beta into beta Claudio Atzori 2024-03-27 13:56:43 +0100
  • 9ff44eed96 fixed typo in indicator query added more institutions Antonis Lempesis 2024-03-27 14:39:01 +0200
  • cff6040424 Merge pull request '[BETA] added missing EOS, Generate tables with parquet-files, instead of csv in the contexts.sh script' (#409) from antonis.lempesis/dnet-hadoop:beta into beta Claudio Atzori 2024-03-27 12:04:04 +0100
  • 1fee4124e0 added missing EOS Antonis Lempesis 2024-03-27 12:58:25 +0200
  • 73a67c0e4a Improved Crossref mapping to include also unpaywall tested Sandro La Bruzzo 2024-03-26 17:26:47 +0100
  • 9e700a8b0d Merge pull request 'adding context information to projects and datasources' (#407) from taggingProjects into beta Claudio Atzori 2024-03-26 14:53:38 +0100
  • 75551ad4ec code formatting Claudio Atzori 2024-03-26 14:53:16 +0100
  • 94b931f7bd [BulkTagging - tag datasource and projects]merging with branch beta Miriam Baglioni 2024-03-26 14:25:19 +0100
  • 3b209261f2 [BulkTagging - tag datasource and projects]merging with branch beta Miriam Baglioni 2024-03-26 14:21:27 +0100
  • 036ba03fcd Generate tables with parquet-files, instead of csv, in "dhp-stats-update/.../contexts.sh" script. Lampros Smyrnaios 2024-03-26 13:29:04 +0200
  • 730eaffc85 Merge pull request 'correctly selecting the active hdfs node for the impala cluster' (#405) from antonis.lempesis/dnet-hadoop:beta into beta Claudio Atzori 2024-03-26 12:07:46 +0100
  • bc8c97182d Automatically select the ACTIVE HDFS NODE for Impala cluster, in all "copyDataToImpalaCluster.sh" scripts. Lampros Smyrnaios 2024-03-26 13:01:12 +0200
  • 92cc27e7eb Use the ACTIVE HDFS NODE for Impala cluster, in "copyDataToImpalaCluster.sh" script. Lampros Smyrnaios 2024-03-26 12:34:11 +0200
  • ef52128c55 included new stats* workflows in parent pom list of modules, code formatting Claudio Atzori 2024-03-26 10:42:10 +0100
  • bfba71a95c further follow up changes from integrating the mergeutils branch Claudio Atzori 2024-03-26 09:01:18 +0100
  • d72e7b7487 Merge pull request 'Changes to indicators and funders definition' (#372) from antonis.lempesis/dnet-hadoop:beta into beta Claudio Atzori 2024-03-26 08:46:20 +0100
  • ece56f0178 update crossref mapping to be transformed together with UnpayWall Sandro La Bruzzo 2024-03-25 18:18:10 +0100
  • 414acd4ef4 Merge pull request 'refactoring the Oaf records merge utilities into dhp-common' (#404) from mergeutils into beta Claudio Atzori 2024-03-25 16:16:07 +0100
  • ecff0b4825 merge from beta Claudio Atzori 2024-03-25 16:15:52 +0100
  • 25c2025223 Merge pull request 'mapped oaf:country from results' (#403) from oaf_country_beta into beta Claudio Atzori 2024-03-25 16:13:31 +0100
  • 538b180fe0 Merge branch 'beta' into oaf_country_beta Claudio Atzori 2024-03-25 16:13:20 +0100
  • eae88c0fe3 Merge pull request 'Solr JSON payload' (#399) from index_records into beta Claudio Atzori 2024-03-25 16:12:59 +0100
  • 82fc609c4f Merge branch 'beta' into index_records Claudio Atzori 2024-03-25 16:12:49 +0100
  • 4b978ffa2d align dhp-schema.version with the beta branch Claudio Atzori 2024-03-25 16:12:36 +0100
  • fa4b3e6d2b Merge pull request 'Open Citation integration' (#401) from ocnew into beta Claudio Atzori 2024-03-25 16:10:40 +0100
  • 74e5d05577 Merge branch 'beta' into ocnew Claudio Atzori 2024-03-25 16:10:31 +0100
  • 6c3b692f60 integrated minor change from beta branch Claudio Atzori 2024-03-25 16:10:23 +0100
  • e9eb590f87 Merge pull request 'FOS ActionSet for the classification of results without a doi' (#397) from FOSNew into beta Claudio Atzori 2024-03-25 16:07:47 +0100
  • 9a5b134ddf Merge branch 'beta' into FOSNew Claudio Atzori 2024-03-25 16:07:37 +0100
  • 069803f34a Merge pull request 'Added exception throwing in Hadoop transformation when TR is not syntactically valid' (#387) from exception_on_invalid_transofmation_rule into beta Claudio Atzori 2024-03-25 16:05:43 +0100
  • 71c1f81b54 Merge branch 'beta' into exception_on_invalid_transofmation_rule Claudio Atzori 2024-03-25 16:05:11 +0100
  • c3c9bdb59c Merge pull request 'bulkTaggingPathMapExtention' (#381) from bulkTaggingPathMapExtention into beta Claudio Atzori 2024-03-25 16:02:01 +0100