Commit Graph

4443 Commits

Author SHA1 Message Date
Miriam Baglioni 75d5ddb999 Update to include a blackList that filters out the results we know are wrongly associated to IE - update workflow definition - the blacklist parameter 2024-05-27 12:01:28 +02:00
Miriam Baglioni 87c9c61b41 Update to include a blackList that filters out the results we know are wrongly associated to IE - refactoring 2024-05-27 12:01:16 +02:00
Miriam Baglioni b55fed09f8 Update to include a blackList that filters out the results we know are wrongly associated to IE 2024-05-27 12:01:01 +02:00
Claudio Atzori 107d958b89 [org dedup] avoid NPEs in SparkPrepareNewOrgs 2024-05-27 11:59:54 +02:00
Claudio Atzori 3a7a6ecc32 [org dedup] avoid NPEs in SparkPrepareOrgRels 2024-05-27 11:59:45 +02:00
Claudio Atzori 1af4224d3d [org dedup] avoid NPEs in SparkPrepareOrgRels 2024-05-27 11:59:33 +02:00
Claudio Atzori 0d5bdb2db0 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-05-27 11:59:02 +02:00
Claudio Atzori 66548e6a83 Merge pull request 'changes in copy script' (#438) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #438
2024-05-27 11:54:03 +02:00
Antonis Lempesis 15b54a345a added fos lvl4 2024-05-24 13:21:28 +03:00
Lampros Smyrnaios b48ed6e617 Change configuration in the copy-operation to Impala Cluster:
Set the "SHOULD_EXIT_WHOLE_SCRIPT_UPON_ERROR" parameter to "false".
2024-05-23 16:58:12 +03:00
Lampros Smyrnaios 68322843e2 Small updates to the copy-operation to Impala Cluster:
- Add a configuration-"switch" to control whether the script exits upon an error or not.
- Allow the script to exit when a table could not be created.
- Show the elapsed time for processing each database.
2024-05-23 15:07:49 +03:00
Lampros Smyrnaios c7b32bbacc Update CopyDataToImpalaCluster:
Update the code of acquiring the entities from Ocean cluster, through hive, in order to optimize the process and account for additional reserved keywords in Impala.

Co-authored-by: Antonis Lempesis <antleb@di.uoa.gr>
2024-05-23 13:00:19 +03:00
Sandro La Bruzzo f1fe363b19 merged again from beta (I hope for the last time) 2024-05-22 11:08:52 +02:00
Sandro La Bruzzo 66c1ffc866 merged again from beta (I hope for the last time) 2024-05-22 11:02:46 +02:00
Claudio Atzori 1ea67eba82 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-05-21 13:48:48 +02:00
Claudio Atzori 834461ba26 [graph provision]fixed wf definition, revised serialization of the usage counts measures 2024-05-21 13:48:06 +02:00
Sandro La Bruzzo e8a61d5dd5 removed plugin, use only FileGZip plugin 2024-05-21 13:45:29 +02:00
Sandro La Bruzzo ca9414b737 Implement multiple node name splitter on GZipCollectorPlugin and all nodes that use XMLIterator. If the splitter name contains is a comma separated values it splits for all the values 2024-05-21 09:11:13 +02:00
Sandro La Bruzzo 032bcc8279 since last beta workflow we decide to introduce in the graph only MAG item with DOI and set them invisible ( this should be the same behaviour of the previous DOIBoost mapping).
This commit apply this type of mapping
2024-05-20 09:24:15 +02:00
Sandro La Bruzzo 103e2652b3 merged beta 2024-05-17 14:43:07 +02:00
Sandro La Bruzzo a87f9ea643 fixed scholexplorer bug 2024-05-17 14:16:43 +02:00
Sandro La Bruzzo 6efab4d88e fixed scholexplorer bug 2024-05-16 16:19:18 +02:00
Claudio Atzori 92f018d196 [graph provision] fixed path pointing to an intermediate data store in the working directory 2024-05-15 15:39:18 +02:00
Claudio Atzori 0611c81a2f [graph provision] using Qualifier.classNames to populate the correponsing fields in the JSON payload 2024-05-15 15:33:10 +02:00
Michele Artini 2b3b5fe9a1 oai finalization and test 2024-05-15 14:13:16 +02:00
Claudio Atzori 1efe7f7e39 [graph provision] upgrade to dhp-schema:6.1.2, included project.oamandatepublications in the JSON payload mapping, fixed serialisation of the usageCounts measures 2024-05-14 12:39:31 +02:00
Claudio Atzori f7d56e2ef2 Merge branch 'beta' into rest-collector-plugin-with-retry 2024-05-10 09:02:21 +02:00
Claudio Atzori dc3a5858f7 Merge branch 'beta' into beta_provision_relation 2024-05-09 14:14:43 +02:00
Claudio Atzori 55f39f7850 [graph provision] adds the possibility to validate the XML records before storing them via the validateXML parameter 2024-05-09 14:06:04 +02:00
Claudio Atzori 39a2afe8b5 [graph provision] fixed XML serialization of the usage counts measures, renamed workflow actions to better reflect their role 2024-05-09 13:54:42 +02:00
Claudio Atzori 908ed9da7a Merge pull request 'Various fixes in the stats wf' (#430) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #430
2024-05-08 13:41:02 +02:00
Antonis Lempesis 0cada3cc8f every step is run in the analytics queue. Hardcoded for now, will make a parameter later 2024-05-08 13:42:53 +03:00
Antonis Lempesis 90a4fb3547 fixed typos 2024-05-08 13:17:58 +03:00
Claudio Atzori 18aa323ee9 cleanup unused classes, adjustments in the oozie wf definition 2024-05-08 11:36:46 +02:00
Michele Artini c9a327bc50 refactoring of gzip method 2024-05-08 11:34:08 +02:00
Michele Artini e234848af8 oaf record: xpath for root 2024-05-08 10:00:53 +02:00
Claudio Atzori b4e3389432 fixed property mapping creating the RelatedEntity transient objects. spark cores & memory adjustments. Code formatting 2024-05-07 16:25:17 +02:00
Giambattista Bloisi 711048ceed PrepareRelationsJob rewritten to use Spark Dataframe API and Windowing functions 2024-05-07 15:44:33 +02:00
Michele Artini 70bf6ac415 oai exporter tests 2024-05-07 09:36:26 +02:00
Michele Artini aa40e53c19 oai exporter parameters 2024-05-07 08:01:19 +02:00
Michele Artini ed052a3476 job for the population of the oai database 2024-05-06 16:08:33 +02:00
Claudio Atzori 26363060ed fixed id prefix creation for the fosnodoi records, again 2024-05-03 15:53:52 +02:00
Claudio Atzori 0486227185 [cleaning] deactivating the cleaning of FOS subjects found in the metadata provided by repositories 2024-05-03 14:31:12 +02:00
Claudio Atzori e1a0fb8933 fixed id prefix creation for the fosnodoi records 2024-05-03 14:14:18 +02:00
Sandro La Bruzzo 26bf8e763a merged from beta 2024-05-02 15:20:23 +02:00
Sandro La Bruzzo 0646d0d064 Updated main sparkApplication to avoid to require master variable 2024-05-02 15:15:03 +02:00
Claudio Atzori 4355f64810 reverted to version 1.2.5-SNAPSHOT 2024-05-02 11:23:53 +02:00
Claudio Atzori 66680b8b9a refactoring of common utilities 2024-05-02 11:16:58 +02:00
Claudio Atzori dcf23b3d06 Merge branch 'beta' into beta-release-1.2.5 2024-05-02 10:01:49 +02:00
Michele Artini f4068de298 code reindent + tests 2024-05-02 09:51:33 +02:00
Claudio Atzori 11bd89e132 [enrichment] use sparkExecutorMemory to define also the memoryOverhead 2024-05-01 08:32:59 +02:00
Claudio Atzori e96c2c1606 [ranking wf] set spark.executor.memoryOverhead to fine tune the resource consumption 2024-04-30 16:23:25 +02:00
Claudio Atzori 50c18f7a0b [dedup wf] revised memory settings to address the increased volume of input contents 2024-04-30 12:34:16 +02:00
Michele Artini 2615136efc added a retry mechanism 2024-04-30 11:58:42 +02:00
Sandro La Bruzzo 133ead1e3e updated new version of scholexplorer Generation 2024-04-29 09:00:30 +02:00
Sandro La Bruzzo 052c6aac9d formatted code 2024-04-26 16:03:04 +02:00
Sandro La Bruzzo 9cd3bc0f10 Added a new generation of the dump for scholexplorer tested with last version of spark, and strongly refactored 2024-04-26 16:02:07 +02:00
Claudio Atzori e2937db385 Merge branch 'beta' into misc_fixes_merge_entities 2024-04-24 08:55:28 +02:00
Giambattista Bloisi 1878199dae Miscellaneous fixes:
- in Merge By ID pick by preference those records coming from delegated Authorities
- fix various tests
- close spark session in SparkCreateSimRels
2024-04-24 08:12:45 +02:00
Sandro La Bruzzo 0d628cd62b merged again from beta 2024-04-23 17:34:55 +02:00
Lampros Smyrnaios 49af2e5740 Miscellaneous updates to the copying operation to Impala Cluster:
- Update the algorithm for creating views that depend on other views; overcome some bash-instabilities.
- Upon any error, fail the whole process, not just the current DB-creation, as those errors usually indicate a bug in the initial DB-creation, that should be fixed immediately.
- Enhance parallel-copy of large files by "hadoop distcp" command.
- Reduce the "invalidate metadata" commands to just the current DB's tables, in order to eliminate the general overhead on Impala.
- Show the number of tables and views in the logs.
- Fix some log-messages.
2024-04-23 17:15:04 +03:00
Antonis Lempesis d2649a1429 increased the jvm ram 2024-04-23 16:03:16 +03:00
Claudio Atzori c3053ef34d using version 1.2.5-beta for the release 2024-04-23 14:52:32 +02:00
Claudio Atzori b5bcab13ec using version 1.2.5-beta for the release 2024-04-23 14:36:39 +02:00
Claudio Atzori 425c9afc36 using version 1.2.5-beta for the release 2024-04-23 14:30:04 +02:00
Claudio Atzori 93dd9cc639 code formatting 2024-04-23 11:28:00 +02:00
Miriam Baglioni 6189879643 [NOAMI] removed entry for Irish Research eLibray (IReL) Care Board from the list of funders. 2024-04-23 11:09:18 +02:00
Miriam Baglioni 7de114bda0 [WebCrawl] addressing comments from PR 2024-04-22 13:52:50 +02:00
Sandro La Bruzzo 073f320c6a Added module containing all the dependencies, useful for spark deploy on k8. 2024-04-22 11:32:31 +02:00
Miriam Baglioni 776c898c4b [WebCrawl] adding affiliation relations from web information 2024-04-22 11:04:17 +02:00
Claudio Atzori 0656ab2838 code formatting 2024-04-20 08:10:58 +02:00
Claudio Atzori ab7f0855af fixed query reading projects from the aggregator DB 2024-04-20 08:10:32 +02:00
Claudio Atzori e5879b68c7 [transformative agreement] including reuslt-funder relations to the information imported from the TRs 2024-04-19 17:14:18 +02:00
Claudio Atzori 3a027e97a7 [graph indexing] sets spark memoryOverhead in the join operations to the same value used for the memory executor 2024-04-19 16:59:58 +02:00
Sandro La Bruzzo b72c3139e2 updated Ignore annotation that is deprecated to Disabled 2024-04-19 14:52:40 +02:00
Sandro La Bruzzo b84ad0c06e merged beta 2024-04-19 14:39:59 +02:00
Antonis Lempesis b52a5a753b Merge remote-tracking branch 'upstream/beta' into beta 2024-04-19 15:28:28 +03:00
Sandro La Bruzzo 8dd9cf84e2 code formatted 2024-04-19 12:30:59 +02:00
Sandro La Bruzzo 342cb6189b fixed problem on changed signature on RowEncoder
removed property dhp.schema.artifact
2024-04-19 12:13:26 +02:00
Antonis Lempesis c3fe9662b2 all indicator tables are now stored as parquet 2024-04-19 12:45:36 +03:00
Claudio Atzori 57c678d904 integrating changes from PR#424 2024-04-18 11:38:35 +02:00
Claudio Atzori 5ab8cd1794 Various fixes for the stats DB update workflow, step16-createIndicatorsTables.sql 2024-04-18 11:28:18 +02:00
Antonis Lempesis 0c71c58df6 fixed the definition of gold_oa 2024-04-18 12:01:27 +03:00
Antonis Lempesis 43d05dbebb fixed the definition of result_country 2024-04-18 11:53:50 +03:00
Antonis Lempesis e728a0897c fixed the definition of indi_pub_bronze_oa 2024-04-18 11:07:55 +03:00
Antonis Lempesis 308ae580a9 slight optimization in indi_pub_gold_oa definition 2024-04-18 10:57:52 +03:00
Antonis Lempesis 27d22bd8f9 slight optimization in indi_pub_gold_oa definition 2024-04-17 23:59:52 +03:00
Antonis Lempesis 1f5aba12fa slight optimization in indi_pub_gold_oa definition 2024-04-17 23:54:23 +03:00
Claudio Atzori ac8747582c Merge branch 'beta' into doidoost_dismiss 2024-04-17 12:01:01 +02:00
Giambattista Bloisi 8ac167e420 Refinements to PR #404: refactoring the Oaf records merge utilities into dhp-common 2024-04-16 17:18:28 +02:00
Miriam Baglioni 0625b9061f removed the funder id : 100011062 Asian Spinal Cord Network, wrongly associated to Ireland 2024-04-16 15:26:53 +02:00
Miriam Baglioni 9eeb9f5d32 mergin with branch beta 2024-04-16 15:24:40 +02:00
Claudio Atzori 589bce3520 Merge pull request '[pBETA] Improvements to copying data from ocean to impala' (#421) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #421
2024-04-16 14:22:32 +02:00
Sandro La Bruzzo a5ddd8dfbb Added Action set generation for the MAG organization 2024-04-16 13:39:15 +02:00
Giambattista Bloisi da333e9f4d Merge pull request 'Enhance Dedup authors matching with algorithms used for ORCID enhancements (task 9690)' (#419) from dedup_authorsmatch_bytoken into beta
Reviewed-on: #419
2024-04-16 10:24:11 +02:00
Michele Artini 78b9d84e4a test 2024-04-16 09:41:16 +02:00
Giambattista Bloisi 43b454399f - Bug fix in matchOrderedTokenAndAbbreviations algorithms where tokens with same initial character were always considered equal
- AuthorsMatch exploits the new matching strategy used for ORCID enhancements in #PR398: split author names in tokens, order the tokens, then check for matches of ordered full tokens or abbreviations
2024-04-15 18:19:29 +02:00
Lampros Smyrnaios d7da4f814b Minor updates to the copying operation to Impala Cluster:
- Improve logging.
- Code optimization/polishing.
2024-04-12 18:12:06 +03:00
Lampros Smyrnaios 14719dcd62 Miscellaneous updates to the copying operation to Impala Cluster:
- Update the algorithm for creating views that depend on other views.
- Add check for successful execution of the "hadoop distcp" command.
- Add a check for successful copy operation of all entities.
- Upon facing an error in a DB, exit the method, instead of the whole script.
- Improve logging.
- Code polishing.
2024-04-12 15:36:13 +03:00
Sandro La Bruzzo 41a42dde64 code formatted 2024-04-11 17:43:48 +02:00
Sandro La Bruzzo 843dc95340 resolved conflict 2024-04-11 17:38:16 +02:00
Sandro La Bruzzo 1e30454ee0 added vocabulary tu instanceTypeMApping of Mag 2024-04-11 17:32:30 +02:00
Sandro La Bruzzo 2581672c11 updated wf of MAG and crossref to use transaction 2024-04-11 17:27:49 +02:00
Lampros Smyrnaios 22745027c8 Use the "HADOOP_USER_NAME" value from the "workflow-property", in "copyDataToImpalaCluster.sh", in "stats-monitor-updates". 2024-04-11 17:46:33 +03:00
Lampros Smyrnaios abf0b69f29 Upgrade the copying operation to Impala Cluster:
- Use only hive commands in the Ocean Cluster, as the "impala-shell" will be removed from there to free-up resources.
- Hugely improve the performance in every aspect of the copying process: a) speedup file-transferring and DB-deletion, b) eliminate permissions-assignment, "load" operations and "use $db" queries, c) retry only the "create view" statements and only as long as they depend on other non-created views, instead of trying to recreate all tables and views 5 consecutive times.
- Add error-checks for the creation of tables and views.
2024-04-11 17:12:12 +03:00
Sandro La Bruzzo a0642bd190 added instanceTypeMapping field on MAG 2024-04-11 13:10:12 +02:00
Sandro La Bruzzo 98dc042db5 mapping generated for MAG,
missing generation of Organization Action set
2024-04-05 18:12:53 +02:00
Sandro La Bruzzo ef582948a7 Updated mapping 2024-04-05 11:10:44 +02:00
Sandro La Bruzzo 5142f462b5 completed mapping from paper to OAF, not tested 2024-04-04 21:06:04 +02:00
Miriam Baglioni 0794e0667b Merge branch 'doidoost_dismiss' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doidoost_dismiss 2024-04-04 09:16:18 +02:00
Miriam Baglioni 4b1de076ac [DataciteHostedByMap] added entry for EBRAINS 2024-04-04 09:16:14 +02:00
Miriam Baglioni c8a88b2187 [DataciteHostedByMap] added entry for EBRAINS 2024-04-04 09:14:58 +02:00
Sandro La Bruzzo 31e152d2bb Merge remote-tracking branch 'origin/doidoost_dismiss' into doidoost_dismiss 2024-04-03 17:08:35 +02:00
Sandro La Bruzzo 6f3e925cae Implemented first part of the new MAG mapping 2024-04-03 17:07:14 +02:00
Miriam Baglioni f0f6abf892 [MapToFunderLink]added references for HFRI and Erasmus+ for the creation of links for funders 2024-04-03 14:59:09 +02:00
Claudio Atzori 26b97aa5ed Merge pull request '[BETA] fixed the result_country definition and updated the stats DB copy procedure' (#416) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #416
2024-04-03 12:36:03 +02:00
Lampros Smyrnaios b7c8acc563 - Update the code which acquires the "IMPALA_HDFS_NODE", to test the "tmp"-dir, instead of the base-dir and introduce retries, to overcome potential file-system failures. This change was suggested by "Sebastian Tymkow" and "Grzegorz Bakalarski".
- Fix typos.
2024-04-03 13:15:37 +03:00
Miriam Baglioni 50fbebf186 [NOAMI] removed entry for Health and Social Care Board from the list of funders. Modified IRC putting 1596 and 1597 as synonyms, as required in ticket 9635 2024-04-03 11:45:40 +02:00
Michele Artini 71d6e02886 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2024-04-03 09:50:41 +02:00
Michele Artini 02c9a311c8 base datainfo with trust=0.89 2024-04-03 09:50:21 +02:00
Miriam Baglioni 42846d3b91 [OpenCitation] add compression option when writing the sequence file 2024-04-03 09:25:00 +02:00
Miriam Baglioni 4f0a044245 Merge pull request 'Add action set creation for Datacite affiliations' (#413) from 9647_datacite_affiliations into beta
Reviewed-on: #413
2024-04-02 17:33:38 +02:00
Serafeim Chatzopoulos cbe13a5c61 Fix datacite input path in properties file 2024-04-02 18:00:35 +03:00
Miriam Baglioni 9c9a9562ae [UsageCount] fixed error 2024-04-02 16:56:37 +02:00
Miriam Baglioni b42bdd5fb3 [UsageCount] add check in case the datasource is not matched against those present in the graph 2024-04-02 16:28:27 +02:00
Miriam Baglioni 64cbd8abe9 Merge pull request '[UsageCount] Usage count per result split by datasource' (#318) from UsageStatsRecordDS into beta
Reviewed-on: #318
2024-04-02 10:21:39 +02:00
Antonis Lempesis df6e3bda04 added new orgs in monitor 2024-04-01 22:45:29 +03:00
Antonis Lempesis 573b081f1d added new orgs in monitor 2024-04-01 22:24:46 +03:00
Serafeim Chatzopoulos 0eb0701b26 Add action set creation for Datacite affiliations 2024-04-01 17:23:26 +03:00
Antonis Lempesis 0bf2a7a359 fixed the result_country definition 2024-04-01 15:23:22 +03:00
Claudio Atzori 24227ab598 Merge pull request '[BETA] fixed typo in indicator query' (#411) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #411
2024-03-27 13:56:43 +01:00
Antonis Lempesis 9ff44eed96 fixed typo in indicator query
added more institutions
2024-03-27 14:39:01 +02:00
Claudio Atzori cff6040424 Merge pull request '[BETA] added missing EOS, Generate tables with parquet-files, instead of csv in the contexts.sh script' (#409) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #409
2024-03-27 12:04:04 +01:00
Antonis Lempesis 1fee4124e0 added missing EOS 2024-03-27 12:58:25 +02:00
Sandro La Bruzzo 73a67c0e4a Improved Crossref mapping to include also unpaywall tested 2024-03-26 17:26:47 +01:00
Claudio Atzori 75551ad4ec code formatting 2024-03-26 14:53:16 +01:00
Miriam Baglioni 94b931f7bd [BulkTagging - tag datasource and projects]merging with branch beta 2024-03-26 14:25:19 +01:00
Miriam Baglioni 3b209261f2 [BulkTagging - tag datasource and projects]merging with branch beta 2024-03-26 14:21:27 +01:00
Lampros Smyrnaios 036ba03fcd Generate tables with parquet-files, instead of csv, in "dhp-stats-update/.../contexts.sh" script. 2024-03-26 13:29:04 +02:00
Claudio Atzori 730eaffc85 Merge pull request 'correctly selecting the active hdfs node for the impala cluster' (#405) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #405
2024-03-26 12:07:46 +01:00
Lampros Smyrnaios bc8c97182d Automatically select the ACTIVE HDFS NODE for Impala cluster, in all "copyDataToImpalaCluster.sh" scripts. 2024-03-26 13:01:12 +02:00
Lampros Smyrnaios 92cc27e7eb Use the ACTIVE HDFS NODE for Impala cluster, in "copyDataToImpalaCluster.sh" script. 2024-03-26 12:34:11 +02:00
Claudio Atzori ef52128c55 included new stats* workflows in parent pom list of modules, code formatting 2024-03-26 10:42:10 +01:00
Claudio Atzori bfba71a95c further follow up changes from integrating the mergeutils branch 2024-03-26 09:01:18 +01:00
Claudio Atzori d72e7b7487 Merge pull request 'Changes to indicators and funders definition' (#372) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #372
2024-03-26 08:46:20 +01:00
Sandro La Bruzzo ece56f0178 update crossref mapping to be transformed together with UnpayWall 2024-03-25 18:18:10 +01:00
Claudio Atzori 538b180fe0 Merge branch 'beta' into oaf_country_beta 2024-03-25 16:13:20 +01:00
Claudio Atzori 82fc609c4f Merge branch 'beta' into index_records 2024-03-25 16:12:49 +01:00
Claudio Atzori 74e5d05577 Merge branch 'beta' into ocnew 2024-03-25 16:10:31 +01:00
Claudio Atzori 6c3b692f60 integrated minor change from beta branch 2024-03-25 16:10:23 +01:00