Commit Graph

4396 Commits

Author SHA1 Message Date
Lampros Smyrnaios d7da4f814b Minor updates to the copying operation to Impala Cluster:
- Improve logging.
- Code optimization/polishing.
2024-04-12 18:12:06 +03:00
Lampros Smyrnaios 14719dcd62 Miscellaneous updates to the copying operation to Impala Cluster:
- Update the algorithm for creating views that depend on other views.
- Add check for successful execution of the "hadoop distcp" command.
- Add a check for successful copy operation of all entities.
- Upon facing an error in a DB, exit the method, instead of the whole script.
- Improve logging.
- Code polishing.
2024-04-12 15:36:13 +03:00
Sandro La Bruzzo 41a42dde64 code formatted 2024-04-11 17:43:48 +02:00
Sandro La Bruzzo 843dc95340 resolved conflict 2024-04-11 17:38:16 +02:00
Sandro La Bruzzo 1e30454ee0 added vocabulary tu instanceTypeMApping of Mag 2024-04-11 17:32:30 +02:00
Sandro La Bruzzo 2581672c11 updated wf of MAG and crossref to use transaction 2024-04-11 17:27:49 +02:00
Lampros Smyrnaios 22745027c8 Use the "HADOOP_USER_NAME" value from the "workflow-property", in "copyDataToImpalaCluster.sh", in "stats-monitor-updates". 2024-04-11 17:46:33 +03:00
Lampros Smyrnaios abf0b69f29 Upgrade the copying operation to Impala Cluster:
- Use only hive commands in the Ocean Cluster, as the "impala-shell" will be removed from there to free-up resources.
- Hugely improve the performance in every aspect of the copying process: a) speedup file-transferring and DB-deletion, b) eliminate permissions-assignment, "load" operations and "use $db" queries, c) retry only the "create view" statements and only as long as they depend on other non-created views, instead of trying to recreate all tables and views 5 consecutive times.
- Add error-checks for the creation of tables and views.
2024-04-11 17:12:12 +03:00
Sandro La Bruzzo a0642bd190 added instanceTypeMapping field on MAG 2024-04-11 13:10:12 +02:00
Sandro La Bruzzo 98dc042db5 mapping generated for MAG,
missing generation of Organization Action set
2024-04-05 18:12:53 +02:00
Sandro La Bruzzo ef582948a7 Updated mapping 2024-04-05 11:10:44 +02:00
Sandro La Bruzzo 5142f462b5 completed mapping from paper to OAF, not tested 2024-04-04 21:06:04 +02:00
Miriam Baglioni 0794e0667b Merge branch 'doidoost_dismiss' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doidoost_dismiss 2024-04-04 09:16:18 +02:00
Miriam Baglioni 4b1de076ac [DataciteHostedByMap] added entry for EBRAINS 2024-04-04 09:16:14 +02:00
Miriam Baglioni c8a88b2187 [DataciteHostedByMap] added entry for EBRAINS 2024-04-04 09:14:58 +02:00
Sandro La Bruzzo 31e152d2bb Merge remote-tracking branch 'origin/doidoost_dismiss' into doidoost_dismiss 2024-04-03 17:08:35 +02:00
Sandro La Bruzzo 6f3e925cae Implemented first part of the new MAG mapping 2024-04-03 17:07:14 +02:00
Miriam Baglioni f0f6abf892 [MapToFunderLink]added references for HFRI and Erasmus+ for the creation of links for funders 2024-04-03 14:59:09 +02:00
Claudio Atzori 26b97aa5ed Merge pull request '[BETA] fixed the result_country definition and updated the stats DB copy procedure' (#416) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #416
2024-04-03 12:36:03 +02:00
Lampros Smyrnaios b7c8acc563 - Update the code which acquires the "IMPALA_HDFS_NODE", to test the "tmp"-dir, instead of the base-dir and introduce retries, to overcome potential file-system failures. This change was suggested by "Sebastian Tymkow" and "Grzegorz Bakalarski".
- Fix typos.
2024-04-03 13:15:37 +03:00
Miriam Baglioni 50fbebf186 [NOAMI] removed entry for Health and Social Care Board from the list of funders. Modified IRC putting 1596 and 1597 as synonyms, as required in ticket 9635 2024-04-03 11:45:40 +02:00
Michele Artini 71d6e02886 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2024-04-03 09:50:41 +02:00
Michele Artini 02c9a311c8 base datainfo with trust=0.89 2024-04-03 09:50:21 +02:00
Miriam Baglioni 42846d3b91 [OpenCitation] add compression option when writing the sequence file 2024-04-03 09:25:00 +02:00
Miriam Baglioni 4f0a044245 Merge pull request 'Add action set creation for Datacite affiliations' (#413) from 9647_datacite_affiliations into beta
Reviewed-on: #413
2024-04-02 17:33:38 +02:00
Serafeim Chatzopoulos cbe13a5c61 Fix datacite input path in properties file 2024-04-02 18:00:35 +03:00
Miriam Baglioni 9c9a9562ae [UsageCount] fixed error 2024-04-02 16:56:37 +02:00
Miriam Baglioni b42bdd5fb3 [UsageCount] add check in case the datasource is not matched against those present in the graph 2024-04-02 16:28:27 +02:00
Miriam Baglioni 64cbd8abe9 Merge pull request '[UsageCount] Usage count per result split by datasource' (#318) from UsageStatsRecordDS into beta
Reviewed-on: #318
2024-04-02 10:21:39 +02:00
Antonis Lempesis df6e3bda04 added new orgs in monitor 2024-04-01 22:45:29 +03:00
Antonis Lempesis 573b081f1d added new orgs in monitor 2024-04-01 22:24:46 +03:00
Serafeim Chatzopoulos 0eb0701b26 Add action set creation for Datacite affiliations 2024-04-01 17:23:26 +03:00
Antonis Lempesis 0bf2a7a359 fixed the result_country definition 2024-04-01 15:23:22 +03:00
Claudio Atzori 24227ab598 Merge pull request '[BETA] fixed typo in indicator query' (#411) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #411
2024-03-27 13:56:43 +01:00
Antonis Lempesis 9ff44eed96 fixed typo in indicator query
added more institutions
2024-03-27 14:39:01 +02:00
Claudio Atzori cff6040424 Merge pull request '[BETA] added missing EOS, Generate tables with parquet-files, instead of csv in the contexts.sh script' (#409) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #409
2024-03-27 12:04:04 +01:00
Antonis Lempesis 1fee4124e0 added missing EOS 2024-03-27 12:58:25 +02:00
Sandro La Bruzzo 73a67c0e4a Improved Crossref mapping to include also unpaywall tested 2024-03-26 17:26:47 +01:00
Claudio Atzori 75551ad4ec code formatting 2024-03-26 14:53:16 +01:00
Miriam Baglioni 94b931f7bd [BulkTagging - tag datasource and projects]merging with branch beta 2024-03-26 14:25:19 +01:00
Miriam Baglioni 3b209261f2 [BulkTagging - tag datasource and projects]merging with branch beta 2024-03-26 14:21:27 +01:00
Lampros Smyrnaios 036ba03fcd Generate tables with parquet-files, instead of csv, in "dhp-stats-update/.../contexts.sh" script. 2024-03-26 13:29:04 +02:00
Claudio Atzori 730eaffc85 Merge pull request 'correctly selecting the active hdfs node for the impala cluster' (#405) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #405
2024-03-26 12:07:46 +01:00
Lampros Smyrnaios bc8c97182d Automatically select the ACTIVE HDFS NODE for Impala cluster, in all "copyDataToImpalaCluster.sh" scripts. 2024-03-26 13:01:12 +02:00
Lampros Smyrnaios 92cc27e7eb Use the ACTIVE HDFS NODE for Impala cluster, in "copyDataToImpalaCluster.sh" script. 2024-03-26 12:34:11 +02:00
Claudio Atzori ef52128c55 included new stats* workflows in parent pom list of modules, code formatting 2024-03-26 10:42:10 +01:00
Claudio Atzori bfba71a95c further follow up changes from integrating the mergeutils branch 2024-03-26 09:01:18 +01:00
Claudio Atzori d72e7b7487 Merge pull request 'Changes to indicators and funders definition' (#372) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #372
2024-03-26 08:46:20 +01:00
Sandro La Bruzzo ece56f0178 update crossref mapping to be transformed together with UnpayWall 2024-03-25 18:18:10 +01:00
Claudio Atzori 538b180fe0 Merge branch 'beta' into oaf_country_beta 2024-03-25 16:13:20 +01:00
Claudio Atzori 82fc609c4f Merge branch 'beta' into index_records 2024-03-25 16:12:49 +01:00
Claudio Atzori 74e5d05577 Merge branch 'beta' into ocnew 2024-03-25 16:10:31 +01:00
Claudio Atzori 6c3b692f60 integrated minor change from beta branch 2024-03-25 16:10:23 +01:00
Claudio Atzori 9a5b134ddf Merge branch 'beta' into FOSNew 2024-03-25 16:07:37 +01:00
Claudio Atzori 71c1f81b54 Merge branch 'beta' into exception_on_invalid_transofmation_rule 2024-03-25 16:05:11 +01:00
Claudio Atzori 91b61687fa Merge branch 'beta' into bulkTaggingPathMapExtention 2024-03-25 15:50:18 +01:00
Claudio Atzori 54936b7f42 Merge branch 'beta' into transformativeagreement 2024-03-25 15:42:22 +01:00
Michele Artini e1149eb5c4 xslt rules and tests 2024-03-25 15:01:42 +01:00
Michele Artini 3f174ad90f Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2024-03-25 12:16:02 +01:00
Michele Artini 6ffb1faf09 fixed a problem with multiple nodes 2024-03-25 12:15:51 +01:00
Giambattista Bloisi 3f22c101d9 Merge pull request 'Enrich authors with ORCID info using new matching algorithm' (#398) from new_orcid_enhancement into beta
Reviewed-on: #398
2024-03-22 17:29:20 +01:00
Giambattista Bloisi 0ff7faad72 Fix conditions that prevented ORCID Enrichment 2024-03-22 16:24:49 +01:00
Michele Artini 7faa115ba0 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2024-03-22 11:08:59 +01:00
Michele Artini f9c74c98fa fixed an identifier xpath 2024-03-22 11:08:45 +01:00
Antonis Lempesis 4c40c96e30 code cleanup 2024-03-22 10:16:49 +02:00
Antonis Lempesis 459167ac2f Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta 2024-03-21 12:44:58 +02:00
Antonis Lempesis 07f634a46d code cleanup 2024-03-21 12:44:30 +02:00
Antonis Lempesis 9521625a07 code cleanup 2024-03-21 11:45:08 +02:00
Sandro La Bruzzo 58dbe71d39 update crossref mapping to be runnable separately as a single datasource outside doiboost 2024-03-20 17:04:52 +01:00
Antonis Lempesis 67a5aa0a38 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta 2024-03-19 11:24:54 +02:00
dimitrispie a3a570e9a0 Commit monitor-updates-wf 2024-03-19 09:42:21 +02:00
Giambattista Bloisi 664a381d31 Unify merge logic of entities in MergeUtils.class 2024-03-18 16:04:49 +01:00
Michele Artini cb29b9773c xslt rules 2024-03-18 15:31:34 +01:00
Michele Artini 85b844d57e updated BASE filter param 2024-03-15 15:03:27 +01:00
Michele Artini 455f2e1e07 apply commits from master 2024-03-15 14:56:39 +01:00
Michele Artini 30167aa882 mapped oaf:country from results 2024-03-15 11:24:16 +01:00
Michele Artini 88fef367b9 new plugin to collect from a dump of BASE 2024-03-15 10:47:52 +01:00
Claudio Atzori 078169b922 cleanup 2024-03-15 09:56:04 +01:00
Claudio Atzori af154d4456 implemented changes from #9497: sort abstracts by string length, included author fullnames in the related results, expanded instance details within each children/result XML element 2024-03-14 16:21:23 +01:00
Claudio Atzori 7863c92466 expanded paper abstract in the result/children XML element (ticket #9497) 2024-03-13 16:25:31 +01:00
Claudio Atzori eb5887cb9a including related organization url in the XML record serialization (ticket #9498) 2024-03-13 14:46:00 +01:00
Sandro La Bruzzo 5281f010a5 applied cherry pick 2024-03-13 09:59:20 +01:00
Sandro La Bruzzo ee1fcb672b code refactor 2024-03-13 09:46:31 +01:00
Miriam Baglioni 5a32bb9578 [OC New] last fix 2024-03-13 09:36:18 +01:00
Sandro La Bruzzo c532831718 Moved Crossref Mapping on dhp-aggregations,
refactored code, avoid to use utility for create part of the oaf defined in DOIBoostMappingUtils, used instead utility in OafMappingUtils
2024-03-13 06:56:10 +01:00
Miriam Baglioni 48c052215c [OC New] last fix 2024-03-12 23:12:32 +01:00
Claudio Atzori db66555ebb WIP: updated provision workflow to create a JSON based representation of the payload 2024-03-12 09:56:09 +01:00
Antonis Lempesis f74c7e8689 selecting distinct peer_reviewed 2024-03-12 02:13:04 +02:00
Giambattista Bloisi 9092075760 Enrich authors with ORCID info using new matching algorithm 2024-03-11 13:23:59 +01:00
Sandro La Bruzzo cbd4e5e4bb update mag mapping 2024-03-08 16:31:40 +01:00
Claudio Atzori d4871b31e8 WIP: extended provision workflow to create the JSON based payload 2024-03-08 11:43:20 +01:00
Antonis Lempesis 3c79720342 fixed the irish result subset 2024-03-07 14:08:57 +02:00
Antonis Lempesis 5ae4b4286c Merge branch 'beta' of https://code-repo.d3science.org/antonis.lempesis/dnet-hadoop into beta 2024-03-07 12:15:19 +02:00
Miriam Baglioni 5180b6ec8a [FOSNEW] removed test class 2024-03-07 10:47:13 +01:00
Miriam Baglioni 7827a2d66b [OCNEW] added creation of the actionset for the results classified with FoS based ont he OpenAIRE identifier 2024-03-07 10:36:30 +01:00
Antonis Lempesis 316d585c8a using distinct apcs per publication to avoid huge sums 2024-03-07 02:07:59 +02:00
Miriam Baglioni fd34372c40 [OCNEW] first implementation 2024-03-06 13:42:00 +01:00
Sandro La Bruzzo d34cef3f8d Merge remote-tracking branch 'origin/beta' into doidoost_dismiss 2024-03-05 11:45:31 +01:00
Sandro La Bruzzo 3b837d38ce added oozie workflow 2024-03-05 11:44:59 +01:00
Sandro La Bruzzo f417515e43 Implemented class that generates a normalized table of MAG, which is the starting point for the creation of the mag source 2024-03-04 17:15:13 +01:00
Sandro La Bruzzo ad0e9aa80c added first part of refactoring of the code generating MAG,
make it more readable using spark sql queries
2024-02-29 18:16:15 +01:00
Sandro La Bruzzo 9d94648f3b code formatted 2024-02-29 18:15:20 +01:00
Giambattista Bloisi 3cd5590f3b When converting json to XML, remove characters that are not allowed in the XML 1.0 specs, as they will cause xpath failures even if escaped 2024-02-28 15:14:18 +01:00
Giambattista Bloisi 56dd05f85c Merge pull request 'Revised procedure when converting json data into xml' (#395) from restiterator_xmlcleanup into beta
Reviewed-on: #395
2024-02-28 10:38:54 +01:00
Claudio Atzori 6fcf872daa Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into index_records 2024-02-28 10:27:28 +01:00
Claudio Atzori 3f07390a58 WIP 2024-02-28 10:10:10 +01:00
Sandro La Bruzzo 7d806a434c formatted code 2024-02-28 09:31:58 +01:00
Sandro La Bruzzo b63994dcc4 Merge remote-tracking branch 'origin/beta' into orcid_update 2024-02-28 09:11:18 +01:00
Sandro La Bruzzo 915a76a796 following the comment on the pull requests:
- Added #NUM_OF_THREADS complete job in the queue at the end of  the main loop to avoid deadlock
2024-02-28 09:10:55 +01:00
Giambattista Bloisi 773e856550 Revised procedure when converting json data into xml:
- json object keys are renamed to be conformant to xml tag elements, special characters are substituted or removed
- json string values are no longer post-processed as they are already escaped by the org.json.XML.toString method
2024-02-24 16:54:30 +01:00
Sandro La Bruzzo a712df1e1d Merge remote-tracking branch 'origin/beta' into orcid_update 2024-02-23 10:12:25 +01:00
Sandro La Bruzzo b32a9d1994 Implemented workflow for updating table , added step to check if the new generated table is valid 2024-02-23 10:04:28 +01:00
Michele Artini 3268570b2c mapping of project PIDs 2024-02-22 14:47:21 +01:00
Miriam Baglioni 72bae7af76 [Transformative Agreement] removed the relations from the ActionSet waiting to have the gree light from Ioanna 2024-02-19 16:20:12 +01:00
Miriam Baglioni 43da7e1191 [Tagging Projects and Datasource] changed the way the pathMap parameter is passed. It was too long and was truncated 2024-02-19 16:12:59 +01:00
Serafeim Chatzopoulos f0dc12634b Add Action Set creation for affiliations inferred from the OpenAPC data 2024-02-18 18:02:09 +02:00
Claudio Atzori a63b091bae Merge branch 'beta' into import_orps_fix 2024-02-15 15:01:56 +01:00
Miriam Baglioni 8dae10b442 - 2024-02-14 14:57:08 +01:00
Miriam Baglioni 83bb97be83 [Tagging Projects and Datasource] added test to check datasource tagging. Fixed issue 2024-02-14 11:23:47 +01:00
Miriam Baglioni 6e1f383e4a [Tagging Projects and Datasource] first extention of bulktagging to add the context to projects and datasource 2024-02-13 16:37:14 +01:00
Miriam Baglioni 3f7d262a4e mergin with branch beta 2024-02-13 14:05:58 +01:00
Miriam Baglioni eca021f4d6 [Transformative Agreement] add results with information abount the agreement and the country of the organization paid for it 2024-02-13 12:21:07 +01:00
Miriam Baglioni bdb6bbb365 mergin with branch beta 2024-02-12 15:50:43 +01:00
Claudio Atzori d85d2df6ad [graph raw] fixed mapping of the original resource type from the Datacite format 2024-02-09 10:20:20 +01:00
Giambattista Bloisi b19643f6eb Dedup aliases, created when a dedup in a previous build has been merged in a new dedup, need to be marked as "deletedbyinference", since they are "merged" in the new dedup 2024-02-08 15:34:59 +01:00
Antonis Lempesis dd4c27f4f3 added 2 new institutions in monitor 2024-02-08 12:57:57 +02:00
Claudio Atzori 38c9001147 fixed import of ORPs stored on HDFS in the internal graph format (e.g. Datacite) 2024-02-07 17:02:05 +01:00
Claudio Atzori fd17c1f17c [actiosets] fixed join type 2024-02-05 16:55:36 +02:00
Claudio Atzori 009dcf6aea [actiosets] introduced support for the PromoteAction strategy 2024-02-05 16:43:40 +02:00
Claudio Atzori 42f5506306 [orcid enrichment] fixed directory cleanup before distcp 2024-02-05 09:45:36 +02:00
Alessia Bardi f2a08d8cc2 test for Italian records from IRS repositories 2024-01-30 19:20:14 +01:00
Antonis Lempesis a512ead447 changed orcid ids to all capital 2024-01-30 16:54:47 +02:00
Miriam Baglioni 07a373a0bd [bulkTagging] removing checks while performing the substring action so that it will fire an Exception if the paramneters are wrongly set 2024-01-30 13:51:11 +01:00
Miriam Baglioni ead08b0dd4 mergin with branch beta 2024-01-30 12:19:10 +01:00
Antonis Lempesis bb10a22290 merged changes from dnet-hadoop 2024-01-29 21:51:47 +02:00
Miriam Baglioni a5995ab557 [orcid-enrichment] change the value of parameters. 2024-01-29 18:19:48 +01:00
Miriam Baglioni a418dacb47 [UsageCount] code extention to include also the name of the datasource 2024-01-29 18:12:33 +01:00
Miriam Baglioni e9131f4e4a mergin with branch beta 2024-01-29 16:27:18 +01:00
Sandro La Bruzzo 9aebca77a0 Added exception throwing in Hadoop transformation when TR is not syntactically valid 2024-01-29 14:41:02 +01:00
Claudio Atzori 926903b06b Merge branch 'beta' into stats_with_spark_sql 2024-01-29 09:11:45 +01:00
Giambattista Bloisi 078df0b4d1 Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf 2024-01-26 21:56:55 +01:00
Claudio Atzori ce3200263e Merge branch 'beta' into crossref_missing_author_fix 2024-01-26 15:57:04 +01:00
Sandro La Bruzzo e889808daa Fixed problem on missing author in crossref Mapping 2024-01-26 12:19:04 +01:00
Antonis Lempesis c548796463 Changed step16-createIndicatorsTables to use a spark oozie action instead of hive 2024-01-26 02:04:48 +02:00
Sandro La Bruzzo 0386f36385 Added workflow to update ORCID and replaced some parsing, because the update works and employments xml differs from the dump one. 2024-01-25 19:40:59 +01:00
Antonis Lempesis a7115cfa9e max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%. 2024-01-25 15:13:16 +01:00
Antonis Lempesis fd43b0e84a max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%. 2024-01-25 15:06:34 +01:00
Claudio Atzori 9b13c22e5d [graph provision] retrieve all the context information by adding all=true to the requests issued to thr API 2024-01-23 15:36:08 +01:00
Sandro La Bruzzo 43e0bba7ed logg added during download 2024-01-23 15:04:49 +01:00
Miriam Baglioni f7d06dc661 compilation after merging 2024-01-23 11:43:08 +01:00
Miriam Baglioni 6e58d79623 mergin with branch beta 2024-01-23 11:36:47 +01:00
Miriam Baglioni e0ec800d7e [BulkTagging] extend the definition of the pathMap to include also actions that should be performed of the value extracted from the result befor applying the constraint 2024-01-23 11:34:53 +01:00
Claudio Atzori f87f3a6483 [graph provision] updated param specification for the XML converter job 2024-01-23 08:54:37 +01:00
Claudio Atzori 6fd25cf549 code formatting 2024-01-23 08:47:12 +01:00
Claudio Atzori f76852f385 Merge branch 'beta' into update_pivots_table 2024-01-22 16:37:22 +01:00
Claudio Atzori 1c6db320f4 [graph provision] obtain context info from the context API instead from the ISLookUp service 2024-01-22 15:53:17 +01:00
Claudio Atzori 2655eea5bc [orcid enrichment] drop paths before copying the non-modifyed contents 2024-01-19 16:28:05 +01:00
Claudio Atzori c6b3401596 increased shuffle partitions for publications in the country propagation workflow 2024-01-19 10:15:39 +01:00
Miriam Baglioni bcc0a13981 [enrichment single step] adding <end> element in wf definition 2024-01-18 17:39:14 +01:00
Miriam Baglioni 6af536541d [enrichment single step] moving parameter file in correct location 2024-01-18 15:35:40 +01:00
Miriam Baglioni a12a3eb143 - 2024-01-18 15:18:10 +01:00
Miriam Baglioni 82e9e262ee [enrichment single step] remove parameter from execution 2024-01-17 17:38:03 +01:00
Miriam Baglioni 67ce2d54be [enrichment single step] refactoring to fix issues in disappeared result type 2024-01-17 16:50:00 +01:00
Miriam Baglioni 59eaccbd87 [enrichment single step] refactoring to fix issue in disappeared result type 2024-01-15 17:49:54 +01:00
Giambattista Bloisi 21a14fcd80 Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions
Implements pivots table update oozie workflow
2024-01-15 10:18:14 +01:00
Sandro La Bruzzo e0753f19da Fixed error of connection timeout 2024-01-13 09:27:08 +01:00
sandro.labruzzo e328bc0ade fixed missing parameter on download update 2024-01-12 16:18:20 +01:00
Miriam Baglioni f612125939 fix issue on FoS integration. Removing the null values from FoS 2024-01-12 10:20:28 +01:00
Claudio Atzori cb9e739484 Merge branch 'beta' into resource_types 2024-01-11 16:29:41 +01:00
Claudio Atzori 2753044d13 refined mapping for the extraction of the original resource type 2024-01-11 16:28:26 +01:00
Giambattista Bloisi 3c66e3bd7b Create dedup record for "merged" pivots
Do not create dedup records for group that have more than 20 different acceptance date
2024-01-10 22:59:52 +01:00
Giambattista Bloisi 10e135db1e Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 831cc1fdde Generate "merged" dedup id relations also for records that are filtered out by the cut parameters 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 1287315ffb Do no longer use dedupId information from pivotHistory Database 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Antonis Lempesis e024718f73 creating result_instances even when no pids exist for the instance 2024-01-10 22:25:50 +01:00
Sandro La Bruzzo 859babf722 added some useful comment 2024-01-10 19:51:13 +01:00
Sandro La Bruzzo 39ebb60b38 Merge remote-tracking branch 'origin/beta' into orcid_update 2024-01-10 19:50:00 +01:00
Sandro La Bruzzo 9d5a7c3b22 code refactor 2024-01-10 19:42:34 +01:00
Sandro La Bruzzo 8f61063201 Added workflow 2024-01-10 19:42:22 +01:00
Sandro La Bruzzo 1a42a5c10d Implemented Download update of ORCID 2024-01-10 18:03:20 +01:00
Miriam Baglioni e711a05229 fixed conflicts 2024-01-10 11:03:42 +01:00
Miriam Baglioni 71d6f30711 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-01-10 10:59:58 +01:00
dimitrispie b920307bdd Changes to indicators 2024-01-09 00:47:09 +02:00
dimitrispie 8b2cbb611e Changes to beta db names 2024-01-09 00:40:56 +02:00
Antonis Lempesis 2e4cab026c fixed the result_country definition 2024-01-08 16:01:26 +02:00
dimitrispie 6b823100ae Update buildIrishMonitorDB.sql
New indicators added
2024-01-07 22:54:39 +02:00
dimitrispie 75bfde043c Historical Snapshots Workflow
Create historical snapshots db with parameters:

hist_db_name=openaire_beta_historical_snapshots_xxx
hist_db_name_prev=openaire_beta_historical_snapshots_xxx (previous run of wf)
stats_db_name=openaire_beta_stats_xxx
stats_irish_db_name=openaire_beta_stats_monitor_ie_xxx
monitor_db_name=openaire_beta_stats_monitor_xxx
monitor_db_prod_name=openaire_beta_stats_monitor
monitor_irish_db_name=openaire_beta_stats_monitor_ie_xxx
monitor_irish_db_prod_name=openaire_beta_stats_monitor_ie
hist_db_prod_name=openaire_beta_historical_snapshots
hist_db_shadow_name=openaire_beta_historical_snapshots_shadow
hist_date=122023
hive_timeout=150000
hadoop_user_name=xxx
resumeFrom=CreateDB
2024-01-04 15:11:04 +02:00
Miriam Baglioni cb14470ba6 added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:50:05 +01:00
Miriam Baglioni 9f966b59d4 added properties file in the forlder for the workflow of result to community from semrel propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:11:47 +01:00
Miriam Baglioni 2f3b5a133d added properties file in the forlder for the workflow of result to community from organization propagation. Changes the path in the classes implementing the propagation 2023-12-22 13:56:40 +01:00
Miriam Baglioni 2f7b9ad815 added properties file in the forlder for the workflow of project to result propagation. Changes the path in the classes implementing the propagation 2023-12-22 11:46:15 +01:00
Miriam Baglioni f2352e8a78 changed in the classes the path for the property files for the propagation of community from project 2023-12-22 11:43:34 +01:00
Miriam Baglioni 009730b3d1 added properties file in the forlder for the workflow of orcid propagation. Changes the path in the classes implementing the propagationchanged the path to the parameter file in the class for entitytoorganization propagation 2023-12-22 11:42:09 +01:00
Miriam Baglioni 89f269c7f4 changed the path to the parameter file in the class for entitytoorganization propagation 2023-12-22 11:37:50 +01:00
Miriam Baglioni b06aea0adf adding the bulkTag parameter file in the folder for the oozie workflow for bulkTagging. Changes the path in the class 2023-12-22 11:35:37 +01:00
Miriam Baglioni 3afd4aa57b adjustments for country propagation 2023-12-22 11:27:30 +01:00
dimitrispie ffdd03d2f4 Monitor Irish Stats WF
Parameters (with examples):
stats_db_name=openaire_beta_stats_20231208
monitor_irish_db_name=openaire_beta_stats_monitor_ie_20231208b
monitor_irish_db_prod_name=openaire_beta_stats_monitor_ie
graph_db_name=openaire_beta_20231208
monitor_irish_db_shadow_name=openaire_beta_stats_monitor_ie_shadow
hive_timeout=150000
hadoop_user_name=dnet.beta
resumeFrom=Step1-buildIrishMonitorDB
2023-12-22 11:05:24 +02:00
dimitrispie 40b98d8182 Changes to indicators and funders definition
- Changes result_refereed definition
- Added result_country indicator
- Added indi_pub_green_with_license indicator
- Added country from jurisdiction to funders
2023-12-22 10:29:20 +02:00
Claudio Atzori 62104790ae added metaresourcetype to the result hive DB view 2023-12-21 12:27:10 +01:00