1
0
Fork 0
Commit Graph

4302 Commits

Author SHA1 Message Date
Miriam Baglioni 94b931f7bd [BulkTagging - tag datasource and projects]merging with branch beta 2024-03-26 14:25:19 +01:00
Miriam Baglioni 3b209261f2 [BulkTagging - tag datasource and projects]merging with branch beta 2024-03-26 14:21:27 +01:00
Lampros Smyrnaios 036ba03fcd Generate tables with parquet-files, instead of csv, in "dhp-stats-update/.../contexts.sh" script. 2024-03-26 13:29:04 +02:00
Claudio Atzori 730eaffc85 Merge pull request 'correctly selecting the active hdfs node for the impala cluster' (#405) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#405
2024-03-26 12:07:46 +01:00
Lampros Smyrnaios bc8c97182d Automatically select the ACTIVE HDFS NODE for Impala cluster, in all "copyDataToImpalaCluster.sh" scripts. 2024-03-26 13:01:12 +02:00
Lampros Smyrnaios 92cc27e7eb Use the ACTIVE HDFS NODE for Impala cluster, in "copyDataToImpalaCluster.sh" script. 2024-03-26 12:34:11 +02:00
Claudio Atzori ef52128c55 included new stats* workflows in parent pom list of modules, code formatting 2024-03-26 10:42:10 +01:00
Claudio Atzori bfba71a95c further follow up changes from integrating the mergeutils branch 2024-03-26 09:01:18 +01:00
Claudio Atzori d72e7b7487 Merge pull request 'Changes to indicators and funders definition' (#372) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#372
2024-03-26 08:46:20 +01:00
Sandro La Bruzzo ece56f0178 update crossref mapping to be transformed together with UnpayWall 2024-03-25 18:18:10 +01:00
Claudio Atzori 538b180fe0 Merge branch 'beta' into oaf_country_beta 2024-03-25 16:13:20 +01:00
Claudio Atzori 82fc609c4f Merge branch 'beta' into index_records 2024-03-25 16:12:49 +01:00
Claudio Atzori 74e5d05577 Merge branch 'beta' into ocnew 2024-03-25 16:10:31 +01:00
Claudio Atzori 6c3b692f60 integrated minor change from beta branch 2024-03-25 16:10:23 +01:00
Claudio Atzori 9a5b134ddf Merge branch 'beta' into FOSNew 2024-03-25 16:07:37 +01:00
Claudio Atzori 71c1f81b54 Merge branch 'beta' into exception_on_invalid_transofmation_rule 2024-03-25 16:05:11 +01:00
Claudio Atzori 91b61687fa Merge branch 'beta' into bulkTaggingPathMapExtention 2024-03-25 15:50:18 +01:00
Claudio Atzori 54936b7f42 Merge branch 'beta' into transformativeagreement 2024-03-25 15:42:22 +01:00
Michele Artini e1149eb5c4 xslt rules and tests 2024-03-25 15:01:42 +01:00
Michele Artini 3f174ad90f Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2024-03-25 12:16:02 +01:00
Michele Artini 6ffb1faf09 fixed a problem with multiple nodes 2024-03-25 12:15:51 +01:00
Giambattista Bloisi 3f22c101d9 Merge pull request 'Enrich authors with ORCID info using new matching algorithm' (#398) from new_orcid_enhancement into beta
Reviewed-on: D-Net/dnet-hadoop#398
2024-03-22 17:29:20 +01:00
Giambattista Bloisi 0ff7faad72 Fix conditions that prevented ORCID Enrichment 2024-03-22 16:24:49 +01:00
Michele Artini 7faa115ba0 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2024-03-22 11:08:59 +01:00
Michele Artini f9c74c98fa fixed an identifier xpath 2024-03-22 11:08:45 +01:00
Antonis Lempesis 4c40c96e30 code cleanup 2024-03-22 10:16:49 +02:00
Antonis Lempesis 459167ac2f Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta 2024-03-21 12:44:58 +02:00
Antonis Lempesis 07f634a46d code cleanup 2024-03-21 12:44:30 +02:00
Antonis Lempesis 9521625a07 code cleanup 2024-03-21 11:45:08 +02:00
Sandro La Bruzzo 58dbe71d39 update crossref mapping to be runnable separately as a single datasource outside doiboost 2024-03-20 17:04:52 +01:00
Antonis Lempesis 67a5aa0a38 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta 2024-03-19 11:24:54 +02:00
dimitrispie a3a570e9a0 Commit monitor-updates-wf 2024-03-19 09:42:21 +02:00
Giambattista Bloisi 664a381d31 Unify merge logic of entities in MergeUtils.class 2024-03-18 16:04:49 +01:00
Michele Artini cb29b9773c xslt rules 2024-03-18 15:31:34 +01:00
Michele Artini 85b844d57e updated BASE filter param 2024-03-15 15:03:27 +01:00
Michele Artini 455f2e1e07 apply commits from master 2024-03-15 14:56:39 +01:00
Michele Artini 30167aa882 mapped oaf:country from results 2024-03-15 11:24:16 +01:00
Michele Artini 88fef367b9 new plugin to collect from a dump of BASE 2024-03-15 10:47:52 +01:00
Claudio Atzori 078169b922 cleanup 2024-03-15 09:56:04 +01:00
Claudio Atzori af154d4456 implemented changes from #9497: sort abstracts by string length, included author fullnames in the related results, expanded instance details within each children/result XML element 2024-03-14 16:21:23 +01:00
Claudio Atzori 7863c92466 expanded paper abstract in the result/children XML element (ticket #9497) 2024-03-13 16:25:31 +01:00
Claudio Atzori eb5887cb9a including related organization url in the XML record serialization (ticket #9498) 2024-03-13 14:46:00 +01:00
Sandro La Bruzzo 5281f010a5 applied cherry pick 2024-03-13 09:59:20 +01:00
Sandro La Bruzzo ee1fcb672b code refactor 2024-03-13 09:46:31 +01:00
Miriam Baglioni 5a32bb9578 [OC New] last fix 2024-03-13 09:36:18 +01:00
Sandro La Bruzzo c532831718 Moved Crossref Mapping on dhp-aggregations,
refactored code, avoid to use utility for create part of the oaf defined in DOIBoostMappingUtils, used instead utility in OafMappingUtils
2024-03-13 06:56:10 +01:00
Miriam Baglioni 48c052215c [OC New] last fix 2024-03-12 23:12:32 +01:00
Claudio Atzori db66555ebb WIP: updated provision workflow to create a JSON based representation of the payload 2024-03-12 09:56:09 +01:00
Antonis Lempesis f74c7e8689 selecting distinct peer_reviewed 2024-03-12 02:13:04 +02:00
Giambattista Bloisi 9092075760 Enrich authors with ORCID info using new matching algorithm 2024-03-11 13:23:59 +01:00
Sandro La Bruzzo cbd4e5e4bb update mag mapping 2024-03-08 16:31:40 +01:00
Claudio Atzori d4871b31e8 WIP: extended provision workflow to create the JSON based payload 2024-03-08 11:43:20 +01:00
Antonis Lempesis 3c79720342 fixed the irish result subset 2024-03-07 14:08:57 +02:00
Antonis Lempesis 5ae4b4286c Merge branch 'beta' of https://code-repo.d3science.org/antonis.lempesis/dnet-hadoop into beta 2024-03-07 12:15:19 +02:00
Miriam Baglioni 5180b6ec8a [FOSNEW] removed test class 2024-03-07 10:47:13 +01:00
Miriam Baglioni 7827a2d66b [OCNEW] added creation of the actionset for the results classified with FoS based ont he OpenAIRE identifier 2024-03-07 10:36:30 +01:00
Antonis Lempesis 316d585c8a using distinct apcs per publication to avoid huge sums 2024-03-07 02:07:59 +02:00
Miriam Baglioni fd34372c40 [OCNEW] first implementation 2024-03-06 13:42:00 +01:00
Sandro La Bruzzo d34cef3f8d Merge remote-tracking branch 'origin/beta' into doidoost_dismiss 2024-03-05 11:45:31 +01:00
Sandro La Bruzzo 3b837d38ce added oozie workflow 2024-03-05 11:44:59 +01:00
Sandro La Bruzzo f417515e43 Implemented class that generates a normalized table of MAG, which is the starting point for the creation of the mag source 2024-03-04 17:15:13 +01:00
Sandro La Bruzzo ad0e9aa80c added first part of refactoring of the code generating MAG,
make it more readable using spark sql queries
2024-02-29 18:16:15 +01:00
Sandro La Bruzzo 9d94648f3b code formatted 2024-02-29 18:15:20 +01:00
Giambattista Bloisi 3cd5590f3b When converting json to XML, remove characters that are not allowed in the XML 1.0 specs, as they will cause xpath failures even if escaped 2024-02-28 15:14:18 +01:00
Giambattista Bloisi 56dd05f85c Merge pull request 'Revised procedure when converting json data into xml' (#395) from restiterator_xmlcleanup into beta
Reviewed-on: D-Net/dnet-hadoop#395
2024-02-28 10:38:54 +01:00
Claudio Atzori 6fcf872daa Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into index_records 2024-02-28 10:27:28 +01:00
Claudio Atzori 3f07390a58 WIP 2024-02-28 10:10:10 +01:00
Sandro La Bruzzo 7d806a434c formatted code 2024-02-28 09:31:58 +01:00
Sandro La Bruzzo b63994dcc4 Merge remote-tracking branch 'origin/beta' into orcid_update 2024-02-28 09:11:18 +01:00
Sandro La Bruzzo 915a76a796 following the comment on the pull requests:
- Added #NUM_OF_THREADS complete job in the queue at the end of  the main loop to avoid deadlock
2024-02-28 09:10:55 +01:00
Giambattista Bloisi 773e856550 Revised procedure when converting json data into xml:
- json object keys are renamed to be conformant to xml tag elements, special characters are substituted or removed
- json string values are no longer post-processed as they are already escaped by the org.json.XML.toString method
2024-02-24 16:54:30 +01:00
Sandro La Bruzzo a712df1e1d Merge remote-tracking branch 'origin/beta' into orcid_update 2024-02-23 10:12:25 +01:00
Sandro La Bruzzo b32a9d1994 Implemented workflow for updating table , added step to check if the new generated table is valid 2024-02-23 10:04:28 +01:00
Michele Artini 3268570b2c mapping of project PIDs 2024-02-22 14:47:21 +01:00
Miriam Baglioni 72bae7af76 [Transformative Agreement] removed the relations from the ActionSet waiting to have the gree light from Ioanna 2024-02-19 16:20:12 +01:00
Miriam Baglioni 43da7e1191 [Tagging Projects and Datasource] changed the way the pathMap parameter is passed. It was too long and was truncated 2024-02-19 16:12:59 +01:00
Serafeim Chatzopoulos f0dc12634b Add Action Set creation for affiliations inferred from the OpenAPC data 2024-02-18 18:02:09 +02:00
Claudio Atzori a63b091bae Merge branch 'beta' into import_orps_fix 2024-02-15 15:01:56 +01:00
Miriam Baglioni 8dae10b442 - 2024-02-14 14:57:08 +01:00
Miriam Baglioni 83bb97be83 [Tagging Projects and Datasource] added test to check datasource tagging. Fixed issue 2024-02-14 11:23:47 +01:00
Miriam Baglioni 6e1f383e4a [Tagging Projects and Datasource] first extention of bulktagging to add the context to projects and datasource 2024-02-13 16:37:14 +01:00
Miriam Baglioni 3f7d262a4e mergin with branch beta 2024-02-13 14:05:58 +01:00
Miriam Baglioni eca021f4d6 [Transformative Agreement] add results with information abount the agreement and the country of the organization paid for it 2024-02-13 12:21:07 +01:00
Miriam Baglioni bdb6bbb365 mergin with branch beta 2024-02-12 15:50:43 +01:00
Claudio Atzori d85d2df6ad [graph raw] fixed mapping of the original resource type from the Datacite format 2024-02-09 10:20:20 +01:00
Giambattista Bloisi b19643f6eb Dedup aliases, created when a dedup in a previous build has been merged in a new dedup, need to be marked as "deletedbyinference", since they are "merged" in the new dedup 2024-02-08 15:34:59 +01:00
Antonis Lempesis dd4c27f4f3 added 2 new institutions in monitor 2024-02-08 12:57:57 +02:00
Claudio Atzori 38c9001147 fixed import of ORPs stored on HDFS in the internal graph format (e.g. Datacite) 2024-02-07 17:02:05 +01:00
Claudio Atzori fd17c1f17c [actiosets] fixed join type 2024-02-05 16:55:36 +02:00
Claudio Atzori 009dcf6aea [actiosets] introduced support for the PromoteAction strategy 2024-02-05 16:43:40 +02:00
Claudio Atzori 42f5506306 [orcid enrichment] fixed directory cleanup before distcp 2024-02-05 09:45:36 +02:00
Alessia Bardi f2a08d8cc2 test for Italian records from IRS repositories 2024-01-30 19:20:14 +01:00
Antonis Lempesis a512ead447 changed orcid ids to all capital 2024-01-30 16:54:47 +02:00
Miriam Baglioni 07a373a0bd [bulkTagging] removing checks while performing the substring action so that it will fire an Exception if the paramneters are wrongly set 2024-01-30 13:51:11 +01:00
Miriam Baglioni ead08b0dd4 mergin with branch beta 2024-01-30 12:19:10 +01:00
Antonis Lempesis bb10a22290 merged changes from dnet-hadoop 2024-01-29 21:51:47 +02:00
Miriam Baglioni a5995ab557 [orcid-enrichment] change the value of parameters. 2024-01-29 18:19:48 +01:00
Miriam Baglioni a418dacb47 [UsageCount] code extention to include also the name of the datasource 2024-01-29 18:12:33 +01:00
Miriam Baglioni e9131f4e4a mergin with branch beta 2024-01-29 16:27:18 +01:00
Sandro La Bruzzo 9aebca77a0 Added exception throwing in Hadoop transformation when TR is not syntactically valid 2024-01-29 14:41:02 +01:00
Claudio Atzori 926903b06b Merge branch 'beta' into stats_with_spark_sql 2024-01-29 09:11:45 +01:00
Giambattista Bloisi 078df0b4d1 Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf 2024-01-26 21:56:55 +01:00
Claudio Atzori ce3200263e Merge branch 'beta' into crossref_missing_author_fix 2024-01-26 15:57:04 +01:00
Sandro La Bruzzo e889808daa Fixed problem on missing author in crossref Mapping 2024-01-26 12:19:04 +01:00
Antonis Lempesis c548796463 Changed step16-createIndicatorsTables to use a spark oozie action instead of hive 2024-01-26 02:04:48 +02:00
Sandro La Bruzzo 0386f36385 Added workflow to update ORCID and replaced some parsing, because the update works and employments xml differs from the dump one. 2024-01-25 19:40:59 +01:00
Antonis Lempesis a7115cfa9e max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%. 2024-01-25 15:13:16 +01:00
Antonis Lempesis fd43b0e84a max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%. 2024-01-25 15:06:34 +01:00
Claudio Atzori 9b13c22e5d [graph provision] retrieve all the context information by adding all=true to the requests issued to thr API 2024-01-23 15:36:08 +01:00
Sandro La Bruzzo 43e0bba7ed logg added during download 2024-01-23 15:04:49 +01:00
Miriam Baglioni f7d06dc661 compilation after merging 2024-01-23 11:43:08 +01:00
Miriam Baglioni 6e58d79623 mergin with branch beta 2024-01-23 11:36:47 +01:00
Miriam Baglioni e0ec800d7e [BulkTagging] extend the definition of the pathMap to include also actions that should be performed of the value extracted from the result befor applying the constraint 2024-01-23 11:34:53 +01:00
Claudio Atzori f87f3a6483 [graph provision] updated param specification for the XML converter job 2024-01-23 08:54:37 +01:00
Claudio Atzori 6fd25cf549 code formatting 2024-01-23 08:47:12 +01:00
Claudio Atzori f76852f385 Merge branch 'beta' into update_pivots_table 2024-01-22 16:37:22 +01:00
Claudio Atzori 1c6db320f4 [graph provision] obtain context info from the context API instead from the ISLookUp service 2024-01-22 15:53:17 +01:00
Claudio Atzori 2655eea5bc [orcid enrichment] drop paths before copying the non-modifyed contents 2024-01-19 16:28:05 +01:00
Claudio Atzori c6b3401596 increased shuffle partitions for publications in the country propagation workflow 2024-01-19 10:15:39 +01:00
Miriam Baglioni bcc0a13981 [enrichment single step] adding <end> element in wf definition 2024-01-18 17:39:14 +01:00
Miriam Baglioni 6af536541d [enrichment single step] moving parameter file in correct location 2024-01-18 15:35:40 +01:00
Miriam Baglioni a12a3eb143 - 2024-01-18 15:18:10 +01:00
Miriam Baglioni 82e9e262ee [enrichment single step] remove parameter from execution 2024-01-17 17:38:03 +01:00
Miriam Baglioni 67ce2d54be [enrichment single step] refactoring to fix issues in disappeared result type 2024-01-17 16:50:00 +01:00
Miriam Baglioni 59eaccbd87 [enrichment single step] refactoring to fix issue in disappeared result type 2024-01-15 17:49:54 +01:00
Giambattista Bloisi 21a14fcd80 Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions
Implements pivots table update oozie workflow
2024-01-15 10:18:14 +01:00
Sandro La Bruzzo e0753f19da Fixed error of connection timeout 2024-01-13 09:27:08 +01:00
sandro.labruzzo e328bc0ade fixed missing parameter on download update 2024-01-12 16:18:20 +01:00
Miriam Baglioni f612125939 fix issue on FoS integration. Removing the null values from FoS 2024-01-12 10:20:28 +01:00
Claudio Atzori cb9e739484 Merge branch 'beta' into resource_types 2024-01-11 16:29:41 +01:00
Claudio Atzori 2753044d13 refined mapping for the extraction of the original resource type 2024-01-11 16:28:26 +01:00
Giambattista Bloisi 3c66e3bd7b Create dedup record for "merged" pivots
Do not create dedup records for group that have more than 20 different acceptance date
2024-01-10 22:59:52 +01:00
Giambattista Bloisi 10e135db1e Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 831cc1fdde Generate "merged" dedup id relations also for records that are filtered out by the cut parameters 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 1287315ffb Do no longer use dedupId information from pivotHistory Database 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Antonis Lempesis e024718f73 creating result_instances even when no pids exist for the instance 2024-01-10 22:25:50 +01:00
Sandro La Bruzzo 859babf722 added some useful comment 2024-01-10 19:51:13 +01:00
Sandro La Bruzzo 39ebb60b38 Merge remote-tracking branch 'origin/beta' into orcid_update 2024-01-10 19:50:00 +01:00
Sandro La Bruzzo 9d5a7c3b22 code refactor 2024-01-10 19:42:34 +01:00
Sandro La Bruzzo 8f61063201 Added workflow 2024-01-10 19:42:22 +01:00
Sandro La Bruzzo 1a42a5c10d Implemented Download update of ORCID 2024-01-10 18:03:20 +01:00
Miriam Baglioni e711a05229 fixed conflicts 2024-01-10 11:03:42 +01:00
Miriam Baglioni 71d6f30711 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-01-10 10:59:58 +01:00
dimitrispie b920307bdd Changes to indicators 2024-01-09 00:47:09 +02:00
dimitrispie 8b2cbb611e Changes to beta db names 2024-01-09 00:40:56 +02:00
Antonis Lempesis 2e4cab026c fixed the result_country definition 2024-01-08 16:01:26 +02:00
dimitrispie 6b823100ae Update buildIrishMonitorDB.sql
New indicators added
2024-01-07 22:54:39 +02:00
dimitrispie 75bfde043c Historical Snapshots Workflow
Create historical snapshots db with parameters:

hist_db_name=openaire_beta_historical_snapshots_xxx
hist_db_name_prev=openaire_beta_historical_snapshots_xxx (previous run of wf)
stats_db_name=openaire_beta_stats_xxx
stats_irish_db_name=openaire_beta_stats_monitor_ie_xxx
monitor_db_name=openaire_beta_stats_monitor_xxx
monitor_db_prod_name=openaire_beta_stats_monitor
monitor_irish_db_name=openaire_beta_stats_monitor_ie_xxx
monitor_irish_db_prod_name=openaire_beta_stats_monitor_ie
hist_db_prod_name=openaire_beta_historical_snapshots
hist_db_shadow_name=openaire_beta_historical_snapshots_shadow
hist_date=122023
hive_timeout=150000
hadoop_user_name=xxx
resumeFrom=CreateDB
2024-01-04 15:11:04 +02:00
Miriam Baglioni cb14470ba6 added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:50:05 +01:00