Commit Graph

1365 Commits

Author SHA1 Message Date
Claudio Atzori d85d2df6ad [graph raw] fixed mapping of the original resource type from the Datacite format 2024-02-09 10:20:20 +01:00
Claudio Atzori 42f5506306 [orcid enrichment] fixed directory cleanup before distcp 2024-02-05 09:45:36 +02:00
Alessia Bardi f2a08d8cc2 test for Italian records from IRS repositories 2024-01-30 19:20:14 +01:00
Claudio Atzori 2655eea5bc [orcid enrichment] drop paths before copying the non-modifyed contents 2024-01-19 16:28:05 +01:00
Claudio Atzori cb9e739484 Merge branch 'beta' into resource_types 2024-01-11 16:29:41 +01:00
Claudio Atzori 2753044d13 refined mapping for the extraction of the original resource type 2024-01-11 16:28:26 +01:00
Miriam Baglioni e711a05229 fixed conflicts 2024-01-10 11:03:42 +01:00
Claudio Atzori 62104790ae added metaresourcetype to the result hive DB view 2023-12-21 12:27:10 +01:00
Miriam Baglioni 4740c808f7 - 2023-12-20 14:26:54 +01:00
Claudio Atzori cb71a7936b [graph cleaning] avoid stack overflow error when navigating Oaf objects declaring an Enum 2023-12-07 23:09:54 +01:00
Claudio Atzori 259c69e446 [orcid enrichment] fixed workflow definition 2023-12-06 19:41:53 +01:00
Claudio Atzori 2a233a89aa [graph grouping] added isLookupUrl to the workflow definition, passed to the grouping spark aciton 2023-12-03 13:32:52 +01:00
Claudio Atzori 622fafbd2e Merge branch 'beta' into orcid_import 2023-12-01 12:28:14 +01:00
Sandro La Bruzzo bf0fd27c36 Removed unused function
Applied PR Comment of Giambattista in the PR
2023-12-01 12:16:42 +01:00
Sandro La Bruzzo cdfb7588dd code formatting 2023-11-30 15:31:42 +01:00
Sandro La Bruzzo 5e22b67b8a Merge remote-tracking branch 'origin/beta' into orcid_import 2023-11-30 15:27:46 +01:00
Sandro La Bruzzo f718caaac9 Added copy of the untouched entities of the graph 2023-11-30 14:51:00 +01:00
Sandro La Bruzzo 7b5e04f37e removed Orcid intersection on DOIBoost 2023-11-30 14:36:50 +01:00
Claudio Atzori 4e1aac2e2f resolved conflict in pom.xml before applying the changes from [COAR based resource types & Irish tender] #350 2023-11-29 14:37:52 +01:00
Sandro La Bruzzo 279100fa52 added test 2023-11-29 11:17:58 +01:00
Sandro La Bruzzo 59111713fa added comment 2023-11-28 09:00:48 +01:00
Sandro La Bruzzo 6f4d0c05ea Implemented Author MErger for ORCID that takes in account the case when name and surname are swapped 2023-11-28 08:43:56 +01:00
Sandro La Bruzzo 34a4b3cbdf Implemented ORCID Enrichment 2023-11-24 12:39:58 +01:00
Claudio Atzori 2c77638bf5 Merge branch 'beta' into cleaning_8898 2023-11-22 14:00:10 +01:00
Claudio Atzori 11a1207f9c [graph cleaning] applying coar based vocabularies in bulk 2023-11-22 12:22:14 +01:00
Claudio Atzori 262d7c581b [graph cleaning] implemented further suggestions from https://support.openaire.eu/issues/8898 2023-10-31 14:34:10 +01:00
Claudio Atzori b3a61ea955 Merge branch 'beta' into url_validation 2023-10-25 14:22:56 +02:00
Claudio Atzori 7fc621cdec added defaults to the graph resolution workflow config-default.xml 2023-10-20 22:28:12 +02:00
Claudio Atzori 2b9d0416ec [graph raw] URL Validator to accept double slashes 2023-10-19 16:26:37 +02:00
Claudio Atzori 6dfcd0c9a2 [raw graph] mapping original resource types 2023-10-16 12:57:18 +02:00
Claudio Atzori 54fbf09ac6 [raw graph] WIP: mapping original resource types 2023-10-16 08:57:47 +02:00
Claudio Atzori 554551682d [raw graph] adopting the new COAR based vocabularies for the resource typing 2023-10-11 16:09:19 +02:00
Claudio Atzori eed9fe0902 code formatting 2023-10-06 12:31:17 +02:00
Claudio Atzori dc86018a5f Merge branch 'merge_entities_job' into beta 2023-10-02 11:24:48 +02:00
Alessia Bardi 0935d7757c Use v5 of the UNIBI Gold ISSN list in test 2023-09-20 15:41:35 +02:00
Alessia Bardi cc7204a089 tests for d4science catalog 2023-09-20 15:38:32 +02:00
Claudio Atzori 5b06c9d06f [graph raw] datainfo.invisible set as true only for entities 2023-09-04 15:15:24 +02:00
Giambattista Bloisi 6cc7d8ca7b GroupEntities and DispatchEntites are now merged in GroupEntitiesSparkJob 2023-08-30 10:43:31 +02:00
Claudio Atzori bf35280ea6 code formatting 2023-08-29 11:11:00 +02:00
Giambattista Bloisi 95cd2b9b1e Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob
Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows
2023-08-10 11:53:48 +02:00
Giambattista Bloisi fab9920271 DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag 2023-08-09 15:41:43 +02:00
Miriam Baglioni c25ac21e5e Merge pull request 'graph cleaning, suggestions from ticket 8898' (#325) from cleaning_8898 into beta
Reviewed-on: D-Net/dnet-hadoop#325
2023-08-08 11:14:19 +02:00
Claudio Atzori 11ffb9bd68 rule out records with NULL dataInfo 2023-07-31 12:35:33 +02:00
Claudio Atzori 270df939c4 partial implementation of the suggestions from https://support.openaire.eu/issues/8898 2023-07-25 17:29:50 +02:00
Giambattista Bloisi e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Giambattista Bloisi bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00
Claudio Atzori b76a47b103 [aggregator graph] added column alias when mapping organization PIDs from the OpenOrgs database 2023-06-13 11:38:10 +02:00
Claudio Atzori ad04f14b81 Merge branch 'beta' into distinct_pids_from_openorgs_beta 2023-06-12 09:58:21 +02:00
Claudio Atzori e1409ffe80 update sql query to return distinct pids 2023-06-12 09:47:45 +02:00
Claudio Atzori e45777e7e1 [aggregator graph] added validation for URLs mapped from oaf:fulltext 2023-05-26 11:33:42 +02:00