Commit Graph

385 Commits

Author SHA1 Message Date
Giambattista Bloisi 613ec5ffce Add profiles for different spark versions: spark-24, spark-34, spark-35 2023-12-05 19:11:06 +01:00
Claudio Atzori 7c3041b276 avoid NPEs 2023-12-03 16:49:49 +01:00
Claudio Atzori 74b185d07b avoid NPEs 2023-12-03 16:18:20 +01:00
Claudio Atzori e6086efc53 avoid NPEs in Vocabulary.getTermBySynonym 2023-12-03 13:33:20 +01:00
Claudio Atzori d33f578e54 code formatting 2023-12-01 15:14:17 +01:00
Claudio Atzori 622fafbd2e Merge branch 'beta' into orcid_import 2023-12-01 12:28:14 +01:00
Sandro La Bruzzo bf0fd27c36 Removed unused function
Applied PR Comment of Giambattista in the PR
2023-12-01 12:16:42 +01:00
Sandro La Bruzzo cdfb7588dd code formatting 2023-11-30 15:31:42 +01:00
Sandro La Bruzzo 5e22b67b8a Merge remote-tracking branch 'origin/beta' into orcid_import 2023-11-30 15:27:46 +01:00
Claudio Atzori 4e1aac2e2f resolved conflict in pom.xml before applying the changes from [COAR based resource types & Irish tender] #350 2023-11-29 14:37:52 +01:00
Sandro La Bruzzo aa239ec673 Changed implementation of check similarity to verify exact match of name instead of the first char 2023-11-29 11:17:41 +01:00
Sandro La Bruzzo 59111713fa added comment 2023-11-28 09:00:48 +01:00
Sandro La Bruzzo 6f4d0c05ea Implemented Author MErger for ORCID that takes in account the case when name and surname are swapped 2023-11-28 08:43:56 +01:00
Sandro La Bruzzo 34a4b3cbdf Implemented ORCID Enrichment 2023-11-24 12:39:58 +01:00
Claudio Atzori 1ba582de3c [graph cleaning] added cleaning for result.publisher and result.instance.license 2023-11-23 16:27:19 +01:00
Claudio Atzori 11a1207f9c [graph cleaning] applying coar based vocabularies in bulk 2023-11-22 12:22:14 +01:00
Claudio Atzori dde2fec035 [graph cleaning] cleanup 2023-10-31 14:35:33 +01:00
Claudio Atzori 262d7c581b [graph cleaning] implemented further suggestions from https://support.openaire.eu/issues/8898 2023-10-31 14:34:10 +01:00
Claudio Atzori b0fed1725e avoid NPEs 2023-10-19 12:13:45 +02:00
Claudio Atzori a24178cb93 Merge branch 'beta' into resource_types 2023-10-17 11:09:50 +02:00
Claudio Atzori d28b7085f6 more NPE checks 2023-10-17 11:09:31 +02:00
Giambattista Bloisi 0e44b037a5 FIX: GroupEntitiesSparkJob deletes whole graph outputPath instead of its temporary folder 2023-10-17 07:54:01 +02:00
Claudio Atzori 39d24d5469 Merge branch 'beta' into resource_types 2023-10-16 11:56:38 +02:00
Claudio Atzori 05ee7d8b09 [graph cleaning] avoid NPEs 2023-10-12 09:13:42 +02:00
Claudio Atzori 554551682d [raw graph] adopting the new COAR based vocabularies for the resource typing 2023-10-11 16:09:19 +02:00
Claudio Atzori 8108491722 Merge branch 'beta' into peer_reviewed 2023-10-06 14:21:52 +02:00
Giambattista Bloisi 2f3cf6d0e7 Fix cleaning of Pmid where parsing of numbers stopped at first not leading 0' character 2023-10-06 14:20:15 +02:00
Claudio Atzori eed9fe0902 code formatting 2023-10-06 12:31:17 +02:00
Claudio Atzori 73c49b8d26 Merge branch 'beta' into SWH_integration 2023-10-06 12:21:51 +02:00
Claudio Atzori c9a5ad6a02 extending the coverage of the peer non-unknown refereed instances 2023-10-02 16:28:42 +02:00
Serafeim Chatzopoulos ab0d70691c Add step for archiving repoUrls to SWH 2023-09-28 20:56:18 +03:00
Serafeim Chatzopoulos ed9c81a0b7 Add steps to collect last visit data && archive not found repository URLs 2023-09-27 19:00:54 +03:00
Claudio Atzori 8a6892cc63 [graph dedup] consistency wf should not remove the relations while dispatching the entities 2023-09-12 21:27:05 +02:00
Giambattista Bloisi 6cc7d8ca7b GroupEntities and DispatchEntites are now merged in GroupEntitiesSparkJob 2023-08-30 10:43:31 +02:00
Claudio Atzori bf35280ea6 code formatting 2023-08-29 11:11:00 +02:00
Giambattista Bloisi 95cd2b9b1e Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob
Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows
2023-08-10 11:53:48 +02:00
Giambattista Bloisi fab9920271 DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag 2023-08-09 15:41:43 +02:00
Miriam Baglioni c25ac21e5e Merge pull request 'graph cleaning, suggestions from ticket 8898' (#325) from cleaning_8898 into beta
Reviewed-on: #325
2023-08-08 11:14:19 +02:00
Claudio Atzori b9dddbfe54 rule out records with NULL dataInfo, except for Relations 2023-07-31 17:53:54 +02:00
Claudio Atzori 11ffb9bd68 rule out records with NULL dataInfo 2023-07-31 12:35:33 +02:00
Claudio Atzori d8435a6512 inverted condition 2023-07-25 17:39:57 +02:00
Claudio Atzori 270df939c4 partial implementation of the suggestions from https://support.openaire.eu/issues/8898 2023-07-25 17:29:50 +02:00
Claudio Atzori c754397a19 Merge branch 'beta' into pid_cleaning 2023-07-24 10:49:31 +02:00
Giambattista Bloisi 38dfebfbe6 Disable MdStoreClientTest test as it requires a local mongodb running and it does not perform any assertions 2023-07-19 14:18:56 +02:00
Sandro La Bruzzo 9910ce06ae added to CreateSimRel the feature to write time log 2023-06-28 11:38:16 +02:00
Sandro La Bruzzo b195da3a83 Added utility to write time logs during the deduplication phase 2023-06-28 11:20:09 +02:00
Claudio Atzori 0f5a819f44 [graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests 2023-06-23 16:10:49 +02:00
Claudio Atzori 1d33074fd1 WIP: pid cleaning 2023-06-09 16:47:25 +02:00
Claudio Atzori 8a463cc3e8 fixed organization id created when mapping APC affiliations. Factored out ROR constants in dhp-common 2023-05-15 15:44:46 +02:00
Claudio Atzori d02916ef82 code formatting 2023-05-02 11:05:37 +02:00
Claudio Atzori 851f664bd9 Merge branch 'beta' into graph_cleaning_refactoring 2023-05-02 09:55:40 +02:00
Miriam Baglioni 73f77575bd [ZenodoApiClient] align with master version 2023-04-18 10:25:27 +02:00
Claudio Atzori 2a6ba29b64 [graph cleaning] unit tests & cleanup 2023-04-04 12:34:51 +02:00
Claudio Atzori 6d3d18d8b5 [graph cleaning] WIP: refactoring of the cleaning stages 2023-03-16 17:23:36 +01:00
Sandro La Bruzzo 0b9819f1ab Code formatted 2023-02-08 10:32:33 +01:00
Sandro La Bruzzo 6c81a161d2 Merge remote-tracking branch 'origin/beta' into 8231-mdstore-synch-improve 2023-02-08 10:29:09 +01:00
Claudio Atzori 9cf0a98699 [cleaning] set the common subject classid/name 2022-12-20 10:17:33 +01:00
Claudio Atzori b8bafab8a0 [cleaning] improved vocabulary based mapping, specialization for the strict vocab cleaning 2022-12-12 14:43:03 +01:00
Sandro La Bruzzo 5a48a2fb18 implemented synch for single mdstore 2022-12-01 11:34:43 +01:00
Claudio Atzori 11695ba649 [graph cleaning] patch also the result's collectedfrom and hostedby datasource name according to the datasource master-duplicate mapping 2022-11-28 10:18:43 +01:00
Claudio Atzori 24ef301cc1 [graph cleaning] patch the result's collectedfrom and hostedby identifiers according to the datasource master-duplicate mapping 2022-11-28 09:54:18 +01:00
Claudio Atzori b47aaf4dd1 [cleaning] subjects declared as belonging to specific vocabularies whose values are not found in the vocab are set to type keyword 2022-10-13 11:23:43 +02:00
Claudio Atzori b7c387c21f cleaning of subjects: avoid duplicated subjects, prioritise collected vs inferred or other sources 2022-08-12 15:09:16 +02:00
Claudio Atzori adb526b0e1 Merge branch 'beta' into clean_subjects 2022-08-12 10:51:17 +02:00
Claudio Atzori cb7c07c54e [scholix] added step to create tar archive 2022-08-11 11:25:24 +02:00
Claudio Atzori 3418ce50ac cleaning of subjects: perform the cleaning when the given value is equivalent to one of the terms in the vocabulary 2022-08-08 12:48:47 +02:00
Claudio Atzori 32cee1f619 WIP: cleaning of subjects 2022-08-05 12:32:08 +02:00
Claudio Atzori b78889a0ce WIP: cleaning of subjects 2022-08-05 09:11:37 +02:00
Claudio Atzori 27a91841e7 WIP: cleaning of subjects 2022-08-04 11:39:39 +02:00
Claudio Atzori 09ccc7b472 Merge branch 'beta' into project_organization_contribution 2022-07-28 09:49:59 +02:00
Claudio Atzori 1138b2ac8e code formatting 2022-07-19 14:15:49 +02:00
Claudio Atzori 0cb1c70788 code formatting 2022-07-01 10:44:08 +02:00
Claudio Atzori 7da24c1dec added more logging 2022-06-28 13:47:49 +02:00
Claudio Atzori a8773af0cb Merge branch 'beta' into project_organization_contribution 2022-06-27 09:37:40 +02:00
Claudio Atzori 316b0fd73c added 'von' to the name particles file 2022-06-27 09:36:51 +02:00
Claudio Atzori 5130eac247 mapping by participant project contribution 2022-06-24 17:16:42 +02:00
Claudio Atzori b295a40d9c restored use of name_particles when parsing author names 2022-06-16 12:20:43 +02:00
Miriam Baglioni ab8868bd3a [ZENODO-API] changed to iterate in all the deposited products and not just the last ten 2022-06-08 17:03:15 +02:00
Claudio Atzori da611cfbbd [eosc_services] resolved merge conflicts 2022-05-03 13:37:15 +02:00
Claudio Atzori f5f532d134 EOSC Services - ongoing update 2022-04-29 12:25:24 +02:00
Miriam Baglioni b61efd613b [Measures] addressed comments in the PR 2022-04-21 12:09:37 +02:00
Miriam Baglioni c304657d91 [Measures] put the logic in common, no need to change the schema 2022-04-21 11:27:26 +02:00
Miriam Baglioni b7c2340952 [HostedByMap - DOIBoost] changed to use code moved to common since used also from hostedbymap now 2022-03-04 11:05:23 +01:00
Alessia Bardi 6158170334 testing delegated authority and bumped dep to schemas 2022-02-11 18:05:18 +01:00
Claudio Atzori db299dd8ab fixed typo 2022-01-27 16:24:06 +01:00
Claudio Atzori c42623f006 added NPE checks 2022-01-21 14:30:09 +01:00
Claudio Atzori 391aa1373b added unit test 2022-01-19 17:13:21 +01:00
Claudio Atzori 62f135262e code formatting 2022-01-19 12:30:52 +01:00
Claudio Atzori 44a937f4ed factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources 2022-01-19 12:24:52 +01:00
Miriam Baglioni 42e8f76778 [GraphCleaning] change the return value in the filtering function to avoid to lose the APC entities 2022-01-13 16:06:43 +01:00
Claudio Atzori 4f212652ca scalafmt: code formatting 2022-01-11 16:57:48 +01:00
Miriam Baglioni be0acccf42 Merge branch 'beta' into dump 2021-12-22 12:39:57 +01:00
Sandro La Bruzzo 3920d68992 Fixed workflow generation of delta in datacite 2021-12-21 11:41:49 +01:00
Sandro La Bruzzo b881ee5ef8 [scholexplorer]
- implemented generation of scholix of delta update of datacite
2021-12-15 11:25:32 +01:00
Miriam Baglioni 56409d1281 [Dump] resolved conflicts with beta and merging 2021-12-14 15:03:45 +01:00
Miriam Baglioni a3592b463a Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-12-14 14:58:26 +01:00
Claudio Atzori aff3ddc8d2 added cleaning for the format field, removing carrige return and tab characters 2021-12-14 11:41:46 +01:00
Miriam Baglioni 936578aaf1 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-12-13 15:01:47 +01:00
Claudio Atzori 41c70c607d cleaning workflow assigns the proper default instance type when a value could not be cleaned using the vocabularies 2021-12-09 16:44:28 +01:00
Claudio Atzori e6e177dda0 vocabulary based cleaning considers also the term label when looking up for a synonym 2021-12-09 13:57:53 +01:00