Commit Graph

90 Commits

Author SHA1 Message Date
Claudio Atzori 1845dcfedf WIP: refactoring the internal graph data model and its utilities 2023-02-01 16:24:35 +01:00
Claudio Atzori 9cf0a98699 [cleaning] set the common subject classid/name 2022-12-20 10:17:33 +01:00
Claudio Atzori b8bafab8a0 [cleaning] improved vocabulary based mapping, specialization for the strict vocab cleaning 2022-12-12 14:43:03 +01:00
Claudio Atzori b47aaf4dd1 [cleaning] subjects declared as belonging to specific vocabularies whose values are not found in the vocab are set to type keyword 2022-10-13 11:23:43 +02:00
Claudio Atzori b7c387c21f cleaning of subjects: avoid duplicated subjects, prioritise collected vs inferred or other sources 2022-08-12 15:09:16 +02:00
Claudio Atzori 3418ce50ac cleaning of subjects: perform the cleaning when the given value is equivalent to one of the terms in the vocabulary 2022-08-08 12:48:47 +02:00
Claudio Atzori 27a91841e7 WIP: cleaning of subjects 2022-08-04 11:39:39 +02:00
Claudio Atzori 5130eac247 mapping by participant project contribution 2022-06-24 17:16:42 +02:00
Claudio Atzori da611cfbbd [eosc_services] resolved merge conflicts 2022-05-03 13:37:15 +02:00
Claudio Atzori f5f532d134 EOSC Services - ongoing update 2022-04-29 12:25:24 +02:00
Miriam Baglioni b61efd613b [Measures] addressed comments in the PR 2022-04-21 12:09:37 +02:00
Miriam Baglioni c304657d91 [Measures] put the logic in common, no need to change the schema 2022-04-21 11:27:26 +02:00
Claudio Atzori db299dd8ab fixed typo 2022-01-27 16:24:06 +01:00
Claudio Atzori c42623f006 added NPE checks 2022-01-21 14:30:09 +01:00
Claudio Atzori 62f135262e code formatting 2022-01-19 12:30:52 +01:00
Claudio Atzori 44a937f4ed factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources 2022-01-19 12:24:52 +01:00
Miriam Baglioni 42e8f76778 [GraphCleaning] change the return value in the filtering function to avoid to lose the APC entities 2022-01-13 16:06:43 +01:00
Claudio Atzori aff3ddc8d2 added cleaning for the format field, removing carrige return and tab characters 2021-12-14 11:41:46 +01:00
Claudio Atzori 41c70c607d cleaning workflow assigns the proper default instance type when a value could not be cleaned using the vocabularies 2021-12-09 16:44:28 +01:00
Claudio Atzori 863a2f9db3 avoid to filter OAF records defined as invisible = true 2021-12-03 09:08:12 +01:00
Claudio Atzori 82a4e4efae [cleaning wf] fixed methodology to rule out invalid result titles, based on https://support.openaire.eu/issues/7206 2021-11-17 14:17:22 +01:00
Claudio Atzori 49f897ef29 [cleaning wf] fixed regex used to spot garbage in result titles; adjusted threshold for filtering titles 2021-11-16 15:24:23 +01:00
Claudio Atzori 2ee21da43b suggestions from SonarLint 2021-08-11 12:13:22 +02:00
Claudio Atzori 6dddad86ee [cleaning] title cleaning based on the me.xuender:unidecode library 2021-07-28 16:21:29 +02:00
Claudio Atzori bc835d2024 [cleaning] fixed filtering function for missing titles 2021-07-23 11:56:13 +02:00
Claudio Atzori 67afd06cd1 [cleaning] cleaning instance.pid and instance.alternateidentifier using the same procedure used to clean result.pid 2021-06-24 12:10:17 +02:00
Claudio Atzori 2039bb9f5f orcid / orcid_pending cleaning backported from master branch 2021-06-14 09:40:50 +02:00
Claudio Atzori a900bfb874 delegating the date parsing to https://github.com/sisyphsu/dateparser 2021-06-11 16:53:01 +02:00
Claudio Atzori eb6acfbabc [cleaning] removing non parsable relation.validationDate(s) 2021-05-28 10:50:44 +02:00
Claudio Atzori 23b8883ab1 applied intellij code cleanup 2021-05-14 10:58:12 +02:00
Claudio Atzori d4c3476152 mapping datasource.journal only when an issn is available, null otherwhise 2021-05-11 11:08:54 +02:00
Claudio Atzori d1cbee8413 imported methods from CleaningFunctions, defined in GraphCleaningFunctions 2021-05-10 16:43:39 +02:00
Claudio Atzori 5afa7d3e0c core utilities in dhp-common moved in external module dhp-schemas 2021-04-27 15:44:01 +02:00
Claudio Atzori f783e60ff7 cleanup 2021-04-27 14:04:50 +02:00
Claudio Atzori 8704d32780 code formatting 2021-04-15 16:52:58 +02:00
Claudio Atzori ba4b4c74d8 do not make the identifier prefix depend on the Handle 2021-04-15 16:48:26 +02:00
Claudio Atzori d1ca025b0b [cleaning] remiving authors without fullname or providing 'deactivated' keyword. Removing test test titles 2021-04-13 14:32:41 +02:00
Claudio Atzori 902d05f548 [cleaning] avoiding NPEs handling null author PIDs 2021-04-12 17:31:40 +02:00
Claudio Atzori 72ce741ea6 WIP: using common definitions from ModelConstants 2021-03-31 17:07:13 +02:00
Claudio Atzori 27681b876c code formatting 2021-03-29 17:47:11 +02:00
miconis 2709d08fc2 Merge branch 'stable_ids' into openorgswf 2021-03-29 16:39:07 +02:00
Claudio Atzori 3becaa5539 [Cleaning] drop alternate identifiers with empty values 2021-03-29 16:01:35 +02:00
Claudio Atzori 48f2b6127e [Cleaning] drop alternate identifiers with empty values 2021-03-29 14:23:18 +02:00
miconis 2355cc4e9b minor changes and bug fix 2021-03-29 10:07:12 +02:00
Claudio Atzori b5b7dc2104 [Cleaning] drop alternate identifiers with empty values 2021-03-26 12:30:00 +01:00
Claudio Atzori 827e7e37db [Cleaning] drop instance.alternateIdentifier elements when they are available among instance.pid 2021-03-25 11:07:59 +01:00
Claudio Atzori 431cbe9955 handle missing instance.pid during bulk cleaning 2021-03-23 09:28:58 +01:00
Claudio Atzori 3256b9c836 code formatting 2021-03-19 09:36:12 +01:00
Claudio Atzori 75144dacb3 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-03-19 09:07:40 +01:00
Claudio Atzori 9588bfba81 [cleaning] entries avaialbe as PIDs must not appear as alternateIdentifier 2021-03-19 09:07:30 +01:00