1
0
Fork 0
Commit Graph

385 Commits

Author SHA1 Message Date
Miriam Baglioni a9ede1e989 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-10-20 10:14:43 +02:00
Giambattista Bloisi 2c235e82ad Fix cleaning of Pmid where parsing of numbers stopped at first not leading 0' character 2023-10-06 12:35:54 +02:00
Claudio Atzori dc80ab14d3 [graph dedup] consistency wf should not remove the relations while dispatching the entities 2023-09-12 14:34:28 +02:00
Claudio Atzori bf35280ea6 code formatting 2023-08-29 11:11:00 +02:00
Giambattista Bloisi 95cd2b9b1e Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob
Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows
2023-08-10 11:53:48 +02:00
Giambattista Bloisi fab9920271 DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag 2023-08-09 15:41:43 +02:00
Miriam Baglioni 599828ce35 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-08-09 13:07:13 +02:00
Miriam Baglioni c25ac21e5e Merge pull request 'graph cleaning, suggestions from ticket 8898' (#325) from cleaning_8898 into beta
Reviewed-on: D-Net/dnet-hadoop#325
2023-08-08 11:14:19 +02:00
Claudio Atzori 0bc74e2000 code formatting 2023-08-02 11:52:10 +02:00
Claudio Atzori 7180911ded [graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests 2023-08-02 11:44:14 +02:00
Claudio Atzori b9dddbfe54 rule out records with NULL dataInfo, except for Relations 2023-07-31 17:53:54 +02:00
Claudio Atzori da1727f93f rule out records with NULL dataInfo, except for Relations 2023-07-31 17:52:56 +02:00
Claudio Atzori 11ffb9bd68 rule out records with NULL dataInfo 2023-07-31 12:35:33 +02:00
Claudio Atzori ccac6a7f75 rule out records with NULL dataInfo 2023-07-31 12:35:05 +02:00
Claudio Atzori d512df8612 code formatting 2023-07-26 09:14:08 +02:00
Claudio Atzori d8435a6512 inverted condition 2023-07-25 17:39:57 +02:00
Claudio Atzori 59764145bb cherry picked & fixed commit 270df939c4 2023-07-25 17:39:00 +02:00
Claudio Atzori 270df939c4 partial implementation of the suggestions from https://support.openaire.eu/issues/8898 2023-07-25 17:29:50 +02:00
Giambattista Bloisi bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00
Claudio Atzori c754397a19 Merge branch 'beta' into pid_cleaning 2023-07-24 10:49:31 +02:00
Giambattista Bloisi 38dfebfbe6 Disable MdStoreClientTest test as it requires a local mongodb running and it does not perform any assertions 2023-07-19 14:18:56 +02:00
Miriam Baglioni 9e8e39f78a - 2023-07-19 11:35:58 +02:00
Giambattista Bloisi bd3fcf869a rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules 2023-07-06 10:02:23 +02:00
Claudio Atzori f3a85e224b merged from branch beta the bulk tagging (single step, negative constraints), the cleanig worflow (single step, pid type based cleaning), instance level fulltext 2023-06-28 13:33:57 +02:00
Sandro La Bruzzo 9910ce06ae added to CreateSimRel the feature to write time log 2023-06-28 11:38:16 +02:00
Sandro La Bruzzo b195da3a83 Added utility to write time logs during the deduplication phase 2023-06-28 11:20:09 +02:00
Claudio Atzori 0f5a819f44 [graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests 2023-06-23 16:10:49 +02:00
Miriam Baglioni e4b27182d0 [master] refactoring 2023-06-21 11:15:53 +02:00
Claudio Atzori 1d33074fd1 WIP: pid cleaning 2023-06-09 16:47:25 +02:00
Miriam Baglioni d9506035e4 [ZenodoApi] gone back to okhttp3 to send the payload. 2023-06-09 12:05:02 +02:00
Claudio Atzori 8a463cc3e8 fixed organization id created when mapping APC affiliations. Factored out ROR constants in dhp-common 2023-05-15 15:44:46 +02:00
Claudio Atzori d02916ef82 code formatting 2023-05-02 11:05:37 +02:00
Claudio Atzori 851f664bd9 Merge branch 'beta' into graph_cleaning_refactoring 2023-05-02 09:55:40 +02:00
Miriam Baglioni 9fc8ebe98b refactoring 2023-04-19 09:32:13 +02:00
Miriam Baglioni 73f77575bd [ZenodoApiClient] align with master version 2023-04-18 10:25:27 +02:00
Miriam Baglioni 24c41806ac [ZenodoApiClienttest] change test to mirror change in the omplementation 2023-04-18 09:08:09 +02:00
Miriam Baglioni 087b5a7973 [ZenodiAPIClient] new version of the API to connect to Zenodo (change the http client 2023-04-17 18:59:22 +02:00
Miriam Baglioni c6a7602b3e refactoring after compilation 2023-04-06 14:45:01 +02:00
Claudio Atzori 2a6ba29b64 [graph cleaning] unit tests & cleanup 2023-04-04 12:34:51 +02:00
Miriam Baglioni 9a9cc6a1dd changed the way the tar archive is build to support renaming in case we need to change .tt.gz into .json.gz 2023-04-04 11:40:58 +02:00
Claudio Atzori 6d3d18d8b5 [graph cleaning] WIP: refactoring of the cleaning stages 2023-03-16 17:23:36 +01:00
Miriam Baglioni 32870339f5 refactoring after compile 2023-02-13 13:06:48 +01:00
Sandro La Bruzzo 0b9819f1ab Code formatted 2023-02-08 10:32:33 +01:00
Sandro La Bruzzo 6c81a161d2 Merge remote-tracking branch 'origin/beta' into 8231-mdstore-synch-improve 2023-02-08 10:29:09 +01:00
Miriam Baglioni b713132db7 [Cleaning] adding missing classes 2022-12-21 12:49:08 +01:00
Claudio Atzori 9cf0a98699 [cleaning] set the common subject classid/name 2022-12-20 10:17:33 +01:00
Claudio Atzori b8bafab8a0 [cleaning] improved vocabulary based mapping, specialization for the strict vocab cleaning 2022-12-12 14:43:03 +01:00
Sandro La Bruzzo 5a48a2fb18 implemented synch for single mdstore 2022-12-01 11:34:43 +01:00
Claudio Atzori 11695ba649 [graph cleaning] patch also the result's collectedfrom and hostedby datasource name according to the datasource master-duplicate mapping 2022-11-28 10:18:43 +01:00
Claudio Atzori 24ef301cc1 [graph cleaning] patch the result's collectedfrom and hostedby identifiers according to the datasource master-duplicate mapping 2022-11-28 09:54:18 +01:00