Commit Graph

215 Commits

Author SHA1 Message Date
Claudio Atzori c3d67f709a adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80) 2020-07-02 17:35:22 +02:00
Claudio Atzori 0f77cac4b5 fix: deduper must use queueMaxSize instead of groupMaxSize for the block definition 2020-07-02 12:43:51 +02:00
miconis 11b77b9f4e json dumps for entity merge test modified to fit the new model. title merge adjusted to fix the error 2020-06-16 18:31:11 +02:00
Claudio Atzori 7b288a94cb code formatting 2020-05-26 09:54:13 +02:00
miconis da1e5cf557 implementation of the result title merge. main title with higher trust, distinct between the others 2020-05-25 18:02:57 +02:00
Claudio Atzori 7181807e64 code formatting 2020-05-23 09:51:48 +02:00
miconis 0fd0c7d725 reimplementation of the sim between two authors. now it takes into account both name and surname. threshold incremented to 1.0 if the name is too short 2020-05-22 17:24:57 +02:00
Claudio Atzori 3cf2796ac6 code formatting 2020-05-22 12:34:00 +02:00
miconis 8bbd1d0501 reimplementation of the author merging in deduprecord creation. implementation of the test class. 2020-05-21 11:52:14 +02:00
Claudio Atzori 42f1a2bf94 bumped project version to 1.2.0-SNAPSHOT 2020-05-11 10:05:57 +02:00
Claudio Atzori fd519df616 new rels produced by dedup workflow must be unique 2020-05-08 19:00:38 +02:00
Claudio Atzori 5b28bb4131 code formatting 2020-05-08 16:49:47 +02:00
miconis 3420998bb4 reltype set in mergerels 2020-05-08 15:43:30 +02:00
Claudio Atzori c79e2f5977 drop workingPath before starting the dedup workflow 2020-05-06 11:27:44 +02:00
miconis 3df703f67d mergerels added to propagate relations 2020-05-04 12:08:12 +02:00
Claudio Atzori 439c6255a2 cleanup 2020-04-29 19:09:07 +02:00
Claudio Atzori 77ac995770 cleaned up poms, added descriptions 2020-04-29 18:44:17 +02:00
miconis 0352d3b0ba entity dumps in dedup compressed 2020-04-29 13:02:34 +02:00
miconis 62e467eb0c assertion numbers updated to fit the new implementation of the pace-core 2020-04-28 11:46:23 +02:00
Claudio Atzori 6f5b899038 reformatted code according to the updated style descriptor 2020-04-28 11:23:29 +02:00
Claudio Atzori a0bdbacdae switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin 2020-04-27 14:52:31 +02:00
Claudio Atzori 7a3f8085f7 switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin 2020-04-27 14:45:40 +02:00
Claudio Atzori 278fc9d276 code formatting 2020-04-23 18:51:38 +02:00
miconis 8d258c85ff spark dedup test fixed, sample for dataset and orp added, test implemented 2020-04-23 18:16:20 +02:00
Claudio Atzori 9ddafd46ca fixed dedup record id prefix, set the correct dataInfo in the DedupRecordFactory 2020-04-23 07:50:18 +02:00
Claudio Atzori 91e72a6944 Dataset based implementation for SparkCreateDedupRecord phase, fixed datasource entity dump supplementing dedup unit tests 2020-04-21 12:06:08 +02:00
miconis 5c9ef08a8e spark dedup test fixed 2020-04-21 10:19:04 +02:00
Claudio Atzori eb8a020859 fixed behaviour of DedupRecordFactory 2020-04-20 18:44:06 +02:00
miconis 1102e32462 SparkDedupTest updated and organization dump fixed 2020-04-20 16:49:01 +02:00
miconis 4da13e4570 Revert "Merge branch 'master' into deduptesting"
This reverts commit 772f75d167, reversing
changes made to 5f45f2c77f.
2020-04-20 16:04:49 +02:00
miconis 772f75d167 Merge branch 'master' into deduptesting 2020-04-20 14:50:12 +02:00
Claudio Atzori d714bfb4d4 collectedfrom field moved in common parent class Oaf.java 2020-04-20 12:25:19 +02:00
Claudio Atzori 5f45f2c77f Merge branch 'master' into deduptesting 2020-04-18 12:46:40 +02:00
Claudio Atzori ad7a131b18 introduced common project code formatting plugin, works on the commit hook, based on https://github.com/Cosium/git-code-format-maven-plugin, applied to each java class in the project 2020-04-18 12:42:58 +02:00
Claudio Atzori a2938dd059 cleanup 2020-04-18 12:24:22 +02:00
Claudio Atzori 9374ff03ea Merge branch 'master' into deduptesting 2020-04-18 12:06:58 +02:00
Claudio Atzori 71813795f6 various refactorings on the dnet-dedup-openaire workflow 2020-04-18 12:06:23 +02:00
miconis 6450bb0daa test for softwares dedup added. definition of orp, dataset and sw dedup configurations 2020-04-17 17:31:59 +02:00
Claudio Atzori 038ac7afd7 relation consistency workflow separated from dedup scan and creation of CCs 2020-04-17 13:12:44 +02:00
miconis 418cf94642 implementation of the deletedbyinference test in propagating relations 2020-04-17 10:40:21 +02:00
Claudio Atzori cc21bbfb1a Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-04-16 14:41:37 +02:00
Claudio Atzori ec5dfc068d added spark.sql.shuffle.partitions=3840 to dedup scan wf 2020-04-16 14:41:28 +02:00
miconis 0eccbc318b Deduper class (utilities for dedup) cleaned. Useless methods removed 2020-04-16 12:36:37 +02:00
Claudio Atzori 76d23895e6 Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-04-16 12:18:32 +02:00
miconis 6a089ec287 minor changes 2020-04-16 12:15:38 +02:00
Claudio Atzori 376efd67de removed prepare statement in spark action 2020-04-16 12:14:16 +02:00
miconis 9b36458b6a Merge branch 'deduptesting' of code-repo.d4science.org:D-Net/dnet-hadoop into deduptesting 2020-04-16 12:13:58 +02:00
miconis cd4d9a148f creating temporary directories in dedup test 2020-04-16 12:13:26 +02:00
Claudio Atzori b39ff36c16 improving the wf definitions 2020-04-16 12:11:37 +02:00
Claudio Atzori 011b342bc9 trying to avoid OOM in SparkPropagateRelation 2020-04-16 11:13:51 +02:00
Claudio Atzori 069ef5eaed trying to avoid OOM in SparkPropagateRelation 2020-04-15 21:23:21 +02:00
Claudio Atzori 8eedfefc98 try to introduce intermediate serialization on hdfs to avoid OOM 2020-04-15 18:35:35 +02:00
miconis 5689d49689 minor changes 2020-04-15 16:34:06 +02:00
miconis 0be2e72be5 further implementation of tests for the deduplication of each entity. publication dump added, empty entity files created 2020-04-08 18:02:30 +02:00
miconis 56fbe689f0 implementation of the tests for each spark action 2020-04-06 16:30:31 +02:00
miconis 53fd624c34 implemented test for sparkcreatesimrels 2020-04-03 18:32:25 +02:00
miconis a61763d149 structure for sparksimrel changed to be compliant with mockito testing 2020-04-02 18:37:53 +02:00
miconis bfa5bc74df minor changes 2020-04-01 19:05:48 +02:00
miconis 9802bcb9fe dedup testing 2020-04-01 18:48:31 +02:00
Claudio Atzori 673e744649 moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa 2020-03-27 10:42:17 +01:00
Sandro La Bruzzo e71e001b58 commented test that doesn't work 2020-03-26 14:15:21 +01:00
Sandro La Bruzzo 0cd022ad6a merge with master 2020-03-26 14:08:29 +01:00
Claudio Atzori cd7dc3e1ae dhp-dedup-openaire workflow tests upgraded to junit5 2020-03-25 18:04:23 +01:00
Michele Artini ebe45003d9 fixed some junit packages 2020-03-25 16:45:03 +01:00
Claudio Atzori 71ae7dd272 renamed module dnet-dedup to dnet-dedup-openaire 2020-03-25 15:57:09 +01:00