Commit Graph

193 Commits

Author SHA1 Message Date
miconis c7e2d5a59a minor changes 2021-01-25 12:40:45 +01:00
miconis 8fea29177c refactoring, minor changes and implementation of the wf for openorgs with integration of organization phases into the scan wf 2021-01-18 16:48:08 +01:00
miconis 1e1aab83e3 implementation of the raw wf for openorgs: still not complete, some functionalities are missing 2020-12-21 11:58:21 +01:00
Claudio Atzori d9532446eb imported more diffs from master branch; code formatting 2020-12-10 16:14:16 +01:00
Claudio Atzori 1eaad89a3c do not fail on uknown properties when grouping entities by ID 2020-12-10 15:56:11 +01:00
Claudio Atzori 758d27745d cleaning tab characters from text fields 2020-11-27 16:07:24 +01:00
Claudio Atzori 5151850a19 CROSSREF and DATACITE constants moved in common ModelConstants 2020-11-26 13:08:36 +01:00
Claudio Atzori d0d5525d40 minor changes 2020-11-26 11:04:17 +01:00
Claudio Atzori 13eae4b31e GroupEntitiesSparkJob must read all graph paths but relations 2020-11-26 11:04:01 +01:00
Claudio Atzori 76363a8512 SimpleDateFormat is not thread safe; improved error reporting in case of invalid dates 2020-11-26 11:03:12 +01:00
Claudio Atzori e208b03755 renamed workflow 2020-11-25 14:55:50 +01:00
Claudio Atzori dfd6205b95 Consistency graph workflow merges all the entities by ID 2020-11-25 14:55:32 +01:00
Claudio Atzori e5da4ee9b1 dedup workflow using the common PidComparator 2020-11-04 15:02:02 +01:00
Claudio Atzori 385214eeae code formatting 2020-10-30 15:47:05 +01:00
miconis c4a59d1b9a merge with the master to port the new packages 2020-10-20 16:07:30 +02:00
miconis 708d887e64 minor changes 2020-10-20 15:12:19 +02:00
miconis 0e54803177 bug fix in the id generator and implementation of jobs for organization dedup 2020-10-20 12:19:46 +02:00
miconis 6f8720982c bug fix in the idgenerator and test implementation 2020-10-09 09:30:23 +02:00
Sandro La Bruzzo 734934e2eb fixed error on empty intersection with publication and relation on export to OAF 2020-10-08 17:29:29 +02:00
Sandro La Bruzzo eec418cd26 moved AuthoreMerger into dhp-common 2020-10-08 10:33:55 +02:00
miconis 1804c5d809 refactoring: classes moved in the right package 2020-10-06 16:44:51 +02:00
miconis 7093355487 bug fix and minor changes 2020-10-06 16:21:34 +02:00
miconis 5a8bc329c5 bug fix in the result merge: it takes the correct bestaccessright basing on the license instead of the trust 2020-10-06 15:26:44 +02:00
miconis a2ac7e52fb implementation of the workflow for new organizations in openorgs 2020-10-06 13:58:09 +02:00
Claudio Atzori 23f64d9eb4 updated dedup tests following the dnet-pace-core library update 2020-10-02 14:30:53 +02:00
miconis e3f7798d1b minor changes in dedup tests, bug fix in the idgenerator and pace-core version update 2020-09-29 15:31:46 +02:00
miconis 4cf79f32eb implementation of the oozie wf to prepare the openorgs input: relations between organizations 2020-09-25 11:29:51 +02:00
miconis 259362ef47 implementation of the job to collect simrels from postgres db 2020-09-22 09:43:27 +02:00
Sandro La Bruzzo 168bfb496a adopted dedup to the new schema 2020-07-31 09:06:57 +02:00
miconis d47352cbc7 refactoring of the procedure for the id generation, minor changes and addition of a comparation on the original id and the origin datasource 2020-07-24 20:10:47 +02:00
miconis b260fee787 implementation of the dedup_id generation using pids to make the graph more stable 2020-07-22 17:29:48 +02:00
Claudio Atzori de72b1c859 cleanup 2020-07-20 09:59:11 +02:00
Claudio Atzori 805de4eca1 fix: filter the blocks with size = 1 2020-07-16 10:11:32 +02:00
Claudio Atzori b90389bac4 code formatting 2020-07-15 11:24:48 +02:00
Claudio Atzori 4e6f46e8fa filter blocks with one record only 2020-07-15 11:22:20 +02:00
Claudio Atzori 06def0c0cb SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter 2020-07-13 20:09:06 +02:00
miconis b52c246aed merge done 2020-07-13 19:57:02 +02:00
miconis b8a45041fd minor changes 2020-07-13 19:53:18 +02:00
Claudio Atzori 66f9f6d323 adjusted parameters for the dedup stats workflow 2020-07-13 19:26:46 +02:00
miconis 03ecfa5ebd implementation of the test class for the new block stats spark action 2020-07-13 18:48:23 +02:00
miconis 10e08ccf45 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-13 18:22:45 +02:00
miconis 9258e4f095 implementation of a new workflow to compute statistics on the blocks 2020-07-13 18:22:34 +02:00
Claudio Atzori c6f6fb0f28 code formatting 2020-07-13 16:46:13 +02:00
Claudio Atzori 344a90c2e6 updated assertions in propagateRelationTest 2020-07-13 16:32:04 +02:00
Claudio Atzori 1143f426aa WIP SparkCreateMergeRels distinct relations 2020-07-13 16:13:36 +02:00
Claudio Atzori 8c67938ad0 configurable number of partitions used in the SparkCreateSimRels phase 2020-07-13 16:07:07 +02:00
Claudio Atzori c73168b18e Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-07-13 15:54:58 +02:00
Claudio Atzori c8284bab06 WIP SparkCreateMergeRels distinct relations 2020-07-13 15:54:51 +02:00
Sandro La Bruzzo 1d133b7fe6 update test 2020-07-13 15:52:41 +02:00
Claudio Atzori 7dd91edf43 parsing of optional parameter 2020-07-13 15:40:41 +02:00
Claudio Atzori 4c101a9d66 WIP SparkCreateMergeRels distinct relations 2020-07-13 15:31:38 +02:00
Claudio Atzori 8a612d861a WIP SparkCreateMergeRels distinct relations 2020-07-13 15:30:57 +02:00
Sandro La Bruzzo 9ef2385022 implemented test for cut of connected component 2020-07-13 15:28:17 +02:00
Sandro La Bruzzo d561b2dd21 implemented cut of connected component 2020-07-13 14:18:42 +02:00
Claudio Atzori e2093e42db Merge branch 'master' into deduptesting 2020-07-13 10:57:49 +02:00
Claudio Atzori 7a3fd9f54c dedup relation aggregator moved into dedicated class 2020-07-13 10:11:36 +02:00
Alessia Bardi 7e96105947 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-07-12 19:29:12 +02:00
Alessia Bardi b7a39731a6 assert, not print 2020-07-12 19:28:56 +02:00
Claudio Atzori 770adc26e9 WIP aggregator to make relationships unique 2020-07-10 19:35:10 +02:00
Claudio Atzori ecf119f37a Merge branch 'master' into deduptesting 2020-07-10 19:04:16 +02:00
Michele Artini e1ae964bc4 stats 2020-07-10 16:12:08 +02:00
Claudio Atzori 752d28f8eb make the relations produced by the dedup SparkPropagateRelation jon unique 2020-07-10 15:09:50 +02:00
Claudio Atzori 3c728aaa0c trying to overcome OOM errors during duplicate scan phase 2020-07-08 22:39:51 +02:00
Claudio Atzori 18c555cd79 Merge branch 'master' into deduptesting 2020-07-08 22:32:01 +02:00
Claudio Atzori 4365cf41d7 trying to overcome OOM errors during duplicate scan phase 2020-07-08 22:31:46 +02:00
Alessia Bardi 853e8d7987 test for software merge 2020-07-08 17:03:53 +02:00
Claudio Atzori c3d67f709a adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80) 2020-07-02 17:35:22 +02:00
Claudio Atzori 0f77cac4b5 fix: deduper must use queueMaxSize instead of groupMaxSize for the block definition 2020-07-02 12:43:51 +02:00
Claudio Atzori 9cd27183b6 [maven-release-plugin] prepare for next development iteration 2020-06-22 11:27:44 +02:00
Claudio Atzori 1e3dab0631 [maven-release-plugin] prepare release dhp-1.2.3 2020-06-22 11:27:39 +02:00
miconis 11b77b9f4e json dumps for entity merge test modified to fit the new model. title merge adjusted to fix the error 2020-06-16 18:31:11 +02:00
Claudio Atzori c4d9f1837f [maven-release-plugin] prepare for next development iteration 2020-06-12 12:21:08 +02:00
Claudio Atzori f0746a7605 [maven-release-plugin] prepare release dhp-1.2.2 2020-06-12 12:21:03 +02:00
Claudio Atzori 7b288a94cb code formatting 2020-05-26 09:54:13 +02:00
Claudio Atzori 7582532e73 [maven-release-plugin] prepare for next development iteration 2020-05-25 19:48:18 +02:00
Claudio Atzori 01c2e93395 [maven-release-plugin] prepare release dhp-1.2.1 2020-05-25 19:48:14 +02:00
miconis da1e5cf557 implementation of the result title merge. main title with higher trust, distinct between the others 2020-05-25 18:02:57 +02:00
Claudio Atzori 7181807e64 code formatting 2020-05-23 09:51:48 +02:00
miconis 0fd0c7d725 reimplementation of the sim between two authors. now it takes into account both name and surname. threshold incremented to 1.0 if the name is too short 2020-05-22 17:24:57 +02:00
Claudio Atzori 3cf2796ac6 code formatting 2020-05-22 12:34:00 +02:00
miconis 8bbd1d0501 reimplementation of the author merging in deduprecord creation. implementation of the test class. 2020-05-21 11:52:14 +02:00
Claudio Atzori 60c40618d3 [maven-release-plugin] prepare for next development iteration 2020-05-11 10:17:14 +02:00
Claudio Atzori c267d958d5 [maven-release-plugin] prepare release dhp-1.2.0 2020-05-11 10:17:10 +02:00
Claudio Atzori 42f1a2bf94 bumped project version to 1.2.0-SNAPSHOT 2020-05-11 10:05:57 +02:00
Claudio Atzori fd519df616 new rels produced by dedup workflow must be unique 2020-05-08 19:00:38 +02:00
Claudio Atzori 0ccc864ad9 [maven-release-plugin] prepare for next development iteration 2020-05-08 17:01:31 +02:00
Claudio Atzori 6e47c724c6 [maven-release-plugin] prepare release dhp-1.1.7 2020-05-08 17:01:27 +02:00
Claudio Atzori 5b28bb4131 code formatting 2020-05-08 16:49:47 +02:00
miconis 3420998bb4 reltype set in mergerels 2020-05-08 15:43:30 +02:00
Claudio Atzori c79e2f5977 drop workingPath before starting the dedup workflow 2020-05-06 11:27:44 +02:00
miconis 3df703f67d mergerels added to propagate relations 2020-05-04 12:08:12 +02:00
Claudio Atzori 439c6255a2 cleanup 2020-04-29 19:09:07 +02:00
Claudio Atzori 77ac995770 cleaned up poms, added descriptions 2020-04-29 18:44:17 +02:00
miconis 0352d3b0ba entity dumps in dedup compressed 2020-04-29 13:02:34 +02:00
miconis 62e467eb0c assertion numbers updated to fit the new implementation of the pace-core 2020-04-28 11:46:23 +02:00
Claudio Atzori 6f5b899038 reformatted code according to the updated style descriptor 2020-04-28 11:23:29 +02:00
Claudio Atzori a0bdbacdae switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin 2020-04-27 14:52:31 +02:00
Claudio Atzori 7a3f8085f7 switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin 2020-04-27 14:45:40 +02:00
Claudio Atzori 278fc9d276 code formatting 2020-04-23 18:51:38 +02:00
miconis 8d258c85ff spark dedup test fixed, sample for dataset and orp added, test implemented 2020-04-23 18:16:20 +02:00