Commit Graph

131 Commits

Author SHA1 Message Date
Claudio Atzori e5da4ee9b1 dedup workflow using the common PidComparator 2020-11-04 15:02:02 +01:00
Claudio Atzori 385214eeae code formatting 2020-10-30 15:47:05 +01:00
miconis c4a59d1b9a merge with the master to port the new packages 2020-10-20 16:07:30 +02:00
miconis 708d887e64 minor changes 2020-10-20 15:12:19 +02:00
miconis 0e54803177 bug fix in the id generator and implementation of jobs for organization dedup 2020-10-20 12:19:46 +02:00
miconis 6f8720982c bug fix in the idgenerator and test implementation 2020-10-09 09:30:23 +02:00
Sandro La Bruzzo 734934e2eb fixed error on empty intersection with publication and relation on export to OAF 2020-10-08 17:29:29 +02:00
Sandro La Bruzzo eec418cd26 moved AuthoreMerger into dhp-common 2020-10-08 10:33:55 +02:00
miconis 1804c5d809 refactoring: classes moved in the right package 2020-10-06 16:44:51 +02:00
miconis 7093355487 bug fix and minor changes 2020-10-06 16:21:34 +02:00
miconis 5a8bc329c5 bug fix in the result merge: it takes the correct bestaccessright basing on the license instead of the trust 2020-10-06 15:26:44 +02:00
miconis a2ac7e52fb implementation of the workflow for new organizations in openorgs 2020-10-06 13:58:09 +02:00
Claudio Atzori 23f64d9eb4 updated dedup tests following the dnet-pace-core library update 2020-10-02 14:30:53 +02:00
miconis e3f7798d1b minor changes in dedup tests, bug fix in the idgenerator and pace-core version update 2020-09-29 15:31:46 +02:00
miconis 4cf79f32eb implementation of the oozie wf to prepare the openorgs input: relations between organizations 2020-09-25 11:29:51 +02:00
miconis 259362ef47 implementation of the job to collect simrels from postgres db 2020-09-22 09:43:27 +02:00
Sandro La Bruzzo 168bfb496a adopted dedup to the new schema 2020-07-31 09:06:57 +02:00
miconis d47352cbc7 refactoring of the procedure for the id generation, minor changes and addition of a comparation on the original id and the origin datasource 2020-07-24 20:10:47 +02:00
miconis b260fee787 implementation of the dedup_id generation using pids to make the graph more stable 2020-07-22 17:29:48 +02:00
Claudio Atzori de72b1c859 cleanup 2020-07-20 09:59:11 +02:00
Claudio Atzori 805de4eca1 fix: filter the blocks with size = 1 2020-07-16 10:11:32 +02:00
Claudio Atzori b90389bac4 code formatting 2020-07-15 11:24:48 +02:00
Claudio Atzori 4e6f46e8fa filter blocks with one record only 2020-07-15 11:22:20 +02:00
Claudio Atzori 06def0c0cb SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter 2020-07-13 20:09:06 +02:00
miconis b52c246aed merge done 2020-07-13 19:57:02 +02:00
miconis b8a45041fd minor changes 2020-07-13 19:53:18 +02:00
Claudio Atzori 66f9f6d323 adjusted parameters for the dedup stats workflow 2020-07-13 19:26:46 +02:00
miconis 03ecfa5ebd implementation of the test class for the new block stats spark action 2020-07-13 18:48:23 +02:00
miconis 10e08ccf45 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-13 18:22:45 +02:00
miconis 9258e4f095 implementation of a new workflow to compute statistics on the blocks 2020-07-13 18:22:34 +02:00
Claudio Atzori c6f6fb0f28 code formatting 2020-07-13 16:46:13 +02:00
Claudio Atzori 344a90c2e6 updated assertions in propagateRelationTest 2020-07-13 16:32:04 +02:00
Claudio Atzori 1143f426aa WIP SparkCreateMergeRels distinct relations 2020-07-13 16:13:36 +02:00
Claudio Atzori 8c67938ad0 configurable number of partitions used in the SparkCreateSimRels phase 2020-07-13 16:07:07 +02:00
Claudio Atzori c73168b18e Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-07-13 15:54:58 +02:00
Claudio Atzori c8284bab06 WIP SparkCreateMergeRels distinct relations 2020-07-13 15:54:51 +02:00
Sandro La Bruzzo 1d133b7fe6 update test 2020-07-13 15:52:41 +02:00
Claudio Atzori 7dd91edf43 parsing of optional parameter 2020-07-13 15:40:41 +02:00
Claudio Atzori 4c101a9d66 WIP SparkCreateMergeRels distinct relations 2020-07-13 15:31:38 +02:00
Claudio Atzori 8a612d861a WIP SparkCreateMergeRels distinct relations 2020-07-13 15:30:57 +02:00
Sandro La Bruzzo 9ef2385022 implemented test for cut of connected component 2020-07-13 15:28:17 +02:00
Sandro La Bruzzo d561b2dd21 implemented cut of connected component 2020-07-13 14:18:42 +02:00
Claudio Atzori e2093e42db Merge branch 'master' into deduptesting 2020-07-13 10:57:49 +02:00
Claudio Atzori 7a3fd9f54c dedup relation aggregator moved into dedicated class 2020-07-13 10:11:36 +02:00
Alessia Bardi 7e96105947 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-07-12 19:29:12 +02:00
Alessia Bardi b7a39731a6 assert, not print 2020-07-12 19:28:56 +02:00
Claudio Atzori 770adc26e9 WIP aggregator to make relationships unique 2020-07-10 19:35:10 +02:00
Claudio Atzori ecf119f37a Merge branch 'master' into deduptesting 2020-07-10 19:04:16 +02:00
Michele Artini e1ae964bc4 stats 2020-07-10 16:12:08 +02:00
Claudio Atzori 752d28f8eb make the relations produced by the dedup SparkPropagateRelation jon unique 2020-07-10 15:09:50 +02:00