Commit Graph

1264 Commits

Author SHA1 Message Date
Claudio Atzori 805de4eca1 fix: filter the blocks with size = 1 2020-07-16 10:11:32 +02:00
Claudio Atzori 4b9fb2ffb8 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-07-15 11:26:04 +02:00
Claudio Atzori b90389bac4 code formatting 2020-07-15 11:24:48 +02:00
Claudio Atzori 4e6f46e8fa filter blocks with one record only 2020-07-15 11:22:20 +02:00
Michele Artini 262c29463e relations with multiple datasources 2020-07-15 09:18:40 +02:00
Claudio Atzori 7d6e269b40 reverted CreateRelatedEntitiesJob_phase1 to its previous state 2020-07-13 22:54:04 +02:00
Claudio Atzori 8e97598eb4 avoid to NPE in case of null instances 2020-07-13 20:46:14 +02:00
Claudio Atzori 06def0c0cb SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter 2020-07-13 20:09:06 +02:00
miconis b52c246aed merge done 2020-07-13 19:57:02 +02:00
miconis b8a45041fd minor changes 2020-07-13 19:53:18 +02:00
Claudio Atzori 66f9f6d323 adjusted parameters for the dedup stats workflow 2020-07-13 19:26:46 +02:00
miconis 03ecfa5ebd implementation of the test class for the new block stats spark action 2020-07-13 18:48:23 +02:00
miconis 10e08ccf45 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-13 18:22:45 +02:00
miconis 9258e4f095 implementation of a new workflow to compute statistics on the blocks 2020-07-13 18:22:34 +02:00
Claudio Atzori c6f6fb0f28 code formatting 2020-07-13 16:46:13 +02:00
Claudio Atzori 8d2102d7d2 Merge branch 'deduptesting' 2020-07-13 16:32:43 +02:00
Claudio Atzori 344a90c2e6 updated assertions in propagateRelationTest 2020-07-13 16:32:04 +02:00
Claudio Atzori 1143f426aa WIP SparkCreateMergeRels distinct relations 2020-07-13 16:13:36 +02:00
Claudio Atzori 8c67938ad0 configurable number of partitions used in the SparkCreateSimRels phase 2020-07-13 16:07:07 +02:00
Claudio Atzori c73168b18e Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting 2020-07-13 15:54:58 +02:00
Claudio Atzori c8284bab06 WIP SparkCreateMergeRels distinct relations 2020-07-13 15:54:51 +02:00
Sandro La Bruzzo 1d133b7fe6 update test 2020-07-13 15:52:41 +02:00
Michele Artini 3635d05061 poms 2020-07-13 15:52:23 +02:00
Claudio Atzori 7dd91edf43 parsing of optional parameter 2020-07-13 15:40:41 +02:00
Claudio Atzori 4c101a9d66 WIP SparkCreateMergeRels distinct relations 2020-07-13 15:31:38 +02:00
Claudio Atzori 8a612d861a WIP SparkCreateMergeRels distinct relations 2020-07-13 15:30:57 +02:00
Sandro La Bruzzo 9ef2385022 implemented test for cut of connected component 2020-07-13 15:28:17 +02:00
Sandro La Bruzzo d561b2dd21 implemented cut of connected component 2020-07-13 14:18:42 +02:00
Miriam Baglioni 8e0e090d7a merge upstream 2020-07-13 12:46:55 +02:00
Claudio Atzori e2093e42db Merge branch 'master' into deduptesting 2020-07-13 10:57:49 +02:00
Michele Artini 2c4ed9a043 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-13 10:55:39 +02:00
Michele Artini ccbe5c5658 fixed import of eu.dnetlib.dhp:dnet-openaire-broker-common 2020-07-13 10:55:27 +02:00
Claudio Atzori 7a3fd9f54c dedup relation aggregator moved into dedicated class 2020-07-13 10:11:36 +02:00
Alessia Bardi 7e96105947 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-07-12 19:29:12 +02:00
Alessia Bardi b7a39731a6 assert, not print 2020-07-12 19:28:56 +02:00
Miriam Baglioni f9ad6f3255 Merge branch 'dump' of code-repo.d4science.org:miriam.baglioni/dnet-hadoop into dump 2020-07-10 19:42:53 +02:00
Miriam Baglioni c27f12d6e8 avoid to consider _SUCCESS file 2020-07-10 19:42:23 +02:00
Claudio Atzori 770adc26e9 WIP aggregator to make relationships unique 2020-07-10 19:35:10 +02:00
Claudio Atzori ecf119f37a Merge branch 'master' into deduptesting 2020-07-10 19:04:16 +02:00
Claudio Atzori 31071e363f Merge branch 'provision_indexing' 2020-07-10 19:03:57 +02:00
Claudio Atzori 06c1913062 added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations 2020-07-10 19:03:33 +02:00
Claudio Atzori cc77446dc4 added dbSchema parameter to the raw_db workflow 2020-07-10 19:01:50 +02:00
Claudio Atzori 4c3836f62e materialize the related entities before joining them 2020-07-10 19:00:44 +02:00
Michele Artini e1ae964bc4 stats 2020-07-10 16:12:08 +02:00
Claudio Atzori 752d28f8eb make the relations produced by the dedup SparkPropagateRelation jon unique 2020-07-10 15:09:50 +02:00
Sandro La Bruzzo c01efed79b Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-10 14:44:57 +02:00
Sandro La Bruzzo a7d3977481 added generation of EBI Dataset 2020-07-10 14:44:50 +02:00
Claudio Atzori b21866a2da allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types 2020-07-10 13:59:48 +02:00
Claudio Atzori ff4d6214f1 experimenting with pruning of relations 2020-07-10 10:06:41 +02:00
Miriam Baglioni faea30cda0 - 2020-07-09 14:05:21 +02:00
Michele Artini 2d742a84ae DedupConfig as json file 2020-07-09 12:53:46 +02:00
Miriam Baglioni a634794242 merge upstream 2020-07-09 11:46:51 +02:00
Michele Artini a44b9b36b9 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-09 11:02:31 +02:00
Michele Artini 1c6a171633 updated pom 2020-07-09 11:02:09 +02:00
Claudio Atzori 3c728aaa0c trying to overcome OOM errors during duplicate scan phase 2020-07-08 22:39:51 +02:00
Claudio Atzori 18c555cd79 Merge branch 'master' into deduptesting 2020-07-08 22:32:01 +02:00
Claudio Atzori 4365cf41d7 trying to overcome OOM errors during duplicate scan phase 2020-07-08 22:31:46 +02:00
Claudio Atzori 67e1d222b6 bulk cleaning when found null or empty, sets bestaccessrights evaluating the result instances 2020-07-08 17:53:35 +02:00
Alessia Bardi 853e8d7987 test for software merge 2020-07-08 17:03:53 +02:00
Claudio Atzori 610d377d57 first implementation of the BETA & PROD graphs merge procedure 2020-07-08 16:54:26 +02:00
Alessia Bardi 9a898c0e4c Json schema generator 2020-07-08 12:52:00 +02:00
Alessia Bardi 636f9ce7d6 json schema generator lib 2020-07-08 12:50:57 +02:00
Alessia Bardi 8f83b726fa Dump json schema compliant to json schema Draft 7 2020-07-08 12:48:46 +02:00
Claudio Atzori e2ea30f89d updated graph construction workflow definition: cleaning wf moved at the bottom to include cleaning of the information produced by the enrichment workflows 2020-07-08 12:16:24 +02:00
Miriam Baglioni 1b0b968548 fixed issue on substring 2020-07-08 12:11:51 +02:00
Miriam Baglioni 7fe00cb4fb - 2020-07-08 10:29:37 +02:00
Miriam Baglioni 375ef07d7b changed the description for the upload 2020-07-07 18:41:27 +02:00
Miriam Baglioni 35c8265793 added the json extention to filename 2020-07-07 18:29:49 +02:00
Miriam Baglioni 81434f8e5e added method newInstance 2020-07-07 18:26:10 +02:00
Miriam Baglioni 817cddfc52 - 2020-07-07 18:25:12 +02:00
Miriam Baglioni a66aa9bd83 removed unuseful tests 2020-07-07 18:25:00 +02:00
Miriam Baglioni 9b20a21b24 removed unuseful tests 2020-07-07 18:23:37 +02:00
Miriam Baglioni 8a1b42ff21 added check to verify that dump contains at least one product 2020-07-07 18:21:35 +02:00
Miriam Baglioni d86adb82a7 - 2020-07-07 18:20:51 +02:00
Miriam Baglioni b2782025f6 enabled the whole workflow to run. Added property to give priority to depenedency in the classpath - to solve conflicts 2020-07-07 18:10:47 +02:00
Miriam Baglioni 83d2c84b77 added constraints to xquery so that to get only profiles with status manager or all 2020-07-07 18:09:48 +02:00
Miriam Baglioni 4c8d86493c - 2020-07-07 18:09:06 +02:00
Miriam Baglioni 0208bc18f3 added new resource for testing 2020-07-07 17:47:24 +02:00
Miriam Baglioni f5bb65c9ef the json schema for the dump of the results 2020-07-07 17:34:40 +02:00
Michele Artini dffa0b01a2 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-07 15:37:29 +02:00
Michele Artini efadbdb2bc fixed a bug with duplicated events 2020-07-07 15:37:13 +02:00
Claudio Atzori 8af8e7481a code formatting 2020-07-07 14:23:34 +02:00
Claudio Atzori b383ed42fa pass optional parameter relationFilter to the PrepareRelationJob implementation 2020-07-07 14:21:28 +02:00
Claudio Atzori 911894a987 Merge branch 'deduptesting' 2020-07-07 14:20:43 +02:00
Miriam Baglioni c19818a3f8 merge branch with fork master 2020-07-06 13:58:23 +02:00
Miriam Baglioni d22240c0ba merge upstream 2020-07-06 13:58:02 +02:00
Michele Artini edf6c6c4dc Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-03 11:48:24 +02:00
Michele Artini 04bebb708c some fixes 2020-07-03 11:48:12 +02:00
Claudio Atzori c3d67f709a adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80) 2020-07-02 17:35:22 +02:00
Miriam Baglioni f8bf4acd76 - 2020-07-02 16:03:11 +02:00
Miriam Baglioni e6c79d44e6 - 2020-07-02 16:02:02 +02:00
Miriam Baglioni d7f6f0c216 changed code to use other lib 2020-07-02 16:01:34 +02:00
Miriam Baglioni 8fdc9e070c added dependency to OkHttp 2020-07-02 16:01:08 +02:00
Miriam Baglioni 94500a581b merge branch with fork master 2020-07-02 14:25:39 +02:00
Miriam Baglioni c133a23cf0 merge upstream 2020-07-02 14:24:57 +02:00
Claudio Atzori 1d39f7901c Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-07-02 12:45:01 +02:00
Claudio Atzori 0f77cac4b5 fix: deduper must use queueMaxSize instead of groupMaxSize for the block definition 2020-07-02 12:43:51 +02:00
Sandro La Bruzzo 18b9330312 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-02 12:43:19 +02:00
Michele Artini b413db0bff white/blacklists 2020-07-02 12:43:03 +02:00
Claudio Atzori d380b85246 unit test for the preparation of the relations 2020-07-02 12:42:13 +02:00