Commit Graph

1163 Commits

Author SHA1 Message Date
Claudio Atzori 7a3fd9f54c dedup relation aggregator moved into dedicated class 2020-07-13 10:11:36 +02:00
Alessia Bardi 7e96105947 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-07-12 19:29:12 +02:00
Alessia Bardi b7a39731a6 assert, not print 2020-07-12 19:28:56 +02:00
Claudio Atzori 770adc26e9 WIP aggregator to make relationships unique 2020-07-10 19:35:10 +02:00
Claudio Atzori ecf119f37a Merge branch 'master' into deduptesting 2020-07-10 19:04:16 +02:00
Claudio Atzori 31071e363f Merge branch 'provision_indexing' 2020-07-10 19:03:57 +02:00
Claudio Atzori 06c1913062 added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations 2020-07-10 19:03:33 +02:00
Claudio Atzori cc77446dc4 added dbSchema parameter to the raw_db workflow 2020-07-10 19:01:50 +02:00
Claudio Atzori 4c3836f62e materialize the related entities before joining them 2020-07-10 19:00:44 +02:00
Michele Artini e1ae964bc4 stats 2020-07-10 16:12:08 +02:00
Claudio Atzori 752d28f8eb make the relations produced by the dedup SparkPropagateRelation jon unique 2020-07-10 15:09:50 +02:00
Sandro La Bruzzo c01efed79b Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-10 14:44:57 +02:00
Sandro La Bruzzo a7d3977481 added generation of EBI Dataset 2020-07-10 14:44:50 +02:00
Claudio Atzori b21866a2da allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types 2020-07-10 13:59:48 +02:00
Claudio Atzori ff4d6214f1 experimenting with pruning of relations 2020-07-10 10:06:41 +02:00
Michele Artini 2d742a84ae DedupConfig as json file 2020-07-09 12:53:46 +02:00
Michele Artini a44b9b36b9 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-09 11:02:31 +02:00
Michele Artini 1c6a171633 updated pom 2020-07-09 11:02:09 +02:00
Claudio Atzori 3c728aaa0c trying to overcome OOM errors during duplicate scan phase 2020-07-08 22:39:51 +02:00
Claudio Atzori 18c555cd79 Merge branch 'master' into deduptesting 2020-07-08 22:32:01 +02:00
Claudio Atzori 4365cf41d7 trying to overcome OOM errors during duplicate scan phase 2020-07-08 22:31:46 +02:00
Claudio Atzori 67e1d222b6 bulk cleaning when found null or empty, sets bestaccessrights evaluating the result instances 2020-07-08 17:53:35 +02:00
Alessia Bardi 853e8d7987 test for software merge 2020-07-08 17:03:53 +02:00
Claudio Atzori 610d377d57 first implementation of the BETA & PROD graphs merge procedure 2020-07-08 16:54:26 +02:00
Claudio Atzori e2ea30f89d updated graph construction workflow definition: cleaning wf moved at the bottom to include cleaning of the information produced by the enrichment workflows 2020-07-08 12:16:24 +02:00
Michele Artini dffa0b01a2 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-07 15:37:29 +02:00
Michele Artini efadbdb2bc fixed a bug with duplicated events 2020-07-07 15:37:13 +02:00
Claudio Atzori 8af8e7481a code formatting 2020-07-07 14:23:34 +02:00
Claudio Atzori b383ed42fa pass optional parameter relationFilter to the PrepareRelationJob implementation 2020-07-07 14:21:28 +02:00
Claudio Atzori 911894a987 Merge branch 'deduptesting' 2020-07-07 14:20:43 +02:00
Michele Artini edf6c6c4dc Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-03 11:48:24 +02:00
Michele Artini 04bebb708c some fixes 2020-07-03 11:48:12 +02:00
Claudio Atzori c3d67f709a adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80) 2020-07-02 17:35:22 +02:00
Claudio Atzori 1d39f7901c Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-07-02 12:45:01 +02:00
Claudio Atzori 0f77cac4b5 fix: deduper must use queueMaxSize instead of groupMaxSize for the block definition 2020-07-02 12:43:51 +02:00
Sandro La Bruzzo 18b9330312 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-02 12:43:19 +02:00
Michele Artini b413db0bff white/blacklists 2020-07-02 12:43:03 +02:00
Claudio Atzori d380b85246 unit test for the preparation of the relations 2020-07-02 12:42:13 +02:00
Claudio Atzori ed1c7e5d75 fixed workflow for the import of the claims alone 2020-07-02 12:40:21 +02:00
Sandro La Bruzzo 07f0723fa7 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-02 12:37:49 +02:00
Sandro La Bruzzo 1d420eedb4 added generation of EBI Dataset 2020-07-02 12:37:43 +02:00
Claudio Atzori e4a29a4513 fixed workflow for the import of the claims alone 2020-07-02 12:36:33 +02:00
Michele Artini 3bcdfbabe9 list with limits 2020-07-01 08:42:39 +02:00
Michele Artini 59a5421c24 indexing, accumulators, limited lists 2020-06-30 16:17:09 +02:00
Michele Artini 6f13673464 accumulators 2020-06-29 16:33:32 +02:00
Sandro La Bruzzo dab783b173 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-29 09:05:00 +02:00
Michele Artini a6ea432435 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-29 08:44:20 +02:00
Michele Artini 35ae381d28 all events matchers 2020-06-29 08:43:56 +02:00
Claudio Atzori 7817338e05 added test to verify the relation pre-processing 2020-06-26 17:58:33 +02:00
Claudio Atzori 8d59fdf34e WIP: dataset based PrepareRelationsJob 2020-06-26 14:32:58 +02:00
Michele Artini 2393d9da2f limits 2020-06-26 11:20:45 +02:00
Sandro La Bruzzo 96ce124b59 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-25 17:00:43 +02:00
Michele Artini 408165a756 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-25 15:53:35 +02:00
Michele Artini e8fb305f18 compilation of event map 2020-06-25 15:53:20 +02:00
Michele Artini 4eb3e109d7 compilation of event map 2020-06-25 15:45:50 +02:00
Claudio Atzori d839e88783 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-25 14:06:30 +02:00
Claudio Atzori 6f5771c1c9 sets author.rank when null 2020-06-25 14:06:21 +02:00
Michele Artini e28033c6d8 some fixes 2020-06-25 13:01:09 +02:00
Claudio Atzori 216975c4ec restored complete provision workflow 2020-06-25 12:55:52 +02:00
Claudio Atzori 2d77d3a388 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-25 12:54:30 +02:00
Claudio Atzori 93f627ea51 code formatting 2020-06-25 12:54:21 +02:00
Miriam Baglioni 05a99cfb61 change the position of value and description elements in the workflow definition 2020-06-25 12:36:08 +02:00
Claudio Atzori 7df2712824 Merge branch 'provision_indexing' 2020-06-25 12:22:41 +02:00
Claudio Atzori e62333192c WIP: prepare relation job 2020-06-25 12:22:18 +02:00
Claudio Atzori 6933ec11fb WIP: prepare relation job 2020-06-25 11:04:12 +02:00
Sandro La Bruzzo a6c0faac70 added test to verify secondary sorting 2020-06-25 10:48:15 +02:00
Claudio Atzori 69b0391708 WIP: prepare relation job 2020-06-25 10:19:56 +02:00
Michele Artini abcbebcbb4 fixed generation of ids 2020-06-25 09:50:46 +02:00
Michele Artini 77d2a1b1c4 params to choose sql queries for beta or production 2020-06-25 09:28:13 +02:00
Claudio Atzori 46e76affeb WIP: prepare relation job 2020-06-24 19:01:15 +02:00
Claudio Atzori 0e723d378b added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job 2020-06-24 18:34:42 +02:00
Michele Artini 202f6e62ff Splitted join wf 2020-06-24 15:47:06 +02:00
Sandro La Bruzzo 96689a8994 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-24 14:06:50 +02:00
Sandro La Bruzzo 46631a4421 updated mapping scholexplorer to OAF 2020-06-24 14:06:38 +02:00
Michele Artini e53dd62e87 minot changes 2020-06-24 09:24:45 +02:00
Michele Artini 8b9933b934 refactoring aggregators 2020-06-24 08:57:13 +02:00
Michele Artini d13e3d3f68 fixed paths 2020-06-23 11:01:42 +02:00
Michele Artini 8386c6f90d filter of valid resultResult relations 2020-06-23 10:24:15 +02:00
Michele Artini 38bb45d0b6 test osf:refereed 2020-06-23 10:14:39 +02:00
Michele Artini c3286f4c37 fixed relType 2020-06-23 09:32:32 +02:00
Michele Artini af2f7705fc partial refactoring of some joins 2020-06-23 08:37:35 +02:00
Claudio Atzori 8a3bc7c183 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-22 14:12:33 +02:00
Claudio Atzori e162ba5075 added dnet workflows to orchestrate the execution of graph2hive, updateSolr and updateStats oozie wfs 2020-06-22 14:12:28 +02:00
Michele Artini 3ce20c198e reformatting 2020-06-22 12:14:25 +02:00
Michele Artini ed787398b3 refactoring wf 2020-06-22 11:45:14 +02:00
Claudio Atzori 9cd27183b6 [maven-release-plugin] prepare for next development iteration 2020-06-22 11:27:44 +02:00
Claudio Atzori 1e3dab0631 [maven-release-plugin] prepare release dhp-1.2.3 2020-06-22 11:27:39 +02:00
Claudio Atzori 961a0d0b49 [actionset promotion] log debugging info in case of error in the action payload extraction or parsing the data 2020-06-22 10:20:45 +02:00
Claudio Atzori 5e8b922962 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-22 09:50:47 +02:00
Claudio Atzori 7d416f08d8 graph cleaning workflow: set hostedby to unknown repository when defined as NULL 2020-06-22 09:50:43 +02:00
Michele Artini 16c7a18435 refactoring 2020-06-22 08:51:31 +02:00
Michele Artini f9fc64ffaf âÃMerge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-19 15:24:43 +02:00
Michele Artini d88fe0ac84 join methods 2020-06-19 15:24:30 +02:00
Sandro La Bruzzo 464eeeec87 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-19 15:11:53 +02:00
Sandro La Bruzzo 1681de672d updated mapping scholexplorer to OAF 2020-06-19 15:11:46 +02:00
Michele Artini 4822747313 some fixes 2020-06-19 13:53:56 +02:00
Michele Artini 834f139e6e fixed some NPE 2020-06-19 12:33:29 +02:00
Claudio Atzori d0ac7514b2 cleaning workflow to include cleaning of default values 2020-06-18 19:37:25 +02:00
Michele Artini 52f62d5d8c events 2020-06-18 14:49:13 +02:00
Michele Artini 61634fbfe0 removed kryo encoding 2020-06-18 14:09:58 +02:00