Michele Artini
|
442f30930c
|
removed duplicated fields
|
2020-07-17 12:25:36 +02:00 |
Claudio Atzori
|
cc5d13da85
|
introduced parameter shouldIndex (true|false)
|
2020-07-16 13:46:39 +02:00 |
Claudio Atzori
|
b098cc3cbe
|
avoid repeating identical values for fields: source, description
|
2020-07-16 13:45:53 +02:00 |
Claudio Atzori
|
805de4eca1
|
fix: filter the blocks with size = 1
|
2020-07-16 10:11:32 +02:00 |
Claudio Atzori
|
4b9fb2ffb8
|
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
|
2020-07-15 11:26:04 +02:00 |
Claudio Atzori
|
5033c25587
|
code formatting
|
2020-07-15 11:26:00 +02:00 |
Claudio Atzori
|
b90389bac4
|
code formatting
|
2020-07-15 11:24:48 +02:00 |
Claudio Atzori
|
4e6f46e8fa
|
filter blocks with one record only
|
2020-07-15 11:22:20 +02:00 |
Michele Artini
|
262c29463e
|
relations with multiple datasources
|
2020-07-15 09:18:40 +02:00 |
Claudio Atzori
|
7d6e269b40
|
reverted CreateRelatedEntitiesJob_phase1 to its previous state
|
2020-07-13 22:54:04 +02:00 |
Claudio Atzori
|
8e97598eb4
|
avoid to NPE in case of null instances
|
2020-07-13 20:46:14 +02:00 |
Claudio Atzori
|
06def0c0cb
|
SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter
|
2020-07-13 20:09:06 +02:00 |
miconis
|
b52c246aed
|
merge done
|
2020-07-13 19:57:02 +02:00 |
miconis
|
b8a45041fd
|
minor changes
|
2020-07-13 19:53:18 +02:00 |
Claudio Atzori
|
66f9f6d323
|
adjusted parameters for the dedup stats workflow
|
2020-07-13 19:26:46 +02:00 |
miconis
|
03ecfa5ebd
|
implementation of the test class for the new block stats spark action
|
2020-07-13 18:48:23 +02:00 |
miconis
|
10e08ccf45
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-13 18:22:45 +02:00 |
miconis
|
9258e4f095
|
implementation of a new workflow to compute statistics on the blocks
|
2020-07-13 18:22:34 +02:00 |
Claudio Atzori
|
c6f6fb0f28
|
code formatting
|
2020-07-13 16:46:13 +02:00 |
Claudio Atzori
|
8d2102d7d2
|
Merge branch 'deduptesting'
|
2020-07-13 16:32:43 +02:00 |
Claudio Atzori
|
344a90c2e6
|
updated assertions in propagateRelationTest
|
2020-07-13 16:32:04 +02:00 |
Claudio Atzori
|
1143f426aa
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 16:13:36 +02:00 |
Claudio Atzori
|
8c67938ad0
|
configurable number of partitions used in the SparkCreateSimRels phase
|
2020-07-13 16:07:07 +02:00 |
Claudio Atzori
|
c73168b18e
|
Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting
|
2020-07-13 15:54:58 +02:00 |
Claudio Atzori
|
c8284bab06
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 15:54:51 +02:00 |
Sandro La Bruzzo
|
1d133b7fe6
|
update test
|
2020-07-13 15:52:41 +02:00 |
Michele Artini
|
3635d05061
|
poms
|
2020-07-13 15:52:23 +02:00 |
Claudio Atzori
|
7dd91edf43
|
parsing of optional parameter
|
2020-07-13 15:40:41 +02:00 |
Claudio Atzori
|
4c101a9d66
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 15:31:38 +02:00 |
Claudio Atzori
|
8a612d861a
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 15:30:57 +02:00 |
Sandro La Bruzzo
|
9ef2385022
|
implemented test for cut of connected component
|
2020-07-13 15:28:17 +02:00 |
Sandro La Bruzzo
|
d561b2dd21
|
implemented cut of connected component
|
2020-07-13 14:18:42 +02:00 |
Claudio Atzori
|
e2093e42db
|
Merge branch 'master' into deduptesting
|
2020-07-13 10:57:49 +02:00 |
Michele Artini
|
2c4ed9a043
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-13 10:55:39 +02:00 |
Michele Artini
|
ccbe5c5658
|
fixed import of eu.dnetlib.dhp:dnet-openaire-broker-common
|
2020-07-13 10:55:27 +02:00 |
Claudio Atzori
|
7a3fd9f54c
|
dedup relation aggregator moved into dedicated class
|
2020-07-13 10:11:36 +02:00 |
Alessia Bardi
|
7e96105947
|
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
|
2020-07-12 19:29:12 +02:00 |
Alessia Bardi
|
b7a39731a6
|
assert, not print
|
2020-07-12 19:28:56 +02:00 |
Claudio Atzori
|
770adc26e9
|
WIP aggregator to make relationships unique
|
2020-07-10 19:35:10 +02:00 |
Claudio Atzori
|
ecf119f37a
|
Merge branch 'master' into deduptesting
|
2020-07-10 19:04:16 +02:00 |
Claudio Atzori
|
31071e363f
|
Merge branch 'provision_indexing'
|
2020-07-10 19:03:57 +02:00 |
Claudio Atzori
|
06c1913062
|
added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations
|
2020-07-10 19:03:33 +02:00 |
Claudio Atzori
|
cc77446dc4
|
added dbSchema parameter to the raw_db workflow
|
2020-07-10 19:01:50 +02:00 |
Claudio Atzori
|
4c3836f62e
|
materialize the related entities before joining them
|
2020-07-10 19:00:44 +02:00 |
Michele Artini
|
e1ae964bc4
|
stats
|
2020-07-10 16:12:08 +02:00 |
Claudio Atzori
|
752d28f8eb
|
make the relations produced by the dedup SparkPropagateRelation jon unique
|
2020-07-10 15:09:50 +02:00 |
Claudio Atzori
|
b21866a2da
|
allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types
|
2020-07-10 13:59:48 +02:00 |
Claudio Atzori
|
ff4d6214f1
|
experimenting with pruning of relations
|
2020-07-10 10:06:41 +02:00 |
Michele Artini
|
2d742a84ae
|
DedupConfig as json file
|
2020-07-09 12:53:46 +02:00 |
Michele Artini
|
a44b9b36b9
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-09 11:02:31 +02:00 |