Claudio Atzori
|
1781609508
|
code formatting
|
2020-07-16 19:06:56 +02:00 |
Claudio Atzori
|
db8b90a156
|
renamed CORE -> BETA
|
2020-07-16 19:05:13 +02:00 |
Miriam Baglioni
|
44e1c40c42
|
merge upstream
|
2020-07-16 18:49:38 +02:00 |
Claudio Atzori
|
878f2b931c
|
Merge branch 'master' into merge_graph
|
2020-07-16 16:34:24 +02:00 |
Claudio Atzori
|
cc5d13da85
|
introduced parameter shouldIndex (true|false)
|
2020-07-16 13:46:39 +02:00 |
Claudio Atzori
|
b098cc3cbe
|
avoid repeating identical values for fields: source, description
|
2020-07-16 13:45:53 +02:00 |
Claudio Atzori
|
805de4eca1
|
fix: filter the blocks with size = 1
|
2020-07-16 10:11:32 +02:00 |
Claudio Atzori
|
4b9fb2ffb8
|
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
|
2020-07-15 11:26:04 +02:00 |
Claudio Atzori
|
b90389bac4
|
code formatting
|
2020-07-15 11:24:48 +02:00 |
Claudio Atzori
|
4e6f46e8fa
|
filter blocks with one record only
|
2020-07-15 11:22:20 +02:00 |
Michele Artini
|
262c29463e
|
relations with multiple datasources
|
2020-07-15 09:18:40 +02:00 |
Claudio Atzori
|
7d6e269b40
|
reverted CreateRelatedEntitiesJob_phase1 to its previous state
|
2020-07-13 22:54:04 +02:00 |
Claudio Atzori
|
8e97598eb4
|
avoid to NPE in case of null instances
|
2020-07-13 20:46:14 +02:00 |
Claudio Atzori
|
06def0c0cb
|
SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter
|
2020-07-13 20:09:06 +02:00 |
miconis
|
b52c246aed
|
merge done
|
2020-07-13 19:57:02 +02:00 |
miconis
|
b8a45041fd
|
minor changes
|
2020-07-13 19:53:18 +02:00 |
Claudio Atzori
|
66f9f6d323
|
adjusted parameters for the dedup stats workflow
|
2020-07-13 19:26:46 +02:00 |
miconis
|
03ecfa5ebd
|
implementation of the test class for the new block stats spark action
|
2020-07-13 18:48:23 +02:00 |
miconis
|
10e08ccf45
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-13 18:22:45 +02:00 |
miconis
|
9258e4f095
|
implementation of a new workflow to compute statistics on the blocks
|
2020-07-13 18:22:34 +02:00 |
Claudio Atzori
|
c6f6fb0f28
|
code formatting
|
2020-07-13 16:46:13 +02:00 |
Claudio Atzori
|
8d2102d7d2
|
Merge branch 'deduptesting'
|
2020-07-13 16:32:43 +02:00 |
Claudio Atzori
|
344a90c2e6
|
updated assertions in propagateRelationTest
|
2020-07-13 16:32:04 +02:00 |
Claudio Atzori
|
1143f426aa
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 16:13:36 +02:00 |
Claudio Atzori
|
8c67938ad0
|
configurable number of partitions used in the SparkCreateSimRels phase
|
2020-07-13 16:07:07 +02:00 |
Claudio Atzori
|
c73168b18e
|
Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting
|
2020-07-13 15:54:58 +02:00 |
Claudio Atzori
|
c8284bab06
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 15:54:51 +02:00 |
Sandro La Bruzzo
|
1d133b7fe6
|
update test
|
2020-07-13 15:52:41 +02:00 |
Michele Artini
|
3635d05061
|
poms
|
2020-07-13 15:52:23 +02:00 |
Claudio Atzori
|
7dd91edf43
|
parsing of optional parameter
|
2020-07-13 15:40:41 +02:00 |
Claudio Atzori
|
4c101a9d66
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 15:31:38 +02:00 |
Claudio Atzori
|
8a612d861a
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 15:30:57 +02:00 |
Sandro La Bruzzo
|
9ef2385022
|
implemented test for cut of connected component
|
2020-07-13 15:28:17 +02:00 |
Sandro La Bruzzo
|
d561b2dd21
|
implemented cut of connected component
|
2020-07-13 14:18:42 +02:00 |
Miriam Baglioni
|
8e0e090d7a
|
merge upstream
|
2020-07-13 12:46:55 +02:00 |
Claudio Atzori
|
e2093e42db
|
Merge branch 'master' into deduptesting
|
2020-07-13 10:57:49 +02:00 |
Michele Artini
|
2c4ed9a043
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-13 10:55:39 +02:00 |
Michele Artini
|
ccbe5c5658
|
fixed import of eu.dnetlib.dhp:dnet-openaire-broker-common
|
2020-07-13 10:55:27 +02:00 |
Claudio Atzori
|
7a3fd9f54c
|
dedup relation aggregator moved into dedicated class
|
2020-07-13 10:11:36 +02:00 |
Alessia Bardi
|
7e96105947
|
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
|
2020-07-12 19:29:12 +02:00 |
Alessia Bardi
|
b7a39731a6
|
assert, not print
|
2020-07-12 19:28:56 +02:00 |
Miriam Baglioni
|
f9ad6f3255
|
Merge branch 'dump' of code-repo.d4science.org:miriam.baglioni/dnet-hadoop into dump
|
2020-07-10 19:42:53 +02:00 |
Miriam Baglioni
|
c27f12d6e8
|
avoid to consider _SUCCESS file
|
2020-07-10 19:42:23 +02:00 |
Claudio Atzori
|
770adc26e9
|
WIP aggregator to make relationships unique
|
2020-07-10 19:35:10 +02:00 |
Claudio Atzori
|
ecf119f37a
|
Merge branch 'master' into deduptesting
|
2020-07-10 19:04:16 +02:00 |
Claudio Atzori
|
31071e363f
|
Merge branch 'provision_indexing'
|
2020-07-10 19:03:57 +02:00 |
Claudio Atzori
|
06c1913062
|
added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations
|
2020-07-10 19:03:33 +02:00 |
Claudio Atzori
|
cc77446dc4
|
added dbSchema parameter to the raw_db workflow
|
2020-07-10 19:01:50 +02:00 |
Claudio Atzori
|
4c3836f62e
|
materialize the related entities before joining them
|
2020-07-10 19:00:44 +02:00 |
Michele Artini
|
e1ae964bc4
|
stats
|
2020-07-10 16:12:08 +02:00 |
Claudio Atzori
|
752d28f8eb
|
make the relations produced by the dedup SparkPropagateRelation jon unique
|
2020-07-10 15:09:50 +02:00 |
Sandro La Bruzzo
|
c01efed79b
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-10 14:44:57 +02:00 |
Sandro La Bruzzo
|
a7d3977481
|
added generation of EBI Dataset
|
2020-07-10 14:44:50 +02:00 |
Claudio Atzori
|
b21866a2da
|
allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types
|
2020-07-10 13:59:48 +02:00 |
Claudio Atzori
|
ff4d6214f1
|
experimenting with pruning of relations
|
2020-07-10 10:06:41 +02:00 |
Miriam Baglioni
|
faea30cda0
|
-
|
2020-07-09 14:05:21 +02:00 |
Michele Artini
|
2d742a84ae
|
DedupConfig as json file
|
2020-07-09 12:53:46 +02:00 |
Miriam Baglioni
|
a634794242
|
merge upstream
|
2020-07-09 11:46:51 +02:00 |
Michele Artini
|
a44b9b36b9
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-09 11:02:31 +02:00 |
Michele Artini
|
1c6a171633
|
updated pom
|
2020-07-09 11:02:09 +02:00 |
Claudio Atzori
|
3c728aaa0c
|
trying to overcome OOM errors during duplicate scan phase
|
2020-07-08 22:39:51 +02:00 |
Claudio Atzori
|
18c555cd79
|
Merge branch 'master' into deduptesting
|
2020-07-08 22:32:01 +02:00 |
Claudio Atzori
|
4365cf41d7
|
trying to overcome OOM errors during duplicate scan phase
|
2020-07-08 22:31:46 +02:00 |
Claudio Atzori
|
67e1d222b6
|
bulk cleaning when found null or empty, sets bestaccessrights evaluating the result instances
|
2020-07-08 17:53:35 +02:00 |
Alessia Bardi
|
853e8d7987
|
test for software merge
|
2020-07-08 17:03:53 +02:00 |
Claudio Atzori
|
610d377d57
|
first implementation of the BETA & PROD graphs merge procedure
|
2020-07-08 16:54:26 +02:00 |
Alessia Bardi
|
9a898c0e4c
|
Json schema generator
|
2020-07-08 12:52:00 +02:00 |
Alessia Bardi
|
636f9ce7d6
|
json schema generator lib
|
2020-07-08 12:50:57 +02:00 |
Alessia Bardi
|
8f83b726fa
|
Dump json schema compliant to json schema Draft 7
|
2020-07-08 12:48:46 +02:00 |
Claudio Atzori
|
e2ea30f89d
|
updated graph construction workflow definition: cleaning wf moved at the bottom to include cleaning of the information produced by the enrichment workflows
|
2020-07-08 12:16:24 +02:00 |
Miriam Baglioni
|
1b0b968548
|
fixed issue on substring
|
2020-07-08 12:11:51 +02:00 |
Miriam Baglioni
|
7fe00cb4fb
|
-
|
2020-07-08 10:29:37 +02:00 |
Miriam Baglioni
|
375ef07d7b
|
changed the description for the upload
|
2020-07-07 18:41:27 +02:00 |
Miriam Baglioni
|
35c8265793
|
added the json extention to filename
|
2020-07-07 18:29:49 +02:00 |
Miriam Baglioni
|
81434f8e5e
|
added method newInstance
|
2020-07-07 18:26:10 +02:00 |
Miriam Baglioni
|
817cddfc52
|
-
|
2020-07-07 18:25:12 +02:00 |
Miriam Baglioni
|
a66aa9bd83
|
removed unuseful tests
|
2020-07-07 18:25:00 +02:00 |
Miriam Baglioni
|
9b20a21b24
|
removed unuseful tests
|
2020-07-07 18:23:37 +02:00 |
Miriam Baglioni
|
8a1b42ff21
|
added check to verify that dump contains at least one product
|
2020-07-07 18:21:35 +02:00 |
Miriam Baglioni
|
d86adb82a7
|
-
|
2020-07-07 18:20:51 +02:00 |
Miriam Baglioni
|
b2782025f6
|
enabled the whole workflow to run. Added property to give priority to depenedency in the classpath - to solve conflicts
|
2020-07-07 18:10:47 +02:00 |
Miriam Baglioni
|
83d2c84b77
|
added constraints to xquery so that to get only profiles with status manager or all
|
2020-07-07 18:09:48 +02:00 |
Miriam Baglioni
|
4c8d86493c
|
-
|
2020-07-07 18:09:06 +02:00 |
Miriam Baglioni
|
0208bc18f3
|
added new resource for testing
|
2020-07-07 17:47:24 +02:00 |
Miriam Baglioni
|
f5bb65c9ef
|
the json schema for the dump of the results
|
2020-07-07 17:34:40 +02:00 |
Michele Artini
|
dffa0b01a2
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-07 15:37:29 +02:00 |
Michele Artini
|
efadbdb2bc
|
fixed a bug with duplicated events
|
2020-07-07 15:37:13 +02:00 |
Claudio Atzori
|
8af8e7481a
|
code formatting
|
2020-07-07 14:23:34 +02:00 |
Claudio Atzori
|
b383ed42fa
|
pass optional parameter relationFilter to the PrepareRelationJob implementation
|
2020-07-07 14:21:28 +02:00 |
Claudio Atzori
|
911894a987
|
Merge branch 'deduptesting'
|
2020-07-07 14:20:43 +02:00 |
Miriam Baglioni
|
c19818a3f8
|
merge branch with fork master
|
2020-07-06 13:58:23 +02:00 |
Miriam Baglioni
|
d22240c0ba
|
merge upstream
|
2020-07-06 13:58:02 +02:00 |
Michele Artini
|
edf6c6c4dc
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-03 11:48:24 +02:00 |
Michele Artini
|
04bebb708c
|
some fixes
|
2020-07-03 11:48:12 +02:00 |
Claudio Atzori
|
c3d67f709a
|
adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80)
|
2020-07-02 17:35:22 +02:00 |
Miriam Baglioni
|
f8bf4acd76
|
-
|
2020-07-02 16:03:11 +02:00 |
Miriam Baglioni
|
e6c79d44e6
|
-
|
2020-07-02 16:02:02 +02:00 |
Miriam Baglioni
|
d7f6f0c216
|
changed code to use other lib
|
2020-07-02 16:01:34 +02:00 |
Miriam Baglioni
|
8fdc9e070c
|
added dependency to OkHttp
|
2020-07-02 16:01:08 +02:00 |
Miriam Baglioni
|
94500a581b
|
merge branch with fork master
|
2020-07-02 14:25:39 +02:00 |
Miriam Baglioni
|
c133a23cf0
|
merge upstream
|
2020-07-02 14:24:57 +02:00 |
Claudio Atzori
|
1d39f7901c
|
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
|
2020-07-02 12:45:01 +02:00 |
Claudio Atzori
|
0f77cac4b5
|
fix: deduper must use queueMaxSize instead of groupMaxSize for the block definition
|
2020-07-02 12:43:51 +02:00 |
Sandro La Bruzzo
|
18b9330312
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-02 12:43:19 +02:00 |
Michele Artini
|
b413db0bff
|
white/blacklists
|
2020-07-02 12:43:03 +02:00 |
Claudio Atzori
|
d380b85246
|
unit test for the preparation of the relations
|
2020-07-02 12:42:13 +02:00 |
Claudio Atzori
|
ed1c7e5d75
|
fixed workflow for the import of the claims alone
|
2020-07-02 12:40:21 +02:00 |
Sandro La Bruzzo
|
07f0723fa7
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-02 12:37:49 +02:00 |
Sandro La Bruzzo
|
1d420eedb4
|
added generation of EBI Dataset
|
2020-07-02 12:37:43 +02:00 |
Claudio Atzori
|
e4a29a4513
|
fixed workflow for the import of the claims alone
|
2020-07-02 12:36:33 +02:00 |
Michele Artini
|
3bcdfbabe9
|
list with limits
|
2020-07-01 08:42:39 +02:00 |
Michele Artini
|
59a5421c24
|
indexing, accumulators, limited lists
|
2020-06-30 16:17:09 +02:00 |
Michele Artini
|
6f13673464
|
accumulators
|
2020-06-29 16:33:32 +02:00 |
Sandro La Bruzzo
|
dab783b173
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-06-29 09:05:00 +02:00 |
Michele Artini
|
a6ea432435
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-06-29 08:44:20 +02:00 |
Michele Artini
|
35ae381d28
|
all events matchers
|
2020-06-29 08:43:56 +02:00 |
Claudio Atzori
|
7817338e05
|
added test to verify the relation pre-processing
|
2020-06-26 17:58:33 +02:00 |
Claudio Atzori
|
8d59fdf34e
|
WIP: dataset based PrepareRelationsJob
|
2020-06-26 14:32:58 +02:00 |
Michele Artini
|
2393d9da2f
|
limits
|
2020-06-26 11:20:45 +02:00 |
Sandro La Bruzzo
|
96ce124b59
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-06-25 17:00:43 +02:00 |
Miriam Baglioni
|
4a7de07ea2
|
refactoring
|
2020-06-25 16:32:40 +02:00 |
Miriam Baglioni
|
54a12978d3
|
fixed issue in xquery
|
2020-06-25 16:30:20 +02:00 |
Michele Artini
|
408165a756
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-06-25 15:53:35 +02:00 |
Michele Artini
|
e8fb305f18
|
compilation of event map
|
2020-06-25 15:53:20 +02:00 |
Michele Artini
|
4eb3e109d7
|
compilation of event map
|
2020-06-25 15:45:50 +02:00 |
Claudio Atzori
|
d839e88783
|
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
|
2020-06-25 14:06:30 +02:00 |
Claudio Atzori
|
6f5771c1c9
|
sets author.rank when null
|
2020-06-25 14:06:21 +02:00 |
Michele Artini
|
e28033c6d8
|
some fixes
|
2020-06-25 13:01:09 +02:00 |
Claudio Atzori
|
216975c4ec
|
restored complete provision workflow
|
2020-06-25 12:55:52 +02:00 |
Claudio Atzori
|
2d77d3a388
|
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
|
2020-06-25 12:54:30 +02:00 |
Claudio Atzori
|
93f627ea51
|
code formatting
|
2020-06-25 12:54:21 +02:00 |
Miriam Baglioni
|
05a99cfb61
|
change the position of value and description elements in the workflow definition
|
2020-06-25 12:36:08 +02:00 |
Claudio Atzori
|
7df2712824
|
Merge branch 'provision_indexing'
|
2020-06-25 12:22:41 +02:00 |
Claudio Atzori
|
e62333192c
|
WIP: prepare relation job
|
2020-06-25 12:22:18 +02:00 |
Claudio Atzori
|
6933ec11fb
|
WIP: prepare relation job
|
2020-06-25 11:04:12 +02:00 |
Sandro La Bruzzo
|
a6c0faac70
|
added test to verify secondary sorting
|
2020-06-25 10:48:15 +02:00 |
Claudio Atzori
|
69b0391708
|
WIP: prepare relation job
|
2020-06-25 10:19:56 +02:00 |
Michele Artini
|
abcbebcbb4
|
fixed generation of ids
|
2020-06-25 09:50:46 +02:00 |
Michele Artini
|
77d2a1b1c4
|
params to choose sql queries for beta or production
|
2020-06-25 09:28:13 +02:00 |
Claudio Atzori
|
46e76affeb
|
WIP: prepare relation job
|
2020-06-24 19:01:15 +02:00 |
Claudio Atzori
|
0e723d378b
|
added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job
|
2020-06-24 18:34:42 +02:00 |
Michele Artini
|
202f6e62ff
|
Splitted join wf
|
2020-06-24 15:47:06 +02:00 |
Sandro La Bruzzo
|
96689a8994
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-06-24 14:06:50 +02:00 |
Sandro La Bruzzo
|
46631a4421
|
updated mapping scholexplorer to OAF
|
2020-06-24 14:06:38 +02:00 |
Michele Artini
|
e53dd62e87
|
minot changes
|
2020-06-24 09:24:45 +02:00 |
Michele Artini
|
8b9933b934
|
refactoring aggregators
|
2020-06-24 08:57:13 +02:00 |
Miriam Baglioni
|
3e5570de7a
|
-
|
2020-06-23 15:44:54 +02:00 |
Michele Artini
|
d13e3d3f68
|
fixed paths
|
2020-06-23 11:01:42 +02:00 |
Michele Artini
|
8386c6f90d
|
filter of valid resultResult relations
|
2020-06-23 10:24:15 +02:00 |
Michele Artini
|
38bb45d0b6
|
test osf:refereed
|
2020-06-23 10:14:39 +02:00 |