miconis
|
8fea29177c
|
refactoring, minor changes and implementation of the wf for openorgs with integration of organization phases into the scan wf
|
2021-01-18 16:48:08 +01:00 |
miconis
|
1e1aab83e3
|
implementation of the raw wf for openorgs: still not complete, some functionalities are missing
|
2020-12-21 11:58:21 +01:00 |
Claudio Atzori
|
d9532446eb
|
imported more diffs from master branch; code formatting
|
2020-12-10 16:14:16 +01:00 |
Claudio Atzori
|
1eaad89a3c
|
do not fail on uknown properties when grouping entities by ID
|
2020-12-10 15:56:11 +01:00 |
Claudio Atzori
|
758d27745d
|
cleaning tab characters from text fields
|
2020-11-27 16:07:24 +01:00 |
Claudio Atzori
|
5151850a19
|
CROSSREF and DATACITE constants moved in common ModelConstants
|
2020-11-26 13:08:36 +01:00 |
Claudio Atzori
|
d0d5525d40
|
minor changes
|
2020-11-26 11:04:17 +01:00 |
Claudio Atzori
|
13eae4b31e
|
GroupEntitiesSparkJob must read all graph paths but relations
|
2020-11-26 11:04:01 +01:00 |
Claudio Atzori
|
76363a8512
|
SimpleDateFormat is not thread safe; improved error reporting in case of invalid dates
|
2020-11-26 11:03:12 +01:00 |
Claudio Atzori
|
e208b03755
|
renamed workflow
|
2020-11-25 14:55:50 +01:00 |
Claudio Atzori
|
dfd6205b95
|
Consistency graph workflow merges all the entities by ID
|
2020-11-25 14:55:32 +01:00 |
Claudio Atzori
|
e5da4ee9b1
|
dedup workflow using the common PidComparator
|
2020-11-04 15:02:02 +01:00 |
Claudio Atzori
|
385214eeae
|
code formatting
|
2020-10-30 15:47:05 +01:00 |
miconis
|
c4a59d1b9a
|
merge with the master to port the new packages
|
2020-10-20 16:07:30 +02:00 |
miconis
|
708d887e64
|
minor changes
|
2020-10-20 15:12:19 +02:00 |
miconis
|
0e54803177
|
bug fix in the id generator and implementation of jobs for organization dedup
|
2020-10-20 12:19:46 +02:00 |
miconis
|
6f8720982c
|
bug fix in the idgenerator and test implementation
|
2020-10-09 09:30:23 +02:00 |
Sandro La Bruzzo
|
734934e2eb
|
fixed error on empty intersection with publication and relation on export to OAF
|
2020-10-08 17:29:29 +02:00 |
Sandro La Bruzzo
|
eec418cd26
|
moved AuthoreMerger into dhp-common
|
2020-10-08 10:33:55 +02:00 |
miconis
|
1804c5d809
|
refactoring: classes moved in the right package
|
2020-10-06 16:44:51 +02:00 |
miconis
|
7093355487
|
bug fix and minor changes
|
2020-10-06 16:21:34 +02:00 |
miconis
|
5a8bc329c5
|
bug fix in the result merge: it takes the correct bestaccessright basing on the license instead of the trust
|
2020-10-06 15:26:44 +02:00 |
miconis
|
a2ac7e52fb
|
implementation of the workflow for new organizations in openorgs
|
2020-10-06 13:58:09 +02:00 |
Claudio Atzori
|
23f64d9eb4
|
updated dedup tests following the dnet-pace-core library update
|
2020-10-02 14:30:53 +02:00 |
miconis
|
e3f7798d1b
|
minor changes in dedup tests, bug fix in the idgenerator and pace-core version update
|
2020-09-29 15:31:46 +02:00 |
miconis
|
4cf79f32eb
|
implementation of the oozie wf to prepare the openorgs input: relations between organizations
|
2020-09-25 11:29:51 +02:00 |
miconis
|
259362ef47
|
implementation of the job to collect simrels from postgres db
|
2020-09-22 09:43:27 +02:00 |
Sandro La Bruzzo
|
168bfb496a
|
adopted dedup to the new schema
|
2020-07-31 09:06:57 +02:00 |
miconis
|
d47352cbc7
|
refactoring of the procedure for the id generation, minor changes and addition of a comparation on the original id and the origin datasource
|
2020-07-24 20:10:47 +02:00 |
miconis
|
b260fee787
|
implementation of the dedup_id generation using pids to make the graph more stable
|
2020-07-22 17:29:48 +02:00 |
Claudio Atzori
|
de72b1c859
|
cleanup
|
2020-07-20 09:59:11 +02:00 |
Claudio Atzori
|
805de4eca1
|
fix: filter the blocks with size = 1
|
2020-07-16 10:11:32 +02:00 |
Claudio Atzori
|
b90389bac4
|
code formatting
|
2020-07-15 11:24:48 +02:00 |
Claudio Atzori
|
4e6f46e8fa
|
filter blocks with one record only
|
2020-07-15 11:22:20 +02:00 |
Claudio Atzori
|
06def0c0cb
|
SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter
|
2020-07-13 20:09:06 +02:00 |
miconis
|
b52c246aed
|
merge done
|
2020-07-13 19:57:02 +02:00 |
miconis
|
b8a45041fd
|
minor changes
|
2020-07-13 19:53:18 +02:00 |
Claudio Atzori
|
66f9f6d323
|
adjusted parameters for the dedup stats workflow
|
2020-07-13 19:26:46 +02:00 |
miconis
|
03ecfa5ebd
|
implementation of the test class for the new block stats spark action
|
2020-07-13 18:48:23 +02:00 |
miconis
|
10e08ccf45
|
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
|
2020-07-13 18:22:45 +02:00 |
miconis
|
9258e4f095
|
implementation of a new workflow to compute statistics on the blocks
|
2020-07-13 18:22:34 +02:00 |
Claudio Atzori
|
c6f6fb0f28
|
code formatting
|
2020-07-13 16:46:13 +02:00 |
Claudio Atzori
|
344a90c2e6
|
updated assertions in propagateRelationTest
|
2020-07-13 16:32:04 +02:00 |
Claudio Atzori
|
1143f426aa
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 16:13:36 +02:00 |
Claudio Atzori
|
8c67938ad0
|
configurable number of partitions used in the SparkCreateSimRels phase
|
2020-07-13 16:07:07 +02:00 |
Claudio Atzori
|
c73168b18e
|
Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting
|
2020-07-13 15:54:58 +02:00 |
Claudio Atzori
|
c8284bab06
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 15:54:51 +02:00 |
Sandro La Bruzzo
|
1d133b7fe6
|
update test
|
2020-07-13 15:52:41 +02:00 |
Claudio Atzori
|
7dd91edf43
|
parsing of optional parameter
|
2020-07-13 15:40:41 +02:00 |
Claudio Atzori
|
4c101a9d66
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 15:31:38 +02:00 |
Claudio Atzori
|
8a612d861a
|
WIP SparkCreateMergeRels distinct relations
|
2020-07-13 15:30:57 +02:00 |
Sandro La Bruzzo
|
9ef2385022
|
implemented test for cut of connected component
|
2020-07-13 15:28:17 +02:00 |
Sandro La Bruzzo
|
d561b2dd21
|
implemented cut of connected component
|
2020-07-13 14:18:42 +02:00 |
Claudio Atzori
|
e2093e42db
|
Merge branch 'master' into deduptesting
|
2020-07-13 10:57:49 +02:00 |
Claudio Atzori
|
7a3fd9f54c
|
dedup relation aggregator moved into dedicated class
|
2020-07-13 10:11:36 +02:00 |
Alessia Bardi
|
7e96105947
|
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
|
2020-07-12 19:29:12 +02:00 |
Alessia Bardi
|
b7a39731a6
|
assert, not print
|
2020-07-12 19:28:56 +02:00 |
Claudio Atzori
|
770adc26e9
|
WIP aggregator to make relationships unique
|
2020-07-10 19:35:10 +02:00 |
Claudio Atzori
|
ecf119f37a
|
Merge branch 'master' into deduptesting
|
2020-07-10 19:04:16 +02:00 |
Michele Artini
|
e1ae964bc4
|
stats
|
2020-07-10 16:12:08 +02:00 |
Claudio Atzori
|
752d28f8eb
|
make the relations produced by the dedup SparkPropagateRelation jon unique
|
2020-07-10 15:09:50 +02:00 |
Claudio Atzori
|
3c728aaa0c
|
trying to overcome OOM errors during duplicate scan phase
|
2020-07-08 22:39:51 +02:00 |
Claudio Atzori
|
18c555cd79
|
Merge branch 'master' into deduptesting
|
2020-07-08 22:32:01 +02:00 |
Claudio Atzori
|
4365cf41d7
|
trying to overcome OOM errors during duplicate scan phase
|
2020-07-08 22:31:46 +02:00 |
Alessia Bardi
|
853e8d7987
|
test for software merge
|
2020-07-08 17:03:53 +02:00 |
Claudio Atzori
|
c3d67f709a
|
adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80)
|
2020-07-02 17:35:22 +02:00 |
Claudio Atzori
|
0f77cac4b5
|
fix: deduper must use queueMaxSize instead of groupMaxSize for the block definition
|
2020-07-02 12:43:51 +02:00 |
Claudio Atzori
|
9cd27183b6
|
[maven-release-plugin] prepare for next development iteration
|
2020-06-22 11:27:44 +02:00 |
Claudio Atzori
|
1e3dab0631
|
[maven-release-plugin] prepare release dhp-1.2.3
|
2020-06-22 11:27:39 +02:00 |
miconis
|
11b77b9f4e
|
json dumps for entity merge test modified to fit the new model. title merge adjusted to fix the error
|
2020-06-16 18:31:11 +02:00 |
Claudio Atzori
|
c4d9f1837f
|
[maven-release-plugin] prepare for next development iteration
|
2020-06-12 12:21:08 +02:00 |
Claudio Atzori
|
f0746a7605
|
[maven-release-plugin] prepare release dhp-1.2.2
|
2020-06-12 12:21:03 +02:00 |
Claudio Atzori
|
7b288a94cb
|
code formatting
|
2020-05-26 09:54:13 +02:00 |
Claudio Atzori
|
7582532e73
|
[maven-release-plugin] prepare for next development iteration
|
2020-05-25 19:48:18 +02:00 |
Claudio Atzori
|
01c2e93395
|
[maven-release-plugin] prepare release dhp-1.2.1
|
2020-05-25 19:48:14 +02:00 |
miconis
|
da1e5cf557
|
implementation of the result title merge. main title with higher trust, distinct between the others
|
2020-05-25 18:02:57 +02:00 |
Claudio Atzori
|
7181807e64
|
code formatting
|
2020-05-23 09:51:48 +02:00 |
miconis
|
0fd0c7d725
|
reimplementation of the sim between two authors. now it takes into account both name and surname. threshold incremented to 1.0 if the name is too short
|
2020-05-22 17:24:57 +02:00 |
Claudio Atzori
|
3cf2796ac6
|
code formatting
|
2020-05-22 12:34:00 +02:00 |
miconis
|
8bbd1d0501
|
reimplementation of the author merging in deduprecord creation. implementation of the test class.
|
2020-05-21 11:52:14 +02:00 |
Claudio Atzori
|
60c40618d3
|
[maven-release-plugin] prepare for next development iteration
|
2020-05-11 10:17:14 +02:00 |
Claudio Atzori
|
c267d958d5
|
[maven-release-plugin] prepare release dhp-1.2.0
|
2020-05-11 10:17:10 +02:00 |
Claudio Atzori
|
42f1a2bf94
|
bumped project version to 1.2.0-SNAPSHOT
|
2020-05-11 10:05:57 +02:00 |
Claudio Atzori
|
fd519df616
|
new rels produced by dedup workflow must be unique
|
2020-05-08 19:00:38 +02:00 |
Claudio Atzori
|
0ccc864ad9
|
[maven-release-plugin] prepare for next development iteration
|
2020-05-08 17:01:31 +02:00 |
Claudio Atzori
|
6e47c724c6
|
[maven-release-plugin] prepare release dhp-1.1.7
|
2020-05-08 17:01:27 +02:00 |
Claudio Atzori
|
5b28bb4131
|
code formatting
|
2020-05-08 16:49:47 +02:00 |
miconis
|
3420998bb4
|
reltype set in mergerels
|
2020-05-08 15:43:30 +02:00 |
Claudio Atzori
|
c79e2f5977
|
drop workingPath before starting the dedup workflow
|
2020-05-06 11:27:44 +02:00 |
miconis
|
3df703f67d
|
mergerels added to propagate relations
|
2020-05-04 12:08:12 +02:00 |
Claudio Atzori
|
439c6255a2
|
cleanup
|
2020-04-29 19:09:07 +02:00 |
Claudio Atzori
|
77ac995770
|
cleaned up poms, added descriptions
|
2020-04-29 18:44:17 +02:00 |
miconis
|
0352d3b0ba
|
entity dumps in dedup compressed
|
2020-04-29 13:02:34 +02:00 |
miconis
|
62e467eb0c
|
assertion numbers updated to fit the new implementation of the pace-core
|
2020-04-28 11:46:23 +02:00 |
Claudio Atzori
|
6f5b899038
|
reformatted code according to the updated style descriptor
|
2020-04-28 11:23:29 +02:00 |
Claudio Atzori
|
a0bdbacdae
|
switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin
|
2020-04-27 14:52:31 +02:00 |
Claudio Atzori
|
7a3f8085f7
|
switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin
|
2020-04-27 14:45:40 +02:00 |
Claudio Atzori
|
278fc9d276
|
code formatting
|
2020-04-23 18:51:38 +02:00 |
miconis
|
8d258c85ff
|
spark dedup test fixed, sample for dataset and orp added, test implemented
|
2020-04-23 18:16:20 +02:00 |
Claudio Atzori
|
9ddafd46ca
|
fixed dedup record id prefix, set the correct dataInfo in the DedupRecordFactory
|
2020-04-23 07:50:18 +02:00 |
Claudio Atzori
|
91e72a6944
|
Dataset based implementation for SparkCreateDedupRecord phase, fixed datasource entity dump supplementing dedup unit tests
|
2020-04-21 12:06:08 +02:00 |
miconis
|
5c9ef08a8e
|
spark dedup test fixed
|
2020-04-21 10:19:04 +02:00 |
Claudio Atzori
|
eb8a020859
|
fixed behaviour of DedupRecordFactory
|
2020-04-20 18:44:06 +02:00 |
miconis
|
1102e32462
|
SparkDedupTest updated and organization dump fixed
|
2020-04-20 16:49:01 +02:00 |
miconis
|
4da13e4570
|
Revert "Merge branch 'master' into deduptesting"
This reverts commit 772f75d167 , reversing
changes made to 5f45f2c77f .
|
2020-04-20 16:04:49 +02:00 |
miconis
|
772f75d167
|
Merge branch 'master' into deduptesting
|
2020-04-20 14:50:12 +02:00 |
Claudio Atzori
|
d714bfb4d4
|
collectedfrom field moved in common parent class Oaf.java
|
2020-04-20 12:25:19 +02:00 |
Claudio Atzori
|
5f45f2c77f
|
Merge branch 'master' into deduptesting
|
2020-04-18 12:46:40 +02:00 |
Claudio Atzori
|
ad7a131b18
|
introduced common project code formatting plugin, works on the commit hook, based on https://github.com/Cosium/git-code-format-maven-plugin, applied to each java class in the project
|
2020-04-18 12:42:58 +02:00 |
Claudio Atzori
|
a2938dd059
|
cleanup
|
2020-04-18 12:24:22 +02:00 |
Claudio Atzori
|
9374ff03ea
|
Merge branch 'master' into deduptesting
|
2020-04-18 12:06:58 +02:00 |
Claudio Atzori
|
71813795f6
|
various refactorings on the dnet-dedup-openaire workflow
|
2020-04-18 12:06:23 +02:00 |
miconis
|
6450bb0daa
|
test for softwares dedup added. definition of orp, dataset and sw dedup configurations
|
2020-04-17 17:31:59 +02:00 |
Claudio Atzori
|
038ac7afd7
|
relation consistency workflow separated from dedup scan and creation of CCs
|
2020-04-17 13:12:44 +02:00 |
miconis
|
418cf94642
|
implementation of the deletedbyinference test in propagating relations
|
2020-04-17 10:40:21 +02:00 |
Claudio Atzori
|
cc21bbfb1a
|
Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting
|
2020-04-16 14:41:37 +02:00 |
Claudio Atzori
|
ec5dfc068d
|
added spark.sql.shuffle.partitions=3840 to dedup scan wf
|
2020-04-16 14:41:28 +02:00 |
miconis
|
0eccbc318b
|
Deduper class (utilities for dedup) cleaned. Useless methods removed
|
2020-04-16 12:36:37 +02:00 |
Claudio Atzori
|
76d23895e6
|
Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting
|
2020-04-16 12:18:32 +02:00 |
miconis
|
6a089ec287
|
minor changes
|
2020-04-16 12:15:38 +02:00 |
Claudio Atzori
|
376efd67de
|
removed prepare statement in spark action
|
2020-04-16 12:14:16 +02:00 |
miconis
|
9b36458b6a
|
Merge branch 'deduptesting' of code-repo.d4science.org:D-Net/dnet-hadoop into deduptesting
|
2020-04-16 12:13:58 +02:00 |
miconis
|
cd4d9a148f
|
creating temporary directories in dedup test
|
2020-04-16 12:13:26 +02:00 |
Claudio Atzori
|
b39ff36c16
|
improving the wf definitions
|
2020-04-16 12:11:37 +02:00 |
Claudio Atzori
|
011b342bc9
|
trying to avoid OOM in SparkPropagateRelation
|
2020-04-16 11:13:51 +02:00 |
Claudio Atzori
|
069ef5eaed
|
trying to avoid OOM in SparkPropagateRelation
|
2020-04-15 21:23:21 +02:00 |
Claudio Atzori
|
8eedfefc98
|
try to introduce intermediate serialization on hdfs to avoid OOM
|
2020-04-15 18:35:35 +02:00 |
miconis
|
5689d49689
|
minor changes
|
2020-04-15 16:34:06 +02:00 |
miconis
|
0be2e72be5
|
further implementation of tests for the deduplication of each entity. publication dump added, empty entity files created
|
2020-04-08 18:02:30 +02:00 |
miconis
|
56fbe689f0
|
implementation of the tests for each spark action
|
2020-04-06 16:30:31 +02:00 |
miconis
|
53fd624c34
|
implemented test for sparkcreatesimrels
|
2020-04-03 18:32:25 +02:00 |
miconis
|
a61763d149
|
structure for sparksimrel changed to be compliant with mockito testing
|
2020-04-02 18:37:53 +02:00 |
miconis
|
bfa5bc74df
|
minor changes
|
2020-04-01 19:05:48 +02:00 |
miconis
|
9802bcb9fe
|
dedup testing
|
2020-04-01 18:48:31 +02:00 |
Claudio Atzori
|
377e1ba840
|
[maven-release-plugin] prepare for next development iteration
|
2020-03-30 20:06:00 +02:00 |
Claudio Atzori
|
76d9315129
|
[maven-release-plugin] prepare release dhp-1.1.6
|
2020-03-30 20:05:56 +02:00 |
Claudio Atzori
|
673e744649
|
moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa
|
2020-03-27 10:42:17 +01:00 |
Sandro La Bruzzo
|
e71e001b58
|
commented test that doesn't work
|
2020-03-26 14:15:21 +01:00 |
Sandro La Bruzzo
|
0cd022ad6a
|
merge with master
|
2020-03-26 14:08:29 +01:00 |
Claudio Atzori
|
cd7dc3e1ae
|
dhp-dedup-openaire workflow tests upgraded to junit5
|
2020-03-25 18:04:23 +01:00 |
Michele Artini
|
ebe45003d9
|
fixed some junit packages
|
2020-03-25 16:45:03 +01:00 |
Claudio Atzori
|
71ae7dd272
|
renamed module dnet-dedup to dnet-dedup-openaire
|
2020-03-25 15:57:09 +01:00 |