Giambattista Bloisi
02636e802c
SparkCreateSimRels:
...
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests
SparkWhitelistSimRels: use left semi join for clarity and performance
SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions
DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Giambattista Bloisi
e64c2854a3
Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
...
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Claudio Atzori
0aa725083f
extended dedup testing
2022-11-17 16:13:43 +01:00
Claudio Atzori
ddff0e8999
merging duplicates using IdentifierComparator
2022-11-11 16:10:25 +01:00
Claudio Atzori
5af5a8ae42
added IdentifierComparator
2022-11-09 14:20:59 +01:00
miconis
8991d097b4
bug fix in the DedupRecordFactory, DataInfo set before merge
2022-02-24 17:13:12 +01:00
Claudio Atzori
2b46b87f56
fixed filtering criteria applied in SparkCopyRelationsNoOpenorgs to keep the parent/child relations from OpenOrgs
2021-11-19 11:30:29 +01:00
miconis
853333bdde
implementation of the whitelist for similarity relations
2021-09-20 16:21:47 +02:00
miconis
0857100fb8
implementation of the tests for the openorgs integration in the openaire provision
2021-04-07 18:42:16 +02:00
miconis
f446580e9f
code refactoring (useless classes and wf removed), implementation of the test for the openorgs dedup
2021-03-29 16:10:46 +02:00
miconis
2355cc4e9b
minor changes and bug fix
2021-03-29 10:07:12 +02:00
miconis
28c1cdd132
merged stable_ids into openorgswf
2021-03-25 10:44:49 +01:00
miconis
98854b0124
minor changes
2021-03-19 16:57:40 +01:00
miconis
1a85020572
bug fix in graph-mapper, changes in the implementation of the openorgs wf to create relations and populate openorgs db
2021-02-26 10:19:28 +01:00
Claudio Atzori
e5da4ee9b1
dedup workflow using the common PidComparator
2020-11-04 15:02:02 +01:00
miconis
c4a59d1b9a
merge with the master to port the new packages
2020-10-20 16:07:30 +02:00
miconis
6f8720982c
bug fix in the idgenerator and test implementation
2020-10-09 09:30:23 +02:00
miconis
5a8bc329c5
bug fix in the result merge: it takes the correct bestaccessright basing on the license instead of the trust
2020-10-06 15:26:44 +02:00
miconis
259362ef47
implementation of the job to collect simrels from postgres db
2020-09-22 09:43:27 +02:00
miconis
d47352cbc7
refactoring of the procedure for the id generation, minor changes and addition of a comparation on the original id and the origin datasource
2020-07-24 20:10:47 +02:00
miconis
b260fee787
implementation of the dedup_id generation using pids to make the graph more stable
2020-07-22 17:29:48 +02:00
Claudio Atzori
8a612d861a
WIP SparkCreateMergeRels distinct relations
2020-07-13 15:30:57 +02:00
Alessia Bardi
853e8d7987
test for software merge
2020-07-08 17:03:53 +02:00
Claudio Atzori
c3d67f709a
adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80)
2020-07-02 17:35:22 +02:00
miconis
11b77b9f4e
json dumps for entity merge test modified to fit the new model. title merge adjusted to fix the error
2020-06-16 18:31:11 +02:00
miconis
da1e5cf557
implementation of the result title merge. main title with higher trust, distinct between the others
2020-05-25 18:02:57 +02:00
miconis
0fd0c7d725
reimplementation of the sim between two authors. now it takes into account both name and surname. threshold incremented to 1.0 if the name is too short
2020-05-22 17:24:57 +02:00
miconis
8bbd1d0501
reimplementation of the author merging in deduprecord creation. implementation of the test class.
2020-05-21 11:52:14 +02:00
Claudio Atzori
fd519df616
new rels produced by dedup workflow must be unique
2020-05-08 19:00:38 +02:00
miconis
0352d3b0ba
entity dumps in dedup compressed
2020-04-29 13:02:34 +02:00
miconis
8d258c85ff
spark dedup test fixed, sample for dataset and orp added, test implemented
2020-04-23 18:16:20 +02:00
Claudio Atzori
91e72a6944
Dataset based implementation for SparkCreateDedupRecord phase, fixed datasource entity dump supplementing dedup unit tests
2020-04-21 12:06:08 +02:00
miconis
1102e32462
SparkDedupTest updated and organization dump fixed
2020-04-20 16:49:01 +02:00
miconis
4da13e4570
Revert "Merge branch 'master' into deduptesting"
...
This reverts commit 772f75d167
, reversing
changes made to 5f45f2c77f
.
2020-04-20 16:04:49 +02:00
miconis
772f75d167
Merge branch 'master' into deduptesting
2020-04-20 14:50:12 +02:00
Claudio Atzori
d714bfb4d4
collectedfrom field moved in common parent class Oaf.java
2020-04-20 12:25:19 +02:00
miconis
6450bb0daa
test for softwares dedup added. definition of orp, dataset and sw dedup configurations
2020-04-17 17:31:59 +02:00
miconis
0be2e72be5
further implementation of tests for the deduplication of each entity. publication dump added, empty entity files created
2020-04-08 18:02:30 +02:00
miconis
56fbe689f0
implementation of the tests for each spark action
2020-04-06 16:30:31 +02:00
miconis
53fd624c34
implemented test for sparkcreatesimrels
2020-04-03 18:32:25 +02:00
miconis
a61763d149
structure for sparksimrel changed to be compliant with mockito testing
2020-04-02 18:37:53 +02:00
Sandro La Bruzzo
0cd022ad6a
merge with master
2020-03-26 14:08:29 +01:00
Claudio Atzori
71ae7dd272
renamed module dnet-dedup to dnet-dedup-openaire
2020-03-25 15:57:09 +01:00