dnet-hadoop/dhp-workflows/dhp-dedup-openaire/src/test/resources/eu/dnetlib/dhp/dedup
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
..
assertions implementation of the job to collect simrels from postgres db 2020-09-22 09:43:27 +02:00
conf SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
entities merging duplicates using IdentifierComparator 2022-11-11 16:10:25 +01:00
json minor changes and bug fix 2021-03-29 10:07:12 +02:00
openorgs fixed filtering criteria applied in SparkCopyRelationsNoOpenorgs to keep the parent/child relations from OpenOrgs 2021-11-19 11:30:29 +01:00
pivot_history SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
profiles merging duplicates using IdentifierComparator 2022-11-11 16:10:25 +01:00
root extended dedup testing 2022-11-17 16:13:43 +01:00
test new rels produced by dedup workflow must be unique 2020-05-08 19:00:38 +02:00
whitelist.simrels.txt implementation of the whitelist for similarity relations 2021-09-20 16:21:47 +02:00