forked from D-Net/dnet-hadoop
Giambattista Bloisi
02636e802c
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure |
||
---|---|---|
.. | ||
dhp-actionmanager | ||
dhp-aggregation | ||
dhp-blacklist | ||
dhp-broker-events | ||
dhp-dedup-openaire | ||
dhp-doiboost | ||
dhp-enrichment | ||
dhp-graph-mapper | ||
dhp-graph-provision | ||
dhp-impact-indicators | ||
dhp-stats-actionsets | ||
dhp-stats-promote | ||
dhp-stats-update | ||
dhp-swh | ||
dhp-usage-raw-data-update | ||
dhp-usage-stats-build | ||
dhp-workflow-profiles | ||
src/site | ||
pom.xml |