dnet-hadoop/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
..
consistency/oozie_app [dedup] added isLookupUrl to the graph consistency workflow definition, required now by the entity grouping phase 2023-12-06 11:06:46 +01:00
openorgs/oozie_app set configuration property in openorgs duplicates wf 2021-10-07 15:39:55 +02:00
scan/oozie_app SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
statistics/oozie_app SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter 2020-07-13 20:09:06 +02:00
copyOpenorgsMergeRels_parameters.json implementation of the openorgs wfs, implementation of the raw_all wf to migrate openorgs db entities 2021-02-10 11:51:50 +01:00
copyOpenorgs_parameters.json bug fix: lookupurl parameter added to dedup record job 2021-04-13 09:08:05 +02:00
createBlockStats_parameters.json SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter 2020-07-13 20:09:06 +02:00
createCC_parameters.json SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
createDedupRecord_parameters.json various refactorings on the dnet-dedup-openaire workflow 2020-04-18 12:06:23 +02:00
createSimRels_parameters.json added to CreateSimRel the feature to write time log 2023-06-28 11:20:58 +02:00
prepareNewOrgs_parameters.json bug fix in graph-mapper, changes in the implementation of the openorgs wf to create relations and populate openorgs db 2021-02-26 10:19:28 +01:00
prepareOrgRels_parameters.json bug fix in graph-mapper, changes in the implementation of the openorgs wf to create relations and populate openorgs db 2021-02-26 10:19:28 +01:00
propagateRelation_parameters.json Consistency graph workflow merges all the entities by ID 2020-11-25 14:55:32 +01:00
updateEntity_parameters.json configurable number of partitions used in the SparkCreateSimRels phase 2020-07-13 16:07:07 +02:00
updateOpenorgsJob_parameters.json implementation of the http call for the update of openorgs suggestions 2021-10-07 11:29:11 +02:00
whitelistSimRels_parameters.json implementation of the whitelist for similarity relations 2021-09-20 16:21:47 +02:00