dnet-hadoop

History

Giambattista Bloisi 02636e802c SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure		2024-01-10 22:59:52 +01:00
..
consistency/oozie_app	[dedup] added isLookupUrl to the graph consistency workflow definition, required now by the entity grouping phase	2023-12-06 11:06:46 +01:00
openorgs/oozie_app	set configuration property in openorgs duplicates wf	2021-10-07 15:39:55 +02:00
scan/oozie_app	SparkCreateSimRels:	2024-01-10 22:59:52 +01:00
statistics/oozie_app	SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter	2020-07-13 20:09:06 +02:00
copyOpenorgsMergeRels_parameters.json	implementation of the openorgs wfs, implementation of the raw_all wf to migrate openorgs db entities	2021-02-10 11:51:50 +01:00
copyOpenorgs_parameters.json	bug fix: lookupurl parameter added to dedup record job	2021-04-13 09:08:05 +02:00
createBlockStats_parameters.json	SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter	2020-07-13 20:09:06 +02:00
createCC_parameters.json	SparkCreateSimRels:	2024-01-10 22:59:52 +01:00
createDedupRecord_parameters.json	various refactorings on the dnet-dedup-openaire workflow	2020-04-18 12:06:23 +02:00
createSimRels_parameters.json	added to CreateSimRel the feature to write time log	2023-06-28 11:20:58 +02:00
prepareNewOrgs_parameters.json	bug fix in graph-mapper, changes in the implementation of the openorgs wf to create relations and populate openorgs db	2021-02-26 10:19:28 +01:00
prepareOrgRels_parameters.json	bug fix in graph-mapper, changes in the implementation of the openorgs wf to create relations and populate openorgs db	2021-02-26 10:19:28 +01:00
propagateRelation_parameters.json	Consistency graph workflow merges all the entities by ID	2020-11-25 14:55:32 +01:00
updateEntity_parameters.json	configurable number of partitions used in the SparkCreateSimRels phase	2020-07-13 16:07:07 +02:00
updateOpenorgsJob_parameters.json	implementation of the http call for the update of openorgs suggestions	2021-10-07 11:29:11 +02:00
whitelistSimRels_parameters.json	implementation of the whitelist for similarity relations	2021-09-20 16:21:47 +02:00