dnet-hadoop

History

Giambattista Bloisi 02636e802c SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure		2024-01-10 22:59:52 +01:00
..
model	SparkCreateSimRels:	2024-01-10 22:59:52 +01:00
AbstractSparkAction.java	SparkCreateSimRels:	2024-01-10 22:59:52 +01:00
DatePicker.java	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface	2023-07-24 15:36:24 +02:00
DedupRecordFactory.java	SparkCreateSimRels:	2024-01-10 22:59:52 +01:00
DedupUtility.java	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface	2023-07-24 15:36:24 +02:00
IdGenerator.java	SparkCreateSimRels:	2024-01-10 22:59:52 +01:00
IdentifierComparator.java	added IdentifierComparator	2022-11-09 14:20:59 +01:00
SparkBlockStats.java	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface	2023-07-24 15:36:24 +02:00
SparkCopyOpenorgsMergeRels.java	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
SparkCopyOpenorgsSimRels.java	suggestions from SonarLint	2021-08-11 12:13:22 +02:00
SparkCopyRelationsNoOpenorgs.java	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
SparkCreateDedupRecord.java	suggestions from SonarLint	2021-08-11 12:13:22 +02:00
SparkCreateMergeRels.java	SparkCreateSimRels:	2024-01-10 22:59:52 +01:00
SparkCreateOrgsDedupRecord.java	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
SparkCreateSimRels.java	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
SparkPrepareNewOrgs.java	suggestions from SonarLint	2021-08-11 12:13:22 +02:00
SparkPrepareOrgRels.java	suggestions from SonarLint	2021-08-11 12:13:22 +02:00
SparkPropagateRelation.java	code formatting	2023-10-06 12:31:17 +02:00
SparkUpdateEntity.java	suggestions from SonarLint	2021-08-11 12:13:22 +02:00
SparkWhitelistSimRels.java	SparkCreateSimRels:	2024-01-10 22:59:52 +01:00
UpdateOpenorgsJob.java	set configuration property in openorgs duplicates wf	2021-10-07 15:39:55 +02:00