Improvements and refactoring in Dedup #367

giambattista.bloisi · 2023-12-11T21:31:20+01:00

giambattista.bloisi commented

2023-12-11 21:31:20 +01:00

SparkCreateSimRels:

Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
Clean titles once before clustering and similarity comparisons
Added support for filtered fields in model
Added support for sorting List fields in model
Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
Added new maxLengthMatch comparator function
Use reduced complexity Levenshtein with threshold in levensteinTitle
Use reduced complexity AuthorsMatch with threshold for early-quit
Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels:

use left semi join for clarity and performance

SparkCreateMergeRels:

Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedup records used in the past
Comparator for pivot record selection now uses "next week" as filler for missing or incorrect date instead of "2000-01-01"
Changed algorithm for generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory:

use reduceGroups instead of mapGroups to decrease memory pressure

SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold for early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: - use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedup records used in the past - Comparator for pivot record selection now uses "next week" as filler for missing or incorrect date instead of "2000-01-01" - Changed algorithm for generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: - use reduceGroups instead of mapGroups to decrease memory pressure

giambattista.bloisi added 2 commits 2023-12-11 21:31:21 +01:00

b0fc113749 SparkCreateSimRels:

- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure

a6c7217df1 Do no longer use dedupId information from pivotHistory Database

giambattista.bloisi force-pushed dedup_increasenumofblocks from ebfeec26b5 to 3c66e3bd7b

2024-01-10 22:59:54 +01:00

Compare

giambattista.bloisi merged commit a88dce5bf3 into beta

2024-01-11 11:24:07 +01:00

giambattista.bloisi referenced this issue from a commit

2024-01-11 11:24:07 +01:00

Merge pull request 'Improvements and refactoring in Dedup' (#367) from dedup_increasenumofblocks into beta

giambattista.bloisi deleted branch dedup_increasenumofblocks

2024-01-11 11:38:23 +01:00

claudio.atzori referenced this pull request