Improvements and refactoring in Dedup #367

Merged
giambattista.bloisi merged 5 commits from dedup_increasenumofblocks into beta 4 months ago
Collaborator

SparkCreateSimRels:

  • Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
  • Clean titles once before clustering and similarity comparisons
  • Added support for filtered fields in model
  • Added support for sorting List fields in model
  • Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
  • Added new maxLengthMatch comparator function
  • Use reduced complexity Levenshtein with threshold in levensteinTitle
  • Use reduced complexity AuthorsMatch with threshold for early-quit
  • Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
  • Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels:

  • use left semi join for clarity and performance

SparkCreateMergeRels:

  • Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
  • Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
  • Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedup records used in the past
  • Comparator for pivot record selection now uses "next week" as filler for missing or incorrect date instead of "2000-01-01"
  • Changed algorithm for generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory:

  • use reduceGroups instead of mapGroups to decrease memory pressure
SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold for early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: - use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedup records used in the past - Comparator for pivot record selection now uses "next week" as filler for missing or incorrect date instead of "2000-01-01" - Changed algorithm for generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: - use reduceGroups instead of mapGroups to decrease memory pressure
giambattista.bloisi added 2 commits 5 months ago
b0fc113749 SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
giambattista.bloisi added 1 commit 5 months ago
giambattista.bloisi added 1 commit 4 months ago
giambattista.bloisi added 1 commit 4 months ago
ebfeec26b5 Create dedup record for "merged" pivots
Do not create dedup records for group that have more than 20 different acceptance date
giambattista.bloisi force-pushed dedup_increasenumofblocks from ebfeec26b5 to 3c66e3bd7b 4 months ago
giambattista.bloisi merged commit a88dce5bf3 into beta 4 months ago
giambattista.bloisi deleted branch dedup_increasenumofblocks 4 months ago
The pull request has been merged as a88dce5bf3.
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b dedup_increasenumofblocks beta
git pull origin dedup_increasenumofblocks

Step 2:

Merge the changes and update on Gitea.
git checkout beta
git merge --no-ff dedup_increasenumofblocks
git push origin beta
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#367
Loading…
There is no content yet.