Master branch updates from beta January 2024 #385

Manually merged
claudio.atzori merged 0 commits from beta into master 2024-01-26 16:09:14 +01:00
  • #367 - Improvements and refactoring in Dedup
  • #373 - Graph enrichment workflows implemented as a single step
  • #374 - Refined mapping for the extraction of the original resource type used in the context of the COAR resource type attribution
  • #375 - [FoS integration] fix issue on FoS integration: removing the null values from FoS
  • #379 - [Graph provision] Use the context API instead the ISLookUp service
  • #380 - [Graph provision] updated param specification for the XML converter job
  • #383 - [DOIBoost] Fixed problem on missing author in Crossref Mapping
  • 3e96777cc4 [metadata collection] increased logging from the oai-pmh metadata collection process
* #367 - Improvements and refactoring in Dedup * #373 - Graph enrichment workflows implemented as a single step * #374 - Refined mapping for the extraction of the original resource type used in the context of the COAR resource type attribution * #375 - [FoS integration] fix issue on FoS integration: removing the null values from FoS * #379 - [Graph provision] Use the context API instead the ISLookUp service * #380 - [Graph provision] updated param specification for the XML converter job * #383 - [DOIBoost] Fixed problem on missing author in Crossref Mapping * 3e96777cc4 [metadata collection] increased logging from the oai-pmh metadata collection process
claudio.atzori added 63 commits 2024-01-26 16:03:33 +01:00
02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
3c66e3bd7b Create dedup record for "merged" pivots
Do not create dedup records for group that have more than 20 different acceptance date
claudio.atzori manually merged commit 4d0c59669b into master 2024-01-26 16:09:14 +01:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#385
No description provided.