Master branch updates from beta January 2024 #385

Manually merged
claudio.atzori merged 0 commits from beta into master 3 months ago
Owner
  • #367 - Improvements and refactoring in Dedup
  • #373 - Graph enrichment workflows implemented as a single step
  • #374 - Refined mapping for the extraction of the original resource type used in the context of the COAR resource type attribution
  • #375 - [FoS integration] fix issue on FoS integration: removing the null values from FoS
  • #379 - [Graph provision] Use the context API instead the ISLookUp service
  • #380 - [Graph provision] updated param specification for the XML converter job
  • #383 - [DOIBoost] Fixed problem on missing author in Crossref Mapping
  • 3e96777cc4 [metadata collection] increased logging from the oai-pmh metadata collection process
* #367 - Improvements and refactoring in Dedup * #373 - Graph enrichment workflows implemented as a single step * #374 - Refined mapping for the extraction of the original resource type used in the context of the COAR resource type attribution * #375 - [FoS integration] fix issue on FoS integration: removing the null values from FoS * #379 - [Graph provision] Use the context API instead the ISLookUp service * #380 - [Graph provision] updated param specification for the XML converter job * #383 - [DOIBoost] Fixed problem on missing author in Crossref Mapping * 3e96777cc4 [metadata collection] increased logging from the oai-pmh metadata collection process
claudio.atzori added 63 commits 3 months ago
02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
3c66e3bd7b Create dedup record for "merged" pivots
Do not create dedup records for group that have more than 20 different acceptance date
claudio.atzori merged commit 4d0c59669b into master manually 3 months ago
The pull request has been manually merged as 4d0c59669b.
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b beta master
git pull origin beta

Step 2:

Merge the changes and update on Gitea.
git checkout master
git merge --no-ff beta
git push origin master
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#385
Loading…
There is no content yet.