1
0
Fork 0
dnet-hadoop/dhp-workflows
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
..
dhp-actionmanager [AMF] docs 2023-10-12 10:05:46 +02:00
dhp-aggregation uploaded input parameters on CreateBaseline WF 2023-12-18 12:21:55 +01:00
dhp-blacklist Use scala.binary.version property to resolve scala maven dependencies 2023-07-24 11:13:48 +02:00
dhp-broker-events code formatting 2023-10-02 11:04:36 +02:00
dhp-dedup-openaire SparkCreateSimRels: 2024-01-10 22:59:52 +01:00
dhp-doiboost [doiboost - preprocess] remove transition to orcid preparation from sequence of steps at the beginning of the workflow 2023-12-15 12:24:55 +01:00
dhp-enrichment added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:50:05 +01:00
dhp-graph-mapper fixed conflicts 2024-01-10 11:03:42 +01:00
dhp-graph-provision refactoring after compiletion 2023-12-20 15:57:26 +01:00
dhp-impact-indicators Run CC and RAM sequentieally in dhp-impact-indicators WF 2023-09-13 08:59:40 +02:00
dhp-stats-actionsets Update StatsAtomicActionsJob.java 2023-12-01 11:35:01 +02:00
dhp-stats-promote Merge pull request 'Updates Promotion DBs' (#321) from antonis.lempesis/dnet-hadoop:beta into beta 2023-08-07 12:09:16 +02:00
dhp-stats-update Merge pull request 'Changes for tables and creation of the new indicator indi_is_result_accessible' (#363) from antonis.lempesis/dnet-hadoop:beta into beta 2023-12-01 15:05:23 +01:00
dhp-swh [SWH] renamed 'Software Heritage Identifier' to 'Software Hash Identifier' 2023-10-13 10:09:26 +02:00
dhp-usage-raw-data-update Use scala.binary.version property to resolve scala maven dependencies 2023-07-24 11:13:48 +02:00
dhp-usage-stats-build Use scala.binary.version property to resolve scala maven dependencies 2023-07-24 11:13:48 +02:00
dhp-workflow-profiles [maven-release-plugin] prepare for next development iteration 2022-04-07 13:32:22 +02:00
src/site added mvn site for dnet-hadoop project 2021-11-16 15:16:28 +01:00
pom.xml cleanup & docs 2023-10-12 12:23:20 +02:00