dnet-hadoop/dhp-workflows
Giambattista Bloisi b0fc113749 SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2023-12-05 00:14:41 +01:00
..
dhp-actionmanager [AMF] docs 2023-10-12 10:05:46 +02:00
dhp-aggregation using objectSubType as originalType in Crossref2Oaf, code formatting 2023-12-01 15:03:05 +01:00
dhp-blacklist Use scala.binary.version property to resolve scala maven dependencies 2023-07-24 11:13:48 +02:00
dhp-broker-events code formatting 2023-10-02 11:04:36 +02:00
dhp-dedup-openaire SparkCreateSimRels: 2023-12-05 00:14:41 +01:00
dhp-doiboost code formatting 2023-12-03 13:31:58 +01:00
dhp-enrichment refactoring 2023-11-27 15:13:15 +01:00
dhp-graph-mapper [graph grouping] added isLookupUrl to the workflow definition, passed to the grouping spark aciton 2023-12-03 13:32:52 +01:00
dhp-graph-provision tests for d4science catalog 2023-09-20 15:38:32 +02:00
dhp-impact-indicators Run CC and RAM sequentieally in dhp-impact-indicators WF 2023-09-13 08:59:40 +02:00
dhp-stats-actionsets Update StatsAtomicActionsJob.java 2023-12-01 11:35:01 +02:00
dhp-stats-promote Merge pull request 'Updates Promotion DBs' (#321) from antonis.lempesis/dnet-hadoop:beta into beta 2023-08-07 12:09:16 +02:00
dhp-stats-update Merge pull request 'Changes for tables and creation of the new indicator indi_is_result_accessible' (#363) from antonis.lempesis/dnet-hadoop:beta into beta 2023-12-01 15:05:23 +01:00
dhp-swh [SWH] renamed 'Software Heritage Identifier' to 'Software Hash Identifier' 2023-10-13 10:09:26 +02:00
dhp-usage-raw-data-update Use scala.binary.version property to resolve scala maven dependencies 2023-07-24 11:13:48 +02:00
dhp-usage-stats-build Use scala.binary.version property to resolve scala maven dependencies 2023-07-24 11:13:48 +02:00
dhp-workflow-profiles [maven-release-plugin] prepare for next development iteration 2022-04-07 13:32:22 +02:00
src/site added mvn site for dnet-hadoop project 2021-11-16 15:16:28 +01:00
pom.xml cleanup & docs 2023-10-12 12:23:20 +02:00