dnet-hadoop

History

Giambattista Bloisi 02636e802c SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure		2024-01-10 22:59:52 +01:00
..
dhp-actionmanager	[AMF] docs	2023-10-12 10:05:46 +02:00
dhp-aggregation	uploaded input parameters on CreateBaseline WF	2023-12-18 12:21:55 +01:00
dhp-blacklist	Use scala.binary.version property to resolve scala maven dependencies	2023-07-24 11:13:48 +02:00
dhp-broker-events	code formatting	2023-10-02 11:04:36 +02:00
dhp-dedup-openaire	SparkCreateSimRels:	2024-01-10 22:59:52 +01:00
dhp-doiboost	[doiboost - preprocess] remove transition to orcid preparation from sequence of steps at the beginning of the workflow	2023-12-15 12:24:55 +01:00
dhp-enrichment	added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation	2023-12-22 14:50:05 +01:00
dhp-graph-mapper	fixed conflicts	2024-01-10 11:03:42 +01:00
dhp-graph-provision	refactoring after compiletion	2023-12-20 15:57:26 +01:00
dhp-impact-indicators	Run CC and RAM sequentieally in dhp-impact-indicators WF	2023-09-13 08:59:40 +02:00
dhp-stats-actionsets	Update StatsAtomicActionsJob.java	2023-12-01 11:35:01 +02:00
dhp-stats-promote	Merge pull request 'Updates Promotion DBs' (#321 ) from antonis.lempesis/dnet-hadoop:beta into beta	2023-08-07 12:09:16 +02:00
dhp-stats-update	Merge pull request 'Changes for tables and creation of the new indicator indi_is_result_accessible' (#363 ) from antonis.lempesis/dnet-hadoop:beta into beta	2023-12-01 15:05:23 +01:00
dhp-swh	[SWH] renamed 'Software Heritage Identifier' to 'Software Hash Identifier'	2023-10-13 10:09:26 +02:00
dhp-usage-raw-data-update	Use scala.binary.version property to resolve scala maven dependencies	2023-07-24 11:13:48 +02:00
dhp-usage-stats-build	Use scala.binary.version property to resolve scala maven dependencies	2023-07-24 11:13:48 +02:00
dhp-workflow-profiles	[maven-release-plugin] prepare for next development iteration	2022-04-07 13:32:22 +02:00
src/site	added mvn site for dnet-hadoop project	2021-11-16 15:16:28 +01:00
pom.xml	cleanup & docs	2023-10-12 12:23:20 +02:00