dnet-hadoop

History

Giambattista Bloisi b0fc113749 SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure		2023-12-05 00:14:41 +01:00
..
dhp-actionmanager	[AMF] docs	2023-10-12 10:05:46 +02:00
dhp-aggregation	using objectSubType as originalType in Crossref2Oaf, code formatting	2023-12-01 15:03:05 +01:00
dhp-blacklist	Use scala.binary.version property to resolve scala maven dependencies	2023-07-24 11:13:48 +02:00
dhp-broker-events	code formatting	2023-10-02 11:04:36 +02:00
dhp-dedup-openaire	SparkCreateSimRels:	2023-12-05 00:14:41 +01:00
dhp-doiboost	code formatting	2023-12-03 13:31:58 +01:00
dhp-enrichment	refactoring	2023-11-27 15:13:15 +01:00
dhp-graph-mapper	[graph grouping] added isLookupUrl to the workflow definition, passed to the grouping spark aciton	2023-12-03 13:32:52 +01:00
dhp-graph-provision	tests for d4science catalog	2023-09-20 15:38:32 +02:00
dhp-impact-indicators	Run CC and RAM sequentieally in dhp-impact-indicators WF	2023-09-13 08:59:40 +02:00
dhp-stats-actionsets	Update StatsAtomicActionsJob.java	2023-12-01 11:35:01 +02:00
dhp-stats-promote	Merge pull request 'Updates Promotion DBs' (#321 ) from antonis.lempesis/dnet-hadoop:beta into beta	2023-08-07 12:09:16 +02:00
dhp-stats-update	Merge pull request 'Changes for tables and creation of the new indicator indi_is_result_accessible' (#363 ) from antonis.lempesis/dnet-hadoop:beta into beta	2023-12-01 15:05:23 +01:00
dhp-swh	[SWH] renamed 'Software Heritage Identifier' to 'Software Hash Identifier'	2023-10-13 10:09:26 +02:00
dhp-usage-raw-data-update	Use scala.binary.version property to resolve scala maven dependencies	2023-07-24 11:13:48 +02:00
dhp-usage-stats-build	Use scala.binary.version property to resolve scala maven dependencies	2023-07-24 11:13:48 +02:00
dhp-workflow-profiles	[maven-release-plugin] prepare for next development iteration	2022-04-07 13:32:22 +02:00
src/site	added mvn site for dnet-hadoop project	2021-11-16 15:16:28 +01:00
pom.xml	cleanup & docs	2023-10-12 12:23:20 +02:00