dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Sandro La Bruzzo	ed7e2ab6d1	reverted mistake on commit workflow.xml	2023-06-28 11:40:19 +02:00
Sandro La Bruzzo	9910ce06ae	added to CreateSimRel the feature to write time log	2023-06-28 11:38:16 +02:00
Claudio Atzori	909729a2fc	[dedup] tweaking num partitions, minor changes	2023-05-17 10:16:22 +02:00
miconis	853333bdde	implementation of the whitelist for similarity relations	2021-09-20 16:21:47 +02:00
miconis	1542196a33	bug fix: starting node of duplicate scan wf changed	2021-04-13 10:15:43 +02:00
miconis	c39c82dfe9	modification of the jobs for the integration of openorgs in the provision, dedup records are no more created by merging but simply taking results of openorgs portal	2021-04-06 14:31:00 +02:00
miconis	2355cc4e9b	minor changes and bug fix	2021-03-29 10:07:12 +02:00
miconis	98854b0124	minor changes	2021-03-19 16:57:40 +01:00
miconis	4b2124a18e	implementation of the openorgs wfs, implementation of the raw_all wf to migrate openorgs db entities	2021-02-10 11:51:50 +01:00
miconis	8fea29177c	refactoring, minor changes and implementation of the wf for openorgs with integration of organization phases into the scan wf	2021-01-18 16:48:08 +01:00
Claudio Atzori	8c67938ad0	configurable number of partitions used in the SparkCreateSimRels phase	2020-07-13 16:07:07 +02:00
Claudio Atzori	c79e2f5977	drop workingPath before starting the dedup workflow	2020-05-06 11:27:44 +02:00
Claudio Atzori	71813795f6	various refactorings on the dnet-dedup-openaire workflow	2020-04-18 12:06:23 +02:00
Claudio Atzori	038ac7afd7	relation consistency workflow separated from dedup scan and creation of CCs	2020-04-17 13:12:44 +02:00
Claudio Atzori	ec5dfc068d	added spark.sql.shuffle.partitions=3840 to dedup scan wf	2020-04-16 14:41:28 +02:00
miconis	9b36458b6a	Merge branch 'deduptesting' of code-repo.d4science.org:D-Net/dnet-hadoop into deduptesting	2020-04-16 12:13:58 +02:00
miconis	cd4d9a148f	creating temporary directories in dedup test	2020-04-16 12:13:26 +02:00
Claudio Atzori	b39ff36c16	improving the wf definitions	2020-04-16 12:11:37 +02:00
Claudio Atzori	673e744649	moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa	2020-03-27 10:42:17 +01:00

20 Commits