dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	2caaaec42d	Include SparkCleanRelation logic in SparkPropagateRelation SparkPropagateRelation includes merge relations Revised tests for SparkPropagateRelation	2023-09-04 11:33:20 +02:00
Giambattista Bloisi	fab9920271	DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag	2023-08-09 15:41:43 +02:00
Giambattista Bloisi	af49424b59	Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities	2023-08-04 14:27:39 +02:00
Giambattista Bloisi	e64c2854a3	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface JsonPath cache contention fixed by using a ConcurrentHashMap Blacklist filtering performance improvement Minor performance improvements when evaluating similarity Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)	2023-07-24 15:36:24 +02:00
Claudio Atzori	ddff0e8999	merging duplicates using IdentifierComparator	2022-11-11 16:10:25 +01:00
miconis	c959639bd5	dependency updated to the new pace-core version	2022-03-15 16:33:03 +01:00
miconis	611ca511db	set configuration property in openorgs duplicates wf	2021-10-07 15:39:55 +02:00
miconis	853333bdde	implementation of the whitelist for similarity relations	2021-09-20 16:21:47 +02:00
Claudio Atzori	9f4db73f30	updated/fixed unit tests	2021-08-11 15:02:51 +02:00
Claudio Atzori	2ee21da43b	suggestions from SonarLint	2021-08-11 12:13:22 +02:00
miconis	3c12eeadce	bug fix in propagation of relations	2021-04-22 11:44:33 +02:00
miconis	7ad573d023	bug fix: changed join in propagaterelations without applying filter on the id	2021-04-16 16:40:42 +02:00
miconis	f64e57c112	refactoring of the id generation, sparkcreatemergerels collects entities to create root id after a join	2021-04-15 10:59:24 +02:00
miconis	3525a8f504	id generation of representative record moved to the SparkCreateMergeRel job	2021-04-14 18:06:07 +02:00
miconis	f446580e9f	code refactoring (useless classes and wf removed), implementation of the test for the openorgs dedup	2021-03-29 16:10:46 +02:00
miconis	2355cc4e9b	minor changes and bug fix	2021-03-29 10:07:12 +02:00
Claudio Atzori	385214eeae	code formatting	2020-10-30 15:47:05 +01:00
miconis	0e54803177	bug fix in the id generator and implementation of jobs for organization dedup	2020-10-20 12:19:46 +02:00
miconis	a2ac7e52fb	implementation of the workflow for new organizations in openorgs	2020-10-06 13:58:09 +02:00
miconis	e3f7798d1b	minor changes in dedup tests, bug fix in the idgenerator and pace-core version update	2020-09-29 15:31:46 +02:00
miconis	259362ef47	implementation of the job to collect simrels from postgres db	2020-09-22 09:43:27 +02:00
miconis	b260fee787	implementation of the dedup_id generation using pids to make the graph more stable	2020-07-22 17:29:48 +02:00
Claudio Atzori	c6f6fb0f28	code formatting	2020-07-13 16:46:13 +02:00
Claudio Atzori	344a90c2e6	updated assertions in propagateRelationTest	2020-07-13 16:32:04 +02:00
Claudio Atzori	c73168b18e	Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting	2020-07-13 15:54:58 +02:00
Claudio Atzori	c8284bab06	WIP SparkCreateMergeRels distinct relations	2020-07-13 15:54:51 +02:00
Sandro La Bruzzo	1d133b7fe6	update test	2020-07-13 15:52:41 +02:00
Claudio Atzori	4c101a9d66	WIP SparkCreateMergeRels distinct relations	2020-07-13 15:31:38 +02:00
Claudio Atzori	8a612d861a	WIP SparkCreateMergeRels distinct relations	2020-07-13 15:30:57 +02:00
Sandro La Bruzzo	9ef2385022	implemented test for cut of connected component	2020-07-13 15:28:17 +02:00
Claudio Atzori	c3d67f709a	adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80)	2020-07-02 17:35:22 +02:00
Claudio Atzori	42f1a2bf94	bumped project version to 1.2.0-SNAPSHOT	2020-05-11 10:05:57 +02:00
Claudio Atzori	fd519df616	new rels produced by dedup workflow must be unique	2020-05-08 19:00:38 +02:00
miconis	3df703f67d	mergerels added to propagate relations	2020-05-04 12:08:12 +02:00
miconis	62e467eb0c	assertion numbers updated to fit the new implementation of the pace-core	2020-04-28 11:46:23 +02:00
Claudio Atzori	6f5b899038	reformatted code according to the updated style descriptor	2020-04-28 11:23:29 +02:00
Claudio Atzori	a0bdbacdae	switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin	2020-04-27 14:52:31 +02:00
Claudio Atzori	7a3f8085f7	switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin	2020-04-27 14:45:40 +02:00
Claudio Atzori	278fc9d276	code formatting	2020-04-23 18:51:38 +02:00
miconis	8d258c85ff	spark dedup test fixed, sample for dataset and orp added, test implemented	2020-04-23 18:16:20 +02:00
Claudio Atzori	91e72a6944	Dataset based implementation for SparkCreateDedupRecord phase, fixed datasource entity dump supplementing dedup unit tests	2020-04-21 12:06:08 +02:00
miconis	5c9ef08a8e	spark dedup test fixed	2020-04-21 10:19:04 +02:00
Claudio Atzori	eb8a020859	fixed behaviour of DedupRecordFactory	2020-04-20 18:44:06 +02:00
miconis	1102e32462	SparkDedupTest updated and organization dump fixed	2020-04-20 16:49:01 +02:00
Claudio Atzori	ad7a131b18	introduced common project code formatting plugin, works on the commit hook, based on https://github.com/Cosium/git-code-format-maven-plugin , applied to each java class in the project	2020-04-18 12:42:58 +02:00
miconis	6a089ec287	minor changes	2020-04-16 12:15:38 +02:00
miconis	cd4d9a148f	creating temporary directories in dedup test	2020-04-16 12:13:26 +02:00
miconis	0be2e72be5	further implementation of tests for the deduplication of each entity. publication dump added, empty entity files created	2020-04-08 18:02:30 +02:00

1 2

53 Commits