dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Giambattista Bloisi	664a381d31	Unify merge logic of entities in MergeUtils.class	2024-03-18 16:04:49 +01:00
Giambattista Bloisi	b19643f6eb	Dedup aliases, created when a dedup in a previous build has been merged in a new dedup, need to be marked as "deletedbyinference", since they are "merged" in the new dedup	2024-02-08 15:34:59 +01:00
Claudio Atzori	6fd25cf549	code formatting	2024-01-23 08:47:12 +01:00
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	5e15f20e6e	Fix entityMerger that was excluding the authors of the first entity in the list to merge	2023-07-21 00:46:54 +02:00
Claudio Atzori	062abfd669	fixed NPE, removed unused stuff	2022-12-06 12:04:00 +01:00
Claudio Atzori	0aa725083f	extended dedup testing	2022-11-17 16:13:43 +01:00
Claudio Atzori	ddff0e8999	merging duplicates using IdentifierComparator	2022-11-11 16:10:25 +01:00
Claudio Atzori	5af5a8ae42	added IdentifierComparator	2022-11-09 14:20:59 +01:00
Claudio Atzori	61319b2e83	updated dhp-schema version; set entity-level dataInfo before & after merging the fields from the group of duplicates	2022-03-25 16:38:33 +01:00
miconis	8991d097b4	bug fix in the DedupRecordFactory, DataInfo set before merge	2022-02-24 17:13:12 +01:00
Claudio Atzori	2ee21da43b	suggestions from SonarLint	2021-08-11 12:13:22 +02:00
miconis	f64e57c112	refactoring of the id generation, sparkcreatemergerels collects entities to create root id after a join	2021-04-15 10:59:24 +02:00
miconis	3525a8f504	id generation of representative record moved to the SparkCreateMergeRel job	2021-04-14 18:06:07 +02:00
miconis	2355cc4e9b	minor changes and bug fix	2021-03-29 10:07:12 +02:00
Claudio Atzori	e5da4ee9b1	dedup workflow using the common PidComparator	2020-11-04 15:02:02 +01:00
miconis	c4a59d1b9a	merge with the master to port the new packages	2020-10-20 16:07:30 +02:00
miconis	0e54803177	bug fix in the id generator and implementation of jobs for organization dedup	2020-10-20 12:19:46 +02:00
miconis	6f8720982c	bug fix in the idgenerator and test implementation	2020-10-09 09:30:23 +02:00
Sandro La Bruzzo	734934e2eb	fixed error on empty intersection with publication and relation on export to OAF	2020-10-08 17:29:29 +02:00
Sandro La Bruzzo	eec418cd26	moved AuthoreMerger into dhp-common	2020-10-08 10:33:55 +02:00
miconis	1804c5d809	refactoring: classes moved in the right package	2020-10-06 16:44:51 +02:00
miconis	7093355487	bug fix and minor changes	2020-10-06 16:21:34 +02:00
miconis	e3f7798d1b	minor changes in dedup tests, bug fix in the idgenerator and pace-core version update	2020-09-29 15:31:46 +02:00
miconis	d47352cbc7	refactoring of the procedure for the id generation, minor changes and addition of a comparation on the original id and the origin datasource	2020-07-24 20:10:47 +02:00
miconis	b260fee787	implementation of the dedup_id generation using pids to make the graph more stable	2020-07-22 17:29:48 +02:00
Claudio Atzori	3cf2796ac6	code formatting	2020-05-22 12:34:00 +02:00
miconis	8bbd1d0501	reimplementation of the author merging in deduprecord creation. implementation of the test class.	2020-05-21 11:52:14 +02:00
Claudio Atzori	6f5b899038	reformatted code according to the updated style descriptor	2020-04-28 11:23:29 +02:00
Claudio Atzori	a0bdbacdae	switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin	2020-04-27 14:52:31 +02:00
Claudio Atzori	7a3f8085f7	switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin	2020-04-27 14:45:40 +02:00
Claudio Atzori	9ddafd46ca	fixed dedup record id prefix, set the correct dataInfo in the DedupRecordFactory	2020-04-23 07:50:18 +02:00
Claudio Atzori	eb8a020859	fixed behaviour of DedupRecordFactory	2020-04-20 18:44:06 +02:00
Claudio Atzori	ad7a131b18	introduced common project code formatting plugin, works on the commit hook, based on https://github.com/Cosium/git-code-format-maven-plugin , applied to each java class in the project	2020-04-18 12:42:58 +02:00
Claudio Atzori	71813795f6	various refactorings on the dnet-dedup-openaire workflow	2020-04-18 12:06:23 +02:00
Claudio Atzori	673e744649	moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa	2020-03-27 10:42:17 +01:00

37 Commits