dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	5e15f20e6e	Fix entityMerger that was excluding the authors of the first entity in the list to merge	2023-07-21 00:46:54 +02:00
Claudio Atzori	062abfd669	fixed NPE, removed unused stuff	2022-12-06 12:04:00 +01:00
Claudio Atzori	0aa725083f	extended dedup testing	2022-11-17 16:13:43 +01:00
Claudio Atzori	ddff0e8999	merging duplicates using IdentifierComparator	2022-11-11 16:10:25 +01:00
Claudio Atzori	5af5a8ae42	added IdentifierComparator	2022-11-09 14:20:59 +01:00
Claudio Atzori	61319b2e83	updated dhp-schema version; set entity-level dataInfo before & after merging the fields from the group of duplicates	2022-03-25 16:38:33 +01:00
miconis	8991d097b4	bug fix in the DedupRecordFactory, DataInfo set before merge	2022-02-24 17:13:12 +01:00
Claudio Atzori	2ee21da43b	suggestions from SonarLint	2021-08-11 12:13:22 +02:00
miconis	f64e57c112	refactoring of the id generation, sparkcreatemergerels collects entities to create root id after a join	2021-04-15 10:59:24 +02:00
miconis	3525a8f504	id generation of representative record moved to the SparkCreateMergeRel job	2021-04-14 18:06:07 +02:00
miconis	2355cc4e9b	minor changes and bug fix	2021-03-29 10:07:12 +02:00
Claudio Atzori	e5da4ee9b1	dedup workflow using the common PidComparator	2020-11-04 15:02:02 +01:00
miconis	c4a59d1b9a	merge with the master to port the new packages	2020-10-20 16:07:30 +02:00
miconis	0e54803177	bug fix in the id generator and implementation of jobs for organization dedup	2020-10-20 12:19:46 +02:00
miconis	6f8720982c	bug fix in the idgenerator and test implementation	2020-10-09 09:30:23 +02:00
Sandro La Bruzzo	734934e2eb	fixed error on empty intersection with publication and relation on export to OAF	2020-10-08 17:29:29 +02:00
Sandro La Bruzzo	eec418cd26	moved AuthoreMerger into dhp-common	2020-10-08 10:33:55 +02:00
miconis	1804c5d809	refactoring: classes moved in the right package	2020-10-06 16:44:51 +02:00
miconis	7093355487	bug fix and minor changes	2020-10-06 16:21:34 +02:00
miconis	e3f7798d1b	minor changes in dedup tests, bug fix in the idgenerator and pace-core version update	2020-09-29 15:31:46 +02:00
miconis	d47352cbc7	refactoring of the procedure for the id generation, minor changes and addition of a comparation on the original id and the origin datasource	2020-07-24 20:10:47 +02:00
miconis	b260fee787	implementation of the dedup_id generation using pids to make the graph more stable	2020-07-22 17:29:48 +02:00
Claudio Atzori	3cf2796ac6	code formatting	2020-05-22 12:34:00 +02:00
miconis	8bbd1d0501	reimplementation of the author merging in deduprecord creation. implementation of the test class.	2020-05-21 11:52:14 +02:00
Claudio Atzori	6f5b899038	reformatted code according to the updated style descriptor	2020-04-28 11:23:29 +02:00
Claudio Atzori	a0bdbacdae	switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin	2020-04-27 14:52:31 +02:00
Claudio Atzori	7a3f8085f7	switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin	2020-04-27 14:45:40 +02:00
Claudio Atzori	9ddafd46ca	fixed dedup record id prefix, set the correct dataInfo in the DedupRecordFactory	2020-04-23 07:50:18 +02:00
Claudio Atzori	eb8a020859	fixed behaviour of DedupRecordFactory	2020-04-20 18:44:06 +02:00
Claudio Atzori	ad7a131b18	introduced common project code formatting plugin, works on the commit hook, based on https://github.com/Cosium/git-code-format-maven-plugin , applied to each java class in the project	2020-04-18 12:42:58 +02:00
Claudio Atzori	71813795f6	various refactorings on the dnet-dedup-openaire workflow	2020-04-18 12:06:23 +02:00
Claudio Atzori	673e744649	moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa	2020-03-27 10:42:17 +01:00

34 Commits