dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Claudio Atzori	6fd25cf549	code formatting	2024-01-23 08:47:12 +01:00
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	10e135db1e	Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	831cc1fdde	Generate "merged" dedup id relations also for records that are filtered out by the cut parameters	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	1287315ffb	Do no longer use dedupId information from pivotHistory Database	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Claudio Atzori	03670bb9ce	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
Claudio Atzori	eed9fe0902	code formatting	2023-10-06 12:31:17 +02:00
Giambattista Bloisi	2caaaec42d	Include SparkCleanRelation logic in SparkPropagateRelation SparkPropagateRelation includes merge relations Revised tests for SparkPropagateRelation	2023-09-04 11:33:20 +02:00
Claudio Atzori	bf35280ea6	code formatting	2023-08-29 11:11:00 +02:00
Giambattista Bloisi	a860e19423	Fix ensure all relations are written out, not only those managed by dedup	2023-08-28 15:36:02 +02:00
Giambattista Bloisi	0d7b2bf83d	Rewrite SparkPropagateRelation exploiting Dataframe API	2023-08-28 10:34:54 +02:00
Giambattista Bloisi	97b6d1dc45	Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags Filter relations also by dataInfo.invisible flag	2023-08-07 10:24:11 +02:00
Giambattista Bloisi	af49424b59	Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities	2023-08-04 14:27:39 +02:00
Giambattista Bloisi	e64c2854a3	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface JsonPath cache contention fixed by using a ConcurrentHashMap Blacklist filtering performance improvement Minor performance improvements when evaluating similarity Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)	2023-07-24 15:36:24 +02:00
Giambattista Bloisi	5e15f20e6e	Fix entityMerger that was excluding the authors of the first entity in the list to merge	2023-07-21 00:46:54 +02:00
Sandro La Bruzzo	9963fd6d29	updated log to add subentity	2023-06-28 13:36:05 +02:00
Sandro La Bruzzo	9910ce06ae	added to CreateSimRel the feature to write time log	2023-06-28 11:38:16 +02:00
Sandro La Bruzzo	bd17c3edc8	added to CreateSimRel the feature to write time log	2023-06-28 11:20:58 +02:00
Claudio Atzori	909729a2fc	[dedup] tweaking num partitions, minor changes	2023-05-17 10:16:22 +02:00
Claudio Atzori	062abfd669	fixed NPE, removed unused stuff	2022-12-06 12:04:00 +01:00
Claudio Atzori	0aa725083f	extended dedup testing	2022-11-17 16:13:43 +01:00
Claudio Atzori	ddff0e8999	merging duplicates using IdentifierComparator	2022-11-11 16:10:25 +01:00
Claudio Atzori	5af5a8ae42	added IdentifierComparator	2022-11-09 14:20:59 +01:00
Claudio Atzori	61319b2e83	updated dhp-schema version; set entity-level dataInfo before & after merging the fields from the group of duplicates	2022-03-25 16:38:33 +01:00
miconis	8991d097b4	bug fix in the DedupRecordFactory, DataInfo set before merge	2022-02-24 17:13:12 +01:00
Claudio Atzori	44a937f4ed	factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources	2022-01-19 12:24:52 +01:00
Claudio Atzori	2b46b87f56	fixed filtering criteria applied in SparkCopyRelationsNoOpenorgs to keep the parent/child relations from OpenOrgs	2021-11-19 11:30:29 +01:00
Claudio Atzori	a24b9f8268	[dedup] trivial refactoring	2021-11-18 17:12:02 +01:00
Claudio Atzori	c0750fb17c	avoid non necessary count operations over large spark datasets	2021-11-18 17:11:31 +01:00
miconis	611ca511db	set configuration property in openorgs duplicates wf	2021-10-07 15:39:55 +02:00
miconis	9646b9fd98	implementation of the http call for the update of openorgs suggestions	2021-10-07 11:29:11 +02:00
miconis	853333bdde	implementation of the whitelist for similarity relations	2021-09-20 16:21:47 +02:00
Claudio Atzori	9f4db73f30	updated/fixed unit tests	2021-08-11 15:02:51 +02:00
Claudio Atzori	2ee21da43b	suggestions from SonarLint	2021-08-11 12:13:22 +02:00
Claudio Atzori	2fff24df55	code formatting	2021-07-28 11:34:19 +02:00
Sandro La Bruzzo	3920c69bc8	change implementation of resolve Relation to generate jsonRdd in output	2021-07-25 09:51:36 +02:00
Sandro La Bruzzo	058b636d4d	added control to check if the entity exists	2021-07-22 16:08:54 +02:00
Claudio Atzori	41b551562e	applying PR#115 (DatePicker) on stable_ids	2021-06-17 09:33:50 +02:00
Claudio Atzori	23b8883ab1	applied intellij code cleanup	2021-05-14 10:58:12 +02:00
Claudio Atzori	5afa7d3e0c	core utilities in dhp-common moved in external module dhp-schemas	2021-04-27 15:44:01 +02:00
Claudio Atzori	ef4bfd82e2	code formatting	2021-04-27 10:09:31 +02:00
miconis	3c12eeadce	bug fix in propagation of relations	2021-04-22 11:44:33 +02:00
Claudio Atzori	815b9f4d56	[openorgs dedup] fixed workflow parameter declarations. Introduced support for resuming the execution from intermediate steps	2021-04-20 17:24:45 +02:00
Claudio Atzori	45057440c1	code formatting	2021-04-16 17:28:25 +02:00
miconis	7ad573d023	bug fix: changed join in propagaterelations without applying filter on the id	2021-04-16 16:40:42 +02:00
miconis	f64e57c112	refactoring of the id generation, sparkcreatemergerels collects entities to create root id after a join	2021-04-15 10:59:24 +02:00
miconis	3525a8f504	id generation of representative record moved to the SparkCreateMergeRel job	2021-04-14 18:06:07 +02:00
Claudio Atzori	511c0521e5	[dedup] avoiding NPEs handling OpenOrg relations	2021-04-12 17:45:11 +02:00
miconis	d442e25cbc	bug fix: ids in self mergerels are not marked deletedbyinference=true	2021-04-12 15:56:22 +02:00

1 2 3

147 Commits