dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	10e135db1e	Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	831cc1fdde	Generate "merged" dedup id relations also for records that are filtered out by the cut parameters	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	1287315ffb	Do no longer use dedupId information from pivotHistory Database	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Claudio Atzori	03670bb9ce	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
Claudio Atzori	eed9fe0902	code formatting	2023-10-06 12:31:17 +02:00
Giambattista Bloisi	2caaaec42d	Include SparkCleanRelation logic in SparkPropagateRelation SparkPropagateRelation includes merge relations Revised tests for SparkPropagateRelation	2023-09-04 11:33:20 +02:00
Claudio Atzori	bf35280ea6	code formatting	2023-08-29 11:11:00 +02:00
Giambattista Bloisi	a860e19423	Fix ensure all relations are written out, not only those managed by dedup	2023-08-28 15:36:02 +02:00
Giambattista Bloisi	0d7b2bf83d	Rewrite SparkPropagateRelation exploiting Dataframe API	2023-08-28 10:34:54 +02:00
Giambattista Bloisi	97b6d1dc45	Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags Filter relations also by dataInfo.invisible flag	2023-08-07 10:24:11 +02:00
Giambattista Bloisi	af49424b59	Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities	2023-08-04 14:27:39 +02:00
Giambattista Bloisi	e64c2854a3	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface JsonPath cache contention fixed by using a ConcurrentHashMap Blacklist filtering performance improvement Minor performance improvements when evaluating similarity Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)	2023-07-24 15:36:24 +02:00
Giambattista Bloisi	5e15f20e6e	Fix entityMerger that was excluding the authors of the first entity in the list to merge	2023-07-21 00:46:54 +02:00
Sandro La Bruzzo	9963fd6d29	updated log to add subentity	2023-06-28 13:36:05 +02:00
Sandro La Bruzzo	9910ce06ae	added to CreateSimRel the feature to write time log	2023-06-28 11:38:16 +02:00
Sandro La Bruzzo	bd17c3edc8	added to CreateSimRel the feature to write time log	2023-06-28 11:20:58 +02:00
Claudio Atzori	909729a2fc	[dedup] tweaking num partitions, minor changes	2023-05-17 10:16:22 +02:00
Claudio Atzori	062abfd669	fixed NPE, removed unused stuff	2022-12-06 12:04:00 +01:00
Claudio Atzori	0aa725083f	extended dedup testing	2022-11-17 16:13:43 +01:00
Claudio Atzori	ddff0e8999	merging duplicates using IdentifierComparator	2022-11-11 16:10:25 +01:00
Claudio Atzori	5af5a8ae42	added IdentifierComparator	2022-11-09 14:20:59 +01:00
Claudio Atzori	61319b2e83	updated dhp-schema version; set entity-level dataInfo before & after merging the fields from the group of duplicates	2022-03-25 16:38:33 +01:00
miconis	8991d097b4	bug fix in the DedupRecordFactory, DataInfo set before merge	2022-02-24 17:13:12 +01:00
Claudio Atzori	44a937f4ed	factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources	2022-01-19 12:24:52 +01:00
Claudio Atzori	2b46b87f56	fixed filtering criteria applied in SparkCopyRelationsNoOpenorgs to keep the parent/child relations from OpenOrgs	2021-11-19 11:30:29 +01:00
Claudio Atzori	a24b9f8268	[dedup] trivial refactoring	2021-11-18 17:12:02 +01:00
Claudio Atzori	c0750fb17c	avoid non necessary count operations over large spark datasets	2021-11-18 17:11:31 +01:00
miconis	611ca511db	set configuration property in openorgs duplicates wf	2021-10-07 15:39:55 +02:00
miconis	9646b9fd98	implementation of the http call for the update of openorgs suggestions	2021-10-07 11:29:11 +02:00
miconis	853333bdde	implementation of the whitelist for similarity relations	2021-09-20 16:21:47 +02:00
Claudio Atzori	9f4db73f30	updated/fixed unit tests	2021-08-11 15:02:51 +02:00
Claudio Atzori	2ee21da43b	suggestions from SonarLint	2021-08-11 12:13:22 +02:00
Claudio Atzori	2fff24df55	code formatting	2021-07-28 11:34:19 +02:00
Sandro La Bruzzo	3920c69bc8	change implementation of resolve Relation to generate jsonRdd in output	2021-07-25 09:51:36 +02:00
Sandro La Bruzzo	058b636d4d	added control to check if the entity exists	2021-07-22 16:08:54 +02:00
Claudio Atzori	41b551562e	applying PR#115 (DatePicker) on stable_ids	2021-06-17 09:33:50 +02:00
Claudio Atzori	23b8883ab1	applied intellij code cleanup	2021-05-14 10:58:12 +02:00
Claudio Atzori	5afa7d3e0c	core utilities in dhp-common moved in external module dhp-schemas	2021-04-27 15:44:01 +02:00
Claudio Atzori	ef4bfd82e2	code formatting	2021-04-27 10:09:31 +02:00
miconis	3c12eeadce	bug fix in propagation of relations	2021-04-22 11:44:33 +02:00
Claudio Atzori	815b9f4d56	[openorgs dedup] fixed workflow parameter declarations. Introduced support for resuming the execution from intermediate steps	2021-04-20 17:24:45 +02:00
Claudio Atzori	45057440c1	code formatting	2021-04-16 17:28:25 +02:00
miconis	7ad573d023	bug fix: changed join in propagaterelations without applying filter on the id	2021-04-16 16:40:42 +02:00
miconis	f64e57c112	refactoring of the id generation, sparkcreatemergerels collects entities to create root id after a join	2021-04-15 10:59:24 +02:00
miconis	3525a8f504	id generation of representative record moved to the SparkCreateMergeRel job	2021-04-14 18:06:07 +02:00
Claudio Atzori	511c0521e5	[dedup] avoiding NPEs handling OpenOrg relations	2021-04-12 17:45:11 +02:00
miconis	d442e25cbc	bug fix: ids in self mergerels are not marked deletedbyinference=true	2021-04-12 15:56:22 +02:00
miconis	bf685d849f	addition of pids in the query for the export of openorgs for the provision, addition of ec_fields in the openorgs model	2021-04-07 14:27:43 +02:00

1 2 3

146 Commits