dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Michele De Bonis	f6601ea7d1	default parameters for openorgs updated	2024-03-25 13:07:04 +01:00
Michele De Bonis	cd4c3c934d	openorgs wf updated	2024-03-22 15:42:37 +01:00
Giambattista Bloisi	8dd666aedd	Dedup aliases, created when a dedup in a previous build has been merged in a new dedup, need to be marked as "deletedbyinference", since they are "merged" in the new dedup	2024-02-08 15:27:57 +01:00
Claudio Atzori	6fd25cf549	code formatting	2024-01-23 08:47:12 +01:00
Giambattista Bloisi	21a14fcd80	Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions Implements pivots table update oozie workflow	2024-01-15 10:18:14 +01:00
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	10e135db1e	Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	831cc1fdde	Generate "merged" dedup id relations also for records that are filtered out by the cut parameters	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	1287315ffb	Do no longer use dedupId information from pivotHistory Database	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Claudio Atzori	431c6bb08a	[dedup] added isLookupUrl to the graph consistency workflow definition, required now by the entity grouping phase	2023-12-06 11:06:46 +01:00
Claudio Atzori	03670bb9ce	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
Claudio Atzori	eed9fe0902	code formatting	2023-10-06 12:31:17 +02:00
Claudio Atzori	7b403a920f	Merge branch 'beta' into consistency_keep_mergerels	2023-10-02 11:26:00 +02:00
Giambattista Bloisi	2caaaec42d	Include SparkCleanRelation logic in SparkPropagateRelation SparkPropagateRelation includes merge relations Revised tests for SparkPropagateRelation	2023-09-04 11:33:20 +02:00
Giambattista Bloisi	6cc7d8ca7b	GroupEntities and DispatchEntites are now merged in GroupEntitiesSparkJob	2023-08-30 10:43:31 +02:00
Giambattista Bloisi	6b1c05d118	Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb	2023-08-29 16:04:19 +02:00
Claudio Atzori	bf35280ea6	code formatting	2023-08-29 11:11:00 +02:00
Claudio Atzori	58665a246c	Merge branch 'beta' into propagate_relation_rewrite	2023-08-29 10:47:02 +02:00
Giambattista Bloisi	d012aec0b3	Revert PropagateRelation's argument name from outputPath to graphOutputPath in consistency workflow (#8964 )	2023-08-28 22:44:54 +02:00
Giambattista Bloisi	a860e19423	Fix ensure all relations are written out, not only those managed by dedup	2023-08-28 15:36:02 +02:00
Giambattista Bloisi	0d7b2bf83d	Rewrite SparkPropagateRelation exploiting Dataframe API	2023-08-28 10:34:54 +02:00
Giambattista Bloisi	95cd2b9b1e	Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows	2023-08-10 11:53:48 +02:00
Giambattista Bloisi	fab9920271	DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag	2023-08-09 15:41:43 +02:00
Giambattista Bloisi	97b6d1dc45	Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags Filter relations also by dataInfo.invisible flag	2023-08-07 10:24:11 +02:00
Giambattista Bloisi	af49424b59	Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities	2023-08-04 14:27:39 +02:00
Giambattista Bloisi	e64c2854a3	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface JsonPath cache contention fixed by using a ConcurrentHashMap Blacklist filtering performance improvement Minor performance improvements when evaluating similarity Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)	2023-07-24 15:36:24 +02:00
Giambattista Bloisi	5e15f20e6e	Fix entityMerger that was excluding the authors of the first entity in the list to merge	2023-07-21 00:46:54 +02:00
Giambattista Bloisi	dba34505de	Fix SparkStatsTest bug where parquet tables were incorrectly read as text files leading to unpredictable count() values	2023-07-19 14:24:52 +02:00
Sandro La Bruzzo	9963fd6d29	updated log to add subentity	2023-06-28 13:36:05 +02:00
Sandro La Bruzzo	ed7e2ab6d1	reverted mistake on commit workflow.xml	2023-06-28 11:40:19 +02:00
Sandro La Bruzzo	9910ce06ae	added to CreateSimRel the feature to write time log	2023-06-28 11:38:16 +02:00
Sandro La Bruzzo	bd17c3edc8	added to CreateSimRel the feature to write time log	2023-06-28 11:20:58 +02:00
Claudio Atzori	909729a2fc	[dedup] tweaking num partitions, minor changes	2023-05-17 10:16:22 +02:00
Claudio Atzori	062abfd669	fixed NPE, removed unused stuff	2022-12-06 12:04:00 +01:00
Claudio Atzori	0aa725083f	extended dedup testing	2022-11-17 16:13:43 +01:00
Claudio Atzori	3dbc637d3e	code formatting	2022-11-17 09:55:41 +01:00
Claudio Atzori	ddff0e8999	merging duplicates using IdentifierComparator	2022-11-11 16:10:25 +01:00
Claudio Atzori	5af5a8ae42	added IdentifierComparator	2022-11-09 14:20:59 +01:00
Claudio Atzori	61319b2e83	updated dhp-schema version; set entity-level dataInfo before & after merging the fields from the group of duplicates	2022-03-25 16:38:33 +01:00
miconis	c959639bd5	dependency updated to the new pace-core version	2022-03-15 16:33:03 +01:00
miconis	8991d097b4	bug fix in the DedupRecordFactory, DataInfo set before merge	2022-02-24 17:13:12 +01:00
Claudio Atzori	391aa1373b	added unit test	2022-01-19 17:13:21 +01:00
Claudio Atzori	44a937f4ed	factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources	2022-01-19 12:24:52 +01:00
Claudio Atzori	f4538f3c4c	cleanup	2021-11-19 11:33:10 +01:00
Claudio Atzori	2b46b87f56	fixed filtering criteria applied in SparkCopyRelationsNoOpenorgs to keep the parent/child relations from OpenOrgs	2021-11-19 11:30:29 +01:00
Claudio Atzori	a24b9f8268	[dedup] trivial refactoring	2021-11-18 17:12:02 +01:00
Claudio Atzori	c0750fb17c	avoid non necessary count operations over large spark datasets	2021-11-18 17:11:31 +01:00
Claudio Atzori	0a727d325d	[dedup] increased number of partitions in the consistency phase	2021-11-16 08:43:41 +01:00
miconis	611ca511db	set configuration property in openorgs duplicates wf	2021-10-07 15:39:55 +02:00

1 2 3 4 5

220 Commits