dnet-hadoop

Author	SHA1	Message	Date
Claudio Atzori	6fd25cf549	code formatting	2024-01-23 08:47:12 +01:00
Giambattista Bloisi	21a14fcd80	Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions Implements pivots table update oozie workflow	2024-01-15 10:18:14 +01:00
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	10e135db1e	Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	831cc1fdde	Generate "merged" dedup id relations also for records that are filtered out by the cut parameters	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	1287315ffb	Do no longer use dedupId information from pivotHistory Database	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Claudio Atzori	431c6bb08a	[dedup] added isLookupUrl to the graph consistency workflow definition, required now by the entity grouping phase	2023-12-06 11:06:46 +01:00
Claudio Atzori	03670bb9ce	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
Claudio Atzori	eed9fe0902	code formatting	2023-10-06 12:31:17 +02:00
Claudio Atzori	7b403a920f	Merge branch 'beta' into consistency_keep_mergerels	2023-10-02 11:26:00 +02:00
Giambattista Bloisi	2caaaec42d	Include SparkCleanRelation logic in SparkPropagateRelation SparkPropagateRelation includes merge relations Revised tests for SparkPropagateRelation	2023-09-04 11:33:20 +02:00
Giambattista Bloisi	6cc7d8ca7b	GroupEntities and DispatchEntites are now merged in GroupEntitiesSparkJob	2023-08-30 10:43:31 +02:00
Giambattista Bloisi	6b1c05d118	Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb	2023-08-29 16:04:19 +02:00
Claudio Atzori	bf35280ea6	code formatting	2023-08-29 11:11:00 +02:00
Claudio Atzori	58665a246c	Merge branch 'beta' into propagate_relation_rewrite	2023-08-29 10:47:02 +02:00
Giambattista Bloisi	d012aec0b3	Revert PropagateRelation's argument name from outputPath to graphOutputPath in consistency workflow (#8964 )	2023-08-28 22:44:54 +02:00
Giambattista Bloisi	a860e19423	Fix ensure all relations are written out, not only those managed by dedup	2023-08-28 15:36:02 +02:00
Giambattista Bloisi	0d7b2bf83d	Rewrite SparkPropagateRelation exploiting Dataframe API	2023-08-28 10:34:54 +02:00
Giambattista Bloisi	95cd2b9b1e	Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows	2023-08-10 11:53:48 +02:00
Giambattista Bloisi	fab9920271	DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag	2023-08-09 15:41:43 +02:00
Giambattista Bloisi	97b6d1dc45	Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags Filter relations also by dataInfo.invisible flag	2023-08-07 10:24:11 +02:00
Giambattista Bloisi	af49424b59	Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities	2023-08-04 14:27:39 +02:00
Giambattista Bloisi	e64c2854a3	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface JsonPath cache contention fixed by using a ConcurrentHashMap Blacklist filtering performance improvement Minor performance improvements when evaluating similarity Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)	2023-07-24 15:36:24 +02:00
Giambattista Bloisi	bb5b845e3c	Use scala.binary.version property to resolve scala maven dependencies Ensure consistent usage of maven properties Profile for compiling with scala 2.12 and Spark 3.4	2023-07-24 11:13:48 +02:00
Giambattista Bloisi	5e15f20e6e	Fix entityMerger that was excluding the authors of the first entity in the list to merge	2023-07-21 00:46:54 +02:00
Giambattista Bloisi	dba34505de	Fix SparkStatsTest bug where parquet tables were incorrectly read as text files leading to unpredictable count() values	2023-07-19 14:24:52 +02:00
Giambattista Bloisi	bd3fcf869a	rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules	2023-07-06 10:02:23 +02:00
Sandro La Bruzzo	9963fd6d29	updated log to add subentity	2023-06-28 13:36:05 +02:00
Sandro La Bruzzo	ed7e2ab6d1	reverted mistake on commit workflow.xml	2023-06-28 11:40:19 +02:00
Sandro La Bruzzo	9910ce06ae	added to CreateSimRel the feature to write time log	2023-06-28 11:38:16 +02:00
Sandro La Bruzzo	bd17c3edc8	added to CreateSimRel the feature to write time log	2023-06-28 11:20:58 +02:00
Claudio Atzori	909729a2fc	[dedup] tweaking num partitions, minor changes	2023-05-17 10:16:22 +02:00
Claudio Atzori	062abfd669	fixed NPE, removed unused stuff	2022-12-06 12:04:00 +01:00
Claudio Atzori	0aa725083f	extended dedup testing	2022-11-17 16:13:43 +01:00
Claudio Atzori	3dbc637d3e	code formatting	2022-11-17 09:55:41 +01:00
Claudio Atzori	ddff0e8999	merging duplicates using IdentifierComparator	2022-11-11 16:10:25 +01:00
Claudio Atzori	5af5a8ae42	added IdentifierComparator	2022-11-09 14:20:59 +01:00
Claudio Atzori	c26222623f	[maven-release-plugin] prepare for next development iteration	2022-04-07 13:32:22 +02:00
Claudio Atzori	86585a6b27	[maven-release-plugin] prepare release dhp-1.2.4	2022-04-07 13:32:19 +02:00
Claudio Atzori	ad85d88eaf	[maven-release-plugin] rollback the release of dhp-1.2.4	2022-04-07 13:28:35 +02:00
Claudio Atzori	598e11dfd7	[maven-release-plugin] prepare for next development iteration	2022-04-07 13:27:02 +02:00
Claudio Atzori	db3d9877a5	[maven-release-plugin] prepare release dhp-1.2.4	2022-04-07 13:26:58 +02:00
Claudio Atzori	3bba6d6e38	[maven-release-plugin] rollback the release of dhp-1.2.4	2022-04-07 12:23:17 +02:00
Claudio Atzori	2ac2d928bd	[maven-release-plugin] prepare for next development iteration	2022-04-07 12:18:47 +02:00
Claudio Atzori	85bc722ff4	[maven-release-plugin] prepare release dhp-1.2.4	2022-04-07 12:18:43 +02:00
Claudio Atzori	bc05b6168a	[maven-release-plugin] rollback the release of dhp-1.2.4	2022-04-07 11:49:06 +02:00
Claudio Atzori	505420fd61	[maven-release-plugin] prepare for next development iteration	2022-04-07 11:34:06 +02:00
Claudio Atzori	66e718981e	[maven-release-plugin] prepare release dhp-1.2.4	2022-04-07 11:34:02 +02:00
Claudio Atzori	61319b2e83	updated dhp-schema version; set entity-level dataInfo before & after merging the fields from the group of duplicates	2022-03-25 16:38:33 +01:00

1 2 3 4 5

245 Commits