dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	10e135db1e	Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	831cc1fdde	Generate "merged" dedup id relations also for records that are filtered out by the cut parameters	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	1287315ffb	Do no longer use dedupId information from pivotHistory Database	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Claudio Atzori	431c6bb08a	[dedup] added isLookupUrl to the graph consistency workflow definition, required now by the entity grouping phase	2023-12-06 11:06:46 +01:00
Claudio Atzori	03670bb9ce	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
Claudio Atzori	eed9fe0902	code formatting	2023-10-06 12:31:17 +02:00
Claudio Atzori	7b403a920f	Merge branch 'beta' into consistency_keep_mergerels	2023-10-02 11:26:00 +02:00
Giambattista Bloisi	2caaaec42d	Include SparkCleanRelation logic in SparkPropagateRelation SparkPropagateRelation includes merge relations Revised tests for SparkPropagateRelation	2023-09-04 11:33:20 +02:00
Giambattista Bloisi	6cc7d8ca7b	GroupEntities and DispatchEntites are now merged in GroupEntitiesSparkJob	2023-08-30 10:43:31 +02:00
Giambattista Bloisi	6b1c05d118	Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb	2023-08-29 16:04:19 +02:00
Claudio Atzori	bf35280ea6	code formatting	2023-08-29 11:11:00 +02:00
Claudio Atzori	58665a246c	Merge branch 'beta' into propagate_relation_rewrite	2023-08-29 10:47:02 +02:00
Giambattista Bloisi	d012aec0b3	Revert PropagateRelation's argument name from outputPath to graphOutputPath in consistency workflow (#8964 )	2023-08-28 22:44:54 +02:00
Giambattista Bloisi	a860e19423	Fix ensure all relations are written out, not only those managed by dedup	2023-08-28 15:36:02 +02:00
Giambattista Bloisi	0d7b2bf83d	Rewrite SparkPropagateRelation exploiting Dataframe API	2023-08-28 10:34:54 +02:00
Giambattista Bloisi	95cd2b9b1e	Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows	2023-08-10 11:53:48 +02:00
Giambattista Bloisi	fab9920271	DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag	2023-08-09 15:41:43 +02:00
Giambattista Bloisi	97b6d1dc45	Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags Filter relations also by dataInfo.invisible flag	2023-08-07 10:24:11 +02:00
Giambattista Bloisi	af49424b59	Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities	2023-08-04 14:27:39 +02:00
Giambattista Bloisi	e64c2854a3	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface JsonPath cache contention fixed by using a ConcurrentHashMap Blacklist filtering performance improvement Minor performance improvements when evaluating similarity Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)	2023-07-24 15:36:24 +02:00
Giambattista Bloisi	bb5b845e3c	Use scala.binary.version property to resolve scala maven dependencies Ensure consistent usage of maven properties Profile for compiling with scala 2.12 and Spark 3.4	2023-07-24 11:13:48 +02:00
Giambattista Bloisi	5e15f20e6e	Fix entityMerger that was excluding the authors of the first entity in the list to merge	2023-07-21 00:46:54 +02:00
Giambattista Bloisi	dba34505de	Fix SparkStatsTest bug where parquet tables were incorrectly read as text files leading to unpredictable count() values	2023-07-19 14:24:52 +02:00
Giambattista Bloisi	bd3fcf869a	rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules	2023-07-06 10:02:23 +02:00
Sandro La Bruzzo	9963fd6d29	updated log to add subentity	2023-06-28 13:36:05 +02:00
Sandro La Bruzzo	ed7e2ab6d1	reverted mistake on commit workflow.xml	2023-06-28 11:40:19 +02:00
Sandro La Bruzzo	9910ce06ae	added to CreateSimRel the feature to write time log	2023-06-28 11:38:16 +02:00
Sandro La Bruzzo	bd17c3edc8	added to CreateSimRel the feature to write time log	2023-06-28 11:20:58 +02:00
Claudio Atzori	909729a2fc	[dedup] tweaking num partitions, minor changes	2023-05-17 10:16:22 +02:00
Claudio Atzori	062abfd669	fixed NPE, removed unused stuff	2022-12-06 12:04:00 +01:00
Claudio Atzori	0aa725083f	extended dedup testing	2022-11-17 16:13:43 +01:00
Claudio Atzori	3dbc637d3e	code formatting	2022-11-17 09:55:41 +01:00
Claudio Atzori	ddff0e8999	merging duplicates using IdentifierComparator	2022-11-11 16:10:25 +01:00
Claudio Atzori	5af5a8ae42	added IdentifierComparator	2022-11-09 14:20:59 +01:00
Claudio Atzori	c26222623f	[maven-release-plugin] prepare for next development iteration	2022-04-07 13:32:22 +02:00
Claudio Atzori	86585a6b27	[maven-release-plugin] prepare release dhp-1.2.4	2022-04-07 13:32:19 +02:00
Claudio Atzori	ad85d88eaf	[maven-release-plugin] rollback the release of dhp-1.2.4	2022-04-07 13:28:35 +02:00
Claudio Atzori	598e11dfd7	[maven-release-plugin] prepare for next development iteration	2022-04-07 13:27:02 +02:00
Claudio Atzori	db3d9877a5	[maven-release-plugin] prepare release dhp-1.2.4	2022-04-07 13:26:58 +02:00
Claudio Atzori	3bba6d6e38	[maven-release-plugin] rollback the release of dhp-1.2.4	2022-04-07 12:23:17 +02:00
Claudio Atzori	2ac2d928bd	[maven-release-plugin] prepare for next development iteration	2022-04-07 12:18:47 +02:00
Claudio Atzori	85bc722ff4	[maven-release-plugin] prepare release dhp-1.2.4	2022-04-07 12:18:43 +02:00
Claudio Atzori	bc05b6168a	[maven-release-plugin] rollback the release of dhp-1.2.4	2022-04-07 11:49:06 +02:00
Claudio Atzori	505420fd61	[maven-release-plugin] prepare for next development iteration	2022-04-07 11:34:06 +02:00
Claudio Atzori	66e718981e	[maven-release-plugin] prepare release dhp-1.2.4	2022-04-07 11:34:02 +02:00
Claudio Atzori	61319b2e83	updated dhp-schema version; set entity-level dataInfo before & after merging the fields from the group of duplicates	2022-03-25 16:38:33 +01:00
miconis	c959639bd5	dependency updated to the new pace-core version	2022-03-15 16:33:03 +01:00
miconis	8991d097b4	bug fix in the DedupRecordFactory, DataInfo set before merge	2022-02-24 17:13:12 +01:00

1 2 3 4 5

243 Commits