dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Michele De Bonis	a10e8d9f05	implementation of countryMatch and addition of workflow parameters	2024-06-28 16:46:52 +02:00
Giambattista Bloisi	4f2a61e10f	Change the selection criteria for the pivot record of a group so that by best pid type becomes the first criteria. This will have the effect to slowly converge to records having DOI pid	2024-06-11 15:33:56 +02:00
Claudio Atzori	ce2364743a	applying changes from PR#442: Fix for missing collectedfrom after dedup	2024-06-06 10:43:43 +02:00
Claudio Atzori	107d958b89	[org dedup] avoid NPEs in SparkPrepareNewOrgs	2024-05-27 11:59:54 +02:00
Claudio Atzori	3a7a6ecc32	[org dedup] avoid NPEs in SparkPrepareOrgRels	2024-05-27 11:59:45 +02:00
Claudio Atzori	1af4224d3d	[org dedup] avoid NPEs in SparkPrepareOrgRels	2024-05-27 11:59:33 +02:00
Sandro La Bruzzo	66c1ffc866	merged again from beta (I hope for the last time)	2024-05-22 11:02:46 +02:00
Claudio Atzori	50c18f7a0b	[dedup wf] revised memory settings to address the increased volume of input contents	2024-04-30 12:34:16 +02:00
Giambattista Bloisi	1878199dae	Miscellaneous fixes: - in Merge By ID pick by preference those records coming from delegated Authorities - fix various tests - close spark session in SparkCreateSimRels	2024-04-24 08:12:45 +02:00
Sandro La Bruzzo	073f320c6a	Added module containing all the dependencies, useful for spark deploy on k8.	2024-04-22 11:32:31 +02:00
Claudio Atzori	0656ab2838	code formatting	2024-04-20 08:10:58 +02:00
Sandro La Bruzzo	b84ad0c06e	merged beta	2024-04-19 14:39:59 +02:00
Sandro La Bruzzo	8dd9cf84e2	code formatted	2024-04-19 12:30:59 +02:00
Sandro La Bruzzo	342cb6189b	fixed problem on changed signature on RowEncoder removed property dhp.schema.artifact	2024-04-19 12:13:26 +02:00
Giambattista Bloisi	8ac167e420	Refinements to PR #404 : refactoring the Oaf records merge utilities into dhp-common	2024-04-16 17:18:28 +02:00
Giambattista Bloisi	43b454399f	- Bug fix in matchOrderedTokenAndAbbreviations algorithms where tokens with same initial character were always considered equal - AuthorsMatch exploits the new matching strategy used for ORCID enhancements in #PR398: split author names in tokens, order the tokens, then check for matches of ordered full tokens or abbreviations	2024-04-15 18:19:29 +02:00
Claudio Atzori	ef52128c55	included new stats* workflows in parent pom list of modules, code formatting	2024-03-26 10:42:10 +01:00
Giambattista Bloisi	664a381d31	Unify merge logic of entities in MergeUtils.class	2024-03-18 16:04:49 +01:00
Giambattista Bloisi	b19643f6eb	Dedup aliases, created when a dedup in a previous build has been merged in a new dedup, need to be marked as "deletedbyinference", since they are "merged" in the new dedup	2024-02-08 15:34:59 +01:00
Claudio Atzori	6fd25cf549	code formatting	2024-01-23 08:47:12 +01:00
Giambattista Bloisi	21a14fcd80	Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions Implements pivots table update oozie workflow	2024-01-15 10:18:14 +01:00
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	10e135db1e	Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	831cc1fdde	Generate "merged" dedup id relations also for records that are filtered out by the cut parameters	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	1287315ffb	Do no longer use dedupId information from pivotHistory Database	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Claudio Atzori	431c6bb08a	[dedup] added isLookupUrl to the graph consistency workflow definition, required now by the entity grouping phase	2023-12-06 11:06:46 +01:00
Sandro La Bruzzo	8c3e9a09d3	added repository openaire-third-parties	2023-12-05 19:11:06 +01:00
Giambattista Bloisi	2fa78f6071	Changes requires to build and run tests with Java 17	2023-12-05 19:11:06 +01:00
Giambattista Bloisi	326c9dc08c	Changes in maven poms to build and test the project using Spark 3.4.x and scala 2.12	2023-12-05 19:11:06 +01:00
Claudio Atzori	03670bb9ce	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
Claudio Atzori	eed9fe0902	code formatting	2023-10-06 12:31:17 +02:00
Claudio Atzori	7b403a920f	Merge branch 'beta' into consistency_keep_mergerels	2023-10-02 11:26:00 +02:00
Giambattista Bloisi	2caaaec42d	Include SparkCleanRelation logic in SparkPropagateRelation SparkPropagateRelation includes merge relations Revised tests for SparkPropagateRelation	2023-09-04 11:33:20 +02:00
Giambattista Bloisi	6cc7d8ca7b	GroupEntities and DispatchEntites are now merged in GroupEntitiesSparkJob	2023-08-30 10:43:31 +02:00
Giambattista Bloisi	6b1c05d118	Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb	2023-08-29 16:04:19 +02:00
Claudio Atzori	bf35280ea6	code formatting	2023-08-29 11:11:00 +02:00
Claudio Atzori	58665a246c	Merge branch 'beta' into propagate_relation_rewrite	2023-08-29 10:47:02 +02:00
Giambattista Bloisi	d012aec0b3	Revert PropagateRelation's argument name from outputPath to graphOutputPath in consistency workflow (#8964 )	2023-08-28 22:44:54 +02:00
Giambattista Bloisi	a860e19423	Fix ensure all relations are written out, not only those managed by dedup	2023-08-28 15:36:02 +02:00
Giambattista Bloisi	0d7b2bf83d	Rewrite SparkPropagateRelation exploiting Dataframe API	2023-08-28 10:34:54 +02:00
Giambattista Bloisi	95cd2b9b1e	Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows	2023-08-10 11:53:48 +02:00
Giambattista Bloisi	fab9920271	DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag	2023-08-09 15:41:43 +02:00
Giambattista Bloisi	97b6d1dc45	Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags Filter relations also by dataInfo.invisible flag	2023-08-07 10:24:11 +02:00
Giambattista Bloisi	af49424b59	Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities	2023-08-04 14:27:39 +02:00
Giambattista Bloisi	e64c2854a3	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface JsonPath cache contention fixed by using a ConcurrentHashMap Blacklist filtering performance improvement Minor performance improvements when evaluating similarity Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)	2023-07-24 15:36:24 +02:00
Giambattista Bloisi	bb5b845e3c	Use scala.binary.version property to resolve scala maven dependencies Ensure consistent usage of maven properties Profile for compiling with scala 2.12 and Spark 3.4	2023-07-24 11:13:48 +02:00
Giambattista Bloisi	5e15f20e6e	Fix entityMerger that was excluding the authors of the first entity in the list to merge	2023-07-21 00:46:54 +02:00
Giambattista Bloisi	dba34505de	Fix SparkStatsTest bug where parquet tables were incorrectly read as text files leading to unpredictable count() values	2023-07-19 14:24:52 +02:00
Giambattista Bloisi	bd3fcf869a	rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules	2023-07-06 10:02:23 +02:00

1 2 3 4 5 ...

267 Commits