dnet-hadoop

Author	SHA1	Message	Date
Michele De Bonis	6df6b4583e	blacklist filtering moved before the cleanup phase in order to have case sensitive regex	2024-09-16 14:04:59 +02:00
Claudio Atzori	83327239de	fixed pom definitions, bumped dependency version for the dhp-schema module, removed unnecessary dependencies	2024-07-17 11:58:48 +02:00
Michele De Bonis	2a36ccb997	optimization of normalization stage in openorgs workflow, implementation of new comparators replacing older versions, openorgs configuration update, addition of inference flag in model definition, new test classes	2024-07-09 16:58:10 +02:00
Michele De Bonis	a10e8d9f05	implementation of countryMatch and addition of workflow parameters	2024-06-28 16:46:52 +02:00
Giambattista Bloisi	4f2a61e10f	Change the selection criteria for the pivot record of a group so that by best pid type becomes the first criteria. This will have the effect to slowly converge to records having DOI pid	2024-06-11 15:33:56 +02:00
Claudio Atzori	ce2364743a	applying changes from PR#442: Fix for missing collectedfrom after dedup	2024-06-06 10:43:43 +02:00
Claudio Atzori	107d958b89	[org dedup] avoid NPEs in SparkPrepareNewOrgs	2024-05-27 11:59:54 +02:00
Claudio Atzori	3a7a6ecc32	[org dedup] avoid NPEs in SparkPrepareOrgRels	2024-05-27 11:59:45 +02:00
Claudio Atzori	1af4224d3d	[org dedup] avoid NPEs in SparkPrepareOrgRels	2024-05-27 11:59:33 +02:00
Sandro La Bruzzo	66c1ffc866	merged again from beta (I hope for the last time)	2024-05-22 11:02:46 +02:00
Claudio Atzori	50c18f7a0b	[dedup wf] revised memory settings to address the increased volume of input contents	2024-04-30 12:34:16 +02:00
Giambattista Bloisi	1878199dae	Miscellaneous fixes: - in Merge By ID pick by preference those records coming from delegated Authorities - fix various tests - close spark session in SparkCreateSimRels	2024-04-24 08:12:45 +02:00
Sandro La Bruzzo	073f320c6a	Added module containing all the dependencies, useful for spark deploy on k8.	2024-04-22 11:32:31 +02:00
Claudio Atzori	0656ab2838	code formatting	2024-04-20 08:10:58 +02:00
Sandro La Bruzzo	b84ad0c06e	merged beta	2024-04-19 14:39:59 +02:00
Sandro La Bruzzo	8dd9cf84e2	code formatted	2024-04-19 12:30:59 +02:00
Sandro La Bruzzo	342cb6189b	fixed problem on changed signature on RowEncoder removed property dhp.schema.artifact	2024-04-19 12:13:26 +02:00
Giambattista Bloisi	8ac167e420	Refinements to PR #404 : refactoring the Oaf records merge utilities into dhp-common	2024-04-16 17:18:28 +02:00
Giambattista Bloisi	43b454399f	- Bug fix in matchOrderedTokenAndAbbreviations algorithms where tokens with same initial character were always considered equal - AuthorsMatch exploits the new matching strategy used for ORCID enhancements in #PR398: split author names in tokens, order the tokens, then check for matches of ordered full tokens or abbreviations	2024-04-15 18:19:29 +02:00
Claudio Atzori	ef52128c55	included new stats* workflows in parent pom list of modules, code formatting	2024-03-26 10:42:10 +01:00
Giambattista Bloisi	664a381d31	Unify merge logic of entities in MergeUtils.class	2024-03-18 16:04:49 +01:00
Giambattista Bloisi	b19643f6eb	Dedup aliases, created when a dedup in a previous build has been merged in a new dedup, need to be marked as "deletedbyinference", since they are "merged" in the new dedup	2024-02-08 15:34:59 +01:00
Claudio Atzori	6fd25cf549	code formatting	2024-01-23 08:47:12 +01:00
Giambattista Bloisi	21a14fcd80	Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions Implements pivots table update oozie workflow	2024-01-15 10:18:14 +01:00
Giambattista Bloisi	3c66e3bd7b	Create dedup record for "merged" pivots Do not create dedup records for group that have more than 20 different acceptance date	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	10e135db1e	Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	831cc1fdde	Generate "merged" dedup id relations also for records that are filtered out by the cut parameters	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	1287315ffb	Do no longer use dedupId information from pivotHistory Database	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Claudio Atzori	431c6bb08a	[dedup] added isLookupUrl to the graph consistency workflow definition, required now by the entity grouping phase	2023-12-06 11:06:46 +01:00
Sandro La Bruzzo	8c3e9a09d3	added repository openaire-third-parties	2023-12-05 19:11:06 +01:00
Giambattista Bloisi	2fa78f6071	Changes requires to build and run tests with Java 17	2023-12-05 19:11:06 +01:00
Giambattista Bloisi	326c9dc08c	Changes in maven poms to build and test the project using Spark 3.4.x and scala 2.12	2023-12-05 19:11:06 +01:00
Claudio Atzori	03670bb9ce	[dedup] use common saveParquet and save methods to ensure outputs are compressed	2023-10-16 10:55:47 +02:00
Claudio Atzori	eed9fe0902	code formatting	2023-10-06 12:31:17 +02:00
Claudio Atzori	7b403a920f	Merge branch 'beta' into consistency_keep_mergerels	2023-10-02 11:26:00 +02:00
Giambattista Bloisi	2caaaec42d	Include SparkCleanRelation logic in SparkPropagateRelation SparkPropagateRelation includes merge relations Revised tests for SparkPropagateRelation	2023-09-04 11:33:20 +02:00
Giambattista Bloisi	6cc7d8ca7b	GroupEntities and DispatchEntites are now merged in GroupEntitiesSparkJob	2023-08-30 10:43:31 +02:00
Giambattista Bloisi	6b1c05d118	Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb	2023-08-29 16:04:19 +02:00
Claudio Atzori	bf35280ea6	code formatting	2023-08-29 11:11:00 +02:00
Claudio Atzori	58665a246c	Merge branch 'beta' into propagate_relation_rewrite	2023-08-29 10:47:02 +02:00
Giambattista Bloisi	d012aec0b3	Revert PropagateRelation's argument name from outputPath to graphOutputPath in consistency workflow (#8964 )	2023-08-28 22:44:54 +02:00
Giambattista Bloisi	a860e19423	Fix ensure all relations are written out, not only those managed by dedup	2023-08-28 15:36:02 +02:00
Giambattista Bloisi	0d7b2bf83d	Rewrite SparkPropagateRelation exploiting Dataframe API	2023-08-28 10:34:54 +02:00
Giambattista Bloisi	95cd2b9b1e	Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows	2023-08-10 11:53:48 +02:00
Giambattista Bloisi	fab9920271	DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag	2023-08-09 15:41:43 +02:00
Giambattista Bloisi	97b6d1dc45	Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags Filter relations also by dataInfo.invisible flag	2023-08-07 10:24:11 +02:00
Giambattista Bloisi	af49424b59	Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities	2023-08-04 14:27:39 +02:00
Giambattista Bloisi	e64c2854a3	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface JsonPath cache contention fixed by using a ConcurrentHashMap Blacklist filtering performance improvement Minor performance improvements when evaluating similarity Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)	2023-07-24 15:36:24 +02:00
Giambattista Bloisi	bb5b845e3c	Use scala.binary.version property to resolve scala maven dependencies Ensure consistent usage of maven properties Profile for compiling with scala 2.12 and Spark 3.4	2023-07-24 11:13:48 +02:00

1 2 3 4 5 ...

270 Commits