dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Claudio Atzori	67e37f41fb	Merge pull request 'blacklist filtering moved before the cleanup phase in order to have case sensitive regex' (#485 ) from dedup_blacklist_fix into beta Reviewed-on: #485	2024-10-28 09:42:51 +01:00
Claudio Atzori	d3764265d5	Merge pull request '[dedup] avoid NPEs in the countryInference dedup utility' (#475 ) from dedup_countryInference_NPE into beta Reviewed-on: #475	2024-10-25 10:12:06 +02:00
Giambattista Bloisi	0e34b0ece1	Fix imports: point them from the main distribution packages	2024-10-23 14:01:52 +02:00
Giambattista Bloisi	56b05cde0b	Revert the changes for IgnoreUndefined management in tree evaluation	2024-10-11 10:35:15 +02:00
Michele De Bonis	6df6b4583e	blacklist filtering moved before the cleanup phase in order to have case sensitive regex	2024-09-16 14:04:59 +02:00
Claudio Atzori	75a11d0ba5	[dedup] avoid NPEs in the countryInference dedup utility	2024-07-25 16:34:32 +02:00
Claudio Atzori	83327239de	fixed pom definitions, bumped dependency version for the dhp-schema module, removed unnecessary dependencies	2024-07-17 11:58:48 +02:00
Michele De Bonis	2a36ccb997	optimization of normalization stage in openorgs workflow, implementation of new comparators replacing older versions, openorgs configuration update, addition of inference flag in model definition, new test classes	2024-07-09 16:58:10 +02:00
Michele De Bonis	a10e8d9f05	implementation of countryMatch and addition of workflow parameters	2024-06-28 16:46:52 +02:00
Sandro La Bruzzo	db358ad0d2	code formatted	2024-05-02 15:25:57 +02:00
Sandro La Bruzzo	26bf8e763a	merged from beta	2024-05-02 15:20:23 +02:00
Claudio Atzori	4355f64810	reverted to version 1.2.5-SNAPSHOT	2024-05-02 11:23:53 +02:00
Claudio Atzori	66680b8b9a	refactoring of common utilities	2024-05-02 11:16:58 +02:00
Sandro La Bruzzo	0d628cd62b	merged again from beta	2024-04-23 17:34:55 +02:00
Claudio Atzori	c3053ef34d	using version 1.2.5-beta for the release	2024-04-23 14:52:32 +02:00
Claudio Atzori	b5bcab13ec	using version 1.2.5-beta for the release	2024-04-23 14:36:39 +02:00
Claudio Atzori	425c9afc36	using version 1.2.5-beta for the release	2024-04-23 14:30:04 +02:00
Sandro La Bruzzo	073f320c6a	Added module containing all the dependencies, useful for spark deploy on k8.	2024-04-22 11:32:31 +02:00
Claudio Atzori	0656ab2838	code formatting	2024-04-20 08:10:58 +02:00
Sandro La Bruzzo	b72c3139e2	updated Ignore annotation that is deprecated to Disabled	2024-04-19 14:52:40 +02:00
Sandro La Bruzzo	b84ad0c06e	merged beta	2024-04-19 14:39:59 +02:00
Giambattista Bloisi	8ac167e420	Refinements to PR #404 : refactoring the Oaf records merge utilities into dhp-common	2024-04-16 17:18:28 +02:00
Giambattista Bloisi	43b454399f	- Bug fix in matchOrderedTokenAndAbbreviations algorithms where tokens with same initial character were always considered equal - AuthorsMatch exploits the new matching strategy used for ORCID enhancements in #PR398: split author names in tokens, order the tokens, then check for matches of ordered full tokens or abbreviations	2024-04-15 18:19:29 +02:00
Giambattista Bloisi	d65285da7f	Promote "Research" to a jolly instanceType in dedup comparisons Compare "Journal" and "Part of book or chapter of book" with "Article"	2024-02-15 12:11:04 +01:00
Giambattista Bloisi	29194472a7	Promote "Research" to a jolly instanceType in dedup comparisons Compare Part of book or chapter of book with Article	2024-02-15 11:53:46 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	613ec5ffce	Add profiles for different spark versions: spark-24, spark-34, spark-35	2023-12-05 19:11:06 +01:00
Sandro La Bruzzo	8c3e9a09d3	added repository openaire-third-parties	2023-12-05 19:11:06 +01:00
Giambattista Bloisi	2fa78f6071	Changes requires to build and run tests with Java 17	2023-12-05 19:11:06 +01:00
Giambattista Bloisi	326c9dc08c	Changes in maven poms to build and test the project using Spark 3.4.x and scala 2.12	2023-12-05 19:11:06 +01:00
Giambattista Bloisi	c412dc162b	Fix bug in conversion from dedup json model to Spark Dataset of Rows: list of strings contained the json escaped representation of the value instead of the plain value, this caused instanceTypeMatch failures because of the leading and trailing double quotes	2023-10-02 11:34:51 +02:00
Giambattista Bloisi	3c47920c78	Use asScala to convert java List to Scala Sequence	2023-10-02 11:04:47 +02:00
Giambattista Bloisi	e239b81740	Fix defect #8997 : GenerateEventsJob is generating huge amounts of logs because broker entity similarity calculation consistently failed	2023-10-02 11:04:18 +02:00
Sandro La Bruzzo	76476cdfb6	Added maven repo for dependencies that are not in maven central	2023-09-20 10:33:14 +02:00
Claudio Atzori	bf35280ea6	code formatting	2023-08-29 11:11:00 +02:00
Giambattista Bloisi	e64c2854a3	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface JsonPath cache contention fixed by using a ConcurrentHashMap Blacklist filtering performance improvement Minor performance improvements when evaluating similarity Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)	2023-07-24 15:36:24 +02:00
Giambattista Bloisi	bb5b845e3c	Use scala.binary.version property to resolve scala maven dependencies Ensure consistent usage of maven properties Profile for compiling with scala 2.12 and Spark 3.4	2023-07-24 11:13:48 +02:00
Giambattista Bloisi	801da2fd4a	New sources formatted by maven plugin	2023-07-06 10:28:53 +02:00
Giambattista Bloisi	bd3fcf869a	rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules	2023-07-06 10:02:23 +02:00

39 Commits