dnet-hadoop

sabeel

Author	SHA1	Message	Date
Sandro La Bruzzo	0d628cd62b	merged again from beta	2024-04-23 17:34:55 +02:00
Sandro La Bruzzo	073f320c6a	Added module containing all the dependencies, useful for spark deploy on k8.	2024-04-22 11:32:31 +02:00
Claudio Atzori	0656ab2838	code formatting	2024-04-20 08:10:58 +02:00
Sandro La Bruzzo	b72c3139e2	updated Ignore annotation that is deprecated to Disabled	2024-04-19 14:52:40 +02:00
Sandro La Bruzzo	b84ad0c06e	merged beta	2024-04-19 14:39:59 +02:00
Giambattista Bloisi	8ac167e420	Refinements to PR #404 : refactoring the Oaf records merge utilities into dhp-common	2024-04-16 17:18:28 +02:00
Giambattista Bloisi	43b454399f	- Bug fix in matchOrderedTokenAndAbbreviations algorithms where tokens with same initial character were always considered equal - AuthorsMatch exploits the new matching strategy used for ORCID enhancements in #PR398: split author names in tokens, order the tokens, then check for matches of ordered full tokens or abbreviations	2024-04-15 18:19:29 +02:00
Giambattista Bloisi	d65285da7f	Promote "Research" to a jolly instanceType in dedup comparisons Compare "Journal" and "Part of book or chapter of book" with "Article"	2024-02-15 12:11:04 +01:00
Giambattista Bloisi	29194472a7	Promote "Research" to a jolly instanceType in dedup comparisons Compare Part of book or chapter of book with Article	2024-02-15 11:53:46 +01:00
Giambattista Bloisi	02636e802c	SparkCreateSimRels: - Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results - Clean titles once before clustering and similarity comparisons - Added support for filtered fields in model - Added support for sorting List fields in model - Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions - Added new maxLengthMatch comparator function - Use reduced complexity Levenshtein with threshold in levensteinTitle - Use reduced complexity AuthorsMatch with threshold early-quit - Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor - Use new clusterings configuration in Dedup tests SparkWhitelistSimRels: use left semi join for clarity and performance SparkCreateMergeRels: - Use new connected component algorithm that converge faster than Spark GraphX provided algorithm - Refactored to use Windowing sorting rather than groupBy to reduce memory pressure - Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past - Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01" - Changed generation of ids of type dedup_wf_001 to avoid collisions DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure	2024-01-10 22:59:52 +01:00
Giambattista Bloisi	613ec5ffce	Add profiles for different spark versions: spark-24, spark-34, spark-35	2023-12-05 19:11:06 +01:00
Sandro La Bruzzo	8c3e9a09d3	added repository openaire-third-parties	2023-12-05 19:11:06 +01:00
Giambattista Bloisi	2fa78f6071	Changes requires to build and run tests with Java 17	2023-12-05 19:11:06 +01:00
Giambattista Bloisi	c412dc162b	Fix bug in conversion from dedup json model to Spark Dataset of Rows: list of strings contained the json escaped representation of the value instead of the plain value, this caused instanceTypeMatch failures because of the leading and trailing double quotes	2023-10-02 11:34:51 +02:00
Giambattista Bloisi	3c47920c78	Use asScala to convert java List to Scala Sequence	2023-10-02 11:04:47 +02:00
Giambattista Bloisi	e239b81740	Fix defect #8997 : GenerateEventsJob is generating huge amounts of logs because broker entity similarity calculation consistently failed	2023-10-02 11:04:18 +02:00
Sandro La Bruzzo	76476cdfb6	Added maven repo for dependencies that are not in maven central	2023-09-20 10:33:14 +02:00
Claudio Atzori	bf35280ea6	code formatting	2023-08-29 11:11:00 +02:00
Giambattista Bloisi	e64c2854a3	Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface JsonPath cache contention fixed by using a ConcurrentHashMap Blacklist filtering performance improvement Minor performance improvements when evaluating similarity Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)	2023-07-24 15:36:24 +02:00
Giambattista Bloisi	bb5b845e3c	Use scala.binary.version property to resolve scala maven dependencies Ensure consistent usage of maven properties Profile for compiling with scala 2.12 and Spark 3.4	2023-07-24 11:13:48 +02:00
Giambattista Bloisi	801da2fd4a	New sources formatted by maven plugin	2023-07-06 10:28:53 +02:00
Giambattista Bloisi	bd3fcf869a	rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules	2023-07-06 10:02:23 +02:00

22 Commits