Commit Graph

39 Commits

Author SHA1 Message Date
Claudio Atzori 67e37f41fb Merge pull request 'blacklist filtering moved before the cleanup phase in order to have case sensitive regex' (#485) from dedup_blacklist_fix into beta
Reviewed-on: #485
2024-10-28 09:42:51 +01:00
Claudio Atzori d3764265d5 Merge pull request '[dedup] avoid NPEs in the countryInference dedup utility' (#475) from dedup_countryInference_NPE into beta
Reviewed-on: #475
2024-10-25 10:12:06 +02:00
Giambattista Bloisi 0e34b0ece1 Fix imports: point them from the main distribution packages 2024-10-23 14:01:52 +02:00
Giambattista Bloisi 56b05cde0b Revert the changes for IgnoreUndefined management in tree evaluation 2024-10-11 10:35:15 +02:00
Michele De Bonis 6df6b4583e blacklist filtering moved before the cleanup phase in order to have case sensitive regex 2024-09-16 14:04:59 +02:00
Claudio Atzori 75a11d0ba5 [dedup] avoid NPEs in the countryInference dedup utility 2024-07-25 16:34:32 +02:00
Claudio Atzori 83327239de fixed pom definitions, bumped dependency version for the dhp-schema module, removed unnecessary dependencies 2024-07-17 11:58:48 +02:00
Michele De Bonis 2a36ccb997 optimization of normalization stage in openorgs workflow, implementation of new comparators replacing older versions, openorgs configuration update, addition of inference flag in model definition, new test classes 2024-07-09 16:58:10 +02:00
Michele De Bonis a10e8d9f05 implementation of countryMatch and addition of workflow parameters 2024-06-28 16:46:52 +02:00
Sandro La Bruzzo db358ad0d2 code formatted 2024-05-02 15:25:57 +02:00
Sandro La Bruzzo 26bf8e763a merged from beta 2024-05-02 15:20:23 +02:00
Claudio Atzori 4355f64810 reverted to version 1.2.5-SNAPSHOT 2024-05-02 11:23:53 +02:00
Claudio Atzori 66680b8b9a refactoring of common utilities 2024-05-02 11:16:58 +02:00
Sandro La Bruzzo 0d628cd62b merged again from beta 2024-04-23 17:34:55 +02:00
Claudio Atzori c3053ef34d using version 1.2.5-beta for the release 2024-04-23 14:52:32 +02:00
Claudio Atzori b5bcab13ec using version 1.2.5-beta for the release 2024-04-23 14:36:39 +02:00
Claudio Atzori 425c9afc36 using version 1.2.5-beta for the release 2024-04-23 14:30:04 +02:00
Sandro La Bruzzo 073f320c6a Added module containing all the dependencies, useful for spark deploy on k8. 2024-04-22 11:32:31 +02:00
Claudio Atzori 0656ab2838 code formatting 2024-04-20 08:10:58 +02:00
Sandro La Bruzzo b72c3139e2 updated Ignore annotation that is deprecated to Disabled 2024-04-19 14:52:40 +02:00
Sandro La Bruzzo b84ad0c06e merged beta 2024-04-19 14:39:59 +02:00
Giambattista Bloisi 8ac167e420 Refinements to PR #404: refactoring the Oaf records merge utilities into dhp-common 2024-04-16 17:18:28 +02:00
Giambattista Bloisi 43b454399f - Bug fix in matchOrderedTokenAndAbbreviations algorithms where tokens with same initial character were always considered equal
- AuthorsMatch exploits the new matching strategy used for ORCID enhancements in #PR398: split author names in tokens, order the tokens, then check for matches of ordered full tokens or abbreviations
2024-04-15 18:19:29 +02:00
Giambattista Bloisi d65285da7f Promote "Research" to a jolly instanceType in dedup comparisons
Compare "Journal" and "Part of book or chapter of book" with "Article"
2024-02-15 12:11:04 +01:00
Giambattista Bloisi 29194472a7 Promote "Research" to a jolly instanceType in dedup comparisons
Compare Part of book or chapter of book with Article
2024-02-15 11:53:46 +01:00
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Giambattista Bloisi 613ec5ffce Add profiles for different spark versions: spark-24, spark-34, spark-35 2023-12-05 19:11:06 +01:00
Sandro La Bruzzo 8c3e9a09d3 added repository openaire-third-parties 2023-12-05 19:11:06 +01:00
Giambattista Bloisi 2fa78f6071 Changes requires to build and run tests with Java 17 2023-12-05 19:11:06 +01:00
Giambattista Bloisi 326c9dc08c Changes in maven poms to build and test the project using Spark 3.4.x and scala 2.12 2023-12-05 19:11:06 +01:00
Giambattista Bloisi c412dc162b Fix bug in conversion from dedup json model to Spark Dataset of Rows: list of strings contained the json escaped representation of the value instead of the plain value, this caused instanceTypeMatch failures because of the leading and trailing double quotes 2023-10-02 11:34:51 +02:00
Giambattista Bloisi 3c47920c78 Use asScala to convert java List to Scala Sequence 2023-10-02 11:04:47 +02:00
Giambattista Bloisi e239b81740 Fix defect #8997: GenerateEventsJob is generating huge amounts of logs because broker entity similarity calculation consistently failed 2023-10-02 11:04:18 +02:00
Sandro La Bruzzo 76476cdfb6 Added maven repo for dependencies that are not in maven central 2023-09-20 10:33:14 +02:00
Claudio Atzori bf35280ea6 code formatting 2023-08-29 11:11:00 +02:00
Giambattista Bloisi e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Giambattista Bloisi bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00
Giambattista Bloisi 801da2fd4a New sources formatted by maven plugin 2023-07-06 10:28:53 +02:00
Giambattista Bloisi bd3fcf869a rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules 2023-07-06 10:02:23 +02:00