Commit Graph

196 Commits

Author SHA1 Message Date
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Claudio Atzori 11a1207f9c [graph cleaning] applying coar based vocabularies in bulk 2023-11-22 12:22:14 +01:00
Claudio Atzori 554551682d [raw graph] adopting the new COAR based vocabularies for the resource typing 2023-10-11 16:09:19 +02:00
Sandro La Bruzzo 76476cdfb6 Added maven repo for dependencies that are not in maven central 2023-09-20 10:33:14 +02:00
Giambattista Bloisi bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00
Giambattista Bloisi bd3fcf869a rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules 2023-07-06 10:02:23 +02:00
Claudio Atzori 744a61a030 depending on dhp-schema:3.17.1 2023-06-12 13:49:44 +02:00
Claudio Atzori e45777e7e1 [aggregator graph] added validation for URLs mapped from oaf:fulltext 2023-05-26 11:33:42 +02:00
Claudio Atzori 3b876d9327 depending on dhp-schemas v. 3.16.0 2023-02-22 10:15:10 +01:00
Miriam Baglioni 85e53fad00 [UsageCount] addition of usagecount for Projects and datasources. Extention of the action set created for the results with new entities for projects and datasources. Extention of the resource set and modification of the testing class 2023-02-09 18:59:45 +01:00
Claudio Atzori 3f90d159e3 code formatting 2022-09-27 15:08:00 +02:00
Alessia Bardi 982bcc1e35 test wrid pid and record identifier 2022-09-23 12:06:06 +02:00
Claudio Atzori 26e1badded added instance.url syntactical validation, avoid creating multiple duplicated URLs 2022-09-19 11:19:10 +02:00
Claudio Atzori ff6f789b6d code formatting 2022-09-09 15:16:31 +02:00
Claudio Atzori 27a91841e7 WIP: cleaning of subjects 2022-08-04 11:39:39 +02:00
Miriam Baglioni 438abdf96f [EOSC TAG] adding eosc interoperability guidelines in the specific element in the result. Removed from subjects. Removed also the deletion of EOSC Jupyter Notebook from subject since now the criteria are searchd for in a different place 2022-07-20 18:07:54 +02:00
Claudio Atzori 9e12cb3c92 EOSC Services - removed field knowledgegraph; depending on the released schema module 2022-05-03 11:55:45 +02:00
Claudio Atzori f5f532d134 EOSC Services - ongoing update 2022-04-29 12:25:24 +02:00
Claudio Atzori c26222623f [maven-release-plugin] prepare for next development iteration 2022-04-07 13:32:22 +02:00
Claudio Atzori 86585a6b27 [maven-release-plugin] prepare release dhp-1.2.4 2022-04-07 13:32:19 +02:00
Claudio Atzori ad85d88eaf [maven-release-plugin] rollback the release of dhp-1.2.4 2022-04-07 13:28:35 +02:00
Claudio Atzori 598e11dfd7 [maven-release-plugin] prepare for next development iteration 2022-04-07 13:27:02 +02:00
Claudio Atzori db3d9877a5 [maven-release-plugin] prepare release dhp-1.2.4 2022-04-07 13:26:58 +02:00
Claudio Atzori f03dea4f49 allow to skip maven site 2022-04-07 13:22:55 +02:00
Claudio Atzori 3bba6d6e38 [maven-release-plugin] rollback the release of dhp-1.2.4 2022-04-07 12:23:17 +02:00
Claudio Atzori 2ac2d928bd [maven-release-plugin] prepare for next development iteration 2022-04-07 12:18:47 +02:00
Claudio Atzori 85bc722ff4 [maven-release-plugin] prepare release dhp-1.2.4 2022-04-07 12:18:43 +02:00
Claudio Atzori bc05b6168a [maven-release-plugin] rollback the release of dhp-1.2.4 2022-04-07 11:49:06 +02:00
Claudio Atzori 505420fd61 [maven-release-plugin] prepare for next development iteration 2022-04-07 11:34:06 +02:00
Claudio Atzori 66e718981e [maven-release-plugin] prepare release dhp-1.2.4 2022-04-07 11:34:02 +02:00
Claudio Atzori eca82e30c9 updated dhp-schema version 2022-03-29 09:46:49 +02:00
Claudio Atzori 61319b2e83 updated dhp-schema version; set entity-level dataInfo before & after merging the fields from the group of duplicates 2022-03-25 16:38:33 +01:00
miconis c959639bd5 dependency updated to the new pace-core version 2022-03-15 16:33:03 +01:00
Alessia Bardi 6158170334 testing delegated authority and bumped dep to schemas 2022-02-11 18:05:18 +01:00
Miriam Baglioni 9fd2ef468e [APC at the result level] changed dependecy in external pom 2022-02-04 16:40:32 +01:00
Miriam Baglioni aae667e6b6 [APC at the result level] added the APC at the level of the result and modified test class 2022-02-04 12:34:25 +01:00
Miriam Baglioni 37784209c9 [dhp-schemas-] updated the version of dhp-schema to 2.10.27 for APC name and id modification 2022-02-02 12:46:31 +01:00
Claudio Atzori 4fc44edb71 depending on dhp-schemas:2.10.26 2022-01-27 16:03:57 +01:00
Claudio Atzori 44a937f4ed factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources 2022-01-19 12:24:52 +01:00
Miriam Baglioni a75fb8c47a [BipFinderInstanceLevel] change pom to align to the dhp-schema release 2.10.24 and refactoring 2022-01-12 18:06:26 +01:00
Miriam Baglioni 4d517ed9ec mergin with branch beta 2022-01-12 17:29:37 +01:00
Claudio Atzori dbd6fa1d65 scalafmt: remote referencing the common definition files makes it work compiling the entire project as well as the individual submodules 2022-01-12 17:19:38 +01:00
Miriam Baglioni 4993666d73 [BipFinderInstanceLevel] changed creation of the instance to allow to enrich existing instances with same pid 2022-01-12 16:53:47 +01:00
Claudio Atzori 4f212652ca scalafmt: code formatting 2022-01-11 16:57:48 +01:00
Claudio Atzori 8d18500069 using dhp-schema:2.9.24 2021-12-22 12:47:21 +01:00
Miriam Baglioni 8905a39bf3 mergin with branch beta 2021-12-02 13:17:29 +01:00
Sandro La Bruzzo 6110a2b984 reverted version 2021-11-19 15:31:45 +01:00
Sandro La Bruzzo 65ebe1019b updated wagon-ssh version 2021-11-19 14:59:04 +01:00
Sandro La Bruzzo 4542a2338b updated site configuration to deploy on website 2021-11-19 13:44:08 +01:00
Miriam Baglioni 9fae872181 [Graph Dump] changed to mirror the changes in the model 2021-11-19 11:25:50 +01:00