Commit Graph

196 Commits (master)

Author SHA1 Message Date
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
3 months ago
Claudio Atzori 11a1207f9c [graph cleaning] applying coar based vocabularies in bulk 5 months ago
Claudio Atzori 554551682d [raw graph] adopting the new COAR based vocabularies for the resource typing 7 months ago
Sandro La Bruzzo 76476cdfb6 Added maven repo for dependencies that are not in maven central 7 months ago
Giambattista Bloisi bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
9 months ago
Giambattista Bloisi bd3fcf869a rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules 10 months ago
Claudio Atzori 744a61a030 depending on dhp-schema:3.17.1 11 months ago
Claudio Atzori e45777e7e1 [aggregator graph] added validation for URLs mapped from oaf:fulltext 11 months ago
Claudio Atzori 3b876d9327 depending on dhp-schemas v. 3.16.0 1 year ago
Miriam Baglioni 85e53fad00 [UsageCount] addition of usagecount for Projects and datasources. Extention of the action set created for the results with new entities for projects and datasources. Extention of the resource set and modification of the testing class 1 year ago
Claudio Atzori 3f90d159e3 code formatting 2 years ago
Alessia Bardi 982bcc1e35 test wrid pid and record identifier 2 years ago
Claudio Atzori 26e1badded added instance.url syntactical validation, avoid creating multiple duplicated URLs 2 years ago
Claudio Atzori ff6f789b6d code formatting 2 years ago
Claudio Atzori 27a91841e7 WIP: cleaning of subjects 2 years ago
Miriam Baglioni 438abdf96f [EOSC TAG] adding eosc interoperability guidelines in the specific element in the result. Removed from subjects. Removed also the deletion of EOSC Jupyter Notebook from subject since now the criteria are searchd for in a different place 2 years ago
Claudio Atzori 9e12cb3c92 EOSC Services - removed field knowledgegraph; depending on the released schema module 2 years ago
Claudio Atzori f5f532d134 EOSC Services - ongoing update 2 years ago
Claudio Atzori c26222623f [maven-release-plugin] prepare for next development iteration 2 years ago
Claudio Atzori 86585a6b27 [maven-release-plugin] prepare release dhp-1.2.4 2 years ago
Claudio Atzori ad85d88eaf [maven-release-plugin] rollback the release of dhp-1.2.4 2 years ago
Claudio Atzori 598e11dfd7 [maven-release-plugin] prepare for next development iteration 2 years ago
Claudio Atzori db3d9877a5 [maven-release-plugin] prepare release dhp-1.2.4 2 years ago
Claudio Atzori f03dea4f49 allow to skip maven site 2 years ago
Claudio Atzori 3bba6d6e38 [maven-release-plugin] rollback the release of dhp-1.2.4 2 years ago
Claudio Atzori 2ac2d928bd [maven-release-plugin] prepare for next development iteration 2 years ago
Claudio Atzori 85bc722ff4 [maven-release-plugin] prepare release dhp-1.2.4 2 years ago
Claudio Atzori bc05b6168a [maven-release-plugin] rollback the release of dhp-1.2.4 2 years ago
Claudio Atzori 505420fd61 [maven-release-plugin] prepare for next development iteration 2 years ago
Claudio Atzori 66e718981e [maven-release-plugin] prepare release dhp-1.2.4 2 years ago
Claudio Atzori eca82e30c9 updated dhp-schema version 2 years ago
Claudio Atzori 61319b2e83 updated dhp-schema version; set entity-level dataInfo before & after merging the fields from the group of duplicates 2 years ago
miconis c959639bd5 dependency updated to the new pace-core version 2 years ago
Alessia Bardi 6158170334 testing delegated authority and bumped dep to schemas 2 years ago
Miriam Baglioni 9fd2ef468e [APC at the result level] changed dependecy in external pom 2 years ago
Miriam Baglioni aae667e6b6 [APC at the result level] added the APC at the level of the result and modified test class 2 years ago
Miriam Baglioni 37784209c9 [dhp-schemas-] updated the version of dhp-schema to 2.10.27 for APC name and id modification 2 years ago
Claudio Atzori 4fc44edb71 depending on dhp-schemas:2.10.26 2 years ago
Claudio Atzori 44a937f4ed factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources 2 years ago
Miriam Baglioni a75fb8c47a [BipFinderInstanceLevel] change pom to align to the dhp-schema release 2.10.24 and refactoring 2 years ago
Miriam Baglioni 4d517ed9ec mergin with branch beta 2 years ago
Claudio Atzori dbd6fa1d65 scalafmt: remote referencing the common definition files makes it work compiling the entire project as well as the individual submodules 2 years ago
Miriam Baglioni 4993666d73 [BipFinderInstanceLevel] changed creation of the instance to allow to enrich existing instances with same pid 2 years ago
Claudio Atzori 4f212652ca scalafmt: code formatting 2 years ago
Claudio Atzori 8d18500069 using dhp-schema:2.9.24 2 years ago
Miriam Baglioni 8905a39bf3 mergin with branch beta 2 years ago
Sandro La Bruzzo 6110a2b984 reverted version 2 years ago
Sandro La Bruzzo 65ebe1019b updated wagon-ssh version 2 years ago
Sandro La Bruzzo 4542a2338b updated site configuration to deploy on website 2 years ago
Miriam Baglioni 9fae872181 [Graph Dump] changed to mirror the changes in the model 2 years ago