Commit Graph

4854 Commits

Author SHA1 Message Date
Giambattista Bloisi 3c66e3bd7b Create dedup record for "merged" pivots
Do not create dedup records for group that have more than 20 different acceptance date
2024-01-10 22:59:52 +01:00
Giambattista Bloisi 10e135db1e Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 831cc1fdde Generate "merged" dedup id relations also for records that are filtered out by the cut parameters 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 1287315ffb Do no longer use dedupId information from pivotHistory Database 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Claudio Atzori 16d858fbf0 Merge pull request 'enrichmentSingleStep' (#373) from enrichmentSingleStep into beta
Reviewed-on: #373
2024-01-10 16:58:49 +01:00
Miriam Baglioni e711a05229 fixed conflicts 2024-01-10 11:03:42 +01:00
Miriam Baglioni 71d6f30711 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-01-10 10:59:58 +01:00
Miriam Baglioni cb14470ba6 added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:50:05 +01:00
Miriam Baglioni 9f966b59d4 added properties file in the forlder for the workflow of result to community from semrel propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:11:47 +01:00
Miriam Baglioni 2f3b5a133d added properties file in the forlder for the workflow of result to community from organization propagation. Changes the path in the classes implementing the propagation 2023-12-22 13:56:40 +01:00
Miriam Baglioni 2f7b9ad815 added properties file in the forlder for the workflow of project to result propagation. Changes the path in the classes implementing the propagation 2023-12-22 11:46:15 +01:00
Miriam Baglioni f2352e8a78 changed in the classes the path for the property files for the propagation of community from project 2023-12-22 11:43:34 +01:00
Miriam Baglioni 009730b3d1 added properties file in the forlder for the workflow of orcid propagation. Changes the path in the classes implementing the propagationchanged the path to the parameter file in the class for entitytoorganization propagation 2023-12-22 11:42:09 +01:00
Miriam Baglioni 89f269c7f4 changed the path to the parameter file in the class for entitytoorganization propagation 2023-12-22 11:37:50 +01:00
Miriam Baglioni b06aea0adf adding the bulkTag parameter file in the folder for the oozie workflow for bulkTagging. Changes the path in the class 2023-12-22 11:35:37 +01:00
Miriam Baglioni 3afd4aa57b adjustments for country propagation 2023-12-22 11:27:30 +01:00
Claudio Atzori 62104790ae added metaresourcetype to the result hive DB view 2023-12-21 12:27:10 +01:00
Miriam Baglioni 5011c4d11a refactoring after compiletion 2023-12-20 15:57:26 +01:00
Miriam Baglioni 4740c808f7 - 2023-12-20 14:26:54 +01:00
Miriam Baglioni d410ea8a41 added needed parameter 2023-12-19 12:15:01 +01:00
Sandro La Bruzzo 15fd93a2b6 uploaded input parameters on CreateBaseline WF 2023-12-18 12:21:55 +01:00
Sandro La Bruzzo 9d342a47da updated the transformation Baseline workflow to include mdstore rollback/commit action 2023-12-18 11:48:57 +01:00
Miriam Baglioni 3eca5d2e1c - 2023-12-18 09:55:27 +01:00
Miriam Baglioni 01ce0b9c76 [doiboost - preprocess] remove transition to orcid preparation from sequence of steps at the beginning of the workflow 2023-12-15 12:24:55 +01:00
Miriam Baglioni 0d8e496a63 - 2023-12-15 12:16:43 +01:00
Claudio Atzori a59be5779e Merge pull request '9078_xml_records_irish_tender' (#368) from 9078_xml_records_irish_tender into beta
Reviewed-on: #368
2023-12-12 12:34:43 +01:00
Claudio Atzori ff924215b8 [graph provision] added tests for new peerreviewed field 2023-12-12 11:21:30 +01:00
Claudio Atzori a6d635e695 Merge branch 'beta' into 9078_xml_records_irish_tender 2023-12-12 11:06:42 +01:00
Claudio Atzori 98cce5bfb2 code formatting 2023-12-12 09:59:05 +01:00
Claudio Atzori 84d54643cf [cleaning] allow enriched orcids to pass the cleaning, rule out non-orcid author pids 2023-12-12 09:57:00 +01:00
Claudio Atzori 7e8eff40c1 [graph provision] added tests for the new model fields 2023-12-12 08:54:15 +01:00
Miriam Baglioni 8752d275fa removed not needed parameter 2023-12-09 15:24:45 +01:00
Miriam Baglioni d4eedada71 adjusting workflow definition 2023-12-09 15:20:11 +01:00
Claudio Atzori aba95ed1d1 code formatting 2023-12-08 17:06:19 +01:00
Claudio Atzori 2877839df0 Merge pull request '[graph cleaning] added cleaning for result.publisher and result.instance.license' (#366) from clean_license_publisher into beta
Reviewed-on: #366
2023-12-08 16:58:37 +01:00
Claudio Atzori 34abd0fc43 Merge branch 'beta' into clean_license_publisher 2023-12-08 16:58:27 +01:00
Claudio Atzori cb71a7936b [graph cleaning] avoid stack overflow error when navigating Oaf objects declaring an Enum 2023-12-07 23:09:54 +01:00
Claudio Atzori 70eb1796b2 logging typo 2023-12-07 14:08:04 +01:00
Claudio Atzori c381bacee0 [enrichment] passing the community API base URL 2023-12-07 14:07:11 +01:00
Miriam Baglioni 336fb31d87 [community_result_propagation] adjusting starting poit of workflow 2023-12-07 10:27:25 +01:00
Miriam Baglioni c0cde53bf6 [bulktagging] setting first step of bulktaggin as the copy of the entities and relations not involved in the tagging' 2023-12-07 10:08:35 +01:00
Miriam Baglioni 616622d2bb first version of the workflow single step 2023-12-07 09:59:52 +01:00
Claudio Atzori 259c69e446 [orcid enrichment] fixed workflow definition 2023-12-06 19:41:53 +01:00
Claudio Atzori 431c6bb08a [dedup] added isLookupUrl to the graph consistency workflow definition, required now by the entity grouping phase 2023-12-06 11:06:46 +01:00
Claudio Atzori 982c0c110b Merge pull request '[graph provision] added serialization for the new fields imported from the stats DB' (#365) from 9078_xml_records_irish_tender into beta
Reviewed-on: #365
2023-12-05 16:39:44 +01:00
Claudio Atzori 321922772b added serialization for the new fields imported for the Irish tender 2023-12-05 16:37:04 +01:00
Claudio Atzori c5b7253130 [community_organization propagation] fixed workflow parameters 2023-12-05 09:13:33 +01:00
Claudio Atzori 3c3bdb8318 [bulktagging] fixed workflow parameters 2023-12-05 09:08:48 +01:00
Claudio Atzori 7c3041b276 avoid NPEs 2023-12-03 16:49:49 +01:00