Commit Graph

4873 Commits

Author SHA1 Message Date
Lampros Smyrnaios ff47a941f5 - Add the "installProject.sh" script.
- Show the Job-ID or potential deployment-error-logs, right after the deployment of the workflow.
- Code polishing.
2024-01-18 18:06:50 +02:00
Lampros Smyrnaios 00644ef487 - Fix the "NoSuchFieldError", caused by library-conflicts, by introducing the "oozie.libpath" property in "workflow.xml".
- Fix the value of the "outputPath" property, in "workflow.xml".
2024-01-18 15:46:27 +02:00
Lampros Smyrnaios 23ec57c670 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into continuous_validation2 2024-01-17 18:16:11 +02:00
Lampros Smyrnaios c17834dddf - Use "KryoSerializer" for Spark and register some "result" classes.
- Code polishing and cleanup.
2024-01-17 18:08:05 +02:00
Claudio Atzori 2d302e6827 Merge pull request '[FoS integration]fix issue on FoS integration. Removing the null values from FoS' (#375) from fosPreparationBeta into beta
Reviewed-on: #375
2024-01-12 10:27:28 +01:00
Miriam Baglioni f612125939 fix issue on FoS integration. Removing the null values from FoS 2024-01-12 10:20:28 +01:00
Claudio Atzori c67467723b Merge pull request 'refined mapping for the extraction of the original resource type' (#374) from resource_types into beta
Reviewed-on: #374
2024-01-11 16:29:47 +01:00
Claudio Atzori cb9e739484 Merge branch 'beta' into resource_types 2024-01-11 16:29:41 +01:00
Claudio Atzori 2753044d13 refined mapping for the extraction of the original resource type 2024-01-11 16:28:26 +01:00
Giambattista Bloisi a88dce5bf3 Merge pull request 'Improvements and refactoring in Dedup' (#367) from dedup_increasenumofblocks into beta
Reviewed-on: #367
2024-01-11 11:24:06 +01:00
Giambattista Bloisi 3c66e3bd7b Create dedup record for "merged" pivots
Do not create dedup records for group that have more than 20 different acceptance date
2024-01-10 22:59:52 +01:00
Giambattista Bloisi 10e135db1e Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 831cc1fdde Generate "merged" dedup id relations also for records that are filtered out by the cut parameters 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 1287315ffb Do no longer use dedupId information from pivotHistory Database 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Claudio Atzori 16d858fbf0 Merge pull request 'enrichmentSingleStep' (#373) from enrichmentSingleStep into beta
Reviewed-on: #373
2024-01-10 16:58:49 +01:00
Miriam Baglioni e711a05229 fixed conflicts 2024-01-10 11:03:42 +01:00
Miriam Baglioni 71d6f30711 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-01-10 10:59:58 +01:00
Lampros Smyrnaios 32e02247bc Merge branch 'continuous_validation2' of https://code-repo.d4science.org/lsmyrnaios/dnet-hadoop into continuous_validation2 2024-01-09 17:05:07 +02:00
Lampros Smyrnaios eaa070f1e6 Code cleanup. 2024-01-09 17:03:35 +02:00
Claudio Atzori fc35b44e22 bumped version of uoa-validator-engine2 to 0.9.3 2024-01-09 15:56:56 +01:00
Miriam Baglioni cb14470ba6 added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:50:05 +01:00
Miriam Baglioni 9f966b59d4 added properties file in the forlder for the workflow of result to community from semrel propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:11:47 +01:00
Miriam Baglioni 2f3b5a133d added properties file in the forlder for the workflow of result to community from organization propagation. Changes the path in the classes implementing the propagation 2023-12-22 13:56:40 +01:00
Miriam Baglioni 2f7b9ad815 added properties file in the forlder for the workflow of project to result propagation. Changes the path in the classes implementing the propagation 2023-12-22 11:46:15 +01:00
Miriam Baglioni f2352e8a78 changed in the classes the path for the property files for the propagation of community from project 2023-12-22 11:43:34 +01:00
Miriam Baglioni 009730b3d1 added properties file in the forlder for the workflow of orcid propagation. Changes the path in the classes implementing the propagationchanged the path to the parameter file in the class for entitytoorganization propagation 2023-12-22 11:42:09 +01:00
Miriam Baglioni 89f269c7f4 changed the path to the parameter file in the class for entitytoorganization propagation 2023-12-22 11:37:50 +01:00
Miriam Baglioni b06aea0adf adding the bulkTag parameter file in the folder for the oozie workflow for bulkTagging. Changes the path in the class 2023-12-22 11:35:37 +01:00
Miriam Baglioni 3afd4aa57b adjustments for country propagation 2023-12-22 11:27:30 +01:00
Claudio Atzori 62104790ae added metaresourcetype to the result hive DB view 2023-12-21 12:27:10 +01:00
Miriam Baglioni 5011c4d11a refactoring after compiletion 2023-12-20 15:57:26 +01:00
Miriam Baglioni 4740c808f7 - 2023-12-20 14:26:54 +01:00
Lampros Smyrnaios 17282ea8fc - Fix the "is not NULL" checks inside "spark.filter()"
- Make sure the "outputPath" ends with a "/", in any case.
- Fix a parameter-description.
2023-12-20 15:15:56 +02:00
Miriam Baglioni d410ea8a41 added needed parameter 2023-12-19 12:15:01 +01:00
Claudio Atzori 22fa60c3dd removed lib biinary 2023-12-18 15:50:29 +01:00
Claudio Atzori 24173d7a0b continuous validation WIP 2023-12-18 15:46:36 +01:00
Sandro La Bruzzo 15fd93a2b6 uploaded input parameters on CreateBaseline WF 2023-12-18 12:21:55 +01:00
Sandro La Bruzzo 9d342a47da updated the transformation Baseline workflow to include mdstore rollback/commit action 2023-12-18 11:48:57 +01:00
Lampros Smyrnaios a2feda6c07 - Fix acquiring the "openaire_guidelines" parameter.
- Use the right Guidelines-profile, depending on the "openaire_guidelines" version.
- Update log-levels.
- Optimize imports.
2023-12-18 10:57:40 +02:00
Miriam Baglioni 3eca5d2e1c - 2023-12-18 09:55:27 +01:00
Lampros Smyrnaios b71633fd7f - Fix the location of the "input_continuous_validator_parameters.json" file.
- Fix handing the "isSparkSessionManaged" parameter.
- Add the "provided" scope for some dependencies. They do not inherit it from the main pom, since the "version" tag is declared, even though the value is the same as the one from the main pom.
- Code polishing / cleanup.
2023-12-15 18:29:38 +02:00
Lampros Smyrnaios 9e6a03e4e2 Initial commit of the "dhp-continuous-validation" module. 2023-12-15 15:53:31 +02:00
Miriam Baglioni 01ce0b9c76 [doiboost - preprocess] remove transition to orcid preparation from sequence of steps at the beginning of the workflow 2023-12-15 12:24:55 +01:00
Miriam Baglioni 0d8e496a63 - 2023-12-15 12:16:43 +01:00
Claudio Atzori a59be5779e Merge pull request '9078_xml_records_irish_tender' (#368) from 9078_xml_records_irish_tender into beta
Reviewed-on: #368
2023-12-12 12:34:43 +01:00
Claudio Atzori ff924215b8 [graph provision] added tests for new peerreviewed field 2023-12-12 11:21:30 +01:00
Claudio Atzori a6d635e695 Merge branch 'beta' into 9078_xml_records_irish_tender 2023-12-12 11:06:42 +01:00
Claudio Atzori 98cce5bfb2 code formatting 2023-12-12 09:59:05 +01:00
Claudio Atzori 84d54643cf [cleaning] allow enriched orcids to pass the cleaning, rule out non-orcid author pids 2023-12-12 09:57:00 +01:00