1
0
Fork 0
Commit Graph

4885 Commits

Author SHA1 Message Date
Claudio Atzori 9e8fc6aa88 [collection] increased logging from the oai-pmh metadata collection process 2024-01-26 09:17:20 +01:00
Antonis Lempesis a7115cfa9e max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%. 2024-01-25 15:13:16 +01:00
Claudio Atzori 2838a9b630 Update 'CONTRIBUTING.md' 2024-01-24 16:07:05 +01:00
Claudio Atzori da944a5c55 Merge pull request 'code of conduct and contributing' (#382) from contributing into beta
Reviewed-on: D-Net/dnet-hadoop#382
2024-01-24 15:40:26 +01:00
Claudio Atzori 0c97a3a81a minor 2024-01-24 10:56:33 +01:00
Claudio Atzori 2c1e6849f0 added code of conduct and contributing files 2024-01-24 10:36:41 +01:00
Claudio Atzori 9b13c22e5d [graph provision] retrieve all the context information by adding all=true to the requests issued to thr API 2024-01-23 15:36:08 +01:00
Claudio Atzori 3e96777cc4 [collection] increased logging from the oai-pmh metadata collection process 2024-01-23 15:21:03 +01:00
Claudio Atzori 9812406589 Merge pull request '[graph provision] updated param specification for the XML converter job' (#380) from provision_community_api into beta
Reviewed-on: D-Net/dnet-hadoop#380
2024-01-23 08:55:59 +01:00
Claudio Atzori f87f3a6483 [graph provision] updated param specification for the XML converter job 2024-01-23 08:54:37 +01:00
Claudio Atzori 6fd25cf549 code formatting 2024-01-23 08:47:12 +01:00
Claudio Atzori bd187ec6e7 Merge pull request 'Implements pivots table update oozie workflow' (#376) from update_pivots_table into beta
Reviewed-on: D-Net/dnet-hadoop#376
2024-01-22 16:37:30 +01:00
Claudio Atzori f76852f385 Merge branch 'beta' into update_pivots_table 2024-01-22 16:37:22 +01:00
Claudio Atzori b9fcc5ad5e Merge pull request 'Context API update' (#379) from provision_community_api into beta
Reviewed-on: D-Net/dnet-hadoop#379
2024-01-22 15:55:33 +01:00
Claudio Atzori 1c6db320f4 [graph provision] obtain context info from the context API instead from the ISLookUp service 2024-01-22 15:53:17 +01:00
Claudio Atzori 2655eea5bc [orcid enrichment] drop paths before copying the non-modifyed contents 2024-01-19 16:28:05 +01:00
Claudio Atzori c6b3401596 increased shuffle partitions for publications in the country propagation workflow 2024-01-19 10:15:39 +01:00
Miriam Baglioni bcc0a13981 [enrichment single step] adding <end> element in wf definition 2024-01-18 17:39:14 +01:00
Miriam Baglioni 6af536541d [enrichment single step] moving parameter file in correct location 2024-01-18 15:35:40 +01:00
Miriam Baglioni a12a3eb143 - 2024-01-18 15:18:10 +01:00
Claudio Atzori 628fdfb5eb Merge pull request '[enrichment single step]' (#378) from enrichmentSingleStepFixed into beta
Reviewed-on: D-Net/dnet-hadoop#378
2024-01-18 09:41:09 +01:00
Miriam Baglioni 82e9e262ee [enrichment single step] remove parameter from execution 2024-01-17 17:38:03 +01:00
Miriam Baglioni 67ce2d54be [enrichment single step] refactoring to fix issues in disappeared result type 2024-01-17 16:50:00 +01:00
Miriam Baglioni 59eaccbd87 [enrichment single step] refactoring to fix issue in disappeared result type 2024-01-15 17:49:54 +01:00
Giambattista Bloisi 21a14fcd80 Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions
Implements pivots table update oozie workflow
2024-01-15 10:18:14 +01:00
Claudio Atzori 2d302e6827 Merge pull request '[FoS integration]fix issue on FoS integration. Removing the null values from FoS' (#375) from fosPreparationBeta into beta
Reviewed-on: D-Net/dnet-hadoop#375
2024-01-12 10:27:28 +01:00
Miriam Baglioni f612125939 fix issue on FoS integration. Removing the null values from FoS 2024-01-12 10:20:28 +01:00
Claudio Atzori c67467723b Merge pull request 'refined mapping for the extraction of the original resource type' (#374) from resource_types into beta
Reviewed-on: D-Net/dnet-hadoop#374
2024-01-11 16:29:47 +01:00
Claudio Atzori cb9e739484 Merge branch 'beta' into resource_types 2024-01-11 16:29:41 +01:00
Claudio Atzori 2753044d13 refined mapping for the extraction of the original resource type 2024-01-11 16:28:26 +01:00
Giambattista Bloisi a88dce5bf3 Merge pull request 'Improvements and refactoring in Dedup' (#367) from dedup_increasenumofblocks into beta
Reviewed-on: D-Net/dnet-hadoop#367
2024-01-11 11:24:06 +01:00
Giambattista Bloisi 3c66e3bd7b Create dedup record for "merged" pivots
Do not create dedup records for group that have more than 20 different acceptance date
2024-01-10 22:59:52 +01:00
Giambattista Bloisi 10e135db1e Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 831cc1fdde Generate "merged" dedup id relations also for records that are filtered out by the cut parameters 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 1287315ffb Do no longer use dedupId information from pivotHistory Database 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Claudio Atzori 16d858fbf0 Merge pull request 'enrichmentSingleStep' (#373) from enrichmentSingleStep into beta
Reviewed-on: D-Net/dnet-hadoop#373
2024-01-10 16:58:49 +01:00
Miriam Baglioni e711a05229 fixed conflicts 2024-01-10 11:03:42 +01:00
Miriam Baglioni 71d6f30711 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-01-10 10:59:58 +01:00
Miriam Baglioni cb14470ba6 added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:50:05 +01:00
Miriam Baglioni 9f966b59d4 added properties file in the forlder for the workflow of result to community from semrel propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:11:47 +01:00
Miriam Baglioni 2f3b5a133d added properties file in the forlder for the workflow of result to community from organization propagation. Changes the path in the classes implementing the propagation 2023-12-22 13:56:40 +01:00
Miriam Baglioni 2f7b9ad815 added properties file in the forlder for the workflow of project to result propagation. Changes the path in the classes implementing the propagation 2023-12-22 11:46:15 +01:00
Miriam Baglioni f2352e8a78 changed in the classes the path for the property files for the propagation of community from project 2023-12-22 11:43:34 +01:00
Miriam Baglioni 009730b3d1 added properties file in the forlder for the workflow of orcid propagation. Changes the path in the classes implementing the propagationchanged the path to the parameter file in the class for entitytoorganization propagation 2023-12-22 11:42:09 +01:00
Miriam Baglioni 89f269c7f4 changed the path to the parameter file in the class for entitytoorganization propagation 2023-12-22 11:37:50 +01:00
Miriam Baglioni b06aea0adf adding the bulkTag parameter file in the folder for the oozie workflow for bulkTagging. Changes the path in the class 2023-12-22 11:35:37 +01:00
Miriam Baglioni 3afd4aa57b adjustments for country propagation 2023-12-22 11:27:30 +01:00
Claudio Atzori 62104790ae added metaresourcetype to the result hive DB view 2023-12-21 12:27:10 +01:00
Miriam Baglioni 5011c4d11a refactoring after compiletion 2023-12-20 15:57:26 +01:00