Claudio Atzori
628fdfb5eb
Merge pull request '[enrichment single step]' ( #378 ) from enrichmentSingleStepFixed into beta
...
Reviewed-on: D-Net/dnet-hadoop#378
2024-01-18 09:41:09 +01:00
Miriam Baglioni
82e9e262ee
[enrichment single step] remove parameter from execution
2024-01-17 17:38:03 +01:00
Miriam Baglioni
67ce2d54be
[enrichment single step] refactoring to fix issues in disappeared result type
2024-01-17 16:50:00 +01:00
Miriam Baglioni
59eaccbd87
[enrichment single step] refactoring to fix issue in disappeared result type
2024-01-15 17:49:54 +01:00
Giambattista Bloisi
21a14fcd80
Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions
...
Implements pivots table update oozie workflow
2024-01-15 10:18:14 +01:00
Sandro La Bruzzo
e0753f19da
Fixed error of connection timeout
2024-01-13 09:27:08 +01:00
sandro.labruzzo
e328bc0ade
fixed missing parameter on download update
2024-01-12 16:18:20 +01:00
Claudio Atzori
2d302e6827
Merge pull request '[FoS integration]fix issue on FoS integration. Removing the null values from FoS' ( #375 ) from fosPreparationBeta into beta
...
Reviewed-on: D-Net/dnet-hadoop#375
2024-01-12 10:27:28 +01:00
Miriam Baglioni
f612125939
fix issue on FoS integration. Removing the null values from FoS
2024-01-12 10:20:28 +01:00
Claudio Atzori
c67467723b
Merge pull request 'refined mapping for the extraction of the original resource type' ( #374 ) from resource_types into beta
...
Reviewed-on: D-Net/dnet-hadoop#374
2024-01-11 16:29:47 +01:00
Claudio Atzori
cb9e739484
Merge branch 'beta' into resource_types
2024-01-11 16:29:41 +01:00
Claudio Atzori
2753044d13
refined mapping for the extraction of the original resource type
2024-01-11 16:28:26 +01:00
Giambattista Bloisi
a88dce5bf3
Merge pull request 'Improvements and refactoring in Dedup' ( #367 ) from dedup_increasenumofblocks into beta
...
Reviewed-on: D-Net/dnet-hadoop#367
2024-01-11 11:24:06 +01:00
Giambattista Bloisi
3c66e3bd7b
Create dedup record for "merged" pivots
...
Do not create dedup records for group that have more than 20 different acceptance date
2024-01-10 22:59:52 +01:00
Giambattista Bloisi
10e135db1e
Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids
2024-01-10 22:59:52 +01:00
Giambattista Bloisi
831cc1fdde
Generate "merged" dedup id relations also for records that are filtered out by the cut parameters
2024-01-10 22:59:52 +01:00
Giambattista Bloisi
1287315ffb
Do no longer use dedupId information from pivotHistory Database
2024-01-10 22:59:52 +01:00
Giambattista Bloisi
02636e802c
SparkCreateSimRels:
...
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests
SparkWhitelistSimRels: use left semi join for clarity and performance
SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions
DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Antonis Lempesis
e024718f73
creating result_instances even when no pids exist for the instance
2024-01-10 22:25:50 +01:00
Sandro La Bruzzo
859babf722
added some useful comment
2024-01-10 19:51:13 +01:00
Sandro La Bruzzo
39ebb60b38
Merge remote-tracking branch 'origin/beta' into orcid_update
2024-01-10 19:50:00 +01:00
Sandro La Bruzzo
9d5a7c3b22
code refactor
2024-01-10 19:42:34 +01:00
Sandro La Bruzzo
8f61063201
Added workflow
2024-01-10 19:42:22 +01:00
Sandro La Bruzzo
1a42a5c10d
Implemented Download update of ORCID
2024-01-10 18:03:20 +01:00
Claudio Atzori
16d858fbf0
Merge pull request 'enrichmentSingleStep' ( #373 ) from enrichmentSingleStep into beta
...
Reviewed-on: D-Net/dnet-hadoop#373
2024-01-10 16:58:49 +01:00
Miriam Baglioni
e711a05229
fixed conflicts
2024-01-10 11:03:42 +01:00
Miriam Baglioni
71d6f30711
Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta
2024-01-10 10:59:58 +01:00
dimitrispie
b920307bdd
Changes to indicators
2024-01-09 00:47:09 +02:00
dimitrispie
8b2cbb611e
Changes to beta db names
2024-01-09 00:40:56 +02:00
Antonis Lempesis
2e4cab026c
fixed the result_country definition
2024-01-08 16:01:26 +02:00
dimitrispie
6b823100ae
Update buildIrishMonitorDB.sql
...
New indicators added
2024-01-07 22:54:39 +02:00
dimitrispie
75bfde043c
Historical Snapshots Workflow
...
Create historical snapshots db with parameters:
hist_db_name=openaire_beta_historical_snapshots_xxx
hist_db_name_prev=openaire_beta_historical_snapshots_xxx (previous run of wf)
stats_db_name=openaire_beta_stats_xxx
stats_irish_db_name=openaire_beta_stats_monitor_ie_xxx
monitor_db_name=openaire_beta_stats_monitor_xxx
monitor_db_prod_name=openaire_beta_stats_monitor
monitor_irish_db_name=openaire_beta_stats_monitor_ie_xxx
monitor_irish_db_prod_name=openaire_beta_stats_monitor_ie
hist_db_prod_name=openaire_beta_historical_snapshots
hist_db_shadow_name=openaire_beta_historical_snapshots_shadow
hist_date=122023
hive_timeout=150000
hadoop_user_name=xxx
resumeFrom=CreateDB
2024-01-04 15:11:04 +02:00
Miriam Baglioni
cb14470ba6
added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation
2023-12-22 14:50:05 +01:00
Miriam Baglioni
9f966b59d4
added properties file in the forlder for the workflow of result to community from semrel propagation. Changes the path in the classes implementing the propagation
2023-12-22 14:11:47 +01:00
Miriam Baglioni
2f3b5a133d
added properties file in the forlder for the workflow of result to community from organization propagation. Changes the path in the classes implementing the propagation
2023-12-22 13:56:40 +01:00
Miriam Baglioni
2f7b9ad815
added properties file in the forlder for the workflow of project to result propagation. Changes the path in the classes implementing the propagation
2023-12-22 11:46:15 +01:00
Miriam Baglioni
f2352e8a78
changed in the classes the path for the property files for the propagation of community from project
2023-12-22 11:43:34 +01:00
Miriam Baglioni
009730b3d1
added properties file in the forlder for the workflow of orcid propagation. Changes the path in the classes implementing the propagationchanged the path to the parameter file in the class for entitytoorganization propagation
2023-12-22 11:42:09 +01:00
Miriam Baglioni
89f269c7f4
changed the path to the parameter file in the class for entitytoorganization propagation
2023-12-22 11:37:50 +01:00
Miriam Baglioni
b06aea0adf
adding the bulkTag parameter file in the folder for the oozie workflow for bulkTagging. Changes the path in the class
2023-12-22 11:35:37 +01:00
Miriam Baglioni
3afd4aa57b
adjustments for country propagation
2023-12-22 11:27:30 +01:00
dimitrispie
ffdd03d2f4
Monitor Irish Stats WF
...
Parameters (with examples):
stats_db_name=openaire_beta_stats_20231208
monitor_irish_db_name=openaire_beta_stats_monitor_ie_20231208b
monitor_irish_db_prod_name=openaire_beta_stats_monitor_ie
graph_db_name=openaire_beta_20231208
monitor_irish_db_shadow_name=openaire_beta_stats_monitor_ie_shadow
hive_timeout=150000
hadoop_user_name=dnet.beta
resumeFrom=Step1-buildIrishMonitorDB
2023-12-22 11:05:24 +02:00
dimitrispie
40b98d8182
Changes to indicators and funders definition
...
- Changes result_refereed definition
- Added result_country indicator
- Added indi_pub_green_with_license indicator
- Added country from jurisdiction to funders
2023-12-22 10:29:20 +02:00
Claudio Atzori
62104790ae
added metaresourcetype to the result hive DB view
2023-12-21 12:27:10 +01:00
Miriam Baglioni
5011c4d11a
refactoring after compiletion
2023-12-20 15:57:26 +01:00
Miriam Baglioni
4740c808f7
-
2023-12-20 14:26:54 +01:00
Miriam Baglioni
d410ea8a41
added needed parameter
2023-12-19 12:15:01 +01:00
Miriam Baglioni
624f5f3f21
[Transformative Agreement] added check to verify the APC were paid byu the IReL funder
2023-12-18 15:28:19 +01:00
Miriam Baglioni
354e02e6a9
[Transformative Agreement] removed not needed class. Read directly the json and no need to pass from the csv
2023-12-18 15:20:27 +01:00
Miriam Baglioni
b00771c7cc
[Transformative Agreement] added code to extract relations from the transformative agreement file for the IE products got from OpenAPC
2023-12-18 15:12:44 +01:00