graph_db
beta
master
doidoost_dismiss
taggingProjects
mergeutils
oaf_country_beta
index_records
ocnew
FOSNew
bulkTaggingPathMapExtention
transformativeagreement
new_orcid_enhancement
9559_DBLP_data
base_stats_job_deprecated
SWH_issue_377
import_orps_fix
spark_join_param_tuning
crossref_mapping_vocabularies
promote_actions_join_type_master
promote_actions_join_type
UsageStatsRecordDS
provision_community_api
enrichmentSingleStepFixed
fosPreparationBeta
resource_types
enrichmentSingleStep
oldPropagationOrganizationCommunity
beta_to_master_dicember2023
orcid_import
9078_xml_records_irish_tender
clean_license_publisher
spark34-integration
bulkTag
SWH_integration
importpoci
8172_impact_indicators_workflow
dedup-with-dataframe-spark34
8876
master_july23
distinct_pids_from_openorgs_beta
propagationProjectThroughParentChils
fulltext_url_validation
removeTaggingCondition
ticket_8369
tweaking_spark_params
fc4e-rsac
doiboost_authormerger
beta_dedup_configuration
apc_affiliation
bulkTagRefactor
organizationToRepresentative
graph_cleaning_refactoring
scholix_flat_indexing
scholix_data_type_openaire
advConstraintsInBeta
doiboostMappingExtention
mag_citation_relation
h2020classification
doiboostFunderExtention
citations_monodirectional
compatibility_order
8232-mdstore-synch-improve
subjectPropagation
pubmed_to_production
cleanCountryOnMaster
graph_cleaning
deduptesting
horizontalConstraints
enrichment
scholix_to_solr
transformation_wf
discard-non-wellformed
removeDump
eosc_context_tagging
pubmed_update
doiboost_refactor
clean_context_master
monitoring
dump_new_funded_products
dump_delta_projects
dump
7096-fileGZip-collector-plugin
oaf_relation_mapping
validation
native_records_migration
instance_group_by_url
hostedByMap_update
hostedByMap_oastartdate
sygma_indexing
dhp-1.2.4
dhp-1.2.3
dhp-1.2.2
dhp-1.2.1
dhp-1.2.0
dhp-1.1.7
dhp-1.1.6
dhp-1.1.5
dhp-1.0.4
dhp-1.0.3
dhp-1.0.2
1.0.1
september-2023
archive/master_pre_stable_ids
archive/junit5
archive/islookup_timeout
archive/dedupTest
archive/bipFinder_master_test
Labels
Clear labels
Something is not working
This issue or pull request already exists
New feature / refactoring
Need some help
Something is wrong
More information is needed
EOSC Research Discovery Graph
EOSC Research Software APIs and Connectors
This won't be fixed
Apply labels
bug
Something is not working
duplicate
This issue or pull request already exists
enhancement
New feature / refactoring
help wanted
Need some help
invalid
Something is wrong
question
More information is needed
RDGraph
EOSC Research Discovery Graph
RSAC
EOSC Research Software APIs and Connectors
wontfix
This won't be fixed
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
Milestone
Set milestone
Clear milestone
No items
No Milestone
Projects
Set Project
Clear projects
No project
Assignees
Assign users
Clear assignees
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#31
Reference in New Issue
There is no content yet.
Delete Branch '%!s(<nil>)'
Deleting a branch is permanent. It CANNOT be undone. Continue?
No
Yes
Currently the OpenAIRE graph materialized across the different processing steps is stored as compressed text files containing newline-delimited JSON records.
In order to analyse the data with ease, it is often needed to make the graph content available as database on Hive, introducing overhead in
The purpose of this enhancement is therefore to address this limitation by making each individual graph processing step capable of storing the graph data as Hive DB.
In order to move on with this activity incrementally, both save modes could be supported, by means of a new parameter saveMode=json|parquet.
Hints
org.apache.spark.SparkConf
and make use ofrunWithSparkHiveSession
instead ofrunWithSparkSession
;I'm factoring out utilities to be used across different workflows in a new module
dhp-workflows-common
.Ongoing update, up to
f3ce97ecf9
I introduced the following changeseu.dnetlib.dhp.common.GraphFormat
declares the two supported formatsJSON | HIVE
aggregatorGraph
,mergeAggregatorGraphs
, both defined indhp-graph-mapper
) are updated to accept both formats.@michele.debonis can you take care of updating the deduplication workflow?
@przemyslaw.jacewicz can you take over the adaptation of the actionmanager promote workflow?
@all The idea is to progressively move from the current JSON based encoding and the path-based parameters to the HIVE based encoding and DB name based parameters.
Yes, I can. I'm not sure when I'll be able to start the work, though. We have some pending tasks in IIS and some planning within our team is needed.
Ongoing update: I'm using the graphCleaning workflow to experiment the different combinations of input/output graph format and I notice that when the input graph is stored in hive, the memory footprint increases more than I expected; in particular with the default settings
The workflow dies processing publications for running OOM: http://iis-cdh5-test-m1.ocean.icm.edu.pl:8088/cluster/app/application_1595483222400_5962
The actual error cause is hidden by the 2nd execution attempt, but in fact it failed with:
We already know that with the current graph model classes the impact on the available memory is quite high when usind bean Encoders, so whenever possible we used the kryo encoders instead. However, unfortunately storing a kryo encoded dataset in an Hive table, with the experiments I performed so far, results in a table with a single binary column (not to useful for data inspection/query).
Perhaps this behaviour can be changed with more investigations, but maybe there are some alternative.
The idea for tracking data quality metrics over time assumes to run a batch of SQL statements over some of the JSON encoded graph materializations and store the observations to prometheus. Running the same queries over time would create a set of time series that we hope would catch sensible data quality aspects.
The materialization of the graph as a proper HIVE DB is not really a prerequisite for running SQL queries against it. In fact SQL statements could be executed also against tempViews created from a spark dataset. As many SQL queries would need to join different tables we might probably need to make the entire set of graph tables available as tempViews.
Another benefit for this approach is that the existing procedure we already use to map the graph as a proper HIVE DB would still be available and could be run when we need to deepen the analysis.
All this would need some experimentation, but before moving on I'd like to hear the other people involved in this task :)