store graph as hive DB #31
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#31
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Currently the OpenAIRE graph materialized across the different processing steps is stored as compressed text files containing newline-delimited JSON records.
In order to analyse the data with ease, it is often needed to make the graph content available as database on Hive, introducing overhead in
The purpose of this enhancement is therefore to address this limitation by making each individual graph processing step capable of storing the graph data as Hive DB.
In order to move on with this activity incrementally, both save modes could be supported, by means of a new parameter saveMode=json|parquet.
Hints
org.apache.spark.SparkConf
and make use ofrunWithSparkHiveSession
instead ofrunWithSparkSession
;I'm factoring out utilities to be used across different workflows in a new module
dhp-workflows-common
.Ongoing update, up to
f3ce97ecf9
I introduced the following changeseu.dnetlib.dhp.common.GraphFormat
declares the two supported formatsJSON | HIVE
aggregatorGraph
,mergeAggregatorGraphs
, both defined indhp-graph-mapper
) are updated to accept both formats.@michele.debonis can you take care of updating the deduplication workflow?
@przemyslaw.jacewicz can you take over the adaptation of the actionmanager promote workflow?
@all The idea is to progressively move from the current JSON based encoding and the path-based parameters to the HIVE based encoding and DB name based parameters.
Yes, I can. I'm not sure when I'll be able to start the work, though. We have some pending tasks in IIS and some planning within our team is needed.
Ongoing update: I'm using the graphCleaning workflow to experiment the different combinations of input/output graph format and I notice that when the input graph is stored in hive, the memory footprint increases more than I expected; in particular with the default settings
The workflow dies processing publications for running OOM: http://iis-cdh5-test-m1.ocean.icm.edu.pl:8088/cluster/app/application_1595483222400_5962
The actual error cause is hidden by the 2nd execution attempt, but in fact it failed with:
We already know that with the current graph model classes the impact on the available memory is quite high when usind bean Encoders, so whenever possible we used the kryo encoders instead. However, unfortunately storing a kryo encoded dataset in an Hive table, with the experiments I performed so far, results in a table with a single binary column (not to useful for data inspection/query).
Perhaps this behaviour can be changed with more investigations, but maybe there are some alternative.
The idea for tracking data quality metrics over time assumes to run a batch of SQL statements over some of the JSON encoded graph materializations and store the observations to prometheus. Running the same queries over time would create a set of time series that we hope would catch sensible data quality aspects.
The materialization of the graph as a proper HIVE DB is not really a prerequisite for running SQL queries against it. In fact SQL statements could be executed also against tempViews created from a spark dataset. As many SQL queries would need to join different tables we might probably need to make the entire set of graph tables available as tempViews.
Another benefit for this approach is that the existing procedure we already use to map the graph as a proper HIVE DB would still be available and could be run when we need to deepen the analysis.
All this would need some experimentation, but before moving on I'd like to hear the other people involved in this task :)