StatsDB workflow to export actionsets about OA routes, diamond, and publicly-funded #355
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#355
Loading…
Reference in New Issue
No description provided.
Delete Branch "dimitris.pierrakos/dnet-hadoop:beta"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
A new oozie workflow capable to read from the stats db to produce a new actionSet for updating results with:
Inputs:
Please have a look at the various comments in the PR , but I think the main aspect that needs to be fixed is the representation of the information being exported from the StatsDB as an ActionSet. In fact, as agreed during the last meeting, the information about
isGreen
,openAccessColor
,isInDiamondJournal
,publiclyFunded
had to be included at the result level in a new release of thedhp-schema
module, version4.17.1
.@ -0,0 +8,4 @@
</parent>
<artifactId>dhp-stats-actionsets</artifactId>
<!-- <build>-->
If this block is commented out and it is not needed, then please remove it
@ -0,0 +25,4 @@
public static final String DEFAULT_FOS_DELIMITER = "\t";
public static final String UPDATE_DATA_INFO_TYPE = "update";
// public static final String UPDATE_SUBJECT_FOS_CLASS_ID = "subject:fos";
Consider also to remove the commented out code lines. They just pollute the code readability.
@ -0,0 +51,4 @@
private Constants() {
}
public static Boolean isSparkSessionManaged(ArgumentApplicationParser parser) {
I cannot see any usage in the codebase for this method. Consider to remove it.
@ -0,0 +58,4 @@
.orElse(Boolean.TRUE);
}
public static <R> Dataset<R> readPath(
I cannot see any usage in the codebase for this method. Consider to remove it.
@ -0,0 +66,4 @@
.map((MapFunction<String, R>) value -> OBJECT_MAPPER.readValue(value, clazz), Encoders.bean(clazz));
}
public static Subject getSubject(String sbj, String classid, String classname,
I cannot see any usage in the codebase for this method. Consider to remove it.
@ -0,0 +156,4 @@
.toJavaRDD()
.map(p -> new AtomicAction(p.getClass(), p))
.union(
getFinalIndicatorsDiamondResult(spark, inputPath + "/diamondOADΒ")
Gitea warns me about the character
B
used in the string"/diamondOADΒ"
Any chance that it is a typo from a different keyboard setting?
@ -0,0 +160,4 @@
.toJavaRDD()
.map(p -> new AtomicAction(p.getClass(), p)))
.union(
getFinalIndicatorsPubliclyFundedResult(spark, inputPath + "/publiclyFundedDΒ")
The same as above, Gitea warns me about the character
B
used in the string"/publiclyFundedDΒ"
Any chance that it is a typo from a different keyboard setting?
@ -0,0 +170,4 @@
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);
Consider to store the output as compressed using the
org.apache.hadoop.io.compress.GzipCodec
as@ -0,0 +224,4 @@
}, Encoders.bean(Result.class));
}
private static List<Measure> getMeasure(Boolean is_model_oa, String model_type) {
As agreed during the last meeting, the information about
isGreen
,openAccessColor
,isInDiamondJournal
,publiclyFunded
is included at the result level in a recent release of thedhp-schema
module, version4.17.1
. Those fields are available at the result level and should be used instead of the generic measures, which have a different purpose.@ -0,0 +68,4 @@
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="atomicactions">
If I well understand from the spark action name, this action would produce an actionset contaning the usage stats metrics (views / downloads). However, an oozie workflow responsible for exporting such information is already available under
Why would we need to duplicate it? Can you elaborate?
Furthermore, I see the tag points to
eu.dnetlib.dhp.actionmanager.stats_actionsets.SparkAtomicActionGreenOAJob
which does not exist in the classpath.Was this workflow run in isolation? Did it produce any output? It would be good to link in the PR discussion a zeppelin notebook that provides some statistics on the data produced by this procedure.
Other than that, the code seems fine, please clarify the single comment I left below.
@ -0,0 +140,4 @@
r.setOpenAccessColor(OpenAccessColor.bronze);
else if (usm.isIs_gold())
r.setOpenAccessColor(OpenAccessColor.bronze);
else if (usm.isIs_gold())
Please check the chain of the conditions, it seems the
usm.isIs_gold()
is checked twice, leading to different then blocks, one sets the OpenAccess color to bronze, the other to gold, which seems counter intuitive. I assume it might be not what you were expecting to write.This is a separate workflow that produces the action sets in the output folder.
Please also check the sample zeppelin notebook
Thanks Dimitris, the notebook is a good start. Can you include a basic statistical analysis that covers for
?
I added a Zeppelin notebook with some results.