StatsDB workflow to export actionsets about OA routes, diamond, and publicly-funded #355

Merged
claudio.atzori merged 4 commits from dimitris.pierrakos/dnet-hadoop:beta into beta 2023-12-01 15:03:58 +01:00
Contributor

A new oozie workflow capable to read from the stats db to produce a new actionSet for updating results with:

  • green_oa ={true, false}
  • openAccesColor = {gold, hybrid, bronze}
  • in_diamond_journal={true, false}
  • publicly_funded={true, false}

Inputs:

  • outputPath
  • statsDB
A new oozie workflow capable to read from the stats db to produce a new actionSet for updating results with: - green_oa ={true, false} - openAccesColor = {gold, hybrid, bronze} - in_diamond_journal={true, false} - publicly_funded={true, false} Inputs: - outputPath - statsDB
dimitris.pierrakos added 1 commit 2023-10-24 08:49:56 +02:00
89c4dfbaf4 StatsDB workflow to export actionsets about OA routes, diamond, and publicly-funded
A new oozie workflow capable to read from the stats db to produce a new actionSet for updating results with:
- green_oa ={true, false}
- openAccesColor = {gold, hybrid, bronze}
- in_diamond_journal={true, false}
- publicly_funded={true, false}

Inputs:

- outputPath
- statsDB
claudio.atzori requested changes 2023-10-25 12:16:49 +02:00
claudio.atzori left a comment
Owner

Please have a look at the various comments in the PR , but I think the main aspect that needs to be fixed is the representation of the information being exported from the StatsDB as an ActionSet. In fact, as agreed during the last meeting, the information about isGreen, openAccessColor, isInDiamondJournal, publiclyFunded had to be included at the result level in a new release of the dhp-schema module, version 4.17.1.

Please have a look at the various comments in the PR , but I think the main aspect that needs to be fixed is the representation of the information being exported from the StatsDB as an ActionSet. In fact, as agreed during the last meeting, the information about `isGreen`, `openAccessColor`, `isInDiamondJournal`, `publiclyFunded` had to be included at the result level in a new release of the `dhp-schema` module, version `4.17.1`.
@ -0,0 +8,4 @@
</parent>
<artifactId>dhp-stats-actionsets</artifactId>
<!-- <build>-->

If this block is commented out and it is not needed, then please remove it

If this block is commented out and it is not needed, then please remove it
dimitris.pierrakos marked this conversation as resolved
@ -0,0 +25,4 @@
public static final String DEFAULT_FOS_DELIMITER = "\t";
public static final String UPDATE_DATA_INFO_TYPE = "update";
// public static final String UPDATE_SUBJECT_FOS_CLASS_ID = "subject:fos";

Consider also to remove the commented out code lines. They just pollute the code readability.

Consider also to remove the commented out code lines. They just pollute the code readability.
dimitris.pierrakos marked this conversation as resolved
@ -0,0 +51,4 @@
private Constants() {
}
public static Boolean isSparkSessionManaged(ArgumentApplicationParser parser) {

I cannot see any usage in the codebase for this method. Consider to remove it.

I cannot see any usage in the codebase for this method. Consider to remove it.
dimitris.pierrakos marked this conversation as resolved
@ -0,0 +58,4 @@
.orElse(Boolean.TRUE);
}
public static <R> Dataset<R> readPath(

I cannot see any usage in the codebase for this method. Consider to remove it.

I cannot see any usage in the codebase for this method. Consider to remove it.
dimitris.pierrakos marked this conversation as resolved
@ -0,0 +66,4 @@
.map((MapFunction<String, R>) value -> OBJECT_MAPPER.readValue(value, clazz), Encoders.bean(clazz));
}
public static Subject getSubject(String sbj, String classid, String classname,

I cannot see any usage in the codebase for this method. Consider to remove it.

I cannot see any usage in the codebase for this method. Consider to remove it.
dimitris.pierrakos marked this conversation as resolved
@ -0,0 +156,4 @@
.toJavaRDD()
.map(p -> new AtomicAction(p.getClass(), p))
.union(
getFinalIndicatorsDiamondResult(spark, inputPath + "/diamondOADΒ")

Gitea warns me about the character B used in the string "/diamondOADΒ"

B [U+0392] is confusable with B [U+0042]

Any chance that it is a typo from a different keyboard setting?

Gitea warns me about the character `B` used in the string `"/diamondOADΒ"` ``` B [U+0392] is confusable with B [U+0042] ``` Any chance that it is a typo from a different keyboard setting?
dimitris.pierrakos marked this conversation as resolved
@ -0,0 +160,4 @@
.toJavaRDD()
.map(p -> new AtomicAction(p.getClass(), p)))
.union(
getFinalIndicatorsPubliclyFundedResult(spark, inputPath + "/publiclyFundedDΒ")

The same as above, Gitea warns me about the character B used in the string "/publiclyFundedDΒ"

B [U+0392] is confusable with B [U+0042]

Any chance that it is a typo from a different keyboard setting?

The same as above, Gitea warns me about the character `B` used in the string `"/publiclyFundedDΒ"` ``` B [U+0392] is confusable with B [U+0042] ``` Any chance that it is a typo from a different keyboard setting?
dimitris.pierrakos marked this conversation as resolved
@ -0,0 +170,4 @@
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);

Consider to store the output as compressed using the org.apache.hadoop.io.compress.GzipCodec as

.saveAsHadoopFile(
    outputPath, 
    Text.class, 
    Text.class, 
    SequenceFileOutputFormat.class, 
    GzipCodec.class);
Consider to store the output as compressed using the `org.apache.hadoop.io.compress.GzipCodec` as ``` .saveAsHadoopFile( outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class); ```
dimitris.pierrakos marked this conversation as resolved
@ -0,0 +224,4 @@
}, Encoders.bean(Result.class));
}
private static List<Measure> getMeasure(Boolean is_model_oa, String model_type) {

As agreed during the last meeting, the information about isGreen, openAccessColor, isInDiamondJournal, publiclyFunded is included at the result level in a recent release of the dhp-schema module, version 4.17.1. Those fields are available at the result level and should be used instead of the generic measures, which have a different purpose.

As agreed during the last meeting, the information about `isGreen`, `openAccessColor`, `isInDiamondJournal`, `publiclyFunded` is included at the result level in a recent release of the `dhp-schema` module, version `4.17.1`. Those fields are available at the result level and should be used instead of the generic measures, which have a different purpose.
dimitris.pierrakos marked this conversation as resolved
@ -0,0 +68,4 @@
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="atomicactions">

If I well understand from the spark action name, this action would produce an actionset contaning the usage stats metrics (views / downloads). However, an oozie workflow responsible for exporting such information is already available under

dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/usagestats/oozie_app/workflow.xml

Why would we need to duplicate it? Can you elaborate?

Furthermore, I see the tag points to eu.dnetlib.dhp.actionmanager.stats_actionsets.SparkAtomicActionGreenOAJob which does not exist in the classpath.

If I well understand from the spark action name, this action would produce an actionset contaning the usage stats metrics (views / downloads). However, an oozie workflow responsible for exporting such information is already available under ``` dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/usagestats/oozie_app/workflow.xml ``` Why would we need to duplicate it? Can you elaborate? Furthermore, I see the <class> tag points to `eu.dnetlib.dhp.actionmanager.stats_actionsets.SparkAtomicActionGreenOAJob` which does not exist in the classpath.
dimitris.pierrakos marked this conversation as resolved
claudio.atzori requested review from miriam.baglioni 2023-10-25 12:17:11 +02:00
claudio.atzori added this to the OpenAIRE project 2023-10-26 09:40:33 +02:00
claudio.atzori modified the project from OpenAIRE to OpenAIRE - DNet 2023-10-26 09:56:27 +02:00
dimitris.pierrakos added 1 commit 2023-11-14 08:46:58 +01:00
d524e30866 Changes to actionsets
Resolve comments from
#355
claudio.atzori requested changes 2023-11-22 13:59:58 +01:00
claudio.atzori left a comment
Owner

Was this workflow run in isolation? Did it produce any output? It would be good to link in the PR discussion a zeppelin notebook that provides some statistics on the data produced by this procedure.

Other than that, the code seems fine, please clarify the single comment I left below.

Was this workflow run in isolation? Did it produce any output? It would be good to link in the PR discussion a zeppelin notebook that provides some statistics on the data produced by this procedure. Other than that, the code seems fine, please clarify the single comment I left below.
@ -0,0 +140,4 @@
r.setOpenAccessColor(OpenAccessColor.bronze);
else if (usm.isIs_gold())
r.setOpenAccessColor(OpenAccessColor.bronze);
else if (usm.isIs_gold())

Please check the chain of the conditions, it seems the usm.isIs_gold() is checked twice, leading to different then blocks, one sets the OpenAccess color to bronze, the other to gold, which seems counter intuitive. I assume it might be not what you were expecting to write.

Please check the chain of the conditions, it seems the `usm.isIs_gold()` is checked twice, leading to different then blocks, one sets the OpenAccess color to bronze, the other to gold, which seems counter intuitive. I assume it might be not what you were expecting to write.
dimitris.pierrakos marked this conversation as resolved
dimitris.pierrakos added 1 commit 2023-11-23 09:48:59 +01:00
359e81b7a6 Update StatsAtomicActionsJob.java
Bug fix for duplicate bronze checks
Author
Contributor

Was this workflow run in isolation? Did it produce any output? It would be good to link in the PR discussion a zeppelin notebook that provides some statistics on the data produced by this procedure.

Other than that, the code seems fine, please clarify the single comment I left below.

This is a separate workflow that produces the action sets in the output folder.
Please also check the sample zeppelin notebook

> Was this workflow run in isolation? Did it produce any output? It would be good to link in the PR discussion a zeppelin notebook that provides some statistics on the data produced by this procedure. > > Other than that, the code seems fine, please clarify the single comment I left below. This is a separate workflow that produces the action sets in the output folder. Please also check the sample [zeppelin notebook](https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2JFBJHF5P)

Please also check the sample zeppelin notebook

Thanks Dimitris, the notebook is a good start. Can you include a basic statistical analysis that covers for

  • total number of records (distinct result id) available in the actionset
  • frequency of the result ids grouping by the new fields (one group by per field)

?

> Please also check the sample [zeppelin notebook](https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2JFBJHF5P) Thanks Dimitris, the notebook is a good start. Can you include a basic statistical analysis that covers for * total number of records (distinct result id) available in the actionset * frequency of the result ids grouping by the new fields (one group by per field) ?
Author
Contributor

Please also check the sample zeppelin notebook

Thanks Dimitris, the notebook is a good start. Can you include a basic statistical analysis that covers for

  • total number of records (distinct result id) available in the actionset
  • frequency of the result ids grouping by the new fields (one group by per field)

?

I added a Zeppelin notebook with some results.

> > > Please also check the sample [zeppelin notebook](https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2JFBJHF5P) > > Thanks Dimitris, the notebook is a good start. Can you include a basic statistical analysis that covers for > > * total number of records (distinct result id) available in the actionset > * frequency of the result ids grouping by the new fields (one group by per field) > > ? I added a [Zeppelin notebook](https://iis-cdh5-test-gw.ocean.icm.edu.pl/zeppelin/#/notebook/2JHJSK1A1) with some results.
dimitris.pierrakos added 1 commit 2023-12-01 10:35:09 +01:00
48430a32a6 Update StatsAtomicActionsJob.java
Added indi_funded_result_with_fundref indicator
claudio.atzori merged commit 0c3c9ea43d into beta 2023-12-01 15:03:56 +01:00
Sign in to join this conversation.
No description provided.