Import affiliation relations from Crossref #320
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#320
Loading…
Reference in New Issue
No description provided.
Delete Branch "8876"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Work carried out in the context of task 8876.
WIP: Import affiliation relations from Crossrefto Import affiliation relations from CrossrefDocumentation for the matching algorithm from Crossref affiliation strings to OpenAIRE's organisations can be found here.
Other than reviewing this PR, @schatz do we expect to run it on BETA asap? If so we should accelerate the review process and define the remaining steps. Being a new oozie workflow, it requires to
I added @giambattista.bloisi to help with the review process
Yes, afaik we want to include this in the next BETA round. I have also talked with Miriam regarding the integration and I am pretty sure she has set up a D-Net profile for this. @miriam.baglioni can you confim this? and also indicate the hdfs directory with the produced actionsets ? Thank you
Ok, thanks for confirming that.
I checked myself and found the profile registered on beta. For further reference:
bip-affiliations
I'm going to update the deployment specification, as I see the workflow already run from the D-Net orchestration (here is the relative D-Net workflow profile), but it pointed to the oozie workflow deployed on the personal directory
The WF on beta pointed to the personal directory because the code was part of a branch and not integrated. It was done to verify everything will run smoothly. Once integrated, I will change the directory with the one specified in the deployment file specification
It think the PR can be integrated
The deployment directory specified in the D-Net workflow profile is already updated.
Hi Serafeim,
overall the code looks good. In the subsequent comments I recommend an alternative approach to perform same business logic with less code and java classes.
@ -0,0 +85,4 @@
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
// load and parse affiliation relations from HDFS
JavaRDD<AffiliationRelationDeserializer> affiliationRelationsDeserializeRDD = sc
AffiliationRelationDeserializer and AffiliationRelationModel are two classes used to store intermediate representation of the data that eventually is put in generated Relation(s). Those two classes leverage lombok annotations to get a few methods generated automatically.
After a talk with @claudio.atzori I discovered that Lombok is accidentally available because it was imported as a transitive dependency in a recent dependency addition, but there was no intention to introduce or promote its usage.
Nevertheless that was an input to rethink the whole processing to reduce the needed code by leveraging spark dataframe functionalities.
the schema of the fields to load is declared using a DDL, thus avoiding the need to define a temporary data holding java class (AffiliationRelationDeserializer).
Hi Giambattista,
Thank you very much for your comments! I agree with the approach you propose, seems cleaner. Not sure why I used Lombok there, old habits :)
@ -0,0 +148,4 @@
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);
It is advisable to compress output file here (using /data/bip-affiliations/data.json as the input the total disk size for output file is reduced from 50Gb to 1.5Gb)
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
You are right, thanks for pointing that.
@ -0,0 +1,27 @@
That class can be removed by using dataframe api approach
@ -0,0 +1,16 @@
This class can be removed by using dataframe api approach
Ι made the requested changes, if there is no objection,I think that this PR is ready to be merged.