Import affiliation relations from Crossref #320

Merged
miriam.baglioni merged 12 commits from 8876 into beta 2023-08-07 10:45:31 +02:00
Member

Work carried out in the context of task 8876.

Work carried out in the context of task [8876](https://support.openaire.eu/issues/8876).
schatz added 5 commits 2023-07-06 19:57:32 +02:00
schatz added 1 commit 2023-07-06 20:07:54 +02:00
schatz added 1 commit 2023-07-17 10:18:13 +02:00
schatz added 1 commit 2023-07-17 15:04:30 +02:00
schatz changed title from WIP: Import affiliation relations from Crossref to Import affiliation relations from Crossref 2023-07-17 15:16:50 +02:00
claudio.atzori was assigned by schatz 2023-07-17 15:17:01 +02:00
schatz requested review from miriam.baglioni 2023-07-17 15:17:08 +02:00
miriam.baglioni was assigned by schatz 2023-07-17 15:17:14 +02:00
Author
Member

Documentation for the matching algorithm from Crossref affiliation strings to OpenAIRE's organisations can be found here.

Documentation for the matching algorithm from Crossref affiliation strings to OpenAIRE's organisations can be found [here](https://github.com/mkallipo/matching/tree/main#description-of-the-algorithm).
claudio.atzori requested review from giambattista.bloisi 2023-07-27 15:26:45 +02:00

Other than reviewing this PR, @schatz do we expect to run it on BETA asap? If so we should accelerate the review process and define the remaining steps. Being a new oozie workflow, it requires to

  • be included among the deployment specifications
  • produce its content on a dedicated actionset, that still mhave to be created, it consists of
    • a profile in the D-Net information system
    • a correspondant directory on HDFS
  • a D-Net workflow responsible to parametrize and trigger the oozie workflow

I added @giambattista.bloisi to help with the review process

Other than reviewing this PR, @schatz do we expect to run it on BETA asap? If so we should accelerate the review process and define the remaining steps. Being a new oozie workflow, it requires to * be included among the [deployment specifications](https://code-repo.d4science.org/D-Net/dnet-hadoop-deployment) * produce its content on a dedicated actionset, that still mhave to be created, it consists of * a profile in the D-Net information system * a correspondant directory on HDFS * a D-Net workflow responsible to parametrize and trigger the oozie workflow I added @giambattista.bloisi to help with the review process
Author
Member

Yes, afaik we want to include this in the next BETA round. I have also talked with Miriam regarding the integration and I am pretty sure she has set up a D-Net profile for this. @miriam.baglioni can you confim this? and also indicate the hdfs directory with the produced actionsets ? Thank you

Yes, afaik we want to include this in the next BETA round. I have also talked with Miriam regarding the integration and I am pretty sure she has set up a D-Net profile for this. @miriam.baglioni can you confim this? and also indicate the hdfs directory with the produced actionsets ? Thank you

Yes, afaik we want to include this in the next BETA round.

Ok, thanks for confirming that.

I have also talked with Miriam regarding the integration and I am pretty sure she has set up a D-Net profile for this. @miriam.baglioni can you confim this? and also indicate the hdfs directory with the produced actionsets ? Thank you

I checked myself and found the profile registered on beta. For further reference:

  • the actionset id is bip-affiliations
  • the URL to the actionset profile
  • the directory where the rawsets are going to be stored is
/var/lib/dnet/actionManager_BETA/bip-affiliation

I'm going to update the deployment specification, as I see the workflow already run from the D-Net orchestration (here is the relative D-Net workflow profile), but it pointed to the oozie workflow deployed on the personal directory

/user/schatz/actionmanager/bipaffiliations/oozie_app
> Yes, afaik we want to include this in the next BETA round. Ok, thanks for confirming that. > I have also talked with Miriam regarding the integration and I am pretty sure she has set up a D-Net profile for this. @miriam.baglioni can you confim this? and also indicate the hdfs directory with the produced actionsets ? Thank you I checked myself and found the profile registered on beta. For further reference: * the actionset id is `bip-affiliations` * the URL to the [actionset profile](https://beta.services.openaire.eu/is/mvc/inspector/index.do/db/DRIVER/ActionManagerSetDSResources/ActionManagerSetDSResourceType/e5a00d5d-c878-46e9-abfc-3e288eea29e3/show) * the directory where the rawsets are going to be stored is ``` /var/lib/dnet/actionManager_BETA/bip-affiliation ``` I'm going to update the deployment specification, as I see the workflow already run from the D-Net orchestration ([here is the relative D-Net workflow profile](https://beta.services.openaire.eu/is/mvc/ui/isManager.do#/profile/b309aa1c-a320-4c40-adff-5db1e1deb0e0_V29ya2Zsb3dEU1Jlc291cmNlcy9Xb3JrZmxvd0RTUmVzb3VyY2VUeXBl)), but it pointed to the oozie workflow deployed on the personal directory ``` /user/schatz/actionmanager/bipaffiliations/oozie_app ```

The WF on beta pointed to the personal directory because the code was part of a branch and not integrated. It was done to verify everything will run smoothly. Once integrated, I will change the directory with the one specified in the deployment file specification

It think the PR can be integrated

The WF on beta pointed to the personal directory because the code was part of a branch and not integrated. It was done to verify everything will run smoothly. Once integrated, I will change the directory with the one specified in the deployment file specification It think the PR can be integrated

The WF on beta pointed to the personal directory because the code was part of a branch and not integrated. It was done to verify everything will run smoothly. Once integrated, I will change the directory with the one specified in the deployment file specification

The deployment directory specified in the D-Net workflow profile is already updated.

> The WF on beta pointed to the personal directory because the code was part of a branch and not integrated. It was done to verify everything will run smoothly. Once integrated, I will change the directory with the one specified in the deployment file specification The deployment directory specified in the D-Net workflow profile is already updated.
schatz added 1 commit 2023-07-28 13:39:35 +02:00
giambattista.bloisi requested changes 2023-07-28 15:05:41 +02:00
giambattista.bloisi left a comment
Member

Hi Serafeim,
overall the code looks good. In the subsequent comments I recommend an alternative approach to perform same business logic with less code and java classes.

Hi Serafeim, overall the code looks good. In the subsequent comments I recommend an alternative approach to perform same business logic with less code and java classes.
@ -0,0 +85,4 @@
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
// load and parse affiliation relations from HDFS
JavaRDD<AffiliationRelationDeserializer> affiliationRelationsDeserializeRDD = sc

AffiliationRelationDeserializer and AffiliationRelationModel are two classes used to store intermediate representation of the data that eventually is put in generated Relation(s). Those two classes leverage lombok annotations to get a few methods generated automatically.
After a talk with @claudio.atzori I discovered that Lombok is accidentally available because it was imported as a transitive dependency in a recent dependency addition, but there was no intention to introduce or promote its usage.

Nevertheless that was an input to rethink the whole processing to reduce the needed code by leveraging spark dataframe functionalities.

  1. There is a shortcut to load a subset of fields of a json file as a Dataset:
	Dataset<Row> df = spark
			.read()
			.schema("`DOI` STRING, `Matchings` ARRAY<STRUCT<`RORid`:ARRAY<STRING>,`Confidence`:DOUBLE>>")
			.json(inputPath)

the schema of the fields to load is declared using a DDL, thus avoiding the need to define a temporary data holding java class (AffiliationRelationDeserializer).

  1. The subsequent flatMap that unrolls the nested arrays (Matchings and then RORid) can be replaced by generating 'exploded' columns, thus avoiding the need for AffiliationRelationModel and the flatMap code
			.withColumn("matching", functions.explode(new Column("Matchings")))
			.withColumn("rorid", functions.explode(new Column("matching.RORid")))
			.select(
				new Column("DOI").as("doi"),
				new Column("rorid"),
				new Column("matching.Confidence").as("confidence"));
  1. At this point the dataframe can be applied to the flatMap as before, the only difference is to extract field from the 'Row' interface instead of a custom Bean/POJO:
		df
			.toJavaRDD()
			// prepare action sets for affiliation relations
			.flatMap((FlatMapFunction<Row, Relation>) row -> {
				// DOI to OpenAIRE id
				final String paperId = ID_PREFIX
					+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", row.getAs("doi")));
...
final String affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("rorid"));
...
Double.toString(row.getAs("confidence")));
AffiliationRelationDeserializer and AffiliationRelationModel are two classes used to store intermediate representation of the data that eventually is put in generated Relation(s). Those two classes leverage lombok annotations to get a few methods generated automatically. After a talk with @claudio.atzori I discovered that Lombok is accidentally available because it was imported as a transitive dependency in a recent dependency addition, but there was no intention to introduce or promote its usage. Nevertheless that was an input to rethink the whole processing to reduce the needed code by leveraging spark dataframe functionalities. 1. There is a shortcut to load a subset of fields of a json file as a Dataset: ``` Dataset<Row> df = spark .read() .schema("`DOI` STRING, `Matchings` ARRAY<STRUCT<`RORid`:ARRAY<STRING>,`Confidence`:DOUBLE>>") .json(inputPath) ``` the schema of the fields to load is declared using a [DDL](https://vincent.doba.fr/posts/20211004_spark_data_description_language_for_defining_spark_schema/), thus avoiding the need to define a temporary data holding java class (AffiliationRelationDeserializer). 2. The subsequent flatMap that unrolls the nested arrays (Matchings and then RORid) can be replaced by generating 'exploded' columns, thus avoiding the need for AffiliationRelationModel and the flatMap code ``` .withColumn("matching", functions.explode(new Column("Matchings"))) .withColumn("rorid", functions.explode(new Column("matching.RORid"))) .select( new Column("DOI").as("doi"), new Column("rorid"), new Column("matching.Confidence").as("confidence")); ``` 3. At this point the dataframe can be applied to the flatMap as before, the only difference is to extract field from the 'Row' interface instead of a custom Bean/POJO: ``` df .toJavaRDD() // prepare action sets for affiliation relations .flatMap((FlatMapFunction<Row, Relation>) row -> { // DOI to OpenAIRE id final String paperId = ID_PREFIX + IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", row.getAs("doi"))); ... final String affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("rorid")); ... Double.toString(row.getAs("confidence"))); ```
Author
Member

Hi Giambattista,
Thank you very much for your comments! I agree with the approach you propose, seems cleaner. Not sure why I used Lombok there, old habits :)

Hi Giambattista, Thank you very much for your comments! I agree with the approach you propose, seems cleaner. Not sure why I used Lombok there, old habits :)
@ -0,0 +148,4 @@
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);

It is advisable to compress output file here (using /data/bip-affiliations/data.json as the input the total disk size for output file is reduced from 50Gb to 1.5Gb)
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);

It is advisable to compress output file here (using /data/bip-affiliations/data.json as the input the total disk size for output file is reduced from 50Gb to 1.5Gb) .saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
Author
Member

You are right, thanks for pointing that.

You are right, thanks for pointing that.
schatz marked this conversation as resolved
@ -0,0 +1,27 @@

That class can be removed by using dataframe api approach

That class can be removed by using dataframe api approach
schatz marked this conversation as resolved
@ -0,0 +1,16 @@

This class can be removed by using dataframe api approach

This class can be removed by using dataframe api approach
schatz marked this conversation as resolved
schatz added 2 commits 2023-07-28 18:04:06 +02:00
schatz added 1 commit 2023-07-28 18:14:45 +02:00
schatz requested review from giambattista.bloisi 2023-07-28 18:39:31 +02:00
Author
Member

Ι made the requested changes, if there is no objection,I think that this PR is ready to be merged.

Ι made the requested changes, if there is no objection,I think that this PR is ready to be merged.
miriam.baglioni merged commit 18fbe52b20 into beta 2023-08-07 10:45:31 +02:00
Sign in to join this conversation.
No description provided.