Import affiliation relations from Crossref #320

Documentation for the matching algorithm from Crossref affiliation strings to OpenAIRE's organisations can be found [here](https://github.com/mkallipo/matching/tree/main#description-of-the-algorithm).

claudio.atzori requested review from giambattista.bloisi 2023-07-27 15:26:45 +02:00

claudio.atzori commented

2023-07-27 15:33:49 +02:00

Other than reviewing this PR, @schatz do we expect to run it on BETA asap? If so we should accelerate the review process and define the remaining steps. Being a new oozie workflow, it requires to

be included among the deployment specifications
produce its content on a dedicated actionset, that still mhave to be created, it consists of
- a profile in the D-Net information system
- a correspondant directory on HDFS
a D-Net workflow responsible to parametrize and trigger the oozie workflow

I added @giambattista.bloisi to help with the review process

Other than reviewing this PR, @schatz do we expect to run it on BETA asap? If so we should accelerate the review process and define the remaining steps. Being a new oozie workflow, it requires to * be included among the [deployment specifications](https://code-repo.d4science.org/D-Net/dnet-hadoop-deployment) * produce its content on a dedicated actionset, that still mhave to be created, it consists of * a profile in the D-Net information system * a correspondant directory on HDFS * a D-Net workflow responsible to parametrize and trigger the oozie workflow I added @giambattista.bloisi to help with the review process

schatz commented

2023-07-27 15:53:41 +02:00

Yes, afaik we want to include this in the next BETA round. I have also talked with Miriam regarding the integration and I am pretty sure she has set up a D-Net profile for this. @miriam.baglioni can you confim this? and also indicate the hdfs directory with the produced actionsets ? Thank you

claudio.atzori commented

2023-07-27 17:18:00 +02:00

Yes, afaik we want to include this in the next BETA round.

Ok, thanks for confirming that.

I have also talked with Miriam regarding the integration and I am pretty sure she has set up a D-Net profile for this. @miriam.baglioni can you confim this? and also indicate the hdfs directory with the produced actionsets ? Thank you

I checked myself and found the profile registered on beta. For further reference:

the actionset id is bip-affiliations
the URL to the actionset profile
the directory where the rawsets are going to be stored is

/var/lib/dnet/actionManager_BETA/bip-affiliation

I'm going to update the deployment specification, as I see the workflow already run from the D-Net orchestration (here is the relative D-Net workflow profile), but it pointed to the oozie workflow deployed on the personal directory

/user/schatz/actionmanager/bipaffiliations/oozie_app

> Yes, afaik we want to include this in the next BETA round. Ok, thanks for confirming that. > I have also talked with Miriam regarding the integration and I am pretty sure she has set up a D-Net profile for this. @miriam.baglioni can you confim this? and also indicate the hdfs directory with the produced actionsets ? Thank you I checked myself and found the profile registered on beta. For further reference: * the actionset id is `bip-affiliations` * the URL to the [actionset profile](https://beta.services.openaire.eu/is/mvc/inspector/index.do/db/DRIVER/ActionManagerSetDSResources/ActionManagerSetDSResourceType/e5a00d5d-c878-46e9-abfc-3e288eea29e3/show) * the directory where the rawsets are going to be stored is ``` /var/lib/dnet/actionManager_BETA/bip-affiliation ``` I'm going to update the deployment specification, as I see the workflow already run from the D-Net orchestration ([here is the relative D-Net workflow profile](https://beta.services.openaire.eu/is/mvc/ui/isManager.do#/profile/b309aa1c-a320-4c40-adff-5db1e1deb0e0_V29ya2Zsb3dEU1Jlc291cmNlcy9Xb3JrZmxvd0RTUmVzb3VyY2VUeXBl)), but it pointed to the oozie workflow deployed on the personal directory ``` /user/schatz/actionmanager/bipaffiliations/oozie_app ```

miriam.baglioni commented

2023-07-28 10:20:07 +02:00

The WF on beta pointed to the personal directory because the code was part of a branch and not integrated. It was done to verify everything will run smoothly. Once integrated, I will change the directory with the one specified in the deployment file specification

It think the PR can be integrated

The WF on beta pointed to the personal directory because the code was part of a branch and not integrated. It was done to verify everything will run smoothly. Once integrated, I will change the directory with the one specified in the deployment file specification It think the PR can be integrated

claudio.atzori commented

2023-07-28 10:52:03 +02:00

The WF on beta pointed to the personal directory because the code was part of a branch and not integrated. It was done to verify everything will run smoothly. Once integrated, I will change the directory with the one specified in the deployment file specification

The deployment directory specified in the D-Net workflow profile is already updated.

> The WF on beta pointed to the personal directory because the code was part of a branch and not integrated. It was done to verify everything will run smoothly. Once integrated, I will change the directory with the one specified in the deployment file specification The deployment directory specified in the D-Net workflow profile is already updated.

schatz added 1 commit 2023-07-28 13:39:35 +02:00

eb8684a8cf Merge branch 'beta' into 8876

giambattista.bloisi requested changes 2023-07-28 15:05:41 +02:00

giambattista.bloisi left a comment

Hi Serafeim,
overall the code looks good. In the subsequent comments I recommend an alternative approach to perform same business logic with less code and java classes.

Hi Serafeim, overall the code looks good. In the subsequent comments I recommend an alternative approach to perform same business logic with less code and java classes.

dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/PrepareAffiliationRelations.java Outdated

						
				@ -0,0 +85,4 @@

						final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());

						// load and parse affiliation relations from HDFS

						JavaRDD<AffiliationRelationDeserializer> affiliationRelationsDeserializeRDD = sc

giambattista.bloisi commented

2023-07-28 13:50:51 +02:00

AffiliationRelationDeserializer and AffiliationRelationModel are two classes used to store intermediate representation of the data that eventually is put in generated Relation(s). Those two classes leverage lombok annotations to get a few methods generated automatically.
After a talk with @claudio.atzori I discovered that Lombok is accidentally available because it was imported as a transitive dependency in a recent dependency addition, but there was no intention to introduce or promote its usage.

Nevertheless that was an input to rethink the whole processing to reduce the needed code by leveraging spark dataframe functionalities.

There is a shortcut to load a subset of fields of a json file as a Dataset:

	Dataset<Row> df = spark
			.read()
			.schema("`DOI` STRING, `Matchings` ARRAY<STRUCT<`RORid`:ARRAY<STRING>,`Confidence`:DOUBLE>>")
			.json(inputPath)

the schema of the fields to load is declared using a DDL, thus avoiding the need to define a temporary data holding java class (AffiliationRelationDeserializer).

The subsequent flatMap that unrolls the nested arrays (Matchings and then RORid) can be replaced by generating 'exploded' columns, thus avoiding the need for AffiliationRelationModel and the flatMap code

			.withColumn("matching", functions.explode(new Column("Matchings")))
			.withColumn("rorid", functions.explode(new Column("matching.RORid")))
			.select(
				new Column("DOI").as("doi"),
				new Column("rorid"),
				new Column("matching.Confidence").as("confidence"));

At this point the dataframe can be applied to the flatMap as before, the only difference is to extract field from the 'Row' interface instead of a custom Bean/POJO:

		df
			.toJavaRDD()
			// prepare action sets for affiliation relations
			.flatMap((FlatMapFunction<Row, Relation>) row -> {
				// DOI to OpenAIRE id
				final String paperId = ID_PREFIX
					+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", row.getAs("doi")));
...
final String affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("rorid"));
...
Double.toString(row.getAs("confidence")));

AffiliationRelationDeserializer and AffiliationRelationModel are two classes used to store intermediate representation of the data that eventually is put in generated Relation(s). Those two classes leverage lombok annotations to get a few methods generated automatically. After a talk with @claudio.atzori I discovered that Lombok is accidentally available because it was imported as a transitive dependency in a recent dependency addition, but there was no intention to introduce or promote its usage. Nevertheless that was an input to rethink the whole processing to reduce the needed code by leveraging spark dataframe functionalities. 1. There is a shortcut to load a subset of fields of a json file as a Dataset: ``` Dataset<Row> df = spark .read() .schema("`DOI` STRING, `Matchings` ARRAY<STRUCT<`RORid`:ARRAY<STRING>,`Confidence`:DOUBLE>>") .json(inputPath) ``` the schema of the fields to load is declared using a [DDL](https://vincent.doba.fr/posts/20211004_spark_data_description_language_for_defining_spark_schema/), thus avoiding the need to define a temporary data holding java class (AffiliationRelationDeserializer). 2. The subsequent flatMap that unrolls the nested arrays (Matchings and then RORid) can be replaced by generating 'exploded' columns, thus avoiding the need for AffiliationRelationModel and the flatMap code ``` .withColumn("matching", functions.explode(new Column("Matchings"))) .withColumn("rorid", functions.explode(new Column("matching.RORid"))) .select( new Column("DOI").as("doi"), new Column("rorid"), new Column("matching.Confidence").as("confidence")); ``` 3. At this point the dataframe can be applied to the flatMap as before, the only difference is to extract field from the 'Row' interface instead of a custom Bean/POJO: ``` df .toJavaRDD() // prepare action sets for affiliation relations .flatMap((FlatMapFunction<Row, Relation>) row -> { // DOI to OpenAIRE id final String paperId = ID_PREFIX + IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", row.getAs("doi"))); ... final String affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("rorid")); ... Double.toString(row.getAs("confidence"))); ```

schatz commented

2023-07-28 18:18:47 +02:00

Hi Giambattista,
Thank you very much for your comments! I agree with the approach you propose, seems cleaner. Not sure why I used Lombok there, old habits :)

Hi Giambattista, Thank you very much for your comments! I agree with the approach you propose, seems cleaner. Not sure why I used Lombok there, old habits :)

dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/PrepareAffiliationRelations.java

						
				@ -0,0 +148,4 @@

							.mapToPair(

								aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),

									new Text(OBJECT_MAPPER.writeValueAsString(aa))))

							.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);

giambattista.bloisi commented

2023-07-28 14:20:24 +02:00

It is advisable to compress output file here (using /data/bip-affiliations/data.json as the input the total disk size for output file is reduced from 50Gb to 1.5Gb)
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);

It is advisable to compress output file here (using /data/bip-affiliations/data.json as the input the total disk size for output file is reduced from 50Gb to 1.5Gb) .saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);

schatz commented

2023-07-28 18:19:37 +02:00

You are right, thanks for pointing that.

schatz marked this conversation as resolved

dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/model/AffiliationRelationDeserializer.java Outdated