ORCID Enrichment and Download #364

sandro.labruzzo · 2023-11-30T15:40:38+01:00

sandro.labruzzo commented

2023-11-30 15:40:38 +01:00

This Pull Request consists of a refactoring of the code that downloads the ORCID Dump.
The previous version did not take advantage of the cluster parallelism well, and it took
more than 20 hours to generate the ORCID dataframe from the dump while the new one takes 4 hours.

In addition, a process of enriching the entire graph with authors from ORCID has been defined.
This process finds an intersection between all the PIDs of the graph entities and those of the ORCID works.
It then defines an heuristic that tries to associate the right author based on the name.
The results are described in ticket #9118.

This Pull Request consists of a refactoring of the code that downloads the ORCID Dump. The previous version did not take advantage of the cluster parallelism well, and it took more than 20 hours to generate the ORCID dataframe from the dump while the new one takes 4 hours. In addition, a process of enriching the entire graph with authors from ORCID has been defined. This process finds an intersection between all the PIDs of the graph entities and those of the ORCID works. It then defines an heuristic that tries to associate the right author based on the name. The results are described in ticket [#9118](https://support.openaire.eu/issues/9118).

sandro.labruzzo added 10 commits 2023-11-30 15:40:39 +01:00

6ce36b3e41 Implemented ORCID Workflow on DHP-Aggregation for retrieving ORCID DUMP and generating tables

34a4b3cbdf Implemented ORCID Enrichment

6f4d0c05ea Implemented Author MErger for ORCID that takes in account the case when name and surname are swapped

59111713fa added comment

aa239ec673 Changed implementation of check similarity to verify exact match of name instead of the first char

279100fa52 added test

7b5e04f37e removed Orcid intersection on DOIBoost

f718caaac9 Added copy of the untouched entities of the graph

5e22b67b8a Merge remote-tracking branch 'origin/beta' into orcid_import

cdfb7588dd code formatting

claudio.atzori was assigned by sandro.labruzzo

2023-11-30 15:41:10 +01:00

alessia.bardi was assigned by sandro.labruzzo

2023-11-30 15:41:10 +01:00

miriam.baglioni was assigned by sandro.labruzzo

2023-11-30 15:41:10 +01:00

giambattista.bloisi was assigned by sandro.labruzzo

2023-11-30 15:41:10 +01:00

claudio.atzori requested review from giambattista.bloisi 2023-12-01 10:53:05 +01:00

giambattista.bloisi requested changes 2023-12-01 11:32:09 +01:00

giambattista.bloisi left a comment

Looks ok, pleas check the comment on hammingDist

dhp-common/src/main/java/eu/dnetlib/dhp/oa/merge/AuthorMerger.java

						
				@ -196,0 +134,4 @@

				    static int hammingDist(String str1, String str2) {

				        if (str1.length() != str2.length())

				            return Math.max(str1.length(), str2.length());

giambattista.bloisi commented

2023-12-01 11:08:55 +01:00

Out of context this seems to account too much difference for strings of different length.
What about ensure to take as reference for while loop the shorter string, prefill count with the length difference and then add the char-by-char comparison difference?
That would be more permissive about strings that have the very same prefix.

Out of context this seems to account too much difference for strings of different length. What about ensure to take as reference for while loop the shorter string, prefill count with the length difference and then add the char-by-char comparison difference? That would be more permissive about strings that have the very same prefix.

sandro.labruzzo commented

2023-12-01 11:53:26 +01:00

About hammingDist function is never used, was a test of previous comparing function so I delete it

dhp-workflows/dhp-graph-mapper/src/main/scala/eu/dnetlib/dhp/enrich/orcid/SparkEnrichGraphWithOrcidAuthors.scala Outdated

						
				@ -0,0 +21,4 @@

				    val targetPath = parser.get("targetPath")

				    log.info(s"targetPath is '$targetPath'")

				    val orcidPublication: Dataset[Row] = generateOrcidTable(spark, orcidPath)

				    enrichResult(

giambattista.bloisi commented

2023-12-01 11:22:46 +01:00

This can be transformed in a loop using ModelSupport.entityTypes filtering non-result types

dhp-workflows/dhp-graph-mapper/src/main/scala/eu/dnetlib/dhp/enrich/orcid/SparkEnrichGraphWithOrcidAuthors.scala

						
				@ -0,0 +63,4 @@

				      .schema(enc.schema)

				      .json(graphPath)

				      .select(col("id"), col("datainfo"), col("instance"))

				      .where("datainfo.deletedbyinference = false")

giambattista.bloisi commented

2023-12-01 11:24:07 +01:00

datainfo.deletedbyinference != true will take care of the case where datainfo is null

sandro.labruzzo commented

2023-12-01 12:15:06 +01:00

Thanks you @giambattista.bloisi I'll update the code

dhp-workflows/dhp-graph-mapper/src/main/scala/eu/dnetlib/dhp/enrich/orcid/SparkEnrichGraphWithOrcidAuthors.scala

						
				@ -0,0 +109,4 @@

				      .load(s"$inputPath/Works")

				      .select(col("orcid"), explode(col("pids")).alias("identifier"))

				      .where(

				        "identifier.schema = 'doi' or identifier.schema ='pmid' or identifier.schema ='pmc' or identifier.schema ='arxiv' or identifier.schema ='handle'"

giambattista.bloisi commented

2023-12-01 11:27:11 +01:00

a shorter form is identifier.schema IN ('doi', 'pmid', ...)

sandro.labruzzo commented

2023-12-01 12:15:17 +01:00

Thanks you @giambattista.bloisi I'll update the code

sandro.labruzzo added 1 commit 2023-12-01 12:16:47 +01:00

bf0fd27c36 Removed unused function

Applied PR Comment of Giambattista in the PR

claudio.atzori added 1 commit 2023-12-01 12:28:16 +01:00

622fafbd2e Merge branch 'beta' into orcid_import

claudio.atzori added 1 commit 2023-12-01 15:03:09 +01:00

33cb483c75 using objectSubType as originalType in Crossref2Oaf, code formatting

claudio.atzori added 1 commit 2023-12-01 15:05:36 +01:00

09d061e90b Merge branch 'beta' into orcid_import

claudio.atzori merged commit c5ac593c07 into beta

2023-12-01 15:05:45 +01:00

claudio.atzori referenced this issue from a commit

2023-12-01 15:05:45 +01:00

Merge pull request 'ORCID Enrichment and Download' (#364) from orcid_import into beta

claudio.atzori referenced this pull request

2023-12-15 10:15:25 +01:00

Master branch updates from beta December 2023 #369

Sign in to join this conversation.

No reviewers

No Label

No Milestone

No project

No Assignees

3 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#364