ORCID Enrichment and Download #364

Merged
claudio.atzori merged 14 commits from orcid_import into beta 2023-12-01 15:05:45 +01:00

This Pull Request consists of a refactoring of the code that downloads the ORCID Dump.
The previous version did not take advantage of the cluster parallelism well, and it took
more than 20 hours to generate the ORCID dataframe from the dump while the new one takes 4 hours.

In addition, a process of enriching the entire graph with authors from ORCID has been defined.
This process finds an intersection between all the PIDs of the graph entities and those of the ORCID works.
It then defines an heuristic that tries to associate the right author based on the name.
The results are described in ticket #9118.

This Pull Request consists of a refactoring of the code that downloads the ORCID Dump. The previous version did not take advantage of the cluster parallelism well, and it took more than 20 hours to generate the ORCID dataframe from the dump while the new one takes 4 hours. In addition, a process of enriching the entire graph with authors from ORCID has been defined. This process finds an intersection between all the PIDs of the graph entities and those of the ORCID works. It then defines an heuristic that tries to associate the right author based on the name. The results are described in ticket [#9118](https://support.openaire.eu/issues/9118).
sandro.labruzzo added 10 commits 2023-11-30 15:40:39 +01:00
claudio.atzori was assigned by sandro.labruzzo 2023-11-30 15:41:10 +01:00
alessia.bardi was assigned by sandro.labruzzo 2023-11-30 15:41:10 +01:00
miriam.baglioni was assigned by sandro.labruzzo 2023-11-30 15:41:10 +01:00
giambattista.bloisi was assigned by sandro.labruzzo 2023-11-30 15:41:10 +01:00
claudio.atzori requested review from giambattista.bloisi 2023-12-01 10:53:05 +01:00
giambattista.bloisi requested changes 2023-12-01 11:32:09 +01:00
giambattista.bloisi left a comment
Member

Looks ok, pleas check the comment on hammingDist

Looks ok, pleas check the comment on hammingDist
@ -196,0 +134,4 @@
static int hammingDist(String str1, String str2) {
if (str1.length() != str2.length())
return Math.max(str1.length(), str2.length());

Out of context this seems to account too much difference for strings of different length.
What about ensure to take as reference for while loop the shorter string, prefill count with the length difference and then add the char-by-char comparison difference?
That would be more permissive about strings that have the very same prefix.

Out of context this seems to account too much difference for strings of different length. What about ensure to take as reference for while loop the shorter string, prefill count with the length difference and then add the char-by-char comparison difference? That would be more permissive about strings that have the very same prefix.
Author
Owner

About hammingDist function is never used, was a test of previous comparing function so I delete it

About hammingDist function is never used, was a test of previous comparing function so I delete it
@ -0,0 +21,4 @@
val targetPath = parser.get("targetPath")
log.info(s"targetPath is '$targetPath'")
val orcidPublication: Dataset[Row] = generateOrcidTable(spark, orcidPath)
enrichResult(

This can be transformed in a loop using ModelSupport.entityTypes filtering non-result types

This can be transformed in a loop using ModelSupport.entityTypes filtering non-result types
@ -0,0 +63,4 @@
.schema(enc.schema)
.json(graphPath)
.select(col("id"), col("datainfo"), col("instance"))
.where("datainfo.deletedbyinference = false")

datainfo.deletedbyinference != true will take care of the case where datainfo is null

datainfo.deletedbyinference != true will take care of the case where datainfo is null
Author
Owner

Thanks you @giambattista.bloisi I'll update the code

Thanks you @giambattista.bloisi I'll update the code
@ -0,0 +109,4 @@
.load(s"$inputPath/Works")
.select(col("orcid"), explode(col("pids")).alias("identifier"))
.where(
"identifier.schema = 'doi' or identifier.schema ='pmid' or identifier.schema ='pmc' or identifier.schema ='arxiv' or identifier.schema ='handle'"

a shorter form is identifier.schema IN ('doi', 'pmid', ...)

a shorter form is identifier.schema IN ('doi', 'pmid', ...)
Author
Owner

Thanks you @giambattista.bloisi I'll update the code

Thanks you @giambattista.bloisi I'll update the code
sandro.labruzzo added 1 commit 2023-12-01 12:16:47 +01:00
bf0fd27c36 Removed unused function
Applied PR Comment of Giambattista in the PR
claudio.atzori added 1 commit 2023-12-01 12:28:16 +01:00
claudio.atzori added 1 commit 2023-12-01 15:03:09 +01:00
claudio.atzori added 1 commit 2023-12-01 15:05:36 +01:00
claudio.atzori merged commit c5ac593c07 into beta 2023-12-01 15:05:45 +01:00
Sign in to join this conversation.
No description provided.