ORCID Enrichment and Download #364
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#364
Loading…
Reference in New Issue
No description provided.
Delete Branch "orcid_import"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This Pull Request consists of a refactoring of the code that downloads the ORCID Dump.
The previous version did not take advantage of the cluster parallelism well, and it took
more than 20 hours to generate the ORCID dataframe from the dump while the new one takes 4 hours.
In addition, a process of enriching the entire graph with authors from ORCID has been defined.
This process finds an intersection between all the PIDs of the graph entities and those of the ORCID works.
It then defines an heuristic that tries to associate the right author based on the name.
The results are described in ticket #9118.
Looks ok, pleas check the comment on hammingDist
@ -196,0 +134,4 @@
static int hammingDist(String str1, String str2) {
if (str1.length() != str2.length())
return Math.max(str1.length(), str2.length());
Out of context this seems to account too much difference for strings of different length.
What about ensure to take as reference for while loop the shorter string, prefill count with the length difference and then add the char-by-char comparison difference?
That would be more permissive about strings that have the very same prefix.
About hammingDist function is never used, was a test of previous comparing function so I delete it
@ -0,0 +21,4 @@
val targetPath = parser.get("targetPath")
log.info(s"targetPath is '$targetPath'")
val orcidPublication: Dataset[Row] = generateOrcidTable(spark, orcidPath)
enrichResult(
This can be transformed in a loop using ModelSupport.entityTypes filtering non-result types
@ -0,0 +63,4 @@
.schema(enc.schema)
.json(graphPath)
.select(col("id"), col("datainfo"), col("instance"))
.where("datainfo.deletedbyinference = false")
datainfo.deletedbyinference != true will take care of the case where datainfo is null
Thanks you @giambattista.bloisi I'll update the code
@ -0,0 +109,4 @@
.load(s"$inputPath/Works")
.select(col("orcid"), explode(col("pids")).alias("identifier"))
.where(
"identifier.schema = 'doi' or identifier.schema ='pmid' or identifier.schema ='pmc' or identifier.schema ='arxiv' or identifier.schema ='handle'"
a shorter form is identifier.schema IN ('doi', 'pmid', ...)
Thanks you @giambattista.bloisi I'll update the code