[WebCrawl] adding affiliation relations from web information #428
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#428
Loading…
Reference in New Issue
No description provided.
Delete Branch "WebCrowlBeta"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR adds affiliation information from web resources. It takes everythong for IE and up to 2021 for the other countries
Please address the inline comments
@ -0,0 +78,4 @@
spark -> {
createActionSet(spark, inputPath, outputPath + "actionSet");
createPlainRelations(spark, inputPath, outputPath + "relations");
The output path is controlled by the actionset management framework and should not be altered by the workflow. Hence consider to place them elsewhere under a different root directory. Also the output of the
createActionSet
function should be placed directly insideoutputPath
.@ -0,0 +197,4 @@
return createAffiliatioRelationPair(
PMCID_PREFIX
+ IdentifierFactory
.md5(PidCleaner.normalizePidValue(PidType.pmc.toString(), "PMC" + pmcid.substring(43))),
It seems that PIDs found in the input dataset include their resolver. As we need them without it, it should be better to make this manipulation explicit.