Orcid Update Procedure #394
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#394
Loading…
Reference in New Issue
No description provided.
Delete Branch "orcid_update"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This pull request implements a new procedure to update the ORCID table from the APIs. The previous one was too complicated and used spark executor to download from the APIs, and it did not give the possibility to control the request limit.
This function creates a number of threads less than the total number of the ORCID API's request rate limit per second. It introduces a possible sleep for each thread to prevent it from making more than one request per second.
The new procedure is more efficient and easier to use than the previous version.
Furthermore before apply updates check if there are no decrease respect the original table in case raises an exception.
Can you sunmmarize how the caller of the workflow can control the request limit?
@ -0,0 +155,4 @@
throw new RuntimeException(e);
}
});
queue.put(ORCIDWorker.JOB_COMPLETE);
I read this line as a request to shut down one worker.
I would expect instead to have 22 of those put to shut down all workers. Also I would expect this activity to be done at the end of the outermost while loop, that is before joining the workers.
When a ORCIDworker encounters a JOB_COMPLETE message before it closes itself, it re-inserts the same object back into the queue. Do you think this behavior is incorrect, or could it lead to deadlocks or other types of bugs?
A deadlock is possible if this thread (or in general producer of messages) - after putting the JOB_COMPLETE message - continues to put other messages. What could happen is that the last worker gets the last re-routed JOB_COMPLETE message and, meanwhile before reinserting the message, it gets pre-empted and the queue gets completely filled by the producers: at this point the last reinsert tentative will last forever because the queue is full and not consumed by anyone.
If I'm not reading the code wrongly, in this case the JOB_COMPLETE should be put outside the while loop also to cover cases where the tar is empty or the last entry of the tar is not a file.
Thanks for your revision Jedi Master, I've updated the code.
I Think I can merge the request
@claudio.atzori I've rewritten the PR description, I hope it's clearer now.