Orcid Update Procedure #394

Merged
sandro.labruzzo merged 13 commits from orcid_update into beta 2 months ago

This pull request implements a new procedure to update the ORCID table from the APIs. The previous one was too complicated and used spark executor to download from the APIs, and it did not give the possibility to control the request limit.
This function creates a number of threads less than the total number of the ORCID API's request rate limit per second. It introduces a possible sleep for each thread to prevent it from making more than one request per second.
The new procedure is more efficient and easier to use than the previous version.
Furthermore before apply updates check if there are no decrease respect the original table in case raises an exception.

This pull request implements a new procedure to update the ORCID table from the APIs. The previous one was too complicated and used spark executor to download from the APIs, and it did not give the possibility to control the request limit. This function creates a number of threads less than the total number of the ORCID API's request rate limit per second. It introduces a possible sleep for each thread to prevent it from making more than one request per second. The new procedure is more efficient and easier to use than the previous version. Furthermore before apply updates check if there are no decrease respect the original table in case raises an exception.
claudio.atzori was assigned by sandro.labruzzo 2 months ago
giambattista.bloisi was assigned by sandro.labruzzo 2 months ago
miriam.baglioni was assigned by sandro.labruzzo 2 months ago
sandro.labruzzo added 11 commits 2 months ago
Owner

It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits.

Can you sunmmarize how the caller of the workflow can control the request limit?

> It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits. Can you sunmmarize how the caller of the workflow can control the request limit?
giambattista.bloisi reviewed 2 months ago
@ -0,0 +155,4 @@
throw new RuntimeException(e);
}
});
queue.put(ORCIDWorker.JOB_COMPLETE);
Collaborator

I read this line as a request to shut down one worker.
I would expect instead to have 22 of those put to shut down all workers. Also I would expect this activity to be done at the end of the outermost while loop, that is before joining the workers.

I read this line as a request to shut down one worker. I would expect instead to have 22 of those put to shut down all workers. Also I would expect this activity to be done at the end of the outermost while loop, that is before joining the workers.
Poster
Owner

When a ORCIDworker encounters a JOB_COMPLETE message before it closes itself, it re-inserts the same object back into the queue. Do you think this behavior is incorrect, or could it lead to deadlocks or other types of bugs?

When a [ORCIDworker](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/orcid_update/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/collection/orcid/ORCIDWorker.java#L137) encounters a JOB_COMPLETE message before it closes itself, it re-inserts the same object back into the queue. Do you think this behavior is incorrect, or could it lead to deadlocks or other types of bugs?
Collaborator

A deadlock is possible if this thread (or in general producer of messages) - after putting the JOB_COMPLETE message - continues to put other messages. What could happen is that the last worker gets the last re-routed JOB_COMPLETE message and, meanwhile before reinserting the message, it gets pre-empted and the queue gets completely filled by the producers: at this point the last reinsert tentative will last forever because the queue is full and not consumed by anyone.

If I'm not reading the code wrongly, in this case the JOB_COMPLETE should be put outside the while loop also to cover cases where the tar is empty or the last entry of the tar is not a file.

A deadlock is possible if this thread (or in general producer of messages) - after putting the JOB_COMPLETE message - continues to put other messages. What could happen is that the last worker gets the last re-routed JOB_COMPLETE message and, meanwhile before reinserting the message, it gets pre-empted and the queue gets completely filled by the producers: at this point the last reinsert tentative will last forever because the queue is full and not consumed by anyone. If I'm not reading the code wrongly, in this case the JOB_COMPLETE should be put outside the while loop also to cover cases where the tar is empty or the last entry of the tar is not a file.
Poster
Owner

Thanks for your revision Jedi Master, I've updated the code.
I Think I can merge the request

Thanks for your revision Jedi Master, I've updated the code. I Think I can merge the request
sandro.labruzzo marked this conversation as resolved
Poster
Owner

It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits.

Can you sunmmarize how the caller of the workflow can control the request limit?

@claudio.atzori I've rewritten the PR description, I hope it's clearer now.

> > It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits. > > Can you sunmmarize how the caller of the workflow can control the request limit? > > @claudio.atzori I've rewritten the PR description, I hope it's clearer now.
sandro.labruzzo added 1 commit 2 months ago
915a76a796 following the comment on the pull requests:
- Added #NUM_OF_THREADS complete job in the queue at the end of  the main loop to avoid deadlock
sandro.labruzzo added 1 commit 2 months ago
sandro.labruzzo merged commit e468e99100 into beta 2 months ago
sandro.labruzzo deleted branch orcid_update 2 months ago
The pull request has been merged as e468e99100.
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b orcid_update beta
git pull origin orcid_update

Step 2:

Merge the changes and update on Gitea.
git checkout beta
git merge --no-ff orcid_update
git push origin beta
Sign in to join this conversation.
No reviewers
No Milestone
No project
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#394
Loading…
There is no content yet.