Orcid Update Procedure #394

Merged
sandro.labruzzo merged 13 commits from orcid_update into beta 2024-02-28 09:17:30 +01:00

This pull request implements a new procedure to update the ORCID table from the APIs. The previous one was too complicated and used spark executor to download from the APIs, and it did not give the possibility to control the request limit.
This function creates a number of threads less than the total number of the ORCID API's request rate limit per second. It introduces a possible sleep for each thread to prevent it from making more than one request per second.
The new procedure is more efficient and easier to use than the previous version.
Furthermore before apply updates check if there are no decrease respect the original table in case raises an exception.

This pull request implements a new procedure to update the ORCID table from the APIs. The previous one was too complicated and used spark executor to download from the APIs, and it did not give the possibility to control the request limit. This function creates a number of threads less than the total number of the ORCID API's request rate limit per second. It introduces a possible sleep for each thread to prevent it from making more than one request per second. The new procedure is more efficient and easier to use than the previous version. Furthermore before apply updates check if there are no decrease respect the original table in case raises an exception.
claudio.atzori was assigned by sandro.labruzzo 2024-02-23 10:25:36 +01:00
giambattista.bloisi was assigned by sandro.labruzzo 2024-02-23 10:25:36 +01:00
miriam.baglioni was assigned by sandro.labruzzo 2024-02-23 10:25:36 +01:00
sandro.labruzzo added 11 commits 2024-02-23 10:25:37 +01:00

It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits.

Can you sunmmarize how the caller of the workflow can control the request limit?

> It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits. Can you sunmmarize how the caller of the workflow can control the request limit?
giambattista.bloisi reviewed 2024-02-24 17:25:10 +01:00
@ -0,0 +155,4 @@
throw new RuntimeException(e);
}
});
queue.put(ORCIDWorker.JOB_COMPLETE);

I read this line as a request to shut down one worker.
I would expect instead to have 22 of those put to shut down all workers. Also I would expect this activity to be done at the end of the outermost while loop, that is before joining the workers.

I read this line as a request to shut down one worker. I would expect instead to have 22 of those put to shut down all workers. Also I would expect this activity to be done at the end of the outermost while loop, that is before joining the workers.
Author
Owner

When a ORCIDworker encounters a JOB_COMPLETE message before it closes itself, it re-inserts the same object back into the queue. Do you think this behavior is incorrect, or could it lead to deadlocks or other types of bugs?

When a [ORCIDworker](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/orcid_update/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/collection/orcid/ORCIDWorker.java#L137) encounters a JOB_COMPLETE message before it closes itself, it re-inserts the same object back into the queue. Do you think this behavior is incorrect, or could it lead to deadlocks or other types of bugs?

A deadlock is possible if this thread (or in general producer of messages) - after putting the JOB_COMPLETE message - continues to put other messages. What could happen is that the last worker gets the last re-routed JOB_COMPLETE message and, meanwhile before reinserting the message, it gets pre-empted and the queue gets completely filled by the producers: at this point the last reinsert tentative will last forever because the queue is full and not consumed by anyone.

If I'm not reading the code wrongly, in this case the JOB_COMPLETE should be put outside the while loop also to cover cases where the tar is empty or the last entry of the tar is not a file.

A deadlock is possible if this thread (or in general producer of messages) - after putting the JOB_COMPLETE message - continues to put other messages. What could happen is that the last worker gets the last re-routed JOB_COMPLETE message and, meanwhile before reinserting the message, it gets pre-empted and the queue gets completely filled by the producers: at this point the last reinsert tentative will last forever because the queue is full and not consumed by anyone. If I'm not reading the code wrongly, in this case the JOB_COMPLETE should be put outside the while loop also to cover cases where the tar is empty or the last entry of the tar is not a file.
Author
Owner

Thanks for your revision Jedi Master, I've updated the code.
I Think I can merge the request

Thanks for your revision Jedi Master, I've updated the code. I Think I can merge the request
sandro.labruzzo marked this conversation as resolved
Author
Owner

It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits.

Can you sunmmarize how the caller of the workflow can control the request limit?

@claudio.atzori I've rewritten the PR description, I hope it's clearer now.

> > It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits. > > Can you sunmmarize how the caller of the workflow can control the request limit? > > @claudio.atzori I've rewritten the PR description, I hope it's clearer now.
sandro.labruzzo added 1 commit 2024-02-28 09:11:00 +01:00
915a76a796 following the comment on the pull requests:
- Added #NUM_OF_THREADS complete job in the queue at the end of  the main loop to avoid deadlock
sandro.labruzzo added 1 commit 2024-02-28 09:11:30 +01:00
sandro.labruzzo merged commit e468e99100 into beta 2024-02-28 09:17:30 +01:00
sandro.labruzzo deleted branch orcid_update 2024-02-28 09:17:31 +01:00
Sign in to join this conversation.
No description provided.