Orcid Update Procedure #394

sandro.labruzzo · 2024-02-23T10:25:36+01:00

sandro.labruzzo commented

2024-02-23 10:25:36 +01:00

This pull request implements a new procedure to update the ORCID table from the APIs. The previous one was too complicated and used spark executor to download from the APIs, and it did not give the possibility to control the request limit.
This function creates a number of threads less than the total number of the ORCID API's request rate limit per second. It introduces a possible sleep for each thread to prevent it from making more than one request per second.
The new procedure is more efficient and easier to use than the previous version.
Furthermore before apply updates check if there are no decrease respect the original table in case raises an exception.

This pull request implements a new procedure to update the ORCID table from the APIs. The previous one was too complicated and used spark executor to download from the APIs, and it did not give the possibility to control the request limit. This function creates a number of threads less than the total number of the ORCID API's request rate limit per second. It introduces a possible sleep for each thread to prevent it from making more than one request per second. The new procedure is more efficient and easier to use than the previous version. Furthermore before apply updates check if there are no decrease respect the original table in case raises an exception.

claudio.atzori was assigned by sandro.labruzzo

2024-02-23 10:25:36 +01:00

giambattista.bloisi was assigned by sandro.labruzzo

2024-02-23 10:25:36 +01:00

miriam.baglioni was assigned by sandro.labruzzo

2024-02-23 10:25:36 +01:00

sandro.labruzzo added 11 commits 2024-02-23 10:25:37 +01:00

1a42a5c10d Implemented Download update of ORCID

8f61063201 Added workflow

9d5a7c3b22 code refactor

39ebb60b38 Merge remote-tracking branch 'origin/beta' into orcid_update

859babf722 added some useful comment

e328bc0ade fixed missing parameter on download update

e0753f19da Fixed error of connection timeout

43e0bba7ed logg added during download

0386f36385 Added workflow to update ORCID and replaced some parsing, because the update works and employments xml differs from the dump one.

b32a9d1994 Implemented workflow for updating table , added step to check if the new generated table is valid

a712df1e1d Merge remote-tracking branch 'origin/beta' into orcid_update

claudio.atzori commented

2024-02-23 13:57:42 +01:00

It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits.

Can you sunmmarize how the caller of the workflow can control the request limit?

> It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits. Can you sunmmarize how the caller of the workflow can control the request limit?

giambattista.bloisi reviewed 2024-02-24 17:25:10 +01:00

dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/collection/orcid/OrcidGetUpdatesFile.java Outdated

						
				@ -0,0 +155,4 @@

												throw new RuntimeException(e);

											}

										});

									queue.put(ORCIDWorker.JOB_COMPLETE);

giambattista.bloisi commented

2024-02-24 17:25:10 +01:00

I read this line as a request to shut down one worker.
I would expect instead to have 22 of those put to shut down all workers. Also I would expect this activity to be done at the end of the outermost while loop, that is before joining the workers.

I read this line as a request to shut down one worker. I would expect instead to have 22 of those put to shut down all workers. Also I would expect this activity to be done at the end of the outermost while loop, that is before joining the workers.

sandro.labruzzo commented

2024-02-26 15:27:59 +01:00

When a ORCIDworker encounters a JOB_COMPLETE message before it closes itself, it re-inserts the same object back into the queue. Do you think this behavior is incorrect, or could it lead to deadlocks or other types of bugs?

When a [ORCIDworker](https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/orcid_update/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/collection/orcid/ORCIDWorker.java#L137) encounters a JOB_COMPLETE message before it closes itself, it re-inserts the same object back into the queue. Do you think this behavior is incorrect, or could it lead to deadlocks or other types of bugs?

giambattista.bloisi commented

2024-02-26 15:53:56 +01:00

A deadlock is possible if this thread (or in general producer of messages) - after putting the JOB_COMPLETE message - continues to put other messages. What could happen is that the last worker gets the last re-routed JOB_COMPLETE message and, meanwhile before reinserting the message, it gets pre-empted and the queue gets completely filled by the producers: at this point the last reinsert tentative will last forever because the queue is full and not consumed by anyone.

If I'm not reading the code wrongly, in this case the JOB_COMPLETE should be put outside the while loop also to cover cases where the tar is empty or the last entry of the tar is not a file.

A deadlock is possible if this thread (or in general producer of messages) - after putting the JOB_COMPLETE message - continues to put other messages. What could happen is that the last worker gets the last re-routed JOB_COMPLETE message and, meanwhile before reinserting the message, it gets pre-empted and the queue gets completely filled by the producers: at this point the last reinsert tentative will last forever because the queue is full and not consumed by anyone. If I'm not reading the code wrongly, in this case the JOB_COMPLETE should be put outside the while loop also to cover cases where the tar is empty or the last entry of the tar is not a file.

sandro.labruzzo commented

2024-02-28 09:16:44 +01:00

Thanks for your revision Jedi Master, I've updated the code.
I Think I can merge the request

Thanks for your revision Jedi Master, I've updated the code. I Think I can merge the request

sandro.labruzzo marked this conversation as resolved

sandro.labruzzo commented

2024-02-26 15:32:52 +01:00

It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits.

Can you sunmmarize how the caller of the workflow can control the request limit?

@claudio.atzori I've rewritten the PR description, I hope it's clearer now.

> > It also gives the possibility to control the request limit, which is important to avoid exceeding the ORCID API limits. > > Can you sunmmarize how the caller of the workflow can control the request limit? > > @claudio.atzori I've rewritten the PR description, I hope it's clearer now.

sandro.labruzzo added 1 commit 2024-02-28 09:11:00 +01:00

915a76a796 following the comment on the pull requests:

- Added #NUM_OF_THREADS complete job in the queue at the end of  the main loop to avoid deadlock

sandro.labruzzo added 1 commit 2024-02-28 09:11:30 +01:00

b63994dcc4 Merge remote-tracking branch 'origin/beta' into orcid_update

sandro.labruzzo merged commit e468e99100 into beta

2024-02-28 09:17:30 +01:00

sandro.labruzzo deleted branch orcid_update

2024-02-28 09:17:31 +01:00

sandro.labruzzo referenced this issue from a commit

2024-02-28 09:17:31 +01:00

Merge pull request 'Orcid Update Procedure' (#394) from orcid_update into beta

claudio.atzori referenced this pull request

2024-05-21 14:10:55 +02:00

beta to master May 2024 #437

Sign in to join this conversation.

No reviewers

No Label

No Milestone

No project

No Assignees

3 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#394