SWH_integration #343

Merged
claudio.atzori merged 15 commits from SWH_integration into beta 2023-10-06 14:15:56 +02:00
Member

Related to tasks: #9010 and #9011

Introduces a new oozie workflow (/eu/dnetlib/dhp/swh) that:

  • Collects distinct software repository URLs (using the coderepositoryurl.value field)
  • Collects last visit data from SWH API (e.g., https://archive.softwareheritage.org/api/1/origin/https://github.com/openaire/iis/visit/latest/ -- note that snapshot is the SWH id)
  • Sends archive requests (via the SWH API) for those repo URLs that were not found and for those that their last visit is older than archiveThresholdInDays (workflow parameter)
  • Creates actionsets, inserting the SWH id (i.e., snapshot id from step 2) into the pid array in the level of the result

We can add this workflow in the beta provision; the parameters required, along with some sample values, are the following:

hiveDbName=openaire_prod_20230914
softwareCodeRepositoryURLs=${workingDir}/1_code_repo_urls.csv
lastVisitsPath=${workingDir}/2_last_visits.seq
archiveRequestsPath=${workingDir}/3_archive_requests.seq
actionsetsPath=${workingDir}/4_actionsets
graphPath=/tmp/prod_provision/graph/18_graph_blacklisted

maxNumberOfRetry=2
retryDelay=1
requestDelay=100

softwareLimit=500
resume=collect-software-repository-urls <== we can set it to create-swh-actionsets in order to just create the actionsets, and bypass the API calls

Related to tasks: [#9010](https://support.openaire.eu/issues/9010) and [#9011](https://support.openaire.eu/issues/9011) Introduces a new oozie workflow (/eu/dnetlib/dhp/swh) that: - Collects distinct software repository URLs (using the `coderepositoryurl.value` field) - Collects last visit data from SWH API (e.g., https://archive.softwareheritage.org/api/1/origin/https://github.com/openaire/iis/visit/latest/ -- note that `snapshot` is the SWH id) - Sends archive requests (via the SWH API) for those repo URLs that were not found and for those that their last visit is older than `archiveThresholdInDays` (workflow parameter) - Creates actionsets, inserting the SWH id (i.e., `snapshot` id from step 2) into the `pid` array in the level of the result We can add this workflow in the beta provision; the parameters required, along with some sample values, are the following: hiveDbName=openaire_prod_20230914 softwareCodeRepositoryURLs=${workingDir}/1_code_repo_urls.csv lastVisitsPath=${workingDir}/2_last_visits.seq archiveRequestsPath=${workingDir}/3_archive_requests.seq actionsetsPath=${workingDir}/4_actionsets graphPath=/tmp/prod_provision/graph/18_graph_blacklisted maxNumberOfRetry=2 retryDelay=1 requestDelay=100 softwareLimit=500 resume=collect-software-repository-urls <== we can set it to `create-swh-actionsets` in order to just create the actionsets, and bypass the API calls
claudio.atzori was assigned by schatz 2023-10-03 14:31:00 +02:00
schatz added 6 commits 2023-10-03 14:31:00 +02:00
schatz added 1 commit 2023-10-03 14:43:42 +02:00
claudio.atzori requested changes 2023-10-03 15:19:50 +02:00
claudio.atzori left a comment
Owner

The implementation, overall looks OK. Let's address the single comment I left before moving to testing it on the automated execution.

The implementation, overall looks OK. Let's address the single comment I left before moving to testing it on the automated execution.
@ -0,0 +132,4 @@
Dataset<Row> joinedDF = graphSoftwareDF.join(swhDF, "repoUrl").select("id", "swhid");
// joinedDF.show(false);
return joinedDF.map((MapFunction<Row, Software>) row -> {

The returned Software record should also contain a reference to the Software Heritage datasource in the collectedfrom element; it is defined as a List<KeyValue>, so you can simply populate it with

key = "10|openaire____::dbfd07503aaa1ed31beed7dec942f3f4"
value "Software Heritage"
The returned `Software` record should also contain a reference to the Software Heritage datasource in the `collectedfrom` element; it is defined as a `List<KeyValue>`, so you can simply populate it with ``` key = "10|openaire____::dbfd07503aaa1ed31beed7dec942f3f4" value "Software Heritage" ```
schatz added 1 commit 2023-10-03 15:55:14 +02:00
schatz added 1 commit 2023-10-03 19:58:01 +02:00
schatz requested review from claudio.atzori 2023-10-03 20:04:11 +02:00
Author
Member

@claudio.atzori Can you check again?

  • Added collectedFrom field
  • Moved SWH API Key to a WF parameter
@claudio.atzori Can you check again? - Added collectedFrom field - Moved SWH API Key to a WF parameter
schatz added 1 commit 2023-10-04 19:31:56 +02:00
claudio.atzori added 1 commit 2023-10-06 12:21:52 +02:00

@claudio.atzori Can you check again?

  • Added collectedFrom field
  • Moved SWH API Key to a WF parameter

Looks good to me, I'm going to deploy & test it on BETA.

> @claudio.atzori Can you check again? > - Added collectedFrom field > - Moved SWH API Key to a WF parameter Looks good to me, I'm going to deploy & test it on BETA.
claudio.atzori added 1 commit 2023-10-06 12:31:23 +02:00
claudio.atzori added 2 commits 2023-10-06 14:03:36 +02:00
claudio.atzori added 1 commit 2023-10-06 14:15:40 +02:00
claudio.atzori merged commit 6856ab28ab into beta 2023-10-06 14:15:56 +02:00
Author
Member

"Software Heritage Identifier" was renamed to "Software Hash Identifier" so we have to rename it in our code.

"Software Heritage Identifier" was renamed to "Software Hash Identifier" so we have to rename it in our code.
claudio.atzori added this to the FAIRCORE4EOSC project 2023-10-26 09:44:28 +02:00
claudio.atzori modified the project from FAIRCORE4EOSC to OpenAIRE 2023-10-26 09:44:32 +02:00
claudio.atzori modified the project from OpenAIRE to FAIRCORE4EOSC 2023-10-26 09:44:36 +02:00
claudio.atzori modified the project from FAIRCORE4EOSC to OpenAIRE 2023-10-26 09:44:45 +02:00
claudio.atzori modified the project from OpenAIRE to OpenAIRE - DNet 2023-10-26 10:00:35 +02:00
Sign in to join this conversation.
No description provided.