SWH_integration #343
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#343
Loading…
Reference in New Issue
No description provided.
Delete Branch "SWH_integration"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Related to tasks: #9010 and #9011
Introduces a new oozie workflow (/eu/dnetlib/dhp/swh) that:
coderepositoryurl.value
field)snapshot
is the SWH id)archiveThresholdInDays
(workflow parameter)snapshot
id from step 2) into thepid
array in the level of the resultWe can add this workflow in the beta provision; the parameters required, along with some sample values, are the following:
hiveDbName=openaire_prod_20230914
softwareCodeRepositoryURLs=${workingDir}/1_code_repo_urls.csv
lastVisitsPath=${workingDir}/2_last_visits.seq
archiveRequestsPath=${workingDir}/3_archive_requests.seq
actionsetsPath=${workingDir}/4_actionsets
graphPath=/tmp/prod_provision/graph/18_graph_blacklisted
maxNumberOfRetry=2
retryDelay=1
requestDelay=100
softwareLimit=500
resume=collect-software-repository-urls <== we can set it to
create-swh-actionsets
in order to just create the actionsets, and bypass the API callsThe implementation, overall looks OK. Let's address the single comment I left before moving to testing it on the automated execution.
@ -0,0 +132,4 @@
Dataset<Row> joinedDF = graphSoftwareDF.join(swhDF, "repoUrl").select("id", "swhid");
// joinedDF.show(false);
return joinedDF.map((MapFunction<Row, Software>) row -> {
The returned
Software
record should also contain a reference to the Software Heritage datasource in thecollectedfrom
element; it is defined as aList<KeyValue>
, so you can simply populate it with@claudio.atzori Can you check again?
Looks good to me, I'm going to deploy & test it on BETA.
"Software Heritage Identifier" was renamed to "Software Hash Identifier" so we have to rename it in our code.