Open Citation integration #401

Merged
claudio.atzori merged 5 commits from ocnew into beta 1 month ago
Collaborator

New implementation for the integration of the new dump of Open Citation. We need to change the implementation because OC has changed the information in the dataset: they use now internal OC identifiers to link citing and cited results instead of pids (DOI, PMID etc)

We also need to download a correspondence file between OC internal identifiers and pids.

OC also uses isbn and issn as identifiers for the citing/cited resource. For now this types of identifiers are not considered. We can reconsider them once we have inserted venues.

Parameters to be passed to the workflow:

1. inputPath = the path where the downloaded new OC dumps are stored
2. outputPathExtraction = not volatile path to store the result of the citation extraction remapped to the pids. In this path all the new extracted relations will be appended. The AS is created starting from this path
3. outputPath = the Action Set path
4. resumeFrom = step where to resume the execution.
5. filelist = The list of files are split by ; Each element represents the URL to be used to download the dump and the date of the dump as in the OC folder. The URL and the date are split by @. Example https://figshare.com/ndownloader/files/42777190@2023-05-20; ... 
All the dumps at the URL will be downloaded and stored in inputPath/date
6. filecorrispondence = the URL of the file storing the correspondence between the OC internal identifiers and the pids. Also in this case it will be formatted as `URL@omid.zip` (https://figshare.com/ndownloader/files/43491411@omid.zip)
7. delimiter = the delimiter in the csv to be used to split the information (,)

Possible values for the resumeFrom parameter:

  • DownloadDump to download the new OC dump files and the correspondence file
  • ExtractContent to extract the zip from the dump
  • ReadContent to read the content in the new dump and create the corresponding json file
  • MapContent to map the internal OC identifiers in the corresponding pids
  • CreateAS to create the action set
  • The default step is the deletion of the inputPath before downloading the new dumped files
New implementation for the integration of the new dump of Open Citation. We need to change the implementation because OC has changed the information in the dataset: they use now internal OC identifiers to link citing and cited results instead of pids (DOI, PMID etc) We also need to download a correspondence file between OC internal identifiers and pids. OC also uses isbn and issn as identifiers for the citing/cited resource. For now this types of identifiers are not considered. We can reconsider them once we have inserted venues. Parameters to be passed to the workflow: ``` 1. inputPath = the path where the downloaded new OC dumps are stored 2. outputPathExtraction = not volatile path to store the result of the citation extraction remapped to the pids. In this path all the new extracted relations will be appended. The AS is created starting from this path 3. outputPath = the Action Set path 4. resumeFrom = step where to resume the execution. 5. filelist = The list of files are split by ; Each element represents the URL to be used to download the dump and the date of the dump as in the OC folder. The URL and the date are split by @. Example https://figshare.com/ndownloader/files/42777190@2023-05-20; ... All the dumps at the URL will be downloaded and stored in inputPath/date 6. filecorrispondence = the URL of the file storing the correspondence between the OC internal identifiers and the pids. Also in this case it will be formatted as `URL@omid.zip` (https://figshare.com/ndownloader/files/43491411@omid.zip) 7. delimiter = the delimiter in the csv to be used to split the information (,) ``` Possible values for the `resumeFrom` parameter: * `DownloadDump` to download the new OC dump files and the correspondence file * `ExtractContent` to extract the zip from the dump * `ReadContent` to read the content in the new dump and create the corresponding json file * `MapContent` to map the internal OC identifiers in the corresponding pids * `CreateAS` to create the action set * The default step is the deletion of the inputPath before downloading the new dumped files
claudio.atzori was assigned by miriam.baglioni 2 months ago
miriam.baglioni added 3 commits 2 months ago
claudio.atzori added 1 commit 1 month ago
claudio.atzori added 1 commit 1 month ago
claudio.atzori merged commit fa4b3e6d2b into beta 1 month ago
The pull request has been merged as fa4b3e6d2b.
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b ocnew beta
git pull origin ocnew

Step 2:

Merge the changes and update on Gitea.
git checkout beta
git merge --no-ff ocnew
git push origin beta
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#401
Loading…
There is no content yet.