WIP: ORCID content acquisition

This commit is contained in:
Claudio Atzori 2024-07-24 15:27:54 +02:00
parent 749124253d
commit 0884de87bf
1 changed files with 41 additions and 7 deletions

View File

@ -1,17 +1,17 @@
# Open Researcher and Contributor ID (ORCID)
ORCID (Open Researcher and Contributor ID) is a non-profit organization that provides a unique identifier for researchers. ORCID iDs are used to connect researchers with their contributions, such as publications, grants, and affiliations.
This document describes how to collect ORCID data from the ORCID datasource.
This document describes how OpenAIRE collects information about the researcher profiles and their works from the ORCID.
## Data acquisition
### Full ORCID Dump
The ORCID full dataset can be downloaded publicly from [Figshare](https://orcid.figshare.com/) and are described on the [ORCID website](https://support.orcid.org/hc/en-us/articles/360006897394-How-do-I-get-the-public-data-file).
These datasets represented the initial import, whereas to keep up with the updates in the data a scheduled process retrieves the delta regularly.
The ORCID dump can be downloaded from the ORCID website https://support.orcid.org/hc/en-us/articles/360006897394-How-do-I-get-the-public-data-file.
The ORCID dump consists in different compressed files that needs to be extracted.
This compressed file contains information on researchers in XML format. Once extracted, they will be parsed to populate the three tables described below.
The ORCID dataset consists in different compressed files containing information about researchers in XML format. Once uncompressed, the information extracted from the XML records was used to populate the three tables described below.
### Incremental Updates
ORCID provides an API to get incremental updates,the parsed incremental data can be used to update the three tables with the latest changes.
ORCID provides an API to get incremental updates, the parsed incremental data can be used to update the three tables with the latest changes.
### OpenAIRE ORCID Data model
@ -19,7 +19,41 @@ ORCID provides an API to get incremental updates,the parsed incremental data can
- **Employments**: This table contains information about the employments of ORCID authors, including their ORCID ID, organization, start date, end date, and ROAR ID.
- **Works**: This table contains information about the works of ORCID authors, including te paper PID and ORCID ID.
**Authors**
| Column name | Type |
|----------------------|----------------------------------------------|
| `biography` | `string` |
| `creditName` | `string` |
| `familyName` | `string` |
| `givenName` | `string` |
| `orcid` | `string` |
| `otherNames` | `array[string]` |
| `otherPids` | `array[struct[schema:string, value:string]]` |
| `visibility` | `string` |
| `lastModifiedDate` | `string` |
**Employments**
| Column name | Type |
|------------------|---------------------------------------|
| `affiliationId` | `struct[schema:string, value:string]` |
| `departmentName` | `string` |
| `endDate` | `string` |
| `orcid` | `string` |
| `roleTitle` | `string` |
| `startDate` | `string` |
**Works**
| Column name | Type |
|-------------|----------------------------------------------|
| `orcid` | `string` |
| `pids` | `array[struct[schema:string, value:string]]` |
| `title` | `string` |
For a more extensive description of the different fields and the schema of the record model please refer to the [ORCID project on GitHub](https://github.com/ORCID/orcid-model).
## Process