openaire-graph-docs/docs/data-provision/aggregation/datacite.md

# Datacite
This section describes the aggregation workflow of Datacite and the mapping implemented for it.

## Datacite datasource
[Datacite](https://datacite.org/index.html) is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs. 

## Datacite API
The [DataCite REST API](https://support.datacite.org/docs/api)  allows users to retrieve, query, and browse DataCite DOI metadata records. In particular, it exposes a method for incremental harvesting new datacite records.

```
https://api.datacite.org/dois?page[cursor]=$CURSOR&page[size]=$NUMBER_OF_ITEM_PER_PAGE&query=updated:[$FROM_DATE_TIMESAMP TO $TO_DATE_TIMESAMP]
```

On this API Request, we introduce some variables:
- **CURSOR**: The value of the cursor to iterate the pages
- **NUMBER_OF_ITEM_PER_PAGE**: (max 1000) defines how many records we can download for each page.
- **FROM_DATE_TIMESAMP, TO_DATE_TIMESAMP** interval timestamp of the updated record


Each record contains two pieces of information needed for incremental harvesting:
- **isActive**: tells if the record is deleted (`isActive:false`)
- **updated**: timestamp of last update


## Collection Workflow

The collection workflow is responsible for aggregating new datacite records. Each record is stored on a table called Native Datacite Store with the following schema:
- **DOI**: The DOI PID of the datacite record (It is a primary key)
- **update_timestamp**: the last update date timestamp
- **json**: the native record JSON

During the collection workflow, we identify the most updated record date, and the collection phase downloads all new datacite records and update the existing one through the API using this date as **FROM_DATE_TIMESAMP** variable.


## Datacite Mapping
The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.


| OpenAIRE Result field path         | Datacite record JSON path     | # Notes           |
|------------------------------------|-------------------------------|-------------------|
| `id`                               |  `\attributes\doi`|the identifier will be created by folloing the openaire PID generation policy |
| <ul><li>`instance`</li>  <li>`instance.type`</li></ul>      | <ul><li>`\attributes\types\resourceType`</li>  <li> `\attributes\types\resourceTypeGeneral` </li>  <li>`attributes\types\schemaOrg`</li></ul> |   Use the vocabulary **_dnet:publication_resource_**  to find a synonym to one of these terms and get the `instance.type`. Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to  generate one of the following main entities: <ul><li>`publication`</li>  <li>`dataset`</li> <li> `software`</li>  <li>`otherresearchproduct`</li></ul> |
| `pid` | `\attributes\doi` | `scheme = doi` |
| `originalid` | `\attributes\doi` |  |
| `dateofcollection` | `attributes\updated`  | the timestamp is defined in milliseconds we convert to "yyyy-MM-dd'T'HH:mm:ssZ" format |
| `author` | `\attributes\creators` |  Each creator field will be mapped in the author entity below the subfield. **If the record has no Creator it will be skipped**|
| `author.fullname` |  `\attributes\creators\name` |  if name is not defined, we construct from given and family name |
| `author.rank` |   | Incremental index starting from 1  |
| `author.name` |  `\attributes\creators\givenName` |  |
| `author.surname` |  `\attributes\creators\familyName` |  |
| `author.pid` |  `\attributes\creators\nameIdentifiers` | this is a list of pids associated to the creator |
| `author.pid.scheme` |  `\attributes\creators\nameIdentifiers` | mapping with vocabulary  **dnet:pid_types** |
| `author.pid.value` |  `\attributes\creators\nameIdentifiers/nameIdentifier` | the pid value |
|  `maintitle` | `\attributes\titles`  |  Titles whose title type is null or title type is Main |
|  `subtitle` | `\attributes\titles`  |  Titles whose title type is Subtitle since the title type vocabulary in OpenAIRE  use the datacite title type vocabulary |
| **date section** |  | for each date in particular for DOI starting with _10.14457_ we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket [#6791](https://support.openaire.eu/issues/6791) |
|`publicationdate` |  `\attributes\dates` | where `dateType` is **issued** |
|`publicationdate` |  `\attributes\publicationYear` | we create this date format `01-01-publicationYear` |  
|`embargoenddate` | `\attributes\dates` | where `dateType` is **available** |
| `subjects`      | `\attributes\subject`  |  `scheme=keywords` |
| `description`   | `\attributes\descriptions`    |            |
| `publisher`   | `\attributes\publisher`    |            |
| `language`   | `\attributes\language`    |  cleaned by using vocabulary `dnet:languages`           |
| `publisher`   | `\attributes\publisher`    |            |
| `instance.license`     | `\attributes\rightsList` | if right value starts with http and matches a particular regex  |
| `instance.accessright` |  `\attributes\rightsList` | <ul> <li>if not present :`unknown`</li><li>if datasource is _figshare_:`open`</li><li>If `embargo_date < today()`: _OPEN_ </li> </ul>   |


### Mapping Relation


| OpenAIRE Relation Semantic and inverse    | Datacite record JSON path     | Source/Tartget type           | #Notes  |
|-------------------------------------------|-------------------------------|-------------------------------|---------|
| `isProducedBy`      |`attributes\fundingReferences` | `Result/Project`|  we must identifi if match this pattern `(info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)`|
| `IsProvidedBy`   | | `Result/DataSource` | Datasource is always Datacite|
| `IsHostedBy`   | `\attributes\relationships\client\id` | `Result/DataSource` |we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ |


### Relation Resolution
added d a part of Datacite documentation 2022-10-06 16:25:08 +02:00			`# Datacite`
			`This section describes the aggregation workflow of Datacite and the mapping implemented for it.`

			`## Datacite datasource`
			`[Datacite](https://datacite.org/index.html) is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs.`

			`## Datacite API`
			`The [DataCite REST API](https://support.datacite.org/docs/api) allows users to retrieve, query, and browse DataCite DOI metadata records. In particular, it exposes a method for incremental harvesting new datacite records.`

			```
			`https://api.datacite.org/dois?page[cursor]=$CURSOR&page[size]=$NUMBER_OF_ITEM_PER_PAGE&query=updated:[$FROM_DATE_TIMESAMP TO $TO_DATE_TIMESAMP]`
			```

			`On this API Request, we introduce some variables:`
			`- CURSOR: The value of the cursor to iterate the pages`
			`- NUMBER_OF_ITEM_PER_PAGE: (max 1000) defines how many records we can download for each page.`
			`- FROM_DATE_TIMESAMP, TO_DATE_TIMESAMP interval timestamp of the updated record`


			`Each record contains two pieces of information needed for incremental harvesting:`
added d a part of Datacite documentation 2022-10-11 11:55:04 +02:00			- isActive: tells if the record is deleted (`isActive:false`)
added d a part of Datacite documentation 2022-10-06 16:25:08 +02:00			`- updated: timestamp of last update`


			`## Collection Workflow`

			`The collection workflow is responsible for aggregating new datacite records. Each record is stored on a table called Native Datacite Store with the following schema:`
			`- DOI: The DOI PID of the datacite record (It is a primary key)`
			`- update_timestamp: the last update date timestamp`
			`- json: the native record JSON`

			`During the collection workflow, we identify the most updated record date, and the collection phase downloads all new datacite records and update the existing one through the API using this date as FROM_DATE_TIMESAMP variable.`


			`## Datacite Mapping`
			`The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.`


			`\| OpenAIRE Result field path \| Datacite record JSON path \| # Notes \|`
			`\|------------------------------------\|-------------------------------\|-------------------\|`
			\| `id` \| `\attributes\doi`\|the identifier will be created by folloing the openaire PID generation policy \|
added d a part of Datacite documentation 2022-10-11 11:55:04 +02:00			\| <ul><li>`instance`</li> <li>`instance.type`</li></ul> \| <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> \| Use the vocabulary _dnet:publication_resource_ to find a synonym to one of these terms and get the `instance.type`. Using the _dnet:result_typologies_ vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li> <li> `software`</li> <li>`otherresearchproduct`</li></ul> \|
			\| `pid` \| `\attributes\doi` \| `scheme = doi` \|
completed Documentation of Datacite 2022-10-12 12:16:35 +02:00			\| `originalid` \| `\attributes\doi` \| \|
added d a part of Datacite documentation 2022-10-11 11:55:04 +02:00			\| `dateofcollection` \| `attributes\updated` \| the timestamp is defined in milliseconds we convert to "yyyy-MM-dd'T'HH:mm:ssZ" format \|
completed Documentation of Datacite 2022-10-12 12:16:35 +02:00			\| `author` \| `\attributes\creators` \| Each creator field will be mapped in the author entity below the subfield. If the record has no Creator it will be skipped\|
			\| `author.fullname` \| `\attributes\creators\name` \| if name is not defined, we construct from given and family name \|
			\| `author.rank` \| \| Incremental index starting from 1 \|
			\| `author.name` \| `\attributes\creators\givenName` \| \|
			\| `author.surname` \| `\attributes\creators\familyName` \| \|
			\| `author.pid` \| `\attributes\creators\nameIdentifiers` \| this is a list of pids associated to the creator \|
			\| `author.pid.scheme` \| `\attributes\creators\nameIdentifiers` \| mapping with vocabulary dnet:pid_types \|
			\| `author.pid.value` \| `\attributes\creators\nameIdentifiers/nameIdentifier` \| the pid value \|
			\| `maintitle` \| `\attributes\titles` \| Titles whose title type is null or title type is Main \|
			\| `subtitle` \| `\attributes\titles` \| Titles whose title type is Subtitle since the title type vocabulary in OpenAIRE use the datacite title type vocabulary \|
			`\| date section \| \| for each date in particular for DOI starting with _10.14457_ we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket [#6791](https://support.openaire.eu/issues/6791) \|`
			\|`publicationdate` \| `\attributes\dates` \| where `dateType` is issued \|
			\|`publicationdate` \| `\attributes\publicationYear` \| we create this date format `01-01-publicationYear` \|
			\|`embargoenddate` \| `\attributes\dates` \| where `dateType` is available \|
			\| `subjects` \| `\attributes\subject` \| `scheme=keywords` \|
			\| `description` \| `\attributes\descriptions` \| \|
			\| `publisher` \| `\attributes\publisher` \| \|
			\| `language` \| `\attributes\language` \| cleaned by using vocabulary `dnet:languages` \|
			\| `publisher` \| `\attributes\publisher` \| \|
			\| `instance.license` \| `\attributes\rightsList` \| if right value starts with http and matches a particular regex \|
			\| `instance.accessright` \| `\attributes\rightsList` \| <ul> <li>if not present :`unknown`</li><li>if datasource is _figshare_:`open`</li><li>If `embargo_date < today()`: _OPEN_ </li> </ul> \|


			`### Mapping Relation`


			`\| OpenAIRE Relation Semantic and inverse \| Datacite record JSON path \| Source/Tartget type \| #Notes \|`
			`\|-------------------------------------------\|-------------------------------\|-------------------------------\|---------\|`
			\| `isProducedBy` \|`attributes\fundingReferences` \| `Result/Project`\| we must identifi if match this pattern `(info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)`\|
			\| `IsProvidedBy` \| \| `Result/DataSource` \| Datasource is always Datacite\|
			\| `IsHostedBy` \| `\attributes\relationships\client\id` \| `Result/DataSource` \|we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ \|


			`### Relation Resolution`


added d a part of Datacite documentation 2022-10-11 11:55:04 +02:00