openaire-graph-docs/docs/data-provision/aggregation/datacite.md

5.8 KiB

Datacite

This section describes the aggregation workflow of Datacite and the mapping implemented for it.

Datacite datasource

Datacite is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs.

Datacite API

The DataCite REST API allows users to retrieve, query, and browse DataCite DOI metadata records. In particular, it exposes a method for incremental harvesting new datacite records.

https://api.datacite.org/dois?page[cursor]=$CURSOR&page[size]=$NUMBER_OF_ITEM_PER_PAGE&query=updated:[$FROM_DATE_TIMESAMP TO $TO_DATE_TIMESAMP]

On this API Request, we introduce some variables:

  • CURSOR: The value of the cursor to iterate the pages
  • NUMBER_OF_ITEM_PER_PAGE: (max 1000) defines how many records we can download for each page.
  • FROM_DATE_TIMESAMP, TO_DATE_TIMESAMP interval timestamp of the updated record

Each record contains two pieces of information needed for incremental harvesting:

  • isActive: tells if the record is deleted (isActive:false)
  • updated: timestamp of last update

Collection Workflow

The collection workflow is responsible for aggregating new datacite records. Each record is stored on a table called Native Datacite Store with the following schema:

  • DOI: The DOI PID of the datacite record (It is a primary key)
  • update_timestamp: the last update date timestamp
  • json: the native record JSON

During the collection workflow, we identify the most updated record date, and the collection phase downloads all new datacite records and update the existing one through the API using this date as FROM_DATE_TIMESAMP variable.

Datacite Mapping

The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.

OpenAIRE Result field path Datacite record JSON path # Notes
id \attributes\doi the identifier will be created by folloing the openaire PID generation policy
  • instance
  • instance.type
  • \attributes\types\resourceType
  • \attributes\types\resourceTypeGeneral
  • attributes\types\schemaOrg
Use the vocabulary dnet:publication_resource to find a synonym to one of these terms and get the instance.type. Using the dnet:result_typologies vocabulary, we look up the instance.type synonym to generate one of the following main entities:
  • publication
  • dataset
  • software
  • otherresearchproduct
pid \attributes\doi scheme = doi
originalid \attributes\doi
dateofcollection attributes\updated the timestamp is defined in milliseconds we convert to "yyyy-MM-dd'T'HH:mm:ssZ" format
author \attributes\creators Each creator field will be mapped in the author entity below the subfield. If the record has no Creator it will be skipped
author.fullname \attributes\creators\name if name is not defined, we construct from given and family name
author.rank Incremental index starting from 1
author.name \attributes\creators\givenName
author.surname \attributes\creators\familyName
author.pid \attributes\creators\nameIdentifiers this is a list of pids associated to the creator
author.pid.scheme \attributes\creators\nameIdentifiers mapping with vocabulary dnet:pid_types
author.pid.value \attributes\creators\nameIdentifiers/nameIdentifier the pid value
maintitle \attributes\titles Titles whose title type is null or title type is Main
subtitle \attributes\titles Titles whose title type is Subtitle since the title type vocabulary in OpenAIRE use the datacite title type vocabulary
date section for each date in particular for DOI starting with 10.14457 we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket #6791
publicationdate \attributes\dates where dateType is issued
publicationdate \attributes\publicationYear we create this date format 01-01-publicationYear
embargoenddate \attributes\dates where dateType is available
subjects \attributes\subject scheme=keywords
description \attributes\descriptions
publisher \attributes\publisher
language \attributes\language cleaned by using vocabulary dnet:languages
publisher \attributes\publisher
instance.license \attributes\rightsList if right value starts with http and matches a particular regex
instance.accessright \attributes\rightsList
  • if not present :unknown
  • if datasource is figshare:open
  • If embargo_date < today(): OPEN

Mapping Relation

OpenAIRE Relation Semantic and inverse Datacite record JSON path Source/Tartget type #Notes
isProducedBy attributes\fundingReferences Result/Project we must identifi if match this pattern (info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)
IsProvidedBy Result/DataSource Datasource is always Datacite
IsHostedBy \attributes\relationships\client\id Result/DataSource we defined a curated map clientId/Datasource if we found a match we create an hostedBy Relation

Relation Resolution