Compare commits

..

1 Commits

Author SHA1 Message Date
Serafeim Chatzopoulos 08fc4a72fd (WIP) - Enable Algolia search plugin 2022-09-21 21:13:56 +03:00
53 changed files with 1622 additions and 2896 deletions

2
.gitignore vendored
View File

@ -18,5 +18,3 @@
npm-debug.log* npm-debug.log*
yarn-debug.log* yarn-debug.log*
yarn-error.log* yarn-error.log*
.idea/

View File

@ -1,52 +0,0 @@
## Base ########################################################################
# Use a larger node image to do the build for native deps (e.g., gcc, python)
FROM node:lts as base
# Reduce npm log spam and colour during install within Docker
ENV NPM_CONFIG_LOGLEVEL=error
ENV NPM_CONFIG_COLOR=false
# We'll run the app as the `node` user, so put it in their home directory
WORKDIR /home/node/app
RUN chown -R node:node /home/node
# Copy the source code over
COPY --chown=node:node . /home/node/app/
## Development #################################################################
# Define a development target that installs devDeps and runs in dev mode
FROM base as development
WORKDIR /home/node/app
# Install (not ci) with dependencies, and for Linux vs. Linux Musl (which we use for -alpine)
RUN npm install
# Switch to the node user vs. root
USER node
# Expose port 3000
EXPOSE 3000
# Start the app in debug mode so we can attach the debugger
CMD ["npm", "start"]
## Production ##################################################################
# Also define a production target which doesn't use devDeps
FROM base as production
WORKDIR /home/node/app
COPY --chown=node:node --from=development /home/node/app/node_modules /home/node/app/node_modules
# Build the Docusaurus app
RUN npm run build
## Deploy ######################################################################
# Use a stable nginx image
FROM nginx:stable-alpine as deploy
WORKDIR /home/node/app
# Copy what we've installed/built from production
COPY --from=production /home/node/app/build /usr/share/nginx/html/

View File

@ -2,56 +2,30 @@
This website is built using [Docusaurus 2](https://docusaurus.io/); please check [here](https://docusaurus.io/docs/installation#requirements) the requirements to run the project. This website is built using [Docusaurus 2](https://docusaurus.io/); please check [here](https://docusaurus.io/docs/installation#requirements) the requirements to run the project.
## Clone repository ### Clone repository
``` ```
$ git clone https://code-repo.d4science.org/D-Net/openaire-graph-docs.git $ git clone https://code-repo.d4science.org/D-Net/openaire-graph-docs.git
``` ```
## Local installation and deployment ### Installation
To install the required packages use:
``` ```
$ npm install $ npm install
``` ```
The following command starts a local development server and opens up a browser window. Note that most changes are reflected live without having to restart the server. ### Local Development
``` ```
$ npm run start $ npm run start
``` ```
Generate the static content into the `build` directory using the command tha follows. Then this directory can be served using any static contents hosting service. This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.
### Build and deploy
``` ```
$ npm run build $ npm run build
``` ```
## Deployment using Docker This command generates static content into the `build` directory and can be served using any static contents hosting service.
### Development
Build docker container for development and run on port 3000:
```
docker build --target development -t docs:dev .
sudo docker run -d -p 3000:3000 docs:dev
```
### Production
Build docker container for production and run on port 80:
```
docker build -t docs:latest .
docker run -d -p 80:80 docs:latest
```
## Documentation versioning
The versioning documentation of Docusaurus can be found [here](https://docusaurus.io/docs/versioning).
Specifically, a new version can be created with the following command:
```
npm run docusaurus docs:version <versionName>
```
When tagging a new version, the document versioning mechanism will:
* Copy the full `docs/` folder contents into a new `versioned_docs/version-<versionName>/` folder.
* Create a versioned sidebars file based from your current sidebar configuration, saved as `versioned_sidebars/version-<versionName>-sidebars.json`.
* Append the new version number to `versions.json`.

View File

@ -1,6 +0,0 @@
---
sidebar_position: 12
---
# Changelog
<span className="todo">TODO</span>

View File

@ -1,6 +1,6 @@
# Data model # Data model
The OpenAIRE Research Graph comprises several types of [entities](../category/entities) and [relationships](./relationships) among them. The OpenAIRE Research Graph comprises several types of entities and [relationships](./relationships) among them.
The latest version of the JSON schema can be found on [Bulk downloads](../download). The latest version of the JSON schema can be found on [Bulk downloads](../download).
@ -15,9 +15,12 @@ Its main entities are described in brief below:
* [Data Sources](entities/data-source) are the resources used to collect metadata for the graph objects * [Data Sources](entities/data-source) are the resources used to collect metadata for the graph objects
* [Organizations](entities/organization) correspond to companies or research institutions involved in projects, * [Organizations](entities/organization) correspond to companies or research institutions involved in projects,
responsible for operating data sources or consisting the affiliations of Product creators. responsible for operating data sources or consisting the affiliations of Product creators.
<!-- * [Funders](entities/funder) (e.g. EC, Wellcome Trust) are agencies responsible for a list of Funding Streams. -->
<!-- * [Funding Streams](entities/funding-stream) represent investments (funding actions) from Funders (e.g. FP7 or H2020). -->
* [Projects](entities/project) are research projects funded by a Funding Stream of a Funder. * [Projects](entities/project) are research projects funded by a Funding Stream of a Funder.
* [Communities](entities/community) are groups of people with a common research intent. * [Communities](entities/community) are groups of people with a common research intent.
:::note Further reading :::note Further reading
A detailed report on the OpenAIRE Research Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199). A detailed report on the OpenAIRE Research Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199).

View File

@ -2,81 +2,7 @@
sidebar_position: 6 sidebar_position: 6
--- ---
# Community
Research communities and research initiatives are intended as groups of people with a common research intent and can be of two types: research initiatives or research communities: # Community (Initiative)
<span className="todo">TODO</span>
* Research initiatives are intended to capture a view of the information space that is "research impact"-oriented, i.e. all products generated due to my research initiative;
* Research communities the latter “research activity” oriented, i.e. all products that may be of interest or related to my research initiative.
For example, the organizations supporting a research infrastructure fall in the first category, while the researchers involved in a discipline fall in the second.
## The `Community` object
### id
_Type: String &bull; Cardinality: ONE_
The OpenAIRE id for the community/research infrastructure, created according to the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers).
```json
"id": "00|context_____::5b7f9fa40bdc12072249204cedfa7808"
```
### acronym
_Type: String &bull; Cardinality: ONE_
The acronym of the community.
```json
"acronym": "covid-19"
```
### description
_Type: String &bull; Cardinality: ONE_
Description of the research community/research infrastructure
```json
"description": "This portal provides access to publications, research data, projects and software that may be relevant to the Corona Virus Disease (COVID-19). The OpenAIRE COVID-19 Gateway aggregates COVID-19 related records, links them and provides a single access point for discovery and navigation. We tag content from the OpenAIRE Research Graph (10,000+ data sources) and additional sources. All COVID-19 related research results are linked to people, organizations and projects, providing a contextualized navigation."
```
### name
_Type: String &bull; Cardinality: ONE_
The long name of the community.
```json
"name": "Corona Virus Disease"
```
### subject
_Type: String &bull; Cardinality: MANY_
The list of the subjects associated to the research community (only appies to research communities).
```json
"subject": [
"COVID19",
"SARS-CoV",
"HCoV-19",
...
]
```
### type
_Type: String &bull; Cardinality: ONE_
The type of the community; one of `{ Research Community, Research infrastructure }`.
```json
"type": "Research Community"
```
### zenodo_community
_Type: String &bull; Cardinality: ONE_
The URL of the Zenodo community associated to the Research community/Research infrastructure.
```json
"zenodo_community": "https://zenodo.org/communities/covid-19"
```

View File

@ -4,9 +4,10 @@ sidebar_position: 2
# Data source # Data source
OpenAIRE entity instances are created out of data collected from various data sources of different kinds, such as publication repositories, dataset archives, CRIS systems, funder databases, etc. Data sources export information packages (e.g., XML records, HTTP responses, RDF data, JSON) that may contain information on one or more of such entities and possibly relationships between them. OpenAIRE entity instances are created out of data collected from various data sources of different kinds, such as publication repositories, dataset archives, CRIS systems, funder databases, etc. Data sources export information packages (e.g., XML records, HTTP responses, RDF data, JSON) that may contain information on one or more of such entities and possibly relationships between them. For example, a metadata record about a project carries information for the creation of a Project entity and its participants (as Organization entities). It is important, once each piece of information is extracted from such packages and inserted into the OpenAIRE information space as an entity, for such pieces to keep provenance information relative to the originating data source. This is to give visibility to the data source, but also to enable the reconstruction of the very same piece of information if problems arise.
For example, a metadata record about a project carries information for the creation of a Project entity and its participants (as Organization entities). It is important, once each piece of information is extracted from such packages and inserted into the OpenAIRE information space as an entity, for such pieces to keep provenance information relative to the originating data source. This is to give visibility to the data source, but also to enable the reconstruction of the very same piece of information if problems arise.
<span className="todo">Definitions for the re3data specific elements from: https://gfzpublic.gfz-potsdam.de/rest/items/item_758898_6/component/file_775891/content</span>
--- ---
@ -15,280 +16,127 @@ For example, a metadata record about a project carries information for the creat
### id ### id
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The OpenAIRE id of the data source, created according to the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers). Main entity identifier, created according to [OpenAIRE_entity_identifier_and_PID_mapping_policy](https://support.openaire.eu/projects/docs/wiki/OpenAIRE_entity_identifier_and_PID_mapping_policy).
```json
"id": "10|issn___print::22c514d022b199c346e7f29ca06efc95"
```
### originalId ### originalId
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
The list of original identifiers associated to the datasource. The list of original ids associated to the datasource.
```json
"originalId": [
"issn___print::2451-8271",
...
]
```
### pid ### pid
_Type: [ControlledField](other#controlledfield) &bull; Cardinality: MANY_ _Type: [ControlledField](other#controlledfield) &bull; Cardinality: MANY_
The persistent identifiers for the datasource. The persistent identifiers for the datasource.
```json
"pid": [
{
"scheme": "DOI",
"value": "10.5281/zenodo.4707307"
},
...
]
```
### datasourcetype ### datasourcetype
_Type: [ControlledField](other#controlledfield) &bull; Cardinality: ONE_ _Type: [ControlledField](other#controlledfield) &bull; Cardinality: ONE_
The datasource type; see the vocabulary [dnet:datasource_typologies](https://api.openaire.eu/vocabularies/dnet:datasource_typologies). The datasource type (e.g. pubsrepository::institutional, Institutional Repository) as in the vocabulary [dnet:datasource_typologies](https://api.openaire.eu/vocabularies/dnet:datasourceCompatibilityLevel).
```json
"datasourcetype": {
"scheme": "pubsrepository::journal",
"value": "Journal"
}
```
### openairecompatibility ### openairecompatibility
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The OpenAIRE compatibility of the ingested results, indicates which guidelines they are compliant according to the vocabulary [dnet:datasourceCompatibilityLevel](https://api.openaire.eu/vocabularies/dnet:datasourceCompatibilityLevel). The OpenAIRE compatibility of the ingested results, indicates which guidelines they are compliant to the vocabulary [dnet:datasourceCompatibilityLevel](https://api.openaire.eu/vocabularies/dnet:datasourceCompatibilityLevel).
```json
"openairecompatibility": "collected from a compatible aggregator"
```
### officialname ### officialname
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The official name of the datasource. The official name of the datasource.
```json
"officialname": "Recent Patents and Topics on Medical Imaging"
```
### englishname ### englishname
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The English name of the datasource. The English name of the datasource.
```json
"englishname": "Recent Patents and Topics on Medical Imaging"
```
### websiteurl ### websiteurl
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The URL of the website of the datasource. The URL of the website of the datasource.
```json
"websiteurl": "http://dspace.unict.it/"
```
### logourl ### logourl
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The URL of the logo for the datasource. The URL of the logo for the datasource.
```json
"logourl": "https://impactum-journals.uc.pt/public/journals/26/pageHeaderLogoImage_en_US.png"
```
### dateofvalidation ### dateofvalidation
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The date of validation against the OpenAIRE guidelines for the datasource records. The date of validation against the guidelines for the datasource records.
```json
"dateofvalidation": "2016-10-10"
```
### description ### description
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The description for the datasource. The description for the datasource.
```json
"description": "Recent Patents on Medical Imaging publishes review and research articles, and guest edited single-topic issues on recent patents in the field of medical imaging. It provides an important and reliable source of current information on developments in the field. The journal is essential reading for all researchers involved in Medical Imaging."
```
### subjects ### subjects
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: ONE_
List of subjects associated to the datasource The subjects of the contents provided by the datasource.
```json
"subjects": [
"Medicine",
"Imaging",
...
]
```
### languages ### languages
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
The languages present in the data source's content, as defined by OpenDOAR. The languages of the contents provided by the datasource (OpenDOAR only).
```json
"languages":[
"eng",
...
]
```
### contenttypes ### contenttypes
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
Types of content in the data source, as defined by OpenDOAR The typologies of the contents provided by the datasource (OpenDOAR only).
```json
"contenttypes": [
"Journal articles",
...
]
```
### releasestartdate ### releasestartdate
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Releasing date of the data source, as defined by re3data.org. <span className="todo">TODO</span>
```json
"releasestartdate": "2010-07-24"
```
### releaseenddate ### releaseenddate
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Date when the data source went offline or stopped ingesting new research data. As defined by re3data.org <span className="todo">TODO</span>
```json
"releaseenddate": "2016-03-28"
```
### accessrights ### accessrights
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Type of access to the data source, as defined by re3data.org. Possible values: `{ open, restricted, closed }`. Open, restricted or closed.
```json
"accessrights": "open"
```
### uploadrights ### uploadrights
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Type of data upload, as defined by re3data.org; one of `{ open, restricted, closed }`. Open, restricted or closed.
```json
"uploadrights": "closed"
```
### databaseaccessrestriction ### databaseaccessrestriction
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Access restrictions to the research data repository. Allowed values are: `{ feeRequired, registration, other }`. All existing access restrictions to the research data repository. Allowed values are: feeRequired, registration, other (re3data only).
This field only applies for re3data data source; see [re3data schema specification](https://gfzpublic.gfz-potsdam.de/rest/items/item_758898_6/component/file_775891/content) for more details.
```json
"databaseaccessrestriction": "registration"
```
### datauploadrestriction ### datauploadrestriction
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Upload restrictions applied by the datasource, as defined by re3data.org. One of `{ feeRequired, registration, other }`. All existing restrictions to the data upload. (re3data only).
This field only applies for re3data data source; see [re3data schema specification](https://gfzpublic.gfz-potsdam.de/rest/items/item_758898_6/component/file_775891/content) for more details.
```json
"datauploadrestriction": "feeRequired registration"
```
### versioning ### versioning
_Type: Boolean &bull; Cardinality: ONE_ _Type: Boolean &bull; Cardinality: ONE_
Whether the research data repository supports versioning: The research data repository supports versioning of research data. (re3data only).
`yes` if the data source supports versioning, `no` otherwise.
This field only applies for re3data data source; see [re3data schema specification](https://gfzpublic.gfz-potsdam.de/rest/items/item_758898_6/component/file_775891/content) for more details.
```json
"versioning": true
```
### citationguidelineurl ### citationguidelineurl
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The URL of the data source providing information on how to cite its items. The DataCite citation format is recommended (http://www.datacite.org/whycitedata). The URL of the research data repository providing information on how to cite its research data. The DataCite citation format is recommended (http://www.datacite.org/whycitedata). (re3data only)
This field only applies for re3data data source; see [re3data schema specification](https://gfzpublic.gfz-potsdam.de/rest/items/item_758898_6/component/file_775891/content) for more details.
```json
"citationguidelineurl": "https://physionet.org/about/#citation"
```
### pidsystems ### pidsystems
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The persistent identifier system that is used by the data source. As defined by re3data.org.
```json
"pidsystems": "hdl"
```
### certificates ### certificates
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The certificate, seal or standard the data source complies with. As defined by re3data.org. <span className="todo">TODO</span>
```json
"certificates": "WDS"
```
### policies ### policies
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
Policies of the data source, as defined in OpenDOAR. <span className="todo">TODO</span>
### journal ### journal
_Type: [Container](other#container) &bull; Cardinality: ONE_ _Type: [Container](other#container) &bull; Cardinality: ONE_
Information about the journal, if this data source is of type Journal. <span className="todo">TODO</span>
```json
"container": {
"edition": "",
"iss": "5",
"issnLinking": "",
"issnOnline": "1873-7625",
"issnPrinted":"2451-8271",
"name": "Recent Patents and Topics on Imaging",
"sp": "12",
"ep": "22",
"vol": "50"
}
```
### missionstatementurl
_Type: String &bull; Cardinality: ONE_
The URL of a mission statement describing the designated community of the data source. As defined by re3data.org
```json
"missionstatementurl": "https://www.sigma2.no/content/nird-research-data-archive"
```

View File

@ -0,0 +1,8 @@
---
sidebar_position: 8
---
# OpenAIRE entity identifier and PID mapping policy
https://support.openaire.eu/projects/docs/wiki/OpenAIRE_entity_identifier_and_PID_mapping_policy
<span className="todo">TODO: include this here? it referenced by many other pages</span>

View File

@ -14,81 +14,35 @@ Organizations include companies, research centers or institutions involved as pr
### id ### id
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The OpenAIRE id for the organization, created according to the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers). Main entity identifier, created according to [OpenAIRE_entity_identifier_and_PID_mapping_policy](https://support.openaire.eu/projects/docs/wiki/OpenAIRE_entity_identifier_and_PID_mapping_policy).
```json
"id": "20|openorgs____::b84450f9864182c67b8611b5593f4250"
```
### legalshortname ### legalshortname
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The legal name in short form of the organization. The legal name in short form of the organization.
```json
"legalshortname": "ARC"
```
### legalname ### legalname
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The legal name of the organization. The legal name of the organization.
```json
"legalname": "Athena Research and Innovation Center In Information Communication & Knowledge Technologies"
```
### alternativenames ### alternativenames
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
Alternative names that identify the organization. The alternative names of the organization.
```json
"alternativenames": [
"Athena Research and Innovation Center In Information Communication & Knowledge Technologies",
"Athena RIC",
"ARC",
...
]
```
### websiteurl ### websiteurl
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The websiteurl of the organization. The websiteurl of the organization.
```json
"websiteurl": "https://www.athena-innovation.gr/el/announce/pressreleases.html"
```
### country ### country
_Type: [Country](other#country) &bull; Cardinality: ONE_ _Type: [Country](other#country) &bull; Cardinality: ONE_
The country where the organization is located. The country where the organization is located.
```json
"country":{
"code": "GR",
"label": "Greece"
}
```
### pid ### pid
_Type: [OrganizationPid](other#organizationpid) &bull; Cardinality: MANY_ _Type: [OrganizationPid](other#organizationpid) &bull; Cardinality: MANY_
The list of persistent identifiers for the organization. The list of persistent identifiers for the organization.
```json
"pid": [
{
"scheme": "ISNI",
"value": "0000 0004 0393 5688"
},
{
"scheme": "GRID",
"value":
"grid.19843.37"
},
...
]
```

File diff suppressed because it is too large Load Diff

View File

@ -13,159 +13,19 @@ Of crucial interest to OpenAIRE is also the identification of the funders (e.g.
### id ### id
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Main entity identifier, created according to the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers). Main entity identifier, created according to [OpenAIRE_entity_identifier_and_PID_mapping_policy](https://support.openaire.eu/projects/docs/wiki/OpenAIRE_entity_identifier_and_PID_mapping_policy).
```json
"id": "40|corda__h2020::70ea22400fd890c5033cb31642c4ae68"
```
### code ### code
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Τhe grant agreement code of the project. Τhe grant agreement code of the project.
```json
"code": "777541"
```
### acronym ### acronym
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Project's acronym. Project's acronym.
```json
"acronym": "OpenAIRE-Advance"
```
### title ### title
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Project's title. Project's title.
```json
"title": "OpenAIRE Advancing Open Scholarship"
```
### callidentifier
_Type: String &bull; Cardinality: ONE_
The identifier of the research call.
```json
"callidentifier": "H2020-EINFRA-2017"`
```
### funding
_Type: [Funding](other#funding) &bull; Cardinality: MANY_
Funding information for the project.
```json
"funding": [
{
"funding_stream": {
"description": "Horizon 2020 Framework Programme - Research and Innovation action",
"id": "EC::H2020::RIA"
},
"jurisdiction": "EU",
"name": "European Commission",
"shortName": "EC"
}
]
```
### granted
_Type: [Grant](other#grant) &bull; Cardinality: ONE_
The money granted to the project.
```json
"granted": {
"currency": "EUR",
"fundedamount": 1.0E7,
"totalcost": 1.0E7
}
```
### h2020programme
_Type: [H2020Programme](other#h2020programme) &bull; Cardinality: MANY_
The H2020 programme funding the project.
```json
"h2020programme":[
{
"code": "H2020-EU.1.4.1.3.",
"description": "Development, deployment and operation of ICT-based e-infrastructures"
}
]
```
### keywords
_Type: String &bull; Cardinality: ONE_
```json
"keywords": [
"Open Science",
...
]
```
### openaccessmandatefordataset
_Type: Boolean &bull; Cardinality: ONE_
```json
"openaccessmandatefordataset": true
```
### openaccessmandateforpublications
_Type: Boolean &bull; Cardinality: ONE_
```json
"openaccessmandateforpublications": true
```
### startdate
_Type: String &bull; Cardinality: ONE_
The start year of the project.
```json
"startdate": "2018-01-01"
```
### enddate
_Type: String &bull; Cardinality: ONE_
The end year pf the project.
```json
"enddate": "2021-02-28"
```
### subject
_Type: String &bull; Cardinality: MANY_
The subjects of the project
```json
"subject": [
"Data and Distributed Computing e-infrastructures for Open Science",
...
]
```
### summary
_Type: String &bull; Cardinality: ONE_
Short summary of the project.
```json
"summary": "OpenAIRE-Advance continues the mission of OpenAIRE to support the Open Access/Open Data mandates in Europe. By sustaining the current successful infrastructure, comprised of a human network and robust technical services, it consolidates its achievements while working to shift the momentum among its communities to Open Science, aiming to be a trusted e-Infrastructurewithin the realms of the European Open Science Cloud.In this next phase, OpenAIRE-Advance strives to empower its National Open Access Desks (NOADs) so they become a pivotal part within their own national data infrastructures, positioningOA and open science onto national agendas. The capacity building activities bring together experts ontopical task groups in thematic areas(open policies, RDM, legal issues, TDM), promoting a train the trainer approach, strengthening and expanding the pan-European Helpdesk with support and training toolkits, training resources and workshops.It examines key elements of scholarly communication, i.e., co-operative OA publishing and next generation repositories, to develop essential building blocks of the scholarly commons.On the technical level OpenAIRE-Advance focuses on the operation and maintenance of the OpenAIRE technical TRL8/9 services,and radically improvesthe OpenAIRE services on offer by: a) optimizing their performance and scalability, b) refining their functionality based on end-user feedback, c) repackagingthem into products, taking a professional marketing approach with well-defined KPIs, d)consolidating the range of services/products into a common e-Infra catalogue to enable a wider uptake.OpenAIRE-Advancesteps up its outreach activities with concrete pilots with three major RIs,citizen science initiatives, and innovators via a rigorous Open Innovation programme. Finally, viaits partnership with COAR, OpenAIRE-Advance consolidatesOpenAIREs global roleextending its collaborations with Latin America, US, Japan, Canada, and Africa."
```
### websiteurl
_Type: String &bull; Cardinality: ONE_
The website of the project
```json
"websiteurl": "https://www.openaire.eu/advance/"
```

View File

@ -20,11 +20,8 @@ Moreover, there are the following sub-types of a `Result`, that inherit all its
### id ### id
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Main entity identifier, created according to the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers). Main entity identifier, created according to
<span className="todo">[OpenAIRE entity identifier and PID mapping policy](https://support.openaire.eu/projects/docs/wiki/OpenAIRE_entity_identifier_and_PID_mapping_policy)</span>.
```json
"id": "50|doi_dedup___::80f29c8c8ba18c46c88a285b7e739dc3"
```
### type ### type
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
@ -38,293 +35,102 @@ Type of the result. Possible types:
as declared in the terms from the [dnet:result_typologies vocabulary](https://api.openaire.eu/vocabularies/dnet:result_typologies). as declared in the terms from the [dnet:result_typologies vocabulary](https://api.openaire.eu/vocabularies/dnet:result_typologies).
```json
"type": "publication"
```
### originalId ### originalId
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
Identifiers of the record at the original sources. Identifiers of the record at the original sources.
```json
"originalId": [
"oai:pubmedcentral.nih.gov:8024784",
"S0048733321000305",
"10.1016/j.respol.2021.104226",
"3136742816"
]
```
### maintitle ### maintitle
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
A name or title by which a scientific result is known. May be the title of a publication, of a dataset or the name of a piece of software. A name or title by which a scientific result is known. May be the title of a publication, of a dataset or the name of a piece of software.
```json
"maintitle": "The fall of the innovation empire and its possible rise through open science"
```
### subtitle ### subtitle
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Explanatory or alternative name by which a scientific result is known. Explanatory or alternative name by which a scientific result is known.
```json
"subtitle": "An analysis of cases from 1980 - 2020"
```
### author ### author
_Type: [Author](other#author) &bull; Cardinality: MANY_ _Type: [Author](other#author) &bull; Cardinality: MANY_
The main researchers involved in producing the data, or the authors of the publication. The main researchers involved in producing the data, or the authors of the publication.
```json
"author": [
{
"fullname": "E. Richard Gold",
"rank": 1,
"name": "Richard",
"surname": "Gold",
"pid": {
"id": {
"scheme": "orcid",
"value": "0000-0002-3789-9238"
},
"provenance"; {
"provenance": "Harvested",
"trust": "0.9"
}
}
},
...
]
```
### bestaccessright ### bestaccessright
_Type: [BestAccessRight](other#bestaccessright) &bull; Cardinality: ONE_ _Type: [BestAccessRight](other#bestaccessright) &bull; Cardinality: ONE_
The most open access right associated to the manifestations of this research results. The most open access right associated to the manifestations of this research results.
```json
"bestaccessright": {
"code": "c_abf2",
"label": "OPEN",
"scheme": "http://vocabularies.coar-repositories.org/documentation/access_rights/"
}
```
### contributor ### contributor
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
The institution or person responsible for collecting, managing, distributing, or otherwise contributing to the development of the resource. The institution or person responsible for collecting, managing, distributing, or otherwise contributing to the development of the resource.
```json
"contributor": [
"University of Zurich",
"Wright, Aidan G C",
"Hallquist, Michael",
...
]
```
### country ### country
_Type: [ResultCountry](other#resultcountry) &bull; Cardinality: MANY_ _Type: [ResultCountry](other#resultcountry) &bull; Cardinality: MANY_
Country associated with the result because it is the country of the organisation that manages the institutional repository or national aggregator or CRIS system from which this record was collected Country associated with the result because it is the country of the organisation that manages the institutional repository or national aggregator or CRIS system from which this record was collected
Country of affiliations of authors can be found instead in the affiliation rel. Country of affiliations of authors can be found instead in the affiliation rel.
```json
"country": [
{
"code": "CH",
"label": "Switzerland",
"provenance": {
"provenance": "Inferred by OpenAIRE",
"trust": "0.85"
}
},
...
]
```
### coverage ### coverage
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
<span className="todo">TODO</span>
### dateofcollection ### dateofcollection
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
When OpenAIRE collected the record the last time. When OpenAIRE collected the record the last time.
<span className="todo">TODO: we should indicate the used date format</span>
```json
"dateofcollection": "2021-06-09T11:37:56.248Z"
```
### description ### description
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
A brief description of the resource and the context in which the resource was created. A brief description of the resource and the context in which the resource was created.
```json
"description": [
"Open science partnerships (OSPs) are one mechanism to reverse declining efficiency. OSPs are public-private partnerships that openly share publications, data and materials.",
"There is growing concern that the innovation system's ability to create wealth and attain social benefit is declining in effectiveness. This article explores the reasons for this decline and suggests a structure, the open science partnership, as one mechanism through which to slow down or reverse this decline.",
"The article examines the empirical literature of the last century to document the decline. This literature suggests that the cost of research and innovation is increasing exponentially, that researcher productivity is declining, and, third, that these two phenomena have led to an overall flat or declining level of innovation productivity.",
...
]
```
### embargoenddate ### embargoenddate
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Date when the embargo ends and this result turns Open Access. Date when the embargo ends and this result turns Open Access. <span className="todo">TODO: we should indicate the used date format</span>
```json
"embargoenddate": "2017-01-01"
```
### instance ### instance
_Type: [Instance](other#instance) &bull; Cardinality: MANY_ _Type: [Instance](other#instance) &bull; Cardinality: MANY_
Specific materialization or version of the result. For example, you can have one result with three instances: one is the pre-print, one is the post-print, one is the published version. Specific materialization or version of the result. For example, you can have one result with three instances: one is the pre-print, one is the post-print, one is the published version
```json
"instance": [
{
"accessright": {
"code": "c_abf2",
"label": "OPEN",
"openAccessRoute": "gold",
"scheme": "http://vocabularies.coar-repositories.org/documentation/access_rights/"
},
"alternateIdentifier": [
{
"scheme": "doi",
"value": "10.1016/j.respol.2021.104226"
},
...
],
"articleprocessingcharge": {
"amount": "4063.93",
"currency": "EUR"
},
"license": "http://creativecommons.org/licenses/by-nc/4.0",
"measures":[
{
"key": "influence",
"value": "6.45335454246e-09"
},
...
],
"pid": [
{
"scheme": "pmc",
"value": "PMC8024784"
},
...
],
"publicationdate": "2021-01-01",
"refereed": "UNKNOWN",
"type": "Article",
"url": [
"http://europepmc.org/articles/PMC8024784"
]
},
...
]
```
### language ### language
_Type: [Language](other#language) &bull; Cardinality: ONE_ _Type: [Language](other#language) &bull; Cardinality: ONE_
The alpha-3/ISO 639-2 code of the language. Values controlled by the [dnet:languages vocabulary](https://api.openaire.eu/vocabularies/dnet:languages). The `alpha-3/ISO 639-2` code of the language. Values controlled by the [dnet:languages vocabulary](https://api.openaire.eu/vocabularies/dnet:languages)
```json
"language": {
"code": "eng",
"label": "English"
}
```
### lastupdatetimestamp ### lastupdatetimestamp
_Type: Long &bull; Cardinality: ONE_ _Type: Long &bull; Cardinality: ONE_
Timestamp of last update of the record in OpenAIRE. Timestamp of last update of the record in OpenAIRE.
```json
"lastupdatetimestamp": 1652722279987
```
### pid ### pid
_Type: [ResultPid](other#resultpid) &bull; Cardinality: MANY_ _Type: [ResultPid](other#resultpid) &bull; Cardinality: MANY_
Persistent identifiers of the result. See also the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers) to learn more. Persistent identifiers of the result. See also <span className="todo">[OpenAIRE entity identifier and PID mapping policy](https://support.openaire.eu/projects/docs/wiki/OpenAIRE_entity_identifier_and_PID_mapping_policy)</span> to learn more.
```json
"pid": [
{
"scheme": "pmc",
"value": "PMC8024784"
},
{
"scheme": "doi",
"value": "10.1016/j.respol.2021.104226"
},
...
]
```
### publicationdate ### publicationdate
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Main date of the research product: typically the publication or issued date. In case of a research result with different versions with different dates, the date of the result is selected as the most frequent well-formatted date. If not available, then the most recent and complete date among those that are well-formatted. For statistics, the year is extracted and the result is counted only among the result of that year. Example: Pre-print date: 2019-02-03, Article date provided by repository: 2020-02, Article date provided by Crossref: 2020, OpenAIRE will set as date 2019-02-03, because its the most recent among the complete and well-formed dates. If then the repository updates the metadata and set a complete date (e.g. 2020-02-12), then this will be the new date for the result because it becomes the most recent most complete date. However, if OpenAIRE then collects the pre-print from another repository with date 2019-02-03, then this will be the “winning date” because it becomes the most frequent well-formatted date. Main date of the research product: typically the publication or issued date. In case of a research result with different versions with different dates, the date of the result is selected as the most frequent well-formatted date. If not available, then the most recent and complete date among those that are well-formatted. For statistics, the year is extracted and the result is counted only among the result of that year. Example: Pre-print date: 2019-02-03, Article date provided by repository: 2020-02, Article date provided by Crossref: 2020, OpenAIRE will set as date 2019-02-03, because its the most recent among the complete and well-formed dates. If then the repository updates the metadata and set a complete date (e.g. 2020-02-12), then this will be the new date for the result because it becomes the most recent most complete date. However, if OpenAIRE then collects the pre-print from another repository with date 2019-02-03, then this will be the “winning date” because it becomes the most frequent well-formatted date.
```json
"publicationdate": "2021-03-18"
```
### publisher ### publisher
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The name of the entity that holds, archives, publishes prints, distributes, releases, issues, or produces the resource. The name of the entity that holds, archives, publishes prints, distributes, releases, issues, or produces the resource.
```json
"publisher": "Elsevier, North-Holland Pub. Co"
```
### source ### source
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
A related resource from which the described resource is derived. See definition of Dublin Core field [dc:source](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/elements11/source). A related resource from which the described resource is derived. See definition of Dublin Core field [dc:source](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/elements11/source).
```json
"source": [
"Research Policy",
"Crossref",
...
]
```
### subjects ### subjects
_Type: [Subject](other#subject) &bull; Cardinality: MANY_ _Type: [Subject](other#subject) &bull; Cardinality: MANY_
Subject, keyword, classification code, or key phrase describing the resource. Subject, keyword, classification code, or key phrase describing the resource.
```json
"subjects": [
{
"provenance": {
"provenance": "Harvested",
"trust": "0.9"
},
"subject": {
"scheme": "keyword",
"value": "Open science"
}
},
...
]
```
--- ---
## Sub-types ## Sub-types
@ -333,127 +139,59 @@ There are the following sub-types of `Result`. Each inherits all its fields and
### Publication ### Publication
Metadata records about research literature (includes types of publications listed [here](http://api.openaire.eu/vocabularies/dnet:result_typologies/publication)).
#### container #### container
_Type: [Container](other#container) &bull; Cardinality: ONE_ _Type: [Container](other#container) &bull; Cardinality: ONE_
Container has information about the conference or journal where the result has been presented or published. Container has information about the conference or journal where the result has been presented or published.
```json
"container": {
"edition": "",
"iss": "5",
"issnLinking": "",
"issnOnline": "1873-7625",
"issnPrinted": "0048-7333",
"name": "Research Policy",
"sp": "12",
"ep": "22",
"vol": "50"
}
```
### Dataset ### Dataset
Metadata records about research data (includes the subtypes listed [here](http://api.openaire.eu/vocabularies/dnet:result_typologies/dataset)).
#### size #### size
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The declared size of the dataset. The size of the dataset.
```json
"size": "10129818"
```
#### version #### version
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The version of the dataset. The version of the dataset.
```json
"version": "v1.3"
```
#### geolocation #### geolocation
_Type: [GeoLocation](other#geolocation) &bull; Cardinality: MANY_ _Type: [GeoLocation](other#geolocation) &bull; Cardinality: MANY_
The list of geolocations associated with the dataset. The list of geolocations associated with the dataset.
```json
"geolocation": [
{
"box": "18.569386 54.468973 18.066832 54.83707",
"place": "Tübingen, Baden-Württemberg, Southern Germany",
"point": "7.72486 50.1084"
},
...
]
```
### Software ### Software
Metadata records about research software (includes the subtypes listed [here](http://api.openaire.eu/vocabularies/dnet:result_typologies/software)).
#### documentationUrl #### documentationUrl
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
The URLs to the software documentation. The URLs to the software documentation.
```json
"documentationUrl": [
"https://github.com/openaire/iis/blob/master/README.markdown",
...
]
```
#### codeRepositoryUrl #### codeRepositoryUrl
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The URL to the repository with the source code. The URL to the repository holding the source code,
```json
"codeRepositoryUrl": "https://github.com/openaire/iis"
```
#### programmingLanguage #### programmingLanguage
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
The programming language. The programming language.
```json
"programmingLanguage": "Java"
```
### Other research product ### Other research product
Metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed [here](http://api.openaire.eu/vocabularies/dnet:result_typologies/other)).
#### contactperson #### contactperson
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
Information on the person responsible for providing further information regarding the resource. The contact person for this ORP.
```json
"contactperson": [
"Noémie Dominguez",
...
]
```
#### contactgroup #### contactgroup
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
Information on the group responsible for providing further information regarding the resource. The information for the contact group.
```json
"contactgroup": [
"Networked Multimedia Information Systems (NeMIS)",
...
]
```
#### tool #### tool
_Type: String &bull; Cardinality: MANY_ _Type: String &bull; Cardinality: MANY_
Information about tool useful for the interpretation and/or re-use of the research product. Information about tool useful for the interpretation and/or re-use of the research product.

View File

@ -1,74 +0,0 @@
# PIDs and identifiers
One of the challenges towards the stability of the contents in the OpenAIRE Research Graph consists of making its identifiers and records stable over time.
The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content,
original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes.
Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records.
## PID Authorities
One of the fronts regards the attribution of the identity to the objects populating the graph. The basic idea is to build the identifiers of the objects in the graph from the PIDs available in some authoritative sources while considering all the other sources as by definition “unstable”. Examples of authoritative sources are Crossref and DataCite. Examples of non-authoritative ones are institutional repositories, aggregators, etc. PIDs from the authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction.
Such a policy defines a list of data sources that are considered authoritative for a specific type of PID they provide, whose effect is twofold:
* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them;
* PIDs are included in the graph according to a tight criterion: the PID Types declared in the table below are considered to be mapped as PIDs only when they are collected from the relative PID authority data source.
| PID Type | Authority |
|-----------|-----------------------------------------------------------------------------------------------------|
| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) |
| pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) |
| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) |
There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule.
In all other cases, PIDs are be included in the graph as alternate Identifiers.
## Delegated authorities
When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them and, depending on the case,
this could result in inconsistencies in the attribution of the field values.
To overcome the issue, the intuition is to include such records only once in the graph. To do so, the concept of "delegated authorities" defines a list of datasources that
assigns PIDs to their scientific products from a given PID minter.
This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes
| Datasource delegated | Datasource delegating | Pid Type |
|--------------------------------------|----------------------------------|-----------|
| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi |
| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id |
## Identifiers in the Graph
OpenAIRE assigns internal identifiers for each object it collects.
By default, the internal identifier is generated as `sourcePrefix::md5(localId)` where:
* `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source at registration time
* `localid` is the identifier assigned to the object by the data source
After years of operation, we can say that:
* `localId` are generally unstable
* objects can disappear from sources
* PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos)
Therefore, when the record is collected from an authoritative source:
* the identity of the record is forged using the PID, like `pidTypePrefix::md5(lowercase(doi))`
* the PID is added in a `pid` element of the data model
When the record is collected from a source which is not authoritative for any type of PID:
* the identity of the record is forged as usual using the local identifier
* the PID, if available, is added as `alternateIdentifier`
Currently, the following data sources are used as "PID authorities":
| PID Type | Prefix (12 chars) | Authority |
|-----------|------------------------|-----------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository |
OpenAIRE also perform duplicate identification (see the [dedicated section for details](../../data-provision/deduplication/)).
All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).

View File

@ -15,66 +15,31 @@ _Type: [Node](#the-node-object) &bull; Cardinality: ONE_
Represents the source node in the relation. Represents the source node in the relation.
```json
"source": {
"id": "20|openorgs____::1cb75a3ad756e4c83e455e3e7347643b",
"type": "organization"
}
```
### target ### target
_Type: [Node](#the-node-object) &bull; Cardinality: ONE_ _Type: [Node](#the-node-object) &bull; Cardinality: ONE_
Represents the target node in the relation. Represents the target node in the relation.
```json
"target": {
"id": "10|doajarticles::022409068174087a003647ff46070f7f",
"type": "datasource"
}
```
### reltype ### reltype
_Type: [RelType](#the-reltype-object) &bull; Cardinality: ONE_ _Type: [RelType](#the-reltype-object) &bull; Cardinality: ONE_
Represent the semantics of the relation between two nodes of the graph. Represent the semantics of the relation between two nodes of the graph.
```json
"reltype": {
"name": "provides",
"type": "provision"
}
```
### provenance ### provenance
_Type: [Provenance](entities/other#provenance-1) &bull; Cardinality: ONE_ _Type: [Provenance](entities/other#provenance-1) &bull; Cardinality: ONE_
Indicates the process that produced (or provided) the information. Indicates the process that produced (or provided) the information.
```json
"provenance": {
"provenance": "Harvested",
"trust":"0.900"
}
```
### validated ### validated
_Type: Boolean &bull; Cardinality: ONE_ _Type: Boolean &bull; Cardinality: ONE_
Indicates weather or not the relation was validated. Indicates weather or not the relation was validated.
```json
"validated": true
```
### validationDate ### validationDate
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Indicates the validation date of the relation - applies only when the validated flag is set to true. Indicates the validation date of the relation - applies only when the validated flag is set to true.
```json
"validationDate": "2022-09-02"
```
--- ---
## The `Node` object ## The `Node` object
@ -87,18 +52,11 @@ _Type: String &bull; Cardinality: ONE_
OpenAIRE identifier of the node in the graph. OpenAIRE identifier of the node in the graph.
```json
"id": "10|doajarticles::022409068174087a003647ff46070f7f"
```
### type ### type
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Graph node type. Graph node type.
```json
"type": "datasource"
```
## The `RelType` object ## The `RelType` object
@ -109,25 +67,19 @@ _Type: String &bull; Cardinality: ONE_
Relation category, e.g. affiliation, citation, see table Relation typologies. Relation category, e.g. affiliation, citation, see table Relation typologies.
```json
"name": "provides"
```
### name ### name
_Type: String &bull; Cardinality: ONE_ _Type: String &bull; Cardinality: ONE_
Further specifies the relation semantic, indicating the relation direction, e.g. Cites, isCitedBy. Further specifies the relation semantic, indicating the relation direction, e.g. Cites, isCitedBy.
```json
"type": "provision"
```
--- ---
## Relationship types ## Relationship types
The following table lists all the possible relation semantics found in the graph dump. The following table lists all the possible relation semantics found in the graph dump.
| # | Source entity type | Target entity type | Relation type | Relation name | Inverse relation name | | # | source entity type | target entity type | relType.type | relType.name | relType.name (inverse) |
|:--:|:------------------:|:-------------------:|:-------------:|:---------------------------:|:----------------------------:| |:--:|:------------------:|:-------------------:|:-------------:|:---------------------------:|:----------------------------:|
| 1 | [Project](entities/project) | [Result](entities/result) | outcome | produces | isProducedBy | | 1 | [Project](entities/project) | [Result](entities/result) | outcome | produces | isProducedBy |
| 2 | [Result](entities/result) | [Organization](entities/organization) | affiliation | hasAuthorInstitution | isAuthorInstitutionOf | | 2 | [Result](entities/result) | [Organization](entities/organization) | affiliation | hasAuthorInstitution | isAuthorInstitutionOf |
@ -138,9 +90,9 @@ The following table lists all the possible relation semantics found in the graph
| 7 | [Data source](entities/data-source) | [Organization](entities/organization) | provision | provides | isProvidedBy | | 7 | [Data source](entities/data-source) | [Organization](entities/organization) | provision | provides | isProvidedBy |
| 8 | [Result](entities/result) | [Data source](entities/data-source) | provision | isHostedBy | hosts | | 8 | [Result](entities/result) | [Data source](entities/data-source) | provision | isHostedBy | hosts |
| 9 | [Result](entities/result) | [Data source](entities/data-source) | provision | isProvidedBy | provides | | 9 | [Result](entities/result) | [Data source](entities/data-source) | provision | isProvidedBy | provides |
| 10 | [Result](entities/result) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo | | 10 | [Result](entities/result) | [CommunityInitiative](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 11 | [Organization](entities/organization) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo | | 11 | [Organization](entities/organization) | [CommunityInitiative](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 12 | [Data source](entities/data-source) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo | | 12 | [Data source](entities/data-source) | [CommunityInitiative](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 13 | [Project](entities/project) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo | | 13 | [Project](entities/project) | [CommunityInitiative](entities/community) | relationship | isRelatedTo | isRelatedTo |

View File

@ -0,0 +1,14 @@
---
sidebar_position: 1
---
# Aggregation
OpenAIRE collects metadata records from a variety of content providers as described in https://www.openaire.eu/aggregation-and-content-provision-workflows.
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like OpenDOAR, re3data, DOAJ, and funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph that you can access from the OpenAIRE portal and the APIs.
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at http://api.openaire.eu/vocabularies. Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term. Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that are too big to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall), and ScholeXplorer, one of the Scholix hubs offering a large set of links between research literature and data.
![Aggregation](./assets/aggregation.png)

View File

@ -1,56 +0,0 @@
---
sidebar_position: 1
---
# Aggregation
OpenAIRE materializes an open, participatory research graph (the OpenAIRE Research graph) where products of the research life-cycle (e.g. scientific literature, research data, project, software) are semantically linked to each other and carry information about their access rights (i.e. if they are Open Access, Restricted, Embargoed, or Closed) and the sources from which they have been collected and where they are hosted. The OpenAIRE research graph is materialised via a set of autonomic, orchestrated workflows operating in a regimen of continuous data aggregation and integration. [1]
## What does OpenAIRE collect?
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/).
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term.
Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that do not follow the OpenAIRE Guidelines and/or are too large to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall).
<p align="center">
<img loading="lazy" alt="Aggregation" src="/img/docs/aggregation.png" width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The OpenAIRE aggregation system collects information about objects of the research life-cycle compliant to the [OpenAIRE acquisition policy](https://www.openaire.eu/content-acquisition-policy) from [different types of data sources](https://explore.openaire.eu/search/find/dataproviders):
1. Scientific literature metadata and full-texts from institutional and thematic repositories, CRIS (Common Research Information Systems), Open Access journals and publishers;
2. Dataset metadata from data repositories and data journals;
3. Scientific literature, data and software metadata from Zenodo;
4. Metadata about data sources, organizations, projects, and funding programs from entity registries, i.e. authoritative sources such as CORDA and other funder databases for projects, OpenDOAR for publication repositories, re3data for data repositories, DOAJ for Open Access journals;
5. Metadata of open source research software from software repositories and SoftwareHeritge
6. Metadata about other types of research products, like workflow, protocols, methods, research packages
Relationships between objects are collected from the data sources, but also automatically detected by [inference algorithms](https://www.openaire.eu/blogs/text-mining-services-in-openaire-1) and added by authenticated users, who can insert links between literature, datasets, software and projects via [the “Link” procedure available from the OpenAIRE explore portal](https://explore.openaire.eu). More information about the linking functionality can be found [here](https://www.openaire.eu/linking).
## What kind of data sources are in OpenAIRE?
Objects and relationships in the OpenAIRE Research Graph are extracted from information packages, i.e. metadata records, collected from data sources of the following kinds:
- *Institutional or thematic repositories*: Information systems where scientists upload the bibliographic metadata and full-texts of their articles, due to obligations from their organization or due to community practices (e.g. ArXiv, Europe PMC);
- *Open Access Publishers and journals*: Information system of open access publishers or relative journals, which offer bibliographic metadata and PDFs of their published articles;
- *Data archives*: Information systems where scientists deposit descriptive metadata and files about their research data (also known as scientific data, datasets, etc.).;
- *Hybrid repositories/archives*: information systems where scientists deposit metadata and file of any kind of scientific products, incuding scientific literature, research data and research software (e.g. Zenodo)
- *Aggregator services*: Information systems that collect descriptive metadata about publications or datasets from multiple sources in order to enable cross-data source discovery of given research products. Examples are DataCite, BASE, DOAJ;
- *Entity Registries*: Information systems created with the intent of maintaining authoritative registries of given entities in the scholarly communication, such as OpenDOAR for the institutional repositories, re3data for the data repositories, CORDA and other funder databases for projects and funding information;
- *CRIS*: Information systems adopted by research and academic organizations to keep track of their research administration records and relative results; examples of CRIS content are articles or datasets funded by projects, their principal investigators, facilities acquired thanks to funding, etc..
- *Research Graphs*: services that maintain an information space of (possibly interlinked) scholalrly communication objects. Examples are CrossRef, ScholeXplorer and OpenAIRE itself.
## How does OpenAIRE collect metadata records?
OpenAIRE collects metadata records describing objects of the research life-cycle from content providers compliant to the OpenAIRE guidelines and from entity registries (i.e. data sources offering authoritative lists of entities, like OpenDOAR, re3data, DOAJ, and funder databases).
The OpenAIRE aggregator collects metadata records in the majority of cases via [OAI-PMH](https://www.openarchives.org/pmh/), but also supports other standard exchange protocols like FTP(S), SFTP, and some RESTful API.
For additional details about the aggregation workflows, please refer to [2].
## References
[1] Manghi P. et al. (2014) "The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures", Program, Vol. 48 Issue: 4, pp.322-354, [10.1108/PROG-08-2013-0045](https://doi.org/10.1108/PROG-08-2013-0045)
[2] Atzori, Claudio, Bardi, Alessia, Manghi, Paolo, & Mannocci, Andrea. (2017). The OpenAIRE workflows for data management. Zenodo. [10.5281/zenodo.996006](http://doi.org/10.5281/zenodo.996006)

View File

@ -1,77 +0,0 @@
# Datacite
This section describes the aggregation workflow used to gather the bibliographic material from Datacite and the relative mapping.
## Datacite datasource
[Datacite](https://datacite.org/index.html) is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs.
## Datacite API
The [DataCite REST API](https://support.datacite.org/docs/api) allows users to retrieve, query, and browse Datacite metadata records. In particular, it exposes a method for harvesting new records incrementally.
```
https://api.datacite.org/dois?page[cursor]=$CURSOR&page[size]=$NUMBER_OF_ITEM_PER_PAGE&query=updated:[$FROM_DATE_TIMESAMP TO $TO_DATE_TIMESAMP]
```
On this API Request, we introduce some variables:
- **CURSOR**: The value of the cursor to iterate the pages; the cursor is extracted from each API response and used in the next request.
- **NUMBER_OF_ITEM_PER_PAGE**: (max 1000) defines how many records must be returned within each API response.
- **FROM_DATE_TIMESAMP, TO_DATE_TIMESAMP** interval timestamp of the updated record.
Each record contains two pieces of information needed for incremental harvesting:
- **isActive**: tells if the record is deleted (`isActive:false`)
- **updated**: timestamp of last update
## Collection Workflow
The collection workflow is responsible for aggregating new records. Each record is stored locally on a table with the following schema:
- **DOI**: The DOI of the Datacite record (it is the primary key)
- **update_timestamp**: the last update date timestamp
- **json**: the native record JSON
The metadata collection process identifies the most recent record date available locally and uses such date to requests the records to the Datacite API, populating the **FROM_DATE_TIMESAMP** variable. The records in the API response are included in the local storage in upsert mode.
## Datacite Mapping
### Entity Mapping
The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.
| OpenAIRE Result field path | Datacite record JSON path | # Notes |
|--------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `id` | `\attributes\doi` | id in the form `doi_________::md5(doi)` |
| <ul><li>`instance`</li> <li>`instance.type`</li></ul> | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Use the vocabulary **_dnet:publication_resource_** to find a synonym to one of these terms and get the `instance.type`. |
| `type` | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li> <li> `software`</li> <li>`otherresearchproduct`</li></ul> |
| `pid` | `\attributes\doi` | `scheme = doi` |
| `originalid` | `\attributes\doi` | |
| `dateofcollection` | `attributes\updated` | the timestamp is defined in milliseconds we convert to "yyyy-MM-dd'T'HH:mm:ssZ" format |
| `author` | `\attributes\creators` | Each creator field will be mapped in the author entity below the subfield. **If the record has no Creator it will be skipped** |
| `author.fullname` | `\attributes\creators\name` | if name is not defined, we construct from given and family name |
| `author.rank` | | Incremental index starting from 1 |
| `author.name` | `\attributes\creators\givenName` | |
| `author.surname` | `\attributes\creators\familyName` | |
| `author.pid` | `\attributes\creators\nameIdentifiers` | this is a list of pids associated to the creator |
| `author.pid.scheme` | `\attributes\creators\nameIdentifiers` | mapping with vocabulary **dnet:pid_types** |
| `author.pid.value` | `\attributes\creators\nameIdentifiers/nameIdentifier` | the pid value |
| `maintitle` | `\attributes\titles` | Titles whose title type is null or title type is Main |
| `subtitle` | `\attributes\titles` | Titles whose title type is Subtitle since the title type vocabulary in OpenAIRE use the datacite title type vocabulary |
| **date section** | | for each date in particular for DOI starting with _10.14457_ we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket [#6791](https://support.openaire.eu/issues/6791) |
| `publicationdate` | `\attributes\dates` | where `dateType` is **issued** |
| `publicationdate` | `\attributes\publicationYear` | we create this date format `01-01-publicationYear` |
| `embargoenddate` | `\attributes\dates` | where `dateType` is **available** |
| `subjects` | `\attributes\subject` | `scheme=keywords` |
| `description` | `\attributes\descriptions` | |
| `publisher` | `\attributes\publisher` | |
| `language` | `\attributes\language` | cleaned by using vocabulary `dnet:languages` |
| `publisher` | `\attributes\publisher` | |
| `instance.license` | `\attributes\rightsList` | if the rights value starts with http and matches a particular regex |
| `instance.accessright` | `\attributes\rightsList` | <ul><li>if not present :`unknown`</li><li>if datasource is Figshare:`open`</li><li>If `embargo_date < today()`: OPEN</li></ul> |
### Relation Mapping
| OpenAIRE Relation Semantic and inverse | Datacite record JSON path | Source/Target type | #Notes |
|----------------------------------------|---------------------------------------|---------------------|------------------------------------------------------------------------------------------------------------|
| `isProducedBy/produces` | `attributes\fundingReferences` | `result/project` | only when the fundingReferences matches the pattern `(info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)` |
| `IsProvidedBy/provides` | | `result/datasource` | Datasource is always set to `Datacite` |
| `isHostedBy/host` | `\attributes\relationships\client\id` | `result/datasource` | we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ |
| `isRelatedTo` | `\attribute\relatedIdentifiers` | `result/result` | we create relationships whenever the pid of the target is resolved on the Research Graph |

View File

@ -1,253 +0,0 @@
# DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID
DOIBoost is a dataset that combines research outputs and links among them from a selection of data sources.
It enriches the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI.
As consequence, DOIBoost does not contain any record from MAG, Unpaywall, or ORCID that doesn't provide a DOI available in Crossref.
Each Crossref record is enriched with:
* ORCID identifiers of authors from ORCID
* Open Access instance (with OA color/route and license) from Unpaywall
* the following information from MAG:
* abstracts
* MAG identifiers of authors
* affiliation (result - organization) relationships
* subjects (MAG FieldsOfStudy)
* conference or journal information
The Open Access status is also set by intersecting the journal information of a record with the journal lists available from DOAJ and the Gold ISSN list.
## Inputs
* *Crossref*: dump available to Crossref subscribers via MetadataPlus service, updated once a month.
* *Microsoft Academic Graph*: downloaded version on 2021-02-15. We plan to take the latest version in Dec 2021 before MAG will be retired.
* *ORCID*: baseline dump obtained in 2020-10-13, regularly updated every week from the [ORCID public API](https://info.orcid.org/documentation/features/public-api).
* *Unpaywall*: public database snapshot downloaded in March 2021. Unpaywall updates it twice a year (https://unpaywall.org/products/snapshot)
The construction of the DOIBoost dataset consists of the following phases:
## Process
The following section describes the processing steps needed to build DOIBoost starting from the input data.
### Crossref filtering
Records in Crossref are ruled out according to the following criteria
* have blank title, examples:
* `10.1093/rheumatology/41.7.837`
* `10.1093/qjmed/95.7.430`
* `10.1371/journal.pone.0171434.g005`
* have one of the following publishers: `"Test accounts"`, `"CrossRef Test Account"`
* Examples from https://api.crossref.org/works?query.publisher-name=%22Test%20accounts%22
* `10.1007/bf00344543`
* `10.1007/bf00186154`
* `10.1306/64ed947a-1724-11d7-8645000102c1865d`
* have no authors with valid names, where valid means: not blank and different from all strings in this list: `List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")`
* Examples for blank authors:
* `10.1108/00070709810247807`
* `10.1016/s1074-9098(02)00346-5`
* `10.1136/heart.88.1.6`
* Examples for `"none"` author from https://api.crossref.org/works?query.author=%22none%22
* `10.4007/annals.2016.184.3.11`
* `10.4007/annals.2012.176.1.6`
* `10.2172/6393585`
* Examples for `"test"` author from https://api.crossref.org/works?query.author=%22test%22
* `10.5116/ijme.54ca.a5ae`
* `10.5755/j01.ss.71.2.544`
* `10.5755/j01.ee.22.2.319`
* have `"Addie Jackson"` as author and `"Elsevier BV"` as publisher (empirically we say they are test records)
* Examples from https://api.crossref.org/works?query.author=Addie+Jackson&query.publisher-name=%22Elsevier%20BV%22
* `10.2139/ssrn.2082156`
* `10.2139/ssrn.2202300`
* `10.2139/ssrn.2255657`
* have not one of the following values in the field `type` : `"book-section"`, `"book"`, `"book-chapter"`, `"book-part"`, `"book-series"`, `"book-set"`, `"book-track"`, `"edited-book"`, `"reference-book"`, `"monograph"`, `"journal-article"`, `"dissertation"`, `"other"`, `"peer-review"`, `"proceedings"`, `"proceedings-article"`, `"reference-entry"`, `"report"`, `"report-series"`, `"standard"`, `"standard-series"`, `"posted-content"`, `"dataset"`,
* Example:
* `10.1371/journal.pone.0171434.g005`
* `10.7554/elife.21052.049`
* `10.1371/journal.pcbi.1005379.s006`
Records with `type=dataset` are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication.
### Mapping Crossref properties into the OpenAIRE Research Graph
Properties in OpenAIRE results are set based on the logic described in the following table:
| OpenAIRE Result field path | Crossref path(s) | Notes |
|----------------------------------------|--------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `id` | `doi` | id in the form `doi_________::md5(doi)` |
| `dateofcollection` | `indexed.datetime` | |
| `lastupdatetimestamp` | `indexed.timestamp` | |
| `type` | `type` | `dataset` if the Crossref type is dataset, `publication` otherwise (based on the filtering logics described above) |
| `originalId` | `doi, clinical-trial-number, alternative-id` | |
| `pid` | | The scheme tells the type of PID, the value contains the actual value |
| `pid.scheme` | | Default value: doi |
| `pid.value` | `doi` | The doi is normalised and lower-cased |
| `maintitle` | `title` | |
| `subtitle` | `subtitle` | |
| `author` | `author` | if available the sequence is mapped to rank and the ORCID is also mapped |
| `author.name` | `author.given` | |
| `author.surname` | `author.family` | |
| `author.fullname` | `author.given author.family` | |
| `author.rank` | | based on the order, starts from 1 |
| `author.pid` | | only if the ORCID is available |
| `author.pid.id.scheme` | | Default `'pending_orcid'` (meaning that it is not an id confirmed by ORCID) |
| `author.pid.id.value` | `author.ORCID` | |
| `author.pid.provenance.provenance` | | Default 'Harvested' |
| `author.pid.provenance.trust` | | Default '0.9' |
| `description` | `abstract` | |
| `subject` | `subject` | with `classid='keywords'`, i.e. no controlled vocabularies for Crossref subjects |
| `publicationdate` | `issued.datetime` or, if not available, `created.datetime` | |
| `publisher` | `publisher` | |
| `source` | `source` | only if the record is not of type `book` |
| `source` | concatenation of `container-title.head` + `"ISBN: "` + `ISBN.head` | only if the record is of type `book` |
| `container` | | It is set only for publications with information about the journal it was published in. |
| `container.name` | `container-title.head` | |
| `container.issnOnline` | `issn-type.value` | if `issn-type.type='electronic'` |
| `container.issnPrinted` | `issn-type.value` | if `issn-type.type='print'` |
| `container.vol` | `volume` | |
| `container.sp` | `page` | before `'-'` |
| `container.ep` | `page` | after `'-'` |
| `instance` | | One instance is created with the DOI URL |
| `instance.accessright` | | Values in `instance.accessright.code` and `instance.accessright.label` are set based on license and dateofacceptance:<br/>- `UNKNOWN`: if the license is blank<br/>- `OPEN ACCESS`: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see [Unpaywall FAQ](https://support.unpaywall.org/support/solutions/articles/44002063718-what-is-an-oa-license-) for details) or if OUP license, but only after 12 months from the publication date<br/>- `EMBARGO`: OUP license, before 12 months from the publication date<br/>- `CLOSED`: if there is a license not covered by the previous cases |
| `instance.accessright.code` | | Code from the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.label` | | One of: `OPEN`, `RESTRICTED`, `CLOSED`, `EMBARGO` |
| `instance.accessright.scheme` | | Scheme that defines the code and label, i.e. the URL to the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.openAccessRoute` | | only if `instance.accessright.value = 'OPEN ACCESS'`. Default is `hybrid`. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list. |
| `instance.license` | `license.URL ` | If there is a `license.content-version='vor'`, then this is used. Otherwise the first license entry is used. |
| `instance.pid` | | The scheme tells the type of PID, the value contains the actual value |
| `instance.pid.scheme` | | Default value: `doi` |
| `instance.pid.value` | `doi` | The doi is normalised and lower-cased |
| `instance.publicationdate` | `issued.datetime` or, if not available, `created.datetime` | |
| `instance.refereed` | | set to `peerReviewed` only if `relation.has-review.id` is not empty, `UNKNOWN` otherwise. |
| `instance.type` | `subtype` | mapped using the [OpenAIRE vocabulary for result typologies](https://api.openaire.eu/vocabularies/dnet:result_typologies) |
| `instance.url` | `doi` | Full URL of the DOI |
All other fields of the Json schema not mentioned in the table contain empty values.
All the records from Crossref are related to the datasource with `name=Crossref` and `id=openaire____::081b82f96300b6a6e3d282bad31cb6e2`
Possible improvements:
* map `clinical-trial-number` and `alternative-id` in `alternateIdentifiers`?
* Verify if Crossref has a property for `language`, `country`, `container.issnLinking`, `container.iss`, `container.edition`, `container.conferenceplace` and `container.conferencedate`
* Different approach to set the `refereed` field and improve its coverage?
h3. 2 Map Crossref links to projects/funders
Links to funding available in Crossref are mapped as funding relationships (`result -- isProducedBy -- project`) applying the following mapping:
| Funder | Grant code | Link to |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| DOI: `{10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665}` or name: `'European Unions Horizon 2020 research and innovation program'` | series of `4-9` digits in `award` | Link to H2020 project |
| DOI: `{10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780}` | series of `4-9` digits in `award` | Link to FP7 project |
| DOI: `10.13039/501100000781` OR name: `'European Union's'` | series of `4-9` digits in `award` | Link to FP7 or H2020 project |
| DOI: `10.13039/100000001` | `award` | Link to NSF project |
| DOI: `10.13039/501100001665` OR name: `{'The French National Research Agency (ANR)', 'The French National Research Agency'}` | `award` | Link to ANR project |
| DOI: `10.13039/501100002341` | `award` | Link to Academy of Finland project |
| DOI: `10.13039/501100001602` | `award`, removing the initial 'SFI' if present | Link to SFI project |
| DOI: `10.13039/501100000923` | `award` | Link to ARC project |
| DOI: `10.13039/501100000038` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to NSERC (`unidentified` project) |
| DOI: `10.13039/501100000155` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to SSHRC (`unidentified` project) |
| DOI: `10.13039/501100000024` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to CIHR (`unidentified` project) |
| DOI: `10.13039/501100002848` OR name :`'CONICYT, Programa de Formación de Capital Humano Avanzado'` | `award` | Link to CONICYT project |
| DOI: `10.13039/501100003448` | series of `4-9` digits in award | Link to GSRT project |
| DOI: `10.13039/501100010198` | `award` | Link to SGOV project |
| DOI: `10.13039/501100004564` | series of `4-9` digits in award | Link to MESTD project |
| DOI: `10.13039/501100003407` | `award` | Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (`unidentified`<br/> project) is also generated |
| DOI: `{10.13039/501100006588, 10.13039/501100004488}` | `award`, removing `'Project No'` and `'HRZZ'` prefix, if present | Link to HRZZ or MZOS project |
| DOI: `10.13039/501100006769` | `award` | Link to Russian Science Foundation project |
| DOI: `10.13039/501100001711` | `award` after `'_'` and before `'/'` | Link to SNSF project |
| DOI: `10.13039/501100004410` | `award` | Link to TUBITAK project |
| DOI: `10.10.13039/100004440` or name: `Wellcome Trust Masters Fellowship` | `award` | Link to Wellcome Trust specific project and to the `unidentified` project. |
### Intersect Crossref with UnpayWall by DOI
The fields we consider from UnpayWall are:
* `is_oa`
* `best_oa_location`
* `oa_status`
The results of Crossref that intersect by DOI with UnpayWall records are enriched with one additional `instance` with the following properties:
| OpenAIRE Result field path | Unpaywall field path | Notes |
|----------------------------------------|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `instance` | | created only if `is_oa` and a `best_oa_location` is available |
| `instance.accessright` | | default value `Open Access`: we do not add instances if UnpayWall says there is no open version |
| `instance.accessright.code` | | Open Access code from the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.label` | | Always `OPEN` |
| `instance.accessright.scheme` | | Scheme that defines the code and label, i.e. the URL to the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.openAccessRoute` | `oa_status` | |
| `instance.url` | `best_oa_location` | |
| `instance.license` | `best_oa_location.license` | |
| `instance.pid` | | The scheme tells the type of PID, the value contains the actual value |
| `instance.pid.scheme` | | Default value: `doi` |
| `instance.pid.value` | `doi` | The doi is normalised and lower-cased |
For the definition of UnpayWall's `oa_status` refer to the [Unpaywall FAQ](https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-)
The record will also feature a relation to the UnpayWall data source: `name="UnpayWall"`, `id=openaire____::8ac8380272269217cb09a928c8caa993`.
### Intersect with ORCID
The fields we consider from ORCID are:
* `doi`
* `authors`, a list of authors, each with optional `name`, `surname`, `creditName`, `oid`
| OpenAIRE field path | ORCID path | Notes |
|------------------------------------|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| `pid` | `doi` | |
| `author.name` | `capitalize(name)` | only mapped if not blank |
| `author.surname` | `capitalize(surname)` | only mapped if not blank |
| `author.fullname` | | if name and surname are not blank, they are concatenated (`capitalize(name) capitalize(surname)`), otherwise we use the `creditName` |
| `author.pid` | | only if the `ORCID` is available |
| `author.pid.id.scheme` | | Default `orcid` (meaning that it is confirmed by ORCID, (in contrast to the `orcid_pending` set from Crossref and Unpaywall) |
| `author.pid.id.value` | `oid` | |
| `author.pid.provenance.provenance` | | Default `Harvested` |
| `author.pid.provenance.trust` | | Default `0.9` |
The records are enriched with the ORCID identifiers of their authors.
[//]: # (TODO: Update with the new approach implemented by Miriam.)
The current approach is:
* if the number of authors from Crossref equals the size of authors from ORCID, then we pick the list of authors with more PIDs and try to enrich it with the PIDs from the other list, based on JaroWrinkler distance on authors' names, surnames, or fullnames, depending on which properties are available;
* if the number of authors are different, then we take the longest and try to enrich it with the PIDs from the other author list, based on JaroWrinkler distance on authors' names, surnames, or fullnames, depending on which properties are available
Miriam will modify the process to ensure that:
* the list of authors from Crossred always "win"
* the identifiers from ORCID "win"
### Intersect with Microsoft Academic Graph
*Important Notes*
* Only papers with DOI are considered
* Since for the same DOI we have multiple version of item with different MAG PaperId, we only take one per DOI (the last one we process). We call this dataset `Papers_distinct`
When mapping MAG records to the OpenAIRE Research Graph, we consider the following MAG tables:
* `PaperAbstractsInvertedIndex`: for the paper abstracts
* `Authors`: for the authors. The MAG data is pre-processed by grouping authors by PaperId
* `Affiliations` and `PaperAuthorAffiliations`: to generate links between publications and organisations
* `Journals` and `ConferenceInstances`: joined with `Papers_distinct` to have the information about the venues where the paper was published
* TO BE REMOVED `PaperUrls`: to create one instance for the OpenAIRE publication
* `FieldsOfStudy`: to add subjects
The records are enriched with:
* abstracts
* MAG identifiers of authors
* affiliation relationships
* subjects (MAG FieldsOfStudy)
* conference or journal information (in the `journal` field) TODO: or `container`, in case of the dump?
* [TO BE REMOVED] instances with URL from MAG
### Enrich DOIBoost3 with hosting data sources (`hostedby`) and access right information
In this phase, we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (`issn`, `eissn`, `lissn`) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a `journal.[l|e]issn` that match are enriched as follows:
* Each instance gain the `hostedby` information corresponding to the journal
* If the journal is open access, the access rights of the instances are also set to `Open Access` with `gold` route (because by construction, the journals we know are open are from DOAJ or Gold ISSN list)
The hostedby of records that do not match are set to the `Unknown Repository`.
## References
The idea behind DOIBoost and its origin can be found in the paper (and related resources) at:
* La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: [10.5281/zenodo.1441071](https://doi.org/10.5281/zenodo.1441071)

View File

@ -1,94 +0,0 @@
# EMBL-EBIs Protein Data Bank in Europe
This section describes the mapping implemented for [EMBL-EBIs Protein Data Bank in Europe](https://www.ebi.ac.uk/).
The Europe PMC RESTful Web Service gives the [datalinks API](https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API) to retrieve data-literature links in Scholix format.
## How the data is collected
Starting from the Pubmed collection, the API below is used to obtain the bioentities related to publications for each PubMed identifier.
Example:
```commandline
curl -s "https://www.ebi.ac.uk/europepmc/webservices/rest/MED/33024307/datalinks?format=json" | jq '.'
{
"version": "6.8",
"hitCount": 9,
"request": {
"id": "33024307",
"source": "MED"
},
"dataLinkList": {
"Category": [
{
"Name": "Nucleotide Sequences",
"CategoryLinkCount": 5,
"Section": [
{
"ObtainedBy": "tm_accession",
"Tags": [
"supporting_data"
],
"SectionLinkCount": 5,
"Linklist": {
"Link": [
{
"ObtainedBy": "tm_accession",
"PublicationDate": "04-11-2022",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "References"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "MED"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "AY278488",
"IDScheme": "ENA",
"IDURL": "http://identifiers.org/ebi/ena.embl:AY278488"
},
"Title": "AY278488",
"Publisher": {
"Name": "Europe PMC"
}
},
[...]
```
## Mapping
The table below describes the mapping from the EBI links records to the OpenAIRE Graph dump format.
We filter all the target links with pid type **ena**, **pdb** or **uniprot**
For each target we construct a Bioentity with the following mapping
| OpenAIRE Result field path | EBI record field xpath | Notes |
|-----------------------------|----------------------------------------------------------|---------------------------------------------------------------|
| `id` | `target/identifier/ID` and `target/identifier/IDScheme` | id in the form `SCHEMA_________::md5(pid)` |
| `pid` | `target/identifier/ID` and `target/identifier/IDScheme` | `classid = classname = schema` |
| `publicationdate` | `target/PublicationDate` | clean and normalize the format of the date to be `YYYY-mm-dd` |
| `maintitle` | `target/Title` | |
| **Instance Mapping** | | |
| `instance.type` | | `Bioentity` |
| `type` | | `Dataset` |
| `instance.pid` | `target/identifier/ID` and `target/identifier/IDScheme` | `classid = classname = schema` |
| `instance.url` | `target/identifier/IDURL` | Copy the value as it is |
| `instance.publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |
### Relation Mapping
| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes |
|----------------------------------------|---------------------|--------------------------------------------------------------------------|
| `IsRelatedTo` | `result/result` | we create relationships between the BioEntity and the pubmed publication |

View File

@ -1,44 +0,0 @@
# PubMed
This section describes the mapping implemented for [MEDLINE/PubMed](https://pubmed.ncbi.nlm.nih.gov/).
## Input
The native data is collected from the [ftp baseline](https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/) site.
It contains XML records compliant with the schema available at [www.nlm.nih.gov](https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html).
## Incremental harvesting
Pubmed exposes an entry point FTP with all the updates for each one. [ftp baseline update](https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/). We collect the new file and generate the new dataset by upserting the existing item.
## Entity Mapping
The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.
| OpenAIRE Result field path | PubMed record field xpath | Notes |
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Publication Mapping** | | |
| `id` | `//PMID` | id in the form `pmid_________::md5(pmid)` |
| `pid` | `//PMID` | `classid = classname = pmid` |
| `publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |
| `maintitle` | `//Title` | |
| `description` | `//AbstractText` | |
| `language` | `//Language` | cleaning vocabulary -> dnet:languages |
| `subjects` | `//DescriptorName` | classId, className = keyword |
| **Author Mapping** | | |
| `author.surname` | `//Author/LastName` | |
| `author.name` | `//Author/ForeName` | |
| `author.fullname` | `//Author/FullName` | Concatenation of forename + lastName if exist |
| `author.rank` | FOR ALL AUTHORS | sequential number starting from 1 |
| **Journal Mapping** | | |
| `container.conferencedate` | `//Journal/PubDate` | map the date of the Journal |
| `container.name` | `//Journal/Title` | name of the journal |
| `container.vol` | `//Journal/Volume` | journal volume |
| `container.issPrinted` | `//Journal/ISSN` | the journal issn |
| `container.iss` | `//Journal/Issue` | The journal issue |
| **Instance Mapping** | | |
| `instance.type` | `//PublicationType` | if the article contains the typology `Journal Article` then we apply this type else We have to find a terms that match the vocabulary otherwise we discard it |
| `type` | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li> <li> `software`</li> <li>`otherresearchproduct`</li></ul> |
| `instance.pid` | `//PMID` | map the pmid in the pid in the instance |
| `instance.url` | `//PMID` | creates the URL by prepending `https://pubmed.ncbi.nlm.nih.gov/` to the PMId |
| `instance.alternateIdentifier` | `//ArticleId[./@IdType="doi"]` | |
| `instance.publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |

View File

Before

Width:  |  Height:  |  Size: 236 KiB

After

Width:  |  Height:  |  Size: 236 KiB

View File

Before

Width:  |  Height:  |  Size: 649 KiB

After

Width:  |  Height:  |  Size: 649 KiB

View File

Before

Width:  |  Height:  |  Size: 152 KiB

After

Width:  |  Height:  |  Size: 152 KiB

View File

@ -1,7 +1,11 @@
# Data provision # Data provision
<span className="todo">source: https://graph.openaire.eu/about#tabs_card</span>
OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the world, including Open Access institutional repositories, data archives, journals. All the metadata records (i.e. descriptions of research products) are put together in a data lake, together with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national and international funders. Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications enrich the content of the data lake with links between research results and projects, author affiliations, subject classification, links to entries from domain-specific databases. Duplicated organisations and results are identified and merged together to obtain an open, trusted, public resource enabling explorations of the scholarly communication landscape like never before. OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the world, including Open Access institutional repositories, data archives, journals. All the metadata records (i.e. descriptions of research products) are put together in a data lake, together with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national and international funders. Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications enrich the content of the data lake with links between research results and projects, author affiliations, subject classification, links to entries from domain-specific databases. Duplicated organisations and results are identified and merged together to obtain an open, trusted, public resource enabling explorations of the scholarly communication landscape like never before.
<p align="center"> ![Architecture](./assets/architecture.png)
<img loading="lazy" alt="Data provision" src="/img/docs/architecture.png" width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p> <span className="todo">TODO: make this image linkable</span>

View File

@ -2,92 +2,20 @@
sidebar_position: 3 sidebar_position: 3
--- ---
# Clustering functions # Clustering functions
<span className="todo">TODO</span>
## Ngrams
It creates ngrams from the input field. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
Parameters: ngram length = 3, maximum number = 4
List of ngrams: “sea”, “sta”, “mod”, “hig”
```
## NgramPairs ## NgramPairs
It produces a list of concatenations of a pair of ngrams generated from different words.<br /> It produces a list of concatenations of a pair of ngrams generated from different words.<br />
``` *Example:*<br />
Example: Input string: `“Search for the Standard Model Higgs Boson”`<br />
Input string: “Search for the Standard Model Higgs Boson” Parameters: ngram length = 3<br />
Parameters: ngram length = 3 List of ngrams: `“sea”`, `“sta”`, `“mod”`, `“hig”`<br />
Ngram pairs: “seasta”, “stamod”, “modhig” Ngram pairs: `“seasta”`, `“stamod”`, `“modhig”`
```
## SuffixPrefix ## SuffixPrefix
It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string. A specialization of this function is available as SortedSuffixPrefix. It returns a sorted list. <br /> It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string.<br />
``` *Example:*<br />
Example: Input string: `“Search for the Standard Model Higgs Boson”`<br />
Input string: “Search for the Standard Model Higgs Boson” Parameters: suffix and prefix length = 3<br />
Parameters: suffix and prefix length = 3, maximum number = 2 Output list: `“ardmod”` (suffix of the word `“Standard”` + prefix of the word `“Model”`), `“rchsta”` (suffix of the word `“Search”` + prefix of the word `“Standard”`)
Output list: “ardmod”` (suffix of the word “Standard” + prefix of the word “Model”), “rchsta” (suffix of the word “Search” + prefix of the word “Standard”)
```
## Acronyms
It creates a number of acronyms out of the words in the input field. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
Output: "ssmhb"
```
## KeywordsClustering
It creates keys by extracting keywords, out of a customizable list, from the input field. <br />
```
Example:
Input string: “University of Pisa”
Output: "key::001" (code that identifies the keyword "University" in the customizable list)
```
## LowercaseClustering
It creates keys by lowercasing the input field. <br />
```
Example:
Input string: “10.001/ABCD”
Output: "10.001/abcd"
```
## RandomClusteringFunction
It creates random keys from the input field. <br />
## SpaceTrimmingFieldValue
It creates keys by trimming spaces in the input field. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
Output: "searchstandardmodelhiggsboson"
```
## UrlClustering
It creates keys for an URL field by extracting the domain. <br />
```
Example:
Input string: “http://www.google.it/page”
Output: "www.google.it"
```
## WordsStatsSuffixPrefixChain
It creates keys containing concatenated statistics of the field, i.e. number of words, number of letters and a chain of suffixes and prefixes of the words. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
Parameters: mod = 10
Output list: "5-3-seaardmod" (number of words + number of letters % 10 + prefix of the word "Search" + suffix of the word "Standard" + prefix of the word "Model"), "5-3-rchstadel" (number of words + number of letters % 10 + suffix of the word "Search" + prefix of the word "Standard" + suffix of the word "Model")
```

View File

@ -1,28 +1,31 @@
# Deduplication # Deduplication
Metadata records about the same scholarly work can be collected from different providers. Each metadata record can possibly carry different information because, for example, some providers are not aware of links to projects, keywords or other details. Another common case is when OpenAIRE collects one metadata record from a repository about a pre-print and another record from a journal about the published article. For the provision of statistics, OpenAIRE must identify those cases and “merge” the two metadata records, so that the scholarly work is counted only once in the statistics OpenAIRE produces. <span className="todo">TODO: intro</span>
## Methodology overview ## Clustering
The deduplication process can be divided into three different phases: Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a clustering function that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no clustering function will ever bring them into the same cluster. To match these requirements OpenAIRE clustering for products works with two functions:
* Candidate identification (clustering) * DOI: the function generates the DOI when this is provided as part of the record properties;
* Duplicates identification (pair-wise comparisons) * Title-based function: the function generates a key that depends on (i) number of significant words in the title (normalized, stemming, etc.), (ii) module 10 of the number of characters of such words, and (iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) o the first 3 words (2 words if the title only has 2). For example, the title “Entity deduplication in big data graphs for scholarly communication” becomes “entity deduplication big data graphs scholarly communication” with two keys key “7.1entionbig” and “7.1itydedbig” (where 1 is module 10 of 54 characters of the normalized title.
* Duplicates grouping (transitive closure) To give an idea, this configuration generates around 77Mi blocks, which we limited to 200 records each (only 15K blocks are affected by the cut), and entails 260Bi matches. Matches in a block are performed using a “sliding window” set to 80 records. The records are sorted lexicographically on a normalized version of their titles. The 1st record is matched against all the 80 following ones, then the second, etc. for an NlogN complexity.
<p align="center"> ## Matching and election
<img loading="lazy" alt="Deduplication Workflow" src="/img/docs/deduplication-workflow.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
### Candidate identification (clustering) Once the clusters have been built, the algorithm proceeds with the comparisons. Comparisons are driven by a decisional tree that:
1. Tries to capture equivalence via PIDs: if records share a PID then they are equivalent
Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a [clustering function](./clustering-functions) that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no [clustering function](./clustering-functions) will ever bring them into the same cluster. 2. Tries to capture difference:
### Duplicates identification (pair-wise comparisons) a. If record titles contain different “numbers” then they are different (this rule is subject to different feelings, and should be fine-tuned);
Pair-wise comparisons are conducted over records in the same cluster following the strategy defined in the decision tree. A different decision tree is adopted depending on the type of the entity being processed. b. If record contain different number of authors then they are different;
To further limit the number of comparisons, a sliding window mechanism is used: (i) records in the same cluster are lexicographically sorted by their title, (ii) a window of K records slides over the cluster, and (iii) records ending up in the same window are pair-wise compared. The result of each comparison produces a similarity relation when the pair of record matches. Such relations will be consequently used as input for the duplicates grouping stage. c. Note that different PIDs do not imply different records, as different versions may have different PIDs.
### Duplicates grouping (transitive closure) 3. Measures equivalence:
Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance. a. The titles of the two records are normalised and compared for similarity by applying the Levenstein distance algorithm. The algorithm returns a number in the range [0,1], where 0 means “very different” and 1 means “equal”. If the distance is greater than or equal 0,99 the two records are identified as duplicates.
b. Dates are not regarded for equivalence matching because different versions of the same records should be merged and may be published on different dates, e.g. pre-print and published version of an article.
Once the equivalence relationships between pairs of records are set, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance. The ID of the record is obtained by appending the prefix “dedup_” to the MD5 of the first ID (given their lexicographical ordering). A new, more stable function to generate the ID is under development, which exploits the DOI when one of the records to be merged includes a Crossref or a DataCite record.

View File

@ -3,68 +3,4 @@ sidebar_position: 2
--- ---
# Organizations # Organizations
<span className="todo">TODO</span>
The organizations in OpenAIRE are aggregated from different registries (e.g. CORDA, OpenDOAR, Re3data, ROR). In some cases, a registry provides organizations as entities with their own persistent identifier. In other cases, those organizations are extracted from other main entities provided by the registry (e.g. datasources, projects, etc.).
The deduplication of organizations is enhanced by the [OpenOrgs](https://orgs.openaire.eu), a tool that combines an automated approach for identifying duplicated instances
of the same organization record with a "humans in the loop" approach, in which the equivalences produced by a duplicate identification algorithm are suggested to data curators, in charge for validating them.
The data curation activity is twofold, on one end pivots around the disambiguation task, on the other hand assumes to improve the metadata describing the organization records
(e.g. including the translated name, or a different PID) as well as defining the hierarchical structure of existing large organizations (i.e. Universities comprising its departments or large research centers with all its sub-units or sub-institutes).
Duplicates among organizations are therefore managed through three different stages:
* *Creation of Suggestions*: executes an automatic workflow that performs the deduplication and prepare new suggestions for the curators to be processed;
* *Curation*: manual editing of the organization records performed by the data curators;
* *Creation of Representative Organizations*: executes an automatic workflow that creates curated organizations and exposes them on the OpenAIRE Research Graph by using the curators' feedback from the OpenOrgs underlying database.
The next sections describe the above mentioned stages.
### Creation of Suggestions
This stage executes an automatic workflow that faces the *candidate identification* and the *duplicates identification* stages of the deduplication to provide suggestions for the curators in the OpenOrgs.
#### Candidate identification (clustering)
To match the requirements of limiting the number of comparisons, OpenAIRE clustering for organizations aims at grouping records that would more likely be comparable.
It works with four functions:
* *URL-based function*: the function generates the URL domain when this is provided as part of the record properties from the organization's `websiteurl` field;
* *Title-based functions*:
* generate strings dependent to the keywords in the `legalname` field;
* generate strings obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) on the first 3 words of the `legalname` field;
* generate strings obtained as a concatenation of ngrams of the `legalname` field;
#### Duplicates identification (pair-wise comparisons)
For each pair of organization in a cluster the following strategy (depicted in the figure below) is applied.
The comparison goes through the following decision tree:
1. *grid id check*: comparison of the grid ids. If the grid id is equivalent, then the similarity relation is drawn. If the grid id is not available, the comparison proceeds to the next stage;
2. *early exits*: comparison of the numbers extracted from the `legalname`, the `country` and the `website` url. No similarity relation is drawn in this stage, the comparison proceeds only if the compared fields verified the conditions of equivalence;
3. *city check*: comparison of the city names in the `legalname`. The comparison proceeds only if the legalnames shares at least 10% of cities;
4. *keyword check*: comparison of the keywords in the `legalname`. The comparison proceeds only if the legalnames shares at least 70% of keywords;
5. *legalname check*: comparison of the normalized `legalnames` with the `Jaro-Winkler` distance to determine if it is higher than `0.9`. If so, a similarity relation is drawn. Otherwise, no similarity relation is drawn.
<p align="center">
<img loading="lazy" alt="Organization Decision Tree" src="/img/docs/decisiontree-organization.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1YKInGGtHu09QG4pT2gRLEum4LxU82d4nKkvGNvRQmrg/edit?usp=sharing)
### Data Curation
All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata.
A data curator can:
* *edit organization metadata*: legalname, pid, country, url, parent relations, etc.;
* *approve suggested duplicates*: establish if an equivalence relation is valid;
* *discard suggested duplicates*: establish if an equivalence relation is wrong;
* *create similarity relations*: add a new equivalence relation not drawn by the algorithm.
Note that if a curator does not provide a feedback on a similarity relation suggested by the algorithm, then such relation is considered as valid.
### Creation of Representative Organizations
This stage executes an automatic workflow that faces the *duplicates grouping* stage to create representative organizations and to update them on the OpenAIRE Research Graph. Such organizations are obtained via transitive closure and the relations used comes from the curators' feedback gathered on the OpenOrgs underlying Database.
#### Duplicates grouping (transitive closure)
Once the similarity relations between pairs of organizations have been gathered, the groups of equivalent organizations are obtained (transitive closure, i.e. “mesh”). From such sets a new representative organization is obtained, which inherits all properties from the merged records and keeps track of their provenance.
The IDs of the representative organizations are obtained by the OpenOrgs Database that creates a unique ``openorgs`` ID for each approved organization. In case an organization is not approved by the curators, the ID is obtained by appending the prefix ``pending_org`` to the MD5 of the first ID (given their lexicographical ordering).

View File

@ -4,66 +4,46 @@ sidebar_position: 1
# Research results # Research results
Metadata records about the same scholarly work can be collected from different providers. Each metadata record can possibly carry different information because, for example, some providers are not aware of links to projects, keywords or other details. Another common case is when OpenAIRE collects one metadata record from a repository about a pre-print and another record from a journal about the published article. For the provision of statistics, OpenAIRE must identify those cases and “merge” the two metadata records, so that the scholarly work is counted only once in the statistics OpenAIRE produces.
Duplicates among research results are identified among results of the same type (publications, datasets, software, other research products). If two duplicate results are aggregated one as a dataset and one as a software, for example, they will never be compared and they will never be identified as duplicates. Duplicates among research results are identified among results of the same type (publications, datasets, software, other research products). If two duplicate results are aggregated one as a dataset and one as a software, for example, they will never be compared and they will never be identified as duplicates.
OpenAIRE supports different deduplication strategies based on the type of results. OpenAIRE supports different deduplication strategies based on the type of results.
The next sections describe how each stage of the deduplication workflow is faced for research results. ## Methodology overview
### Candidate identification (clustering) The deduplication process can be divided into two different phases:
* Candidate identification (clustering)
* Decision tree
* Creation of representative record
To match the requirements of limiting the number of comparisons, OpenAIRE clustering for research products works with two functions: The implementation of each phase is different based on the type of results that are being processed.
* *DOI-based function*: the function generates the DOI when this is provided as part of the record properties;
* *Title-based function*: the function generates a key that depends on (i) number of significant words in the title (normalized, stemming, etc.), (ii) module 10 of the number of characters of such words, and (iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) on the first 3 words (2 words if the title only has 2). For example, the title ``Search for the Standard Model Higgs Boson`` becomes ``search standard model higgs boson`` with two keys key ``5-3-seaardmod`` and ``5-3-rchstadel``.
To give an idea, this configuration generates around 77Mi blocks, which we limited to 200 records each (only 15K blocks are affected by the cut), and entails 260Bi matches. ### Publications
### Duplicates identification (pair-wise comparisons) #### Candidate identification (clustering)
Comparisons in a block are performed using a *sliding window* set to 50 records. The records are sorted lexicographically on a normalized version of their titles. The 1st record is compared against all the 50 following ones using the decision tree, then the second, etc. for an NlogN complexity. Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a [clustering function](./clustering-functions) that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no [clustering function](./clustering-functions) will ever bring them into the same cluster. To match these requirements OpenAIRE clustering for products works with two functions:
A different decision tree is adopted depending on the type of the entity being processed. DOI: the function generates the DOI when this is provided as part of the record properties;
Similarity relations drawn in this stage will be consequently used to perform the duplicates grouping. Title-based function: the function generates a key that depends on (i) number of significant words in the title (normalized, stemming, etc.), (ii) module 10 of the number of characters of such words, and (iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) o the first 3 words (2 words if the title only has 2). For example, the title “Entity deduplication in big data graphs for scholarly communication” becomes “entity deduplication big data graphs scholarly communication” with two keys key “7.1entionbig” and “7.1itydedbig” (where 1 is module 10 of 54 characters of the normalized title.
#### Publications #### Decision tree
For each pair of publications in a cluster the following strategy (depicted in the figure below) is applied. For each pair of publications in a cluster the following strategy (depicted in the figure below) is applied.
The comparison goes through different stages: Cross comparison of the pid lists (in the `pid` and `alternateid` elements). If 50% common pids, levenshtein distance on titles with low threshold (0.9).
1. *trusted pids check*: comparison of the trusted pid lists (in the `pid` field of the record). If at least 1 pid is equivalent, records match and the similarity relation is drawn. Otherwise, check if the number of authors and the title version is equal. If so, levenshtein distance on titles with higher threshold (0.99).
2. *instance type check*: comparison of the instance types (indicating the subtype of the record, i.e. presentation, conference object, etc.). If the instance types are not compatible then the records does not match. Otherwise, the comparison proceeds to the next stage The publications are matched as duplicate if the distance is higher than the threshold, in every other case they are considered as distinct publications.
3. *untrusted pids check*: comparison of all the available pids (in the `pid` and the `alternateid` fields of the record). In every case, no similarity relation is drawn in this stage. If at least one pid is equivalent, the next stage will be a *soft check*, otherwise the next stage is a *strong check*.
4. *soft check*: comparison of the record titles with the Levenshtein distance. If the distance measure is above 0.9 then the similarity relation is drawn.
5. *strong check*: comparison composed by three substages involving the (i) comparison of the author list sizes and the version of the record to determine if they are coherent, (ii) comparison of the record titles with the Levenshtein distance to determine if it is higher than 0.99, (iii) "smart" comparison of the author lists to check if common authors are more than 60%.
<p align="center"> ![Example banner](../assets/dedup-results.png)
<img loading="lazy" alt="Publications Decision Tree" src="/img/docs/decisiontree-publication.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/19SIilTp1vukw6STMZuPMdc0pv0ODYCiOxP7OU3iPWK8/edit?usp=sharing) #### Creation of representative record
<span className="todo">TODO</span>
#### Software ### Datasets
For each pair of software in a cluster the following strategy (depicted in the figure below) is applied. <span className="todo">TODO</span>
The comparison goes through different stages:
1. *pids check*: comparison of the pids in the records. No similarity relation is drawn in this stage, it is only used to establish the final threshold to be used to compare record titles. If there is at least one common pid, then the next stage is a *soft check*. Otherwise, the next stage is a *strong check*
2. *soft check*: comparison of the record titles with Levenshtein distance. If the measure is above 0.9, then the similarity relation is drawn
3. *strong check*: comparison of the record titles with Levenshtein distance. If the measure is above 0.99, then the similarity relation is drawn
<p align="center"> ### Software
<img loading="lazy" alt="Software Decision Tree" src="/img/docs/decisiontree-software.png" width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/> <span className="todo">TODO</span>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/19gd1-GTOEEo6awMObGRkYFhpAlO_38mfbDFFX0HAkuo/edit?usp=sharing) ### Other types of research products
<span className="todo">TODO</span>
#### Datasets and Other types of research products
For each pair of datasets or other types of research products in a cluster the strategy depicted in the figure below is applied.
The decision tree is almost identical to the publication decision tree, with the only exception of the *instance type check* stage. Since such type of record does not have a relatable instance type, the check is not performed and the decision tree node is skipped.
<p align="center">
<img loading="lazy" alt="Dataset and Other types of research products Decision Tree" src="/img/docs/decisiontree-dataset-orp.png" width="90%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1uBa7Bw2KwBRDUYIfyRr_Keol7UOeyvMNN7MPXYLg4qw/edit?usp=sharing)
### Duplicates grouping (transitive closure)
The general concept is that the field coming from the record with higher "trust" value is used as reference for the field of the representative record.
The IDs of the representative records are obtained by appending the prefix ``dedup_`` to the MD5 of the first ID (given their lexicographical ordering). If the group of merged records contains a trusted ID (i.e. the DOI), also the ``doi`` keyword is added to the prefix.

View File

@ -1,23 +0,0 @@
---
sidebar_position: 3
---
# Extraction of Acknowledged Concepts
| Property | Description |
| --- | --- |
| Short description | Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept. |
| Parameters | Concept titles, acronyms, and identifiers, publication's identifiers and fulltexts |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham.](https://doi.org/10.1007/978-3-031-16802-4_9) |

View File

@ -1,23 +0,0 @@
---
sidebar_position: 4
---
# Extraction of Cited Concepts
| Property | Description |
| --- | --- |
| Short description | Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata. |
| Parameters | Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts. |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham.](https://doi.org/10.1007/978-3-319-67008-9_28) |

View File

@ -1,23 +0,0 @@
---
sidebar_position: 5
---
# Classifiers
| Property | Description |
| --- | --- |
| Short description | A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm classifies publication's fulltexts using a Bayesian classifier and weighted terms according to an offline training phase. The training has been done using the following taxonomies: arXiv, MeSH (Medical Subject Headings), ACM, and DDC (Dewey Decimal Classification, or Dewey Decimal System). |
| Parameters | Publication's identifier and fulltext |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham.](https://doi.org/10.1007/978-3-319-08425-1_10) |

View File

@ -1,5 +1,8 @@
# Enrichment # Enrichment
<span className="todo">TODO: intro</span>
## Mining ## Mining
The OpenAIRE Research Graph is enriched by links mined by OpenAIREs full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching. The OpenAIRE Research Graph is enriched by links mined by OpenAIREs full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching.

View File

@ -3,19 +3,4 @@ sidebar_position: 1
--- ---
# Mining algorithms # Mining algorithms
The Text and Data Mining (TDM) algorithms used for enriching the OpenAIRE Graph are grouped in the following main categories:
[Extraction of acknowledged concepts](acks.md)
[Extraction of cited concepts](cites.md)
[Document Classification](classified.md)
<span className="todo">TODO</span> <span className="todo">TODO</span>

View File

@ -6,16 +6,8 @@ sidebar_position: 5
The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as: The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as:
* The OpenAIRE Research Graph APIs and Portals will offer to the EOSC (European Open Science Cloud) an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond. * EOSC --The OpenAIRE Research Graph APIs and Portals will offer to the EOSC an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.
* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE. * DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE
* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project. * EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
* ScholExplorer is used by different players of the scholarly communication ecosystem. For example, [Elsevier](https://www.elsevier.com/authors/tools-and-resources/research-data/data-base-linking) uses its API to make the links between
publications and datasets automatically appear on ScienceDirect.
ScholExplorer indexes the links among the four major types of research products (API v3) available in the OpenAIRE Research Graph and makes them available through an HTTP API that allows
to search them by the following criteria:
* Links whose source object has a given PID or PID type;
* Links whose source object has been published by a given data source ("data source as publisher");
* Links that were collected from a given data source ("data source as provider").

View File

@ -2,54 +2,8 @@
sidebar_position: 4 sidebar_position: 4
--- ---
# Post cleaning # Post-cleaning
At the very end of the processing pipeline, a step is dedicated to perform cleaning operations aimed at improving the overall quality of the data. The aggregation processes are continuously running and apply vocabularies as they are in a given moment of time. It could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies.
The output of this final cleansing step is the final version of the OpenAIRE Research Graph.
## Vocabulary based cleaning
The aggregation processes run independently one from another and continuously. Each aggregation process, depending on the characteristics of the records exposed by the data source, makes use of one or more vocabularies to harmonise the values available in a given field.
A vocabulary is a data structure that defines a list of terms, and for each term defines a list of synonyms:
```xml
<TERMS>
<TERM native_name="Annotation" code="0018" english_name="Annotation" encoding="OPENAIRE">
<SYNONYMS>
<SYNONYM term="Comentario" encoding="CSIC"/>
<SYNONYM term="Comment/debate" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="annotation" encoding="OPENAIRE-PR202112"/>
[...]
</SYNONYMS>
<RELATIONS/>
</TERM>
<TERM native_name="Article" code="0001" english_name="Article" encoding="OPENAIRE">
<SYNONYMS>
<SYNONYM term="A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="A4 Artikkeli konferenssijulkaisussa" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="Article" encoding="OTHER"/>
<SYNONYM term="Article (author)" encoding="OTHER"/>
[...]
```
Each vocabulary is typically used to control and harmonise the values available in a specific field characterising the bibliographic records. The example above provides a preview of the vocabulary used to clean the [result's instance typology](/data-model/entities/result#instance).
The content of the vocabularies can be accessed on [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/).
Given a value provided in the original records, the cleaning process looks for a synonym and, when found, resolves the corresponding term which is used in turn to build the cleaned record.
Each aggregation process applies vocabularies according to their definitions in a given moment of time, however, it could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies.
In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation.
## Filtering
Bibliographic records that do not meet minimal requirements for being part of the OpenAIRE Research Graph are eliminated during this phase.
Currently, the only criteria applied horizontally to the entire graph aims at excluding scientific results whose title is not meaningful for citation purposes.
Then, different criteria are applied in the pre-processing of specific sub-collections:
* [Crossref filtering](/data-provision/aggregation/doiboost#crossref-filtering)
## Country cleaning
This phase is responsible for removing the country information from result records that match specific criteria. The need for this phase is driven by the fact that some datasources, although referred of national pertinence, they contain material that is not always related to the given country.
In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation. The output of the final cleansing step is the final version of the OpenAIRE Research Graph.

View File

@ -4,14 +4,14 @@ id: intro
sidebar_position: 1 sidebar_position: 1
--- ---
# Overview # Welcome!
The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities. The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities.
Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community. Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community.
Imagine a vast collection of research products all linked together, contextualised and openly available. For the past years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources. Imagine a vast collection of research products all linked together, contextualised and openly available. For the past ten years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources.
As of today, the OpenAIRE Research Graph aggregates hundreds of millions of metadata records (and links among them) from multiple data sources trusted by scientists, including: As of today, the OpenAIRE Research Graph aggregates around 450Mi metadata records with links collecting from 10K data sources trusted by scientists, including:
* Repositories registered in OpenDOAR or re3data.org (soon FAIRSharing.org) * Repositories registered in OpenDOAR or re3data.org (soon FAIRSharing.org)
* Open Access journals registered in DOAJ * Open Access journals registered in DOAJ
@ -21,5 +21,6 @@ As of today, the OpenAIRE Research Graph aggregates hundreds of millions of meta
* Microsoft Academic Graph * Microsoft Academic Graph
* Datacite * Datacite
After cleaning, deduplication, enrichment and full-text mining processes, the graph is analysed to produce statistics for the [OpenAIRE MONITOR](https://monitor.openaire.eu), the [Open Science Observatory](https://osobservatory.openaire.eu), made discoverable via the [OpenAIRE EXPLORE](https://explore.openaire.eu) and programmatically accessible via [OpenAIRE Public APIs](https://develop.openaire.eu). After cleaning, deduplication, enrichment and full-text mining processes, the graph is analysed to produce statistics for the [OpenAIRE MONITOR](https://monitor.openaire.eu), the [Open Science Observatory](https://osobservatory.openaire.eu), made discoverable via the [OpenAIRE EXPLORE](https://explore.openaire.eu) and programmatically accessible as described at
Last but not least, the Graph data are openly available and can be used by third-parties to create added value services. <span className="todo">https://develop.openaire.eu</span>.
Json dumps are also published on Zenodo.

View File

@ -0,0 +1,8 @@
{
"label": "Learning center",
"position": 9,
"link": {
"type": "generated-index",
"description": "5 minutes to learn the most important Docusaurus concepts."
}
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

View File

@ -0,0 +1,7 @@
---
sidebar_position: 1
---
# OpenPlato Webinars
<span className="todo">TODO</span>

View File

@ -0,0 +1,7 @@
---
sidebar_position: 2
---
# Tutorials
<span className="todo">TODO</span>

View File

@ -2,71 +2,5 @@
sidebar_position: 7 sidebar_position: 7
--- ---
# How to cite # Related publications
<span className="todo">TODO</span>
Open Science services are open and transparent and survive thanks to your active support and to the visibility and reward they gather. If you use one of the [OpenAIRE Research Graph dumps](https://zenodo.org/record/6616871) for your research, please provide a proper citation following the recommendation that you find on the dump's Zenodo page.
## Relevant research products
### Aggregation system
Manghi, P., Artini, M., Atzori, C., Bardi, A., Mannocci, A., La Bruzzo, S., Candela, L., Castelli, D. and Pagano, P. (2014), “The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures”, Program: electronic library and information systems, Vol. 48 No. 4, pp. 322-354. [doi:10.1108/prog-08-2013-0045](http://doi.org/10.1108/prog-08-2013-0045)
Atzori, C., Bardi, A., Manghi, P., & Mannocci, A. (2017, January). "The OpenAIRE workflows for data management". In Italian Research Conference on Digital Libraries (pp. 95-107). Springer, Cham. [doi:10.1007/978-3-319-68130-6_8](https://doi.org/10.1007/978-3-319-68130-6_8)
*Software* Michele Artini, Claudio Atzori, Alessia Bardi, Sandro La Bruzzo, Paolo Manghi, & Andrea Mannocci. (2016, November 24). "The D-NET software toolkit: dnet-basic-aggregator (Version 1.3.0)". Zenodo. [doi:10.5281/zenodo.168356](https://doi.org/10.5281/zenodo.168356) <i className="fa-solid fa-arrow-up-right-from-square"></i>
Mannocci, A., & Manghi, P. (2016, September). "DataQ: a data flow quality monitoring system for aggregative data infrastructures". In International Conference on Theory and Practice of Digital Libraries (pp. 357-369). Springer, Cham. [doi:10.1007/978-3-319-43997-6_28](https://doi.org/10.1007/978-3-319-43997-6_28)
### Deduplication
Vichos K., De Bonis M., Kanellos I., Chatzopoulos S., Atzori C., Manola N., Manghi P., Vergoulis T. (Feb. 2022), "A preliminary assessment of the article deduplication algorithm used for the OpenAIRE Research Graph". IRCDL 2022 - 18th Italian Research Conference on Digital Libraries, Padua, Italy. CEUR-WS Proceedings. [http://ceur-ws.org/Vol-3160](http://ceur-ws.org/Vol-3160/)
De Bonis, M., Manghi, P., & Atzori, C. (2022). "FDup: a framework for general-purpose and efficient entity deduplication of record collections". PeerJ Computer Science, 8, e1058. [https://peerj.com/articles/cs-1058](https://peerj.com/articles/cs-1058)
Manghi, P., Atzori, C., De Bonis, M., & Bardi, A. (2020). "Entity deduplication in big data graphs for scholarly communication". Data Technologies and Applications. [doi:10.1108/dta-09-2019-0163](https://doi.org/10.1108/dta-09-2019-0163)
Atzori, C., Manghi, P., & Bardi, A. (2018, December). "GDup: de-duplication of scholarly communication big graphs". In 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT) (pp. 142-151). IEEE. [doi:10.1109/bdcat.2018.00025](https://doi.org/10.1109/bdcat.2018.00025)
*Software* Claudio Atzori, & Paolo Manghi. (2017, February 17). "GDup: a big graph entity deduplication system" (Version 4.0.5). Zenodo. [doi:/10.5281/zenodo.292980](https://doi.org/10.5281/zenodo.292980)
Atzori, Claudio. "GDup: an Integrated, Scalable Big Graph Deduplication System." (2016). [doi:10.5281/zenodo.1454879](https://doi.org/10.5281/zenodo.1454879)
Manghi, Paolo, Marko Mikulicic, and Claudio Atzori. "De-duplication of aggregation authority files." International Journal of Metadata, Semantics and Ontologies 7.2 (2012): 114-130. [doi:10.1504/ijmso.2012.050014](https://doi.org/10.1504/ijmso.2012.050014)
Manghi, P., & Mikulicic, M. (2011, October). "PACE: A general-purpose tool for authority control". In Research Conference on Metadata and Semantic Research (pp. 80-92). Springer, Berlin, Heidelberg. [doi:10.1007/978-3-642-24731-6_8](https://doi.org/10.1007/978-3-642-24731-6_8)
### Mining
Giannakopoulos T., Foufoulas Y., Dimitropoulos H., Manola N. (2019) “Interactive Text Analysis and Information Extraction”. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, Cham. [doi:10.1007/978-3-030-11226-4_27](https://doi.org/10.1007/978-3-030-11226-4_27)
Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham. [doi:10.1007/978-3-319-67008-9_28](https://doi.org/10.1007/978-3-319-67008-9_28)
Y. Chronis, Y. Foufoulas, V. Nikolopoulos, A. Papadopoulos, L. Stamatogiannakis, C. Svingos, Y. E. Ioannidis, "A Relational Approach to Complex Dataflows", in Workshop Proceedings of the EDBT/ICDT 2016 (MEDAL 2016) Joint Conference (March 15, 2016, Bordeaux, France) on CEUR-WS.org (ISSN 1613-0073) [http://ceur-ws.org/Vol-1558/paper45.pdf](http://ceur-ws.org/Vol-1558/paper45.pdf)
T. Giannakopoulos, I. Foufoulas, E. Stamatogiannakis, H. Dimitropoulos, N. Manola, and Y. Ioannidis. 2015. “Visual-Based Classification of Figures from Scientific Literature”. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). Association for Computing Machinery, New York, NY, USA, 10591060. [doi:10.1145/2740908.2742024](https://doi.org/10.1145/2740908.2742024)
Giannakopoulos, T., Foufoulas, I., Stamatogiannakis, E., Dimitropoulos, H., Manola, N., & Ioannidis, Y. (2014). “Discovering and Visualizing Interdisciplinary Content Classes in Scientific Publications”. D-Lib Mag., Volume 20, Number 11/12. [doi:10.1045/november14-giannakopoulos](https://doi.org/10.1045/november14-giannakopoulos)
Giannakopoulos T., Stamatogiannakis E., Foufoulas I., Dimitropoulos H., Manola N., Ioannidis Y. (2014) “Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation”. In: Bolikowski Ł., Casarosa V., Goodale P., Houssos N., Manghi P., Schirrwagen J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. [doi:10.1007/978-3-319-08425-1_10](https://doi.org/10.1007/978-3-319-08425-1_10)
Giannakopoulos T., Dimitropoulos H., Metaxas O., Manola N., Ioannidis Y. (2013) “Supervised Content Visualization of Scientific Publications: A Case Study on the ArXiv Dataset”. In: Kłopotek M.A., Koronacki J., Marciniak M., Mykowiecka A., Wierzchoń S.T. (eds) Language Processing and Intelligent Information Systems. IIS 2013. Lecture Notes in Computer Science, vol 7912. Springer, Berlin, Heidelberg. [doi:10.1007/978-3-642-38634-3_23](https://doi.org/10.1007/978-3-642-38634-3_23)
Tkaczyk, D., Szostek, P., Fedoryszak, M. et al. "CERMINE: automatic extraction of structured metadata from scientific literature". IJDAR 18, 317335 (2015). [doi:10.1007/s10032-015-0249-8](https://doi.org/10.1007/s10032-015-0249-8)
M. Kobos, Ł. Bolikowski, M. Horst, P. Manghi, N. Manola, J. Schirrwagen (2014) “Information inference in scholarly communication infrastructures: the OpenAIREplus project experience”, Procedia Computer Science 38, 92-99. [doi:10.1016/j.procs.2014.10.016](https://doi.org/10.1016/j.procs.2014.10.016)
### Portals
Baglioni M. et al. (2019) "The OpenAIRE Research Community Dashboard: On Blending Scientific Workflows and Scientific Publishing". In: Doucet A., Isaac A., Golub K., Aalberg T., Jatowt A. (eds) Digital Libraries for Open Knowledge. TPDL 2019. Lecture Notes in Computer Science, vol 11799. Springer, Cham. [doi:10.1007/978-3-030-30760-8_5](https://doi.org/10.1007/978-3-030-30760-8_5)
### Broker Service
Manghi, P., Atzori, C., Bardi, A., La Bruzzo, S., & Artini, M. (2016, February). "Realizing a Scalable and History-Aware Literature Broker Service for OpenAIRE". In Italian Research Conference on Digital Libraries (pp. 92-103). Springer, Cham. [doi:10.1007/978-3-319-56300-8_9](https://doi.org/10.1007/978-3-319-56300-8_9)
Artini, M., Atzori, C., Bardi, A., La Bruzzo, S., Manghi, P., & Mannocci, A. (2015). "The OpenAIRE literature broker service for institutional repositories". D-Lib Magazine, 21(11/12), 1. [doi:10.1045/november2015-artini](https://doi.org/10.1045/november2015-artini)

View File

@ -91,19 +91,15 @@ const config = {
type: 'doc', type: 'doc',
docId: 'intro', docId: 'intro',
position: 'left', position: 'left',
label: 'Research graph v5.0', label: 'Research graph',
}, },
//
// documentation version in the navbar
// { // {
// type: 'docsVersionDropdown', // type: 'doc',
// position: 'right' // docId: 'intro',
// position: 'left',
// label: 'docs',
// }, // },
//
// link to blog, the blog must be enabled first
// {to: '/blog', label: 'Blog', position: 'left'}, // {to: '/blog', label: 'Blog', position: 'left'},
//
// link to github repo
// { // {
// href: 'https://github.com/facebook/docusaurus', // href: 'https://github.com/facebook/docusaurus',
// label: 'GitHub', // label: 'GitHub',
@ -111,6 +107,30 @@ const config = {
// }, // },
], ],
}, },
algolia: {
// The application ID provided by Algolia
appId: 'QMQXA47JBK',
// Public API key: it is safe to commit it
apiKey: '72d95d6ee477ad827c88faa456ef29b1',
indexName: 'OpenAIRE_Graph_Docs',
rateLimit: 8,
// Optional: see doc section below
contextualSearch: true,
// Optional: Specify domains where the navigation should occur through window.location instead on history.push. Useful when our Algolia config crawls multiple documentation sites and we want to navigate with window.location.href to them.
externalUrlRegex: 'external\\.com|domain\\.com',
// Optional: Algolia search parameters
searchParameters: {},
// Optional: path for search page that enabled by default (`false` to disable it)
searchPagePath: 'search',
},
footer: { footer: {
style: 'dark', style: 'dark',
links: [ links: [

1537
package-lock.json generated

File diff suppressed because it is too large Load Diff

View File

@ -14,8 +14,9 @@
"write-heading-ids": "docusaurus write-heading-ids" "write-heading-ids": "docusaurus write-heading-ids"
}, },
"dependencies": { "dependencies": {
"@docusaurus/core": "^2.1.0", "@docusaurus/core": "2.0.1",
"@docusaurus/preset-classic": "^2.1.0", "@docusaurus/preset-classic": "2.0.1",
"@docusaurus/theme-search-algolia": "^2.1.0",
"@mdx-js/react": "^1.6.22", "@mdx-js/react": "^1.6.22",
"clsx": "^1.2.1", "clsx": "^1.2.1",
"hast-util-is-element": "^1.1.0", "hast-util-is-element": "^1.1.0",
@ -26,7 +27,7 @@
"remark-math": "^3.0.1" "remark-math": "^3.0.1"
}, },
"devDependencies": { "devDependencies": {
"@docusaurus/module-type-aliases": "^2.1.0" "@docusaurus/module-type-aliases": "2.0.1"
}, },
"browserslist": { "browserslist": {
"production": [ "production": [

View File

@ -13,113 +13,19 @@
/** @type {import('@docusaurus/plugin-content-docs').SidebarsConfig} */ /** @type {import('@docusaurus/plugin-content-docs').SidebarsConfig} */
const sidebars = { const sidebars = {
mySidebar: [ // By default, Docusaurus generates a sidebar from the docs folder structure
{ tutorialSidebar: [{type: 'autogenerated', dirName: '.'}],
type: 'doc',
id: 'intro' // But you can create a sidebar manually
}, /*
tutorialSidebar: [
{ {
type: 'category', type: 'category',
label: "Data model", label: 'Tutorial',
link: {type: 'doc', id: 'data-model/data-model'}, items: ['hello'],
items: [
{ type: 'doc', id: 'data-model/pids-and-identifiers' },
{
type: 'category',
label: "Entities",
link: {
type: 'generated-index',
description: 'The main entities of the OpenAIRE Research Graph are listed below.'
}, },
items: [ ],
{ type: 'doc', id: 'data-model/entities/result' }, */
{ type: 'doc', id: 'data-model/entities/data-source' },
{ type: 'doc', id: 'data-model/entities/organization' },
{ type: 'doc', id: 'data-model/entities/project' },
{ type: 'doc', id: 'data-model/entities/community' },
]
},
{
type: 'doc',
id: 'data-model/relationships'
}
]
},
{
type: "link",
label: "Public API",
href: "https://graph.openaire.eu/develop/overview.html"
},
{
type: 'doc',
id: 'download'
},
{
type: 'category',
label: "Data provision",
link: {type: 'doc', id: 'data-provision/data-provision'},
items: [
{
type: 'category',
label: "Aggregation",
link: {type: 'doc', id: 'data-provision/aggregation/aggregation'},
items: [
{ type: 'doc', id: 'data-provision/aggregation/doiboost', label: 'DOIBoost' },
{ type: 'doc', id: 'data-provision/aggregation/pubmed' },
{ type: 'doc', id: 'data-provision/aggregation/datacite' },
{ type: 'doc', id: 'data-provision/aggregation/ebi', label: 'EMBL-EBI' },
]
},
{
type: 'category',
label: "Deduplication",
link: {type: 'doc', id: 'data-provision/deduplication/deduplication'},
items: [
{ type: 'doc', id: 'data-provision/deduplication/research-products' },
{ type: 'doc', id: 'data-provision/deduplication/organizations' },
]
},
{
type: 'category',
label: "Enrichment",
link: {type: 'doc', id: 'data-provision/enrichment/enrichment'},
items: [
{ type: 'doc', id: 'data-provision/enrichment/mining' },
{ type: 'doc', id: 'data-provision/enrichment/impact-scores' },
]
},
{ type: 'doc', id: 'data-provision/post-cleaning' },
{ type: 'doc', id: 'data-provision/indexing' },
{ type: 'doc', id: 'data-provision/stats' },
]
},
{
type: 'doc',
id: 'services'
},
{
type: "link",
label: "Learning center",
href: "https://openplato.eu/"
},
{
type: 'doc',
id: 'publications',
label: "Relevant publications"
},
{
type: 'doc',
id: 'faq'
},
{
type: 'doc',
id: 'license'
},
{
type: 'doc',
id: 'changelog'
},
]
}; };
module.exports = sidebars; module.exports = sidebars;

View File

@ -5,34 +5,6 @@
*/ */
/* You can override the default Infima variables here. */ /* You can override the default Infima variables here. */
/*
:root {
--ifm-color-primary: #2e8555;
--ifm-color-primary-dark: #29784c;
--ifm-color-primary-darker: #277148;
--ifm-color-primary-darkest: #205d3b;
--ifm-color-primary-light: #33925d;
--ifm-color-primary-lighter: #359962;
--ifm-color-primary-lightest: #3cad6e;
--ifm-code-font-size: 95%;
--docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.1);
}
*/
/* For readability concerns, you should choose a lighter palette in dark mode. */
/*
[data-theme='dark'] {
--ifm-color-primary: #25c2a0;
--ifm-color-primary-dark: #21af90;
--ifm-color-primary-darker: #1fa588;
--ifm-color-primary-darkest: #1a8870;
--ifm-color-primary-light: #29d5b0;
--ifm-color-primary-lighter: #32d8b4;
--ifm-color-primary-lightest: #4fddbf;
--docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.3);
}
*/
:root { :root {
--ifm-color-primary: #4666ca; --ifm-color-primary: #4666ca;
--ifm-color-primary-dark: #3757be; --ifm-color-primary-dark: #3757be;

Binary file not shown.

Before

Width:  |  Height:  |  Size: 174 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 184 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 914 KiB