Merge pull request 'rpl 'OpenAIRE Research Graph 'OpenAIRE Graph'' (#13) from openaire_graph_rename into main

Reviewed-on: D-Net/openaire-graph-docs#13
This commit is contained in:
Serafeim Chatzopoulos 2022-11-15 15:39:33 +01:00
commit f688d64dd5
14 changed files with 26 additions and 26 deletions

View File

@ -1,6 +1,6 @@
# Data model
The OpenAIRE Research Graph comprises several types of [entities](../category/entities) and [relationships](./relationships) among them.
The OpenAIRE Graph comprises several types of [entities](../category/entities) and [relationships](./relationships) among them.
The latest version of the JSON schema can be found on [Bulk downloads](../download).
@ -20,6 +20,6 @@ responsible for operating data sources or consisting the affiliations of Product
:::note Further reading
A detailed report on the OpenAIRE Research Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199).
A detailed report on the OpenAIRE Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199).
:::

View File

@ -3,6 +3,6 @@
"position": 1,
"link": {
"type": "generated-index",
"description": "The main entities of the OpenAIRE Research Graph are listed below."
"description": "The main entities of the OpenAIRE Graph are listed below."
}
}

View File

@ -37,7 +37,7 @@ _Type: String • Cardinality: ONE_
Description of the research community/research infrastructure
```json
"description": "This portal provides access to publications, research data, projects and software that may be relevant to the Corona Virus Disease (COVID-19). The OpenAIRE COVID-19 Gateway aggregates COVID-19 related records, links them and provides a single access point for discovery and navigation. We tag content from the OpenAIRE Research Graph (10,000+ data sources) and additional sources. All COVID-19 related research results are linked to people, organizations and projects, providing a contextualized navigation."
"description": "This portal provides access to publications, research data, projects and software that may be relevant to the Corona Virus Disease (COVID-19). The OpenAIRE COVID-19 Gateway aggregates COVID-19 related records, links them and provides a single access point for discovery and navigation. We tag content from the OpenAIRE Graph (10,000+ data sources) and additional sources. All COVID-19 related research results are linked to people, organizations and projects, providing a contextualized navigation."
```
### name

View File

@ -1,6 +1,6 @@
# PIDs and identifiers
One of the challenges towards the stability of the contents in the OpenAIRE Research Graph consists of making its identifiers and records stable over time.
One of the challenges towards the stability of the contents in the OpenAIRE Graph consists of making its identifiers and records stable over time.
The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content,
original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes.
Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records.

View File

@ -4,14 +4,14 @@ sidebar_position: 1
# Aggregation
OpenAIRE materializes an open, participatory research graph (the OpenAIRE Research graph) where products of the research life-cycle (e.g. scientific literature, research data, project, software) are semantically linked to each other and carry information about their access rights (i.e. if they are Open Access, Restricted, Embargoed, or Closed) and the sources from which they have been collected and where they are hosted. The OpenAIRE research graph is materialised via a set of autonomic, orchestrated workflows operating in a regimen of continuous data aggregation and integration. [1]
OpenAIRE materializes an open, participatory research graph (the OpenAIRE Graph) where products of the research life-cycle (e.g. scientific literature, research data, project, software) are semantically linked to each other and carry information about their access rights (i.e. if they are Open Access, Restricted, Embargoed, or Closed) and the sources from which they have been collected and where they are hosted. The OpenAIRE Graph is materialised via a set of autonomic, orchestrated workflows operating in a regimen of continuous data aggregation and integration. [1]
## What does OpenAIRE collect?
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/).
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/).
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term.
Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that do not follow the OpenAIRE Guidelines and/or are too large to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall).
Also, the OpenAIRE Graph is extended with other relevant scholarly communication sources that do not follow the OpenAIRE Guidelines and/or are too large to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall).
<p align="center">
<img loading="lazy" alt="Aggregation" src="/img/docs/aggregation.png" width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
@ -30,7 +30,7 @@ Relationships between objects are collected from the data sources, but also auto
## What kind of data sources are in OpenAIRE?
Objects and relationships in the OpenAIRE Research Graph are extracted from information packages, i.e. metadata records, collected from data sources of the following kinds:
Objects and relationships in the OpenAIRE Graph are extracted from information packages, i.e. metadata records, collected from data sources of the following kinds:
- *Institutional or thematic repositories*: Information systems where scientists upload the bibliographic metadata and full-texts of their articles, due to obligations from their organization or due to community practices (e.g. ArXiv, Europe PMC);
- *Open Access Publishers and journals*: Information system of open access publishers or relative journals, which offer bibliographic metadata and PDFs of their published articles;

View File

@ -68,7 +68,7 @@ Records in Crossref are ruled out according to the following criteria
Records with `type=dataset` are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication.
### Mapping Crossref properties into the OpenAIRE Research Graph
### Mapping Crossref properties into the OpenAIRE Graph
Properties in OpenAIRE results are set based on the logic described in the following table:
@ -222,7 +222,7 @@ Miriam will modify the process to ensure that:
* Only papers with DOI are considered
* Since for the same DOI we have multiple version of item with different MAG PaperId, we only take one per DOI (the last one we process). We call this dataset `Papers_distinct`
When mapping MAG records to the OpenAIRE Research Graph, we consider the following MAG tables:
When mapping MAG records to the OpenAIRE Graph, we consider the following MAG tables:
* `PaperAbstractsInvertedIndex`: for the paper abstracts
* `Authors`: for the authors. The MAG data is pre-processed by grouping authors by PaperId
* `Affiliations` and `PaperAuthorAffiliations`: to generate links between publications and organisations

View File

@ -14,7 +14,7 @@ The data curation activity is twofold, on one end pivots around the disambiguati
Duplicates among organizations are therefore managed through three different stages:
* *Creation of Suggestions*: executes an automatic workflow that performs the deduplication and prepare new suggestions for the curators to be processed;
* *Curation*: manual editing of the organization records performed by the data curators;
* *Creation of Representative Organizations*: executes an automatic workflow that creates curated organizations and exposes them on the OpenAIRE Research Graph by using the curators' feedback from the OpenOrgs underlying database.
* *Creation of Representative Organizations*: executes an automatic workflow that creates curated organizations and exposes them on the OpenAIRE Graph by using the curators' feedback from the OpenOrgs underlying database.
The next sections describe the above mentioned stages.
@ -61,7 +61,7 @@ Note that if a curator does not provide a feedback on a similarity relation sugg
### Creation of Representative Organizations
This stage executes an automatic workflow that faces the *duplicates grouping* stage to create representative organizations and to update them on the OpenAIRE Research Graph. Such organizations are obtained via transitive closure and the relations used comes from the curators' feedback gathered on the OpenOrgs underlying Database.
This stage executes an automatic workflow that faces the *duplicates grouping* stage to create representative organizations and to update them on the OpenAIRE Graph. Such organizations are obtained via transitive closure and the relations used comes from the curators' feedback gathered on the OpenOrgs underlying Database.
#### Duplicates grouping (transitive closure)

View File

@ -2,7 +2,7 @@
## Mining
The OpenAIRE Research Graph is enriched by links mined by OpenAIREs full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching.
The OpenAIRE Graph is enriched by links mined by OpenAIREs full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching.
**Project mining** in OpenAIRE text mines the full-texts of publications in order to extract matches to funding project codes/IDs. The mining algorithm works by utilising (i) the grant identifier, and (ii) the project acronym (if available) of each project. The mining algorithm: (1) Preprocesses/normalizes the full-texts using several functions, which depend on the characteristics of each funder (i.e., the format of the grant identifiers), such as stopword and/or punctuation removal, tokenization, stemming, converting to lowercase; then (2) String matching of grant identifiers against the normalized text is done using database techniques; and (3) The results are validated and cleaned using the context near the match by looking at the context around the matched ID for relevant metadata and positive or negative words/phrases, in order to calculate a confidence value for each publication-->project link. A confidence threshold is set to optimise high accuracy while minimising false positives, such as matches with page or report numbers, post/zip codes, parts of telephone numbers, DOIs or URLs, accession numbers. The algorithm also applies rules for disambiguating results, as different funders can share identical project IDs; for example, grant number 633172 could refer to H2020 project EuroMix but also to Australian-funded NHMRC project “Brain activity (EEG) analysis and brain imaging techniques to measure the neurobiological effects of sleep apnea”. Project mining works very well and was the first Text & Data Mining (TDM) service of OpenAIRE. Performance results vary from funder to funder but precision is higher than 98% for all funders and 99.5% for EC projects. Recall is higher than 95% (99% for EC projects), when projects are properly acknowledged using project/grant IDs.

View File

@ -4,17 +4,17 @@ sidebar_position: 5
# Indexing
The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as:
The final version of the OpenAIRE Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as:
* The OpenAIRE Research Graph APIs and Portals will offer to the EOSC (European Open Science Cloud) an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.
* The OpenAIRE Graph APIs and Portals will offer to the EOSC (European Open Science Cloud) an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.
* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE.
* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
* ScholExplorer is used by different players of the scholarly communication ecosystem. For example, [Elsevier](https://www.elsevier.com/authors/tools-and-resources/research-data/data-base-linking) uses its API to make the links between
publications and datasets automatically appear on ScienceDirect.
ScholExplorer indexes the links among the four major types of research products (API v3) available in the OpenAIRE Research Graph and makes them available through an HTTP API that allows
ScholExplorer indexes the links among the four major types of research products (API v3) available in the OpenAIRE Graph and makes them available through an HTTP API that allows
to search them by the following criteria:
* Links whose source object has a given PID or PID type;
* Links whose source object has been published by a given data source ("data source as publisher");

View File

@ -5,7 +5,7 @@ sidebar_position: 4
# Post cleaning
At the very end of the processing pipeline, a step is dedicated to perform cleaning operations aimed at improving the overall quality of the data.
The output of this final cleansing step is the final version of the OpenAIRE Research Graph.
The output of this final cleansing step is the final version of the OpenAIRE Graph.
## Vocabulary based cleaning
@ -43,7 +43,7 @@ In addition, the integration of ScholeXplorer and DOIBoost and some enrichment p
## Filtering
Bibliographic records that do not meet minimal requirements for being part of the OpenAIRE Research Graph are eliminated during this phase.
Bibliographic records that do not meet minimal requirements for being part of the OpenAIRE Graph are eliminated during this phase.
Currently, the only criteria applied horizontally to the entire graph aims at excluding scientific results whose title is not meaningful for citation purposes.
Then, different criteria are applied in the pre-processing of specific sub-collections:

View File

@ -4,4 +4,4 @@ sidebar_position: 6
# Stats analysis
The OpenAIRE Research Graph is also processed by a pipeline for extracting the statistics and producing the charts for funders, research initiative, infrastructures, and policy makers that you can see on MONITOR. Based on the information available on the graph, OpenAIRE provides a set of indicators for monitoring the funding and research impact and the uptake of Open Science publishing practices, such as Open Access publishing of publications and datasets, availability of interlinks between research products, availability of post-print versions in institutional or thematic Open Access repositories, etc.
The OpenAIRE Graph is also processed by a pipeline for extracting the statistics and producing the charts for funders, research initiative, infrastructures, and policy makers that you can see on MONITOR. Based on the information available on the graph, OpenAIRE provides a set of indicators for monitoring the funding and research impact and the uptake of Open Science publishing practices, such as Open Access publishing of publications and datasets, availability of interlinks between research products, availability of post-print versions in institutional or thematic Open Access repositories, etc.

View File

@ -6,12 +6,12 @@ sidebar_position: 1
# Overview
The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities.
The OpenAIRE Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities.
Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community.
Imagine a vast collection of research products all linked together, contextualised and openly available. For the past years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources.
As of today, the OpenAIRE Research Graph aggregates hundreds of millions of metadata records (and links among them) from multiple data sources trusted by scientists, including:
As of today, the OpenAIRE Graph aggregates hundreds of millions of metadata records (and links among them) from multiple data sources trusted by scientists, including:
* Repositories registered in OpenDOAR or re3data.org (soon FAIRSharing.org)
* Open Access journals registered in DOAJ

View File

@ -4,7 +4,7 @@ sidebar_position: 7
# How to cite
Open Science services are open and transparent and survive thanks to your active support and to the visibility and reward they gather. If you use one of the [OpenAIRE Research Graph dumps](https://zenodo.org/record/6616871) for your research, please provide a proper citation following the recommendation that you find on the dump's Zenodo page.
Open Science services are open and transparent and survive thanks to your active support and to the visibility and reward they gather. If you use one of the [OpenAIRE Graph dumps](https://zenodo.org/record/6616871) for your research, please provide a proper citation following the recommendation that you find on the dump's Zenodo page.
## Relevant research products
@ -20,7 +20,7 @@ Mannocci, A., & Manghi, P. (2016, September). "DataQ: a data flow quality monito
### Deduplication
Vichos K., De Bonis M., Kanellos I., Chatzopoulos S., Atzori C., Manola N., Manghi P., Vergoulis T. (Feb. 2022), "A preliminary assessment of the article deduplication algorithm used for the OpenAIRE Research Graph". IRCDL 2022 - 18th Italian Research Conference on Digital Libraries, Padua, Italy. CEUR-WS Proceedings. [http://ceur-ws.org/Vol-3160](http://ceur-ws.org/Vol-3160/)
Vichos K., De Bonis M., Kanellos I., Chatzopoulos S., Atzori C., Manola N., Manghi P., Vergoulis T. (Feb. 2022), "A preliminary assessment of the article deduplication algorithm used for the OpenAIRE Graph". IRCDL 2022 - 18th Italian Research Conference on Digital Libraries, Padua, Italy. CEUR-WS Proceedings. [http://ceur-ws.org/Vol-3160](http://ceur-ws.org/Vol-3160/)
De Bonis, M., Manghi, P., & Atzori, C. (2022). "FDup: a framework for general-purpose and efficient entity deduplication of record collections". PeerJ Computer Science, 8, e1058. [https://peerj.com/articles/cs-1058](https://peerj.com/articles/cs-1058)

View File

@ -29,7 +29,7 @@ const sidebars = {
label: "Entities",
link: {
type: 'generated-index',
description: 'The main entities of the OpenAIRE Research Graph are listed below.'
description: 'The main entities of the OpenAIRE Graph are listed below.'
},
items: [
{ type: 'doc', id: 'data-model/entities/result' },