Merge pull request 'Restructure data provision section' (#32) from restructure_data_provision into main

Reviewed-on: #32
This commit is contained in:
Serafeim Chatzopoulos 2022-12-21 17:56:43 +01:00
commit fdc331641d
28 changed files with 117 additions and 89 deletions

2
.env
View File

@ -1,2 +1,2 @@
URL="http://snf-23385.ok-kno.grnetcloud.net"
BASE_URL="/docs"
BASE_URL="/"

View File

@ -647,11 +647,11 @@ A measure computed for this instance (e.g. those provided by [BIP! Finder](https
_Type: String • Cardinality: ONE_
The specified measure. Currently supported one of:
* `influence` (see [PageRank](/data-provision/enrichment/impact-scores#pagerank-pr))
* `influence_alt` (see [Citation Count](/data-provision/enrichment/impact-scores#citation-count-cc))
* `popularity` (see [AttRank](/data-provision/enrichment/impact-scores#attrank))
* `popularity_alt` (see [RAM](/data-provision/enrichment/impact-scores#ram))
* `impulse` (see ["Incubation" Citation Count](/data-provision/enrichment/impact-scores#incubation-citation-count-icc))
* `influence` (see [PageRank](/data-provision/indicators-ingestion/impact-scores#pagerank-pr))
* `influence_alt` (see [Citation Count](/data-provision/indicators-ingestion/impact-scores#citation-count-cc))
* `popularity` (see [AttRank](/data-provision/indicators-ingestion/impact-scores#attrank))
* `popularity_alt` (see [RAM](/data-provision/indicators-ingestion/impact-scores#ram))
* `impulse` (see ["Incubation" Citation Count](/data-provision/indicators-ingestion/impact-scores#incubation-citation-count-icc))
```json
"key": "influence"

View File

@ -0,0 +1,5 @@
---
sidebar_position: 1
---
# OpenAIRE compatible sources

View File

@ -1,15 +1,10 @@
---
sidebar_position: 4
---
# Cleaning
# Post cleaning
At the very end of the processing pipeline, a step is dedicated to perform cleaning operations aimed at improving the overall quality of the data.
The output of this final cleansing step is the final version of the OpenAIRE Graph.
## Vocabulary based cleaning
<!-- ## Vocabulary based cleaning -->
The aggregation processes run independently one from another and continuously. Each aggregation process, depending on the characteristics of the records exposed by the data source, makes use of one or more vocabularies to harmonise the values available in a given field.
In this page, we describe the *vocabulary-based cleaning* operation performed to harmonise the data of the different data sources.
A vocabulary is a data structure that defines a list of terms, and for each term defines a list of synonyms:
```xml
@ -39,17 +34,4 @@ The content of the vocabularies can be accessed on [api.openaire.eu/vocabularies
Given a value provided in the original records, the cleaning process looks for a synonym and, when found, resolves the corresponding term which is used in turn to build the cleaned record.
Each aggregation process applies vocabularies according to their definitions in a given moment of time, however, it could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies.
In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation.
## Filtering
Bibliographic records that do not meet minimal requirements for being part of the OpenAIRE Graph are eliminated during this phase.
Currently, the only criteria applied horizontally to the entire graph aims at excluding scientific results whose title is not meaningful for citation purposes.
Then, different criteria are applied in the pre-processing of specific sub-collections:
* [Crossref filtering](/data-provision/aggregation/doiboost#crossref-filtering)
## Country cleaning
This phase is responsible for removing the country information from result records that match specific criteria. The need for this phase is driven by the fact that some datasources, although referred of national pertinence, they contain material that is not always related to the given country.
In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation.

View File

@ -1,7 +1,7 @@
# Data provision
# Graph production workflow
OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the world, including Open Access institutional repositories, data archives, journals. All the metadata records (i.e. descriptions of research products) are put together in a data lake, together with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national and international funders. Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications enrich the content of the data lake with links between research results and projects, author affiliations, subject classification, links to entries from domain-specific databases. Duplicated organisations and results are identified and merged together to obtain an open, trusted, public resource enabling explorations of the scholarly communication landscape like never before.
<p align="center">
<img loading="lazy" alt="Data provision" src="/img/docs/architecture.png" width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
<img loading="lazy" alt="Data provision" src="/img/docs/architecture.png" width="120%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>

View File

@ -1,5 +1,4 @@
# Bulk Tagging/Deduction
# Deduction
The Deduction process (also known as “bulk tagging”) enriches each record with new information that can be derived from the existing property values.

View File

@ -4,8 +4,7 @@ sidebar_position: 3
# Extraction of acknowledged concepts
***Short description:***
Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE.
***Short description:*** Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE.
***Algorithmic details:***
The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept.

View File

@ -4,8 +4,7 @@ sidebar_position: 1
# Affiliation matching
***Short description:***
The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database.
***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database.
***Algorithmic details:***

View File

@ -1,7 +1,6 @@
# Citation matching
***Short description:***
During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. In this document we shall evaluate if the module has been properly integrated with the whole
***Short description:*** During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. In this document we shall evaluate if the module has been properly integrated with the whole
system and assess the accuracy of the algorithm used. It is worth mentioning that the implemented algorithm has been described in detail in arXiv:1303.6906 [cs.IR]1. However, in the referenced paper the algorithm was tested on small datasets, but here we will focus on larger datasets, which are expected to be analysed by the system in the production environment.
***Algorithmic details:***

View File

@ -4,8 +4,7 @@ sidebar_position: 4
# Extraction of cited concepts
***Short description:***
Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.
***Short description:*** Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.
***Algorithmic details:***
The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata.

View File

@ -1,7 +1,6 @@
# Documents similarity
***Short description:***
Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content.
***Short description:*** Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content.
***Algorithmic details:***
The similarity between two documents is expressed as the similarity between weights of their common terms (i.e., words being reduced to their root form) within a context of all terms from the first and the second document. In this approach, the computation can be divided into three consecutive steps:

View File

Before

Width:  |  Height:  |  Size: 37 KiB

After

Width:  |  Height:  |  Size: 37 KiB

View File

@ -1,7 +1,6 @@
# Metadata extraction
***Short description:***
Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project.
***Short description:*** Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project.
CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts:

View File

@ -0,0 +1,18 @@
# Finalisation
At the very end of the graph production workflow, a step is dedicated to perform certain finalisation operations, that we describe in this page,
aiming to improve the overall quality of the data.
The output of this final step is the final version of the OpenAIRE Graph.
## Filtering
Bibliographic records that do not meet minimal requirements for being part of the OpenAIRE Graph are eliminated during this phase.
Currently, the only criteria applied horizontally to the entire graph aims at excluding scientific results whose title is not meaningful for citation purposes.
Then, different criteria are applied in the pre-processing of specific sub-collections:
* [Crossref filtering](/data-provision/aggregation/non-compatible-sources/doiboost#crossref-filtering)
## Country cleaning
This phase is responsible for removing the country information from result records that match specific criteria. The need for this phase is driven by the fact that some datasources, although referred of national pertinence, they contain material that is not always related to the given country.

View File

@ -1,7 +1,3 @@
---
sidebar_position: 5
---
# Indexing
The final version of the OpenAIRE Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as:

View File

@ -1,7 +1,3 @@
---
sidebar_position: 2
---
# Impact indicators
This page summarises all calculated impact indicators, which are included into the [measure](/data-model/entities/other#measure) property.

View File

@ -0,0 +1,7 @@
# Usage counts
Usage counts cover the needs of content providers and consumers offering added value to assist them in reaching their goals.
They include metrics of usage activity of Open Access Repositories categorizing the data retrieved by country, number of downloads, number of views, number of repositories and all derivative quantitative open metrics, comprehensively.
You can find more information about the UsageCounts service [here](https://usagecounts.openaire.eu/).

View File

@ -0,0 +1,3 @@
# Merge by id
<span className="todo">TODO</span>

View File

@ -66,7 +66,7 @@ const sidebars = {
},
{
type: 'category',
label: "Data provision",
label: "Graph production workflow",
link: {type: 'doc', id: 'data-provision/data-provision'},
items: [
{
@ -74,12 +74,46 @@ const sidebars = {
label: "Aggregation",
link: {type: 'doc', id: 'data-provision/aggregation/aggregation'},
items: [
{ type: 'doc', id: 'data-provision/aggregation/doiboost', label: 'DOIBoost' },
{ type: 'doc', id: 'data-provision/aggregation/pubmed' },
{ type: 'doc', id: 'data-provision/aggregation/datacite' },
{ type: 'doc', id: 'data-provision/aggregation/ebi', label: 'EMBL-EBI' },
{
type: 'doc',
label: "OpenAIRE compatible sources",
id: 'data-provision/aggregation/compatible-sources',
},
{
type: 'category',
label: "Non-compatible sources",
link: { type: 'generated-index' },
items: [
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/doiboost', label: 'DOIBoost' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/pubmed' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/datacite' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/ebi', label: 'EMBL-EBI' },
]
}
]
},
{
type: 'doc',
id: 'data-provision/merge-by-id'
},
{
type: 'category',
label: "Enrichment by mining",
link: {
type: 'generated-index',
description: 'The OpenAIRE Graph is enriched using the different Text and Data Mining (TDM) algorithms that are grouped in the following categories.'
},
items: [
{ type: 'doc', id: 'data-provision/enrichment-by-mining/affiliation_matching' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/citation_matching' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/classifies' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/documents_similarity' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/acks' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/cites' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/metadata_extraction' },
]
},
{ type: 'doc', id: 'data-provision/cleaning' },
{
type: 'category',
label: "Deduplication",
@ -90,38 +124,32 @@ const sidebars = {
]
},
{
type: 'category',
label: "Enrichment",
link: {
type: 'generated-index',
description: 'The OpenAIRE Graph is enriched using the different processes that we describe in this section.'
type: 'category',
label: "Deduction & propagation",
link: {
type: 'generated-index' ,
description: 'The OpenAIRE Graph is further enriched by the deduction and propagation processes descibed in this section.'
},
items: [
{
type: 'category',
label: "Mining",
link: {
type: 'generated-index',
description: 'The Text and Data Mining (TDM) algorithms used for enriching the OpenAIRE Graph are grouped in the following main categories:'
},
items: [
{ type: 'doc', id: 'data-provision/enrichment/affiliation_matching' },
{ type: 'doc', id: 'data-provision/enrichment/citation_matching' },
{ type: 'doc', id: 'data-provision/enrichment/classifies' },
{ type: 'doc', id: 'data-provision/enrichment/documents_similarity' },
{ type: 'doc', id: 'data-provision/enrichment/acks' },
{ type: 'doc', id: 'data-provision/enrichment/cites' },
{ type: 'doc', id: 'data-provision/enrichment/metadata_extraction' },
]
},
{ type: 'doc', id: 'data-provision/enrichment/bulk-tagging' },
{ type: 'doc', id: 'data-provision/enrichment/propagation' },
{ type: 'doc', id: 'data-provision/enrichment/impact-scores' },
{ type: 'doc', id: 'data-provision/deduction-and-propagation/bulk-tagging' },
{ type: 'doc', id: 'data-provision/deduction-and-propagation/propagation' },
]
},
{ type: 'doc', id: 'data-provision/post-cleaning' },
{
type: 'category',
label: "Indicators ingestion",
link: {
type: 'generated-index' ,
description: 'In this step, the following types of indicators are ingested in the OpenAIRE Graph.'
},
items: [
{ type: 'doc', id: 'data-provision/indicators-ingestion/impact-scores' },
{ type: 'doc', id: 'data-provision/indicators-ingestion/usage-counts' },
]
},
{ type: 'doc', id: 'data-provision/finalisation' },
{ type: 'doc', id: 'data-provision/indexing' },
]
},
@ -135,10 +163,10 @@ const sidebars = {
id: 'publications',
label: "Relevant publications"
},
{
type: 'doc',
id: 'faq'
},
// {
// type: 'doc',
// id: 'faq'
// },
{
type: 'doc',
id: 'license'

Binary file not shown.

After

Width:  |  Height:  |  Size: 278 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 278 KiB

After

Width:  |  Height:  |  Size: 83 KiB