Restructure data provision section #32

Merged
schatz merged 5 commits from restructure_data_provision into main 2022-12-21 17:56:43 +01:00
10 changed files with 50 additions and 51 deletions
Showing only changes of commit 69ff846180 - Show all commits

View File

@ -1 +1,37 @@
# Cleaning # Cleaning
<!-- ## Vocabulary based cleaning -->
The aggregation processes run independently one from another and continuously. Each aggregation process, depending on the characteristics of the records exposed by the data source, makes use of one or more vocabularies to harmonise the values available in a given field.
In this page, we describe the *vocabulary-based cleaning* operation performed to harmonise the data of the different data sources.
A vocabulary is a data structure that defines a list of terms, and for each term defines a list of synonyms:
```xml
<TERMS>
<TERM native_name="Annotation" code="0018" english_name="Annotation" encoding="OPENAIRE">
<SYNONYMS>
<SYNONYM term="Comentario" encoding="CSIC"/>
<SYNONYM term="Comment/debate" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="annotation" encoding="OPENAIRE-PR202112"/>
[...]
</SYNONYMS>
<RELATIONS/>
</TERM>
<TERM native_name="Article" code="0001" english_name="Article" encoding="OPENAIRE">
<SYNONYMS>
<SYNONYM term="A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="A4 Artikkeli konferenssijulkaisussa" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="Article" encoding="OTHER"/>
<SYNONYM term="Article (author)" encoding="OTHER"/>
[...]
```
Each vocabulary is typically used to control and harmonise the values available in a specific field characterising the bibliographic records. The example above provides a preview of the vocabulary used to clean the [result's instance typology](/data-model/entities/result#instance).
The content of the vocabularies can be accessed on [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/).
Given a value provided in the original records, the cleaning process looks for a synonym and, when found, resolves the corresponding term which is used in turn to build the cleaned record.
Each aggregation process applies vocabularies according to their definitions in a given moment of time, however, it could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies.
In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation.

View File

@ -4,8 +4,7 @@ sidebar_position: 3
# Extraction of acknowledged concepts # Extraction of acknowledged concepts
***Short description:*** ***Short description:*** Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE.
Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE.
***Algorithmic details:*** ***Algorithmic details:***
The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept. The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept.

View File

@ -4,8 +4,7 @@ sidebar_position: 1
# Affiliation matching # Affiliation matching
***Short description:*** ***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database.
The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database.
***Algorithmic details:*** ***Algorithmic details:***

View File

@ -1,7 +1,6 @@
# Citation matching # Citation matching
***Short description:*** ***Short description:*** During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. In this document we shall evaluate if the module has been properly integrated with the whole
During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. In this document we shall evaluate if the module has been properly integrated with the whole
system and assess the accuracy of the algorithm used. It is worth mentioning that the implemented algorithm has been described in detail in arXiv:1303.6906 [cs.IR]1. However, in the referenced paper the algorithm was tested on small datasets, but here we will focus on larger datasets, which are expected to be analysed by the system in the production environment. system and assess the accuracy of the algorithm used. It is worth mentioning that the implemented algorithm has been described in detail in arXiv:1303.6906 [cs.IR]1. However, in the referenced paper the algorithm was tested on small datasets, but here we will focus on larger datasets, which are expected to be analysed by the system in the production environment.
***Algorithmic details:*** ***Algorithmic details:***

View File

@ -4,8 +4,7 @@ sidebar_position: 4
# Extraction of cited concepts # Extraction of cited concepts
***Short description:*** ***Short description:*** Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.
Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.
***Algorithmic details:*** ***Algorithmic details:***
The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata. The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata.

View File

@ -1,7 +1,6 @@
# Documents similarity # Documents similarity
***Short description:*** ***Short description:*** Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content.
Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content.
***Algorithmic details:*** ***Algorithmic details:***
The similarity between two documents is expressed as the similarity between weights of their common terms (i.e., words being reduced to their root form) within a context of all terms from the first and the second document. In this approach, the computation can be divided into three consecutive steps: The similarity between two documents is expressed as the similarity between weights of their common terms (i.e., words being reduced to their root form) within a context of all terms from the first and the second document. In this approach, the computation can be divided into three consecutive steps:

View File

@ -1,7 +1,6 @@
# Metadata extraction # Metadata extraction
***Short description:*** ***Short description:*** Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project.
Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project.
CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts: CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts:

View File

@ -1,41 +1,8 @@
# Finalisation # Finalisation
At the very end of the processing pipeline, a step is dedicated to perform cleaning operations aimed at improving the overall quality of the data. At the very end of the graph production workflow, a step is dedicated to perform certain finalisation operations, that we describe in this page,
The output of this final cleansing step is the final version of the OpenAIRE Graph. aiming to improve the overall quality of the data.
The output of this final step is the final version of the OpenAIRE Graph.
## Vocabulary based cleaning
The aggregation processes run independently one from another and continuously. Each aggregation process, depending on the characteristics of the records exposed by the data source, makes use of one or more vocabularies to harmonise the values available in a given field.
A vocabulary is a data structure that defines a list of terms, and for each term defines a list of synonyms:
```xml
<TERMS>
<TERM native_name="Annotation" code="0018" english_name="Annotation" encoding="OPENAIRE">
<SYNONYMS>
<SYNONYM term="Comentario" encoding="CSIC"/>
<SYNONYM term="Comment/debate" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="annotation" encoding="OPENAIRE-PR202112"/>
[...]
</SYNONYMS>
<RELATIONS/>
</TERM>
<TERM native_name="Article" code="0001" english_name="Article" encoding="OPENAIRE">
<SYNONYMS>
<SYNONYM term="A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="A4 Artikkeli konferenssijulkaisussa" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="Article" encoding="OTHER"/>
<SYNONYM term="Article (author)" encoding="OTHER"/>
[...]
```
Each vocabulary is typically used to control and harmonise the values available in a specific field characterising the bibliographic records. The example above provides a preview of the vocabulary used to clean the [result's instance typology](/data-model/entities/result#instance).
The content of the vocabularies can be accessed on [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/).
Given a value provided in the original records, the cleaning process looks for a synonym and, when found, resolves the corresponding term which is used in turn to build the cleaned record.
Each aggregation process applies vocabularies according to their definitions in a given moment of time, however, it could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies.
In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation.
## Filtering ## Filtering

View File

@ -1 +1,3 @@
# Merge by id # Merge by id
<span className="todo">TODO</span>

View File

@ -125,7 +125,7 @@ const sidebars = {
}, },
{ {
type: 'category', type: 'category',
label: "Deduplication & propagation", label: "Deduction & propagation",
link: { link: {
type: 'generated-index' , type: 'generated-index' ,
description: 'The OpenAIRE Graph is further enriched by the deduction and propagation processes descibed in this section.' description: 'The OpenAIRE Graph is further enriched by the deduction and propagation processes descibed in this section.'