diff --git a/docs/data-provision/cleaning.md b/docs/data-provision/cleaning.md index 81b62dd..e920026 100644 --- a/docs/data-provision/cleaning.md +++ b/docs/data-provision/cleaning.md @@ -1 +1,37 @@ -# Cleaning \ No newline at end of file +# Cleaning + + + + +The aggregation processes run independently one from another and continuously. Each aggregation process, depending on the characteristics of the records exposed by the data source, makes use of one or more vocabularies to harmonise the values available in a given field. +In this page, we describe the *vocabulary-based cleaning* operation performed to harmonise the data of the different data sources. +A vocabulary is a data structure that defines a list of terms, and for each term defines a list of synonyms: + +```xml + + + + + + + [...] + + + + + + + + + + [...] +``` + +Each vocabulary is typically used to control and harmonise the values available in a specific field characterising the bibliographic records. The example above provides a preview of the vocabulary used to clean the [result's instance typology](/data-model/entities/result#instance). + +The content of the vocabularies can be accessed on [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). + +Given a value provided in the original records, the cleaning process looks for a synonym and, when found, resolves the corresponding term which is used in turn to build the cleaned record. +Each aggregation process applies vocabularies according to their definitions in a given moment of time, however, it could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies. + +In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation. \ No newline at end of file diff --git a/docs/data-provision/enrichment-by-mining/acks.md b/docs/data-provision/enrichment-by-mining/acks.md index 903e0b4..eed8cb1 100644 --- a/docs/data-provision/enrichment-by-mining/acks.md +++ b/docs/data-provision/enrichment-by-mining/acks.md @@ -4,8 +4,7 @@ sidebar_position: 3 # Extraction of acknowledged concepts -***Short description:*** -Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE. +***Short description:*** Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE. ***Algorithmic details:*** The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept. diff --git a/docs/data-provision/enrichment-by-mining/affiliation_matching.md b/docs/data-provision/enrichment-by-mining/affiliation_matching.md index fb2ce11..539e51b 100644 --- a/docs/data-provision/enrichment-by-mining/affiliation_matching.md +++ b/docs/data-provision/enrichment-by-mining/affiliation_matching.md @@ -4,8 +4,7 @@ sidebar_position: 1 # Affiliation matching -***Short description:*** -The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database. +***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database. ***Algorithmic details:*** diff --git a/docs/data-provision/enrichment-by-mining/citation_matching.md b/docs/data-provision/enrichment-by-mining/citation_matching.md index 7cf56db..01fcf37 100644 --- a/docs/data-provision/enrichment-by-mining/citation_matching.md +++ b/docs/data-provision/enrichment-by-mining/citation_matching.md @@ -1,7 +1,6 @@ # Citation matching -***Short description:*** -During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. In this document we shall evaluate if the module has been properly integrated with the whole +***Short description:*** During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. In this document we shall evaluate if the module has been properly integrated with the whole system and assess the accuracy of the algorithm used. It is worth mentioning that the implemented algorithm has been described in detail in arXiv:1303.6906 [cs.IR]1. However, in the referenced paper the algorithm was tested on small datasets, but here we will focus on larger datasets, which are expected to be analysed by the system in the production environment. ***Algorithmic details:*** diff --git a/docs/data-provision/enrichment-by-mining/cites.md b/docs/data-provision/enrichment-by-mining/cites.md index 9a45946..f7d8158 100644 --- a/docs/data-provision/enrichment-by-mining/cites.md +++ b/docs/data-provision/enrichment-by-mining/cites.md @@ -4,8 +4,7 @@ sidebar_position: 4 # Extraction of cited concepts -***Short description:*** -Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs. +***Short description:*** Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs. ***Algorithmic details:*** The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata. diff --git a/docs/data-provision/enrichment-by-mining/documents_similarity.md b/docs/data-provision/enrichment-by-mining/documents_similarity.md index c67700c..1e02b95 100644 --- a/docs/data-provision/enrichment-by-mining/documents_similarity.md +++ b/docs/data-provision/enrichment-by-mining/documents_similarity.md @@ -1,7 +1,6 @@ # Documents similarity -***Short description:*** -Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content. +***Short description:*** Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content. ***Algorithmic details:*** The similarity between two documents is expressed as the similarity between weights of their common terms (i.e., words being reduced to their root form) within a context of all terms from the first and the second document. In this approach, the computation can be divided into three consecutive steps: diff --git a/docs/data-provision/enrichment-by-mining/metadata_extraction.md b/docs/data-provision/enrichment-by-mining/metadata_extraction.md index ef930bd..4ade667 100644 --- a/docs/data-provision/enrichment-by-mining/metadata_extraction.md +++ b/docs/data-provision/enrichment-by-mining/metadata_extraction.md @@ -1,7 +1,6 @@ # Metadata extraction -***Short description:*** -Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project. +***Short description:*** Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project. CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts: diff --git a/docs/data-provision/finalisation.md b/docs/data-provision/finalisation.md index b08dff7..22e5ca9 100644 --- a/docs/data-provision/finalisation.md +++ b/docs/data-provision/finalisation.md @@ -1,41 +1,8 @@ # Finalisation -At the very end of the processing pipeline, a step is dedicated to perform cleaning operations aimed at improving the overall quality of the data. -The output of this final cleansing step is the final version of the OpenAIRE Graph. - -## Vocabulary based cleaning - -The aggregation processes run independently one from another and continuously. Each aggregation process, depending on the characteristics of the records exposed by the data source, makes use of one or more vocabularies to harmonise the values available in a given field. -A vocabulary is a data structure that defines a list of terms, and for each term defines a list of synonyms: - -```xml - - - - - - - [...] - - - - - - - - - - [...] -``` - -Each vocabulary is typically used to control and harmonise the values available in a specific field characterising the bibliographic records. The example above provides a preview of the vocabulary used to clean the [result's instance typology](/data-model/entities/result#instance). - -The content of the vocabularies can be accessed on [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). - -Given a value provided in the original records, the cleaning process looks for a synonym and, when found, resolves the corresponding term which is used in turn to build the cleaned record. -Each aggregation process applies vocabularies according to their definitions in a given moment of time, however, it could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies. - -In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation. +At the very end of the graph production workflow, a step is dedicated to perform certain finalisation operations, that we describe in this page, +aiming to improve the overall quality of the data. +The output of this final step is the final version of the OpenAIRE Graph. ## Filtering diff --git a/docs/data-provision/merge-by-id.md b/docs/data-provision/merge-by-id.md index 1b6bfa5..fea9776 100644 --- a/docs/data-provision/merge-by-id.md +++ b/docs/data-provision/merge-by-id.md @@ -1 +1,3 @@ -# Merge by id \ No newline at end of file +# Merge by id + +TODO \ No newline at end of file diff --git a/sidebars.js b/sidebars.js index 6a7d61c..1913193 100644 --- a/sidebars.js +++ b/sidebars.js @@ -125,7 +125,7 @@ const sidebars = { }, { type: 'category', - label: "Deduplication & propagation", + label: "Deduction & propagation", link: { type: 'generated-index' , description: 'The OpenAIRE Graph is further enriched by the deduction and propagation processes descibed in this section.'