merging with main

This commit is contained in:
Miriam Baglioni 2022-11-23 14:01:30 +01:00
commit b14f89a845
33 changed files with 942 additions and 584 deletions

View File

@ -1,6 +1,6 @@
# Data model # Data model
The OpenAIRE Research Graph comprises several types of entities and [relationships](./relationships) among them. The OpenAIRE Graph comprises several types of [entities](../category/entities) and [relationships](./relationships) among them.
The latest version of the JSON schema can be found on [Bulk downloads](../download). The latest version of the JSON schema can be found on [Bulk downloads](../download).
@ -20,6 +20,6 @@ responsible for operating data sources or consisting the affiliations of Product
:::note Further reading :::note Further reading
A detailed report on the OpenAIRE Research Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199). A detailed report on the OpenAIRE Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199).
::: :::

View File

@ -3,6 +3,6 @@
"position": 1, "position": 1,
"link": { "link": {
"type": "generated-index", "type": "generated-index",
"description": "The main entities of the OpenAIRE Research Graph are listed below." "description": "The main entities of the OpenAIRE Graph are listed below."
} }
} }

View File

@ -37,7 +37,7 @@ _Type: String • Cardinality: ONE_
Description of the research community/research infrastructure Description of the research community/research infrastructure
```json ```json
"description": "This portal provides access to publications, research data, projects and software that may be relevant to the Corona Virus Disease (COVID-19). The OpenAIRE COVID-19 Gateway aggregates COVID-19 related records, links them and provides a single access point for discovery and navigation. We tag content from the OpenAIRE Research Graph (10,000+ data sources) and additional sources. All COVID-19 related research results are linked to people, organizations and projects, providing a contextualized navigation." "description": "This portal provides access to publications, research data, projects and software that may be relevant to the Corona Virus Disease (COVID-19). The OpenAIRE COVID-19 Gateway aggregates COVID-19 related records, links them and provides a single access point for discovery and navigation. We tag content from the OpenAIRE Graph (10,000+ data sources) and additional sources. All COVID-19 related research results are linked to people, organizations and projects, providing a contextualized navigation."
``` ```
### name ### name

View File

@ -646,7 +646,12 @@ A measure computed for this instance (e.g. those provided by [BIP! Finder](https
### key ### key
_Type: String • Cardinality: ONE_ _Type: String • Cardinality: ONE_
The specified measure. Currently supported one of: `{ influence, influence_alt, popularity, popularity_alt, impulse, cc }` (see [the dedicated page](../../data-provision/enrichment/impact-scores) for more details). The specified measure. Currently supported one of:
* `influence` (see [PageRank](/data-provision/enrichment/impact-scores#pagerank-pr))
* `influence_alt` (see [Citation Count](/data-provision/enrichment/impact-scores#citation-count-cc))
* `popularity` (see [AttRank](/data-provision/enrichment/impact-scores#attrank))
* `popularity_alt` (see [RAM](/data-provision/enrichment/impact-scores#ram))
* `impulse` (see ["Incubation" Citation Count](/data-provision/enrichment/impact-scores#incubation-citation-count-icc))
```json ```json
"key": "influence" "key": "influence"

View File

@ -311,7 +311,7 @@ _Type: [Subject](other#subject) • Cardinality: MANY_
Subject, keyword, classification code, or key phrase describing the resource. Subject, keyword, classification code, or key phrase describing the resource.
```json ```json
"subjecsts": [ "subjects": [
{ {
"provenance": { "provenance": {
"provenance": "Harvested", "provenance": "Harvested",

View File

@ -1,6 +1,6 @@
# PIDs and identifiers # PIDs and identifiers
One of the challenges towards the stability of the contents in the OpenAIRE Research Graph consists of making its identifiers and records stable over time. One of the challenges towards the stability of the contents in the OpenAIRE Graph consists of making its identifiers and records stable over time.
The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content, The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content,
original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes. original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes.
Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records. Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records.

View File

@ -4,14 +4,14 @@ sidebar_position: 1
# Aggregation # Aggregation
OpenAIRE materializes an open, participatory research graph (the OpenAIRE Research graph) where products of the research life-cycle (e.g. scientific literature, research data, project, software) are semantically linked to each other and carry information about their access rights (i.e. if they are Open Access, Restricted, Embargoed, or Closed) and the sources from which they have been collected and where they are hosted. The OpenAIRE research graph is materialised via a set of autonomic, orchestrated workflows operating in a regimen of continuous data aggregation and integration. [1] OpenAIRE materializes an open, participatory research graph (the OpenAIRE Graph) where products of the research life-cycle (e.g. scientific literature, research data, project, software) are semantically linked to each other and carry information about their access rights (i.e. if they are Open Access, Restricted, Embargoed, or Closed) and the sources from which they have been collected and where they are hosted. The OpenAIRE Graph is materialised via a set of autonomic, orchestrated workflows operating in a regimen of continuous data aggregation and integration. [1]
## What does OpenAIRE collect? ## What does OpenAIRE collect?
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/). OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/).
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term. The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term.
Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that do not follow the OpenAIRE Guidelines and/or are too large to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall). Also, the OpenAIRE Graph is extended with other relevant scholarly communication sources that do not follow the OpenAIRE Guidelines and/or are too large to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall).
<p align="center"> <p align="center">
<img loading="lazy" alt="Aggregation" src="/img/docs/aggregation.png" width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/> <img loading="lazy" alt="Aggregation" src="/img/docs/aggregation.png" width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
@ -30,7 +30,7 @@ Relationships between objects are collected from the data sources, but also auto
## What kind of data sources are in OpenAIRE? ## What kind of data sources are in OpenAIRE?
Objects and relationships in the OpenAIRE Research Graph are extracted from information packages, i.e. metadata records, collected from data sources of the following kinds: Objects and relationships in the OpenAIRE Graph are extracted from information packages, i.e. metadata records, collected from data sources of the following kinds:
- *Institutional or thematic repositories*: Information systems where scientists upload the bibliographic metadata and full-texts of their articles, due to obligations from their organization or due to community practices (e.g. ArXiv, Europe PMC); - *Institutional or thematic repositories*: Information systems where scientists upload the bibliographic metadata and full-texts of their articles, due to obligations from their organization or due to community practices (e.g. ArXiv, Europe PMC);
- *Open Access Publishers and journals*: Information system of open access publishers or relative journals, which offer bibliographic metadata and PDFs of their published articles; - *Open Access Publishers and journals*: Information system of open access publishers or relative journals, which offer bibliographic metadata and PDFs of their published articles;

View File

@ -68,7 +68,7 @@ Records in Crossref are ruled out according to the following criteria
Records with `type=dataset` are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication. Records with `type=dataset` are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication.
### Mapping Crossref properties into the OpenAIRE Research Graph ### Mapping Crossref properties into the OpenAIRE Graph
Properties in OpenAIRE results are set based on the logic described in the following table: Properties in OpenAIRE results are set based on the logic described in the following table:
@ -131,7 +131,7 @@ Possible improvements:
* Verify if Crossref has a property for `language`, `country`, `container.issnLinking`, `container.iss`, `container.edition`, `container.conferenceplace` and `container.conferencedate` * Verify if Crossref has a property for `language`, `country`, `container.issnLinking`, `container.iss`, `container.edition`, `container.conferenceplace` and `container.conferencedate`
* Different approach to set the `refereed` field and improve its coverage? * Different approach to set the `refereed` field and improve its coverage?
h3. 2 Map Crossref links to projects/funders ### Map Crossref links to projects/funders
Links to funding available in Crossref are mapped as funding relationships (`result -- isProducedBy -- project`) applying the following mapping: Links to funding available in Crossref are mapped as funding relationships (`result -- isProducedBy -- project`) applying the following mapping:
@ -222,7 +222,7 @@ Miriam will modify the process to ensure that:
* Only papers with DOI are considered * Only papers with DOI are considered
* Since for the same DOI we have multiple version of item with different MAG PaperId, we only take one per DOI (the last one we process). We call this dataset `Papers_distinct` * Since for the same DOI we have multiple version of item with different MAG PaperId, we only take one per DOI (the last one we process). We call this dataset `Papers_distinct`
When mapping MAG records to the OpenAIRE Research Graph, we consider the following MAG tables: When mapping MAG records to the OpenAIRE Graph, we consider the following MAG tables:
* `PaperAbstractsInvertedIndex`: for the paper abstracts * `PaperAbstractsInvertedIndex`: for the paper abstracts
* `Authors`: for the authors. The MAG data is pre-processed by grouping authors by PaperId * `Authors`: for the authors. The MAG data is pre-processed by grouping authors by PaperId
* `Affiliations` and `PaperAuthorAffiliations`: to generate links between publications and organisations * `Affiliations` and `PaperAuthorAffiliations`: to generate links between publications and organisations

View File

@ -14,7 +14,7 @@ The data curation activity is twofold, on one end pivots around the disambiguati
Duplicates among organizations are therefore managed through three different stages: Duplicates among organizations are therefore managed through three different stages:
* *Creation of Suggestions*: executes an automatic workflow that performs the deduplication and prepare new suggestions for the curators to be processed; * *Creation of Suggestions*: executes an automatic workflow that performs the deduplication and prepare new suggestions for the curators to be processed;
* *Curation*: manual editing of the organization records performed by the data curators; * *Curation*: manual editing of the organization records performed by the data curators;
* *Creation of Representative Organizations*: executes an automatic workflow that creates curated organizations and exposes them on the OpenAIRE Research Graph by using the curators' feedback from the OpenOrgs underlying database. * *Creation of Representative Organizations*: executes an automatic workflow that creates curated organizations and exposes them on the OpenAIRE Graph by using the curators' feedback from the OpenOrgs underlying database.
The next sections describe the above mentioned stages. The next sections describe the above mentioned stages.
@ -46,6 +46,8 @@ The comparison goes through the following decision tree:
<img loading="lazy" alt="Organization Decision Tree" src="/img/docs/decisiontree-organization.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/> <img loading="lazy" alt="Organization Decision Tree" src="/img/docs/decisiontree-organization.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p> </p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1YKInGGtHu09QG4pT2gRLEum4LxU82d4nKkvGNvRQmrg/edit?usp=sharing)
### Data Curation ### Data Curation
All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata. All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata.
@ -59,7 +61,7 @@ Note that if a curator does not provide a feedback on a similarity relation sugg
### Creation of Representative Organizations ### Creation of Representative Organizations
This stage executes an automatic workflow that faces the *duplicates grouping* stage to create representative organizations and to update them on the OpenAIRE Research Graph. Such organizations are obtained via transitive closure and the relations used comes from the curators' feedback gathered on the OpenOrgs underlying Database. This stage executes an automatic workflow that faces the *duplicates grouping* stage to create representative organizations and to update them on the OpenAIRE Graph. Such organizations are obtained via transitive closure and the relations used comes from the curators' feedback gathered on the OpenOrgs underlying Database.
#### Duplicates grouping (transitive closure) #### Duplicates grouping (transitive closure)

View File

@ -37,6 +37,8 @@ The comparison goes through different stages:
<img loading="lazy" alt="Publications Decision Tree" src="/img/docs/decisiontree-publication.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/> <img loading="lazy" alt="Publications Decision Tree" src="/img/docs/decisiontree-publication.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p> </p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/19SIilTp1vukw6STMZuPMdc0pv0ODYCiOxP7OU3iPWK8/edit?usp=sharing)
#### Software #### Software
For each pair of software in a cluster the following strategy (depicted in the figure below) is applied. For each pair of software in a cluster the following strategy (depicted in the figure below) is applied.
The comparison goes through different stages: The comparison goes through different stages:
@ -48,6 +50,8 @@ The comparison goes through different stages:
<img loading="lazy" alt="Software Decision Tree" src="/img/docs/decisiontree-software.png" width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/> <img loading="lazy" alt="Software Decision Tree" src="/img/docs/decisiontree-software.png" width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p> </p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/19gd1-GTOEEo6awMObGRkYFhpAlO_38mfbDFFX0HAkuo/edit?usp=sharing)
#### Datasets and Other types of research products #### Datasets and Other types of research products
For each pair of datasets or other types of research products in a cluster the strategy depicted in the figure below is applied. For each pair of datasets or other types of research products in a cluster the strategy depicted in the figure below is applied.
The decision tree is almost identical to the publication decision tree, with the only exception of the *instance type check* stage. Since such type of record does not have a relatable instance type, the check is not performed and the decision tree node is skipped. The decision tree is almost identical to the publication decision tree, with the only exception of the *instance type check* stage. Since such type of record does not have a relatable instance type, the check is not performed and the decision tree node is skipped.
@ -56,6 +60,8 @@ The decision tree is almost identical to the publication decision tree, with the
<img loading="lazy" alt="Dataset and Other types of research products Decision Tree" src="/img/docs/decisiontree-dataset-orp.png" width="90%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/> <img loading="lazy" alt="Dataset and Other types of research products Decision Tree" src="/img/docs/decisiontree-dataset-orp.png" width="90%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p> </p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1uBa7Bw2KwBRDUYIfyRr_Keol7UOeyvMNN7MPXYLg4qw/edit?usp=sharing)
### Duplicates grouping (transitive closure) ### Duplicates grouping (transitive closure)
The general concept is that the field coming from the record with higher "trust" value is used as reference for the field of the representative record. The general concept is that the field coming from the record with higher "trust" value is used as reference for the field of the representative record.

View File

@ -0,0 +1,31 @@
---
sidebar_position: 3
---
# Extraction of acknowledged concepts
***Short description:***
Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE.
***Algorithmic details:***
The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept.
***Parameters:***
Concept titles, acronyms, and identifiers, publication's identifiers and fulltexts
***Limitations:*** -
***Environment:***
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
***References:***
* Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. [doi:10.1007/978-3-031-16802-4_9](https://doi.org/10.1007/978-3-031-16802-4_9)
***Authority:*** ATHENA RC &bull; ***License:*** CC-BY/CC-0 &bull; ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)

View File

@ -0,0 +1,58 @@
---
sidebar_position: 1
---
# Affiliation matching
***Short description:***
The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database.
***Algorithmic details:***
*The buckets concept*
In order to get the best possible results, the algorithm should compare every affiliation with every organization. However, this approach would be very inefficient and slow, because it would involve the processing of the cartesian product (all possible pairs) of millions of affiliations and thousands of organizations. To avoid this, IIS has introduced the concept of buckets. A bucket is a smaller group of affiliations and organizations that have been selected to be matched with one another. The matching algorithm compares only these affiliations and organizations that belong to the same bucket.
*Affiliation matching process*
Every affiliation in a given *bucket* is compared with every organization in the same bucket multiple times, each time by using a different algorithm (*voter*). Each *voter* is assigned a number (match strength) that describes the estimated correctness of the result of its comparison. All the affiliation-organization pairs that have been matched by at least one *voter*, will be assigned the match strength > 0 (the actual number depends on the voters, its calculation method will be shown later).
It is very important for the algorithm to group the affiliations and organizations properly i.e. the ones that have a chance to match should be in the same *bucket*. To guarantee this, the affiliation matching module allows to create different methods of dividing the affiliations and organizations into *buckets*, and to use all of these methods in a single matching process. The specific method of grouping the affiliations and organizations into *bucket* and then joining them into pairs is carried out by the service called *Joiner*.
Every *joiner* can be linked with many different *voters* that will tell if the affiliation-organization pairs joined match or not. By providing new *joiners* and *voters* one can extend the matching algorithm with countless new methods for matching affiliations with organizations, thus adjusting the algorithm to his or her needs.
All the affiliations and organizations are sequentially computed by all the *matchers*. In every *matcher* they are grouped by some *joiner* in pairs, and then these pairs are processed by all the *voters* in the *matcher*. Every affiliation-organization pair that has been matched at least once is assigned the match strength that depends on the match strengths of the *voters* that pointed the given pair is a match.
**NOTE:** There can be many organizations matched with a given affiliation, each of them matched with a different match strength. The user of the module can set a match strength threshold which will limit the results to only those matches that have the match strength greater than the specified threshold.
*Calculation of the match strength of the affiliation-organization pair matched by multiple matchers*
It often happens that the given affiliation-organization pair is returned as a match by more than one matcher, each time with a different match strength. In such a case **the match with the highest match strength will be selected**.
*Calculation of the match strength of the affiliation-organization pair within a single matcher*
Every voter has a match strength that is in the range (0, 1]. **The voter match strength says what the quotient of correct matches to all matches guessed by this voter is, and is based on real data and hundreds of matches prepared by hand.**
The match strength of the given affiliation-organization pair is based on the match strengths of all the voters in the matcher that have pointed that the pair is a match. It will always be less than or equal to 1 and greater than the match strength of each single voter that matched the given pair.
The total match strength is calculated in such a way that each consecutive voter reduces (by its match strength) the gap of uncertainty about the correctness of the given match.
***Parameters:***
* input
* input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations.
* input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location.
* input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations.
* input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations.
* input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations
* output
* [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations.
***Limitations:*** -
***Environment:***
Java, Spark
***References:*** -
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching)

View File

@ -0,0 +1,42 @@
# Citation matching
***Short description:***
During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. In this document we shall evaluate if the module has been properly integrated with the whole
system and assess the accuracy of the algorithm used. It is worth mentioning that the implemented algorithm has been described in detail in arXiv:1303.6906 [cs.IR]1. However, in the referenced paper the algorithm was tested on small datasets, but here we will focus on larger datasets, which are expected to be analysed by the system in the production environment.
***Algorithmic details:***
*General description*
The algorithm used in citation matching task consists of two phases. In the first one, for each citation string a set of potentially matching documents is retrieved using a heuristic. In the second one, the metadata of these documents is analysed in order to assess which of them is the most similar to given citation. We assume that citations are parsed, i.e. fragments containing meaningful pieces of metadata information are marked in a special way. Note that in the IIS system, the citation parsing step is executed by another module. The following metadata fields are used by the described solution:
* an author,
* a title,
* a journal name,
* pages,
* a year of publication.
*Heuristic matching*
The heuristic is based on indexing of document metadata by their author names. For each citation we extract author names and try to find documents in the index which have the same author entries. As spelling errors and inaccuracies commonly occur in citations, we have implemented approximate index which enables retrieval of entities with edit distance less than or equal 1.
*Strict matching*
In this step, all the potentially matching pairs obtained in the heuristic step are evaluated and only the most probable ones are returned as the final result. As citations tend to contain spelling errors and differ in style, there is a need to introduce fuzzy similarity measures fitted to the specifics of various metadata fields. Most of them compute a fraction of tokens or trigrams that occur in both fields being compared. When comparing journal
names, we have taken longest common subsequence (LCS) of two strings into consideration. This can be seen as an instance of the assignment problem with some refinements added. The overall similarity of two citation strings is obtained by applying a linear Support Vector Machine (SVM) using field similarities as features.
***Parameters:***
* input:
* input_metadata: [ExtractedDocumentMetadataMergedWithOriginal](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/transformers/metadatamerger/ExtractedDocumentMetadataMergedWithOriginal.avdl) avro datastore location with the metadata of both publications and bibliorgaphic references to be matched
* input_matched_citations: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with citations which were already matched and should be excluded from fuzzy matching
* output: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with matched publications
***Limitations:*** -
***Environment:***
Java, Spark
***References:*** -
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/citation-matching](https://github.com/CeON/CoAnSys/tree/master/citation-matching)

View File

@ -0,0 +1,24 @@
---
sidebar_position: 4
---
# Extraction of cited concepts
***Short description:***
Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.
***Algorithmic details:***
The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata.
***Parameters:***
Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts
***Limitations:*** -
***Environment:***
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
***References:***
* Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham. [doi:10.1007/978-3-319-67008-9_28](https://doi.org/10.1007/978-3-319-67008-9_28)
***Authority:*** ATHENA RC &bull; ***License:*** CC-BY/CC-0 &bull; ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)

View File

@ -0,0 +1,22 @@
---
sidebar_position: 5
---
# Classifiers
***Short description:*** A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes.
***Algorithmic details:***
The algorithm classifies publication's fulltexts using a Bayesian classifier and weighted terms according to an offline training phase. The training has been done using the following taxonomies: arXiv, MeSH (Medical Subject Headings), ACM, and DDC (Dewey Decimal Classification, or Dewey Decimal System).
***Parameters:*** Publication's identifier and fulltext
***Limitations:*** -
***Environment:***
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
***References:***
* Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. [doi:10.1007/978-3-319-08425-1_10](https://doi.org/10.1007/978-3-319-08425-1_10)
***Authority:*** ATHENA RC &bull; ***License:*** CC-BY/CC-0 &bull; ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)

View File

@ -0,0 +1,49 @@
# Documents similarity
***Short description:***
Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content.
***Algorithmic details:***
The similarity between two documents is expressed as the similarity between weights of their common terms (i.e., words being reduced to their root form) within a context of all terms from the first and the second document. In this approach, the computation can be divided into three consecutive steps:
1. selection of proper terms,
2. calculation of weights of terms for each document,
3. calculation of a given similarity function on weights of terms corresponding to each pair of documents.
 
The document similarity module uses the term frequency inverse-document frequency (TFIDF) measure and the cosine similarity to produce weights for terms and calculate their similarity respectively.
*Steps of execution*
Computation of similarity between documents is executed in the following steps.
1. First, we create a text representation of each document. The text is a concatenation of 3 attributes of document object coming from Information Space: title, abstract, and keywords.
2. Text representation of each document is split into words. Next, stop words or words which occur in more than the N percent of documents (say 99%) or these occurring in less than M documents (say 5) are discarded as we assume that they carry no important information.
3. Next, the words are stemmed (reduced to their root form) and thus converted to terms. The importance of each term in each document is calculated using TFIDF measure (resulting in a vector of weights of terms for each document). Only the top P (say 20) important terms per documents remain for the further computations.
4. In order to calculate the cosine similarity value for the documents, we execute the following steps.
a. Triples [document id, term, term weight] are grouped by a common term and for each pair of triples from the group, term importance is recalculated as the multiplication of terms weights, producing quads [document id 1, document id 2, term, multiplied term weight].
b. Quads are grouped by [document id 1, document id 2] and the values of the multiplied term weight are summed up, resulting in the creation of triples [document id 1, document id 2, total common weight].
c. Finally, triples are normalized using product of the norm of the term weights' vectors. The normalized value is the final similarity measure with value between 0 and 1.
5. For a given document, only the top R (say 20) links to similar documents are returned. The links that are thrown away are assumed to be uninteresting for the end-user and thus storing them would only needlessly take disk space.
***Parameters:***
* input:
* input_document: [DocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/documentssimilarity/DocumentMetadata.avdl) avro datastore location
* parallel: sets parameter parallel for Pig actions (default=80)
* mapredChildJavaOpts: mapreduce's map and reduce child java opts set to all PIG actions (default=Xmx12g)
* tfidfTopnTermPerDocument: number of the most important terms taken into account (default=20)
* similarityTopnDocumentPerDocument: maximum number of similar documents for each publication (default=20)
* removal_rate: removal rate (default=0.99)
* removal_least_used: removal of the least used terms (default=20)
* threshold_num_of_vector_elems_length: vector elements length threshold, when set to less than 2 all documents will be included in similarity matching (default=2)
* output: [DocumentSimilarity](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/documentssimilarity/DocumentSimilarity.avdl) avro datastore location
***Limitations:*** -
***Environment:***
Pig, Java
***References:***
* P. J. Dendek, A. Czeczko, M. Fedoryszak, A. Kawa, and L. Bolikowski, "Content Analysis of Scientific Articles in Apache Hadoop Ecosystem", Stud. Comp.Intelligence, vol. 541, 2014.
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/document-similarity](https://github.com/CeON/CoAnSys/tree/master/document-similarity)

View File

@ -2,7 +2,7 @@
## Mining ## Mining
The OpenAIRE Research Graph is enriched by links mined by OpenAIREs full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching. The OpenAIRE Graph is enriched by links mined by OpenAIREs full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching.
**Project mining** in OpenAIRE text mines the full-texts of publications in order to extract matches to funding project codes/IDs. The mining algorithm works by utilising (i) the grant identifier, and (ii) the project acronym (if available) of each project. The mining algorithm: (1) Preprocesses/normalizes the full-texts using several functions, which depend on the characteristics of each funder (i.e., the format of the grant identifiers), such as stopword and/or punctuation removal, tokenization, stemming, converting to lowercase; then (2) String matching of grant identifiers against the normalized text is done using database techniques; and (3) The results are validated and cleaned using the context near the match by looking at the context around the matched ID for relevant metadata and positive or negative words/phrases, in order to calculate a confidence value for each publication-->project link. A confidence threshold is set to optimise high accuracy while minimising false positives, such as matches with page or report numbers, post/zip codes, parts of telephone numbers, DOIs or URLs, accession numbers. The algorithm also applies rules for disambiguating results, as different funders can share identical project IDs; for example, grant number 633172 could refer to H2020 project EuroMix but also to Australian-funded NHMRC project “Brain activity (EEG) analysis and brain imaging techniques to measure the neurobiological effects of sleep apnea”. Project mining works very well and was the first Text & Data Mining (TDM) service of OpenAIRE. Performance results vary from funder to funder but precision is higher than 98% for all funders and 99.5% for EC projects. Recall is higher than 95% (99% for EC projects), when projects are properly acknowledged using project/grant IDs. **Project mining** in OpenAIRE text mines the full-texts of publications in order to extract matches to funding project codes/IDs. The mining algorithm works by utilising (i) the grant identifier, and (ii) the project acronym (if available) of each project. The mining algorithm: (1) Preprocesses/normalizes the full-texts using several functions, which depend on the characteristics of each funder (i.e., the format of the grant identifiers), such as stopword and/or punctuation removal, tokenization, stemming, converting to lowercase; then (2) String matching of grant identifiers against the normalized text is done using database techniques; and (3) The results are validated and cleaned using the context near the match by looking at the context around the matched ID for relevant metadata and positive or negative words/phrases, in order to calculate a confidence value for each publication-->project link. A confidence threshold is set to optimise high accuracy while minimising false positives, such as matches with page or report numbers, post/zip codes, parts of telephone numbers, DOIs or URLs, accession numbers. The algorithm also applies rules for disambiguating results, as different funders can share identical project IDs; for example, grant number 633172 could refer to H2020 project EuroMix but also to Australian-funded NHMRC project “Brain activity (EEG) analysis and brain imaging techniques to measure the neurobiological effects of sleep apnea”. Project mining works very well and was the first Text & Data Mining (TDM) service of OpenAIRE. Performance results vary from funder to funder but precision is higher than 98% for all funders and 99.5% for EC projects. Recall is higher than 95% (99% for EC projects), when projects are properly acknowledged using project/grant IDs.

View File

@ -2,30 +2,74 @@
sidebar_position: 2 sidebar_position: 2
--- ---
# Impact scores # Impact indicators
<span className="todo">TODO - add intro</span>
This page summarises all calculated impact indicators, which are included into the [measure](/data-model/entities/other#measure) property.
It should be noted that the impact indicators are being calculated both on the level of the research output as well on the level of distinct DOIs.
Below we explain their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses.
## Citation Count (CC) ## Citation Count (CC)
This is the most widely used scientific impact indicator, which sums all citations received by each article. The citation count of a ***Short description:***
publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$, This is the most widely used scientific impact indicator, which sums all citations received by each article.
where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise).
Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly
drew on it. drew on it.
***Algorithmic details:***
The citation count of a
publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$,
where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise).
***Parameters:*** -
***Limitations:***
OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
***Environment:*** PySpark
***References:*** -
***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
## "Incubation" Citation Count (iCC) ## "Incubation" Citation Count (iCC)
***Short description:***
This measure is essentially a time-restricted version of the citation count, where the time window is distinct for each paper, i.e., This measure is essentially a time-restricted version of the citation count, where the time window is distinct for each paper, i.e.,
only citations $y$ years after its publication are counted (usually, $y=3$). The "incubation" citation count of a paper $i$ is only citations $y$ years after its publication are counted.
calculated as: $s_i = \sum_{j,t_j \leq t_i+3} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's
***Algorithmic details:***
The "incubation" citation count of a paper $i$ is
calculated as: $s_i = \sum_{j,t_j \leq t_i+y} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's
publication years, respectively. $t_i$ is cited paper $i$'s publication year. iCC can be seen as an indicator of a paper's initial momentum publication years, respectively. $t_i$ is cited paper $i$'s publication year. iCC can be seen as an indicator of a paper's initial momentum
(impulse) directly after its publication. (impulse) directly after its publication.
***Parameters:***
$y=3$
***Limitations:***
OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
***Environment:*** PySpark
***References:***
* Vergoulis, T., Kanellos, I., Atzori, C., Mannocci, A., Chatzopoulos, S., Bruzzo, S. L., Manola, N., & Manghi, P. (2021, April). Bip! db: A dataset of impact measures for scientific publications. In Companion Proceedings of the Web Conference 2021 (pp. 456-460).
***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
## PageRank (PR) ## PageRank (PR)
***Short description:***
Originally developed to rank Web pages, PageRank has been also widely used to rank publications in citation Originally developed to rank Web pages, PageRank has been also widely used to rank publications in citation
networks. In this latter context, a publication's PageRank networks. In this latter context, a publication's PageRank
score also serves as a measure of its influence. In particular, the PageRank score of a publication is calculated score also serves as a measure of its influence.
***Algorithmic details:***
The PageRank score of a publication is calculated
as its probability of being read by a researcher that either randomly selects publications to read or selects as its probability of being read by a researcher that either randomly selects publications to read or selects
publications based on the references of her latest read. Formally, the score of a publication $i$ is given by: publications based on the references of her latest read. Formally, the score of a publication $i$ is given by:
@ -41,12 +85,31 @@ score of each publication relies of the score of publications citing it (the alg
until all scores converge). As a result, PageRank differentiates citations based on the importance of citing until all scores converge). As a result, PageRank differentiates citations based on the importance of citing
articles, thus alleviating the corresponding issue of the Citation Count. articles, thus alleviating the corresponding issue of the Citation Count.
***Parameters:***
$\alpha = 0.5, convergence\_error = 10^{-12}$
***Limitations:***
OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
***Environment:*** PySpark
***References:***
* Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.
***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
## RAM ## RAM
RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared ***Short description:***
to older ones. Hence, it better captures the popularity of publications. This "time-awareness" of citations RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared to older ones.
Hence, it better captures the popularity of publications. This "time-awareness" of citations
alleviates the bias of methods like Citation Count and PageRank against recently published articles, which have alleviates the bias of methods like Citation Count and PageRank against recently published articles, which have
not had "enough" time to gather as many citations. The RAM score of each paper $i$ is calculated as follows: not had "enough" time to gather as many citations.
***Algorithmic details:***
The RAM score of each paper $i$ is calculated as follows:
$$ $$
s_i = \sum_j{R_{i,j}} s_i = \sum_j{R_{i,j}}
@ -56,11 +119,30 @@ where $R$ is the so-called Retained Adjacency Matrix (RAM) and $R_{i,j}=\gamma^{
$i$, and $R_{i,j}=0$ otherwise. Parameter $\gamma \in (0,1)$, $t_c$ corresponds to the current year and $t_j$ corresponds to the $i$, and $R_{i,j}=0$ otherwise. Parameter $\gamma \in (0,1)$, $t_c$ corresponds to the current year and $t_j$ corresponds to the
publication year of citing article $j$. publication year of citing article $j$.
***Parameters:***
$\gamma = 0.6$
***Limitations:***
OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
***Environment:*** PySpark
***References:***
* Ghosh, R., Kuo, T. T., Hsu, C. N., Lin, S. D., & Lerman, K. (2011, December). Time-aware ranking in dynamic citation networks. In 2011 ieee 11^{th} international conference on data mining workshops (pp. 373-380). IEEE.
***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
## AttRank ## AttRank
***Short description:***
AttRank is a PageRank variant that alleviates its bias against recent publications (i.e., it is tailored to capture popularity). AttRank is a PageRank variant that alleviates its bias against recent publications (i.e., it is tailored to capture popularity).
AttRank achieves this by modifying PageRank's probability of randomly selecting a publication. Instead of using a uniform probability, AttRank achieves this by modifying PageRank's probability of randomly selecting a publication. Instead of using a uniform probability,
AttRank defines it based on a combination of the publication's age and the citations it received in recent years. The AttRank score AttRank defines it based on a combination of the publication's age and the citations it received in recent years.
***Algorithmic details:***
The AttRank score
of each publication $i$ is calculated based on: of each publication $i$ is calculated based on:
$$ $$
@ -71,3 +153,21 @@ $$
where $\alpha + \beta + \gamma =1$ and $\alpha,\beta,\gamma \in [0,1]$. $Att(i)$ denotes a recent attention-based score for publication $i$, where $\alpha + \beta + \gamma =1$ and $\alpha,\beta,\gamma \in [0,1]$. $Att(i)$ denotes a recent attention-based score for publication $i$,
which reflects its share of citations in the $y$ most recent years, $t_i$ is the publication year of article $i$, $t_c$ denotes the current which reflects its share of citations in the $y$ most recent years, $t_i$ is the publication year of article $i$, $t_c$ denotes the current
year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix. year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix.
***Parameters:***
$\alpha = 0.2, \beta = 0.5, \gamma = 0.3, \rho = -0.16, convergence\_error = 10^-{12}$
Note that recent attention is based on the 3 most recent years (including current one).
***Limitations:***
OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
***Environment:*** PySpark
***References:***
* Kanellos, I., Vergoulis, T., Sacharidis, D., Dalamagas, T., & Vassiliou, Y. (2021, April). Ranking papers by their short-term scientific impact. In 2021 IEEE 37th International Conference on Data Engineering (ICDE) (pp. 1997-2002). IEEE.
***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)

View File

@ -0,0 +1,37 @@
# Metadata extraction
***Short description:***
Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project.
CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts:
* document's metadata, including title, authors, affiliations, abstract, keywords, journal name, volume and issue,
* parsed bibliographic references
* the structure of document's sections, section titles and paragraphs
CERMINE is based on a modular workflow, whose architecture ensures that individual workflow steps can be maintained separately. As a result it is easy to perform evaluation, training, improve or replace one step implementation without changing other parts of the workflow. Most steps implementations utilize supervised and unsupervised machine-leaning techniques, which increases the maintainability of the system, as well as its ability to adapt to new document layouts.
***Algorithmic details:***
CERMINE workflow is composed of four main parts:
* Basic structure extraction takes a PDF file on the input and produces a geometric hierarchical structure representing the document. The structure is composed of pages, zones, lines, words and characters. The reading order of all elements is determined. Every zone is labelled with one of four general categories: METADATA, REFERENCES, BODY and OTHER.
* Metadata extraction part analyses parts of the geometric hierarchical structure labelled as METADATA and extracts a rich set of document's metadata from it.
* References extraction part analyses parts of the geometric hierarchical structure labelled as REFERENCES and the result is a list of document's parsed bibliographic references.
* Text extraction part analyses parts of the geometric hierarchical structure labelled as BODY and extracts document's body structure composed of sections, subsections and paragraphs.
CERMINE uses supervised and unsupervised machine-leaning techniques, such as Support Vector Machines, K-means clustering and Conditional Random Fields. Content classifiers are trained on [GROTOAP2 dataset](http://cermine.ceon.pl/grotoap2/). More information about CERMINE can be found in the [presentation](http://cermine.ceon.pl/static/docs/slides.pdf).
***Parameters:***
* input: [DocumentText](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/DocumentText.avdl) avro datastore location
* output: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location
***Limitations:***
Born-digital form of PDF documents is supported only. Large PDF documents may require more than 4g of assgined memory (set by default).
***Environment:***
Java, Hadoop
***References:***
* Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. CERMINE: automatic extraction of structured metadata from scientific literature. In International Journal on Document Analysis and Recognition, 2015, vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CERMINE](https://github.com/CeON/CERMINE)

View File

@ -1,6 +0,0 @@
---
sidebar_position: 1
---
# Mining algorithms
<span className="todo">TODO</span>

View File

@ -4,10 +4,18 @@ sidebar_position: 5
# Indexing # Indexing
The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as: The final version of the OpenAIRE Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as:
* EOSC --The OpenAIRE Research Graph APIs and Portals will offer to the EOSC an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond. * The OpenAIRE Graph APIs and Portals will offer to the EOSC (European Open Science Cloud) an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.
* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE * DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE.
* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project. * EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
* ScholExplorer is used by different players of the scholarly communication ecosystem. For example, [Elsevier](https://www.elsevier.com/authors/tools-and-resources/research-data/data-base-linking) uses its API to make the links between
publications and datasets automatically appear on ScienceDirect.
ScholExplorer indexes the links among the four major types of research products (API v3) available in the OpenAIRE Graph and makes them available through an HTTP API that allows
to search them by the following criteria:
* Links whose source object has a given PID or PID type;
* Links whose source object has been published by a given data source ("data source as publisher");
* Links that were collected from a given data source ("data source as provider").

View File

@ -5,7 +5,7 @@ sidebar_position: 4
# Post cleaning # Post cleaning
At the very end of the processing pipeline, a step is dedicated to perform cleaning operations aimed at improving the overall quality of the data. At the very end of the processing pipeline, a step is dedicated to perform cleaning operations aimed at improving the overall quality of the data.
The output of this final cleansing step is the final version of the OpenAIRE Research Graph. The output of this final cleansing step is the final version of the OpenAIRE Graph.
## Vocabulary based cleaning ## Vocabulary based cleaning
@ -43,7 +43,7 @@ In addition, the integration of ScholeXplorer and DOIBoost and some enrichment p
## Filtering ## Filtering
Bibliographic records that do not meet minimal requirements for being part of the OpenAIRE Research Graph are eliminated during this phase. Bibliographic records that do not meet minimal requirements for being part of the OpenAIRE Graph are eliminated during this phase.
Currently, the only criteria applied horizontally to the entire graph aims at excluding scientific results whose title is not meaningful for citation purposes. Currently, the only criteria applied horizontally to the entire graph aims at excluding scientific results whose title is not meaningful for citation purposes.
Then, different criteria are applied in the pre-processing of specific sub-collections: Then, different criteria are applied in the pre-processing of specific sub-collections:

View File

@ -4,4 +4,4 @@ sidebar_position: 6
# Stats analysis # Stats analysis
The OpenAIRE Research Graph is also processed by a pipeline for extracting the statistics and producing the charts for funders, research initiative, infrastructures, and policy makers that you can see on MONITOR. Based on the information available on the graph, OpenAIRE provides a set of indicators for monitoring the funding and research impact and the uptake of Open Science publishing practices, such as Open Access publishing of publications and datasets, availability of interlinks between research products, availability of post-print versions in institutional or thematic Open Access repositories, etc. The OpenAIRE Graph is also processed by a pipeline for extracting the statistics and producing the charts for funders, research initiative, infrastructures, and policy makers that you can see on MONITOR. Based on the information available on the graph, OpenAIRE provides a set of indicators for monitoring the funding and research impact and the uptake of Open Science publishing practices, such as Open Access publishing of publications and datasets, availability of interlinks between research products, availability of post-print versions in institutional or thematic Open Access repositories, etc.

View File

@ -6,12 +6,12 @@ sidebar_position: 1
# Overview # Overview
The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities. The OpenAIRE Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities.
Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community. Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community.
Imagine a vast collection of research products all linked together, contextualised and openly available. For the past years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources. Imagine a vast collection of research products all linked together, contextualised and openly available. For the past years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources.
As of today, the OpenAIRE Research Graph aggregates hundreds of millions of metadata records (and links among them) from multiple data sources trusted by scientists, including: As of today, the OpenAIRE Graph aggregates hundreds of millions of metadata records (and links among them) from multiple data sources trusted by scientists, including:
* Repositories registered in OpenDOAR or re3data.org (soon FAIRSharing.org) * Repositories registered in OpenDOAR or re3data.org (soon FAIRSharing.org)
* Open Access journals registered in DOAJ * Open Access journals registered in DOAJ

View File

@ -3,4 +3,6 @@ sidebar_position: 11
--- ---
# License # License
<span className="todo">TODO</span>
OpenAIRE Graph is available for download and re-use as CC-BY (due to some input sources whose license is CC-BY). Parts of the graphs can be re-used as CC-0.

View File

@ -4,7 +4,7 @@ sidebar_position: 7
# How to cite # How to cite
Open Science services are open and transparent and survive thanks to your active support and to the visibility and reward they gather. If you use one of the [OpenAIRE Research Graph dumps](https://zenodo.org/record/6616871) for your research, please provide a proper citation following the recommendation that you find on the dump's Zenodo page. Open Science services are open and transparent and survive thanks to your active support and to the visibility and reward they gather. If you use one of the [OpenAIRE Graph dumps](https://zenodo.org/record/6616871) for your research, please provide a proper citation following the recommendation that you find on the dump's Zenodo page.
## Relevant research products ## Relevant research products
@ -20,7 +20,7 @@ Mannocci, A., & Manghi, P. (2016, September). "DataQ: a data flow quality monito
### Deduplication ### Deduplication
Vichos K., De Bonis M., Kanellos I., Chatzopoulos S., Atzori C., Manola N., Manghi P., Vergoulis T. (Feb. 2022), "A preliminary assessment of the article deduplication algorithm used for the OpenAIRE Research Graph". IRCDL 2022 - 18th Italian Research Conference on Digital Libraries, Padua, Italy. CEUR-WS Proceedings. [http://ceur-ws.org/Vol-3160](http://ceur-ws.org/Vol-3160/) Vichos K., De Bonis M., Kanellos I., Chatzopoulos S., Atzori C., Manola N., Manghi P., Vergoulis T. (Feb. 2022), "A preliminary assessment of the article deduplication algorithm used for the OpenAIRE Graph". IRCDL 2022 - 18th Italian Research Conference on Digital Libraries, Padua, Italy. CEUR-WS Proceedings. [http://ceur-ws.org/Vol-3160](http://ceur-ws.org/Vol-3160/)
De Bonis, M., Manghi, P., & Atzori, C. (2022). "FDup: a framework for general-purpose and efficient entity deduplication of record collections". PeerJ Computer Science, 8, e1058. [https://peerj.com/articles/cs-1058](https://peerj.com/articles/cs-1058) De Bonis, M., Manghi, P., & Atzori, C. (2022). "FDup: a framework for general-purpose and efficient entity deduplication of record collections". PeerJ Computer Science, 8, e1058. [https://peerj.com/articles/cs-1058](https://peerj.com/articles/cs-1058)

1018
package-lock.json generated

File diff suppressed because it is too large Load Diff

View File

@ -4,18 +4,18 @@
"private": true, "private": true,
"scripts": { "scripts": {
"docusaurus": "docusaurus", "docusaurus": "docusaurus",
"start": "docusaurus start", "start": "docusaurus start --host 0.0.0.0",
"build": "docusaurus build", "build": "docusaurus build",
"swizzle": "docusaurus swizzle", "swizzle": "docusaurus swizzle",
"deploy": "docusaurus deploy", "deploy": "docusaurus deploy",
"clear": "docusaurus clear", "clear": "docusaurus clear",
"serve": "docusaurus serve", "serve": "docusaurus serve --host 0.0.0.0",
"write-translations": "docusaurus write-translations", "write-translations": "docusaurus write-translations",
"write-heading-ids": "docusaurus write-heading-ids" "write-heading-ids": "docusaurus write-heading-ids"
}, },
"dependencies": { "dependencies": {
"@docusaurus/core": "^2.1.0", "@docusaurus/core": "^2.2.0",
"@docusaurus/preset-classic": "^2.1.0", "@docusaurus/preset-classic": "^2.2.0",
"@mdx-js/react": "^1.6.22", "@mdx-js/react": "^1.6.22",
"clsx": "^1.2.1", "clsx": "^1.2.1",
"hast-util-is-element": "^1.1.0", "hast-util-is-element": "^1.1.0",

View File

@ -29,7 +29,7 @@ const sidebars = {
label: "Entities", label: "Entities",
link: { link: {
type: 'generated-index', type: 'generated-index',
description: 'The main entities of the OpenAIRE Research Graph are listed below.' description: 'The main entities of the OpenAIRE Graph are listed below.'
}, },
items: [ items: [
{ type: 'doc', id: 'data-model/entities/result' }, { type: 'doc', id: 'data-model/entities/result' },
@ -84,7 +84,25 @@ const sidebars = {
label: "Enrichment", label: "Enrichment",
link: {type: 'doc', id: 'data-provision/enrichment/enrichment'}, link: {type: 'doc', id: 'data-provision/enrichment/enrichment'},
items: [ items: [
{ type: 'doc', id: 'data-provision/enrichment/mining' }, {
type: 'category',
label: "Mining algorithms",
link: {
type: 'generated-index',
description: 'The Text and Data Mining (TDM) algorithms used for enriching the OpenAIRE Graph are grouped in the following main categories:'
},
items: [
{ type: 'doc', id: 'data-provision/enrichment/affiliation_matching' },
{ type: 'doc', id: 'data-provision/enrichment/citation_matching' },
{ type: 'doc', id: 'data-provision/enrichment/classifies' },
{ type: 'doc', id: 'data-provision/enrichment/documents_similarity' },
{ type: 'doc', id: 'data-provision/enrichment/acks' },
{ type: 'doc', id: 'data-provision/enrichment/cites' },
{ type: 'doc', id: 'data-provision/enrichment/metadata_extraction' },
]
},
{ type: 'doc', id: 'data-provision/enrichment/impact-scores' }, { type: 'doc', id: 'data-provision/enrichment/impact-scores' },
] ]
}, },

Binary file not shown.

Before

Width:  |  Height:  |  Size: 170 KiB

After

Width:  |  Height:  |  Size: 174 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 129 KiB

After

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 181 KiB

After

Width:  |  Height:  |  Size: 184 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 78 KiB

After

Width:  |  Height:  |  Size: 79 KiB