Compare commits

...

39 Commits

Author SHA1 Message Date
Harry Dimitropoulos 8cddb71098 Update 'docs/data-provision/enrichment/classifies.md' 2022-11-16 16:56:42 +01:00
Harry Dimitropoulos e562936a18 Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 16:56:16 +01:00
Harry Dimitropoulos 96c7a6d87c Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 16:56:02 +01:00
Harry Dimitropoulos a48f5a263d Update 'docs/data-provision/enrichment/mining.md' 2022-11-16 16:54:28 +01:00
Harry Dimitropoulos 2d75ea529f Update 'docs/data-provision/enrichment/classified.md' 2022-11-16 16:48:47 +01:00
Harry Dimitropoulos 8f9184146c Update 'docs/data-provision/enrichment/classified.md' 2022-11-16 16:41:42 +01:00
Yannis Foufoulas 8fda5c81cf Update 'docs/data-provision/enrichment/mining.md' 2022-11-16 16:36:42 +01:00
Harry Dimitropoulos e40fee8408 Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 16:36:15 +01:00
Yannis Foufoulas aa35a239f3 Update 'docs/data-provision/enrichment/classified.md' 2022-11-16 16:34:04 +01:00
Harry Dimitropoulos fcedfc1d9d Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 16:31:37 +01:00
Harry Dimitropoulos 1f5856ecf4 Add 'docs/data-provision/enrichment/classified.md' 2022-11-16 16:27:43 +01:00
Yannis Foufoulas 45d3b152dc Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 16:25:05 +01:00
Harry Dimitropoulos 163c5a6bca Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 16:20:45 +01:00
Yannis Foufoulas 44815cc8e1 Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 16:19:14 +01:00
Yannis Foufoulas 0732dd5df6 Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 16:15:51 +01:00
Yannis Foufoulas d5dd2f6d0b Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 16:13:15 +01:00
Harry Dimitropoulos 4458952a2e Update 'docs/data-provision/enrichment/cites.md'
Added Reference and link to High-Pass Text Filtering paper
2022-11-16 15:37:58 +01:00
Harry Dimitropoulos 544808c7cd Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 15:28:43 +01:00
Harry Dimitropoulos 5dec33d26f Update 'docs/data-provision/enrichment/cites.md'
added short description
2022-11-16 15:22:35 +01:00
Harry Dimitropoulos c9228633ec Update 'docs/data-provision/enrichment/acks.md'
Added a brief description
2022-11-16 15:17:49 +01:00
Harry Dimitropoulos ca9a8f75c3 Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 15:07:29 +01:00
Harry Dimitropoulos 6b48a13bc1 Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 14:59:39 +01:00
Harry Dimitropoulos c2dbf0536b Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 14:59:02 +01:00
Harry Dimitropoulos f933f541fe Update 'docs/data-provision/enrichment/mining.md' 2022-11-16 14:58:12 +01:00
Yannis Foufoulas 39d3f47fa0 Update 'docs/data-provision/enrichment/mining.md' 2022-11-16 14:45:41 +01:00
Yannis Foufoulas 5fc5032537 Update 'docs/data-provision/enrichment/mining.md' 2022-11-16 14:44:46 +01:00
Harry Dimitropoulos 002bfdd851 Add 'docs/data-provision/enrichment/cites.md' 2022-11-16 14:18:40 +01:00
Harry Dimitropoulos ad4c4f909e Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 14:16:46 +01:00
Harry Dimitropoulos 1cf79bc30c Add 'docs/data-provision/enrichment/acks.md' 2022-11-16 14:08:56 +01:00
Harry Dimitropoulos b739759e3a Update 'docs/data-provision/enrichment/mining.md' 2022-11-15 15:53:21 +01:00
Harry Dimitropoulos c5b84be1d3 Update 'docs/data-provision/enrichment/mining.md' 2022-11-15 15:49:57 +01:00
Harry Dimitropoulos e6b02ffc32 Update 'docs/data-provision/enrichment/mining.md' 2022-11-15 14:50:45 +01:00
Yannis Foufoulas 47e112420e edit md file 2022-11-15 15:38:34 +02:00
Serafeim Chatzopoulos 76ffd07839 Merge pull request 'expanded indexing section' (#12) from indexing into main
Reviewed-on: D-Net/openaire-graph-docs#12
2022-11-15 12:27:07 +01:00
Serafeim Chatzopoulos 1456f4f045 Update typo in '/docs/data-model/entities/result.md' 2022-11-15 12:25:51 +01:00
Claudio Atzori b0598daa72 expanded indexing section 2022-11-15 11:46:46 +01:00
Claudio Atzori edaffdef8c added link to the entities section 2022-11-15 09:56:27 +01:00
Serafeim Chatzopoulos 673e2579fc Merge pull request 'Deduplication section: decision trees updated and link of images added in comments' (#11) from deduplication into main
Reviewed-on: D-Net/openaire-graph-docs#11
2022-11-14 11:19:02 +01:00
Michele De Bonis 3419c0ee40 decision trees updated and link of images added in comments 2022-11-14 11:13:29 +01:00
13 changed files with 105 additions and 5 deletions

View File

@ -1,6 +1,6 @@
# Data model
The OpenAIRE Research Graph comprises several types of entities and [relationships](./relationships) among them.
The OpenAIRE Research Graph comprises several types of [entities](../category/entities) and [relationships](./relationships) among them.
The latest version of the JSON schema can be found on [Bulk downloads](../download).

View File

@ -311,7 +311,7 @@ _Type: [Subject](other#subject) • Cardinality: MANY_
Subject, keyword, classification code, or key phrase describing the resource.
```json
"subjecsts": [
"subjects": [
{
"provenance": {
"provenance": "Harvested",

View File

@ -46,6 +46,8 @@ The comparison goes through the following decision tree:
<img loading="lazy" alt="Organization Decision Tree" src="/img/docs/decisiontree-organization.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1YKInGGtHu09QG4pT2gRLEum4LxU82d4nKkvGNvRQmrg/edit?usp=sharing)
### Data Curation
All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata.

View File

@ -37,6 +37,8 @@ The comparison goes through different stages:
<img loading="lazy" alt="Publications Decision Tree" src="/img/docs/decisiontree-publication.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/19SIilTp1vukw6STMZuPMdc0pv0ODYCiOxP7OU3iPWK8/edit?usp=sharing)
#### Software
For each pair of software in a cluster the following strategy (depicted in the figure below) is applied.
The comparison goes through different stages:
@ -48,6 +50,8 @@ The comparison goes through different stages:
<img loading="lazy" alt="Software Decision Tree" src="/img/docs/decisiontree-software.png" width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/19gd1-GTOEEo6awMObGRkYFhpAlO_38mfbDFFX0HAkuo/edit?usp=sharing)
#### Datasets and Other types of research products
For each pair of datasets or other types of research products in a cluster the strategy depicted in the figure below is applied.
The decision tree is almost identical to the publication decision tree, with the only exception of the *instance type check* stage. Since such type of record does not have a relatable instance type, the check is not performed and the decision tree node is skipped.
@ -56,6 +60,8 @@ The decision tree is almost identical to the publication decision tree, with the
<img loading="lazy" alt="Dataset and Other types of research products Decision Tree" src="/img/docs/decisiontree-dataset-orp.png" width="90%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1uBa7Bw2KwBRDUYIfyRr_Keol7UOeyvMNN7MPXYLg4qw/edit?usp=sharing)
### Duplicates grouping (transitive closure)
The general concept is that the field coming from the record with higher "trust" value is used as reference for the field of the representative record.

View File

@ -0,0 +1,23 @@
---
sidebar_position: 3
---
# Extraction of Acknowledged Concepts
| Property | Description |
| --- | --- |
| Short description | Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept. |
| Parameters | Concept titles, acronyms, and identifiers, publication's identifiers and fulltexts |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham.](https://doi.org/10.1007/978-3-031-16802-4_9) |

View File

@ -0,0 +1,23 @@
---
sidebar_position: 4
---
# Extraction of Cited Concepts
| Property | Description |
| --- | --- |
| Short description | Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata. |
| Parameters | Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts. |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham.](https://doi.org/10.1007/978-3-319-67008-9_28) |

View File

@ -0,0 +1,23 @@
---
sidebar_position: 5
---
# Classifiers
| Property | Description |
| --- | --- |
| Short description | A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm classifies publication's fulltexts using a Bayesian classifier and weighted terms according to an offline training phase. The training has been done using the following taxonomies: arXiv, MeSH (Medical Subject Headings), ACM, and DDC (Dewey Decimal Classification, or Dewey Decimal System). |
| Parameters | Publication's identifier and fulltext |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham.](https://doi.org/10.1007/978-3-319-08425-1_10) |

View File

@ -3,4 +3,19 @@ sidebar_position: 1
---
# Mining algorithms
The Text and Data Mining (TDM) algorithms used for enriching the OpenAIRE Graph are grouped in the following main categories:
[Extraction of acknowledged concepts](acks.md)
[Extraction of cited concepts](cites.md)
[Document Classification](classified.md)
<span className="todo">TODO</span>

View File

@ -6,8 +6,16 @@ sidebar_position: 5
The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as:
* EOSC --The OpenAIRE Research Graph APIs and Portals will offer to the EOSC an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.
* The OpenAIRE Research Graph APIs and Portals will offer to the EOSC (European Open Science Cloud) an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.
* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE
* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE.
* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
* ScholExplorer is used by different players of the scholarly communication ecosystem. For example, [Elsevier](https://www.elsevier.com/authors/tools-and-resources/research-data/data-base-linking) uses its API to make the links between
publications and datasets automatically appear on ScienceDirect.
ScholExplorer indexes the links among the four major types of research products (API v3) available in the OpenAIRE Research Graph and makes them available through an HTTP API that allows
to search them by the following criteria:
* Links whose source object has a given PID or PID type;
* Links whose source object has been published by a given data source ("data source as publisher");
* Links that were collected from a given data source ("data source as provider").

Binary file not shown.

Before

Width:  |  Height:  |  Size: 170 KiB

After

Width:  |  Height:  |  Size: 174 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 129 KiB

After

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 181 KiB

After

Width:  |  Height:  |  Size: 184 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 78 KiB

After

Width:  |  Height:  |  Size: 79 KiB