[Enrichment] first version of documentation for the bulktagging and part of the propagation

2022-11-09 18:03:55 +01:00
26 changed files with 66 additions and 117 deletions
--- a/docs/data-model/data-model.md
+++ b/docs/data-model/data-model.md
@ -1,6 +1,6 @@
 # Data model

-The OpenAIRE Research Graph comprises several types of [entities](../category/entities) and [relationships](./relationships) among them.
+The OpenAIRE Research Graph comprises several types of entities and [relationships](./relationships) among them.

 The latest version of the JSON schema can be found on [Bulk downloads](../download).

--- a/docs/data-model/entities/result.md
+++ b/docs/data-model/entities/result.md
@ -311,7 +311,7 @@ _Type: [Subject](other#subject) &bull; Cardinality: MANY_
 Subject, keyword, classification code, or key phrase describing the resource.

 ```json
-"subjects": [
+"subjecsts": [
    {
        "provenance": {
            "provenance": "Harvested",
--- a/docs/data-provision/deduplication/organizations.md
+++ b/docs/data-provision/deduplication/organizations.md
@ -46,8 +46,6 @@ The comparison goes through the following decision tree:
    <img loading="lazy" alt="Organization Decision Tree" src="/img/docs/decisiontree-organization.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
 </p>

-[//]: # (Link to the image: https://docs.google.com/drawings/d/1YKInGGtHu09QG4pT2gRLEum4LxU82d4nKkvGNvRQmrg/edit?usp=sharing)
-
 ### Data Curation

 All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata.
--- a/docs/data-provision/deduplication/research-products.md
+++ b/docs/data-provision/deduplication/research-products.md
@ -37,8 +37,6 @@ The comparison goes through different stages:
    <img loading="lazy" alt="Publications Decision Tree" src="/img/docs/decisiontree-publication.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
 </p>

-[//]: # (Link to the image: https://docs.google.com/drawings/d/19SIilTp1vukw6STMZuPMdc0pv0ODYCiOxP7OU3iPWK8/edit?usp=sharing)
-
 #### Software
 For each pair of software in a cluster the following strategy (depicted in the figure below) is applied.
 The comparison goes through different stages:
@ -50,8 +48,6 @@ The comparison goes through different stages:
    <img loading="lazy" alt="Software Decision Tree" src="/img/docs/decisiontree-software.png" width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
 </p>

-[//]: # (Link to the image: https://docs.google.com/drawings/d/19gd1-GTOEEo6awMObGRkYFhpAlO_38mfbDFFX0HAkuo/edit?usp=sharing)
-
 #### Datasets and Other types of research products
 For each pair of datasets or other types of research products in a cluster the strategy depicted in the figure below is applied.
 The decision tree is almost identical to the publication decision tree, with the only exception of the *instance type check* stage. Since such type of record does not have a relatable instance type, the check is not performed and the decision tree node is skipped.
@ -60,8 +56,6 @@ The decision tree is almost identical to the publication decision tree, with the
    <img loading="lazy" alt="Dataset and Other types of research products Decision Tree" src="/img/docs/decisiontree-dataset-orp.png" width="90%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
 </p>

-[//]: # (Link to the image: https://docs.google.com/drawings/d/1uBa7Bw2KwBRDUYIfyRr_Keol7UOeyvMNN7MPXYLg4qw/edit?usp=sharing)
-
 ### Duplicates grouping (transitive closure)

 The general concept is that the field coming from the record with higher "trust" value is used as reference for the field of the representative record.
--- a/docs/data-provision/enrichment/acks.md
+++ b/docs/data-provision/enrichment/acks.md
@ -1,23 +0,0 @@
---
-sidebar_position: 3
---
-
-# Extraction of Acknowledged Concepts
-
-| Property  | Description |
-| --- | --- |
-| Short description  | Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE. |
-| Authority  | ATHENA Research Center, Greece |
-| Licence  | CC-BY/CC-0  |
-| Algorithmic details | The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept. |
-| Parameters | Concept titles, acronyms, and identifiers, publication's identifiers and fulltexts |
-| Limitations | N/A |
-| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
-| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
-| References & resources | [Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham.](https://doi.org/10.1007/978-3-031-16802-4_9) |
-
-
-
-
-
-
--- a/docs/data-provision/enrichment/cites.md
+++ b/docs/data-provision/enrichment/cites.md
@ -1,23 +0,0 @@
---
-sidebar_position: 4
---
-
-# Extraction of Cited Concepts
-
-| Property  | Description |
-| --- | --- |
-| Short description  | Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs. |
-| Authority  | ATHENA Research Center, Greece  |
-| Licence  | CC-BY/CC-0  |
-| Algorithmic details | The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata. |
-| Parameters | Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts. |
-| Limitations | N/A |
-| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
-| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
-| References & resources | [Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham.](https://doi.org/10.1007/978-3-319-67008-9_28) |
-
-
-
-
-
-
--- a/docs/data-provision/enrichment/classifies.md
+++ b/docs/data-provision/enrichment/classifies.md
@ -1,23 +0,0 @@
---
-sidebar_position: 5
---
-
-# Classifiers
-
-| Property  | Description |
-| --- | --- |
-| Short description  | A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes. |
-| Authority  | ATHENA Research Center, Greece  |
-| Licence  | CC-BY/CC-0  |
-| Algorithmic details | The algorithm classifies publication's fulltexts using a Bayesian classifier and weighted terms according to an offline training phase. The training has been done using the following taxonomies: arXiv, MeSH (Medical Subject Headings), ACM, and DDC (Dewey Decimal Classification, or Dewey Decimal System).  |
-| Parameters | Publication's identifier and fulltext |
-| Limitations | N/A |
-| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
-| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
-|  References & resources | [Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham.](https://doi.org/10.1007/978-3-319-08425-1_10) |
-
-
-
-
-
-
--- a/docs/data-provision/enrichment/enrichment.md
+++ b/docs/data-provision/enrichment/enrichment.md
@ -18,27 +18,76 @@ The OpenAIRE Research Graph is enriched by links mined by OpenAIRE’s full-text

 The Deduction process (also known as “bulk tagging”) enriches each record with new information that can be derived from the existing property values.

-As of September 2020, three procedures are in place to relate a research product to a research initiative, infrastructure (RI) or community (RC) based on:
+This process is used to associate results to community/research initiatives that are part of OpenAIRE. 
+As of November 2022, three procedures are in place to relate a research product to a research initiative, infrastructure (RI) or community (RC) based on:

-* subjects (2.7M results tagged)
+* subjects: it is possible to specify a list of subjects that are relevant for the RC/RI. Every time one of the subjects is found among the subjects of a result, the result is linked to the RC/RI.

-* Zenodo community (16K results tagged)
+<p align="center">
+    <img loading="lazy" alt="Bulktagging Subject" src="/img/docs/enrichment/bulktagging_subject.png" width="70%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>
+
+
+* data sources: it is possible to list a set of data sources relevant for the RC/RI. All the results collected from these data sources will be linked to the RC/RI
+<p align="center">
+    <img loading="lazy" alt="Bulktagging Data source" src="/img/docs/enrichment/bulktagging_datasource.png" width="70%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>
+
+ When only some results collected from a datasource are relevant for the RC/RI, it is possible to specify a set of selection constraints (SC) that have to be verified before linking the result to the 
+community. The selection constraint has the form <strong>SC = S1 or S2 or ... or Sn</strong>. The generic Si has the form <strong>Si = s<sub>i1</sub> and s<sub>i2</sub> and ...and s<sub>in</sub></strong> and each s<sub>ij</sub> is a condition on a specific field of the result. The set of fields that can be specified is <strong>F={title, author, contributor, description, orcid}</strong>, 
+while the set of condition can be among <strong>V={contains, equals, not_contains, not_equals, contains_ignorecase, equals_ignorecase, not_contains_ignorecase, not_equal_ignorecase}</strong>, and the value is free text.
+A possible selection criteria can be: “All the products whose contributor contains DARIAH “
+
+<p align="center">
+    <img loading="lazy" alt="Bulktagging Data source" src="/img/docs/enrichment/bulktagging_selconstraints.png" width="70%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>
+
+* Zenodo community: it is possible to list a set of Zenodo communities relevant for the RC/RI. All the products collected from the listed Zenodo communities are linked to the RC/RI
+
+
+<p align="center">
+    <img loading="lazy" alt="Bulktagging Zenodo Community" src="/img/docs/enrichment/bulktagging_zenodo.png" width="70%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>

-* the data source it comes from (250K results tagged)

 The list of subjects, Zenodo communities and data sources used to enrich the products are defined by the managers of the community gateway or infrastructure monitoring dashboard associated with the RC/RI.

 ## Propagation

-This process “propagates” properties and links from one product to another if between the two there is a “strong” semantic relationship.
+This process enriches the graph by adding new links and/or new properties. The new information is added by exploiting existing semantic 
+relationships and values between the involved entities 

-As of September 2020, the following procedures are in place:
-Propagation of the property “country” to results from institutional repositories: e.g. publication collected from an institutional repository maintained by an italian university will be enriched with the property “country = IT”.
+As of November 2022, the following procedures are in place:

-* Propagation of links to projects: e.g. publication linked to project P “is supplemented by” a dataset D. Dataset D will get the link to project P. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
+* Country propagation: updates of the property “country” of a results. This happen when the result is collected from an institutional datasource or when the datasource hosting the result in inserted in a whitelist. For all the results whose hosting datasource verifies one of the conditions above, the country of the organization providing the datasource is added to the country of the result: e.g. publication collected from an institutional repository maintained by an italian university will be enriched with the property “country = IT”.
+<p align="center">
+    <img loading="lazy" alt="Country Propagation" src="/img/docs/enrichment/propagation_country.png" width="70%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>

-* Propagation of related community/infrastructure/initiative from organizations to products via affiliation relationships: e.g. a publication with an author affiliated with organization O. The manager of the community gateway C declared that the outputs of O are all relevant for his/her community C. The publication is tagged as relevant for C.
+* Project propagation: adds a "isProducedBy" relationship (and its inverse) between a Project P and Result R, if R has a strong semantic relationship with another Result R1 and R1 is linked to P: e.g. publication linked to project P “is supplemented by” a dataset D. Dataset D will get the link to project P. The relationships considered for this procedure are “isSupplementedBy” and “isSupplementTo”.
+<p align="center">
+    <img loading="lazy" alt="Project Propagation" src="/img/docs/enrichment/propagation_resulttoproject.png" width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>
+* Result to RC/RI through organization propagation. The manager of the RC/RI can specify a set of organizations whose product are relevant for the 
+community. This kind of propagation exploits the hasAuthorInstitution relation between results and organizations, 
+Each result having such a relation with at least one organization relevant for the RC/RI will be linked to it.
+<p align="center">
+    <img loading="lazy" alt="Result to community through organization propagation" src="/img/docs/enrichment/propagation_resulttocommunitythroughorganization.png" width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>

-* Propagation of related community/infrastructure/initiative to related products: e.g. publication associated to community C is supplemented by a dataset D. Dataset D will get the association to C. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
-
-* Propagation of ORCID identifiers to related products, if the products have the same authors: e.g. publication has ORCID for its authors and is supplemented by a dataset D. Dataset D has the same authors as the publication. Authors of D are enriched with the ORCIDs available in the publication. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
+* Result to RC/RI through semantic relation: e.g. publication associated to community C is supplemented by a dataset D. Dataset D will get the association to C. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
+<p align="center">
+    <img loading="lazy" alt="Result to community through organization propagation" src="/img/docs/enrichment/propagation_resulttocommunitythroughsemrel.png" width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>
+* ORCID identifiers to result through semantic relation related products, if the products have the same authors: e.g. publication has ORCID for its authors and is supplemented by a dataset D. Dataset D has the same authors as the publication. Authors of D are enriched with the ORCIDs available in the publication. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
+<p align="center">
+    <img loading="lazy" alt="Result to community through organization propagation" src="/img/docs/enrichment/propagation_orcid.png" width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>
+* affiliation to organization through institutional repository
+<p align="center">
+    <img loading="lazy" alt="Result to community through organization propagation" src="/img/docs/enrichment/propagation_affiliationistrepo.png" width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>
+* affiliation to organization through semantic relation 
+<p align="center">
+    <img loading="lazy" alt="Result to community through organization propagation" src="/img/docs/enrichment/propagation_organizationsemrel.png" width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
+</p>
--- a/docs/data-provision/enrichment/img.png
+++ b/docs/data-provision/enrichment/img.png
--- a/docs/data-provision/enrichment/mining.md
+++ b/docs/data-provision/enrichment/mining.md
@ -3,19 +3,4 @@ sidebar_position: 1
 ---

 # Mining algorithms
-
-The Text and Data Mining (TDM) algorithms used for enriching the OpenAIRE Graph are grouped in the following main categories:
-
-[Extraction of acknowledged concepts](acks.md)
-
-[Extraction of cited concepts](cites.md)
-
-[Document Classification](classified.md)
-
 <span className="todo">TODO</span>
-
-
-
-
-
-
--- a/docs/data-provision/indexing.md
+++ b/docs/data-provision/indexing.md
@ -6,16 +6,8 @@ sidebar_position: 5

 The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as:

-* The OpenAIRE Research Graph APIs and Portals will offer to the EOSC (European Open Science Cloud) an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.
+* EOSC --The OpenAIRE Research Graph APIs and Portals will offer to the EOSC an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.

-* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE.
+* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE

-* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
-
-* ScholExplorer is used by different players of the scholarly communication ecosystem. For example, [Elsevier](https://www.elsevier.com/authors/tools-and-resources/research-data/data-base-linking) uses its API to make the links between 
-publications and datasets automatically appear on ScienceDirect.
-ScholExplorer indexes the links among the four major types of research products (API v3) available in the OpenAIRE Research Graph and makes them available through an HTTP API that allows 
-to search them by the following criteria:
-  * Links whose source object has a given PID or PID type;
-  * Links whose source object has been published by a given data source ("data source as publisher");
-  * Links that were collected from a given data source ("data source as provider").
+* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
--- a/static/img/docs/decisiontree-dataset-orp.png
+++ b/static/img/docs/decisiontree-dataset-orp.png
--- a/static/img/docs/decisiontree-organization.png
+++ b/static/img/docs/decisiontree-organization.png
--- a/static/img/docs/decisiontree-publication.png
+++ b/static/img/docs/decisiontree-publication.png
--- a/static/img/docs/decisiontree-software.png
+++ b/static/img/docs/decisiontree-software.png
--- a/static/img/docs/enrichment/bulktagging_datasource.png
+++ b/static/img/docs/enrichment/bulktagging_datasource.png
--- a/static/img/docs/enrichment/bulktagging_selconstraints.png
+++ b/static/img/docs/enrichment/bulktagging_selconstraints.png
--- a/static/img/docs/enrichment/bulktagging_subject.png
+++ b/static/img/docs/enrichment/bulktagging_subject.png
--- a/static/img/docs/enrichment/bulktagging_zenodo.png
+++ b/static/img/docs/enrichment/bulktagging_zenodo.png
--- a/static/img/docs/enrichment/propagation_affiliationistrepo.png
+++ b/static/img/docs/enrichment/propagation_affiliationistrepo.png
--- a/static/img/docs/enrichment/propagation_country.png
+++ b/static/img/docs/enrichment/propagation_country.png
--- a/static/img/docs/enrichment/propagation_orcid.png
+++ b/static/img/docs/enrichment/propagation_orcid.png
--- a/static/img/docs/enrichment/propagation_organizationsemrel.png
+++ b/static/img/docs/enrichment/propagation_organizationsemrel.png
--- a/static/img/docs/enrichment/propagation_resulttocommunitythroughorganization.png
+++ b/static/img/docs/enrichment/propagation_resulttocommunitythroughorganization.png
--- a/static/img/docs/enrichment/propagation_resulttocommunitythroughsemrel.png
+++ b/static/img/docs/enrichment/propagation_resulttocommunitythroughsemrel.png
--- a/static/img/docs/enrichment/propagation_resulttoproject.png
+++ b/static/img/docs/enrichment/propagation_resulttoproject.png