Add formating to impact indicators page

2022-11-11 18:07:24 +02:00
15 changed files with 122 additions and 120 deletions
--- a/docs/data-model/data-model.md
+++ b/docs/data-model/data-model.md
@ -1,6 +1,6 @@
 # Data model

-The OpenAIRE Research Graph comprises several types of [entities](../category/entities) and [relationships](./relationships) among them.
+The OpenAIRE Research Graph comprises several types of entities and [relationships](./relationships) among them.

 The latest version of the JSON schema can be found on [Bulk downloads](../download).

--- a/docs/data-model/entities/other.md
+++ b/docs/data-model/entities/other.md
@ -646,7 +646,12 @@ A measure computed for this instance (e.g. those provided by [BIP! Finder](https
 ### key
 _Type: String &bull; Cardinality: ONE_

-The specified measure. Currently supported one of: `{ influence, influence_alt, popularity, popularity_alt, impulse, cc }` (see [the dedicated page](../../data-provision/enrichment/impact-scores) for more details).
+The specified measure. Currently supported one of: 
+* `influence` (see [PageRank](/data-provision/enrichment/impact-scores#pagerank-pr))
+* `influence_alt` (see [Citation Count](/data-provision/enrichment/impact-scores#citation-count-cc))
+* `popularity` (see [AttRank](/data-provision/enrichment/impact-scores#attrank))
+* `popularity_alt` (see [RAM](/data-provision/enrichment/impact-scores#ram))
+* `impulse` (see ["Incubation" Citation Count](/data-provision/enrichment/impact-scores#incubation-citation-count-icc))

 ```json
 "key": "influence"
--- a/docs/data-model/entities/result.md
+++ b/docs/data-model/entities/result.md
@ -311,7 +311,7 @@ _Type: [Subject](other#subject) &bull; Cardinality: MANY_
 Subject, keyword, classification code, or key phrase describing the resource.

 ```json
-"subjects": [
+"subjecsts": [
    {
        "provenance": {
            "provenance": "Harvested",
--- a/docs/data-provision/deduplication/organizations.md
+++ b/docs/data-provision/deduplication/organizations.md
@ -46,8 +46,6 @@ The comparison goes through the following decision tree:
    <img loading="lazy" alt="Organization Decision Tree" src="/img/docs/decisiontree-organization.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
 </p>

-[//]: # (Link to the image: https://docs.google.com/drawings/d/1YKInGGtHu09QG4pT2gRLEum4LxU82d4nKkvGNvRQmrg/edit?usp=sharing)
-
 ### Data Curation

 All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata.
--- a/docs/data-provision/deduplication/research-products.md
+++ b/docs/data-provision/deduplication/research-products.md
@ -37,8 +37,6 @@ The comparison goes through different stages:
    <img loading="lazy" alt="Publications Decision Tree" src="/img/docs/decisiontree-publication.png" width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
 </p>

-[//]: # (Link to the image: https://docs.google.com/drawings/d/19SIilTp1vukw6STMZuPMdc0pv0ODYCiOxP7OU3iPWK8/edit?usp=sharing)
-
 #### Software
 For each pair of software in a cluster the following strategy (depicted in the figure below) is applied.
 The comparison goes through different stages:
@ -50,8 +48,6 @@ The comparison goes through different stages:
    <img loading="lazy" alt="Software Decision Tree" src="/img/docs/decisiontree-software.png" width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
 </p>

-[//]: # (Link to the image: https://docs.google.com/drawings/d/19gd1-GTOEEo6awMObGRkYFhpAlO_38mfbDFFX0HAkuo/edit?usp=sharing)
-
 #### Datasets and Other types of research products
 For each pair of datasets or other types of research products in a cluster the strategy depicted in the figure below is applied.
 The decision tree is almost identical to the publication decision tree, with the only exception of the *instance type check* stage. Since such type of record does not have a relatable instance type, the check is not performed and the decision tree node is skipped.
@ -60,8 +56,6 @@ The decision tree is almost identical to the publication decision tree, with the
    <img loading="lazy" alt="Dataset and Other types of research products Decision Tree" src="/img/docs/decisiontree-dataset-orp.png" width="90%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
 </p>

-[//]: # (Link to the image: https://docs.google.com/drawings/d/1uBa7Bw2KwBRDUYIfyRr_Keol7UOeyvMNN7MPXYLg4qw/edit?usp=sharing)
-
 ### Duplicates grouping (transitive closure)

 The general concept is that the field coming from the record with higher "trust" value is used as reference for the field of the representative record.
--- a/docs/data-provision/enrichment/acks.md
+++ b/docs/data-provision/enrichment/acks.md
@ -1,23 +0,0 @@
---
-sidebar_position: 3
---
-
-# Extraction of Acknowledged Concepts
-
-| Property  | Description |
-| --- | --- |
-| Short description  | Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE. |
-| Authority  | ATHENA Research Center, Greece |
-| Licence  | CC-BY/CC-0  |
-| Algorithmic details | The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept. |
-| Parameters | Concept titles, acronyms, and identifiers, publication's identifiers and fulltexts |
-| Limitations | N/A |
-| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
-| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
-| References & resources | [Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham.](https://doi.org/10.1007/978-3-031-16802-4_9) |
-
-
-
-
-
-
--- a/docs/data-provision/enrichment/cites.md
+++ b/docs/data-provision/enrichment/cites.md
@ -1,23 +0,0 @@
---
-sidebar_position: 4
---
-
-# Extraction of Cited Concepts
-
-| Property  | Description |
-| --- | --- |
-| Short description  | Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs. |
-| Authority  | ATHENA Research Center, Greece  |
-| Licence  | CC-BY/CC-0  |
-| Algorithmic details | The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata. |
-| Parameters | Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts. |
-| Limitations | N/A |
-| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
-| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
-| References & resources | [Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham.](https://doi.org/10.1007/978-3-319-67008-9_28) |
-
-
-
-
-
-
--- a/docs/data-provision/enrichment/classifies.md
+++ b/docs/data-provision/enrichment/classifies.md
@ -1,23 +0,0 @@
---
-sidebar_position: 5
---
-
-# Classifiers
-
-| Property  | Description |
-| --- | --- |
-| Short description  | A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes. |
-| Authority  | ATHENA Research Center, Greece  |
-| Licence  | CC-BY/CC-0  |
-| Algorithmic details | The algorithm classifies publication's fulltexts using a Bayesian classifier and weighted terms according to an offline training phase. The training has been done using the following taxonomies: arXiv, MeSH (Medical Subject Headings), ACM, and DDC (Dewey Decimal Classification, or Dewey Decimal System).  |
-| Parameters | Publication's identifier and fulltext |
-| Limitations | N/A |
-| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
-| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
-|  References & resources | [Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham.](https://doi.org/10.1007/978-3-319-08425-1_10) |
-
-
-
-
-
-
--- a/docs/data-provision/enrichment/impact-scores.md
+++ b/docs/data-provision/enrichment/impact-scores.md
@ -2,30 +2,71 @@
 sidebar_position: 2
 ---

-# Impact scores
-<span className="todo">TODO - add intro</span>
+# Impact indicators
+
+This page summarises all calculated impact indicators, along with explanations about their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses.

 ## Citation Count (CC)

-This is the most widely used scientific impact indicator, which sums all citations received by each article. The citation count of a 
-publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$, 
-where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise). 
+***Short description:***
+This is the most widely used scientific impact indicator, which sums all citations received by each article.
 Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly 
 drew on it.

+***Solution:***
+The citation count of a 
+publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$, 
+where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise). 
+
+***Parameters:*** -
+
+***Limitations:***
+OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
+Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
+
+***Environment:*** PySpark
+
+***References:*** -
+
+***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
+
+
 ## "Incubation" Citation Count (iCC)

+***Short description:***
 This measure is essentially a time-restricted version of the citation count, where the time window is distinct for each paper, i.e., 
-only citations $y$ years after its publication are counted (usually, $y=3$). The "incubation" citation count of a paper $i$ is 
-calculated as: $s_i = \sum_{j,t_j \leq t_i+3} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's 
+only citations $y$ years after its publication are counted.
+
+***Solution:***
+The "incubation" citation count of a paper $i$ is 
+calculated as: $s_i = \sum_{j,t_j \leq t_i+y} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's 
 publication years, respectively. $t_i$ is cited paper $i$'s publication year. iCC can be seen as an indicator of a paper's initial momentum 
 (impulse) directly after its publication.

-## PageRank (PR)
+***Parameters:*** 
+$y=3$

+***Limitations:***
+OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
+Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
+
+***Environment:*** PySpark
+
+***References:*** 
+* Vergoulis, T., Kanellos, I., Atzori, C., Mannocci, A., Chatzopoulos, S., Bruzzo, S. L., Manola, N., & Manghi, P. (2021, April). Bip! db: A dataset of impact measures for scientific publications. In Companion Proceedings of the Web Conference 2021 (pp. 456-460).
+
+***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
+
+
+ ## PageRank (PR)
+
+***Short description:***
 Originally developed to rank Web pages, PageRank has been also widely used to rank publications in citation
 networks. In this latter context, a publication's PageRank 
-score also serves as a measure of its influence. In particular, the PageRank score of a publication is calculated 
+score also serves as a measure of its influence.
+
+***Solution:***
+The PageRank score of a publication is calculated 
 as its probability of being read by a researcher that either randomly selects publications to read or selects 
 publications based on the references of her latest read. Formally, the score of a publication $i$ is given by: 

@ -41,12 +82,31 @@ score of each publication relies of the score of publications citing it (the alg
 until all scores converge). As a result, PageRank differentiates citations based on the importance of citing 
 articles, thus alleviating the corresponding issue of the Citation Count.

+***Parameters:*** 
+$\alpha = 0.5, convergence\_error = 10^{-12}$
+
+***Limitations:***
+OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
+Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
+
+***Environment:*** PySpark
+
+***References:*** 
+* Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.
+
+***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
+ 
+
 ## RAM

-RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared 
-to older ones. Hence, it better captures the popularity of publications. This "time-awareness" of citations 
+***Short description:***
+RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared to older ones.
+Hence, it better captures the popularity of publications. This "time-awareness" of citations 
 alleviates the bias of methods like Citation Count and PageRank against recently published articles, which have 
-not had "enough" time to gather as many citations. The RAM score of each paper $i$ is calculated as follows:
+not had "enough" time to gather as many citations.
+
+***Solution:***
+The RAM score of each paper $i$ is calculated as follows:

 $$
 s_i = \sum_j{R_{i,j}}
@ -56,11 +116,30 @@ where $R$ is the so-called Retained Adjacency Matrix (RAM) and $R_{i,j}=\gamma^{
 $i$, and $R_{i,j}=0$ otherwise. Parameter $\gamma \in (0,1)$, $t_c$ corresponds to the current year and $t_j$ corresponds to the 
 publication year of citing article $j$.

+***Parameters:*** 
+$\gamma = 0.6$
+
+***Limitations:***
+OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
+Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
+
+***Environment:*** PySpark
+
+***References:*** 
+* Ghosh, R., Kuo, T. T., Hsu, C. N., Lin, S. D., & Lerman, K. (2011, December). Time-aware ranking in dynamic citation networks. In 2011 ieee 11^{th} international conference on data mining workshops (pp. 373-380). IEEE.
+
+***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
+
+
 ## AttRank

+***Short description:***
 AttRank is a PageRank variant that alleviates its bias against recent publications (i.e., it is tailored to capture popularity). 
 AttRank achieves this by modifying PageRank's probability of randomly selecting a publication. Instead of using a uniform probability,
-AttRank defines it based on a combination of the publication's age and the citations it received in recent years. The AttRank score 
+AttRank defines it based on a combination of the publication's age and the citations it received in recent years.
+
+***Solution:***
+The AttRank score 
 of each publication $i$ is calculated based on:

 $$
@ -70,4 +149,22 @@ $$

 where $\alpha + \beta + \gamma =1$ and $\alpha,\beta,\gamma \in [0,1]$. $Att(i)$ denotes a recent attention-based score for publication $i$, 
 which reflects its share of citations in the $y$ most recent years, $t_i$ is the publication year of article $i$, $t_c$ denotes the current 
-year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix.
+year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix.
+
+***Parameters:*** 
+$\alpha = 0.2, \beta = 0.5, \gamma = 0.3, \rho = -0.16, convergence\_error = 10^-{12}$
+
+Note that recent attention is based on the 3 most recent years (including current one).
+
+***Limitations:***
+OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
+Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
+
+***Environment:*** PySpark
+
+***References:*** 
+* Kanellos, I., Vergoulis, T., Sacharidis, D., Dalamagas, T., & Vassiliou, Y. (2021, April). Ranking papers by their short-term scientific impact. In 2021 IEEE 37th International Conference on Data Engineering (ICDE) (pp. 1997-2002). IEEE.
+
+***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
+
+ 
--- a/docs/data-provision/enrichment/mining.md
+++ b/docs/data-provision/enrichment/mining.md
@ -3,19 +3,4 @@ sidebar_position: 1
 ---

 # Mining algorithms
-
-The Text and Data Mining (TDM) algorithms used for enriching the OpenAIRE Graph are grouped in the following main categories:
-
-[Extraction of acknowledged concepts](acks.md)
-
-[Extraction of cited concepts](cites.md)
-
-[Document Classification](classified.md)
-
 <span className="todo">TODO</span>
-
-
-
-
-
-
--- a/docs/data-provision/indexing.md
+++ b/docs/data-provision/indexing.md
@ -6,16 +6,8 @@ sidebar_position: 5

 The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as:

-* The OpenAIRE Research Graph APIs and Portals will offer to the EOSC (European Open Science Cloud) an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.
+* EOSC --The OpenAIRE Research Graph APIs and Portals will offer to the EOSC an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.

-* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE.
+* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE

-* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
-
-* ScholExplorer is used by different players of the scholarly communication ecosystem. For example, [Elsevier](https://www.elsevier.com/authors/tools-and-resources/research-data/data-base-linking) uses its API to make the links between 
-publications and datasets automatically appear on ScienceDirect.
-ScholExplorer indexes the links among the four major types of research products (API v3) available in the OpenAIRE Research Graph and makes them available through an HTTP API that allows 
-to search them by the following criteria:
-  * Links whose source object has a given PID or PID type;
-  * Links whose source object has been published by a given data source ("data source as publisher");
-  * Links that were collected from a given data source ("data source as provider").
+* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
--- a/static/img/docs/decisiontree-dataset-orp.png
+++ b/static/img/docs/decisiontree-dataset-orp.png
--- a/static/img/docs/decisiontree-organization.png
+++ b/static/img/docs/decisiontree-organization.png
--- a/static/img/docs/decisiontree-publication.png
+++ b/static/img/docs/decisiontree-publication.png
--- a/static/img/docs/decisiontree-software.png
+++ b/static/img/docs/decisiontree-software.png