Merge pull request 'Add formating to impact indicators page' (#9) from impact_indicators into main

Reviewed-on: D-Net/openaire-graph-docs#9
1 year ago · 96912ea7ec
parent 3a7578fe16 7d9c7b214c
commit 96912ea7ec
2 changed files with 120 additions and 15 deletions
--- a/docs/data-model/entities/other.md
+++ b/docs/data-model/entities/other.md
@ -646,7 +646,12 @@ A measure computed for this instance (e.g. those provided by [BIP! Finder](https
 ### key
 _Type: String &bull; Cardinality: ONE_

-The specified measure. Currently supported one of: `{ influence, influence_alt, popularity, popularity_alt, impulse, cc }` (see [the dedicated page](../../data-provision/enrichment/impact-scores) for more details).
+The specified measure. Currently supported one of: 
+* `influence` (see [PageRank](/data-provision/enrichment/impact-scores#pagerank-pr))
+* `influence_alt` (see [Citation Count](/data-provision/enrichment/impact-scores#citation-count-cc))
+* `popularity` (see [AttRank](/data-provision/enrichment/impact-scores#attrank))
+* `popularity_alt` (see [RAM](/data-provision/enrichment/impact-scores#ram))
+* `impulse` (see ["Incubation" Citation Count](/data-provision/enrichment/impact-scores#incubation-citation-count-icc))

 ```json
 "key": "influence"
--- a/docs/data-provision/enrichment/impact-scores.md
+++ b/docs/data-provision/enrichment/impact-scores.md
@ -2,30 +2,74 @@
 sidebar_position: 2
 ---

-# Impact scores
-<span className="todo">TODO - add intro</span>
+# Impact indicators
+
+This page summarises all calculated impact indicators, which are included into the [measure](/data-model/entities/other#measure) property.
+It should be noted that the impact indicators are being calculated both on the level of the research output as well on the level of distinct DOIs.
+Below we explain their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses.
+

 ## Citation Count (CC)

-This is the most widely used scientific impact indicator, which sums all citations received by each article. The citation count of a 
-publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$, 
-where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise). 
+***Short description:***
+This is the most widely used scientific impact indicator, which sums all citations received by each article.
 Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly 
 drew on it.

+***Algorithmic details:***
+The citation count of a 
+publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$, 
+where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise). 
+
+***Parameters:*** -
+
+***Limitations:***
+OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
+Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
+
+***Environment:*** PySpark
+
+***References:*** -
+
+***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
+
+
 ## "Incubation" Citation Count (iCC)

+***Short description:***
 This measure is essentially a time-restricted version of the citation count, where the time window is distinct for each paper, i.e., 
-only citations $y$ years after its publication are counted (usually, $y=3$). The "incubation" citation count of a paper $i$ is 
-calculated as: $s_i = \sum_{j,t_j \leq t_i+3} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's 
+only citations $y$ years after its publication are counted.
+
+***Algorithmic details:***
+The "incubation" citation count of a paper $i$ is 
+calculated as: $s_i = \sum_{j,t_j \leq t_i+y} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's 
 publication years, respectively. $t_i$ is cited paper $i$'s publication year. iCC can be seen as an indicator of a paper's initial momentum 
 (impulse) directly after its publication.

-## PageRank (PR)
+***Parameters:*** 
+$y=3$
+
+***Limitations:***
+OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
+Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
+
+***Environment:*** PySpark

+***References:*** 
+* Vergoulis, T., Kanellos, I., Atzori, C., Mannocci, A., Chatzopoulos, S., Bruzzo, S. L., Manola, N., & Manghi, P. (2021, April). Bip! db: A dataset of impact measures for scientific publications. In Companion Proceedings of the Web Conference 2021 (pp. 456-460).
+
+***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
+
+
+ ## PageRank (PR)
+
+***Short description:***
 Originally developed to rank Web pages, PageRank has been also widely used to rank publications in citation
 networks. In this latter context, a publication's PageRank 
-score also serves as a measure of its influence. In particular, the PageRank score of a publication is calculated 
+score also serves as a measure of its influence.
+
+***Algorithmic details:***
+The PageRank score of a publication is calculated 
 as its probability of being read by a researcher that either randomly selects publications to read or selects 
 publications based on the references of her latest read. Formally, the score of a publication $i$ is given by: 

@ -41,12 +85,31 @@ score of each publication relies of the score of publications citing it (the alg
 until all scores converge). As a result, PageRank differentiates citations based on the importance of citing 
 articles, thus alleviating the corresponding issue of the Citation Count.

+***Parameters:*** 
+$\alpha = 0.5, convergence\_error = 10^{-12}$
+
+***Limitations:***
+OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
+Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
+
+***Environment:*** PySpark
+
+***References:*** 
+* Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.
+
+***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
+ 
+
 ## RAM

-RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared 
-to older ones. Hence, it better captures the popularity of publications. This "time-awareness" of citations 
+***Short description:***
+RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared to older ones.
+Hence, it better captures the popularity of publications. This "time-awareness" of citations 
 alleviates the bias of methods like Citation Count and PageRank against recently published articles, which have 
-not had "enough" time to gather as many citations. The RAM score of each paper $i$ is calculated as follows:
+not had "enough" time to gather as many citations.
+
+***Algorithmic details:***
+The RAM score of each paper $i$ is calculated as follows:

 $$
 s_i = \sum_j{R_{i,j}}
@ -56,11 +119,30 @@ where $R$ is the so-called Retained Adjacency Matrix (RAM) and $R_{i,j}=\gamma^{
 $i$, and $R_{i,j}=0$ otherwise. Parameter $\gamma \in (0,1)$, $t_c$ corresponds to the current year and $t_j$ corresponds to the 
 publication year of citing article $j$.

+***Parameters:*** 
+$\gamma = 0.6$
+
+***Limitations:***
+OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
+Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
+
+***Environment:*** PySpark
+
+***References:*** 
+* Ghosh, R., Kuo, T. T., Hsu, C. N., Lin, S. D., & Lerman, K. (2011, December). Time-aware ranking in dynamic citation networks. In 2011 ieee 11^{th} international conference on data mining workshops (pp. 373-380). IEEE.
+
+***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
+
+
 ## AttRank

+***Short description:***
 AttRank is a PageRank variant that alleviates its bias against recent publications (i.e., it is tailored to capture popularity). 
 AttRank achieves this by modifying PageRank's probability of randomly selecting a publication. Instead of using a uniform probability,
-AttRank defines it based on a combination of the publication's age and the citations it received in recent years. The AttRank score 
+AttRank defines it based on a combination of the publication's age and the citations it received in recent years.
+
+***Algorithmic details:***
+The AttRank score 
 of each publication $i$ is calculated based on:

 $$
@ -70,4 +152,22 @@ $$

 where $\alpha + \beta + \gamma =1$ and $\alpha,\beta,\gamma \in [0,1]$. $Att(i)$ denotes a recent attention-based score for publication $i$, 
 which reflects its share of citations in the $y$ most recent years, $t_i$ is the publication year of article $i$, $t_c$ denotes the current 
-year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix.
+year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix.
+
+***Parameters:*** 
+$\alpha = 0.2, \beta = 0.5, \gamma = 0.3, \rho = -0.16, convergence\_error = 10^-{12}$
+
+Note that recent attention is based on the 3 most recent years (including current one).
+
+***Limitations:***
+OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
+Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
+
+***Environment:*** PySpark
+
+***References:*** 
+* Kanellos, I., Vergoulis, T., Sacharidis, D., Dalamagas, T., & Vassiliou, Y. (2021, April). Ranking papers by their short-term scientific impact. In 2021 IEEE 37th International Conference on Data Engineering (ICDE) (pp. 1997-2002). IEEE.
+
+***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
+
+