diff --git a/docs/data-model/entities/other.md b/docs/data-model/entities/other.md index a14ca5e..cd12f18 100644 --- a/docs/data-model/entities/other.md +++ b/docs/data-model/entities/other.md @@ -646,7 +646,12 @@ A measure computed for this instance (e.g. those provided by [BIP! Finder](https ### key _Type: String • Cardinality: ONE_ -The specified measure. Currently supported one of: `{ influence, influence_alt, popularity, popularity_alt, impulse, cc }` (see [the dedicated page](../../data-provision/enrichment/impact-scores) for more details). +The specified measure. Currently supported one of: +* `influence` (see [PageRank](/data-provision/enrichment/impact-scores#pagerank-pr)) +* `influence_alt` (see [Citation Count](/data-provision/enrichment/impact-scores#citation-count-cc)) +* `popularity` (see [AttRank](/data-provision/enrichment/impact-scores#attrank)) +* `popularity_alt` (see [RAM](/data-provision/enrichment/impact-scores#ram)) +* `impulse` (see ["Incubation" Citation Count](/data-provision/enrichment/impact-scores#incubation-citation-count-icc)) ```json "key": "influence" diff --git a/docs/data-provision/enrichment/impact-scores.md b/docs/data-provision/enrichment/impact-scores.md index 79a63e9..d3db939 100644 --- a/docs/data-provision/enrichment/impact-scores.md +++ b/docs/data-provision/enrichment/impact-scores.md @@ -2,30 +2,74 @@ sidebar_position: 2 --- -# Impact scores -TODO - add intro +# Impact indicators + +This page summarises all calculated impact indicators, which are included into the [measure](/data-model/entities/other#measure) property. +It should be noted that the impact indicators are being calculated both on the level of the research output as well on the level of distinct DOIs. +Below we explain their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses. + ## Citation Count (CC) -This is the most widely used scientific impact indicator, which sums all citations received by each article. The citation count of a -publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$, -where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise). +***Short description:*** +This is the most widely used scientific impact indicator, which sums all citations received by each article. Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly drew on it. +***Algorithmic details:*** +The citation count of a +publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$, +where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise). + +***Parameters:*** - + +***Limitations:*** +OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator. +Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source. + +***Environment:*** PySpark + +***References:*** - + +***Authority:*** ATHENA RC • ***License:*** GPL-2.0 • ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker) + + ## "Incubation" Citation Count (iCC) +***Short description:*** This measure is essentially a time-restricted version of the citation count, where the time window is distinct for each paper, i.e., -only citations $y$ years after its publication are counted (usually, $y=3$). The "incubation" citation count of a paper $i$ is -calculated as: $s_i = \sum_{j,t_j \leq t_i+3} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's +only citations $y$ years after its publication are counted. + +***Algorithmic details:*** +The "incubation" citation count of a paper $i$ is +calculated as: $s_i = \sum_{j,t_j \leq t_i+y} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's publication years, respectively. $t_i$ is cited paper $i$'s publication year. iCC can be seen as an indicator of a paper's initial momentum (impulse) directly after its publication. -## PageRank (PR) +***Parameters:*** +$y=3$ +***Limitations:*** +OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator. +Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source. + +***Environment:*** PySpark + +***References:*** +* Vergoulis, T., Kanellos, I., Atzori, C., Mannocci, A., Chatzopoulos, S., Bruzzo, S. L., Manola, N., & Manghi, P. (2021, April). Bip! db: A dataset of impact measures for scientific publications. In Companion Proceedings of the Web Conference 2021 (pp. 456-460). + +***Authority:*** ATHENA RC • ***License:*** GPL-2.0 • ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker) + + + ## PageRank (PR) + +***Short description:*** Originally developed to rank Web pages, PageRank has been also widely used to rank publications in citation networks. In this latter context, a publication's PageRank -score also serves as a measure of its influence. In particular, the PageRank score of a publication is calculated +score also serves as a measure of its influence. + +***Algorithmic details:*** +The PageRank score of a publication is calculated as its probability of being read by a researcher that either randomly selects publications to read or selects publications based on the references of her latest read. Formally, the score of a publication $i$ is given by: @@ -41,12 +85,31 @@ score of each publication relies of the score of publications citing it (the alg until all scores converge). As a result, PageRank differentiates citations based on the importance of citing articles, thus alleviating the corresponding issue of the Citation Count. +***Parameters:*** +$\alpha = 0.5, convergence\_error = 10^{-12}$ + +***Limitations:*** +OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator. +Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source. + +***Environment:*** PySpark + +***References:*** +* Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab. + +***Authority:*** ATHENA RC • ***License:*** GPL-2.0 • ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker) + + ## RAM -RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared -to older ones. Hence, it better captures the popularity of publications. This "time-awareness" of citations +***Short description:*** +RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared to older ones. +Hence, it better captures the popularity of publications. This "time-awareness" of citations alleviates the bias of methods like Citation Count and PageRank against recently published articles, which have -not had "enough" time to gather as many citations. The RAM score of each paper $i$ is calculated as follows: +not had "enough" time to gather as many citations. + +***Algorithmic details:*** +The RAM score of each paper $i$ is calculated as follows: $$ s_i = \sum_j{R_{i,j}} @@ -56,11 +119,30 @@ where $R$ is the so-called Retained Adjacency Matrix (RAM) and $R_{i,j}=\gamma^{ $i$, and $R_{i,j}=0$ otherwise. Parameter $\gamma \in (0,1)$, $t_c$ corresponds to the current year and $t_j$ corresponds to the publication year of citing article $j$. +***Parameters:*** +$\gamma = 0.6$ + +***Limitations:*** +OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator. +Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source. + +***Environment:*** PySpark + +***References:*** +* Ghosh, R., Kuo, T. T., Hsu, C. N., Lin, S. D., & Lerman, K. (2011, December). Time-aware ranking in dynamic citation networks. In 2011 ieee 11^{th} international conference on data mining workshops (pp. 373-380). IEEE. + +***Authority:*** ATHENA RC • ***License:*** GPL-2.0 • ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker) + + ## AttRank +***Short description:*** AttRank is a PageRank variant that alleviates its bias against recent publications (i.e., it is tailored to capture popularity). AttRank achieves this by modifying PageRank's probability of randomly selecting a publication. Instead of using a uniform probability, -AttRank defines it based on a combination of the publication's age and the citations it received in recent years. The AttRank score +AttRank defines it based on a combination of the publication's age and the citations it received in recent years. + +***Algorithmic details:*** +The AttRank score of each publication $i$ is calculated based on: $$ @@ -70,4 +152,22 @@ $$ where $\alpha + \beta + \gamma =1$ and $\alpha,\beta,\gamma \in [0,1]$. $Att(i)$ denotes a recent attention-based score for publication $i$, which reflects its share of citations in the $y$ most recent years, $t_i$ is the publication year of article $i$, $t_c$ denotes the current -year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix. \ No newline at end of file +year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix. + +***Parameters:*** +$\alpha = 0.2, \beta = 0.5, \gamma = 0.3, \rho = -0.16, convergence\_error = 10^-{12}$ + +Note that recent attention is based on the 3 most recent years (including current one). + +***Limitations:*** +OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator. +Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source. + +***Environment:*** PySpark + +***References:*** +* Kanellos, I., Vergoulis, T., Sacharidis, D., Dalamagas, T., & Vassiliou, Y. (2021, April). Ranking papers by their short-term scientific impact. In 2021 IEEE 37th International Conference on Data Engineering (ICDE) (pp. 1997-2002). IEEE. + +***Authority:*** ATHENA RC • ***License:*** GPL-2.0 • ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker) + + \ No newline at end of file