From 7717d883ee2c40c622232b8cb6a6f666b4b6f20b Mon Sep 17 00:00:00 2001 From: Serafeim Chatzopoulos Date: Fri, 11 Nov 2022 18:07:24 +0200 Subject: [PATCH 1/3] Add formating to impact indicators page --- docs/data-model/entities/other.md | 7 +- .../enrichment/impact-scores.md | 125 ++++++++++++++++-- 2 files changed, 117 insertions(+), 15 deletions(-) diff --git a/docs/data-model/entities/other.md b/docs/data-model/entities/other.md index a14ca5e..cd12f18 100644 --- a/docs/data-model/entities/other.md +++ b/docs/data-model/entities/other.md @@ -646,7 +646,12 @@ A measure computed for this instance (e.g. those provided by [BIP! Finder](https ### key _Type: String • Cardinality: ONE_ -The specified measure. Currently supported one of: `{ influence, influence_alt, popularity, popularity_alt, impulse, cc }` (see [the dedicated page](../../data-provision/enrichment/impact-scores) for more details). +The specified measure. Currently supported one of: +* `influence` (see [PageRank](/data-provision/enrichment/impact-scores#pagerank-pr)) +* `influence_alt` (see [Citation Count](/data-provision/enrichment/impact-scores#citation-count-cc)) +* `popularity` (see [AttRank](/data-provision/enrichment/impact-scores#attrank)) +* `popularity_alt` (see [RAM](/data-provision/enrichment/impact-scores#ram)) +* `impulse` (see ["Incubation" Citation Count](/data-provision/enrichment/impact-scores#incubation-citation-count-icc)) ```json "key": "influence" diff --git a/docs/data-provision/enrichment/impact-scores.md b/docs/data-provision/enrichment/impact-scores.md index 79a63e9..8d8433a 100644 --- a/docs/data-provision/enrichment/impact-scores.md +++ b/docs/data-provision/enrichment/impact-scores.md @@ -2,30 +2,71 @@ sidebar_position: 2 --- -# Impact scores -TODO - add intro +# Impact indicators + +This page summarises all calculated impact indicators, along with explanations about their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses. ## Citation Count (CC) -This is the most widely used scientific impact indicator, which sums all citations received by each article. The citation count of a -publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$, -where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise). +***Short description:*** +This is the most widely used scientific impact indicator, which sums all citations received by each article. Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly drew on it. +***Solution:*** +The citation count of a +publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$, +where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise). + +***Parameters:*** - + +***Limitations:*** +OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator. +Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source. + +***Environment:*** PySpark + +***References:*** - + +***Authority:*** ATHENA RC • ***License:*** GPL-2.0 • ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker) + + ## "Incubation" Citation Count (iCC) +***Short description:*** This measure is essentially a time-restricted version of the citation count, where the time window is distinct for each paper, i.e., -only citations $y$ years after its publication are counted (usually, $y=3$). The "incubation" citation count of a paper $i$ is -calculated as: $s_i = \sum_{j,t_j \leq t_i+3} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's +only citations $y$ years after its publication are counted. + +***Solution:*** +The "incubation" citation count of a paper $i$ is +calculated as: $s_i = \sum_{j,t_j \leq t_i+y} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's publication years, respectively. $t_i$ is cited paper $i$'s publication year. iCC can be seen as an indicator of a paper's initial momentum (impulse) directly after its publication. -## PageRank (PR) +***Parameters:*** +$y=3$ +***Limitations:*** +OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator. +Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source. + +***Environment:*** PySpark + +***References:*** +* Vergoulis, T., Kanellos, I., Atzori, C., Mannocci, A., Chatzopoulos, S., Bruzzo, S. L., Manola, N., & Manghi, P. (2021, April). Bip! db: A dataset of impact measures for scientific publications. In Companion Proceedings of the Web Conference 2021 (pp. 456-460). + +***Authority:*** ATHENA RC • ***License:*** GPL-2.0 • ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker) + + + ## PageRank (PR) + +***Short description:*** Originally developed to rank Web pages, PageRank has been also widely used to rank publications in citation networks. In this latter context, a publication's PageRank -score also serves as a measure of its influence. In particular, the PageRank score of a publication is calculated +score also serves as a measure of its influence. + +***Solution:*** +The PageRank score of a publication is calculated as its probability of being read by a researcher that either randomly selects publications to read or selects publications based on the references of her latest read. Formally, the score of a publication $i$ is given by: @@ -41,12 +82,31 @@ score of each publication relies of the score of publications citing it (the alg until all scores converge). As a result, PageRank differentiates citations based on the importance of citing articles, thus alleviating the corresponding issue of the Citation Count. +***Parameters:*** +$\alpha = 0.5, convergence\_error = 10^{-12}$ + +***Limitations:*** +OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator. +Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source. + +***Environment:*** PySpark + +***References:*** +* Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab. + +***Authority:*** ATHENA RC • ***License:*** GPL-2.0 • ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker) + + ## RAM -RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared -to older ones. Hence, it better captures the popularity of publications. This "time-awareness" of citations +***Short description:*** +RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared to older ones. +Hence, it better captures the popularity of publications. This "time-awareness" of citations alleviates the bias of methods like Citation Count and PageRank against recently published articles, which have -not had "enough" time to gather as many citations. The RAM score of each paper $i$ is calculated as follows: +not had "enough" time to gather as many citations. + +***Solution:*** +The RAM score of each paper $i$ is calculated as follows: $$ s_i = \sum_j{R_{i,j}} @@ -56,11 +116,30 @@ where $R$ is the so-called Retained Adjacency Matrix (RAM) and $R_{i,j}=\gamma^{ $i$, and $R_{i,j}=0$ otherwise. Parameter $\gamma \in (0,1)$, $t_c$ corresponds to the current year and $t_j$ corresponds to the publication year of citing article $j$. +***Parameters:*** +$\gamma = 0.6$ + +***Limitations:*** +OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator. +Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source. + +***Environment:*** PySpark + +***References:*** +* Ghosh, R., Kuo, T. T., Hsu, C. N., Lin, S. D., & Lerman, K. (2011, December). Time-aware ranking in dynamic citation networks. In 2011 ieee 11^{th} international conference on data mining workshops (pp. 373-380). IEEE. + +***Authority:*** ATHENA RC • ***License:*** GPL-2.0 • ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker) + + ## AttRank +***Short description:*** AttRank is a PageRank variant that alleviates its bias against recent publications (i.e., it is tailored to capture popularity). AttRank achieves this by modifying PageRank's probability of randomly selecting a publication. Instead of using a uniform probability, -AttRank defines it based on a combination of the publication's age and the citations it received in recent years. The AttRank score +AttRank defines it based on a combination of the publication's age and the citations it received in recent years. + +***Solution:*** +The AttRank score of each publication $i$ is calculated based on: $$ @@ -70,4 +149,22 @@ $$ where $\alpha + \beta + \gamma =1$ and $\alpha,\beta,\gamma \in [0,1]$. $Att(i)$ denotes a recent attention-based score for publication $i$, which reflects its share of citations in the $y$ most recent years, $t_i$ is the publication year of article $i$, $t_c$ denotes the current -year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix. \ No newline at end of file +year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix. + +***Parameters:*** +$\alpha = 0.2, \beta = 0.5, \gamma = 0.3, \rho = -0.16, convergence\_error = 10^-{12}$ + +Note that recent attention is based on the 3 most recent years (including current one). + +***Limitations:*** +OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator. +Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source. + +***Environment:*** PySpark + +***References:*** +* Kanellos, I., Vergoulis, T., Sacharidis, D., Dalamagas, T., & Vassiliou, Y. (2021, April). Ranking papers by their short-term scientific impact. In 2021 IEEE 37th International Conference on Data Engineering (ICDE) (pp. 1997-2002). IEEE. + +***Authority:*** ATHENA RC • ***License:*** GPL-2.0 • ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker) + + \ No newline at end of file From ce31a6d5c75a68937bdfebd084f69fd71de31b24 Mon Sep 17 00:00:00 2001 From: Serafeim Chatzopoulos Date: Tue, 15 Nov 2022 16:29:39 +0200 Subject: [PATCH 2/3] Address review comments --- docs/data-provision/enrichment/impact-scores.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/docs/data-provision/enrichment/impact-scores.md b/docs/data-provision/enrichment/impact-scores.md index 8d8433a..9f142e2 100644 --- a/docs/data-provision/enrichment/impact-scores.md +++ b/docs/data-provision/enrichment/impact-scores.md @@ -4,7 +4,10 @@ sidebar_position: 2 # Impact indicators -This page summarises all calculated impact indicators, along with explanations about their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses. +This page summarises all calculated impact indicators, which are included into the [measure](/data-model/entities/other#measure) property. +It should be noted that the impact indicators are being calculated both on the level of the research output as well on the level of their instances that correspond to a given DOI. +Below we explain their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses. + ## Citation Count (CC) @@ -13,7 +16,7 @@ This is the most widely used scientific impact indicator, which sums all citatio Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly drew on it. -***Solution:*** +***Algorithmic details:*** The citation count of a publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$, where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise). @@ -37,7 +40,7 @@ Also, since some indicators require the publication year for their calculation, This measure is essentially a time-restricted version of the citation count, where the time window is distinct for each paper, i.e., only citations $y$ years after its publication are counted. -***Solution:*** +***Algorithmic details:*** The "incubation" citation count of a paper $i$ is calculated as: $s_i = \sum_{j,t_j \leq t_i+y} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's publication years, respectively. $t_i$ is cited paper $i$'s publication year. iCC can be seen as an indicator of a paper's initial momentum @@ -65,7 +68,7 @@ Originally developed to rank Web pages, PageRank has been also widely used to ra networks. In this latter context, a publication's PageRank score also serves as a measure of its influence. -***Solution:*** +***Algorithmic details:*** The PageRank score of a publication is calculated as its probability of being read by a researcher that either randomly selects publications to read or selects publications based on the references of her latest read. Formally, the score of a publication $i$ is given by: @@ -105,7 +108,7 @@ Hence, it better captures the popularity of publications. This "time-awareness" alleviates the bias of methods like Citation Count and PageRank against recently published articles, which have not had "enough" time to gather as many citations. -***Solution:*** +***Algorithmic details:*** The RAM score of each paper $i$ is calculated as follows: $$ @@ -138,7 +141,7 @@ AttRank is a PageRank variant that alleviates its bias against recent publicatio AttRank achieves this by modifying PageRank's probability of randomly selecting a publication. Instead of using a uniform probability, AttRank defines it based on a combination of the publication's age and the citations it received in recent years. -***Solution:*** +***Algorithmic details:*** The AttRank score of each publication $i$ is calculated based on: From 7d9c7b214cfbc1b818e550f08fa4bcf3c23ca0b9 Mon Sep 17 00:00:00 2001 From: Serafeim Chatzopoulos Date: Tue, 15 Nov 2022 16:54:44 +0200 Subject: [PATCH 3/3] Minor change in impact-scores.md --- docs/data-provision/enrichment/impact-scores.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/data-provision/enrichment/impact-scores.md b/docs/data-provision/enrichment/impact-scores.md index 9f142e2..d3db939 100644 --- a/docs/data-provision/enrichment/impact-scores.md +++ b/docs/data-provision/enrichment/impact-scores.md @@ -5,7 +5,7 @@ sidebar_position: 2 # Impact indicators This page summarises all calculated impact indicators, which are included into the [measure](/data-model/entities/other#measure) property. -It should be noted that the impact indicators are being calculated both on the level of the research output as well on the level of their instances that correspond to a given DOI. +It should be noted that the impact indicators are being calculated both on the level of the research output as well on the level of distinct DOIs. Below we explain their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses.