fake-orcid-analysis/notebooks/01-Exploration.ipynb

478 KiB
Raw Blame History

Exploratory analysis

TODO:

  • Understanding the reason for fake profiles can bring insight on how to catch them (could be trivial with prior knowledge, e.g., SEO hacking => URLs)
  • Make casistics (e.g. author publishing with empty orcid, author publishing but not on OpenAIRE, etc.)
  • Temporal dimension of any use?
  • Can we access private info thanks to the OpenAIRE-ORCID agreement?
In [1]:
import pandas as pd
import ast
import tldextract
import numpy

import plotly
from plotly.offline import iplot, init_notebook_mode
import plotly.graph_objs as go
import plotly.express as px

init_notebook_mode(connected=True)
TOP_N = 0
TOP_RANGE = [0, 0]
def set_top_n(n):
    global TOP_N, TOP_RANGE
    TOP_N = n
    TOP_RANGE = [-.5, n - 1 + .5]

Notable solid ORCID iDs for explorative purposes:

In [2]:
AM = '0000-0002-5193-7851'
PP = '0000-0002-8588-4196'

Notable anomalies:

In [3]:
JOURNAL = '0000-0003-1815-5732'
NOINFO = '0000-0001-5009-2052'
VALID_NO_OA = '0000-0002-5154-6404' # True profile, but not in OpenAIRE
# todo: find group-shared ORCiD, if possible

Notable fake ORCID iDs:

In [4]:
SCAFFOLD = '0000-0001-5004-7761'
WHATSAPP = '0000-0001-6997-9470'
PENIS = '0000-0002-3399-7287'
BITCOIN = '0000-0002-7518-6845'
FITNESS_CHINA = '0000-0002-1234-835X' # URL record + employment
CANNABIS = '0000-0002-9025-8632'      # URL > 70 + works (REMOVED)
PLUMBER = '0000-0002-1700-8311'       # URL > 10 + works 

Load the dataset

In [5]:
df = pd.read_pickle('../data/processed/dataset.pkl')
df.head(5)
Out[5]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label
0 0000-0001-5009-2052 1 1 1 NaN NaN NaN NaN NaN NaN ... NaN 0 NaN 2019-06-05t20:25:43.066z 2019-12-11t03:57:41.741z 0 0 0 0 0
1 0000-0001-5943-0732 1 1 1 NaN NaN NaN NaN NaN NaN ... NaN 0 NaN 2015-08-18t13:10:42.871z 2016-06-15t01:05:19.986z 0 0 0 0 0
2 0000-0001-6083-622x 1 1 1 NaN NaN NaN NaN NaN NaN ... NaN 0 NaN 2019-01-21t10:55:27.997z 2019-01-28t16:24:02.199z 0 0 0 0 0
3 0000-0001-6262-5709 1 1 1 NaN NaN NaN NaN NaN NaN ... NaN 0 NaN 2015-08-18t14:29:39.440z 2017-06-21t07:18:20.787z 0 0 0 0 0
4 0000-0001-6616-4890 1 1 1 NaN NaN NaN NaN NaN NaN ... NaN 0 NaN 2015-08-13t01:59:51.802z 2016-06-15t01:05:21.373z 0 0 0 0 0

5 rows × 24 columns

Notable profiles inspection

In [6]:
df[df['orcid'] == AM]
Out[6]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label
1575869 0000-0002-5193-7851 1 1 1 andrea mannocci data scientist & researcher; scholarly knowled... NaN [[personal website, https://andremann.github.i... andrea.mannocci@isti.cnr.it ... [[research associate, istituto di scienza e te... 37 [scopus - elsevier, crossref metadata search, ... 2017-09-12t14:28:33.467z 2021-03-09t08:32:47.840z 34 0 0 60 1

1 rows × 24 columns

In [7]:
df[df['orcid'] == WHATSAPP]
Out[7]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label
6819986 0000-0001-6997-9470 1 1 1 other whatsapp NaN NaN [[otherwhatsapp, https://otherwhatsapp.com/], ... NaN ... NaN 0 NaN 2020-10-07t10:37:12.237z 2020-10-08t02:32:03.935z 0 0 0 0 0

1 rows × 24 columns

In [8]:
df.count()
Out[8]:
orcid                     10916574
claimed                   10916574
verified_email            10916574
verified_primary_email    10916574
given_names               10886150
family_name               10601571
biography                   348649
other_names                 551482
urls                        707687
primary_email               123851
other_emails                 48306
keywords                    646400
external_ids               1301959
education                  2430233
employment                 2665092
n_works                   10916574
works_source               2721431
activation_date           10916574
last_update_date          10916574
n_doi                     10916574
n_arxiv                   10916574
n_pmc                     10916574
n_other_pids              10916574
label                     10916574
dtype: int64
In [9]:
df['orcid'].describe()
Out[9]:
count                10916574
unique               10916574
top       0000-0001-8786-4765
freq                        1
Name: orcid, dtype: object

Primary email

In [10]:
df['primary_email'].describe()
Out[10]:
count                       123851
unique                      123848
top       patrick.davey@monash.edu
freq                             2
Name: primary_email, dtype: object

Dupe emails

In [11]:
df['primary_email'].dropna().loc[df['primary_email'].duplicated()]
Out[11]:
6347224            maykin@owasp.org
7027865    patrick.davey@monash.edu
9529005      opercin@erbakan.edu.tr
Name: primary_email, dtype: object
In [12]:
df[df['primary_email'] == 'maykin@owasp.org']
Out[12]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label
4450046 0000-0001-9855-1676 1 1 1 maykin warasart NaN NaN NaN maykin@owasp.org ... NaN 0 NaN 2020-10-23t17:51:51.925z 2021-01-01t15:00:52.053z 0 0 0 0 0
6347224 0000-0002-0836-2271 1 1 1 maykin warasart NaN NaN NaN maykin@owasp.org ... NaN 0 NaN 2020-09-15t04:43:55.709z 2020-09-15t05:17:28.509z 0 0 0 0 0

2 rows × 24 columns

In [13]:
df[df['primary_email'] == 'opercin@erbakan.edu.tr']
Out[13]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label
6840791 0000-0002-2232-9638 1 1 1 osman perçin NaN NaN NaN opercin@erbakan.edu.tr ... NaN 0 NaN 2015-01-12t13:47:55.549z 2020-01-27t07:38:24.269z 0 0 0 0 0
9529005 0000-0003-0033-0918 1 1 1 osman perçin NaN NaN NaN opercin@erbakan.edu.tr ... [[, necmettin erbakan university, konya, , tr,... 0 NaN 2015-10-13t05:47:12.014z 2020-12-25t13:52:03.976z 0 0 0 0 0

2 rows × 24 columns

In [14]:
df[df['primary_email'] == 'patrick.davey@monash.edu']
Out[14]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label
944993 0000-0002-9158-1757 1 1 1 patrick davey NaN NaN NaN patrick.davey@monash.edu ... [[phd student, monash university, melbourne, ,... 0 NaN 2019-05-09t23:01:02.170z 2019-08-20t03:00:17.844z 0 0 0 0 0
7027865 0000-0002-8774-0030 1 1 1 patrick davey NaN NaN NaN patrick.davey@monash.edu ... [[phd student, monash university, melbourne, v... 1 [crossref] 2018-09-11t10:47:10.997z 2021-02-09t06:21:44.138z 1 0 0 0 1

2 rows × 24 columns

In [15]:
df['primary_email_domain'] = df[df.primary_email.notna()]['primary_email'].apply(lambda x: x.split('@')[1])
In [16]:
df['primary_email_domain'].describe()
Out[16]:
count        123851
unique        17089
top       gmail.com
freq          26540
Name: primary_email_domain, dtype: object
In [17]:
top_primary_emails = df[['primary_email_domain', 'orcid']]\
                .groupby('primary_email_domain')\
                .count()\
                .sort_values('orcid', ascending=False)
top_primary_emails
Out[17]:
orcid
primary_email_domain
gmail.com 26540
hotmail.com 3769
yahoo.com 2614
163.com 2109
yuhs.ac 1132
... ...
imean-biotech.com 1
imec.msu.ru 1
imedea.uib-csic.es 1
imes.uni-hannover.de 1
zzuli.edu.cn 1

17089 rows × 1 columns

In [18]:
set_top_n(30)
data = [
    go.Bar(
        x=top_primary_emails[:TOP_N].index,
        y=top_primary_emails[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top-%s email domains' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Other emails

In [19]:
def extract_email_domains(lst):
    res = []
    for email in lst:
        res.append(email.split('@')[1])
    return res
In [20]:
df['other_email_domains'] = df[df.other_emails.notna()]['other_emails'].apply(lambda x: extract_email_domains(x))
In [21]:
df[df['other_email_domains'].notna()].head()
Out[21]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains
34 0000-0002-5774-8947 1 1 1 NaN NaN NaN [omah m. williams - duncan] NaN NaN ... NaN 2014-03-07t04:34:39.598z 2019-05-21t17:08:12.202z 0 0 0 0 0 NaN [gmail.com]
1199 0000-0003-2877-5492 1 1 0 aliasghar khosroabadi NaN NaN NaN khosroedc@yahoo.com ... [scopus - elsevier] 2018-01-19t13:40:29.874z 2019-12-11t02:19:08.160z 0 0 0 1 1 yahoo.com [medsab.ac.ir, gmail.com]
1995 0000-0001-8004-5054 1 1 1 angiola orlando NaN NaN NaN angiola.orlando@mib.infn.it ... [angiola orlando, crossref] 2015-08-31t09:12:02.349z 2020-06-22t14:22:31.786z 59 2 0 53 1 mib.infn.it [ge.infn.it]
2323 0000-0003-3048-4504 1 1 1 apichat saejio NaN NaN NaN NaN ... [scopus - elsevier] 2016-03-06t08:54:15.121z 2020-08-28t08:31:15.790z 2 0 0 4 0 NaN [eat.kmutnb.ac.th]
4461 0000-0001-9961-9732 1 1 1 chunfeng yun NaN NaN NaN sallyycf@163.com ... [multidisciplinary digital publishing institut... 2016-11-22t07:55:23.863z 2019-11-26t02:29:35.104z 5 0 9 0 1 163.com [pku.edu.cn]

5 rows × 26 columns

In [22]:
df['n_emails'] = df['other_emails'].str.len()
In [23]:
emails_by_orcid = df.sort_values('n_emails', ascending=False)
In [24]:
set_top_n(30)
data = [
    go.Bar(
        x=emails_by_orcid[:TOP_N]['orcid'],
        y=emails_by_orcid[:TOP_N]['n_emails']
    )
]

layout = go.Layout(
    title='Top %s ORCID iDs by email' % TOP_N, 
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
In [25]:
top_other_emails = df[['orcid', 'other_email_domains']]\
                        .explode('other_email_domains')\
                        .reset_index(drop=True)\
                        .groupby('other_email_domains')\
                        .count()\
                        .sort_values('orcid', ascending=False)
In [26]:
set_top_n(30)
data = [
    go.Bar(
        x=top_other_emails[:TOP_N].index,
        y=top_other_emails[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top %s other email domains' % TOP_N, 
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Email speculation

In [27]:
df[df['primary_email'].isna() & df['other_emails'].notna()]
Out[27]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains n_emails
34 0000-0002-5774-8947 1 1 1 NaN NaN NaN [omah m. williams - duncan] NaN NaN ... 2014-03-07t04:34:39.598z 2019-05-21t17:08:12.202z 0 0 0 0 0 NaN [gmail.com] 1.0
2323 0000-0003-3048-4504 1 1 1 apichat saejio NaN NaN NaN NaN ... 2016-03-06t08:54:15.121z 2020-08-28t08:31:15.790z 2 0 0 4 0 NaN [eat.kmutnb.ac.th] 1.0
7622 0000-0002-5612-7444 1 1 1 friederike m. hesse NaN NaN [[midwifery care - milla hebammenpraxis, http:... NaN ... 2017-06-10t07:45:11.387z 2017-06-10t07:55:03.455z 0 0 0 0 0 NaN [gmail.com, dghwi.de] 2.0
7956 0000-0002-8943-0538 1 1 1 geo sunny NaN NaN NaN NaN ... 2019-11-30t14:08:11.221z 2020-05-15t09:06:25.637z 1 0 0 0 1 NaN [students.cutn.ac.in] 1.0
10508 0000-0002-4022-0580 1 1 1 jean carlos da silva gomes NaN NaN [[currículo lattes, http://lattes.cnpq.br/0026... NaN ... 2017-05-26t19:09:33.432z 2020-06-02t00:23:14.020z 2 0 0 2 1 NaN [letras.ufrj.br] 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10915002 0000-0002-3715-3866 1 1 1 joanna korybut-orlowska NaN [joanna gołębiewska] NaN NaN ... 2017-04-27t10:08:48.102z 2020-12-08t09:44:59.088z 6 0 0 0 0 NaN [gmail.com] 1.0
10915305 0000-0003-1925-0141 1 1 1 marco ferretti NaN NaN NaN NaN ... 2015-02-23t10:29:00.543z 2020-11-30t21:58:07.439z 7 0 0 9 1 NaN [itabc.cnr.it] 1.0
10915495 0000-0001-5526-3017 1 1 1 nadia yacoubi NaN NaN NaN NaN ... 2015-03-10t16:45:31.974z 2020-12-11t00:00:01.060z 3 0 0 0 1 NaN [evonik.com] 1.0
10915820 0000-0002-9902-7953 1 1 1 s m mahmudul hasan NaN NaN NaN NaN ... 2018-01-26t02:18:25.551z 2020-11-24t05:37:24.167z 7 0 2 7 1 NaN [gmail.com] 1.0
10916306 0000-0002-5126-5127 1 1 1 andonis neophytou NaN NaN NaN NaN ... 2017-03-30t17:08:15.383z 2020-12-09t16:16:50.762z 2 0 0 3 0 NaN [ucy.ac.cy] 1.0

19692 rows × 27 columns

URLs

In [28]:
def extract_url_domains(lst):
    domains = []
    for e in lst:
        # e[0] is a string describing the url
        # e[1] is the url
        domain = tldextract.extract(e[1])
        domains.append(domain.registered_domain)
    return domains
In [29]:
df['url_domains'] = df[df.urls.notna()]['urls'].apply(lambda x: extract_url_domains(x))
In [30]:
df[df['url_domains'].notna()].head()
Out[30]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains n_emails url_domains
9 0000-0001-8718-0056 1 1 1 NaN NaN NaN [飛資得] [[link1, http://orcid.flysheetmed.info], [ntu ... ericlin.flysheet@gmail.com ... 2019-10-11t17:51:12.473z 0 0 0 6 1 gmail.com NaN NaN [flysheetmed.info, ntu.edu.tw]
41 0000-0002-7845-4016 1 1 1 NaN NaN NaN NaN [[publication profile, http://publications.lib... NaN ... 2016-06-06t15:29:36.952z 0 0 0 0 0 NaN NaN NaN [chalmers.se]
59 0000-0003-0967-6157 1 1 1 NaN NaN NaN [徐興慶] [[ntu researcher profile, http://ah.ntu.edu.tw... NaN ... 2017-03-10t07:30:04.778z 12 0 0 4 1 NaN NaN NaN [ntu.edu.tw, ntu.edu.tw]
149 0000-0002-8015-3781 1 1 1 alejandro ossorio NaN NaN [[web de la universidad carlos iii de madrid, ... aossorio@di.uc3m.es ... 2019-07-04t08:47:12.005z 0 0 0 0 0 di.uc3m.es NaN NaN [uc3m.es]
155 0000-0003-3444-936x 1 1 1 alessandra caravale archeologa, con laurea in metodologia e tecnic... NaN [[isma- cnr, http://www.isma.cnr.it/?page_id=1... NaN ... 2020-05-14t15:54:38.235z 7 0 0 14 1 NaN NaN NaN [cnr.it]

5 rows × 28 columns

In [31]:
df['n_urls'] = df['url_domains'].str.len()
In [32]:
urls_by_orcid = df.sort_values('n_urls', ascending=False)[['orcid', 'n_urls']]
urls_by_orcid
Out[32]:
orcid n_urls
257375 0000-0002-1234-835x 219.0
3630067 0000-0001-7478-4539 174.0
5196089 0000-0002-7392-3792 169.0
10696059 0000-0002-6938-9638 152.0
6868932 0000-0002-5710-4041 114.0
... ... ...
10916569 0000-0001-5692-7639 NaN
10916570 0000-0003-1539-0999 NaN
10916571 0000-0003-2858-5509 NaN
10916572 0000-0003-2438-9500 NaN
10916573 0000-0003-4119-4772 NaN

10916574 rows × 2 columns

In [33]:
set_top_n(100)
data = [
    go.Bar(
        x=urls_by_orcid[:TOP_N]['orcid'],
        y=urls_by_orcid[:TOP_N]['n_urls']
    )
]

layout = go.Layout(
    title='Top %s ORCID iDs with URLs' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
In [34]:
top_urls = df[['orcid', 'url_domains']]\
                .explode('url_domains')\
                .reset_index(drop=True)\
                .groupby('url_domains')\
                .count()\
                .sort_values('orcid', ascending=False)
In [35]:
set_top_n(30)
data = [
    go.Bar(
        x=top_urls[:TOP_N].index,
        y=top_urls[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top-%s URL domains' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

URLs speculation

In [36]:
df[(df['url_domains'].str.len() > 50) & (df['n_works'] > 0)]
Out[36]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains n_emails url_domains n_urls
382497 0000-0002-9025-8632 1 1 1 buycannabis dispensary we procure and deliver premium cannabis strain... [we procure and deliver premium cannabis strai... [[find your cannabis & marijuana dispensary , ... NaN ... 0 0 0 0 0 NaN NaN NaN [goowonderland.com, goowonderland.com, goowond... 81.0
911811 0000-0002-4062-3603 1 1 1 juan de dios beltrán mancilla juan de dios beltrán mancilla (*) filósofo aut... [juan de dios beltrán mancilla, filósofo autod... [[01.- juan de dios beltrán mancilla. teoría o... NaN ... 0 0 0 7 0 NaN NaN NaN [yumpu.com, ijopm.org, google.com, blogspot.co... 69.0
1136129 0000-0002-1929-6054 1 1 1 franklin américo canaza choque docente-investigador social. maestrando en der... [franklin américo canaza-choque , franklin a. ... [[consejo nacional de ciencia, tecnología e in... leo_123fa@hotmail.com ... 29 0 0 33 1 hotmail.com [gmail.com, gmail.com, hotmail.com, baldwin.ed... 5.0 [concytec.gob.pe, redalyc.org, redalyc.org, un... 61.0
3102686 0000-0003-2593-7134 1 1 1 aan jaelani all my papers can be downloaded from portal:re... [jaelani, a., jaelani, aan] [[microsoft academic research, https://academi... aan_jaelani@syekhnurjati.ac.id ... 88 0 0 193 1 syekhnurjati.ac.id [gmail.com] 1.0 [microsoft.com, twitter.com, academia.edu, aca... 67.0
6868932 0000-0002-5710-4041 1 1 1 ryszard romaniuk professor of electronics and communications en... [r.romaniuk, r.s.romaniuk, ryszard romaniuk, r... [[scholar google, http://scholar.google.pl/cit... rrom@ise.pw.edu.pl ... 1221 25 0 1742 1 ise.pw.edu.pl [ise.pw.edu.pl, elka.pw.edu.pl, cern.ch] 3.0 [google.pl, publons.com, scopus.com, mendeley.... 114.0
8088987 0000-0002-9965-2425 1 1 1 jaroslaw spychala jaroslaw spychala has received a doctoral degr... [jaroslaw jozef spychala] [[resume, http://www.biowebspin.com/wp-content... NaN ... 15 0 0 29 1 NaN NaN NaN [biowebspin.com, biowebspin.com, google.com, l... 73.0
8658355 0000-0002-3920-7389 1 1 1 а. гусев surname, name gusev alexander leonidovichdate... [alexander l. gusev , alexander leonidovich gu... [[a.l. gusev alternative energy and ecology, ... NaN ... 37 0 0 21 1 NaN NaN NaN [youtube.com, isjaee.com, researchgate.net, re... 111.0
8778864 0000-0002-3997-5070 1 1 1 dr. parameshachari b d dr. parameshachari b dacm distinguished speake... [dr. parameshachari b d] [[gsssietw,mysuru, http://geethashishu.in/], [... NaN ... 47 0 0 48 1 NaN NaN NaN [geethashishu.in, geethashishu.in, acm.org, go... 71.0
9980164 0000-0003-4948-9268 1 1 1 gustavo duperré gustavo norberto duperré graduated in arts and... [gustavo norberto duperré, duperré, g. n., gus... [[gis in cultural heritage - icomos românia, h... gustavo.duperre@usal.edu.ar ... 13 0 0 34 0 usal.edu.ar NaN NaN [icomos.ro, unirioja.es, unirioja.es, unc.edu.... 61.0
10024501 0000-0003-2407-3557 1 1 1 abdul aziz abdul aziz was born on may 25, 1973, in brebes... [abdul aziz, aziz, abdul, aziz, a., aziz, abd,... [[google scholar, https://scholar.google.com/c... NaN ... 19 0 0 77 1 NaN NaN NaN [google.com, syekhnurjati.ac.id, orcid.org, bl... 59.0
10091165 0000-0003-2183-8112 1 1 1 pelayo munhoz olea pós-doutorado em gestão ambiental pela univers... [ munhoz, pelayo olea, olea, pelayo, olea, p... [[currículo lattes, http://lattes.cnpq.br/6209... NaN ... 797 0 1 582 1 NaN NaN NaN [cnpq.br, cnpq.br, cnpq.br, cnpq.br, publons.c... 61.0
10523205 0000-0003-2450-090x 1 1 1 eduard babulak professor eduard babulak is accomplished inter... [professor eduard babulak] [[honorary chair, chief mentor & senior adviso... NaN ... 199 0 1 174 1 NaN NaN NaN [worldassessmentcouncil.org, spseke.sk, bcs.or... 114.0
10696059 0000-0002-6938-9638 1 1 1 adolfo catral sanabria my education is in computer science, mathemati... NaN [[researchgate adolfo catral , https://www.res... NaN ... 2022 0 0 16 1 NaN NaN NaN [researchgate.net, youtube.com, linkedin.com, ... 152.0

13 rows × 29 columns

In [37]:
df[(df['url_domains'].str.len() > 10) & (df['n_works'] > 0) & (df['works_source'].str.len() == 1)]
Out[37]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains n_emails url_domains n_urls
97666 0000-0002-7843-8497 1 1 1 davi barbosa pesquisador na área sociojurídica, professor, ... [professor davi barbosa delmont] [[plataforma de cursos ideia criativa, https:/... NaN ... 0 0 0 0 0 NaN NaN NaN [eadplataforma.com, facebook.com, youtube.com,... 39.0
200670 0000-0003-1554-1531 1 1 1 katarzyna ochman katarzyna ochman [kataˈʐɨna ˈɔxman] is assista... [[kataˈʐɨna ˈɔxman], catharina ochman, cathari... [[researchgate, https://www.researchgate.net/p... NaN ... 1 0 0 0 1 NaN NaN NaN [researchgate.net, academia.edu, facebook.com,... 11.0
210325 0000-0003-3080-4643 1 1 1 graham dawson science and engineering faculty (sef) libraria... [ graham colin dawson, g.c. dawson] [[qut home page, https://www.library.qut.edu.a... g.dawson@qut.edu.au ... 0 0 0 6 1 qut.edu.au NaN NaN [qut.edu.au, qut.edu.au, google.com.au, resear... 11.0
218947 0000-0003-3193-030x 1 1 1 juan pablo wolff mejia aspirante a maestría en derecho y negocios int... [juan pablo wolff, pablo wolff mejia, juan p. ... [[twitter, https://twitter.com/pablomejiam], [... juanpmejia@ulasallista.edu.co ... 0 0 0 0 1 ulasallista.edu.co NaN NaN [twitter.com, youtube.com, google.com, linkedi... 11.0
261974 0000-0002-5341-6531 1 1 1 trent hammond mr trent hammond is an honorary research fello... [trent ernest hammond (t.e.hammond)] [[academic support masters, http://trenthammon... trent.hammond@academicsupportmasters.com.au ... 1 0 0 1 1 academicsupportmasters.com.au [health.nsw.gov.au, csu.edu.au, sociologist.co... 5.0 [wix.com, academia.edu, researchgate.net, rese... 12.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10405738 0000-0002-3374-5709 1 1 1 guillermo ortiz médico, internista, neumólogo, intensivista, e... [guillermo ortiz-ruiz] [[elsevier, https://www.elsevier.com/], [asoci... NaN ... 62 0 0 88 0 NaN NaN NaN [elsevier.com, amci.org.co, springer.com, revi... 12.0
10472264 0000-0001-7228-5680 1 1 1 text protocol NaN NaN [[about, https://about.me/textprotocol], [gith... NaN ... 0 0 0 0 0 NaN NaN NaN [about.me, github.com, gitlab.com, gravatar.co... 12.0
10785961 0000-0002-3064-0194 1 1 1 leonardo fernando cruz basso NaN NaN [[papers-1, https://www.researchgate.net/profi... leonardofernando.basso@mackenzie.br ... 5 0 0 0 1 mackenzie.br [mackenzie.br] 1.0 [researchgate.net, ssrn.com, cnpq.br, google.c... 17.0
10845645 0000-0003-1047-4229 1 1 1 bayu sakti bayu purbha saktisaya adalah bayu purbha sakti... [bayu purbha sakti] [[osf, http://osf.io/qe2ug], [inarxiv, https:/... NaN ... 0 0 0 0 1 NaN NaN NaN [osf.io, osf.io, academia.edu, mendeley.com, f... 12.0
10896059 0000-0003-4836-7074 1 1 1 karla haydee ortiz palafox karla haydee ortíz palafoxmiembro del sistema ... [karla palafox] [[opinión día del maestro, http://www.cronicaj... NaN ... 0 0 0 2 1 NaN NaN NaN [cronicajalisco.com, youtube.com, tlaquepaque.... 22.0

141 rows × 29 columns

In [38]:
exploded_sources = df[(df['url_domains'].str.len() > 10) & (df['n_works'] > 0) & (df['works_source'].str.len() == 1)].explode('works_source').reset_index(drop=True)
exploded_sources
Out[38]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains n_emails url_domains n_urls
0 0000-0002-7843-8497 1 1 1 davi barbosa pesquisador na área sociojurídica, professor, ... [professor davi barbosa delmont] [[plataforma de cursos ideia criativa, https:/... NaN ... 0 0 0 0 0 NaN NaN NaN [eadplataforma.com, facebook.com, youtube.com,... 39.0
1 0000-0003-1554-1531 1 1 1 katarzyna ochman katarzyna ochman [kataˈʐɨna ˈɔxman] is assista... [[kataˈʐɨna ˈɔxman], catharina ochman, cathari... [[researchgate, https://www.researchgate.net/p... NaN ... 1 0 0 0 1 NaN NaN NaN [researchgate.net, academia.edu, facebook.com,... 11.0
2 0000-0003-3080-4643 1 1 1 graham dawson science and engineering faculty (sef) libraria... [ graham colin dawson, g.c. dawson] [[qut home page, https://www.library.qut.edu.a... g.dawson@qut.edu.au ... 0 0 0 6 1 qut.edu.au NaN NaN [qut.edu.au, qut.edu.au, google.com.au, resear... 11.0
3 0000-0003-3193-030x 1 1 1 juan pablo wolff mejia aspirante a maestría en derecho y negocios int... [juan pablo wolff, pablo wolff mejia, juan p. ... [[twitter, https://twitter.com/pablomejiam], [... juanpmejia@ulasallista.edu.co ... 0 0 0 0 1 ulasallista.edu.co NaN NaN [twitter.com, youtube.com, google.com, linkedi... 11.0
4 0000-0002-5341-6531 1 1 1 trent hammond mr trent hammond is an honorary research fello... [trent ernest hammond (t.e.hammond)] [[academic support masters, http://trenthammon... trent.hammond@academicsupportmasters.com.au ... 1 0 0 1 1 academicsupportmasters.com.au [health.nsw.gov.au, csu.edu.au, sociologist.co... 5.0 [wix.com, academia.edu, researchgate.net, rese... 12.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
136 0000-0002-3374-5709 1 1 1 guillermo ortiz médico, internista, neumólogo, intensivista, e... [guillermo ortiz-ruiz] [[elsevier, https://www.elsevier.com/], [asoci... NaN ... 62 0 0 88 0 NaN NaN NaN [elsevier.com, amci.org.co, springer.com, revi... 12.0
137 0000-0001-7228-5680 1 1 1 text protocol NaN NaN [[about, https://about.me/textprotocol], [gith... NaN ... 0 0 0 0 0 NaN NaN NaN [about.me, github.com, gitlab.com, gravatar.co... 12.0
138 0000-0002-3064-0194 1 1 1 leonardo fernando cruz basso NaN NaN [[papers-1, https://www.researchgate.net/profi... leonardofernando.basso@mackenzie.br ... 5 0 0 0 1 mackenzie.br [mackenzie.br] 1.0 [researchgate.net, ssrn.com, cnpq.br, google.c... 17.0
139 0000-0003-1047-4229 1 1 1 bayu sakti bayu purbha saktisaya adalah bayu purbha sakti... [bayu purbha sakti] [[osf, http://osf.io/qe2ug], [inarxiv, https:/... NaN ... 0 0 0 0 1 NaN NaN NaN [osf.io, osf.io, academia.edu, mendeley.com, f... 12.0
140 0000-0003-4836-7074 1 1 1 karla haydee ortiz palafox karla haydee ortíz palafoxmiembro del sistema ... [karla palafox] [[opinión día del maestro, http://www.cronicaj... NaN ... 0 0 0 2 1 NaN NaN NaN [cronicajalisco.com, youtube.com, tlaquepaque.... 22.0

141 rows × 29 columns

In [39]:
exploded_sources[exploded_sources.apply(lambda x: x['works_source'].find(x['given_names']) >= 0, axis=1)]
Out[39]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains n_emails url_domains n_urls
0 0000-0002-7843-8497 1 1 1 davi barbosa pesquisador na área sociojurídica, professor, ... [professor davi barbosa delmont] [[plataforma de cursos ideia criativa, https:/... NaN ... 0 0 0 0 0 NaN NaN NaN [eadplataforma.com, facebook.com, youtube.com,... 39.0
1 0000-0003-1554-1531 1 1 1 katarzyna ochman katarzyna ochman [kataˈʐɨna ˈɔxman] is assista... [[kataˈʐɨna ˈɔxman], catharina ochman, cathari... [[researchgate, https://www.researchgate.net/p... NaN ... 1 0 0 0 1 NaN NaN NaN [researchgate.net, academia.edu, facebook.com,... 11.0
3 0000-0003-3193-030x 1 1 1 juan pablo wolff mejia aspirante a maestría en derecho y negocios int... [juan pablo wolff, pablo wolff mejia, juan p. ... [[twitter, https://twitter.com/pablomejiam], [... juanpmejia@ulasallista.edu.co ... 0 0 0 0 1 ulasallista.edu.co NaN NaN [twitter.com, youtube.com, google.com, linkedi... 11.0
4 0000-0002-5341-6531 1 1 1 trent hammond mr trent hammond is an honorary research fello... [trent ernest hammond (t.e.hammond)] [[academic support masters, http://trenthammon... trent.hammond@academicsupportmasters.com.au ... 1 0 0 1 1 academicsupportmasters.com.au [health.nsw.gov.au, csu.edu.au, sociologist.co... 5.0 [wix.com, academia.edu, researchgate.net, rese... 12.0
5 0000-0001-5295-2271 1 1 1 antoniy moysey NaN NaN [[academic journals database, http://journalda... antoniimoisei@bsmu.edu.ua ... 0 0 0 0 1 bsmu.edu.ua NaN NaN [journaldatabase.info, nplu.org, acls.org, ind... 21.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
135 0000-0002-8125-0081 1 1 1 issam bencheikh NaN [issame1982, دكتور عصام بن الشيخ] [[my blog web site, http://issame1982.blogspot... NaN ... 0 0 0 0 1 NaN NaN NaN [blogspot.com, researchgate.net, google.com, l... 12.0
136 0000-0002-3374-5709 1 1 1 guillermo ortiz médico, internista, neumólogo, intensivista, e... [guillermo ortiz-ruiz] [[elsevier, https://www.elsevier.com/], [asoci... NaN ... 62 0 0 88 0 NaN NaN NaN [elsevier.com, amci.org.co, springer.com, revi... 12.0
137 0000-0001-7228-5680 1 1 1 text protocol NaN NaN [[about, https://about.me/textprotocol], [gith... NaN ... 0 0 0 0 0 NaN NaN NaN [about.me, github.com, gitlab.com, gravatar.co... 12.0
139 0000-0003-1047-4229 1 1 1 bayu sakti bayu purbha saktisaya adalah bayu purbha sakti... [bayu purbha sakti] [[osf, http://osf.io/qe2ug], [inarxiv, https:/... NaN ... 0 0 0 0 1 NaN NaN NaN [osf.io, osf.io, academia.edu, mendeley.com, f... 12.0
140 0000-0003-4836-7074 1 1 1 karla haydee ortiz palafox karla haydee ortíz palafoxmiembro del sistema ... [karla palafox] [[opinión día del maestro, http://www.cronicaj... NaN ... 0 0 0 2 1 NaN NaN NaN [cronicajalisco.com, youtube.com, tlaquepaque.... 22.0

115 rows × 29 columns

Works source

Paste from Miriam

External IDs

External IDs should come from reliable sources. ORCiD registrants cannot add them freely.

In [40]:
df['n_ids'] = df[df['external_ids'].notna()].external_ids.str.len()
In [41]:
df.n_ids.describe()
Out[41]:
count    1.301959e+06
mean     1.358640e+00
std      6.635087e-01
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      2.000000e+00
max      8.000000e+01
Name: n_ids, dtype: float64
In [42]:
df[df.n_ids == df.n_ids.max()]
Out[42]:
orcid claimed verified_email verified_primary_email given_names family_name biography other_names urls primary_email ... n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains n_emails url_domains n_urls n_ids
7253330 0000-0002-9554-6633 1 1 1 john a williams NaN NaN [[aston university profile page, https://resea... NaN ... 0 0 208 1 NaN NaN NaN [aston.ac.uk] 1.0 80.0

1 rows × 30 columns

In [43]:
ids = df[['orcid', 'external_ids']].explode('external_ids').reset_index(drop=True)
In [44]:
ids['provider'] = ids[ids.external_ids.notna()]['external_ids'].apply(lambda x: x[0])
In [45]:
ids[ids.provider.notna()].head()
Out[45]:
orcid external_ids provider
7 0000-0001-7463-977x [loop profile, 371409] loop profile
9 0000-0001-8718-0056 [scopus author id, 55466912100] scopus author id
10 0000-0001-8718-0056 [scopus author id, 7102015452] scopus author id
14 0000-0001-9708-5570 [researcherid, p-5112-2015] researcherid
15 0000-0001-9708-5570 [scopus author id, 42062216900] scopus author id
In [46]:
top_ids_providers = ids.groupby('provider').count().sort_values('orcid', ascending=False)
In [47]:
data = [
    go.Bar(
        x=top_ids_providers.index,
        y=top_ids_providers['orcid']
    )
]

layout = go.Layout(
    title='IDs provided by providers',
    xaxis=dict(tickangle=45, tickfont=dict(size=12))
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
In [48]:
pd.unique(ids['provider'])
Out[48]:
array([nan, 'loop profile', 'scopus author id', 'researcherid',
       'scopus author id: ', 'gnd', 'isni', 'ciência id', 'pitt id',
       'id dialnet', 'technical university of denmark cwis',
       'researcher name resolver id', 'scopus author id:',
       'hkust profile', '中国科学家在线', 'cti vitae', 'escientist',
       'researcher id', 'sciprofile', 'digital author id', 'scopus  id',
       'uow scholars', 'authenticusid', 'authenticus', 'authid',
       'hku researcherpage', 'chalmers id', 'iauthor', 'us epa vivo',
       'digital author id (dai)', 'vivo cornell', 'smithsonian profiles',
       'github', 'google scholar', 'scopus id', 'researcherid:', 'dai',
       'kaken', 'orcid id', 'dialnet id', 'profile system identifier',
       'sciprofiles', 'id dialnet:', 'researcherid: ', 'scienceopen',
       'une researcher id', 'custom', 'orcid'], dtype=object)

Keywords

This field is problematic as users can be nasty and put multiple keywords in one as opposed of having different keywords. Look this

In [49]:
df[df['orcid'] == AM]['keywords'].values[0]
Out[49]:
['data science ',
 'science of science',
 'scholarly knowledge mining',
 'open science',
 'research infrastructures']

I did a good job. The following instead is dirty

In [50]:
df[df['orcid'] == PP]['keywords'].values[0]
Out[50]:
['open access, open science, libraries, repositories, social web,']

So the keyword field needs some cleaning

In [51]:
def fix_keywords(lst):
        fixed = set()
        for k in lst:
            tokens = set(k.split(','))
#             tokens.remove('')
            for t in tokens:
                fixed.add(str.strip(t))
        fixed.discard('')
        return list(fixed)
In [52]:
df['fixed_keywords'] = df[df.keywords.notna()]['keywords'].apply(lambda x: fix_keywords(x))
In [53]:
df[df['orcid'] == PP]['fixed_keywords'].values[0]
Out[53]:
['open science', 'repositories', 'social web', 'libraries', 'open access']
In [54]:
df['n_keywords'] = df.keywords.str.len()
In [55]:
keywords_by_orcid = df.sort_values('n_keywords', ascending=False)[['orcid', 'n_keywords']]
keywords_by_orcid
Out[55]:
orcid n_keywords
2851081 0000-0002-0673-0341 154.0
7344151 0000-0002-7060-4112 141.0
2235440 0000-0002-6075-3501 140.0
2994233 0000-0002-4071-0301 118.0
3971323 0000-0002-9638-8091 115.0
... ... ...
10916569 0000-0001-5692-7639 NaN
10916570 0000-0003-1539-0999 NaN
10916571 0000-0003-2858-5509 NaN
10916572 0000-0003-2438-9500 NaN
10916573 0000-0003-4119-4772 NaN

10916574 rows × 2 columns

In [56]:
set_top_n(100)
data = [
    go.Bar(
        x=keywords_by_orcid[:TOP_N]['orcid'],
        y=keywords_by_orcid[:TOP_N]['n_keywords']
    )
]

layout = go.Layout(
    title='Keywords provided by ORCiD',
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
In [57]:
top_keywords = df[['orcid', 'keywords']]\
                .explode('keywords')\
                .reset_index(drop=True)\
                .groupby('keywords')\
                .count()\
                .sort_values('orcid', ascending=False)
In [58]:
set_top_n(50)
data = [
    go.Bar(
        x=top_keywords[:TOP_N].index,
        y=top_keywords[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top-%s keywords occurrence' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12))
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Correlation

In [59]:
fig = px.imshow(df[df.n_ids > 0].corr())
fig.show()