Exploratory analysis¶

TODO:

Understanding the reason for fake profiles can bring insight on how to catch them (could be trivial with prior knowledge, e.g., SEO hacking => URLs)
Make casistics (e.g. author publishing with empty orcid, author publishing but not on OpenAIRE, etc.)
Temporal dimension of any use?
Can we access private info thanks to the OpenAIRE-ORCID agreement?

In [1]:

import pandas as pd
import ast
import tldextract
import numpy

import plotly
from plotly.offline import iplot, init_notebook_mode
import plotly.graph_objs as go
import plotly.express as px

init_notebook_mode(connected=True)
TOP_N = 0
TOP_RANGE = [0, 0]
def set_top_n(n):
    global TOP_N, TOP_RANGE
    TOP_N = n
    TOP_RANGE = [-.5, n - 1 + .5]

Notable solid ORCID iDs for explorative purposes:

In [2]:

AM = '0000-0002-5193-7851'
PP = '0000-0002-8588-4196'

Notable anomalies:

In [3]:

JOURNAL = '0000-0003-1815-5732'
NOINFO = '0000-0001-5009-2052'
VALID_NO_OA = '0000-0002-5154-6404' # True profile, but not in OpenAIRE
# todo: find group-shared ORCiD, if possible

Notable fake ORCID iDs:

In [4]:

SCAFFOLD = '0000-0001-5004-7761'
WHATSAPP = '0000-0001-6997-9470'
PENIS = '0000-0002-3399-7287'
BITCOIN = '0000-0002-7518-6845'
FITNESS_CHINA = '0000-0002-1234-835X' # URL record + employment
CANNABIS = '0000-0002-9025-8632'      # URL > 70 + works (REMOVED)
PLUMBER = '0000-0002-1700-8311'       # URL > 10 + works

Load the dataset

In [5]:

df = pd.read_pickle('../data/processed/dataset.pkl')
df.head(5)

Out[5]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	employment	works_source	activation_date	last_update_date
0	0000-0001-5009-2052	1	1	1	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	2019-06-05t20:25:43.066z	2019-12-11t03:57:41.741z
1	0000-0001-5943-0732	1	1	1	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	2015-08-18t13:10:42.871z	2016-06-15t01:05:19.986z
2	0000-0001-6083-622x	1	1	1	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	2019-01-21t10:55:27.997z	2019-01-28t16:24:02.199z
3	0000-0001-6262-5709	1	1	1	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	2015-08-18t14:29:39.440z	2017-06-21t07:18:20.787z
4	0000-0001-6616-4890	1	1	1	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	2015-08-13t01:59:51.802z	2016-06-15t01:05:21.373z

5 rows × 24 columns

Notable profiles inspection

In [6]:

df[df['orcid'] == AM]

Out[6]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	employment	n_works	works_source	activation_date	last_update_date	n_doi	n_arxiv	n_pmc	n_other_pids	label
1575869	0000-0002-5193-7851	1	1	1	andrea	mannocci	data scientist & researcher; scholarly knowled...	NaN	[[personal website, https://andremann.github.i...	andrea.mannocci@isti.cnr.it	...	[[research associate, istituto di scienza e te...	37	[scopus - elsevier, crossref metadata search, ...	2017-09-12t14:28:33.467z	2021-03-09t08:32:47.840z	34	0	0	60	1

1 rows × 24 columns

In [7]:

df[df['orcid'] == WHATSAPP]

Out[7]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	employment	n_works	works_source	activation_date	last_update_date	n_doi	n_arxiv	n_pmc	n_other_pids	label
6819986	0000-0001-6997-9470	1	1	1	other	whatsapp	NaN	NaN	[[otherwhatsapp, https://otherwhatsapp.com/], ...	NaN	...	NaN	0	NaN	2020-10-07t10:37:12.237z	2020-10-08t02:32:03.935z	0	0	0	0	0

1 rows × 24 columns

In [8]:

df.count()

Out[8]:

orcid                     10916574
claimed                   10916574
verified_email            10916574
verified_primary_email    10916574
given_names               10886150
family_name               10601571
biography                   348649
other_names                 551482
urls                        707687
primary_email               123851
other_emails                 48306
keywords                    646400
external_ids               1301959
education                  2430233
employment                 2665092
n_works                   10916574
works_source               2721431
activation_date           10916574
last_update_date          10916574
n_doi                     10916574
n_arxiv                   10916574
n_pmc                     10916574
n_other_pids              10916574
label                     10916574
dtype: int64

In [9]:

df['orcid'].describe()

Out[9]:

count                10916574
unique               10916574
top       0000-0001-8786-4765
freq                        1
Name: orcid, dtype: object

Primary email¶

In [10]:

df['primary_email'].describe()

Out[10]:

count                       123851
unique                      123848
top       patrick.davey@monash.edu
freq                             2
Name: primary_email, dtype: object

Dupe emails

In [11]:

df['primary_email'].dropna().loc[df['primary_email'].duplicated()]

Out[11]:

6347224            maykin@owasp.org
7027865    patrick.davey@monash.edu
9529005      opercin@erbakan.edu.tr
Name: primary_email, dtype: object

In [12]:

df[df['primary_email'] == 'maykin@owasp.org']

Out[12]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	employment	n_works	works_source	activation_date	last_update_date	n_doi	n_arxiv	n_pmc	n_other_pids	label
4450046	0000-0001-9855-1676	1	1	1	maykin	warasart	NaN	NaN	NaN	maykin@owasp.org	...	NaN	0	NaN	2020-10-23t17:51:51.925z	2021-01-01t15:00:52.053z	0	0	0	0	0
6347224	0000-0002-0836-2271	1	1	1	maykin	warasart	NaN	NaN	NaN	maykin@owasp.org	...	NaN	0	NaN	2020-09-15t04:43:55.709z	2020-09-15t05:17:28.509z	0	0	0	0	0

2 rows × 24 columns

In [13]:

df[df['primary_email'] == 'opercin@erbakan.edu.tr']

Out[13]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	employment	n_works	works_source	activation_date	last_update_date	n_doi	n_arxiv	n_pmc	n_other_pids	label
6840791	0000-0002-2232-9638	1	1	1	osman	perçin	NaN	NaN	NaN	opercin@erbakan.edu.tr	...	NaN	0	NaN	2015-01-12t13:47:55.549z	2020-01-27t07:38:24.269z	0	0	0	0	0
9529005	0000-0003-0033-0918	1	1	1	osman	perçin	NaN	NaN	NaN	opercin@erbakan.edu.tr	...	[[, necmettin erbakan university, konya, , tr,...	0	NaN	2015-10-13t05:47:12.014z	2020-12-25t13:52:03.976z	0	0	0	0	0

2 rows × 24 columns

In [14]:

df[df['primary_email'] == 'patrick.davey@monash.edu']

Out[14]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	employment	n_works	works_source	activation_date	last_update_date	n_doi	n_arxiv	n_pmc	n_other_pids	label
944993	0000-0002-9158-1757	1	1	1	patrick	davey	NaN	NaN	NaN	patrick.davey@monash.edu	...	[[phd student, monash university, melbourne, ,...	0	NaN	2019-05-09t23:01:02.170z	2019-08-20t03:00:17.844z	0	0	0	0	0
7027865	0000-0002-8774-0030	1	1	1	patrick	davey	NaN	NaN	NaN	patrick.davey@monash.edu	...	[[phd student, monash university, melbourne, v...	1	[crossref]	2018-09-11t10:47:10.997z	2021-02-09t06:21:44.138z	1	0	0	0	1

2 rows × 24 columns

In [15]:

df['primary_email_domain'] = df[df.primary_email.notna()]['primary_email'].apply(lambda x: x.split('@')[1])

In [16]:

df['primary_email_domain'].describe()

Out[16]:

count        123851
unique        17089
top       gmail.com
freq          26540
Name: primary_email_domain, dtype: object

In [17]:

top_primary_emails = df[['primary_email_domain', 'orcid']]\
                .groupby('primary_email_domain')\
                .count()\
                .sort_values('orcid', ascending=False)
top_primary_emails

Out[17]:

	orcid
primary_email_domain
gmail.com	26540
hotmail.com	3769
yahoo.com	2614
163.com	2109
yuhs.ac	1132
...	...
imean-biotech.com	1
imec.msu.ru	1
imedea.uib-csic.es	1
imes.uni-hannover.de	1
zzuli.edu.cn	1

17089 rows × 1 columns

In [18]:

set_top_n(30)
data = [
    go.Bar(
        x=top_primary_emails[:TOP_N].index,
        y=top_primary_emails[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top-%s email domains' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Other emails¶

In [19]:

def extract_email_domains(lst):
    res = []
    for email in lst:
        res.append(email.split('@')[1])
    return res

In [20]:

df['other_email_domains'] = df[df.other_emails.notna()]['other_emails'].apply(lambda x: extract_email_domains(x))

In [21]:

df[df['other_email_domains'].notna()].head()

Out[21]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	works_source	activation_date	last_update_date	n_doi	n_arxiv	n_pmc	n_other_pids	label	primary_email_domain	other_email_domains
34	0000-0002-5774-8947	1	1	1	NaN	NaN	NaN	[omah m. williams - duncan]	NaN	NaN	...	NaN	2014-03-07t04:34:39.598z	2019-05-21t17:08:12.202z	0	0	0	0	0	NaN	[gmail.com]
1199	0000-0003-2877-5492	1	1	0	aliasghar	khosroabadi	NaN	NaN	NaN	khosroedc@yahoo.com	...	[scopus - elsevier]	2018-01-19t13:40:29.874z	2019-12-11t02:19:08.160z	0	0	0	1	1	yahoo.com	[medsab.ac.ir, gmail.com]
1995	0000-0001-8004-5054	1	1	1	angiola	orlando	NaN	NaN	NaN	angiola.orlando@mib.infn.it	...	[angiola orlando, crossref]	2015-08-31t09:12:02.349z	2020-06-22t14:22:31.786z	59	2	0	53	1	mib.infn.it	[ge.infn.it]
2323	0000-0003-3048-4504	1	1	1	apichat	saejio	NaN	NaN	NaN	NaN	...	[scopus - elsevier]	2016-03-06t08:54:15.121z	2020-08-28t08:31:15.790z	2	0	0	4	0	NaN	[eat.kmutnb.ac.th]
4461	0000-0001-9961-9732	1	1	1	chunfeng	yun	NaN	NaN	NaN	sallyycf@163.com	...	[multidisciplinary digital publishing institut...	2016-11-22t07:55:23.863z	2019-11-26t02:29:35.104z	5	0	9	0	1	163.com	[pku.edu.cn]

5 rows × 26 columns

In [22]:

df['n_emails'] = df['other_emails'].str.len()

In [23]:

emails_by_orcid = df.sort_values('n_emails', ascending=False)

In [24]:

set_top_n(30)
data = [
    go.Bar(
        x=emails_by_orcid[:TOP_N]['orcid'],
        y=emails_by_orcid[:TOP_N]['n_emails']
    )
]

layout = go.Layout(
    title='Top %s ORCID iDs by email' % TOP_N, 
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

In [25]:

top_other_emails = df[['orcid', 'other_email_domains']]\
                        .explode('other_email_domains')\
                        .reset_index(drop=True)\
                        .groupby('other_email_domains')\
                        .count()\
                        .sort_values('orcid', ascending=False)

In [26]:

set_top_n(30)
data = [
    go.Bar(
        x=top_other_emails[:TOP_N].index,
        y=top_other_emails[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top %s other email domains' % TOP_N, 
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Email speculation¶

In [27]:

df[df['primary_email'].isna() & df['other_emails'].notna()]

Out[27]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	activation_date	last_update_date	n_doi	n_arxiv	n_pmc	n_other_pids	label	primary_email_domain	other_email_domains	n_emails
34	0000-0002-5774-8947	1	1	1	NaN	NaN	NaN	[omah m. williams - duncan]	NaN	NaN	...	2014-03-07t04:34:39.598z	2019-05-21t17:08:12.202z	0	0	0	0	0	NaN	[gmail.com]	1.0
2323	0000-0003-3048-4504	1	1	1	apichat	saejio	NaN	NaN	NaN	NaN	...	2016-03-06t08:54:15.121z	2020-08-28t08:31:15.790z	2	0	0	4	0	NaN	[eat.kmutnb.ac.th]	1.0
7622	0000-0002-5612-7444	1	1	1	friederike m.	hesse	NaN	NaN	[[midwifery care - milla hebammenpraxis, http:...	NaN	...	2017-06-10t07:45:11.387z	2017-06-10t07:55:03.455z	0	0	0	0	0	NaN	[gmail.com, dghwi.de]	2.0
7956	0000-0002-8943-0538	1	1	1	geo	sunny	NaN	NaN	NaN	NaN	...	2019-11-30t14:08:11.221z	2020-05-15t09:06:25.637z	1	0	0	0	1	NaN	[students.cutn.ac.in]	1.0
10508	0000-0002-4022-0580	1	1	1	jean carlos	da silva gomes	NaN	NaN	[[currículo lattes, http://lattes.cnpq.br/0026...	NaN	...	2017-05-26t19:09:33.432z	2020-06-02t00:23:14.020z	2	0	0	2	1	NaN	[letras.ufrj.br]	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10915002	0000-0002-3715-3866	1	1	1	joanna	korybut-orlowska	NaN	[joanna gołębiewska]	NaN	NaN	...	2017-04-27t10:08:48.102z	2020-12-08t09:44:59.088z	6	0	0	0	0	NaN	[gmail.com]	1.0
10915305	0000-0003-1925-0141	1	1	1	marco	ferretti	NaN	NaN	NaN	NaN	...	2015-02-23t10:29:00.543z	2020-11-30t21:58:07.439z	7	0	0	9	1	NaN	[itabc.cnr.it]	1.0
10915495	0000-0001-5526-3017	1	1	1	nadia	yacoubi	NaN	NaN	NaN	NaN	...	2015-03-10t16:45:31.974z	2020-12-11t00:00:01.060z	3	0	0	0	1	NaN	[evonik.com]	1.0
10915820	0000-0002-9902-7953	1	1	1	s m mahmudul	hasan	NaN	NaN	NaN	NaN	...	2018-01-26t02:18:25.551z	2020-11-24t05:37:24.167z	7	0	2	7	1	NaN	[gmail.com]	1.0
10916306	0000-0002-5126-5127	1	1	1	andonis	neophytou	NaN	NaN	NaN	NaN	...	2017-03-30t17:08:15.383z	2020-12-09t16:16:50.762z	2	0	0	3	0	NaN	[ucy.ac.cy]	1.0

19692 rows × 27 columns

URLs¶

In [28]:

def extract_url_domains(lst):
    domains = []
    for e in lst:
        # e[0] is a string describing the url
        # e[1] is the url
        domain = tldextract.extract(e[1])
        domains.append(domain.registered_domain)
    return domains

In [29]:

df['url_domains'] = df[df.urls.notna()]['urls'].apply(lambda x: extract_url_domains(x))

In [30]:

df[df['url_domains'].notna()].head()

Out[30]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	last_update_date	n_doi	n_other_pids	label	primary_email_domain	other_email_domains	n_emails	url_domains
9	0000-0001-8718-0056	1	1	1	NaN	NaN	NaN	[飛資得]	[[link1, http://orcid.flysheetmed.info], [ntu ...	ericlin.flysheet@gmail.com	...	2019-10-11t17:51:12.473z	0	6	1	gmail.com	NaN	NaN	[flysheetmed.info, ntu.edu.tw]
41	0000-0002-7845-4016	1	1	1	NaN	NaN	NaN	NaN	[[publication profile, http://publications.lib...	NaN	...	2016-06-06t15:29:36.952z	0	0	0	NaN	NaN	NaN	[chalmers.se]
59	0000-0003-0967-6157	1	1	1	NaN	NaN	NaN	[徐興慶]	[[ntu researcher profile, http://ah.ntu.edu.tw...	NaN	...	2017-03-10t07:30:04.778z	12	4	1	NaN	NaN	NaN	[ntu.edu.tw, ntu.edu.tw]
149	0000-0002-8015-3781	1	1	1	alejandro	ossorio	NaN	NaN	[[web de la universidad carlos iii de madrid, ...	aossorio@di.uc3m.es	...	2019-07-04t08:47:12.005z	0	0	0	di.uc3m.es	NaN	NaN	[uc3m.es]
155	0000-0003-3444-936x	1	1	1	alessandra	caravale	archeologa, con laurea in metodologia e tecnic...	NaN	[[isma- cnr, http://www.isma.cnr.it/?page_id=1...	NaN	...	2020-05-14t15:54:38.235z	7	14	1	NaN	NaN	NaN	[cnr.it]

5 rows × 28 columns

In [31]:

df['n_urls'] = df['url_domains'].str.len()

In [32]:

urls_by_orcid = df.sort_values('n_urls', ascending=False)[['orcid', 'n_urls']]
urls_by_orcid

Out[32]:

	orcid	n_urls
257375	0000-0002-1234-835x	219.0
3630067	0000-0001-7478-4539	174.0
5196089	0000-0002-7392-3792	169.0
10696059	0000-0002-6938-9638	152.0
6868932	0000-0002-5710-4041	114.0
...	...	...
10916569	0000-0001-5692-7639	NaN
10916570	0000-0003-1539-0999	NaN
10916571	0000-0003-2858-5509	NaN
10916572	0000-0003-2438-9500	NaN
10916573	0000-0003-4119-4772	NaN

10916574 rows × 2 columns

In [33]:

set_top_n(100)
data = [
    go.Bar(
        x=urls_by_orcid[:TOP_N]['orcid'],
        y=urls_by_orcid[:TOP_N]['n_urls']
    )
]

layout = go.Layout(
    title='Top %s ORCID iDs with URLs' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

In [34]:

top_urls = df[['orcid', 'url_domains']]\
                .explode('url_domains')\
                .reset_index(drop=True)\
                .groupby('url_domains')\
                .count()\
                .sort_values('orcid', ascending=False)

In [35]:

set_top_n(30)
data = [
    go.Bar(
        x=top_urls[:TOP_N].index,
        y=top_urls[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top-%s URL domains' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

URLs speculation¶

In [36]:

df[(df['url_domains'].str.len() > 50) & (df['n_works'] > 0)]

Out[36]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	n_doi	n_arxiv	n_pmc	n_other_pids	label	primary_email_domain	other_email_domains	n_emails	url_domains	n_urls
382497	0000-0002-9025-8632	1	1	1	buycannabis	dispensary	we procure and deliver premium cannabis strain...	[we procure and deliver premium cannabis strai...	[[find your cannabis & marijuana dispensary , ...	NaN	...	0	0	0	0	0	NaN	NaN	NaN	[goowonderland.com, goowonderland.com, goowond...	81.0
911811	0000-0002-4062-3603	1	1	1	juan de dios	beltrán mancilla	juan de dios beltrán mancilla (*) filósofo aut...	[juan de dios beltrán mancilla, filósofo autod...	[[01.- juan de dios beltrán mancilla. teoría o...	NaN	...	0	0	0	7	0	NaN	NaN	NaN	[yumpu.com, ijopm.org, google.com, blogspot.co...	69.0
1136129	0000-0002-1929-6054	1	1	1	franklin américo	canaza choque	docente-investigador social. maestrando en der...	[franklin américo canaza-choque , franklin a. ...	[[consejo nacional de ciencia, tecnología e in...	leo_123fa@hotmail.com	...	29	0	0	33	1	hotmail.com	[gmail.com, gmail.com, hotmail.com, baldwin.ed...	5.0	[concytec.gob.pe, redalyc.org, redalyc.org, un...	61.0
3102686	0000-0003-2593-7134	1	1	1	aan	jaelani	all my papers can be downloaded from portal:re...	[jaelani, a., jaelani, aan]	[[microsoft academic research, https://academi...	aan_jaelani@syekhnurjati.ac.id	...	88	0	0	193	1	syekhnurjati.ac.id	[gmail.com]	1.0	[microsoft.com, twitter.com, academia.edu, aca...	67.0
6868932	0000-0002-5710-4041	1	1	1	ryszard	romaniuk	professor of electronics and communications en...	[r.romaniuk, r.s.romaniuk, ryszard romaniuk, r...	[[scholar google, http://scholar.google.pl/cit...	rrom@ise.pw.edu.pl	...	1221	25	0	1742	1	ise.pw.edu.pl	[ise.pw.edu.pl, elka.pw.edu.pl, cern.ch]	3.0	[google.pl, publons.com, scopus.com, mendeley....	114.0
8088987	0000-0002-9965-2425	1	1	1	jaroslaw	spychala	jaroslaw spychala has received a doctoral degr...	[jaroslaw jozef spychala]	[[resume, http://www.biowebspin.com/wp-content...	NaN	...	15	0	0	29	1	NaN	NaN	NaN	[biowebspin.com, biowebspin.com, google.com, l...	73.0
8658355	0000-0002-3920-7389	1	1	1	а.	гусев	surname, name gusev alexander leonidovichdate...	[alexander l. gusev , alexander leonidovich gu...	[[a.l. gusev alternative energy and ecology, ...	NaN	...	37	0	0	21	1	NaN	NaN	NaN	[youtube.com, isjaee.com, researchgate.net, re...	111.0
8778864	0000-0002-3997-5070	1	1	1	dr. parameshachari	b d	dr. parameshachari b dacm distinguished speake...	[dr. parameshachari b d]	[[gsssietw,mysuru, http://geethashishu.in/], [...	NaN	...	47	0	0	48	1	NaN	NaN	NaN	[geethashishu.in, geethashishu.in, acm.org, go...	71.0
9980164	0000-0003-4948-9268	1	1	1	gustavo	duperré	gustavo norberto duperré graduated in arts and...	[gustavo norberto duperré, duperré, g. n., gus...	[[gis in cultural heritage - icomos românia, h...	gustavo.duperre@usal.edu.ar	...	13	0	0	34	0	usal.edu.ar	NaN	NaN	[icomos.ro, unirioja.es, unirioja.es, unc.edu....	61.0
10024501	0000-0003-2407-3557	1	1	1	abdul	aziz	abdul aziz was born on may 25, 1973, in brebes...	[abdul aziz, aziz, abdul, aziz, a., aziz, abd,...	[[google scholar, https://scholar.google.com/c...	NaN	...	19	0	0	77	1	NaN	NaN	NaN	[google.com, syekhnurjati.ac.id, orcid.org, bl...	59.0
10091165	0000-0003-2183-8112	1	1	1	pelayo munhoz	olea	pós-doutorado em gestão ambiental pela univers...	[ munhoz, pelayo olea, olea, pelayo, olea, p...	[[currículo lattes, http://lattes.cnpq.br/6209...	NaN	...	797	0	1	582	1	NaN	NaN	NaN	[cnpq.br, cnpq.br, cnpq.br, cnpq.br, publons.c...	61.0
10523205	0000-0003-2450-090x	1	1	1	eduard	babulak	professor eduard babulak is accomplished inter...	[professor eduard babulak]	[[honorary chair, chief mentor & senior adviso...	NaN	...	199	0	1	174	1	NaN	NaN	NaN	[worldassessmentcouncil.org, spseke.sk, bcs.or...	114.0
10696059	0000-0002-6938-9638	1	1	1	adolfo	catral sanabria	my education is in computer science, mathemati...	NaN	[[researchgate adolfo catral , https://www.res...	NaN	...	2022	0	0	16	1	NaN	NaN	NaN	[researchgate.net, youtube.com, linkedin.com, ...	152.0

13 rows × 29 columns

In [37]:

df[(df['url_domains'].str.len() > 10) & (df['n_works'] > 0) & (df['works_source'].str.len() == 1)]

Out[37]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	n_doi	n_arxiv	n_pmc	n_other_pids	label	primary_email_domain	other_email_domains	n_emails	url_domains	n_urls
97666	0000-0002-7843-8497	1	1	1	davi	barbosa	pesquisador na área sociojurídica, professor, ...	[professor davi barbosa delmont]	[[plataforma de cursos ideia criativa, https:/...	NaN	...	0	0	0	0	0	NaN	NaN	NaN	[eadplataforma.com, facebook.com, youtube.com,...	39.0
200670	0000-0003-1554-1531	1	1	1	katarzyna	ochman	katarzyna ochman [kataˈʐɨna ˈɔxman] is assista...	[[kataˈʐɨna ˈɔxman], catharina ochman, cathari...	[[researchgate, https://www.researchgate.net/p...	NaN	...	1	0	0	0	1	NaN	NaN	NaN	[researchgate.net, academia.edu, facebook.com,...	11.0
210325	0000-0003-3080-4643	1	1	1	graham	dawson	science and engineering faculty (sef) libraria...	[ graham colin dawson, g.c. dawson]	[[qut home page, https://www.library.qut.edu.a...	g.dawson@qut.edu.au	...	0	0	0	6	1	qut.edu.au	NaN	NaN	[qut.edu.au, qut.edu.au, google.com.au, resear...	11.0
218947	0000-0003-3193-030x	1	1	1	juan pablo	wolff mejia	aspirante a maestría en derecho y negocios int...	[juan pablo wolff, pablo wolff mejia, juan p. ...	[[twitter, https://twitter.com/pablomejiam], [...	juanpmejia@ulasallista.edu.co	...	0	0	0	0	1	ulasallista.edu.co	NaN	NaN	[twitter.com, youtube.com, google.com, linkedi...	11.0
261974	0000-0002-5341-6531	1	1	1	trent	hammond	mr trent hammond is an honorary research fello...	[trent ernest hammond (t.e.hammond)]	[[academic support masters, http://trenthammon...	trent.hammond@academicsupportmasters.com.au	...	1	0	0	1	1	academicsupportmasters.com.au	[health.nsw.gov.au, csu.edu.au, sociologist.co...	5.0	[wix.com, academia.edu, researchgate.net, rese...	12.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10405738	0000-0002-3374-5709	1	1	1	guillermo	ortiz	médico, internista, neumólogo, intensivista, e...	[guillermo ortiz-ruiz]	[[elsevier, https://www.elsevier.com/], [asoci...	NaN	...	62	0	0	88	0	NaN	NaN	NaN	[elsevier.com, amci.org.co, springer.com, revi...	12.0
10472264	0000-0001-7228-5680	1	1	1	text	protocol	NaN	NaN	[[about, https://about.me/textprotocol], [gith...	NaN	...	0	0	0	0	0	NaN	NaN	NaN	[about.me, github.com, gitlab.com, gravatar.co...	12.0
10785961	0000-0002-3064-0194	1	1	1	leonardo fernando	cruz basso	NaN	NaN	[[papers-1, https://www.researchgate.net/profi...	leonardofernando.basso@mackenzie.br	...	5	0	0	0	1	mackenzie.br	[mackenzie.br]	1.0	[researchgate.net, ssrn.com, cnpq.br, google.c...	17.0
10845645	0000-0003-1047-4229	1	1	1	bayu	sakti	bayu purbha saktisaya adalah bayu purbha sakti...	[bayu purbha sakti]	[[osf, http://osf.io/qe2ug], [inarxiv, https:/...	NaN	...	0	0	0	0	1	NaN	NaN	NaN	[osf.io, osf.io, academia.edu, mendeley.com, f...	12.0
10896059	0000-0003-4836-7074	1	1	1	karla haydee	ortiz palafox	karla haydee ortíz palafoxmiembro del sistema ...	[karla palafox]	[[opinión día del maestro, http://www.cronicaj...	NaN	...	0	0	0	2	1	NaN	NaN	NaN	[cronicajalisco.com, youtube.com, tlaquepaque....	22.0

141 rows × 29 columns

In [38]:

exploded_sources = df[(df['url_domains'].str.len() > 10) & (df['n_works'] > 0) & (df['works_source'].str.len() == 1)].explode('works_source').reset_index(drop=True)
exploded_sources

Out[38]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	n_doi	n_arxiv	n_pmc	n_other_pids	label	primary_email_domain	other_email_domains	n_emails	url_domains	n_urls
0	0000-0002-7843-8497	1	1	1	davi	barbosa	pesquisador na área sociojurídica, professor, ...	[professor davi barbosa delmont]	[[plataforma de cursos ideia criativa, https:/...	NaN	...	0	0	0	0	0	NaN	NaN	NaN	[eadplataforma.com, facebook.com, youtube.com,...	39.0
1	0000-0003-1554-1531	1	1	1	katarzyna	ochman	katarzyna ochman [kataˈʐɨna ˈɔxman] is assista...	[[kataˈʐɨna ˈɔxman], catharina ochman, cathari...	[[researchgate, https://www.researchgate.net/p...	NaN	...	1	0	0	0	1	NaN	NaN	NaN	[researchgate.net, academia.edu, facebook.com,...	11.0
2	0000-0003-3080-4643	1	1	1	graham	dawson	science and engineering faculty (sef) libraria...	[ graham colin dawson, g.c. dawson]	[[qut home page, https://www.library.qut.edu.a...	g.dawson@qut.edu.au	...	0	0	0	6	1	qut.edu.au	NaN	NaN	[qut.edu.au, qut.edu.au, google.com.au, resear...	11.0
3	0000-0003-3193-030x	1	1	1	juan pablo	wolff mejia	aspirante a maestría en derecho y negocios int...	[juan pablo wolff, pablo wolff mejia, juan p. ...	[[twitter, https://twitter.com/pablomejiam], [...	juanpmejia@ulasallista.edu.co	...	0	0	0	0	1	ulasallista.edu.co	NaN	NaN	[twitter.com, youtube.com, google.com, linkedi...	11.0
4	0000-0002-5341-6531	1	1	1	trent	hammond	mr trent hammond is an honorary research fello...	[trent ernest hammond (t.e.hammond)]	[[academic support masters, http://trenthammon...	trent.hammond@academicsupportmasters.com.au	...	1	0	0	1	1	academicsupportmasters.com.au	[health.nsw.gov.au, csu.edu.au, sociologist.co...	5.0	[wix.com, academia.edu, researchgate.net, rese...	12.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
136	0000-0002-3374-5709	1	1	1	guillermo	ortiz	médico, internista, neumólogo, intensivista, e...	[guillermo ortiz-ruiz]	[[elsevier, https://www.elsevier.com/], [asoci...	NaN	...	62	0	0	88	0	NaN	NaN	NaN	[elsevier.com, amci.org.co, springer.com, revi...	12.0
137	0000-0001-7228-5680	1	1	1	text	protocol	NaN	NaN	[[about, https://about.me/textprotocol], [gith...	NaN	...	0	0	0	0	0	NaN	NaN	NaN	[about.me, github.com, gitlab.com, gravatar.co...	12.0
138	0000-0002-3064-0194	1	1	1	leonardo fernando	cruz basso	NaN	NaN	[[papers-1, https://www.researchgate.net/profi...	leonardofernando.basso@mackenzie.br	...	5	0	0	0	1	mackenzie.br	[mackenzie.br]	1.0	[researchgate.net, ssrn.com, cnpq.br, google.c...	17.0
139	0000-0003-1047-4229	1	1	1	bayu	sakti	bayu purbha saktisaya adalah bayu purbha sakti...	[bayu purbha sakti]	[[osf, http://osf.io/qe2ug], [inarxiv, https:/...	NaN	...	0	0	0	0	1	NaN	NaN	NaN	[osf.io, osf.io, academia.edu, mendeley.com, f...	12.0
140	0000-0003-4836-7074	1	1	1	karla haydee	ortiz palafox	karla haydee ortíz palafoxmiembro del sistema ...	[karla palafox]	[[opinión día del maestro, http://www.cronicaj...	NaN	...	0	0	0	2	1	NaN	NaN	NaN	[cronicajalisco.com, youtube.com, tlaquepaque....	22.0

141 rows × 29 columns

In [39]:

exploded_sources[exploded_sources.apply(lambda x: x['works_source'].find(x['given_names']) >= 0, axis=1)]

Out[39]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	n_doi	n_arxiv	n_pmc	n_other_pids	label	primary_email_domain	other_email_domains	n_emails	url_domains	n_urls
0	0000-0002-7843-8497	1	1	1	davi	barbosa	pesquisador na área sociojurídica, professor, ...	[professor davi barbosa delmont]	[[plataforma de cursos ideia criativa, https:/...	NaN	...	0	0	0	0	0	NaN	NaN	NaN	[eadplataforma.com, facebook.com, youtube.com,...	39.0
1	0000-0003-1554-1531	1	1	1	katarzyna	ochman	katarzyna ochman [kataˈʐɨna ˈɔxman] is assista...	[[kataˈʐɨna ˈɔxman], catharina ochman, cathari...	[[researchgate, https://www.researchgate.net/p...	NaN	...	1	0	0	0	1	NaN	NaN	NaN	[researchgate.net, academia.edu, facebook.com,...	11.0
3	0000-0003-3193-030x	1	1	1	juan pablo	wolff mejia	aspirante a maestría en derecho y negocios int...	[juan pablo wolff, pablo wolff mejia, juan p. ...	[[twitter, https://twitter.com/pablomejiam], [...	juanpmejia@ulasallista.edu.co	...	0	0	0	0	1	ulasallista.edu.co	NaN	NaN	[twitter.com, youtube.com, google.com, linkedi...	11.0
4	0000-0002-5341-6531	1	1	1	trent	hammond	mr trent hammond is an honorary research fello...	[trent ernest hammond (t.e.hammond)]	[[academic support masters, http://trenthammon...	trent.hammond@academicsupportmasters.com.au	...	1	0	0	1	1	academicsupportmasters.com.au	[health.nsw.gov.au, csu.edu.au, sociologist.co...	5.0	[wix.com, academia.edu, researchgate.net, rese...	12.0
5	0000-0001-5295-2271	1	1	1	antoniy	moysey	NaN	NaN	[[academic journals database, http://journalda...	antoniimoisei@bsmu.edu.ua	...	0	0	0	0	1	bsmu.edu.ua	NaN	NaN	[journaldatabase.info, nplu.org, acls.org, ind...	21.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
135	0000-0002-8125-0081	1	1	1	issam	bencheikh	NaN	[issame1982, دكتور عصام بن الشيخ]	[[my blog web site, http://issame1982.blogspot...	NaN	...	0	0	0	0	1	NaN	NaN	NaN	[blogspot.com, researchgate.net, google.com, l...	12.0
136	0000-0002-3374-5709	1	1	1	guillermo	ortiz	médico, internista, neumólogo, intensivista, e...	[guillermo ortiz-ruiz]	[[elsevier, https://www.elsevier.com/], [asoci...	NaN	...	62	0	0	88	0	NaN	NaN	NaN	[elsevier.com, amci.org.co, springer.com, revi...	12.0
137	0000-0001-7228-5680	1	1	1	text	protocol	NaN	NaN	[[about, https://about.me/textprotocol], [gith...	NaN	...	0	0	0	0	0	NaN	NaN	NaN	[about.me, github.com, gitlab.com, gravatar.co...	12.0
139	0000-0003-1047-4229	1	1	1	bayu	sakti	bayu purbha saktisaya adalah bayu purbha sakti...	[bayu purbha sakti]	[[osf, http://osf.io/qe2ug], [inarxiv, https:/...	NaN	...	0	0	0	0	1	NaN	NaN	NaN	[osf.io, osf.io, academia.edu, mendeley.com, f...	12.0
140	0000-0003-4836-7074	1	1	1	karla haydee	ortiz palafox	karla haydee ortíz palafoxmiembro del sistema ...	[karla palafox]	[[opinión día del maestro, http://www.cronicaj...	NaN	...	0	0	0	2	1	NaN	NaN	NaN	[cronicajalisco.com, youtube.com, tlaquepaque....	22.0

115 rows × 29 columns

Works source¶

Paste from Miriam

External IDs¶

External IDs should come from reliable sources. ORCiD registrants cannot add them freely.

In [40]:

df['n_ids'] = df[df['external_ids'].notna()].external_ids.str.len()

In [41]:

df.n_ids.describe()

Out[41]:

count    1.301959e+06
mean     1.358640e+00
std      6.635087e-01
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      2.000000e+00
max      8.000000e+01
Name: n_ids, dtype: float64

In [42]:

df[df.n_ids == df.n_ids.max()]

Out[42]:

	orcid	claimed	verified_email	verified_primary_email	given_names	family_name	biography	other_names	urls	primary_email	...	n_arxiv	n_pmc	n_other_pids	label	primary_email_domain	other_email_domains	n_emails	url_domains	n_urls	n_ids
7253330	0000-0002-9554-6633	1	1	1	john a	williams	NaN	NaN	[[aston university profile page, https://resea...	NaN	...	0	0	208	1	NaN	NaN	NaN	[aston.ac.uk]	1.0	80.0

1 rows × 30 columns

In [43]:

ids = df[['orcid', 'external_ids']].explode('external_ids').reset_index(drop=True)

In [44]:

ids['provider'] = ids[ids.external_ids.notna()]['external_ids'].apply(lambda x: x[0])

In [45]:

ids[ids.provider.notna()].head()

Out[45]:

	orcid	external_ids	provider
7	0000-0001-7463-977x	[loop profile, 371409]	loop profile
9	0000-0001-8718-0056	[scopus author id, 55466912100]	scopus author id
10	0000-0001-8718-0056	[scopus author id, 7102015452]	scopus author id
14	0000-0001-9708-5570	[researcherid, p-5112-2015]	researcherid
15	0000-0001-9708-5570	[scopus author id, 42062216900]	scopus author id

In [46]:

top_ids_providers = ids.groupby('provider').count().sort_values('orcid', ascending=False)

In [47]:

data = [
    go.Bar(
        x=top_ids_providers.index,
        y=top_ids_providers['orcid']
    )
]

layout = go.Layout(
    title='IDs provided by providers',
    xaxis=dict(tickangle=45, tickfont=dict(size=12))
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

In [48]:

pd.unique(ids['provider'])

Out[48]:

array([nan, 'loop profile', 'scopus author id', 'researcherid',
       'scopus author id: ', 'gnd', 'isni', 'ciência id', 'pitt id',
       'id dialnet', 'technical university of denmark cwis',
       'researcher name resolver id', 'scopus author id:',
       'hkust profile', '中国科学家在线', 'cti vitae', 'escientist',
       'researcher id', 'sciprofile', 'digital author id', 'scopus  id',
       'uow scholars', 'authenticusid', 'authenticus', 'authid',
       'hku researcherpage', 'chalmers id', 'iauthor', 'us epa vivo',
       'digital author id (dai)', 'vivo cornell', 'smithsonian profiles',
       'github', 'google scholar', 'scopus id', 'researcherid:', 'dai',
       'kaken', 'orcid id', 'dialnet id', 'profile system identifier',
       'sciprofiles', 'id dialnet:', 'researcherid: ', 'scienceopen',
       'une researcher id', 'custom', 'orcid'], dtype=object)

Keywords¶

This field is problematic as users can be nasty and put multiple keywords in one as opposed of having different keywords. Look this

In [49]:

df[df['orcid'] == AM]['keywords'].values[0]

Out[49]:

['data science ',
 'science of science',
 'scholarly knowledge mining',
 'open science',
 'research infrastructures']

I did a good job. The following instead is dirty

In [50]:

df[df['orcid'] == PP]['keywords'].values[0]

Out[50]:

['open access, open science, libraries, repositories, social web,']

So the keyword field needs some cleaning

In [51]:

def fix_keywords(lst):
        fixed = set()
        for k in lst:
            tokens = set(k.split(','))
#             tokens.remove('')
            for t in tokens:
                fixed.add(str.strip(t))
        fixed.discard('')
        return list(fixed)

In [52]:

df['fixed_keywords'] = df[df.keywords.notna()]['keywords'].apply(lambda x: fix_keywords(x))

In [53]:

df[df['orcid'] == PP]['fixed_keywords'].values[0]

Out[53]:

['open science', 'repositories', 'social web', 'libraries', 'open access']

In [54]:

df['n_keywords'] = df.keywords.str.len()

In [55]:

keywords_by_orcid = df.sort_values('n_keywords', ascending=False)[['orcid', 'n_keywords']]
keywords_by_orcid

Out[55]:

	orcid	n_keywords
2851081	0000-0002-0673-0341	154.0
7344151	0000-0002-7060-4112	141.0
2235440	0000-0002-6075-3501	140.0
2994233	0000-0002-4071-0301	118.0
3971323	0000-0002-9638-8091	115.0
...	...	...
10916569	0000-0001-5692-7639	NaN
10916570	0000-0003-1539-0999	NaN
10916571	0000-0003-2858-5509	NaN
10916572	0000-0003-2438-9500	NaN
10916573	0000-0003-4119-4772	NaN

10916574 rows × 2 columns

In [56]:

set_top_n(100)
data = [
    go.Bar(
        x=keywords_by_orcid[:TOP_N]['orcid'],
        y=keywords_by_orcid[:TOP_N]['n_keywords']
    )
]

layout = go.Layout(
    title='Keywords provided by ORCiD',
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

In [57]:

top_keywords = df[['orcid', 'keywords']]\
                .explode('keywords')\
                .reset_index(drop=True)\
                .groupby('keywords')\
                .count()\
                .sort_values('orcid', ascending=False)

In [58]:

set_top_n(50)
data = [
    go.Bar(
        x=top_keywords[:TOP_N].index,
        y=top_keywords[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top-%s keywords occurrence' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12))
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Correlation¶

In [59]:

fig = px.imshow(df[df.n_ids > 0].corr())
fig.show()

478 KiB Raw Blame History Unescape Escape

Exploratory analysis¶

Primary email¶

Other emails¶

Email speculation¶

URLs¶

URLs speculation¶

Works source¶

External IDs¶

Keywords¶

Correlation¶

478 KiB

Raw Blame History