793 KiB
Exploratory analysis¶
TODO:
- Understanding the reason for fake profiles can bring insight on how to catch them (could be trivial with prior knowledge, e.g., SEO hacking => URLs)
- Make casistics (e.g. author publishing with empty orcid, author publishing but not on OpenAIRE, etc.)
- Temporal dimension of any use?
- Can we access private info thanks to the OpenAIRE-ORCID agreement?
import glob
import pandas as pd
import ast
import tldextract
import numpy as np
import antispam
import plotly
from plotly.offline import iplot, init_notebook_mode
import plotly.graph_objs as go
import plotly.express as px
init_notebook_mode(connected=True)
TOP_N = 0
TOP_RANGE = [0, 0]
def set_top_n(n):
global TOP_N, TOP_RANGE
TOP_N = n
TOP_RANGE = [-.5, n - 1 + .5]
pd.set_option('display.max_columns', None)
Notable solid ORCID iDs for explorative purposes:
AM = '0000-0002-5193-7851'
PP = '0000-0002-8588-4196'
Notable anomalies:
JOURNAL = '0000-0003-1815-5732'
NOINFO = '0000-0001-5009-2052'
VALID_NO_OA = '0000-0002-5154-6404' # True profile, but not in OpenAIRE
WORK_MISUSE = '0000-0001-7870-1120'
# todo: find group-shared ORCiD, if possible
Notable fake ORCID iDs:
SCAFFOLD = '0000-0001-5004-7761'
WHATSAPP = '0000-0001-6997-9470'
PENIS = '0000-0002-3399-7287'
BITCOIN = '0000-0002-7518-6845'
FITNESS_CHINA = '0000-0002-1234-835X' # URL record + employment
CANNABIS = '0000-0002-9025-8632' # URL > 70 + works (REMOVED)
PLUMBER = '0000-0002-1700-8311' # URL > 10 + works
Load the dataset
parts = glob.glob('../data/processed/dataset.pkl.*')
df = pd.concat((pd.read_pickle(part) for part in sorted(parts)))
df.head(5)
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0000-0001-6097-3953 | False | False | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 0 | NaN | 2018-03-02t09:29:16.528z | 2018-03-02t09:43:07.551z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
1 | 0000-0001-6112-5550 | True | True | <NA> | <NA> | <NA> | [v.i. yurtaev; v. yurtaev] | <NA> | NaN | NaN | NaN | [[professor, peoples friendship university of ... | 0 | NaN | 2018-04-03t07:50:23.358z | 2020-03-18t09:42:44.753z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 |
2 | 0000-0001-6152-2695 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 0 | NaN | 2019-12-11t15:31:56.388z | 2020-01-28t15:34:17.309z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
3 | 0000-0001-6220-5683 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[research scientist, new york university abu ... | 0 | NaN | 2015-08-18t12:36:45.307z | 2020-09-23t13:37:54.180z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 |
4 | 0000-0001-7071-8294 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[researcher (academic), universidad de zarago... | 0 | NaN | 2014-03-10t13:22:01.966z | 2016-06-14t22:17:54.470z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 2 |
Notable profiles inspection
df[df['orcid'] == AM]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3073261 | 0000-0002-5193-7851 | True | True | andrea | mannocci | data scientist & researcher; scholarly knowled... | NaN | andrea.mannocci@isti.cnr.it | [open science, data science, science of scienc... | scopus author id, 55233589900 | [[information engineering, ph.d., università d... | [[research associate, istituto di scienza e te... | 37 | [scopus - elsevier, crossref metadata search, ... | 2017-09-12t14:28:33.467z | 2021-03-17t15:40:07.776z | 34 | 0 | 0 | 60 | True | isti.cnr.it | NaN | [github.io, twitter.com, linkedin.com] | <NA> | 3 | 1 | 5 | 4 | 5 |
df[df['orcid'] == WHATSAPP]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9887272 | 0000-0001-6997-9470 | True | True | other | <NA> | NaN | <NA> | [whatsapp gb baixar, whatsapp gb 2020, whatsap... | NaN | NaN | NaN | 0 | NaN | 2020-10-07t10:37:12.237z | 2020-10-08t02:32:03.935z | 0 | 0 | 0 | 0 | False | NaN | NaN | [otherwhatsapp.com, im-creator.com, facebook.c... | <NA> | 27 | <NA> | 4 | <NA> | <NA> |
df.count()
orcid 10989649 verified_email 10989649 verified_primary_email 10989649 given_names 10959039 family_name 10671715 biography 354015 other_names 554684 primary_email 124722 keywords 649637 external_ids 1308598 education 2441645 employment 2680488 n_works 10989649 works_source 2740939 activation_date 10989649 last_update_date 10989649 n_doi 10989649 n_arxiv 10989649 n_pmc 10989649 n_other_pids 10989649 label 10989649 primary_email_domain 124722 other_email_domains 48615 url_domains 715067 n_emails 48615 n_urls 715067 n_ids 1308598 n_keywords 649637 n_education 2441645 n_employment 2680488 dtype: int64
df['orcid'].describe()
count 10989649 unique 10989649 top 0000-0001-5242-3687 freq 1 Name: orcid, dtype: object
Primary email¶
df['primary_email'].describe()
count 124722 unique 124718 top opercin@erbakan.edu.tr freq 2 Name: primary_email, dtype: object
Dupe emails
df['primary_email'].dropna().loc[df['primary_email'].duplicated()]
1681787 opercin@erbakan.edu.tr 5590332 patrick.davey@monash.edu 9316843 maykin@owasp.org 10375852 andycheng2026@163.com Name: primary_email, dtype: string
df[df['primary_email'] == 'maykin@owasp.org']
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7543981 | 0000-0002-0836-2271 | True | True | maykin | warasart | <NA> | NaN | maykin@owasp.org | NaN | NaN | NaN | NaN | 0 | NaN | 2020-09-15t04:43:55.709z | 2020-09-15t05:17:28.509z | 0 | 0 | 0 | 0 | False | owasp.org | [dga.or.th] | NaN | 1 | <NA> | <NA> | <NA> | <NA> | <NA> |
9316843 | 0000-0001-9855-1676 | True | True | maykin | warasart | <NA> | NaN | maykin@owasp.org | NaN | NaN | NaN | NaN | 0 | NaN | 2020-10-23t17:51:51.925z | 2021-01-01t15:00:52.053z | 0 | 0 | 0 | 0 | False | owasp.org | [dga.or.th, ieee.org] | NaN | 2 | <NA> | <NA> | <NA> | <NA> | <NA> |
df[df['primary_email'] == 'opercin@erbakan.edu.tr']
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
347852 | 0000-0002-2232-9638 | True | True | osman | perçin | <NA> | NaN | opercin@erbakan.edu.tr | NaN | NaN | NaN | NaN | 0 | NaN | 2015-01-12t13:47:55.549z | 2020-01-27t07:38:24.269z | 0 | 0 | 0 | 0 | False | erbakan.edu.tr | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
1681787 | 0000-0003-0033-0918 | True | True | osman | perçin | <NA> | NaN | opercin@erbakan.edu.tr | NaN | NaN | NaN | [[, necmettin erbakan university, konya, , tr,... | 0 | NaN | 2015-10-13t05:47:12.014z | 2020-12-25t13:52:03.976z | 0 | 0 | 0 | 0 | False | erbakan.edu.tr | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 |
df[df['primary_email'] == 'patrick.davey@monash.edu']
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
954085 | 0000-0002-9158-1757 | True | True | patrick | davey | <NA> | NaN | patrick.davey@monash.edu | [radiochemistry, inorganic chemistry, bioinorg... | NaN | NaN | [[phd student, monash university, melbourne, ,... | 0 | NaN | 2019-05-09t23:01:02.170z | 2019-08-20t03:00:17.844z | 0 | 0 | 0 | 0 | False | monash.edu | NaN | NaN | <NA> | <NA> | <NA> | 4 | <NA> | 1 |
5590332 | 0000-0002-8774-0030 | True | True | patrick | davey | <NA> | NaN | patrick.davey@monash.edu | NaN | NaN | NaN | [[phd student, monash university, melbourne, v... | 1 | [crossref] | 2018-09-11t10:47:10.997z | 2021-02-09t06:21:44.138z | 1 | 0 | 0 | 0 | True | monash.edu | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 |
df['primary_email_domain'].describe()
count 124722 unique 17160 top gmail.com freq 26750 Name: primary_email_domain, dtype: object
top_primary_emails = df[['primary_email_domain', 'orcid']]\
.groupby('primary_email_domain')\
.count()\
.sort_values('orcid', ascending=False)
top_primary_emails
orcid | |
---|---|
primary_email_domain | |
gmail.com | 26750 |
hotmail.com | 3801 |
yahoo.com | 2625 |
163.com | 2132 |
yuhs.ac | 1134 |
... | ... |
imf.csic.es | 1 |
imf.org | 1 |
imfd.tu-freiberg.de | 1 |
imft.fr | 1 |
zzuli.edu.cn | 1 |
17160 rows × 1 columns
set_top_n(30)
data = [
go.Bar(
x=top_primary_emails[:TOP_N].index,
y=top_primary_emails[:TOP_N]['orcid']
)
]
layout = go.Layout(
title='Top-%s email domains' % TOP_N,
xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
Other emails¶
df[df.other_email_domains.notna()].head()
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
251 | 0000-0002-5916-446X | True | True | antonio gilvan | teixeira júnior | <NA> | [teixeira, antônio gilvan, júnior, antonio gil... | gilvan.junior@aluno.ufca.edu.br | [ethicis; medicine; infectology; neurology; ne... | [[scopus author id, 56647743200], [scopus auth... | [[faculty of health and life sciences, , unive... | NaN | 14 | [antonio gilvan teixeira júnior, scopus - else... | 2016-05-18t11:26:36.642z | 2016-09-20t18:25:05.728z | 13 | 0 | 0 | 8 | False | aluno.ufca.edu.br | [liverpool.ac.uk] | [researchgate.net, academia.edu, cnpq.br] | 1 | 3 | 4 | 1 | 1 | <NA> |
316 | 0000-0002-8742-947X | True | True | aaron | tan shing loong | <NA> | NaN | aaron.tanshingloong@wadh.ox.ac.uk | NaN | NaN | [[ruskin school of art; wadham college, , univ... | NaN | 0 | NaN | 2015-10-05t23:10:08.771z | 2016-06-14t19:55:50.313z | 0 | 0 | 0 | 0 | False | wadh.ox.ac.uk | [rsa.ox.ac.uk] | NaN | 1 | <NA> | <NA> | <NA> | 1 | <NA> |
433 | 0000-0001-9097-2281 | True | True | abhishek | solanki | <NA> | NaN | <NA> | NaN | NaN | NaN | [[senior engineer, robert bosch (india), benga... | 1 | [abhishek solanki] | 2019-04-22t04:43:06.232z | 2020-07-02t14:18:28.305z | 0 | 0 | 0 | 0 | False | NaN | [in.bosch.com] | [github.com, linkedin.com] | 1 | 2 | <NA> | <NA> | <NA> | 2 |
497 | 0000-0002-8614-3007 | True | True | adam | arra | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 0 | NaN | 2017-11-15t06:33:45.625z | 2017-11-15t06:44:02.998z | 0 | 0 | 0 | 0 | False | NaN | [hct.ac.ae] | NaN | 1 | <NA> | <NA> | <NA> | <NA> | <NA> |
869 | 0000-0001-9884-5498 | True | True | alberto | ronzani | <NA> | NaN | alberto@aronza.com | NaN | NaN | NaN | [[research scientist, vtt technical research c... | 19 | [crossref metadata search, alberto ronzani, cr... | 2014-04-16t13:21:54.287z | 2020-09-28t15:10:37.439z | 18 | 0 | 0 | 3 | True | aronza.com | [vtt.fi] | NaN | 1 | <NA> | <NA> | <NA> | <NA> | 1 |
emails_by_orcid = df[['orcid', 'n_emails']].sort_values('n_emails', ascending=False)
set_top_n(30)
data = [
go.Bar(
x=emails_by_orcid[:TOP_N]['orcid'],
y=emails_by_orcid[:TOP_N]['n_emails']
)
]
layout = go.Layout(
title='Top %s ORCID iDs by email' % TOP_N,
xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
top_other_emails = df[['orcid', 'other_email_domains']]\
.explode('other_email_domains')\
.reset_index(drop=True)\
.groupby('other_email_domains')\
.count()\
.sort_values('orcid', ascending=False)
set_top_n(30)
data = [
go.Bar(
x=top_other_emails[:TOP_N].index,
y=top_other_emails[:TOP_N]['orcid']
)
]
layout = go.Layout(
title='Top %s other email domains' % TOP_N,
xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
This somehow makes sense, legitimate users could put the gmail account as primary for login purposes and have institutional addresses as other email addresses. It makes also the life easier upon relocation.
Email speculation¶
df[df.primary_email.isna() & df.other_email_domains.notna()]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
433 | 0000-0001-9097-2281 | True | True | abhishek | solanki | <NA> | NaN | <NA> | NaN | NaN | NaN | [[senior engineer, robert bosch (india), benga... | 1 | [abhishek solanki] | 2019-04-22t04:43:06.232z | 2020-07-02t14:18:28.305z | 0 | 0 | 0 | 0 | False | NaN | [in.bosch.com] | [github.com, linkedin.com] | 1 | 2 | <NA> | <NA> | <NA> | 2 |
497 | 0000-0002-8614-3007 | True | True | adam | arra | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 0 | NaN | 2017-11-15t06:33:45.625z | 2017-11-15t06:44:02.998z | 0 | 0 | 0 | 0 | False | NaN | [hct.ac.ae] | NaN | 1 | <NA> | <NA> | <NA> | <NA> | <NA> |
898 | 0000-0003-3728-6439 | True | True | alejandra | echeverry velásquez | alejandra echeverry is an industrial electrici... | NaN | <NA> | [control, technology, science, innovation, eng... | NaN | [[, electrical engineer, institución universit... | [[professor, institución universitaria pascual... | 1 | [crossref] | 2019-03-31t00:00:42.929z | 2020-09-06t02:18:54.290z | 1 | 0 | 0 | 0 | True | NaN | [pascualbravo.edu.co] | NaN | 1 | <NA> | <NA> | 7 | 1 | 1 |
1719 | 0000-0001-8330-7443 | True | True | andrea | tesoniero | <NA> | NaN | <NA> | NaN | researcherid, d-9056-2015 | [[department of geophysics, master of science ... | [[postdoctoral associate, yale university, new... | 4 | [andrea tesoniero] | 2015-03-09t11:59:06.093z | 2020-08-20t15:03:23.447z | 4 | 0 | 0 | 2 | False | NaN | [yale.edu] | NaN | 1 | <NA> | 1 | <NA> | 4 | 2 |
6829 | 0000-0001-9670-515X | True | True | esma esin | yildirim | <NA> | NaN | <NA> | [pharmacognosy, natural chemistry, chemical en... | NaN | [[business management, master of science, ista... | NaN | 0 | NaN | 2020-07-26t10:38:03.721z | 2020-07-26t10:52:26.539z | 0 | 0 | 0 | 0 | False | NaN | [gmail.com] | NaN | 1 | <NA> | <NA> | 3 | 3 | <NA> |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10985816 | 0000-0003-1204-6009 | True | True | nathan | walk | <NA> | NaN | <NA> | NaN | NaN | [[department of physics, doctor of philosophy,... | [[, university of oxford, oxford, oxfordshire,... | 10 | [crossref metadata search] | 2016-07-28t14:24:16.844z | 2020-10-13t11:47:50.621z | 10 | 0 | 0 | 0 | True | NaN | [cs.ox.ac.uk] | [fu-berlin.de] | 1 | 1 | <NA> | <NA> | 3 | 2 |
10986027 | 0000-0002-3472-7668 | True | True | raf | vandevelde | <NA> | NaN | <NA> | NaN | NaN | [[chemical engineering technology, master, kat... | [[phd researcher, katholieke universiteit leuv... | 0 | NaN | 2020-10-14t13:56:44.779z | 2020-10-16t14:21:40.673z | 0 | 0 | 0 | 0 | False | NaN | [kuleuven.be] | [linkedin.com] | 1 | 1 | <NA> | <NA> | 2 | 1 |
10987501 | 0000-0002-9602-0529 | True | True | carlos augusto | finelli | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 1 | [crossref] | 2013-09-16t16:52:06.120z | 2020-12-01t22:47:08.074z | 1 | 0 | 0 | 0 | True | NaN | [cecot.com.br] | NaN | 1 | <NA> | <NA> | <NA> | <NA> | <NA> |
10987829 | 0000-0003-4402-5982 | True | True | filipe | de almeida araújo | <NA> | NaN | <NA> | NaN | NaN | [[materials science, msc. materials science, m... | [[co-owner, aeft acessory, manaus, amazonas, b... | 0 | NaN | 2020-03-02t20:11:01.699z | 2020-12-04t13:53:39.404z | 0 | 0 | 0 | 0 | False | NaN | [ime.eb.br] | NaN | 1 | <NA> | <NA> | <NA> | 2 | 1 |
10988444 | 0000-0002-1734-7241 | True | True | manareldeen | ahmed | <NA> | NaN | <NA> | [deep learning, atomistic simulation, graphene... | NaN | NaN | [[post-doctor, zhejiang university, hangzhou, ... | 6 | [manareldeen ahmed] | 2017-02-17t13:18:36.540z | 2020-12-04t02:04:36.668z | 6 | 0 | 0 | 3 | True | NaN | [hotmail.com] | NaN | 1 | <NA> | <NA> | 5 | <NA> | 1 |
19814 rows × 30 columns
URLs¶
df[df.url_domains.notna()].head()
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 0000-0001-7402-0096 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[, kth royal institute of technology, stockho... | 0 | NaN | 2015-01-11t15:13:06.467z | 2016-06-14t23:55:59.896z | 0 | 0 | 0 | 0 | False | NaN | NaN | [kth.se] | <NA> | 1 | <NA> | <NA> | <NA> | 1 |
11 | 0000-0001-8377-3508 | True | True | <NA> | <NA> | <NA> | [fontana, milena da silva] | <NA> | [educação; informática; matemática.] | NaN | NaN | [[, instituto federal de educação, ciência e t... | 0 | NaN | 2018-05-23t23:39:04.534z | 2019-10-16t02:50:11.007z | 0 | 0 | 0 | 0 | False | NaN | NaN | [cnpq.br] | <NA> | 1 | <NA> | 1 | <NA> | 3 |
29 | 0000-0002-2638-4108 | True | True | <NA> | <NA> | investigador de la universidad de oviedo. depa... | NaN | <NA> | [constitutional history, history of political ... | scopus author id, 54394231000 | [[public law, ph doctor, university of oviedo,... | [[professor of constitutional law, university ... | 1 | [crossref] | 2013-03-25t14:38:06.016z | 2020-07-01t13:10:37.025z | 1 | 0 | 0 | 0 | False | NaN | NaN | [unioviedo.es] | <NA> | 1 | 1 | 3 | 1 | 1 |
46 | 0000-0003-1435-6545 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | [prostate cancer, migration, culture cell] | researcherid, p-2223-2018 | [[morfologia, , universidade estadual paulista... | [[, universidade estadual paulista (unesp), in... | 0 | NaN | 2018-08-09t12:12:24.405z | 2020-04-22t01:38:03.184z | 0 | 0 | 0 | 0 | False | NaN | NaN | [cnpq.br, linkedin.com] | <NA> | 2 | 1 | 3 | 1 | 1 |
158 | 0000-0003-1284-9741 | True | True | alex percy antonio | manriquez paisig | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 0 | NaN | 2020-09-08t20:04:33.906z | 2020-09-08t20:25:55.432z | 0 | 0 | 0 | 0 | False | NaN | NaN | [youtube.com] | <NA> | 1 | <NA> | <NA> | <NA> | <NA> |
urls_by_orcid = df[['orcid', 'n_urls']].sort_values('n_urls', ascending=False)
urls_by_orcid
orcid | n_urls | |
---|---|---|
3226518 | 0000-0002-1234-835X | 219 |
4206055 | 0000-0001-7478-4539 | 174 |
4901870 | 0000-0002-7392-3792 | 169 |
8184260 | 0000-0002-6938-9638 | 152 |
2743648 | 0000-0002-5710-4041 | 114 |
... | ... | ... |
10989644 | 0000-0002-1686-1935 | <NA> |
10989645 | 0000-0002-3800-6331 | <NA> |
10989646 | 0000-0002-8783-5814 | <NA> |
10989647 | 0000-0002-7584-2283 | <NA> |
10989648 | 0000-0003-0529-3538 | <NA> |
10989649 rows × 2 columns
set_top_n(100)
data = [
go.Bar(
x=urls_by_orcid[:TOP_N]['orcid'],
y=urls_by_orcid[:TOP_N]['n_urls']
)
]
layout = go.Layout(
title='Top %s ORCID iDs with URLs' % TOP_N,
xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
top_urls = df[['orcid', 'url_domains']]\
.explode('url_domains')\
.reset_index(drop=True)\
.groupby('url_domains')\
.count()\
.sort_values('orcid', ascending=False)
set_top_n(50)
data = [
go.Bar(
x=top_urls[:TOP_N].index,
y=top_urls[:TOP_N]['orcid']
)
]
layout = go.Layout(
title='Top-%s URL domains' % TOP_N,
xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
URLs speculation¶
df[(df['url_domains'].str.len() > 50) & (df['n_works'] > 0)]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1025713 | 0000-0003-2407-3557 | True | True | abdul | aziz | abdul aziz was born on may 25, 1973, in brebes... | [abdul aziz, aziz, abdul, aziz, a., aziz, abd,... | <NA> | [metodologi penelitian, ilmu ekonomi, ekonomi ... | NaN | [[ilmu ekonomi, dr, universitas borobudur, jak... | [[assisten professor/dr, institut agama islam ... | 72 | [base - bielefeld academic search engine, abdu... | 2016-09-12t04:41:24.842z | 2021-01-26t11:58:33.039z | 19 | 0 | 0 | 77 | False | NaN | NaN | [google.com, syekhnurjati.ac.id, orcid.org, bl... | <NA> | 59 | <NA> | 4 | 3 | 1 |
2743648 | 0000-0002-5710-4041 | True | True | ryszard | romaniuk | professor of electronics and communications en... | [r.romaniuk, r.s.romaniuk, ryszard romaniuk, r... | rrom@ise.pw.edu.pl | [electronics, measurement systems, research sy... | [[isni, 0000000071432485], [researcherid, b-91... | [[faculty of electronics and information techn... | [[professor, institute director, politechnika ... | 5008 | [inspire-hep, researcherid, isni2orcid search ... | 2013-01-20t12:09:21.600z | 2021-03-16t19:37:31.650z | 1221 | 25 | 0 | 1742 | True | ise.pw.edu.pl | [ise.pw.edu.pl, elka.pw.edu.pl, cern.ch] | [google.pl, publons.com, scopus.com, mendeley.... | 3 | 114 | 3 | 5 | 1 | 1 |
3011724 | 0000-0003-2450-090X | True | True | eduard | babulak | professor eduard babulak is accomplished inter... | [professor eduard babulak] | <NA> | [quality of service provision assessment, next... | [[scopus author id, 6506867432], [researcherid... | [[information technology, doctor habilitated (... | [[consultant, horizon 2020 framework programme... | 274 | [the lens, base - bielefeld academic search en... | 2013-04-03t08:02:30.013z | 2021-02-28t10:07:13.231z | 199 | 0 | 1 | 174 | False | NaN | NaN | [worldassessmentcouncil.org, spseke.sk, bcs.or... | <NA> | 114 | 5 | 8 | 6 | 22 |
3881064 | 0000-0002-3920-7389 | True | True | а. | гусев | surname, name gusev alexander leonidovichdate... | [alexander l. gusev , alexander leonidovich gu... | <NA> | [technologies of production, technologies of i... | [[researcherid, f-8048-2014], [scopus author i... | [[chemical technology and cryogenic-vacuum tec... | [[general director, scientific technical centr... | 472 | [publons, datacite, scopus - elsevier, a.l. gu... | 2014-05-14t00:01:28.030z | 2021-01-16t13:44:14.134z | 37 | 0 | 0 | 21 | False | NaN | NaN | [youtube.com, isjaee.com, researchgate.net, re... | <NA> | 111 | 2 | 16 | 2 | 7 |
7466062 | 0000-0002-1929-6054 | True | True | franklin américo | canaza choque | docente-investigador social. maestrando en der... | [franklin américo canaza-choque , franklin a. ... | leo_123fa@hotmail.com | [filosofía; educación; políticas de desarrollo... | [[researcherid, p-8613-2018], [loop profile, 8... | [[facultad de ciencias de la educación , maest... | [[investigador social, universidad católica de... | 39 | [researcherid, base - bielefeld academic searc... | 2017-09-15t19:45:43.483z | 2021-03-23t20:12:47.297z | 30 | 0 | 0 | 34 | True | hotmail.com | [gmail.com, gmail.com, hotmail.com, baldwin.ed... | [concytec.gob.pe, redalyc.org, redalyc.org, un... | 5 | 61 | 4 | 2 | 1 | 1 |
7517096 | 0000-0003-4948-9268 | True | True | gustavo | duperré | gustavo norberto duperré graduated in arts and... | [gustavo norberto duperré, duperré, g. n., gus... | gustavo.duperre@usal.edu.ar | [computer science, sciences of antiquity, cont... | [[scopus author id, 57195936346], [researcheri... | [[programme in history, history of art and ter... | [[titular professor, dirección general de cult... | 41 | [gustavo duperré, scopus - elsevier, publons, ... | 2020-02-22t15:49:52.386z | 2021-03-12t15:13:44.065z | 13 | 0 | 0 | 34 | False | usal.edu.ar | NaN | [icomos.ro, unirioja.es, unirioja.es, unc.edu.... | <NA> | 61 | 2 | 11 | 6 | 5 |
8068275 | 0000-0003-2183-8112 | True | True | pelayo munhoz | olea | pós-doutorado em gestão ambiental pela univers... | [ munhoz, pelayo olea, olea, pelayo, olea, p... | <NA> | [empreendedorismo, sustentabilidade, inovação] | [[scopus author id, 55175503300], [researcheri... | [[, postdoctoral in environmental sustainabili... | [[professor, universidade federal do rio grand... | 1109 | [the lens, pelayo munhoz olea, dimensions, bas... | 2013-02-04t17:25:34.723z | 2021-03-19t18:51:01.128z | 798 | 0 | 1 | 582 | True | NaN | NaN | [cnpq.br, cnpq.br, cnpq.br, cnpq.br, publons.c... | <NA> | 61 | 2 | 3 | 7 | 9 |
8184260 | 0000-0002-6938-9638 | True | True | adolfo | catral sanabria | my education is in computer science, mathemati... | NaN | <NA> | NaN | loop profile, 747193 | [[education, capacitación para la enseñanza en... | NaN | 2023 | [base - bielefeld academic search engine, data... | 2019-05-07t19:27:02.210z | 2020-12-10t23:39:15.236z | 2022 | 0 | 0 | 16 | False | NaN | NaN | [researchgate.net, youtube.com, linkedin.com, ... | <NA> | 152 | 1 | <NA> | 6 | <NA> |
8791256 | 0000-0002-9025-8632 | True | True | buycannabis | dispensary | we procure and deliver premium cannabis strain... | [we procure and deliver premium cannabis strai... | <NA> | [marijuana dispensary, cannabis, canabis dispe... | NaN | NaN | NaN | 10 | [goowonderland dispensary] | 2020-12-09t21:19:46.004z | 2020-12-10t01:17:28.772z | 0 | 0 | 0 | 0 | False | NaN | NaN | [goowonderland.com, goowonderland.com, goowond... | <NA> | 81 | <NA> | 7 | <NA> | <NA> |
10174509 | 0000-0002-9965-2425 | True | True | jaroslaw | spychala | jaroslaw spychala has received a doctoral degr... | [jaroslaw jozef spychala] | <NA> | [photochemistry, medicinal and pharmaceutical ... | scopus author id, 7006745874 | [[department of chemistry, postdoctoral associ... | [[assistant professor, adam mickiewicz univers... | 29 | [scopus - elsevier] | 2014-09-18t12:34:14.242z | 2020-02-11t14:31:25.544z | 15 | 0 | 0 | 29 | True | NaN | NaN | [biowebspin.com, biowebspin.com, google.com, l... | <NA> | 73 | 1 | 4 | 4 | 2 |
10257808 | 0000-0002-4062-3603 | True | True | juan de dios | beltrán mancilla | juan de dios beltrán mancilla (*) filósofo aut... | [juan de dios beltrán mancilla, filósofo autod... | <NA> | [filosofia medicina arquitectura economía dere... | NaN | [[, diplomado en practicas directivas para or... | [[inspector general jornada vespertina // de 2... | 11 | [juan de dios beltr´´án mancilla] | 2020-04-19t21:06:33.495z | 2021-02-10t20:13:07.698z | 0 | 0 | 0 | 7 | False | NaN | NaN | [yumpu.com, ijopm.org, google.com, blogspot.co... | <NA> | 69 | <NA> | 1 | 8 | 6 |
10486212 | 0000-0002-3997-5070 | True | True | dr. parameshachari | b d | dr. parameshachari b dacm distinguished speake... | [dr. parameshachari b d] | <NA> | [mysore region coordinator|ieee bangalore sect... | [[researcherid, f-7045-2018], [scopus author i... | [[electronics and communication engineering, p... | [[acm distinguished speaker (volunteer), assoc... | 93 | [publons, multidisciplinary digital publishing... | 2016-08-24t11:00:30.403z | 2021-03-23t07:16:22.582z | 47 | 0 | 0 | 48 | False | NaN | NaN | [geethashishu.in, geethashishu.in, acm.org, go... | <NA> | 71 | 3 | 6 | 5 | 10 |
10652632 | 0000-0003-2593-7134 | True | True | aan | jaelani | all my papers can be downloaded from portal:re... | [jaelani, a., jaelani, aan] | aan_jaelani@syekhnurjati.ac.id | [islamic economics, islamic finance and bankin... | [[scopus author id, 57195963463], [loop profil... | [[post graduate, s3/dr, universitas islam nege... | [[dr, institut agama islam negeri syekh nurjat... | 79 | [publons, aan jaelani, scopus - elsevier, dime... | 2016-03-02t18:37:44.989z | 2021-03-19t10:11:57.908z | 88 | 0 | 0 | 193 | True | syekhnurjati.ac.id | [gmail.com] | [microsoft.com, twitter.com, academia.edu, aca... | 1 | 67 | 4 | 7 | 2 | 1 |
df[(df['url_domains'].str.len() > 10) & (df['n_works'] > 0) & (df['works_source'].str.len() == 1)]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47439 | 0000-0002-5967-2835 | True | True | oleksiy | goryayinov | <NA> | [алексей николаевич горяинов, о.м.горяїнов, а.... | <NA> | [diagnostics, transport, logistics] | researcherid, i-7977-2016 | [[, дистанционный курс «ctl.sc2x: supply chain... | [[docent, kharkiv petro vasylenko national tec... | 274 | [oleksiy goryayinov] | 2014-08-03t18:06:42.925z | 2021-03-22t13:56:48.311z | 0 | 0 | 0 | 0 | False | NaN | NaN | [khntusg.com.ua, khntusg.com.ua, google.com.ua... | <NA> | 13 | 1 | 3 | 14 | 7 |
72557 | 0000-0002-3505-2797 | True | True | nurul | malahayati | google scholar | NaN | <NA> | NaN | researcherid, q-3861-2017 | [[civil and transportation engineering , maste... | [[senior lecturer, universitas syiah kuala, ba... | 6 | [nurul malahayati] | 2017-10-01t00:46:31.324z | 2019-08-19t15:52:47.253z | 3 | 0 | 0 | 3 | False | NaN | NaN | [google.com, ristekdikti.go.id, unsyiah.ac.id,... | <NA> | 16 | 1 | <NA> | 2 | 1 |
94081 | 0000-0003-3670-9620 | True | True | carlos | barrera | im individual inventor, and this is my work; s... | [retrodynamic, novelinflow] | <NA> | [energy, technology, gearturbine, imploturboco... | loop profile, 394457 | NaN | NaN | 1 | [carlos barrera] | 2016-08-29t20:32:10.362z | 2021-02-09t04:56:35.554z | 0 | 0 | 0 | 0 | False | NaN | NaN | [blogspot.mx, behance.net, authorstream.com, d... | <NA> | 24 | 1 | 8 | <NA> | <NA> |
261673 | 0000-0002-5441-0465 | True | True | nuria | hernández-león | <NA> | [nuria h. león, nuria hernández león, hernánde... | <NA> | [training, icts, business management, research... | NaN | [[, course: social skills, university of salam... | [[merchandise reception and expedition trainer... | 11 | [nuria hernández-león] | 2015-11-28t07:18:58.442z | 2021-03-05t16:37:47.403z | 1 | 0 | 0 | 4 | False | NaN | NaN | [feriaempresamujer.com, escueladenegociosydire... | <NA> | 16 | <NA> | 7 | 19 | 16 |
326211 | 0000-0002-7781-6767 | True | True | mohd nazri | ismail | born in penang, malaysia in 1971, dr. mohd had... | [ndum (national defence university of malaysia)] | <NA> | [wsn, manet, simulation and modelling, network... | [[scopus author id, 24372977800], [researcheri... | NaN | [[lecturer, universiti pertahanan nasional mal... | 35 | [scopus - elsevier] | 2016-09-06t02:25:52.974z | 2020-10-20t06:55:55.051z | 24 | 0 | 0 | 35 | True | NaN | NaN | [google.com.my, researchgate.net, academia.edu... | <NA> | 16 | 2 | 10 | <NA> | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10579801 | 0000-0001-5087-6965 | True | True | robert | ohara | systematics, evolutionary biology, and the his... | [r. o’hara, r.j. o’hara, robert o’hara, robert... | <NA> | [history and philosophy of science, evolutiona... | [[isni, 0000000138200102], [researcherid, b-47... | [[biology, ph.d., harvard university, cambridg... | NaN | 45 | [robert j. o’hara] | 2014-09-21t02:45:19.620z | 2020-07-09t06:51:09.228z | 23 | 0 | 0 | 72 | True | NaN | NaN | [rjohara.net, google.com, collegiateway.org, r... | <NA> | 12 | 3 | 5 | 1 | <NA> |
10590882 | 0000-0002-3318-9861 | True | True | shagufta | perveen | prof. dr. shagufta perveen is a professor at k... | NaN | shagufta792000@yahoo.com | [shagufta perveen professor, shagufta perveen ... | NaN | [[hej research institute of chemistry, phd che... | [[professor, king saud university college of p... | 66 | [scopus - elsevier] | 2015-12-21t10:34:06.771z | 2021-02-22t14:58:30.893z | 56 | 0 | 0 | 66 | True | yahoo.com | [msu.edu, ksu.edu.sa] | [shaguftaperveen.com, researchgate.net, ksu.ed... | 2 | 11 | <NA> | 25 | 3 | 7 |
10766062 | 0000-0001-8960-9004 | True | True | susan | bastani | <NA> | [s. bastani, سوسن باستانی] | sbastani@alzahra.ac.ir | [social networks, fuzzy logic, online and offl... | scopus author id, 16642098400 | [[sociology, ph.d., university of toronto, tor... | [[professor, alzahra university, tehran, vanak... | 20 | [scopus - elsevier] | 2019-07-10t06:50:46.255z | 2020-10-07t04:08:01.961z | 19 | 0 | 0 | 33 | True | alzahra.ac.ir | [gmail.com, gmail.com] | [scopus.com, google.com, publons.com, zenodo.o... | 2 | 11 | 1 | 4 | 3 | 4 |
10807839 | 0000-0002-4379-6454 | True | True | caroline wanjiru | kariuki | caroline holds a phd in economics from curtin ... | NaN | <NA> | [development economics, applied econometrics, ... | NaN | [[economics, doctor of philosophy , curtin uni... | [[director, educational development, strathmor... | 4 | [caroline wanjiru kariuki] | 2020-03-18t10:18:04.007z | 2021-02-11t14:40:38.515z | 1 | 0 | 0 | 0 | False | NaN | NaN | [scopus.com, mendeley.com, publons.com, resear... | <NA> | 13 | <NA> | 4 | 3 | 6 |
10911966 | 0000-0003-2311-0600 | True | True | myo | kyaw hlaing | <NA> | [dr myo kyaw hlaing] | <NA> | [economic geology] | NaN | NaN | [[lecturer, union of myanmar ministry of educa... | 2 | [myo kyaw hlaing] | 2018-12-26t12:51:57.801z | 2021-01-26t14:36:47.421z | 1 | 0 | 0 | 2 | False | NaN | NaN | [facebook.com, linkedin.com, instagram.com, re... | <NA> | 12 | <NA> | 1 | <NA> | 2 |
140 rows × 30 columns
exploded_sources = df[(df['url_domains'].str.len() > 10) & (df['n_works'] > 0) & (df['works_source'].str.len() == 1)].explode('works_source').reset_index(drop=True)
exploded_sources
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0000-0002-5967-2835 | True | True | oleksiy | goryayinov | <NA> | [алексей николаевич горяинов, о.м.горяїнов, а.... | <NA> | [diagnostics, transport, logistics] | researcherid, i-7977-2016 | [[, дистанционный курс «ctl.sc2x: supply chain... | [[docent, kharkiv petro vasylenko national tec... | 274 | oleksiy goryayinov | 2014-08-03t18:06:42.925z | 2021-03-22t13:56:48.311z | 0 | 0 | 0 | 0 | False | NaN | NaN | [khntusg.com.ua, khntusg.com.ua, google.com.ua... | <NA> | 13 | 1 | 3 | 14 | 7 |
1 | 0000-0002-3505-2797 | True | True | nurul | malahayati | google scholar | NaN | <NA> | NaN | researcherid, q-3861-2017 | [[civil and transportation engineering , maste... | [[senior lecturer, universitas syiah kuala, ba... | 6 | nurul malahayati | 2017-10-01t00:46:31.324z | 2019-08-19t15:52:47.253z | 3 | 0 | 0 | 3 | False | NaN | NaN | [google.com, ristekdikti.go.id, unsyiah.ac.id,... | <NA> | 16 | 1 | <NA> | 2 | 1 |
2 | 0000-0003-3670-9620 | True | True | carlos | barrera | im individual inventor, and this is my work; s... | [retrodynamic, novelinflow] | <NA> | [energy, technology, gearturbine, imploturboco... | loop profile, 394457 | NaN | NaN | 1 | carlos barrera | 2016-08-29t20:32:10.362z | 2021-02-09t04:56:35.554z | 0 | 0 | 0 | 0 | False | NaN | NaN | [blogspot.mx, behance.net, authorstream.com, d... | <NA> | 24 | 1 | 8 | <NA> | <NA> |
3 | 0000-0002-5441-0465 | True | True | nuria | hernández-león | <NA> | [nuria h. león, nuria hernández león, hernánde... | <NA> | [training, icts, business management, research... | NaN | [[, course: social skills, university of salam... | [[merchandise reception and expedition trainer... | 11 | nuria hernández-león | 2015-11-28t07:18:58.442z | 2021-03-05t16:37:47.403z | 1 | 0 | 0 | 4 | False | NaN | NaN | [feriaempresamujer.com, escueladenegociosydire... | <NA> | 16 | <NA> | 7 | 19 | 16 |
4 | 0000-0002-7781-6767 | True | True | mohd nazri | ismail | born in penang, malaysia in 1971, dr. mohd had... | [ndum (national defence university of malaysia)] | <NA> | [wsn, manet, simulation and modelling, network... | [[scopus author id, 24372977800], [researcheri... | NaN | [[lecturer, universiti pertahanan nasional mal... | 35 | scopus - elsevier | 2016-09-06t02:25:52.974z | 2020-10-20t06:55:55.051z | 24 | 0 | 0 | 35 | True | NaN | NaN | [google.com.my, researchgate.net, academia.edu... | <NA> | 16 | 2 | 10 | <NA> | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
135 | 0000-0001-5087-6965 | True | True | robert | ohara | systematics, evolutionary biology, and the his... | [r. o’hara, r.j. o’hara, robert o’hara, robert... | <NA> | [history and philosophy of science, evolutiona... | [[isni, 0000000138200102], [researcherid, b-47... | [[biology, ph.d., harvard university, cambridg... | NaN | 45 | robert j. o’hara | 2014-09-21t02:45:19.620z | 2020-07-09t06:51:09.228z | 23 | 0 | 0 | 72 | True | NaN | NaN | [rjohara.net, google.com, collegiateway.org, r... | <NA> | 12 | 3 | 5 | 1 | <NA> |
136 | 0000-0002-3318-9861 | True | True | shagufta | perveen | prof. dr. shagufta perveen is a professor at k... | NaN | shagufta792000@yahoo.com | [shagufta perveen professor, shagufta perveen ... | NaN | [[hej research institute of chemistry, phd che... | [[professor, king saud university college of p... | 66 | scopus - elsevier | 2015-12-21t10:34:06.771z | 2021-02-22t14:58:30.893z | 56 | 0 | 0 | 66 | True | yahoo.com | [msu.edu, ksu.edu.sa] | [shaguftaperveen.com, researchgate.net, ksu.ed... | 2 | 11 | <NA> | 25 | 3 | 7 |
137 | 0000-0001-8960-9004 | True | True | susan | bastani | <NA> | [s. bastani, سوسن باستانی] | sbastani@alzahra.ac.ir | [social networks, fuzzy logic, online and offl... | scopus author id, 16642098400 | [[sociology, ph.d., university of toronto, tor... | [[professor, alzahra university, tehran, vanak... | 20 | scopus - elsevier | 2019-07-10t06:50:46.255z | 2020-10-07t04:08:01.961z | 19 | 0 | 0 | 33 | True | alzahra.ac.ir | [gmail.com, gmail.com] | [scopus.com, google.com, publons.com, zenodo.o... | 2 | 11 | 1 | 4 | 3 | 4 |
138 | 0000-0002-4379-6454 | True | True | caroline wanjiru | kariuki | caroline holds a phd in economics from curtin ... | NaN | <NA> | [development economics, applied econometrics, ... | NaN | [[economics, doctor of philosophy , curtin uni... | [[director, educational development, strathmor... | 4 | caroline wanjiru kariuki | 2020-03-18t10:18:04.007z | 2021-02-11t14:40:38.515z | 1 | 0 | 0 | 0 | False | NaN | NaN | [scopus.com, mendeley.com, publons.com, resear... | <NA> | 13 | <NA> | 4 | 3 | 6 |
139 | 0000-0003-2311-0600 | True | True | myo | kyaw hlaing | <NA> | [dr myo kyaw hlaing] | <NA> | [economic geology] | NaN | NaN | [[lecturer, union of myanmar ministry of educa... | 2 | myo kyaw hlaing | 2018-12-26t12:51:57.801z | 2021-01-26t14:36:47.421z | 1 | 0 | 0 | 2 | False | NaN | NaN | [facebook.com, linkedin.com, instagram.com, re... | <NA> | 12 | <NA> | 1 | <NA> | 2 |
140 rows × 30 columns
exploded_sources[exploded_sources.apply(lambda x: x['works_source'].find(x['given_names']) >= 0, axis=1)]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0000-0002-5967-2835 | True | True | oleksiy | goryayinov | <NA> | [алексей николаевич горяинов, о.м.горяїнов, а.... | <NA> | [diagnostics, transport, logistics] | researcherid, i-7977-2016 | [[, дистанционный курс «ctl.sc2x: supply chain... | [[docent, kharkiv petro vasylenko national tec... | 274 | oleksiy goryayinov | 2014-08-03t18:06:42.925z | 2021-03-22t13:56:48.311z | 0 | 0 | 0 | 0 | False | NaN | NaN | [khntusg.com.ua, khntusg.com.ua, google.com.ua... | <NA> | 13 | 1 | 3 | 14 | 7 |
1 | 0000-0002-3505-2797 | True | True | nurul | malahayati | google scholar | NaN | <NA> | NaN | researcherid, q-3861-2017 | [[civil and transportation engineering , maste... | [[senior lecturer, universitas syiah kuala, ba... | 6 | nurul malahayati | 2017-10-01t00:46:31.324z | 2019-08-19t15:52:47.253z | 3 | 0 | 0 | 3 | False | NaN | NaN | [google.com, ristekdikti.go.id, unsyiah.ac.id,... | <NA> | 16 | 1 | <NA> | 2 | 1 |
2 | 0000-0003-3670-9620 | True | True | carlos | barrera | im individual inventor, and this is my work; s... | [retrodynamic, novelinflow] | <NA> | [energy, technology, gearturbine, imploturboco... | loop profile, 394457 | NaN | NaN | 1 | carlos barrera | 2016-08-29t20:32:10.362z | 2021-02-09t04:56:35.554z | 0 | 0 | 0 | 0 | False | NaN | NaN | [blogspot.mx, behance.net, authorstream.com, d... | <NA> | 24 | 1 | 8 | <NA> | <NA> |
3 | 0000-0002-5441-0465 | True | True | nuria | hernández-león | <NA> | [nuria h. león, nuria hernández león, hernánde... | <NA> | [training, icts, business management, research... | NaN | [[, course: social skills, university of salam... | [[merchandise reception and expedition trainer... | 11 | nuria hernández-león | 2015-11-28t07:18:58.442z | 2021-03-05t16:37:47.403z | 1 | 0 | 0 | 4 | False | NaN | NaN | [feriaempresamujer.com, escueladenegociosydire... | <NA> | 16 | <NA> | 7 | 19 | 16 |
5 | 0000-0001-7010-2908 | True | True | clara | sarmento | clara sarmento holds an aggregation in cultura... | NaN | <NA> | [portuguese culture and literature, cultural a... | ciência id, d418-d6f8-7d49 | [[ao abrigo da bolsa santander ie best practic... | [[presidente da comissão de acreditação do nov... | 275 | clara sarmento | 2013-12-12t00:33:58.190z | 2020-10-12t14:43:00.749z | 17 | 0 | 0 | 60 | True | NaN | NaN | [iscap.pt, google.pt, academia.edu, researchga... | <NA> | 13 | 1 | 6 | 8 | 37 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
133 | 0000-0003-1020-1351 | True | True | sheikh saifullah | ahmed | sheikh saifullah ahmed is a full-time lecturer... | NaN | saifullahahmedku@gmail.com | [south asian literature, postmodern literature... | NaN | [[english discipline , ma & ba in english , kh... | [[lecturer , international university of busin... | 3 | sheikh saifullah ahmed | 2020-04-08t21:00:11.201z | 2021-02-12t20:45:32.247z | 2 | 0 | 0 | 3 | False | gmail.com | NaN | [academia.edu, iubat.edu, google.com, research... | <NA> | 12 | <NA> | 5 | 1 | 1 |
134 | 0000-0001-7228-5680 | True | True | text | protocol | <NA> | NaN | <NA> | NaN | NaN | NaN | [[engineer, textprotocol.org, palo alto, ca, u... | 1 | text protocol | 2021-03-09t10:30:32.237z | 2021-03-21t17:17:40.500z | 0 | 0 | 0 | 0 | False | NaN | NaN | [about.me, figma.com, github.com, gitlab.com, ... | <NA> | 15 | <NA> | <NA> | <NA> | 1 |
135 | 0000-0001-5087-6965 | True | True | robert | ohara | systematics, evolutionary biology, and the his... | [r. o’hara, r.j. o’hara, robert o’hara, robert... | <NA> | [history and philosophy of science, evolutiona... | [[isni, 0000000138200102], [researcherid, b-47... | [[biology, ph.d., harvard university, cambridg... | NaN | 45 | robert j. o’hara | 2014-09-21t02:45:19.620z | 2020-07-09t06:51:09.228z | 23 | 0 | 0 | 72 | True | NaN | NaN | [rjohara.net, google.com, collegiateway.org, r... | <NA> | 12 | 3 | 5 | 1 | <NA> |
138 | 0000-0002-4379-6454 | True | True | caroline wanjiru | kariuki | caroline holds a phd in economics from curtin ... | NaN | <NA> | [development economics, applied econometrics, ... | NaN | [[economics, doctor of philosophy , curtin uni... | [[director, educational development, strathmor... | 4 | caroline wanjiru kariuki | 2020-03-18t10:18:04.007z | 2021-02-11t14:40:38.515z | 1 | 0 | 0 | 0 | False | NaN | NaN | [scopus.com, mendeley.com, publons.com, resear... | <NA> | 13 | <NA> | 4 | 3 | 6 |
139 | 0000-0003-2311-0600 | True | True | myo | kyaw hlaing | <NA> | [dr myo kyaw hlaing] | <NA> | [economic geology] | NaN | NaN | [[lecturer, union of myanmar ministry of educa... | 2 | myo kyaw hlaing | 2018-12-26t12:51:57.801z | 2021-01-26t14:36:47.421z | 1 | 0 | 0 | 2 | False | NaN | NaN | [facebook.com, linkedin.com, instagram.com, re... | <NA> | 12 | <NA> | 1 | <NA> | 2 |
113 rows × 30 columns
Works source¶
def remove_own_source(lst, given, family):
res = []
for ws in lst:
if ws.lower().find(given.lower()) == -1:
if pd.notna(family):
if ws.lower().find(family.lower()) == -1:
res.append(ws)
else:
res.append(ws)
return res
df['ext_works_source'] = df[(df.works_source.notna()) & (df.given_names.notna())]\
.apply(lambda x: remove_own_source(x['works_source'], x['given_names'], x['family_name']), axis=1)
df['n_ext_work_source'] = df.ext_works_source.str.len()
exploded_external_sources = df[df['ext_works_source'].str.len() > 0][['orcid','ext_works_source']]\
.explode('ext_works_source').reset_index(drop=True)
grouped_ext_sources = exploded_external_sources.groupby('ext_works_source')\
.count()\
.sort_values('orcid', ascending=False)\
.reset_index()
data = [
go.Bar(
x=grouped_ext_sources[:30].ext_works_source,
y=grouped_ext_sources[:30].orcid
)
]
layout = go.Layout(
title='Top 30 works_source',
xaxis=dict(tickangle=45, tickfont=dict(size=12))
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
authoritative_sources = grouped_ext_sources[grouped_ext_sources['orcid'] > 2]
authoritative_sources
ext_works_source | orcid | |
---|---|---|
0 | crossref | 1460841 |
1 | scopus - elsevier | 902231 |
2 | crossref metadata search | 297684 |
3 | multidisciplinary digital publishing institute | 281664 |
4 | europe pubmed central | 181605 |
... | ... | ... |
337 | uta - oa journal global insight | 3 |
338 | francis crick institute | 3 |
339 | anna | 3 |
340 | santos | 3 |
341 | universitäts- und stadtbibliothek köln | 3 |
342 rows × 2 columns
exploded_external_sources['authoritative'] = exploded_external_sources.ext_works_source\
.isin(authoritative_sources['ext_works_source'])
orcid_authoritative_source = exploded_external_sources\
.groupby('orcid')['authoritative']\
.any()\
.reset_index()[['orcid', 'authoritative']]
df = df.set_index('orcid').join(orcid_authoritative_source.set_index('orcid')).reset_index()
df.loc[df.authoritative.isna(), 'authoritative'] = False
df.head()
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | ext_works_source | n_ext_work_source | authoritative | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0000-0001-6097-3953 | False | False | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 0 | NaN | 2018-03-02t09:29:16.528z | 2018-03-02t09:43:07.551z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaN | False |
1 | 0000-0001-6112-5550 | True | True | <NA> | <NA> | <NA> | [v.i. yurtaev; v. yurtaev] | <NA> | NaN | NaN | NaN | [[professor, peoples friendship university of ... | 0 | NaN | 2018-04-03t07:50:23.358z | 2020-03-18t09:42:44.753z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 | NaN | NaN | False |
2 | 0000-0001-6152-2695 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 0 | NaN | 2019-12-11t15:31:56.388z | 2020-01-28t15:34:17.309z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaN | False |
3 | 0000-0001-6220-5683 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[research scientist, new york university abu ... | 0 | NaN | 2015-08-18t12:36:45.307z | 2020-09-23t13:37:54.180z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 | NaN | NaN | False |
4 | 0000-0001-7071-8294 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[researcher (academic), universidad de zarago... | 0 | NaN | 2014-03-10t13:22:01.966z | 2016-06-14t22:17:54.470z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | NaN | NaN | False |
External IDs¶
External IDs should come from reliable sources. ORCiD registrants cannot add them freely.
df.n_ids.describe()
count 1.308598e+06 mean 1.359082e+00 std 6.643235e-01 min 1.000000e+00 25% 1.000000e+00 50% 1.000000e+00 75% 2.000000e+00 max 8.000000e+01 Name: n_ids, dtype: float64
df[df.n_ids == df.n_ids.max()]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | ext_works_source | n_ext_work_source | authoritative | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3896226 | 0000-0002-9554-6633 | True | True | john a | williams | <NA> | NaN | <NA> | NaN | [[scopus author id, 55553733518], [scopus aut... | NaN | [[, aston university, birmingham, , gb, 1722, ... | 92 | [aston research explorer] | 2014-11-20t09:42:10.690z | 2021-03-17t01:00:51.203z | 80 | 0 | 0 | 208 | True | NaN | NaN | [aston.ac.uk] | <NA> | 1 | 80 | <NA> | <NA> | 1 | [aston research explorer] | 1.0 | True |
ids = df[['orcid', 'external_ids']].explode('external_ids').reset_index(drop=True)
ids['provider'] = ids[ids.external_ids.notna()]['external_ids'].apply(lambda x: x[0])
ids[ids.provider.notna()].head()
orcid | external_ids | provider | |
---|---|---|---|
9 | 0000-0001-8315-2066 | [researcherid, k-4630-2014] | researcherid |
29 | 0000-0002-2638-4108 | [scopus author id, 54394231000] | scopus author id |
46 | 0000-0003-1435-6545 | [researcherid, p-2223-2018] | researcherid |
50 | 0000-0003-2259-7023 | [scopus author id, 57189297461] | scopus author id |
64 | 0000-0002-7397-5824 | [scopus author id, 8399842800] | scopus author id |
top_ids_providers = ids.groupby('provider').count().sort_values('orcid', ascending=False)
data = [
go.Bar(
x=top_ids_providers.index,
y=top_ids_providers['orcid']
)
]
layout = go.Layout(
title='IDs provided by providers',
xaxis=dict(tickangle=45, tickfont=dict(size=12))
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
pd.unique(ids['provider'])
array([nan, 'researcherid', 'scopus author id', 'loop profile', 'gnd', 'ciência id', 'researcher name resolver id', 'pitt id', 'id dialnet', 'isni', 'technical university of denmark cwis', 'chalmers id', 'scopus author id: ', 'scopus author id:', 'hkust profile', 'hku researcherpage', '中国科学家在线', 'uow scholars', 'sciprofile', 'cti vitae', 'digital author id', 'researcher id', 'authenticusid', 'authid', 'authenticus', 'scopus id', 'digital author id (dai)', 'researcherid:', 'vivo cornell', 'us epa vivo', 'escientist', 'github', 'iauthor', 'orcid id', 'dai', 'scopus id', 'smithsonian profiles', 'google scholar', 'kaken', 'dialnet id', 'researcherid: ', 'une researcher id', 'sciprofiles', 'id dialnet:', 'scienceopen', 'orcid', 'profile system identifier', 'custom'], dtype=object)
Keywords¶
This field is problematic as users can be nasty and put multiple keywords in one as opposed of having different keywords. Look this
keywords_by_orcid = df[['orcid', 'n_keywords']].sort_values('n_keywords', ascending=False)
keywords_by_orcid
orcid | n_keywords | |
---|---|---|
3751714 | 0000-0002-0673-0341 | 154 |
8697926 | 0000-0003-3343-5660 | 148 |
1154523 | 0000-0002-6075-3501 | 140 |
6512971 | 0000-0002-7060-4112 | 140 |
1515197 | 0000-0001-5287-1949 | 132 |
... | ... | ... |
10989644 | 0000-0002-1686-1935 | <NA> |
10989645 | 0000-0002-3800-6331 | <NA> |
10989646 | 0000-0002-8783-5814 | <NA> |
10989647 | 0000-0002-7584-2283 | <NA> |
10989648 | 0000-0003-0529-3538 | <NA> |
10989649 rows × 2 columns
set_top_n(100)
data = [
go.Bar(
x=keywords_by_orcid[:TOP_N]['orcid'],
y=keywords_by_orcid[:TOP_N]['n_keywords']
)
]
layout = go.Layout(
title='Keywords provided by ORCiD',
xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
top_keywords = df[['orcid', 'keywords']]\
.explode('keywords')\
.reset_index(drop=True)\
.groupby('keywords')\
.count()\
.sort_values('orcid', ascending=False)
set_top_n(50)
data = [
go.Bar(
x=top_keywords[:TOP_N].index,
y=top_keywords[:TOP_N]['orcid']
)
]
layout = go.Layout(
title='Top-%s keywords occurrence' % TOP_N,
xaxis=dict(tickangle=45, tickfont=dict(size=12))
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
Education¶
df.n_education.describe()
count 1.753340e+06 mean 1.913072e+00 std 1.197388e+00 min 1.000000e+00 25% 1.000000e+00 50% 2.000000e+00 75% 3.000000e+00 max 2.000000e+02 Name: n_education, dtype: float64
df[df.n_education == df.n_education.max()]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | ext_works_source | n_ext_work_source | authoritative | spam_score | n_valid_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2568539 | 0000-0002-1927-0292 | True | True | phd. carmen m | galvez-sánchez | my name is carmen maria galvez sánchez. i´m a ... | NaN | <NA> | [qualitative research, fibromyalgia, quantitat... | [[loop profile, 509331], [scopus author id, 57... | [[psychology, 2019-2020 course. degree in psyc... | [[researcher and teaching staff. postdoctoral ... | 35 | [phd. carmen m galvez-sánchez, multidisciplina... | 2016-04-18t14:28:57.237z | 2021-03-06t14:17:33.246z | 24 | 0 | 0 | 7 | True | NaN | NaN | NaN | <NA> | <NA> | 2 | 5 | 200 | 3 | [multidisciplinary digital publishing institut... | 4.0 | True | 0.999948 | 1 |
exploded_education = df[['orcid', 'education']].explode('education').dropna()
exploded_education
orcid | education | |
---|---|---|
28 | 0000-0002-2343-910X | [aeronautics and astronautics, phd, massachuse... |
28 | 0000-0002-2343-910X | [aeronautics and astronautics, sm, massachuset... |
28 | 0000-0002-2343-910X | [mechanical engineering and material science, ... |
29 | 0000-0002-2638-4108 | [public law, ph doctor, university of oviedo, ... |
46 | 0000-0003-1435-6545 | [morfologia, , universidade estadual paulista ... |
... | ... | ... |
10989644 | 0000-0002-1686-1935 | [, , south china agricultural university, guan... |
10989645 | 0000-0002-3800-6331 | [richard gilder graduate school, phd in compar... |
10989645 | 0000-0002-3800-6331 | [geological sciences and history (dual major),... |
10989647 | 0000-0002-7584-2283 | [school of electronics and information, master... |
10989647 | 0000-0002-7584-2283 | [ department of electrical engineering, bachel... |
4434439 rows × 2 columns
exploded_education[['degree', 'role', 'university', 'city', 'region', 'country', 'id', 'id_scheme']] = pd.DataFrame(exploded_education.education.tolist(), index=exploded_education.index)
exploded_education.id.replace('', pd.NA, inplace=True)
exploded_education.groupby('orcid').id.count().reset_index()
orcid | id | |
---|---|---|
0 | 0000-0001-5000-0162 | 3 |
1 | 0000-0001-5000-0170 | 2 |
2 | 0000-0001-5000-0218 | 3 |
3 | 0000-0001-5000-0226 | 1 |
4 | 0000-0001-5000-0306 | 0 |
... | ... | ... |
2441640 | 0000-0003-4999-9719 | 1 |
2441641 | 0000-0003-4999-9735 | 1 |
2441642 | 0000-0003-4999-992X | 2 |
2441643 | 0000-0003-4999-9938 | 2 |
2441644 | 0000-0003-4999-9954 | 1 |
2441645 rows × 2 columns
df = df.merge(exploded_education.groupby('orcid').id.count().reset_index(), on='orcid')
df.rename(columns={'id': 'n_valid_education'}, inplace=True)
df[df.n_education != df.n_valid_education]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | ext_works_source | n_ext_work_source | authoritative | spam_score | n_valid_employment | n_valid_education | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0000-0003-1435-6545 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | [prostate cancer, migration, culture cell] | researcherid, p-2223-2018 | [[morfologia, , universidade estadual paulista... | [[, universidade estadual paulista (unesp), in... | 0 | NaN | 2018-08-09t12:12:24.405z | 2020-04-22t01:38:03.184z | 0 | 0 | 0 | 0 | False | NaN | NaN | [cnpq.br, linkedin.com] | <NA> | 2 | 1 | 3 | 1 | 1 | NaN | NaN | False | NaN | 0 | 0 |
6 | 0000-0002-0427-9745 | True | True | a. can | inci | i am a professor of finance at bryant universi... | NaN | <NA> | NaN | [[researcherid, b-5471-2018], [scopus author i... | [[finance, ph.d., university of michigan - ros... | [[professor of finance, bryant university, smi... | 34 | [a. can inci] | 2018-01-20t02:58:05.199z | 2020-06-16t12:35:09.403z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | 2 | <NA> | 4 | 5 | [] | 0.0 | False | 4.341588e-10 | 0 | 0 |
9 | 0000-0002-3380-6671 | True | True | abdul | asis pata | <NA> | NaN | <NA> | NaN | NaN | [[agribisnis, m.si, universitas hasanuddin, ma... | [[s.p, universitas muslim maros, maros, , id, ... | 0 | NaN | 2018-02-12t02:08:37.018z | 2018-02-12t02:22:33.378z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | 1 | 1 | NaN | NaN | False | NaN | 0 | 0 |
11 | 0000-0001-6902-6549 | True | True | abubakar | muhammad | <NA> | NaN | <NA> | NaN | NaN | [[school of electrical and information enginee... | [[lecturer, university of faisalabad, faisalab... | 1 | [multidisciplinary digital publishing institute] | 2017-07-06t10:29:17.738z | 2020-08-01t05:18:53.393z | 1 | 0 | 0 | 0 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | 1 | 1 | [multidisciplinary digital publishing institute] | 1.0 | True | NaN | 0 | 0 |
12 | 0000-0002-6142-6406 | True | True | adam | mamadou | <NA> | NaN | <NA> | NaN | NaN | [[département deconomie sociologie rurale et t... | [[, institut national de la recherche agronomi... | 0 | NaN | 2018-02-15t09:54:59.943z | 2018-02-15t10:19:27.869z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | 1 | 1 | NaN | NaN | False | NaN | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1753316 | 0000-0002-1842-4130 | True | True | josé de jesús | cázares-marinero | <NA> | [josé cázares] | <NA> | [chemical biology, industrial chemistry, biote... | [[researcherid, h-2597-2013], [scopus author i... | [[charles friedel, postdoc, école nationale su... | [[mtc, polioles, mexico, , mx, , ], [head of r... | 17 | [crossref metadata search, scopus - elsevier, ... | 2013-07-09t14:39:30.950z | 2020-12-10t17:42:20.176z | 17 | 0 | 0 | 29 | False | NaN | NaN | [linkedin.com, google.com, researchgate.net] | <NA> | 3 | 2 | 5 | 3 | 3 | [crossref metadata search, scopus - elsevier] | 2.0 | True | NaN | 0 | 0 |
1753319 | 0000-0003-0459-4822 | True | True | luana | <NA> | mestranda em tecnologia na saúde e foi aluna o... | [luana bastos morey] | <NA> | [tradução; língua espanhol; língua portuguesa;... | NaN | [[pós-graduação em tecnologia em saúde stricto... | [[professora de espanhol e português para estr... | 7 | [luana arrial bastos] | 2017-05-11t13:14:59.372z | 2020-12-08t20:18:24.163z | 0 | 0 | 0 | 0 | False | NaN | NaN | [unidospelasaude.com.br, facebook.com, faceboo... | <NA> | 4 | <NA> | 2 | 4 | 3 | [] | 0.0 | False | 1.000000e+00 | 2 | 3 |
1753320 | 0000-0003-0057-1551 | True | True | lyudmyla | antypenko | the phd degree of pharmacy was received under ... | [lyudmila nikolaevna antipenko (russian transl... | <NA> | [pharmaceutical chemistry, organic synthesis, ... | [[scopus author id, 55070809900], [researcheri... | [[centre for nanomaterials, advanced technolog... | [[visiting scientist, north dakota state unive... | 35 | [crossref metadata search, scopus - elsevier, ... | 2014-02-19t08:15:15.698z | 2020-12-09t18:14:17.963z | 28 | 0 | 11 | 17 | True | NaN | NaN | NaN | <NA> | <NA> | 2 | 5 | 7 | 8 | [crossref metadata search, scopus - elsevier, ... | 4.0 | True | 1.000000e+00 | 2 | 4 |
1753325 | 0000-0003-4653-4705 | True | True | patricia | teixeira | 2005 - phd, university of coimbrajuly 2009-jun... | NaN | <NA> | [ecotoxicology, heavy metals, steroid hormones... | [[researcherid, i-6863-2013], [scopus author i... | [[, phd, university of coimbra, coimbra, , pt,... | [[senior researcher, university of coimbra, co... | 95 | [ciênciavitae, scopus - elsevier, pg cardoso, ... | 2013-11-26t10:59:34.331z | 2020-12-02t15:28:26.221z | 90 | 0 | 0 | 42 | False | NaN | NaN | NaN | <NA> | <NA> | 3 | 7 | 1 | 3 | [ciênciavitae, scopus - elsevier, pg cardoso, ... | 4.0 | True | 7.147059e-10 | 3 | 0 |
1753337 | 0000-0002-1686-1935 | True | True | youxia | wang | youxia wang (1995-), native of zunyi, guizhou ... | NaN | <NA> | NaN | NaN | [[institute of animal nutrition, master degree... | [[master, sichuan agricultural university , ch... | 0 | NaN | 2020-12-11t02:11:51.808z | 2020-12-11t03:25:28.263z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | 2 | 1 | NaN | NaN | False | 4.475163e-02 | 1 | 1 |
473043 rows × 36 columns
Employment¶
df.n_employment.describe()
count 2.680488e+06 mean 1.664713e+00 std 1.530077e+00 min 1.000000e+00 25% 1.000000e+00 50% 1.000000e+00 75% 2.000000e+00 max 1.980000e+02 Name: n_employment, dtype: float64
df[df.n_employment == df.n_employment.max()]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | ext_works_source | n_ext_work_source | authoritative | spam_score | n_valid_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2020738 | 0000-0002-0293-964X | True | True | ben zhong | tang | <NA> | [唐本忠] | tangbenz@ust.hk | [fluorescent biosensors, light-emitting molecu... | [[hkust profile, tang-benzhong], [researcherid... | [[department of chemistry and faculty of pharm... | [[chair professor, division of biomedical engi... | 422 | [tang, benzhong, crossref] | 2015-03-13t00:28:33.270z | 2021-03-23t07:56:34.824z | 359 | 0 | 0 | 0 | False | ust.hk | NaN | [ust.hk] | <NA> | 1 | 3 | 7 | 7 | 198 | [crossref] | 1.0 | True | NaN | 32 |
Let's count how many employments have a valid assigned id by orcid (ringols, isni, grid, etc.)
exploded_employment = df[['orcid', 'employment']].explode('employment').dropna()
exploded_employment
orcid | employment | |
---|---|---|
1 | 0000-0001-6112-5550 | [professor, peoples friendship university of r... |
3 | 0000-0001-6220-5683 | [research scientist, new york university abu d... |
4 | 0000-0001-7071-8294 | [researcher (academic), universidad de zaragoz... |
4 | 0000-0001-7071-8294 | [researcher (academic), instituto de síntesis ... |
6 | 0000-0001-7402-0096 | [, kth royal institute of technology, stockhol... |
... | ... | ... |
10989643 | 0000-0003-2606-0936 | [post-doc, institute of biochemistry and cell ... |
10989644 | 0000-0002-1686-1935 | [master, sichuan agricultural university , che... |
10989645 | 0000-0002-3800-6331 | [assistant professor, baruch college, city uni... |
10989645 | 0000-0002-3800-6331 | [postdoctoral scholar, university of californi... |
10989647 | 0000-0002-7584-2283 | [lecturer, henan institute of science and tech... |
4462243 rows × 2 columns
exploded_employment[['role', 'institution', 'city', 'region', 'country', 'id', 'id_scheme']] = pd.DataFrame(exploded_employment.employment.tolist(), index=exploded_employment.index)
exploded_employment.id.replace('', pd.NA, inplace=True)
exploded_employment.groupby('orcid').id.count().reset_index()
orcid | id | |
---|---|---|
0 | 0000-0001-5000-0031 | 1 |
1 | 0000-0001-5000-0138 | 1 |
2 | 0000-0001-5000-0170 | 2 |
3 | 0000-0001-5000-0218 | 1 |
4 | 0000-0001-5000-0226 | 1 |
... | ... | ... |
2680483 | 0000-0003-4999-9831 | 1 |
2680484 | 0000-0003-4999-9890 | 1 |
2680485 | 0000-0003-4999-992X | 0 |
2680486 | 0000-0003-4999-9938 | 1 |
2680487 | 0000-0003-4999-9954 | 2 |
2680488 rows × 2 columns
df = df.merge(exploded_employment.groupby('orcid').id.count().reset_index(), on='orcid')
df.rename(columns={'id': 'n_valid_employment'}, inplace=True)
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | ext_works_source | n_ext_work_source | authoritative | spam_score | n_valid_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0000-0001-6112-5550 | True | True | <NA> | <NA> | <NA> | [v.i. yurtaev; v. yurtaev] | <NA> | NaN | NaN | NaN | [[professor, peoples friendship university of ... | 0 | NaN | 2018-04-03t07:50:23.358z | 2020-03-18t09:42:44.753z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 | NaN | NaN | False | NaN | 1 |
1 | 0000-0001-6220-5683 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[research scientist, new york university abu ... | 0 | NaN | 2015-08-18t12:36:45.307z | 2020-09-23t13:37:54.180z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 | NaN | NaN | False | NaN | 0 |
2 | 0000-0001-7071-8294 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[researcher (academic), universidad de zarago... | 0 | NaN | 2014-03-10t13:22:01.966z | 2016-06-14t22:17:54.470z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | NaN | NaN | False | NaN | 1 |
3 | 0000-0001-7402-0096 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[, kth royal institute of technology, stockho... | 0 | NaN | 2015-01-11t15:13:06.467z | 2016-06-14t23:55:59.896z | 0 | 0 | 0 | 0 | False | NaN | NaN | [kth.se] | <NA> | 1 | <NA> | <NA> | <NA> | 1 | NaN | NaN | False | NaN | 0 |
4 | 0000-0001-8315-2066 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | [iron chlorosis, fertilizers, calcareous soil,... | researcherid, k-4630-2014 | NaN | [[, universidad de córdoba, córdoba, andalucía... | 0 | NaN | 2014-05-26t08:57:12.661z | 2019-03-27t07:53:48.987z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | 1 | 4 | <NA> | 1 | NaN | NaN | False | NaN | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2680483 | 0000-0002-8004-688X | True | True | paul | wanjala muyoma | <NA> | [wanjala] | <NA> | [environment and sustainability] | NaN | NaN | [[graduate teaching assistant, university of p... | 0 | NaN | 2016-03-07t08:53:06.561z | 2020-12-02t02:14:50.213z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | 1 | <NA> | 2 | NaN | NaN | False | NaN | 2 |
2680484 | 0000-0003-2606-0936 | True | True | luang | xu | <NA> | [xu lu-ang, lu lu] | <NA> | NaN | NaN | NaN | [[post-doc, institute of biochemistry and cell... | 2 | [scopus - elsevier, crossref] | 2015-10-24t03:53:23.544z | 2020-11-19t09:23:48.896z | 2 | 0 | 0 | 1 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 | [scopus - elsevier, crossref] | 2.0 | True | NaN | 1 |
2680485 | 0000-0002-1686-1935 | True | True | youxia | wang | youxia wang (1995-), native of zunyi, guizhou ... | NaN | <NA> | NaN | NaN | [[institute of animal nutrition, master degree... | [[master, sichuan agricultural university , ch... | 0 | NaN | 2020-12-11t02:11:51.808z | 2020-12-11t03:25:28.263z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | 2 | 1 | NaN | NaN | False | 0.044752 | 1 |
2680486 | 0000-0002-3800-6331 | True | True | zachary | calamari | <NA> | NaN | <NA> | NaN | NaN | [[richard gilder graduate school, phd in compa... | [[assistant professor, baruch college, city un... | 7 | [crossref metadata search, zachary t. calamari... | 2015-01-20t20:20:17.042z | 2020-11-21t19:48:36.221z | 7 | 0 | 1 | 0 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | 2 | 2 | [crossref metadata search, crossref] | 2.0 | True | NaN | 0 |
2680487 | 0000-0002-7584-2283 | True | True | 现刚 | 左 | <NA> | [zuo xiangang, xiangang zuo, zuo x g, x g zuo] | <NA> | NaN | NaN | [[school of electronics and information, maste... | [[lecturer, henan institute of science and tec... | 0 | NaN | 2016-12-27t07:45:25.073z | 2020-11-29t13:06:17.582z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | 2 | 1 | NaN | NaN | False | NaN | 1 |
2680488 rows × 35 columns
df[df.n_employment != df.n_valid_employment]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | ext_works_source | n_ext_work_source | authoritative | spam_score | n_valid_employment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0000-0001-6220-5683 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[research scientist, new york university abu ... | 0 | NaN | 2015-08-18t12:36:45.307z | 2020-09-23t13:37:54.180z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 | NaN | NaN | False | NaN | 0 |
2 | 0000-0001-7071-8294 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[researcher (academic), universidad de zarago... | 0 | NaN | 2014-03-10t13:22:01.966z | 2016-06-14t22:17:54.470z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | NaN | NaN | False | NaN | 1 |
3 | 0000-0001-7402-0096 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[, kth royal institute of technology, stockho... | 0 | NaN | 2015-01-11t15:13:06.467z | 2016-06-14t23:55:59.896z | 0 | 0 | 0 | 0 | False | NaN | NaN | [kth.se] | <NA> | 1 | <NA> | <NA> | <NA> | 1 | NaN | NaN | False | NaN | 0 |
5 | 0000-0001-8377-3508 | True | True | <NA> | <NA> | <NA> | [fontana, milena da silva] | <NA> | [educação; informática; matemática.] | NaN | NaN | [[, instituto federal de educação, ciência e t... | 0 | NaN | 2018-05-23t23:39:04.534z | 2019-10-16t02:50:11.007z | 0 | 0 | 0 | 0 | False | NaN | NaN | [cnpq.br] | <NA> | 1 | <NA> | 1 | <NA> | 3 | NaN | NaN | False | NaN | 0 |
8 | 0000-0002-6508-6998 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | [[researcher (academic), universidad de zarago... | 0 | NaN | 2014-03-12t08:23:22.492z | 2015-07-27t15:51:38.411z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 2 | NaN | NaN | False | NaN | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2680476 | 0000-0001-9133-2366 | True | True | søren | staugaard | <NA> | NaN | <NA> | NaN | NaN | [[, , aarhus universitet, aarhus, , dk, 1006, ... | [[, aarhus university, aarhus c, , dk, , ], [s... | 29 | [aarhus university, crossref] | 2013-03-19t11:34:48.477z | 2020-12-07t08:03:23.190z | 14 | 0 | 10 | 35 | True | NaN | NaN | [au.dk, au.dk] | <NA> | 2 | <NA> | <NA> | 1 | 3 | [aarhus university, crossref] | 2.0 | True | NaN | 1 |
2680477 | 0000-0001-8494-2123 | True | True | tarun | jain | <NA> | NaN | <NA> | [pet/ct specialist; nuclear medicine physician... | NaN | NaN | [[assistant professor, mahatma gandhi medical ... | 0 | NaN | 2014-12-19t08:21:46.292z | 2020-12-09t06:03:57.055z | 0 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | 1 | <NA> | 5 | NaN | NaN | False | NaN | 4 |
2680479 | 0000-0002-2906-0299 | True | True | tiffany | mackay | <NA> | [tiffany russel sia] | <NA> | [microfluidics, gpc-1, gallium-67, pet/ct, oxy... | researcherid, a-2121-2017 | [[faculty of medicine, master in pharmaceutica... | [[clinical project lead, minomic international... | 11 | [crossref, researcherid, tiffany mackay] | 2017-01-03t23:28:48.736z | 2020-12-09t17:12:20.326z | 11 | 0 | 0 | 0 | True | NaN | NaN | [oxytocin.com.au, linkedin.com] | <NA> | 2 | 1 | 13 | 2 | 4 | [crossref, researcherid] | 2.0 | True | NaN | 1 |
2680481 | 0000-0002-4422-4036 | True | True | vijay | krishnan | <NA> | NaN | <NA> | NaN | NaN | [[psychiatry, md, all india institute of medic... | [[assistant professor, all india institute of ... | 2 | [crossref] | 2015-05-28t17:24:39.519z | 2020-11-24t08:57:22.875z | 2 | 0 | 0 | 0 | False | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | 2 | 5 | [crossref] | 1.0 | True | NaN | 3 |
2680486 | 0000-0002-3800-6331 | True | True | zachary | calamari | <NA> | NaN | <NA> | NaN | NaN | [[richard gilder graduate school, phd in compa... | [[assistant professor, baruch college, city un... | 7 | [crossref metadata search, zachary t. calamari... | 2015-01-20t20:20:17.042z | 2020-11-21t19:48:36.221z | 7 | 0 | 1 | 0 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | 2 | 2 | [crossref metadata search, crossref] | 2.0 | True | NaN | 0 |
1036967 rows × 35 columns
Biography¶
df['biography'] = df[df.biography.notna()]['biography'].replace('', np.NaN)
df.biography.describe()
count 354015 unique 337007 top car title loans are a more straightforward way... freq 343 Name: biography, dtype: object
df[(df.biography.notna()) & (df.biography.str.contains('car title loans are a more straightforward'))]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | ext_works_source | n_ext_work_source | authoritative | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
51306 | 0000-0002-7397-7977 | True | True | premium car | title loans | car title loans are a more straightforward way... | [premium car title loans] | <NA> | [car title loan upland] | NaN | NaN | NaN | 0 | NaN | 2020-11-06t06:10:20.070z | 2020-11-06t06:24:28.005z | 0 | 0 | 0 | 0 | False | NaN | NaN | [premiumcartitleloans.com] | <NA> | 1 | <NA> | 1 | <NA> | <NA> | NaN | NaN | False |
51307 | 0000-0003-4931-9736 | True | True | premium car | title loans | car title loans are a more straightforward way... | [premium car title loans] | <NA> | [car title loan saratoga] | NaN | NaN | NaN | 0 | NaN | 2020-11-13t01:04:19.859z | 2020-11-13t01:15:12.546z | 0 | 0 | 0 | 0 | False | NaN | NaN | [premiumcartitleloans.com] | <NA> | 1 | <NA> | 1 | <NA> | <NA> | NaN | NaN | False |
106024 | 0000-0001-8221-2303 | True | True | premium car | title loans | car title loans are a more straightforward way... | [premium car title loans] | <NA> | [car title loan victorville] | NaN | NaN | NaN | 0 | NaN | 2020-11-05t00:38:21.096z | 2020-11-05t00:40:40.091z | 0 | 0 | 0 | 0 | False | NaN | NaN | [premiumcartitleloans.com] | <NA> | 1 | <NA> | 1 | <NA> | <NA> | NaN | NaN | False |
108770 | 0000-0001-6736-072X | True | True | premium car | title loans | car title loans are a more straightforward way... | NaN | <NA> | NaN | NaN | NaN | NaN | 0 | NaN | 2020-12-08t05:38:30.786z | 2020-12-08t05:40:03.786z | 0 | 0 | 0 | 0 | False | NaN | NaN | [premiumcartitleloans.com] | <NA> | 1 | <NA> | <NA> | <NA> | <NA> | NaN | NaN | False |
108771 | 0000-0002-8727-1246 | True | True | premium car | title loans | car title loans are a more straightforward way... | [loan agency] | <NA> | [refinance car title loan, title loan on car, ... | NaN | NaN | NaN | 0 | NaN | 2020-12-10t08:54:56.127z | 2020-12-10t08:57:15.791z | 0 | 0 | 0 | 0 | False | NaN | NaN | [premiumcartitleloans.com] | <NA> | 1 | <NA> | 4 | <NA> | <NA> | NaN | NaN | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10875416 | 0000-0002-9640-8136 | True | True | premium car | title loans | car title loans are a more straightforward way... | [premium car title loans] | <NA> | [car title loan clovis] | NaN | NaN | NaN | 0 | NaN | 2020-10-22t06:11:02.945z | 2020-10-22t06:17:09.111z | 0 | 0 | 0 | 0 | False | NaN | NaN | [premiumcartitleloans.com] | <NA> | 1 | <NA> | 1 | <NA> | <NA> | NaN | NaN | False |
10878239 | 0000-0002-6926-3752 | True | True | premium car | title loans | car title loans are a more straightforward way... | [premium car title loans] | <NA> | [car title loan escondido] | NaN | NaN | NaN | 0 | NaN | 2020-12-03t02:00:33.684z | 2020-12-03t02:02:07.054z | 0 | 0 | 0 | 0 | False | NaN | NaN | [premiumcartitleloans.com] | <NA> | 1 | <NA> | 1 | <NA> | <NA> | NaN | NaN | False |
10933380 | 0000-0002-3655-4713 | True | True | premium car | title loans | car title loans are a more straightforward way... | [premium car title loans] | <NA> | [car title loan san rafael] | NaN | NaN | NaN | 0 | NaN | 2020-11-18t00:39:17.492z | 2020-11-18t00:52:19.024z | 0 | 0 | 0 | 0 | False | NaN | NaN | [premiumcartitleloans.com] | <NA> | 1 | <NA> | 1 | <NA> | <NA> | NaN | NaN | False |
10933381 | 0000-0002-8724-1020 | True | True | premium car | title loans | car title loans are a more straightforward way... | [premium car title loans] | <NA> | [car title loan san juan capistrano] | NaN | NaN | NaN | 0 | NaN | 2020-11-19t00:31:54.080z | 2020-11-19t00:34:08.721z | 0 | 0 | 0 | 0 | False | NaN | NaN | [premiumcartitleloans.com] | <NA> | 1 | <NA> | 1 | <NA> | <NA> | NaN | NaN | False |
10985986 | 0000-0002-4601-4569 | True | True | premium car | title loans | car title loans are a more straightforward way... | [premium car title loans] | <NA> | [car title loan mount pleasant] | NaN | NaN | NaN | 0 | NaN | 2020-10-16t00:32:26.207z | 2020-10-16t00:37:42.646z | 0 | 0 | 0 | 0 | False | NaN | NaN | [premiumcartitleloans.com] | <NA> | 1 | <NA> | 1 | <NA> | <NA> | NaN | NaN | False |
421 rows × 33 columns
def score(bio):
try:
return antispam.score(bio)
except: # if len(bio) < 3 the filter doesn't know how to handle that
return -1
df['spam_score'] = df[df.biography.notna()]['biography'].apply(lambda bio: score(bio))
df[df.spam_score == -1][['orcid','biography']]
orcid | biography | |
---|---|---|
25505 | 0000-0003-0505-2734 | j |
138487 | 0000-0002-3417-7299 | ..... |
139595 | 0000-0003-3794-1288 | m.d., ph.d. |
193340 | 0000-0001-9655-4806 | 肿瘤 |
194990 | 0000-0002-9149-0142 | be y |
... | ... | ... |
10927866 | 0000-0002-7341-5480 | ph.d. |
10976080 | 0000-0003-4041-0840 | / |
10976689 | 0000-0002-4285-8537 | |
10976922 | 0000-0002-1545-8773 | hi |
10987379 | 0000-0002-6302-4224 | . |
348 rows × 2 columns
df['spam_score'] = df['spam_score'].replace(-1, np.NaN)
df.spam_score.describe()
count 3.536670e+05 mean 6.098044e-01 std 4.476618e-01 min 1.917500e-22 25% 1.858235e-02 50% 9.529688e-01 75% 9.999992e-01 max 1.000000e+00 Name: spam_score, dtype: float64
df[df.spam_score > 0.9999][['biography', 'spam_score']]
biography | spam_score | |
---|---|---|
29 | investigador de la universidad de oviedo. depa... | 1.000000 |
83 | formación académica en la temática de manejo d... | 1.000000 |
217 | doctor en educación, maestro en gerencia de la... | 1.000000 |
222 | possui graduação em psicologia pela pontifícia... | 1.000000 |
470 | roofing contractors in seattle waroofing contr... | 1.000000 |
... | ... | ... |
10989593 | jose ignacio peláez sánchez ha sido profesor e... | 0.999966 |
10989603 | mestranda em tecnologia na saúde e foi aluna o... | 1.000000 |
10989605 | the phd degree of pharmacy was received under ... | 1.000000 |
10989615 | mostafa metwaly is an assistant lecturer at th... | 1.000000 |
10989617 | jual obat aborsi di tangerang, obat penggugur ... | 0.999999 |
120733 rows × 2 columns
TODO: offending words, sexually explicit content
All VS All correlation¶
fig = px.imshow(df.select_dtypes(include=['bool','number']).fillna(-1).corr())
fig.show()
# df[['verified_email',
# 'verified_primary_email',
# 'n_works',
# 'n_doi',
# 'n_arxiv',
# 'n_pmc',
# 'n_other_pids',
# 'n_emails',
# 'n_urls',
# 'n_ids',
# 'n_keywords',
# 'n_employment',
# 'n_education',
# 'label']].to_pickle('../data/processed/features.pkl')
Label speculation¶
df[df.label == 1]
orcid | verified_email | verified_primary_email | given_names | family_name | biography | other_names | primary_email | keywords | external_ids | education | employment | n_works | works_source | activation_date | last_update_date | n_doi | n_arxiv | n_pmc | n_other_pids | label | primary_email_domain | other_email_domains | url_domains | n_emails | n_urls | n_ids | n_keywords | n_education | n_employment | ext_works_source | n_ext_work_source | authoritative | spam_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17 | 0000-0002-0137-3066 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 0 | NaN | 2017-07-25t04:34:17.338z | 2019-11-27t17:54:45.418z | 0 | 0 | 0 | 0 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaN | False | NaN |
19 | 0000-0002-0461-9711 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 2 | [crossref] | 2015-08-18t12:42:01.797z | 2019-12-06t11:37:38.203z | 2 | 0 | 0 | 0 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaN | False | NaN |
22 | 0000-0002-0761-9450 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 1 | [crossref] | 2020-05-13t17:15:28.405z | 2020-08-11t21:00:45.694z | 1 | 0 | 0 | 0 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaN | False | NaN |
33 | 0000-0002-4447-9215 | True | True | <NA> | <NA> | <NA> | NaN | <NA> | NaN | NaN | NaN | NaN | 0 | NaN | 2017-07-24t09:37:50.242z | 2019-11-15t08:31:24.820z | 0 | 0 | 0 | 0 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaN | False | NaN |
44 | 0000-0003-0426-4065 | True | True | <NA> | <NA> | <NA> | [eliza i. gilbert] | <NA> | NaN | NaN | NaN | [[, us fish and wildlife service, albuquerque,... | 0 | NaN | 2017-08-07t18:32:31.802z | 2020-04-08t16:48:55.732z | 0 | 0 | 0 | 0 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 | NaN | NaN | False | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10989635 | 0000-0002-7340-9697 | True | True | tawanda | marandure | <NA> | NaN | <NA> | NaN | scopus author id, 48261373600 | [[animal science, msc sustainable agriculture,... | [[lecturer, zimbabwe open university faculty o... | 7 | [scopus - elsevier] | 2015-11-05t08:52:08.743z | 2020-12-09t17:59:18.350z | 7 | 0 | 0 | 7 | True | NaN | NaN | NaN | <NA> | <NA> | 1 | <NA> | 3 | 3 | [scopus - elsevier] | 1.0 | True | NaN |
10989636 | 0000-0002-2906-0299 | True | True | tiffany | mackay | <NA> | [tiffany russel sia] | <NA> | [microfluidics, gpc-1, gallium-67, pet/ct, oxy... | researcherid, a-2121-2017 | [[faculty of medicine, master in pharmaceutica... | [[clinical project lead, minomic international... | 11 | [crossref, researcherid, tiffany mackay] | 2017-01-03t23:28:48.736z | 2020-12-09t17:12:20.326z | 11 | 0 | 0 | 0 | True | NaN | NaN | [oxytocin.com.au, linkedin.com] | <NA> | 2 | 1 | 13 | 2 | 4 | [crossref, researcherid] | 2.0 | True | NaN |
10989637 | 0000-0001-5896-2024 | True | True | giovanni, l | tiscia | <NA> | NaN | <NA> | NaN | scopus author id, 54948242800 | NaN | NaN | 70 | [scopus - elsevier, tiscia giovanni, l, europe... | 2016-07-27t10:09:13.585z | 2020-12-07t22:23:05.706z | 65 | 0 | 17 | 52 | True | NaN | NaN | NaN | <NA> | <NA> | 1 | <NA> | <NA> | <NA> | [scopus - elsevier, europe pubmed central, cro... | 3.0 | True | NaN |
10989643 | 0000-0003-2606-0936 | True | True | luang | xu | <NA> | [xu lu-ang, lu lu] | <NA> | NaN | NaN | NaN | [[post-doc, institute of biochemistry and cell... | 2 | [scopus - elsevier, crossref] | 2015-10-24t03:53:23.544z | 2020-11-19t09:23:48.896z | 2 | 0 | 0 | 1 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | 1 | [scopus - elsevier, crossref] | 2.0 | True | NaN |
10989645 | 0000-0002-3800-6331 | True | True | zachary | calamari | <NA> | NaN | <NA> | NaN | NaN | [[richard gilder graduate school, phd in compa... | [[assistant professor, baruch college, city un... | 7 | [crossref metadata search, zachary t. calamari... | 2015-01-20t20:20:17.042z | 2020-11-21t19:48:36.221z | 7 | 0 | 1 | 0 | True | NaN | NaN | NaN | <NA> | <NA> | <NA> | <NA> | 2 | 2 | [crossref metadata search, crossref] | 2.0 | True | NaN |
2075872 rows × 34 columns
# (df.n_works > 0) & (df.n_ids > 1)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10989649 entries, 0 to 10989648 Data columns (total 34 columns): # Column Dtype --- ------ ----- 0 orcid object 1 verified_email bool 2 verified_primary_email bool 3 given_names string 4 family_name string 5 biography string 6 other_names object 7 primary_email string 8 keywords object 9 external_ids object 10 education object 11 employment object 12 n_works Int16 13 works_source object 14 activation_date string 15 last_update_date string 16 n_doi Int16 17 n_arxiv Int16 18 n_pmc Int16 19 n_other_pids Int16 20 label bool 21 primary_email_domain object 22 other_email_domains object 23 url_domains object 24 n_emails Int16 25 n_urls Int16 26 n_ids Int16 27 n_keywords Int16 28 n_education Int16 29 n_employment Int16 30 ext_works_source object 31 n_ext_work_source float64 32 authoritative object 33 spam_score float64 dtypes: Int16(11), bool(3), float64(2), object(12), string(6) memory usage: 2.0+ GB