fake-orcid-analysis/notebooks/01-Exploration.ipynb

793 KiB
Raw Blame History

Exploratory analysis

TODO:

  • Understanding the reason for fake profiles can bring insight on how to catch them (could be trivial with prior knowledge, e.g., SEO hacking => URLs)
  • Make casistics (e.g. author publishing with empty orcid, author publishing but not on OpenAIRE, etc.)
  • Temporal dimension of any use?
  • Can we access private info thanks to the OpenAIRE-ORCID agreement?
In [1]:
import glob

import pandas as pd
import ast
import tldextract
import numpy as np

import antispam

import plotly
from plotly.offline import iplot, init_notebook_mode
import plotly.graph_objs as go
import plotly.express as px

init_notebook_mode(connected=True)
TOP_N = 0
TOP_RANGE = [0, 0]

def set_top_n(n):
    global TOP_N, TOP_RANGE
    TOP_N = n
    TOP_RANGE = [-.5, n - 1 + .5]
    
pd.set_option('display.max_columns', None)

Notable solid ORCID iDs for explorative purposes:

In [2]:
AM = '0000-0002-5193-7851'
PP = '0000-0002-8588-4196'

Notable anomalies:

In [3]:
JOURNAL = '0000-0003-1815-5732'
NOINFO = '0000-0001-5009-2052'
VALID_NO_OA = '0000-0002-5154-6404' # True profile, but not in OpenAIRE
WORK_MISUSE = '0000-0001-7870-1120'
# todo: find group-shared ORCiD, if possible

Notable fake ORCID iDs:

In [4]:
SCAFFOLD = '0000-0001-5004-7761'
WHATSAPP = '0000-0001-6997-9470'
PENIS = '0000-0002-3399-7287'
BITCOIN = '0000-0002-7518-6845'
FITNESS_CHINA = '0000-0002-1234-835X' # URL record + employment
CANNABIS = '0000-0002-9025-8632'      # URL > 70 + works (REMOVED)
PLUMBER = '0000-0002-1700-8311'       # URL > 10 + works 

Load the dataset

In [5]:
parts = glob.glob('../data/processed/dataset.pkl.*')

df = pd.concat((pd.read_pickle(part) for part in sorted(parts)))
df.head(5)
Out[5]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
0 0000-0001-6097-3953 False False <NA> <NA> <NA> NaN <NA> NaN NaN NaN NaN 0 NaN 2018-03-02t09:29:16.528z 2018-03-02t09:43:07.551z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> <NA>
1 0000-0001-6112-5550 True True <NA> <NA> <NA> [v.i. yurtaev; v. yurtaev] <NA> NaN NaN NaN [[professor, peoples friendship university of ... 0 NaN 2018-04-03t07:50:23.358z 2020-03-18t09:42:44.753z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 1
2 0000-0001-6152-2695 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN NaN 0 NaN 2019-12-11t15:31:56.388z 2020-01-28t15:34:17.309z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> <NA>
3 0000-0001-6220-5683 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[research scientist, new york university abu ... 0 NaN 2015-08-18t12:36:45.307z 2020-09-23t13:37:54.180z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 1
4 0000-0001-7071-8294 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[researcher (academic), universidad de zarago... 0 NaN 2014-03-10t13:22:01.966z 2016-06-14t22:17:54.470z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 2

Notable profiles inspection

In [6]:
df[df['orcid'] == AM]
Out[6]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
3073261 0000-0002-5193-7851 True True andrea mannocci data scientist & researcher; scholarly knowled... NaN andrea.mannocci@isti.cnr.it [open science, data science, science of scienc... scopus author id, 55233589900 [[information engineering, ph.d., università d... [[research associate, istituto di scienza e te... 37 [scopus - elsevier, crossref metadata search, ... 2017-09-12t14:28:33.467z 2021-03-17t15:40:07.776z 34 0 0 60 True isti.cnr.it NaN [github.io, twitter.com, linkedin.com] <NA> 3 1 5 4 5
In [7]:
df[df['orcid'] == WHATSAPP]
Out[7]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
9887272 0000-0001-6997-9470 True True other whatsapp <NA> NaN <NA> [whatsapp gb baixar, whatsapp gb 2020, whatsap... NaN NaN NaN 0 NaN 2020-10-07t10:37:12.237z 2020-10-08t02:32:03.935z 0 0 0 0 False NaN NaN [otherwhatsapp.com, im-creator.com, facebook.c... <NA> 27 <NA> 4 <NA> <NA>
In [8]:
df.count()
Out[8]:
orcid                     10989649
verified_email            10989649
verified_primary_email    10989649
given_names               10959039
family_name               10671715
biography                   354015
other_names                 554684
primary_email               124722
keywords                    649637
external_ids               1308598
education                  2441645
employment                 2680488
n_works                   10989649
works_source               2740939
activation_date           10989649
last_update_date          10989649
n_doi                     10989649
n_arxiv                   10989649
n_pmc                     10989649
n_other_pids              10989649
label                     10989649
primary_email_domain        124722
other_email_domains          48615
url_domains                 715067
n_emails                     48615
n_urls                      715067
n_ids                      1308598
n_keywords                  649637
n_education                2441645
n_employment               2680488
dtype: int64
In [9]:
df['orcid'].describe()
Out[9]:
count                10989649
unique               10989649
top       0000-0001-5242-3687
freq                        1
Name: orcid, dtype: object

Primary email

In [10]:
df['primary_email'].describe()
Out[10]:
count                     124722
unique                    124718
top       opercin@erbakan.edu.tr
freq                           2
Name: primary_email, dtype: object

Dupe emails

In [11]:
df['primary_email'].dropna().loc[df['primary_email'].duplicated()]
Out[11]:
1681787       opercin@erbakan.edu.tr
5590332     patrick.davey@monash.edu
9316843             maykin@owasp.org
10375852       andycheng2026@163.com
Name: primary_email, dtype: string
In [12]:
df[df['primary_email'] == 'maykin@owasp.org']
Out[12]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
7543981 0000-0002-0836-2271 True True maykin warasart <NA> NaN maykin@owasp.org NaN NaN NaN NaN 0 NaN 2020-09-15t04:43:55.709z 2020-09-15t05:17:28.509z 0 0 0 0 False owasp.org [dga.or.th] NaN 1 <NA> <NA> <NA> <NA> <NA>
9316843 0000-0001-9855-1676 True True maykin warasart <NA> NaN maykin@owasp.org NaN NaN NaN NaN 0 NaN 2020-10-23t17:51:51.925z 2021-01-01t15:00:52.053z 0 0 0 0 False owasp.org [dga.or.th, ieee.org] NaN 2 <NA> <NA> <NA> <NA> <NA>
In [13]:
df[df['primary_email'] == 'opercin@erbakan.edu.tr']
Out[13]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
347852 0000-0002-2232-9638 True True osman perçin <NA> NaN opercin@erbakan.edu.tr NaN NaN NaN NaN 0 NaN 2015-01-12t13:47:55.549z 2020-01-27t07:38:24.269z 0 0 0 0 False erbakan.edu.tr NaN NaN <NA> <NA> <NA> <NA> <NA> <NA>
1681787 0000-0003-0033-0918 True True osman perçin <NA> NaN opercin@erbakan.edu.tr NaN NaN NaN [[, necmettin erbakan university, konya, , tr,... 0 NaN 2015-10-13t05:47:12.014z 2020-12-25t13:52:03.976z 0 0 0 0 False erbakan.edu.tr NaN NaN <NA> <NA> <NA> <NA> <NA> 1
In [14]:
df[df['primary_email'] == 'patrick.davey@monash.edu']
Out[14]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
954085 0000-0002-9158-1757 True True patrick davey <NA> NaN patrick.davey@monash.edu [radiochemistry, inorganic chemistry, bioinorg... NaN NaN [[phd student, monash university, melbourne, ,... 0 NaN 2019-05-09t23:01:02.170z 2019-08-20t03:00:17.844z 0 0 0 0 False monash.edu NaN NaN <NA> <NA> <NA> 4 <NA> 1
5590332 0000-0002-8774-0030 True True patrick davey <NA> NaN patrick.davey@monash.edu NaN NaN NaN [[phd student, monash university, melbourne, v... 1 [crossref] 2018-09-11t10:47:10.997z 2021-02-09t06:21:44.138z 1 0 0 0 True monash.edu NaN NaN <NA> <NA> <NA> <NA> <NA> 1
In [15]:
df['primary_email_domain'].describe()
Out[15]:
count        124722
unique        17160
top       gmail.com
freq          26750
Name: primary_email_domain, dtype: object
In [16]:
top_primary_emails = df[['primary_email_domain', 'orcid']]\
                .groupby('primary_email_domain')\
                .count()\
                .sort_values('orcid', ascending=False)
top_primary_emails
Out[16]:
orcid
primary_email_domain
gmail.com 26750
hotmail.com 3801
yahoo.com 2625
163.com 2132
yuhs.ac 1134
... ...
imf.csic.es 1
imf.org 1
imfd.tu-freiberg.de 1
imft.fr 1
zzuli.edu.cn 1

17160 rows × 1 columns

In [17]:
set_top_n(30)
data = [
    go.Bar(
        x=top_primary_emails[:TOP_N].index,
        y=top_primary_emails[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top-%s email domains' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Other emails

In [18]:
df[df.other_email_domains.notna()].head()
Out[18]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
251 0000-0002-5916-446X True True antonio gilvan teixeira júnior <NA> [teixeira, antônio gilvan, júnior, antonio gil... gilvan.junior@aluno.ufca.edu.br [ethicis; medicine; infectology; neurology; ne... [[scopus author id, 56647743200], [scopus auth... [[faculty of health and life sciences, , unive... NaN 14 [antonio gilvan teixeira júnior, scopus - else... 2016-05-18t11:26:36.642z 2016-09-20t18:25:05.728z 13 0 0 8 False aluno.ufca.edu.br [liverpool.ac.uk] [researchgate.net, academia.edu, cnpq.br] 1 3 4 1 1 <NA>
316 0000-0002-8742-947X True True aaron tan shing loong <NA> NaN aaron.tanshingloong@wadh.ox.ac.uk NaN NaN [[ruskin school of art; wadham college, , univ... NaN 0 NaN 2015-10-05t23:10:08.771z 2016-06-14t19:55:50.313z 0 0 0 0 False wadh.ox.ac.uk [rsa.ox.ac.uk] NaN 1 <NA> <NA> <NA> 1 <NA>
433 0000-0001-9097-2281 True True abhishek solanki <NA> NaN <NA> NaN NaN NaN [[senior engineer, robert bosch (india), benga... 1 [abhishek solanki] 2019-04-22t04:43:06.232z 2020-07-02t14:18:28.305z 0 0 0 0 False NaN [in.bosch.com] [github.com, linkedin.com] 1 2 <NA> <NA> <NA> 2
497 0000-0002-8614-3007 True True adam arra <NA> NaN <NA> NaN NaN NaN NaN 0 NaN 2017-11-15t06:33:45.625z 2017-11-15t06:44:02.998z 0 0 0 0 False NaN [hct.ac.ae] NaN 1 <NA> <NA> <NA> <NA> <NA>
869 0000-0001-9884-5498 True True alberto ronzani <NA> NaN alberto@aronza.com NaN NaN NaN [[research scientist, vtt technical research c... 19 [crossref metadata search, alberto ronzani, cr... 2014-04-16t13:21:54.287z 2020-09-28t15:10:37.439z 18 0 0 3 True aronza.com [vtt.fi] NaN 1 <NA> <NA> <NA> <NA> 1
In [19]:
emails_by_orcid = df[['orcid', 'n_emails']].sort_values('n_emails', ascending=False)
In [20]:
set_top_n(30)
data = [
    go.Bar(
        x=emails_by_orcid[:TOP_N]['orcid'],
        y=emails_by_orcid[:TOP_N]['n_emails']
    )
]

layout = go.Layout(
    title='Top %s ORCID iDs by email' % TOP_N, 
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
In [21]:
top_other_emails = df[['orcid', 'other_email_domains']]\
                        .explode('other_email_domains')\
                        .reset_index(drop=True)\
                        .groupby('other_email_domains')\
                        .count()\
                        .sort_values('orcid', ascending=False)
In [22]:
set_top_n(30)
data = [
    go.Bar(
        x=top_other_emails[:TOP_N].index,
        y=top_other_emails[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top %s other email domains' % TOP_N, 
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

This somehow makes sense, legitimate users could put the gmail account as primary for login purposes and have institutional addresses as other email addresses. It makes also the life easier upon relocation.

Email speculation

In [23]:
df[df.primary_email.isna() & df.other_email_domains.notna()]
Out[23]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
433 0000-0001-9097-2281 True True abhishek solanki <NA> NaN <NA> NaN NaN NaN [[senior engineer, robert bosch (india), benga... 1 [abhishek solanki] 2019-04-22t04:43:06.232z 2020-07-02t14:18:28.305z 0 0 0 0 False NaN [in.bosch.com] [github.com, linkedin.com] 1 2 <NA> <NA> <NA> 2
497 0000-0002-8614-3007 True True adam arra <NA> NaN <NA> NaN NaN NaN NaN 0 NaN 2017-11-15t06:33:45.625z 2017-11-15t06:44:02.998z 0 0 0 0 False NaN [hct.ac.ae] NaN 1 <NA> <NA> <NA> <NA> <NA>
898 0000-0003-3728-6439 True True alejandra echeverry velásquez alejandra echeverry is an industrial electrici... NaN <NA> [control, technology, science, innovation, eng... NaN [[, electrical engineer, institución universit... [[professor, institución universitaria pascual... 1 [crossref] 2019-03-31t00:00:42.929z 2020-09-06t02:18:54.290z 1 0 0 0 True NaN [pascualbravo.edu.co] NaN 1 <NA> <NA> 7 1 1
1719 0000-0001-8330-7443 True True andrea tesoniero <NA> NaN <NA> NaN researcherid, d-9056-2015 [[department of geophysics, master of science ... [[postdoctoral associate, yale university, new... 4 [andrea tesoniero] 2015-03-09t11:59:06.093z 2020-08-20t15:03:23.447z 4 0 0 2 False NaN [yale.edu] NaN 1 <NA> 1 <NA> 4 2
6829 0000-0001-9670-515X True True esma esin yildirim <NA> NaN <NA> [pharmacognosy, natural chemistry, chemical en... NaN [[business management, master of science, ista... NaN 0 NaN 2020-07-26t10:38:03.721z 2020-07-26t10:52:26.539z 0 0 0 0 False NaN [gmail.com] NaN 1 <NA> <NA> 3 3 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10985816 0000-0003-1204-6009 True True nathan walk <NA> NaN <NA> NaN NaN [[department of physics, doctor of philosophy,... [[, university of oxford, oxford, oxfordshire,... 10 [crossref metadata search] 2016-07-28t14:24:16.844z 2020-10-13t11:47:50.621z 10 0 0 0 True NaN [cs.ox.ac.uk] [fu-berlin.de] 1 1 <NA> <NA> 3 2
10986027 0000-0002-3472-7668 True True raf vandevelde <NA> NaN <NA> NaN NaN [[chemical engineering technology, master, kat... [[phd researcher, katholieke universiteit leuv... 0 NaN 2020-10-14t13:56:44.779z 2020-10-16t14:21:40.673z 0 0 0 0 False NaN [kuleuven.be] [linkedin.com] 1 1 <NA> <NA> 2 1
10987501 0000-0002-9602-0529 True True carlos augusto finelli <NA> NaN <NA> NaN NaN NaN NaN 1 [crossref] 2013-09-16t16:52:06.120z 2020-12-01t22:47:08.074z 1 0 0 0 True NaN [cecot.com.br] NaN 1 <NA> <NA> <NA> <NA> <NA>
10987829 0000-0003-4402-5982 True True filipe de almeida araújo <NA> NaN <NA> NaN NaN [[materials science, msc. materials science, m... [[co-owner, aeft acessory, manaus, amazonas, b... 0 NaN 2020-03-02t20:11:01.699z 2020-12-04t13:53:39.404z 0 0 0 0 False NaN [ime.eb.br] NaN 1 <NA> <NA> <NA> 2 1
10988444 0000-0002-1734-7241 True True manareldeen ahmed <NA> NaN <NA> [deep learning, atomistic simulation, graphene... NaN NaN [[post-doctor, zhejiang university, hangzhou, ... 6 [manareldeen ahmed] 2017-02-17t13:18:36.540z 2020-12-04t02:04:36.668z 6 0 0 3 True NaN [hotmail.com] NaN 1 <NA> <NA> 5 <NA> 1

19814 rows × 30 columns

URLs

In [24]:
df[df.url_domains.notna()].head()
Out[24]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
6 0000-0001-7402-0096 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[, kth royal institute of technology, stockho... 0 NaN 2015-01-11t15:13:06.467z 2016-06-14t23:55:59.896z 0 0 0 0 False NaN NaN [kth.se] <NA> 1 <NA> <NA> <NA> 1
11 0000-0001-8377-3508 True True <NA> <NA> <NA> [fontana, milena da silva] <NA> [educação; informática; matemática.] NaN NaN [[, instituto federal de educação, ciência e t... 0 NaN 2018-05-23t23:39:04.534z 2019-10-16t02:50:11.007z 0 0 0 0 False NaN NaN [cnpq.br] <NA> 1 <NA> 1 <NA> 3
29 0000-0002-2638-4108 True True <NA> <NA> investigador de la universidad de oviedo. depa... NaN <NA> [constitutional history, history of political ... scopus author id, 54394231000 [[public law, ph doctor, university of oviedo,... [[professor of constitutional law, university ... 1 [crossref] 2013-03-25t14:38:06.016z 2020-07-01t13:10:37.025z 1 0 0 0 False NaN NaN [unioviedo.es] <NA> 1 1 3 1 1
46 0000-0003-1435-6545 True True <NA> <NA> <NA> NaN <NA> [prostate cancer, migration, culture cell] researcherid, p-2223-2018 [[morfologia, , universidade estadual paulista... [[, universidade estadual paulista (unesp), in... 0 NaN 2018-08-09t12:12:24.405z 2020-04-22t01:38:03.184z 0 0 0 0 False NaN NaN [cnpq.br, linkedin.com] <NA> 2 1 3 1 1
158 0000-0003-1284-9741 True True alex percy antonio manriquez paisig <NA> NaN <NA> NaN NaN NaN NaN 0 NaN 2020-09-08t20:04:33.906z 2020-09-08t20:25:55.432z 0 0 0 0 False NaN NaN [youtube.com] <NA> 1 <NA> <NA> <NA> <NA>
In [25]:
urls_by_orcid = df[['orcid', 'n_urls']].sort_values('n_urls', ascending=False)
urls_by_orcid
Out[25]:
orcid n_urls
3226518 0000-0002-1234-835X 219
4206055 0000-0001-7478-4539 174
4901870 0000-0002-7392-3792 169
8184260 0000-0002-6938-9638 152
2743648 0000-0002-5710-4041 114
... ... ...
10989644 0000-0002-1686-1935 <NA>
10989645 0000-0002-3800-6331 <NA>
10989646 0000-0002-8783-5814 <NA>
10989647 0000-0002-7584-2283 <NA>
10989648 0000-0003-0529-3538 <NA>

10989649 rows × 2 columns

In [26]:
set_top_n(100)
data = [
    go.Bar(
        x=urls_by_orcid[:TOP_N]['orcid'],
        y=urls_by_orcid[:TOP_N]['n_urls']
    )
]

layout = go.Layout(
    title='Top %s ORCID iDs with URLs' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
In [27]:
top_urls = df[['orcid', 'url_domains']]\
                .explode('url_domains')\
                .reset_index(drop=True)\
                .groupby('url_domains')\
                .count()\
                .sort_values('orcid', ascending=False)
In [28]:
set_top_n(50)
data = [
    go.Bar(
        x=top_urls[:TOP_N].index,
        y=top_urls[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top-%s URL domains' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

URLs speculation

In [29]:
df[(df['url_domains'].str.len() > 50) & (df['n_works'] > 0)]
Out[29]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
1025713 0000-0003-2407-3557 True True abdul aziz abdul aziz was born on may 25, 1973, in brebes... [abdul aziz, aziz, abdul, aziz, a., aziz, abd,... <NA> [metodologi penelitian, ilmu ekonomi, ekonomi ... NaN [[ilmu ekonomi, dr, universitas borobudur, jak... [[assisten professor/dr, institut agama islam ... 72 [base - bielefeld academic search engine, abdu... 2016-09-12t04:41:24.842z 2021-01-26t11:58:33.039z 19 0 0 77 False NaN NaN [google.com, syekhnurjati.ac.id, orcid.org, bl... <NA> 59 <NA> 4 3 1
2743648 0000-0002-5710-4041 True True ryszard romaniuk professor of electronics and communications en... [r.romaniuk, r.s.romaniuk, ryszard romaniuk, r... rrom@ise.pw.edu.pl [electronics, measurement systems, research sy... [[isni, 0000000071432485], [researcherid, b-91... [[faculty of electronics and information techn... [[professor, institute director, politechnika ... 5008 [inspire-hep, researcherid, isni2orcid search ... 2013-01-20t12:09:21.600z 2021-03-16t19:37:31.650z 1221 25 0 1742 True ise.pw.edu.pl [ise.pw.edu.pl, elka.pw.edu.pl, cern.ch] [google.pl, publons.com, scopus.com, mendeley.... 3 114 3 5 1 1
3011724 0000-0003-2450-090X True True eduard babulak professor eduard babulak is accomplished inter... [professor eduard babulak] <NA> [quality of service provision assessment, next... [[scopus author id, 6506867432], [researcherid... [[information technology, doctor habilitated (... [[consultant, horizon 2020 framework programme... 274 [the lens, base - bielefeld academic search en... 2013-04-03t08:02:30.013z 2021-02-28t10:07:13.231z 199 0 1 174 False NaN NaN [worldassessmentcouncil.org, spseke.sk, bcs.or... <NA> 114 5 8 6 22
3881064 0000-0002-3920-7389 True True а. гусев surname, name gusev alexander leonidovichdate... [alexander l. gusev , alexander leonidovich gu... <NA> [technologies of production, technologies of i... [[researcherid, f-8048-2014], [scopus author i... [[chemical technology and cryogenic-vacuum tec... [[general director, scientific technical centr... 472 [publons, datacite, scopus - elsevier, a.l. gu... 2014-05-14t00:01:28.030z 2021-01-16t13:44:14.134z 37 0 0 21 False NaN NaN [youtube.com, isjaee.com, researchgate.net, re... <NA> 111 2 16 2 7
7466062 0000-0002-1929-6054 True True franklin américo canaza choque docente-investigador social. maestrando en der... [franklin américo canaza-choque , franklin a. ... leo_123fa@hotmail.com [filosofía; educación; políticas de desarrollo... [[researcherid, p-8613-2018], [loop profile, 8... [[facultad de ciencias de la educación , maest... [[investigador social, universidad católica de... 39 [researcherid, base - bielefeld academic searc... 2017-09-15t19:45:43.483z 2021-03-23t20:12:47.297z 30 0 0 34 True hotmail.com [gmail.com, gmail.com, hotmail.com, baldwin.ed... [concytec.gob.pe, redalyc.org, redalyc.org, un... 5 61 4 2 1 1
7517096 0000-0003-4948-9268 True True gustavo duperré gustavo norberto duperré graduated in arts and... [gustavo norberto duperré, duperré, g. n., gus... gustavo.duperre@usal.edu.ar [computer science, sciences of antiquity, cont... [[scopus author id, 57195936346], [researcheri... [[programme in history, history of art and ter... [[titular professor, dirección general de cult... 41 [gustavo duperré, scopus - elsevier, publons, ... 2020-02-22t15:49:52.386z 2021-03-12t15:13:44.065z 13 0 0 34 False usal.edu.ar NaN [icomos.ro, unirioja.es, unirioja.es, unc.edu.... <NA> 61 2 11 6 5
8068275 0000-0003-2183-8112 True True pelayo munhoz olea pós-doutorado em gestão ambiental pela univers... [ munhoz, pelayo olea, olea, pelayo, olea, p... <NA> [empreendedorismo, sustentabilidade, inovação] [[scopus author id, 55175503300], [researcheri... [[, postdoctoral in environmental sustainabili... [[professor, universidade federal do rio grand... 1109 [the lens, pelayo munhoz olea, dimensions, bas... 2013-02-04t17:25:34.723z 2021-03-19t18:51:01.128z 798 0 1 582 True NaN NaN [cnpq.br, cnpq.br, cnpq.br, cnpq.br, publons.c... <NA> 61 2 3 7 9
8184260 0000-0002-6938-9638 True True adolfo catral sanabria my education is in computer science, mathemati... NaN <NA> NaN loop profile, 747193 [[education, capacitación para la enseñanza en... NaN 2023 [base - bielefeld academic search engine, data... 2019-05-07t19:27:02.210z 2020-12-10t23:39:15.236z 2022 0 0 16 False NaN NaN [researchgate.net, youtube.com, linkedin.com, ... <NA> 152 1 <NA> 6 <NA>
8791256 0000-0002-9025-8632 True True buycannabis dispensary we procure and deliver premium cannabis strain... [we procure and deliver premium cannabis strai... <NA> [marijuana dispensary, cannabis, canabis dispe... NaN NaN NaN 10 [goowonderland dispensary] 2020-12-09t21:19:46.004z 2020-12-10t01:17:28.772z 0 0 0 0 False NaN NaN [goowonderland.com, goowonderland.com, goowond... <NA> 81 <NA> 7 <NA> <NA>
10174509 0000-0002-9965-2425 True True jaroslaw spychala jaroslaw spychala has received a doctoral degr... [jaroslaw jozef spychala] <NA> [photochemistry, medicinal and pharmaceutical ... scopus author id, 7006745874 [[department of chemistry, postdoctoral associ... [[assistant professor, adam mickiewicz univers... 29 [scopus - elsevier] 2014-09-18t12:34:14.242z 2020-02-11t14:31:25.544z 15 0 0 29 True NaN NaN [biowebspin.com, biowebspin.com, google.com, l... <NA> 73 1 4 4 2
10257808 0000-0002-4062-3603 True True juan de dios beltrán mancilla juan de dios beltrán mancilla (*) filósofo aut... [juan de dios beltrán mancilla, filósofo autod... <NA> [filosofia medicina arquitectura economía dere... NaN [[, diplomado en practicas directivas para or... [[inspector general jornada vespertina // de 2... 11 [juan de dios beltr´´án mancilla] 2020-04-19t21:06:33.495z 2021-02-10t20:13:07.698z 0 0 0 7 False NaN NaN [yumpu.com, ijopm.org, google.com, blogspot.co... <NA> 69 <NA> 1 8 6
10486212 0000-0002-3997-5070 True True dr. parameshachari b d dr. parameshachari b dacm distinguished speake... [dr. parameshachari b d] <NA> [mysore region coordinator|ieee bangalore sect... [[researcherid, f-7045-2018], [scopus author i... [[electronics and communication engineering, p... [[acm distinguished speaker (volunteer), assoc... 93 [publons, multidisciplinary digital publishing... 2016-08-24t11:00:30.403z 2021-03-23t07:16:22.582z 47 0 0 48 False NaN NaN [geethashishu.in, geethashishu.in, acm.org, go... <NA> 71 3 6 5 10
10652632 0000-0003-2593-7134 True True aan jaelani all my papers can be downloaded from portal:re... [jaelani, a., jaelani, aan] aan_jaelani@syekhnurjati.ac.id [islamic economics, islamic finance and bankin... [[scopus author id, 57195963463], [loop profil... [[post graduate, s3/dr, universitas islam nege... [[dr, institut agama islam negeri syekh nurjat... 79 [publons, aan jaelani, scopus - elsevier, dime... 2016-03-02t18:37:44.989z 2021-03-19t10:11:57.908z 88 0 0 193 True syekhnurjati.ac.id [gmail.com] [microsoft.com, twitter.com, academia.edu, aca... 1 67 4 7 2 1
In [30]:
df[(df['url_domains'].str.len() > 10) & (df['n_works'] > 0) & (df['works_source'].str.len() == 1)]
Out[30]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
47439 0000-0002-5967-2835 True True oleksiy goryayinov <NA> [алексей николаевич горяинов, о.м.горяїнов, а.... <NA> [diagnostics, transport, logistics] researcherid, i-7977-2016 [[, дистанционный курс «ctl.sc2x: supply chain... [[docent, kharkiv petro vasylenko national tec... 274 [oleksiy goryayinov] 2014-08-03t18:06:42.925z 2021-03-22t13:56:48.311z 0 0 0 0 False NaN NaN [khntusg.com.ua, khntusg.com.ua, google.com.ua... <NA> 13 1 3 14 7
72557 0000-0002-3505-2797 True True nurul malahayati google scholar NaN <NA> NaN researcherid, q-3861-2017 [[civil and transportation engineering , maste... [[senior lecturer, universitas syiah kuala, ba... 6 [nurul malahayati] 2017-10-01t00:46:31.324z 2019-08-19t15:52:47.253z 3 0 0 3 False NaN NaN [google.com, ristekdikti.go.id, unsyiah.ac.id,... <NA> 16 1 <NA> 2 1
94081 0000-0003-3670-9620 True True carlos barrera im individual inventor, and this is my work; s... [retrodynamic, novelinflow] <NA> [energy, technology, gearturbine, imploturboco... loop profile, 394457 NaN NaN 1 [carlos barrera] 2016-08-29t20:32:10.362z 2021-02-09t04:56:35.554z 0 0 0 0 False NaN NaN [blogspot.mx, behance.net, authorstream.com, d... <NA> 24 1 8 <NA> <NA>
261673 0000-0002-5441-0465 True True nuria hernández-león <NA> [nuria h. león, nuria hernández león, hernánde... <NA> [training, icts, business management, research... NaN [[, course: social skills, university of salam... [[merchandise reception and expedition trainer... 11 [nuria hernández-león] 2015-11-28t07:18:58.442z 2021-03-05t16:37:47.403z 1 0 0 4 False NaN NaN [feriaempresamujer.com, escueladenegociosydire... <NA> 16 <NA> 7 19 16
326211 0000-0002-7781-6767 True True mohd nazri ismail born in penang, malaysia in 1971, dr. mohd had... [ndum (national defence university of malaysia)] <NA> [wsn, manet, simulation and modelling, network... [[scopus author id, 24372977800], [researcheri... NaN [[lecturer, universiti pertahanan nasional mal... 35 [scopus - elsevier] 2016-09-06t02:25:52.974z 2020-10-20t06:55:55.051z 24 0 0 35 True NaN NaN [google.com.my, researchgate.net, academia.edu... <NA> 16 2 10 <NA> 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10579801 0000-0001-5087-6965 True True robert ohara systematics, evolutionary biology, and the his... [r. ohara, r.j. ohara, robert ohara, robert... <NA> [history and philosophy of science, evolutiona... [[isni, 0000000138200102], [researcherid, b-47... [[biology, ph.d., harvard university, cambridg... NaN 45 [robert j. ohara] 2014-09-21t02:45:19.620z 2020-07-09t06:51:09.228z 23 0 0 72 True NaN NaN [rjohara.net, google.com, collegiateway.org, r... <NA> 12 3 5 1 <NA>
10590882 0000-0002-3318-9861 True True shagufta perveen prof. dr. shagufta perveen is a professor at k... NaN shagufta792000@yahoo.com [shagufta perveen professor, shagufta perveen ... NaN [[hej research institute of chemistry, phd che... [[professor, king saud university college of p... 66 [scopus - elsevier] 2015-12-21t10:34:06.771z 2021-02-22t14:58:30.893z 56 0 0 66 True yahoo.com [msu.edu, ksu.edu.sa] [shaguftaperveen.com, researchgate.net, ksu.ed... 2 11 <NA> 25 3 7
10766062 0000-0001-8960-9004 True True susan bastani <NA> [s. bastani, سوسن باستانی] sbastani@alzahra.ac.ir [social networks, fuzzy logic, online and offl... scopus author id, 16642098400 [[sociology, ph.d., university of toronto, tor... [[professor, alzahra university, tehran, vanak... 20 [scopus - elsevier] 2019-07-10t06:50:46.255z 2020-10-07t04:08:01.961z 19 0 0 33 True alzahra.ac.ir [gmail.com, gmail.com] [scopus.com, google.com, publons.com, zenodo.o... 2 11 1 4 3 4
10807839 0000-0002-4379-6454 True True caroline wanjiru kariuki caroline holds a phd in economics from curtin ... NaN <NA> [development economics, applied econometrics, ... NaN [[economics, doctor of philosophy , curtin uni... [[director, educational development, strathmor... 4 [caroline wanjiru kariuki] 2020-03-18t10:18:04.007z 2021-02-11t14:40:38.515z 1 0 0 0 False NaN NaN [scopus.com, mendeley.com, publons.com, resear... <NA> 13 <NA> 4 3 6
10911966 0000-0003-2311-0600 True True myo kyaw hlaing <NA> [dr myo kyaw hlaing] <NA> [economic geology] NaN NaN [[lecturer, union of myanmar ministry of educa... 2 [myo kyaw hlaing] 2018-12-26t12:51:57.801z 2021-01-26t14:36:47.421z 1 0 0 2 False NaN NaN [facebook.com, linkedin.com, instagram.com, re... <NA> 12 <NA> 1 <NA> 2

140 rows × 30 columns

In [31]:
exploded_sources = df[(df['url_domains'].str.len() > 10) & (df['n_works'] > 0) & (df['works_source'].str.len() == 1)].explode('works_source').reset_index(drop=True)
exploded_sources
Out[31]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
0 0000-0002-5967-2835 True True oleksiy goryayinov <NA> [алексей николаевич горяинов, о.м.горяїнов, а.... <NA> [diagnostics, transport, logistics] researcherid, i-7977-2016 [[, дистанционный курс «ctl.sc2x: supply chain... [[docent, kharkiv petro vasylenko national tec... 274 oleksiy goryayinov 2014-08-03t18:06:42.925z 2021-03-22t13:56:48.311z 0 0 0 0 False NaN NaN [khntusg.com.ua, khntusg.com.ua, google.com.ua... <NA> 13 1 3 14 7
1 0000-0002-3505-2797 True True nurul malahayati google scholar NaN <NA> NaN researcherid, q-3861-2017 [[civil and transportation engineering , maste... [[senior lecturer, universitas syiah kuala, ba... 6 nurul malahayati 2017-10-01t00:46:31.324z 2019-08-19t15:52:47.253z 3 0 0 3 False NaN NaN [google.com, ristekdikti.go.id, unsyiah.ac.id,... <NA> 16 1 <NA> 2 1
2 0000-0003-3670-9620 True True carlos barrera im individual inventor, and this is my work; s... [retrodynamic, novelinflow] <NA> [energy, technology, gearturbine, imploturboco... loop profile, 394457 NaN NaN 1 carlos barrera 2016-08-29t20:32:10.362z 2021-02-09t04:56:35.554z 0 0 0 0 False NaN NaN [blogspot.mx, behance.net, authorstream.com, d... <NA> 24 1 8 <NA> <NA>
3 0000-0002-5441-0465 True True nuria hernández-león <NA> [nuria h. león, nuria hernández león, hernánde... <NA> [training, icts, business management, research... NaN [[, course: social skills, university of salam... [[merchandise reception and expedition trainer... 11 nuria hernández-león 2015-11-28t07:18:58.442z 2021-03-05t16:37:47.403z 1 0 0 4 False NaN NaN [feriaempresamujer.com, escueladenegociosydire... <NA> 16 <NA> 7 19 16
4 0000-0002-7781-6767 True True mohd nazri ismail born in penang, malaysia in 1971, dr. mohd had... [ndum (national defence university of malaysia)] <NA> [wsn, manet, simulation and modelling, network... [[scopus author id, 24372977800], [researcheri... NaN [[lecturer, universiti pertahanan nasional mal... 35 scopus - elsevier 2016-09-06t02:25:52.974z 2020-10-20t06:55:55.051z 24 0 0 35 True NaN NaN [google.com.my, researchgate.net, academia.edu... <NA> 16 2 10 <NA> 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
135 0000-0001-5087-6965 True True robert ohara systematics, evolutionary biology, and the his... [r. ohara, r.j. ohara, robert ohara, robert... <NA> [history and philosophy of science, evolutiona... [[isni, 0000000138200102], [researcherid, b-47... [[biology, ph.d., harvard university, cambridg... NaN 45 robert j. ohara 2014-09-21t02:45:19.620z 2020-07-09t06:51:09.228z 23 0 0 72 True NaN NaN [rjohara.net, google.com, collegiateway.org, r... <NA> 12 3 5 1 <NA>
136 0000-0002-3318-9861 True True shagufta perveen prof. dr. shagufta perveen is a professor at k... NaN shagufta792000@yahoo.com [shagufta perveen professor, shagufta perveen ... NaN [[hej research institute of chemistry, phd che... [[professor, king saud university college of p... 66 scopus - elsevier 2015-12-21t10:34:06.771z 2021-02-22t14:58:30.893z 56 0 0 66 True yahoo.com [msu.edu, ksu.edu.sa] [shaguftaperveen.com, researchgate.net, ksu.ed... 2 11 <NA> 25 3 7
137 0000-0001-8960-9004 True True susan bastani <NA> [s. bastani, سوسن باستانی] sbastani@alzahra.ac.ir [social networks, fuzzy logic, online and offl... scopus author id, 16642098400 [[sociology, ph.d., university of toronto, tor... [[professor, alzahra university, tehran, vanak... 20 scopus - elsevier 2019-07-10t06:50:46.255z 2020-10-07t04:08:01.961z 19 0 0 33 True alzahra.ac.ir [gmail.com, gmail.com] [scopus.com, google.com, publons.com, zenodo.o... 2 11 1 4 3 4
138 0000-0002-4379-6454 True True caroline wanjiru kariuki caroline holds a phd in economics from curtin ... NaN <NA> [development economics, applied econometrics, ... NaN [[economics, doctor of philosophy , curtin uni... [[director, educational development, strathmor... 4 caroline wanjiru kariuki 2020-03-18t10:18:04.007z 2021-02-11t14:40:38.515z 1 0 0 0 False NaN NaN [scopus.com, mendeley.com, publons.com, resear... <NA> 13 <NA> 4 3 6
139 0000-0003-2311-0600 True True myo kyaw hlaing <NA> [dr myo kyaw hlaing] <NA> [economic geology] NaN NaN [[lecturer, union of myanmar ministry of educa... 2 myo kyaw hlaing 2018-12-26t12:51:57.801z 2021-01-26t14:36:47.421z 1 0 0 2 False NaN NaN [facebook.com, linkedin.com, instagram.com, re... <NA> 12 <NA> 1 <NA> 2

140 rows × 30 columns

In [32]:
exploded_sources[exploded_sources.apply(lambda x: x['works_source'].find(x['given_names']) >= 0, axis=1)]
Out[32]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment
0 0000-0002-5967-2835 True True oleksiy goryayinov <NA> [алексей николаевич горяинов, о.м.горяїнов, а.... <NA> [diagnostics, transport, logistics] researcherid, i-7977-2016 [[, дистанционный курс «ctl.sc2x: supply chain... [[docent, kharkiv petro vasylenko national tec... 274 oleksiy goryayinov 2014-08-03t18:06:42.925z 2021-03-22t13:56:48.311z 0 0 0 0 False NaN NaN [khntusg.com.ua, khntusg.com.ua, google.com.ua... <NA> 13 1 3 14 7
1 0000-0002-3505-2797 True True nurul malahayati google scholar NaN <NA> NaN researcherid, q-3861-2017 [[civil and transportation engineering , maste... [[senior lecturer, universitas syiah kuala, ba... 6 nurul malahayati 2017-10-01t00:46:31.324z 2019-08-19t15:52:47.253z 3 0 0 3 False NaN NaN [google.com, ristekdikti.go.id, unsyiah.ac.id,... <NA> 16 1 <NA> 2 1
2 0000-0003-3670-9620 True True carlos barrera im individual inventor, and this is my work; s... [retrodynamic, novelinflow] <NA> [energy, technology, gearturbine, imploturboco... loop profile, 394457 NaN NaN 1 carlos barrera 2016-08-29t20:32:10.362z 2021-02-09t04:56:35.554z 0 0 0 0 False NaN NaN [blogspot.mx, behance.net, authorstream.com, d... <NA> 24 1 8 <NA> <NA>
3 0000-0002-5441-0465 True True nuria hernández-león <NA> [nuria h. león, nuria hernández león, hernánde... <NA> [training, icts, business management, research... NaN [[, course: social skills, university of salam... [[merchandise reception and expedition trainer... 11 nuria hernández-león 2015-11-28t07:18:58.442z 2021-03-05t16:37:47.403z 1 0 0 4 False NaN NaN [feriaempresamujer.com, escueladenegociosydire... <NA> 16 <NA> 7 19 16
5 0000-0001-7010-2908 True True clara sarmento clara sarmento holds an aggregation in cultura... NaN <NA> [portuguese culture and literature, cultural a... ciência id, d418-d6f8-7d49 [[ao abrigo da bolsa santander ie best practic... [[presidente da comissão de acreditação do nov... 275 clara sarmento 2013-12-12t00:33:58.190z 2020-10-12t14:43:00.749z 17 0 0 60 True NaN NaN [iscap.pt, google.pt, academia.edu, researchga... <NA> 13 1 6 8 37
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
133 0000-0003-1020-1351 True True sheikh saifullah ahmed sheikh saifullah ahmed is a full-time lecturer... NaN saifullahahmedku@gmail.com [south asian literature, postmodern literature... NaN [[english discipline , ma & ba in english , kh... [[lecturer , international university of busin... 3 sheikh saifullah ahmed 2020-04-08t21:00:11.201z 2021-02-12t20:45:32.247z 2 0 0 3 False gmail.com NaN [academia.edu, iubat.edu, google.com, research... <NA> 12 <NA> 5 1 1
134 0000-0001-7228-5680 True True text protocol <NA> NaN <NA> NaN NaN NaN [[engineer, textprotocol.org, palo alto, ca, u... 1 text protocol 2021-03-09t10:30:32.237z 2021-03-21t17:17:40.500z 0 0 0 0 False NaN NaN [about.me, figma.com, github.com, gitlab.com, ... <NA> 15 <NA> <NA> <NA> 1
135 0000-0001-5087-6965 True True robert ohara systematics, evolutionary biology, and the his... [r. ohara, r.j. ohara, robert ohara, robert... <NA> [history and philosophy of science, evolutiona... [[isni, 0000000138200102], [researcherid, b-47... [[biology, ph.d., harvard university, cambridg... NaN 45 robert j. ohara 2014-09-21t02:45:19.620z 2020-07-09t06:51:09.228z 23 0 0 72 True NaN NaN [rjohara.net, google.com, collegiateway.org, r... <NA> 12 3 5 1 <NA>
138 0000-0002-4379-6454 True True caroline wanjiru kariuki caroline holds a phd in economics from curtin ... NaN <NA> [development economics, applied econometrics, ... NaN [[economics, doctor of philosophy , curtin uni... [[director, educational development, strathmor... 4 caroline wanjiru kariuki 2020-03-18t10:18:04.007z 2021-02-11t14:40:38.515z 1 0 0 0 False NaN NaN [scopus.com, mendeley.com, publons.com, resear... <NA> 13 <NA> 4 3 6
139 0000-0003-2311-0600 True True myo kyaw hlaing <NA> [dr myo kyaw hlaing] <NA> [economic geology] NaN NaN [[lecturer, union of myanmar ministry of educa... 2 myo kyaw hlaing 2018-12-26t12:51:57.801z 2021-01-26t14:36:47.421z 1 0 0 2 False NaN NaN [facebook.com, linkedin.com, instagram.com, re... <NA> 12 <NA> 1 <NA> 2

113 rows × 30 columns

Works source

In [33]:
def remove_own_source(lst, given, family):
    res = []
    for ws in lst:
        if ws.lower().find(given.lower()) == -1:
            if pd.notna(family):
                if ws.lower().find(family.lower()) == -1:
                    res.append(ws)
            else:
                res.append(ws)
    return res
In [34]:
df['ext_works_source'] = df[(df.works_source.notna()) & (df.given_names.notna())]\
                        .apply(lambda x: remove_own_source(x['works_source'], x['given_names'], x['family_name']), axis=1)
In [35]:
df['n_ext_work_source'] = df.ext_works_source.str.len()
In [36]:
exploded_external_sources = df[df['ext_works_source'].str.len() > 0][['orcid','ext_works_source']]\
                            .explode('ext_works_source').reset_index(drop=True)
In [37]:
grouped_ext_sources = exploded_external_sources.groupby('ext_works_source')\
                        .count()\
                        .sort_values('orcid', ascending=False)\
                        .reset_index()
In [38]:
data = [
    go.Bar(
        x=grouped_ext_sources[:30].ext_works_source,
        y=grouped_ext_sources[:30].orcid
    )
]

layout = go.Layout(
    title='Top 30 works_source',
    xaxis=dict(tickangle=45, tickfont=dict(size=12))
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
In [39]:
authoritative_sources = grouped_ext_sources[grouped_ext_sources['orcid'] > 2]
authoritative_sources
Out[39]:
ext_works_source orcid
0 crossref 1460841
1 scopus - elsevier 902231
2 crossref metadata search 297684
3 multidisciplinary digital publishing institute 281664
4 europe pubmed central 181605
... ... ...
337 uta - oa journal global insight 3
338 francis crick institute 3
339 anna 3
340 santos 3
341 universitäts- und stadtbibliothek köln 3

342 rows × 2 columns

In [40]:
exploded_external_sources['authoritative'] = exploded_external_sources.ext_works_source\
                                            .isin(authoritative_sources['ext_works_source'])
In [41]:
orcid_authoritative_source = exploded_external_sources\
                            .groupby('orcid')['authoritative']\
                            .any()\
                            .reset_index()[['orcid', 'authoritative']]
In [42]:
df = df.set_index('orcid').join(orcid_authoritative_source.set_index('orcid')).reset_index()
In [43]:
df.loc[df.authoritative.isna(), 'authoritative'] = False
In [44]:
df.head()
Out[44]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment ext_works_source n_ext_work_source authoritative
0 0000-0001-6097-3953 False False <NA> <NA> <NA> NaN <NA> NaN NaN NaN NaN 0 NaN 2018-03-02t09:29:16.528z 2018-03-02t09:43:07.551z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> NaN NaN False
1 0000-0001-6112-5550 True True <NA> <NA> <NA> [v.i. yurtaev; v. yurtaev] <NA> NaN NaN NaN [[professor, peoples friendship university of ... 0 NaN 2018-04-03t07:50:23.358z 2020-03-18t09:42:44.753z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 1 NaN NaN False
2 0000-0001-6152-2695 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN NaN 0 NaN 2019-12-11t15:31:56.388z 2020-01-28t15:34:17.309z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> NaN NaN False
3 0000-0001-6220-5683 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[research scientist, new york university abu ... 0 NaN 2015-08-18t12:36:45.307z 2020-09-23t13:37:54.180z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 1 NaN NaN False
4 0000-0001-7071-8294 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[researcher (academic), universidad de zarago... 0 NaN 2014-03-10t13:22:01.966z 2016-06-14t22:17:54.470z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 2 NaN NaN False

External IDs

External IDs should come from reliable sources. ORCiD registrants cannot add them freely.

In [45]:
df.n_ids.describe()
Out[45]:
count    1.308598e+06
mean     1.359082e+00
std      6.643235e-01
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      2.000000e+00
max      8.000000e+01
Name: n_ids, dtype: float64
In [46]:
df[df.n_ids == df.n_ids.max()]
Out[46]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment ext_works_source n_ext_work_source authoritative
3896226 0000-0002-9554-6633 True True john a williams <NA> NaN <NA> NaN [[scopus author id,  55553733518], [scopus aut... NaN [[, aston university, birmingham, , gb, 1722, ... 92 [aston research explorer] 2014-11-20t09:42:10.690z 2021-03-17t01:00:51.203z 80 0 0 208 True NaN NaN [aston.ac.uk] <NA> 1 80 <NA> <NA> 1 [aston research explorer] 1.0 True
In [47]:
ids = df[['orcid', 'external_ids']].explode('external_ids').reset_index(drop=True)
In [48]:
ids['provider'] = ids[ids.external_ids.notna()]['external_ids'].apply(lambda x: x[0])
In [49]:
ids[ids.provider.notna()].head()
Out[49]:
orcid external_ids provider
9 0000-0001-8315-2066 [researcherid, k-4630-2014] researcherid
29 0000-0002-2638-4108 [scopus author id, 54394231000] scopus author id
46 0000-0003-1435-6545 [researcherid, p-2223-2018] researcherid
50 0000-0003-2259-7023 [scopus author id, 57189297461] scopus author id
64 0000-0002-7397-5824 [scopus author id, 8399842800] scopus author id
In [50]:
top_ids_providers = ids.groupby('provider').count().sort_values('orcid', ascending=False)
In [51]:
data = [
    go.Bar(
        x=top_ids_providers.index,
        y=top_ids_providers['orcid']
    )
]

layout = go.Layout(
    title='IDs provided by providers',
    xaxis=dict(tickangle=45, tickfont=dict(size=12))
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
In [52]:
pd.unique(ids['provider'])
Out[52]:
array([nan, 'researcherid', 'scopus author id', 'loop profile', 'gnd',
       'ciência id', 'researcher name resolver id', 'pitt id',
       'id dialnet', 'isni', 'technical university of denmark cwis',
       'chalmers id', 'scopus author id: ', 'scopus author id:',
       'hkust profile', 'hku researcherpage', '中国科学家在线', 'uow scholars',
       'sciprofile', 'cti vitae', 'digital author id', 'researcher id',
       'authenticusid', 'authid', 'authenticus', 'scopus  id',
       'digital author id (dai)', 'researcherid:', 'vivo cornell',
       'us epa vivo', 'escientist', 'github', 'iauthor', 'orcid id',
       'dai', 'scopus id', 'smithsonian profiles', 'google scholar',
       'kaken', 'dialnet id', 'researcherid: ', 'une researcher id',
       'sciprofiles', 'id dialnet:', 'scienceopen', 'orcid',
       'profile system identifier', 'custom'], dtype=object)

Keywords

This field is problematic as users can be nasty and put multiple keywords in one as opposed of having different keywords. Look this

In [53]:
keywords_by_orcid = df[['orcid', 'n_keywords']].sort_values('n_keywords', ascending=False)
keywords_by_orcid
Out[53]:
orcid n_keywords
3751714 0000-0002-0673-0341 154
8697926 0000-0003-3343-5660 148
1154523 0000-0002-6075-3501 140
6512971 0000-0002-7060-4112 140
1515197 0000-0001-5287-1949 132
... ... ...
10989644 0000-0002-1686-1935 <NA>
10989645 0000-0002-3800-6331 <NA>
10989646 0000-0002-8783-5814 <NA>
10989647 0000-0002-7584-2283 <NA>
10989648 0000-0003-0529-3538 <NA>

10989649 rows × 2 columns

In [54]:
set_top_n(100)
data = [
    go.Bar(
        x=keywords_by_orcid[:TOP_N]['orcid'],
        y=keywords_by_orcid[:TOP_N]['n_keywords']
    )
]

layout = go.Layout(
    title='Keywords provided by ORCiD',
    xaxis=dict(tickangle=45, tickfont=dict(size=12), range=TOP_RANGE)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
In [55]:
top_keywords = df[['orcid', 'keywords']]\
                .explode('keywords')\
                .reset_index(drop=True)\
                .groupby('keywords')\
                .count()\
                .sort_values('orcid', ascending=False)
In [56]:
set_top_n(50)
data = [
    go.Bar(
        x=top_keywords[:TOP_N].index,
        y=top_keywords[:TOP_N]['orcid']
    )
]

layout = go.Layout(
    title='Top-%s keywords occurrence' % TOP_N,
    xaxis=dict(tickangle=45, tickfont=dict(size=12))
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Education

In [120]:
df.n_education.describe()
Out[120]:
count    1.753340e+06
mean     1.913072e+00
std      1.197388e+00
min      1.000000e+00
25%      1.000000e+00
50%      2.000000e+00
75%      3.000000e+00
max      2.000000e+02
Name: n_education, dtype: float64
In [121]:
df[df.n_education == df.n_education.max()]
Out[121]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment ext_works_source n_ext_work_source authoritative spam_score n_valid_employment
2568539 0000-0002-1927-0292 True True phd. carmen m galvez-sánchez my name is carmen maria galvez sánchez. i´m a ... NaN <NA> [qualitative research, fibromyalgia, quantitat... [[loop profile, 509331], [scopus author id, 57... [[psychology, 2019-2020 course. degree in psyc... [[researcher and teaching staff. postdoctoral ... 35 [phd. carmen m galvez-sánchez, multidisciplina... 2016-04-18t14:28:57.237z 2021-03-06t14:17:33.246z 24 0 0 7 True NaN NaN NaN <NA> <NA> 2 5 200 3 [multidisciplinary digital publishing institut... 4.0 True 0.999948 1
In [57]:
exploded_education = df[['orcid', 'education']].explode('education').dropna()
exploded_education
Out[57]:
orcid education
28 0000-0002-2343-910X [aeronautics and astronautics, phd, massachuse...
28 0000-0002-2343-910X [aeronautics and astronautics, sm, massachuset...
28 0000-0002-2343-910X [mechanical engineering and material science, ...
29 0000-0002-2638-4108 [public law, ph doctor, university of oviedo, ...
46 0000-0003-1435-6545 [morfologia, , universidade estadual paulista ...
... ... ...
10989644 0000-0002-1686-1935 [, , south china agricultural university, guan...
10989645 0000-0002-3800-6331 [richard gilder graduate school, phd in compar...
10989645 0000-0002-3800-6331 [geological sciences and history (dual major),...
10989647 0000-0002-7584-2283 [school of electronics and information, master...
10989647 0000-0002-7584-2283 [ department of electrical engineering, bachel...

4434439 rows × 2 columns

In [58]:
exploded_education[['degree', 'role', 'university', 'city', 'region', 'country', 'id', 'id_scheme']] = pd.DataFrame(exploded_education.education.tolist(), index=exploded_education.index)
In [130]:
exploded_education.id.replace('', pd.NA, inplace=True)
In [132]:
exploded_education.groupby('orcid').id.count().reset_index()
Out[132]:
orcid id
0 0000-0001-5000-0162 3
1 0000-0001-5000-0170 2
2 0000-0001-5000-0218 3
3 0000-0001-5000-0226 1
4 0000-0001-5000-0306 0
... ... ...
2441640 0000-0003-4999-9719 1
2441641 0000-0003-4999-9735 1
2441642 0000-0003-4999-992X 2
2441643 0000-0003-4999-9938 2
2441644 0000-0003-4999-9954 1

2441645 rows × 2 columns

In [133]:
df = df.merge(exploded_education.groupby('orcid').id.count().reset_index(), on='orcid')
df.rename(columns={'id': 'n_valid_education'}, inplace=True)
In [134]:
df[df.n_education != df.n_valid_education]
Out[134]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment ext_works_source n_ext_work_source authoritative spam_score n_valid_employment n_valid_education
2 0000-0003-1435-6545 True True <NA> <NA> <NA> NaN <NA> [prostate cancer, migration, culture cell] researcherid, p-2223-2018 [[morfologia, , universidade estadual paulista... [[, universidade estadual paulista (unesp), in... 0 NaN 2018-08-09t12:12:24.405z 2020-04-22t01:38:03.184z 0 0 0 0 False NaN NaN [cnpq.br, linkedin.com] <NA> 2 1 3 1 1 NaN NaN False NaN 0 0
6 0000-0002-0427-9745 True True a. can inci i am a professor of finance at bryant universi... NaN <NA> NaN [[researcherid, b-5471-2018], [scopus author i... [[finance, ph.d., university of michigan - ros... [[professor of finance, bryant university, smi... 34 [a. can inci] 2018-01-20t02:58:05.199z 2020-06-16t12:35:09.403z 0 0 0 0 False NaN NaN NaN <NA> <NA> 2 <NA> 4 5 [] 0.0 False 4.341588e-10 0 0
9 0000-0002-3380-6671 True True abdul asis pata <NA> NaN <NA> NaN NaN [[agribisnis, m.si, universitas hasanuddin, ma... [[s.p, universitas muslim maros, maros, , id, ... 0 NaN 2018-02-12t02:08:37.018z 2018-02-12t02:22:33.378z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> 1 1 NaN NaN False NaN 0 0
11 0000-0001-6902-6549 True True abubakar muhammad <NA> NaN <NA> NaN NaN [[school of electrical and information enginee... [[lecturer, university of faisalabad, faisalab... 1 [multidisciplinary digital publishing institute] 2017-07-06t10:29:17.738z 2020-08-01t05:18:53.393z 1 0 0 0 True NaN NaN NaN <NA> <NA> <NA> <NA> 1 1 [multidisciplinary digital publishing institute] 1.0 True NaN 0 0
12 0000-0002-6142-6406 True True adam mamadou <NA> NaN <NA> NaN NaN [[département deconomie sociologie rurale et t... [[, institut national de la recherche agronomi... 0 NaN 2018-02-15t09:54:59.943z 2018-02-15t10:19:27.869z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> 1 1 NaN NaN False NaN 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1753316 0000-0002-1842-4130 True True josé de jesús cázares-marinero <NA> [josé cázares] <NA> [chemical biology, industrial chemistry, biote... [[researcherid, h-2597-2013], [scopus author i... [[charles friedel, postdoc, école nationale su... [[mtc, polioles, mexico, , mx, , ], [head of r... 17 [crossref metadata search, scopus - elsevier, ... 2013-07-09t14:39:30.950z 2020-12-10t17:42:20.176z 17 0 0 29 False NaN NaN [linkedin.com, google.com, researchgate.net] <NA> 3 2 5 3 3 [crossref metadata search, scopus - elsevier] 2.0 True NaN 0 0
1753319 0000-0003-0459-4822 True True luana <NA> mestranda em tecnologia na saúde e foi aluna o... [luana bastos morey] <NA> [tradução; língua espanhol; língua portuguesa;... NaN [[pós-graduação em tecnologia em saúde stricto... [[professora de espanhol e português para estr... 7 [luana arrial bastos] 2017-05-11t13:14:59.372z 2020-12-08t20:18:24.163z 0 0 0 0 False NaN NaN [unidospelasaude.com.br, facebook.com, faceboo... <NA> 4 <NA> 2 4 3 [] 0.0 False 1.000000e+00 2 3
1753320 0000-0003-0057-1551 True True lyudmyla antypenko the phd degree of pharmacy was received under ... [lyudmila nikolaevna antipenko (russian transl... <NA> [pharmaceutical chemistry, organic synthesis, ... [[scopus author id, 55070809900], [researcheri... [[centre for nanomaterials, advanced technolog... [[visiting scientist, north dakota state unive... 35 [crossref metadata search, scopus - elsevier, ... 2014-02-19t08:15:15.698z 2020-12-09t18:14:17.963z 28 0 11 17 True NaN NaN NaN <NA> <NA> 2 5 7 8 [crossref metadata search, scopus - elsevier, ... 4.0 True 1.000000e+00 2 4
1753325 0000-0003-4653-4705 True True patricia teixeira 2005 - phd, university of coimbrajuly 2009-jun... NaN <NA> [ecotoxicology, heavy metals, steroid hormones... [[researcherid, i-6863-2013], [scopus author i... [[, phd, university of coimbra, coimbra, , pt,... [[senior researcher, university of coimbra, co... 95 [ciênciavitae, scopus - elsevier, pg cardoso, ... 2013-11-26t10:59:34.331z 2020-12-02t15:28:26.221z 90 0 0 42 False NaN NaN NaN <NA> <NA> 3 7 1 3 [ciênciavitae, scopus - elsevier, pg cardoso, ... 4.0 True 7.147059e-10 3 0
1753337 0000-0002-1686-1935 True True youxia wang youxia wang (1995-), native of zunyi, guizhou ... NaN <NA> NaN NaN [[institute of animal nutrition, master degree... [[master, sichuan agricultural university , ch... 0 NaN 2020-12-11t02:11:51.808z 2020-12-11t03:25:28.263z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> 2 1 NaN NaN False 4.475163e-02 1 1

473043 rows × 36 columns

Employment

In [116]:
df.n_employment.describe()
Out[116]:
count    2.680488e+06
mean     1.664713e+00
std      1.530077e+00
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      2.000000e+00
max      1.980000e+02
Name: n_employment, dtype: float64
In [119]:
df[df.n_employment == df.n_employment.max()]
Out[119]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment ext_works_source n_ext_work_source authoritative spam_score n_valid_employment
2020738 0000-0002-0293-964X True True ben zhong tang <NA> [唐本忠] tangbenz@ust.hk [fluorescent biosensors, light-emitting molecu... [[hkust profile, tang-benzhong], [researcherid... [[department of chemistry and faculty of pharm... [[chair professor, division of biomedical engi... 422 [tang, benzhong, crossref] 2015-03-13t00:28:33.270z 2021-03-23t07:56:34.824z 359 0 0 0 False ust.hk NaN [ust.hk] <NA> 1 3 7 7 198 [crossref] 1.0 True NaN 32

Let's count how many employments have a valid assigned id by orcid (ringols, isni, grid, etc.)

In [60]:
exploded_employment = df[['orcid', 'employment']].explode('employment').dropna()
exploded_employment
Out[60]:
orcid employment
1 0000-0001-6112-5550 [professor, peoples friendship university of r...
3 0000-0001-6220-5683 [research scientist, new york university abu d...
4 0000-0001-7071-8294 [researcher (academic), universidad de zaragoz...
4 0000-0001-7071-8294 [researcher (academic), instituto de síntesis ...
6 0000-0001-7402-0096 [, kth royal institute of technology, stockhol...
... ... ...
10989643 0000-0003-2606-0936 [post-doc, institute of biochemistry and cell ...
10989644 0000-0002-1686-1935 [master, sichuan agricultural university , che...
10989645 0000-0002-3800-6331 [assistant professor, baruch college, city uni...
10989645 0000-0002-3800-6331 [postdoctoral scholar, university of californi...
10989647 0000-0002-7584-2283 [lecturer, henan institute of science and tech...

4462243 rows × 2 columns

In [61]:
exploded_employment[['role', 'institution', 'city', 'region', 'country', 'id', 'id_scheme']] = pd.DataFrame(exploded_employment.employment.tolist(), index=exploded_employment.index)
In [83]:
exploded_employment.id.replace('', pd.NA, inplace=True)
In [105]:
exploded_employment.groupby('orcid').id.count().reset_index()
Out[105]:
orcid id
0 0000-0001-5000-0031 1
1 0000-0001-5000-0138 1
2 0000-0001-5000-0170 2
3 0000-0001-5000-0218 1
4 0000-0001-5000-0226 1
... ... ...
2680483 0000-0003-4999-9831 1
2680484 0000-0003-4999-9890 1
2680485 0000-0003-4999-992X 0
2680486 0000-0003-4999-9938 1
2680487 0000-0003-4999-9954 2

2680488 rows × 2 columns

In [106]:
df = df.merge(exploded_employment.groupby('orcid').id.count().reset_index(), on='orcid')
df.rename(columns={'id': 'n_valid_employment'}, inplace=True)
Out[106]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment ext_works_source n_ext_work_source authoritative spam_score n_valid_employment
0 0000-0001-6112-5550 True True <NA> <NA> <NA> [v.i. yurtaev; v. yurtaev] <NA> NaN NaN NaN [[professor, peoples friendship university of ... 0 NaN 2018-04-03t07:50:23.358z 2020-03-18t09:42:44.753z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 1 NaN NaN False NaN 1
1 0000-0001-6220-5683 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[research scientist, new york university abu ... 0 NaN 2015-08-18t12:36:45.307z 2020-09-23t13:37:54.180z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 1 NaN NaN False NaN 0
2 0000-0001-7071-8294 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[researcher (academic), universidad de zarago... 0 NaN 2014-03-10t13:22:01.966z 2016-06-14t22:17:54.470z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 2 NaN NaN False NaN 1
3 0000-0001-7402-0096 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[, kth royal institute of technology, stockho... 0 NaN 2015-01-11t15:13:06.467z 2016-06-14t23:55:59.896z 0 0 0 0 False NaN NaN [kth.se] <NA> 1 <NA> <NA> <NA> 1 NaN NaN False NaN 0
4 0000-0001-8315-2066 True True <NA> <NA> <NA> NaN <NA> [iron chlorosis, fertilizers, calcareous soil,... researcherid, k-4630-2014 NaN [[, universidad de córdoba, córdoba, andalucía... 0 NaN 2014-05-26t08:57:12.661z 2019-03-27t07:53:48.987z 0 0 0 0 False NaN NaN NaN <NA> <NA> 1 4 <NA> 1 NaN NaN False NaN 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2680483 0000-0002-8004-688X True True paul wanjala muyoma <NA> [wanjala] <NA> [environment and sustainability] NaN NaN [[graduate teaching assistant, university of p... 0 NaN 2016-03-07t08:53:06.561z 2020-12-02t02:14:50.213z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> 1 <NA> 2 NaN NaN False NaN 2
2680484 0000-0003-2606-0936 True True luang xu <NA> [xu lu-ang, lu lu] <NA> NaN NaN NaN [[post-doc, institute of biochemistry and cell... 2 [scopus - elsevier, crossref] 2015-10-24t03:53:23.544z 2020-11-19t09:23:48.896z 2 0 0 1 True NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 1 [scopus - elsevier, crossref] 2.0 True NaN 1
2680485 0000-0002-1686-1935 True True youxia wang youxia wang (1995-), native of zunyi, guizhou ... NaN <NA> NaN NaN [[institute of animal nutrition, master degree... [[master, sichuan agricultural university , ch... 0 NaN 2020-12-11t02:11:51.808z 2020-12-11t03:25:28.263z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> 2 1 NaN NaN False 0.044752 1
2680486 0000-0002-3800-6331 True True zachary calamari <NA> NaN <NA> NaN NaN [[richard gilder graduate school, phd in compa... [[assistant professor, baruch college, city un... 7 [crossref metadata search, zachary t. calamari... 2015-01-20t20:20:17.042z 2020-11-21t19:48:36.221z 7 0 1 0 True NaN NaN NaN <NA> <NA> <NA> <NA> 2 2 [crossref metadata search, crossref] 2.0 True NaN 0
2680487 0000-0002-7584-2283 True True 现刚 <NA> [zuo xiangang, xiangang zuo, zuo x g, x g zuo] <NA> NaN NaN [[school of electronics and information, maste... [[lecturer, henan institute of science and tec... 0 NaN 2016-12-27t07:45:25.073z 2020-11-29t13:06:17.582z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> 2 1 NaN NaN False NaN 1

2680488 rows × 35 columns

In [115]:
df[df.n_employment != df.n_valid_employment]
Out[115]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment ext_works_source n_ext_work_source authoritative spam_score n_valid_employment
1 0000-0001-6220-5683 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[research scientist, new york university abu ... 0 NaN 2015-08-18t12:36:45.307z 2020-09-23t13:37:54.180z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 1 NaN NaN False NaN 0
2 0000-0001-7071-8294 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[researcher (academic), universidad de zarago... 0 NaN 2014-03-10t13:22:01.966z 2016-06-14t22:17:54.470z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 2 NaN NaN False NaN 1
3 0000-0001-7402-0096 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[, kth royal institute of technology, stockho... 0 NaN 2015-01-11t15:13:06.467z 2016-06-14t23:55:59.896z 0 0 0 0 False NaN NaN [kth.se] <NA> 1 <NA> <NA> <NA> 1 NaN NaN False NaN 0
5 0000-0001-8377-3508 True True <NA> <NA> <NA> [fontana, milena da silva] <NA> [educação; informática; matemática.] NaN NaN [[, instituto federal de educação, ciência e t... 0 NaN 2018-05-23t23:39:04.534z 2019-10-16t02:50:11.007z 0 0 0 0 False NaN NaN [cnpq.br] <NA> 1 <NA> 1 <NA> 3 NaN NaN False NaN 0
8 0000-0002-6508-6998 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN [[researcher (academic), universidad de zarago... 0 NaN 2014-03-12t08:23:22.492z 2015-07-27t15:51:38.411z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 2 NaN NaN False NaN 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2680476 0000-0001-9133-2366 True True søren staugaard <NA> NaN <NA> NaN NaN [[, , aarhus universitet, aarhus, , dk, 1006, ... [[, aarhus university, aarhus c, , dk, , ], [s... 29 [aarhus university, crossref] 2013-03-19t11:34:48.477z 2020-12-07t08:03:23.190z 14 0 10 35 True NaN NaN [au.dk, au.dk] <NA> 2 <NA> <NA> 1 3 [aarhus university, crossref] 2.0 True NaN 1
2680477 0000-0001-8494-2123 True True tarun jain <NA> NaN <NA> [pet/ct specialist; nuclear medicine physician... NaN NaN [[assistant professor, mahatma gandhi medical ... 0 NaN 2014-12-19t08:21:46.292z 2020-12-09t06:03:57.055z 0 0 0 0 False NaN NaN NaN <NA> <NA> <NA> 1 <NA> 5 NaN NaN False NaN 4
2680479 0000-0002-2906-0299 True True tiffany mackay <NA> [tiffany russel sia] <NA> [microfluidics, gpc-1, gallium-67, pet/ct, oxy... researcherid, a-2121-2017 [[faculty of medicine, master in pharmaceutica... [[clinical project lead, minomic international... 11 [crossref, researcherid, tiffany mackay] 2017-01-03t23:28:48.736z 2020-12-09t17:12:20.326z 11 0 0 0 True NaN NaN [oxytocin.com.au, linkedin.com] <NA> 2 1 13 2 4 [crossref, researcherid] 2.0 True NaN 1
2680481 0000-0002-4422-4036 True True vijay krishnan <NA> NaN <NA> NaN NaN [[psychiatry, md, all india institute of medic... [[assistant professor, all india institute of ... 2 [crossref] 2015-05-28t17:24:39.519z 2020-11-24t08:57:22.875z 2 0 0 0 False NaN NaN NaN <NA> <NA> <NA> <NA> 2 5 [crossref] 1.0 True NaN 3
2680486 0000-0002-3800-6331 True True zachary calamari <NA> NaN <NA> NaN NaN [[richard gilder graduate school, phd in compa... [[assistant professor, baruch college, city un... 7 [crossref metadata search, zachary t. calamari... 2015-01-20t20:20:17.042z 2020-11-21t19:48:36.221z 7 0 1 0 True NaN NaN NaN <NA> <NA> <NA> <NA> 2 2 [crossref metadata search, crossref] 2.0 True NaN 0

1036967 rows × 35 columns

Biography

In [63]:
df['biography'] = df[df.biography.notna()]['biography'].replace('', np.NaN)
In [64]:
df.biography.describe()
Out[64]:
count                                                354015
unique                                               337007
top       car title loans are a more straightforward way...
freq                                                    343
Name: biography, dtype: object
In [65]:
df[(df.biography.notna()) & (df.biography.str.contains('car title loans are a more straightforward'))]
Out[65]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment ext_works_source n_ext_work_source authoritative
51306 0000-0002-7397-7977 True True premium car title loans car title loans are a more straightforward way... [premium car title loans] <NA> [car title loan upland] NaN NaN NaN 0 NaN 2020-11-06t06:10:20.070z 2020-11-06t06:24:28.005z 0 0 0 0 False NaN NaN [premiumcartitleloans.com] <NA> 1 <NA> 1 <NA> <NA> NaN NaN False
51307 0000-0003-4931-9736 True True premium car title loans car title loans are a more straightforward way... [premium car title loans] <NA> [car title loan saratoga] NaN NaN NaN 0 NaN 2020-11-13t01:04:19.859z 2020-11-13t01:15:12.546z 0 0 0 0 False NaN NaN [premiumcartitleloans.com] <NA> 1 <NA> 1 <NA> <NA> NaN NaN False
106024 0000-0001-8221-2303 True True premium car title loans car title loans are a more straightforward way... [premium car title loans] <NA> [car title loan victorville] NaN NaN NaN 0 NaN 2020-11-05t00:38:21.096z 2020-11-05t00:40:40.091z 0 0 0 0 False NaN NaN [premiumcartitleloans.com] <NA> 1 <NA> 1 <NA> <NA> NaN NaN False
108770 0000-0001-6736-072X True True premium car title loans car title loans are a more straightforward way... NaN <NA> NaN NaN NaN NaN 0 NaN 2020-12-08t05:38:30.786z 2020-12-08t05:40:03.786z 0 0 0 0 False NaN NaN [premiumcartitleloans.com] <NA> 1 <NA> <NA> <NA> <NA> NaN NaN False
108771 0000-0002-8727-1246 True True premium car title loans car title loans are a more straightforward way... [loan agency] <NA> [refinance car title loan, title loan on car, ... NaN NaN NaN 0 NaN 2020-12-10t08:54:56.127z 2020-12-10t08:57:15.791z 0 0 0 0 False NaN NaN [premiumcartitleloans.com] <NA> 1 <NA> 4 <NA> <NA> NaN NaN False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10875416 0000-0002-9640-8136 True True premium car title loans car title loans are a more straightforward way... [premium car title loans] <NA> [car title loan clovis] NaN NaN NaN 0 NaN 2020-10-22t06:11:02.945z 2020-10-22t06:17:09.111z 0 0 0 0 False NaN NaN [premiumcartitleloans.com] <NA> 1 <NA> 1 <NA> <NA> NaN NaN False
10878239 0000-0002-6926-3752 True True premium car title loans car title loans are a more straightforward way... [premium car title loans] <NA> [car title loan escondido] NaN NaN NaN 0 NaN 2020-12-03t02:00:33.684z 2020-12-03t02:02:07.054z 0 0 0 0 False NaN NaN [premiumcartitleloans.com] <NA> 1 <NA> 1 <NA> <NA> NaN NaN False
10933380 0000-0002-3655-4713 True True premium car title loans car title loans are a more straightforward way... [premium car title loans] <NA> [car title loan san rafael] NaN NaN NaN 0 NaN 2020-11-18t00:39:17.492z 2020-11-18t00:52:19.024z 0 0 0 0 False NaN NaN [premiumcartitleloans.com] <NA> 1 <NA> 1 <NA> <NA> NaN NaN False
10933381 0000-0002-8724-1020 True True premium car title loans car title loans are a more straightforward way... [premium car title loans] <NA> [car title loan san juan capistrano] NaN NaN NaN 0 NaN 2020-11-19t00:31:54.080z 2020-11-19t00:34:08.721z 0 0 0 0 False NaN NaN [premiumcartitleloans.com] <NA> 1 <NA> 1 <NA> <NA> NaN NaN False
10985986 0000-0002-4601-4569 True True premium car title loans car title loans are a more straightforward way... [premium car title loans] <NA> [car title loan mount pleasant] NaN NaN NaN 0 NaN 2020-10-16t00:32:26.207z 2020-10-16t00:37:42.646z 0 0 0 0 False NaN NaN [premiumcartitleloans.com] <NA> 1 <NA> 1 <NA> <NA> NaN NaN False

421 rows × 33 columns

In [66]:
def score(bio):
    try:
        return antispam.score(bio)
    except: # if len(bio) < 3 the filter doesn't know how to handle that
        return -1
In [67]:
df['spam_score'] = df[df.biography.notna()]['biography'].apply(lambda bio: score(bio))
In [68]:
df[df.spam_score == -1][['orcid','biography']]
Out[68]:
orcid biography
25505 0000-0003-0505-2734 j
138487 0000-0002-3417-7299 .....
139595 0000-0003-3794-1288 m.d., ph.d.
193340 0000-0001-9655-4806 肿瘤
194990 0000-0002-9149-0142 be y
... ... ...
10927866 0000-0002-7341-5480 ph.d.
10976080 0000-0003-4041-0840 /
10976689 0000-0002-4285-8537
10976922 0000-0002-1545-8773 hi
10987379 0000-0002-6302-4224 .

348 rows × 2 columns

In [69]:
df['spam_score'] = df['spam_score'].replace(-1, np.NaN)
In [70]:
df.spam_score.describe()
Out[70]:
count    3.536670e+05
mean     6.098044e-01
std      4.476618e-01
min      1.917500e-22
25%      1.858235e-02
50%      9.529688e-01
75%      9.999992e-01
max      1.000000e+00
Name: spam_score, dtype: float64
In [71]:
df[df.spam_score > 0.9999][['biography', 'spam_score']]
Out[71]:
biography spam_score
29 investigador de la universidad de oviedo. depa... 1.000000
83 formación académica en la temática de manejo d... 1.000000
217 doctor en educación, maestro en gerencia de la... 1.000000
222 possui graduação em psicologia pela pontifícia... 1.000000
470 roofing contractors in seattle waroofing contr... 1.000000
... ... ...
10989593 jose ignacio peláez sánchez ha sido profesor e... 0.999966
10989603 mestranda em tecnologia na saúde e foi aluna o... 1.000000
10989605 the phd degree of pharmacy was received under ... 1.000000
10989615 mostafa metwaly is an assistant lecturer at th... 1.000000
10989617 jual obat aborsi di tangerang, obat penggugur ... 0.999999

120733 rows × 2 columns

TODO: offending words, sexually explicit content

All VS All correlation

In [136]:
fig = px.imshow(df.select_dtypes(include=['bool','number']).fillna(-1).corr())
fig.show()
In [74]:
# df[['verified_email', 
#     'verified_primary_email', 
#     'n_works', 
#     'n_doi',
#     'n_arxiv', 
#     'n_pmc', 
#     'n_other_pids', 
#     'n_emails', 
#     'n_urls', 
#     'n_ids', 
#     'n_keywords', 
#     'n_employment', 
#     'n_education', 
#     'label']].to_pickle('../data/processed/features.pkl')

Label speculation

In [75]:
df[df.label == 1]
Out[75]:
orcid verified_email verified_primary_email given_names family_name biography other_names primary_email keywords external_ids education employment n_works works_source activation_date last_update_date n_doi n_arxiv n_pmc n_other_pids label primary_email_domain other_email_domains url_domains n_emails n_urls n_ids n_keywords n_education n_employment ext_works_source n_ext_work_source authoritative spam_score
17 0000-0002-0137-3066 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN NaN 0 NaN 2017-07-25t04:34:17.338z 2019-11-27t17:54:45.418z 0 0 0 0 True NaN NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> NaN NaN False NaN
19 0000-0002-0461-9711 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN NaN 2 [crossref] 2015-08-18t12:42:01.797z 2019-12-06t11:37:38.203z 2 0 0 0 True NaN NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> NaN NaN False NaN
22 0000-0002-0761-9450 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN NaN 1 [crossref] 2020-05-13t17:15:28.405z 2020-08-11t21:00:45.694z 1 0 0 0 True NaN NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> NaN NaN False NaN
33 0000-0002-4447-9215 True True <NA> <NA> <NA> NaN <NA> NaN NaN NaN NaN 0 NaN 2017-07-24t09:37:50.242z 2019-11-15t08:31:24.820z 0 0 0 0 True NaN NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> NaN NaN False NaN
44 0000-0003-0426-4065 True True <NA> <NA> <NA> [eliza i. gilbert] <NA> NaN NaN NaN [[, us fish and wildlife service, albuquerque,... 0 NaN 2017-08-07t18:32:31.802z 2020-04-08t16:48:55.732z 0 0 0 0 True NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 1 NaN NaN False NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10989635 0000-0002-7340-9697 True True tawanda marandure <NA> NaN <NA> NaN scopus author id, 48261373600 [[animal science, msc sustainable agriculture,... [[lecturer, zimbabwe open university faculty o... 7 [scopus - elsevier] 2015-11-05t08:52:08.743z 2020-12-09t17:59:18.350z 7 0 0 7 True NaN NaN NaN <NA> <NA> 1 <NA> 3 3 [scopus - elsevier] 1.0 True NaN
10989636 0000-0002-2906-0299 True True tiffany mackay <NA> [tiffany russel sia] <NA> [microfluidics, gpc-1, gallium-67, pet/ct, oxy... researcherid, a-2121-2017 [[faculty of medicine, master in pharmaceutica... [[clinical project lead, minomic international... 11 [crossref, researcherid, tiffany mackay] 2017-01-03t23:28:48.736z 2020-12-09t17:12:20.326z 11 0 0 0 True NaN NaN [oxytocin.com.au, linkedin.com] <NA> 2 1 13 2 4 [crossref, researcherid] 2.0 True NaN
10989637 0000-0001-5896-2024 True True giovanni, l tiscia <NA> NaN <NA> NaN scopus author id, 54948242800 NaN NaN 70 [scopus - elsevier, tiscia giovanni, l, europe... 2016-07-27t10:09:13.585z 2020-12-07t22:23:05.706z 65 0 17 52 True NaN NaN NaN <NA> <NA> 1 <NA> <NA> <NA> [scopus - elsevier, europe pubmed central, cro... 3.0 True NaN
10989643 0000-0003-2606-0936 True True luang xu <NA> [xu lu-ang, lu lu] <NA> NaN NaN NaN [[post-doc, institute of biochemistry and cell... 2 [scopus - elsevier, crossref] 2015-10-24t03:53:23.544z 2020-11-19t09:23:48.896z 2 0 0 1 True NaN NaN NaN <NA> <NA> <NA> <NA> <NA> 1 [scopus - elsevier, crossref] 2.0 True NaN
10989645 0000-0002-3800-6331 True True zachary calamari <NA> NaN <NA> NaN NaN [[richard gilder graduate school, phd in compa... [[assistant professor, baruch college, city un... 7 [crossref metadata search, zachary t. calamari... 2015-01-20t20:20:17.042z 2020-11-21t19:48:36.221z 7 0 1 0 True NaN NaN NaN <NA> <NA> <NA> <NA> 2 2 [crossref metadata search, crossref] 2.0 True NaN

2075872 rows × 34 columns

In [76]:
# (df.n_works > 0) & (df.n_ids > 1)
In [77]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10989649 entries, 0 to 10989648
Data columns (total 34 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   orcid                   object 
 1   verified_email          bool   
 2   verified_primary_email  bool   
 3   given_names             string 
 4   family_name             string 
 5   biography               string 
 6   other_names             object 
 7   primary_email           string 
 8   keywords                object 
 9   external_ids            object 
 10  education               object 
 11  employment              object 
 12  n_works                 Int16  
 13  works_source            object 
 14  activation_date         string 
 15  last_update_date        string 
 16  n_doi                   Int16  
 17  n_arxiv                 Int16  
 18  n_pmc                   Int16  
 19  n_other_pids            Int16  
 20  label                   bool   
 21  primary_email_domain    object 
 22  other_email_domains     object 
 23  url_domains             object 
 24  n_emails                Int16  
 25  n_urls                  Int16  
 26  n_ids                   Int16  
 27  n_keywords              Int16  
 28  n_education             Int16  
 29  n_employment            Int16  
 30  ext_works_source        object 
 31  n_ext_work_source       float64
 32  authoritative           object 
 33  spam_score              float64
dtypes: Int16(11), bool(3), float64(2), object(12), string(6)
memory usage: 2.0+ GB