{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Information to check\n", "- names\n", "- description\n", "- url\n", "- subjects & keywords\n", "- content type\n", "- repo type\n", "- policies\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import ast\n", "import csv\n", "import json\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import plotly\n", "from plotly.offline import iplot, init_notebook_mode\n", "import plotly.graph_objs as go\n", "import plotly.express as px\n", "\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
openaire_idre3data_idrepository_nameadditional_namerepository_urlrepository_iddescriptiontypesizeupdate_datestart_dateend_datesubjectmission_statementcontent_typeprovider_typekeywordinstitutionpolicydatabase_accessdatabase_licensedata_accessdata_licensedata_uploaddata_upload_licensesoftwareversioningapipid_systemcitation_guideline_urlaid_systemenhanced_publicationquality_managementcertificatemetadata_standardsyndicationremarksentry_datelast_update
0re3data_____::91780fe96da5ba32f804e43359c154bar3d100000001Odum Institute Archive Dataverse[]https://dataverse.unc.edu/dataverse/odum[]The Odum Institute Archive Dataverse contains ...[disciplinary]13 dataverses; 3.050 datasets2020-12-04NaNNaN[1 Humanities and Social Sciences, 111 Social ...false[Databases, Plain text, Scientific and statist...[dataProvider][FAIR, Middle East, crime, demography, economy...[[Odum Institute for Research in Social Scienc...truetruetruetruetruetruefalsetrueNaNfalsetruetruetrueunknownyestruetruefalseOdum Dataverse is covered by Thomson Reuters D...2013-06-102021-07-06
1re3data_____::cc3ea05c863cd49af75f7f54e0e86f09r3d100000002Access to Archival Databases[AAD]https://aad.archives.gov/aad/[RRID:SCR_010479, RRID:nlx_157752]You will find in the Access to Archival Databa...[disciplinary]NaNNaN1985NaN[1 Humanities and Social Sciences, 102 History...true[Images, Standard office documents, Structured...[dataProvider][US History][[The U.S. National Archives and Records Admin...truetruefalsetruetruetruefalsetruenotruetruetruetrueunknownunknownfalsefalsetrueNaN2012-07-042021-05-25
2re3data_____::a2f73fbe91311f4356d0d7957c441773r3d100000004Datenbank Gesprochenes Deutsch[DGD, DGD2 (formerly), Database for Spoken Ger...https://dgd.ids-mannheim.de/[]The \"Database for Spoken German (DGD)\" is a co...[disciplinary]34 corpora2020-02-032012NaN[1 Humanities and Social Sciences, 104 Linguis...true[Audiovisual data, Standard office documents, ...[dataProvider, serviceProvider][Australian German, FOLK, German dialects, Pfe...[[Institut für Deutsche Sprache, Archiv für Ge...truetruefalsetruetruetruefalsetrueyesfalsetruetruetrueunknownunknowntruefalsefalseNaN2012-07-202020-08-27
3re3data_____::0394b97eb11f19785cbca1ec830429dar3d100000005UNC Dataverse[University of North Carolina Dataverse]https://dataverse.unc.edu/[]UNC Dataverse is an open-source repository sof...[institutional]186 dataverses; 25.272 studies; 229.442 files2020-11-302011NaN[1 Humanities and Social Sciences, 111 Social ...true[Archived data, Plain text, Raw data, Scientif...[dataProvider, serviceProvider][FAIR, census, demographic survey, demography,...[[Odum Institute for Research in Social Scienc...truetruefalsetruetruetruetruetrueyestruetruetruetrueunknownyesfalsetruefalseThe Odum Institute houses one of the oldest an...2012-07-232020-11-30
4re3data_____::a48f09c562b247a9919acfe195549b47r3d100000006Archaeology Data Service[ADS]https://archaeologydataservice.ac.uk/[FAIRsharing_doi:10.25504/FAIRsharing.hm1mfg]The ADS is an accredited digital repository fo...[disciplinary]1837 results2020-05-201996-10-01NaN[1 Humanities and Social Sciences, 101 Ancient...true[Archived data, Audiovisual data, Databases, I...[dataProvider, serviceProvider][FAIR, archaeology, cultural heritage, prehist...[[Arts and Humanities Research Council, [AHRC]...truetruetruetruetruetruetruetrueyestruetruetruetrueunknownyestruetruetrueADS is covered by Clarivate Data Citation Inde...2012-07-232021-06-11
\n", "
" ], "text/plain": [ " openaire_id re3data_id \\\n", "0 re3data_____::91780fe96da5ba32f804e43359c154ba r3d100000001 \n", "1 re3data_____::cc3ea05c863cd49af75f7f54e0e86f09 r3d100000002 \n", "2 re3data_____::a2f73fbe91311f4356d0d7957c441773 r3d100000004 \n", "3 re3data_____::0394b97eb11f19785cbca1ec830429da r3d100000005 \n", "4 re3data_____::a48f09c562b247a9919acfe195549b47 r3d100000006 \n", "\n", " repository_name \\\n", "0 Odum Institute Archive Dataverse \n", "1 Access to Archival Databases \n", "2 Datenbank Gesprochenes Deutsch \n", "3 UNC Dataverse \n", "4 Archaeology Data Service \n", "\n", " additional_name \\\n", "0 [] \n", "1 [AAD] \n", "2 [DGD, DGD2 (formerly), Database for Spoken Ger... \n", "3 [University of North Carolina Dataverse] \n", "4 [ADS] \n", "\n", " repository_url \\\n", "0 https://dataverse.unc.edu/dataverse/odum \n", "1 https://aad.archives.gov/aad/ \n", "2 https://dgd.ids-mannheim.de/ \n", "3 https://dataverse.unc.edu/ \n", "4 https://archaeologydataservice.ac.uk/ \n", "\n", " repository_id \\\n", "0 [] \n", "1 [RRID:SCR_010479, RRID:nlx_157752] \n", "2 [] \n", "3 [] \n", "4 [FAIRsharing_doi:10.25504/FAIRsharing.hm1mfg] \n", "\n", " description type \\\n", "0 The Odum Institute Archive Dataverse contains ... [disciplinary] \n", "1 You will find in the Access to Archival Databa... [disciplinary] \n", "2 The \"Database for Spoken German (DGD)\" is a co... [disciplinary] \n", "3 UNC Dataverse is an open-source repository sof... [institutional] \n", "4 The ADS is an accredited digital repository fo... [disciplinary] \n", "\n", " size update_date start_date \\\n", "0 13 dataverses; 3.050 datasets 2020-12-04 NaN \n", "1 NaN NaN 1985 \n", "2 34 corpora 2020-02-03 2012 \n", "3 186 dataverses; 25.272 studies; 229.442 files 2020-11-30 2011 \n", "4 1837 results 2020-05-20 1996-10-01 \n", "\n", " end_date subject \\\n", "0 NaN [1 Humanities and Social Sciences, 111 Social ... \n", "1 NaN [1 Humanities and Social Sciences, 102 History... \n", "2 NaN [1 Humanities and Social Sciences, 104 Linguis... \n", "3 NaN [1 Humanities and Social Sciences, 111 Social ... \n", "4 NaN [1 Humanities and Social Sciences, 101 Ancient... \n", "\n", " mission_statement content_type \\\n", "0 false [Databases, Plain text, Scientific and statist... \n", "1 true [Images, Standard office documents, Structured... \n", "2 true [Audiovisual data, Standard office documents, ... \n", "3 true [Archived data, Plain text, Raw data, Scientif... \n", "4 true [Archived data, Audiovisual data, Databases, I... \n", "\n", " provider_type \\\n", "0 [dataProvider] \n", "1 [dataProvider] \n", "2 [dataProvider, serviceProvider] \n", "3 [dataProvider, serviceProvider] \n", "4 [dataProvider, serviceProvider] \n", "\n", " keyword \\\n", "0 [FAIR, Middle East, crime, demography, economy... \n", "1 [US History] \n", "2 [Australian German, FOLK, German dialects, Pfe... \n", "3 [FAIR, census, demographic survey, demography,... \n", "4 [FAIR, archaeology, cultural heritage, prehist... \n", "\n", " institution policy database_access \\\n", "0 [[Odum Institute for Research in Social Scienc... true true \n", "1 [[The U.S. National Archives and Records Admin... true true \n", "2 [[Institut für Deutsche Sprache, Archiv für Ge... true true \n", "3 [[Odum Institute for Research in Social Scienc... true true \n", "4 [[Arts and Humanities Research Council, [AHRC]... true true \n", "\n", " database_license data_access data_license data_upload data_upload_license \\\n", "0 true true true true false \n", "1 false true true true false \n", "2 false true true true false \n", "3 false true true true true \n", "4 true true true true true \n", "\n", " software versioning api pid_system citation_guideline_url aid_system \\\n", "0 true NaN false true true true \n", "1 true no true true true true \n", "2 true yes false true true true \n", "3 true yes true true true true \n", "4 true yes true true true true \n", "\n", " enhanced_publication quality_management certificate metadata_standard \\\n", "0 unknown yes true true \n", "1 unknown unknown false false \n", "2 unknown unknown true false \n", "3 unknown yes false true \n", "4 unknown yes true true \n", "\n", " syndication remarks entry_date \\\n", "0 false Odum Dataverse is covered by Thomson Reuters D... 2013-06-10 \n", "1 true NaN 2012-07-04 \n", "2 false NaN 2012-07-20 \n", "3 false The Odum Institute houses one of the oldest an... 2012-07-23 \n", "4 true ADS is covered by Clarivate Data Citation Inde... 2012-07-23 \n", "\n", " last_update \n", "0 2021-07-06 \n", "1 2021-05-25 \n", "2 2020-08-27 \n", "3 2020-11-30 \n", "4 2021-06-11 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re3data_df = pd.read_csv('../data/raw/re3data.tsv', delimiter='\\t', \n", " converters={'subject': ast.literal_eval,\n", " 'keyword': ast.literal_eval,\n", " 'additional_name': ast.literal_eval,\n", " 'repository_id': ast.literal_eval,\n", " 'type': ast.literal_eval,\n", " 'content_type': ast.literal_eval,\n", " 'provider_type': ast.literal_eval,\n", " 'institution': ast.literal_eval\n", " })\n", "re3data_df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['openaire_id', 're3data_id', 'repository_name', 'additional_name',\n", " 'repository_url', 'repository_id', 'description', 'type', 'size',\n", " 'update_date', 'start_date', 'end_date', 'subject', 'mission_statement',\n", " 'content_type', 'provider_type', 'keyword', 'institution', 'policy',\n", " 'database_access', 'database_license', 'data_access', 'data_license',\n", " 'data_upload', 'data_upload_license', 'software', 'versioning', 'api',\n", " 'pid_system', 'citation_guideline_url', 'aid_system',\n", " 'enhanced_publication', 'quality_management', 'certificate',\n", " 'metadata_standard', 'syndication', 'remarks', 'entry_date',\n", " 'last_update'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re3data_df.columns" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def empty_list_is_nan(cell):\n", " if isinstance(cell, list):\n", " return np.nan if len(cell) == 0 else cell\n", " else:\n", " return cell\n", " \n", "re3data_df = re3data_df.applymap(empty_list_is_nan)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
openaire_idre3data_idrepository_nameadditional_namerepository_urlrepository_iddescriptiontypesizeupdate_datestart_dateend_datesubjectmission_statementcontent_typeprovider_typekeywordinstitutionpolicydatabase_accessdatabase_licensedata_accessdata_licensedata_uploaddata_upload_licensesoftwareversioningapipid_systemcitation_guideline_urlaid_systemenhanced_publicationquality_managementcertificatemetadata_standardsyndicationremarksentry_datelast_update
count2707270727072137268682927072677126012481762146268527072700269926992706270727072707270727072707270727071292270727072707270727042705270727072707163727072707
unique270727072704212826838282705812336873517913672132342474268521222222222113322216321259814
topre3data_____::4cea5a5ea78542232a51190879756661r3d100011254EarthChem Library[IRIS]http://www.jcvi.org/cms/home/[doi:10.17171/1-6]The repository is no longer available. >>>!!!<...[disciplinary]2 datasets2019-05-1520082015[1 Humanities and Social Sciences, 2 Life Scie...true[Standard office documents][dataProvider][multidisciplinary][[National Center for Biotechnology Informatio...truetruefalsetruetruetruefalsetrueyesfalsetruetruetrueunknownyesfalsefalsefalseThe National Institute of Standards and Techno...2016-05-102021-07-02
freq1122222171361592112222286301748190623942707213427012693268119882227108614852448270727071592149224811655212932047
\n", "
" ], "text/plain": [ " openaire_id re3data_id \\\n", "count 2707 2707 \n", "unique 2707 2707 \n", "top re3data_____::4cea5a5ea78542232a51190879756661 r3d100011254 \n", "freq 1 1 \n", "\n", " repository_name additional_name repository_url \\\n", "count 2707 2137 2686 \n", "unique 2704 2128 2683 \n", "top EarthChem Library [IRIS] http://www.jcvi.org/cms/home/ \n", "freq 2 2 2 \n", "\n", " repository_id description \\\n", "count 829 2707 \n", "unique 828 2705 \n", "top [doi:10.17171/1-6] The repository is no longer available. >>>!!!<... \n", "freq 2 2 \n", "\n", " type size update_date start_date end_date \\\n", "count 2677 1260 1248 1762 146 \n", "unique 8 1233 687 351 79 \n", "top [disciplinary] 2 datasets 2019-05-15 2008 2015 \n", "freq 1713 6 15 92 11 \n", "\n", " subject mission_statement \\\n", "count 2685 2707 \n", "unique 1367 2 \n", "top [1 Humanities and Social Sciences, 2 Life Scie... true \n", "freq 222 2286 \n", "\n", " content_type provider_type keyword \\\n", "count 2700 2699 2699 \n", "unique 1323 4 2474 \n", "top [Standard office documents] [dataProvider] [multidisciplinary] \n", "freq 30 1748 190 \n", "\n", " institution policy \\\n", "count 2706 2707 \n", "unique 2685 2 \n", "top [[National Center for Biotechnology Informatio... true \n", "freq 6 2394 \n", "\n", " database_access database_license data_access data_license data_upload \\\n", "count 2707 2707 2707 2707 2707 \n", "unique 1 2 2 2 2 \n", "top true false true true true \n", "freq 2707 2134 2701 2693 2681 \n", "\n", " data_upload_license software versioning api pid_system \\\n", "count 2707 2707 1292 2707 2707 \n", "unique 2 2 2 2 2 \n", "top false true yes false true \n", "freq 1988 2227 1086 1485 2448 \n", "\n", " citation_guideline_url aid_system enhanced_publication \\\n", "count 2707 2707 2704 \n", "unique 1 1 3 \n", "top true true unknown \n", "freq 2707 2707 1592 \n", "\n", " quality_management certificate metadata_standard syndication \\\n", "count 2705 2707 2707 2707 \n", "unique 3 2 2 2 \n", "top yes false false false \n", "freq 1492 2481 1655 2129 \n", "\n", " remarks entry_date \\\n", "count 1637 2707 \n", "unique 1632 1259 \n", "top The National Institute of Standards and Techno... 2016-05-10 \n", "freq 3 20 \n", "\n", " last_update \n", "count 2707 \n", "unique 814 \n", "top 2021-07-02 \n", "freq 47 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re3data_df.describe(include='all')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "openaire_id 0\n", "re3data_id 0\n", "repository_name 0\n", "additional_name 570\n", "repository_url 21\n", "repository_id 1878\n", "description 0\n", "type 30\n", "size 1447\n", "update_date 1459\n", "start_date 945\n", "end_date 2561\n", "subject 22\n", "mission_statement 0\n", "content_type 7\n", "provider_type 8\n", "keyword 8\n", "institution 1\n", "policy 0\n", "database_access 0\n", "database_license 0\n", "data_access 0\n", "data_license 0\n", "data_upload 0\n", "data_upload_license 0\n", "software 0\n", "versioning 1415\n", "api 0\n", "pid_system 0\n", "citation_guideline_url 0\n", "aid_system 0\n", "enhanced_publication 3\n", "quality_management 2\n", "certificate 0\n", "metadata_standard 0\n", "syndication 0\n", "remarks 1070\n", "entry_date 0\n", "last_update 0\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re3data_df.isna().sum()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Databases', 'Plain text',\n", " 'Scientific and statistical data formats',\n", " 'Standard office documents', 'other', 'Images', 'Structured text',\n", " 'Audiovisual data', 'Archived data', 'Raw data',\n", " 'Software applications', 'Source code', 'Structured graphics',\n", " 'Configuration data', 'Networkbased data', nan], dtype=object)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re3data_df.content_type.explode().unique()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['dataProvider', 'serviceProvider', nan], dtype=object)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re3data_df.provider_type.explode().unique()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }