moved cells

This commit is contained in:
Andrea Mannocci 2023-06-23 07:02:05 +02:00
parent 95e2f4a8ba
commit f676ccc989
1 changed files with 81 additions and 36 deletions

View File

@ -27,9 +27,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This can take some time depending on your network speed.\n",
"This step can take some time depending on your network speed.\n",
"\n",
"Uncomment and run only if you need to downlaod the data the first time: these lines just download the datasets from the deposition on Zenodo containing data for this kit (https://zenodo.org/record/7490192), untar the content and clean up. All the data needed will sit under the `data` folder."
"Uncomment and run only if you need to download the data the first time: these lines just download the datasets from the deposition on Zenodo containing data for this kit (https://zenodo.org/record/7490192), untar the content and clean up. All the data needed will sit under the `data` folder."
]
},
{
@ -59,18 +59,6 @@
"# os.system(f'tar -xf data/{item} -C data/; rm data/{item}')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"import pandas as pd\n",
"import glob\n",
"\n",
"files = sorted(glob.glob('./data/publication/part-*.txt.gz'))\n",
"publications_df = pd.concat(pd.read_json(f) for f in files)"
]
},
{
"attachments": {},
"cell_type": "markdown",
@ -114,7 +102,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If you try to load the data straight into memory, one file would fit..."
"If you try to load the data straight into memory, one `part file` would fit"
]
},
{
@ -132,7 +120,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"But if you try to load everything, even just all the publications, the chances are slim.\n",
"However, if you try to load the whole thing, even just the publications, the chances are slim.\n",
"\n",
"If you try uncommenting and running the following lines, after some time, the kernel will die while trying and restart."
]
@ -152,7 +140,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"So, let's see how Spark can help us.\n",
"So, let's see how `Spark` can help us.\n",
"\n",
"First thing first, let's create the Spark session."
]
@ -490,7 +478,7 @@
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"Show journal information.\n",
"Show information about publishing venues.\n",
"</div>"
]
},
@ -536,6 +524,31 @@
"spark.sql(query).limit(20).toPandas()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Since relations are bi-directional, so we could have used the dual semantic `Cites` and join on target ids. We would get the same results.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT publications.id, pid.value, COUNT(*) AS count\n",
"FROM publications JOIN relations ON publications.id = relations.target.id \n",
"WHERE reltype.name = 'Cites'\n",
"GROUP BY publications.id, pid.value\n",
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
"spark.sql(query).limit(20).toPandas()"
]
},
{
"attachments": {},
"cell_type": "markdown",
@ -570,11 +583,12 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"Show the journals with the highest number of results published in\n",
"Show the journals with the highest number of publications\n",
"</div>"
]
},
@ -600,11 +614,14 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"Show the number of projects per organization; sort results in descending order; limit to the first 20.\n",
"\n",
"Hint: the `COALESCE` function can be oh help to select over the possible name forms of an organisation (e.g., short and full name). You can specify multiple columns to select and it will return the first column that is not null. \n",
"</div>"
]
},
@ -615,10 +632,11 @@
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT COALESCE(legalshortname, legalname) AS organization, \n",
"SELECT COALESCE(legalshortname, legalname) AS name, \n",
" COUNT(*) AS count \n",
"FROM organizations JOIN relations ON organizations.id = relations.source.id AND reltype.name = 'isParticipant'\n",
"GROUP BY organization \n",
"FROM organizations JOIN relations ON organizations.id = relations.source.id \n",
" AND reltype.name = 'isParticipant'\n",
"GROUP BY name \n",
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
@ -626,13 +644,14 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"Show projects with the highest number of associated results. \n",
"\n",
"Note: An \"unidentified\" project is a placeholder for all the association to a funder without knowing the specific project. It should be removed from the count.\n",
"Note: An `unidentified` project title is a placeholder for all the associations to a funder without knowing the specific project. It should be removed from the count.\n",
"</div>"
]
},
@ -644,7 +663,9 @@
"source": [
"query = \"\"\"\n",
"SELECT funding.shortName, code, title, COUNT(*) AS count \n",
"FROM projects JOIN relations ON projects.id = relations.source.id AND reltype.name = 'produces' AND not projects.title ilike '%unidentified%' \n",
"FROM projects JOIN relations ON projects.id = relations.source.id \n",
" AND reltype.name = 'produces' \n",
" AND not projects.title ilike '%unidentified%' \n",
"GROUP BY funding.shortName, code, title\n",
"ORDER BY count DESC\n",
"\"\"\"\n",
@ -653,6 +674,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -666,11 +688,13 @@
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT CONCAT_WS(' / ', \n",
"SELECT CONCAT_WS(' / ',\n",
" IF(SIZE(funding.shortName) > 0, ARRAY_JOIN(funding.shortName, ',', '-'), '?'), \n",
" COALESCE(code, '?'), \n",
" SUBSTRING(title, 0, 50)) AS project, COUNT(*) AS count \n",
"FROM projects JOIN relations ON projects.id = relations.source.id AND reltype.name = 'produces' AND NOT projects.title ilike '%unidentified%' \n",
"FROM projects JOIN relations ON projects.id = relations.source.id \n",
" AND reltype.name = 'produces' \n",
" AND NOT projects.title ilike '%unidentified%' \n",
"GROUP BY project \n",
"ORDER BY count DESC\n",
"\"\"\"\n",
@ -734,7 +758,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Organization short names can be empty, so the legal name could be a fallback option."
"Organization short names can be empty, so the legal name could be a fallback option to use in `COALESCE`."
]
},
{
@ -763,7 +787,8 @@
"query = \"\"\"\n",
"SELECT COALESCE(legalshortname, legalname) AS organization,\n",
" COUNT(*) AS count \n",
"FROM organizations JOIN relations ON organizations.id = relations.source.id AND reltype.name = 'isAuthorInstitutionOf' \n",
"FROM organizations JOIN relations ON organizations.id = relations.source.id \n",
" AND reltype.name = 'isAuthorInstitutionOf' \n",
"GROUP BY organization\n",
"ORDER BY count DESC\n",
"\"\"\"\n",
@ -796,7 +821,9 @@
" COUNT(IF(type = 'dataset', 1, NULL)) AS dataset,\n",
" COUNT(IF(type = 'software', 1, NULL)) AS software,\n",
" COUNT(IF(type = 'other', 1, NULL)) AS other\n",
"FROM results JOIN organizations JOIN relations ON organizations.id = relations.source.id AND results.id = relations.target.id AND reltype.name = 'isAuthorInstitutionOf' \n",
"FROM results JOIN organizations JOIN relations ON organizations.id = relations.source.id \n",
" AND results.id = relations.target.id \n",
" AND reltype.name = 'isAuthorInstitutionOf' \n",
"GROUP BY organization \n",
"ORDER BY total DESC\n",
"\"\"\"\n",
@ -828,7 +855,9 @@
" COUNT(IF(bestaccessright.label = 'OPEN', 1, NULL)) AS open,\n",
" COUNT(IF(bestaccessright.label = 'EMBARGO', 1, NULL)) AS embargo,\n",
" COUNT(IF(bestaccessright.label = 'CLOSED', 1, NULL)) AS closed\n",
"FROM organizations JOIN relations JOIN results ON organizations.id = relations.source.id AND results.id = relations.target.id AND reltype.name = 'isAuthorInstitutionOf'\n",
"FROM organizations JOIN relations JOIN results ON organizations.id = relations.source.id \n",
" AND results.id = relations.target.id \n",
" AND reltype.name = 'isAuthorInstitutionOf'\n",
"GROUP BY organization\n",
"ORDER BY total DESC\n",
"\"\"\"\n",
@ -860,7 +889,9 @@
" COUNT(IF(bestaccessright.label = 'OPEN', 1, NULL)) AS open,\n",
" COUNT(IF(bestaccessright.label = 'EMBARGO', 1, NULL)) AS embargo,\n",
" COUNT(IF(bestaccessright.label = 'CLOSED', 1, NULL)) AS closed\n",
"FROM organizations JOIN relations JOIN results ON organizations.id = relations.source.id AND results.id = relations.target.id AND reltype.name = 'isAuthorInstitutionOf'\n",
"FROM organizations JOIN relations JOIN results ON organizations.id = relations.source.id\n",
" AND results.id = relations.target.id \n",
" AND reltype.name = 'isAuthorInstitutionOf'\n",
"WHERE organizations.country IS NOT NULL\n",
"GROUP BY organizations.country.code\n",
"ORDER BY total DESC\n",
@ -902,7 +933,16 @@
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
"spark.sql(query).toPandas()"
"edges = spark.sql(query).toPandas()\n",
"edges"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Results can modeled as graph and analysed."
]
},
{
@ -1029,7 +1069,8 @@
"WITH countryProject AS (\n",
" SELECT country.code AS country, \n",
" target.id AS id \n",
" FROM organizations JOIN relations ON reltype.name = 'isParticipant' AND source.id = organizations.id\n",
" FROM organizations JOIN relations ON reltype.name = 'isParticipant' \n",
" AND source.id = organizations.id\n",
" WHERE country IS NOT NULL\n",
")\n",
"SELECT l.country AS left, \n",
@ -1065,7 +1106,8 @@
"WITH orgProject AS (\n",
" SELECT COALESCE(legalshortname, legalname) AS organization, \n",
" target.id AS id \n",
" FROM organizations JOIN relations ON reltype.name = 'isParticipant' AND source.id = organizations.id\n",
" FROM organizations JOIN relations ON reltype.name = 'isParticipant' \n",
" AND source.id = organizations.id\n",
")\n",
"SELECT l.organization AS left,\n",
" r.organization AS right,\n",
@ -1100,7 +1142,8 @@
"WITH orgProduct AS (\n",
" SELECT COALESCE(legalshortname, legalname) AS organization, \n",
" target.id AS id \n",
" FROM organizations JOIN relations ON reltype.name = 'isAuthorInstitutionOf' AND source.id = organizations.id\n",
" FROM organizations JOIN relations ON reltype.name = 'isAuthorInstitutionOf' \n",
" AND source.id = organizations.id\n",
")\n",
"SELECT l.organization AS left, \n",
" r.organization AS right,\n",
@ -1164,7 +1207,9 @@
"source": [
"query = \"\"\"\n",
"SELECT COUNT(*) AS count\n",
"FROM relations JOIN publications JOIN datasets ON reltype.name = 'IsSupplementedBy' AND publications.id = relations.source.id AND datasets.id = relations.target.id\n",
"FROM relations JOIN publications JOIN datasets ON reltype.name = 'IsSupplementedBy' \n",
" AND publications.id = relations.source.id \n",
" AND datasets.id = relations.target.id\n",
"\"\"\"\n",
"\n",
"spark.sql(query).limit(20).toPandas()"