openaire_beginners_kit/notebooks/beginners_kit.ipynb

1239 lines
76 KiB
Plaintext
Raw Normal View History

2023-05-08 14:12:10 +02:00
{
"cells": [
{
2023-06-06 11:16:57 +02:00
"attachments": {},
2023-05-08 14:12:10 +02:00
"cell_type": "markdown",
"metadata": {},
"source": [
"# OpenAIRE Beginners Kit\n",
"\n",
2023-06-22 17:07:45 +02:00
"The **OpenAIRE Graph** is an Open Access dataset containing metadata about research products (literature, datasets, software, and other research products) linked to other entities of the research ecosystem, such as organisations, grants, and data sources.\n",
2023-05-08 14:12:10 +02:00
"\n",
2023-06-22 17:07:45 +02:00
"The large size of the OpenAIRE Graph is a major impediment for beginners to familiarise with the underlying data model and explore its contents. Working with the Graph in its full size typically requires access to a huge distributed computing infrastructure which cannot be easily accessible to everyone.\n",
2023-05-08 14:12:10 +02:00
"\n",
2023-06-22 17:07:45 +02:00
"The OpenAIRE Beginners Kit aims to address this issue. It consists of two components: a subset of the Graph composed of the research products published between `2022-06-29` and `2022-12-29`, all the entities connected to them and the respective relationships, and the present Zeppelin notebook that demonstrates how you can use `PySpark` to analyse the Graph and get answers to some interesting research questions."
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
2023-05-08 14:12:10 +02:00
"cell_type": "markdown",
"metadata": {},
"source": [
2023-05-09 16:12:08 +02:00
"## Download data"
2023-05-08 14:12:10 +02:00
]
},
2023-06-22 17:07:45 +02:00
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-23 07:02:05 +02:00
"This step can take some time depending on your network speed.\n",
2023-06-22 17:07:45 +02:00
"\n",
2023-06-23 07:02:05 +02:00
"Uncomment and run only if you need to download the data the first time: these lines just download the datasets from the deposition on Zenodo containing data for this kit (https://zenodo.org/record/7490192), untar the content and clean up. All the data needed will sit under the `data` folder."
2023-06-22 17:07:45 +02:00
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"slideshow": {
"slide_type": "notes"
},
"tags": []
2023-05-08 14:12:10 +02:00
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-06-06 16:40:31 +02:00
"# import os\n",
"# base_url = \"https://zenodo.org/record/7490192/files/\"\n",
2023-05-08 14:12:10 +02:00
"\n",
"\n",
2023-06-06 16:40:31 +02:00
"# items =[\"communities_infrastructures.tar\",\"dataset.tar\",\"datasource.tar\",\"organization.tar\",\"otherresearchproduct.tar\",\"project.tar\",\"publication.tar\",\"relation.tar\", \"software.tar\"]\n",
2023-05-08 14:12:10 +02:00
"\n",
2023-06-06 16:40:31 +02:00
"# for item in items: \n",
"# print(f\"Downloading {item}\")\n",
"# os.system(f'wget {base_url}{item}?download=1 -O data/{item}')\n",
"# print(f\"Extracting {item}\")\n",
"# os.system(f'tar -xf data/{item} -C data/; rm data/{item}')"
2023-05-08 14:12:10 +02:00
]
},
2023-06-22 17:07:45 +02:00
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import libraries"
2023-05-08 14:12:10 +02:00
]
},
{
"cell_type": "code",
2023-05-09 16:12:08 +02:00
"execution_count": null,
2023-06-22 17:07:45 +02:00
"metadata": {},
2023-05-08 14:12:10 +02:00
"outputs": [],
"source": [
"import json\n",
"\n",
2023-06-22 17:07:45 +02:00
"import glob\n",
"import pandas as pd\n",
"\n",
2023-05-08 14:12:10 +02:00
"import pyspark.sql.functions as F\n",
"from pyspark.sql.functions import col\n",
"from pyspark.sql.types import StructType\n",
"from pyspark.sql import SparkSession\n",
"from IPython.display import JSON as pretty_print\n",
"\n",
2023-06-06 14:48:58 +02:00
"import pandas as pd\n",
"pd.set_option('display.max_columns', None)\n",
2023-06-22 17:07:45 +02:00
"pd.set_option('display.max_colwidth', 100)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the datasets"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-23 07:02:05 +02:00
"If you try to load the data straight into memory, one `part file` would fit"
2023-06-22 17:07:45 +02:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_json('./data/publication/part-00000.txt.gz', compression='gzip', lines=True)\n",
"df.head(2)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-23 07:02:05 +02:00
"However, if you try to load the whole thing, even just the publications, the chances are slim.\n",
2023-05-08 14:12:10 +02:00
"\n",
2023-06-22 17:07:45 +02:00
"If you try uncommenting and running the following lines, after some time, the kernel will die while trying and restart."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# files = sorted(glob.glob('./data/publication/part-*.txt.gz'))\n",
"# publications_df = pd.concat(pd.read_json(f, compression='gzip', lines=True) for f in files)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-23 07:02:05 +02:00
"So, let's see how `Spark` can help us.\n",
2023-05-08 14:12:10 +02:00
"\n",
2023-06-22 17:07:45 +02:00
"First thing first, let's create the Spark session."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spark = SparkSession.builder.getOrCreate()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's define now a few variables containing the schema followed by OpenAIRE data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
"source": [
2023-05-08 14:12:10 +02:00
"publicationSchema = '{\"fields\":[{\"metadata\":{},\"name\":\"author\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"fullname\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"name\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"pid\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"value\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"trust\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"rank\",\"nullable\":true,\"type\":\"long\"},{\"metadata\":{},\"name\":\"surname\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"bestaccessright\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"label\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"container\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"conferencedate\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"conferenceplace\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"edition\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"ep\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"iss\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"issnLinking\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"issnOnline\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"issnPrinted\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"name\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"sp\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"vol\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"contributor\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"country\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"label\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"trust\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"coverage\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"dateofcollection\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"description\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"embargoenddate\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"format\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"indicators\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"impactMeasures\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"impulse\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name
"datasetSchema = '{\"fields\":[{\"metadata\":{},\"name\":\"author\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"fullname\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"name\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"pid\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"value\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"trust\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"rank\",\"nullable\":true,\"type\":\"long\"},{\"metadata\":{},\"name\":\"surname\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"bestaccessright\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"label\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"contributor\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"country\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"label\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"trust\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"coverage\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"dateofcollection\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"description\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"embargoenddate\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"format\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"geolocation\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"box\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"place\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"point\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"indicators\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"impactMeasures\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"impulse\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"score\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"influence\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"score\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"influence_alt\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"score\",\"nullable\":true,\"type\":\"string\"}],\"type\":
"softwareSchema = '{\"fields\":[{\"metadata\":{},\"name\":\"author\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"fullname\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"name\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"pid\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"value\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"trust\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"rank\",\"nullable\":true,\"type\":\"long\"},{\"metadata\":{},\"name\":\"surname\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"bestaccessright\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"label\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"contributor\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"country\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"label\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"trust\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"coverage\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"dateofcollection\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"description\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"documentationUrl\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"embargoenddate\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"format\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"indicators\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"impactMeasures\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"impulse\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"score\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"influence\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"score\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"influence_alt\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"score\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"popularity\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"score\",\"nullable\":true,\"type\":\"string\
"otherSchema = '{\"fields\":[{\"metadata\":{},\"name\":\"author\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"fullname\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"name\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"pid\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"value\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"trust\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"rank\",\"nullable\":true,\"type\":\"long\"},{\"metadata\":{},\"name\":\"surname\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"bestaccessright\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"label\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"contactgroup\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"contactperson\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"contributor\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"country\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"label\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"trust\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"coverage\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"dateofcollection\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"description\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"embargoenddate\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"format\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"indicators\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"impactMeasures\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"impulse\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"score\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"influence\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"score\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"influence_alt\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"class\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"score\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"popularity\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\
"datasourceSchema = '{\"fields\":[{\"metadata\":{},\"name\":\"accessrights\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"certificates\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"citationguidelineurl\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"databaseaccessrestriction\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"datasourcetype\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"value\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"datauploadrestriction\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"dateofvalidation\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"description\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"englishname\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"journal\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"issnLinking\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"issnOnline\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"issnPrinted\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"name\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"languages\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"logourl\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"missionstatementurl\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"officialname\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"openairecompatibility\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"originalId\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"pid\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"value\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"pidsystems\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"policies\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"releasestartdate\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"subjects\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"uploadrights\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"versioning\",\"nullable\":true,\"type\":\"boolean\"},{\"metadata\":{},\"name\":\"websiteurl\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}'\n",
"organizationSchema = '{\"fields\":[{\"metadata\":{},\"name\":\"alternativenames\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"country\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"label\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"legalname\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"legalshortname\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"pid\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"scheme\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"value\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"websiteurl\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}'\n",
"projectSchema = '{\"fields\":[{\"metadata\":{},\"name\":\"acronym\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"callidentifier\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"enddate\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"funding\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"funding_stream\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"description\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"jurisdiction\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"name\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"shortName\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"granted\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"currency\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"fundedamount\",\"nullable\":true,\"type\":\"double\"},{\"metadata\":{},\"name\":\"totalcost\",\"nullable\":true,\"type\":\"double\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"h2020programme\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":{\"fields\":[{\"metadata\":{},\"name\":\"code\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"description\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"},\"type\":\"array\"}},{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"keywords\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"openaccessmandatefordataset\",\"nullable\":true,\"type\":\"boolean\"},{\"metadata\":{},\"name\":\"openaccessmandateforpublications\",\"nullable\":true,\"type\":\"boolean\"},{\"metadata\":{},\"name\":\"startdate\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"subject\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"summary\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"title\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"websiteurl\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}'\n",
"communitySchema = '{\"fields\":[{\"metadata\":{},\"name\":\"acronym\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"description\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"name\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"subject\",\"nullable\":true,\"type\":{\"containsNull\":true,\"elementType\":\"string\",\"type\":\"array\"}},{\"metadata\":{},\"name\":\"type\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"zenodo_community\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}'\n",
"relationSchema = '{\"fields\":[{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"provenance\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"trust\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"reltype\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"name\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"type\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"source\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"type\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"target\",\"nullable\":true,\"type\":{\"fields\":[{\"metadata\":{},\"name\":\"id\",\"nullable\":true,\"type\":\"string\"},{\"metadata\":{},\"name\":\"type\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}},{\"metadata\":{},\"name\":\"validated\",\"nullable\":true,\"type\":\"boolean\"},{\"metadata\":{},\"name\":\"validationDate\",\"nullable\":true,\"type\":\"string\"}],\"type\":\"struct\"}'"
]
},
2023-05-09 16:12:08 +02:00
{
2023-06-06 11:16:57 +02:00
"attachments": {},
2023-05-09 16:12:08 +02:00
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"Now, let's read the datasetsabout OpenAIRE entitities."
2023-05-09 16:12:08 +02:00
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
"inputPath = 'data/'\n",
" \n",
2023-06-22 17:07:45 +02:00
"publications = spark.read.schema(StructType.fromJson(json.loads(publicationSchema))).json(inputPath + 'publication')\n",
"datasets = spark.read.schema(StructType.fromJson(json.loads(datasetSchema))).json(inputPath + 'dataset')\n",
"softwares = spark.read.schema(StructType.fromJson(json.loads(softwareSchema))).json(inputPath + 'software')\n",
"others = spark.read.schema(StructType.fromJson(json.loads(otherSchema))).json(inputPath + 'otherresearchproduct')\n",
"results = publications.unionByName(datasets, allowMissingColumns=True).unionByName(softwares, allowMissingColumns=True).unionByName(others, allowMissingColumns=True)\n",
"datasources = spark.read.schema(StructType.fromJson(json.loads(datasourceSchema))).json(inputPath + 'datasource')\n",
"organizations = spark.read.schema(StructType.fromJson(json.loads(organizationSchema))).json(inputPath + 'organization')\n",
"projects = spark.read.schema(StructType.fromJson(json.loads(projectSchema))).json(inputPath + 'project')\n",
"communities = spark.read.schema(StructType.fromJson(json.loads(communitySchema))).json(inputPath + 'communities_infrastructures')\n",
"relations = spark.read.schema(StructType.fromJson(json.loads(relationSchema))).json(inputPath + 'relation')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create some `Temporary views`, which is similar to a real SQL table that you can query via Spark."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"publications.createOrReplaceTempView(\"publications\")\n",
"datasets.createOrReplaceTempView(\"datasets\")\n",
"softwares.createOrReplaceTempView(\"software\")\n",
"others.createOrReplaceTempView(\"others\")\n",
2023-05-08 14:12:10 +02:00
"results.createOrReplaceTempView(\"results\")\n",
2023-06-22 17:07:45 +02:00
"datasources.createOrReplaceTempView(\"datasources\")\n",
"organizations.createOrReplaceTempView(\"organizations\")\n",
"projects.createOrReplaceTempView(\"projects\")\n",
"communities.createOrReplaceTempView(\"communities\")\n",
"relations.createOrReplaceTempView(\"relations\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, let's count the number of rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"number of publications %s\"%publications.count())\n",
"print(\"number of datasets %s\"%datasets.count())\n",
"print(\"number of software %s\"%softwares.count())\n",
"print(\"number of other research products %s\"%others.count())\n",
2023-05-08 14:12:10 +02:00
"print(\"number of results %s\"%results.count())\n",
2023-06-22 17:07:45 +02:00
"print(\"number of datasources %s\"%datasources.count())\n",
"print(\"number of organizations %s\"%organizations.count())\n",
"print(\"number of communities %s\"%communities.count())\n",
"print(\"number of projects %s\"%projects.count())\n",
"print(\"number of relations %s\"%relations.count())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"By the way, the same could be achieved in SQL via Spark."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spark.sql(\"SELECT COUNT(*) FROM publications\").toPandas()"
2023-05-08 14:12:10 +02:00
]
},
2023-05-09 16:12:08 +02:00
{
2023-06-06 11:16:57 +02:00
"attachments": {},
2023-05-09 16:12:08 +02:00
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"Let's show some data now. \n",
"For example, a generic publication (link to documentation: https://graph.openaire.eu/docs/data-model/entities/result)"
2023-05-09 16:12:08 +02:00
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-06-22 17:07:45 +02:00
"pretty_print(json.loads(publications\n",
" .where(\"id='50|78975075580c::2ff84f3173897001283274434e8f3eaa'\")\n",
" .toJSON()\n",
" .first()), expanded=False)"
2023-05-09 16:12:08 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
2023-05-09 16:12:08 +02:00
"cell_type": "markdown",
"metadata": {},
"source": [
"Or a data source (link to documentation: https://graph.openaire.eu/docs/data-model/entities/data-source)"
2023-05-08 14:12:10 +02:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-06-22 17:07:45 +02:00
"pretty_print(json.loads(datasources\n",
" .where(\"id='10|fairsharing_::c3a690be93aa602ee2dc0ccab5b7b67e'\")\n",
" .toJSON()\n",
" .first()), expanded=False)"
2023-05-09 16:12:08 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
2023-05-09 16:12:08 +02:00
"cell_type": "markdown",
"metadata": {},
"source": [
"An organization (link to documentation: https://graph.openaire.eu/docs/data-model/entities/organization)"
2023-05-08 14:12:10 +02:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-06-22 17:07:45 +02:00
"pretty_print(json.loads(organizations\n",
" .where(\"id='20|openorgs____::5836463160e0e5d1cd12997f7d2f0257'\")\n",
" .toJSON()\n",
" .first()), expanded=False)"
2023-05-09 16:12:08 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
2023-05-09 16:12:08 +02:00
"cell_type": "markdown",
"metadata": {},
"source": [
"A project (link to documentation: https://graph.openaire.eu/docs/data-model/entities/project)"
2023-05-08 14:12:10 +02:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-06-22 17:07:45 +02:00
"pretty_print(json.loads(projects\n",
" .toJSON()\n",
" .first()), expanded=False)"
2023-05-09 16:12:08 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
2023-05-09 16:12:08 +02:00
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"A community (link to documentation: https://graph.openaire.eu/docs/data-model/entities/community)"
2023-05-08 14:12:10 +02:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-06-22 17:07:45 +02:00
"pretty_print(json.loads(communities\n",
" .where(\"acronym='mes'\")\n",
" .toJSON()\n",
" .first()), expanded=False)"
2023-05-09 16:12:08 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
2023-05-09 16:12:08 +02:00
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"And finally, a relation (link to documentation: https://graph.openaire.eu/docs/data-model/relationships)"
2023-05-08 14:12:10 +02:00
]
},
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-06-22 17:07:45 +02:00
"pretty_print(json.loads(relations.toJSON().first()), expanded=False)"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
2023-05-09 16:12:08 +02:00
"metadata": {
"tags": []
},
"source": [
2023-06-06 11:16:57 +02:00
"## Exercises "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-06 16:40:31 +02:00
"All the exercises follow the template below.\n",
2023-06-06 11:16:57 +02:00
"```python\n",
"query = \"\"\"\n",
"SELECT <columns>\n",
"FROM <table>\n",
"\"\"\"\n",
"\n",
"spark.sql(query).limit(20).toPandas()\n",
2023-06-22 17:07:45 +02:00
"```\n",
"\n",
"`toPandas()` results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data. running on larger datasets results in memory error and crashes the application.\n",
"`limit(20)` is added exactly for this purpose."
2023-06-06 11:16:57 +02:00
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Group relations based on their semantics and count them; sort results in descending order, limit to the first 20. \n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
2023-05-09 16:12:08 +02:00
"autoscroll": "auto",
"tags": []
2023-05-08 14:12:10 +02:00
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
2023-06-22 17:07:45 +02:00
"SELECT reltype.name, COUNT(*) AS count \n",
2023-05-08 14:12:10 +02:00
"FROM relations \n",
"GROUP BY reltype.name \n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
2023-06-23 07:02:05 +02:00
"Show information about publishing venues.\n",
2023-06-22 17:07:45 +02:00
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT container.issnLinking, container.issnOnline, container.issnPrinted, container.name \n",
"FROM publications \n",
"WHERE container IS NOT NULL\n",
"\"\"\"\n",
"\n",
"spark.sql(query).limit(20).toPandas()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"Count and sort publications by citations.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT publications.id, pid.value, COUNT(*) AS count\n",
"FROM publications JOIN relations ON publications.id = relations.source.id \n",
"WHERE reltype.name = 'IsCitedBy'\n",
"GROUP BY publications.id, pid.value\n",
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
"spark.sql(query).limit(20).toPandas()"
]
},
2023-06-23 07:02:05 +02:00
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Since relations are bi-directional, so we could have used the dual semantic `Cites` and join on target ids. We would get the same results.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT publications.id, pid.value, COUNT(*) AS count\n",
"FROM publications JOIN relations ON publications.id = relations.target.id \n",
"WHERE reltype.name = 'Cites'\n",
"GROUP BY publications.id, pid.value\n",
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
"spark.sql(query).limit(20).toPandas()"
]
},
2023-06-22 17:07:45 +02:00
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"Show the most occurring publication subjects; sort results in descending order; limit to the first 20.\n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"WITH terms AS (\n",
" SELECT explode(subjects.subject.value) AS `term`\n",
" FROM publications\n",
2023-05-08 14:12:10 +02:00
")\n",
2023-05-09 16:12:08 +02:00
"SELECT term AS `subject term`,\n",
" COUNT(*) AS count \n",
2023-05-08 14:12:10 +02:00
"FROM terms \n",
"GROUP BY term \n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
2023-05-08 14:12:10 +02:00
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-23 07:02:05 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
2023-06-23 07:02:05 +02:00
"Show the journals with the highest number of publications\n",
2023-06-22 17:07:45 +02:00
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-06-22 17:15:54 +02:00
"metadata": {},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"WITH journals AS (\n",
" SELECT container.*\n",
" FROM publications\n",
" WHERE container IS NOT NULL\n",
2023-05-08 14:12:10 +02:00
")\n",
2023-05-09 16:12:08 +02:00
"SELECT name, count(*) AS count \n",
2023-05-08 14:12:10 +02:00
"FROM journals \n",
"GROUP BY name \n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-23 07:02:05 +02:00
"attachments": {},
"cell_type": "markdown",
2023-06-06 14:48:58 +02:00
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show the number of projects per organization; sort results in descending order; limit to the first 20.\n",
2023-06-23 07:02:05 +02:00
"\n",
"Hint: the `COALESCE` function can be oh help to select over the possible name forms of an organisation (e.g., short and full name). You can specify multiple columns to select and it will return the first column that is not null. \n",
2023-06-22 17:07:45 +02:00
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-06-22 17:15:54 +02:00
"metadata": {},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
2023-06-23 07:02:05 +02:00
"SELECT COALESCE(legalshortname, legalname) AS name, \n",
2023-06-06 15:10:04 +02:00
" COUNT(*) AS count \n",
2023-06-23 07:02:05 +02:00
"FROM organizations JOIN relations ON organizations.id = relations.source.id \n",
" AND reltype.name = 'isParticipant'\n",
"GROUP BY name \n",
2023-06-06 14:48:58 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
"spark.sql(query).limit(20).toPandas()"
]
},
{
2023-06-23 07:02:05 +02:00
"attachments": {},
2023-06-06 14:48:58 +02:00
"cell_type": "markdown",
2023-06-22 17:15:54 +02:00
"metadata": {},
2023-06-06 14:48:58 +02:00
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show projects with the highest number of associated results. \n",
"\n",
2023-06-23 07:02:05 +02:00
"Note: An `unidentified` project title is a placeholder for all the associations to a funder without knowing the specific project. It should be removed from the count.\n",
2023-06-22 17:07:45 +02:00
"</div>"
2023-06-06 14:48:58 +02:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT funding.shortName, code, title, COUNT(*) AS count \n",
2023-06-23 07:02:05 +02:00
"FROM projects JOIN relations ON projects.id = relations.source.id \n",
" AND reltype.name = 'produces' \n",
" AND not projects.title ilike '%unidentified%' \n",
2023-06-06 14:48:58 +02:00
"GROUP BY funding.shortName, code, title\n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-23 07:02:05 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-06 14:48:58 +02:00
"Strings can be manipulated as well on the fly"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-06-22 17:15:54 +02:00
"metadata": {},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
2023-06-23 07:02:05 +02:00
"SELECT CONCAT_WS(' / ',\n",
2023-06-06 14:48:58 +02:00
" IF(SIZE(funding.shortName) > 0, ARRAY_JOIN(funding.shortName, ',', '-'), '?'), \n",
" COALESCE(code, '?'), \n",
" SUBSTRING(title, 0, 50)) AS project, COUNT(*) AS count \n",
2023-06-23 07:02:05 +02:00
"FROM projects JOIN relations ON projects.id = relations.source.id \n",
" AND reltype.name = 'produces' \n",
" AND NOT projects.title ilike '%unidentified%' \n",
2023-06-06 14:48:58 +02:00
"GROUP BY project \n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
2023-06-22 17:15:54 +02:00
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"<div class=\"alert alert-info\">\n",
"Show the most co-occurring publication subjects from controlled vocabularies (i.e., scheme != 'keyword') avoiding repetition; limit to the first 20.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"autoscroll": "auto",
"tags": []
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"WITH subjects AS (\n",
" WITH exploded_subjects (\n",
" SELECT id, EXPLODE(subjects.subject) AS subject \n",
" FROM publications) \n",
" SELECT id, subject.value AS `subject` \n",
" FROM exploded_subjects \n",
" WHERE subject.scheme != 'keyword'\n",
")\n",
"SELECT l.subject AS left, \n",
" r.subject AS right, \n",
" COUNT(*) AS count\n",
"FROM subjects AS l JOIN subjects AS r ON l.id = r.id AND l.subject < r.subject\n",
"GROUP BY left, right\n",
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
"spark.sql(query).limit(20).toPandas()"
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show the number of research products per organization; sort results in descending order; limit to the first 20. The relation used is the affiliation, since in our data this relation links products to organization and not authors to organizations.\n",
"</div>"
2023-06-06 14:48:58 +02:00
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-23 07:02:05 +02:00
"Organization short names can be empty, so the legal name could be a fallback option to use in `COALESCE`."
2023-06-06 14:48:58 +02:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT legalshortname, legalname\n",
"FROM organizations \n",
"WHERE legalshortname IS NULL\n",
"\"\"\"\n",
"\n",
"spark.sql(query).limit(20).toPandas()"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"SELECT COALESCE(legalshortname, legalname) AS organization,\n",
" COUNT(*) AS count \n",
2023-06-23 07:02:05 +02:00
"FROM organizations JOIN relations ON organizations.id = relations.source.id \n",
" AND reltype.name = 'isAuthorInstitutionOf' \n",
2023-05-08 14:12:10 +02:00
"GROUP BY organization\n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show the number of research products (per type) per organization.\n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"SELECT COALESCE(legalshortname, legalname) AS organization, \n",
2023-06-06 15:54:52 +02:00
" COUNT(*) AS total,\n",
2023-05-08 14:12:10 +02:00
" COUNT(IF(type = 'publication', 1, NULL)) AS publication,\n",
" COUNT(IF(type = 'dataset', 1, NULL)) AS dataset,\n",
" COUNT(IF(type = 'software', 1, NULL)) AS software,\n",
" COUNT(IF(type = 'other', 1, NULL)) AS other\n",
2023-06-23 07:02:05 +02:00
"FROM results JOIN organizations JOIN relations ON organizations.id = relations.source.id \n",
" AND results.id = relations.target.id \n",
" AND reltype.name = 'isAuthorInstitutionOf' \n",
2023-05-08 14:12:10 +02:00
"GROUP BY organization \n",
2023-06-06 15:54:52 +02:00
"ORDER BY total DESC\n",
2023-05-09 16:12:08 +02:00
"\"\"\"\n",
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show result access types per organization.\n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"SELECT COALESCE(legalshortname, legalname) AS organization, \n",
2023-06-06 15:54:52 +02:00
" COUNT(*) as total,\n",
2023-05-08 14:12:10 +02:00
" COUNT(IF(bestaccessright.label = 'OPEN', 1, NULL)) AS open,\n",
" COUNT(IF(bestaccessright.label = 'EMBARGO', 1, NULL)) AS embargo,\n",
" COUNT(IF(bestaccessright.label = 'CLOSED', 1, NULL)) AS closed\n",
2023-06-23 07:02:05 +02:00
"FROM organizations JOIN relations JOIN results ON organizations.id = relations.source.id \n",
" AND results.id = relations.target.id \n",
" AND reltype.name = 'isAuthorInstitutionOf'\n",
2023-05-08 14:12:10 +02:00
"GROUP BY organization\n",
2023-06-06 15:54:52 +02:00
"ORDER BY total DESC\n",
2023-05-09 16:12:08 +02:00
"\"\"\"\n",
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show the result access types per country of the organizations.\n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"SELECT organizations.country.code AS country, \n",
2023-06-06 15:54:52 +02:00
" COUNT(*) AS total,\n",
2023-05-08 14:12:10 +02:00
" COUNT(IF(bestaccessright.label = 'OPEN', 1, NULL)) AS open,\n",
" COUNT(IF(bestaccessright.label = 'EMBARGO', 1, NULL)) AS embargo,\n",
" COUNT(IF(bestaccessright.label = 'CLOSED', 1, NULL)) AS closed\n",
2023-06-23 07:02:05 +02:00
"FROM organizations JOIN relations JOIN results ON organizations.id = relations.source.id\n",
" AND results.id = relations.target.id \n",
" AND reltype.name = 'isAuthorInstitutionOf'\n",
2023-05-08 14:12:10 +02:00
"WHERE organizations.country IS NOT NULL\n",
"GROUP BY organizations.country.code\n",
2023-06-06 15:54:52 +02:00
"ORDER BY total DESC\n",
2023-05-09 16:12:08 +02:00
"\"\"\"\n",
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show the collaboration network of countries participating in projects with respect to the partecipating organizations.\n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"WITH countryProject AS (\n",
2023-05-08 14:12:10 +02:00
" SELECT country.code AS country, \n",
" target.id AS id \n",
" FROM organizations JOIN relations ON reltype.name = 'isParticipant' AND source.id = organizations.id\n",
" WHERE country IS NOT NULL\n",
")\n",
"SELECT l.country AS left, \n",
" r.country AS right,\n",
" COUNT(*) AS count \n",
"FROM countryProject AS l JOIN countryProject AS r ON l.id = r.id AND l.country <= r.country\n",
"GROUP BY left, right \n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
2023-06-23 07:02:05 +02:00
"edges = spark.sql(query).toPandas()\n",
"edges"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Results can modeled as graph and analysed."
2023-05-08 14:12:10 +02:00
]
},
2023-06-06 15:54:52 +02:00
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import igraph as ig\n",
"\n",
"G = ig.Graph.TupleList(\n",
" edges=edges[['left', 'right', 'count']].values,\n",
" vertex_name_attr='countrycode',\n",
" edge_attrs = ['weight'],\n",
" directed=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"G.vcount()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"G.ecount()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"G.vs[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"G.es[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"fig, ax = plt.subplots()\n",
"ig.plot(G, vertex_label=G.vs['countrycode'], target=ax)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"G.vs.find(countrycode_eq = 'MY') # maldives"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"H = G.induced_subgraph(G.neighborhood(50))\n",
"H.summary()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"H.vs['color'] = 'grey'\n",
"H.vs[0]['color'] = 'red'\n",
"fig, ax = plt.subplots()\n",
"ig.plot(H, target=ax)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"G.transitivity_local_undirected(50)"
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show international project collaborations; focus on organizations.\n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"WITH countryProject AS (\n",
2023-05-08 14:12:10 +02:00
" SELECT country.code AS country, \n",
" target.id AS id \n",
2023-06-23 07:02:05 +02:00
" FROM organizations JOIN relations ON reltype.name = 'isParticipant' \n",
" AND source.id = organizations.id\n",
2023-05-08 14:12:10 +02:00
" WHERE country IS NOT NULL\n",
")\n",
"SELECT l.country AS left, \n",
" r.country AS right, \n",
" COUNT(*) AS count \n",
"FROM countryProject AS l JOIN countryProject AS r ON l.id = r.id AND l.country < r.country\n",
"GROUP BY left, right \n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
2023-06-06 16:16:01 +02:00
"spark.sql(query).toPandas() "
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show the organisations collaborating in projects more often.\n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"WITH orgProject AS (\n",
2023-05-08 14:12:10 +02:00
" SELECT COALESCE(legalshortname, legalname) AS organization, \n",
" target.id AS id \n",
2023-06-23 07:02:05 +02:00
" FROM organizations JOIN relations ON reltype.name = 'isParticipant' \n",
" AND source.id = organizations.id\n",
2023-05-08 14:12:10 +02:00
")\n",
"SELECT l.organization AS left,\n",
" r.organization AS right,\n",
" COUNT(*) AS count\n",
"FROM orgProject AS l JOIN orgProject AS r ON l.id = r.id AND l.organization < r.organization\n",
"GROUP BY left, right \n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show the organizations co-authoring papers more often.\n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"WITH orgProduct AS (\n",
2023-05-08 14:12:10 +02:00
" SELECT COALESCE(legalshortname, legalname) AS organization, \n",
" target.id AS id \n",
2023-06-23 07:02:05 +02:00
" FROM organizations JOIN relations ON reltype.name = 'isAuthorInstitutionOf' \n",
" AND source.id = organizations.id\n",
2023-05-08 14:12:10 +02:00
")\n",
"SELECT l.organization AS left, \n",
" r.organization AS right,\n",
" COUNT(*) AS count \n",
"FROM orgProduct AS l JOIN orgProduct AS r ON l.id = r.id AND l.organization < r.organization\n",
2023-05-08 14:12:10 +02:00
"GROUP BY left, right \n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
2023-05-08 14:12:10 +02:00
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Summarize the access rights over the years.\n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
2023-05-08 14:12:10 +02:00
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
2023-05-08 14:12:10 +02:00
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"SELECT bestaccessright.label AS accessright,\n",
2023-05-08 14:12:10 +02:00
" SUBSTRING(publicationdate, 0,4) AS year,\n",
" COUNT(*) AS count\n",
"FROM results\n",
"WHERE bestaccessright IS NOT NULL AND publicationdate IS NOT NULL\n",
"GROUP BY accessright, year\n",
2023-05-09 16:12:08 +02:00
"ORDER BY count DESC\n",
"\"\"\"\n",
"\n",
2023-05-09 16:12:08 +02:00
"spark.sql(query).limit(20).toPandas()"
2023-05-08 14:12:10 +02:00
]
},
{
2023-06-06 11:16:57 +02:00
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
2023-06-22 17:07:45 +02:00
"<div class=\"alert alert-info\">\n",
"Show the number of publications supplemented by datasets.\n",
"</div>"
]
},
2023-05-08 14:12:10 +02:00
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
"source": [
2023-05-09 16:12:08 +02:00
"query = \"\"\"\n",
"SELECT COUNT(*) AS count\n",
2023-06-23 07:02:05 +02:00
"FROM relations JOIN publications JOIN datasets ON reltype.name = 'IsSupplementedBy' \n",
" AND publications.id = relations.source.id \n",
" AND datasets.id = relations.target.id\n",
2023-05-09 16:12:08 +02:00
"\"\"\"\n",
"\n",
2023-05-08 14:12:10 +02:00
"spark.sql(query).limit(20).toPandas()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
},
"name": "openaire_beginners_kit SQL"
},
"nbformat": 4,
"nbformat_minor": 4
}