openaire_beginners_kit/data/beginners_kit.ipynb

164 KiB
Raw Blame History

OpenAIRE Beginners Kit

The OpenAIRE Research Graph is an Open Access dataset containing metadata about research products (literature, datasets, software, etc.) linked to other entities of the research ecosystem like organisations, project grants, and data sources.

The large size of the OpenAIRE Research Graph is a major impediment for beginners to familiarise with the underlying data model and explore its contents. Working with the Graph in its full size typically requires access to a huge distributed computing infrastructure which cannot be easily accessible to everyone.

The OpenAIRE Beginners Kit aims to address this issue. It consists of two components: a subset of the Graph composed of the research products published between 2022-06-29 and 2022-12-29, all the entities connected to them and the respective relationships, and the present Zeppelin notebook that demonstrates how you can use PySpark to analyse the Graph and get answers to some interesting research questions.

Download data

In [5]:
!rm -rf data
!mkdir data

import os
base_url = "https://zenodo.org/record/7490192/files/"


items =["communities_infrastructures.tar","dataset.tar","datasource.tar","organization.tar","otherresearchproduct.tar","project.tar","publication.tar","relation.tar", "software.tar"]

for item in items:    
    print(f"Downloading {item}")
    os.system(f'wget {base_url}{item}?download=1 -O data/{item}')
    print(f"Extracting {item}")
    os.system(f'tar -xf data/{item} -C data/; rm data/{item}')
    
    
    
Downloading communities_infrastructures.tar
Extracting communities_infrastructures.tar
Downloading dataset.tar
Extracting dataset.tar
Downloading datasource.tar
Extracting datasource.tar
Downloading organization.tar
Extracting organization.tar
Downloading otherresearchproduct.tar
Extracting otherresearchproduct.tar
Downloading project.tar
Extracting project.tar
Downloading publication.tar
Extracting publication.tar
Downloading relation.tar
Extracting relation.tar
Downloading software.tar
Extracting software.tar

Have a look at the input data

In [8]:
import json

import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.types import StructType
from pyspark.sql import SparkSession
from IPython.display import JSON as pretty_print


spark = SparkSession.builder.getOrCreate()



publicationSchema = '{"fields":[{"metadata":{},"name":"author","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"fullname","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"pid","nullable":true,"type":{"fields":[{"metadata":{},"name":"id","nullable":true,"type":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"rank","nullable":true,"type":"long"},{"metadata":{},"name":"surname","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"bestaccessright","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"scheme","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"container","nullable":true,"type":{"fields":[{"metadata":{},"name":"conferencedate","nullable":true,"type":"string"},{"metadata":{},"name":"conferenceplace","nullable":true,"type":"string"},{"metadata":{},"name":"edition","nullable":true,"type":"string"},{"metadata":{},"name":"ep","nullable":true,"type":"string"},{"metadata":{},"name":"iss","nullable":true,"type":"string"},{"metadata":{},"name":"issnLinking","nullable":true,"type":"string"},{"metadata":{},"name":"issnOnline","nullable":true,"type":"string"},{"metadata":{},"name":"issnPrinted","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"sp","nullable":true,"type":"string"},{"metadata":{},"name":"vol","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"contributor","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"country","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"coverage","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"dateofcollection","nullable":true,"type":"string"},{"metadata":{},"name":"description","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"embargoenddate","nullable":true,"type":"string"},{"metadata":{},"name":"format","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"indicators","nullable":true,"type":{"fields":[{"metadata":{},"name":"impactMeasures","nullable":true,"type":{"fields":[{"metadata":{},"name":"impulse","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"influence","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"influence_alt","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"popularity","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"popularity_alt","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"usageCounts","nullable":true,"type":{"fields":[{"metadata":{},"name":"downloads","nullable":true,"type":"string"},{"metadata":{},"name":"views","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"instance","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"accessright","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"openAccessRoute","nullable":true,"type":"string"},{"metadata":{},"name":"scheme","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"alternateIdentifier","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"license","nullable":true,"type":"string"},{"metadata":{},"name":"pid","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"publicationdate","nullable":true,"type":"string"},{"metadata":{},"name":"refereed","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"},{"metadata":{},"name":"url","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"language","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"lastupdatetimestamp","nullable":true,"type":"long"},{"metadata":{},"name":"maintitle","nullable":true,"type":"string"},{"metadata":{},"name":"originalId","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"pid","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"publicationdate","nullable":true,"type":"string"},{"metadata":{},"name":"publisher","nullable":true,"type":"string"},{"metadata":{},"name":"source","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"subjects","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"subject","nullable":true,"type":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"subtitle","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"}],"type":"struct"}'
datasetSchema = '{"fields":[{"metadata":{},"name":"author","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"fullname","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"pid","nullable":true,"type":{"fields":[{"metadata":{},"name":"id","nullable":true,"type":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"rank","nullable":true,"type":"long"},{"metadata":{},"name":"surname","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"bestaccessright","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"scheme","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"contributor","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"country","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"coverage","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"dateofcollection","nullable":true,"type":"string"},{"metadata":{},"name":"description","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"embargoenddate","nullable":true,"type":"string"},{"metadata":{},"name":"format","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"geolocation","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"box","nullable":true,"type":"string"},{"metadata":{},"name":"place","nullable":true,"type":"string"},{"metadata":{},"name":"point","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"indicators","nullable":true,"type":{"fields":[{"metadata":{},"name":"impactMeasures","nullable":true,"type":{"fields":[{"metadata":{},"name":"impulse","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"influence","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"influence_alt","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"popularity","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"popularity_alt","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"usageCounts","nullable":true,"type":{"fields":[{"metadata":{},"name":"downloads","nullable":true,"type":"string"},{"metadata":{},"name":"views","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"instance","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"accessright","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"openAccessRoute","nullable":true,"type":"string"},{"metadata":{},"name":"scheme","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"alternateIdentifier","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"license","nullable":true,"type":"string"},{"metadata":{},"name":"pid","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"publicationdate","nullable":true,"type":"string"},{"metadata":{},"name":"refereed","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"},{"metadata":{},"name":"url","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"language","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"lastupdatetimestamp","nullable":true,"type":"long"},{"metadata":{},"name":"maintitle","nullable":true,"type":"string"},{"metadata":{},"name":"originalId","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"pid","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"publicationdate","nullable":true,"type":"string"},{"metadata":{},"name":"publisher","nullable":true,"type":"string"},{"metadata":{},"name":"size","nullable":true,"type":"string"},{"metadata":{},"name":"source","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"subjects","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"subject","nullable":true,"type":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"subtitle","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"},{"metadata":{},"name":"version","nullable":true,"type":"string"}],"type":"struct"}'
softwareSchema = '{"fields":[{"metadata":{},"name":"author","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"fullname","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"pid","nullable":true,"type":{"fields":[{"metadata":{},"name":"id","nullable":true,"type":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"rank","nullable":true,"type":"long"},{"metadata":{},"name":"surname","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"bestaccessright","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"scheme","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"contributor","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"country","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"coverage","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"dateofcollection","nullable":true,"type":"string"},{"metadata":{},"name":"description","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"documentationUrl","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"embargoenddate","nullable":true,"type":"string"},{"metadata":{},"name":"format","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"indicators","nullable":true,"type":{"fields":[{"metadata":{},"name":"impactMeasures","nullable":true,"type":{"fields":[{"metadata":{},"name":"impulse","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"influence","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"influence_alt","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"popularity","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"popularity_alt","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"usageCounts","nullable":true,"type":{"fields":[{"metadata":{},"name":"downloads","nullable":true,"type":"string"},{"metadata":{},"name":"views","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"instance","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"accessright","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"openAccessRoute","nullable":true,"type":"string"},{"metadata":{},"name":"scheme","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"alternateIdentifier","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"license","nullable":true,"type":"string"},{"metadata":{},"name":"pid","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"publicationdate","nullable":true,"type":"string"},{"metadata":{},"name":"refereed","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"},{"metadata":{},"name":"url","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"language","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"lastupdatetimestamp","nullable":true,"type":"long"},{"metadata":{},"name":"maintitle","nullable":true,"type":"string"},{"metadata":{},"name":"originalId","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"pid","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"programmingLanguage","nullable":true,"type":"string"},{"metadata":{},"name":"publicationdate","nullable":true,"type":"string"},{"metadata":{},"name":"publisher","nullable":true,"type":"string"},{"metadata":{},"name":"source","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"subjects","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"subject","nullable":true,"type":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"subtitle","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"}],"type":"struct"}'
otherSchema = '{"fields":[{"metadata":{},"name":"author","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"fullname","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"pid","nullable":true,"type":{"fields":[{"metadata":{},"name":"id","nullable":true,"type":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"rank","nullable":true,"type":"long"},{"metadata":{},"name":"surname","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"bestaccessright","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"scheme","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"contactgroup","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"contactperson","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"contributor","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"country","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"coverage","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"dateofcollection","nullable":true,"type":"string"},{"metadata":{},"name":"description","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"embargoenddate","nullable":true,"type":"string"},{"metadata":{},"name":"format","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"indicators","nullable":true,"type":{"fields":[{"metadata":{},"name":"impactMeasures","nullable":true,"type":{"fields":[{"metadata":{},"name":"impulse","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"influence","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"influence_alt","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"popularity","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"popularity_alt","nullable":true,"type":{"fields":[{"metadata":{},"name":"class","nullable":true,"type":"string"},{"metadata":{},"name":"score","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"usageCounts","nullable":true,"type":{"fields":[{"metadata":{},"name":"downloads","nullable":true,"type":"string"},{"metadata":{},"name":"views","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"}},{"metadata":{},"name":"instance","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"accessright","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"},{"metadata":{},"name":"openAccessRoute","nullable":true,"type":"string"},{"metadata":{},"name":"scheme","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"alternateIdentifier","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"license","nullable":true,"type":"string"},{"metadata":{},"name":"pid","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"publicationdate","nullable":true,"type":"string"},{"metadata":{},"name":"refereed","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"},{"metadata":{},"name":"url","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"language","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"lastupdatetimestamp","nullable":true,"type":"long"},{"metadata":{},"name":"maintitle","nullable":true,"type":"string"},{"metadata":{},"name":"originalId","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"pid","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"publicationdate","nullable":true,"type":"string"},{"metadata":{},"name":"publisher","nullable":true,"type":"string"},{"metadata":{},"name":"source","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"subjects","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"subject","nullable":true,"type":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"}}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"subtitle","nullable":true,"type":"string"},{"metadata":{},"name":"tool","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"type","nullable":true,"type":"string"}],"type":"struct"}'
datasourceSchema = '{"fields":[{"metadata":{},"name":"accessrights","nullable":true,"type":"string"},{"metadata":{},"name":"certificates","nullable":true,"type":"string"},{"metadata":{},"name":"citationguidelineurl","nullable":true,"type":"string"},{"metadata":{},"name":"databaseaccessrestriction","nullable":true,"type":"string"},{"metadata":{},"name":"datasourcetype","nullable":true,"type":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"datauploadrestriction","nullable":true,"type":"string"},{"metadata":{},"name":"dateofvalidation","nullable":true,"type":"string"},{"metadata":{},"name":"description","nullable":true,"type":"string"},{"metadata":{},"name":"englishname","nullable":true,"type":"string"},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"journal","nullable":true,"type":{"fields":[{"metadata":{},"name":"issnLinking","nullable":true,"type":"string"},{"metadata":{},"name":"issnOnline","nullable":true,"type":"string"},{"metadata":{},"name":"issnPrinted","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"languages","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"logourl","nullable":true,"type":"string"},{"metadata":{},"name":"missionstatementurl","nullable":true,"type":"string"},{"metadata":{},"name":"officialname","nullable":true,"type":"string"},{"metadata":{},"name":"openairecompatibility","nullable":true,"type":"string"},{"metadata":{},"name":"originalId","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"pid","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"pidsystems","nullable":true,"type":"string"},{"metadata":{},"name":"policies","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"releasestartdate","nullable":true,"type":"string"},{"metadata":{},"name":"subjects","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"uploadrights","nullable":true,"type":"string"},{"metadata":{},"name":"versioning","nullable":true,"type":"boolean"},{"metadata":{},"name":"websiteurl","nullable":true,"type":"string"}],"type":"struct"}'
organizationSchema = '{"fields":[{"metadata":{},"name":"alternativenames","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"country","nullable":true,"type":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"label","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"legalname","nullable":true,"type":"string"},{"metadata":{},"name":"legalshortname","nullable":true,"type":"string"},{"metadata":{},"name":"pid","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"scheme","nullable":true,"type":"string"},{"metadata":{},"name":"value","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"websiteurl","nullable":true,"type":"string"}],"type":"struct"}'
projectSchema = '{"fields":[{"metadata":{},"name":"acronym","nullable":true,"type":"string"},{"metadata":{},"name":"callidentifier","nullable":true,"type":"string"},{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"enddate","nullable":true,"type":"string"},{"metadata":{},"name":"funding","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"funding_stream","nullable":true,"type":{"fields":[{"metadata":{},"name":"description","nullable":true,"type":"string"},{"metadata":{},"name":"id","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"jurisdiction","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"shortName","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"granted","nullable":true,"type":{"fields":[{"metadata":{},"name":"currency","nullable":true,"type":"string"},{"metadata":{},"name":"fundedamount","nullable":true,"type":"double"},{"metadata":{},"name":"totalcost","nullable":true,"type":"double"}],"type":"struct"}},{"metadata":{},"name":"h2020programme","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"code","nullable":true,"type":"string"},{"metadata":{},"name":"description","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"keywords","nullable":true,"type":"string"},{"metadata":{},"name":"openaccessmandatefordataset","nullable":true,"type":"boolean"},{"metadata":{},"name":"openaccessmandateforpublications","nullable":true,"type":"boolean"},{"metadata":{},"name":"startdate","nullable":true,"type":"string"},{"metadata":{},"name":"subject","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"summary","nullable":true,"type":"string"},{"metadata":{},"name":"title","nullable":true,"type":"string"},{"metadata":{},"name":"websiteurl","nullable":true,"type":"string"}],"type":"struct"}'
communitySchema = '{"fields":[{"metadata":{},"name":"acronym","nullable":true,"type":"string"},{"metadata":{},"name":"description","nullable":true,"type":"string"},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"subject","nullable":true,"type":{"containsNull":true,"elementType":"string","type":"array"}},{"metadata":{},"name":"type","nullable":true,"type":"string"},{"metadata":{},"name":"zenodo_community","nullable":true,"type":"string"}],"type":"struct"}'
relationSchema = '{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":{"fields":[{"metadata":{},"name":"provenance","nullable":true,"type":"string"},{"metadata":{},"name":"trust","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"reltype","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"source","nullable":true,"type":{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"target","nullable":true,"type":{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"validated","nullable":true,"type":"boolean"},{"metadata":{},"name":"validationDate","nullable":true,"type":"string"}],"type":"struct"}'
In [9]:
#set the input path: the path on the cluster where the dataset will be stored (e.g. '/data/openaire_dump_subset/'); untar each folder in the dataset and move it to the chosen path

inputPath = 'data/'
 
# load entities and relationships
publication = spark.read.schema(StructType.fromJson(json.loads(publicationSchema))).json(inputPath + 'publication')
dataset = spark.read.schema(StructType.fromJson(json.loads(datasetSchema))).json(inputPath + 'dataset')
software = spark.read.schema(StructType.fromJson(json.loads(softwareSchema))).json(inputPath + 'software')
other = spark.read.schema(StructType.fromJson(json.loads(otherSchema))).json(inputPath + 'otherresearchproduct')
#results = publication.dropColumn('container').unionByName(dataset.dropColumns('size', 'version', 'geolocation'), allowMissingColumns=True).unionByName(software.dropColumns('documentationUrl', 'codeRepositoryUrl', 'programmingLanguage'), allowMissingColumns=True).unionByName(other.dropColumns('contactperson', 'contactgroup', 'tool'), allowMissingColumns=True)
results = publication.unionByName(dataset, allowMissingColumns=True).unionByName(software, allowMissingColumns=True).unionByName(other, allowMissingColumns=True)
datasource = spark.read.schema(StructType.fromJson(json.loads(datasourceSchema))).json(inputPath + 'datasource')
organization = spark.read.schema(StructType.fromJson(json.loads(organizationSchema))).json(inputPath + 'organization')
project = spark.read.schema(StructType.fromJson(json.loads(projectSchema))).json(inputPath + 'project')
community = spark.read.schema(StructType.fromJson(json.loads(communitySchema))).json(inputPath + 'communities_infrastructures')
relation = spark.read.schema(StructType.fromJson(json.loads(relationSchema))).json(inputPath + 'relation')

publication.createOrReplaceTempView("publications")
dataset.createOrReplaceTempView("datasets")
software.createOrReplaceTempView("software")
other.createOrReplaceTempView("others")
results.createOrReplaceTempView("results")
datasource.createOrReplaceTempView("datasources")
organization.createOrReplaceTempView("organizations")
project.createOrReplaceTempView("projects")
community.createOrReplaceTempView("communities")
relation.createOrReplaceTempView("relations")

# count and print their number
print("number of publications %s"%publication.count())
print("number of datasets %s"%dataset.count())
print("number of software %s"%software.count())
print("number of other research products %s"%other.count())
print("number of results %s"%results.count())
print("number of datasources %s"%datasource.count())
print("number of organizations %s"%organization.count())
print("number of communities %s"%community.count())
print("number of projects %s"%project.count())
print("number of relationships %s"%relation.count())
number of publications 2685793
number of datasets 128092
number of software 26992
number of other research products 22779
number of results 2863656
number of datasources 47356
number of organizations 7411
number of communities 17
number of projects 15780
number of relationships 14004807
In [13]:
# the generic result (link to documentation: https://graph.openaire.eu/docs/data-model/entities/result)
pretty_print(json.loads(publication.where("id='50|78975075580c::2ff84f3173897001283274434e8f3eaa'").toJSON().first()), expanded=True)
Out[13]:
<IPython.core.display.JSON object>
In [12]:
# the data source (link to documentation: https://graph.openaire.eu/docs/data-model/entities/data-source)
pretty_print(json.loads(datasource.where("id='10|fairsharing_::c3a690be93aa602ee2dc0ccab5b7b67e'").toJSON().first()), expanded=True)
Out[12]:
<IPython.core.display.JSON object>
In [16]:
# the organization (link to documentation: https://graph.openaire.eu/docs/data-model/entities/organization)
pretty_print(json.loads(organization.where("id='20|openorgs____::5836463160e0e5d1cd12997f7d2f0257'").toJSON().first()), expanded=True)
Out[16]:
<IPython.core.display.JSON object>
In [17]:
# the project (link to documentation: https://graph.openaire.eu/docs/data-model/entities/project)
pretty_print(json.loads(project.toJSON().first()), expanded=True)
Out[17]:
<IPython.core.display.JSON object>
In [18]:
# the community (link to documentation: https://graph.openaire.eu/docs/data-model/entities/community)
pretty_print(json.loads(community.where("acronym='mes'").toJSON().first()), expanded=True)
Out[18]:
<IPython.core.display.JSON object>
In [19]:
# the relation (link to documentation: https://graph.openaire.eu/docs/data-model/relationships)
pretty_print(json.loads(relation.toJSON().first()), expanded=True)
Out[19]:
<IPython.core.display.JSON object>
In [22]:
query ="""SELECT reltype.name, 
       COUNT(*) AS count 
FROM relations 
GROUP BY reltype.name 
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
Out[22]:
name count
0 isProvidedBy 3534319
1 provides 3534312
2 hosts 2696438
3 isHostedBy 2696436
4 IsRelatedTo 399737
5 isAuthorInstitutionOf 231642
6 hasAuthorInstitution 231642
7 IsCitedBy 174058
8 Cites 174058
9 HasVersion 44402
10 IsVersionOf 44402
11 isProducedBy 38672
12 produces 38671
13 IsPartOf 34520
14 HasPart 34520
15 hasParticipant 31035
16 isParticipant 31035
17 IsIdenticalTo 12974
18 HasAmongTopNSimilarDocuments 5903
19 IsAmongTopNSimilarDocuments 5903
In [23]:
query="""WITH terms AS (
    SELECT explode(subjects.subject.value) AS `term` FROM publications
)
SELECT term AS `subject term`, 
       COUNT(*) AS count 
FROM terms 
GROUP BY term 
ORDER BY count DESC"""

spark.sql(query).limit(20).toPandas()

    
Out[23]:
subject term count
0 General Medicine 242423
1 Electrical and Electronic Engineering 66295
2 General Materials Science 62012
3 General Chemistry 56444
4 Biochemistry 52956
5 Computer Science Applications 52099
6 Mechanical Engineering 46967
7 Condensed Matter Physics 46413
8 Surgery 42772
9 General Environmental Science 41371
10 Public Health, Environmental and Occupational ... 40836
11 FOS: Computer and information sciences 40609
12 Oncology 40491
13 Molecular Biology 39883
14 General Engineering 39537
15 FOS: Physical sciences 39021
16 Social and Behavioral Sciences 38058
17 Renewable Energy, Sustainability and the Envir... 36529
18 Education 36364
19 Materials Chemistry 35187
In [25]:
query="""
WITH subjects AS (
    WITH tmp (SELECT id, EXPLODE(subjects.subject) AS subject FROM publications) 
    SELECT id, subject.value AS `subject` FROM tmp WHERE subject.scheme != 'keyword'
)
SELECT l.subject AS left, 
       r.subject AS right, 
       COUNT(*) AS count
FROM subjects AS l JOIN subjects AS r ON l.id = r.id AND l.subject < r.subject
GROUP BY left, right
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
Out[25]:
left right count
0 business business.industry 12625
1 business.industry medicine.medical_specialty 5327
2 business medicine.medical_specialty 5323
3 business.industry medicine 5190
4 business medicine 5187
5 medicine medicine.medical_specialty 4396
6 business.industry medicine.disease 3997
7 business medicine.disease 3994
8 medicine medicine.disease 3754
9 Computer science business 3275
10 Computer science business.industry 3239
11 media_common media_common.quotation_subject 3234
12 medicine.disease medicine.medical_specialty 3153
13 Medicine business 2630
14 Medicine business.industry 2630
15 Artificial intelligence business.industry 1758
16 Artificial intelligence business 1754
17 Internal medicine medicine.medical_specialty 1715
18 Internal medicine business 1670
19 Internal medicine business.industry 1670
In [24]:
query="""SELECT container.* 
FROM publications 
WHERE container IS NOT NULL"""
spark.sql(query).limit(20).toPandas()
Out[24]:
conferencedate conferenceplace edition ep iss issnLinking issnOnline issnPrinted name sp vol
0 None None 0012-835X 0012-835X East African Medical Journal
1 None None None None None None None 0032-5910 Powder Technology 117586 406
2 None None None 0 None None 1110-8460 None المجلة العلمیة لعلوم وفنون الریاضة 0 0
3 None None None 1319 None None None 0883-5403 The Journal of Arthroplasty 1314 37
4 None None None 837 None None 1435-8115 1431-9276 Microscopy and Microanalysis 836 28
5 None None None 42133 None None 1944-8252 1944-8244 ACS Applied Materials &amp; Interfaces 42123 14
6 None None None None None None None 0272-8842 Ceramics International None None
7 None None None 1023 None None None 0020-0255 Information Sciences 994 612
8 None None None None None None 2632-959X None Nano Express None None
9 None None None None None None 1863-4613 1865-1704 International Review of Economics None None
10 None None None None None None None 2651-4141 Ankara Hacı Bayram Veli Üniversitesi Hukuk Fak... None None
11 None None None None None None 2107-0180 0378-7966 European Journal of Drug Metabolism and Pharma... None None
12 None None None None None None 1742-6596 1742-6588 Journal of Physics: Conference Series 012008 2304
13 None None None 9454 None None 2574-0962 2574-0962 ACS Applied Energy Materials 9447 5
14 None None None None None None None None 2022 International Conference on Intelligent C... None None
15 None None None 1321 None None 2093-6311 1598-2351 International Journal of Steel Structures 1306 22
16 None None None None None None 1475-4762 0004-0894 Area None None
17 None None None None None None 2326-831X 2326-8298 Annual Review of Statistics and Its Application None 10
18 None None None None None None None None Spintronics XV None None
19 None None None None None None 2072-6694 None Cancers 3291 14
In [26]:
query="""WITH journals AS (
    SELECT container.* FROM publications WHERE container IS NOT NULL
)
SELECT name, 
       count(*) AS count 
FROM journals 
GROUP BY name 
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
Out[26]:
name count
0 Scientific Reports 8152
1 SSRN Electronic Journal 8061
2 Blood 6527
3 PLOS ONE 6206
4 Cureus 5636
5 International Journal of Molecular Sciences 4793
6 International Journal of Environmental Researc... 4466
7 Academy of Management Proceedings 4391
8 Sustainability 4334
9 ECS Meeting Abstracts 4235
10 Research, Society and Development 4042
11 Frontiers in Immunology 3750
12 Frontiers in Psychology 3667
13 Science of The Total Environment 3630
14 International journal of health sciences 3592
15 Frontiers in Oncology 3562
16 European Heart Journal 3358
17 Applied Sciences 3111
18 IOP Conference Series: Earth and Environmental... 3047
19 Journal of Cleaner Production 3030
In [27]:
query="""SELECT CONCAT_WS(' - ',  IF(SIZE(funding.shortName) > 0, ARRAY_JOIN(funding.shortName, ',', '-'), '-'), COALESCE(code, '-'), SUBSTRING(title, 0, 50)) AS project,
       COUNT(*) AS count 
FROM projects JOIN relations ON projects.id = relations.source.id AND reltype.name = 'produces'
GROUP BY project 
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
Out[27]:
project count
0 NSERC - unidentified - unidentified 5817
1 CIHR - unidentified - unidentified 2216
2 SSHRC - unidentified - unidentified 1044
3 EC - 822336 - Representation and Preservation ... 921
4 WT - unidentified - unidentified 588
5 EC - 773830 - Promoting One Health in Europe t... 155
6 EC - 786314 - Continuity and Rupture in Centra... 60
7 EC - 633053 - Implementation of activities des... 55
8 EC - 881603 - Graphene Flagship Core Project 3 47
9 EC - 945539 - Human Brain Project Specific Gra... 46
10 EC - 824093 - The strong interaction at the fr... 41
11 EC - 872522 - Expanding our knowledge on Citiz... 40
12 EC - 823717 - Enabling Science and Technology ... 40
13 EC - 900014 - Fracture mechanics testing of ir... 38
14 EC - 823914 - Advanced Research Infrastructure... 37
15 EC - 733032 - European Human Biomonitoring Ini... 32
16 NSF - 1852977 - The Management and Operation o... 31
17 EC - 776613 - European Climate Prediction system 31
18 EC - 776816 - Project Ô: demonstration of plan... 30
19 EC - 812880 - Joint PhD Laboratory for New Mat... 30
In [39]:
query="""SELECT COALESCE(legalshortname, legalname) AS organization, 
       COUNT(*) AS count 
FROM organizations JOIN relations ON organizations.id = relations.source.id AND reltype.name = 'isParticipant'
GROUP BY organization 
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
    
Out[39]:
organization count
0 CNRS 638
1 UH 579
2 CSIC 379
3 FHG 322
4 CNR 317
5 UCL 310
6 ETH Zurich 300
7 MPG 299
8 THE CHANCELLOR, MASTERS AND SCHOLARS OF THE UN... 271
9 CEA 267
10 KUL 255
11 UOXF 249
12 DTU 209
13 Delft University of Technology 207
14 UCPH 203
15 Imperial 203
16 University of Edinburgh 181
17 Aalto University 180
18 AU 177
19 EPFL 172
In [40]:
query="""SELECT COALESCE(legalshortname, legalname) AS organization, 
       COUNT(*) AS count 
FROM organizations JOIN relations ON organizations.id = relations.source.id AND reltype.name = 'isAuthorInstitutionOf' 
GROUP BY organization
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
Out[40]:
organization count
0 UPV 4980
1 UL 4819
2 University of Oxford 3859
3 University of Cambridge 3670
4 UPC 3041
5 ULP 2855
6 AMU 2624
7 KUL 2582
8 UB 2576
9 University of Zagreb 2555
10 AAU 2522
11 University of California System 2497
12 University of Edinburgh 2422
13 Andalas University 2350
14 Amsterdam UMC 2323
15 ETH Zurich 2276
16 UPM 2191
17 INRIA 2096
18 UH 2082
19 VUA 1982
In [41]:
query="""SELECT COALESCE(legalshortname, legalname) AS organization, 
       COUNT(IF(type = 'publication', 1, NULL)) AS publication,
       COUNT(IF(type = 'dataset', 1, NULL)) AS dataset,
       COUNT(IF(type = 'software', 1, NULL)) AS software,
       COUNT(IF(type = 'other', 1, NULL)) AS other
FROM results JOIN organizations JOIN relations ON organizations.id = relations.source.id AND results.id = relations.target.id AND reltype.name = 'isAuthorInstitutionOf' 
GROUP BY organization 
ORDER BY publication DESC"""
spark.sql(query).limit(20).toPandas()
Out[41]:
organization publication dataset software other
0 UPV 4974 6 0 0
1 UL 4493 0 0 326
2 University of Oxford 3711 104 0 44
3 University of Cambridge 3468 99 4 99
4 UPC 3023 6 0 12
5 ULP 2822 0 0 33
6 AMU 2567 8 1 48
7 University of Zagreb 2509 3 0 43
8 University of California System 2483 0 0 14
9 AAU 2470 1 1 50
10 UB 2432 0 1 143
11 University of Edinburgh 2414 1 1 6
12 Andalas University 2342 0 0 8
13 Amsterdam UMC 2323 0 0 0
14 UPM 2188 0 0 3
15 ETH Zurich 2186 0 0 90
16 INRIA 2068 0 7 21
17 KUL 2060 0 1 521
18 INSERM 1954 0 3 5
19 VUA 1945 0 0 37
In [42]:
query="""SELECT COALESCE(legalshortname, legalname) AS organization, 
       COUNT(IF(bestaccessright.label = 'OPEN', 1, NULL)) AS open,
       COUNT(IF(bestaccessright.label = 'EMBARGO', 1, NULL)) AS embargo,
       COUNT(IF(bestaccessright.label = 'CLOSED', 1, NULL)) AS closed
FROM organizations JOIN relations JOIN results ON organizations.id = relations.source.id  AND results.id = relations.target.id AND reltype.name = 'isAuthorInstitutionOf'
GROUP BY organization
ORDER BY open DESC"""
spark.sql(query).limit(20).toPandas()
Out[42]:
organization open embargo closed
0 UPV 4770 11 199
1 UL 4603 67 144
2 University of Oxford 2999 837 21
3 UPC 2607 158 3
4 KUL 2518 15 28
5 UB 2481 19 58
6 University of California System 2450 1 32
7 University of Edinburgh 2394 1 10
8 Andalas University 2350 0 0
9 ETH Zurich 2251 12 10
10 University of Zagreb 2191 141 13
11 UH 2060 0 17
12 UPM 1997 35 11
13 ULP 1977 679 1
14 University of Cambridge 1944 12 181
15 University of Copenhagen 1896 0 5
16 Amsterdam UMC 1870 15 3
17 CSIC 1585 2 77
18 VUA 1503 10 55
19 UWO 1427 0 25
In [43]:
query="""SELECT organizations.country.code AS country, 
       COUNT(IF(bestaccessright.label = 'OPEN', 1, NULL)) AS open,
       COUNT(IF(bestaccessright.label = 'EMBARGO', 1, NULL)) AS embargo,
       COUNT(IF(bestaccessright.label = 'CLOSED', 1, NULL)) AS closed
FROM organizations JOIN relations JOIN results ON organizations.id = relations.source.id  AND results.id = relations.target.id AND reltype.name = 'isAuthorInstitutionOf'
WHERE organizations.country IS NOT NULL
GROUP BY organizations.country.code
ORDER BY open DESC"""
spark.sql(query).limit(20).toPandas()
Out[43]:
country open embargo closed
0 ES 23724 309 618
1 GB 21034 1044 994
2 DE 15356 368 2772
3 US 11900 36 5577
4 FR 9348 176 3779
5 CH 6908 136 536
6 PT 6221 814 57
7 HR 5944 157 35
8 BE 5636 245 412
9 FI 5421 35 43
10 IT 4989 119 1658
11 NL 4963 29 240
12 DK 4651 56 491
13 SI 4642 67 398
14 CO 4124 59 155
15 ID 4060 0 9
16 SE 3690 1 93
17 CA 3458 66 778
18 NO 3338 2 56
19 TR 2759 139 1397
In [44]:
query="""WITH countryProject AS (
    SELECT country.code AS country, 
           target.id AS id 
    FROM organizations JOIN relations ON reltype.name = 'isParticipant' AND source.id = organizations.id
    WHERE country IS NOT NULL
)
SELECT l.country AS left, 
       r.country AS right,
       COUNT(*) AS count 
FROM countryProject AS l JOIN countryProject AS r ON l.id = r.id AND l.country <= r.country
GROUP BY left, right 
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
    
Out[44]:
left right count
0 DE DE 12806
1 GB GB 9955
2 DE GB 6269
3 IT IT 5240
4 ES ES 4906
5 FR FR 4830
6 DE IT 4683
7 DE FR 4573
8 DE ES 4472
9 NL NL 3613
10 DE NL 3427
11 GB IT 3332
12 FR GB 3328
13 ES GB 3195
14 GB NL 2860
15 CH DE 2676
16 ES IT 2665
17 FR IT 2456
18 ES FR 2365
19 US US 2040
In [45]:
query="""WITH countryProject AS (
    SELECT country.code AS country, 
           target.id AS id 
    FROM organizations JOIN relations ON  reltype.name = 'isParticipant' AND source.id = organizations.id
    WHERE country IS NOT NULL
)
SELECT l.country AS left, 
       r.country AS right, 
       COUNT(*) AS count 
FROM countryProject AS l JOIN countryProject AS r ON l.id = r.id AND l.country < r.country
GROUP BY left, right 
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
 
Out[45]:
left right count
0 DE GB 6269
1 DE IT 4683
2 DE FR 4573
3 DE ES 4472
4 DE NL 3427
5 GB IT 3332
6 FR GB 3328
7 ES GB 3195
8 GB NL 2860
9 CH DE 2676
10 ES IT 2665
11 FR IT 2456
12 ES FR 2365
13 CH GB 1955
14 DE SE 1804
15 BE DE 1759
16 FR NL 1726
17 IT NL 1708
18 ES NL 1596
19 GB SE 1491
In [ ]:
query="""WITH orgProject AS (
    SELECT COALESCE(legalshortname, legalname) AS organization, 
           target.id AS id 
    FROM organizations JOIN relations ON  reltype.name = 'isParticipant' AND source.id = organizations.id
)
SELECT l.organization AS left,
       r.organization AS right,
       COUNT(*) AS count
FROM orgProject AS l JOIN orgProject AS r ON l.id = r.id AND l.organization < r.organization
GROUP BY left, right 
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
In [ ]:
query="""WITH orgProject AS (
    SELECT COALESCE(legalshortname, legalname) AS organization, 
           target.id AS id 
    FROM organizations JOIN relations ON reltype.name = 'isAuthorInstitutionOf' AND source.id = organizations.id
)
SELECT l.organization AS left, 
       r.organization AS right,
       COUNT(*) AS count 
FROM orgProject AS l JOIN orgProject AS r ON l.id = r.id AND l.organization < r.organization
GROUP BY left, right 
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
In [37]:
query="""SELECT bestaccessright.label AS accessright,
       SUBSTRING(publicationdate, 0,4) AS year,
       COUNT(*) AS count
FROM results
WHERE bestaccessright IS NOT NULL AND publicationdate IS NOT NULL
GROUP BY accessright, year
ORDER BY count DESC"""
spark.sql(query).limit(20).toPandas()
Out[37]:
accessright year count
0 OPEN 2022 1391279
1 CLOSED 2022 672566
2 EMBARGO 2022 14258
3 RESTRICTED 2022 12312
In [ ]:
query="""SELECT COUNT(*) AS count
FROM relations JOIN publications JOIN datasets ON reltype.name = 'IsSupplementedBy' AND publications.id = relations.source.id AND datasets.id = relations.target.id"""
spark.sql(query).limit(20).toPandas()