openaire-graph-docs/docs/graph-production-workflow/aggregation/non-compatible-sources/mag.md

3.8 KiB

Microsoft Academic Graph

Data acquisition

The Microsoft Academic Graph dataset is generated from the latest released version of the graph, 06-12-2021.

Changes from the previous version

  • New workflow: MAG is no longer created within the DOIBoost process. Now, a new workflow normalizes the various MAG tables into a single table, from which the action set is generated.
  • MAG discontinued: It is important to note that MAG has been finished. Therefore, normalization only occurs once data is imported from a complete dump of MAG.

Process

The Microsoft Academic Graph (MAG) is a heterogeneous graph that contains scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study. The MAG schema is designed to capture the rich and complex relationships between these entities.

The main node types in the MAG schema are:

  • Paper: Publications represent works of scientific research, such as articles, books, and book chapters.
  • PaperAbstractsInvertedIndex: used to map the paper abstracts
  • Authors: Authors represent the people who wrote the publications. Institutions: Institutions represent the organizations with which the authors are affiliated.
  • Journals: Journals represent the periodical series in which the publications are published.
  • Conferences: Conferences represent the academic meetings in which the publications are presented.

The main edge types in the MAG schema are:

  • Citation relationships: Citation relationships connect citing publications to cited publications.
  • Affiliation relationships: Affiliation relationships connect authors to the institutions with which they are affiliated.

Preprocess

In the first phase, a normalized table is defined containing all papers and associated relationships.

Mapping MAG properties into the OpenAIRE Graph

Properties in OpenAIRE research products are set based on the logic described in the following table:

OpenAIRE Research Product field path MAG path(s) Notes
id PaperId id in the form mag_________::md5(PaperId)
instance.alternateIdentifier[@type = DOI] Doi DOI intersected with Crossref. Only MAG papers with a DOI present in Crossref are filtered
instance.instancetype DocType Using the dnet:result_typologies vocabulary, we look up the DocType synonym to generate one of the following main entities:
  • publication
  • dataset
  • software
  • otherresearchproduct
maintitle OriginalTitle
publicationdate Year publication date if Date is not available
publicationdate Date
publicationdate OnlineDate Date the article was put online
publisher Publisher
journal.name ConferenceName
journal.issnPrinted JournalISSN
journal.edition JournalPublisher
journal.ConferencePlace ConferenceLocation
journal.conferencedate ConferenceStartDate, ConferenceEndDate conference date as an append of conferencestartdate-conferenceenddate
journal.vol Volume
journal.iss Issue
journal.sp FirstPage
journal.ep LastPage
abstract Paper abstract
Author Mapping
author.fullname AuthorName
organization.legalname AffiliationName
organization.id AffiliationId id in the form mag_________::md5(AffiliationId)
organization.id AffiliationId for each affiliation we generate an affiliation relation between paper and organization
author.pid[@type = mag] AuthorId
author.rank AuthorSequenceNumber
organization.pid GridId