documented mapping for MAG and crossref

This commit is contained in:
Sandro La Bruzzo 2024-07-18 11:12:24 +02:00
parent b00c45af1c
commit 584abf5a42
2 changed files with 63 additions and 8 deletions

View File

@ -55,7 +55,7 @@ Properties in OpenAIRE research products are set based on the logic described in
| `id` | `doi` | id in the form `doi_________::md5(doi)` | | `id` | `doi` | id in the form `doi_________::md5(doi)` |
| `dateofcollection` | `indexed.datetime` | | | `dateofcollection` | `indexed.datetime` | |
| `lastupdatetimestamp` | `indexed.timestamp` | | | `lastupdatetimestamp` | `indexed.timestamp` | |
| `type` | `type` | `dataset` if the Crossref type is dataset, `publication` otherwise (based on the filtering logics described above) | | `type` | `type` | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li></ul> |
| `originalId` | `doi, clinical-trial-number, alternative-id` | | | `originalId` | `doi, clinical-trial-number, alternative-id` | |
| `pid` | | The scheme tells the type of PID, the value contains the actual value | | `pid` | | The scheme tells the type of PID, the value contains the actual value |
| `pid.scheme` | | Default value: doi | | `pid.scheme` | | Default value: doi |

View File

@ -1,14 +1,69 @@
# Microsoft Academic Graph # Microsoft Academic Graph
## Data acquisition ## Data acquisition
The Microsoft Academic Graph dataset is generated from the latest released version of the graph, 06-12-2021.
### Changes from the previous version
* New workflow: MAG is no longer created within the DOIBoost process. Now, a new workflow normalizes the various MAG tables into a single table, from which the action set is generated.
* MAG discontinued: It is important to note that MAG has been finished. Therefore, normalization only occurs once data is imported from a complete dump of MAG.
## Process ## Process
When mapping MAG records to the OpenAIRE Graph, we consider the following MAG tables: The Microsoft Academic Graph (MAG) is a heterogeneous graph that contains scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study. The MAG schema is designed to capture the rich and complex relationships between these entities.
* `PaperAbstractsInvertedIndex`: used to map the paper abstracts
* `Authors`: for the authors. The MAG data is pre-processed by grouping authors by PaperId and then used to map the MAG author identifier. The main node types in the MAG schema are:
* `Affiliations` and `PaperAuthorAffiliations`: used to generate links between publications and organisations (paper affiliations)
* `Journals` and `ConferenceInstances`: joined with `Papers_distinct` is used to map the information about the venues where the paper was published * `Paper`: Publications represent works of scientific research, such as articles, books, and book chapters.
* TO BE REMOVED `PaperUrls`: to create one instance for the OpenAIRE publication * `PaperAbstractsInvertedIndex`: used to map the paper abstracts
* TO BE REMOVED `FieldsOfStudy`: to add subjects * `Authors`: Authors represent the people who wrote the publications.
Institutions: Institutions represent the organizations with which the authors are affiliated.
* `Journals`: Journals represent the periodical series in which the publications are published.
* `Conferences`: Conferences represent the academic meetings in which the publications are presented.
The main edge types in the MAG schema are:
* `Citation relationships`: Citation relationships connect citing publications to cited publications.
* `Affiliation relationships`: Affiliation relationships connect authors to the institutions with which they are affiliated.
### Preprocess
In the first phase, a normalized table is defined containing all papers and associated relationships.
### Mapping MAG properties into the OpenAIRE Graph
Properties in OpenAIRE research products are set based on the logic described in the following table:
| OpenAIRE Research Product field path | MAG path(s) | Notes |
|---------------------------------------|------------------|-------------|
| `id` |`PaperId`| id in the form `mag_________::md5(PaperId)`|
| `instance.alternateIdentifier[@type = DOI]` |`Doi` | DOI intersected with Crossref. Only MAG papers with a DOI present in Crossref are filtered|
| `instance.instancetype` | `DocType` |Using the **_dnet:result_typologies_** vocabulary, we look up the `DocType` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li><li>`software`</li><li>`otherresearchproduct`</li></ul>|
| `maintitle` | `OriginalTitle`| |
| `publicationdate` |`Year` | publication date if `Date` is not available|
| `publicationdate` | `Date`| |
| `publicationdate` |`OnlineDate` | Date the article was put online |
| `publisher` | `Publisher` | |
| `journal.name` |`ConferenceName` | |
| `journal.issnPrinted` | `JournalISSN` | |
| `journal.edition` | `JournalPublisher` | |
| `journal.ConferencePlace` | `ConferenceLocation` | |
| `journal.conferencedate` | `ConferenceStartDate`, `ConferenceEndDate`| conference date as an append of conferencestartdate-conferenceenddate |
| `journal.vol` | `Volume` | |
| `journal.iss` | `Issue`| |
| `journal.sp` | `FirstPage` | |
| `journal.ep` | `LastPage` | |
| `abstract` | `Paper abstract` | |
| **Author Mapping** | | |
| `author.fullname` | `AuthorName` | |
| `organization.legalname` | `AffiliationName` | |
| `organization.id` | `AffiliationId` | id in the form `mag_________::md5(AffiliationId)` |
|`organization.id` | `AffiliationId` | for each affiliation we generate an affiliation relation between paper and organization |
| `author.pid[@type = mag]` | `AuthorId` | |
| `author.rank` | `AuthorSequenceNumber` | |
| `organization.pid` | `GridId` | |