diff --git a/docs/api.md b/docs/api.md deleted file mode 100644 index f167e6c..0000000 --- a/docs/api.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -sidebar_position: 5 ---- - -# Public API -TODO \ No newline at end of file diff --git a/docs/api/_category_.json b/docs/api/_category_.json new file mode 100644 index 0000000..36617e4 --- /dev/null +++ b/docs/api/_category_.json @@ -0,0 +1,8 @@ +{ + "label": "Public API", + "position": 4, + "link": { + "type": "doc", + "id": "api" + } +} \ No newline at end of file diff --git a/docs/api/api.md b/docs/api/api.md new file mode 100644 index 0000000..1cf4b7f --- /dev/null +++ b/docs/api/api.md @@ -0,0 +1,6 @@ +--- +sidebar_position: 5 +--- + +# Public API +TODO: https://graph.openaire.eu/develop/overview.html \ No newline at end of file diff --git a/docs/data-model/data-model.md b/docs/data-model/data-model.md index a24b918..7e4f745 100644 --- a/docs/data-model/data-model.md +++ b/docs/data-model/data-model.md @@ -12,8 +12,16 @@ Its main entities are described in brief below: * [Research Products](entities/result) represent the outcomes of research activities. * [Organizations](entities/organization) correspond to companies or research institutions involved in projects, responsible for operating data sources or consisting the affiliations of Product creators. -* Funders (e.g. EC, Wellcome Trust) are agencies responsible for a list of Funding Streams. -* Funding Streams represent investments (funding actions) from Funders (e.g. FP7 or H2020). +* [Funders](entities/funder) (e.g. EC, Wellcome Trust) are agencies responsible for a list of Funding Streams. +* [Funding Streams](entities/funding-stream) represent investments (funding actions) from Funders (e.g. FP7 or H2020). * [Projects](entities/project) are research projects funded by a Funding Stream of a Funder. * [Data Sources](entities/data-source) are the resources used to collect metadata for the graph objects +TODO: communities are present in the existing documentation instead of funders and fundins streams + +:::note Further reading + +A detailed report on the OpenAIRE Research Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199). +::: + + diff --git a/docs/data-model/entities/community.md b/docs/data-model/entities/community.md index a3ea2d1..11c3bcc 100644 --- a/docs/data-model/entities/community.md +++ b/docs/data-model/entities/community.md @@ -1,6 +1,8 @@ --- -sidebar_position: 5 +sidebar_position: 7 --- + # Community (Initiative) +TODO diff --git a/docs/data-model/entities/data-source.md b/docs/data-model/entities/data-source.md index 2db29c2..5bee910 100644 --- a/docs/data-model/entities/data-source.md +++ b/docs/data-model/entities/data-source.md @@ -6,10 +6,12 @@ sidebar_position: 2 OpenAIRE entity instances are created out of data collected from various data sources of different kinds, such as publication repositories, dataset archives, CRIS systems, funder databases, etc. Data sources export information packages (e.g., XML records, HTTP responses, RDF data, JSON) that may contain information on one or more of such entities and possibly relationships between them. For example, a metadata record about a project carries information for the creation of a Project entity and its participants (as Organization entities). It is important, once each piece of information is extracted from such packages and inserted into the OpenAIRE information space as an entity, for such pieces to keep provenance information relative to the originating data source. This is to give visibility to the data source, but also to enable the reconstruction of the very same piece of information if problems arise. -Definitions for the re3data specific elements from: https://gfzpublic.gfz-potsdam.de/rest/items/item_758898_6/component/file_775891/content + +Definitions for the re3data specific elements from: https://gfzpublic.gfz-potsdam.de/rest/items/item_758898_6/component/file_775891/content + --- -## Properties +## The `DataSource` object ### id _Type: String • Cardinality: ONE_ diff --git a/docs/data-model/entities/entity-identifiers.md b/docs/data-model/entities/entity-identifiers.md new file mode 100644 index 0000000..58ffa26 --- /dev/null +++ b/docs/data-model/entities/entity-identifiers.md @@ -0,0 +1,8 @@ +--- +sidebar_position: 8 +--- + +# OpenAIRE entity identifier and PID mapping policy + +https://support.openaire.eu/projects/docs/wiki/OpenAIRE_entity_identifier_and_PID_mapping_policy +TODO: include this here? it referenced by many other pages \ No newline at end of file diff --git a/docs/data-model/entities/funder.md b/docs/data-model/entities/funder.md new file mode 100644 index 0000000..0c0a8ba --- /dev/null +++ b/docs/data-model/entities/funder.md @@ -0,0 +1,8 @@ +--- +sidebar_position: 5 +--- + +# Funder +TODO + + diff --git a/docs/data-model/entities/funding-stream.md b/docs/data-model/entities/funding-stream.md new file mode 100644 index 0000000..fa946d7 --- /dev/null +++ b/docs/data-model/entities/funding-stream.md @@ -0,0 +1,8 @@ +--- +sidebar_position: 6 +--- + +# Funding stream +TODO + + diff --git a/docs/data-model/entities/organization.md b/docs/data-model/entities/organization.md index 6f3504a..1564998 100644 --- a/docs/data-model/entities/organization.md +++ b/docs/data-model/entities/organization.md @@ -3,3 +3,46 @@ sidebar_position: 3 --- # Organization + +Organizations include companies, research centers or institutions involved as project partners or as responsible of operating data sources. Information about organizations are collected from funder databases like CORDA, registries of data sources like OpenDOAR and re3Data, and CRIS systems, as being related to projects or data sources. + + +--- + +## The `Organization` object + +### id +_Type: String • Cardinality: ONE_ + +Main entity identifier, created according to [OpenAIRE_entity_identifier_and_PID_mapping_policy](https://support.openaire.eu/projects/docs/wiki/OpenAIRE_entity_identifier_and_PID_mapping_policy). + +### legalshortname +_Type: String • Cardinality: ONE_ + +The legal name in short form of the organization. + +### legalname +_Type: String • Cardinality: ONE_ + +The legal name of the organization. + +### alternativenames +_Type: String • Cardinality: MANY_ + +The alternative names of the organization. + +### websiteurl +_Type: String • Cardinality: ONE_ + +The websiteurl of the organization. + +### country +_Type: [Country](other#country) • Cardinality: ONE_ + +The country where the organization is located. + +### pid +_Type: [OrganizationPid](other#organizationpid) • Cardinality: MANY_ + +The list of persistent identifiers for the organization. + diff --git a/docs/data-model/entities/other.md b/docs/data-model/entities/other.md index 9a7580b..c75ad4e 100644 --- a/docs/data-model/entities/other.md +++ b/docs/data-model/entities/other.md @@ -4,6 +4,7 @@ sidebar_position: 6 # Other helper objects +Here, we describe other helper objects that are used as part of the main graph entities. ## AccessRight _Type: One of `{ gold, green, hybrid, bronze }` • Cardinality: ONE_ @@ -261,20 +262,6 @@ _Type: String • Cardinality: ONE_ The date of the conference. -## GeoLocation -Represents the geolocation information. - -### point -_Type: String • Cardinality: ONE_ -TODO - -### box -_Type: String • Cardinality: ONE_ -TODO - -### place -_Type: String • Cardinality: ONE_ -TODO ## ControlledField TODO: similar to AlternateIdentifier and ResultPid? @@ -319,31 +306,22 @@ _Type: String • Cardinality: ONE_ The country label (i.e. Italy). -## ResultCountry -It is for the country associated to the result. -It is a subclass of [Country](#country) and extends it with provenance information. -
- Example - -```json -{ - "code" : "IT", - "label": "Italy", - "provenance" : { - "provenance": "inferred by OpenAIRE", - "trust": "0.85" - } -} -``` +## GeoLocation +Represents the geolocation information. -
+### point +_Type: String • Cardinality: ONE_ +TODO -### provenance -_Type: [Provenance](#provenance-2) • Cardinality: ONE_ +### box +_Type: String • Cardinality: ONE_ +TODO -Indicates the reason why this country is associated to this result +### place +_Type: String • Cardinality: ONE_ +TODO ## Instance An instance is one specific materialization or version of the result. For example, you can have one result with three instances as result of deduplication: @@ -443,6 +421,32 @@ _Type: String • Cardinality: ONE_ Language label in English +## OrganizationPid + +The schema and value for identifiers of the organization. + +
+ Example + + +```json +{ + "scheme" : "GRID", + "value" : "grid.7119.e" +} +``` + +
+ +### scheme +_Type: String • Cardinality: ONE_ + +Vocabulary reference (i.e. isni). + +### value +_Type: String • Cardinality: ONE_ + +Value from the given scheme/vocabulary (i.e. 0000000090326370). ## Provenance Indicates the process that produced (or provided) the information, and the trust associated to the information. @@ -473,13 +477,39 @@ Indicates the process that produced (or provided) the information, and the trust ### provenance _Type: String • Cardinality: ONE_ -provenance term from the vocabulary [dnet:provenanceActions](https://api.openaire.eu/vocabularies/dnet:provenanceActions). +Provenance term from the vocabulary [dnet:provenanceActions](https://api.openaire.eu/vocabularies/dnet:provenanceActions). ### trust _Type: String • Cardinality: ONE_ Trust, expressed as a number in the range [0-1]. +## ResultCountry +It is for the country associated to the result. +It is a subclass of [Country](#country) and extends it with provenance information. + +
+ Example + + +```json +{ + "code" : "IT", + "label": "Italy", + "provenance" : { + "provenance": "inferred by OpenAIRE", + "trust": "0.85" + } +} +``` + +
+ +### provenance +_Type: [Provenance](#provenance-2) • Cardinality: ONE_ + +Indicates the reason why this country is associated to this result. + ## ResultPid Type used to represent the information associated to persistent identifiers for the result that have been forged by an authority for that pid type. diff --git a/docs/data-model/entities/project.md b/docs/data-model/entities/project.md index fcfc0c8..265255e 100644 --- a/docs/data-model/entities/project.md +++ b/docs/data-model/entities/project.md @@ -4,3 +4,28 @@ sidebar_position: 4 # Project +Of crucial interest to OpenAIRE is also the identification of the funders (e.g. European Commission, WellcomeTrust, FCT Portugal, NWO The Netherlands) that co-funded the projects that have led to a given result. Projects are characterized by a list of funding streams (e.g. FP7, H2020 for the EC), which identify the strands of fundings. Funding streams can be nested to form a tree of sub-funding streams. + +--- + +## The `Project` object + +### id +_Type: String • Cardinality: ONE_ + +Main entity identifier, created according to [OpenAIRE_entity_identifier_and_PID_mapping_policy](https://support.openaire.eu/projects/docs/wiki/OpenAIRE_entity_identifier_and_PID_mapping_policy). + +### code +_Type: String • Cardinality: ONE_ + +Τhe grant agreement code of the project. + +### acronym +_Type: String • Cardinality: ONE_ + +Project's acronym. + +### title +_Type: String • Cardinality: ONE_ + +Project's title. diff --git a/docs/data-model/entities/result.md b/docs/data-model/entities/result.md index e86241d..7b913c0 100644 --- a/docs/data-model/entities/result.md +++ b/docs/data-model/entities/result.md @@ -15,7 +15,7 @@ Moreover, there are the following sub-types of a `Result`, that inherit all its --- -## Properties +## The `Result` object ### id _Type: String • Cardinality: ONE_ diff --git a/docs/data-model/relationships.md b/docs/data-model/relationships.md index 446af4d..5832dd2 100644 --- a/docs/data-model/relationships.md +++ b/docs/data-model/relationships.md @@ -2,4 +2,97 @@ sidebar_position: 2 --- -# Relationships \ No newline at end of file +# Relationships + +A relationship in the graph is represented by the following data type, which aims to model a directed edge between two nodes, providing information about the semantic of the relation, its provenance and validation. + +--- + +## The `Relationship` object + +### source +_Type: [Node](#the-node-object) • Cardinality: ONE_ + +Represents the source node in the relation. + +### target +_Type: [Node](#the-node-object) • Cardinality: ONE_ + +Represents the target node in the relation. + +### reltype +_Type: [RelType](#the-reltype-object) • Cardinality: ONE_ + +Represent the semantics of the relation between two nodes of the graph. + +### provenance +_Type: [Provenance](entities/other#provenance-1) • Cardinality: ONE_ + +Indicates the process that produced (or provided) the information. + +### validated +_Type: Boolean • Cardinality: ONE_ + +Indicates weather or not the relation was validated. + +### validationDate +_Type: String • Cardinality: ONE_ + +Indicates the validation date of the relation - applies only when the validated flag is set to true. + +--- + +## The `Node` object + +The Node data type contains the minimum information needed to identify a graph node, its identifier and entity type. + + +### id +_Type: String • Cardinality: ONE_ + +OpenAIRE identifier of the node in the graph. + +### type +_Type: String • Cardinality: ONE_ + +Graph node type. + + +## The `RelType` object + +The RelType data type models the semantic of the relationship among two nodes. + +### type +_Type: String • Cardinality: ONE_ + +Relation category, e.g. affiliation, citation, see table Relation typologies. + +### name +_Type: String • Cardinality: ONE_ + +Further specifies the relation semantic, indicating the relation direction, e.g. Cites, isCitedBy. + + +--- + +## Relationship types + +The following table lists all the possible relation semantics found in the graph dump. + +| # | source entity type | target entity type | relType.type | relType.name | relType.name (inverse) | +|:--:|:------------------:|:-------------------:|:-------------:|:---------------------------:|:----------------------------:| +| 1 | [Project](entities/project) | [Result](entities/result) | outcome | produces | isProducedBy | +| 2 | [Result](entities/result) | [Organization](entities/organization) | affiliation | hasAuthorInstitution | isAuthorInstitutionOf | +| 3 | [Result](entities/result) | [Result](entities/result) | similarity | isAmongTopNSimilarDocuments | HasAmongTopNSimilarDocuments | +| 4 | [Project](entities/project) | [Organization](entities/organization) | participation | isParticipant | hasParticipant | +| 5 | [Result](entities/result) | [Result](entities/result) | supplement | isSupplementTo | isSupplementedBy | +| 6 | [Result](entities/result) | [Result](entities/result) | relationship | isRelatedTo | isRelatedTo | +| 7 | [Data source](entities/data-source) | [Organization](entities/organization) | provision | provides | isProvidedBy | +| 8 | [Result](entities/result) | [Data source](entities/data-source) | provision | isHostedBy | hosts | +| 9 | [Result](entities/result) | [Data source](entities/data-source) | provision | isProvidedBy | provides | +| 10 | [Result](entities/result) | [CommunityInitiative](entities/community) | relationship | isRelatedTo | isRelatedTo | +| 11 | [Organization](entities/organization) | [CommunityInitiative](entities/community) | relationship | isRelatedTo | isRelatedTo | +| 12 | [Data source](entities/data-source) | [CommunityInitiative](entities/community) | relationship | isRelatedTo | isRelatedTo | +| 13 | [Project](entities/project) | [CommunityInitiative](entities/community) | relationship | isRelatedTo | isRelatedTo | + + diff --git a/docs/data-provision/_category_.json b/docs/data-provision/_category_.json index d16f944..80915c5 100644 --- a/docs/data-provision/_category_.json +++ b/docs/data-provision/_category_.json @@ -2,7 +2,7 @@ "label": "Data provision", "position": 6, "link": { - "type": "generated-index", - "description": "5 minutes to learn the most important Docusaurus concepts." + "type": "doc", + "id": "data-provision" } } \ No newline at end of file diff --git a/docs/data-provision/aggregation.md b/docs/data-provision/aggregation.md new file mode 100644 index 0000000..4ae6ab4 --- /dev/null +++ b/docs/data-provision/aggregation.md @@ -0,0 +1,14 @@ +--- +sidebar_position: 1 +--- + +# Aggregation + +OpenAIRE collects metadata records from a variety of content providers as described in https://www.openaire.eu/aggregation-and-content-provision-workflows. + +OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like OpenDOAR, re3data, DOAJ, and funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph that you can access from the OpenAIRE portal and the APIs. + +The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at http://api.openaire.eu/vocabularies. Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term. Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that are too big to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall), and ScholeXplorer, one of the Scholix hubs offering a large set of links between research literature and data. + + +![Aggregation](./assets/aggregation.png) diff --git a/docs/data-provision/assets/aggregation.png b/docs/data-provision/assets/aggregation.png new file mode 100644 index 0000000..bd6dd19 Binary files /dev/null and b/docs/data-provision/assets/aggregation.png differ diff --git a/docs/data-provision/assets/architecture.png b/docs/data-provision/assets/architecture.png new file mode 100644 index 0000000..8db82ef Binary files /dev/null and b/docs/data-provision/assets/architecture.png differ diff --git a/docs/data-provision/data-provision.md b/docs/data-provision/data-provision.md new file mode 100644 index 0000000..bc43579 --- /dev/null +++ b/docs/data-provision/data-provision.md @@ -0,0 +1,11 @@ +# Data provision + + +source: https://graph.openaire.eu/about#tabs_card + +OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the world, including Open Access institutional repositories, data archives, journals. All the metadata records (i.e. descriptions of research products) are put together in a data lake, together with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national and international funders. Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications enrich the content of the data lake with links between research results and projects, author affiliations, subject classification, links to entries from domain-specific databases. Duplicated organisations and results are identified and merged together to obtain an open, trusted, public resource enabling explorations of the scholarly communication landscape like never before. + +![Architecture](./assets/architecture.png) + +TODO: make this image linkable + diff --git a/docs/data-provision/data-sources.md b/docs/data-provision/data-sources.md deleted file mode 100644 index 3cb91c7..0000000 --- a/docs/data-provision/data-sources.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -sidebar_position: 1 ---- - -# Data sources -TODO \ No newline at end of file diff --git a/docs/data-provision/deduplication/_category_.json b/docs/data-provision/deduplication/_category_.json index be4ad14..c80249b 100644 --- a/docs/data-provision/deduplication/_category_.json +++ b/docs/data-provision/deduplication/_category_.json @@ -2,7 +2,7 @@ "label": "Deduplication", "position": 2, "link": { - "type": "generated-index", - "description": "5 minutes to learn the most important Docusaurus concepts." + "type": "doc", + "id": "deduplication" } } \ No newline at end of file diff --git a/docs/data-provision/deduplication/deduplication.md b/docs/data-provision/deduplication/deduplication.md new file mode 100644 index 0000000..b37c350 --- /dev/null +++ b/docs/data-provision/deduplication/deduplication.md @@ -0,0 +1,31 @@ +# Deduplication + +TODO: intro + +## Clustering + +Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a clustering function that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no clustering function will ever bring them into the same cluster. To match these requirements OpenAIRE clustering for products works with two functions: +* DOI: the function generates the DOI when this is provided as part of the record properties; +* Title-based function: the function generates a key that depends on (i) number of significant words in the title (normalized, stemming, etc.), (ii) module 10 of the number of characters of such words, and (iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) o the first 3 words (2 words if the title only has 2). For example, the title “Entity deduplication in big data graphs for scholarly communication” becomes “entity deduplication big data graphs scholarly communication” with two keys key “7.1entionbig” and “7.1itydedbig” (where 1 is module 10 of 54 characters of the normalized title. +To give an idea, this configuration generates around 77Mi blocks, which we limited to 200 records each (only 15K blocks are affected by the cut), and entails 260Bi matches. Matches in a block are performed using a “sliding window” set to 80 records. The records are sorted lexicographically on a normalized version of their titles. The 1st record is matched against all the 80 following ones, then the second, etc. for an NlogN complexity. + +## Matching and election + +Once the clusters have been built, the algorithm proceeds with the comparisons. Comparisons are driven by a decisional tree that: +1. Tries to capture equivalence via PIDs: if records share a PID then they are equivalent + +2. Tries to capture difference: + + a. If record titles contain different “numbers” then they are different (this rule is subject to different feelings, and should be fine-tuned); + + b. If record contain different number of authors then they are different; + + c. Note that different PIDs do not imply different records, as different versions may have different PIDs. + +3. Measures equivalence: + + a. The titles of the two records are normalised and compared for similarity by applying the Levenstein distance algorithm. The algorithm returns a number in the range [0,1], where 0 means “very different” and 1 means “equal”. If the distance is greater than or equal 0,99 the two records are identified as duplicates. + + b. Dates are not regarded for equivalence matching because different versions of the same records should be merged and may be published on different dates, e.g. pre-print and published version of an article. + +Once the equivalence relationships between pairs of records are set, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance. The ID of the record is obtained by appending the prefix “dedup_” to the MD5 of the first ID (given their lexicographical ordering). A new, more stable function to generate the ID is under development, which exploits the DOI when one of the records to be merged includes a Crossref or a DataCite record. \ No newline at end of file diff --git a/docs/data-provision/enrichment/_category_.json b/docs/data-provision/enrichment/_category_.json new file mode 100644 index 0000000..9ecbe8a --- /dev/null +++ b/docs/data-provision/enrichment/_category_.json @@ -0,0 +1,8 @@ +{ + "label": "Enrichment", + "position": 3, + "link": { + "type": "doc", + "id": "enrichment" + } +} \ No newline at end of file diff --git a/docs/data-provision/enrichment/enrichment.md b/docs/data-provision/enrichment/enrichment.md new file mode 100644 index 0000000..d5996a6 --- /dev/null +++ b/docs/data-provision/enrichment/enrichment.md @@ -0,0 +1,47 @@ +# Enrichment + + +TODO: intro + +## Mining + +The OpenAIRE Research Graph is enriched by links mined by OpenAIRE’s full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching. + +**Project mining** in OpenAIRE text mines the full-texts of publications in order to extract matches to funding project codes/IDs. The mining algorithm works by utilising (i) the grant identifier, and (ii) the project acronym (if available) of each project. The mining algorithm: (1) Preprocesses/normalizes the full-texts using several functions, which depend on the characteristics of each funder (i.e., the format of the grant identifiers), such as stopword and/or punctuation removal, tokenization, stemming, converting to lowercase; then (2) String matching of grant identifiers against the normalized text is done using database techniques; and (3) The results are validated and cleaned using the context near the match by looking at the context around the matched ID for relevant metadata and positive or negative words/phrases, in order to calculate a confidence value for each publication-->project link. A confidence threshold is set to optimise high accuracy while minimising false positives, such as matches with page or report numbers, post/zip codes, parts of telephone numbers, DOIs or URLs, accession numbers. The algorithm also applies rules for disambiguating results, as different funders can share identical project IDs; for example, grant number 633172 could refer to H2020 project EuroMix but also to Australian-funded NHMRC project “Brain activity (EEG) analysis and brain imaging techniques to measure the neurobiological effects of sleep apnea”. Project mining works very well and was the first Text & Data Mining (TDM) service of OpenAIRE. Performance results vary from funder to funder but precision is higher than 98% for all funders and 99.5% for EC projects. Recall is higher than 95% (99% for EC projects), when projects are properly acknowledged using project/grant IDs. + +**Dataset extraction** runs on publications full-texts as described in “High pass text-filtering for Citation matching”, TPDL 2017[1]. In particular, we search for citations to datasets using their DOIs, titles and other metadata (i.e., dates, creator names, publishers, etc.). We extract parts of the text which look like citations and search for datasets using database join and pattern matching techniques. Based on the experiments described in the paper, precision of the dataset extraction module is 98.5% and recall is 97.4% but it is also probably overestimated since it does not take into account corruptions that may take place during pdf to text extraction. It is calculated on the extracted full-texts of small samples from PubMed and arXiv. + +**Software extraction** runs also on parts of the text which look like citations. We search the citations for links to software in open software repositories, specifically github, sourceforge, bitbucket and the google code archive. After that, we search for links that are included in Software Heritage (SH, https://www.softwareheritage.org) and return the permanent URL that SH provides for each software project. We also enrich this content with user names, titles and descriptions of the software projects using web mining techniques. Since software mining is based on URL matching, our precision is 100% (we return a software link only if we find it in the text and there is no need to disambiguate). As for recall rate, this is not calculable for this mining task. Although we apply all the necessary normalizations to the URLs in order to overcome usual issues (e.g., http or https, existence of www or not, lower/upper case), we do not calculate cases where a software is mentioned using its name and not by a link from the supported software repositories. + +**For the extraction of bio-entities**, we focus on Protein Data Bank (PDB) entries. We have downloaded the database with PDB codes and we update it regularly. We search through the whole publication’s full-text for references to PDB codes. We apply disambiguation rules (e.g., there are PDB codes that are the same as antibody codes or other issues) so that we return valid results. Current precision is 98%. Although it's risky to mention recall rates since these are usually overestimated, we have calculated a recall rate of 98% using small samples from pubmed publications. Moreover, our technique is able to identify about 30% more links to proteins than the ones that are tagged in Pubmed xmls. + +**Other text-mining modules** include mining for links to EPO patents, or custom mining modules for linking research objects to specific research communities, initiatives and infrastructures, e.g. COVID-19 mining module. Apart from text-mining modules, OpenAIRE also provides a document classification service that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text one or more predefined content classes. In OpenAIRE, the currently used taxonomies are arXiv, MeSH (Medical Subject Headings), ACM and DDC (Dewey Decimal Classification, or Dewey Decimal System). + +## Bulk Tagging/Deduction + +The Deduction process (also known as “bulk tagging”) enriches each record with new information that can be derived from the existing property values. + +As of September 2020, three procedures are in place to relate a research product to a research initiative, infrastructure (RI) or community (RC) based on: + +* subjects (2.7M results tagged) + +* Zenodo community (16K results tagged) + +* the data source it comes from (250K results tagged) + +The list of subjects, Zenodo communities and data sources used to enrich the products are defined by the managers of the community gateway or infrastructure monitoring dashboard associated with the RC/RI. + +## Propagation + +This process “propagates” properties and links from one product to another if between the two there is a “strong” semantic relationship. + +As of September 2020, the following procedures are in place: +Propagation of the property “country” to results from institutional repositories: e.g. publication collected from an institutional repository maintained by an italian university will be enriched with the property “country = IT”. + +* Propagation of links to projects: e.g. publication linked to project P “is supplemented by” a dataset D. Dataset D will get the link to project P. The relationships considered for this procedure are “isSupplementedBy” and “supplements”. + +* Propagation of related community/infrastructure/initiative from organizations to products via affiliation relationships: e.g. a publication with an author affiliated with organization O. The manager of the community gateway C declared that the outputs of O are all relevant for his/her community C. The publication is tagged as relevant for C. + +* Propagation of related community/infrastructure/initiative to related products: e.g. publication associated to community C is supplemented by a dataset D. Dataset D will get the association to C. The relationships considered for this procedure are “isSupplementedBy” and “supplements”. + +* Propagation of ORCID identifiers to related products, if the products have the same authors: e.g. publication has ORCID for its authors and is supplemented by a dataset D. Dataset D has the same authors as the publication. Authors of D are enriched with the ORCIDs available in the publication. The relationships considered for this procedure are “isSupplementedBy” and “supplements”. \ No newline at end of file diff --git a/docs/data-provision/inference/impact-scores.md b/docs/data-provision/enrichment/impact-scores.md similarity index 100% rename from docs/data-provision/inference/impact-scores.md rename to docs/data-provision/enrichment/impact-scores.md diff --git a/docs/data-provision/inference/mining.md b/docs/data-provision/enrichment/mining.md similarity index 100% rename from docs/data-provision/inference/mining.md rename to docs/data-provision/enrichment/mining.md diff --git a/docs/data-provision/indexing.md b/docs/data-provision/indexing.md new file mode 100644 index 0000000..d37a716 --- /dev/null +++ b/docs/data-provision/indexing.md @@ -0,0 +1,13 @@ +--- +sidebar_position: 5 +--- + +# Indexing + +The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as: + +* EOSC --The OpenAIRE Research Graph APIs and Portals will offer to the EOSC an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond. + +* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE + +* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project. \ No newline at end of file diff --git a/docs/data-provision/inference/_category_.json b/docs/data-provision/inference/_category_.json deleted file mode 100644 index 6dbf3a8..0000000 --- a/docs/data-provision/inference/_category_.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "label": "Inference and annotations", - "position": 3, - "link": { - "type": "generated-index", - "description": "5 minutes to learn the most important Docusaurus concepts." - } -} \ No newline at end of file diff --git a/docs/data-provision/post-cleaning.md b/docs/data-provision/post-cleaning.md new file mode 100644 index 0000000..512d70c --- /dev/null +++ b/docs/data-provision/post-cleaning.md @@ -0,0 +1,9 @@ +--- +sidebar_position: 4 +--- + +# Post-cleaning + +The aggregation processes are continuously running and apply vocabularies as they are in a given moment of time. It could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies. + +In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation. The output of the final cleansing step is the final version of the OpenAIRE Research Graph. \ No newline at end of file diff --git a/docs/data-provision/stats.md b/docs/data-provision/stats.md new file mode 100644 index 0000000..8e44f9b --- /dev/null +++ b/docs/data-provision/stats.md @@ -0,0 +1,7 @@ +--- +sidebar_position: 6 +--- + +# Stats analysis + +The OpenAIRE Research Graph is also processed by a pipeline for extracting the statistics and producing the charts for funders, research initiative, infrastructures, and policy makers that you can see on MONITOR. Based on the information available on the graph, OpenAIRE provides a set of indicators for monitoring the funding and research impact and the uptake of Open Science publishing practices, such as Open Access publishing of publications and datasets, availability of interlinks between research products, availability of post-print versions in institutional or thematic Open Access repositories, etc. \ No newline at end of file diff --git a/docs/download.md b/docs/download.md index 9b7e8d5..6a6e6a8 100644 --- a/docs/download.md +++ b/docs/download.md @@ -10,7 +10,7 @@ Here we provide detailed documentation about the full dump: * JSON dump: https://doi.org/10.5281/zenodo.3516917 * JSON schema: https://doi.org/10.5281/zenodo.4238938 -:::tip Tip! +:::note Tip! For a visual and interactive overview of the JSON schema, we suggest to use a JSON schema viewer like [jsonschemaviewer](https://navneethg.github.io/jsonschemaviewer/) (you just need to copy the schema and then you can easily navigate through the nodes).