diff --git a/docs/data-model/data-model.md b/docs/data-model/data-model.md index 0f891b1..6a32834 100644 --- a/docs/data-model/data-model.md +++ b/docs/data-model/data-model.md @@ -1,26 +1,39 @@ # Data model -The OpenAIRE Graph comprises several types of [entities](../category/entities) and [relationships](/category/relationships) among them. +The OpenAIRE Graph comprises several types of [entities](../category/entities) +and [relationships](/category/relationships) among them. -The latest version of the JSON schema can be found on the [Downloads](../downloads/full-graph) section. +The latest version of the JSON schema can be found on +the [Downloads](../downloads/full-graph) section.
-The figure above, presents the graph's data model. +The figure above, presents the graph's data model. Its main entities are described in brief below: -* [Research products](./entities/research-product) represent the outcomes (or products) of research activities. -* [Data sources](./entities/data-source) are the sources from which the metadata of graph objects are collected. -* [Organizations](./entities/organization) correspond to companies or research institutions involved in projects, -responsible for operating data sources or consisting the affiliations of Product creators. -* [Projects](./entities/project) are research project grants funded by a Funding Stream of a Funder. -* [Communities](./entities/community) are groups of people with a common research intent (e.g. research infrastructures, university alliances). -* Persons correspond to individual researchers who are involved in the design, creation or maintenance of research products. Currently, this is a non-materialized entity type in the Graph, which means that the respective metadata (and relationships) are encapsulated in the author field of the respective research products. +* [Research products](./entities/research-product) represent the outcomes (or + products) of research activities. +* [Data sources](./entities/data-source) are the sources from which the metadata + of graph objects are collected. +* [Organizations](./entities/organization) correspond to companies or research + institutions involved in projects, + responsible for operating data sources or consisting the affiliations of + Product creators. +* [Projects](./entities/project) are research project grants funded by a Funding + Stream of a Funder. +* [Communities](./entities/community) are groups of people with a common + research intent (e.g. research infrastructures, university alliances). +* Persons correspond to individual researchers who are involved in the design, + creation or maintenance of research products. Currently, this is a + non-materialized entity type in the Graph, which means that the respective + metadata (and relationships) are encapsulated in the author field of the + respective research products. :::note Further reading -A detailed report on the OpenAIRE Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199). +A detailed report on the OpenAIRE Graph Data Model can be found +on [Zenodo](https://zenodo.org/record/2643199). ::: diff --git a/docs/data-model/pids-and-identifiers.md b/docs/data-model/pids-and-identifiers.md index 3e3012e..b4a6889 100644 --- a/docs/data-model/pids-and-identifiers.md +++ b/docs/data-model/pids-and-identifiers.md @@ -1,17 +1,33 @@ # PIDs and identifiers -One of the challenges towards the stability of the contents in the OpenAIRE Graph consists of making its identifiers and records stable over time. -The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content, -original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes. -Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records. +One of the challenges towards the stability of the contents in the OpenAIRE +Graph consists of making its identifiers and records stable over time. +The barriers to this scenario are many, as the Graph keeps a map of data sources +that is subject to constant variations: records in repositories vary in content, +original IDs, and PIDs, may disappear or reappear, and the same holds for the +repository or the metadata collection it exposes. +Not only, but the mappings applied to the original contents may also change and +improve over time to catch up with the changes in the input records. ## PID Authorities -One of the fronts regards the attribution of the identity to the objects populating the graph. The basic idea is to build the identifiers of the objects in the graph from the PIDs available in some authoritative sources while considering all the other sources as by definition “unstable”. Examples of authoritative sources are Crossref and DataCite. Examples of non-authoritative ones are institutional repositories, aggregators, etc. PIDs from the authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction. +One of the fronts regards the attribution of the identity to the objects +populating the graph. The basic idea is to build the identifiers of the objects +in the graph from the PIDs available in some authoritative sources while +considering all the other sources as by definition “unstable”. Examples of +authoritative sources are Crossref and DataCite. Examples of non-authoritative +ones are institutional repositories, aggregators, etc. PIDs from the +authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, +precisely because they are immutable by construction. -Such a policy defines a list of data sources that are considered authoritative for a specific type of PID they provide, whose effect is twofold: -* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them; -* PIDs are included in the graph according to a tight criterion: the PID Types declared in the table below are considered to be mapped as PIDs only when they are collected from the relative PID authority data source. +Such a policy defines a list of data sources that are considered authoritative +for a specific type of PID they provide, whose effect is twofold: + +* OpenAIRE IDs depend on persistent IDs when they are provided by the authority + responsible to create them; +* PIDs are included in the graph according to a tight criterion: the PID Types + declared in the table below are considered to be mapped as PIDs only when they + are collected from the relative PID authority data source. | PID Type | Authority | |-----------|-----------------------------------------------------------------------------------------------------| @@ -22,60 +38,76 @@ Such a policy defines a list of data sources that are considered authoritative f | ena | [Protein Data Bank](http://www.pdb.org/) | | pdb | [Protein Data Bank](http://www.pdb.org/) | - -There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule. +There is an exception though: Handle(s) are minted by several repositories; as +listing them all would not be a viable option, to avoid losing them as PIDs, +Handles bypass the PID authority filtering rule. In all other cases, PIDs are be included in the graph as alternate Identifiers. ## Delegated authorities -When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them and, depending on the case, +When a record is aggregated from multiple sources considered authoritative for +minting specific PIDs, different mappings could be applied to them and, +depending on the case, this could result in inconsistencies in the attribution of the field values. -To overcome the issue, the intuition is to include such records only once in the graph. To do so, the concept of "delegated authorities" defines a list of datasources that +To overcome the issue, the intuition is to include such records only once in the +graph. To do so, the concept of "delegated authorities" defines a list of +datasources that assigns PIDs to their scientific products from a given PID minter. -This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes - -| Datasource delegated | Datasource delegating | Pid Type | -|--------------------------------------|----------------------------------|-----------| -| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi | -| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id | +This "selection" can be performed when the entities in the graph sharing the +same identifier are grouped together. The list of the delegated authorities +currently includes +| Datasource delegated | Datasource delegating | Pid Type | +|--------------------------------------|----------------------------------|----------| +| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi | +| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id | ## Identifiers in the Graph OpenAIRE assigns internal identifiers for each object it collects. -By default, the internal identifier is generated as `sourcePrefix::md5(localId)` where: +By default, the internal identifier is generated as `sourcePrefix::md5(localId)` +where: -* `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source at registration time +* `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source + at registration time * `localΙd` is the identifier assigned to the object by the data source After years of operation, we can say that: * `localId` are generally unstable * objects can disappear from sources -* PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos) +* PIDs provided by sources that are not PID agencies (authoritative sources for + a specific type of PID) are often wrong (e.g. pre-print with the DOI of the + published version, DOIs with typos) Therefore, when the record is collected from an authoritative source: -* the identity of the record is forged using the PID, like `pidTypePrefix::md5(lowercase(doi))` +* the identity of the record is forged using the PID, + like `pidTypePrefix::md5(lowercase(doi))` * the PID is added in a `pid` element of the data model -When the record is collected from a source which is not authoritative for any type of PID: +When the record is collected from a source which is not authoritative for any +type of PID: + * the identity of the record is forged as usual using the local identifier * the PID, if available, is added as `alternateIdentifier` Currently, the following data sources are used as "PID authorities": -| PID Type | Prefix (12 chars) | Authority | -|-----------|------------------------|-------------------------------------------| -| doi | `doi_________` | Crossref, Datacite, Zenodo | -| pmc | `pmc_________` | Europe PubMed Central, PubMed Central | -| pmid | `pmid________` | Europe PubMed Central, PubMed Central | -| arXiv | `arXiv_______` | arXiv.org e-Print Archive | -| handle | `handle______` | any repository | -| ena | `ena_________` | EMBL-EBI | -| pdb | `pdb_________` | EMBL-EBI | -| uniprot | `uniprot_____` | EMBL-EBI | +| PID Type | Prefix (12 chars) | Authority | +|----------|-----------------------|-----------------------------------------| +| doi | `doi_________` | Crossref, Datacite, Zenodo | +| pmc | `pmc_________` | Europe PubMed Central, PubMed Central | +| pmid | `pmid________` | Europe PubMed Central, PubMed Central | +| arXiv | `arXiv_______` | arXiv.org e-Print Archive | +| handle | `handle______` | any repository | +| ena | `ena_________` | EMBL-EBI | +| pdb | `pdb_________` | EMBL-EBI | +| uniprot | `uniprot_____` | EMBL-EBI | -OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)). -All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). +OpenAIRE also perform duplicate identification (see +the [dedicated section for details](/graph-production-workflow/deduplication)). +All duplicates are **merged** together in a **representative record** which must +be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier +of one of the aggregated record). diff --git a/docs/graph-production-workflow/deduplication/clustering-functions.md b/docs/graph-production-workflow/deduplication/clustering-functions.md index ded6c57..9632b33 100644 --- a/docs/graph-production-workflow/deduplication/clustering-functions.md +++ b/docs/graph-production-workflow/deduplication/clustering-functions.md @@ -1,11 +1,13 @@ --- sidebar_position: 3 --- -# Clustering functions + +# Clustering functions ## Ngrams It creates ngrams from the input field.
diff --git a/docs/graph-production-workflow/deduplication/organizations.md b/docs/graph-production-workflow/deduplication/organizations.md index c2c57e1..8b73455 100644 --- a/docs/graph-production-workflow/deduplication/organizations.md +++ b/docs/graph-production-workflow/deduplication/organizations.md @@ -4,43 +4,82 @@ sidebar_position: 2 # Organizations -The organizations in OpenAIRE are aggregated from different registries (e.g. CORDA, OpenDOAR, Re3data, ROR). In some cases, a registry provides organizations as entities with their own persistent identifier. In other cases, those organizations are extracted from other main entities provided by the registry (e.g. datasources, projects, etc.). +The organizations in OpenAIRE are aggregated from different registries (e.g. +CORDA, OpenDOAR, Re3data, ROR). In some cases, a registry provides organizations +as entities with their own persistent identifier. In other cases, those +organizations are extracted from other main entities provided by the registry ( +e.g. datasources, projects, etc.). -The deduplication of organizations is enhanced by the [OpenOrgs](https://orgs.openaire.eu), a tool that combines an automated approach for identifying duplicated instances -of the same organization record with a "humans in the loop" approach, in which the equivalences produced by a duplicate identification algorithm are suggested to data curators, in charge for validating them. -The data curation activity is twofold, on one end pivots around the disambiguation task, on the other hand assumes to improve the metadata describing the organization records -(e.g. including the translated name, or a different PID) as well as defining the hierarchical structure of existing large organizations (i.e. Universities comprising its departments or large research centers with all its sub-units or sub-institutes). +The deduplication of organizations is enhanced by +the [OpenOrgs](https://orgs.openaire.eu), a tool that combines an automated +approach for identifying duplicated instances +of the same organization record with a "humans in the loop" approach, in which +the equivalences produced by a duplicate identification algorithm are suggested +to data curators, in charge for validating them. +The data curation activity is twofold, on one end pivots around the +disambiguation task, on the other hand assumes to improve the metadata +describing the organization records +(e.g. including the translated name, or a different PID) as well as defining the +hierarchical structure of existing large organizations (i.e. Universities +comprising its departments or large research centers with all its sub-units or +sub-institutes). -Duplicates among organizations are therefore managed through three different stages: - * *Creation of Suggestions*: executes an automatic workflow that performs the deduplication and prepare new suggestions for the curators to be processed; - * *Curation*: manual editing of the organization records performed by the data curators; - * *Creation of Representative Organizations*: executes an automatic workflow that creates curated organizations and exposes them on the OpenAIRE Graph by using the curators' feedback from the OpenOrgs underlying database. +Duplicates among organizations are therefore managed through three different +stages: + +* *Creation of Suggestions*: executes an automatic workflow that performs the + deduplication and prepare new suggestions for the curators to be processed; +* *Curation*: manual editing of the organization records performed by the data + curators; +* *Creation of Representative Organizations*: executes an automatic workflow + that creates curated organizations and exposes them on the OpenAIRE Graph by + using the curators' feedback from the OpenOrgs underlying database. The next sections describe the above mentioned stages. ### Creation of Suggestions -This stage executes an automatic workflow that faces the *candidate identification* and the *duplicates identification* stages of the deduplication to provide suggestions for the curators in the OpenOrgs. +This stage executes an automatic workflow that faces the *candidate +identification* and the *duplicates identification* stages of the deduplication +to provide suggestions for the curators in the OpenOrgs. #### Candidate identification (clustering) -To match the requirements of limiting the number of comparisons, OpenAIRE clustering for organizations aims at grouping records that would more likely be comparable. +To match the requirements of limiting the number of comparisons, OpenAIRE +clustering for organizations aims at grouping records that would more likely be +comparable. It works with four functions: -* *URL-based function*: the function generates the URL domain when this is provided as part of the record properties from the organization's `websiteurl` field; -* *Title-based functions*: - * generate strings dependent to the keywords in the `legalname` field; - * generate strings obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) on the first 3 words of the `legalname` field; - * generate strings obtained as a concatenation of ngrams of the `legalname` field; + +* *URL-based function*: the function generates the URL domain when this is + provided as part of the record properties from the organization's `websiteurl` + field; +* *Title-based functions*: + * generate strings dependent to the keywords in the `legalname` field; + * generate strings obtained as an alternation of the function prefix(3) and + suffix(3) (and vice versa) on the first 3 words of the `legalname` field; + * generate strings obtained as a concatenation of ngrams of the `legalname` + field; #### Duplicates identification (pair-wise comparisons) -For each pair of organization in a cluster the following strategy (depicted in the figure below) is applied. +For each pair of organization in a cluster the following strategy (depicted in +the figure below) is applied. The comparison goes through the following decision tree: -1. *grid id check*: comparison of the grid ids. If the grid id is equivalent, then the similarity relation is drawn. If the grid id is not available, the comparison proceeds to the next stage; -2. *early exits*: comparison of the numbers extracted from the `legalname`, the `country` and the `website` url. No similarity relation is drawn in this stage, the comparison proceeds only if the compared fields verified the conditions of equivalence; -3. *city check*: comparison of the city names in the `legalname`. The comparison proceeds only if the legalnames shares at least 10% of cities; -4. *keyword check*: comparison of the keywords in the `legalname`. The comparison proceeds only if the legalnames shares at least 70% of keywords; -5. *legalname check*: comparison of the normalized `legalnames` with the `Jaro-Winkler` distance to determine if it is higher than `0.9`. If so, a similarity relation is drawn. Otherwise, no similarity relation is drawn. + +1. *grid id check*: comparison of the grid ids. If the grid id is equivalent, + then the similarity relation is drawn. If the grid id is not available, the + comparison proceeds to the next stage; +2. *early exits*: comparison of the numbers extracted from the `legalname`, + the `country` and the `website` url. No similarity relation is drawn in this + stage, the comparison proceeds only if the compared fields verified the + conditions of equivalence; +3. *city check*: comparison of the city names in the `legalname`. The comparison + proceeds only if the legalnames shares at least 10% of cities; +4. *keyword check*: comparison of the keywords in the `legalname`. The + comparison proceeds only if the legalnames shares at least 70% of keywords; +5. *legalname check*: comparison of the normalized `legalnames` with + the `Jaro-Winkler` distance to determine if it is higher than `0.9`. If so, a + similarity relation is drawn. Otherwise, no similarity relation is drawn.
@@ -50,21 +89,39 @@ The comparison goes through the following decision tree: ### Data Curation -All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata. +All the similarity relations drawn by the algorithm involving the decision tree +are exposed in OpenOrgs, where are made available to the data curators to give +feedbacks and to improve the organizations metadata. A data curator can: - * *edit organization metadata*: legalname, pid, country, url, parent relations, etc.; - * *approve suggested duplicates*: establish if an equivalence relation is valid; - * *discard suggested duplicates*: establish if an equivalence relation is wrong; - * *create similarity relations*: add a new equivalence relation not drawn by the algorithm. -Note that if a curator does not provide a feedback on a similarity relation suggested by the algorithm, then such relation is considered as valid. +* *edit organization metadata*: legalname, pid, country, url, parent relations, + etc.; +* *approve suggested duplicates*: establish if an equivalence relation is valid; +* *discard suggested duplicates*: establish if an equivalence relation is wrong; +* *create similarity relations*: add a new equivalence relation not drawn by the + algorithm. + +Note that if a curator does not provide a feedback on a similarity relation +suggested by the algorithm, then such relation is considered as valid. ### Creation of Representative Organizations -This stage executes an automatic workflow that faces the *duplicates grouping* stage to create representative organizations and to update them on the OpenAIRE Graph. Such organizations are obtained via transitive closure and the relations used comes from the curators' feedback gathered on the OpenOrgs underlying Database. +This stage executes an automatic workflow that faces the *duplicates grouping* +stage to create representative organizations and to update them on the OpenAIRE +Graph. Such organizations are obtained via transitive closure and the relations +used comes from the curators' feedback gathered on the OpenOrgs underlying +Database. #### Duplicates grouping (transitive closure) -Once the similarity relations between pairs of organizations have been gathered, the groups of equivalent organizations are obtained (transitive closure, i.e. “mesh”). From such sets a new representative organization is obtained, which inherits all properties from the merged records and keeps track of their provenance. +Once the similarity relations between pairs of organizations have been gathered, +the groups of equivalent organizations are obtained (transitive closure, i.e. +“mesh”). From such sets a new representative organization is obtained, which +inherits all properties from the merged records and keeps track of their +provenance. -The IDs of the representative organizations are obtained by the OpenOrgs Database that creates a unique ``openorgs`` ID for each approved organization. In case an organization is not approved by the curators, the ID is obtained by appending the prefix ``pending_org`` to the MD5 of the first ID (given their lexicographical ordering). \ No newline at end of file +The IDs of the representative organizations are obtained by the OpenOrgs +Database that creates a unique ``openorgs`` ID for each approved organization. +In case an organization is not approved by the curators, the ID is obtained by +appending the prefix ``pending_org`` to the MD5 of the first ID (given their +lexicographical ordering). \ No newline at end of file diff --git a/docs/graph-production-workflow/deduplication/research-products.md b/docs/graph-production-workflow/deduplication/research-products.md index ca56b89..e9fe2d1 100644 --- a/docs/graph-production-workflow/deduplication/research-products.md +++ b/docs/graph-production-workflow/deduplication/research-products.md @@ -149,13 +149,15 @@ The comparison goes through different stages: ### Duplicates grouping -The aim of the final stage is the creation of objects that group all the equivalent -entities discovered by the previous step. This is done in two phases. +The aim of the final stage is the creation of objects that group all the +equivalent +entities discovered by the previous step. This is done in two phases. #### Transitive closure + As a final step of duplicate identification a transitive closure -is run against similarity relations to find groups of duplicates not directly -caught by the previous steps. If a group is larger than 200 elements only the +is run against similarity relations to find groups of duplicates not directly +caught by the previous steps. If a group is larger than 200 elements only the first 200 elements will be included in the group, while the remaining will be kept ungrouped.