From a7c4daa10fb223384032340ea75f22a0cd15229f Mon Sep 17 00:00:00 2001 From: Serafeim Chatzopoulos Date: Mon, 3 Apr 2023 18:49:15 +0300 Subject: [PATCH 1/5] Add placeholders for info to be updated regarding pids per entity && hasAuthorInstitution detailed page --- docs/data-model/pids-and-identifiers.md | 61 +++++++++++++------ docs/data-model/relationships.md | 2 +- .../relationships/hasAuthorInstitution.md | 15 +++++ 3 files changed, 60 insertions(+), 18 deletions(-) create mode 100644 docs/data-model/relationships/hasAuthorInstitution.md diff --git a/docs/data-model/pids-and-identifiers.md b/docs/data-model/pids-and-identifiers.md index c613366..dfafe16 100644 --- a/docs/data-model/pids-and-identifiers.md +++ b/docs/data-model/pids-and-identifiers.md @@ -10,18 +10,10 @@ Not only, but the mappings applied to the original contents may also change and One of the fronts regards the attribution of the identity to the objects populating the graph. The basic idea is to build the identifiers of the objects in the graph from the PIDs available in some authoritative sources while considering all the other sources as by definition “unstable”. Examples of authoritative sources are Crossref and DataCite. Examples of non-authoritative ones are institutional repositories, aggregators, etc. PIDs from the authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction. Such a policy defines a list of data sources that are considered authoritative for a specific type of PID they provide, whose effect is twofold: -* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them; +* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them * PIDs are included in the graph according to a tight criterion: the PID Types declared in the table below are considered to be mapped as PIDs only when they are collected from the relative PID authority data source. -| PID Type | Authority | -|-----------|-----------------------------------------------------------------------------------------------------| -| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) | -| pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | -| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) | -| uniprot | [Protein Data Bank](http://www.pdb.org/) | -| ena | [Protein Data Bank](http://www.pdb.org/) | -| pdb | [Protein Data Bank](http://www.pdb.org/) | - +[PID authorities table was removed from here] There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule. In all other cases, PIDs are be included in the graph as alternate Identifiers. @@ -35,10 +27,7 @@ assigns PIDs to their scientific products from a given PID minter. This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes -| Datasource delegated | Datasource delegating | Pid Type | -|--------------------------------------|----------------------------------|-----------| -| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi | -| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id | +[deletated authorities table was removed from here] ## Identifiers in the Graph @@ -47,7 +36,8 @@ OpenAIRE assigns internal identifiers for each object it collects. By default, the internal identifier is generated as `sourcePrefix::md5(localId)` where: * `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source at registration time -* `localid` is the identifier assigned to the object by the data source +* `localId` is the identifier assigned to the object by the data source +[so, the openaire id of objects with no pid is based on this local id; is this always available?] After years of operation, we can say that: @@ -66,6 +56,33 @@ When the record is collected from a source which is not authoritative for any ty Currently, the following data sources are used as "PID authorities": +[PID authorities table was removed from here] + + + +OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)). +All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). + + +## PID authorities per entity + +### Result + +| PID Type | Authority | +|-----------|-----------------------------------------------------------------------------------------------------| +| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) | +| pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | +| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) | +| uniprot | [Protein Data Bank](http://www.pdb.org/) | +| ena | [Protein Data Bank](http://www.pdb.org/) | +| pdb | [Protein Data Bank](http://www.pdb.org/) | + + +| Datasource delegated | Datasource delegating | Pid Type | +|--------------------------------------|----------------------------------|-----------| +| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi | +| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id | + | PID Type | Prefix (12 chars) | Authority | |-----------|------------------------|-------------------------------------------| | doi | `doi_________` | Crossref, Datacite, Zenodo | @@ -77,5 +94,15 @@ Currently, the following data sources are used as "PID authorities": | pdb | `pdb_________` | EMBL-EBI | | uniprot | `uniprot_____` | EMBL-EBI | -OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)). -All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). +### Data source + +### Organization + +
* how we use OpenOrgs?
+ + +
* explain what is "pending" in the openaire id of some organizations
+ +### Project + +### Community diff --git a/docs/data-model/relationships.md b/docs/data-model/relationships.md index 18a4875..76f31f5 100644 --- a/docs/data-model/relationships.md +++ b/docs/data-model/relationships.md @@ -152,7 +152,7 @@ Note: the labels used to specify the semantic of the relationships are (for the | 19 | [Result](entities/result) | [Result](entities/result) | IsPreviousVersionOf / IsNewVersionOf | Harvested | | 20 | [Result](entities/result) | [Result](entities/result) | IsContinuedBy / Continues | Harvested | | 21 | [Result](entities/result) | [Result](entities/result) | IsDescribedBy / Describes | Harvested | -| 22 | [Result](entities/result) | [Organization](entities/organization) | hasAuthorInstitution / isAuthorInstitutionOf | Harvested, Inferred by OpenAIRE | +| 22 | [Result](entities/result) | [Organization](entities/organization) | hasAuthorInstitution / isAuthorInstitutionOf | Harvested, Inferred by OpenAIRE [(more)](relationships/hasAuthorInstitution) | | 23 | [Result](entities/result) | [Data source](entities/data-source) | isHostedBy / hosts | Harvested, Inferred by OpenAIRE | | 24 | [Result](entities/result) | [Data source](entities/data-source) | isProvidedBy / provides | Harvested | | 25 | [Result](entities/result) | [Community](entities/community) | IsRelatedTo / IsRelatedTo | Harvested, Inferred by OpenAIRE, Linked by user | diff --git a/docs/data-model/relationships/hasAuthorInstitution.md b/docs/data-model/relationships/hasAuthorInstitution.md new file mode 100644 index 0000000..657b9a3 --- /dev/null +++ b/docs/data-model/relationships/hasAuthorInstitution.md @@ -0,0 +1,15 @@ +# hasAuthorInstitution +#### Inverse relationship type: `isAuthorInstitutionOf` + +This relationship connects [Results](/data-model/entities/result) with the affiliated [Organizations](/data-model/entities/organization) for their authors. + + +Specifically, we collect those relations from the following data sources: +[TODO: add more details and enrich the following list] + + +* MAG +* Institutional repositories + +Last but to least, the final graph is also enriched with `hasAuthorInstitution` relationships through the Propagation process; you can find more details +[here](/graph-production-workflow/deduction-and-propagation/propagation). \ No newline at end of file -- 2.17.1 From 6dd690b4d0b51cd923b59e0feffaad8f838b6589 Mon Sep 17 00:00:00 2001 From: Serafeim Chatzopoulos Date: Tue, 4 Apr 2023 15:16:41 +0300 Subject: [PATCH 2/5] Restructuring PID and identifiers page --- docs/data-model/pids-and-identifiers.md | 94 ++++++++++++------------- 1 file changed, 47 insertions(+), 47 deletions(-) diff --git a/docs/data-model/pids-and-identifiers.md b/docs/data-model/pids-and-identifiers.md index dfafe16..4e891c8 100644 --- a/docs/data-model/pids-and-identifiers.md +++ b/docs/data-model/pids-and-identifiers.md @@ -7,28 +7,21 @@ Not only, but the mappings applied to the original contents may also change and ## PID Authorities -One of the fronts regards the attribution of the identity to the objects populating the graph. The basic idea is to build the identifiers of the objects in the graph from the PIDs available in some authoritative sources while considering all the other sources as by definition “unstable”. Examples of authoritative sources are Crossref and DataCite. Examples of non-authoritative ones are institutional repositories, aggregators, etc. PIDs from the authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction. +One of the fronts, regards the attribution of the identity to the objects populating the Graph. The basic idea is to build the identifiers of the objects in the Graph from the PIDs available in some authoritative sources, while considering all the other sources as by definition “unstable”. +For instance, Crossref and DataCite are considered to be authoritative sources for results, +contrary to institutional repositories, aggregators, etc. +PIDs from the authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction. Such a policy defines a list of data sources that are considered authoritative for a specific type of PID they provide, whose effect is twofold: -* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them -* PIDs are included in the graph according to a tight criterion: the PID Types declared in the table below are considered to be mapped as PIDs only when they are collected from the relative PID authority data source. - -[PID authorities table was removed from here] +* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them. +* PIDs are included in the Graph according to a tight criterion: + +PIDs are considered valid only when they are collected from a relative PID authority data source. +For each entity, we outline the PID authorities per PID Type in the [following section](#pid-authorities-per-entity). +[TODO: refine this part if not accurate] There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule. -In all other cases, PIDs are be included in the graph as alternate Identifiers. - -## Delegated authorities - -When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them and, depending on the case, -this could result in inconsistencies in the attribution of the field values. -To overcome the issue, the intuition is to include such records only once in the graph. To do so, the concept of "delegated authorities" defines a list of datasources that -assigns PIDs to their scientific products from a given PID minter. - -This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes - -[deletated authorities table was removed from here] - +In all other cases, PIDs are be included in the Graph as alternate Identifiers. ## Identifiers in the Graph @@ -37,9 +30,8 @@ By default, the internal identifier is generated as `sourcePrefix::md5(localId)` * `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source at registration time * `localId` is the identifier assigned to the object by the data source -[so, the openaire id of objects with no pid is based on this local id; is this always available?] -After years of operation, we can say that: +After years of operation, we can conclude that: * `localId` are generally unstable * objects can disappear from sources @@ -54,47 +46,51 @@ When the record is collected from a source which is not authoritative for any ty * the identity of the record is forged as usual using the local identifier * the PID, if available, is added as `alternateIdentifier` -Currently, the following data sources are used as "PID authorities": +You can review the list of the PID authorities per entity in the [following section](#pid-authorities-per-entity). -[PID authorities table was removed from here] - - - -OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)). +OpenAIRE also performs duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)). All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). ## PID authorities per entity +This section gathers all PID Types and their respective authorities for each entity in the Graph. + ### Result -| PID Type | Authority | -|-----------|-----------------------------------------------------------------------------------------------------| -| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) | -| pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | -| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) | -| uniprot | [Protein Data Bank](http://www.pdb.org/) | -| ena | [Protein Data Bank](http://www.pdb.org/) | -| pdb | [Protein Data Bank](http://www.pdb.org/) | +| PID Type | Authority | OpenAIRE ID prefix (12 chars) | +|-----------|------------------------|-----------------------| +| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org), Zenodo | `doi_________` +| pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | `pmid________` +| pmc | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | `pmc_________` +| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) | `arXiv_______` +| uniprot | [Protein Data Bank](http://www.pdb.org/) [ or EMBL-EBI ?] | `uniprot_____` +| ena | [Protein Data Bank](http://www.pdb.org/) [ or EMBL-EBI ?] | `ena_________` +| pdb | [Protein Data Bank](http://www.pdb.org/) [ or EMBL-EBI ?] | `pdb_________` +| handle | Any repository | `handle______` + +#### Delegated authorities + +[TODO: the problem that this solves is that we can get a specific PID from more than one auhtoritative sources right ? For example, if we get DOIs from Crossref, Datacite, and Zenodo (btw Zenodo was not mentioned in the first table). +Can't we mention those sources by priority in the first table and simply mention in the text that we prefer to collect those PIDs starting from the first till the last one? Is this the problem or I am missing something else here?] + +When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them and, depending on the case, +this could result in inconsistencies in the attribution of the field values. +To overcome the issue, the intuition is to include such records only once in the Graph. To do so, the concept of "delegated authorities" defines a list of datasources that +assigns PIDs to their scientific products from a given PID minter. + +This "selection" can be performed when the entities in the Graph sharing the same identifier are grouped together. +The list of the delegated authorities currently includes the following: -| Datasource delegated | Datasource delegating | Pid Type | +| PID Type | Datasource delegated | Datasource delegating | |--------------------------------------|----------------------------------|-----------| -| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi | -| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id | +| doi | [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | +| w3id [is not mentioned in the table above] | [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | -| PID Type | Prefix (12 chars) | Authority | -|-----------|------------------------|-------------------------------------------| -| doi | `doi_________` | Crossref, Datacite, Zenodo | -| pmc | `pmc_________` | Europe PubMed Central, PubMed Central | -| pmid | `pmid________` | Europe PubMed Central, PubMed Central | -| arXiv | `arXiv_______` | arXiv.org e-Print Archive | -| handle | `handle______` | any repository | -| ena | `ena_________` | EMBL-EBI | -| pdb | `pdb_________` | EMBL-EBI | -| uniprot | `uniprot_____` | EMBL-EBI | ### Data source +[TODO] ### Organization @@ -104,5 +100,9 @@ All duplicates are **merged** together in a **representative record** which must
* explain what is "pending" in the openaire id of some organizations
### Project +[TODO] + ### Community +[TODO] + -- 2.17.1 From c3661c9547085dd01eb1049d08309eff937fcab7 Mon Sep 17 00:00:00 2001 From: Serafeim Chatzopoulos Date: Tue, 4 Apr 2023 15:17:55 +0300 Subject: [PATCH 3/5] Fix typos --- .../enrichment-by-mining/affiliation_matching.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md index 539e51b..f79843d 100644 --- a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md +++ b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md @@ -4,7 +4,7 @@ sidebar_position: 1 # Affiliation matching -***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database. +***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the PDF and XML documents with organizations from the OpenAIRE organization database. ***Algorithmic details:*** -- 2.17.1 From 89a36878bb0e513e9ea43ef23a340d0b4cd5d70f Mon Sep 17 00:00:00 2001 From: Claudio Atzori Date: Tue, 11 Apr 2023 16:53:42 +0200 Subject: [PATCH 4/5] revised affiliation provenance section and pid and identifiers page --- docs/data-model/pids-and-identifiers.md | 64 +++--- docs/data-model/relationships/affiliation.md | 17 ++ .../relationships/hasAuthorInstitution.md | 15 -- .../relationships/relationship-types.md | 190 +++++++++++++++--- 4 files changed, 210 insertions(+), 76 deletions(-) create mode 100644 docs/data-model/relationships/affiliation.md delete mode 100644 docs/data-model/relationships/hasAuthorInstitution.md diff --git a/docs/data-model/pids-and-identifiers.md b/docs/data-model/pids-and-identifiers.md index 4e891c8..395b224 100644 --- a/docs/data-model/pids-and-identifiers.md +++ b/docs/data-model/pids-and-identifiers.md @@ -1,27 +1,24 @@ # PIDs and identifiers One of the challenges towards the stability of the contents in the OpenAIRE Graph consists of making its identifiers and records stable over time. -The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content, -original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes. +The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content, original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes. Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records. ## PID Authorities One of the fronts, regards the attribution of the identity to the objects populating the Graph. The basic idea is to build the identifiers of the objects in the Graph from the PIDs available in some authoritative sources, while considering all the other sources as by definition “unstable”. -For instance, Crossref and DataCite are considered to be authoritative sources for results, -contrary to institutional repositories, aggregators, etc. +For instance, Crossref and DataCite are considered to be authoritative sources for results, contrary to institutional repositories, aggregators, etc. PIDs from the authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction. Such a policy defines a list of data sources that are considered authoritative for a specific type of PID they provide, whose effect is twofold: * OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them. * PIDs are included in the Graph according to a tight criterion: - -PIDs are considered valid only when they are collected from a relative PID authority data source. + +The PID Types declared in the table below are considered to be mapped as [`result.pid`](entities/result#pid) and [`result.instance[].pid`](entities/other#pid-1) only when they are collected from a relative PID authority data source. For each entity, we outline the PID authorities per PID Type in the [following section](#pid-authorities-per-entity). -[TODO: refine this part if not accurate] There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule. -In all other cases, PIDs are be included in the Graph as alternate Identifiers. +In all other cases, PIDs are included in the Graph as alternate Identifiers. ## Identifiers in the Graph @@ -39,12 +36,13 @@ After years of operation, we can conclude that: Therefore, when the record is collected from an authoritative source: -* the identity of the record is forged using the PID, like `pidTypePrefix::md5(lowercase(doi))` +* the identity of the record is forged using the PID, like `pidTypePrefix::md5(lowercase(pid value))` * the PID is added in a `pid` element of the data model -When the record is collected from a source which is not authoritative for any type of PID: -* the identity of the record is forged as usual using the local identifier +When the record is collected from a source which is _not_ authoritative for any type of PID: +* the identity of the record is forged as usual using the local identifier (typically the [oai identifier](http://www.openarchives.org/OAI/2.0/guidelines-oai-identifier.htm)) * the PID, if available, is added as `alternateIdentifier` +* Handles are still mapped as PIDs, although they are not associated with any OpenAIRE internal identifier prefix You can review the list of the PID authorities per entity in the [following section](#pid-authorities-per-entity). @@ -58,20 +56,19 @@ This section gathers all PID Types and their respective authorities for each ent ### Result -| PID Type | Authority | OpenAIRE ID prefix (12 chars) | -|-----------|------------------------|-----------------------| -| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org), Zenodo | `doi_________` -| pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | `pmid________` -| pmc | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | `pmc_________` -| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) | `arXiv_______` -| uniprot | [Protein Data Bank](http://www.pdb.org/) [ or EMBL-EBI ?] | `uniprot_____` -| ena | [Protein Data Bank](http://www.pdb.org/) [ or EMBL-EBI ?] | `ena_________` -| pdb | [Protein Data Bank](http://www.pdb.org/) [ or EMBL-EBI ?] | `pdb_________` -| handle | Any repository | `handle______` +| PID Type | Authority | OpenAIRE ID prefix (12 chars) | +|----------|-----------------------------------------------------------------------------------------------------|----------------------------------------------| +| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) | `doi_________` | +| pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | `pmid________` | +| pmc | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | `pmc_________` | +| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) | `arXiv_______` | +| uniprot | [Protein Data Bank](http://www.pdb.org/) [ or EMBL-EBI ?] | `uniprot_____` | +| ena | [Protein Data Bank](http://www.pdb.org/) [ or EMBL-EBI ?] | `ena_________` | +| pdb | [Protein Data Bank](http://www.pdb.org/) [ or EMBL-EBI ?] | `pdb_________` | #### Delegated authorities -[TODO: the problem that this solves is that we can get a specific PID from more than one auhtoritative sources right ? For example, if we get DOIs from Crossref, Datacite, and Zenodo (btw Zenodo was not mentioned in the first table). +[TODO: the problem that this solves is that we can get a specific PID from more than one authoritative sources right ? For example, if we get DOIs from Crossref, Datacite, and Zenodo (btw Zenodo was not mentioned in the first table). Can't we mention those sources by priority in the first table and simply mention in the text that we prefer to collect those PIDs starting from the first till the last one? Is this the problem or I am missing something else here?] When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them and, depending on the case, @@ -80,17 +77,27 @@ To overcome the issue, the intuition is to include such records only once in the assigns PIDs to their scientific products from a given PID minter. This "selection" can be performed when the entities in the Graph sharing the same identifier are grouped together. -The list of the delegated authorities currently includes the following: +The list of the delegated authorities currently includes the following, which can be considered as an extension of the table above: -| PID Type | Datasource delegated | Datasource delegating | -|--------------------------------------|----------------------------------|-----------| -| doi | [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | -| w3id [is not mentioned in the table above] | [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | +| PID Type | Datasource delegated | Datasource delegating | OpenAIRE ID prefix (12 chars) | +|----------|--------------------------------------|----------------------------------|-------------------------------| +| doi | [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | `doi_________` | +| w3id | [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | `w3id________` | ### Data source -[TODO] + +The following table lists the most important registries from which OpenAIRE imports datasource records. + +| PID Type | Authority | OpenAIRE ID prefix (12 chars) | +|------------------------|--------------------------------------------------------------------------------------------|-------------------------------| +| OpenDOAR ID | [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/) | `opendoar____` | +| Re3Data ID | [re3data](https://www.re3data.org/) | `re3data_____` | +| Fairsharing | [Fairsharing](https://fairsharing.org/) | `fairsharing_` | +| EuroCRIS - DRIS | [EuroCRIS - Directory of Research Information Systems](https://eurocris.org/services/dris) | `eurocrisdris` | +| EOSC Service Catalogue | [EOSC Service Catalogue](https://eosc-portal.eu/services-resources) | `eosc________` | + ### Organization @@ -102,7 +109,6 @@ The list of the delegated authorities currently includes the following: ### Project [TODO] - ### Community [TODO] diff --git a/docs/data-model/relationships/affiliation.md b/docs/data-model/relationships/affiliation.md new file mode 100644 index 0000000..dc1e9c7 --- /dev/null +++ b/docs/data-model/relationships/affiliation.md @@ -0,0 +1,17 @@ +# Affiliation + +This relationship connects [Results](/data-model/entities/result) with the affiliated [Organizations](/data-model/entities/organization) for their authors. + +* **[Result](/data-model/entities/result) hasAuthorInstitution [Organizations](/data-model/entities/organization)** +* **[Organizations](/data-model/entities/organization) isAuthorInstitutionOf [Result](/data-model/entities/result)** + +Specifically, OpenAIRE collects those relations from the following data sources: + +* MAG +* CNR ExploRA (Institutional repository) as a pilot case. + +Note that the aggregation of affiliation links from repositories is supported in the [OpenAIRE Guidelines v4.1-rc1](https://openaire-guidelines-for-literature-repository-managers.readthedocs.io/en/latest/field_creator.html#attribute-affiliationidentifier-r) (yet to be released). + +Moreover, the final graph is also enriched with `affiliation` relationships through the following processes +* Context propagation on the Graph, you can find more details [here](/graph-production-workflow/deduction-and-propagation/propagation); +* TDM through the affiliation matching algorithm, you can find more details [here](/graph-production-workflow/enrichment-by-mining/affiliation_matching.md). \ No newline at end of file diff --git a/docs/data-model/relationships/hasAuthorInstitution.md b/docs/data-model/relationships/hasAuthorInstitution.md deleted file mode 100644 index 657b9a3..0000000 --- a/docs/data-model/relationships/hasAuthorInstitution.md +++ /dev/null @@ -1,15 +0,0 @@ -# hasAuthorInstitution -#### Inverse relationship type: `isAuthorInstitutionOf` - -This relationship connects [Results](/data-model/entities/result) with the affiliated [Organizations](/data-model/entities/organization) for their authors. - - -Specifically, we collect those relations from the following data sources: -[TODO: add more details and enrich the following list] - - -* MAG -* Institutional repositories - -Last but to least, the final graph is also enriched with `hasAuthorInstitution` relationships through the Propagation process; you can find more details -[here](/graph-production-workflow/deduction-and-propagation/propagation). \ No newline at end of file diff --git a/docs/data-model/relationships/relationship-types.md b/docs/data-model/relationships/relationship-types.md index 55378b3..131b580 100644 --- a/docs/data-model/relationships/relationship-types.md +++ b/docs/data-model/relationships/relationship-types.md @@ -1,37 +1,163 @@ -# Relationship types +--- +sidebar_position: 2 +--- + +# Relationships + +A relationship in the graph is represented by the following data type, which aims to model a directed edge between two nodes, providing information about the semantic of the relation, its provenance and validation. + +--- + +## The `Relationship` object + +### source +_Type: [Node](#the-node-object) • Cardinality: ONE_ + +Represents the source node in the relation. + +```json +"source": { + "id": "20|openorgs____::1cb75a3ad756e4c83e455e3e7347643b", + "type": "organization" +} +``` + +### target +_Type: [Node](#the-node-object) • Cardinality: ONE_ + +Represents the target node in the relation. + +```json +"target": { + "id": "10|doajarticles::022409068174087a003647ff46070f7f", + "type": "datasource" +} +``` + +### reltype +_Type: [RelType](#the-reltype-object) • Cardinality: ONE_ + +Represent the semantics of the relation between two nodes of the graph. + +```json +"reltype": { + "name": "provides", + "type": "provision" +} +``` +### provenance +_Type: [Provenance](entities/other#provenance-1) • Cardinality: ONE_ + +Indicates the process that produced (or provided) the information. + +```json +"provenance": { + "provenance": "Harvested", + "trust":"0.900" +} +``` + +### validated +_Type: Boolean • Cardinality: ONE_ + +Indicates weather or not the relation was validated. + +```json +"validated": true +``` + +### validationDate +_Type: String • Cardinality: ONE_ + +Indicates the validation date of the relation - applies only when the validated flag is set to true. + +```json +"validationDate": "2022-09-02" +``` + +--- + +## The `Node` object + +The Node data type contains the minimum information needed to identify a graph node, its identifier and entity type. + + +### id +_Type: String • Cardinality: ONE_ + +OpenAIRE identifier of the node in the graph. + +```json +"id": "10|doajarticles::022409068174087a003647ff46070f7f" +``` + +### type +_Type: String • Cardinality: ONE_ + +Graph node type. + +```json +"type": "datasource" +``` + +## The `RelType` object + +The RelType data type models the semantic of the relationship among two nodes. + +### type +_Type: String • Cardinality: ONE_ + +Relation category, e.g. affiliation, citation, see table Relation typologies. + +```json +"name": "provides" +``` + +### name +_Type: String • Cardinality: ONE_ + +Further specifies the relation semantic, indicating the relation direction, e.g. Cites, isCitedBy. + +```json +"type": "provision" +``` +--- + +## Relationship types The following table lists all the possible relation semantics found in the graph dump. Note: the labels used to specify the semantic of the relationships are (for the large) inherited from the [DataCite metadata kernel](https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf), which provides a description for them. -| # | Source entity type | Target entity type | Relation name / inverse | Provenance | -|:--:|:--------------------------------------:|:--------------------------------------:|:----------------------------------------------------------:|:-----------------------------------------------:| -| 1 | [Project](/data-model/entities/project) | [Result](/data-model/entities/result) | produces / isProducedBy | Harvested, Inferred by OpenAIRE, Linked by user | -| 2 | [Project](/data-model/entities/project) | [Organization](/data-model/entities/organization) | hasParticipant / isParticipant | Harvested | -| 3 | [Project](/data-model/entities/project) | [Community](/data-model/entities/community) | IsRelatedTo / IsRelatedTo | Linked by user | -| 4 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsAmongTopNSimilarDocuments / HasAmongTopNSimilarDocuments | Inferred by OpenAIRE | -| 5 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsSupplementTo / IsSupplementedBy | Harvested | -| 6 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsRelatedTo / IsRelatedTo | Harvested, Inferred by OpenAIRE, Linked by user | -| 7 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsPartOf / HasPart | Harvested | -| 8 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsDocumentedBy / Documents | Harvested | -| 9 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsObsoletedBy / Obsoletes | Harvested | -| 10 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsSourceOf / IsDerivedFrom | Harvested | -| 11 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsCompiledBy / Compiles | Harvested | -| 12 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsRequiredBy / Requires | Harvested | -| 13 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsCitedBy / Cites | Harvested, Inferred by OpenAIRE | -| 14 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsReferencedBy / References | Harvested | -| 15 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsReviewedBy / Reviews | Harvested | -| 16 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsOriginalFormOf / IsVariantFormOf | Harvested | -| 17 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsVersionOf / HasVersion | Harvested | -| 18 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsIdenticalTo / IsIdenticalTo | Harvested | -| 19 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsPreviousVersionOf / IsNewVersionOf | Harvested | -| 20 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsContinuedBy / Continues | Harvested | -| 21 | [Result](/data-model/entities/result) | [Result](/data-model/entities/result) | IsDescribedBy / Describes | Harvested | -| 22 | [Result](/data-model/entities/result) | [Organization](/data-model/entities/organization) | hasAuthorInstitution / isAuthorInstitutionOf | Harvested, Inferred by OpenAIRE | -| 23 | [Result](/data-model/entities/result) | [Data source](/data-model/entities/data-source) | isHostedBy / hosts | Harvested, Inferred by OpenAIRE | -| 24 | [Result](/data-model/entities/result) | [Data source](/data-model/entities/data-source) | isProvidedBy / provides | Harvested | -| 25 | [Result](/data-model/entities/result) | [Community](/data-model/entities/community) | IsRelatedTo / IsRelatedTo | Harvested, Inferred by OpenAIRE, Linked by user | -| 26 | [Organization](/data-model/entities/organization) | [Community](/data-model/entities/community) | IsRelatedTo / IsRelatedTo | Linked by user | -| 27 | [Organization](/data-model/entities/organization) | [Organization](/data-model/entities/organization) | IsChildOf / IsParentOf | Linked by user | -| 28 | [Data source](/data-model/entities/data-source) | [Community](/data-model/entities/community) | IsRelatedTo / IsRelatedTo | Linked by user | -| 29 | [Data source](/data-model/entities/data-source) | [Organization](/data-model/entities/organization) | isProvidedBy / provides | Harvested | +| # | Source entity type | Target entity type | Relation name / inverse | Provenance | +|-----|:-------------------------------------:|:-------------------------------------:|:----------------------------------------------------------:|:-----------------------------------------------------:| +| 1 | [Project](entities/project) | [Result](entities/result) | produces / isProducedBy | Harvested, Inferred by OpenAIRE, Linked by user | +| 2 | [Project](entities/project) | [Organization](entities/organization) | hasParticipant / isParticipant | Harvested | +| 3 | [Project](entities/project) | [Community](entities/community) | IsRelatedTo / IsRelatedTo | Linked by user | +| 4 | [Result](entities/result) | [Result](entities/result) | IsAmongTopNSimilarDocuments / HasAmongTopNSimilarDocuments | Inferred by OpenAIRE | +| 5 | [Result](entities/result) | [Result](entities/result) | IsSupplementTo / IsSupplementedBy | Harvested | +| 6 | [Result](entities/result) | [Result](entities/result) | IsRelatedTo / IsRelatedTo | Harvested, Inferred by OpenAIRE, Linked by user | +| 7 | [Result](entities/result) | [Result](entities/result) | IsPartOf / HasPart | Harvested | +| 8 | [Result](entities/result) | [Result](entities/result) | IsDocumentedBy / Documents | Harvested | +| 9 | [Result](entities/result) | [Result](entities/result) | IsObsoletedBy / Obsoletes | Harvested | +| 10 | [Result](entities/result) | [Result](entities/result) | IsSourceOf / IsDerivedFrom | Harvested | +| 11 | [Result](entities/result) | [Result](entities/result) | IsCompiledBy / Compiles | Harvested | +| 12 | [Result](entities/result) | [Result](entities/result) | IsRequiredBy / Requires | Harvested | +| 13 | [Result](entities/result) | [Result](entities/result) | IsCitedBy / Cites | Harvested, Inferred by OpenAIRE | +| 14 | [Result](entities/result) | [Result](entities/result) | IsReferencedBy / References | Harvested | +| 15 | [Result](entities/result) | [Result](entities/result) | IsReviewedBy / Reviews | Harvested | +| 16 | [Result](entities/result) | [Result](entities/result) | IsOriginalFormOf / IsVariantFormOf | Harvested | +| 17 | [Result](entities/result) | [Result](entities/result) | IsVersionOf / HasVersion | Harvested | +| 18 | [Result](entities/result) | [Result](entities/result) | IsIdenticalTo / IsIdenticalTo | Harvested | +| 19 | [Result](entities/result) | [Result](entities/result) | IsPreviousVersionOf / IsNewVersionOf | Harvested | +| 20 | [Result](entities/result) | [Result](entities/result) | IsContinuedBy / Continues | Harvested | +| 21 | [Result](entities/result) | [Result](entities/result) | IsDescribedBy / Describes | Harvested | +| 22 | [Result](entities/result) | [Organization](entities/organization) | hasAuthorInstitution / isAuthorInstitutionOf | Harvested, Inferred by OpenAIRE [(more)](affiliation) | +| 23 | [Result](entities/result) | [Data source](entities/data-source) | isHostedBy / hosts | Harvested, Inferred by OpenAIRE | +| 24 | [Result](entities/result) | [Data source](entities/data-source) | isProvidedBy / provides | Harvested | +| 25 | [Result](entities/result) | [Community](entities/community) | IsRelatedTo / IsRelatedTo | Harvested, Inferred by OpenAIRE, Linked by user | +| 26 | [Organization](entities/organization) | [Community](entities/community) | IsRelatedTo / IsRelatedTo | Linked by user | +| 27 | [Organization](entities/organization) | [Organization](entities/organization) | IsChildOf / IsParentOf | Linked by user | +| 28 | [Data source](entities/data-source) | [Community](entities/community) | IsRelatedTo / IsRelatedTo | Linked by user | +| 29 | [Data source](entities/data-source) | [Organization](entities/organization) | isProvidedBy / provides | Harvested | + -- 2.17.1 From cf81cb5ba3e7bb75b4ba6b327dc722e85e550923 Mon Sep 17 00:00:00 2001 From: Thanasis Vergoulis Date: Fri, 21 Apr 2023 22:35:43 +0200 Subject: [PATCH 5/5] Update 'docs/data-model/pids-and-identifiers.md' --- docs/data-model/pids-and-identifiers.md | 34 +++++++++++++++---------- 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/docs/data-model/pids-and-identifiers.md b/docs/data-model/pids-and-identifiers.md index 395b224..b12beb5 100644 --- a/docs/data-model/pids-and-identifiers.md +++ b/docs/data-model/pids-and-identifiers.md @@ -1,21 +1,20 @@ -# PIDs and identifiers +# Object indentifiers -One of the challenges towards the stability of the contents in the OpenAIRE Graph consists of making its identifiers and records stable over time. -The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content, original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes. -Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records. +One of the challenges towards the stability of the contents in the OpenAIRE Graph consists of making its objects and their identifiers (called "OpenAIRE IDs") stable over time. +~~The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content, original identifiers, and persistent identifiers (PIDs), may disappear or reappear, and the same holds for the repository or the metadata collection it exposes.~~ +Not only can the mappings applied to the original contents change over time, but they can also improve to catch up with changes in the input records. -## PID Authorities +## Adding stability using PIDs -One of the fronts, regards the attribution of the identity to the objects populating the Graph. The basic idea is to build the identifiers of the objects in the Graph from the PIDs available in some authoritative sources, while considering all the other sources as by definition “unstable”. +One of the main issues concerns the attribution of the identity to the objects populating the Graph. The basic idea is to build the identifiers of the objects in the Graph from the related PIDs, where they are available. As a result, PIDs are collected and stored inside the respective objects (in the `pid` field). +However, although various sources can provide object-related PIDs, some of them can be "unstable". For that reason, during the process, only the PIDs available from some "authoritative", stable sources are being considered for the population of the values in the `pid` field and for the creation of the OpenAIRE IDs. OpenAIRE maintains a [list of data sources that are considered authoritative](#pid-authorities) for each specific type of PID. For instance, Crossref and DataCite are considered to be authoritative sources for results, contrary to institutional repositories, aggregators, etc. PIDs from the authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction. -Such a policy defines a list of data sources that are considered authoritative for a specific type of PID they provide, whose effect is twofold: -* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them. -* PIDs are included in the Graph according to a tight criterion: + -The PID Types declared in the table below are considered to be mapped as [`result.pid`](entities/result#pid) and [`result.instance[].pid`](entities/other#pid-1) only when they are collected from a relative PID authority data source. -For each entity, we outline the PID authorities per PID Type in the [following section](#pid-authorities-per-entity). +~~The PID Types declared in the table below are considered to be mapped as [`result.pid`](entities/result#pid) and [`result.instance[].pid`](entities/other#pid-1) only when they are collected from a relative PID authority data source. +For each entity, we outline the PID authorities per PID Type in the [following section](#pid-authorities-per-entity).~~ There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule. In all other cases, PIDs are included in the Graph as alternate Identifiers. @@ -50,9 +49,18 @@ OpenAIRE also performs duplicate identification (see the [dedicated section for All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). -## PID authorities per entity +## OpenAIRE ID prefixes -This section gathers all PID Types and their respective authorities for each entity in the Graph. +| Prefix (12 chars) | Interpretation | +|-------------------|----------------| +| `doi_________` | constructed based on a DOI | +| `pmid________` | ... | + + + +## PID authorities + +This section elaborates the PID types that are supported by the OpenAIRE Graph along with the respective authoritative sources. ### Result -- 2.17.1