aggregation section #2

Merged
schatz merged 27 commits from aggregation into main 2022-11-09 12:01:13 +01:00

This PR introduces the aggregation section. It is currently a work in progress.

This PR introduces the aggregation section. It is currently a work in progress.
claudio.atzori self-assigned this 2022-10-04 15:50:29 +02:00
claudio.atzori added 2 commits 2022-10-04 15:50:34 +02:00
claudio.atzori requested review from alessia.bardi 2022-10-04 15:50:46 +02:00
claudio.atzori requested review from miriam.baglioni 2022-10-04 15:51:35 +02:00
claudio.atzori requested review from schatz 2022-10-04 15:51:56 +02:00
claudio.atzori added 1 commit 2022-10-05 15:06:31 +02:00
claudio.atzori requested review from andrea.mannocci 2022-10-05 15:07:25 +02:00
claudio.atzori added 3 commits 2022-10-06 12:11:01 +02:00
claudio.atzori added 1 commit 2022-10-06 13:57:54 +02:00
sandro.labruzzo added 1 commit 2022-10-06 16:31:10 +02:00
sandro.labruzzo added 1 commit 2022-10-11 11:55:15 +02:00
sandro.labruzzo added 1 commit 2022-10-12 12:16:44 +02:00
claudio.atzori added 1 commit 2022-10-21 13:44:49 +02:00
sandro.labruzzo added 1 commit 2022-10-21 14:58:20 +02:00
sandro.labruzzo added 1 commit 2022-11-02 14:36:56 +01:00
sandro.labruzzo added 1 commit 2022-11-02 14:38:34 +01:00
sandro.labruzzo added 1 commit 2022-11-02 14:43:00 +01:00
sandro.labruzzo added 1 commit 2022-11-02 14:48:52 +01:00
claudio.atzori changed title from WIP: aggregation section to aggregation section 2022-11-07 12:14:47 +01:00
claudio.atzori added 1 commit 2022-11-07 12:14:54 +01:00
claudio.atzori added 2 commits 2022-11-08 09:35:15 +01:00
claudio.atzori added 1 commit 2022-11-08 10:53:27 +01:00
schatz requested changes 2022-11-08 12:59:16 +01:00
@ -0,0 +13,4 @@
* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them;
* PIDs are included in the graph according to a tight criterion: the PID Types declared in the table below are considered to be mapped as PIDs only when they are collected from the relative PID authority data source.
| *PID Type* | *Authority* |
Member

I would remove italics from the header of this table. Note that headers are already styled in bold face.

I would remove italics from the header of this table. Note that headers are already styled in bold face.
Author
Owner

Thanks, it is indeed useless to further boldify them. I am going to remove the extras.

Thanks, it is indeed useless to further boldify them. I am going to remove the extras.
@ -0,0 +31,4 @@
This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes
| *Datasource delegated* | *Datasource delegating* | *Pid Type* |
Member

Here as well, I would remove italics.

Here as well, I would remove italics.
Author
Owner

Thanks, it is indeed useless to further boldify them. I am going to remove the extras.

Thanks, it is indeed useless to further boldify them. I am going to remove the extras.
@ -0,0 +10,4 @@
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/).
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at http://api.openaire.eu/vocabularies. Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term.
Member

The link "http://api.openaire.eu/vocabularies" here is broken

The link "http://api.openaire.eu/vocabularies" here is broken
Author
Owner

Fixed.

Fixed.
@ -0,0 +17,4 @@
<img loading="lazy" alt="Aggregation" src="/img/docs/aggregation.png" width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The OpenAIRE aggregation system collects information about objects of the research life-cycle compliant to the [OpenAIRE acquisition policy](https://www.openaire.eu/content-aquisition-policy1) from [different types of data sources](https://explore.openaire.eu/search/find/dataproviders):
Member

The link to "OpenAIRE acquisition policy" is broken.

The link to "OpenAIRE acquisition policy" is broken.
Author
Owner

When I put the link I then informed the person responsible to maintain those pages on the openaire website and then the url was changed. It is fixed now.

When I put the link I then informed the person responsible to maintain those pages on the openaire website and then the url was changed. It is fixed now.
@ -0,0 +26,4 @@
5. Metadata of open source research software from software repositories and SoftwareHeritge
6. Metadata about other types of research products, like workflow, protocols, methods, research packages
Relationships between objects are collected from the data sources, but also automatically detected by [inference algorithms](https://www.openaire.eu/blogs/text-mining-services-in-openaire-1) and added by authenticated users, who can insert links between literature, datasets, software and projects via [the “Link” procedure available from the OpenAIRE explore portal](https://explore.openaire.eu/participate/claim).
Member

The second link here required authentication, is it ok ?

The second link here required authentication, is it ok ?
Author
Owner

Well, it is explicitly mentioned that the functionality is available for authenticated users. However I agree it is not nice to expose a link that brings the users to a login form. I added a second link to the claiming guide.

Well, it is explicitly mentioned that the functionality is available for authenticated users. However I agree it is not nice to expose a link that brings the users to a login form. I added a second link to the claiming guide.
@ -0,0 +53,4 @@
| `author.pid.value` | `\attributes\creators\nameIdentifiers/nameIdentifier` | the pid value |
| `maintitle` | `\attributes\titles` | Titles whose title type is null or title type is Main |
| `subtitle` | `\attributes\titles` | Titles whose title type is Subtitle since the title type vocabulary in OpenAIRE use the datacite title type vocabulary |
| **date section** | | for each date in particular for DOI starting with _10.14457_ we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket [#6791](https://support.openaire.eu/issues/6791) |
Member

Why is this bold ? is it correct ?

Why is this bold ? is it correct ?
Author
Owner

It is a way to group "sections" of the mapping related to common aspects together.

It is a way to group "sections" of the mapping related to common aspects together.
@ -0,0 +76,4 @@
| `IsHostedBy` | `\attributes\relationships\client\id` | `Result/DataSource` | we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ |
### Relation Resolution
Member

This section is empty. Remove this or add content.

This section is empty. Remove this or add content.
Author
Owner

Removed.

Removed.
@ -0,0 +6,4 @@
The idea behind DOIBoost and its origin can be found in the paper (and related resources) at:
* La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: [10.5281/zenodo.1441071](https://doi.org/10.5281/zenodo.1441071)
Member

I would move the reference to a "References" section at the end of the page, like in the aggregation page.

I would move the reference to a "References" section at the end of the page, like in the aggregation page.
Author
Owner

Done.

Done.
@ -0,0 +29,4 @@
The construction of the DOIBoost dataset consists of the following phases:
## 1. Crossref filtering
Member

I would remove the numbering of the titles, as in other pages, there are without numbers.
And I am not sure if these sections need to be under the section "Inputs", so they should be moved one level down in the hierarchy of the titles.

I would remove the numbering of the titles, as in other pages, there are without numbers. And I am not sure if these sections need to be under the section "Inputs", so they should be moved one level down in the hierarchy of the titles.
Author
Owner

Those subsections describe the processing steps needed to build DOIBoost, I reorganised the hierarchy.

Those subsections describe the processing steps needed to build DOIBoost, I reorganised the hierarchy.
@ -0,0 +34,4 @@
Records in Crossref are ruled out according to the following criteria
* have blank title, examples:
* `10.1093/rheumatology/41.7.837`
Member

Do we want examples here or it is "too much" ?

Do we want examples here or it is "too much" ?
Author
Owner

I'm not sure. Many people that based they work on Crossref contents usually doesn't mention such cases and I think it would be good, for transparency, to mention them.

I'm not sure. Many people that based they work on Crossref contents usually doesn't mention such cases and I think it would be good, for transparency, to mention them.
@ -0,0 +10,4 @@
Example:
```commandline
Member

I am not sure if the full response from EMBL is required here.

I am not sure if the full response from EMBL is required here.
@ -0,0 +404,4 @@
The table below describes the mapping from the EBI links records to the OpenAIRE Graph dump format.
| *OpenAIRE Result field path* | PubMed record field xpath | Notes |
Member

This is empty. Remove it or add content. Also remove italics from the table header.

This is empty. Remove it or add content. Also remove italics from the table header.
Author
Owner

Fixed

Fixed
@ -0,0 +8,4 @@
It contains XML records compliant with the schema available at https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html.
## Incremental harvesting
Pubmed exposes an entry point FTP with all the updates for each one. [ftp baseline update](https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/). We collect the new file and generate the new dataset by upserting the existing item.
Member

Remove the fullstop before the link ?

Remove the fullstop before the link ?
Author
Owner

Updated.

Updated.
@ -0,0 +15,4 @@
The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.
| *OpenAIRE Result field path* | PubMed record field xpath | Notes |
Member

Remove italics from the table header.

Remove italics from the table header.
Author
Owner

Removed.

Removed.
@ -0,0 +18,4 @@
| *OpenAIRE Result field path* | PubMed record field xpath | Notes |
|--------------------------------|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Publication Mapping** | | |
| `id` | ?? | id in the form `pmid_________::md5(pmid)` |
Member

??

??
Author
Owner

Filled

Filled
sidebars.js Outdated
@ -62,0 +64,4 @@
label: "Aggregation",
link: {type: 'doc', id: 'data-provision/aggregation/aggregation'},
items: [
{ type: 'doc', id: 'data-provision/aggregation/doiboost' },
Member

Is it ok to use only "DOIBoost" here as the title of the item in the sidebar ? If yes, we add a "label" here.

Is it ok to use only "DOIBoost" here as the title of the item in the sidebar ? If yes, we add a "label" here.
Author
Owner

Thanks for the hint. I'm not sure what would be better for the end user reading this doc. On one end DOIBoost means nothing, hence I'm tempted to leave the longer title (listing the different providers), on the other hand, aestetically speaking I surely prefer the short version.

It's good to know anyway that can build a cleaner TOC.

Thanks for the hint. I'm not sure what would be better for the end user reading this doc. On one end DOIBoost means nothing, hence I'm tempted to leave the longer title (listing the different providers), on the other hand, aestetically speaking I surely prefer the short version. It's good to know anyway that can build a cleaner TOC.
sandro.labruzzo added 2 commits 2022-11-08 15:42:14 +01:00
sandro.labruzzo added 1 commit 2022-11-08 15:58:28 +01:00
claudio.atzori added 2 commits 2022-11-08 17:05:54 +01:00
claudio.atzori added 1 commit 2022-11-08 17:12:31 +01:00
claudio.atzori added 1 commit 2022-11-08 17:16:13 +01:00
schatz merged commit 372ee33111 into main 2022-11-09 12:01:13 +01:00
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/openaire-graph-docs#2
No description provided.