Static sidebar && add publications

This commit is contained in:
Serafeim Chatzopoulos 2022-09-23 19:00:46 +03:00
parent 2d8c3ad241
commit cb0e7a921a
13 changed files with 160 additions and 34 deletions

View File

@ -2,8 +2,7 @@
sidebar_position: 6 sidebar_position: 6
--- ---
# Community
# Community (Initiative)
Research communities and research initiatives are intended as groups of people with a common research intent and can be of two types: research initiatives or research communities: Research communities and research initiatives are intended as groups of people with a common research intent and can be of two types: research initiatives or research communities:

View File

@ -4,11 +4,13 @@ sidebar_position: 1
# Aggregation # Aggregation
OpenAIRE collects metadata records from a variety of content providers as described in https://www.openaire.eu/aggregation-and-content-provision-workflows. OpenAIRE collects metadata records from a variety of content providers as described in the [aggregation and content provision workflows](https://www.openaire.eu/aggregation-and-content-provision-workflows).
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like OpenDOAR, re3data, DOAJ, and funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph that you can access from the OpenAIRE portal and the APIs. OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like OpenDOAR, re3data, DOAJ, and funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph that you can access from the OpenAIRE portal and the APIs.
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at http://api.openaire.eu/vocabularies. Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term. Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that are too big to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall), and ScholeXplorer, one of the Scholix hubs offering a large set of links between research literature and data. The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at http://api.openaire.eu/vocabularies. Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term. Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that are too big to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall), and ScholeXplorer, one of the Scholix hubs offering a large set of links between research literature and data.
![Aggregation](./assets/aggregation.png) <p align="center">
<img loading="lazy" alt="Aggregation" src="/img/docs/aggregation.png" width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>

View File

@ -1,11 +1,7 @@
# Data provision # Data provision
<span className="todo">source: https://graph.openaire.eu/about#tabs_card</span>
OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the world, including Open Access institutional repositories, data archives, journals. All the metadata records (i.e. descriptions of research products) are put together in a data lake, together with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national and international funders. Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications enrich the content of the data lake with links between research results and projects, author affiliations, subject classification, links to entries from domain-specific databases. Duplicated organisations and results are identified and merged together to obtain an open, trusted, public resource enabling explorations of the scholarly communication landscape like never before. OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the world, including Open Access institutional repositories, data archives, journals. All the metadata records (i.e. descriptions of research products) are put together in a data lake, together with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national and international funders. Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications enrich the content of the data lake with links between research results and projects, author affiliations, subject classification, links to entries from domain-specific databases. Duplicated organisations and results are identified and merged together to obtain an open, trusted, public resource enabling explorations of the scholarly communication landscape like never before.
![Architecture](./assets/architecture.png) <p align="center">
<img loading="lazy" alt="Data provision" src="/img/docs/architecture.png" width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
<span className="todo">TODO: make this image linkable</span> </p>

View File

@ -2,7 +2,6 @@
sidebar_position: 3 sidebar_position: 3
--- ---
# Clustering functions # Clustering functions
<span className="todo">TODO</span>
## NgramPairs ## NgramPairs
It produces a list of concatenations of a pair of ngrams generated from different words.<br /> It produces a list of concatenations of a pair of ngrams generated from different words.<br />

View File

@ -1,7 +1,5 @@
# Deduplication # Deduplication
<span className="todo">TODO: intro</span>
## Clustering ## Clustering
Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a clustering function that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no clustering function will ever bring them into the same cluster. To match these requirements OpenAIRE clustering for products works with two functions: Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a clustering function that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no clustering function will ever bring them into the same cluster. To match these requirements OpenAIRE clustering for products works with two functions:

View File

@ -33,7 +33,9 @@ Cross comparison of the pid lists (in the `pid` and `alternateid` elements). If
Otherwise, check if the number of authors and the title version is equal. If so, levenshtein distance on titles with higher threshold (0.99). Otherwise, check if the number of authors and the title version is equal. If so, levenshtein distance on titles with higher threshold (0.99).
The publications are matched as duplicate if the distance is higher than the threshold, in every other case they are considered as distinct publications. The publications are matched as duplicate if the distance is higher than the threshold, in every other case they are considered as distinct publications.
![Example banner](../assets/dedup-results.png) <p align="center">
<img loading="lazy" alt="Deduplication workflow" src="/img/docs/dedup-results.png" width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
#### Creation of representative record #### Creation of representative record
<span className="todo">TODO</span> <span className="todo">TODO</span>

View File

@ -1,8 +1,5 @@
# Enrichment # Enrichment
<span className="todo">TODO: intro</span>
## Mining ## Mining
The OpenAIRE Research Graph is enriched by links mined by OpenAIREs full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching. The OpenAIRE Research Graph is enriched by links mined by OpenAIREs full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching.

View File

@ -21,6 +21,5 @@ As of today, the OpenAIRE Research Graph aggregates around 450Mi metadata record
* Microsoft Academic Graph * Microsoft Academic Graph
* Datacite * Datacite
After cleaning, deduplication, enrichment and full-text mining processes, the graph is analysed to produce statistics for the [OpenAIRE MONITOR](https://monitor.openaire.eu), the [Open Science Observatory](https://osobservatory.openaire.eu), made discoverable via the [OpenAIRE EXPLORE](https://explore.openaire.eu) and programmatically accessible as described at After cleaning, deduplication, enrichment and full-text mining processes, the graph is analysed to produce statistics for the [OpenAIRE MONITOR](https://monitor.openaire.eu), the [Open Science Observatory](https://osobservatory.openaire.eu), made discoverable via the [OpenAIRE EXPLORE](https://explore.openaire.eu) and programmatically accessible via [OpenAIRE Public APIs](https://develop.openaire.eu).
<span className="todo">https://develop.openaire.eu</span>. Last but not least, frequently updated [JSON dumps](download) are published on Zenodo.
Json dumps are also published on Zenodo.

View File

@ -2,5 +2,56 @@
sidebar_position: 7 sidebar_position: 7
--- ---
# Related publications # How to cite
<span className="todo">TODO</span>
If you use one of the [OpenAIRE Research Graph dumps](https://zenodo.org/record/6616871), please cite it following the recommendation that you find on the Zenodo page.
## Other relevant publications
### Aggregation system
Manghi, P., Artini, M., Atzori, C., Bardi, A., Mannocci, A., La Bruzzo, S., Candela, L., Castelli, D. and Pagano, P. (2014), “The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures”, Program: electronic library and information systems, Vol. 48 No. 4, pp. 322-354.
Michele Artini, Claudio Atzori, Alessia Bardi, Sandro La Bruzzo, Paolo Manghi, & Andrea Mannocci. (2016, November 24). The D-NET software toolkit: dnet-basic-aggregator (Version 1.3.0). Zenodo. <i className="fa-solid fa-arrow-up-right-from-square"></i>
Atzori, C., Bardi, A., Manghi, P., & Mannocci, A. (2017, January). The OpenAIRE workflows for data management. In Italian Research Conference on Digital Libraries (pp. 95-107). Springer, Cham.
Mannocci, A., & Manghi, P. (2016, September). DataQ: a data flow quality monitoring system for aggregative data infrastructures. In International Conference on Theory and Practice of Digital Libraries (pp. 357-369). Springer, Cham.
### Deduplication
Claudio Atzori, & Paolo Manghi. (2017, February 17). gdup: a big graph entity deduplication system (Version 4.0.5). Zenodo. https://code-repo.d4science.org/D-Net/dnet-dedup/releases
Manghi, Paolo, Marko Mikulicic, and Claudio Atzori. "De-duplication of aggregation authority files." International Journal of Metadata, Semantics and Ontologies 7.2 (2012): 114-130.
Manghi, P., Atzori, C., De Bonis, M., & Bardi, A. (2020). Entity deduplication in big data graphs for scholarly communication. Data Technologies and Applications.
Manghi, P., & Mikulicic, M. (2011, October). PACE: A general-purpose tool for authority control. In Research Conference on Metadata and Semantic Research (pp. 80-92). Springer, Berlin, Heidelberg.
Atzori, C., Manghi, P., & Bardi, A. (2018, December). GDup: de-duplication of scholarly communication big graphs. In 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT) (pp. 142-151). IEEE.
Atzori, Claudio. "GDup: an Integrated, Scalable Big Graph Deduplication System." (2016).
### Mining
M. Kobos, Ł. Bolikowski, M. Horst, P. Manghi, N. Manola, J. Schirrwagen, “Information inference in scholarly communication infrastructures: the OpenAIREplus project experience”, Procedia Computer Science 38, 92-99.
Tkaczyk, D., Szostek, P., Fedoryszak, M. et al. CERMINE: automatic extraction of structured metadata from scientific literature. IJDAR 18, 317335 (2015).
Giannakopoulos T., Foufoulas Y., Dimitropoulos H., Manola N. (2019) “Interactive Text Analysis and Information Extraction”. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, Cham.
Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham.
T. Giannakopoulos, I. Foufoulas, E. Stamatogiannakis, H. Dimitropoulos, N. Manola, and Y. Ioannidis. 2015. “Visual-Based Classification of Figures from Scientific Literature”. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). Association for Computing Machinery, New York, NY, USA, 10591060.
Giannakopoulos, T., Foufoulas, I., Stamatogiannakis, E., Dimitropoulos, H., Manola, N., & Ioannidis, Y. (2014). “Discovering and Visualizing Interdisciplinary Content Classes in Scientific Publications”. D-Lib Mag., Volume 20, Number 11/12.
Giannakopoulos T., Stamatogiannakis E., Foufoulas I., Dimitropoulos H., Manola N., Ioannidis Y. (2014) “Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation”. In: Bolikowski Ł., Casarosa V., Goodale P., Houssos N., Manghi P., Schirrwagen J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. Also in: Google Books
Giannakopoulos T., Dimitropoulos H., Metaxas O., Manola N., Ioannidis Y. (2013) “Supervised Content Visualization of Scientific Publications: A Case Study on the ArXiv Dataset”. In: Kłopotek M.A., Koronacki J., Marciniak M., Mykowiecka A., Wierzchoń S.T. (eds) Language Processing and Intelligent Information Systems. IIS 2013. Lecture Notes in Computer Science, vol 7912. Springer, Berlin, Heidelberg.
Y. Chronis, Y. Foufoulas, V. Nikolopoulos, A. Papadopoulos, L. Stamatogiannakis, C. Svingos, Y. E. Ioannidis, "A Relational Approach to Complex Dataflows", in Workshop Proceedings of the EDBT/ICDT 2016 (MEDAL 2016) Joint Conference (March 15, 2016, Bordeaux, France) on CEUR-WS.org (ISSN 1613-0073)
### Portals
Baglioni M. et al. (2019) The OpenAIRE Research Community Dashboard: On Blending Scientific Workflows and Scientific Publishing. In: Doucet A., Isaac A., Golub K., Aalberg T., Jatowt A. (eds) Digital Libraries for Open Knowledge. TPDL 2019. Lecture Notes in Computer Science, vol 11799. Springer, Cham.
### Broker Service
Artini, M., Atzori, C., Bardi, A., La Bruzzo, S., Manghi, P., & Mannocci, A. (2015). The OpenAIRE literature broker service for institutional repositories. D-Lib Magazine, 21(11/12), 1.
Manghi, P., Atzori, C., Bardi, A., La Bruzzo, S., & Artini, M. (2016, February). Realizing a Scalable and History-Aware Literature Broker Service for OpenAIRE. In Italian Research Conference on Digital Libraries (pp. 92-103). Springer, Cham.

View File

@ -13,19 +13,102 @@
/** @type {import('@docusaurus/plugin-content-docs').SidebarsConfig} */ /** @type {import('@docusaurus/plugin-content-docs').SidebarsConfig} */
const sidebars = { const sidebars = {
// By default, Docusaurus generates a sidebar from the docs folder structure mySidebar: [
tutorialSidebar: [{type: 'autogenerated', dirName: '.'}],
// But you can create a sidebar manually
/*
tutorialSidebar: [
{ {
type: 'category', type: 'doc',
label: 'Tutorial', id: 'intro'
items: ['hello'],
}, },
], {
*/ type: 'category',
label: "Data model",
link: {type: 'doc', id: 'data-model/data-model'},
items: [
{
type: 'category',
label: "Entities",
link: {
type: 'generated-index',
description: 'The main entities of the OpenAIRE Research Graph are listed below.'
},
items: [
{ type: 'doc', id: 'data-model/entities/result' },
{ type: 'doc', id: 'data-model/entities/data-source' },
{ type: 'doc', id: 'data-model/entities/organization' },
{ type: 'doc', id: 'data-model/entities/project' },
{ type: 'doc', id: 'data-model/entities/community' },
]
},
{
type: 'doc',
id: 'data-model/relationships'
}
]
},
{
type: "link",
label: "Public API",
href: "https://graph.openaire.eu/develop/overview.html"
},
{
type: 'doc',
id: 'download'
},
{
type: 'category',
label: "Data provision",
link: {type: 'doc', id: 'data-provision/data-provision'},
items: [
{ type: 'doc', id: 'data-provision/aggregation' },
{
type: 'category',
label: "Deduplication",
link: {type: 'doc', id: 'data-provision/deduplication/deduplication'},
items: [
{ type: 'doc', id: 'data-provision/deduplication/research-products' },
{ type: 'doc', id: 'data-provision/deduplication/organizations' },
]
},
{
type: 'category',
label: "Enrichment",
link: {type: 'doc', id: 'data-provision/enrichment/enrichment'},
items: [
{ type: 'doc', id: 'data-provision/enrichment/mining' },
{ type: 'doc', id: 'data-provision/enrichment/impact-scores' },
]
},
{ type: 'doc', id: 'data-provision/post-cleaning' },
{ type: 'doc', id: 'data-provision/indexing' },
{ type: 'doc', id: 'data-provision/stats' },
]
},
{
type: 'doc',
id: 'services'
},
{
type: 'category',
label: "Learning center",
link: { type: 'generated-index' },
items: [
{ type: 'doc', id: 'learning-center/open-plato' },
{ type: 'doc', id: 'learning-center/tutorials' },
]
},
{
type: 'doc',
id: 'publications',
label: "Relevant publications"
},
{
type: 'doc',
id: 'faq'
},
{
type: 'doc',
id: 'license'
},
]
}; };
module.exports = sidebars; module.exports = sidebars;

View File

Before

Width:  |  Height:  |  Size: 236 KiB

After

Width:  |  Height:  |  Size: 236 KiB

View File

Before

Width:  |  Height:  |  Size: 649 KiB

After

Width:  |  Height:  |  Size: 649 KiB

View File

Before

Width:  |  Height:  |  Size: 152 KiB

After

Width:  |  Height:  |  Size: 152 KiB