Merge pull request 'update mining docs' (#14) from ioannis.foufoulas/openaire-graph-docs:mining_docs into main

Reviewed-on: D-Net/openaire-graph-docs#14
This commit is contained in:
Serafeim Chatzopoulos 2022-11-17 13:41:49 +01:00
commit 5684d7bff7
4 changed files with 84 additions and 0 deletions

View File

@ -0,0 +1,23 @@
---
sidebar_position: 3
---
# Extraction of Acknowledged Concepts
| Property | Description |
| --- | --- |
| Short description | Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept. |
| Parameters | Concept titles, acronyms, and identifiers, publication's identifiers and fulltexts |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham.](https://doi.org/10.1007/978-3-031-16802-4_9) |

View File

@ -0,0 +1,23 @@
---
sidebar_position: 4
---
# Extraction of Cited Concepts
| Property | Description |
| --- | --- |
| Short description | Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata. |
| Parameters | Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts. |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham.](https://doi.org/10.1007/978-3-319-67008-9_28) |

View File

@ -0,0 +1,23 @@
---
sidebar_position: 5
---
# Classifiers
| Property | Description |
| --- | --- |
| Short description | A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm classifies publication's fulltexts using a Bayesian classifier and weighted terms according to an offline training phase. The training has been done using the following taxonomies: arXiv, MeSH (Medical Subject Headings), ACM, and DDC (Dewey Decimal Classification, or Dewey Decimal System). |
| Parameters | Publication's identifier and fulltext |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham.](https://doi.org/10.1007/978-3-319-08425-1_10) |

View File

@ -3,4 +3,19 @@ sidebar_position: 1
---
# Mining algorithms
The Text and Data Mining (TDM) algorithms used for enriching the OpenAIRE Graph are grouped in the following main categories:
[Extraction of acknowledged concepts](acks.md)
[Extraction of cited concepts](cites.md)
[Document Classification](classified.md)
<span className="todo">TODO</span>