Format md

This commit is contained in:
Giambattista Bloisi 2024-04-22 14:22:29 +02:00
parent 2f1042d747
commit 9222fe3456
6 changed files with 266 additions and 96 deletions

View File

@ -1,26 +1,39 @@
# Data model
The OpenAIRE Graph comprises several types of [entities](../category/entities) and [relationships](/category/relationships) among them.
The OpenAIRE Graph comprises several types of [entities](../category/entities)
and [relationships](/category/relationships) among them.
The latest version of the JSON schema can be found on the [Downloads](../downloads/full-graph) section.
The latest version of the JSON schema can be found on
the [Downloads](../downloads/full-graph) section.
<p align="center">
<img loading="lazy" alt="Data model" src={require('../assets/img/data-model-3.png').default} width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The figure above, presents the graph's data model.
The figure above, presents the graph's data model.
Its main entities are described in brief below:
* [Research products](./entities/research-product) represent the outcomes (or products) of research activities.
* [Data sources](./entities/data-source) are the sources from which the metadata of graph objects are collected.
* [Organizations](./entities/organization) correspond to companies or research institutions involved in projects,
responsible for operating data sources or consisting the affiliations of Product creators.
* [Projects](./entities/project) are research project grants funded by a Funding Stream of a Funder.
* [Communities](./entities/community) are groups of people with a common research intent (e.g. research infrastructures, university alliances).
* Persons correspond to individual researchers who are involved in the design, creation or maintenance of research products. Currently, this is a non-materialized entity type in the Graph, which means that the respective metadata (and relationships) are encapsulated in the author field of the respective research products.
* [Research products](./entities/research-product) represent the outcomes (or
products) of research activities.
* [Data sources](./entities/data-source) are the sources from which the metadata
of graph objects are collected.
* [Organizations](./entities/organization) correspond to companies or research
institutions involved in projects,
responsible for operating data sources or consisting the affiliations of
Product creators.
* [Projects](./entities/project) are research project grants funded by a Funding
Stream of a Funder.
* [Communities](./entities/community) are groups of people with a common
research intent (e.g. research infrastructures, university alliances).
* Persons correspond to individual researchers who are involved in the design,
creation or maintenance of research products. Currently, this is a
non-materialized entity type in the Graph, which means that the respective
metadata (and relationships) are encapsulated in the author field of the
respective research products.
:::note Further reading
A detailed report on the OpenAIRE Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199).
A detailed report on the OpenAIRE Graph Data Model can be found
on [Zenodo](https://zenodo.org/record/2643199).
:::

View File

@ -1,17 +1,33 @@
# PIDs and identifiers
One of the challenges towards the stability of the contents in the OpenAIRE Graph consists of making its identifiers and records stable over time.
The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content,
original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes.
Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records.
One of the challenges towards the stability of the contents in the OpenAIRE
Graph consists of making its identifiers and records stable over time.
The barriers to this scenario are many, as the Graph keeps a map of data sources
that is subject to constant variations: records in repositories vary in content,
original IDs, and PIDs, may disappear or reappear, and the same holds for the
repository or the metadata collection it exposes.
Not only, but the mappings applied to the original contents may also change and
improve over time to catch up with the changes in the input records.
## PID Authorities
One of the fronts regards the attribution of the identity to the objects populating the graph. The basic idea is to build the identifiers of the objects in the graph from the PIDs available in some authoritative sources while considering all the other sources as by definition “unstable”. Examples of authoritative sources are Crossref and DataCite. Examples of non-authoritative ones are institutional repositories, aggregators, etc. PIDs from the authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction.
One of the fronts regards the attribution of the identity to the objects
populating the graph. The basic idea is to build the identifiers of the objects
in the graph from the PIDs available in some authoritative sources while
considering all the other sources as by definition “unstable”. Examples of
authoritative sources are Crossref and DataCite. Examples of non-authoritative
ones are institutional repositories, aggregators, etc. PIDs from the
authoritative sources would form the stable OpenAIRE ID skeleton of the Graph,
precisely because they are immutable by construction.
Such a policy defines a list of data sources that are considered authoritative for a specific type of PID they provide, whose effect is twofold:
* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them;
* PIDs are included in the graph according to a tight criterion: the PID Types declared in the table below are considered to be mapped as PIDs only when they are collected from the relative PID authority data source.
Such a policy defines a list of data sources that are considered authoritative
for a specific type of PID they provide, whose effect is twofold:
* OpenAIRE IDs depend on persistent IDs when they are provided by the authority
responsible to create them;
* PIDs are included in the graph according to a tight criterion: the PID Types
declared in the table below are considered to be mapped as PIDs only when they
are collected from the relative PID authority data source.
| PID Type | Authority |
|-----------|-----------------------------------------------------------------------------------------------------|
@ -22,60 +38,76 @@ Such a policy defines a list of data sources that are considered authoritative f
| ena | [Protein Data Bank](http://www.pdb.org/) |
| pdb | [Protein Data Bank](http://www.pdb.org/) |
There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule.
There is an exception though: Handle(s) are minted by several repositories; as
listing them all would not be a viable option, to avoid losing them as PIDs,
Handles bypass the PID authority filtering rule.
In all other cases, PIDs are be included in the graph as alternate Identifiers.
## Delegated authorities
When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them and, depending on the case,
When a record is aggregated from multiple sources considered authoritative for
minting specific PIDs, different mappings could be applied to them and,
depending on the case,
this could result in inconsistencies in the attribution of the field values.
To overcome the issue, the intuition is to include such records only once in the graph. To do so, the concept of "delegated authorities" defines a list of datasources that
To overcome the issue, the intuition is to include such records only once in the
graph. To do so, the concept of "delegated authorities" defines a list of
datasources that
assigns PIDs to their scientific products from a given PID minter.
This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes
| Datasource delegated | Datasource delegating | Pid Type |
|--------------------------------------|----------------------------------|-----------|
| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi |
| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id |
This "selection" can be performed when the entities in the graph sharing the
same identifier are grouped together. The list of the delegated authorities
currently includes
| Datasource delegated | Datasource delegating | Pid Type |
|--------------------------------------|----------------------------------|----------|
| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi |
| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id |
## Identifiers in the Graph
OpenAIRE assigns internal identifiers for each object it collects.
By default, the internal identifier is generated as `sourcePrefix::md5(localId)` where:
By default, the internal identifier is generated as `sourcePrefix::md5(localId)`
where:
* `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source at registration time
* `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source
at registration time
* `localΙd` is the identifier assigned to the object by the data source
After years of operation, we can say that:
* `localId` are generally unstable
* objects can disappear from sources
* PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos)
* PIDs provided by sources that are not PID agencies (authoritative sources for
a specific type of PID) are often wrong (e.g. pre-print with the DOI of the
published version, DOIs with typos)
Therefore, when the record is collected from an authoritative source:
* the identity of the record is forged using the PID, like `pidTypePrefix::md5(lowercase(doi))`
* the identity of the record is forged using the PID,
like `pidTypePrefix::md5(lowercase(doi))`
* the PID is added in a `pid` element of the data model
When the record is collected from a source which is not authoritative for any type of PID:
When the record is collected from a source which is not authoritative for any
type of PID:
* the identity of the record is forged as usual using the local identifier
* the PID, if available, is added as `alternateIdentifier`
Currently, the following data sources are used as "PID authorities":
| PID Type | Prefix (12 chars) | Authority |
|-----------|------------------------|-------------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository |
| ena | `ena_________` | EMBL-EBI |
| pdb | `pdb_________` | EMBL-EBI |
| uniprot | `uniprot_____` | EMBL-EBI |
| PID Type | Prefix (12 chars) | Authority |
|----------|-----------------------|-----------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository |
| ena | `ena_________` | EMBL-EBI |
| pdb | `pdb_________` | EMBL-EBI |
| uniprot | `uniprot_____` | EMBL-EBI |
OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)).
All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).
OpenAIRE also perform duplicate identification (see
the [dedicated section for details](/graph-production-workflow/deduplication)).
All duplicates are **merged** together in a **representative record** which must
be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier
of one of the aggregated record).

View File

@ -1,11 +1,13 @@
---
sidebar_position: 3
---
# Clustering functions
# Clustering functions
## Ngrams
It creates ngrams from the input field. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
@ -15,7 +17,9 @@ List of ngrams: “sea”, “sta”, “mod”, “hig”
## NgramPairs
It produces a list of concatenations of a pair of ngrams generated from different words.<br />
It produces a list of concatenations of a pair of ngrams generated from
different words.<br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
@ -25,7 +29,10 @@ Ngram pairs: “seasta”, “stamod”, “modhig”
## SuffixPrefix
It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string. A specialization of this function is available as SortedSuffixPrefix. It returns a sorted list. <br />
It produces ngrams pairs in a particular way: it concatenates the suffix of a
string with the prefix of the next in the input string. A specialization of this
function is available as SortedSuffixPrefix. It returns a sorted list. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
@ -36,6 +43,7 @@ Output list: “ardmod”` (suffix of the word “Standard” + prefix of the wo
## Acronyms
It creates a number of acronyms out of the words in the input field. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
@ -44,7 +52,9 @@ Output: "ssmhb"
## KeywordsClustering
It creates keys by extracting keywords, out of a customizable list, from the input field. <br />
It creates keys by extracting keywords, out of a customizable list, from the
input field. <br />
```
Example:
Input string: “University of Pisa”
@ -54,6 +64,7 @@ Output: "key::001" (code that identifies the keyword "University" in the customi
## LowercaseClustering
It creates keys by lowercasing the input field. <br />
```
Example:
Input string: “10.001/ABCD”
@ -67,6 +78,7 @@ It creates random keys from the input field. <br />
## SpaceTrimmingFieldValue
It creates keys by trimming spaces in the input field. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
@ -76,6 +88,7 @@ Output: "searchstandardmodelhiggsboson"
## UrlClustering
It creates keys for an URL field by extracting the domain. <br />
```
Example:
Input string: “http://www.google.it/page”
@ -84,7 +97,10 @@ Output: "www.google.it"
## WordsStatsSuffixPrefixChain
It creates keys containing concatenated statistics of the field, i.e. number of words, number of letters and a chain of suffixes and prefixes of the words. <br />
It creates keys containing concatenated statistics of the field, i.e. number of
words, number of letters and a chain of suffixes and prefixes of the
words. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”

View File

@ -1,14 +1,37 @@
# Deduplication
The OpenAIRE Graph is populated by aggregating metadata records from distinct data sources whose content typically overlaps. For example, the collection of article metadata records from publisher' archives (e.g. Frontiers, Elsevier, Copernicus) and from pre-print platforms (e.g. ArXiv.org, UKPubMed, BioarXiv.org). In order to support monitoring of science, the OpenAIRE Graph implements record deduplication and merge strategies, in such a way the scientific production can be consistently statistically represented. Such strategies reflect the following intuition behind OpenAIRE monitoring: "Two metadata records are equivalent when they describe the same research product, hence they feature compatible resource types, have the same title, the same authors, or, alternatively, the same PID". Finally, groups of duplicates can be whitelisted or blacklisted, in order to manually refine the quality of this strategy.
The OpenAIRE Graph is populated by aggregating metadata records from distinct
data sources whose content typically overlaps. For example, the collection of
article metadata records from publisher' archives (e.g. Frontiers, Elsevier,
Copernicus) and from pre-print platforms (e.g. ArXiv.org, UKPubMed,
BioarXiv.org). In order to support monitoring of science, the OpenAIRE Graph
implements record deduplication and merge strategies, in such a way the
scientific production can be consistently statistically represented. Such
strategies reflect the following intuition behind OpenAIRE monitoring: "Two
metadata records are equivalent when they describe the same research product,
hence they feature compatible resource types, have the same title, the same
authors, or, alternatively, the same PID". Finally, groups of duplicates can be
whitelisted or blacklisted, in order to manually refine the quality of this
strategy.
It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strenghten similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward.
It should be noticed that publication dates do not make a difference, as
different versions of the same product can be published at different times; e.g.
the pre-print and a published version of a scientific article, which should be
counted as one object; abstracts, subjects, and other possible related fields,
are not used to strenghten similarity, due to their heterogeneity or absence
across different data sources. Moreover, even when two products are indicated as
one a new version of the other, the presence of different authors will not bring
them into the same group, to avoid unfair distribution of scientific reward.
Groups of duplicates are finally merged into a new "dedup" record that embeds all properties of the merged records and carries provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date.
Groups of duplicates are finally merged into a new "dedup" record that embeds
all properties of the merged records and carries provenance information about
the data sources and the relative "instances", i.e. manifestations of the
products, together with their resource type, access rights, and publishing date.
## Methodology overview
The deduplication process can be divided into five different phases:
* Collection import
* Candidate identification (clustering)
* Duplicates identification (pair-wise comparisons)
@ -23,25 +46,52 @@ The deduplication process can be divided into five different phases:
### Collection import
The nodes in the graph represent entities of different types. This phase is responsible for identifying all the nodes with a given type and make them available to the subsequent phases representing them in the deduplication record model.
The nodes in the graph represent entities of different types. This phase is
responsible for identifying all the nodes with a given type and make them
available to the subsequent phases representing them in the deduplication record
model.
### Candidate identification (clustering)
### Candidate identification (clustering)
Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a [clustering function](./clustering-functions) that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no [clustering function](./clustering-functions) will ever bring them into the same cluster.
Clustering is a common heuristics used to overcome the N x N complexity required
to match all pairs of objects to identify the equivalent ones. The challenge is
to identify a [clustering function](./clustering-functions) that maximizes the
chance of comparing only records that may lead to a match, while minimizing the
number of records that will not be matched while being equivalent. Since the
equivalence function is to some level tolerant to minimal errors (e.g. switching
of characters in the title, or minimal difference in letters), we need this
function to be not too precise (e.g. a hash of the title), but also not too
flexible (e.g. random ngrams of the title). On the other hand, reality tells us
that in some cases equality of two records can only be determined by their
PIDs (e.g. DOI) as the metadata properties are very different across different
versions and no [clustering function](./clustering-functions) will ever bring
them into the same cluster.
### Duplicates identification (pair-wise comparisons)
Pair-wise comparisons are conducted over records in the same cluster following the strategy defined in the decision tree. A different decision tree is adopted depending on the type of the entity being processed.
Pair-wise comparisons are conducted over records in the same cluster following
the strategy defined in the decision tree. A different decision tree is adopted
depending on the type of the entity being processed.
To further limit the number of comparisons, a sliding window mechanism is used: (i) records in the same cluster are lexicographically sorted by their title, (ii) a window of K records slides over the cluster, and (iii) records ending up in the same window are pair-wise compared. The result of each comparison produces a similarity relation when the pair of record matches. Such relations will be consequently used as input for the duplicates grouping stage.
To further limit the number of comparisons, a sliding window mechanism is
used: (i) records in the same cluster are lexicographically sorted by their
title, (ii) a window of K records slides over the cluster, and (iii) records
ending up in the same window are pair-wise compared. The result of each
comparison produces a similarity relation when the pair of record matches. Such
relations will be consequently used as input for the duplicates grouping stage.
### Duplicates grouping (transitive closure)
Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance.
Once the similarity relations between pairs of records are drawn, the groups of
equivalent records are obtained (transitive closure, i.e. “mesh”). From such
sets a new representative object is obtained, which inherits all properties from
the merged records and keeps track of their provenance.
### Relation redistribution
Relations involved in nodes identified as duplicated are eventually marked as virtually deleted and used as template for creating a new relation pointing to the new representative record.
Relations involved in nodes identified as duplicated are eventually marked as
virtually deleted and used as template for creating a new relation pointing to
the new representative record.
Note that nodes and relationships marked as virtually deleted are not exported.
<p align="center">

View File

@ -4,43 +4,82 @@ sidebar_position: 2
# Organizations
The organizations in OpenAIRE are aggregated from different registries (e.g. CORDA, OpenDOAR, Re3data, ROR). In some cases, a registry provides organizations as entities with their own persistent identifier. In other cases, those organizations are extracted from other main entities provided by the registry (e.g. datasources, projects, etc.).
The organizations in OpenAIRE are aggregated from different registries (e.g.
CORDA, OpenDOAR, Re3data, ROR). In some cases, a registry provides organizations
as entities with their own persistent identifier. In other cases, those
organizations are extracted from other main entities provided by the registry (
e.g. datasources, projects, etc.).
The deduplication of organizations is enhanced by the [OpenOrgs](https://orgs.openaire.eu), a tool that combines an automated approach for identifying duplicated instances
of the same organization record with a "humans in the loop" approach, in which the equivalences produced by a duplicate identification algorithm are suggested to data curators, in charge for validating them.
The data curation activity is twofold, on one end pivots around the disambiguation task, on the other hand assumes to improve the metadata describing the organization records
(e.g. including the translated name, or a different PID) as well as defining the hierarchical structure of existing large organizations (i.e. Universities comprising its departments or large research centers with all its sub-units or sub-institutes).
The deduplication of organizations is enhanced by
the [OpenOrgs](https://orgs.openaire.eu), a tool that combines an automated
approach for identifying duplicated instances
of the same organization record with a "humans in the loop" approach, in which
the equivalences produced by a duplicate identification algorithm are suggested
to data curators, in charge for validating them.
The data curation activity is twofold, on one end pivots around the
disambiguation task, on the other hand assumes to improve the metadata
describing the organization records
(e.g. including the translated name, or a different PID) as well as defining the
hierarchical structure of existing large organizations (i.e. Universities
comprising its departments or large research centers with all its sub-units or
sub-institutes).
Duplicates among organizations are therefore managed through three different stages:
* *Creation of Suggestions*: executes an automatic workflow that performs the deduplication and prepare new suggestions for the curators to be processed;
* *Curation*: manual editing of the organization records performed by the data curators;
* *Creation of Representative Organizations*: executes an automatic workflow that creates curated organizations and exposes them on the OpenAIRE Graph by using the curators' feedback from the OpenOrgs underlying database.
Duplicates among organizations are therefore managed through three different
stages:
* *Creation of Suggestions*: executes an automatic workflow that performs the
deduplication and prepare new suggestions for the curators to be processed;
* *Curation*: manual editing of the organization records performed by the data
curators;
* *Creation of Representative Organizations*: executes an automatic workflow
that creates curated organizations and exposes them on the OpenAIRE Graph by
using the curators' feedback from the OpenOrgs underlying database.
The next sections describe the above mentioned stages.
### Creation of Suggestions
This stage executes an automatic workflow that faces the *candidate identification* and the *duplicates identification* stages of the deduplication to provide suggestions for the curators in the OpenOrgs.
This stage executes an automatic workflow that faces the *candidate
identification* and the *duplicates identification* stages of the deduplication
to provide suggestions for the curators in the OpenOrgs.
#### Candidate identification (clustering)
To match the requirements of limiting the number of comparisons, OpenAIRE clustering for organizations aims at grouping records that would more likely be comparable.
To match the requirements of limiting the number of comparisons, OpenAIRE
clustering for organizations aims at grouping records that would more likely be
comparable.
It works with four functions:
* *URL-based function*: the function generates the URL domain when this is provided as part of the record properties from the organization's `websiteurl` field;
* *Title-based functions*:
* generate strings dependent to the keywords in the `legalname` field;
* generate strings obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) on the first 3 words of the `legalname` field;
* generate strings obtained as a concatenation of ngrams of the `legalname` field;
* *URL-based function*: the function generates the URL domain when this is
provided as part of the record properties from the organization's `websiteurl`
field;
* *Title-based functions*:
* generate strings dependent to the keywords in the `legalname` field;
* generate strings obtained as an alternation of the function prefix(3) and
suffix(3) (and vice versa) on the first 3 words of the `legalname` field;
* generate strings obtained as a concatenation of ngrams of the `legalname`
field;
#### Duplicates identification (pair-wise comparisons)
For each pair of organization in a cluster the following strategy (depicted in the figure below) is applied.
For each pair of organization in a cluster the following strategy (depicted in
the figure below) is applied.
The comparison goes through the following decision tree:
1. *grid id check*: comparison of the grid ids. If the grid id is equivalent, then the similarity relation is drawn. If the grid id is not available, the comparison proceeds to the next stage;
2. *early exits*: comparison of the numbers extracted from the `legalname`, the `country` and the `website` url. No similarity relation is drawn in this stage, the comparison proceeds only if the compared fields verified the conditions of equivalence;
3. *city check*: comparison of the city names in the `legalname`. The comparison proceeds only if the legalnames shares at least 10% of cities;
4. *keyword check*: comparison of the keywords in the `legalname`. The comparison proceeds only if the legalnames shares at least 70% of keywords;
5. *legalname check*: comparison of the normalized `legalnames` with the `Jaro-Winkler` distance to determine if it is higher than `0.9`. If so, a similarity relation is drawn. Otherwise, no similarity relation is drawn.
1. *grid id check*: comparison of the grid ids. If the grid id is equivalent,
then the similarity relation is drawn. If the grid id is not available, the
comparison proceeds to the next stage;
2. *early exits*: comparison of the numbers extracted from the `legalname`,
the `country` and the `website` url. No similarity relation is drawn in this
stage, the comparison proceeds only if the compared fields verified the
conditions of equivalence;
3. *city check*: comparison of the city names in the `legalname`. The comparison
proceeds only if the legalnames shares at least 10% of cities;
4. *keyword check*: comparison of the keywords in the `legalname`. The
comparison proceeds only if the legalnames shares at least 70% of keywords;
5. *legalname check*: comparison of the normalized `legalnames` with
the `Jaro-Winkler` distance to determine if it is higher than `0.9`. If so, a
similarity relation is drawn. Otherwise, no similarity relation is drawn.
<p align="center">
<img loading="lazy" alt="Organization Decision Tree" src={require('../../assets/img/decisiontree-organization.png').default} width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
@ -50,21 +89,39 @@ The comparison goes through the following decision tree:
### Data Curation
All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata.
All the similarity relations drawn by the algorithm involving the decision tree
are exposed in OpenOrgs, where are made available to the data curators to give
feedbacks and to improve the organizations metadata.
A data curator can:
* *edit organization metadata*: legalname, pid, country, url, parent relations, etc.;
* *approve suggested duplicates*: establish if an equivalence relation is valid;
* *discard suggested duplicates*: establish if an equivalence relation is wrong;
* *create similarity relations*: add a new equivalence relation not drawn by the algorithm.
Note that if a curator does not provide a feedback on a similarity relation suggested by the algorithm, then such relation is considered as valid.
* *edit organization metadata*: legalname, pid, country, url, parent relations,
etc.;
* *approve suggested duplicates*: establish if an equivalence relation is valid;
* *discard suggested duplicates*: establish if an equivalence relation is wrong;
* *create similarity relations*: add a new equivalence relation not drawn by the
algorithm.
Note that if a curator does not provide a feedback on a similarity relation
suggested by the algorithm, then such relation is considered as valid.
### Creation of Representative Organizations
This stage executes an automatic workflow that faces the *duplicates grouping* stage to create representative organizations and to update them on the OpenAIRE Graph. Such organizations are obtained via transitive closure and the relations used comes from the curators' feedback gathered on the OpenOrgs underlying Database.
This stage executes an automatic workflow that faces the *duplicates grouping*
stage to create representative organizations and to update them on the OpenAIRE
Graph. Such organizations are obtained via transitive closure and the relations
used comes from the curators' feedback gathered on the OpenOrgs underlying
Database.
#### Duplicates grouping (transitive closure)
Once the similarity relations between pairs of organizations have been gathered, the groups of equivalent organizations are obtained (transitive closure, i.e. “mesh”). From such sets a new representative organization is obtained, which inherits all properties from the merged records and keeps track of their provenance.
Once the similarity relations between pairs of organizations have been gathered,
the groups of equivalent organizations are obtained (transitive closure, i.e.
“mesh”). From such sets a new representative organization is obtained, which
inherits all properties from the merged records and keeps track of their
provenance.
The IDs of the representative organizations are obtained by the OpenOrgs Database that creates a unique ``openorgs`` ID for each approved organization. In case an organization is not approved by the curators, the ID is obtained by appending the prefix ``pending_org`` to the MD5 of the first ID (given their lexicographical ordering).
The IDs of the representative organizations are obtained by the OpenOrgs
Database that creates a unique ``openorgs`` ID for each approved organization.
In case an organization is not approved by the curators, the ID is obtained by
appending the prefix ``pending_org`` to the MD5 of the first ID (given their
lexicographical ordering).

View File

@ -149,13 +149,15 @@ The comparison goes through different stages:
### Duplicates grouping
The aim of the final stage is the creation of objects that group all the equivalent
entities discovered by the previous step. This is done in two phases.
The aim of the final stage is the creation of objects that group all the
equivalent
entities discovered by the previous step. This is done in two phases.
#### Transitive closure
As a final step of duplicate identification a transitive closure
is run against similarity relations to find groups of duplicates not directly
caught by the previous steps. If a group is larger than 200 elements only the
is run against similarity relations to find groups of duplicates not directly
caught by the previous steps. If a group is larger than 200 elements only the
first 200 elements will be included in the group, while the remaining will be
kept ungrouped.