diff --git a/docs/data-model/data-model.md b/docs/data-model/data-model.md index 0f891b1..6a32834 100644 --- a/docs/data-model/data-model.md +++ b/docs/data-model/data-model.md @@ -1,26 +1,39 @@ # Data model -The OpenAIRE Graph comprises several types of [entities](../category/entities) and [relationships](/category/relationships) among them. +The OpenAIRE Graph comprises several types of [entities](../category/entities) +and [relationships](/category/relationships) among them. -The latest version of the JSON schema can be found on the [Downloads](../downloads/full-graph) section. +The latest version of the JSON schema can be found on +the [Downloads](../downloads/full-graph) section.

Data model

-The figure above, presents the graph's data model. +The figure above, presents the graph's data model. Its main entities are described in brief below: -* [Research products](./entities/research-product) represent the outcomes (or products) of research activities. -* [Data sources](./entities/data-source) are the sources from which the metadata of graph objects are collected. -* [Organizations](./entities/organization) correspond to companies or research institutions involved in projects, -responsible for operating data sources or consisting the affiliations of Product creators. -* [Projects](./entities/project) are research project grants funded by a Funding Stream of a Funder. -* [Communities](./entities/community) are groups of people with a common research intent (e.g. research infrastructures, university alliances). -* Persons correspond to individual researchers who are involved in the design, creation or maintenance of research products. Currently, this is a non-materialized entity type in the Graph, which means that the respective metadata (and relationships) are encapsulated in the author field of the respective research products. +* [Research products](./entities/research-product) represent the outcomes (or + products) of research activities. +* [Data sources](./entities/data-source) are the sources from which the metadata + of graph objects are collected. +* [Organizations](./entities/organization) correspond to companies or research + institutions involved in projects, + responsible for operating data sources or consisting the affiliations of + Product creators. +* [Projects](./entities/project) are research project grants funded by a Funding + Stream of a Funder. +* [Communities](./entities/community) are groups of people with a common + research intent (e.g. research infrastructures, university alliances). +* Persons correspond to individual researchers who are involved in the design, + creation or maintenance of research products. Currently, this is a + non-materialized entity type in the Graph, which means that the respective + metadata (and relationships) are encapsulated in the author field of the + respective research products. :::note Further reading -A detailed report on the OpenAIRE Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199). +A detailed report on the OpenAIRE Graph Data Model can be found +on [Zenodo](https://zenodo.org/record/2643199). ::: diff --git a/docs/data-model/pids-and-identifiers.md b/docs/data-model/pids-and-identifiers.md index 3e3012e..b4a6889 100644 --- a/docs/data-model/pids-and-identifiers.md +++ b/docs/data-model/pids-and-identifiers.md @@ -1,17 +1,33 @@ # PIDs and identifiers -One of the challenges towards the stability of the contents in the OpenAIRE Graph consists of making its identifiers and records stable over time. -The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content, -original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes. -Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records. +One of the challenges towards the stability of the contents in the OpenAIRE +Graph consists of making its identifiers and records stable over time. +The barriers to this scenario are many, as the Graph keeps a map of data sources +that is subject to constant variations: records in repositories vary in content, +original IDs, and PIDs, may disappear or reappear, and the same holds for the +repository or the metadata collection it exposes. +Not only, but the mappings applied to the original contents may also change and +improve over time to catch up with the changes in the input records. ## PID Authorities -One of the fronts regards the attribution of the identity to the objects populating the graph. The basic idea is to build the identifiers of the objects in the graph from the PIDs available in some authoritative sources while considering all the other sources as by definition “unstable”. Examples of authoritative sources are Crossref and DataCite. Examples of non-authoritative ones are institutional repositories, aggregators, etc. PIDs from the authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction. +One of the fronts regards the attribution of the identity to the objects +populating the graph. The basic idea is to build the identifiers of the objects +in the graph from the PIDs available in some authoritative sources while +considering all the other sources as by definition “unstable”. Examples of +authoritative sources are Crossref and DataCite. Examples of non-authoritative +ones are institutional repositories, aggregators, etc. PIDs from the +authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, +precisely because they are immutable by construction. -Such a policy defines a list of data sources that are considered authoritative for a specific type of PID they provide, whose effect is twofold: -* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them; -* PIDs are included in the graph according to a tight criterion: the PID Types declared in the table below are considered to be mapped as PIDs only when they are collected from the relative PID authority data source. +Such a policy defines a list of data sources that are considered authoritative +for a specific type of PID they provide, whose effect is twofold: + +* OpenAIRE IDs depend on persistent IDs when they are provided by the authority + responsible to create them; +* PIDs are included in the graph according to a tight criterion: the PID Types + declared in the table below are considered to be mapped as PIDs only when they + are collected from the relative PID authority data source. | PID Type | Authority | |-----------|-----------------------------------------------------------------------------------------------------| @@ -22,60 +38,76 @@ Such a policy defines a list of data sources that are considered authoritative f | ena | [Protein Data Bank](http://www.pdb.org/) | | pdb | [Protein Data Bank](http://www.pdb.org/) | - -There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule. +There is an exception though: Handle(s) are minted by several repositories; as +listing them all would not be a viable option, to avoid losing them as PIDs, +Handles bypass the PID authority filtering rule. In all other cases, PIDs are be included in the graph as alternate Identifiers. ## Delegated authorities -When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them and, depending on the case, +When a record is aggregated from multiple sources considered authoritative for +minting specific PIDs, different mappings could be applied to them and, +depending on the case, this could result in inconsistencies in the attribution of the field values. -To overcome the issue, the intuition is to include such records only once in the graph. To do so, the concept of "delegated authorities" defines a list of datasources that +To overcome the issue, the intuition is to include such records only once in the +graph. To do so, the concept of "delegated authorities" defines a list of +datasources that assigns PIDs to their scientific products from a given PID minter. -This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes - -| Datasource delegated | Datasource delegating | Pid Type | -|--------------------------------------|----------------------------------|-----------| -| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi | -| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id | +This "selection" can be performed when the entities in the graph sharing the +same identifier are grouped together. The list of the delegated authorities +currently includes +| Datasource delegated | Datasource delegating | Pid Type | +|--------------------------------------|----------------------------------|----------| +| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi | +| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id | ## Identifiers in the Graph OpenAIRE assigns internal identifiers for each object it collects. -By default, the internal identifier is generated as `sourcePrefix::md5(localId)` where: +By default, the internal identifier is generated as `sourcePrefix::md5(localId)` +where: -* `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source at registration time +* `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source + at registration time * `localΙd` is the identifier assigned to the object by the data source After years of operation, we can say that: * `localId` are generally unstable * objects can disappear from sources -* PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos) +* PIDs provided by sources that are not PID agencies (authoritative sources for + a specific type of PID) are often wrong (e.g. pre-print with the DOI of the + published version, DOIs with typos) Therefore, when the record is collected from an authoritative source: -* the identity of the record is forged using the PID, like `pidTypePrefix::md5(lowercase(doi))` +* the identity of the record is forged using the PID, + like `pidTypePrefix::md5(lowercase(doi))` * the PID is added in a `pid` element of the data model -When the record is collected from a source which is not authoritative for any type of PID: +When the record is collected from a source which is not authoritative for any +type of PID: + * the identity of the record is forged as usual using the local identifier * the PID, if available, is added as `alternateIdentifier` Currently, the following data sources are used as "PID authorities": -| PID Type | Prefix (12 chars) | Authority | -|-----------|------------------------|-------------------------------------------| -| doi | `doi_________` | Crossref, Datacite, Zenodo | -| pmc | `pmc_________` | Europe PubMed Central, PubMed Central | -| pmid | `pmid________` | Europe PubMed Central, PubMed Central | -| arXiv | `arXiv_______` | arXiv.org e-Print Archive | -| handle | `handle______` | any repository | -| ena | `ena_________` | EMBL-EBI | -| pdb | `pdb_________` | EMBL-EBI | -| uniprot | `uniprot_____` | EMBL-EBI | +| PID Type | Prefix (12 chars) | Authority | +|----------|-----------------------|-----------------------------------------| +| doi | `doi_________` | Crossref, Datacite, Zenodo | +| pmc | `pmc_________` | Europe PubMed Central, PubMed Central | +| pmid | `pmid________` | Europe PubMed Central, PubMed Central | +| arXiv | `arXiv_______` | arXiv.org e-Print Archive | +| handle | `handle______` | any repository | +| ena | `ena_________` | EMBL-EBI | +| pdb | `pdb_________` | EMBL-EBI | +| uniprot | `uniprot_____` | EMBL-EBI | -OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)). -All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). +OpenAIRE also perform duplicate identification (see +the [dedicated section for details](/graph-production-workflow/deduplication)). +All duplicates are **merged** together in a **representative record** which must +be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier +of one of the aggregated record). diff --git a/docs/graph-production-workflow/deduplication/clustering-functions.md b/docs/graph-production-workflow/deduplication/clustering-functions.md index ded6c57..9632b33 100644 --- a/docs/graph-production-workflow/deduplication/clustering-functions.md +++ b/docs/graph-production-workflow/deduplication/clustering-functions.md @@ -1,11 +1,13 @@ --- sidebar_position: 3 --- -# Clustering functions + +# Clustering functions ## Ngrams It creates ngrams from the input field.
+ ``` Example: Input string: “Search for the Standard Model Higgs Boson” @@ -15,7 +17,9 @@ List of ngrams: “sea”, “sta”, “mod”, “hig” ## NgramPairs -It produces a list of concatenations of a pair of ngrams generated from different words.
+It produces a list of concatenations of a pair of ngrams generated from +different words.
+ ``` Example: Input string: “Search for the Standard Model Higgs Boson” @@ -25,7 +29,10 @@ Ngram pairs: “seasta”, “stamod”, “modhig” ## SuffixPrefix -It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string. A specialization of this function is available as SortedSuffixPrefix. It returns a sorted list.
+It produces ngrams pairs in a particular way: it concatenates the suffix of a +string with the prefix of the next in the input string. A specialization of this +function is available as SortedSuffixPrefix. It returns a sorted list.
+ ``` Example: Input string: “Search for the Standard Model Higgs Boson” @@ -36,6 +43,7 @@ Output list: “ardmod”` (suffix of the word “Standard” + prefix of the wo ## Acronyms It creates a number of acronyms out of the words in the input field.
+ ``` Example: Input string: “Search for the Standard Model Higgs Boson” @@ -44,7 +52,9 @@ Output: "ssmhb" ## KeywordsClustering -It creates keys by extracting keywords, out of a customizable list, from the input field.
+It creates keys by extracting keywords, out of a customizable list, from the +input field.
+ ``` Example: Input string: “University of Pisa” @@ -54,6 +64,7 @@ Output: "key::001" (code that identifies the keyword "University" in the customi ## LowercaseClustering It creates keys by lowercasing the input field.
+ ``` Example: Input string: “10.001/ABCD” @@ -67,6 +78,7 @@ It creates random keys from the input field.
## SpaceTrimmingFieldValue It creates keys by trimming spaces in the input field.
+ ``` Example: Input string: “Search for the Standard Model Higgs Boson” @@ -76,6 +88,7 @@ Output: "searchstandardmodelhiggsboson" ## UrlClustering It creates keys for an URL field by extracting the domain.
+ ``` Example: Input string: “http://www.google.it/page” @@ -84,7 +97,10 @@ Output: "www.google.it" ## WordsStatsSuffixPrefixChain -It creates keys containing concatenated statistics of the field, i.e. number of words, number of letters and a chain of suffixes and prefixes of the words.
+It creates keys containing concatenated statistics of the field, i.e. number of +words, number of letters and a chain of suffixes and prefixes of the +words.
+ ``` Example: Input string: “Search for the Standard Model Higgs Boson” diff --git a/docs/graph-production-workflow/deduplication/deduplication.md b/docs/graph-production-workflow/deduplication/deduplication.md index 09516f9..ef86fdc 100644 --- a/docs/graph-production-workflow/deduplication/deduplication.md +++ b/docs/graph-production-workflow/deduplication/deduplication.md @@ -1,14 +1,37 @@ # Deduplication -The OpenAIRE Graph is populated by aggregating metadata records from distinct data sources whose content typically overlaps. For example, the collection of article metadata records from publisher' archives (e.g. Frontiers, Elsevier, Copernicus) and from pre-print platforms (e.g. ArXiv.org, UKPubMed, BioarXiv.org). In order to support monitoring of science, the OpenAIRE Graph implements record deduplication and merge strategies, in such a way the scientific production can be consistently statistically represented. Such strategies reflect the following intuition behind OpenAIRE monitoring: "Two metadata records are equivalent when they describe the same research product, hence they feature compatible resource types, have the same title, the same authors, or, alternatively, the same PID". Finally, groups of duplicates can be whitelisted or blacklisted, in order to manually refine the quality of this strategy. +The OpenAIRE Graph is populated by aggregating metadata records from distinct +data sources whose content typically overlaps. For example, the collection of +article metadata records from publisher' archives (e.g. Frontiers, Elsevier, +Copernicus) and from pre-print platforms (e.g. ArXiv.org, UKPubMed, +BioarXiv.org). In order to support monitoring of science, the OpenAIRE Graph +implements record deduplication and merge strategies, in such a way the +scientific production can be consistently statistically represented. Such +strategies reflect the following intuition behind OpenAIRE monitoring: "Two +metadata records are equivalent when they describe the same research product, +hence they feature compatible resource types, have the same title, the same +authors, or, alternatively, the same PID". Finally, groups of duplicates can be +whitelisted or blacklisted, in order to manually refine the quality of this +strategy. -It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strenghten similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward. +It should be noticed that publication dates do not make a difference, as +different versions of the same product can be published at different times; e.g. +the pre-print and a published version of a scientific article, which should be +counted as one object; abstracts, subjects, and other possible related fields, +are not used to strenghten similarity, due to their heterogeneity or absence +across different data sources. Moreover, even when two products are indicated as +one a new version of the other, the presence of different authors will not bring +them into the same group, to avoid unfair distribution of scientific reward. -Groups of duplicates are finally merged into a new "dedup" record that embeds all properties of the merged records and carries provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date. +Groups of duplicates are finally merged into a new "dedup" record that embeds +all properties of the merged records and carries provenance information about +the data sources and the relative "instances", i.e. manifestations of the +products, together with their resource type, access rights, and publishing date. ## Methodology overview The deduplication process can be divided into five different phases: + * Collection import * Candidate identification (clustering) * Duplicates identification (pair-wise comparisons) @@ -23,25 +46,52 @@ The deduplication process can be divided into five different phases: ### Collection import -The nodes in the graph represent entities of different types. This phase is responsible for identifying all the nodes with a given type and make them available to the subsequent phases representing them in the deduplication record model. +The nodes in the graph represent entities of different types. This phase is +responsible for identifying all the nodes with a given type and make them +available to the subsequent phases representing them in the deduplication record +model. -### Candidate identification (clustering) +### Candidate identification (clustering) -Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a [clustering function](./clustering-functions) that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no [clustering function](./clustering-functions) will ever bring them into the same cluster. +Clustering is a common heuristics used to overcome the N x N complexity required +to match all pairs of objects to identify the equivalent ones. The challenge is +to identify a [clustering function](./clustering-functions) that maximizes the +chance of comparing only records that may lead to a match, while minimizing the +number of records that will not be matched while being equivalent. Since the +equivalence function is to some level tolerant to minimal errors (e.g. switching +of characters in the title, or minimal difference in letters), we need this +function to be not too precise (e.g. a hash of the title), but also not too +flexible (e.g. random ngrams of the title). On the other hand, reality tells us +that in some cases equality of two records can only be determined by their +PIDs (e.g. DOI) as the metadata properties are very different across different +versions and no [clustering function](./clustering-functions) will ever bring +them into the same cluster. ### Duplicates identification (pair-wise comparisons) -Pair-wise comparisons are conducted over records in the same cluster following the strategy defined in the decision tree. A different decision tree is adopted depending on the type of the entity being processed. +Pair-wise comparisons are conducted over records in the same cluster following +the strategy defined in the decision tree. A different decision tree is adopted +depending on the type of the entity being processed. -To further limit the number of comparisons, a sliding window mechanism is used: (i) records in the same cluster are lexicographically sorted by their title, (ii) a window of K records slides over the cluster, and (iii) records ending up in the same window are pair-wise compared. The result of each comparison produces a similarity relation when the pair of record matches. Such relations will be consequently used as input for the duplicates grouping stage. +To further limit the number of comparisons, a sliding window mechanism is +used: (i) records in the same cluster are lexicographically sorted by their +title, (ii) a window of K records slides over the cluster, and (iii) records +ending up in the same window are pair-wise compared. The result of each +comparison produces a similarity relation when the pair of record matches. Such +relations will be consequently used as input for the duplicates grouping stage. ### Duplicates grouping (transitive closure) -Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance. +Once the similarity relations between pairs of records are drawn, the groups of +equivalent records are obtained (transitive closure, i.e. “mesh”). From such +sets a new representative object is obtained, which inherits all properties from +the merged records and keeps track of their provenance. ### Relation redistribution -Relations involved in nodes identified as duplicated are eventually marked as virtually deleted and used as template for creating a new relation pointing to the new representative record. +Relations involved in nodes identified as duplicated are eventually marked as +virtually deleted and used as template for creating a new relation pointing to +the new representative record. Note that nodes and relationships marked as virtually deleted are not exported.

diff --git a/docs/graph-production-workflow/deduplication/organizations.md b/docs/graph-production-workflow/deduplication/organizations.md index c2c57e1..8b73455 100644 --- a/docs/graph-production-workflow/deduplication/organizations.md +++ b/docs/graph-production-workflow/deduplication/organizations.md @@ -4,43 +4,82 @@ sidebar_position: 2 # Organizations -The organizations in OpenAIRE are aggregated from different registries (e.g. CORDA, OpenDOAR, Re3data, ROR). In some cases, a registry provides organizations as entities with their own persistent identifier. In other cases, those organizations are extracted from other main entities provided by the registry (e.g. datasources, projects, etc.). +The organizations in OpenAIRE are aggregated from different registries (e.g. +CORDA, OpenDOAR, Re3data, ROR). In some cases, a registry provides organizations +as entities with their own persistent identifier. In other cases, those +organizations are extracted from other main entities provided by the registry ( +e.g. datasources, projects, etc.). -The deduplication of organizations is enhanced by the [OpenOrgs](https://orgs.openaire.eu), a tool that combines an automated approach for identifying duplicated instances -of the same organization record with a "humans in the loop" approach, in which the equivalences produced by a duplicate identification algorithm are suggested to data curators, in charge for validating them. -The data curation activity is twofold, on one end pivots around the disambiguation task, on the other hand assumes to improve the metadata describing the organization records -(e.g. including the translated name, or a different PID) as well as defining the hierarchical structure of existing large organizations (i.e. Universities comprising its departments or large research centers with all its sub-units or sub-institutes). +The deduplication of organizations is enhanced by +the [OpenOrgs](https://orgs.openaire.eu), a tool that combines an automated +approach for identifying duplicated instances +of the same organization record with a "humans in the loop" approach, in which +the equivalences produced by a duplicate identification algorithm are suggested +to data curators, in charge for validating them. +The data curation activity is twofold, on one end pivots around the +disambiguation task, on the other hand assumes to improve the metadata +describing the organization records +(e.g. including the translated name, or a different PID) as well as defining the +hierarchical structure of existing large organizations (i.e. Universities +comprising its departments or large research centers with all its sub-units or +sub-institutes). -Duplicates among organizations are therefore managed through three different stages: - * *Creation of Suggestions*: executes an automatic workflow that performs the deduplication and prepare new suggestions for the curators to be processed; - * *Curation*: manual editing of the organization records performed by the data curators; - * *Creation of Representative Organizations*: executes an automatic workflow that creates curated organizations and exposes them on the OpenAIRE Graph by using the curators' feedback from the OpenOrgs underlying database. +Duplicates among organizations are therefore managed through three different +stages: + +* *Creation of Suggestions*: executes an automatic workflow that performs the + deduplication and prepare new suggestions for the curators to be processed; +* *Curation*: manual editing of the organization records performed by the data + curators; +* *Creation of Representative Organizations*: executes an automatic workflow + that creates curated organizations and exposes them on the OpenAIRE Graph by + using the curators' feedback from the OpenOrgs underlying database. The next sections describe the above mentioned stages. ### Creation of Suggestions -This stage executes an automatic workflow that faces the *candidate identification* and the *duplicates identification* stages of the deduplication to provide suggestions for the curators in the OpenOrgs. +This stage executes an automatic workflow that faces the *candidate +identification* and the *duplicates identification* stages of the deduplication +to provide suggestions for the curators in the OpenOrgs. #### Candidate identification (clustering) -To match the requirements of limiting the number of comparisons, OpenAIRE clustering for organizations aims at grouping records that would more likely be comparable. +To match the requirements of limiting the number of comparisons, OpenAIRE +clustering for organizations aims at grouping records that would more likely be +comparable. It works with four functions: -* *URL-based function*: the function generates the URL domain when this is provided as part of the record properties from the organization's `websiteurl` field; -* *Title-based functions*: - * generate strings dependent to the keywords in the `legalname` field; - * generate strings obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) on the first 3 words of the `legalname` field; - * generate strings obtained as a concatenation of ngrams of the `legalname` field; + +* *URL-based function*: the function generates the URL domain when this is + provided as part of the record properties from the organization's `websiteurl` + field; +* *Title-based functions*: + * generate strings dependent to the keywords in the `legalname` field; + * generate strings obtained as an alternation of the function prefix(3) and + suffix(3) (and vice versa) on the first 3 words of the `legalname` field; + * generate strings obtained as a concatenation of ngrams of the `legalname` + field; #### Duplicates identification (pair-wise comparisons) -For each pair of organization in a cluster the following strategy (depicted in the figure below) is applied. +For each pair of organization in a cluster the following strategy (depicted in +the figure below) is applied. The comparison goes through the following decision tree: -1. *grid id check*: comparison of the grid ids. If the grid id is equivalent, then the similarity relation is drawn. If the grid id is not available, the comparison proceeds to the next stage; -2. *early exits*: comparison of the numbers extracted from the `legalname`, the `country` and the `website` url. No similarity relation is drawn in this stage, the comparison proceeds only if the compared fields verified the conditions of equivalence; -3. *city check*: comparison of the city names in the `legalname`. The comparison proceeds only if the legalnames shares at least 10% of cities; -4. *keyword check*: comparison of the keywords in the `legalname`. The comparison proceeds only if the legalnames shares at least 70% of keywords; -5. *legalname check*: comparison of the normalized `legalnames` with the `Jaro-Winkler` distance to determine if it is higher than `0.9`. If so, a similarity relation is drawn. Otherwise, no similarity relation is drawn. + +1. *grid id check*: comparison of the grid ids. If the grid id is equivalent, + then the similarity relation is drawn. If the grid id is not available, the + comparison proceeds to the next stage; +2. *early exits*: comparison of the numbers extracted from the `legalname`, + the `country` and the `website` url. No similarity relation is drawn in this + stage, the comparison proceeds only if the compared fields verified the + conditions of equivalence; +3. *city check*: comparison of the city names in the `legalname`. The comparison + proceeds only if the legalnames shares at least 10% of cities; +4. *keyword check*: comparison of the keywords in the `legalname`. The + comparison proceeds only if the legalnames shares at least 70% of keywords; +5. *legalname check*: comparison of the normalized `legalnames` with + the `Jaro-Winkler` distance to determine if it is higher than `0.9`. If so, a + similarity relation is drawn. Otherwise, no similarity relation is drawn.

Organization Decision Tree @@ -50,21 +89,39 @@ The comparison goes through the following decision tree: ### Data Curation -All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata. +All the similarity relations drawn by the algorithm involving the decision tree +are exposed in OpenOrgs, where are made available to the data curators to give +feedbacks and to improve the organizations metadata. A data curator can: - * *edit organization metadata*: legalname, pid, country, url, parent relations, etc.; - * *approve suggested duplicates*: establish if an equivalence relation is valid; - * *discard suggested duplicates*: establish if an equivalence relation is wrong; - * *create similarity relations*: add a new equivalence relation not drawn by the algorithm. -Note that if a curator does not provide a feedback on a similarity relation suggested by the algorithm, then such relation is considered as valid. +* *edit organization metadata*: legalname, pid, country, url, parent relations, + etc.; +* *approve suggested duplicates*: establish if an equivalence relation is valid; +* *discard suggested duplicates*: establish if an equivalence relation is wrong; +* *create similarity relations*: add a new equivalence relation not drawn by the + algorithm. + +Note that if a curator does not provide a feedback on a similarity relation +suggested by the algorithm, then such relation is considered as valid. ### Creation of Representative Organizations -This stage executes an automatic workflow that faces the *duplicates grouping* stage to create representative organizations and to update them on the OpenAIRE Graph. Such organizations are obtained via transitive closure and the relations used comes from the curators' feedback gathered on the OpenOrgs underlying Database. +This stage executes an automatic workflow that faces the *duplicates grouping* +stage to create representative organizations and to update them on the OpenAIRE +Graph. Such organizations are obtained via transitive closure and the relations +used comes from the curators' feedback gathered on the OpenOrgs underlying +Database. #### Duplicates grouping (transitive closure) -Once the similarity relations between pairs of organizations have been gathered, the groups of equivalent organizations are obtained (transitive closure, i.e. “mesh”). From such sets a new representative organization is obtained, which inherits all properties from the merged records and keeps track of their provenance. +Once the similarity relations between pairs of organizations have been gathered, +the groups of equivalent organizations are obtained (transitive closure, i.e. +“mesh”). From such sets a new representative organization is obtained, which +inherits all properties from the merged records and keeps track of their +provenance. -The IDs of the representative organizations are obtained by the OpenOrgs Database that creates a unique ``openorgs`` ID for each approved organization. In case an organization is not approved by the curators, the ID is obtained by appending the prefix ``pending_org`` to the MD5 of the first ID (given their lexicographical ordering). \ No newline at end of file +The IDs of the representative organizations are obtained by the OpenOrgs +Database that creates a unique ``openorgs`` ID for each approved organization. +In case an organization is not approved by the curators, the ID is obtained by +appending the prefix ``pending_org`` to the MD5 of the first ID (given their +lexicographical ordering). \ No newline at end of file diff --git a/docs/graph-production-workflow/deduplication/research-products.md b/docs/graph-production-workflow/deduplication/research-products.md index ca56b89..e9fe2d1 100644 --- a/docs/graph-production-workflow/deduplication/research-products.md +++ b/docs/graph-production-workflow/deduplication/research-products.md @@ -149,13 +149,15 @@ The comparison goes through different stages: ### Duplicates grouping -The aim of the final stage is the creation of objects that group all the equivalent -entities discovered by the previous step. This is done in two phases. +The aim of the final stage is the creation of objects that group all the +equivalent +entities discovered by the previous step. This is done in two phases. #### Transitive closure + As a final step of duplicate identification a transitive closure -is run against similarity relations to find groups of duplicates not directly -caught by the previous steps. If a group is larger than 200 elements only the +is run against similarity relations to find groups of duplicates not directly +caught by the previous steps. If a group is larger than 200 elements only the first 200 elements will be included in the group, while the remaining will be kept ungrouped.