2024-05-15 16:14:12 +02:00
4 changed files with 96 additions and 33 deletions
--- a/docs/data-model/pids-and-identifiers.md
+++ b/docs/data-model/pids-and-identifiers.md
@ -35,10 +35,10 @@ assigns PIDs to their scientific products from a given PID minter.
 This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes
-| Datasource delegated                 | Datasource delegating            | Pid Type  |
+| Datasource delegated                 | Datasource delegating            | Pid Type |
-|--------------------------------------|----------------------------------|-----------|
+|--------------------------------------|----------------------------------|----------|
-| [Zenodo](https://zenodo.org)         | [Datacite](https://datacite.org) | doi       |
+| [Zenodo](https://zenodo.org)         | [Datacite](https://datacite.org) | doi      |
-| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/)        | w3id      |
+| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/)        | w3id     |
 ## Identifiers in the Graph
@ -66,16 +66,16 @@ When the record is collected from a source which is not authoritative for any ty
 Currently, the following data sources are used as "PID authorities":
-| PID Type  | Prefix (12 chars)      | Authority                             	 |
+| PID Type | Prefix (12 chars)     | Authority                             	 |
-|-----------|------------------------|-------------------------------------------|
+|----------|-----------------------|-----------------------------------------|
-| doi       | `doi_________`      	  | Crossref, Datacite, Zenodo            	 |
+| doi      | `doi_________`      	 | Crossref, Datacite, Zenodo            	 |
-| pmc       | `pmc_________`      	  | Europe PubMed Central, PubMed Central 	 |
+| pmc      | `pmc_________`      	 | Europe PubMed Central, PubMed Central 	 |
-| pmid      | `pmid________`      	  | Europe PubMed Central, PubMed Central 	 |
+| pmid     | `pmid________`      	 | Europe PubMed Central, PubMed Central 	 |
-| arXiv     | `arXiv_______`      	  | arXiv.org e-Print Archive             	 |
+| arXiv    | `arXiv_______`      	 | arXiv.org e-Print Archive             	 |
-| handle    | `handle______`      	  | any repository                        	 |
+| handle   | `handle______`      	 | any repository                        	 |
-| ena       | `ena_________`      	  | EMBL-EBI                            	 |
+| ena      | `ena_________`      	 | EMBL-EBI                            	   |
-| pdb       | `pdb_________`      	  | EMBL-EBI                            	 |
+| pdb      | `pdb_________`      	 | EMBL-EBI                            	   |
-| uniprot   | `uniprot_____`      	  | EMBL-EBI                            	 |
+| uniprot  | `uniprot_____`      	 | EMBL-EBI                            	   |
 OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)).
-All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).
+All duplicates are **merged** together in a **representative record** which must be assigned a [dedicated OpenAIRE identifier](/graph-production-workflow/deduplication/research-products#openaire-identifier-of-the-representative-record) (i.e. it cannot have the identifier of one of the aggregated record).
--- a/docs/graph-production-workflow/deduplication/deduplication.md
+++ b/docs/graph-production-workflow/deduplication/deduplication.md
@ -2,9 +2,9 @@
 The OpenAIRE Graph is populated by aggregating metadata records from distinct data sources whose content typically overlaps. For example, the collection of article metadata records from publisher' archives (e.g. Frontiers, Elsevier, Copernicus) and from pre-print platforms (e.g. ArXiv.org, UKPubMed, BioarXiv.org). In order to support monitoring of science, the OpenAIRE Graph implements record deduplication and merge strategies, in such a way the scientific production can be consistently statistically represented. Such strategies reflect the following intuition behind OpenAIRE monitoring: "Two metadata records are equivalent when they describe the same research product, hence they feature compatible resource types, have the same title, the same authors, or, alternatively, the same PID". Finally, groups of duplicates can be whitelisted or blacklisted, in order to manually refine the quality of this strategy. 
-It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strenghten similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward. 
+It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strengthen similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward.
-Groups of duplicates are finally merged into a new "dedup" record that embeds all properties of the merged records and carries provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date.
+Groups of duplicates are finally merged into a new "representative record", having its own id, embedding properties of the merged records and carrying provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date.
 ## Methodology overview
@ -37,7 +37,7 @@ To further limit the number of comparisons, a sliding window mechanism is used:
 ### Duplicates grouping (transitive closure)
-Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance. 
+Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new **representative record** is obtained, which inherits properties from the merged records and keeps track of their provenance.
 ### Relation redistribution
--- a/docs/graph-production-workflow/deduplication/research-products.md
+++ b/docs/graph-production-workflow/deduplication/research-products.md
@ -149,22 +149,85 @@ The comparison goes through different stages:
 ### Duplicates grouping
-The aim of the final stage is the creation of objects that group all the equivalent
+The aim of the final stage is the creation of records that group all the
-entities discovered by the previous step. This is done in two phases. 
+equivalent entities discovered pairwise by the previous step. This is done in
 multiple phases.
 #### Transitive closure
 As a final step of duplicate identification a transitive closure
 is run against similarity relations to find groups of duplicates not directly 
 caught by the previous steps. If a group is larger than 200 elements only the 
 first 200 elements will be included in the group, while the remaining will be
 kept ungrouped.
-#### Creation of representative record (dedup record)
+As the concluding step of duplicate identification, a transitive closure is
 performed against similarity relations to identify complete groups of duplicated
 records (cliques). If a group exceeds 200 elements, only the first 200 elements
 are included in the group, while the remaining elements are kept ungrouped.
-The general concept is that the field coming from the record with higher "trust"
+#### Selection of the pivot record
 value is used as reference for the field of the representative record.
-The IDs of the representative records are obtained by prepending the
+Each group of duplicate records needs to be identified in the final graph with
-prefix ``dedup_`` to the MD5 of the first ID (given their lexicographical
+an OpenAIRE identifier, derived from a record of the group known as the _pivot
-ordering). If the group of merged records contains a trusted ID type (i.e. the
+record_. It is determined after sorting the group of duplicate records by the 
-DOI), also the type keyword (i.e. ``DOI``) is added to the prefix.
+following criteria:
 1. Records previously chosen as pivot records in the graph's previous
   generations.
 2. Records with identifiers from a [PID authority](/data-model/pids-and-identifiers#pid-authorities).
 3. Publications from CrossRef or datasets from DataCite.
 4. Records with an earlier date of acceptance.
 5. Records with smaller IDs in lexicographical order.
 The first sorting criterion is possible because a state table, called "pivot
 history", is maintained across graph generations. It keeps track of which
 records were used as pivot records in what graph, guaranteed to retain data for
 the last 12 months.
 #### Creation of representative records
 The representative record, also known as the "dedup record", replaces the group
 of deduplicated records in the graph.
 ##### OpenAIRE identifier of the representative record
 The OpenAIRE identifier of the representative record is generated based on the
 identifier of the record chosen as the pivot of the group:
 - if the pivot record comes from a "PID authority", the identifier of the
  representative record is the same, but the "PID Type Prefix" part of the
  identifier is modified to append ``_dedup``.<br/>
  For example ```doi_________::d5021b53204e4fdeab6ff5d5bc468032``` will
  become ```doi_dedup___::d5021b53204e4fdeab6ff5d5bc468032```
 - otherwise the "PID Type Prefix" part will be set to the fixed value
  ``dedup_wf_002``, and the following hash will be calculated as the MD5 hash of
  the entire raw id of the pivot record.<br/>
  For example ``DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g`` will
  become ``dedup_wf_002::345e5d1b80537b0d0e0a49241ae9e516``
 ##### Content of the representative record
 The representative records inherits properties from the records it merges
 and tracks their provenance. Whenever possible, it preserves all data from the
 merged records, such as the ``instance`` field. In cases where a specific value
 must be chosen, the most representative one is selected. For example, for the
 "dateofacceptance" field, the earliest value is chosen.
 ##### Merged and singleton representative record
 Changes in metadata content or graph construction may lead to cases where
 representative records disappear from the graph:
 1. When two or more representative records are merged into one representative
   record. Put it other terms this happens when a group of duplicated records
   contains multiple records formerly used as pivot record.
 2. When a record chosen as a pivot record leaves its group and remains alone.
 3. When a record chosen as a pivot record is no longer published by its data
   source (deletion of the metadata record).
 To address these cases, the pivot history table ensures the visibility of
 disappearing representative records for the first two cases. Specifically:
 1. In the case of merged representative records, the new representative record
   and the ones that would be lost are generated and linked as part of the new
   representative record.
 2. In the case of a record no longer serving as a pivot, a representative record
   is generated and linked only with that record.
 This approach ensures that users can access representative records that would
 otherwise be lost.
--- a/docs/graph-production-workflow/merge-by-id.md
+++ b/docs/graph-production-workflow/merge-by-id.md
@ -16,7 +16,7 @@ a global grouping of every record available in the graph:
 This ensures that the same record, possibly assigned to different types by different 
 mappings, appears only once in the graph and under a single typing. In case of clashing 
-identifiers, the properties are merged (including the provencance information), considering 
+identifiers, the properties are merged (including the provenance information), considering 
 the following precedence order for the research product typing:
 ```