Describe the usage of the pivot table to improve stability of “representative records” and how “non authoritative” PIDs are used to generate “representative records”
This commit is contained in:
parent
9222fe3456
commit
6bb810a606
|
@ -18,15 +18,16 @@ It should be noticed that publication dates do not make a difference, as
|
||||||
different versions of the same product can be published at different times; e.g.
|
different versions of the same product can be published at different times; e.g.
|
||||||
the pre-print and a published version of a scientific article, which should be
|
the pre-print and a published version of a scientific article, which should be
|
||||||
counted as one object; abstracts, subjects, and other possible related fields,
|
counted as one object; abstracts, subjects, and other possible related fields,
|
||||||
are not used to strenghten similarity, due to their heterogeneity or absence
|
are not used to strengthen similarity, due to their heterogeneity or absence
|
||||||
across different data sources. Moreover, even when two products are indicated as
|
across different data sources. Moreover, even when two products are indicated as
|
||||||
one a new version of the other, the presence of different authors will not bring
|
one a new version of the other, the presence of different authors will not bring
|
||||||
them into the same group, to avoid unfair distribution of scientific reward.
|
them into the same group, to avoid unfair distribution of scientific reward.
|
||||||
|
|
||||||
Groups of duplicates are finally merged into a new "dedup" record that embeds
|
Groups of duplicates are finally merged into a new "representative record",
|
||||||
all properties of the merged records and carries provenance information about
|
having its own id, embedding properties of the merged records and carrying
|
||||||
the data sources and the relative "instances", i.e. manifestations of the
|
provenance information about the data sources and the relative "instances", i.e.
|
||||||
products, together with their resource type, access rights, and publishing date.
|
manifestations of the products, together with their resource type, access
|
||||||
|
rights, and publishing date.
|
||||||
|
|
||||||
## Methodology overview
|
## Methodology overview
|
||||||
|
|
||||||
|
@ -84,7 +85,7 @@ relations will be consequently used as input for the duplicates grouping stage.
|
||||||
|
|
||||||
Once the similarity relations between pairs of records are drawn, the groups of
|
Once the similarity relations between pairs of records are drawn, the groups of
|
||||||
equivalent records are obtained (transitive closure, i.e. “mesh”). From such
|
equivalent records are obtained (transitive closure, i.e. “mesh”). From such
|
||||||
sets a new representative object is obtained, which inherits all properties from
|
sets a new **representative record** is obtained, which inherits properties from
|
||||||
the merged records and keeps track of their provenance.
|
the merged records and keeps track of their provenance.
|
||||||
|
|
||||||
### Relation redistribution
|
### Relation redistribution
|
||||||
|
|
|
@ -149,24 +149,84 @@ The comparison goes through different stages:
|
||||||
|
|
||||||
### Duplicates grouping
|
### Duplicates grouping
|
||||||
|
|
||||||
The aim of the final stage is the creation of objects that group all the
|
The aim of the final stage is the creation of records that group all the
|
||||||
equivalent
|
equivalent entities discovered pairwise by the previous step. This is done in
|
||||||
entities discovered by the previous step. This is done in two phases.
|
multiple phases.
|
||||||
|
|
||||||
#### Transitive closure
|
#### Transitive closure
|
||||||
|
|
||||||
As a final step of duplicate identification a transitive closure
|
As the concluding step of duplicate identification, a transitive closure is
|
||||||
is run against similarity relations to find groups of duplicates not directly
|
performed against similarity relations to identify complete groups of duplicated
|
||||||
caught by the previous steps. If a group is larger than 200 elements only the
|
records (cliques). If a group exceeds 200 elements, only the first 200 elements
|
||||||
first 200 elements will be included in the group, while the remaining will be
|
are included in the group, while the remaining elements are kept ungrouped.
|
||||||
kept ungrouped.
|
|
||||||
|
|
||||||
#### Creation of representative record (dedup record)
|
#### Selection of the pivot record
|
||||||
|
|
||||||
The general concept is that the field coming from the record with higher "trust"
|
Each group of duplicate records needs to be identified in the final graph with
|
||||||
value is used as reference for the field of the representative record.
|
an OpenAIRE identifier, derived from a record of the group known as the pivot
|
||||||
|
record. The pivot record is determined after sorting by the following criteria:
|
||||||
|
|
||||||
The IDs of the representative records are obtained by prepending the
|
1. Records previously chosen as pivot records in the graph's previous
|
||||||
prefix ``dedup_`` to the MD5 of the first ID (given their lexicographical
|
generations.
|
||||||
ordering). If the group of merged records contains a trusted ID type (i.e. the
|
2. Records with identifiers from a "PID authority".
|
||||||
DOI), also the type keyword (i.e. ``DOI``) is added to the prefix.
|
3. Publications from CrossRef or datasets from DataCite.
|
||||||
|
4. Records with an earlier date of acceptance.
|
||||||
|
5. Records with smaller IDs in lexicographical order.
|
||||||
|
|
||||||
|
The first sorting criterion is possible because a state table, called "pivot
|
||||||
|
history," is maintained across graph generations. It keeps track of which
|
||||||
|
records were used as pivot records in what graph, guaranteed to retain data for
|
||||||
|
the last 12 months.
|
||||||
|
|
||||||
|
#### Creation of representative records
|
||||||
|
|
||||||
|
The representative record, also known as the "dedup record," replaces the group
|
||||||
|
of deduplicated records in the graph.
|
||||||
|
|
||||||
|
##### OpenAIRE identifier of the representative record
|
||||||
|
|
||||||
|
The OpenAIRE identifier of the representative record is generated based on the
|
||||||
|
identifier of the record chosen as the pivot of the group:
|
||||||
|
|
||||||
|
- if the pivot record comes from a "PID authority", the identifier of the
|
||||||
|
representative record is the same, but the "PID Type Prefix" part of the
|
||||||
|
identifier is modified to append ``_dedup``.<br/>
|
||||||
|
For example ```50|doi_________::d5021b53204e4fdeab6ff5d5bc468032``` will
|
||||||
|
become ```50|doi_dedup___::d5021b53204e4fdeab6ff5d5bc468032```
|
||||||
|
- otherwise the "PID Type Prefix" part will be set to the fixed value
|
||||||
|
``dedup_wf_002``, and the following hash will be calculated as the MD5 hash of
|
||||||
|
the entire raw id of the pivot record.<br/>
|
||||||
|
For example ``50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g`` will
|
||||||
|
become ``50|dedup_wf_002::345e5d1b80537b0d0e0a49241ae9e516``
|
||||||
|
|
||||||
|
##### Content of the representative record
|
||||||
|
|
||||||
|
The representative records inherits properties from the records it merges
|
||||||
|
and tracks their provenance. Whenever possible, it preserves all data from the
|
||||||
|
merged records, such as the ``instance`` field. In cases where a specific value
|
||||||
|
must be chosen, the most representative one is selected. For example, for the
|
||||||
|
"dateofacceptance" field, the earliest value is chosen.
|
||||||
|
|
||||||
|
##### Merged and singleton representative record
|
||||||
|
|
||||||
|
Changes in metadata content or graph construction may lead to cases where
|
||||||
|
representative records disappear from the graph:
|
||||||
|
|
||||||
|
1. When two or more representative records are merged into one representative
|
||||||
|
record. Put it other terms this happens when a group of duplicated records
|
||||||
|
contains multiple records formerly used as pivot record.
|
||||||
|
2. When a record chosen as a pivot record exits its group and remains alone.
|
||||||
|
3. When a record chosen as a pivot record is no longer published by its data
|
||||||
|
source (deletion of the metadata record).
|
||||||
|
|
||||||
|
To address these cases, the pivot history table ensures the visibility of
|
||||||
|
disappearing representative records for the first two cases. Specifically:
|
||||||
|
|
||||||
|
1. In the case of merged representative records, the new representative record
|
||||||
|
and the ones that would be lost are generated and linked as part of the new
|
||||||
|
representative record.
|
||||||
|
2. In the case of a record no longer serving as a pivot, a representative record
|
||||||
|
is generated and linked only with that record.
|
||||||
|
|
||||||
|
This approach ensures that users can access representative records that would
|
||||||
|
otherwise be lost.
|
||||||
|
|
Loading…
Reference in New Issue