From f279cdfe1062f527333b91f6652d797a458e17e5 Mon Sep 17 00:00:00 2001 From: mkallipo <95910739+mkallipo@users.noreply.github.com> Date: Fri, 26 Apr 2024 10:55:10 +0200 Subject: [PATCH 1/5] affiliation matching description update --- .../affiliation_matching.md | 66 ++++++++++++++++--- 1 file changed, 58 insertions(+), 8 deletions(-) diff --git a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md index 539e51b..fadb3d7 100644 --- a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md +++ b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md @@ -4,9 +4,14 @@ sidebar_position: 1 # Affiliation matching -***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database. +***Short description:*** The goal of the affiliation matching module is to pair affiliations with organizations listed in either OpenAIRE or ROR organization database. +Depending on the data source, we currently employ two distinct methodologies: -***Algorithmic details:*** + * The first method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database. + * The second concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database. + + +***Algorithmic details of the first method:*** *The buckets concept* @@ -39,13 +44,13 @@ The total match strength is calculated in such a way that each consecutive voter ***Parameters:*** * input - * input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations. - * input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location. - * input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations. - * input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations. - * input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations + * input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations. + * input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location. + * input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations. + * input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations. + * input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations * output - * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. + * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. ***Limitations:*** - @@ -55,3 +60,48 @@ Java, Spark ***References:*** - ***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching) + + +***Algorithmic details of the second method:*** + +*Categorization* + +The affiliations' strings are imported and undergo cleaning, tokenization, and removal of stopwords. Similar to the “buckets concept” of the first method, the goal is to split the affiliation strings, as well as the ROR organizations, into coherent groups. To achieve this, data preprocessing has already been conducted on ROR's data, involving the analysis of word frequency ('keywords') within the legal names of ROR's organizations to define specific categories. These categories include universities and institutes, laboratories, hospitals, companies, museums, governments, foundation, and rest organizations. ROR's organizations have subsequently been assigned to these categories based on their legal names. The algorithm employs a similar approach to categorize affiliations into these same groups. + +*String Shortening* + +The objective is to extract pertinent details from each affiliation string. The algorithm divides the string whenever a comma (,) or semicolon (;) is detected. It then applies specific 'rules' to these segments and retains only those containing relevant keywords. Additionally, it trims down the segments by preserving words in proximity to particular keywords like "university," "institute," "laboratory," or "hospital." As a result, the average string length is reduced from 90 to 35 characters. + +*Matching with ROR's Database* + +The algorithm checks whether a substring containing a keyword is linked to a legal name or to an alternative name in the organizations listed in the ROR's database. In order to identify the most accurate match, the algorithm employs cosine similarity.. Although alternative methods like Levenshtein Distance or Jaro-Winkler Distance were considered for measuring string similarity, it was concluded that cosine similarity was the most appropriate choice for this specific application. + +*Refinement* + +If multiple matches are found above the desired similarity thresholds, the algorithm performs another check. It applies cosine similarity between the organizations found in the ROR's database and the original affiliation string. This comparison takes into account additional information present in the original affiliation, such as addresses or city names. The algorithm aims to identify the best fit among the potential matches. Note that the case where two or more different organizations share the same name is also considered. + +***Parameters:*** + +* input + * source of affiliations: JSON Crossref or XML Pubmed or Parquet DataCite files. + + * organizations: [dix_acad.pkl](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_mult.pkl, [dix_city](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.) + + * similarity thresholds: simU for universities, simG for other organizations (default values are simU = 0.64, simG = 0.87). + cument-organization pairs which are used as a hint for matching affiliations + +* output + * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. + + +***Limitations:*** - + +***Environment:*** +Python + +***References:*** - + +***Authority:*** OpenAIRE • ***License:***AGPL-3.0 • ***Code:*** [AffRo](https://github.com/mkallipo/affiliation-matching) + + + -- 2.17.1 From 4cdb5f7f31eb52d387e5994a8c8fd89778da6e3e Mon Sep 17 00:00:00 2001 From: mkallipo <95910739+mkallipo@users.noreply.github.com> Date: Fri, 26 Apr 2024 11:13:04 +0200 Subject: [PATCH 2/5] affiliation matching description update --- .../enrichment-by-mining/affiliation_matching.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md index fadb3d7..a23b711 100644 --- a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md +++ b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md @@ -85,13 +85,13 @@ If multiple matches are found above the desired similarity thresholds, the algor * input * source of affiliations: JSON Crossref or XML Pubmed or Parquet DataCite files. - * organizations: [dix_acad.pkl](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_mult.pkl, [dix_city](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.) + * organizations: [dix_acad.pkl](https://github.com/openaire/affro/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/openaire/affro/blob/main/dictionaries/dix_mult.pkl), [dix_city](https://github.com/openaire/affro/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/openaire/affro/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.) * similarity thresholds: simU for universities, simG for other organizations (default values are simU = 0.64, simG = 0.87). cument-organization pairs which are used as a hint for matching affiliations * output - * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. + * JSON file with ROR ids of organizations and corresponding similarity scores for each DOI. ***Limitations:*** - -- 2.17.1 From c017c95486436adc7b04f13cbef46f3e4e9ce606 Mon Sep 17 00:00:00 2001 From: Serafeim Chatzopoulos Date: Sat, 4 May 2024 11:56:31 +0300 Subject: [PATCH 3/5] Adjust text in affiliation matching page --- .../enrichment-by-mining/affiliation_matching.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md index a23b711..3637b3c 100644 --- a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md +++ b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md @@ -4,14 +4,14 @@ sidebar_position: 1 # Affiliation matching -***Short description:*** The goal of the affiliation matching module is to pair affiliations with organizations listed in either OpenAIRE or ROR organization database. +***Short description:*** The goal of the affiliation matching module is to match affiliation strings (identified in full-text PDFs or in scholarly databases, such as Crossref) with persistent organization identifiers (e.g., ROR identifiers). Depending on the data source, we currently employ two distinct methodologies: - * The first method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database. - * The second concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database. +- The [first](#algorithmic-details-of-the-first-method) method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database. +- The [second](#algorithmic-details-of-the-second-method) concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database. -***Algorithmic details of the first method:*** +## Algorithmic details of the first method *The buckets concept* @@ -62,7 +62,7 @@ Java, Spark ***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching) -***Algorithmic details of the second method:*** +## Algorithmic details of the second method *Categorization* @@ -101,7 +101,7 @@ Python ***References:*** - -***Authority:*** OpenAIRE • ***License:***AGPL-3.0 • ***Code:*** [AffRo](https://github.com/mkallipo/affiliation-matching) +***Authority:*** OpenAIRE • ***License:*** AGPL-3.0 • ***Code:*** [AffRo](https://github.com/openaire/affro) -- 2.17.1 From b7cb15e94249fdcf79b6461d85115f96ab2c76a7 Mon Sep 17 00:00:00 2001 From: Serafeim Chatzopoulos Date: Sat, 4 May 2024 11:59:14 +0300 Subject: [PATCH 4/5] Update affiliation matching page in v7.1.3 --- .../affiliation_matching.md | 66 ++++++++++++++++--- 1 file changed, 58 insertions(+), 8 deletions(-) diff --git a/versioned_docs/version-7.1.3/graph-production-workflow/enrichment-by-mining/affiliation_matching.md b/versioned_docs/version-7.1.3/graph-production-workflow/enrichment-by-mining/affiliation_matching.md index 539e51b..3637b3c 100644 --- a/versioned_docs/version-7.1.3/graph-production-workflow/enrichment-by-mining/affiliation_matching.md +++ b/versioned_docs/version-7.1.3/graph-production-workflow/enrichment-by-mining/affiliation_matching.md @@ -4,9 +4,14 @@ sidebar_position: 1 # Affiliation matching -***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database. +***Short description:*** The goal of the affiliation matching module is to match affiliation strings (identified in full-text PDFs or in scholarly databases, such as Crossref) with persistent organization identifiers (e.g., ROR identifiers). +Depending on the data source, we currently employ two distinct methodologies: -***Algorithmic details:*** +- The [first](#algorithmic-details-of-the-first-method) method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database. +- The [second](#algorithmic-details-of-the-second-method) concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database. + + +## Algorithmic details of the first method *The buckets concept* @@ -39,13 +44,13 @@ The total match strength is calculated in such a way that each consecutive voter ***Parameters:*** * input - * input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations. - * input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location. - * input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations. - * input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations. - * input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations + * input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations. + * input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location. + * input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations. + * input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations. + * input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations * output - * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. + * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. ***Limitations:*** - @@ -55,3 +60,48 @@ Java, Spark ***References:*** - ***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching) + + +## Algorithmic details of the second method + +*Categorization* + +The affiliations' strings are imported and undergo cleaning, tokenization, and removal of stopwords. Similar to the “buckets concept” of the first method, the goal is to split the affiliation strings, as well as the ROR organizations, into coherent groups. To achieve this, data preprocessing has already been conducted on ROR's data, involving the analysis of word frequency ('keywords') within the legal names of ROR's organizations to define specific categories. These categories include universities and institutes, laboratories, hospitals, companies, museums, governments, foundation, and rest organizations. ROR's organizations have subsequently been assigned to these categories based on their legal names. The algorithm employs a similar approach to categorize affiliations into these same groups. + +*String Shortening* + +The objective is to extract pertinent details from each affiliation string. The algorithm divides the string whenever a comma (,) or semicolon (;) is detected. It then applies specific 'rules' to these segments and retains only those containing relevant keywords. Additionally, it trims down the segments by preserving words in proximity to particular keywords like "university," "institute," "laboratory," or "hospital." As a result, the average string length is reduced from 90 to 35 characters. + +*Matching with ROR's Database* + +The algorithm checks whether a substring containing a keyword is linked to a legal name or to an alternative name in the organizations listed in the ROR's database. In order to identify the most accurate match, the algorithm employs cosine similarity.. Although alternative methods like Levenshtein Distance or Jaro-Winkler Distance were considered for measuring string similarity, it was concluded that cosine similarity was the most appropriate choice for this specific application. + +*Refinement* + +If multiple matches are found above the desired similarity thresholds, the algorithm performs another check. It applies cosine similarity between the organizations found in the ROR's database and the original affiliation string. This comparison takes into account additional information present in the original affiliation, such as addresses or city names. The algorithm aims to identify the best fit among the potential matches. Note that the case where two or more different organizations share the same name is also considered. + +***Parameters:*** + +* input + * source of affiliations: JSON Crossref or XML Pubmed or Parquet DataCite files. + + * organizations: [dix_acad.pkl](https://github.com/openaire/affro/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/openaire/affro/blob/main/dictionaries/dix_mult.pkl), [dix_city](https://github.com/openaire/affro/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/openaire/affro/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.) + + * similarity thresholds: simU for universities, simG for other organizations (default values are simU = 0.64, simG = 0.87). + cument-organization pairs which are used as a hint for matching affiliations + +* output + * JSON file with ROR ids of organizations and corresponding similarity scores for each DOI. + + +***Limitations:*** - + +***Environment:*** +Python + +***References:*** - + +***Authority:*** OpenAIRE • ***License:*** AGPL-3.0 • ***Code:*** [AffRo](https://github.com/openaire/affro) + + + -- 2.17.1 From 75b1cdf92ef064d8e54d445c674c7f4f017959ca Mon Sep 17 00:00:00 2001 From: Giambattista Bloisi Date: Mon, 22 Apr 2024 14:22:29 +0200 Subject: [PATCH 5/5] =?UTF-8?q?Describe=20the=20usage=20of=20the=20pivot?= =?UTF-8?q?=20table=20to=20improve=20stability=20of=20=E2=80=9Crepresentat?= =?UTF-8?q?ive=20records=E2=80=9D=20and=20how=20=E2=80=9Cnon=20authoritati?= =?UTF-8?q?ve=E2=80=9D=20PIDs=20are=20used=20to=20generate=20=E2=80=9Crepr?= =?UTF-8?q?esentative=20records=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/data-model/pids-and-identifiers.md | 30 +++--- .../deduplication/deduplication.md | 6 +- .../deduplication/research-products.md | 91 ++++++++++++++++--- docs/graph-production-workflow/merge-by-id.md | 2 +- 4 files changed, 96 insertions(+), 33 deletions(-) diff --git a/docs/data-model/pids-and-identifiers.md b/docs/data-model/pids-and-identifiers.md index 3e3012e..baf9a9b 100644 --- a/docs/data-model/pids-and-identifiers.md +++ b/docs/data-model/pids-and-identifiers.md @@ -35,10 +35,10 @@ assigns PIDs to their scientific products from a given PID minter. This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes -| Datasource delegated | Datasource delegating | Pid Type | -|--------------------------------------|----------------------------------|-----------| -| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi | -| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id | +| Datasource delegated | Datasource delegating | Pid Type | +|--------------------------------------|----------------------------------|----------| +| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi | +| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id | ## Identifiers in the Graph @@ -66,16 +66,16 @@ When the record is collected from a source which is not authoritative for any ty Currently, the following data sources are used as "PID authorities": -| PID Type | Prefix (12 chars) | Authority | -|-----------|------------------------|-------------------------------------------| -| doi | `doi_________` | Crossref, Datacite, Zenodo | -| pmc | `pmc_________` | Europe PubMed Central, PubMed Central | -| pmid | `pmid________` | Europe PubMed Central, PubMed Central | -| arXiv | `arXiv_______` | arXiv.org e-Print Archive | -| handle | `handle______` | any repository | -| ena | `ena_________` | EMBL-EBI | -| pdb | `pdb_________` | EMBL-EBI | -| uniprot | `uniprot_____` | EMBL-EBI | +| PID Type | Prefix (12 chars) | Authority | +|----------|-----------------------|-----------------------------------------| +| doi | `doi_________` | Crossref, Datacite, Zenodo | +| pmc | `pmc_________` | Europe PubMed Central, PubMed Central | +| pmid | `pmid________` | Europe PubMed Central, PubMed Central | +| arXiv | `arXiv_______` | arXiv.org e-Print Archive | +| handle | `handle______` | any repository | +| ena | `ena_________` | EMBL-EBI | +| pdb | `pdb_________` | EMBL-EBI | +| uniprot | `uniprot_____` | EMBL-EBI | OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)). -All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). +All duplicates are **merged** together in a **representative record** which must be assigned a [dedicated OpenAIRE identifier](/graph-production-workflow/deduplication/research-products#openaire-identifier-of-the-representative-record) (i.e. it cannot have the identifier of one of the aggregated record). diff --git a/docs/graph-production-workflow/deduplication/deduplication.md b/docs/graph-production-workflow/deduplication/deduplication.md index 09516f9..1308c90 100644 --- a/docs/graph-production-workflow/deduplication/deduplication.md +++ b/docs/graph-production-workflow/deduplication/deduplication.md @@ -2,9 +2,9 @@ The OpenAIRE Graph is populated by aggregating metadata records from distinct data sources whose content typically overlaps. For example, the collection of article metadata records from publisher' archives (e.g. Frontiers, Elsevier, Copernicus) and from pre-print platforms (e.g. ArXiv.org, UKPubMed, BioarXiv.org). In order to support monitoring of science, the OpenAIRE Graph implements record deduplication and merge strategies, in such a way the scientific production can be consistently statistically represented. Such strategies reflect the following intuition behind OpenAIRE monitoring: "Two metadata records are equivalent when they describe the same research product, hence they feature compatible resource types, have the same title, the same authors, or, alternatively, the same PID". Finally, groups of duplicates can be whitelisted or blacklisted, in order to manually refine the quality of this strategy. -It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strenghten similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward. +It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strengthen similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward. -Groups of duplicates are finally merged into a new "dedup" record that embeds all properties of the merged records and carries provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date. +Groups of duplicates are finally merged into a new "representative record", having its own id, embedding properties of the merged records and carrying provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date. ## Methodology overview @@ -37,7 +37,7 @@ To further limit the number of comparisons, a sliding window mechanism is used: ### Duplicates grouping (transitive closure) -Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance. +Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new **representative record** is obtained, which inherits properties from the merged records and keeps track of their provenance. ### Relation redistribution diff --git a/docs/graph-production-workflow/deduplication/research-products.md b/docs/graph-production-workflow/deduplication/research-products.md index ca56b89..9287c2b 100644 --- a/docs/graph-production-workflow/deduplication/research-products.md +++ b/docs/graph-production-workflow/deduplication/research-products.md @@ -149,22 +149,85 @@ The comparison goes through different stages: ### Duplicates grouping -The aim of the final stage is the creation of objects that group all the equivalent -entities discovered by the previous step. This is done in two phases. +The aim of the final stage is the creation of records that group all the +equivalent entities discovered pairwise by the previous step. This is done in +multiple phases. #### Transitive closure -As a final step of duplicate identification a transitive closure -is run against similarity relations to find groups of duplicates not directly -caught by the previous steps. If a group is larger than 200 elements only the -first 200 elements will be included in the group, while the remaining will be -kept ungrouped. -#### Creation of representative record (dedup record) +As the concluding step of duplicate identification, a transitive closure is +performed against similarity relations to identify complete groups of duplicated +records (cliques). If a group exceeds 200 elements, only the first 200 elements +are included in the group, while the remaining elements are kept ungrouped. -The general concept is that the field coming from the record with higher "trust" -value is used as reference for the field of the representative record. +#### Selection of the pivot record -The IDs of the representative records are obtained by prepending the -prefix ``dedup_`` to the MD5 of the first ID (given their lexicographical -ordering). If the group of merged records contains a trusted ID type (i.e. the -DOI), also the type keyword (i.e. ``DOI``) is added to the prefix. \ No newline at end of file +Each group of duplicate records needs to be identified in the final graph with +an OpenAIRE identifier, derived from a record of the group known as the _pivot +record_. It is determined after sorting the group of duplicate records by the +following criteria: + +1. Records previously chosen as pivot records in the graph's previous + generations. +2. Records with identifiers from a [PID authority](/data-model/pids-and-identifiers#pid-authorities). +3. Publications from CrossRef or datasets from DataCite. +4. Records with an earlier date of acceptance. +5. Records with smaller IDs in lexicographical order. + +The first sorting criterion is possible because a state table, called "pivot +history", is maintained across graph generations. It keeps track of which +records were used as pivot records in what graph, guaranteed to retain data for +the last 12 months. + +#### Creation of representative records + +The representative record, also known as the "dedup record", replaces the group +of deduplicated records in the graph. + +##### OpenAIRE identifier of the representative record + +The OpenAIRE identifier of the representative record is generated based on the +identifier of the record chosen as the pivot of the group: + +- if the pivot record comes from a "PID authority", the identifier of the + representative record is the same, but the "PID Type Prefix" part of the + identifier is modified to append ``_dedup``.
+ For example ```doi_________::d5021b53204e4fdeab6ff5d5bc468032``` will + become ```doi_dedup___::d5021b53204e4fdeab6ff5d5bc468032``` +- otherwise the "PID Type Prefix" part will be set to the fixed value + ``dedup_wf_002``, and the following hash will be calculated as the MD5 hash of + the entire raw id of the pivot record.
+ For example ``DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g`` will + become ``dedup_wf_002::345e5d1b80537b0d0e0a49241ae9e516`` + +##### Content of the representative record + +The representative records inherits properties from the records it merges +and tracks their provenance. Whenever possible, it preserves all data from the +merged records, such as the ``instance`` field. In cases where a specific value +must be chosen, the most representative one is selected. For example, for the +"dateofacceptance" field, the earliest value is chosen. + +##### Merged and singleton representative record + +Changes in metadata content or graph construction may lead to cases where +representative records disappear from the graph: + +1. When two or more representative records are merged into one representative + record. Put it other terms this happens when a group of duplicated records + contains multiple records formerly used as pivot record. +2. When a record chosen as a pivot record leaves its group and remains alone. +3. When a record chosen as a pivot record is no longer published by its data + source (deletion of the metadata record). + +To address these cases, the pivot history table ensures the visibility of +disappearing representative records for the first two cases. Specifically: + +1. In the case of merged representative records, the new representative record + and the ones that would be lost are generated and linked as part of the new + representative record. +2. In the case of a record no longer serving as a pivot, a representative record + is generated and linked only with that record. + +This approach ensures that users can access representative records that would +otherwise be lost. diff --git a/docs/graph-production-workflow/merge-by-id.md b/docs/graph-production-workflow/merge-by-id.md index 3549ecd..9e994c7 100644 --- a/docs/graph-production-workflow/merge-by-id.md +++ b/docs/graph-production-workflow/merge-by-id.md @@ -16,7 +16,7 @@ a global grouping of every record available in the graph: This ensures that the same record, possibly assigned to different types by different mappings, appears only once in the graph and under a single typing. In case of clashing -identifiers, the properties are merged (including the provencance information), considering +identifiers, the properties are merged (including the provenance information), considering the following precedence order for the research product typing: ``` -- 2.17.1