From f0adbba8d78cffa06f8ccb276044ba436c5932b7 Mon Sep 17 00:00:00 2001 From: mkallipo <95910739+mkallipo@users.noreply.github.com> Date: Fri, 26 Apr 2024 10:55:10 +0200 Subject: [PATCH 1/4] affiliation matching description update --- .../affiliation_matching.md | 66 ++++++++++++++++--- 1 file changed, 58 insertions(+), 8 deletions(-) diff --git a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md index 539e51b..fadb3d7 100644 --- a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md +++ b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md @@ -4,9 +4,14 @@ sidebar_position: 1 # Affiliation matching -***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database. +***Short description:*** The goal of the affiliation matching module is to pair affiliations with organizations listed in either OpenAIRE or ROR organization database. +Depending on the data source, we currently employ two distinct methodologies: -***Algorithmic details:*** + * The first method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database. + * The second concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database. + + +***Algorithmic details of the first method:*** *The buckets concept* @@ -39,13 +44,13 @@ The total match strength is calculated in such a way that each consecutive voter ***Parameters:*** * input - * input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations. - * input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location. - * input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations. - * input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations. - * input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations + * input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations. + * input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location. + * input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations. + * input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations. + * input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations * output - * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. + * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. ***Limitations:*** - @@ -55,3 +60,48 @@ Java, Spark ***References:*** - ***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching) + + +***Algorithmic details of the second method:*** + +*Categorization* + +The affiliations' strings are imported and undergo cleaning, tokenization, and removal of stopwords. Similar to the “buckets concept” of the first method, the goal is to split the affiliation strings, as well as the ROR organizations, into coherent groups. To achieve this, data preprocessing has already been conducted on ROR's data, involving the analysis of word frequency ('keywords') within the legal names of ROR's organizations to define specific categories. These categories include universities and institutes, laboratories, hospitals, companies, museums, governments, foundation, and rest organizations. ROR's organizations have subsequently been assigned to these categories based on their legal names. The algorithm employs a similar approach to categorize affiliations into these same groups. + +*String Shortening* + +The objective is to extract pertinent details from each affiliation string. The algorithm divides the string whenever a comma (,) or semicolon (;) is detected. It then applies specific 'rules' to these segments and retains only those containing relevant keywords. Additionally, it trims down the segments by preserving words in proximity to particular keywords like "university," "institute," "laboratory," or "hospital." As a result, the average string length is reduced from 90 to 35 characters. + +*Matching with ROR's Database* + +The algorithm checks whether a substring containing a keyword is linked to a legal name or to an alternative name in the organizations listed in the ROR's database. In order to identify the most accurate match, the algorithm employs cosine similarity.. Although alternative methods like Levenshtein Distance or Jaro-Winkler Distance were considered for measuring string similarity, it was concluded that cosine similarity was the most appropriate choice for this specific application. + +*Refinement* + +If multiple matches are found above the desired similarity thresholds, the algorithm performs another check. It applies cosine similarity between the organizations found in the ROR's database and the original affiliation string. This comparison takes into account additional information present in the original affiliation, such as addresses or city names. The algorithm aims to identify the best fit among the potential matches. Note that the case where two or more different organizations share the same name is also considered. + +***Parameters:*** + +* input + * source of affiliations: JSON Crossref or XML Pubmed or Parquet DataCite files. + + * organizations: [dix_acad.pkl](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_mult.pkl, [dix_city](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.) + + * similarity thresholds: simU for universities, simG for other organizations (default values are simU = 0.64, simG = 0.87). + cument-organization pairs which are used as a hint for matching affiliations + +* output + * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. + + +***Limitations:*** - + +***Environment:*** +Python + +***References:*** - + +***Authority:*** OpenAIRE • ***License:***AGPL-3.0 • ***Code:*** [AffRo](https://github.com/mkallipo/affiliation-matching) + + + From f7e9e93209b2d435edfeba6504d2d515c1d9db7a Mon Sep 17 00:00:00 2001 From: mkallipo <95910739+mkallipo@users.noreply.github.com> Date: Fri, 26 Apr 2024 11:13:04 +0200 Subject: [PATCH 2/4] affiliation matching description update --- .../enrichment-by-mining/affiliation_matching.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md index fadb3d7..a23b711 100644 --- a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md +++ b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md @@ -85,13 +85,13 @@ If multiple matches are found above the desired similarity thresholds, the algor * input * source of affiliations: JSON Crossref or XML Pubmed or Parquet DataCite files. - * organizations: [dix_acad.pkl](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_mult.pkl, [dix_city](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/mkallipo/affiliation-matching/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.) + * organizations: [dix_acad.pkl](https://github.com/openaire/affro/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/openaire/affro/blob/main/dictionaries/dix_mult.pkl), [dix_city](https://github.com/openaire/affro/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/openaire/affro/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.) * similarity thresholds: simU for universities, simG for other organizations (default values are simU = 0.64, simG = 0.87). cument-organization pairs which are used as a hint for matching affiliations * output - * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. + * JSON file with ROR ids of organizations and corresponding similarity scores for each DOI. ***Limitations:*** - From 755c0117ccb9b6a150caac1cdc0557970bd4c836 Mon Sep 17 00:00:00 2001 From: Serafeim Chatzopoulos Date: Sat, 4 May 2024 11:56:31 +0300 Subject: [PATCH 3/4] Adjust text in affiliation matching page --- .../enrichment-by-mining/affiliation_matching.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md index a23b711..3637b3c 100644 --- a/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md +++ b/docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md @@ -4,14 +4,14 @@ sidebar_position: 1 # Affiliation matching -***Short description:*** The goal of the affiliation matching module is to pair affiliations with organizations listed in either OpenAIRE or ROR organization database. +***Short description:*** The goal of the affiliation matching module is to match affiliation strings (identified in full-text PDFs or in scholarly databases, such as Crossref) with persistent organization identifiers (e.g., ROR identifiers). Depending on the data source, we currently employ two distinct methodologies: - * The first method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database. - * The second concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database. +- The [first](#algorithmic-details-of-the-first-method) method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database. +- The [second](#algorithmic-details-of-the-second-method) concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database. -***Algorithmic details of the first method:*** +## Algorithmic details of the first method *The buckets concept* @@ -62,7 +62,7 @@ Java, Spark ***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching) -***Algorithmic details of the second method:*** +## Algorithmic details of the second method *Categorization* @@ -101,7 +101,7 @@ Python ***References:*** - -***Authority:*** OpenAIRE • ***License:***AGPL-3.0 • ***Code:*** [AffRo](https://github.com/mkallipo/affiliation-matching) +***Authority:*** OpenAIRE • ***License:*** AGPL-3.0 • ***Code:*** [AffRo](https://github.com/openaire/affro) From 8e3710d970dee7e1ca9475096daf9775f147afe3 Mon Sep 17 00:00:00 2001 From: Serafeim Chatzopoulos Date: Sat, 4 May 2024 11:59:14 +0300 Subject: [PATCH 4/4] Update affiliation matching page in v7.1.3 --- .../affiliation_matching.md | 66 ++++++++++++++++--- 1 file changed, 58 insertions(+), 8 deletions(-) diff --git a/versioned_docs/version-7.1.3/graph-production-workflow/enrichment-by-mining/affiliation_matching.md b/versioned_docs/version-7.1.3/graph-production-workflow/enrichment-by-mining/affiliation_matching.md index 539e51b..3637b3c 100644 --- a/versioned_docs/version-7.1.3/graph-production-workflow/enrichment-by-mining/affiliation_matching.md +++ b/versioned_docs/version-7.1.3/graph-production-workflow/enrichment-by-mining/affiliation_matching.md @@ -4,9 +4,14 @@ sidebar_position: 1 # Affiliation matching -***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database. +***Short description:*** The goal of the affiliation matching module is to match affiliation strings (identified in full-text PDFs or in scholarly databases, such as Crossref) with persistent organization identifiers (e.g., ROR identifiers). +Depending on the data source, we currently employ two distinct methodologies: -***Algorithmic details:*** +- The [first](#algorithmic-details-of-the-first-method) method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database. +- The [second](#algorithmic-details-of-the-second-method) concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database. + + +## Algorithmic details of the first method *The buckets concept* @@ -39,13 +44,13 @@ The total match strength is calculated in such a way that each consecutive voter ***Parameters:*** * input - * input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations. - * input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location. - * input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations. - * input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations. - * input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations + * input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations. + * input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location. + * input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations. + * input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations. + * input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations * output - * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. + * [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations. ***Limitations:*** - @@ -55,3 +60,48 @@ Java, Spark ***References:*** - ***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching) + + +## Algorithmic details of the second method + +*Categorization* + +The affiliations' strings are imported and undergo cleaning, tokenization, and removal of stopwords. Similar to the “buckets concept” of the first method, the goal is to split the affiliation strings, as well as the ROR organizations, into coherent groups. To achieve this, data preprocessing has already been conducted on ROR's data, involving the analysis of word frequency ('keywords') within the legal names of ROR's organizations to define specific categories. These categories include universities and institutes, laboratories, hospitals, companies, museums, governments, foundation, and rest organizations. ROR's organizations have subsequently been assigned to these categories based on their legal names. The algorithm employs a similar approach to categorize affiliations into these same groups. + +*String Shortening* + +The objective is to extract pertinent details from each affiliation string. The algorithm divides the string whenever a comma (,) or semicolon (;) is detected. It then applies specific 'rules' to these segments and retains only those containing relevant keywords. Additionally, it trims down the segments by preserving words in proximity to particular keywords like "university," "institute," "laboratory," or "hospital." As a result, the average string length is reduced from 90 to 35 characters. + +*Matching with ROR's Database* + +The algorithm checks whether a substring containing a keyword is linked to a legal name or to an alternative name in the organizations listed in the ROR's database. In order to identify the most accurate match, the algorithm employs cosine similarity.. Although alternative methods like Levenshtein Distance or Jaro-Winkler Distance were considered for measuring string similarity, it was concluded that cosine similarity was the most appropriate choice for this specific application. + +*Refinement* + +If multiple matches are found above the desired similarity thresholds, the algorithm performs another check. It applies cosine similarity between the organizations found in the ROR's database and the original affiliation string. This comparison takes into account additional information present in the original affiliation, such as addresses or city names. The algorithm aims to identify the best fit among the potential matches. Note that the case where two or more different organizations share the same name is also considered. + +***Parameters:*** + +* input + * source of affiliations: JSON Crossref or XML Pubmed or Parquet DataCite files. + + * organizations: [dix_acad.pkl](https://github.com/openaire/affro/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/openaire/affro/blob/main/dictionaries/dix_mult.pkl), [dix_city](https://github.com/openaire/affro/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/openaire/affro/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.) + + * similarity thresholds: simU for universities, simG for other organizations (default values are simU = 0.64, simG = 0.87). + cument-organization pairs which are used as a hint for matching affiliations + +* output + * JSON file with ROR ids of organizations and corresponding similarity scores for each DOI. + + +***Limitations:*** - + +***Environment:*** +Python + +***References:*** - + +***Authority:*** OpenAIRE • ***License:*** AGPL-3.0 • ***Code:*** [AffRo](https://github.com/openaire/affro) + + +