Compare commits

..

No commits in common. "main" and "v7.1.3" have entirely different histories.
main ... v7.1.3

22 changed files with 112 additions and 336 deletions

View File

@ -27,7 +27,7 @@ Endpoint: https://api.openaire.eu/search/researchProducts
| funder | WT \| EC \| ARC \| ANDS \| NSF \| FCT \| NHMRC | Search for entities by funder. |
| fundingStream | ... | Search for entities by funding stream. |
| FP7scientificArea | ... | Search for FP7 entities by scientific area. |
| keywords | White-space separated list of keywords. | This parameter is used to support a keyword search functionality in various fields (e.g., for research products the keywords are used to search in the products title, description, authors, etc). Regarding the semantics, when you provide multiple keywords, all keywords should be present, hence the correct interpretation is `kwd1 AND kw2`. |
| keywords | White-space separated list of keywords. | N/A |
| doi | Comma separated list of DOIs. <br/>Alternatively, it is possible to repeat the parameter for each requested doi. | Gets the research products with the given DOIs, if any. |
| orcid | Comma separated list of ORCID iDs of authors. <br/>Alternatively, it is possible to repeat the parameter for each author ORCID iD. | Gets the research products linked to the given ORCID iD of an author, if any. |
| fromDateAccepted | Date formatted as `YYYY-MM-DD` | Gets the research products whose date of acceptance is greater than or equal the given date. |

Binary file not shown.

Before

Width:  |  Height:  |  Size: 357 KiB

After

Width:  |  Height:  |  Size: 96 KiB

View File

@ -203,17 +203,17 @@ Scheme of reference for access right code. Currently, always set to COAR access
## BipIndicator
The different citation-based impact indicators as computed by [BIP!](https://bip.imsi.athenarc.gr/).
The different impact indicators as computed by [BIP!](https://bip.imsi.athenarc.gr/).
### indicator
_Type: String &bull; Cardinality: ONE_
The name of indicator; it can be either one of:
* `influence`: it reflects the overall/total (citation-based) impact of an article in the research community at large, based on the underlying citation network (diachronically).
* `influence_alt`: it is an alternative to the "Influence" indicator, which also reflects the overall/total (citation-based) impact of an article in the research community at large, based on the underlying citation network (diachronically).
* `popularity`: it reflects the "current" (citation-based) impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
* `popularity_alt`: it is an alternative to the "Popularity" indicator, which also reflects the "current" (citation-based) impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
* `influence`: it reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
* `influence_alt`: it is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
* `popularity`: it reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
* `popularity_alt`: it is an alternative to the "Popularity" indicator, which also reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
* `impulse`: it reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
For more details on how these indicators are calculated, please refer [here](/graph-production-workflow/indicators-ingestion/impact-indicators).
@ -662,7 +662,7 @@ Each Indicator object is composed of the following properties:
### bipIndicators
_Type: [BipIndicator](#bipindicator) &bull; Cardinality: MANY_
These indicators, provided by [BIP!](https://bip.imsi.athenarc.gr/), estimate the citation-based impact of a research product.
These impact-based indicators, provided by [BIP!](https://bip.imsi.athenarc.gr/), estimate the impact of a research product.
For details about their calculation, please refer [here](/graph-production-workflow/indicators-ingestion/impact-indicators).

View File

@ -189,7 +189,7 @@ _Type: [Indicator](other#indicator-1) &bull; Cardinality: ONE_
The indicators computed for this research product;
currently, the following types of indicators are supported:
* [Citation-based impact indicators by BIP!](other#bipindicators)
* [Impact indicators by BIP!](other#bipindicators)
* [Usage Statistics indicators](other#usagecounts)
```json

View File

@ -24,7 +24,7 @@ Such a policy defines a list of data sources that are considered authoritative f
There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule.
In all other cases, PIDs are included in the graph as alternate Identifiers.
In all other cases, PIDs are be included in the graph as alternate Identifiers.
## Delegated authorities
@ -35,10 +35,10 @@ assigns PIDs to their scientific products from a given PID minter.
This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes
| Datasource delegated | Datasource delegating | Pid Type |
|--------------------------------------|----------------------------------|----------|
| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi |
| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id |
| Datasource delegated | Datasource delegating | Pid Type |
|--------------------------------------|----------------------------------|-----------|
| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi |
| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id |
## Identifiers in the Graph
@ -66,15 +66,16 @@ When the record is collected from a source which is not authoritative for any ty
Currently, the following data sources are used as "PID authorities":
| PID Type | Prefix (12 chars) | Authority |
|----------|-----------------------|-----------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| ena | `ena_________` | EMBL-EBI |
| pdb | `pdb_________` | EMBL-EBI |
| uniprot | `uniprot_____` | EMBL-EBI |
| PID Type | Prefix (12 chars) | Authority |
|-----------|------------------------|-------------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository |
| ena | `ena_________` | EMBL-EBI |
| pdb | `pdb_________` | EMBL-EBI |
| uniprot | `uniprot_____` | EMBL-EBI |
OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)).
All duplicates are **merged** together in a **representative record** which must be assigned a [dedicated OpenAIRE identifier](/graph-production-workflow/deduplication/research-products#openaire-identifier-of-the-representative-record) (i.e. it cannot have the identifier of one of the aggregated record).
All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).

View File

@ -11,7 +11,7 @@ OpenAIRE materializes an open, participatory research graph (the OpenAIRE Graph)
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/).
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term.
In addition, the OpenAIRE Graph is extended with other relevant scholarly communication sources that need special handling, either because they do not strictly follow the OpenAIRE Guidelines or due to the vast amount of data of data they offer; these include Crossref, ORCID, Microsoft Academic Graph, Unpaywall).
In addition, the OpenAIRE Graph is extended with other relevant scholarly communication sources that need special handling, either because they do not strictly follow the OpenAIRE Guidelines or due to the vast amount of data of data they offer (e.g. DOIBoost, that merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall).
<p align="center">
<img loading="lazy" alt="Aggregation" src={require('../../assets/img/aggregation.png').default} width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>

View File

@ -2,9 +2,9 @@
The OpenAIRE Graph is populated by aggregating metadata records from distinct data sources whose content typically overlaps. For example, the collection of article metadata records from publisher' archives (e.g. Frontiers, Elsevier, Copernicus) and from pre-print platforms (e.g. ArXiv.org, UKPubMed, BioarXiv.org). In order to support monitoring of science, the OpenAIRE Graph implements record deduplication and merge strategies, in such a way the scientific production can be consistently statistically represented. Such strategies reflect the following intuition behind OpenAIRE monitoring: "Two metadata records are equivalent when they describe the same research product, hence they feature compatible resource types, have the same title, the same authors, or, alternatively, the same PID". Finally, groups of duplicates can be whitelisted or blacklisted, in order to manually refine the quality of this strategy.
It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strengthen similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward.
It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strenghten similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward.
Groups of duplicates are finally merged into a new "representative record", having its own id, embedding properties of the merged records and carrying provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date.
Groups of duplicates are finally merged into a new "dedup" record that embeds all properties of the merged records and carries provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date.
## Methodology overview
@ -37,7 +37,7 @@ To further limit the number of comparisons, a sliding window mechanism is used:
### Duplicates grouping (transitive closure)
Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new **representative record** is obtained, which inherits properties from the merged records and keeps track of their provenance.
Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance.
### Relation redistribution

View File

@ -149,85 +149,22 @@ The comparison goes through different stages:
### Duplicates grouping
The aim of the final stage is the creation of records that group all the
equivalent entities discovered pairwise by the previous step. This is done in
multiple phases.
The aim of the final stage is the creation of objects that group all the equivalent
entities discovered by the previous step. This is done in two phases.
#### Transitive closure
As a final step of duplicate identification a transitive closure
is run against similarity relations to find groups of duplicates not directly
caught by the previous steps. If a group is larger than 200 elements only the
first 200 elements will be included in the group, while the remaining will be
kept ungrouped.
As the concluding step of duplicate identification, a transitive closure is
performed against similarity relations to identify complete groups of duplicated
records (cliques). If a group exceeds 200 elements, only the first 200 elements
are included in the group, while the remaining elements are kept ungrouped.
#### Creation of representative record (dedup record)
#### Selection of the pivot record
The general concept is that the field coming from the record with higher "trust"
value is used as reference for the field of the representative record.
Each group of duplicate records needs to be identified in the final graph with
an OpenAIRE identifier, derived from a record of the group known as the _pivot
record_. It is determined after sorting the group of duplicate records by the
following criteria:
1. Records previously chosen as pivot records in the graph's previous
generations.
2. Records with identifiers from a [PID authority](/data-model/pids-and-identifiers#pid-authorities).
3. Publications from CrossRef or datasets from DataCite.
4. Records with an earlier date of acceptance.
5. Records with smaller IDs in lexicographical order.
The first sorting criterion is possible because a state table, called "pivot
history", is maintained across graph generations. It keeps track of which
records were used as pivot records in what graph, guaranteed to retain data for
the last 12 months.
#### Creation of representative records
The representative record, also known as the "dedup record", replaces the group
of deduplicated records in the graph.
##### OpenAIRE identifier of the representative record
The OpenAIRE identifier of the representative record is generated based on the
identifier of the record chosen as the pivot of the group:
- if the pivot record comes from a "PID authority", the identifier of the
representative record is the same, but the "PID Type Prefix" part of the
identifier is modified to append ``_dedup``.<br/>
For example ```doi_________::d5021b53204e4fdeab6ff5d5bc468032``` will
become ```doi_dedup___::d5021b53204e4fdeab6ff5d5bc468032```
- otherwise the "PID Type Prefix" part will be set to the fixed value
``dedup_wf_002``, and the following hash will be calculated as the MD5 hash of
the entire raw id of the pivot record.<br/>
For example ``DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g`` will
become ``dedup_wf_002::345e5d1b80537b0d0e0a49241ae9e516``
##### Content of the representative record
The representative records inherits properties from the records it merges
and tracks their provenance. Whenever possible, it preserves all data from the
merged records, such as the ``instance`` field. In cases where a specific value
must be chosen, the most representative one is selected. For example, for the
"dateofacceptance" field, the earliest value is chosen.
##### Merged and singleton representative record
Changes in metadata content or graph construction may lead to cases where
representative records disappear from the graph:
1. When two or more representative records are merged into one representative
record. Put it other terms this happens when a group of duplicated records
contains multiple records formerly used as pivot record.
2. When a record chosen as a pivot record leaves its group and remains alone.
3. When a record chosen as a pivot record is no longer published by its data
source (deletion of the metadata record).
To address these cases, the pivot history table ensures the visibility of
disappearing representative records for the first two cases. Specifically:
1. In the case of merged representative records, the new representative record
and the ones that would be lost are generated and linked as part of the new
representative record.
2. In the case of a record no longer serving as a pivot, a representative record
is generated and linked only with that record.
This approach ensures that users can access representative records that would
otherwise be lost.
The IDs of the representative records are obtained by prepending the
prefix ``dedup_`` to the MD5 of the first ID (given their lexicographical
ordering). If the group of merged records contains a trusted ID type (i.e. the
DOI), also the type keyword (i.e. ``DOI``) is added to the prefix.

View File

@ -4,14 +4,9 @@ sidebar_position: 1
# Affiliation matching
***Short description:*** The goal of the affiliation matching module is to match affiliation strings (identified in full-text PDFs or in scholarly databases, such as Crossref) with persistent organization identifiers (e.g., ROR identifiers).
Depending on the data source, we currently employ two distinct methodologies:
***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database.
- The [first](#algorithmic-details-of-the-first-method) method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database.
- The [second](#algorithmic-details-of-the-second-method) concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database.
## Algorithmic details of the first method
***Algorithmic details:***
*The buckets concept*
@ -44,13 +39,13 @@ The total match strength is calculated in such a way that each consecutive voter
***Parameters:***
* input
* input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations.
* input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location.
* input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations.
* input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations.
* input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations
* input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations.
* input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location.
* input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations.
* input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations.
* input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations
* output
* [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations.
* [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations.
***Limitations:*** -
@ -60,48 +55,3 @@ Java, Spark
***References:*** -
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching)
## Algorithmic details of the second method
*Categorization*
The affiliations' strings are imported and undergo cleaning, tokenization, and removal of stopwords. Similar to the “buckets concept” of the first method, the goal is to split the affiliation strings, as well as the ROR organizations, into coherent groups. To achieve this, data preprocessing has already been conducted on ROR's data, involving the analysis of word frequency ('keywords') within the legal names of ROR's organizations to define specific categories. These categories include universities and institutes, laboratories, hospitals, companies, museums, governments, foundation, and rest organizations. ROR's organizations have subsequently been assigned to these categories based on their legal names. The algorithm employs a similar approach to categorize affiliations into these same groups.
*String Shortening*
The objective is to extract pertinent details from each affiliation string. The algorithm divides the string whenever a comma (,) or semicolon (;) is detected. It then applies specific 'rules' to these segments and retains only those containing relevant keywords. Additionally, it trims down the segments by preserving words in proximity to particular keywords like "university," "institute," "laboratory," or "hospital." As a result, the average string length is reduced from 90 to 35 characters.
*Matching with ROR's Database*
The algorithm checks whether a substring containing a keyword is linked to a legal name or to an alternative name in the organizations listed in the ROR's database. In order to identify the most accurate match, the algorithm employs cosine similarity.. Although alternative methods like Levenshtein Distance or Jaro-Winkler Distance were considered for measuring string similarity, it was concluded that cosine similarity was the most appropriate choice for this specific application.
*Refinement*
If multiple matches are found above the desired similarity thresholds, the algorithm performs another check. It applies cosine similarity between the organizations found in the ROR's database and the original affiliation string. This comparison takes into account additional information present in the original affiliation, such as addresses or city names. The algorithm aims to identify the best fit among the potential matches. Note that the case where two or more different organizations share the same name is also considered.
***Parameters:***
* input
* source of affiliations: JSON Crossref or XML Pubmed or Parquet DataCite files.
* organizations: [dix_acad.pkl](https://github.com/openaire/affro/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/openaire/affro/blob/main/dictionaries/dix_mult.pkl), [dix_city](https://github.com/openaire/affro/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/openaire/affro/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.)
* similarity thresholds: simU for universities, simG for other organizations (default values are simU = 0.64, simG = 0.87).
cument-organization pairs which are used as a hint for matching affiliations
* output
* JSON file with ROR ids of organizations and corresponding similarity scores for each DOI.
***Limitations:*** -
***Environment:***
Python
***References:*** -
***Authority:*** OpenAIRE &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [AffRo](https://github.com/openaire/affro)

View File

@ -1,16 +1,16 @@
# Citation-based impact indicators
# Impact indicators
This page summarises all calculated citation-based impact indicators, provided by [BIP!](https://bip.imsi.athenarc.gr/), which are included in the [bipIndicators](../../data-model/entities/other#bipindicators) property (found under the [indicators](../../data-model/entities/research-product#indicators) property of the reseach product).
This page summarises all calculated impact indicators, provided by [BIP!](https://bip.imsi.athenarc.gr/), which are included in the [bipIndicators](../../data-model/entities/other#bipindicators) property (found under the [indicators](../../data-model/entities/research-product#indicators) property of the reseach product).
It should be noted that the citation-based impact indicators are being calculated on the level of the research output.
It should be noted that the impact indicators are being calculated on the level of the research output.
Below we explain their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses.
## Citation Count (CC) <small><span className="bip-indicator-names">&bull; influence_alt</span></small>
***Short description:***
This is the most widely used citation-based impact indicator, which sums all citations received by each article.
Citation count can be viewed as a measure of a publication's overall (citation-based) impact, since it conveys the number of other works that directly
This is the most widely used scientific impact indicator, which sums all citations received by each article.
Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly
drew on it.
***Algorithmic details:***

View File

@ -16,7 +16,7 @@ a global grouping of every record available in the graph:
This ensures that the same record, possibly assigned to different types by different
mappings, appears only once in the graph and under a single typing. In case of clashing
identifiers, the properties are merged (including the provenance information), considering
identifiers, the properties are merged (including the provencance information), considering
the following precedence order for the research product typing:
```

View File

@ -27,7 +27,7 @@ Endpoint: https://api.openaire.eu/search/researchProducts
| funder | WT \| EC \| ARC \| ANDS \| NSF \| FCT \| NHMRC | Search for entities by funder. |
| fundingStream | ... | Search for entities by funding stream. |
| FP7scientificArea | ... | Search for FP7 entities by scientific area. |
| keywords | White-space separated list of keywords. | This parameter is used to support a keyword search functionality in various fields (e.g., for research products the keywords are used to search in the products title, description, authors, etc). Regarding the semantics, when you provide multiple keywords, all keywords should be present, hence the correct interpretation is `kwd1 AND kw2`. |
| keywords | White-space separated list of keywords. | N/A |
| doi | Comma separated list of DOIs. <br/>Alternatively, it is possible to repeat the parameter for each requested doi. | Gets the research products with the given DOIs, if any. |
| orcid | Comma separated list of ORCID iDs of authors. <br/>Alternatively, it is possible to repeat the parameter for each author ORCID iD. | Gets the research products linked to the given ORCID iD of an author, if any. |
| fromDateAccepted | Date formatted as `YYYY-MM-DD` | Gets the research products whose date of acceptance is greater than or equal the given date. |

Binary file not shown.

Before

Width:  |  Height:  |  Size: 357 KiB

After

Width:  |  Height:  |  Size: 96 KiB

View File

@ -203,17 +203,17 @@ Scheme of reference for access right code. Currently, always set to COAR access
## BipIndicator
The different citation-based impact indicators as computed by [BIP!](https://bip.imsi.athenarc.gr/).
The different impact indicators as computed by [BIP!](https://bip.imsi.athenarc.gr/).
### indicator
_Type: String &bull; Cardinality: ONE_
The name of indicator; it can be either one of:
* `influence`: it reflects the overall/total (citation-based) impact of an article in the research community at large, based on the underlying citation network (diachronically).
* `influence_alt`: it is an alternative to the "Influence" indicator, which also reflects the overall/total (citation-based) impact of an article in the research community at large, based on the underlying citation network (diachronically).
* `popularity`: it reflects the "current" (citation-based) impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
* `popularity_alt`: it is an alternative to the "Popularity" indicator, which also reflects the "current" (citation-based) impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
* `influence`: it reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
* `influence_alt`: it is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
* `popularity`: it reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
* `popularity_alt`: it is an alternative to the "Popularity" indicator, which also reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
* `impulse`: it reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
For more details on how these indicators are calculated, please refer [here](/graph-production-workflow/indicators-ingestion/impact-indicators).
@ -662,7 +662,7 @@ Each Indicator object is composed of the following properties:
### bipIndicators
_Type: [BipIndicator](#bipindicator) &bull; Cardinality: MANY_
These indicators, provided by [BIP!](https://bip.imsi.athenarc.gr/), estimate the citation-based impact of a research product.
These impact-based indicators, provided by [BIP!](https://bip.imsi.athenarc.gr/), estimate the impact of a research product.
For details about their calculation, please refer [here](/graph-production-workflow/indicators-ingestion/impact-indicators).

View File

@ -189,7 +189,7 @@ _Type: [Indicator](other#indicator-1) &bull; Cardinality: ONE_
The indicators computed for this research product;
currently, the following types of indicators are supported:
* [Citation-based impact indicators by BIP!](other#bipindicators)
* [Impact indicators by BIP!](other#bipindicators)
* [Usage Statistics indicators](other#usagecounts)
```json

View File

@ -24,7 +24,7 @@ Such a policy defines a list of data sources that are considered authoritative f
There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule.
In all other cases, PIDs are included in the graph as alternate Identifiers.
In all other cases, PIDs are be included in the graph as alternate Identifiers.
## Delegated authorities
@ -35,10 +35,10 @@ assigns PIDs to their scientific products from a given PID minter.
This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes
| Datasource delegated | Datasource delegating | Pid Type |
|--------------------------------------|----------------------------------|----------|
| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi |
| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id |
| Datasource delegated | Datasource delegating | Pid Type |
|--------------------------------------|----------------------------------|-----------|
| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi |
| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id |
## Identifiers in the Graph
@ -66,15 +66,16 @@ When the record is collected from a source which is not authoritative for any ty
Currently, the following data sources are used as "PID authorities":
| PID Type | Prefix (12 chars) | Authority |
|----------|-----------------------|-----------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| ena | `ena_________` | EMBL-EBI |
| pdb | `pdb_________` | EMBL-EBI |
| uniprot | `uniprot_____` | EMBL-EBI |
| PID Type | Prefix (12 chars) | Authority |
|-----------|------------------------|-------------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository |
| ena | `ena_________` | EMBL-EBI |
| pdb | `pdb_________` | EMBL-EBI |
| uniprot | `uniprot_____` | EMBL-EBI |
OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)).
All duplicates are **merged** together in a **representative record** which must be assigned a [dedicated OpenAIRE identifier](/graph-production-workflow/deduplication/research-products#openaire-identifier-of-the-representative-record) (i.e. it cannot have the identifier of one of the aggregated record).
All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).

View File

@ -11,7 +11,7 @@ OpenAIRE materializes an open, participatory research graph (the OpenAIRE Graph)
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/).
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term.
In addition, the OpenAIRE Graph is extended with other relevant scholarly communication sources that need special handling, either because they do not strictly follow the OpenAIRE Guidelines or due to the vast amount of data of data they offer; these include Crossref, ORCID, Microsoft Academic Graph, Unpaywall).
In addition, the OpenAIRE Graph is extended with other relevant scholarly communication sources that need special handling, either because they do not strictly follow the OpenAIRE Guidelines or due to the vast amount of data of data they offer (e.g. DOIBoost, that merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall).
<p align="center">
<img loading="lazy" alt="Aggregation" src={require('../../assets/img/aggregation.png').default} width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>

View File

@ -2,9 +2,9 @@
The OpenAIRE Graph is populated by aggregating metadata records from distinct data sources whose content typically overlaps. For example, the collection of article metadata records from publisher' archives (e.g. Frontiers, Elsevier, Copernicus) and from pre-print platforms (e.g. ArXiv.org, UKPubMed, BioarXiv.org). In order to support monitoring of science, the OpenAIRE Graph implements record deduplication and merge strategies, in such a way the scientific production can be consistently statistically represented. Such strategies reflect the following intuition behind OpenAIRE monitoring: "Two metadata records are equivalent when they describe the same research product, hence they feature compatible resource types, have the same title, the same authors, or, alternatively, the same PID". Finally, groups of duplicates can be whitelisted or blacklisted, in order to manually refine the quality of this strategy.
It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strengthen similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward.
It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strenghten similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward.
Groups of duplicates are finally merged into a new "representative record", having its own id, embedding properties of the merged records and carrying provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date.
Groups of duplicates are finally merged into a new "dedup" record that embeds all properties of the merged records and carries provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date.
## Methodology overview
@ -37,7 +37,7 @@ To further limit the number of comparisons, a sliding window mechanism is used:
### Duplicates grouping (transitive closure)
Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new **representative record** is obtained, which inherits properties from the merged records and keeps track of their provenance.
Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance.
### Relation redistribution

View File

@ -149,85 +149,22 @@ The comparison goes through different stages:
### Duplicates grouping
The aim of the final stage is the creation of records that group all the
equivalent entities discovered pairwise by the previous step. This is done in
multiple phases.
The aim of the final stage is the creation of objects that group all the equivalent
entities discovered by the previous step. This is done in two phases.
#### Transitive closure
As a final step of duplicate identification a transitive closure
is run against similarity relations to find groups of duplicates not directly
caught by the previous steps. If a group is larger than 200 elements only the
first 200 elements will be included in the group, while the remaining will be
kept ungrouped.
As the concluding step of duplicate identification, a transitive closure is
performed against similarity relations to identify complete groups of duplicated
records (cliques). If a group exceeds 200 elements, only the first 200 elements
are included in the group, while the remaining elements are kept ungrouped.
#### Creation of representative record (dedup record)
#### Selection of the pivot record
The general concept is that the field coming from the record with higher "trust"
value is used as reference for the field of the representative record.
Each group of duplicate records needs to be identified in the final graph with
an OpenAIRE identifier, derived from a record of the group known as the _pivot
record_. It is determined after sorting the group of duplicate records by the
following criteria:
1. Records previously chosen as pivot records in the graph's previous
generations.
2. Records with identifiers from a [PID authority](/data-model/pids-and-identifiers#pid-authorities).
3. Publications from CrossRef or datasets from DataCite.
4. Records with an earlier date of acceptance.
5. Records with smaller IDs in lexicographical order.
The first sorting criterion is possible because a state table, called "pivot
history", is maintained across graph generations. It keeps track of which
records were used as pivot records in what graph, guaranteed to retain data for
the last 12 months.
#### Creation of representative records
The representative record, also known as the "dedup record", replaces the group
of deduplicated records in the graph.
##### OpenAIRE identifier of the representative record
The OpenAIRE identifier of the representative record is generated based on the
identifier of the record chosen as the pivot of the group:
- if the pivot record comes from a "PID authority", the identifier of the
representative record is the same, but the "PID Type Prefix" part of the
identifier is modified to append ``_dedup``.<br/>
For example ```doi_________::d5021b53204e4fdeab6ff5d5bc468032``` will
become ```doi_dedup___::d5021b53204e4fdeab6ff5d5bc468032```
- otherwise the "PID Type Prefix" part will be set to the fixed value
``dedup_wf_002``, and the following hash will be calculated as the MD5 hash of
the entire raw id of the pivot record.<br/>
For example ``DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g`` will
become ``dedup_wf_002::345e5d1b80537b0d0e0a49241ae9e516``
##### Content of the representative record
The representative records inherits properties from the records it merges
and tracks their provenance. Whenever possible, it preserves all data from the
merged records, such as the ``instance`` field. In cases where a specific value
must be chosen, the most representative one is selected. For example, for the
"dateofacceptance" field, the earliest value is chosen.
##### Merged and singleton representative record
Changes in metadata content or graph construction may lead to cases where
representative records disappear from the graph:
1. When two or more representative records are merged into one representative
record. Put it other terms this happens when a group of duplicated records
contains multiple records formerly used as pivot record.
2. When a record chosen as a pivot record leaves its group and remains alone.
3. When a record chosen as a pivot record is no longer published by its data
source (deletion of the metadata record).
To address these cases, the pivot history table ensures the visibility of
disappearing representative records for the first two cases. Specifically:
1. In the case of merged representative records, the new representative record
and the ones that would be lost are generated and linked as part of the new
representative record.
2. In the case of a record no longer serving as a pivot, a representative record
is generated and linked only with that record.
This approach ensures that users can access representative records that would
otherwise be lost.
The IDs of the representative records are obtained by prepending the
prefix ``dedup_`` to the MD5 of the first ID (given their lexicographical
ordering). If the group of merged records contains a trusted ID type (i.e. the
DOI), also the type keyword (i.e. ``DOI``) is added to the prefix.

View File

@ -4,14 +4,9 @@ sidebar_position: 1
# Affiliation matching
***Short description:*** The goal of the affiliation matching module is to match affiliation strings (identified in full-text PDFs or in scholarly databases, such as Crossref) with persistent organization identifiers (e.g., ROR identifiers).
Depending on the data source, we currently employ two distinct methodologies:
***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database.
- The [first](#algorithmic-details-of-the-first-method) method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database.
- The [second](#algorithmic-details-of-the-second-method) concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database.
## Algorithmic details of the first method
***Algorithmic details:***
*The buckets concept*
@ -44,13 +39,13 @@ The total match strength is calculated in such a way that each consecutive voter
***Parameters:***
* input
* input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations.
* input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location.
* input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations.
* input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations.
* input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations
* input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations.
* input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location.
* input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations.
* input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations.
* input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations
* output
* [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations.
* [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations.
***Limitations:*** -
@ -60,48 +55,3 @@ Java, Spark
***References:*** -
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching)
## Algorithmic details of the second method
*Categorization*
The affiliations' strings are imported and undergo cleaning, tokenization, and removal of stopwords. Similar to the “buckets concept” of the first method, the goal is to split the affiliation strings, as well as the ROR organizations, into coherent groups. To achieve this, data preprocessing has already been conducted on ROR's data, involving the analysis of word frequency ('keywords') within the legal names of ROR's organizations to define specific categories. These categories include universities and institutes, laboratories, hospitals, companies, museums, governments, foundation, and rest organizations. ROR's organizations have subsequently been assigned to these categories based on their legal names. The algorithm employs a similar approach to categorize affiliations into these same groups.
*String Shortening*
The objective is to extract pertinent details from each affiliation string. The algorithm divides the string whenever a comma (,) or semicolon (;) is detected. It then applies specific 'rules' to these segments and retains only those containing relevant keywords. Additionally, it trims down the segments by preserving words in proximity to particular keywords like "university," "institute," "laboratory," or "hospital." As a result, the average string length is reduced from 90 to 35 characters.
*Matching with ROR's Database*
The algorithm checks whether a substring containing a keyword is linked to a legal name or to an alternative name in the organizations listed in the ROR's database. In order to identify the most accurate match, the algorithm employs cosine similarity.. Although alternative methods like Levenshtein Distance or Jaro-Winkler Distance were considered for measuring string similarity, it was concluded that cosine similarity was the most appropriate choice for this specific application.
*Refinement*
If multiple matches are found above the desired similarity thresholds, the algorithm performs another check. It applies cosine similarity between the organizations found in the ROR's database and the original affiliation string. This comparison takes into account additional information present in the original affiliation, such as addresses or city names. The algorithm aims to identify the best fit among the potential matches. Note that the case where two or more different organizations share the same name is also considered.
***Parameters:***
* input
* source of affiliations: JSON Crossref or XML Pubmed or Parquet DataCite files.
* organizations: [dix_acad.pkl](https://github.com/openaire/affro/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/openaire/affro/blob/main/dictionaries/dix_mult.pkl), [dix_city](https://github.com/openaire/affro/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/openaire/affro/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.)
* similarity thresholds: simU for universities, simG for other organizations (default values are simU = 0.64, simG = 0.87).
cument-organization pairs which are used as a hint for matching affiliations
* output
* JSON file with ROR ids of organizations and corresponding similarity scores for each DOI.
***Limitations:*** -
***Environment:***
Python
***References:*** -
***Authority:*** OpenAIRE &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [AffRo](https://github.com/openaire/affro)

View File

@ -1,16 +1,16 @@
# Citation-based impact indicators
# Impact indicators
This page summarises all calculated citation-based impact indicators, provided by [BIP!](https://bip.imsi.athenarc.gr/), which are included in the [bipIndicators](../../data-model/entities/other#bipindicators) property (found under the [indicators](../../data-model/entities/research-product#indicators) property of the reseach product).
This page summarises all calculated impact indicators, provided by [BIP!](https://bip.imsi.athenarc.gr/), which are included in the [bipIndicators](../../data-model/entities/other#bipindicators) property (found under the [indicators](../../data-model/entities/research-product#indicators) property of the reseach product).
It should be noted that the citation-based impact indicators are being calculated on the level of the research output.
It should be noted that the impact indicators are being calculated on the level of the research output.
Below we explain their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses.
## Citation Count (CC) <small><span className="bip-indicator-names">&bull; influence_alt</span></small>
***Short description:***
This is the most widely used citation-based impact indicator, which sums all citations received by each article.
Citation count can be viewed as a measure of a publication's overall (citation-based) impact, since it conveys the number of other works that directly
This is the most widely used scientific impact indicator, which sums all citations received by each article.
Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly
drew on it.
***Algorithmic details:***

View File

@ -16,7 +16,7 @@ a global grouping of every record available in the graph:
This ensures that the same record, possibly assigned to different types by different
mappings, appears only once in the graph and under a single typing. In case of clashing
identifiers, the properties are merged (including the provenance information), considering
identifiers, the properties are merged (including the provencance information), considering
the following precedence order for the research product typing:
```