Compare commits

...

262 Commits
main ... main

Author SHA1 Message Date
Serafeim Chatzopoulos 8e3710d970 Update affiliation matching page in v7.1.3 2024-05-04 11:59:14 +03:00
Serafeim Chatzopoulos d4b02e71ad Merge pull request 'Update affiliation matching description' (#74) from update_affiliation_algorithms into main
Reviewed-on: D-Net/openaire-graph-docs#74
2024-05-04 10:57:57 +02:00
Serafeim Chatzopoulos 755c0117cc Adjust text in affiliation matching page 2024-05-04 11:56:31 +03:00
Serafeim Chatzopoulos 5c28427adc Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs 2024-05-03 15:13:20 +03:00
Serafeim Chatzopoulos 1fc2158abc Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs 2024-05-03 15:13:15 +03:00
Serafeim Chatzopoulos 8b547c1fee Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs 2024-05-03 14:34:15 +03:00
Serafeim Chatzopoulos cce0901008 Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs 2024-05-03 14:34:07 +03:00
Serafeim Chatzopoulos 3d6729c598 Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs 2024-05-03 14:21:49 +03:00
Serafeim Chatzopoulos 9713276f3e Rename 'impact indicators' to 'citation-based impact indicators' 2024-05-03 14:21:41 +03:00
Serafeim Chatzopoulos 57506751ef Rename 'impact indicators' to 'citation-based impact indicators' 2024-05-03 12:59:33 +03:00
mkallipo f7e9e93209 affiliation matching description update 2024-04-26 11:13:04 +02:00
mkallipo f0adbba8d7 affiliation matching description update 2024-04-26 10:55:10 +02:00
Claudio Atzori 60b5b1e021 Merge pull request 'added changelog for versions 7.1.2 and 7.1.3' (#73) from v7.1.3 into main
Reviewed-on: D-Net/openaire-graph-docs#73
2024-04-24 16:28:02 +02:00
Claudio Atzori 587508f693 added changelog for versions 7.1.2 and 7.1.3 2024-04-24 16:26:47 +02:00
Claudio Atzori 2f1042d747 Merge pull request 'changelog for v7.1.1' (#71) from v7.1.1 into main
Reviewed-on: D-Net/openaire-graph-docs#71
2024-03-14 10:36:25 +01:00
Claudio Atzori f37d8d8e67 changelog for v7.1.1 2024-03-14 10:35:15 +01:00
Claudio Atzori 5e32a5829f added version 7.1.0 2024-02-21 12:22:06 +01:00
Claudio Atzori 48250cc47a Merge pull request 'Update documentation to describe dedup profile v4' (#70) from dedup_v4 into main
Reviewed-on: D-Net/openaire-graph-docs#70
2024-02-21 10:55:51 +01:00
Claudio Atzori 6a58319814 Merge branch 'main' into dedup_v4 2024-02-21 10:55:43 +01:00
Claudio Atzori c84f5f08eb updated changelog 2024-02-21 10:53:05 +01:00
Claudio Atzori 9f8db418c1 updated changelog 2024-02-19 12:19:18 +01:00
Serafeim Chatzopoulos 5abf090dd3 Fix links in Public APIs home page 2024-02-18 18:17:45 +02:00
Claudio Atzori c95c2228b1 fixed field name, minor changes in wording, also in version 7.0.0 2024-02-16 09:49:36 +01:00
Claudio Atzori a2dfc2482e fixed field name, minor changes in wording 2024-02-16 09:49:36 +01:00
Giambattista Bloisi cc17acb259 Fix usage of <br> in markkdown 2024-02-14 09:43:36 +01:00
Claudio Atzori 882be07650 fixed field name, minor changes in wording, also in version 7.0.0 2024-02-12 12:10:29 +01:00
Claudio Atzori 5bf002b969 fixed field name, minor changes in wording 2024-02-12 08:55:57 +01:00
Giambattista Bloisi 77b24157d6 Refinement of research product chapter 2024-02-09 12:45:53 +01:00
Michele De Bonis f4e7332869 decision trees updated 2024-02-08 15:43:44 +01:00
Giambattista Bloisi 24bdb4e8fd Descripe dedupe profile v4 2024-02-08 12:20:05 +01:00
Serafeim Chatzopoulos d8e23c2277 Add link to User Forum 2024-01-24 17:08:36 +02:00
Serafeim Chatzopoulos 1eba5b613b Update v7.0.0 2024-01-17 15:56:05 +02:00
Serafeim Chatzopoulos 06114518ca Change absolute paths to relative ones 2024-01-17 15:22:28 +02:00
Serafeim Chatzopoulos 13c696b417 Change research results to research products 2024-01-17 14:42:18 +02:00
Serafeim Chatzopoulos 096cbbb74e Fix typos in related datasets 2024-01-17 12:40:48 +02:00
Serafeim Chatzopoulos 362b60e29d Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs 2024-01-17 12:04:44 +02:00
Serafeim Chatzopoulos bdb4c63aa3 Update v7.0.0 2024-01-17 12:04:32 +02:00
Serafeim Chatzopoulos 0b5d027f5f Update v7.0.0 2024-01-17 11:52:11 +02:00
Serafeim Chatzopoulos 4b27dd22ae Rename dump to dataset 2024-01-17 11:36:15 +02:00
Serafeim Chatzopoulos c801abe833 Add version 7.0.0 2024-01-17 11:16:12 +02:00
Serafeim Chatzopoulos e76e5211b9 Add info box 2024-01-17 10:50:48 +02:00
Serafeim Chatzopoulos 317c5824b3 Merge pull request 'Fix dead links' (#67) from api into main
Reviewed-on: D-Net/openaire-graph-docs#67
2024-01-15 18:58:29 +01:00
Serafeim Chatzopoulos f09a6b2f24 Fix dead links 2024-01-15 19:48:55 +02:00
Serafeim Chatzopoulos 23a3fd2810 Merge pull request 'Add authentication pages for APIs' (#66) from api into main
Reviewed-on: D-Net/openaire-graph-docs#66
2024-01-15 18:40:52 +01:00
Serafeim Chatzopoulos 3c1004c7dc Add authentication pages for APIs 2024-01-15 19:40:20 +02:00
Serafeim Chatzopoulos 8deb1dce77 Merge pull request 'Add Docs for APIs' (#65) from api into main
Reviewed-on: D-Net/openaire-graph-docs#65
2024-01-11 17:07:57 +01:00
Serafeim Chatzopoulos ca12a508ba Add Docs for APIs 2024-01-11 18:05:41 +02:00
Claudio Atzori 4a071ba919 beginners kit: updated URL to the version independent DOI 2023-12-12 12:26:23 +01:00
Claudio Atzori a878d9a4b0 revert to previous changelog structure 2023-11-29 10:18:01 +01:00
Claudio Atzori 48a183d53b added version 6.2.2 2023-11-27 11:00:20 +01:00
Serafeim Chatzopoulos 4d20184a1d Update release date in v6.1.1 changelog && add notice in download pages 2023-11-06 16:10:24 +02:00
Serafeim Chatzopoulos 9b0f50794e Add v6.1.1 2023-11-06 14:54:17 +02:00
Serafeim Chatzopoulos 2a8d984d0a Add v6.1.1 2023-11-06 14:53:42 +02:00
Claudio Atzori 5dffc486f2 Merge pull request 'Update of introduction to the deduplication section' (#59) from deduplication into main
Reviewed-on: D-Net/openaire-graph-docs#59
2023-11-06 12:48:03 +01:00
Claudio Atzori da21610ee4 Merge branch 'main' into deduplication 2023-11-06 12:47:33 +01:00
Claudio Atzori 2cdc031839 Merge pull request 'v6.1.1' (#63) from v6.1.1 into main
Reviewed-on: D-Net/openaire-graph-docs#63
2023-11-06 12:46:20 +01:00
Claudio Atzori 62c6f4a3a1 Merge branch 'main' into v6.1.1 2023-11-06 12:44:26 +01:00
Serafeim Chatzopoulos 493291a327 Change 'relation' to 'relationship' in relationship-object.md 2023-11-02 12:28:30 +02:00
Serafeim Chatzopoulos 6830c2948c Updat changelog record for v6.1.1 2023-10-31 16:35:24 +02:00
Serafeim Chatzopoulos e73bec7430 Add changelog record for v6.1.1 2023-10-31 16:26:29 +02:00
Serafeim Chatzopoulos ecda5f7c22 Merge pull request 'Add versionless doi for scholix' (#62) from updateScholixLink into main
Reviewed-on: D-Net/openaire-graph-docs#62
2023-10-25 09:14:43 +02:00
Serafeim Chatzopoulos 3f078bb26b Add versionless doi for scholix 2023-10-25 00:13:40 -07:00
Serafeim Chatzopoulos 80cc6c587f Merge pull request 'changed the link to the scholix dataset so it always points to the last version' (#61) from updateScholixLink into main
Reviewed-on: D-Net/openaire-graph-docs#61
2023-10-25 08:55:43 +02:00
Miriam Baglioni 00a0f77ecb changed the link to the scholix dataset so it always points to the last version 2023-10-24 08:41:15 +02:00
Serafeim Chatzopoulos 1f511a1845 Merge pull request 'Updating the "Citation matching" page by adding an entry in references section' (#60) from marekhorst_fixing_citationmatching_references into main
Reviewed-on: D-Net/openaire-graph-docs#60
2023-10-19 14:06:02 +02:00
Marek Horst 3f7e939f57 Updating the "Citation matching" page by adding an entry in references section. 2023-10-19 11:44:02 +02:00
Paolo Manghi ed12572fa1 Update 'docs/graph-production-workflow/deduplication/deduplication.md' 2023-10-18 11:52:06 +02:00
Serafeim Chatzopoulos f1938bd159 Add citation_matching.md changes to current release 2023-10-16 17:25:53 +03:00
Serafeim Chatzopoulos 93e61b95ef Merge pull request 'Fixing the "Citation matching" page short description paragraph by removing two sentences' (#58) from marekhorst_fixing_citationmatching_short_descr into main
Reviewed-on: D-Net/openaire-graph-docs#58
2023-10-16 15:13:35 +02:00
Marek Horst b3421ca0ab Fixing the "Citation matching" page short description paragraph by removing two sentences:
1) phrase that leaked from the page template
2) irrelevant remark about the dataset from the paper and the work carried out in one of the OpenAIRE project incarnations
2023-10-16 14:56:18 +02:00
Claudio Atzori f600fcf8d6 Merge pull request 'download section: zenodo depositions referred as 'dataset' instead of 'dump'' (#57) from v6.0.0_fix into main
Reviewed-on: D-Net/openaire-graph-docs#57
2023-08-22 09:44:34 +02:00
Claudio Atzori e81f9e1b11 download section: zenodo depositions referred as 'dataset' instead of 'dump' 2023-08-22 09:40:46 +02:00
Serafeim Chatzopoulos b1344520fa Merge pull request 'Add changes for version 6.0.0' (#56) from v6.1.0 into main
Reviewed-on: D-Net/openaire-graph-docs#56
2023-08-17 11:01:21 +02:00
Serafeim Chatzopoulos 32f6708557 Merge main into branch 2023-08-17 11:30:22 +03:00
Serafeim Chatzopoulos e540c9afb5 Add versioned folders for v5.2.0 2023-08-17 11:10:24 +03:00
Serafeim Chatzopoulos d0d758c0af Rename current version to 5.2.0 2023-08-17 11:04:14 +03:00
Serafeim Chatzopoulos 5b910df0e2 Add changes for version 6.1.0 2023-08-16 21:02:52 +03:00
Claudio Atzori 5422c5a048 Merge pull request 'Add matomo tracking script' (#55) from 8925 into main
Reviewed-on: D-Net/openaire-graph-docs#55
2023-07-24 09:55:08 +02:00
Claudio Atzori 8274746f0b Merge branch 'main' into 8925 2023-07-24 09:52:18 +02:00
Claudio Atzori a6c79793e1 Merge pull request 'Version 6.0.0' (#54) from v6.0.0 into main
Reviewed-on: D-Net/openaire-graph-docs#54
2023-07-24 09:51:54 +02:00
Claudio Atzori 171c56f1ce updated changelog 2023-07-24 09:49:57 +02:00
Claudio Atzori 4493a93380 updated changelog 2023-07-24 09:44:25 +02:00
Serafeim Chatzopoulos 11c378a47c Add matomo tracking script 2023-07-20 17:07:00 +03:00
Claudio Atzori c7f4f5fe8c created docusaurus version 6.0.0 2023-07-19 10:01:26 +02:00
Claudio Atzori 905c80b042 changes for version 6.0.0 2023-07-19 10:00:10 +02:00
Serafeim Chatzopoulos d851cae07b Add page for enrichment by mining 2023-07-05 00:10:21 +03:00
Thanasis Vergoulis bc0c487f5f Merge pull request 'Enriching the full-text collection process' (#53) from pdf_aggregation into main
Reviewed-on: D-Net/openaire-graph-docs#53
2023-07-04 20:17:38 +02:00
Lampros Smyrnaios 1c668b7fd8 Eliminate the "ambiguous unicode character" warning on Gitea. 2023-07-03 18:07:01 +03:00
Lampros Smyrnaios a4f15a3f83 Add the documentation about the PDF Aggregation Service. 2023-07-03 17:57:30 +03:00
Claudio Atzori 9fbc4cc6e0 Merge pull request 'Add changelog entry for v5.1.3' (#51) from v5.1.3 into main
Reviewed-on: D-Net/openaire-graph-docs#51
2023-06-13 16:06:25 +02:00
Serafeim Chatzopoulos a878ba4385 Add new doc & sidebar files 2023-06-13 17:02:29 +03:00
Serafeim Chatzopoulos 4603ba6cec Add changelog entry for v5.1.3 2023-06-13 16:58:02 +03:00
Serafeim Chatzopoulos fd6aaa299d Hide link to OpenenPlato 2023-06-05 17:41:15 +03:00
Serafeim Chatzopoulos 7b9a8df4ee Change intro 2023-05-08 17:32:16 +03:00
Serafeim Chatzopoulos c11367f286 Updata data model figure 2023-04-21 17:32:14 +03:00
Serafeim Chatzopoulos ea1ea195f4 Change data model figure 2023-04-21 17:20:39 +03:00
Serafeim Chatzopoulos 40ab5d2366 Update 'README.md' 2023-04-19 14:39:27 +02:00
Serafeim Chatzopoulos 47ed9d47d5 Update 'README.md' 2023-04-14 14:20:21 +02:00
Serafeim Chatzopoulos 984bfd6a1b Update 'README.md' 2023-04-14 14:13:04 +02:00
Serafeim Chatzopoulos 42ae72e16b Split relationships in two pages && minor changes in sidebar 2023-04-05 18:28:12 +03:00
Claudio Atzori a33eaabc4e created version 5.1.2, updated changelog 2023-04-05 16:07:05 +02:00
Claudio Atzori 7d123d94cc renamed version 5.2.0 -> 5.1.1 2023-04-05 15:50:13 +02:00
Claudio Atzori 57ed9fbc13 renamed version 5.2.0 -> 5.1.1 2023-04-05 15:49:40 +02:00
Claudio Atzori 8d4f376de1 renamed version 5.2.0 -> 5.1.1 2023-04-05 15:49:21 +02:00
Serafeim Chatzopoulos 5a50af2301 Merge branch 'pdf_aggregation' of https://code-repo.d4science.org/D-Net/openaire-graph-docs into pdf_aggregation 2023-03-22 14:57:07 +02:00
Serafeim Chatzopoulos f530e6b738 Add a dedicated page for enrichment by mining 2023-03-22 14:56:59 +02:00
Serafeim Chatzopoulos 12a0827944 Add a dedicated page for enrichment by mining 2023-03-22 14:56:06 +02:00
Serafeim Chatzopoulos 93e14c1754 Fix broken links 2023-03-15 20:31:12 +02:00
Serafeim Chatzopoulos de89887e7e Update url path for graph-production-workflow; add more details in indictors ingestion page 2023-03-15 20:19:49 +02:00
Claudio Atzori bd4e9bd417 refer to 'more recent versions' rather than a particular one 2023-03-10 11:08:18 +01:00
Claudio Atzori fb5e6cd814 integrating Scholexplorer Bio Entity Datasource documentation (PR#48) 2023-03-09 15:00:45 +01:00
Sandro La Bruzzo 49fc73d09d added mapping uniprot/swiss 2023-03-09 14:21:55 +01:00
Sandro La Bruzzo da45601f2c Added mapping of UNiprot and updated pid types 2023-03-09 14:21:23 +01:00
Claudio Atzori 16c87331e4 created version 5.2.0, updated changelog 2023-03-01 15:32:36 +01:00
Serafeim Chatzopoulos e636f50b12 Add ack link to license page 2023-02-28 14:58:53 +02:00
Serafeim Chatzopoulos 3addaee00d Merge pull request 'Update graph name, logo, and badges' (#46) from Change_graph_name into main
Reviewed-on: D-Net/openaire-graph-docs#46
2023-02-23 12:52:55 +01:00
Serafeim Chatzopoulos 29de468b6a Update graph name, logo, and badges 2023-02-23 13:50:36 +02:00
Claudio Atzori bb0f8335e2 aligned relation labels with the data 2023-02-17 15:41:19 +01:00
Serafeim Chatzopoulos 7f994877e4 Add warning message in download pages 2023-02-14 11:51:07 +02:00
Claudio Atzori d19f9e4342 Added link to a Zeppelin guide 2023-02-07 11:58:00 +01:00
Claudio Atzori 13457c0280 Added link to a Zeppelin guide 2023-02-07 11:50:53 +01:00
Serafeim Chatzopoulos aa18073f86 Merge pull request 'Add warning message in full graph dump page' (#45) from 5.1.0 into main
Reviewed-on: D-Net/openaire-graph-docs#45
2023-02-06 21:07:14 +01:00
Serafeim Chatzopoulos 028479c36c Add warning message in full graph dump page 2023-02-06 22:06:35 +02:00
Serafeim Chatzopoulos aaf64e8c65 Merge pull request 'Version 5.1.0' (#44) from 5.1.0 into main
Reviewed-on: D-Net/openaire-graph-docs#44
2023-02-06 14:17:46 +01:00
Serafeim Chatzopoulos 8146a2706e Add relationship provenance in 5.1.0 2023-02-06 15:17:25 +02:00
Serafeim Chatzopoulos 0f68d79164 Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs into 5.1.0 2023-02-06 15:15:35 +02:00
Serafeim Chatzopoulos 4b2dcaa5e2 Merge pull request 'Adding provenance to relation types' (#43) from relation_provenance into main
Reviewed-on: D-Net/openaire-graph-docs#43
2023-02-06 14:12:38 +01:00
Serafeim Chatzopoulos 6009ac9a88 Changelog minor fixes: start/release date and dump release format 2023-02-06 15:06:16 +02:00
Serafeim Chatzopoulos 276a645717 Change number of data sources to 2k && replace Grid.ac with ROR 2023-02-06 14:44:20 +02:00
Claudio Atzori a67ece1080 completed provenances in the relation table 2023-02-03 16:00:13 +01:00
Claudio Atzori 6be9720458 created version 5.1.0 2023-02-03 12:37:28 +01:00
Claudio Atzori 73ae675672 preparing version 5.1.0 2023-02-03 12:35:26 +01:00
Serafeim Chatzopoulos fc54cfcf02 Change workflow image 2023-02-02 14:45:42 +02:00
Serafeim Chatzopoulos 4eecc333bc Update intro.md 2023-01-22 21:35:51 +02:00
Serafeim Chatzopoulos 34d2b37c1e Update intro.md 2023-01-22 21:28:34 +02:00
Serafeim Chatzopoulos b95ba3707e Tag version 5.0.0 2023-01-18 17:50:04 +02:00
Serafeim Chatzopoulos e02f71b037 Merge pull request '[deduplication] added relationship redistribution phase' (#40) from deduplication into main
Reviewed-on: D-Net/openaire-graph-docs#40
2023-01-18 16:47:34 +01:00
Serafeim Chatzopoulos 997c31a9df Show only version number 2023-01-18 17:45:16 +02:00
Claudio Atzori c33f7aa62d WIP adding provenance to relation types 2023-01-17 16:00:01 +01:00
Serafeim Chatzopoulos b2a0130ce5 Merge pull request 'Link to the DataCite metadata kernel' (#42) from relation_datacite into main
Reviewed-on: D-Net/openaire-graph-docs#42
2023-01-13 14:16:11 +01:00
Claudio Atzori e34d565882 adding link to the DataCite metadata kernel 2023-01-13 14:14:07 +01:00
Miriam Baglioni 8c2e0e0022 fixed issues on relationship table 2023-01-10 12:34:00 +01:00
Claudio Atzori 937de81e83 make the phases explicit in the text 2023-01-10 12:21:59 +01:00
Claudio Atzori 38e3f8b780 added relationship redistribution phase 2023-01-10 12:08:32 +01:00
Serafeim Chatzopoulos 87ef2724da Add helpdesk in sidebar 2023-01-09 20:05:03 +02:00
Serafeim Chatzopoulos 22e90827e2 Update links on Zenodo for dumps 2023-01-05 18:07:32 +02:00
Serafeim Chatzopoulos 148564e098 Update citation of OpenAIRE Research Graph 2023-01-05 17:58:50 +02:00
Serafeim Chatzopoulos 5921b13dc7 Update 'docs/downloads/full-graph.md' 2022-12-30 22:07:17 +01:00
Serafeim Chatzopoulos 20d9cea33b Update 'docs/changelog.md' 2022-12-30 22:00:07 +01:00
Miriam Baglioni e5574b8490 Merge pull request 'Add versioning section & changelog' (#10) from changelog into main
Reviewed-on: D-Net/openaire-graph-docs#10
2022-12-30 16:35:01 +01:00
Miriam Baglioni 7035ad6878 remove the set of the added relationships. 2022-12-30 16:34:07 +01:00
Miriam Baglioni e234c3630a Merge pull request 'Add beginner's kit text' (#38) from beginners-kit into main
Reviewed-on: D-Net/openaire-graph-docs#38
2022-12-30 16:30:45 +01:00
Miriam Baglioni cb3509ba38 added link to the beginner's kit uploaded on Zenodo 2022-12-30 16:27:50 +01:00
Serafeim Chatzopoulos 5dd5cd836d Change architecture diagram 2022-12-30 17:19:03 +02:00
Serafeim Chatzopoulos 8d78ebc5db Add Beginner's kit in changelog 2022-12-30 16:50:48 +02:00
Serafeim Chatzopoulos 8aa4183dd9 Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs into changelog 2022-12-30 16:42:52 +02:00
Serafeim Chatzopoulos f15912051f Add beginner's kit text 2022-12-30 16:40:16 +02:00
Serafeim Chatzopoulos b943be8ee3 Fix links from impact measures page to specific properties/objects in the result 2022-12-27 21:22:30 +02:00
Serafeim Chatzopoulos 4c23bb429b Merge pull request 'graph-data-model-revision' (#37) from graph-data-model-revision into main
Reviewed-on: D-Net/openaire-graph-docs#37
2022-12-27 19:24:10 +01:00
Serafeim Chatzopoulos ccf3ea1529 Add info for impact indicators 2022-12-27 19:34:46 +02:00
Miriam Baglioni 709b5f49bd updated changelog 2022-12-27 14:54:28 +01:00
Miriam Baglioni 5f75cd4011 merging with main 2022-12-27 14:47:12 +01:00
Miriam Baglioni f170f72d8d indentation for json 2022-12-27 12:57:36 +01:00
Miriam Baglioni 7bf48ea976 added new relationships 2022-12-27 12:51:52 +01:00
Miriam Baglioni 248e758a94 merge with main 2022-12-27 11:59:15 +01:00
Miriam Baglioni 489bfef146 added the serialization of Indicators at the level of the result. Removed the serialization of measures at the level of the instance 2022-12-27 11:55:54 +01:00
Serafeim Chatzopoulos 0e7b14c0af Merge pull request 'Restructuring data provision section' (#34) from restructure_data_provision into main
Reviewed-on: D-Net/openaire-graph-docs#34
2022-12-23 12:32:17 +01:00
Claudio Atzori 29731b7be7 added links to the explore, connect, provide portals. Further adoption of the OpenAIRE Graph shorter wording 2022-12-23 12:13:43 +01:00
Claudio Atzori 070219b095 added synthetic stats page 2022-12-23 12:11:59 +01:00
Claudio Atzori 8e4172c1f7 usage count text from Dimitris 2022-12-22 16:25:25 +01:00
Claudio Atzori 099a500e88 added merge by id description 2022-12-22 16:21:00 +01:00
Thanasis Vergoulis c66de2b9e7 Merge pull request 'Adds a searchbox in the navbar' (#35) from enable_search into main
Reviewed-on: D-Net/openaire-graph-docs#35
2022-12-22 09:51:44 +01:00
Serafeim Chatzopoulos 078ec28a6a Set docsRouteBasePath for search plugin 2022-12-21 22:53:44 +02:00
Serafeim Chatzopoulos 61d62ddab3 Install plugin 2022-12-21 22:12:26 +02:00
Serafeim Chatzopoulos 6e56aa1a4d Add text to compatible sources - aggregation 2022-12-21 21:44:50 +02:00
Serafeim Chatzopoulos 8e9295947c Rename back to OpenAIRE Research Graph 2022-12-21 20:52:33 +02:00
Serafeim Chatzopoulos b9bdda24b7 Merge pull request 'Fixes images path when using a BASE_URL = "/docs"' (#33) from fix-image-links into main
Reviewed-on: D-Net/openaire-graph-docs#33
2022-12-21 18:15:19 +01:00
Serafeim Chatzopoulos 79e3a5b563 Merge with main 2022-12-21 19:13:15 +02:00
Serafeim Chatzopoulos fdc331641d Merge pull request 'Restructure data provision section' (#32) from restructure_data_provision into main
Reviewed-on: D-Net/openaire-graph-docs#32
2022-12-21 17:56:43 +01:00
Serafeim Chatzopoulos 69ff846180 Move text from finalisation to cleaning; minor changes in mining; fix typo in sidebar 2022-12-21 17:03:44 +02:00
Serafeim Chatzopoulos f1f011210c Rename folder deduction-and-propagation 2022-12-21 14:40:34 +02:00
Serafeim Chatzopoulos 387fd97e24 Remove FAQ 2022-12-21 14:40:05 +02:00
Serafeim Chatzopoulos 53b955a373 Add usage counts text 2022-12-21 14:39:41 +02:00
Serafeim Chatzopoulos 484d6cb82b Restructure data provision section 2022-12-20 17:55:04 +02:00
Serafeim Chatzopoulos 1506ce928a Update 'release.properties' 2022-12-20 15:16:23 +01:00
Serafeim Chatzopoulos 4e3806e05e Change footer 2022-12-20 15:20:50 +02:00
Serafeim Chatzopoulos db8bdc4a08 Fix broken links 2022-12-20 14:05:55 +02:00
Serafeim Chatzopoulos e3126ec32d Merge pull request 'Add support for ENV variables' (#27) from parameter_config_with_env into main
Reviewed-on: D-Net/openaire-graph-docs#27
2022-12-16 17:56:07 +01:00
Serafeim Chatzopoulos 0b57188a58 Merge main into branch 2022-12-16 18:55:55 +02:00
Serafeim Chatzopoulos 6686a7ec50 Merge pull request 'Add LOD dump in other related datasets section' (#29) from update_related_datasets into main
Reviewed-on: D-Net/openaire-graph-docs#29
2022-12-16 07:12:41 +01:00
Serafeim Chatzopoulos 69a2a92909 Merge pull request 'Add new badges for ack' (#30) from update_badges into main
Reviewed-on: D-Net/openaire-graph-docs#30
2022-12-16 07:12:31 +01:00
Serafeim Chatzopoulos f8fde1dba8 Merge pull request 'Disable color theme switch' (#31) from disable_color_theme_switch into main
Reviewed-on: D-Net/openaire-graph-docs#31
2022-12-16 07:12:19 +01:00
Serafeim Chatzopoulos 440e8c5b9c Disable color theme switch & remove code filtering sidebar items 2022-12-15 20:04:15 +02:00
Serafeim Chatzopoulos c1cf65e2d3 Add extra padding in badges 2022-12-15 17:13:08 +02:00
Serafeim Chatzopoulos 6281938c81 Add new badges for ack 2022-12-15 17:01:15 +02:00
Serafeim Chatzopoulos 2839958e38 Add LOD dump in other related datasets section 2022-12-15 14:31:41 +02:00
Alessia Bardi 159f50c9ef simplified some sentences 2022-12-14 19:01:35 +01:00
Claudio Atzori e3fb581270 Update 'README.md' 2022-12-14 14:23:22 +01:00
Serafeim Chatzopoulos 24af35739e Merge pull request 'Release information' (#26) from release_properties into main
Reviewed-on: D-Net/openaire-graph-docs#26
2022-12-13 12:56:56 +01:00
Serafeim Chatzopoulos 17bd13446b Add support for ENV variables 2022-12-12 09:35:53 +02:00
Serafeim Chatzopoulos caa6f7d196 Merge pull request '[Bulk Download] first versione of the documentation' (#19) from bulk_downloads into main
Reviewed-on: D-Net/openaire-graph-docs#19
2022-12-08 19:30:26 +01:00
Serafeim Chatzopoulos 83f28816b8 Fix link to downloads 2022-12-08 20:26:24 +02:00
Serafeim Chatzopoulos 67d2e38f6d Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs into bulk_downloads 2022-12-08 20:18:08 +02:00
Serafeim Chatzopoulos 750d57a110 Format the publications in the same way 2022-12-08 20:12:29 +02:00
Serafeim Chatzopoulos b4cd25b8db Add how to cite & badge in download page 2022-12-08 19:40:12 +02:00
Serafeim Chatzopoulos 6c283bde25 Minor fix on alternative sub-graph data model description 2022-12-06 18:52:35 +02:00
Serafeim Chatzopoulos bee82cbd4c Re-arrange downloads section 2022-12-06 18:43:54 +02:00
Claudio Atzori ede1bd98ea added release.properties file 2022-12-06 15:46:40 +01:00
Miriam Baglioni 47394afd5e [CommunityModel] added comments on subgraphs 2022-12-05 15:39:25 +01:00
Miriam Baglioni a61a407c14 [CommunityModel] first version of the community model 2022-12-05 12:45:57 +01:00
Serafeim Chatzopoulos 029429fcc5 Merge pull request 'Smaller images' (#24) from reduced_image_file_size into main
Reviewed-on: D-Net/openaire-graph-docs#24
2022-12-02 17:21:06 +01:00
Miriam Baglioni 48895caa3c merging with main 2022-12-02 16:28:42 +01:00
Claudio Atzori 965785e183 reduced image sizes for a lower build footprint 2022-12-02 15:56:52 +01:00
Serafeim Chatzopoulos ab4f9afe31 Merge pull request 'fix_issues_raised_in_PR_7' (#23) from fix_issues_raised_in_PR_7 into main
Reviewed-on: D-Net/openaire-graph-docs#23
2022-12-02 12:58:06 +01:00
Serafeim Chatzopoulos ac7554cb8a Minor rephrasing 2022-12-02 13:57:16 +02:00
Serafeim Chatzopoulos 1e2e95cc08 Merge branch 'main' into 'fix_issues_raised_in_PR_7' 2022-12-02 13:39:13 +02:00
Serafeim Chatzopoulos 9c45c0533e Merge pull request 'Add sitemap.xml generation during build' (#21) from Add_sitemap_generation into main
Reviewed-on: D-Net/openaire-graph-docs#21
2022-12-02 12:27:52 +01:00
Serafeim Chatzopoulos 0c0352048f Merge pull request 'Attempt to match the look and feel of graph.openaire.eu' (#22) from styling into main
Reviewed-on: D-Net/openaire-graph-docs#22
2022-12-02 12:27:39 +01:00
Serafeim Chatzopoulos 2cd5c4d686 Remove stats page for now 2022-12-02 12:37:54 +02:00
Serafeim Chatzopoulos 7a22db2ad1 Merge branch 'main' into styling 2022-12-02 11:29:58 +01:00
Serafeim Chatzopoulos 10c6330c1b Merge branch 'main' into Add_sitemap_generation 2022-12-02 11:29:47 +01:00
Serafeim Chatzopoulos eb2364b8f4 Add missing enrichment files 2022-12-02 12:26:21 +02:00
Serafeim Chatzopoulos 01e4744550 Merge pull request 'enrichment' (#20) from enrichment into main
Reviewed-on: D-Net/openaire-graph-docs#20
2022-12-01 13:41:25 +01:00
Serafeim Chatzopoulos 7c12a37f11 Split the enrichment section in sub-pages 2022-11-30 18:09:31 +02:00
Serafeim Chatzopoulos 74eab8b908 Merge branch 'main' into Add_sitemap_generation 2022-11-30 13:18:31 +01:00
Serafeim Chatzopoulos 9a8c0f6923 Change the update frequency of the sitemap.xml to monthly 2022-11-30 14:15:45 +02:00
Serafeim Chatzopoulos 79c516f21c Change background color 2022-11-29 18:16:52 +02:00
Serafeim Chatzopoulos 7b5d9eae82 Update logo & styling 2022-11-29 16:43:06 +02:00
Serafeim Chatzopoulos df6b49bd8f Remove services page 2022-11-29 15:49:11 +02:00
Serafeim Chatzopoulos a844ac459c Align references in aggregation section with those in relevant pubs 2022-11-29 14:21:52 +02:00
Serafeim Chatzopoulos f4f84a5a31 Fix typos 2022-11-29 14:16:22 +02:00
Serafeim Chatzopoulos 4b63ab0ace Add sitemap.xml generation during build 2022-11-29 13:18:34 +02:00
Serafeim Chatzopoulos 989d9ea34c Split bulk downloads page in sub-pages 2022-11-28 14:19:40 +02:00
Alessia Bardi d96049a3ab ignore intellij project file 2022-11-23 14:33:34 +01:00
Miriam Baglioni b14f89a845 merging with main 2022-11-23 14:01:30 +01:00
Miriam Baglioni 2f3e832d4d [Bulk Download] first versione of the documentation 2022-11-18 17:53:03 +01:00
Miriam Baglioni 6a773cfe1a [Enrichment] first version for propagation finished 2022-11-18 17:15:02 +01:00
Serafeim Chatzopoulos b32ee99cdf Merge pull request 'Initial OpenAIRE Graph license description' (#8) from license into main
Reviewed-on: D-Net/openaire-graph-docs#8
2022-11-17 16:44:45 +01:00
Serafeim Chatzopoulos 964bb10439 Merge pull request 'Format mining algorithms' (#17) from formating_enrichment_section into main
Reviewed-on: D-Net/openaire-graph-docs#17
2022-11-17 14:49:47 +01:00
Serafeim Chatzopoulos af2589274a Merge pull request 'Update to docusaurus v2.2.0 && npm audit fix' (#16) from update_docusaurus into main
Reviewed-on: D-Net/openaire-graph-docs#16
2022-11-17 14:48:57 +01:00
Serafeim Chatzopoulos 96912ea7ec Merge pull request 'Add formating to impact indicators page' (#9) from impact_indicators into main
Reviewed-on: D-Net/openaire-graph-docs#9
2022-11-17 14:24:53 +01:00
Serafeim Chatzopoulos 14c995a362 Format mining algorithms 2022-11-17 15:21:38 +02:00
Serafeim Chatzopoulos 3a7578fe16 Merge pull request 'Introducing the description of mining algorithms developed by ICM' (#15) from enrichment_mining_icm into main
Reviewed-on: D-Net/openaire-graph-docs#15
2022-11-17 13:44:26 +01:00
Serafeim Chatzopoulos 7526743ef6 Merge branch main into enrichment_mining_icm 2022-11-17 14:44:09 +02:00
Serafeim Chatzopoulos 5684d7bff7 Merge pull request 'update mining docs' (#14) from ioannis.foufoulas/openaire-graph-docs:mining_docs into main
Reviewed-on: D-Net/openaire-graph-docs#14
2022-11-17 13:41:49 +01:00
Serafeim Chatzopoulos c77ac867e0 Add changelog for v5 2022-11-17 14:28:09 +02:00
Serafeim Chatzopoulos 36a3bc35f0 Update to docusaurus v2.2.0 && npm audit fix 2022-11-17 13:50:45 +02:00
Marek Horst 32864f74c6 Small structural corrections. 2022-11-16 19:13:38 +01:00
Marek Horst 0e96fae405 Introducing the description of mining algorithms developed by ICM. 2022-11-16 19:04:32 +01:00
Claudio Atzori 2de2ed1932 fixed section title formatting 2022-11-16 14:07:07 +01:00
Serafeim Chatzopoulos 7d9c7b214c Minor change in impact-scores.md 2022-11-15 16:54:44 +02:00
Serafeim Chatzopoulos 90eb3d4380 Merge brnach 'main' into impact_indicators 2022-11-15 16:47:18 +02:00
Serafeim Chatzopoulos f688d64dd5 Merge pull request 'rpl 'OpenAIRE Research Graph 'OpenAIRE Graph'' (#13) from openaire_graph_rename into main
Reviewed-on: D-Net/openaire-graph-docs#13
2022-11-15 15:39:33 +01:00
Claudio Atzori 7b454f70d4 restored the original 'OpenAIRE Research Graph' Zenodo community name 2022-11-15 15:31:12 +01:00
Serafeim Chatzopoulos ce31a6d5c7 Address review comments 2022-11-15 16:29:39 +02:00
Claudio Atzori 681be1e2f8 Replaced 'OpenAIRE Research Graph' with 'OpenAIRE Graph' 2022-11-15 15:26:58 +01:00
Serafeim Chatzopoulos 0db019e51a Add versioning section 2022-11-11 19:15:55 +02:00
Serafeim Chatzopoulos 7717d883ee Add formating to impact indicators page 2022-11-11 18:07:24 +02:00
Marek Horst d5f68e5348 Initial OpenAIRE Graph license description. 2022-11-10 18:55:27 +01:00
Andreas Czerniak ce17228075 contributing APIs wiki page, CAP, DRIS 2022-11-10 12:26:43 +01:00
Andreas Czerniak 849901f231 add redmine page 2022-11-10 12:15:55 +01:00
Miriam Baglioni 1669c7a5fe [Enrichment] first version of documentation for the bulktagging and part of the propagation 2022-11-09 18:03:55 +01:00
1490 changed files with 80645 additions and 1680 deletions

2
.env Normal file
View File

@ -0,0 +1,2 @@
URL="https://graph.openaire.eu"
BASE_URL="/docs"

3
.gitignore vendored
View File

@ -19,4 +19,5 @@ npm-debug.log*
yarn-debug.log*
yarn-error.log*
.idea/
.idea/
openaire-graph-docs.iml

View File

@ -2,28 +2,37 @@
This website is built using [Docusaurus 2](https://docusaurus.io/); please check [here](https://docusaurus.io/docs/installation#requirements) the requirements to run the project.
## Clone repository
## Local installation and development
From https://docusaurus.io/docs/installation#requirements
> Node.js version 16.14 or above (which can be checked by running node -v)
Clone the repository:
```
$ git clone https://code-repo.d4science.org/D-Net/openaire-graph-docs.git
git clone https://code-repo.d4science.org/D-Net/openaire-graph-docs.git
```
NOTE: please use git branches for introducing new changes.
Install the required packages:
```
npm install
```
## Local installation and deployment
Start a local development server (opens in a new browser window).
```
npm run start
```
NOTE: most changes are reflected live without having to restart the server.
To install the required packages use:
```
$ npm install
```
The following command starts a local development server and opens up a browser window. Note that most changes are reflected live without having to restart the server.
Before issuing a Pull Request, please ensure that the following command runs successfully:
```
$ npm run start
```
Generate the static content into the `build` directory using the command tha follows. Then this directory can be served using any static contents hosting service.
```
$ npm run build
npm run build
```
NOTE: This command generates the static content into the `build` directory.
Then this output directory is issued to deploy the documentation website.
## Deployment using Docker
@ -55,3 +64,6 @@ When tagging a new version, the document versioning mechanism will:
* Copy the full `docs/` folder contents into a new `versioned_docs/version-<versionName>/` folder.
* Create a versioned sidebars file based from your current sidebar configuration, saved as `versioned_sidebars/version-<versionName>-sidebars.json`.
* Append the new version number to `versions.json`.
Therefore, when previewing the compiled site locally with `npm run start`, ensure to visualise the `Next` version on the browser as it shows the changes under `/docs`.
To change a version that was already versioned, the source files to be modified are in the `versioned_docs/version-<versionName>/` folder.

308
docs/apis/authentication.md Normal file
View File

@ -0,0 +1,308 @@
# Guide for authenticated requests
The OpenAIRE APIs can be accessed over HTTPS both by authenticated and non authenticated requests.
You can use authenticated requests to increase the rate limit of your requests (please refer [here](./terms#authentication--limits) for the current API rate limits).
There are 2 main modes that you can use to authenticate API requests:
* [Personal access tokens](#personal-access-token)
* [Registered services](#registered-services)
In the following, we elaborate on these modes.
## Personal access token
To access the OpenAIRE APIs with better rate limits you can use your personal access token. To have access to the following functionalities you need to login to OpenAIRE. In case you are not already a member you will need to register first and provide your [Personal information](https://develop.openaire.eu/personal-info).
:::info New!
The registration process has been updated! In order to visit the Personal Token and Registered Services functionalities you need to fill in the Personal Information form available [here](https://develop.openaire.eu/personal-info). This update will not affect the operation of your existing services. However, if you want to register a new service or access/modify an existing one, you will need to provide your personal information first.
:::
### How to create your personal access token
To create your personal access token go to [your personal access token page](https://develop.openaire.eu/personal-token) and copy it!
:::info
Your access token is valid for an hour.
:::
:::caution
Do not share your personal access token. Send your personal access token over HTTPS.
:::
### How to use your personal access token
To access the OpenAIRE APIs send your personal access token using the Authorization header.
```js
GET https://api.openaire.eu/{resourceServicePath}
Authorization: Bearer {ACCESS_TOKEN}
```
### An hour is not enough? What to do.
To prolong your access to our APIs you can use a **refresh token** that allows you to programmatically issue a new access token.
To get your refresh tokeng go to [your personal access token page](https://develop.openaire.eu/personal-token) and click the **"Get a refresh token"** button to get your refresh token.
OpenAIRE refresh token expires after 1 month.
In case you already have a refresh token a new one will be issued and the old one will no longer be valid.
Please copy your refresh token and store it confidentially. You will not be able to retrieve it. Do not share your refresh token. Send your refresh token over HTTPS.
Since the OpenAIRE refresh token expires after one month, when a client gets a refresh token, this token must be stored securely to keep it from being used by potential attackers. If a refresh token is leaked, it may be used to obtain new access tokens and access protected resources until a new one is issued or it expires.
To get a personal access token using your refresh token you need to make the following request:
```js
GET https://services.openaire.eu/uoa-user-management/api/users/getAccessToken?refreshToken={your_refresh_token}
```
The response has the following format:
```json
{
"access_token": "...",
"token_type": "Bearer",
"refresh_token": "...",
"expires_in": "...",
"scope": "...",
"id_token": "..."
}
```
## Registered services
If you have a service (client) that you want to interact with the OpenAIRE APIs you need to register it.
:::info
You can register up to 5 services.
:::
We offer two ways of authenticting your service: the Basic Authentication and the Advanced Authentication.
### Which one is for me?
| | How | Client Credential Issuer | Authentication Method |
| --- | --- | --- | --- |
| **Basic** | Client ID & Client Secret | OpenAIRE AAI server | Client Secret (Basic) |
| **Advanced** | Private Key signed JWT | Service owner | Private Key JWT Client Authentication |
For the **Basic Authentication** method the OpenAIRE AAI server generates a pair of _Client ID_ and _Client Secret_ credentials for your service upon its registration. The service sends the client id and client secret when authenticating to the OpenAIRE AAI Server to obtain the access token for the OpenAIRE APIs. The OpenAIRE AAI server checks whether the client id and client secret sent is valid. [Continue reading for the Basic Authentication](#basic-service-authentication-and-registration).
For the **Advanced Authentication** method your service does not send a client secret but it uses a _self signed client assertion_ to authenticate to the OpenAIRE AAI server in order to obtain the access token for the OpenAIRE APIs. The client assertion is a JWT that must be signed with RSASSA using SHA-256 hash algorithm. The OpenAIRE AAI server validates the client assertion using the public key that you have provided upon the service registration. [Continue reading for the Advanced Authentication](#advanced-service-authentication-and-registration).
:::info
The Advanced Authentication method allows the OpenAIRE AAI server to verify that the client authentication request at the token endpoint was signed by your service and not altered in any way. This is more computation intensive compared to the Basic Authentication but it ensures non-repudiation. On the other hand, the Basic Authentication is more lightweight and easy to deploy but it does not provide signature verification, and there is always a possibility of the Client ID/secret credentials being stolen. Note that tThe Advanced authentication method gives a higher level of security to the process as long as it is used correctly, i.e. when the signed JWT has a short duration. When the duration of the JWT is long, the process is no different from the basic one.
:::
### Basic service authentication and registration
To have access to the following functionalities you need to login to OpenAIRE. In case you are not already a member you will need to register first and provide your [Personal information](https://develop.openaire.eu/personal-info).
:::info New!
The registration process has been updated! In order to visit the Personal Token and Registered Services functionalities you need to fill in the Personal Information form available [here](https://develop.openaire.eu/personal-info). This update will not affect the operation of your existing services. However, if you want to register a new service or access/modify an existing one, you will need to provide your personal information first.
:::
For the **Basic Authentication** method the OpenAIRE AAI server generates a pair of _Client ID_ and _Client Secret_ for your service upon its registration. The service uses the client id and client secret to obtain the access token for the OpenAIRE APIs. The OpenAIRE AAI server checks whether the client id and client secret sent is valid.
#### How to register your service
To register your service you need to:
1. Go to your [Registered Services](https://develop.openaire.eu/apis) page and click the **\+ New Service** button.
2. Provide the mandatory information for your service.
3. Select the **Basic** Security level.
4. Click the **Create** button.
Once your service is created, the _Client ID_ and _Client Secret_ will appear on your screen. Click "OK" and your new service will be appear in the list of your [Registered Services](https://develop.openaire.eu/apis) page.
#### How to make a request
##### Step 1. Request for an access token
To make an access token request use the _Client ID_ and _Client Secret_ of your service.
```js
curl -u {CLIENT_ID}:{CLIENT_SECRET} \
-X POST 'https://aai.openaire.eu/oidc/token' \
-d 'grant_type=client_credentials'
```
where **{CLIENT_ID}** and **{CLIENT_SECRET}** are the _Client ID_ and _Client Secret_ assigned to your service upon registration.
The response is:
```json
{
"access_token": ...,
"token_type": "Bearer",
"expires_in": ...
}
```
Store the access token confidentially on the service side.
##### Step 2. Make a request
To access the OpenAIRE APIs send the access token returned in **Step 1**.
```js
GET https://api.openaire.eu/{resourceServicePath}
Authorization: Bearer {ACCESS_TOKEN}
```
### Advanced service authentication and registration
To have access to the following functionalities you need to login to OpenAIRE. In case you are not already a member you will need to register first and provide your [Personal information](https://develop.openaire.eu/personal-info).
:::info New!
The registration process has been updated! In order to visit the Personal Token and Registered Services functionalities you need to fill in the Personal Information form available [here](https://develop.openaire.eu/personal-info). This update will not affect the operation of your existing services. However, if you want to register a new service or access/modify an existing one, you will need to provide your personal information first.
:::
For the **Advanced Authentication** method your service does not send a client secret but it uses a _self signed client assertion_ to obtain the access token for the OpenAIRE APIs. The client assertion is a JWT that must be signed with RSASSA using SHA-256 hash algorithm. The OpenAIRE AAI server validates the client assertion using the public key that you have provided upon the service registration.
#### Prepare to register your service
Before you register your service you need to prepare a pair of a private key and a public key on your side.
:::info
We accept keys signed with RSASSA using SHA-256 hash algorithm.
:::
To create the key pair you have the following options:
* Use OpenAIRE authorization server built in tool. You can access the service here: [https://aai.openaire.eu/oidc/generate-oidc-keystore](https://aai.openaire.eu/oidc/generate-oidc-keystore).
The response is your **Public and Private Keypair** and has the following format:
```json
{
"p" : ...,
"kty" : "RSA",
"q" : ...,
"d" : ...,
"e" : "AQAB",
"kid" : ...,
"qi" : ...,
"dp" : ...,
"alg" : "RS256",
"dq" : ...,
"n" : ....
}
```
Use the public key parameters (kty, e, kid, alg, n) to create your **Public Key** in the following format:
```json
{
"kty": "RSA",
"e": "AQAB",
"kid": ...,
"alg": "RS256",
"n": ...
}
```
:::info
Store both the **Public and Private keypair** and the **Public key**. You will need them to register your service.
:::
:::caution
Store the **Public and Private keypair** confidentially on the service side.
:::
* Use openssl and then convert the keys to jwk format using PEM to JWK scripts, such as [https://github.com/danedmunds/pem-to-jwk](https://github.com/danedmunds/pem-to-jwk). Alternatively, the client application can read the key pair in PEM format and then convert them, using JWK libraries. Use the public key parameters (kty, e, kid, alg, n) to the service registration.
:::info
You can also provide a public key in JWK format that can be accessed using a link.
:::
#### How to register your service
To register your service you need to:
1. Go to your [Registered Services](https://develop.openaire.eu/apis) page and click the **\+ New Service** button.
2. Provide the mandatory information for your service.
3. Select the **Advanced** Security level.
4. Use the public key parameters (kty, e, kid, alg, n) you previously produced to declare your **"Public Key"** **"By value"** in the following format:
```json
{
"kty": "RSA",
"e": "AQAB",
"kid": ...,
"alg": "RS256",
"n": ...
}
```
**\- OR -**
If your service has a public key in JWK format that can be accessed using a link, you can set **“Public Key”** to **“By URL”**.
5. Click the **Create** button.
Once your service is created it will appear in the list of your [Registered Services](https://develop.openaire.eu/apis) page, with the **Service Id** that was automatically assigned to it by the AAI OpenAIRE service.
#### How to make a request
##### Step 1. Create and sign a JWT
Your service must create and sign a JWT and include it in the request to token endpoint as described in the [OpenID Connect Core 1.0, 9. Client Authentication](https://openid.net/specs/openid-connect-core-1_0.html#ClientAuthentication).
To create a JWT you can use [https://mkjose.org/](https://mkjose.org/). To do so you need to create a **payload** that should contain the following claims:
```json
{
"iss": "{SERVICE_ID}",
"sub": "{SERVICE_ID}",
"aud": "https://aai.openaire.eu/oidc/token",
"jti": "{RANDOM_STRING}",
"exp": {EXPIRATION_TIME_OF_SIGNED_JWT}
}
```
* **iss**, _(required)_ the “issuer” claim identifies the principal that issued the JWT. The value is the **Service Id** that was created when you registered your service.
* **sub**, _(required)_ the “subject” claim identifies the principal that is the subject of the JWT. The value is the **Service Id** that was created when you registered your service.
* aud, _(required)_ the “audience” claim identifies the recipients that the JWT is intended for. The value is **https://aai.openaire.eu/oidc/token**>.
* **jti**, _(required)_ The “JWT ID” claim provides a unique identifier for the JWT. The value is a random string.
* **exp**, _(required)_ the “expiration time” claim identifies the expiration time on or after which the JWT **MUST NOT** be accepted for processing. The value is a timestamp in **epoch format**.
Fill in the payload in the form available at [https://mkjose.org/](https://mkjose.org/), select the Signing Algorithm to be **RS256 using SHA-256** and paste the **Public and Private Keypair** previously created.
To check your JWT you can go to [https://jwt.io/](https://jwt.io/). The **header** should contain the following claims:
```json
{
"alg": "RS256",
"kid": ...
}
```
where **kid** is the one of your **Public and Private Keypair** you used to sign the JWT in **Step 1**.
:::caution
Store the signed key confidentially on the service side. You will need it in Step 2.
:::
##### Step 2. Request for an access token
To make an access token request use the _signed JWT_ that you created in **Step 1**. The OpenAIRE AAI server will check if the signed JWT is valid using the public key that you declared in the **"How to register your service"** process.
```js
curl -k -X POST "https://aai.openaire.eu/oidc/token" \
-d "grant_type=client_credentials" \
-d "client_assertion_type=urn:ietf:params:oauth:client-assertion-type:jwt-bearer" \
-d "client_assertion={signedJWT}"
```
where **{signedJWT}** is the signed JWT created in **Step 1**.
The response is:
```json
{
"access_token": {ACCESS_TOKEN}
"token_type":"Bearer",
"expires_in": ...,
"scope":"openid"
}
```
Store the access token confidentially on the service side.
##### Step 3. Make a request
To access the OpenAIRE APIs send the access token returned in **Step 2**.
```js
GET https://test.openaire.eu/{resourceServicePath}
Authorization: Bearer {ACCESS_TOKEN}
```

50
docs/apis/broker-api.md Normal file
View File

@ -0,0 +1,50 @@
# Broker API
## Introduction
The Broker Service is available to use via the OpenAIRE Content Provider Dashboard. Thanks to the Broker, repositories, publishers or aggregators can exchange metadata and enrich their local metadata collection by subscribing to notifications of different types. The Broker is able to notify providers when the OpenAIRE Graph contains information that is not available in the original collection of the data source. In particular, the data source manager can subscribe via the [Content Provider Dashboard](https://provide.openaire.eu) and be notified about:
* Additional PIDs of its publications (e.g. DOIs)
* Links to projects
* ORCID that can be associated to an author of datasource publications
* Links to Open Access versions
* Additional classification subjects (e.g. subjects from standard schemes like ACM, JEL and DDC)
* Abstracts identified in duplicate publications
* Missing publication dates
All Repository managers approaching the Content Provider Dashboard will be offered the possibility to preview a set of enrichments relative to their repository that OpenAIRE can derive from the Graph. More specifically, enrichments will be organized into categories named topics and representing the different types of enrichments OpenAIRE can build. For each topic the preview consists of 100 “enrichment events”, a subset of all the possible enrichments pertinent to a given repository in the OpenAIRE Graph, that the user can explore by applying filters on different criteria and the total number of events that can be potentially built is highlighted in the UI. Repository managers can create subscriptions for specific topics and that include the filtering criteria they used to analyze the enrichments preview, or can subscribe to all the available topics with no restrictions at once. Once the repository manager creates a subscription, the algorithm analyzing the OpenAIRE Graph will produce the full set of enrichments for the manager's repository, possibly far beyond the 100 enrichments available in the preview. The enrichments will be made available as notifications in a dedicated section in the Content Provider Dashboard UI to be further checked as well as through the broker service API for programmatic access. Notifications will be sent to subscribers every time the OpenAIRE Graph will be updated and analyzed to derive the enrichments.
## Usage Example
The following commands indicate how the broker API documented at [api.openaire.eu/broker](https://api.openaire.eu/broker/swagger-ui/index.html) can be used to access the set of enrichments:
1. Get the list of subscriptions for a given subscriber, e.g.
```js
curl -X GET --header 'Accept: application/json' 'https://api.openaire.eu/broker/subscriptions?email=[subscriber_email]'
```
2. Extract the subscription ID and use it to access the 1st page of enrichment notification records
```js
curl -X GET --header 'Accept: application/json' 'https://api.openaire.eu/broker/scroll/notifications/bySubscriptionId/[sub-1234]'
```
3. Extract the scroll ID from the response to request subsequent pages
```js
curl -X GET --header 'Accept: application/json' 'https://api.openaire.eu/broker/scroll/notifications/[scroll_id]'
```
To simplify accessing the enrichment notification records, please check the OpenAIRE broker cmdline client available on [GitHub](https://github.com/openaire/broker-cmdline-client).
## Terms of Use and SLA
APIs are free-to-use (no sign-up needed) by any third-party service
**Metadata license is CC-BY**: the metadata records retuned by the service can be freely re-used by commercial and non-commercial partners under CC-BY license, hence as long as OpenAIRE is acknowledged as data source.
**Quality of Service**: all API services are running in production 24/7 within the OpenAIRE infrastructure premises deployed at the [data center](http://icm.edu.pl/en/centre-of-technology/) facilities of the [Interdisciplinary Centre for Mathematical and Computational Modelling](http://icm.edu.pl/en/) (ICM).
**APIs rate limits**: please check [here](./authentication).

View File

@ -0,0 +1,61 @@
# Dspace & EPrints API
<!-- Bulk access to projects -->
The APIs offer custom access to metadata about projects funded by a selection of international funders for the **DSpace** and **EPrints** platforms. The currently supported funders and relative codes are:
* **FP7:** The 7th Framework Programme funded by the European Commission
* **H2020:** Horizon2020 Programme funded by the European Commission
* **HE:** Horizon Europe Programme funded by the European Commission
* **AKA:** Academy of Finland
* **ARC:** Australian Research Council
* **FWF:** Austrian Science Foundation
* **CHISTERA:** CHIST-ERA
* **CIHR:** Canadian Institutes of Health Research
* **HRZZ:** Croatian Science Foundation
* **EEA:** European Environemnt Agency
* **ANR:** French National Research Agency
* **FCT:** The funding programme of Fundação para a Ciência e a Tecnologia, the national funding agency of Portugal
* **MESTD:** The Ministry of Education, Science and Technological Development of Serbia
* **MZOS:** Ministry of Science, Education and Sports of the Republic of Croatia
* **NHMRC:** Australian National Health and Medical Research Council
* **NIH:** US National Institutes of Health
* **NSF:** US National Science Foundation
* **NSERC:** Natural Sciences and Engineering Research Council of Canada
* **NWO:** The Netherlands Organisation for Scientific Research
* **SFI:** Science Foundation Ireland
* **SSHRC:** Social Sciences and Humanities Research Council
* **SNSF:** Swiss National Science Foundation
* **TARA:** Tara Expeditions Foundation
* **TUBITAK:** The National funder of Turkey
* **UKRI:** United Kingdom Research and Innovation
* **WT:** Wellcome Trust
## DSpace/ePrints
DSpace endpoint: http://api.openaire.eu/projects/dspace/$fundingStream/ALL/ALL
ePrints endpoint: http://api.openaire.eu/projects/eprints/$fundingStream/ALL/ALL
The URLs embed the parameters needed to collect projects funded by specific funding stream, where the pattern is FundingStream/FundingSubStream/FundingSubSubStream.
Additional parameters can be concatenated to the URL to refine the results by date (date must be in the form `YYYY-MM-DD`):
* startFrom
* startUntil
* endFrom
* endUntil
## Examples
Get Wellcome Trust projects for EPrints: [http://api.openaire.eu/projects/eprints/WT/ALL/ALL](http://api.openaire.eu/projects/eprints/WT/ALL/ALL)
Get EC-FP7 projects of the specific programme “SP2-IDEAS” for EPrints: [http://api.openaire.eu/projects/eprints/FP7/SP2/ALL](http://api.openaire.eu/projects/eprints/FP7/SP2/ALL)
Get EC-FP7 projects for DSpace that started after the given date: [http://api.openaire.eu/projects/dspace/FP7/ALL/ALL?startFrom=2011-01-01](http://api.openaire.eu/projects/dspace/FP7/ALL/ALL?startFrom=2011-01-01).
## Terms of Use and SLA
APIs are free-to-use (no sign-up needed) by any third-party service.
**Metadata license is CC-BY**: the metadata records retuned by the service can be freely re-used by commercial and non-commercial partners under CC-BY license, hence as long as OpenAIRE is acknowledged as data source.
**Quality of Service**: all API services are running in production 24/7 within the OpenAIRE infrastructure premises deployed at the [data center](http://icm.edu.pl/en/centre-of-technology/) facilities of the [Interdisciplinary Centre for Mathematical and Computational Modelling](http://icm.edu.pl/en/) (ICM).
**APIs rate limits**: please check [here](./authentication).

9
docs/apis/home.md Normal file
View File

@ -0,0 +1,9 @@
# Public APIs
The OpenAIRE Graph data are accessible through various public APIs. More specifically, the following APIs are currently provided:
* [Search API](./search-api/search-api.md) (an API to search for research products and projects)
* [ScholeXplorer API](https://api.scholexplorer.openaire.eu/swagger-ui/index.html?urls.primaryName=Scholexplorer%20API%20V2.0) (an API offering dataset-publication & dataset-dataset links)
* [DSpace & EPrints API](./dspace-eprints-api.md) (an API to offer custom access to metadata for projects funded by a selection of international funders for DSpace and EPrints platforms)
* [Broker API](./broker-api.md) (an API to enrich metadata for repositories, publishers, and aggregators)
It is also worth mentioning that, between 2015 and 2023 a LOD API was being provided but the respective service has been discontinued. Old LOD datasets can be found on Zenodo [here](https://zenodo.org/records/4587369).

View File

@ -0,0 +1,31 @@
# Searching for projects
## Endpoints
For research projects: http://api.openaire.eu/search/projects
## Parameters
| Parameter | Option | Description |
| --- | --- | --- |
| page | integer | Page number of the search results. |
| size | integer | Number of results per page. |
| format | json \| xml \| csv \| tsv | The format of the response. The default is xml. |
| model | openaire \| sygma | The data model of the response. Default is openaire. Model sygma is a simplified version of the openaire model. For sygma, only the xml format is available. The relative XML schema is available [here](https://www.openaire.eu/schema/sygma/oaf_sygma_v2.1.xsd). |
| sortBy | `sortBy=field,[ascending\|descending]`; **'field'** is one of: `projectstartdate`, `projectstartyear`, `projectenddate`, `projectendyear`, `projectduration` | The sorting order of the specified field. |
| hasECFunding | true \| false | If hasECFunding is true gets the entities funded by the EC. If hasECFunding is false gets the entities related to projects not funded by the EC. |
| hasWTFunding | true \| false | If hasWTFunding is true gets the entities funded by Wellcome Trust. The results are the same as those obtained with `funder=wt`. If hasWTFunding is false gets the entities related to projects not funded by Wellcome Trust. |
| funder | WT \| EC \| ARC \| ANDS \| NSF \| FCT \| NHMRC | Search for entities by funder. |
| fundingStream | ... | Search for entities by funding stream. |
| FP7scientificArea | ... | Search for FP7 entities by scientific area. |
| keywords | White-space separated list of keywords. | N/A |
| sortBy | `sortBy=field,[ascending\|descending]`; **'field'** is one of: `projectstartdate`, `projectstartyear`, `projectenddate`, `projectendyear`, `projectduration` | The sorting order of the specified field. |
| grantID | Comma separated list of grant identifiers. | Gets the project with the given grant identifier, if any. |
| openairePublicationID | Comma separated list of OpenAIRE identifiers. | Gets the publication with the given openaire identifier, if any. |
| name | White-space separated list of keywords. | Gets the projects whose names contain the given list of keywords. Using double quotes `"` you get an exact match, if any. |
| acronym | N/A | Gets the project with the given acronym, if any. |
| callID | N/A | Search for projects by call identifier. |
| startYear | Year formatted as `YYYY` | Gets the projects that started in the given year. |
| endYear | Year formatted as `YYYY`. | Gets the projects that ended in the given year. |
| participantCountries | Comma separeted list of 2 letter country codes. | Search for projects by participant countries. |
| participantAcronyms | White space separeted list of acronyms of institutions. | Search for projects by participant institutions. |

View File

@ -0,0 +1,98 @@
# Searching for research products
## Endpoints
For research products: https://api.openaire.eu/search/researchProducts
By specific type:
* publications: https://api.openaire.eu/search/publications
* research data: https://api.openaire.eu/search/datasets
* research software: https://api.openaire.eu/search/software
* other research products: https://api.openaire.eu/search/other
## General parameters
Endpoint: https://api.openaire.eu/search/researchProducts
| Parameter | Option | Description |
| --- | --- | --- |
| page | integer | Page number of the search results. |
| size | integer | Number of results per page. |
| format | json \| xml \| csv \| tsv | The format of the response. The default is xml. |
| model | openaire \| sygma | The data model of the response. Default is openaire. Model sygma is a simplified version of the openaire model. For sygma, only the xml format is available. The relative XML schema is available [here](https://www.openaire.eu/schema/sygma/oaf_sygma_v2.1.xsd). |
| sortBy | `sortBy=field,[ascending\|descending]` <br/>**'field'** can one of: <ul> <li>`dateofcollection`</li><li>`resultstoragedate`</li><li>`resultstoragedate`</li> <li>`resultembargoenddate`</li><li>`resultembargoendyear`</li><li>`resultdateofacceptance`</li> <li>`resultacceptanceyear`</li><li>`influence`</li><li>`popularity`</li> <li>`citationCount`</li><li>`impulse`</li> </ul>Multiple sorting is supported by repeating the `sortBy` parameter. | The sorting order of the specified field. |
| hasECFunding | true \| false | If hasECFunding is true gets the entities funded by the EC. If hasECFunding is false gets the entities related to projects not funded by the EC. |
| hasWTFunding | true \| false | If hasWTFunding is true gets the entities funded by Wellcome Trust. The results are the same as those obtained with `funder=wt`. If hasWTFunding is false gets the entities related to projects not funded by Wellcome Trust. |
| funder | WT \| EC \| ARC \| ANDS \| NSF \| FCT \| NHMRC | Search for entities by funder. |
| fundingStream | ... | Search for entities by funding stream. |
| FP7scientificArea | ... | Search for FP7 entities by scientific area. |
| keywords | White-space separated list of keywords. | N/A |
| doi | Comma separated list of DOIs. <br/>Alternatively, it is possible to repeat the parameter for each requested doi. | Gets the research products with the given DOIs, if any. |
| orcid | Comma separated list of ORCID iDs of authors. <br/>Alternatively, it is possible to repeat the parameter for each author ORCID iD. | Gets the research products linked to the given ORCID iD of an author, if any. |
| fromDateAccepted | Date formatted as `YYYY-MM-DD` | Gets the research products whose date of acceptance is greater than or equal the given date. |
| toDateAccepted | Date formatted as `YYYY-MM-DD` | Gets the research products whose date of acceptance is less than or equal the given date. |
| title | White-space separated list of keywords. | Gets the research products whose titles contain the given list of keywords. |
| author | White-space separated list of names and/or surnames. | Search for research products by authors. |
| OA | true \| false | If OA is true gets Open Access research products. If OA is false gets the non Open Access research products |
| projectID | The given grant identifier of the project | Search for research products of the project with the specified projectID |
| country | 2 letter country code | Search for research products associated to the country code |
| influence <br/> | Accepted values: <br/>`C1` for top 0.01% in terms of influence <br/>`C2` for top 0.1% in terms of influence <br/>`C3` for top 1% in terms of influence <br/>`C4` for top 10% in terms of influence <br/>`C5` for average/low in terms of influence <br/> <br/>Comma separated list of values or repeat of the parameter for each value will form a query with OR semantics, eg. `?influence=C1&influence=C2` | Search for research products based on their influence. |
| popularity <br/> | Accepted values: <br/>`C1` for top 0.01% in terms of popularity <br/>`C2` for top 0.1% in terms of popularity <br/>`C3` for top 1% in terms of popularity <br/>`C4` for top 10% in terms of popularity <br/>`C5` for average/low in terms of popularity <br/> <br/>Comma separated list of values or repeat of the parameter for each value will form a query with OR semantics, eg. `?popularity=C1&popularity=C2` | Search for research products based on their popularity. |
| impulse <br/> | Accepted values: <br/>`C1` for top 0.01% in terms of impulse <br/>`C2` for top 0.1% in terms of impulse <br/>`C3` for top 1% in terms of impulse <br/>`C4` for top 10% in terms of impulse <br/>`C5` for average/low in terms of impulse <br/> <br/>Comma separated list of values or repeat of the parameter for each value will form a query with OR semantics, eg. `?impulse=C1&impulse=C2` | Search for research products based on their impulse. |
| citationCount <br/> | Accepted values: <br/>`C1` for top 0.01% in terms of citation count <br/>`C2` for top 0.1% in terms of citation count <br/>`C3` for top 1% in terms of citation count <br/>`C4` for top 10% in terms of citation count <br/>`C5` for average/low in terms of citation count <br/> <br/>Comma separated list of values or repeat of the parameter for each value will form a query with OR semantics, eg. `?citationCount=C1&citationCount=C2` | Search for research products based on their number of citations. |
| openaireProviderID | Comma separated list of identifiers. | Search for research products by openaire data provider identifier. <br/>Alternatively, it is possible to repeat the parameter for each provider id. In both cases, provider identifiers will form a query with OR semantics. |
| openaireProjectID | Comma separated list of identifiers. <br/>Alternatively, it is possible to repeat the parameter for each provider id. In both cases, provider identifiers will form a query with OR semantics. | Search for research products by openaire project identifier. Alternatively, it is possible to repeat the parameter for each provider id. In both cases, provider identifiers will form a query with OR semantics. |
| hasProject | true \| false | If hasProject is true gets the research products that have a link to a project. If hasProject is false gets the publications with no links to projects. |
| FP7ProjectID | ... | Search for research products associated to a FP7 project with the given grant number. It is equivalent to a query by `funder=FP7&projectID={grantID}` |
## Parameters for publications
Endpoint: https://api.openaire.eu/search/publications
You can use all the [general research products parameters](#general-parameters) as well as those in the following table.
| Parameter | Option | Description |
| --- | --- | --- |
| instancetype | Comma separated list of publication types. Check [here](http://api.openaire.eu/vocabularies/dnet:publication_resource) to see the possible values | Gets the publication of the given type, if any. |
| originalId | Comma separated list of original identifiers as we get them from the data source. <br/>Alternatively, it is possible to repeat the parameter for each requested identifier. | Gets the publication with the given openaire identifier, if any. |
| sdg | The number of the Sustainable Development Goals `[1-17]`. <br/>Check [here](https://sdgs.un.org/goals) to see the Sustainable Developemnt Goals. | Gets the publications that are classified with the respective Sustainable Development Goal number. |
| fos | The Field of Science classification value. <br/>Check [here](/resources/athenarc_fos_hierarchy.json) to see the Field of Science classification values | Gets the publications that are classified with the respective Field of Science classification value. |
| openairePublicationID | Comma separated list of OpenAIRE identifiers. <br/>Alternatively, it is possible to repeat the parameter for each requested identifier. | Gets the publication with the given openaire identifier, if any. |
| peerReviewed | Accepted values: <br/>true \| false | Specify if the publications are peerReviewed or not. |
| diamondJournal | Accepted values: <br/>true \| false | Specify if the publications are published in a diamond journal or not. |
| publiclyFunded | Accepted values: <br/>true \| false | Specify if the publications are publicly funded or not. |
| green | Accepted values: <br/>true \| false | Specify if the publications are green open access or not. |
| openAccessColor | Accepted values: <br/>`gold`\| `bronze`\| `hybrid` <br/>Comma separated list of values or repeat of the parameter for each value will form a query with OR semantics, eg. `?openAccessColor=gold&openAccessColor=hybrid` | Specify the open access color of a publication. |
## Parameters for research data
Endpoint: https://api.openaire.eu/search/datasets
You can use all the [general research products parameters](#general-parameters) as well as those in the following table.
| Parameter | Option | Description |
| --- | --- | --- |
| openaireDatasetID | Comma separated list of OpenAIRE identifiers. <br/>Alternatively, it is possible to repeat the parameter for each requested identifier. | Gets the research data with the given openaire identifier, if any. |
## Parameters for research software
Endpoint: https://api.openaire.eu/search/software
You can use all the [general research products parameters](#general-parameters) as well as those in the following table.
| Parameter | Option | Description |
| --- | --- | --- |
| openaireSoftwareID | Comma separated list of OpenAIRE identifiers. <br/>Alternatively, it is possible to repeat the parameter for each requested identifier. | Gets the research software with the given openaire identifier, if any. |
## Parameters for other research products
Endpoint: https://api.openaire.eu/search/other
You can use all the [general research products parameters](#general-parameters) as well as those in the following table.
| Parameter | Option | Description |
| --- | --- | --- |
| openaireOtherID | Comma separated list of OpenAIRE identifiers. <br/>Alternatively, it is possible to repeat the parameter for each requested identifier. | Gets the other research products with the given openaire identifier, if any. |

View File

@ -0,0 +1,172 @@
# Response metadata format
In this page, we elaborate on the metadata response format, as well as response headers and errors.
## Main response
The OpenAIRE Search API supports the following types of response formats:
* XML
* JSON
* CSV
* TSV
In the next paragraphs, we elaborate on the respective metadata formats.
### XML/JSON
The default format of delivered records is oaf (OpenAIRE Format - current version 1.0):
* XML schema: https://www.openaire.eu/schema/1.0/oaf-1.0.xsd
* Documentation: https://www.openaire.eu/schema/1.0/doc/oaf-1.0.html
For the list of changes [click here](https://www.openaire.eu/openaire-xml-schema-change-announcement).
Note that latest versions of the XML schema and documentation are also available at the following permanent links:
* XML schema: https://www.openaire.eu/schema/latest/oaf.xsd
* Documentation: https://www.openaire.eu/schema/latest/doc/oaf.html
Older versions:
* oaf v0.3 [XML schema](https://www.openaire.eu/schema/0.3/oaf-0.3.xsd) and [documentation](https://www.openaire.eu/schema/0.3/doc/oaf-0.3.html)
* oaf v0.2 [XML schema](https://www.openaire.eu/schema/0.2/oaf-0.2.xsd) and [documentation](https://www.openaire.eu/schema/0.2/doc/oaf-0.2.html)
* oaf v0.1 [XML schema](https://www.openaire.eu/schema/0.1/oaf-0.1.xsd) and [documentation](https://www.openaire.eu/schema/0.1/doc/oaf-0.1.html)
### CSV/TSV
The API returns in comma-separated files (CSV) or tab-separated files (TSV) the following fields:
* Title
* AUthors
* Publicatioy year
* DOI
* Download from
* Publication type
* Journal
* Funder
* Project name (GA Number)
* Access
## Headers
| Name | Description |
| --- | --- |
| x-ratelimit-limit | The maximum number of requests allowed for the client in one time window. |
| x-ratelimit-used | The number of requests already made by the client in the current time window. |
The OpenAIRE APIs use a sliding time window of one hour.
## Errors
### General
404 - Not found
```json
{
"error": "Not found",
"description": "Invald request path."
}
```
429 - Rate limit abuse
```json
{
"error": "Too many requests",
"description": "Request rate exceeded. Slow down."
}
```
### Only for authenticated requests
400 - Missing grant type
```json
{
"error": "invalid_request",
"error_description": "Missing grant type"
}
```
400 - Wrong grant type
```json
{
"error": "unsupported_grant_type",
"error_description": "Unsupported grant type: ..."
}
```
400 - Missing Refresh Token
```json
{
"status" : "error",
"code" : "400",
"message" : "Bad Request",
"description" : "Missing refreshToken parameter"
}
```
401 - Missing username or/and password
```json
{
"error": "unauthorized",
"error_description": "Client id must not be empty!"
}
```
401 - Wrong username or/and password
```json
{
"error": "unauthorized",
"error_description": "Bad credentials"
}
```
401 - Invalid Refresh Token (for authenticated requests)
```json
{
"status" : "error",
"code" : "401",
"message" : "Unauthorised",
"description" : "Invalid refreshToken token"
}
```
401 - Invalid client assertion
```json
{
"error":"invalid_client",
"error_description":"Bad client credentials"
}
```
401 - Client assertion for missing service
```json
{
"error":"invalid_client",
"error_description":"Could not find client {SERVICE_ID}"
}
```
401 - Expired signed jwt
```json
{
"error":"unauthorized",
"error_description":"Assertion Token in expired: {EXPIRATION_TIME}"
}
```
403 - Invalid Access Token
```json
{
"error": "Token invalid",
"description": "Authorization header value invalid."
}
```

View File

@ -0,0 +1,7 @@
# Search API
The Search API allows developers to access metadata records of the OpenAIRE Graph by performing queries over research products (i.e., publications, data, software, other research products), and projects.
The API is intended for metadata discovery and exploration only, hence it does not provide access to the whole information space: the number of total results returned by one query is limited to 10,000.
For accessing the whole graph, developers are encouraged to use the [OpenAIRE full Graph dataset](../../downloads/full-graph).

View File

@ -0,0 +1,93 @@
# APIs specification changelog
| Date | Description |
| --- | --- |
| 2024-01-09T11:14:10.524604Z | New parameters for publications. Now you can specifυ if they are peer reviewed, in diamond journal, publicly funded, green and specify their OA colour. |
| 2023-11-30T11:39:10.159187Z | Added impact factor parameters. Now you can sort results and query by impact, influence, impulse and citation count. |
| 2023-11-29T12:26:17.660379Z | New registration and token process available at https://develop.openaire.eu. Updated documentation |
| 2023-05-25T09:16:19.903365Z | new instancetype parameter added |
| 2022-09-29T07:03:32.109909Z | updated URLs to the broker swagger UI |
| 2022-09-28T20:35:13.116653Z | updated URLs to the broker swagger UI |
| 2022-07-28T12:02:06.271154Z | Updated list of funders supported by the API for bulk access to projects: EC Horizon Europe also included |
| 2022-05-11T10:01:33.969973Z | New end point for researchProducts in selective access! FOS and SDG classifications available for publication requests |
| 2022-03-29T15:03:29.583536Z | Graph dataset: add new Scholix version 4 |
| 2021-11-12T12:04:52.900385Z | originalId parameter added |
| 2021-10-18T15:31:18.446582Z | OAI-PMH publisher completely dismissed as announced in January 2021 |
| 2021-10-12T07:46:48.032978Z | orcid parameter added in selective access |
| 2021-04-08T10:28:02.371361Z | Authenticated requests to our APIs are now enabled. |
| 2021-02-26T16:28:15.364435Z | NEWS: new dataset available with research products with project funding information |
| 2021-02-17T07:39:46.051129Z | WIP: broker API documentation |
| 2021-02-11T09:06:41.608115Z | Broker API documentation |
| 2021-02-10T10:17:39.504429Z | Authentication documentation added + broker card + broker dummy page |
| 2021-02-01T08:55:35.496938Z | OAI-PMH shutdown announced for the end of April 2021 |
| 2021-01-15T18:56:04.748404Z | Updated documentation on OpenAIRE Research Graph Datasets |
| 2021-01-15T16:57:08.569766Z | Announcing the shutdown of the OAI-PMH publisher |
| 2019-01-25T15:36:27.264313Z | Added new parameter country for research products |
| 2018-10-17T10:39:56.570815Z | Software and Other research products are available via HTTP API. Documentation has been updated. |
| 2018-04-09T09:20:24.763966Z | Added section on terms of services and SLA in the specific API pages |
| 2018-04-09T08:26:18.897089Z | Added section for terms of use and SLA in the home page |
| 2018-03-21T15:31:13.490821Z | dded page with list of changes generated from the svn log |
| 2018-03-21T14:58:14.569096Z | Added APi rate limits |
| 2018-03-21T14:46:32.362617Z | ignore intellij settings |
| 2018-02-01T14:44:00.743257Z | Latest schema version is 1.0 |
| 2018-01-30T10:29:03.037760Z | removed authorOpenaireId parameter + change the message to say the schema is already changed |
| 2018-01-26T13:09:17.887663Z | Removed openaireAuthorID from API documentation |
| 2018-01-11T14:41:29.910148Z | Rephrase LOD to Linked Open Data |
| 2018-01-11T13:56:40.051318Z | add LOD box in overview.html |
| 2018-01-11T13:48:19.812005Z | Adding warning for schema change |
| 2017-10-23T14:21:15.794995Z | intellij file |
| 2017-10-09T10:43:56.532687Z | Added HTML files for api documentation based on uikit |
| 2017-10-06T12:08:16.603152Z | deleting old API documentation: new will be committed soon by Katerina |
| 2017-10-06T12:04:55.560134Z | copied from dnet40 |
| 2017-05-26T11:44:59.926816Z | removed warning for fundingStream queries |
| 2017-05-25T12:36:43.800409Z | warning and location of the api in the prod infra |
| 2017-03-29T13:58:34.013071Z | reformatted xml and new generated HTML |
| 2017-03-29T13:57:23.196971Z | changed pubdate |
| 2017-03-29T13:46:08.349593Z | added link to the OpenAIRE helpdesk |
| 2017-03-29T13:39:44.386894Z | fixed param hasWTFunding (instead of hasUKFunding) + list of supported funders |
| 2017-03-29T13:37:43.381141Z | param name is dateOfAcceptance not of collection |
| 2017-02-22T09:31:34.767373Z | #2630: informing that incremental harvesting is not supported and updated list of interesting OAI sets |
| 2016-01-18T10:38:57.125792Z | commented warning section |
| 2015-09-15T09:04:00.819955Z | added this week in the warning week |
| 2015-09-15T09:02:37.458839Z | updated supported funders and removed section about the TSV as it is only to be used by NOADs |
| 2015-09-15T08:56:37.943151Z | removed organizations OAI set in the examples. Added FP7Publications. |
| 2015-09-15T08:54:41.579385Z | Updated links to the guidelines |
| 2015-09-15T08:53:06.677011Z | OAI-PMH discards duplicates now |
| 2015-08-26T08:51:32.795385Z | added schema 0.3 as the latest schema |
| 2015-05-18T12:10:24.329058Z | csvn and tsv formats available for search api |
| 2015-03-20T10:54:31.069584Z | fixed tsv URL |
| 2015-03-20T10:49:46.639336Z | updated date |
| 2015-03-20T10:49:11.327980Z | added documentation for the projects2tsv endpoint |
| 2015-03-19T11:18:36.226626Z | minor changes to a couple of sentences |
| 2015-03-13T17:35:01.980176Z | updated the generated html |
| 2015-03-13T17:33:41.882951Z | added list of avaialble funding streams and those that are coming soon |
| 2015-03-13T17:33:02.339565Z | openaire compliance of OAI-PMH |
| 2015-02-04T14:16:56.528188Z | #1062: OAI-PMH and HTTP numbers are not the same becuase of duplicates |
| 2014-12-03T15:17:27.207961Z | #1031: title of eprints/dspace export |
| 2014-11-13T16:11:13.633046Z | Updated date and generated new html |
| 2014-11-13T16:08:12.045544Z | Fixed documentation about datasets |
| 2014-11-11T18:43:18.738678Z | Fixed documentation for publications |
| 2014-11-11T17:09:19.351093Z | added sortby parameter |
| 2014-09-17T09:05:38.726757Z | created tag folder for release |
| 2014-08-04T10:59:48.089720Z | Updated pubdate |
| 2014-08-04T10:58:22.814919Z | Overview cleanup |
| 2014-08-04T10:50:54.515588Z | added links to the latest available schema and documentation |
| 2014-07-24T14:09:49.733958Z | #690: HTTP API documentation for project and other updates. |
| 2014-06-06T08:41:45.731338Z | #550: making it clear we are delivering metadata only. Clenaup. |
| 2014-05-14T16:38:30.702554Z | updated date |
| 2014-05-14T16:35:05.787718Z | re-added OAI set for projects |
| 2014-04-30T10:42:18.355154Z | updated oxygen project with the correct tree structure |
| 2014-04-30T10:41:14.539090Z | Added and commented property to generate output in chunks |
| 2014-04-30T10:40:30.012256Z | mvn generates output with no chunks in a single file: api-doc.html |
| 2014-04-30T10:39:37.875730Z | Main docbook file renamed from book.xml to api-doc.xml |
| 2014-04-30T10:34:16.576722Z | updated OAI-PMH sets: now delivering only research products and no other entities. |
| 2014-04-15T09:53:22.158487Z | copied dnet-api-http-doc to new dnet40 codebase |
| 2014-04-10T09:55:41.690052Z | ignore |
| 2014-04-10T09:53:59.192401Z | removed target/*classes from svn |
| 2014-04-09T10:46:05.757155Z | mavenized project. Generates html running mvn docbkx:generate-html. results are then in target/docbkx |
| 2014-04-09T09:18:26.268418Z | added links to xsd and xsd doc in the overview chapter |
| 2014-04-08T12:55:01.169556Z | ticket #300: updated doc for APIs |
| 2014-03-10T18:13:38.784171Z | not a maven project |
| 2014-03-10T18:13:18.180379Z | basic structure for API doc |
| 2014-03-10T13:50:02.957489Z | added files as generated by the archetype docbkx-quickstart-archetype v2.0.15 |
| 2014-03-10T13:45:30.505315Z | created module for HTTP API docbook |

17
docs/apis/terms.md Normal file
View File

@ -0,0 +1,17 @@
# Terms of use
## Authentication & limits
The OpenAIRE APIs are free-to-use by any third-party service and can be accessed over HTTPS both by authenticated and unauthenticated requests. The rate limit for the former type of requests is up to 7200 requests per hour, while the latter is up to 60 requests per hour.
To make an authenticated request, you must first [register](https://services.openaire.eu/uoa-user-management/register.jsp). Then, you can go to the [personal access token page](https://develop.openaire.eu/user-info?errorCode=1&redirectUrl=%2Fpersonal-token) in your account, copy your token and use it for up to one hour, [find out more](./authentication).
Our OAuth 2.0 implementation, conforms to the OpenID Connect specification, and is [OpenID Certified](https://openid.net/certification/). OpenID Connect is a simple identity layer on top of the OAuth 2.0 protocol. For more information about OAuth2.0 please visit the [OAuth2.0 official site](https://oauth.net/2/). For more information about OpenID Connect please visit the [OpenID Connect official site](https://openid.net/connect/). Also, check [here](http://www.openaire.eu/privacy-policy) for more information on our Privacy Policy.
## Quality of service
OpenAIRE API services are running in production 24/7 within the OpenAIRE infrastructure premises deployed at the data center facilities of the Interdisciplinary Centre for Mathematical and Computational Modelling (ICM).
## License
OpenAIRE Graph license is CC-BY: the records returned by the service can be freely re-used by commercial and non-commercial partners under CC-BY license, hence as long as OpenAIRE is acknowledged as a data source.

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 96 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 394 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 623 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 666 KiB

View File

Before

Width:  |  Height:  |  Size: 256 KiB

After

Width:  |  Height:  |  Size: 256 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 203 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 221 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 387 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 90 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

View File

@ -2,5 +2,245 @@
sidebar_position: 12
---
# Changelog
<span className="todo">TODO</span>
# Versions & changelog
## Versioning
Our versioning policy follows the [Semantic Versioning specification](https://semver.org/).
In our case, given a version `MAJOR.MINOR.PATCH`, we increment the:
* `MAJOR` version when the data model of the Graph changes
* `MINOR` version when the pipeline (e.g., different deduplication method, different implementation for an enrichment process) or major data sources change
* `PATCH` version when the graph data are updated
## Changelog
This section documents all notable changes for each graph version.
---
### v7.1.3
_Start Date: 2024-04-10 &bull; Release Date: 2024-04-22 &bull; Dataset release: **no**_
#### Added
- Introduced new Field of Science classifications, reaching a total of ~73Mi publications classified
- General increase of the funded scientific outputs, thanks to the full-text mining scanning new OpenAccess publications, some examples:
- European Commission - EC +7% (from 1.52Mi to 1.62Mi)
- Irish Research Council - IRC +7% (from 12.7K to 13.5K)
- French National Research Agency - ANR +5.8% (from 91.5K to 96.8K)
- National Institute of Health - NIH +5% (from 594K to 626K)
- UK Research and Innovation - UKRI +3.7% (from 434K to 450K)
- General increase of the scientific products with author affiliation information +2% (from 83.12Mi to 84.88Mi)
#### Changed
- Updated Crossref publications to include contents until March 2023
- Updated Datacite contents until March 2024
- Updated ORCID contents until March 2024
### v7.1.2
_Start Date: 2024-03-15 &bull; Release Date: 2024-03-27 &bull; Dataset release: **no**_
#### Added
- General increase of the funded scientific outputs, thanks to the full-text mining scanning new OpenAccess publications
#### Changed
- Updated Crossref publications to include contents until February 2023
- Updated Datacite contents until February 2024
- Updated ORCID contents until February 2024
### v7.1.1
_Start Date: 2024-02-23 &bull; Release Date: 2024-03-06 &bull; Dataset release: **no**_
#### Added
- Updated the content import criteria applied to Datacite, resulting in +13Mi Other Research Products (+167%)
- Introduced project PIDs; DOI currently available for grants funded by FCT and TWCF
#### Changed
- Scientific products typed as "Collection" categorized under "Research Data" instead of "Other Research Product".
- Updated Crossref publications to include contents until January 2023
- Updated Datacite contents until January 2024
### v7.1.0
_Start Date: 2024-01-30 &bull; Release Date: 2024-02-20 &bull; Dataset release: **no**_
#### Added
- The scientific products aggregated increased by ~5Mi records (+1.6%)
#### Changed
- A refined version of the deduplication strategy allowed to catch more duplicates among the scientific products, implying
a decrease of their total number of ~3.2Mi (-1.35%). More details about the deduplication algorithm are available [here](graph-production-workflow/deduplication/research-products).
- Updated Crossref publications to include contents until November 2023
- Updated Datacite contents until December 2023
### v7.0.0
_Start Date: 2023-12-18 &bull; Release Date: 2024-01-06 &bull; Dataset release: **yes**_
#### Added
- the scientific products increased by ~3Mi records (+1.26%)
- the number of relations increased by 28.6Mi (+1%)
- the funded contents increased by 5%, from 3.6Mi to 3,8Mi. Funders that recorded the highest increase include, for example, EC with +120K linked research products, and SFI with +1K products.
#### Changed
This graph release also introduces new fields to identify reseach products published using specific open access models, in diamond journals, and those that received public funding. These fields will also be added to the graph dataset in Zenodo. In details:
- `ResearchProduct.isGreen (true, false)`: indicates whether or not the researh product was published following the green open access model;
- `ResearchProduct.openAccesColor (bronze, gold, hybrid)`: indicates the specific open access model used for the publication;
- `ResearchProduct.isInDiamondJournal (true, false)`: indicates whether or not the research product was published in a diamond journal;
- `ResearchProduct.publicly-funded (true, false)`: indicates whether or not the grants acknowledged by the publication come from public funds.
### v6.2.2
_Start Date: 2023-11-07 &bull; Release Date: 2023-11-23 &bull; Dataset release: **no**_
#### Added
- Imported Opencitation's POCI dataset, containing citations among publications in PubMed
- Imported Affiliations from Crossref and from PubMed
- Imported Software Heritage identifiers for Software records
- Extended coverage of Irish funders imported from Crossref
- Peer reviewed material identified with a revised heuristic that allowed to improve the coverage
- Project references identified by TDM increased by ~10%
- Introduced new Field of Science classifications for ~40Mi publications
#### Changed
- Updated Crossref publications to include contents until October 2023
- Updated Datacite contents until October 2023
- Indicators regarding data source downloads and views taken by usage counts from September 2023
### v6.1.1
_Start Date: 2023-09-11 &bull; Release Date: 2023-10-15 &bull; Dataset release: **no**_
#### Added
- Affiliation (research product to organization) relations from Crossref
- Links to the full text of research products
- Cleaning for author and publisher names (get rid of tabs, CR characters, \n(s), escape double quotes)
#### Changed
- Projects without a grant code are removed
- Crossref dump from July 2023
- ORCID works without a DOI from March 2023
- Usage counts from July 2023
- Datacite contents from early July 2023
- OpenCitations relations from December 2022
### v6.0.0
_Start Date: 2023-07-26 &bull; Release Date: 2023-08-16 &bull; Dataset release: **yes**_
#### Changed
- [Relationship data model](./data-model/relationships/relationship-object): flattened properties source, sourceType, target, targetType
- BIP! indicators are now serialised as an array; see the updated model [here](./data-model/entities/other#bipindicators)
- Crossref dump from June 2023
- ORCID works without a DOI from June 2023
- Usage counts from June 2023
- Datacite contents from June 2023
- OpenCitations relations from January 2023
- BIP! indicators from June 2023
- New Datasources/Services were added, collected from an updated EOSC Service catalogue endpoint
### v5.2.0
_Start Date: 2023-07-03 &bull; Release Date: 2023-07-17 &bull; Dataset release: **no**_
#### Added
- Citations imported from Crossref & MAG
- FoS and SDG classifications introduced for ~16Mi research products
#### Changed
- Removed the numerical prefix from the OpenAIRE identifiers (```"20|openorgs____::..." --> "openorgs____::..."```)
- Dataset file names in the Zenodo depositions changed from `dump` to `dataset`
- Crossref dump from May 2023
- ORCID works without a DOI from June 2023
- Usage counts from April 2023
- Datacite contents from June 2023
- OpenCitations relations from January 2023
- Deduplication of the datasource
- Avoid duplicated organisation PIDs
### v5.1.3
_Start Date: 2023-05-22 &bull; Release Date: 2023-06-12 &bull; Dataset release: **no**_
#### Added
- Datasource and project level usage counts
#### Changed
- Crossref dump from April 2023
- ORCID works without a DOI from May 2023
- Usage counts from April 2023
- Datacite contents from May 2023
- OpenCitations relations from January 2023
- Deduplication of the datasource
### v5.1.2
_Start Date: 2023-03-20 &bull; Release Date: 2023-04-04 &bull; Dataset release: **no**_
#### Changed
- Crossref dump from February 2023
- ORCID works without a DOI from March 2023
- Usage counts from February 2023 (+76% Downloads per Datasource for 2023)
- Datacite contents from mid March 2023
- OpenCitations relations from January 2023
### v5.1.1
_Start Date: 2023-02-13 &bull; Release Date: 2023-03-01 &bull; Dataset release: **no**_
#### Added
- Revised SDG classification: improved coverage (+600K classified DOIs)
- General increase of the funded scientific outputs, thanks to the full text mining scanning new OpenAccess publications
- Integrated contents from
- [EMBL-EBIs Protein Data Bank in Europe](./graph-production-workflow/aggregation/non-compatible-sources/ebi)
- [UniProtKB/Swiss-Prot](./graph-production-workflow/aggregation/non-compatible-sources/uniprot)
#### Changed
- Crossref dump from January 2023
- ORCID works without a DOI from January 2023
- Usage counts from January 2023
- Datacite contents from mid February 2023
- OpenCitations relations from December 2022
### v5.1.0
_Start Date: 2023-01-16 &bull; Release Date: 2023-01-30 &bull; Dataset release: **no**_
#### Added
- Revised SDG classification: better accuracy, lower coverage (will improve in the next months)
#### Changed
- Crossref dump from December 2022
- ORCID works without a DOI from January 2023
- Usage counts from December 2022
- DataCite contents from January 2023
---
### v5.0.0
_Start Date: 2022-12-19 &bull; Release Date: 2022-12-28 &bull; Dataset release: **yes**_
#### Added
- [Impact & Usage indicators](./data-model/entities/research-product.md#indicators) at the level of the research product
- [Beginner's kit](./downloads/beginners-kit) in the Downloads section
- New relationship types were introduced; see the complete list [here](./data-model/relationships/relationship-types)
#### Changed
- FOS and SDGs were removed from the [ResearchProduct.subjects](./data-model/entities/research-product#subjects)
- Measures were removed from the [ResearchProduct.instance](./data-model/entities/research-product#instance)
- Updated DOIBoost to include publications from Crossref and the works from ORCID with a DOI until November 2022
- Added ORCID works without a DOI from November 2022

View File

@ -1,25 +1,26 @@
# Data model
The OpenAIRE Research Graph comprises several types of [entities](../category/entities) and [relationships](./relationships) among them.
The OpenAIRE Graph comprises several types of [entities](../category/entities) and [relationships](/category/relationships) among them.
The latest version of the JSON schema can be found on [Bulk downloads](../download).
The latest version of the JSON schema can be found on the [Downloads](../downloads/full-graph) section.
<p align="center">
<img loading="lazy" alt="Data model" src="/img/docs/data-model.png" width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
<img loading="lazy" alt="Data model" src={require('../assets/img/data-model-3.png').default} width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The figure above, presents the graph's data model.
Its main entities are described in brief below:
* [Results](entities/result) represent the outcomes of research activities.
* [Data Sources](entities/data-source) are the resources used to collect metadata for the graph objects
* [Organizations](entities/organization) correspond to companies or research institutions involved in projects,
* [Research products](./entities/research-product) represent the outcomes (or products) of research activities.
* [Data sources](./entities/data-source) are the sources from which the metadata of graph objects are collected.
* [Organizations](./entities/organization) correspond to companies or research institutions involved in projects,
responsible for operating data sources or consisting the affiliations of Product creators.
* [Projects](entities/project) are research projects funded by a Funding Stream of a Funder.
* [Communities](entities/community) are groups of people with a common research intent.
* [Projects](./entities/project) are research project grants funded by a Funding Stream of a Funder.
* [Communities](./entities/community) are groups of people with a common research intent (e.g. research infrastructures, university alliances).
* Persons correspond to individual researchers who are involved in the design, creation or maintenance of research products. Currently, this is a non-materialized entity type in the Graph, which means that the respective metadata (and relationships) are encapsulated in the author field of the respective research products.
:::note Further reading
A detailed report on the OpenAIRE Research Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199).
A detailed report on the OpenAIRE Graph Data Model can be found on [Zenodo](https://zenodo.org/record/2643199).
:::

View File

@ -3,6 +3,6 @@
"position": 1,
"link": {
"type": "generated-index",
"description": "The main entities of the OpenAIRE Research Graph are listed below."
"description": "The main entities of the OpenAIRE Graph are listed below."
}
}

View File

@ -2,7 +2,7 @@
sidebar_position: 6
---
# Community
# Communities
Research communities and research initiatives are intended as groups of people with a common research intent and can be of two types: research initiatives or research communities:
@ -19,7 +19,7 @@ _Type: String &bull; Cardinality: ONE_
The OpenAIRE id for the community/research infrastructure, created according to the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers).
```json
"id": "00|context_____::5b7f9fa40bdc12072249204cedfa7808"
"id": "context_____::5b7f9fa40bdc12072249204cedfa7808"
```
### acronym
@ -37,7 +37,7 @@ _Type: String &bull; Cardinality: ONE_
Description of the research community/research infrastructure
```json
"description": "This portal provides access to publications, research data, projects and software that may be relevant to the Corona Virus Disease (COVID-19). The OpenAIRE COVID-19 Gateway aggregates COVID-19 related records, links them and provides a single access point for discovery and navigation. We tag content from the OpenAIRE Research Graph (10,000+ data sources) and additional sources. All COVID-19 related research results are linked to people, organizations and projects, providing a contextualized navigation."
"description": "This portal provides access to publications, research data, projects and software that may be relevant to the Corona Virus Disease (COVID-19). The OpenAIRE COVID-19 Gateway aggregates COVID-19 related records, links them and provides a single access point for discovery and navigation. We tag content from the OpenAIRE Graph (10,000+ data sources) and additional sources. All COVID-19 related research results are linked to people, organizations and projects, providing a contextualized navigation."
```
### name

View File

@ -2,7 +2,7 @@
sidebar_position: 2
---
# Data source
# Data sources
OpenAIRE entity instances are created out of data collected from various data sources of different kinds, such as publication repositories, dataset archives, CRIS systems, funder databases, etc. Data sources export information packages (e.g., XML records, HTTP responses, RDF data, JSON) that may contain information on one or more of such entities and possibly relationships between them.
@ -18,7 +18,7 @@ _Type: String &bull; Cardinality: ONE_
The OpenAIRE id of the data source, created according to the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers).
```json
"id": "10|issn___print::22c514d022b199c346e7f29ca06efc95"
"id": "issn___print::22c514d022b199c346e7f29ca06efc95"
```
### originalId
@ -64,7 +64,7 @@ The datasource type; see the vocabulary [dnet:datasource_typologies](https://api
### openairecompatibility
_Type: String &bull; Cardinality: ONE_
The OpenAIRE compatibility of the ingested results, indicates which guidelines they are compliant according to the vocabulary [dnet:datasourceCompatibilityLevel](https://api.openaire.eu/vocabularies/dnet:datasourceCompatibilityLevel).
The OpenAIRE compatibility of the ingested research products, indicates which guidelines they are compliant according to the vocabulary [dnet:datasourceCompatibilityLevel](https://api.openaire.eu/vocabularies/dnet:datasourceCompatibilityLevel).
```json
"openairecompatibility": "collected from a compatible aggregator"

View File

@ -2,7 +2,7 @@
sidebar_position: 3
---
# Organization
# Organizations
Organizations include companies, research centers or institutions involved as project partners or as responsible of operating data sources. Information about organizations are collected from funder databases like CORDA, registries of data sources like OpenDOAR and re3Data, and CRIS systems, as being related to projects or data sources.
@ -17,7 +17,7 @@ _Type: String &bull; Cardinality: ONE_
The OpenAIRE id for the organization, created according to the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers).
```json
"id": "20|openorgs____::b84450f9864182c67b8611b5593f4250"
"id": "openorgs____::b84450f9864182c67b8611b5593f4250"
```
### legalshortname

View File

@ -20,7 +20,7 @@ Indicates the OpenAccess status. Values are set according to the [Unpaywall meth
```
## AlternateIdentifier
Type used to represent the information associated to persistent identifiers associated to the result that have not been forged by an authority for that pid type. For example we collect metadata from an institutional repository that provides as identifier for the result also the doi.
Type used to represent the information associated to persistent identifiers associated to the research product that have not been forged by an authority for that pid type. For example we collect metadata from an institutional repository that provides as identifier for the research product also the DOI.
### scheme
_Type: String &bull; Cardinality: ONE_
@ -63,7 +63,7 @@ The quantity of money.
## Author
Represents the result author.
Represents the research product author.
### fullname
_Type: String &bull; Cardinality: ONE_
@ -95,7 +95,7 @@ Author's family name.
### rank
_Type: String &bull; Cardinality: ONE_
Author's order in the list of authors for the given result.
Author's order in the list of authors for the given research product.
```json
"rank": 1
@ -167,7 +167,7 @@ The author's pid value in that scheme.
```
## BestAccessRight
Indicates the most open access rights \*available among the result Instances.
Indicates the most open access rights \*available among the research product instances.
\* where the openness is defined by the ordering of the access right terms in the following.
```
@ -201,8 +201,57 @@ Scheme of reference for access right code. Currently, always set to COAR access
"scheme": "http://vocabularies.coar-repositories.org/documentation/access_rights/"
```
## BipIndicator
The different citation-based impact indicators as computed by [BIP!](https://bip.imsi.athenarc.gr/).
### indicator
_Type: String &bull; Cardinality: ONE_
The name of indicator; it can be either one of:
* `influence`: it reflects the overall/total (citation-based) impact of an article in the research community at large, based on the underlying citation network (diachronically).
* `influence_alt`: it is an alternative to the "Influence" indicator, which also reflects the overall/total (citation-based) impact of an article in the research community at large, based on the underlying citation network (diachronically).
* `popularity`: it reflects the "current" (citation-based) impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
* `popularity_alt`: it is an alternative to the "Popularity" indicator, which also reflects the "current" (citation-based) impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
* `impulse`: it reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
For more details on how these indicators are calculated, please refer [here](/graph-production-workflow/indicators-ingestion/impact-indicators).
```json
"influence": {
"score": "123",
"class": "C2"
}
```
### class
_Type: String &bull; Cardinality: ONE_
The impact class assigned based on the indicator score.
To facilitate comprehension, BIP! also offers impact classes for articles, to group together those that have similar impact. The following 5 classes are provided:
* `C1`: Top 0.01%
* `C2`: Top 0.1%
* `C3`: Top 1%
* `C4`: Top 10%
* `C5`: Bottom 90%
```json
"class": "C2"
```
### score
_Type: String &bull; Cardinality: ONE_
The actual indicator score.
```json
"score": "1234"
```
## Container
This field has information about the conference or journal where the result has been presented or published.
This field has information about the conference or journal where the research product has been presented or published.
### name
_Type: String &bull; Cardinality: ONE_
@ -484,7 +533,7 @@ The description of the programme.
```
## Instance
An instance is one specific materialization or version of the result. For example, you can have one result with three instances as result of deduplication:
An instance is one specific materialization or version of the research product. For example, you can have one research product with three instances due to deduplication:
* one is the pre-print
* one is the post-print
@ -509,7 +558,7 @@ Maps [dc:rights](https://www.dublincore.org/specifications/dublin-core/dcmi-term
### alternateIdentifier
_Type: [AlternateIdentifier](#alternateidentifier) &bull; Cardinality: MANY_
All the identifiers associated to the result other than the authoritative ones.
All the identifiers associated to the research product other than the authoritative ones.
```json
"alternateIdentifier": [
@ -542,21 +591,6 @@ The license URL.
"license": "http://creativecommons.org/licenses/by-nc/4.0"
```
### measures
_Type: [Measure](#measure) &bull; Cardinality: MANY_
The measures computed for this instance (e.g. those provided by [BIP! Finder](https://bip.imsi.athenarc.gr/)).
```json
"measures": [
{
"key": "influence",
"value": "6.45335454246e-09"
},
...
]
```
### pid
_Type: [ResultPid](#resultpid) &bull; Cardinality: MANY_
@ -619,8 +653,64 @@ URLs to the instance. They may link to the actual full-text or to the landing pa
]
```
## Indicator
These are indicators computed for a specific OpenAIRE research product.
Each Indicator object is composed of the following properties:
### bipIndicators
_Type: [BipIndicator](#bipindicator) &bull; Cardinality: MANY_
These indicators, provided by [BIP!](https://bip.imsi.athenarc.gr/), estimate the citation-based impact of a research product.
For details about their calculation, please refer [here](/graph-production-workflow/indicators-ingestion/impact-indicators).
```json
"bipIndicators": [
{
"indicator": "influence",
"score": "123",
"class": "C2"
},
{
"indicator": "influence_alt",
"score": "456",
"class": "C3"
},
{
"indicator": "popularity",
"score": "234",
"class": "C1"
},
{
"indicator": "popularity_alt",
"score": "345",
"class": "C5"
},
{
"indicator": "impulse",
"score": "987",
"class": "C3"
}
]
```
### usageCounts
_Type: [UsageCounts](#usagecounts-1) &bull; Cardinality: ONE_
These measures, computed by the [UsageCounts Service](https://usagecounts.openaire.eu/), are based on usage statistics.
Please refer [here](/graph-production-workflow/indicators-ingestion/usage-counts) for more details.
```json
"usageCounts":{
"downloads": "10",
"views": "20"
}
```
## Language
Represents information for the language of the result
Represents information for the language of the research product.
### code
_Type: String &bull; Cardinality: ONE_
@ -640,26 +730,6 @@ Language label in English.
"label": "English"
```
## Measure
A measure computed for this instance (e.g. those provided by [BIP! Finder](https://bip.imsi.athenarc.gr/))
### key
_Type: String &bull; Cardinality: ONE_
The specified measure. Currently supported one of: `{ influence, influence_alt, popularity, popularity_alt, impulse, cc }` (see [the dedicated page](../../data-provision/enrichment/impact-scores) for more details).
```json
"key": "influence"
```
### value
_Type: String &bull; Cardinality: ONE_
```json
"value": "6.45335454246e-09"
```
The value for that measure.
## OrganizationPid
@ -705,13 +775,13 @@ Trust, expressed as a number in the range [0-1].
```
## ResultCountry
It is for the country associated to the result.
This is the country associated to the research product.
It is a subclass of [Country](#country) and extends it with provenance information.
### provenance
_Type: [Provenance](#provenance-2) &bull; Cardinality: ONE_
Indicates the reason why this country is associated to this result.
Indicates the reason why this country is associated to this research product.
```json
"provenance": {
@ -721,14 +791,14 @@ Indicates the reason why this country is associated to this result.
```
## ResultPid
Type used to represent the information associated to persistent identifiers for the result that have been forged by an authority for that pid type.
Type used to represent the information associated to persistent identifiers for the research product that have been forged by an authority for that pid type.
<!-- <span className="todo">Seems to be similar to the AlternateIdentifier. What is the difference?</span> -->
### scheme
_Type: String &bull; Cardinality: ONE_
The scheme of the persistent identifier for the result (i.e. doi). If the pid is here it means the information for the pid has been collected from an authority for that pid type (i.e. Crossref/Datacite for doi). The set of authoritative pid is: `doi` when collected from Crossref or Datacite, `pmid` when collected from EuroPubmed, `arxiv` when collected from arXiv, `handle` from the repositories.
The scheme of the persistent identifier for the research product (i.e. doi). If the pid is here it means the information for the pid has been collected from an authority for that pid type (i.e. Crossref/Datacite for doi). The set of authoritative pid is: `doi` when collected from Crossref or Datacite, `pmid` when collected from EuroPubmed, `arxiv` when collected from arXiv, `handle` from the repositories.
```json
"scheme": "doi"
@ -744,7 +814,7 @@ The value expressed in the scheme (i.e. 10.1000/182).
```
## Subject
Represents keywords associated to the result.
Represents keywords associated to the research product.
### subject
_Type: [SubjectSchemeValue](#subjectschemevalue) &bull; Cardinality: ONE_
@ -790,3 +860,25 @@ The value for the subject in the selected scheme. When the scheme is 'keyword',
```json
"value" : "pyrolysis-oil"
```
## UsageCounts
The usage counts indicator computed for this research product.
### views
_Type: String &bull; Cardinality: ONE_
The number of views for this research product.
```json
"views": "10"
```
### downloads
_Type: String &bull; Cardinality: ONE_
The number of downloads for this research product.
```json
"downloads": "5"
```

View File

@ -2,9 +2,9 @@
sidebar_position: 4
---
# Project
# Projects
Of crucial interest to OpenAIRE is also the identification of the funders (e.g. European Commission, WellcomeTrust, FCT Portugal, NWO The Netherlands) that co-funded the projects that have led to a given result. Projects are characterized by a list of funding streams (e.g. FP7, H2020 for the EC), which identify the strands of fundings. Funding streams can be nested to form a tree of sub-funding streams.
Of crucial interest to OpenAIRE is also the identification of the funders (e.g. European Commission, WellcomeTrust, FCT Portugal, NWO The Netherlands) that co-funded the projects that have led to a given research product. Projects are characterized by a list of funding streams (e.g. FP7, H2020 for the EC), which identify the strands of fundings. Funding streams can be nested to form a tree of sub-funding streams.
---
@ -16,7 +16,7 @@ _Type: String &bull; Cardinality: ONE_
Main entity identifier, created according to the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers).
```json
"id": "40|corda__h2020::70ea22400fd890c5033cb31642c4ae68"
"id": "corda__h2020::70ea22400fd890c5033cb31642c4ae68"
```
### code

View File

@ -0,0 +1,520 @@
---
sidebar_position: 1
---
# Research products
Research products are intended as digital objects, described by metadata, resulting from a scientific process.
In this page, we descibe the properties of the `ResearchProduct` object.
Moreover, there are the following sub-types of a `ResearchProduct`, that inherit all its properties and further extend it:
* [Publication](#publication)
* [Dataset](#dataset)
* [Software](#software)
* [Other research product](#other-research-product)
---
## The `ResearchProduct` object
### id
_Type: String &bull; Cardinality: ONE_
Main entity identifier, created according to the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers).
```json
"id": "doi_dedup___::80f29c8c8ba18c46c88a285b7e739dc3"
```
### type
_Type: String &bull; Cardinality: ONE_
Type of the research products. Possible types:
* `publication`
* `dataset`
* `software`
* `other`
as declared in the terms from the [dnet:result_typologies vocabulary](https://api.openaire.eu/vocabularies/dnet:result_typologies).
```json
"type": "publication"
```
### originalId
_Type: String &bull; Cardinality: MANY_
Identifiers of the record at the original sources.
```json
"originalId": [
"oai:pubmedcentral.nih.gov:8024784",
"S0048733321000305",
"10.1016/j.respol.2021.104226",
"3136742816"
]
```
### maintitle
_Type: String &bull; Cardinality: ONE_
A name or title by which a research product is known. May be the title of a publication, of a dataset or the name of a piece of software.
```json
"maintitle": "The fall of the innovation empire and its possible rise through open science"
```
### subtitle
_Type: String &bull; Cardinality: ONE_
Explanatory or alternative name by which a research product is known.
```json
"subtitle": "An analysis of cases from 1980 - 2020"
```
### author
_Type: [Author](other#author) &bull; Cardinality: MANY_
The main researchers involved in producing the data, or the authors of the publication.
```json
"author": [
{
"fullname": "E. Richard Gold",
"rank": 1,
"name": "Richard",
"surname": "Gold",
"pid": {
"id": {
"scheme": "orcid",
"value": "0000-0002-3789-9238"
},
"provenance"; {
"provenance": "Harvested",
"trust": "0.9"
}
}
},
...
]
```
### bestaccessright
_Type: [BestAccessRight](other#bestaccessright) &bull; Cardinality: ONE_
The most open access right associated to the manifestations of this research product.
```json
"bestaccessright": {
"code": "c_abf2",
"label": "OPEN",
"scheme": "http://vocabularies.coar-repositories.org/documentation/access_rights/"
}
```
### contributor
_Type: String &bull; Cardinality: MANY_
The institution or person responsible for collecting, managing, distributing, or otherwise contributing to the development of the resource.
```json
"contributor": [
"University of Zurich",
"Wright, Aidan G C",
"Hallquist, Michael",
...
]
```
### country
_Type: [ResultCountry](other#resultcountry) &bull; Cardinality: MANY_
Country associated with the research product: it is the country of the organisation that manages the institutional repository or national aggregator or CRIS system from which this record was collected.
Country of affiliations of authors can be found instead in the affiliation relation.
```json
"country": [
{
"code": "CH",
"label": "Switzerland",
"provenance": {
"provenance": "Inferred by OpenAIRE",
"trust": "0.85"
}
},
...
]
```
### coverage
_Type: String &bull; Cardinality: MANY_
### dateofcollection
_Type: String &bull; Cardinality: ONE_
When OpenAIRE collected the record the last time.
```json
"dateofcollection": "2021-06-09T11:37:56.248Z"
```
### description
_Type: String &bull; Cardinality: MANY_
A brief description of the resource and the context in which the resource was created.
```json
"description": [
"Open science partnerships (OSPs) are one mechanism to reverse declining efficiency. OSPs are public-private partnerships that openly share publications, data and materials.",
"There is growing concern that the innovation system's ability to create wealth and attain social benefit is declining in effectiveness. This article explores the reasons for this decline and suggests a structure, the open science partnership, as one mechanism through which to slow down or reverse this decline.",
"The article examines the empirical literature of the last century to document the decline. This literature suggests that the cost of research and innovation is increasing exponentially, that researcher productivity is declining, and, third, that these two phenomena have led to an overall flat or declining level of innovation productivity.",
...
]
```
### embargoenddate
_Type: String &bull; Cardinality: ONE_
Date when the embargo ends and this research product turns Open Access.
```json
"embargoenddate": "2017-01-01"
```
### indicators
_Type: [Indicator](other#indicator-1) &bull; Cardinality: ONE_
The indicators computed for this research product;
currently, the following types of indicators are supported:
* [Citation-based impact indicators by BIP!](other#bipindicators)
* [Usage Statistics indicators](other#usagecounts)
```json
"indicators": {
"bipIndicators": [
{
"indicator": "influence",
"score": "123",
"class": "C2"
},
{
"indicator": "influence_alt",
"score": "456",
"class": "C3"
},
{
"indicator": "popularity",
"score": "234",
"class": "C1"
},
{
"indicator": "popularity_alt",
"score": "345",
"class": "C5"
},
{
"indicator": "impulse",
"score": "987",
"class": "C3"
}
],
"usageCounts": {
"downloads": "10",
"views": "20"
}
}
```
### instance
_Type: [Instance](other#instance) &bull; Cardinality: MANY_
Specific materialization or version of the research product. For example, you can have one research product with three instances: one is the pre-print, one is the post-print, one is the published version.
```json
"instance": [
{
"accessright": {
"code": "c_abf2",
"label": "OPEN",
"openAccessRoute": "gold",
"scheme": "http://vocabularies.coar-repositories.org/documentation/access_rights/"
},
"alternateIdentifier": [
{
"scheme": "doi",
"value": "10.1016/j.respol.2021.104226"
},
...
],
"articleprocessingcharge": {
"amount": "4063.93",
"currency": "EUR"
},
"license": "http://creativecommons.org/licenses/by-nc/4.0",
"pid": [
{
"scheme": "pmc",
"value": "PMC8024784"
},
...
],
"publicationdate": "2021-01-01",
"refereed": "UNKNOWN",
"type": "Article",
"url": [
"http://europepmc.org/articles/PMC8024784"
]
},
...
]
```
### language
_Type: [Language](other#language) &bull; Cardinality: ONE_
The alpha-3/ISO 639-2 code of the language. Values controlled by the [dnet:languages vocabulary](https://api.openaire.eu/vocabularies/dnet:languages).
```json
"language": {
"code": "eng",
"label": "English"
}
```
### lastupdatetimestamp
_Type: Long &bull; Cardinality: ONE_
Timestamp of last update of the record in OpenAIRE.
```json
"lastupdatetimestamp": 1652722279987
```
### pid
_Type: [ResultPid](other#resultpid) &bull; Cardinality: MANY_
Persistent identifiers of the research product. See also the [OpenAIRE entity identifier and PID mapping policy](../pids-and-identifiers) to learn more.
```json
"pid": [
{
"scheme": "pmc",
"value": "PMC8024784"
},
{
"scheme": "doi",
"value": "10.1016/j.respol.2021.104226"
},
...
]
```
### publicationdate
_Type: String &bull; Cardinality: ONE_
Main date of the research product: typically the publication or issued date. In case of a research product with different versions with different dates, the date of the research product is selected as the most frequent well-formatted date. If not available, then the most recent and complete date among those that are well-formatted. For statistics, the year is extracted and the research product is counted only among the research products of that year. Example: Pre-print date: 2019-02-03, Article date provided by repository: 2020-02, Article date provided by Crossref: 2020, OpenAIRE will set as date 2019-02-03, because its the most recent among the complete and well-formed dates. If then the repository updates the metadata and set a complete date (e.g. 2020-02-12), then this will be the new date for the research product because it becomes the most recent most complete date. However, if OpenAIRE then collects the pre-print from another repository with date 2019-02-03, then this will be the “winning date” because it becomes the most frequent well-formatted date.
```json
"publicationdate": "2021-03-18"
```
### publisher
_Type: String &bull; Cardinality: ONE_
The name of the entity that holds, archives, publishes prints, distributes, releases, issues, or produces the resource.
```json
"publisher": "Elsevier, North-Holland Pub. Co"
```
### source
_Type: String &bull; Cardinality: MANY_
A related resource from which the described resource is derived. See definition of Dublin Core field [dc:source](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/elements11/source).
```json
"source": [
"Research Policy",
"Crossref",
...
]
```
### subjects
_Type: [Subject](other#subject) &bull; Cardinality: MANY_
Subject, keyword, classification code, or key phrase describing the resource.
```json
"subjects": [
{
"provenance": {
"provenance": "Harvested",
"trust": "0.9"
},
"subject": {
"scheme": "keyword",
"value": "Open science"
}
},
...
]
```
### isGreen
_Type: Boolean &bull; Cardinality: ONE_
Indicates whether or not the scientific result was published following the green open access model.
### openAccessColor
_Type: String &bull; Cardinality: ONE_
Indicates the specific open access model used for the publication; possible value is one of `bronze, gold, hybrid`.
### isInDiamondJournal
_Type: Boolean &bull; Cardinality: ONE_
Indicates whether or not the publication was published in a diamond journal.
### publiclyFunded
_Type: String &bull; Cardinality: ONE_
Discloses whether the publication acknowledges grants from public sources.
---
## Sub-types
There are the following sub-types of `Result`. Each inherits all its fields and extends them with the following.
### Publication
Metadata records about research literature (includes types of publications listed [here](http://api.openaire.eu/vocabularies/dnet:result_typologies/publication)).
#### container
_Type: [Container](other#container) &bull; Cardinality: ONE_
Container has information about the conference or journal where the research product has been presented or published.
```json
"container": {
"edition": "",
"iss": "5",
"issnLinking": "",
"issnOnline": "1873-7625",
"issnPrinted": "0048-7333",
"name": "Research Policy",
"sp": "12",
"ep": "22",
"vol": "50"
}
```
### Dataset
Metadata records about research data (includes the subtypes listed [here](http://api.openaire.eu/vocabularies/dnet:result_typologies/dataset)).
#### size
_Type: String &bull; Cardinality: ONE_
The declared size of the dataset.
```json
"size": "10129818"
```
#### version
_Type: String &bull; Cardinality: ONE_
The version of the dataset.
```json
"version": "v1.3"
```
#### geolocation
_Type: [GeoLocation](other#geolocation) &bull; Cardinality: MANY_
The list of geolocations associated with the dataset.
```json
"geolocation": [
{
"box": "18.569386 54.468973 18.066832 54.83707",
"place": "Tübingen, Baden-Württemberg, Southern Germany",
"point": "7.72486 50.1084"
},
...
]
```
### Software
Metadata records about research software (includes the subtypes listed [here](http://api.openaire.eu/vocabularies/dnet:result_typologies/software)).
#### documentationUrl
_Type: String &bull; Cardinality: MANY_
The URLs to the software documentation.
```json
"documentationUrl": [
"https://github.com/openaire/iis/blob/master/README.markdown",
...
]
```
#### codeRepositoryUrl
_Type: String &bull; Cardinality: ONE_
The URL to the repository with the source code.
```json
"codeRepositoryUrl": "https://github.com/openaire/iis"
```
#### programmingLanguage
_Type: String &bull; Cardinality: ONE_
The programming language.
```json
"programmingLanguage": "Java"
```
### Other research product
Metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed [here](http://api.openaire.eu/vocabularies/dnet:result_typologies/other)).
#### contactperson
_Type: String &bull; Cardinality: MANY_
Information on the person responsible for providing further information regarding the resource.
```json
"contactperson": [
"Noémie Dominguez",
...
]
```
#### contactgroup
_Type: String &bull; Cardinality: MANY_
Information on the group responsible for providing further information regarding the resource.
```json
"contactgroup": [
"Networked Multimedia Information Systems (NeMIS)",
...
]
```
#### tool
_Type: String &bull; Cardinality: MANY_
Information about tool useful for the interpretation and/or re-use of the research product.

View File

@ -1,6 +1,6 @@
# PIDs and identifiers
One of the challenges towards the stability of the contents in the OpenAIRE Research Graph consists of making its identifiers and records stable over time.
One of the challenges towards the stability of the contents in the OpenAIRE Graph consists of making its identifiers and records stable over time.
The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content,
original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes.
Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records.
@ -18,6 +18,10 @@ Such a policy defines a list of data sources that are considered authoritative f
| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) |
| pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) |
| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) |
| uniprot | [Protein Data Bank](http://www.pdb.org/) |
| ena | [Protein Data Bank](http://www.pdb.org/) |
| pdb | [Protein Data Bank](http://www.pdb.org/) |
There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule.
In all other cases, PIDs are be included in the graph as alternate Identifiers.
@ -43,7 +47,7 @@ OpenAIRE assigns internal identifiers for each object it collects.
By default, the internal identifier is generated as `sourcePrefix::md5(localId)` where:
* `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source at registration time
* `localid` is the identifier assigned to the object by the data source
* `localΙd` is the identifier assigned to the object by the data source
After years of operation, we can say that:
@ -63,12 +67,15 @@ When the record is collected from a source which is not authoritative for any ty
Currently, the following data sources are used as "PID authorities":
| PID Type | Prefix (12 chars) | Authority |
|-----------|------------------------|-----------------------------------------|
|-----------|------------------------|-------------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository |
| ena | `ena_________` | EMBL-EBI |
| pdb | `pdb_________` | EMBL-EBI |
| uniprot | `uniprot_____` | EMBL-EBI |
OpenAIRE also perform duplicate identification (see the [dedicated section for details](../../data-provision/deduplication/)).
OpenAIRE also perform duplicate identification (see the [dedicated section for details](/graph-production-workflow/deduplication)).
All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).

View File

@ -1,146 +0,0 @@
---
sidebar_position: 2
---
# Relationships
A relationship in the graph is represented by the following data type, which aims to model a directed edge between two nodes, providing information about the semantic of the relation, its provenance and validation.
---
## The `Relationship` object
### source
_Type: [Node](#the-node-object) &bull; Cardinality: ONE_
Represents the source node in the relation.
```json
"source": {
"id": "20|openorgs____::1cb75a3ad756e4c83e455e3e7347643b",
"type": "organization"
}
```
### target
_Type: [Node](#the-node-object) &bull; Cardinality: ONE_
Represents the target node in the relation.
```json
"target": {
"id": "10|doajarticles::022409068174087a003647ff46070f7f",
"type": "datasource"
}
```
### reltype
_Type: [RelType](#the-reltype-object) &bull; Cardinality: ONE_
Represent the semantics of the relation between two nodes of the graph.
```json
"reltype": {
"name": "provides",
"type": "provision"
}
```
### provenance
_Type: [Provenance](entities/other#provenance-1) &bull; Cardinality: ONE_
Indicates the process that produced (or provided) the information.
```json
"provenance": {
"provenance": "Harvested",
"trust":"0.900"
}
```
### validated
_Type: Boolean &bull; Cardinality: ONE_
Indicates weather or not the relation was validated.
```json
"validated": true
```
### validationDate
_Type: String &bull; Cardinality: ONE_
Indicates the validation date of the relation - applies only when the validated flag is set to true.
```json
"validationDate": "2022-09-02"
```
---
## The `Node` object
The Node data type contains the minimum information needed to identify a graph node, its identifier and entity type.
### id
_Type: String &bull; Cardinality: ONE_
OpenAIRE identifier of the node in the graph.
```json
"id": "10|doajarticles::022409068174087a003647ff46070f7f"
```
### type
_Type: String &bull; Cardinality: ONE_
Graph node type.
```json
"type": "datasource"
```
## The `RelType` object
The RelType data type models the semantic of the relationship among two nodes.
### type
_Type: String &bull; Cardinality: ONE_
Relation category, e.g. affiliation, citation, see table Relation typologies.
```json
"name": "provides"
```
### name
_Type: String &bull; Cardinality: ONE_
Further specifies the relation semantic, indicating the relation direction, e.g. Cites, isCitedBy.
```json
"type": "provision"
```
---
## Relationship types
The following table lists all the possible relation semantics found in the graph dump.
| # | Source entity type | Target entity type | Relation type | Relation name | Inverse relation name |
|:--:|:------------------:|:-------------------:|:-------------:|:---------------------------:|:----------------------------:|
| 1 | [Project](entities/project) | [Result](entities/result) | outcome | produces | isProducedBy |
| 2 | [Result](entities/result) | [Organization](entities/organization) | affiliation | hasAuthorInstitution | isAuthorInstitutionOf |
| 3 | [Result](entities/result) | [Result](entities/result) | similarity | isAmongTopNSimilarDocuments | HasAmongTopNSimilarDocuments |
| 4 | [Project](entities/project) | [Organization](entities/organization) | participation | isParticipant | hasParticipant |
| 5 | [Result](entities/result) | [Result](entities/result) | supplement | isSupplementTo | isSupplementedBy |
| 6 | [Result](entities/result) | [Result](entities/result) | relationship | isRelatedTo | isRelatedTo |
| 7 | [Data source](entities/data-source) | [Organization](entities/organization) | provision | provides | isProvidedBy |
| 8 | [Result](entities/result) | [Data source](entities/data-source) | provision | isHostedBy | hosts |
| 9 | [Result](entities/result) | [Data source](entities/data-source) | provision | isProvidedBy | provides |
| 10 | [Result](entities/result) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 11 | [Organization](entities/organization) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 12 | [Data source](entities/data-source) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 13 | [Project](entities/project) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |

View File

@ -0,0 +1,109 @@
---
title: The Relationship object
---
# The `Relationship` object
A relationship in the Graph is represented with the data type presented in this page, which aims to model a directed edge between two nodes, providing information about its semantics, provenance and validation.
### source
_Type: String &bull; Cardinality: ONE_
OpenAIRE identifier of the node in the graph.
```json
"source": "openorgs____::1cb75a3ad756e4c83e455e3e7347643b"
```
### sourceType
_Type: String &bull; Cardinality: ONE_
Graph node type.
```json
"sourceType": "organization"
```
### target
_Type: String &bull; Cardinality: ONE_
OpenAIRE identifier of the node in the graph.
```json
"target": "doajarticles::022409068174087a003647ff46070f7f"
```
### targetType
_Type: String &bull; Cardinality: ONE_
Graph node type.
```json
"target": "datasource"
```
### reltype
_Type: [RelType](#the-reltype-object) &bull; Cardinality: ONE_
Represent the semantics of the relationship between two nodes of the graph.
```json
"reltype": {
"name": "provides",
"type": "provision"
}
```
### provenance
_Type: [Provenance](/data-model/entities/other#provenance-1) &bull; Cardinality: ONE_
Indicates the process that produced (or provided) the information.
```json
"provenance": {
"provenance": "Harvested",
"trust":"0.900"
}
```
### validated
_Type: Boolean &bull; Cardinality: ONE_
Indicates weather or not the relationship was validated.
```json
"validated": true
```
### validationDate
_Type: String &bull; Cardinality: ONE_
Indicates the validation date of the relationship - applies only when the validated flag is set to true.
```json
"validationDate": "2022-09-02"
```
---
## The `RelType` object
The RelType data type models the semantic of the relationship among two nodes.
### type
_Type: String &bull; Cardinality: ONE_
The relationship category, e.g. affiliation, citation. (see [relationship types](./relationship-types)).
```json
"name": "provides"
```
### name
_Type: String &bull; Cardinality: ONE_
Further specifies the relationship semantic, indicating the relationship direction, e.g. Cites, isCitedBy.
```json
"type": "provision"
```
---

View File

@ -0,0 +1,37 @@
# Relationship types
The following table lists all the possible relation semantics found in the Graph Dataset.
Note: the labels used to specify the semantic of the relationships are (for the large) inherited from the [DataCite metadata kernel](https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf), which provides a description for them.
| # | Source entity type | Target entity type | Relation name / inverse | Provenance |
|:--:|:--------------------------------------:|:--------------------------------------:|:----------------------------------------------------------:|:-----------------------------------------------:|
| 1 | [Project](/data-model/entities/project) | [ResearchProduct](../../data-model/entities/research-product) | produces / isProducedBy | Harvested, Inferred by OpenAIRE, Linked by user |
| 2 | [Project](/data-model/entities/project) | [Organization](/data-model/entities/organization) | hasParticipant / isParticipant | Harvested |
| 3 | [Project](/data-model/entities/project) | [Community](/data-model/entities/community) | IsRelatedTo / IsRelatedTo | Linked by user |
| 4 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsAmongTopNSimilarDocuments / HasAmongTopNSimilarDocuments | Inferred by OpenAIRE |
| 5 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsSupplementTo / IsSupplementedBy | Harvested |
| 6 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsRelatedTo / IsRelatedTo | Harvested, Inferred by OpenAIRE, Linked by user |
| 7 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsPartOf / HasPart | Harvested |
| 8 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsDocumentedBy / Documents | Harvested |
| 9 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsObsoletedBy / Obsoletes | Harvested |
| 10 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsSourceOf / IsDerivedFrom | Harvested |
| 11 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsCompiledBy / Compiles | Harvested |
| 12 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsRequiredBy / Requires | Harvested |
| 13 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsCitedBy / Cites | Harvested, Inferred by OpenAIRE |
| 14 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsReferencedBy / References | Harvested |
| 15 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsReviewedBy / Reviews | Harvested |
| 16 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsOriginalFormOf / IsVariantFormOf | Harvested |
| 17 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsVersionOf / HasVersion | Harvested |
| 18 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsIdenticalTo / IsIdenticalTo | Harvested |
| 19 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsPreviousVersionOf / IsNewVersionOf | Harvested |
| 20 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsContinuedBy / Continues | Harvested |
| 21 | [ResearchProduct](../../data-model/entities/research-product) | [ResearchProduct](../../data-model/entities/research-product) | IsDescribedBy / Describes | Harvested |
| 22 | [ResearchProduct](../../data-model/entities/research-product) | [Organization](/data-model/entities/organization) | hasAuthorInstitution / isAuthorInstitutionOf | Harvested, Inferred by OpenAIRE |
| 23 | [ResearchProduct](../../data-model/entities/research-product) | [Data source](/data-model/entities/data-source) | isHostedBy / hosts | Harvested, Inferred by OpenAIRE |
| 24 | [ResearchProduct](../../data-model/entities/research-product) | [Data source](/data-model/entities/data-source) | isProvidedBy / provides | Harvested |
| 25 | [ResearchProduct](../../data-model/entities/research-product) | [Community](/data-model/entities/community) | IsRelatedTo / IsRelatedTo | Harvested, Inferred by OpenAIRE, Linked by user |
| 26 | [Organization](/data-model/entities/organization) | [Community](/data-model/entities/community) | IsRelatedTo / IsRelatedTo | Linked by user |
| 27 | [Organization](/data-model/entities/organization) | [Organization](/data-model/entities/organization) | IsChildOf / IsParentOf | Linked by user |
| 28 | [Data source](/data-model/entities/data-source) | [Community](/data-model/entities/community) | IsRelatedTo / IsRelatedTo | Linked by user |
| 29 | [Data source](/data-model/entities/data-source) | [Organization](/data-model/entities/organization) | isProvidedBy / provides | Harvested |

View File

@ -1,23 +0,0 @@
---
sidebar_position: 3
---
# Extraction of Acknowledged Concepts
| Property | Description |
| --- | --- |
| Short description | Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept. |
| Parameters | Concept titles, acronyms, and identifiers, publication's identifiers and fulltexts |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham.](https://doi.org/10.1007/978-3-031-16802-4_9) |

View File

@ -1,23 +0,0 @@
---
sidebar_position: 4
---
# Extraction of Cited Concepts
| Property | Description |
| --- | --- |
| Short description | Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata. |
| Parameters | Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts. |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham.](https://doi.org/10.1007/978-3-319-67008-9_28) |

View File

@ -1,23 +0,0 @@
---
sidebar_position: 5
---
# Classifiers
| Property | Description |
| --- | --- |
| Short description | A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes. |
| Authority | ATHENA Research Center, Greece |
| Licence | CC-BY/CC-0 |
| Algorithmic details | The algorithm classifies publication's fulltexts using a Bayesian classifier and weighted terms according to an offline training phase. The training has been done using the following taxonomies: arXiv, MeSH (Medical Subject Headings), ACM, and DDC (Dewey Decimal Classification, or Dewey Decimal System). |
| Parameters | Publication's identifier and fulltext |
| Limitations | N/A |
| Code repository | https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction |
| Environment | Python, madIS (https://github.com/madgik/madis), APSW (https://github.com/rogerbinns/apsw) |
| References & resources | [Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham.](https://doi.org/10.1007/978-3-319-08425-1_10) |

View File

@ -1,44 +0,0 @@
# Enrichment
## Mining
The OpenAIRE Research Graph is enriched by links mined by OpenAIREs full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching.
**Project mining** in OpenAIRE text mines the full-texts of publications in order to extract matches to funding project codes/IDs. The mining algorithm works by utilising (i) the grant identifier, and (ii) the project acronym (if available) of each project. The mining algorithm: (1) Preprocesses/normalizes the full-texts using several functions, which depend on the characteristics of each funder (i.e., the format of the grant identifiers), such as stopword and/or punctuation removal, tokenization, stemming, converting to lowercase; then (2) String matching of grant identifiers against the normalized text is done using database techniques; and (3) The results are validated and cleaned using the context near the match by looking at the context around the matched ID for relevant metadata and positive or negative words/phrases, in order to calculate a confidence value for each publication-->project link. A confidence threshold is set to optimise high accuracy while minimising false positives, such as matches with page or report numbers, post/zip codes, parts of telephone numbers, DOIs or URLs, accession numbers. The algorithm also applies rules for disambiguating results, as different funders can share identical project IDs; for example, grant number 633172 could refer to H2020 project EuroMix but also to Australian-funded NHMRC project “Brain activity (EEG) analysis and brain imaging techniques to measure the neurobiological effects of sleep apnea”. Project mining works very well and was the first Text & Data Mining (TDM) service of OpenAIRE. Performance results vary from funder to funder but precision is higher than 98% for all funders and 99.5% for EC projects. Recall is higher than 95% (99% for EC projects), when projects are properly acknowledged using project/grant IDs.
**Dataset extraction** runs on publications full-texts as described in “High pass text-filtering for Citation matching”, TPDL 2017[1]. In particular, we search for citations to datasets using their DOIs, titles and other metadata (i.e., dates, creator names, publishers, etc.). We extract parts of the text which look like citations and search for datasets using database join and pattern matching techniques. Based on the experiments described in the paper, precision of the dataset extraction module is 98.5% and recall is 97.4% but it is also probably overestimated since it does not take into account corruptions that may take place during pdf to text extraction. It is calculated on the extracted full-texts of small samples from PubMed and arXiv.
**Software extraction** runs also on parts of the text which look like citations. We search the citations for links to software in open software repositories, specifically github, sourceforge, bitbucket and the google code archive. After that, we search for links that are included in Software Heritage (SH, https://www.softwareheritage.org) and return the permanent URL that SH provides for each software project. We also enrich this content with user names, titles and descriptions of the software projects using web mining techniques. Since software mining is based on URL matching, our precision is 100% (we return a software link only if we find it in the text and there is no need to disambiguate). As for recall rate, this is not calculable for this mining task. Although we apply all the necessary normalizations to the URLs in order to overcome usual issues (e.g., http or https, existence of www or not, lower/upper case), we do not calculate cases where a software is mentioned using its name and not by a link from the supported software repositories.
**For the extraction of bio-entities**, we focus on Protein Data Bank (PDB) entries. We have downloaded the database with PDB codes and we update it regularly. We search through the whole publications full-text for references to PDB codes. We apply disambiguation rules (e.g., there are PDB codes that are the same as antibody codes or other issues) so that we return valid results. Current precision is 98%. Although it's risky to mention recall rates since these are usually overestimated, we have calculated a recall rate of 98% using small samples from pubmed publications. Moreover, our technique is able to identify about 30% more links to proteins than the ones that are tagged in Pubmed xmls.
**Other text-mining modules** include mining for links to EPO patents, or custom mining modules for linking research objects to specific research communities, initiatives and infrastructures, e.g. COVID-19 mining module. Apart from text-mining modules, OpenAIRE also provides a document classification service that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text one or more predefined content classes. In OpenAIRE, the currently used taxonomies are arXiv, MeSH (Medical Subject Headings), ACM and DDC (Dewey Decimal Classification, or Dewey Decimal System).
## Bulk Tagging/Deduction
The Deduction process (also known as “bulk tagging”) enriches each record with new information that can be derived from the existing property values.
As of September 2020, three procedures are in place to relate a research product to a research initiative, infrastructure (RI) or community (RC) based on:
* subjects (2.7M results tagged)
* Zenodo community (16K results tagged)
* the data source it comes from (250K results tagged)
The list of subjects, Zenodo communities and data sources used to enrich the products are defined by the managers of the community gateway or infrastructure monitoring dashboard associated with the RC/RI.
## Propagation
This process “propagates” properties and links from one product to another if between the two there is a “strong” semantic relationship.
As of September 2020, the following procedures are in place:
Propagation of the property “country” to results from institutional repositories: e.g. publication collected from an institutional repository maintained by an italian university will be enriched with the property “country = IT”.
* Propagation of links to projects: e.g. publication linked to project P “is supplemented by” a dataset D. Dataset D will get the link to project P. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
* Propagation of related community/infrastructure/initiative from organizations to products via affiliation relationships: e.g. a publication with an author affiliated with organization O. The manager of the community gateway C declared that the outputs of O are all relevant for his/her community C. The publication is tagged as relevant for C.
* Propagation of related community/infrastructure/initiative to related products: e.g. publication associated to community C is supplemented by a dataset D. Dataset D will get the association to C. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
* Propagation of ORCID identifiers to related products, if the products have the same authors: e.g. publication has ORCID for its authors and is supplemented by a dataset D. Dataset D has the same authors as the publication. Authors of D are enriched with the ORCIDs available in the publication. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.

View File

@ -1,73 +0,0 @@
---
sidebar_position: 2
---
# Impact scores
<span className="todo">TODO - add intro</span>
## Citation Count (CC)
This is the most widely used scientific impact indicator, which sums all citations received by each article. The citation count of a
publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$,
where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise).
Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly
drew on it.
## "Incubation" Citation Count (iCC)
This measure is essentially a time-restricted version of the citation count, where the time window is distinct for each paper, i.e.,
only citations $y$ years after its publication are counted (usually, $y=3$). The "incubation" citation count of a paper $i$ is
calculated as: $s_i = \sum_{j,t_j \leq t_i+3} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's
publication years, respectively. $t_i$ is cited paper $i$'s publication year. iCC can be seen as an indicator of a paper's initial momentum
(impulse) directly after its publication.
## PageRank (PR)
Originally developed to rank Web pages, PageRank has been also widely used to rank publications in citation
networks. In this latter context, a publication's PageRank
score also serves as a measure of its influence. In particular, the PageRank score of a publication is calculated
as its probability of being read by a researcher that either randomly selects publications to read or selects
publications based on the references of her latest read. Formally, the score of a publication $i$ is given by:
$$
s_i = \alpha \cdot \sum_{j} P_{i,j} \cdot s_j + (1-\alpha) \cdot \frac{1}{N}
$$
where $P$ is the stochastic transition matrix, which corresponds to the column normalised version of adjacency
matrix $A$, $\alpha \in [0,1]$, and $N$ is the number of publications in the citation network. The first addend
of the equation corresponds to the selection (with probability $\alpha$) of following a reference, while the
second one to the selection of randomly choosing any publication in the network. It should be noted that the
score of each publication relies of the score of publications citing it (the algorithm is executed iteratively
until all scores converge). As a result, PageRank differentiates citations based on the importance of citing
articles, thus alleviating the corresponding issue of the Citation Count.
## RAM
RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared
to older ones. Hence, it better captures the popularity of publications. This "time-awareness" of citations
alleviates the bias of methods like Citation Count and PageRank against recently published articles, which have
not had "enough" time to gather as many citations. The RAM score of each paper $i$ is calculated as follows:
$$
s_i = \sum_j{R_{i,j}}
$$
where $R$ is the so-called Retained Adjacency Matrix (RAM) and $R_{i,j}=\gamma^{t_c-t_j}$ when publication $j$ cites publication
$i$, and $R_{i,j}=0$ otherwise. Parameter $\gamma \in (0,1)$, $t_c$ corresponds to the current year and $t_j$ corresponds to the
publication year of citing article $j$.
## AttRank
AttRank is a PageRank variant that alleviates its bias against recent publications (i.e., it is tailored to capture popularity).
AttRank achieves this by modifying PageRank's probability of randomly selecting a publication. Instead of using a uniform probability,
AttRank defines it based on a combination of the publication's age and the citations it received in recent years. The AttRank score
of each publication $i$ is calculated based on:
$$
s_i = \alpha \cdot \sum_{j} P_{i,j} \cdot s_j
+ \beta \cdot Att(i)+ \gamma \cdot c \cdot e^{-\rho \cdot (t_c-t_i)}
$$
where $\alpha + \beta + \gamma =1$ and $\alpha,\beta,\gamma \in [0,1]$. $Att(i)$ denotes a recent attention-based score for publication $i$,
which reflects its share of citations in the $y$ most recent years, $t_i$ is the publication year of article $i$, $t_c$ denotes the current
year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix.

View File

@ -1,21 +0,0 @@
---
sidebar_position: 1
---
# Mining algorithms
The Text and Data Mining (TDM) algorithms used for enriching the OpenAIRE Graph are grouped in the following main categories:
[Extraction of acknowledged concepts](acks.md)
[Extraction of cited concepts](cites.md)
[Document Classification](classified.md)
<span className="todo">TODO</span>

View File

@ -1,7 +0,0 @@
---
sidebar_position: 6
---
# Stats analysis
The OpenAIRE Research Graph is also processed by a pipeline for extracting the statistics and producing the charts for funders, research initiative, infrastructures, and policy makers that you can see on MONITOR. Based on the information available on the graph, OpenAIRE provides a set of indicators for monitoring the funding and research impact and the uptake of Open Science publishing practices, such as Open Access publishing of publications and datasets, availability of interlinks between research products, availability of post-print versions in institutional or thematic Open Access repositories, etc.

View File

@ -1,17 +0,0 @@
---
sidebar_position: 4
---
# Bulk downloads
In order to facilitate users, different dumps are available. All are available under the Zenodo community called [OpenAIRE Research Graph](https://zenodo.org/communities/openaire-research-graph).
Here we provide detailed documentation about the full dump:
* JSON dump: https://doi.org/10.5281/zenodo.3516917
* JSON schema: https://doi.org/10.5281/zenodo.4238938
:::note Tip!
For a visual and interactive overview of the JSON schema, we suggest to use a JSON schema viewer like [jsonschemaviewer](https://navneethg.github.io/jsonschemaviewer/) (you just need to copy the schema and then you can easily navigate through the nodes).
:::

View File

@ -0,0 +1,30 @@
---
sidebar_position: 1
---
# CfHbKeyValue
Information about the sources from which the record has been collected.
@JsonSchema(description = "the OpenAIRE identifier of the data source")
### key
_Type: String &bull; Cardinality: ONE_
the OpenAIRE identifier of the data source
```json
"key":"openaire____::081b82f96300b6a6e3d282bad31cb6e2"
```
### value
_Type: String &bull; Cardinality: ONE_
The name of the data source.
```json
"value":"Crossref"
```

View File

@ -0,0 +1,37 @@
---
sidebar_position: 1
---
# CommunityInstance
It is a subclass of [Instance](../../data-model/entities/research-product#instance) extended with information regarding the collection and hosting source for this materialization of the research product.
### hostedby
_Type: [CfHbKeyValue](./cfhb) &bull; Cardinality: ONE_
Information about the source from which the instance can be viewed or downloaded.
```json
"hostedby": {
"key": "issn___print::35ee75a5ad42581d604be113a8f56427",
"value": "New Phytologist"
},
```
### collectedfrom
_Type: [CfHbKeyValue](./cfhb) &bull; Cardinality: ONE_
Information about the source from which the record has been collected
```json
"collectedfrom": {
"key": "openaire____::081b82f96300b6a6e3d282bad31cb6e2",
"value": "Crossref"
}
```

View File

@ -0,0 +1,46 @@
---
sidebar_position: 1
---
# Context
Information related to research initiative/community (RI/RC) related to the research product.
### code
_Type: String &bull; Cardinality: ONE_
Code identifying the RI/RC.
```json
"code":"sdsn-gr"
```
### label
_Type: String &bull; Cardinality: ONE_
Label of the RI/RC.
```json
"label":"SDSN - Greece"
```
### provenance
_Type: [Provenance](/data-model/entities/other#provenance-2) &bull; Cardinality: MANY_
Why this research product is associated to the RI/RC.
```json
"provenance":[{
"provenance":"Inferred by OpenAIRE",
"trust":"0.9"
},
...
]
```

View File

@ -0,0 +1,140 @@
---
sidebar_position: 1
---
# Extended Research Product
It is a subclass of [ResearchProduct](../../data-model/entities/research-product) extended with information regarding projects (and funders), research communities/infrastructure and related data sources.
### projects
_Type: [Project](project.md) &bull; Cardinality: MANY_
List of projects (i.e. grants) that (co-)funded the production of the research products.
```json
"projects": [
{
"id": "corda__h2020::94c4a066401e22002c4811a301bb4655",
"code": "727929",
"acronym": "TomRes",
"title": "A NOVEL AND INTEGRATED APPROACH TO INCREASE MULTIPLE AND COMBINED STRESS TOLERANCE IN PLANTS USING TOMATO AS A MODEL",
"funder": {
"shortName": "EC",
"name": "European Commission",
"jurisdiction": "EU",
"fundingStream": "H2020"
},
"provenance": {
"provenance": "Harvested",
"trust": "0.900000000000000022"
},
"validated": {
"validationDate": "2021-0101",
"validatedByFunder": true
}
},
...
]
```
### context
_Type: [Context](./context) &bull; Cardinality: MANY_
Reference to relevant research infrastructure, initiative or communities (RI/RC) among those collaborating with OpenAIRE. Please see https://connect.openaire.eu that are publicly visible.
```json
"context":[
{
"code":"sdsn-gr",
"label":"SDSN - Greece",
"provenance":[
{
"provenance":"Inferred by OpenAIRE",
"trust":"0.9"
}
]
},
...
]
```
### collectedfrom
_Type: [CfHbKeyValue](./cfhb) &bull; Cardinality: MANY_
Information about the sources from which the record has been collected.
```json
"collectedfrom":[
{
"key":"openaire____::081b82f96300b6a6e3d282bad31cb6e2",
"value":"Crossref"
},
...
]
```
### instance
_Type: [CommunityInstance](./communityInstance) &bull; Cardinality: MANY_
Information about the source from which the instance can be viewed or downloaded.
```json
"instance": [
{
"license": "http://doi.wiley.com/10.1002/tdm_license_1.1",
"accessright": {
"code": "c_16ec",
"label": "RESTRICTED",
"scheme": "http://vocabularies.coar-repositories.org/documentation/access_rights/",
"openAccessRoute": null
},
"type": "Article",
"url": [
"https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1111%2Fnph.15014",
"http://onlinelibrary.wiley.com/wol1/doi/10.1111/nph.15014/fullpdf",
"http://dx.doi.org/10.1111/nph.15014"
],
"publicationdate": "2018-02-09",
"refereed": "UNKNOWN",
"hostedby": {
"key": "issn___print::35ee75a5ad42581d604be113a8f56427",
"value": "New Phytologist"
},
"collectedfrom": {
"key": "openaire____::081b82f96300b6a6e3d282bad31cb6e2",
"value": "Crossref"
}
},
...
]
```

View File

@ -0,0 +1,72 @@
---
sidebar_position: 1
---
# Funder
Information about the funder funding the project.
### fundingStream
_Type: String &bull; Cardinality: ONE_
Funding information for the project.
```json
"funding_stream": "H2020"
```
### jurisdiction
_Type: String &bull; Cardinality: ONE_
Geographical jurisdiction (e.g. for European Commission is EU, for Croatian Science Foundation is HR).
```json
"jurisdiction": "EU"
```
### name
_Type: String &bull; Cardinality: ONE_
The name of the funder.
```json
"name": "European Commission"
```
### shortName
_Type: String &bull; Cardinality: ONE_
The short name of the funder.
```json
"shortName": "EC"
```

View File

@ -0,0 +1,134 @@
---
sidebar_position: 1
---
# Project
The information about the projects related to a research product.
### id
_Type: String &bull; Cardinality: ONE_
Main entity identifier, created according to the [OpenAIRE entity identifier and PID mapping policy](../../data-model/pids-and-identifiers).
```json
"id": "corda__h2020::70ea22400fd890c5033cb31642c4ae68"
```
### code
_Type: String &bull; Cardinality: ONE_
Τhe grant agreement code of the project.
```json
"code": "777541"
```
### acronym
_Type: String &bull; Cardinality: ONE_
Project's acronym.
```json
"acronym": "OpenAIRE-Advance"
```
### title
_Type: String &bull; Cardinality: ONE_
Project's title.
```json
"title": "OpenAIRE Advancing Open Scholarship"
```
### funder
_Type [Funder](funder.md) &bull; Cardinality: ONE_
Information about the funder funding the project.
```json
"funder": {
"shortName": "EC",
"name": "European Commission",
"jurisdiction": "EU",
"fundingStream": "H2020"
}
```
### provenace
_Type [Provenance](../../data-model/entities/other#provenance-2) &bull; Cardinality: ONE_
The reason why the project is associated to the research product.
```json
"provenance": {
"provenance": "Harvested",
"trust": "0.900000000000000022"
}
```
### validated
_Type [Validated](validated.md) &bull; Cardinality: ONE_
Specifies whether the association between the project and the research product was validated.
```json
"validated": {
"validationDate": "2021-0101",
"validatedByFunder": true
}
```

View File

@ -0,0 +1,41 @@
---
sidebar_position: 1
---
# Validated
Information about the validtion of the association between the research product and the funding information.
### validationDate
_Type: String &bull; Cardinality: ONE_
When OpenAIRE collected the association between the funding and the research product from an authoritative source (i.e. Sygma).
```json
"validationDate": "2021-0101"
```
### validatedByFunder
_Type: Boolean &bull; Cardinality: ONE_
Specifies if the validation comes from the funder.
```json
"validatedByFunder": true
```

View File

@ -0,0 +1,16 @@
---
sidebar_position: 2
---
# Beginner's kit
The large size of the OpenAIRE Graph is a major impediment for beginners to familiarise with the underlying data model and explore its contents.
Working with the Graph in its full size typically requires access to a huge distributed computing infrastructure which cannot be easily accessible to everyone.
[The OpenAIRE Beginners Kit](https://doi.org/10.5281/zenodo.7490191) aims to address this issue. It consists of two components:
<!-- :::caution
This version is not accompanied with public dataset files, hence the files in this section are based on [v6.0.0](/docs/6.0.0/) of the Graph. The current data are only exposed via the [OpenAIRE Graph API](https://graph.openaire.eu/develop/) and added-value services that are built on top of this version of the Graph (e.g., the [OpenAIRE Explore](https://explore.openaire.eu/)). If you are interested to get bulk access to our latest data, please contact us via our [helpdesk](https://graph.openaire.eu/support).
::: -->
* A subset of the Graph composed of the research products published between 2022-06-29 and 2022-12-29, all the entities connected to them and the respective relationships.
* A Zeppelin notebook that demonstrates how you can use PySpark to analyse the Graph and get answers to some interesting research questions. A guide to Apache Zeppelin can be found [here](https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_zeppelin-component-guide/content/ch_overview.html).

View File

@ -0,0 +1,50 @@
---
sidebar_position: 1
---
# Full graph dataset
You can download the full OpenAIRE Graph Dataset as well as its schema from the following links:
<!-- :::caution
This version is not accompanied with public dump files, hence the files in this section are based on [v6.0.0](/docs/6.0.0/) of the Graph. The current data are only exposed via the [OpenAIRE Graph API](https://graph.openaire.eu/develop/) and added-value services that are built on top of this version of the Graph (e.g., the [OpenAIRE Explore](https://explore.openaire.eu/)). If you are interested to get bulk access to our latest data, please contact us via our [helpdesk](https://graph.openaire.eu/support).
::: -->
Dataset: https://doi.org/10.5281/zenodo.3516917
Schema: https://doi.org/10.5281/zenodo.4238938
The schema used to create this dataset mirrors the one described in the [Data Model](/data-model).
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
It is composed of several files so that you can download the parts you are interested into. The files are named after the entity they store (i.e. publication, dataset). Each file is at most 10GB and it is
a tar archive containing gz files, each with one json per line.
## How to acknowledge this work
Open Science services are open and transparent and survive thanks to your active support and to the visibility and reward they gather. If you use one of the [OpenAIRE Graph datasets](https://doi.org/10.5281/zenodo.3516917) for your research, please provide a proper citation following the recommendation that you find on the dataset's Zenodo page or as provided below.
:::note How to cite
Manghi P., Atzori C., Bardi A., Baglioni M., Schirrwagen J., Dimitropoulos H., La Bruzzo S., Foufoulas I., Mannocci A., Horst M., Czerniak A., Iatropoulou K., Kokogiannaki A., De Bonis M., Artini M., Lempesis A., Ioannidis A., Manola N., Principe P., Vergoulis T., Chatzopoulos S., Pierrakos D. (2022). "OpenAIRE Research Graph Dataset", *Dataset*, Zenodo. [doi:10.5281/zenodo.3516917](https://doi.org/10.5281/zenodo.3516917) ([BibTex](/bibtex/OpenAIRE_Research_Graph_dataset.bib))
:::
Please also consider citing [other relevant research products](/publications#relevant-research-products) that can be of interest.
Also consider adding one of the following badges to your service with the appropriate link to [our website](https://graph.openaire.eu); click on the badges below to download the respective badge image files.
<div className="row">
<div className="col col--4 left-badge">
<a target="_blank" href={require('../assets/badges/openaire-badge-1.zip').default} download>
<img loading="lazy" alt="Openaire badge" src={require('../assets/badges/openaire-badge-1.png').default} className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module pagination-nav__link" style={{ paddingTop: '1.2em', paddingBottom: '1.2em'}} title="Click to download"/>
</a>
</div>
<div className="col col--4 mid-badge">
<a target="_blank" href={require('../assets/badges/openaire-badge-2.zip').default} download>
<img loading="lazy" alt="Openaire badge" src={require('../assets/badges/openaire-badge-2.png').default} className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module pagination-nav__link dark-badge" style={{ paddingTop: '1.2em', paddingBottom: '1.2em'}} title="Click to download"/>
</a>
</div>
<div className="col col--4 right-badge">
<a target="_blank" href={require('../assets/badges/openaire-badge-3.zip').default} download>
<img loading="lazy" alt="Openaire badge" src={require('../assets/badges/openaire-badge-3.png').default} className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module pagination-nav__link" style={{ paddingTop: '1.2em', paddingBottom: '1.2em'}} title="Click to download"/>
</a>
</div>
</div>

View File

@ -0,0 +1,34 @@
---
sidebar_position: 4
---
# Other related datasets
In this page, we list other related datasets; please refer to their respective schema definitions for the data model they follow.
## The dataset of ScholeXplorer
Dataset: https://zenodo.org/doi/10.5281/zenodo.1200252
Schema (Scholix version 3): https://doi.org/10.5281/zenodo.1120275
Schema (Scholix version 4): https://doi.org/10.5281/zenodo.6351557
This dataset is licensed under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
The dataset contains the GZ-compressed dataset of the Scholix links exposed by the OpenAIRE ScholeXplorer service.
## The OpenAIRE LOD dataset
:::caution
The OpenAIRE LOD dataset has been discontinued. The SPARQL Endpoint is no longer supported but old LOD datasets can be found in the link below.
:::
Dataset (RDF): https://doi.org/10.5281/zenodo.609943
<!-- LOD Ontology: http://lod.openaire.eu/vocab
SPARQL Endpoint: http://lod.openaire.eu/sparql -->
The OpenAIRE Linked Open Data (LOD) Services and their integration with the OpenAIRE information space have been released as a beta version. The LOD exporting process started with a specification of the OpenAIRE data model as an RDF vocabulary, and then mapping of the OpenAIRE data to the graph-based RDF data model. To interlink the OpenAIRE data with related data on the Web, we have identified a list of potential datasets to interlinked with, including the DBpedia dataset extracted from Wikipedia and the publication databases DBLP and CiteSeer.
<!-- Please refer [here](http://lod.openaire.eu/documentation) for more details on the LOD documentation. -->

View File

@ -0,0 +1,70 @@
---
sidebar_position: 3
---
# Sub-graph datasets
In order to facilitate users, different datasets are available under the Zenodo community called [OpenAIRE Graph](https://zenodo.org/communities/openaire-research-graph).
This page lists all alternative datasets currently available.
<!-- :::caution
This version is not accompanied with public dataset files, hence the files in this section are based on [v6.0.0](/docs/6.0.0/) of the Graph. The current data are only exposed via the [OpenAIRE Graph API](https://graph.openaire.eu/develop/) and added-value services that are built on top of this version of the Graph (e.g., the [OpenAIRE Explore](https://explore.openaire.eu/)). If you are interested to get bulk access to our latest data, please contact us via our [helpdesk](https://graph.openaire.eu/support).
::: -->
## The OpenAIRE COVID-19 dataset
Dataset: https://doi.org/10.5281/zenodo.3980490
Schema: https://doi.org/10.5281/zenodo.3974225
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
It contains metadata records of publications, research data, software and projects on the topic of Corona Virus and COVID-19.
This dataset is part of the activities of OpenAIRE to support the fight against COVID-19 together with the OpenAIRE COVID-19 Gateway.
The dataset consists of a tar archive containing gzip files with one json per line. Please refer [here](#alternative-sub-graph-data-model) for details on the data model of this dataset.
## The dataset of funded products
Dataset: https://doi.org/10.5281/zenodo.4559725
Schema: https://doi.org/10.5281/zenodo.3974225
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
It contains metadata records of research products (research literature, data, software, other types of research products) with funding
information available in the OpenAIRE Graph. Records are grouped by funder in a dedicated archive file. Each tar archive contains
gzip files, each with one json record per line. The model of this dataset differs from the one of the whole graph.
Please refer [here](#alternative-sub-graph-data-model) for details on the data model of this dataset.
## The dataset of delta projects
Dataset: https://doi.org/10.5281/zenodo.6419021
Schema: https://doi.org/10.5281/zenodo.4238938
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
It contains the metadata records of projects collected by OpenAIRE in a given time frame. Usually one deposition of collected projects is done for each release of the OpenAIRE Graph
The deposition is one tar archive containing gzip files, each with one json record per line.
## The datasets about research communities, initiatives and infrastructures
Dataset: https://doi.org/10.5281/zenodo.3974604
Schema: https://doi.org/10.5281/zenodo.3974225
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
The dataset contains one file per community/initiative/infrastructure collaborating with OpenAIRE. Check out also their community gateways on
CONNECT. Each file is a tar archive containing gzip files with one json per line. The only communities/research initiative/infrastructure included are publicly visible ones.
The model of this dataset differs from the one of the whole graph.
Please refer [here](#alternative-sub-graph-data-model) for details on the data model of this dataset.
---
## Alternative sub-graph data model
It should be noted that the datasets for research communities, infrastructures, and products related to projects do not strictly follow the main data model of the OpenAIRE Graph. In particular, they differ in the following:
* only research products are included (no relations or other entities)
* the research products are extended with information that can be inferred in the whole dataset namely:
* funding information if present
* associated research community/infrastructure
* associated data sources
So they have just one entity type, that is the [Extended Research Product](./alternative-model/extended-research-product.md).

View File

@ -0,0 +1,8 @@
{
"label": "Graph production workflow",
"position": 6,
"link": {
"type": "doc",
"id": "graph-production-workflow"
}
}

View File

@ -0,0 +1,58 @@
---
sidebar_position: 1
---
# Aggregation
OpenAIRE materializes an open, participatory research graph (the OpenAIRE Graph) where products of the research life-cycle (e.g. scientific literature, research data, project, software) are semantically linked to each other and carry information about their access rights (i.e. if they are Open Access, Restricted, Embargoed, or Closed) and the sources from which they have been collected and where they are hosted. The OpenAIRE Graph is materialised via a set of autonomic, orchestrated workflows operating in a regimen of continuous data aggregation and integration. [1]
## What does OpenAIRE collect?
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/).
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term.
In addition, the OpenAIRE Graph is extended with other relevant scholarly communication sources that need special handling, either because they do not strictly follow the OpenAIRE Guidelines or due to the vast amount of data of data they offer (e.g. DOIBoost, that merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall).
<p align="center">
<img loading="lazy" alt="Aggregation" src={require('../../assets/img/aggregation.png').default} width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The OpenAIRE aggregation system collects information about objects of the research life-cycle compliant to the [OpenAIRE acquisition policy](https://www.openaire.eu/content-acquisition-policy) from [different types of data sources](https://explore.openaire.eu/search/find/dataproviders):
1. Scientific literature metadata and full-texts from institutional and thematic repositories, CRIS (Common Research Information Systems), Open Access journals and publishers;
2. Dataset metadata from data repositories and data journals;
3. Scientific literature, data and software metadata from Zenodo;
4. Metadata about data sources, organizations, projects, and funding programs from entity registries, i.e. authoritative sources such as CORDA and other funder databases for projects, OpenDOAR for publication repositories, re3data for data repositories, DOAJ for Open Access journals;
5. Metadata of open source research software from software repositories and SoftwareHeritge
6. Metadata about other types of research products, like workflow, protocols, methods, research packages
Relationships between objects are collected from the data sources, but also automatically detected by [inference algorithms](https://www.openaire.eu/blogs/text-mining-services-in-openaire-1) and added by authenticated users, who can insert links between literature, datasets, software and projects via [the “Link” procedure available from the OpenAIRE explore portal](https://explore.openaire.eu). More information about the linking functionality can be found [here](https://www.openaire.eu/linking).
## What kind of data sources are in OpenAIRE?
Objects and relationships in the OpenAIRE Graph are extracted from information packages, i.e. metadata records, collected from data sources of the following kinds:
- *Literature, Institutional and thematic repositories*: Information systems where scientists upload the bibliographic metadata and full-texts of their articles, due to obligations from their organization or due to community practices (e.g. ArXiv, Europe PMC);
- *Open Access Publishers and journals*: Information system of open access publishers or relative journals, which offer bibliographic metadata and PDFs of their published articles;
- *Data archives*: Information systems where scientists deposit descriptive metadata and files about their research data (also known as scientific data, datasets, etc.).;
- *Hybrid repositories/archives*: information systems where scientists deposit metadata and file of any kind of scientific products, incuding scientific literature, research data and research software (e.g. Zenodo)
- *Aggregator services*: Information systems that collect descriptive metadata about publications or datasets from multiple sources in order to enable cross-data source discovery of given research products. Examples are DataCite, BASE, DOAJ;
- *Entity Registries*: Information systems created with the intent of maintaining authoritative registries of given entities in the scholarly communication, such as OpenDOAR for the institutional repositories, re3data for the data repositories, CORDA and other funder databases for projects and funding information;
- *CRIS*: Information systems adopted by research and academic organizations to keep track of their research administration records and relative research products; examples of CRIS content are articles or datasets funded by projects, their principal investigators, facilities acquired thanks to funding, etc..
- *Research Graphs*: services that maintain an information space of (possibly interlinked) scholalrly communication objects. Examples are CrossRef, ScholeXplorer and OpenAIRE itself.
## How does OpenAIRE collect metadata records?
OpenAIRE collects metadata records describing objects of the research life-cycle from content providers compliant to the OpenAIRE guidelines and from entity registries (i.e. data sources offering authoritative lists of entities, like OpenDOAR, re3data, DOAJ, and funder databases).
The OpenAIRE aggregator collects metadata records in the majority of cases via [OAI-PMH](https://www.openarchives.org/pmh/), but also supports other standard exchange protocols like FTP(S), SFTP, and some RESTful API.
The whole list of available and used collectors could be found in the [RedMine Wiki - API Protocols](https://support.openaire.eu/projects/openaire/wiki/API_protocols)
For additional details about the aggregation workflows, please refer to [2].
## References
[1] Manghi, P., Artini, M., Atzori, C., Bardi, A., Mannocci, A., La Bruzzo, S., Candela, L., Castelli, D. and Pagano, P. (2014), “The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures”, Program: electronic library and information systems, Vol. 48 No. 4, pp. 322-354. [doi:10.1108/prog-08-2013-0045](http://doi.org/10.1108/prog-08-2013-0045)
[2] Atzori, C., Bardi, A., Manghi, P., & Mannocci, A. (2017, January). "The OpenAIRE workflows for data management". In Italian Research Conference on Digital Libraries (pp. 95-107). Springer, Cham. [doi:10.1007/978-3-319-68130-6_8](https://doi.org/10.1007/978-3-319-68130-6_8)

View File

@ -0,0 +1,11 @@
---
sidebar_position: 1
---
# OpenAIRE compatible sources
The OpenAIRE aggregator collects metadata records from content providers compliant to the OpenAIRE guidelines.
The OpenAIRE Guidelines help repository managers expose publications, datasets and CRIS metadata via the OAI-PMH protocol in order to integrate with OpenAIRE infrastructure.
You can find more information in https://guidelines.openaire.eu/en/latest/

View File

@ -0,0 +1,77 @@
# Datacite
This section describes the aggregation workflow used to gather the bibliographic material from Datacite and the relative mapping.
## Datacite datasource
[Datacite](https://datacite.org/index.html) is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs.
## Datacite API
The [DataCite REST API](https://support.datacite.org/docs/api) allows users to retrieve, query, and browse Datacite metadata records. In particular, it exposes a method for harvesting new records incrementally.
```
https://api.datacite.org/dois?page[cursor]=$CURSOR&page[size]=$NUMBER_OF_ITEM_PER_PAGE&query=updated:[$FROM_DATE_TIMESAMP TO $TO_DATE_TIMESAMP]
```
On this API Request, we introduce some variables:
- **CURSOR**: The value of the cursor to iterate the pages; the cursor is extracted from each API response and used in the next request.
- **NUMBER_OF_ITEM_PER_PAGE**: (max 1000) defines how many records must be returned within each API response.
- **FROM_DATE_TIMESAMP, TO_DATE_TIMESAMP** interval timestamp of the updated record.
Each record contains two pieces of information needed for incremental harvesting:
- **isActive**: tells if the record is deleted (`isActive:false`)
- **updated**: timestamp of last update
## Collection Workflow
The collection workflow is responsible for aggregating new records. Each record is stored locally on a table with the following schema:
- **DOI**: The DOI of the Datacite record (it is the primary key)
- **update_timestamp**: the last update date timestamp
- **json**: the native record JSON
The metadata collection process identifies the most recent record date available locally and uses such date to requests the records to the Datacite API, populating the **FROM_DATE_TIMESAMP** variable. The records in the API response are included in the local storage in upsert mode.
## Datacite Mapping
### Entity Mapping
The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.
| OpenAIRE Research Product field path | Datacite record JSON path | # Notes |
|--------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `id` | `\attributes\doi` | id in the form `doi_________::md5(doi)` |
| <ul><li>`instance`</li> <li>`instance.type`</li></ul> | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Use the vocabulary **_dnet:publication_resource_** to find a synonym to one of these terms and get the `instance.type`. |
| `type` | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li> <li> `software`</li> <li>`otherresearchproduct`</li></ul> |
| `pid` | `\attributes\doi` | `scheme = doi` |
| `originalid` | `\attributes\doi` | |
| `dateofcollection` | `attributes\updated` | the timestamp is defined in milliseconds we convert to "yyyy-MM-dd'T'HH:mm:ssZ" format |
| `author` | `\attributes\creators` | Each creator field will be mapped in the author entity below the subfield. **If the record has no Creator it will be skipped** |
| `author.fullname` | `\attributes\creators\name` | if name is not defined, we construct from given and family name |
| `author.rank` | | Incremental index starting from 1 |
| `author.name` | `\attributes\creators\givenName` | |
| `author.surname` | `\attributes\creators\familyName` | |
| `author.pid` | `\attributes\creators\nameIdentifiers` | this is a list of pids associated to the creator |
| `author.pid.scheme` | `\attributes\creators\nameIdentifiers` | mapping with vocabulary **dnet:pid_types** |
| `author.pid.value` | `\attributes\creators\nameIdentifiers/nameIdentifier` | the pid value |
| `maintitle` | `\attributes\titles` | Titles whose title type is null or title type is Main |
| `subtitle` | `\attributes\titles` | Titles whose title type is Subtitle since the title type vocabulary in OpenAIRE use the datacite title type vocabulary |
| **date section** | | for each date in particular for DOI starting with _10.14457_ we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket [#6791](https://support.openaire.eu/issues/6791) |
| `publicationdate` | `\attributes\dates` | where `dateType` is **issued** |
| `publicationdate` | `\attributes\publicationYear` | we create this date format `01-01-publicationYear` |
| `embargoenddate` | `\attributes\dates` | where `dateType` is **available** |
| `subjects` | `\attributes\subject` | `scheme=keywords` |
| `description` | `\attributes\descriptions` | |
| `publisher` | `\attributes\publisher` | |
| `language` | `\attributes\language` | cleaned by using vocabulary `dnet:languages` |
| `publisher` | `\attributes\publisher` | |
| `instance.license` | `\attributes\rightsList` | if the rights value starts with http and matches a particular regex |
| `instance.accessright` | `\attributes\rightsList` | <ul><li>if not present :`unknown`</li><li>if datasource is Figshare:`open`</li><li>If `embargo_date < today()`: OPEN</li></ul> |
### Relation Mapping
| OpenAIRE Relation Semantic and inverse | Datacite record JSON path | Source/Target type | #Notes |
|----------------------------------------|---------------------------------------|---------------------|------------------------------------------------------------------------------------------------------------|
| `isProducedBy/produces` | `attributes\fundingReferences` | `ResearchProduct/Project` | only when the fundingReferences matches the pattern `(info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)` |
| `IsProvidedBy/provides` | | `ResearchProduct/Datasource` | Datasource is always set to `Datacite` |
| `isHostedBy/host` | `\attributes\relationships\client\id` | `ResearchProduct/Datasource` | we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ |
| `isRelatedTo` | `\attribute\relatedIdentifiers` | `ResearchProduct/ResearchProduct` | we create relationships whenever the pid of the target is resolved on the Research Graph |

View File

@ -0,0 +1,253 @@
# DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID
DOIBoost is a dataset that combines research outputs and links among them from a selection of data sources.
It enriches the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI.
As consequence, DOIBoost does not contain any record from MAG, Unpaywall, or ORCID that doesn't provide a DOI available in Crossref.
Each Crossref record is enriched with:
* ORCID identifiers of authors from ORCID
* Open Access instance (with OA color/route and license) from Unpaywall
* the following information from MAG:
* abstracts
* MAG identifiers of authors
* affiliation (research product - organization) relationships
* subjects (MAG FieldsOfStudy)
* conference or journal information
The Open Access status is also set by intersecting the journal information of a record with the journal lists available from DOAJ and the Gold ISSN list.
## Inputs
* *Crossref*: dump available to Crossref subscribers via MetadataPlus service, updated once a month.
* *Microsoft Academic Graph*: downloaded version on 2021-02-15. We plan to take the latest version in Dec 2021 before MAG will be retired.
* *ORCID*: baseline dump obtained in 2020-10-13, regularly updated every week from the [ORCID public API](https://info.orcid.org/documentation/features/public-api).
* *Unpaywall*: public database snapshot downloaded in March 2021. Unpaywall updates it twice a year (https://unpaywall.org/products/snapshot)
The construction of the DOIBoost dataset consists of the following phases:
## Process
The following section describes the processing steps needed to build DOIBoost starting from the input data.
### Crossref filtering
Records in Crossref are ruled out according to the following criteria
* have blank title, examples:
* `10.1093/rheumatology/41.7.837`
* `10.1093/qjmed/95.7.430`
* `10.1371/journal.pone.0171434.g005`
* have one of the following publishers: `"Test accounts"`, `"CrossRef Test Account"`
* Examples from https://api.crossref.org/works?query.publisher-name=%22Test%20accounts%22
* `10.1007/bf00344543`
* `10.1007/bf00186154`
* `10.1306/64ed947a-1724-11d7-8645000102c1865d`
* have no authors with valid names, where valid means: not blank and different from all strings in this list: `List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")`
* Examples for blank authors:
* `10.1108/00070709810247807`
* `10.1016/s1074-9098(02)00346-5`
* `10.1136/heart.88.1.6`
* Examples for `"none"` author from https://api.crossref.org/works?query.author=%22none%22
* `10.4007/annals.2016.184.3.11`
* `10.4007/annals.2012.176.1.6`
* `10.2172/6393585`
* Examples for `"test"` author from https://api.crossref.org/works?query.author=%22test%22
* `10.5116/ijme.54ca.a5ae`
* `10.5755/j01.ss.71.2.544`
* `10.5755/j01.ee.22.2.319`
* have `"Addie Jackson"` as author and `"Elsevier BV"` as publisher (empirically we say they are test records)
* Examples from https://api.crossref.org/works?query.author=Addie+Jackson&query.publisher-name=%22Elsevier%20BV%22
* `10.2139/ssrn.2082156`
* `10.2139/ssrn.2202300`
* `10.2139/ssrn.2255657`
* have not one of the following values in the field `type` : `"book-section"`, `"book"`, `"book-chapter"`, `"book-part"`, `"book-series"`, `"book-set"`, `"book-track"`, `"edited-book"`, `"reference-book"`, `"monograph"`, `"journal-article"`, `"dissertation"`, `"other"`, `"peer-review"`, `"proceedings"`, `"proceedings-article"`, `"reference-entry"`, `"report"`, `"report-series"`, `"standard"`, `"standard-series"`, `"posted-content"`, `"dataset"`,
* Example:
* `10.1371/journal.pone.0171434.g005`
* `10.7554/elife.21052.049`
* `10.1371/journal.pcbi.1005379.s006`
Records with `type=dataset` are mapped into OpenAIRE research products of type dataset. All others are mapped as OpenAIRE research products of type publication.
### Mapping Crossref properties into the OpenAIRE Graph
Properties in OpenAIRE research products are set based on the logic described in the following table:
| OpenAIRE Research Product field path | Crossref path(s) | Notes |
|----------------------------------------|--------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `id` | `doi` | id in the form `doi_________::md5(doi)` |
| `dateofcollection` | `indexed.datetime` | |
| `lastupdatetimestamp` | `indexed.timestamp` | |
| `type` | `type` | `dataset` if the Crossref type is dataset, `publication` otherwise (based on the filtering logics described above) |
| `originalId` | `doi, clinical-trial-number, alternative-id` | |
| `pid` | | The scheme tells the type of PID, the value contains the actual value |
| `pid.scheme` | | Default value: doi |
| `pid.value` | `doi` | The doi is normalised and lower-cased |
| `maintitle` | `title` | |
| `subtitle` | `subtitle` | |
| `author` | `author` | if available the sequence is mapped to rank and the ORCID is also mapped |
| `author.name` | `author.given` | |
| `author.surname` | `author.family` | |
| `author.fullname` | `author.given author.family` | |
| `author.rank` | | based on the order, starts from 1 |
| `author.pid` | | only if the ORCID is available |
| `author.pid.id.scheme` | | Default `'pending_orcid'` (meaning that it is not an id confirmed by ORCID) |
| `author.pid.id.value` | `author.ORCID` | |
| `author.pid.provenance.provenance` | | Default 'Harvested' |
| `author.pid.provenance.trust` | | Default '0.9' |
| `description` | `abstract` | |
| `subject` | `subject` | with `classid='keywords'`, i.e. no controlled vocabularies for Crossref subjects |
| `publicationdate` | `issued.datetime` or, if not available, `created.datetime` | |
| `publisher` | `publisher` | |
| `source` | `source` | only if the record is not of type `book` |
| `source` | concatenation of `container-title.head` + `"ISBN: "` + `ISBN.head` | only if the record is of type `book` |
| `container` | | It is set only for publications with information about the journal it was published in. |
| `container.name` | `container-title.head` | |
| `container.issnOnline` | `issn-type.value` | if `issn-type.type='electronic'` |
| `container.issnPrinted` | `issn-type.value` | if `issn-type.type='print'` |
| `container.vol` | `volume` | |
| `container.sp` | `page` | before `'-'` |
| `container.ep` | `page` | after `'-'` |
| `instance` | | One instance is created with the DOI URL |
| `instance.accessright` | | Values in `instance.accessright.code` and `instance.accessright.label` are set based on license and dateofacceptance:<br/>- `UNKNOWN`: if the license is blank<br/>- `OPEN ACCESS`: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see [Unpaywall FAQ](https://support.unpaywall.org/support/solutions/articles/44002063718-what-is-an-oa-license-) for details) or if OUP license, but only after 12 months from the publication date<br/>- `EMBARGO`: OUP license, before 12 months from the publication date<br/>- `CLOSED`: if there is a license not covered by the previous cases |
| `instance.accessright.code` | | Code from the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.label` | | One of: `OPEN`, `RESTRICTED`, `CLOSED`, `EMBARGO` |
| `instance.accessright.scheme` | | Scheme that defines the code and label, i.e. the URL to the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.openAccessRoute` | | only if `instance.accessright.value = 'OPEN ACCESS'`. Default is `hybrid`. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list. |
| `instance.license` | `license.URL ` | If there is a `license.content-version='vor'`, then this is used. Otherwise the first license entry is used. |
| `instance.pid` | | The scheme tells the type of PID, the value contains the actual value |
| `instance.pid.scheme` | | Default value: `doi` |
| `instance.pid.value` | `doi` | The doi is normalised and lower-cased |
| `instance.publicationdate` | `issued.datetime` or, if not available, `created.datetime` | |
| `instance.refereed` | | set to `peerReviewed` only if `relation.has-review.id` is not empty, `UNKNOWN` otherwise. |
| `instance.type` | `subtype` | mapped using the [OpenAIRE vocabulary for research products typologies](https://api.openaire.eu/vocabularies/dnet:result_typologies) |
| `instance.url` | `doi` | Full URL of the DOI |
All other fields of the Json schema not mentioned in the table contain empty values.
All the records from Crossref are related to the datasource with `name=Crossref` and `id=openaire____::081b82f96300b6a6e3d282bad31cb6e2`
Possible improvements:
* map `clinical-trial-number` and `alternative-id` in `alternateIdentifiers`?
* Verify if Crossref has a property for `language`, `country`, `container.issnLinking`, `container.iss`, `container.edition`, `container.conferenceplace` and `container.conferencedate`
* Different approach to set the `refereed` field and improve its coverage?
### Map Crossref links to projects/funders
Links to funding available in Crossref are mapped as funding relationships (`ResearchProduct -- isProducedBy -- Project`) applying the following mapping:
| Funder | Grant code | Link to |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| DOI: `{10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665}` or name: `'European Unions Horizon 2020 research and innovation program'` | series of `4-9` digits in `award` | Link to H2020 project |
| DOI: `{10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780}` | series of `4-9` digits in `award` | Link to FP7 project |
| DOI: `10.13039/501100000781` OR name: `'European Union's'` | series of `4-9` digits in `award` | Link to FP7 or H2020 project |
| DOI: `10.13039/100000001` | `award` | Link to NSF project |
| DOI: `10.13039/501100001665` OR name: `{'The French National Research Agency (ANR)', 'The French National Research Agency'}` | `award` | Link to ANR project |
| DOI: `10.13039/501100002341` | `award` | Link to Academy of Finland project |
| DOI: `10.13039/501100001602` | `award`, removing the initial 'SFI' if present | Link to SFI project |
| DOI: `10.13039/501100000923` | `award` | Link to ARC project |
| DOI: `10.13039/501100000038` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to NSERC (`unidentified` project) |
| DOI: `10.13039/501100000155` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to SSHRC (`unidentified` project) |
| DOI: `10.13039/501100000024` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to CIHR (`unidentified` project) |
| DOI: `10.13039/501100002848` OR name :`'CONICYT, Programa de Formación de Capital Humano Avanzado'` | `award` | Link to CONICYT project |
| DOI: `10.13039/501100003448` | series of `4-9` digits in award | Link to GSRT project |
| DOI: `10.13039/501100010198` | `award` | Link to SGOV project |
| DOI: `10.13039/501100004564` | series of `4-9` digits in award | Link to MESTD project |
| DOI: `10.13039/501100003407` | `award` | Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (`unidentified`<br/> project) is also generated |
| DOI: `{10.13039/501100006588, 10.13039/501100004488}` | `award`, removing `'Project No'` and `'HRZZ'` prefix, if present | Link to HRZZ or MZOS project |
| DOI: `10.13039/501100006769` | `award` | Link to Russian Science Foundation project |
| DOI: `10.13039/501100001711` | `award` after `'_'` and before `'/'` | Link to SNSF project |
| DOI: `10.13039/501100004410` | `award` | Link to TUBITAK project |
| DOI: `10.10.13039/100004440` or name: `Wellcome Trust Masters Fellowship` | `award` | Link to Wellcome Trust specific project and to the `unidentified` project. |
### Intersect Crossref with UnpayWall by DOI
The fields we consider from UnpayWall are:
* `is_oa`
* `best_oa_location`
* `oa_status`
The records of Crossref that intersect by DOI with UnpayWall records are enriched with one additional `instance` with the following properties:
| OpenAIRE Research Product field path | Unpaywall field path | Notes |
|----------------------------------------|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `instance` | | created only if `is_oa` and a `best_oa_location` is available |
| `instance.accessright` | | default value `Open Access`: we do not add instances if UnpayWall says there is no open version |
| `instance.accessright.code` | | Open Access code from the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.label` | | Always `OPEN` |
| `instance.accessright.scheme` | | Scheme that defines the code and label, i.e. the URL to the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.openAccessRoute` | `oa_status` | |
| `instance.url` | `best_oa_location` | |
| `instance.license` | `best_oa_location.license` | |
| `instance.pid` | | The scheme tells the type of PID, the value contains the actual value |
| `instance.pid.scheme` | | Default value: `doi` |
| `instance.pid.value` | `doi` | The doi is normalised and lower-cased |
For the definition of UnpayWall's `oa_status` refer to the [Unpaywall FAQ](https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-)
The record will also feature a relation to the UnpayWall data source: `name="UnpayWall"`, `id=openaire____::8ac8380272269217cb09a928c8caa993`.
### Intersect with ORCID
The fields we consider from ORCID are:
* `doi`
* `authors`, a list of authors, each with optional `name`, `surname`, `creditName`, `oid`
| OpenAIRE field path | ORCID path | Notes |
|------------------------------------|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| `pid` | `doi` | |
| `author.name` | `capitalize(name)` | only mapped if not blank |
| `author.surname` | `capitalize(surname)` | only mapped if not blank |
| `author.fullname` | | if name and surname are not blank, they are concatenated (`capitalize(name) capitalize(surname)`), otherwise we use the `creditName` |
| `author.pid` | | only if the `ORCID` is available |
| `author.pid.id.scheme` | | Default `orcid` (meaning that it is confirmed by ORCID, (in contrast to the `orcid_pending` set from Crossref and Unpaywall) |
| `author.pid.id.value` | `oid` | |
| `author.pid.provenance.provenance` | | Default `Harvested` |
| `author.pid.provenance.trust` | | Default `0.9` |
The records are enriched with the ORCID identifiers of their authors.
[//]: # (TODO: Update with the new approach implemented by Miriam.)
The current approach is:
* if the number of authors from Crossref equals the size of authors from ORCID, then we pick the list of authors with more PIDs and try to enrich it with the PIDs from the other list, based on JaroWrinkler distance on authors' names, surnames, or fullnames, depending on which properties are available;
* if the number of authors are different, then we take the longest and try to enrich it with the PIDs from the other author list, based on JaroWrinkler distance on authors' names, surnames, or fullnames, depending on which properties are available
Miriam will modify the process to ensure that:
* the list of authors from Crossred always "win"
* the identifiers from ORCID "win"
### Intersect with Microsoft Academic Graph
*Important Notes*
* Only papers with DOI are considered
* Since for the same DOI we have multiple version of item with different MAG PaperId, we only take one per DOI (the last one we process). We call this dataset `Papers_distinct`
When mapping MAG records to the OpenAIRE Graph, we consider the following MAG tables:
* `PaperAbstractsInvertedIndex`: for the paper abstracts
* `Authors`: for the authors. The MAG data is pre-processed by grouping authors by PaperId
* `Affiliations` and `PaperAuthorAffiliations`: to generate links between publications and organisations
* `Journals` and `ConferenceInstances`: joined with `Papers_distinct` to have the information about the venues where the paper was published
* TO BE REMOVED `PaperUrls`: to create one instance for the OpenAIRE publication
* `FieldsOfStudy`: to add subjects
The records are enriched with:
* abstracts
* MAG identifiers of authors
* affiliation relationships
* subjects (MAG FieldsOfStudy)
* conference or journal information (in the `journal` field) TODO: or `container`, in case of the dump?
* [TO BE REMOVED] instances with URL from MAG
### Enrich DOIBoost3 with hosting data sources (`hostedby`) and access right information
In this phase, we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (`issn`, `eissn`, `lissn`) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a `journal.[l|e]issn` that match are enriched as follows:
* Each instance gain the `hostedby` information corresponding to the journal
* If the journal is open access, the access rights of the instances are also set to `Open Access` with `gold` route (because by construction, the journals we know are open are from DOAJ or Gold ISSN list)
The hostedby of records that do not match are set to the `Unknown Repository`.
## References
The idea behind DOIBoost and its origin can be found in the paper (and related resources) at:
* La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: [10.5281/zenodo.1441071](https://doi.org/10.5281/zenodo.1441071)

View File

@ -0,0 +1,94 @@
# EMBL-EBIs Protein Data Bank in Europe
This section describes the mapping implemented for [EMBL-EBIs Protein Data Bank in Europe](https://www.ebi.ac.uk/).
The Europe PMC RESTful Web Service gives the [datalinks API](https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API) to retrieve data-literature links in Scholix format.
## How the data is collected
Starting from the Pubmed collection, the API below is used to obtain the bioentities related to publications for each PubMed identifier.
Example:
```commandline
curl -s "https://www.ebi.ac.uk/europepmc/webservices/rest/MED/33024307/datalinks?format=json" | jq '.'
{
"version": "6.8",
"hitCount": 9,
"request": {
"id": "33024307",
"source": "MED"
},
"dataLinkList": {
"Category": [
{
"Name": "Nucleotide Sequences",
"CategoryLinkCount": 5,
"Section": [
{
"ObtainedBy": "tm_accession",
"Tags": [
"supporting_data"
],
"SectionLinkCount": 5,
"Linklist": {
"Link": [
{
"ObtainedBy": "tm_accession",
"PublicationDate": "04-11-2022",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "References"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "MED"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "AY278488",
"IDScheme": "ENA",
"IDURL": "http://identifiers.org/ebi/ena.embl:AY278488"
},
"Title": "AY278488",
"Publisher": {
"Name": "Europe PMC"
}
},
[...]
```
## Mapping
The table below describes the mapping from the EBI links records to the OpenAIRE Graph Dataset format.
We filter all the target links with pid type **ena**, **pdb** or **uniprot**
For each target we construct a Bioentity with the following mapping
| OpenAIRE Research Product field path | EBI record field xpath | Notes |
|-----------------------------|----------------------------------------------------------|---------------------------------------------------------------|
| `id` | `target/identifier/ID` and `target/identifier/IDScheme` | id in the form `SCHEMA_________::md5(pid)` |
| `pid` | `target/identifier/ID` and `target/identifier/IDScheme` | `classid = classname = schema` |
| `publicationdate` | `target/PublicationDate` | clean and normalize the format of the date to be `YYYY-mm-dd` |
| `maintitle` | `target/Title` | |
| **Instance Mapping** | | |
| `instance.type` | | `Bioentity` |
| `type` | | `Dataset` |
| `instance.pid` | `target/identifier/ID` and `target/identifier/IDScheme` | `classid = classname = schema` |
| `instance.url` | `target/identifier/IDURL` | Copy the value as it is |
| `instance.publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |
### Relation Mapping
| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes |
|----------------------------------------|---------------------|--------------------------------------------------------------------------|
| `IsRelatedTo` | `ResearchProduct/ResearchProduct` | we create relationships between the BioEntity and the pubmed publication |

View File

@ -0,0 +1,44 @@
# PubMed
This section describes the mapping implemented for [MEDLINE/PubMed](https://pubmed.ncbi.nlm.nih.gov/).
## Input
The native data is collected from the [ftp baseline](https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/) site.
It contains XML records compliant with the schema available at [www.nlm.nih.gov](https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html).
## Incremental harvesting
Pubmed exposes an entry point FTP with all the updates for each one. [ftp baseline update](https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/). We collect the new file and generate the new dataset by upserting the existing item.
## Entity Mapping
The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.
| OpenAIRE Research Product field path | PubMed record field xpath | Notes |
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Publication Mapping** | | |
| `id` | `//PMID` | id in the form `pmid_________::md5(pmid)` |
| `pid` | `//PMID` | `classid = classname = pmid` |
| `publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |
| `maintitle` | `//Title` | |
| `description` | `//AbstractText` | |
| `language` | `//Language` | cleaning vocabulary -> dnet:languages |
| `subjects` | `//DescriptorName` | classId, className = keyword |
| **Author Mapping** | | |
| `author.surname` | `//Author/LastName` | |
| `author.name` | `//Author/ForeName` | |
| `author.fullname` | `//Author/FullName` | Concatenation of forename + lastName if exist |
| `author.rank` | FOR ALL AUTHORS | sequential number starting from 1 |
| **Journal Mapping** | | |
| `container.conferencedate` | `//Journal/PubDate` | map the date of the Journal |
| `container.name` | `//Journal/Title` | name of the journal |
| `container.vol` | `//Journal/Volume` | journal volume |
| `container.issPrinted` | `//Journal/ISSN` | the journal issn |
| `container.iss` | `//Journal/Issue` | The journal issue |
| **Instance Mapping** | | |
| `instance.type` | `//PublicationType` | if the article contains the typology `Journal Article` then we apply this type else We have to find a terms that match the vocabulary otherwise we discard it |
| `type` | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li> <li> `software`</li> <li>`otherresearchproduct`</li></ul> |
| `instance.pid` | `//PMID` | map the pmid in the pid in the instance |
| `instance.url` | `//PMID` | creates the URL by prepending `https://pubmed.ncbi.nlm.nih.gov/` to the PMId |
| `instance.alternateIdentifier` | `//ArticleId[./@IdType="doi"]` | |
| `instance.publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |

View File

@ -0,0 +1,31 @@
# UniProtKB/Swiss-Prot
This section describes the mapping implemented to integrate metadata and links from [UniProtKB/Swiss-Prot](https://www.uniprot.org/).
The complete data dump "Reviewed (Swiss-Prot)" can be downloaded from [here](https://www.uniprot.org/help/downloads).
From this dataset, only the protein records linked to a PubMed publication are extracted.
## Entity Mapping
The table below describes the mapping from the TEXT metadata format to the OpenAIRE Graph Dataset format.
You can check an example of the text metadata [here](https://rest.uniprot.org/uniprotkb/A0A0C5B5G6.txt)
| OpenAIRE Research Product field path | FASTA record field xpath | Notes |
|------------------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| **BIOEntity Mapping** | | |
| `id` | `LINE Starts with AC` | id in the form `uniprot_____::md5(id)` |
| `pid` | `LINE Starts with AC` | example `AC A0A0C5B5G6;` classid=classname=`uniprot` the vaue is the text after `AC` |
| `publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be `YYYY-mm-dd` |
| `maintitle` | `LINE START WITH GN` | main title |
| **Instance Mapping** | | |
| `instance.type` | | `Bioentity` |
| `type` | | `Dataset` |
| `instance.pid` | `LINE Starts with AC` | `classid = classname = uniprot` |
| `instance.url` | `pid` | prepend to the value `https://www.uniprot.org/uniprot/` |
| `instance.publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be YYYY-mm-dd |
### Relation Mapping
| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes |
|----------------------------------------|----------------------|--------------------------------------------------------------------------------------------------------------------------|
| `IsRelatedTo` | `LINE START WITH RX` | the mapping creates relationships between the BioEntity and the PubMed or DOI generating an unresolved target identifier |

View File

@ -1,15 +1,10 @@
---
sidebar_position: 4
---
# Cleaning
# Post cleaning
At the very end of the processing pipeline, a step is dedicated to perform cleaning operations aimed at improving the overall quality of the data.
The output of this final cleansing step is the final version of the OpenAIRE Research Graph.
## Vocabulary based cleaning
<!-- ## Vocabulary based cleaning -->
The aggregation processes run independently one from another and continuously. Each aggregation process, depending on the characteristics of the records exposed by the data source, makes use of one or more vocabularies to harmonise the values available in a given field.
In this page, we describe the *vocabulary-based cleaning* operation performed to harmonise the data of the different data sources.
A vocabulary is a data structure that defines a list of terms, and for each term defines a list of synonyms:
```xml
@ -32,24 +27,11 @@ A vocabulary is a data structure that defines a list of terms, and for each term
[...]
```
Each vocabulary is typically used to control and harmonise the values available in a specific field characterising the bibliographic records. The example above provides a preview of the vocabulary used to clean the [result's instance typology](/data-model/entities/result#instance).
Each vocabulary is typically used to control and harmonise the values available in a specific field characterising the bibliographic records. The example above provides a preview of the vocabulary used to clean the [research product's instance typology](../data-model/entities/research-product#instance).
The content of the vocabularies can be accessed on [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/).
Given a value provided in the original records, the cleaning process looks for a synonym and, when found, resolves the corresponding term which is used in turn to build the cleaned record.
Each aggregation process applies vocabularies according to their definitions in a given moment of time, however, it could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies.
In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation.
## Filtering
Bibliographic records that do not meet minimal requirements for being part of the OpenAIRE Research Graph are eliminated during this phase.
Currently, the only criteria applied horizontally to the entire graph aims at excluding scientific results whose title is not meaningful for citation purposes.
Then, different criteria are applied in the pre-processing of specific sub-collections:
* [Crossref filtering](/data-provision/aggregation/doiboost#crossref-filtering)
## Country cleaning
This phase is responsible for removing the country information from result records that match specific criteria. The need for this phase is driven by the fact that some datasources, although referred of national pertinence, they contain material that is not always related to the given country.
In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation.

View File

@ -0,0 +1,37 @@
# Deduction
The Deduction process (also known as “bulk tagging”) enriches each record with new information that can be derived from the existing property values.
This process is used to associate research products to community/research initiatives that are part of OpenAIRE.
As of November 2022, three procedures are in place to relate a research product to a research initiative, infrastructure (RI) or community (RC) based on:
* subjects: it is possible to specify a list of subjects that are relevant for the RC/RI. Every time one of the subjects is found among the subjects of a research products, the research products is linked to the RC/RI.
<p align="center">
<img loading="lazy" alt="Bulktagging Subject" src={require('../../assets/img/enrichment/bulktagging_subject.png').default} width="50%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* data sources: it is possible to list a set of data sources relevant for the RC/RI. All research products collected from these data sources will be linked to the RC/RI
<p align="center">
<img loading="lazy" alt="Bulktagging Data source" src={require('../../assets/img/enrichment/bulktagging_datasource.png').default} width="50%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
When only some research products collected from a datasource are relevant for the RC/RI, it is possible to specify a set of selection constraints (SC) that have to be verified before linking the research product to the
community. The selection constraint has the form <strong>SC = S1 or S2 or ... or Sn</strong>. The generic Si has the form <strong>Si = s<sub>i1</sub> and s<sub>i2</sub> and ...and s<sub>in</sub></strong> and each s<sub>ij</sub> is a condition on a specific field of the research product. The set of fields that can be specified is <strong>F={title, author, contributor, description, orcid}</strong>,
while the set of condition can be among <strong>V={contains, equals, not_contains, not_equals, contains_ignorecase, equals_ignorecase, not_contains_ignorecase, not_equal_ignorecase}</strong>, and the value is free text.
A possible selection criteria can be: “All the products whose contributor contains DARIAH “
<p align="center">
<img loading="lazy" alt="Bulktagging Data source" src={require('../../assets/img/enrichment/bulktagging_selconstraints.png').default} width="70%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* Zenodo community: it is possible to list a set of Zenodo communities relevant for the RC/RI. All the products collected from the listed Zenodo communities are linked to the RC/RI
<p align="center">
<img loading="lazy" alt="Bulktagging Zenodo Community" src={require('../../assets/img/enrichment/bulktagging_zenodo.png').default} width="50%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The list of subjects, Zenodo communities and data sources used to enrich the products are defined by the managers of the community gateway or infrastructure monitoring dashboard associated with the RC/RI.

View File

@ -0,0 +1,55 @@
# Propagation
This process enriches the graph by adding new links and/or new properties. The new information is added by exploiting existing semantic
relationships and values between the involved entities
As of November 2022, the following procedures are in place:
* Country propagation: updates the property “country” of a research product. This happens when the research product is collected from an institutional datasource or when the datasource hosting the research product is inserted in a whitelist. For all the research products whose hosting datasource verifies one of the conditions above, the country of the organization providing the datasource is added to the country of the research product: e.g. publication collected from an institutional repository maintained by an italian university will be enriched with the property “country = IT”.
<p align="center">
<img loading="lazy" alt="Country Propagation" src={require('../../assets/img/enrichment/propagation_country.png').default} width="50%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* Project propagation: adds a "isProducedBy" relationship (and its inverse) between a Project P and research product R1, if R1 has a strong semantic relationship with another research product R2 and P produces R2: e.g. publication linked to project P “is supplemented by” a dataset D. Dataset D will get the link to project P. The relationships considered for this procedure are “isSupplementedBy” and “isSupplementTo”.
<p align="center">
<img loading="lazy" alt="Project Propagation" src={require('../../assets/img/enrichment/propagation_resulttoproject.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* Research product to RC/RI through organization propagation. The manager of the RC/RI can specify a set of organizations whose product are relevant for the
community.
Each research product having such a relation of affiliation with at least one organization relevant for the RC/RI will be linked to it.
<p align="center">
<img loading="lazy" alt="Research product to community through organization propagation" src={require('../../assets/img/enrichment/propagation_resulttocommunitythroughorganization.png').default}
width="50%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* Research product to RC/RI through semantic relation: extends the set of products linked to a RC/RI by exploiting strong semantic relationships between the research products;
e.g. if a research product R1 is associated to the community C and is supplemented by a research product R2 then R2 will be linked to the community. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
<p align="center">
<img loading="lazy" alt="Research product to community through semantic relation propagation" src={require('../../assets/img/enrichment/propagation_resulttocommunitythroughsemrel.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* ORCID identifiers to research product through semantic relation. This propagation enriches the research products by adding ORCID identifiers to authors. The added ORCID will be marked as "potential" since they have been inserted through propagation.
The process considers the set of overlapping authors between research products (R1 and R2) linked with a strong semantic relationship (IsSupplementedBy, IsSupplementTo).
For each author A in the overlapping set, if R1 provides the ORCID value for A and R2 does not, then the author A in R2 will be enriched with the information of the ORCID found in R1.
<p align="center">
<img loading="lazy" alt="Orcid propation through semantic relation" src={require('../../assets/img/enrichment/propagation_orcid.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* affiliation to organization through institutional repository. This propagation adds one "hasAuthorInstitution" relationship (and its inverse)
between a research product R and Organization O,
if R was collected from a datasource D with type institutional repository, and D was provided by O.
<p align="center">
<img loading="lazy" alt="Affiliation propagation through institutional repository" src={require('../../assets/img/enrichment/propagation_affiliationistrepo.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* affiliation to organization through semantic relation. This propagation adds one "hasAuthorInstitution" relationship (and its inverse) between a
research product R and an Organization O,
if R has an affiliation relation with an organization O1 that is in relation "isChildOf" with O.
<p align="center">
<img loading="lazy" alt="Affiliation propagation through semantic relation" src={require('../../assets/img/enrichment/propagation_organizationsemrel.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The algorithm exploits only the organization leaves that are in a "IsChildOf" relation with another organization. So far one single step is done
<p align="center">
<img loading="lazy" alt="propagation strategy" src={require('../../assets/img/enrichment/organization_tree.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>

View File

@ -0,0 +1,51 @@
# Deduplication
The OpenAIRE Graph is populated by aggregating metadata records from distinct data sources whose content typically overlaps. For example, the collection of article metadata records from publisher' archives (e.g. Frontiers, Elsevier, Copernicus) and from pre-print platforms (e.g. ArXiv.org, UKPubMed, BioarXiv.org). In order to support monitoring of science, the OpenAIRE Graph implements record deduplication and merge strategies, in such a way the scientific production can be consistently statistically represented. Such strategies reflect the following intuition behind OpenAIRE monitoring: "Two metadata records are equivalent when they describe the same research product, hence they feature compatible resource types, have the same title, the same authors, or, alternatively, the same PID". Finally, groups of duplicates can be whitelisted or blacklisted, in order to manually refine the quality of this strategy.
It should be noticed that publication dates do not make a difference, as different versions of the same product can be published at different times; e.g. the pre-print and a published version of a scientific article, which should be counted as one object; abstracts, subjects, and other possible related fields, are not used to strenghten similarity, due to their heterogeneity or absence across different data sources. Moreover, even when two products are indicated as one a new version of the other, the presence of different authors will not bring them into the same group, to avoid unfair distribution of scientific reward.
Groups of duplicates are finally merged into a new "dedup" record that embeds all properties of the merged records and carries provenance information about the data sources and the relative "instances", i.e. manifestations of the products, together with their resource type, access rights, and publishing date.
## Methodology overview
The deduplication process can be divided into five different phases:
* Collection import
* Candidate identification (clustering)
* Duplicates identification (pair-wise comparisons)
* Duplicates grouping (transitive closure)
* Relation redistribution
<p align="center">
<img loading="lazy" alt="Deduplication Workflow" src={require('../../assets/img/deduplication-workflow.png').default} width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1lLLSU3wsWighmxGQMNMZbgP3mg3BfDVAGVLwt4_OFA8/edit?usp=sharing)
### Collection import
The nodes in the graph represent entities of different types. This phase is responsible for identifying all the nodes with a given type and make them available to the subsequent phases representing them in the deduplication record model.
### Candidate identification (clustering)
Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a [clustering function](./clustering-functions) that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no [clustering function](./clustering-functions) will ever bring them into the same cluster.
### Duplicates identification (pair-wise comparisons)
Pair-wise comparisons are conducted over records in the same cluster following the strategy defined in the decision tree. A different decision tree is adopted depending on the type of the entity being processed.
To further limit the number of comparisons, a sliding window mechanism is used: (i) records in the same cluster are lexicographically sorted by their title, (ii) a window of K records slides over the cluster, and (iii) records ending up in the same window are pair-wise compared. The result of each comparison produces a similarity relation when the pair of record matches. Such relations will be consequently used as input for the duplicates grouping stage.
### Duplicates grouping (transitive closure)
Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance.
### Relation redistribution
Relations involved in nodes identified as duplicated are eventually marked as virtually deleted and used as template for creating a new relation pointing to the new representative record.
Note that nodes and relationships marked as virtually deleted are not exported.
<p align="center">
<img loading="lazy" alt="Deduplication Workflow" src={require('../../assets/img/dedup-relation-fixup.png').default} width="75%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1cDEuVhWnSO8lUZs_Nd748vKfIPxg10jbwKSVZlv33Mg/edit?usp=sharing)

View File

@ -0,0 +1,70 @@
---
sidebar_position: 2
---
# Organizations
The organizations in OpenAIRE are aggregated from different registries (e.g. CORDA, OpenDOAR, Re3data, ROR). In some cases, a registry provides organizations as entities with their own persistent identifier. In other cases, those organizations are extracted from other main entities provided by the registry (e.g. datasources, projects, etc.).
The deduplication of organizations is enhanced by the [OpenOrgs](https://orgs.openaire.eu), a tool that combines an automated approach for identifying duplicated instances
of the same organization record with a "humans in the loop" approach, in which the equivalences produced by a duplicate identification algorithm are suggested to data curators, in charge for validating them.
The data curation activity is twofold, on one end pivots around the disambiguation task, on the other hand assumes to improve the metadata describing the organization records
(e.g. including the translated name, or a different PID) as well as defining the hierarchical structure of existing large organizations (i.e. Universities comprising its departments or large research centers with all its sub-units or sub-institutes).
Duplicates among organizations are therefore managed through three different stages:
* *Creation of Suggestions*: executes an automatic workflow that performs the deduplication and prepare new suggestions for the curators to be processed;
* *Curation*: manual editing of the organization records performed by the data curators;
* *Creation of Representative Organizations*: executes an automatic workflow that creates curated organizations and exposes them on the OpenAIRE Graph by using the curators' feedback from the OpenOrgs underlying database.
The next sections describe the above mentioned stages.
### Creation of Suggestions
This stage executes an automatic workflow that faces the *candidate identification* and the *duplicates identification* stages of the deduplication to provide suggestions for the curators in the OpenOrgs.
#### Candidate identification (clustering)
To match the requirements of limiting the number of comparisons, OpenAIRE clustering for organizations aims at grouping records that would more likely be comparable.
It works with four functions:
* *URL-based function*: the function generates the URL domain when this is provided as part of the record properties from the organization's `websiteurl` field;
* *Title-based functions*:
* generate strings dependent to the keywords in the `legalname` field;
* generate strings obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) on the first 3 words of the `legalname` field;
* generate strings obtained as a concatenation of ngrams of the `legalname` field;
#### Duplicates identification (pair-wise comparisons)
For each pair of organization in a cluster the following strategy (depicted in the figure below) is applied.
The comparison goes through the following decision tree:
1. *grid id check*: comparison of the grid ids. If the grid id is equivalent, then the similarity relation is drawn. If the grid id is not available, the comparison proceeds to the next stage;
2. *early exits*: comparison of the numbers extracted from the `legalname`, the `country` and the `website` url. No similarity relation is drawn in this stage, the comparison proceeds only if the compared fields verified the conditions of equivalence;
3. *city check*: comparison of the city names in the `legalname`. The comparison proceeds only if the legalnames shares at least 10% of cities;
4. *keyword check*: comparison of the keywords in the `legalname`. The comparison proceeds only if the legalnames shares at least 70% of keywords;
5. *legalname check*: comparison of the normalized `legalnames` with the `Jaro-Winkler` distance to determine if it is higher than `0.9`. If so, a similarity relation is drawn. Otherwise, no similarity relation is drawn.
<p align="center">
<img loading="lazy" alt="Organization Decision Tree" src={require('../../assets/img/decisiontree-organization.png').default} width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1YKInGGtHu09QG4pT2gRLEum4LxU82d4nKkvGNvRQmrg/edit?usp=sharing)
### Data Curation
All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata.
A data curator can:
* *edit organization metadata*: legalname, pid, country, url, parent relations, etc.;
* *approve suggested duplicates*: establish if an equivalence relation is valid;
* *discard suggested duplicates*: establish if an equivalence relation is wrong;
* *create similarity relations*: add a new equivalence relation not drawn by the algorithm.
Note that if a curator does not provide a feedback on a similarity relation suggested by the algorithm, then such relation is considered as valid.
### Creation of Representative Organizations
This stage executes an automatic workflow that faces the *duplicates grouping* stage to create representative organizations and to update them on the OpenAIRE Graph. Such organizations are obtained via transitive closure and the relations used comes from the curators' feedback gathered on the OpenOrgs underlying Database.
#### Duplicates grouping (transitive closure)
Once the similarity relations between pairs of organizations have been gathered, the groups of equivalent organizations are obtained (transitive closure, i.e. “mesh”). From such sets a new representative organization is obtained, which inherits all properties from the merged records and keeps track of their provenance.
The IDs of the representative organizations are obtained by the OpenOrgs Database that creates a unique ``openorgs`` ID for each approved organization. In case an organization is not approved by the curators, the ID is obtained by appending the prefix ``pending_org`` to the MD5 of the first ID (given their lexicographical ordering).

View File

@ -0,0 +1,170 @@
---
sidebar_position: 1
---
# Research products
Duplicates among research products are identified among results of the same
type (publications, datasets, software, other research products). If two
duplicate research products are aggregated one as a dataset and one as a
software, for example, they will never be compared and they will never be
identified as duplicates.
OpenAIRE supports different deduplication strategies based on the type of
results.
The next sections describe how each stage of the deduplication workflow is faced
for research products.
### Candidate identification (clustering)
To match the requirements of limiting the number of comparisons, OpenAIRE
clustering for research products works with two different strategies based on
entity types:
#### Software
* *Title extraction functions*:
two clustering functions are applied to the title (normalized, stemming, etc.)
* *stats and suffix prefix of words*: the function generates a key that
depends on (i) number of significant words in the title, (ii) module 10 of
the number of characters of such words, and (iii) a
string
obtained as an alternation of the function prefix(3) and suffix(3) (and
vice-versa) on the first 3 words (2 words if the title only has 2). For
example, the title ``Search for the Standard Model Higgs Boson``
becomes the two keys ``5-3-seaardmod`` and ``5-3-rchstadel``
* *n-grams*: the function generates ngrams from the
title. For example, the
title ``Search for the Standard Model Higgs Boson``
becomes the keys ``tan``, ``sta``, ``ode``, ``mod``, ``ear``, ``hig``,
``igg``, ``sea``
* *DOI extraction function*: the function generates the DOI when this is
provided as part of the record properties
* *URL extraction function*: the function generates the hostname part provided
by the URL of the software, if any
#### Publication, Dataset and Other Research Product
* *PID extraction function*: the function generates the PIDs when at least one
is provided as part of the ``pid`` record properties
* *Author and Title extraction function*: the function generates a key that
depends on (i) the number of authors of the product, with a cap of 21
authors (ii) number of significant words in the title (normalized, stemming,
etc.), divided by 10, and (iii) a string obtained as an alternation of the
function prefix(3) and suffix(3) (and vice versa) on the first 3 words (2
words if the title only has 2).
<br />
For example, a product composed by 197 authors and
titled ``Search for the Standard Model Higgs Boson``
becomes the two keys ``21-0-seaardmod`` and ``21-0-rchstadel``
### Duplicates identification (pair-wise comparisons)
Comparisons in a block are performed using a *sliding window* set to 50 records.
The records are sorted lexicographically on the normalized version of their
titles. The 1st record is compared against all the 50 following ones using the
decision tree, then the second, etc.
Local information about matching records is kept and possibly used to prune
unneeded comparisons, for example once it is known that A equals to both B and
C, B will not be compared against C because the A,B,C group will be anyway
discovered by the global transitive closure step later.
<br />
A different decision tree is adopted depending on the type of the entity being
processed.
Similarity relations drawn in this stage will be consequently used to perform
the duplicates grouping.
#### Publications
For each pair of publications in a cluster the following strategy (depicted in
the figure below) is applied.
The comparison goes through different stages:
1. *trusted pids check*: comparison of the trusted pid lists (in the `pid` field
of the record). If at least 1 pid is equivalent, records match and the
similarity relation is drawn.
2. *instance type check*: comparison of the instance types (indicating the
subtype of the record, i.e. presentation, conference object, etc.). If the
instance types are not compatible then the records does not match. Otherwise,
the comparison proceeds to the next stage
3. *untrusted pids check*: comparison of all the available pids (in the `pid`
and the `alternateid` fields of the record). In every case, no similarity
relation is drawn in this stage. If at least one pid is equivalent, the next
stage will be a *soft check*, otherwise the next stage is a *strong check*.
4. *soft check*: comparison of the record titles with the Levenshtein distance.
If the distance measure is above 0.9 then the similarity relation is drawn.
5. *strong check*: comparison composed by three substages involving the (i)
comparison of the author list sizes and the version of the record to
determine if they are coherent, (ii) comparison of the record titles with the
Levenshtein distance to determine if it is higher than 0.95, (iii) "smart"
comparison of the author lists to check if common authors are more than 60%
in case of titles whose length is greater than 30 chars or more than 90%
otherwise.
<p align="center">
<img loading="lazy" alt="Publications Decision Tree" src={require('../../assets/img/decisiontree-publication.png').default} width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/19SIilTp1vukw6STMZuPMdc0pv0ODYCiOxP7OU3iPWK8/edit?usp=sharing)
#### Datasets and Other types of research products
For each pair of datasets or other types of research products in a cluster the
strategy depicted in the figure below is applied.
The decision tree is almost identical to the publication decision tree, with the
only exception of the *instance type check* stage. Since such type of record
does not have a relatable instance type, the check is not performed and the
decision tree node is skipped.
<p align="center">
<img loading="lazy" alt="Dataset and Other types of research products Decision Tree" src={require('../../assets/img/decisiontree-dataset-orp.png').default} width="90%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1uBa7Bw2KwBRDUYIfyRr_Keol7UOeyvMNN7MPXYLg4qw/edit?usp=sharing)
#### Software
For each pair of software in a cluster the following strategy (depicted in the
figure below) is applied.
The comparison goes through different stages:
1. *DOI pids and URLs check*: comparison of the pids of type DOI and URLs in the
records. If at least 1 DOI is equivalent or 1 URL is equivalent, then records
match and the similarity relation is drawn
2. *title check*: comparison of the record titles with Levenshtein distance,
excluding versioning information.
If the distance is below 0.95 then the records does not match. Otherwise, the
comparison proceeds to the next stage
3. *untrusted DOI check*: comparison of all the available DOIs (in the `pid` and
the `alternateid` fields of the record). If at least 1 DOI is equivalent,
records match and the similarity relation is drawn
4. *authors check*: "smart" comparison of the author lists to check if the two
products share all authors
<p align="center">
<img loading="lazy" alt="Software Decision Tree" src={require('../../assets/img/decisiontree-software.png').default} width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/19gd1-GTOEEo6awMObGRkYFhpAlO_38mfbDFFX0HAkuo/edit?usp=sharing)
### Duplicates grouping
The aim of the final stage is the creation of objects that group all the equivalent
entities discovered by the previous step. This is done in two phases.
#### Transitive closure
As a final step of duplicate identification a transitive closure
is run against similarity relations to find groups of duplicates not directly
caught by the previous steps. If a group is larger than 200 elements only the
first 200 elements will be included in the group, while the remaining will be
kept ungrouped.
#### Creation of representative record (dedup record)
The general concept is that the field coming from the record with higher "trust"
value is used as reference for the field of the representative record.
The IDs of the representative records are obtained by prepending the
prefix ``dedup_`` to the MD5 of the first ID (given their lexicographical
ordering). If the group of merged records contains a trusted ID type (i.e. the
DOI), also the type keyword (i.e. ``DOI``) is added to the prefix.

View File

@ -0,0 +1,30 @@
---
sidebar_position: 3
---
# Extraction of acknowledged concepts
***Short description:*** Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE.
***Algorithmic details:***
The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept.
***Parameters:***
Concept titles, acronyms, and identifiers, publication's identifiers and fulltexts
***Limitations:*** -
***Environment:***
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
***References:***
* Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. [doi:10.1007/978-3-031-16802-4_9](https://doi.org/10.1007/978-3-031-16802-4_9)
***Authority:*** ATHENA RC &bull; ***License:*** CC-BY/CC-0 &bull; ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)

View File

@ -0,0 +1,107 @@
---
sidebar_position: 1
---
# Affiliation matching
***Short description:*** The goal of the affiliation matching module is to match affiliation strings (identified in full-text PDFs or in scholarly databases, such as Crossref) with persistent organization identifiers (e.g., ROR identifiers).
Depending on the data source, we currently employ two distinct methodologies:
- The [first](#algorithmic-details-of-the-first-method) method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database.
- The [second](#algorithmic-details-of-the-second-method) concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database.
## Algorithmic details of the first method
*The buckets concept*
In order to get the best possible results, the algorithm should compare every affiliation with every organization. However, this approach would be very inefficient and slow, because it would involve the processing of the cartesian product (all possible pairs) of millions of affiliations and thousands of organizations. To avoid this, IIS has introduced the concept of buckets. A bucket is a smaller group of affiliations and organizations that have been selected to be matched with one another. The matching algorithm compares only these affiliations and organizations that belong to the same bucket.
*Affiliation matching process*
Every affiliation in a given *bucket* is compared with every organization in the same bucket multiple times, each time by using a different algorithm (*voter*). Each *voter* is assigned a number (match strength) that describes the estimated correctness of the result of its comparison. All the affiliation-organization pairs that have been matched by at least one *voter*, will be assigned the match strength > 0 (the actual number depends on the voters, its calculation method will be shown later).
It is very important for the algorithm to group the affiliations and organizations properly i.e. the ones that have a chance to match should be in the same *bucket*. To guarantee this, the affiliation matching module allows to create different methods of dividing the affiliations and organizations into *buckets*, and to use all of these methods in a single matching process. The specific method of grouping the affiliations and organizations into *bucket* and then joining them into pairs is carried out by the service called *Joiner*.
Every *joiner* can be linked with many different *voters* that will tell if the affiliation-organization pairs joined match or not. By providing new *joiners* and *voters* one can extend the matching algorithm with countless new methods for matching affiliations with organizations, thus adjusting the algorithm to his or her needs.
All the affiliations and organizations are sequentially computed by all the *matchers*. In every *matcher* they are grouped by some *joiner* in pairs, and then these pairs are processed by all the *voters* in the *matcher*. Every affiliation-organization pair that has been matched at least once is assigned the match strength that depends on the match strengths of the *voters* that pointed the given pair is a match.
**NOTE:** There can be many organizations matched with a given affiliation, each of them matched with a different match strength. The user of the module can set a match strength threshold which will limit the results to only those matches that have the match strength greater than the specified threshold.
*Calculation of the match strength of the affiliation-organization pair matched by multiple matchers*
It often happens that the given affiliation-organization pair is returned as a match by more than one matcher, each time with a different match strength. In such a case **the match with the highest match strength will be selected**.
*Calculation of the match strength of the affiliation-organization pair within a single matcher*
Every voter has a match strength that is in the range (0, 1]. **The voter match strength says what the quotient of correct matches to all matches guessed by this voter is, and is based on real data and hundreds of matches prepared by hand.**
The match strength of the given affiliation-organization pair is based on the match strengths of all the voters in the matcher that have pointed that the pair is a match. It will always be less than or equal to 1 and greater than the match strength of each single voter that matched the given pair.
The total match strength is calculated in such a way that each consecutive voter reduces (by its match strength) the gap of uncertainty about the correctness of the given match.
***Parameters:***
* input
* input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations.
* input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location.
* input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations.
* input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations.
* input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations
* output
* [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations.
***Limitations:*** -
***Environment:***
Java, Spark
***References:*** -
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching)
## Algorithmic details of the second method
*Categorization*
The affiliations' strings are imported and undergo cleaning, tokenization, and removal of stopwords. Similar to the “buckets concept” of the first method, the goal is to split the affiliation strings, as well as the ROR organizations, into coherent groups. To achieve this, data preprocessing has already been conducted on ROR's data, involving the analysis of word frequency ('keywords') within the legal names of ROR's organizations to define specific categories. These categories include universities and institutes, laboratories, hospitals, companies, museums, governments, foundation, and rest organizations. ROR's organizations have subsequently been assigned to these categories based on their legal names. The algorithm employs a similar approach to categorize affiliations into these same groups.
*String Shortening*
The objective is to extract pertinent details from each affiliation string. The algorithm divides the string whenever a comma (,) or semicolon (;) is detected. It then applies specific 'rules' to these segments and retains only those containing relevant keywords. Additionally, it trims down the segments by preserving words in proximity to particular keywords like "university," "institute," "laboratory," or "hospital." As a result, the average string length is reduced from 90 to 35 characters.
*Matching with ROR's Database*
The algorithm checks whether a substring containing a keyword is linked to a legal name or to an alternative name in the organizations listed in the ROR's database. In order to identify the most accurate match, the algorithm employs cosine similarity.. Although alternative methods like Levenshtein Distance or Jaro-Winkler Distance were considered for measuring string similarity, it was concluded that cosine similarity was the most appropriate choice for this specific application.
*Refinement*
If multiple matches are found above the desired similarity thresholds, the algorithm performs another check. It applies cosine similarity between the organizations found in the ROR's database and the original affiliation string. This comparison takes into account additional information present in the original affiliation, such as addresses or city names. The algorithm aims to identify the best fit among the potential matches. Note that the case where two or more different organizations share the same name is also considered.
***Parameters:***
* input
* source of affiliations: JSON Crossref or XML Pubmed or Parquet DataCite files.
* organizations: [dix_acad.pkl](https://github.com/openaire/affro/blob/main/dictionaries/dix_acad.pkl), [dix_mult](https://github.com/openaire/affro/blob/main/dictionaries/dix_mult.pkl), [dix_city](https://github.com/openaire/affro/blob/main/dictionaries/dix_city.pkl), [dix_country](https://github.com/openaire/affro/blob/main/dictionaries/dix_country.pkl) (four pickled dictionaries with keys legalnames and alternativenames of organizations in the ROR database.)
* similarity thresholds: simU for universities, simG for other organizations (default values are simU = 0.64, simG = 0.87).
cument-organization pairs which are used as a hint for matching affiliations
* output
* JSON file with ROR ids of organizations and corresponding similarity scores for each DOI.
***Limitations:*** -
***Environment:***
Python
***References:*** -
***Authority:*** OpenAIRE &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [AffRo](https://github.com/openaire/affro)

View File

@ -0,0 +1,41 @@
# Citation matching
***Short description:*** During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. It is worth mentioning that the implemented algorithm has been described in detail in [arXiv:1303.6906](https://arxiv.org/abs/1303.6906)[1].
***Algorithmic details:***
*General description*
The algorithm used in citation matching task consists of two phases. In the first one, for each citation string a set of potentially matching documents is retrieved using a heuristic. In the second one, the metadata of these documents is analysed in order to assess which of them is the most similar to given citation. We assume that citations are parsed, i.e. fragments containing meaningful pieces of metadata information are marked in a special way. Note that in the IIS system, the citation parsing step is executed by another module. The following metadata fields are used by the described solution:
* an author,
* a title,
* a journal name,
* pages,
* a year of publication.
*Heuristic matching*
The heuristic is based on indexing of document metadata by their author names. For each citation we extract author names and try to find documents in the index which have the same author entries. As spelling errors and inaccuracies commonly occur in citations, we have implemented approximate index which enables retrieval of entities with edit distance less than or equal 1.
*Strict matching*
In this step, all the potentially matching pairs obtained in the heuristic step are evaluated and only the most probable ones are returned as the final result. As citations tend to contain spelling errors and differ in style, there is a need to introduce fuzzy similarity measures fitted to the specifics of various metadata fields. Most of them compute a fraction of tokens or trigrams that occur in both fields being compared. When comparing journal
names, we have taken longest common subsequence (LCS) of two strings into consideration. This can be seen as an instance of the assignment problem with some refinements added. The overall similarity of two citation strings is obtained by applying a linear Support Vector Machine (SVM) using field similarities as features.
***Parameters:***
* input:
* input_metadata: [ExtractedDocumentMetadataMergedWithOriginal](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/transformers/metadatamerger/ExtractedDocumentMetadataMergedWithOriginal.avdl) avro datastore location with the metadata of both publications and bibliorgaphic references to be matched
* input_matched_citations: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with citations which were already matched and should be excluded from fuzzy matching
* output: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with matched publications
***Limitations:*** -
***Environment:***
Java, Spark
***References:***
* Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł. (2013). Large Scale Citation Matching Using Apache Hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg. [https://doi.org/10.1007/978-3-642-40501-3_37](https://doi.org/10.1007/978-3-642-40501-3_37)
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/citation-matching](https://github.com/CeON/CoAnSys/tree/master/citation-matching)

View File

@ -0,0 +1,23 @@
---
sidebar_position: 4
---
# Extraction of cited concepts
***Short description:*** Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.
***Algorithmic details:***
The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata.
***Parameters:***
Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts
***Limitations:*** -
***Environment:***
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
***References:***
* Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham. [doi:10.1007/978-3-319-67008-9_28](https://doi.org/10.1007/978-3-319-67008-9_28)
***Authority:*** ATHENA RC &bull; ***License:*** CC-BY/CC-0 &bull; ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)

View File

@ -0,0 +1,22 @@
---
sidebar_position: 5
---
# Classifiers
***Short description:*** A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes.
***Algorithmic details:***
The algorithm classifies publication's fulltexts using a Bayesian classifier and weighted terms according to an offline training phase. The training has been done using the following taxonomies: arXiv, MeSH (Medical Subject Headings), ACM, and DDC (Dewey Decimal Classification, or Dewey Decimal System).
***Parameters:*** Publication's identifier and fulltext
***Limitations:*** -
***Environment:***
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
***References:***
* Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. [doi:10.1007/978-3-319-08425-1_10](https://doi.org/10.1007/978-3-319-08425-1_10)
***Authority:*** ATHENA RC &bull; ***License:*** CC-BY/CC-0 &bull; ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)

View File

@ -0,0 +1,48 @@
# Documents similarity
***Short description:*** Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content.
***Algorithmic details:***
The similarity between two documents is expressed as the similarity between weights of their common terms (i.e., words being reduced to their root form) within a context of all terms from the first and the second document. In this approach, the computation can be divided into three consecutive steps:
1. selection of proper terms,
2. calculation of weights of terms for each document,
3. calculation of a given similarity function on weights of terms corresponding to each pair of documents.
 
The document similarity module uses the term frequency inverse-document frequency (TFIDF) measure and the cosine similarity to produce weights for terms and calculate their similarity respectively.
*Steps of execution*
Computation of similarity between documents is executed in the following steps.
1. First, we create a text representation of each document. The text is a concatenation of 3 attributes of document object coming from Information Space: title, abstract, and keywords.
2. Text representation of each document is split into words. Next, stop words or words which occur in more than the N percent of documents (say 99%) or these occurring in less than M documents (say 5) are discarded as we assume that they carry no important information.
3. Next, the words are stemmed (reduced to their root form) and thus converted to terms. The importance of each term in each document is calculated using TFIDF measure (resulting in a vector of weights of terms for each document). Only the top P (say 20) important terms per documents remain for the further computations.
4. In order to calculate the cosine similarity value for the documents, we execute the following steps.
a. Triples [document id, term, term weight] are grouped by a common term and for each pair of triples from the group, term importance is recalculated as the multiplication of terms weights, producing quads [document id 1, document id 2, term, multiplied term weight].
b. Quads are grouped by [document id 1, document id 2] and the values of the multiplied term weight are summed up, resulting in the creation of triples [document id 1, document id 2, total common weight].
c. Finally, triples are normalized using product of the norm of the term weights' vectors. The normalized value is the final similarity measure with value between 0 and 1.
5. For a given document, only the top R (say 20) links to similar documents are returned. The links that are thrown away are assumed to be uninteresting for the end-user and thus storing them would only needlessly take disk space.
***Parameters:***
* input:
* input_document: [DocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/documentssimilarity/DocumentMetadata.avdl) avro datastore location
* parallel: sets parameter parallel for Pig actions (default=80)
* mapredChildJavaOpts: mapreduce's map and reduce child java opts set to all PIG actions (default=Xmx12g)
* tfidfTopnTermPerDocument: number of the most important terms taken into account (default=20)
* similarityTopnDocumentPerDocument: maximum number of similar documents for each publication (default=20)
* removal_rate: removal rate (default=0.99)
* removal_least_used: removal of the least used terms (default=20)
* threshold_num_of_vector_elems_length: vector elements length threshold, when set to less than 2 all documents will be included in similarity matching (default=2)
* output: [DocumentSimilarity](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/documentssimilarity/DocumentSimilarity.avdl) avro datastore location
***Limitations:*** -
***Environment:***
Pig, Java
***References:***
* P. J. Dendek, A. Czeczko, M. Fedoryszak, A. Kawa, and L. Bolikowski, "Content Analysis of Scientific Articles in Apache Hadoop Ecosystem", Stud. Comp.Intelligence, vol. 541, 2014.
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/document-similarity](https://github.com/CeON/CoAnSys/tree/master/document-similarity)

View File

@ -0,0 +1,18 @@
import DocCardList from '@theme/DocCardList';
# Enrichment by mining
**OpenAIRE** collects the full-texts of the publications, in order to apply TDM (Text and Data Mining) algorithms on them and enrich the Graph with inference links.
The collection of the full-texts is handled by the internal **PDF Aggregation Service**. This service uses the publications' urls, from the OpenAIRE Graph and state-of-the-art algorithms, to crawl the web and try to locate and download the full-texts of the open access publications, while focusing on the most recent ones. It respects the servers of the repositories and publishers and avoids overloading them.
The service is orchestrating a distributed execution system, on the cloud, with multiple microservices running in parallel, in order to efficiently process and download a large number of publications. The microservices store the generated report records for the publications, in a database, and the full-texts in an S3 Object Store.
On the publication-page level, it applies text-mining algorithms to analyze the structure of the page, extract the full-text url and download the file. Additionally, it tracks various performance indicators to optimize the crawling speed, during execution.
The PDF Aggregation Service is also capable of bulk-importing full-texts from compatible data sources, which increases the collection speed of full-texts.
The different Text and Data Mining (TDM) algorithms used in the graph-enrichment process are grouped in the following categories.
<DocCardList></DocCardList>

Some files were not shown because too many files have changed in this diff Show More