Compare commits

...

176 Commits

Author SHA1 Message Date
Serafeim Chatzopoulos 053f708fe8 Merge pull request 'main' (#1) from D-Net/openaire-graph-docs:main into main
Reviewed-on: #1
2023-01-18 14:26:20 +01:00
Serafeim Chatzopoulos b2a0130ce5 Merge pull request 'Link to the DataCite metadata kernel' (#42) from relation_datacite into main
Reviewed-on: D-Net/openaire-graph-docs#42
2023-01-13 14:16:11 +01:00
Claudio Atzori e34d565882 adding link to the DataCite metadata kernel 2023-01-13 14:14:07 +01:00
Miriam Baglioni 8c2e0e0022 fixed issues on relationship table 2023-01-10 12:34:00 +01:00
Serafeim Chatzopoulos 87ef2724da Add helpdesk in sidebar 2023-01-09 20:05:03 +02:00
Serafeim Chatzopoulos 22e90827e2 Update links on Zenodo for dumps 2023-01-05 18:07:32 +02:00
Serafeim Chatzopoulos 148564e098 Update citation of OpenAIRE Research Graph 2023-01-05 17:58:50 +02:00
Serafeim Chatzopoulos 5921b13dc7 Update 'docs/downloads/full-graph.md' 2022-12-30 22:07:17 +01:00
Serafeim Chatzopoulos 20d9cea33b Update 'docs/changelog.md' 2022-12-30 22:00:07 +01:00
Miriam Baglioni e5574b8490 Merge pull request 'Add versioning section & changelog' (#10) from changelog into main
Reviewed-on: D-Net/openaire-graph-docs#10
2022-12-30 16:35:01 +01:00
Miriam Baglioni 7035ad6878 remove the set of the added relationships. 2022-12-30 16:34:07 +01:00
Miriam Baglioni e234c3630a Merge pull request 'Add beginner's kit text' (#38) from beginners-kit into main
Reviewed-on: D-Net/openaire-graph-docs#38
2022-12-30 16:30:45 +01:00
Miriam Baglioni cb3509ba38 added link to the beginner's kit uploaded on Zenodo 2022-12-30 16:27:50 +01:00
Serafeim Chatzopoulos 5dd5cd836d Change architecture diagram 2022-12-30 17:19:03 +02:00
Serafeim Chatzopoulos 8d78ebc5db Add Beginner's kit in changelog 2022-12-30 16:50:48 +02:00
Serafeim Chatzopoulos 8aa4183dd9 Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs into changelog 2022-12-30 16:42:52 +02:00
Serafeim Chatzopoulos f15912051f Add beginner's kit text 2022-12-30 16:40:16 +02:00
Serafeim Chatzopoulos b943be8ee3 Fix links from impact measures page to specific properties/objects in the result 2022-12-27 21:22:30 +02:00
Serafeim Chatzopoulos 4c23bb429b Merge pull request 'graph-data-model-revision' (#37) from graph-data-model-revision into main
Reviewed-on: D-Net/openaire-graph-docs#37
2022-12-27 19:24:10 +01:00
Serafeim Chatzopoulos ccf3ea1529 Add info for impact indicators 2022-12-27 19:34:46 +02:00
Miriam Baglioni 709b5f49bd updated changelog 2022-12-27 14:54:28 +01:00
Miriam Baglioni 5f75cd4011 merging with main 2022-12-27 14:47:12 +01:00
Miriam Baglioni f170f72d8d indentation for json 2022-12-27 12:57:36 +01:00
Miriam Baglioni 7bf48ea976 added new relationships 2022-12-27 12:51:52 +01:00
Miriam Baglioni 248e758a94 merge with main 2022-12-27 11:59:15 +01:00
Miriam Baglioni 489bfef146 added the serialization of Indicators at the level of the result. Removed the serialization of measures at the level of the instance 2022-12-27 11:55:54 +01:00
Serafeim Chatzopoulos 0e7b14c0af Merge pull request 'Restructuring data provision section' (#34) from restructure_data_provision into main
Reviewed-on: D-Net/openaire-graph-docs#34
2022-12-23 12:32:17 +01:00
Claudio Atzori 29731b7be7 added links to the explore, connect, provide portals. Further adoption of the OpenAIRE Graph shorter wording 2022-12-23 12:13:43 +01:00
Claudio Atzori 070219b095 added synthetic stats page 2022-12-23 12:11:59 +01:00
Claudio Atzori 8e4172c1f7 usage count text from Dimitris 2022-12-22 16:25:25 +01:00
Claudio Atzori 099a500e88 added merge by id description 2022-12-22 16:21:00 +01:00
Thanasis Vergoulis c66de2b9e7 Merge pull request 'Adds a searchbox in the navbar' (#35) from enable_search into main
Reviewed-on: D-Net/openaire-graph-docs#35
2022-12-22 09:51:44 +01:00
Serafeim Chatzopoulos 078ec28a6a Set docsRouteBasePath for search plugin 2022-12-21 22:53:44 +02:00
Serafeim Chatzopoulos 61d62ddab3 Install plugin 2022-12-21 22:12:26 +02:00
Serafeim Chatzopoulos 6e56aa1a4d Add text to compatible sources - aggregation 2022-12-21 21:44:50 +02:00
Serafeim Chatzopoulos 8e9295947c Rename back to OpenAIRE Research Graph 2022-12-21 20:52:33 +02:00
Serafeim Chatzopoulos b9bdda24b7 Merge pull request 'Fixes images path when using a BASE_URL = "/docs"' (#33) from fix-image-links into main
Reviewed-on: D-Net/openaire-graph-docs#33
2022-12-21 18:15:19 +01:00
Serafeim Chatzopoulos 79e3a5b563 Merge with main 2022-12-21 19:13:15 +02:00
Serafeim Chatzopoulos fdc331641d Merge pull request 'Restructure data provision section' (#32) from restructure_data_provision into main
Reviewed-on: D-Net/openaire-graph-docs#32
2022-12-21 17:56:43 +01:00
Serafeim Chatzopoulos 69ff846180 Move text from finalisation to cleaning; minor changes in mining; fix typo in sidebar 2022-12-21 17:03:44 +02:00
Serafeim Chatzopoulos f1f011210c Rename folder deduction-and-propagation 2022-12-21 14:40:34 +02:00
Serafeim Chatzopoulos 387fd97e24 Remove FAQ 2022-12-21 14:40:05 +02:00
Serafeim Chatzopoulos 53b955a373 Add usage counts text 2022-12-21 14:39:41 +02:00
Serafeim Chatzopoulos 484d6cb82b Restructure data provision section 2022-12-20 17:55:04 +02:00
Serafeim Chatzopoulos 1506ce928a Update 'release.properties' 2022-12-20 15:16:23 +01:00
Serafeim Chatzopoulos 4e3806e05e Change footer 2022-12-20 15:20:50 +02:00
Serafeim Chatzopoulos db8bdc4a08 Fix broken links 2022-12-20 14:05:55 +02:00
Serafeim Chatzopoulos e3126ec32d Merge pull request 'Add support for ENV variables' (#27) from parameter_config_with_env into main
Reviewed-on: D-Net/openaire-graph-docs#27
2022-12-16 17:56:07 +01:00
Serafeim Chatzopoulos 0b57188a58 Merge main into branch 2022-12-16 18:55:55 +02:00
Serafeim Chatzopoulos 6686a7ec50 Merge pull request 'Add LOD dump in other related datasets section' (#29) from update_related_datasets into main
Reviewed-on: D-Net/openaire-graph-docs#29
2022-12-16 07:12:41 +01:00
Serafeim Chatzopoulos 69a2a92909 Merge pull request 'Add new badges for ack' (#30) from update_badges into main
Reviewed-on: D-Net/openaire-graph-docs#30
2022-12-16 07:12:31 +01:00
Serafeim Chatzopoulos f8fde1dba8 Merge pull request 'Disable color theme switch' (#31) from disable_color_theme_switch into main
Reviewed-on: D-Net/openaire-graph-docs#31
2022-12-16 07:12:19 +01:00
Serafeim Chatzopoulos 440e8c5b9c Disable color theme switch & remove code filtering sidebar items 2022-12-15 20:04:15 +02:00
Serafeim Chatzopoulos c1cf65e2d3 Add extra padding in badges 2022-12-15 17:13:08 +02:00
Serafeim Chatzopoulos 6281938c81 Add new badges for ack 2022-12-15 17:01:15 +02:00
Serafeim Chatzopoulos 2839958e38 Add LOD dump in other related datasets section 2022-12-15 14:31:41 +02:00
Alessia Bardi 159f50c9ef simplified some sentences 2022-12-14 19:01:35 +01:00
Claudio Atzori e3fb581270 Update 'README.md' 2022-12-14 14:23:22 +01:00
Serafeim Chatzopoulos 24af35739e Merge pull request 'Release information' (#26) from release_properties into main
Reviewed-on: D-Net/openaire-graph-docs#26
2022-12-13 12:56:56 +01:00
Serafeim Chatzopoulos 17bd13446b Add support for ENV variables 2022-12-12 09:35:53 +02:00
Serafeim Chatzopoulos caa6f7d196 Merge pull request '[Bulk Download] first versione of the documentation' (#19) from bulk_downloads into main
Reviewed-on: D-Net/openaire-graph-docs#19
2022-12-08 19:30:26 +01:00
Serafeim Chatzopoulos 83f28816b8 Fix link to downloads 2022-12-08 20:26:24 +02:00
Serafeim Chatzopoulos 67d2e38f6d Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs into bulk_downloads 2022-12-08 20:18:08 +02:00
Serafeim Chatzopoulos 750d57a110 Format the publications in the same way 2022-12-08 20:12:29 +02:00
Serafeim Chatzopoulos b4cd25b8db Add how to cite & badge in download page 2022-12-08 19:40:12 +02:00
Serafeim Chatzopoulos 6c283bde25 Minor fix on alternative sub-graph data model description 2022-12-06 18:52:35 +02:00
Serafeim Chatzopoulos bee82cbd4c Re-arrange downloads section 2022-12-06 18:43:54 +02:00
Claudio Atzori ede1bd98ea added release.properties file 2022-12-06 15:46:40 +01:00
Miriam Baglioni 47394afd5e [CommunityModel] added comments on subgraphs 2022-12-05 15:39:25 +01:00
Miriam Baglioni a61a407c14 [CommunityModel] first version of the community model 2022-12-05 12:45:57 +01:00
Serafeim Chatzopoulos 029429fcc5 Merge pull request 'Smaller images' (#24) from reduced_image_file_size into main
Reviewed-on: D-Net/openaire-graph-docs#24
2022-12-02 17:21:06 +01:00
Miriam Baglioni 48895caa3c merging with main 2022-12-02 16:28:42 +01:00
Claudio Atzori 965785e183 reduced image sizes for a lower build footprint 2022-12-02 15:56:52 +01:00
Serafeim Chatzopoulos ab4f9afe31 Merge pull request 'fix_issues_raised_in_PR_7' (#23) from fix_issues_raised_in_PR_7 into main
Reviewed-on: D-Net/openaire-graph-docs#23
2022-12-02 12:58:06 +01:00
Serafeim Chatzopoulos ac7554cb8a Minor rephrasing 2022-12-02 13:57:16 +02:00
Serafeim Chatzopoulos 1e2e95cc08 Merge branch 'main' into 'fix_issues_raised_in_PR_7' 2022-12-02 13:39:13 +02:00
Serafeim Chatzopoulos 9c45c0533e Merge pull request 'Add sitemap.xml generation during build' (#21) from Add_sitemap_generation into main
Reviewed-on: D-Net/openaire-graph-docs#21
2022-12-02 12:27:52 +01:00
Serafeim Chatzopoulos 0c0352048f Merge pull request 'Attempt to match the look and feel of graph.openaire.eu' (#22) from styling into main
Reviewed-on: D-Net/openaire-graph-docs#22
2022-12-02 12:27:39 +01:00
Serafeim Chatzopoulos 2cd5c4d686 Remove stats page for now 2022-12-02 12:37:54 +02:00
Serafeim Chatzopoulos 7a22db2ad1 Merge branch 'main' into styling 2022-12-02 11:29:58 +01:00
Serafeim Chatzopoulos 10c6330c1b Merge branch 'main' into Add_sitemap_generation 2022-12-02 11:29:47 +01:00
Serafeim Chatzopoulos eb2364b8f4 Add missing enrichment files 2022-12-02 12:26:21 +02:00
Serafeim Chatzopoulos 01e4744550 Merge pull request 'enrichment' (#20) from enrichment into main
Reviewed-on: D-Net/openaire-graph-docs#20
2022-12-01 13:41:25 +01:00
Serafeim Chatzopoulos 7c12a37f11 Split the enrichment section in sub-pages 2022-11-30 18:09:31 +02:00
Serafeim Chatzopoulos 74eab8b908 Merge branch 'main' into Add_sitemap_generation 2022-11-30 13:18:31 +01:00
Serafeim Chatzopoulos 9a8c0f6923 Change the update frequency of the sitemap.xml to monthly 2022-11-30 14:15:45 +02:00
Serafeim Chatzopoulos 79c516f21c Change background color 2022-11-29 18:16:52 +02:00
Serafeim Chatzopoulos 7b5d9eae82 Update logo & styling 2022-11-29 16:43:06 +02:00
Serafeim Chatzopoulos df6b49bd8f Remove services page 2022-11-29 15:49:11 +02:00
Serafeim Chatzopoulos a844ac459c Align references in aggregation section with those in relevant pubs 2022-11-29 14:21:52 +02:00
Serafeim Chatzopoulos f4f84a5a31 Fix typos 2022-11-29 14:16:22 +02:00
Serafeim Chatzopoulos 4b63ab0ace Add sitemap.xml generation during build 2022-11-29 13:18:34 +02:00
Serafeim Chatzopoulos 989d9ea34c Split bulk downloads page in sub-pages 2022-11-28 14:19:40 +02:00
Alessia Bardi d96049a3ab ignore intellij project file 2022-11-23 14:33:34 +01:00
Miriam Baglioni b14f89a845 merging with main 2022-11-23 14:01:30 +01:00
Miriam Baglioni 2f3e832d4d [Bulk Download] first versione of the documentation 2022-11-18 17:53:03 +01:00
Miriam Baglioni 6a773cfe1a [Enrichment] first version for propagation finished 2022-11-18 17:15:02 +01:00
Serafeim Chatzopoulos b32ee99cdf Merge pull request 'Initial OpenAIRE Graph license description' (#8) from license into main
Reviewed-on: D-Net/openaire-graph-docs#8
2022-11-17 16:44:45 +01:00
Serafeim Chatzopoulos 964bb10439 Merge pull request 'Format mining algorithms' (#17) from formating_enrichment_section into main
Reviewed-on: D-Net/openaire-graph-docs#17
2022-11-17 14:49:47 +01:00
Serafeim Chatzopoulos af2589274a Merge pull request 'Update to docusaurus v2.2.0 && npm audit fix' (#16) from update_docusaurus into main
Reviewed-on: D-Net/openaire-graph-docs#16
2022-11-17 14:48:57 +01:00
Serafeim Chatzopoulos 96912ea7ec Merge pull request 'Add formating to impact indicators page' (#9) from impact_indicators into main
Reviewed-on: D-Net/openaire-graph-docs#9
2022-11-17 14:24:53 +01:00
Serafeim Chatzopoulos 14c995a362 Format mining algorithms 2022-11-17 15:21:38 +02:00
Serafeim Chatzopoulos 3a7578fe16 Merge pull request 'Introducing the description of mining algorithms developed by ICM' (#15) from enrichment_mining_icm into main
Reviewed-on: D-Net/openaire-graph-docs#15
2022-11-17 13:44:26 +01:00
Serafeim Chatzopoulos 7526743ef6 Merge branch main into enrichment_mining_icm 2022-11-17 14:44:09 +02:00
Serafeim Chatzopoulos 5684d7bff7 Merge pull request 'update mining docs' (#14) from ioannis.foufoulas/openaire-graph-docs:mining_docs into main
Reviewed-on: D-Net/openaire-graph-docs#14
2022-11-17 13:41:49 +01:00
Serafeim Chatzopoulos c77ac867e0 Add changelog for v5 2022-11-17 14:28:09 +02:00
Serafeim Chatzopoulos 36a3bc35f0 Update to docusaurus v2.2.0 && npm audit fix 2022-11-17 13:50:45 +02:00
Marek Horst 32864f74c6 Small structural corrections. 2022-11-16 19:13:38 +01:00
Marek Horst 0e96fae405 Introducing the description of mining algorithms developed by ICM. 2022-11-16 19:04:32 +01:00
Harry Dimitropoulos 8cddb71098 Update 'docs/data-provision/enrichment/classifies.md' 2022-11-16 16:56:42 +01:00
Harry Dimitropoulos e562936a18 Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 16:56:16 +01:00
Harry Dimitropoulos 96c7a6d87c Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 16:56:02 +01:00
Harry Dimitropoulos a48f5a263d Update 'docs/data-provision/enrichment/mining.md' 2022-11-16 16:54:28 +01:00
Harry Dimitropoulos 2d75ea529f Update 'docs/data-provision/enrichment/classified.md' 2022-11-16 16:48:47 +01:00
Harry Dimitropoulos 8f9184146c Update 'docs/data-provision/enrichment/classified.md' 2022-11-16 16:41:42 +01:00
Yannis Foufoulas 8fda5c81cf Update 'docs/data-provision/enrichment/mining.md' 2022-11-16 16:36:42 +01:00
Harry Dimitropoulos e40fee8408 Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 16:36:15 +01:00
Yannis Foufoulas aa35a239f3 Update 'docs/data-provision/enrichment/classified.md' 2022-11-16 16:34:04 +01:00
Harry Dimitropoulos fcedfc1d9d Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 16:31:37 +01:00
Harry Dimitropoulos 1f5856ecf4 Add 'docs/data-provision/enrichment/classified.md' 2022-11-16 16:27:43 +01:00
Yannis Foufoulas 45d3b152dc Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 16:25:05 +01:00
Harry Dimitropoulos 163c5a6bca Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 16:20:45 +01:00
Yannis Foufoulas 44815cc8e1 Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 16:19:14 +01:00
Yannis Foufoulas 0732dd5df6 Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 16:15:51 +01:00
Yannis Foufoulas d5dd2f6d0b Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 16:13:15 +01:00
Harry Dimitropoulos 4458952a2e Update 'docs/data-provision/enrichment/cites.md'
Added Reference and link to High-Pass Text Filtering paper
2022-11-16 15:37:58 +01:00
Harry Dimitropoulos 544808c7cd Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 15:28:43 +01:00
Harry Dimitropoulos 5dec33d26f Update 'docs/data-provision/enrichment/cites.md'
added short description
2022-11-16 15:22:35 +01:00
Harry Dimitropoulos c9228633ec Update 'docs/data-provision/enrichment/acks.md'
Added a brief description
2022-11-16 15:17:49 +01:00
Harry Dimitropoulos ca9a8f75c3 Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 15:07:29 +01:00
Harry Dimitropoulos 6b48a13bc1 Update 'docs/data-provision/enrichment/cites.md' 2022-11-16 14:59:39 +01:00
Harry Dimitropoulos c2dbf0536b Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 14:59:02 +01:00
Harry Dimitropoulos f933f541fe Update 'docs/data-provision/enrichment/mining.md' 2022-11-16 14:58:12 +01:00
Yannis Foufoulas 39d3f47fa0 Update 'docs/data-provision/enrichment/mining.md' 2022-11-16 14:45:41 +01:00
Yannis Foufoulas 5fc5032537 Update 'docs/data-provision/enrichment/mining.md' 2022-11-16 14:44:46 +01:00
Harry Dimitropoulos 002bfdd851 Add 'docs/data-provision/enrichment/cites.md' 2022-11-16 14:18:40 +01:00
Harry Dimitropoulos ad4c4f909e Update 'docs/data-provision/enrichment/acks.md' 2022-11-16 14:16:46 +01:00
Harry Dimitropoulos 1cf79bc30c Add 'docs/data-provision/enrichment/acks.md' 2022-11-16 14:08:56 +01:00
Claudio Atzori 2de2ed1932 fixed section title formatting 2022-11-16 14:07:07 +01:00
Serafeim Chatzopoulos 7d9c7b214c Minor change in impact-scores.md 2022-11-15 16:54:44 +02:00
Harry Dimitropoulos b739759e3a Update 'docs/data-provision/enrichment/mining.md' 2022-11-15 15:53:21 +01:00
Harry Dimitropoulos c5b84be1d3 Update 'docs/data-provision/enrichment/mining.md' 2022-11-15 15:49:57 +01:00
Serafeim Chatzopoulos 90eb3d4380 Merge brnach 'main' into impact_indicators 2022-11-15 16:47:18 +02:00
Serafeim Chatzopoulos f688d64dd5 Merge pull request 'rpl 'OpenAIRE Research Graph 'OpenAIRE Graph'' (#13) from openaire_graph_rename into main
Reviewed-on: D-Net/openaire-graph-docs#13
2022-11-15 15:39:33 +01:00
Claudio Atzori 7b454f70d4 restored the original 'OpenAIRE Research Graph' Zenodo community name 2022-11-15 15:31:12 +01:00
Serafeim Chatzopoulos ce31a6d5c7 Address review comments 2022-11-15 16:29:39 +02:00
Claudio Atzori 681be1e2f8 Replaced 'OpenAIRE Research Graph' with 'OpenAIRE Graph' 2022-11-15 15:26:58 +01:00
Harry Dimitropoulos e6b02ffc32 Update 'docs/data-provision/enrichment/mining.md' 2022-11-15 14:50:45 +01:00
Yannis Foufoulas 47e112420e edit md file 2022-11-15 15:38:34 +02:00
Serafeim Chatzopoulos 76ffd07839 Merge pull request 'expanded indexing section' (#12) from indexing into main
Reviewed-on: D-Net/openaire-graph-docs#12
2022-11-15 12:27:07 +01:00
Serafeim Chatzopoulos 1456f4f045 Update typo in '/docs/data-model/entities/result.md' 2022-11-15 12:25:51 +01:00
Claudio Atzori b0598daa72 expanded indexing section 2022-11-15 11:46:46 +01:00
Claudio Atzori edaffdef8c added link to the entities section 2022-11-15 09:56:27 +01:00
Serafeim Chatzopoulos 673e2579fc Merge pull request 'Deduplication section: decision trees updated and link of images added in comments' (#11) from deduplication into main
Reviewed-on: D-Net/openaire-graph-docs#11
2022-11-14 11:19:02 +01:00
Michele De Bonis 3419c0ee40 decision trees updated and link of images added in comments 2022-11-14 11:13:29 +01:00
Serafeim Chatzopoulos 0db019e51a Add versioning section 2022-11-11 19:15:55 +02:00
Serafeim Chatzopoulos 7717d883ee Add formating to impact indicators page 2022-11-11 18:07:24 +02:00
Marek Horst d5f68e5348 Initial OpenAIRE Graph license description. 2022-11-10 18:55:27 +01:00
Andreas Czerniak ce17228075 contributing APIs wiki page, CAP, DRIS 2022-11-10 12:26:43 +01:00
Andreas Czerniak 849901f231 add redmine page 2022-11-10 12:15:55 +01:00
Miriam Baglioni 1669c7a5fe [Enrichment] first version of documentation for the bulktagging and part of the propagation 2022-11-09 18:03:55 +01:00
Serafeim Chatzopoulos f581623ce0 Merge pull request 'Redirect to OpenPlato' (#6) from learning-center into main
Reviewed-on: D-Net/openaire-graph-docs#6
2022-11-09 12:28:23 +01:00
Serafeim Chatzopoulos 243ec8ced9 Redirect to OpenPlato 2022-11-09 13:27:45 +02:00
Serafeim Chatzopoulos 3f967bed99 Merge pull request 'update of the deduplication section' (#4) from deduplication into main
Reviewed-on: D-Net/openaire-graph-docs#4
2022-11-09 12:12:24 +01:00
Serafeim Chatzopoulos 137eda72c5 Merge pull request 'post cleaning' (#5) from cleaning into main
Reviewed-on: D-Net/openaire-graph-docs#5
2022-11-09 12:11:49 +01:00
Serafeim Chatzopoulos 3d6d2f3523 Minor (link) fixes 2022-11-09 13:11:33 +02:00
Serafeim Chatzopoulos e21949a82c Merge branch 'main' of https://code-repo.d4science.org/D-Net/openaire-graph-docs into cleaning 2022-11-09 13:02:32 +02:00
Serafeim Chatzopoulos 3c489d45ef Merge pull request 'Update 'docs/publications.md' with extra information (DOIs) and new papers' (#3) from paolo.manghi/openaire-graph-docs:main into main
Reviewed-on: D-Net/openaire-graph-docs#3
2022-11-09 12:02:02 +01:00
Serafeim Chatzopoulos 372ee33111 Merge pull request 'aggregation section' (#2) from aggregation into main
Reviewed-on: D-Net/openaire-graph-docs#2
2022-11-09 12:01:12 +01:00
Claudio Atzori 31592f4e84 text describing the bulk cleaning process 2022-11-08 15:40:16 +01:00
Claudio Atzori 83ca3121a1 Merge branch 'main' into deduplication 2022-11-08 10:53:11 +01:00
Claudio Atzori 6cc93eb939 organization dedup 2022-11-08 10:13:02 +01:00
Michele De Bonis 44ee711b38 organization deduplication doc updated 2022-11-07 11:01:55 +01:00
Michele De Bonis 3bc4f9883d beginning of the description of organization deduplication 2022-11-03 17:18:08 +01:00
miconis c9cafec4da deduplication section revised, decision trees for research products added 2022-11-03 13:16:44 +01:00
Paolo Manghi ceb8a070b5 Update 'docs/publications.md' 2022-10-12 12:21:14 +02:00
106 changed files with 3098 additions and 1506 deletions

2
.env Normal file
View File

@ -0,0 +1,2 @@
URL="https://graph.openaire.eu"
BASE_URL="/docs"

3
.gitignore vendored
View File

@ -19,4 +19,5 @@ npm-debug.log*
yarn-debug.log*
yarn-error.log*
.idea/
.idea/
openaire-graph-docs.iml

View File

@ -9,6 +9,11 @@ $ git clone https://code-repo.d4science.org/D-Net/openaire-graph-docs.git
## Local installation and deployment
From https://docusaurus.io/docs/installation#requirements
> Node.js version 16.14 or above (which can be checked by running node -v)
To install the required packages use:
```
$ npm install

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 96 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 389 KiB

View File

Before

Width:  |  Height:  |  Size: 256 KiB

After

Width:  |  Height:  |  Size: 256 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 474 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 90 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

View File

@ -2,5 +2,34 @@
sidebar_position: 12
---
# Changelog
<span className="todo">TODO</span>
# Versions & changelog
## Versioning
Our versioning policy follows the [Semantic Versioning specification](https://semver.org/).
In our case, given a version `MAJOR.MINOR.PATCH`, we increment the:
* `MAJOR` version when the data model of the Graph changes
* `MINOR` version when the pipeline (e.g., different deduplication method, different implementation for an enrichment process) or major data sources change
* `PATCH` version when the graph data are updated
## Changelog
This section will document all notable changes for each graph version.
### v5.0.0
#### Added
- [Impact indicators](/data-model/entities/result#indicators) at the level of the Result
- [Beginner's kit](/downloads/beginners-kit) in the Downloads section
- New relationship types were introduced; see the complete list [here](/data-model/relationships#relationship-types)
#### Changed
- FOS and SDGs were removed from the [result subjects](/data-model/entities/result#subjects)
- Measures were removed from the [result instance](/data-model/entities/result#instance)

View File

@ -1,22 +1,22 @@
# Data model
The OpenAIRE Research Graph comprises several types of entities and [relationships](./relationships) among them.
The OpenAIRE Research Graph comprises several types of [entities](../category/entities) and [relationships](./relationships) among them.
The latest version of the JSON schema can be found on [Bulk downloads](../download).
The latest version of the JSON schema can be found on the [Downloads](../downloads/full-graph) section.
<p align="center">
<img loading="lazy" alt="Data model" src="/img/docs/data-model.png" width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
<img loading="lazy" alt="Data model" src={require('../assets/img/data-model.png').default} width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The figure above, presents the graph's data model.
Its main entities are described in brief below:
* [Results](entities/result) represent the outcomes of research activities.
* [Data Sources](entities/data-source) are the resources used to collect metadata for the graph objects
* [Results](entities/result) represent the outcomes (or products) of research activities.
* [Data Sources](entities/data-source) are the sources from which the metadata of graph objects are collected.
* [Organizations](entities/organization) correspond to companies or research institutions involved in projects,
responsible for operating data sources or consisting the affiliations of Product creators.
* [Projects](entities/project) are research projects funded by a Funding Stream of a Funder.
* [Communities](entities/community) are groups of people with a common research intent.
* [Projects](entities/project) are research project grants funded by a Funding Stream of a Funder.
* [Communities](entities/community) are groups of people with a common research intent (e.g. research infrastructures, university alliances).
:::note Further reading

View File

@ -37,7 +37,7 @@ _Type: String &bull; Cardinality: ONE_
Description of the research community/research infrastructure
```json
"description": "This portal provides access to publications, research data, projects and software that may be relevant to the Corona Virus Disease (COVID-19). The OpenAIRE COVID-19 Gateway aggregates COVID-19 related records, links them and provides a single access point for discovery and navigation. We tag content from the OpenAIRE Research Graph (10,000+ data sources) and additional sources. All COVID-19 related research results are linked to people, organizations and projects, providing a contextualized navigation."
"description": "This portal provides access to publications, research data, projects and software that may be relevant to the Corona Virus Disease (COVID-19). The OpenAIRE COVID-19 Gateway aggregates COVID-19 related records, links them and provides a single access point for discovery and navigation. We tag content from the OpenAIRE Graph (10,000+ data sources) and additional sources. All COVID-19 related research results are linked to people, organizations and projects, providing a contextualized navigation."
```
### name

View File

@ -542,21 +542,6 @@ The license URL.
"license": "http://creativecommons.org/licenses/by-nc/4.0"
```
### measures
_Type: [Measure](#measure) &bull; Cardinality: MANY_
The measures computed for this instance (e.g. those provided by [BIP! Finder](https://bip.imsi.athenarc.gr/)).
```json
"measures": [
{
"key": "influence",
"value": "6.45335454246e-09"
},
...
]
```
### pid
_Type: [ResultPid](#resultpid) &bull; Cardinality: MANY_
@ -619,6 +604,55 @@ URLs to the instance. They may link to the actual full-text or to the landing pa
]
```
## Indicator
These are indicators computed for a specific OpenAIRE result.
Each Indicator object is composed of the following properties:
### impactMeasures
_Type: [ImpactMeasures](#impactmeasures-1) &bull; Cardinality: ONE_
These impact-based indicators, provided by [BIP!](https://bip.imsi.athenarc.gr/), estimate the impact of a result.
For details about their calculation, please refer [here](/data-provision/indicators-ingestion/impact-scores).
```json
"impactMeasures": {
"influence": {
"score": "123",
"class": "C2"
},
"influence_alt" : {
"score": "456",
"class": "C3"
},
"popularity": {
"score": "234",
"class": "C1"
},
"popularity_alt": {
"score": "345",
"class": "C5"
},
"impulse": {
"score": "987",
"class": "C3"
}
}
```
### usageCounts
_Type: [UsageCounts](#usagecounts-1) &bull; Cardinality: ONE_
These measures, computed by the [UsageCounts Service](https://usagecounts.openaire.eu/), are based on usage statistics.
```json
"usageCounts":{
"downloads": "10",
"views": "20"
}
```
## Language
Represents information for the language of the result
@ -640,26 +674,76 @@ Language label in English.
"label": "English"
```
## Measure
A measure computed for this instance (e.g. those provided by [BIP! Finder](https://bip.imsi.athenarc.gr/))
## ImpactMeasures
### key
_Type: String &bull; Cardinality: ONE_
The different impact-based indicators as computed by [BIP!](https://bip.imsi.athenarc.gr/).
The specified measure. Currently supported one of: `{ influence, influence_alt, popularity, popularity_alt, impulse, cc }` (see [the dedicated page](../../data-provision/enrichment/impact-scores) for more details).
### influence
_Type: [Score](#score) &bull; Cardinality: ONE_
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
For more details please refer [here](/data-provision/indicators-ingestion/impact-scores#pagerank-pr).
```json
"key": "influence"
"influence": {
"score": "123",
"class": "C2"
}
```
### value
_Type: String &bull; Cardinality: ONE_
### influence_alt
_Type: [Score](#score) &bull; Cardinality: ONE_
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
For more details please refer [here](/data-provision/indicators-ingestion/impact-scores#citation-count-cc).
```json
"value": "6.45335454246e-09"
"influence_alt" :{
"score": "456",
"class": "C3"
}
```
The value for that measure.
### popularity
_Type: [Score](#score) &bull; Cardinality: ONE_
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
For more details please refer [here](/data-provision/indicators-ingestion/impact-scores#attrank).
```json
"popularity":{
"score": "234",
"class": "C1"
}
```
### popularity_alt
_Type: [Score](#score) &bull; Cardinality: ONE_
This is an alternative to the "Popularity" indicator, which also reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
For more details please refer [here](/data-provision/indicators-ingestion/impact-scores#ram).
```json
"popularity_alt":{
"score": "345",
"class": "C5"
}
```
### impulse
_Type: [Score](#score) &bull; Cardinality: ONE_
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
For more details please refer [here](/data-provision/indicators-ingestion/impact-scores#incubation-citation-count-icc).
```json
"impulse":{
"score": "987",
"class": "C3"
}
```
## OrganizationPid
@ -743,6 +827,33 @@ The value expressed in the scheme (i.e. 10.1000/182).
"value": "10.21511/bbs.13(3).2018.13"
```
## Score
The specific score object for each calculated impact measure calculated by [BIP!](https://bip.imsi.athenarc.gr/).
### score
_Type: String &bull; Cardinality: ONE_
The actual indicator score.
```json
"score": "1234"
```
### class
_Type: String &bull; Cardinality: ONE_
The impact class assigned based on the indicator score.
To facilitate comprehension, BIP! also offers impact classes for articles, to group together those that have similar impact. The following 5 classes are provided:
* `C1`: Top 0.01%
* `C2`: Top 0.1%
* `C3`: Top 1%
* `C4`: Top 10%
* `C5`: Bottom 90%
```json
"class": "C2"
```
## Subject
Represents keywords associated to the result.
@ -790,3 +901,25 @@ The value for the subject in the selected scheme. When the scheme is 'keyword',
```json
"value" : "pyrolysis-oil"
```
## UsageCounts
The usage counts indicator computed for this result.
### views
_Type: String &bull; Cardinality: ONE_
The number of views for this result.
```json
"views": "10"
```
### downloads
_Type: String &bull; Cardinality: ONE_
The number of downloads for this result.
```json
"downloads": "5"
```

View File

@ -183,6 +183,43 @@ Date when the embargo ends and this result turns Open Access.
"embargoenddate": "2017-01-01"
```
### indicators
_Type: [Indicator](other#indicator) &bull; Cardinality: ONE_
The indicators computed for this result;
currently, the following two types of indicators are supported: [impact indicators](/data-provision/indicators-ingestion/impact-scores) and [usage statistics indicators](/data-provision/indicators-ingestion/usage-counts).
```json
"indicators": {
"impactMeasures": {
"influence": {
"score": "123",
"class": "C2"
},
"influence_alt" : {
"score": "456",
"class": "C3"
},
"popularity": {
"score": "234",
"class": "C1"
},
"popularity_alt": {
"score": "345",
"class": "C5"
},
"impulse": {
"score": "987",
"class": "C3"
}
},
"usageCounts": {
"downloads": "10",
"views": "20"
}
}
```
### instance
_Type: [Instance](other#instance) &bull; Cardinality: MANY_
@ -209,13 +246,6 @@ Specific materialization or version of the result. For example, you can have one
"currency": "EUR"
},
"license": "http://creativecommons.org/licenses/by-nc/4.0",
"measures":[
{
"key": "influence",
"value": "6.45335454246e-09"
},
...
],
"pid": [
{
"scheme": "pmc",
@ -311,7 +341,7 @@ _Type: [Subject](other#subject) &bull; Cardinality: MANY_
Subject, keyword, classification code, or key phrase describing the resource.
```json
"subjecsts": [
"subjects": [
{
"provenance": {
"provenance": "Harvested",

View File

@ -70,5 +70,5 @@ Currently, the following data sources are used as "PID authorities":
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository |
OpenAIRE also perform duplicate identification (see the [dedicated section for details](../../data-provision/deduplication/)).
OpenAIRE also perform duplicate identification (see the [dedicated section for details](/data-provision/deduplication)).
All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).

View File

@ -127,20 +127,36 @@ Further specifies the relation semantic, indicating the relation direction, e.g.
The following table lists all the possible relation semantics found in the graph dump.
Note: the labels used to specify the semantic of the relationships are (for the large) inherited from the [DataCite metadata kernel](https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf), which provides a description for them.
| # | Source entity type | Target entity type | Relation type | Relation name | Inverse relation name |
|:--:|:------------------:|:-------------------:|:-------------:|:---------------------------:|:----------------------------:|
| 1 | [Project](entities/project) | [Result](entities/result) | outcome | produces | isProducedBy |
| 2 | [Result](entities/result) | [Organization](entities/organization) | affiliation | hasAuthorInstitution | isAuthorInstitutionOf |
| 3 | [Result](entities/result) | [Result](entities/result) | similarity | isAmongTopNSimilarDocuments | HasAmongTopNSimilarDocuments |
| 4 | [Project](entities/project) | [Organization](entities/organization) | participation | isParticipant | hasParticipant |
| 2 | [Project](entities/project) | [Organization](entities/organization) | participation | hasParticipant | isParticipant |
| 3 | [Project](entities/project) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 4 | [Result](entities/result) | [Result](entities/result) | similarity | isAmongTopNSimilarDocuments | HasAmongTopNSimilarDocuments |
| 5 | [Result](entities/result) | [Result](entities/result) | supplement | isSupplementTo | isSupplementedBy |
| 6 | [Result](entities/result) | [Result](entities/result) | relationship | isRelatedTo | isRelatedTo |
| 7 | [Data source](entities/data-source) | [Organization](entities/organization) | provision | provides | isProvidedBy |
| 8 | [Result](entities/result) | [Data source](entities/data-source) | provision | isHostedBy | hosts |
| 9 | [Result](entities/result) | [Data source](entities/data-source) | provision | isProvidedBy | provides |
| 10 | [Result](entities/result) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 11 | [Organization](entities/organization) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 12 | [Data source](entities/data-source) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 13 | [Project](entities/project) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 7 | [Result](entities/result) | [Result](entities/result) | relationship | IsPartOf | HasPart |
| 8 | [Result](entities/result) | [Result](entities/result) | relationship | IsDocumentedBy | Documents |
| 9 | [Result](entities/result) | [Result](entities/result) | relationship | IsObsoletedBy | Obsoletes |
| 10 | [Result](entities/result) | [Result](entities/result) | relationship | IsSourceOf | IsDerivedFrom |
| 11 | [Result](entities/result) | [Result](entities/result) | relationship | IsCompiledBy | Compiles |
| 12 | [Result](entities/result) | [Result](entities/result) | relationship | IsRequiredBy | Requires |
| 13 | [Result](entities/result) | [Result](entities/result) | citation | IsCitedBy | Cites |
| 14 | [Result](entities/result) | [Result](entities/result) | relationship | IsReferencedBy | References |
| 15 | [Result](entities/result) | [Result](entities/result) | relationship | IsReviewedBy | Reviews |
| 16 | [Result](entities/result) | [Result](entities/result) | relationship | IsOriginalFormOf | IsVariantFormOf |
| 17 | [Result](entities/result) | [Result](entities/result) | relationship | IsVersionOf | HasVersion |
| 18 | [Result](entities/result) | [Result](entities/result) | relationship | IsIdenticalTo | IsIdenticalTo |
| 19 | [Result](entities/result) | [Result](entities/result) | relationship | IsPreviousVersionOf | IsNewVersionOf |
| 20 | [Result](entities/result) | [Result](entities/result) | relationship | IsContinuedBy | Continues |
| 21 | [Result](entities/result) | [Result](entities/result) | relationship | IsDescribedBy | Describes |
| 22 | [Result](entities/result) | [Organization](entities/organization) | affiliation | hasAuthorInstitution | isAuthorInstitutionOf |
| 23 | [Result](entities/result) | [Data source](entities/data-source) | provision | isHostedBy | hosts |
| 24 | [Result](entities/result) | [Data source](entities/data-source) | provision | isProvidedBy | provides |
| 25 | [Result](entities/result) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 26 | [Organization](entities/organization) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 27 | [Data source](entities/data-source) | [Community](entities/community) | relationship | isRelatedTo | isRelatedTo |
| 28 | [Data source](entities/data-source) | [Organization](entities/organization) | provision | isProvidedBy | provides |

View File

@ -4,17 +4,17 @@ sidebar_position: 1
# Aggregation
OpenAIRE materializes an open, participatory research graph (the OpenAIRE Research graph) where products of the research life-cycle (e.g. scientific literature, research data, project, software) are semantically linked to each other and carry information about their access rights (i.e. if they are Open Access, Restricted, Embargoed, or Closed) and the sources from which they have been collected and where they are hosted. The OpenAIRE research graph is materialised via a set of autonomic, orchestrated workflows operating in a regimen of continuous data aggregation and integration. [1]
OpenAIRE materializes an open, participatory research graph (the OpenAIRE Research Graph) where products of the research life-cycle (e.g. scientific literature, research data, project, software) are semantically linked to each other and carry information about their access rights (i.e. if they are Open Access, Restricted, Embargoed, or Closed) and the sources from which they have been collected and where they are hosted. The OpenAIRE Research Graph is materialised via a set of autonomic, orchestrated workflows operating in a regimen of continuous data aggregation and integration. [1]
## What does OpenAIRE collect?
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/).
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term.
Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that do not follow the OpenAIRE Guidelines and/or are too large to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall).
In addition, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that need special handling, either because they do not strictly follow the OpenAIRE Guidelines or due to the vast amount of data of data they offer (e.g. DOIBoost, that merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall).
<p align="center">
<img loading="lazy" alt="Aggregation" src="/img/docs/aggregation.png" width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
<img loading="lazy" alt="Aggregation" src={require('../../assets/img/aggregation.png').default} width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The OpenAIRE aggregation system collects information about objects of the research life-cycle compliant to the [OpenAIRE acquisition policy](https://www.openaire.eu/content-acquisition-policy) from [different types of data sources](https://explore.openaire.eu/search/find/dataproviders):
@ -32,7 +32,7 @@ Relationships between objects are collected from the data sources, but also auto
Objects and relationships in the OpenAIRE Research Graph are extracted from information packages, i.e. metadata records, collected from data sources of the following kinds:
- *Institutional or thematic repositories*: Information systems where scientists upload the bibliographic metadata and full-texts of their articles, due to obligations from their organization or due to community practices (e.g. ArXiv, Europe PMC);
- *Literature, Institutional and thematic repositories*: Information systems where scientists upload the bibliographic metadata and full-texts of their articles, due to obligations from their organization or due to community practices (e.g. ArXiv, Europe PMC);
- *Open Access Publishers and journals*: Information system of open access publishers or relative journals, which offer bibliographic metadata and PDFs of their published articles;
- *Data archives*: Information systems where scientists deposit descriptive metadata and files about their research data (also known as scientific data, datasets, etc.).;
- *Hybrid repositories/archives*: information systems where scientists deposit metadata and file of any kind of scientific products, incuding scientific literature, research data and research software (e.g. Zenodo)
@ -46,11 +46,13 @@ Objects and relationships in the OpenAIRE Research Graph are extracted from info
OpenAIRE collects metadata records describing objects of the research life-cycle from content providers compliant to the OpenAIRE guidelines and from entity registries (i.e. data sources offering authoritative lists of entities, like OpenDOAR, re3data, DOAJ, and funder databases).
The OpenAIRE aggregator collects metadata records in the majority of cases via [OAI-PMH](https://www.openarchives.org/pmh/), but also supports other standard exchange protocols like FTP(S), SFTP, and some RESTful API.
The whole list of available and used collectors could be found in the [RedMine Wiki - API Protocols](https://support.openaire.eu/projects/openaire/wiki/API_protocols)
For additional details about the aggregation workflows, please refer to [2].
## References
[1] Manghi P. et al. (2014) "The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures", Program, Vol. 48 Issue: 4, pp.322-354, [10.1108/PROG-08-2013-0045](https://doi.org/10.1108/PROG-08-2013-0045)
[1] Manghi, P., Artini, M., Atzori, C., Bardi, A., Mannocci, A., La Bruzzo, S., Candela, L., Castelli, D. and Pagano, P. (2014), “The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures”, Program: electronic library and information systems, Vol. 48 No. 4, pp. 322-354. [doi:10.1108/prog-08-2013-0045](http://doi.org/10.1108/prog-08-2013-0045)
[2] Atzori, Claudio, Bardi, Alessia, Manghi, Paolo, & Mannocci, Andrea. (2017). The OpenAIRE workflows for data management. Zenodo. [10.5281/zenodo.996006](http://doi.org/10.5281/zenodo.996006)
[2] Atzori, C., Bardi, A., Manghi, P., & Mannocci, A. (2017, January). "The OpenAIRE workflows for data management". In Italian Research Conference on Digital Libraries (pp. 95-107). Springer, Cham. [doi:10.1007/978-3-319-68130-6_8](https://doi.org/10.1007/978-3-319-68130-6_8)

View File

@ -0,0 +1,11 @@
---
sidebar_position: 1
---
# OpenAIRE compatible sources
The OpenAIRE aggregator collects metadata records from content providers compliant to the OpenAIRE guidelines.
The OpenAIRE Guidelines help repository managers expose publications, datasets and CRIS metadata via the OAI-PMH protocol in order to integrate with OpenAIRE infrastructure.
You can find more information in https://guidelines.openaire.eu/en/latest/

View File

@ -33,7 +33,7 @@ The metadata collection process identifies the most recent record date available
### Entity Mapping
The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.
The table below describes the mapping from the XML baseline records to the OpenAIRE Research Graph dump format.
| OpenAIRE Result field path | Datacite record JSON path | # Notes |
|--------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

View File

@ -131,7 +131,7 @@ Possible improvements:
* Verify if Crossref has a property for `language`, `country`, `container.issnLinking`, `container.iss`, `container.edition`, `container.conferenceplace` and `container.conferencedate`
* Different approach to set the `refereed` field and improve its coverage?
h3. 2 Map Crossref links to projects/funders
### Map Crossref links to projects/funders
Links to funding available in Crossref are mapped as funding relationships (`result -- isProducedBy -- project`) applying the following mapping:

View File

@ -69,7 +69,7 @@ curl -s "https://www.ebi.ac.uk/europepmc/webservices/rest/MED/33024307/datalinks
```
## Mapping
The table below describes the mapping from the EBI links records to the OpenAIRE Graph dump format.
The table below describes the mapping from the EBI links records to the OpenAIRE Research Graph dump format.
We filter all the target links with pid type **ena**, **pdb** or **uniprot**
For each target we construct a Bioentity with the following mapping

View File

@ -12,7 +12,7 @@ Pubmed exposes an entry point FTP with all the updates for each one. [ftp baseli
## Entity Mapping
The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.
The table below describes the mapping from the XML baseline records to the OpenAIRE Research Graph dump format.
| OpenAIRE Result field path | PubMed record field xpath | Notes |
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

View File

@ -0,0 +1,37 @@
# Cleaning
<!-- ## Vocabulary based cleaning -->
The aggregation processes run independently one from another and continuously. Each aggregation process, depending on the characteristics of the records exposed by the data source, makes use of one or more vocabularies to harmonise the values available in a given field.
In this page, we describe the *vocabulary-based cleaning* operation performed to harmonise the data of the different data sources.
A vocabulary is a data structure that defines a list of terms, and for each term defines a list of synonyms:
```xml
<TERMS>
<TERM native_name="Annotation" code="0018" english_name="Annotation" encoding="OPENAIRE">
<SYNONYMS>
<SYNONYM term="Comentario" encoding="CSIC"/>
<SYNONYM term="Comment/debate" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="annotation" encoding="OPENAIRE-PR202112"/>
[...]
</SYNONYMS>
<RELATIONS/>
</TERM>
<TERM native_name="Article" code="0001" english_name="Article" encoding="OPENAIRE">
<SYNONYMS>
<SYNONYM term="A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="A4 Artikkeli konferenssijulkaisussa" encoding="Aaltodoc Publication Archive"/>
<SYNONYM term="Article" encoding="OTHER"/>
<SYNONYM term="Article (author)" encoding="OTHER"/>
[...]
```
Each vocabulary is typically used to control and harmonise the values available in a specific field characterising the bibliographic records. The example above provides a preview of the vocabulary used to clean the [result's instance typology](/data-model/entities/result#instance).
The content of the vocabularies can be accessed on [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/).
Given a value provided in the original records, the cleaning process looks for a synonym and, when found, resolves the corresponding term which is used in turn to build the cleaned record.
Each aggregation process applies vocabularies according to their definitions in a given moment of time, however, it could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies.
In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation.

View File

@ -1,7 +1,8 @@
# Data provision
# Graph production workflow
OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the world, including Open Access institutional repositories, data archives, journals. All the metadata records (i.e. descriptions of research products) are put together in a data lake, together with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national and international funders. Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications enrich the content of the data lake with links between research results and projects, author affiliations, subject classification, links to entries from domain-specific databases. Duplicated organisations and results are identified and merged together to obtain an open, trusted, public resource enabling explorations of the scholarly communication landscape like never before.
<p align="center">
<img loading="lazy" alt="Data provision" src="/img/docs/architecture.png" width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
<img loading="lazy" alt="Data provision" src={require('../assets/img/architecture.png').default} width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>

View File

@ -0,0 +1,37 @@
# Deduction
The Deduction process (also known as “bulk tagging”) enriches each record with new information that can be derived from the existing property values.
This process is used to associate results to community/research initiatives that are part of OpenAIRE.
As of November 2022, three procedures are in place to relate a research product to a research initiative, infrastructure (RI) or community (RC) based on:
* subjects: it is possible to specify a list of subjects that are relevant for the RC/RI. Every time one of the subjects is found among the subjects of a result, the result is linked to the RC/RI.
<p align="center">
<img loading="lazy" alt="Bulktagging Subject" src={require('../../assets/img/enrichment/bulktagging_subject.png').default} width="50%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* data sources: it is possible to list a set of data sources relevant for the RC/RI. All the results collected from these data sources will be linked to the RC/RI
<p align="center">
<img loading="lazy" alt="Bulktagging Data source" src={require('../../assets/img/enrichment/bulktagging_datasource.png').default} width="50%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
When only some results collected from a datasource are relevant for the RC/RI, it is possible to specify a set of selection constraints (SC) that have to be verified before linking the result to the
community. The selection constraint has the form <strong>SC = S1 or S2 or ... or Sn</strong>. The generic Si has the form <strong>Si = s<sub>i1</sub> and s<sub>i2</sub> and ...and s<sub>in</sub></strong> and each s<sub>ij</sub> is a condition on a specific field of the result. The set of fields that can be specified is <strong>F={title, author, contributor, description, orcid}</strong>,
while the set of condition can be among <strong>V={contains, equals, not_contains, not_equals, contains_ignorecase, equals_ignorecase, not_contains_ignorecase, not_equal_ignorecase}</strong>, and the value is free text.
A possible selection criteria can be: “All the products whose contributor contains DARIAH “
<p align="center">
<img loading="lazy" alt="Bulktagging Data source" src={require('../../assets/img/enrichment/bulktagging_selconstraints.png').default} width="70%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* Zenodo community: it is possible to list a set of Zenodo communities relevant for the RC/RI. All the products collected from the listed Zenodo communities are linked to the RC/RI
<p align="center">
<img loading="lazy" alt="Bulktagging Zenodo Community" src={require('../../assets/img/enrichment/bulktagging_zenodo.png').default} width="50%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The list of subjects, Zenodo communities and data sources used to enrich the products are defined by the managers of the community gateway or infrastructure monitoring dashboard associated with the RC/RI.

View File

@ -0,0 +1,55 @@
# Propagation
This process enriches the graph by adding new links and/or new properties. The new information is added by exploiting existing semantic
relationships and values between the involved entities
As of November 2022, the following procedures are in place:
* Country propagation: updates the property “country” of a results. This happens when the result is collected from an institutional datasource or when the datasource hosting the result is inserted in a whitelist. For all the results whose hosting datasource verifies one of the conditions above, the country of the organization providing the datasource is added to the country of the result: e.g. publication collected from an institutional repository maintained by an italian university will be enriched with the property “country = IT”.
<p align="center">
<img loading="lazy" alt="Country Propagation" src={require('../../assets/img/enrichment/propagation_country.png').default} width="50%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* Project propagation: adds a "isProducedBy" relationship (and its inverse) between a Project P and Result R1, if R1 has a strong semantic relationship with another Result R2 and P produces R2: e.g. publication linked to project P “is supplemented by” a dataset D. Dataset D will get the link to project P. The relationships considered for this procedure are “isSupplementedBy” and “isSupplementTo”.
<p align="center">
<img loading="lazy" alt="Project Propagation" src={require('../../assets/img/enrichment/propagation_resulttoproject.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* Result to RC/RI through organization propagation. The manager of the RC/RI can specify a set of organizations whose product are relevant for the
community.
Each result having such a relation of affiliation with at least one organization relevant for the RC/RI will be linked to it.
<p align="center">
<img loading="lazy" alt="Result to community through organization propagation" src={require('../../assets/img/enrichment/propagation_resulttocommunitythroughorganization.png').default}
width="50%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* Result to RC/RI through semantic relation: extends the set of products linked to a RC/RI by exploiting strong semantic relationships between the results;
e.g. if a result R1 is associated to the community C and is supplemented by a result R2 then the result R2 will be linked to the community. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
<p align="center">
<img loading="lazy" alt="Result to community through semantic relation propagation" src={require('../../assets/img/enrichment/propagation_resulttocommunitythroughsemrel.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* ORCID identifiers to result through semantic relation. This propagation enriches the results by adding ORCID identifiers to authors. The added ORCID will be marked as "potential" since they have been inserted through propagation.
The process considers the set of overlapping authors between results (R1 and R2) linked with a strong semantic relationship (IsSupplementedBy, IsSupplementTo).
For each author A in the overlapping set, if R1 provides the ORCID value for A and R2 does not, then the author A in R2 will be enriched with the information of the ORCID found in R1.
<p align="center">
<img loading="lazy" alt="Orcid propation through semantic relation" src={require('../../assets/img/enrichment/propagation_orcid.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* affiliation to organization through institutional repository. This propagation adds one "hasAuthorInstitution" relationship (and its inverse)
between a Result R and Organization O,
if R was collected from a datasource D with type institutional repository, and D was provided by O.
<p align="center">
<img loading="lazy" alt="Affiliation propagation through institutional repository" src={require('../../assets/img/enrichment/propagation_affiliationistrepo.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
* affiliation to organization through semantic relation. This propagation adds one "hasAuthorInstitution" relationship (and its inverse) between a
Result R and an Organization O,
if R has an affiliation relation with an organization O1 that is in relation "isChildOf" with O.
<p align="center">
<img loading="lazy" alt="Affiliation propagation through semantic relation" src={require('../../assets/img/enrichment/propagation_organizationsemrel.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
The algorithm exploits only the organization leaves that are in a "IsChildOf" relation with another organization. So far one single step is done
<p align="center">
<img loading="lazy" alt="propagation strategy" src={require('../../assets/img/enrichment/organization_tree.png').default} width="40%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>

View File

@ -3,18 +3,91 @@ sidebar_position: 3
---
# Clustering functions
## Ngrams
It creates ngrams from the input field. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
Parameters: ngram length = 3, maximum number = 4
List of ngrams: “sea”, “sta”, “mod”, “hig”
```
## NgramPairs
It produces a list of concatenations of a pair of ngrams generated from different words.<br />
*Example:*<br />
Input string: `“Search for the Standard Model Higgs Boson”`<br />
Parameters: ngram length = 3<br />
List of ngrams: `“sea”`, `“sta”`, `“mod”`, `“hig”`<br />
Ngram pairs: `“seasta”`, `“stamod”`, `“modhig”`
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
Parameters: ngram length = 3
Ngram pairs: “seasta”, “stamod”, “modhig”
```
## SuffixPrefix
It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string.<br />
*Example:*<br />
Input string: `“Search for the Standard Model Higgs Boson”`<br />
Parameters: suffix and prefix length = 3<br />
Output list: `“ardmod”` (suffix of the word `“Standard”` + prefix of the word `“Model”`), `“rchsta”` (suffix of the word `“Search”` + prefix of the word `“Standard”`)
It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string. A specialization of this function is available as SortedSuffixPrefix. It returns a sorted list. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
Parameters: suffix and prefix length = 3, maximum number = 2
Output list: “ardmod”` (suffix of the word “Standard” + prefix of the word “Model”), “rchsta” (suffix of the word “Search” + prefix of the word “Standard”)
```
## Acronyms
It creates a number of acronyms out of the words in the input field. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
Output: "ssmhb"
```
## KeywordsClustering
It creates keys by extracting keywords, out of a customizable list, from the input field. <br />
```
Example:
Input string: “University of Pisa”
Output: "key::001" (code that identifies the keyword "University" in the customizable list)
```
## LowercaseClustering
It creates keys by lowercasing the input field. <br />
```
Example:
Input string: “10.001/ABCD”
Output: "10.001/abcd"
```
## RandomClusteringFunction
It creates random keys from the input field. <br />
## SpaceTrimmingFieldValue
It creates keys by trimming spaces in the input field. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
Output: "searchstandardmodelhiggsboson"
```
## UrlClustering
It creates keys for an URL field by extracting the domain. <br />
```
Example:
Input string: “http://www.google.it/page”
Output: "www.google.it"
```
## WordsStatsSuffixPrefixChain
It creates keys containing concatenated statistics of the field, i.e. number of words, number of letters and a chain of suffixes and prefixes of the words. <br />
```
Example:
Input string: “Search for the Standard Model Higgs Boson”
Parameters: mod = 10
Output list: "5-3-seaardmod" (number of words + number of letters % 10 + prefix of the word "Search" + suffix of the word "Standard" + prefix of the word "Model"), "5-3-rchstadel" (number of words + number of letters % 10 + suffix of the word "Search" + prefix of the word "Standard" + suffix of the word "Model")
```

View File

@ -1,29 +1,28 @@
# Deduplication
## Clustering
Metadata records about the same scholarly work can be collected from different providers. Each metadata record can possibly carry different information because, for example, some providers are not aware of links to projects, keywords or other details. Another common case is when OpenAIRE collects one metadata record from a repository about a pre-print and another record from a journal about the published article. For the provision of statistics, OpenAIRE must identify those cases and “merge” the two metadata records, so that the scholarly work is counted only once in the statistics OpenAIRE produces.
Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a clustering function that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no clustering function will ever bring them into the same cluster. To match these requirements OpenAIRE clustering for products works with two functions:
* DOI: the function generates the DOI when this is provided as part of the record properties;
* Title-based function: the function generates a key that depends on (i) number of significant words in the title (normalized, stemming, etc.), (ii) module 10 of the number of characters of such words, and (iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) o the first 3 words (2 words if the title only has 2). For example, the title “Entity deduplication in big data graphs for scholarly communication” becomes “entity deduplication big data graphs scholarly communication” with two keys key “7.1entionbig” and “7.1itydedbig” (where 1 is module 10 of 54 characters of the normalized title.
To give an idea, this configuration generates around 77Mi blocks, which we limited to 200 records each (only 15K blocks are affected by the cut), and entails 260Bi matches. Matches in a block are performed using a “sliding window” set to 80 records. The records are sorted lexicographically on a normalized version of their titles. The 1st record is matched against all the 80 following ones, then the second, etc. for an NlogN complexity.
## Methodology overview
## Matching and election
The deduplication process can be divided into three different phases:
* Candidate identification (clustering)
* Duplicates identification (pair-wise comparisons)
* Duplicates grouping (transitive closure)
Once the clusters have been built, the algorithm proceeds with the comparisons. Comparisons are driven by a decisional tree that:
1. Tries to capture equivalence via PIDs: if records share a PID then they are equivalent
<p align="center">
<img loading="lazy" alt="Deduplication Workflow" src={require('../../assets/img/deduplication-workflow.png').default} width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
2. Tries to capture difference:
### Candidate identification (clustering)
a. If record titles contain different “numbers” then they are different (this rule is subject to different feelings, and should be fine-tuned);
Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a [clustering function](./clustering-functions) that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no [clustering function](./clustering-functions) will ever bring them into the same cluster.
b. If record contain different number of authors then they are different;
### Duplicates identification (pair-wise comparisons)
c. Note that different PIDs do not imply different records, as different versions may have different PIDs.
Pair-wise comparisons are conducted over records in the same cluster following the strategy defined in the decision tree. A different decision tree is adopted depending on the type of the entity being processed.
3. Measures equivalence:
To further limit the number of comparisons, a sliding window mechanism is used: (i) records in the same cluster are lexicographically sorted by their title, (ii) a window of K records slides over the cluster, and (iii) records ending up in the same window are pair-wise compared. The result of each comparison produces a similarity relation when the pair of record matches. Such relations will be consequently used as input for the duplicates grouping stage.
a. The titles of the two records are normalised and compared for similarity by applying the Levenstein distance algorithm. The algorithm returns a number in the range [0,1], where 0 means “very different” and 1 means “equal”. If the distance is greater than or equal 0,99 the two records are identified as duplicates.
### Duplicates grouping (transitive closure)
b. Dates are not regarded for equivalence matching because different versions of the same records should be merged and may be published on different dates, e.g. pre-print and published version of an article.
Once the equivalence relationships between pairs of records are set, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance. The ID of the record is obtained by appending the prefix “dedup_” to the MD5 of the first ID (given their lexicographical ordering). A new, more stable function to generate the ID is under development, which exploits the DOI when one of the records to be merged includes a Crossref or a DataCite record.
Once the similarity relations between pairs of records are drawn, the groups of equivalent records are obtained (transitive closure, i.e. “mesh”). From such sets a new representative object is obtained, which inherits all properties from the merged records and keeps track of their provenance.

View File

@ -3,4 +3,68 @@ sidebar_position: 2
---
# Organizations
<span className="todo">TODO</span>
The organizations in OpenAIRE are aggregated from different registries (e.g. CORDA, OpenDOAR, Re3data, ROR). In some cases, a registry provides organizations as entities with their own persistent identifier. In other cases, those organizations are extracted from other main entities provided by the registry (e.g. datasources, projects, etc.).
The deduplication of organizations is enhanced by the [OpenOrgs](https://orgs.openaire.eu), a tool that combines an automated approach for identifying duplicated instances
of the same organization record with a "humans in the loop" approach, in which the equivalences produced by a duplicate identification algorithm are suggested to data curators, in charge for validating them.
The data curation activity is twofold, on one end pivots around the disambiguation task, on the other hand assumes to improve the metadata describing the organization records
(e.g. including the translated name, or a different PID) as well as defining the hierarchical structure of existing large organizations (i.e. Universities comprising its departments or large research centers with all its sub-units or sub-institutes).
Duplicates among organizations are therefore managed through three different stages:
* *Creation of Suggestions*: executes an automatic workflow that performs the deduplication and prepare new suggestions for the curators to be processed;
* *Curation*: manual editing of the organization records performed by the data curators;
* *Creation of Representative Organizations*: executes an automatic workflow that creates curated organizations and exposes them on the OpenAIRE Research Graph by using the curators' feedback from the OpenOrgs underlying database.
The next sections describe the above mentioned stages.
### Creation of Suggestions
This stage executes an automatic workflow that faces the *candidate identification* and the *duplicates identification* stages of the deduplication to provide suggestions for the curators in the OpenOrgs.
#### Candidate identification (clustering)
To match the requirements of limiting the number of comparisons, OpenAIRE clustering for organizations aims at grouping records that would more likely be comparable.
It works with four functions:
* *URL-based function*: the function generates the URL domain when this is provided as part of the record properties from the organization's `websiteurl` field;
* *Title-based functions*:
* generate strings dependent to the keywords in the `legalname` field;
* generate strings obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) on the first 3 words of the `legalname` field;
* generate strings obtained as a concatenation of ngrams of the `legalname` field;
#### Duplicates identification (pair-wise comparisons)
For each pair of organization in a cluster the following strategy (depicted in the figure below) is applied.
The comparison goes through the following decision tree:
1. *grid id check*: comparison of the grid ids. If the grid id is equivalent, then the similarity relation is drawn. If the grid id is not available, the comparison proceeds to the next stage;
2. *early exits*: comparison of the numbers extracted from the `legalname`, the `country` and the `website` url. No similarity relation is drawn in this stage, the comparison proceeds only if the compared fields verified the conditions of equivalence;
3. *city check*: comparison of the city names in the `legalname`. The comparison proceeds only if the legalnames shares at least 10% of cities;
4. *keyword check*: comparison of the keywords in the `legalname`. The comparison proceeds only if the legalnames shares at least 70% of keywords;
5. *legalname check*: comparison of the normalized `legalnames` with the `Jaro-Winkler` distance to determine if it is higher than `0.9`. If so, a similarity relation is drawn. Otherwise, no similarity relation is drawn.
<p align="center">
<img loading="lazy" alt="Organization Decision Tree" src={require('../../assets/img/decisiontree-organization.png').default} width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1YKInGGtHu09QG4pT2gRLEum4LxU82d4nKkvGNvRQmrg/edit?usp=sharing)
### Data Curation
All the similarity relations drawn by the algorithm involving the decision tree are exposed in OpenOrgs, where are made available to the data curators to give feedbacks and to improve the organizations metadata.
A data curator can:
* *edit organization metadata*: legalname, pid, country, url, parent relations, etc.;
* *approve suggested duplicates*: establish if an equivalence relation is valid;
* *discard suggested duplicates*: establish if an equivalence relation is wrong;
* *create similarity relations*: add a new equivalence relation not drawn by the algorithm.
Note that if a curator does not provide a feedback on a similarity relation suggested by the algorithm, then such relation is considered as valid.
### Creation of Representative Organizations
This stage executes an automatic workflow that faces the *duplicates grouping* stage to create representative organizations and to update them on the OpenAIRE Research Graph. Such organizations are obtained via transitive closure and the relations used comes from the curators' feedback gathered on the OpenOrgs underlying Database.
#### Duplicates grouping (transitive closure)
Once the similarity relations between pairs of organizations have been gathered, the groups of equivalent organizations are obtained (transitive closure, i.e. “mesh”). From such sets a new representative organization is obtained, which inherits all properties from the merged records and keeps track of their provenance.
The IDs of the representative organizations are obtained by the OpenOrgs Database that creates a unique ``openorgs`` ID for each approved organization. In case an organization is not approved by the curators, the ID is obtained by appending the prefix ``pending_org`` to the MD5 of the first ID (given their lexicographical ordering).

View File

@ -4,48 +4,66 @@ sidebar_position: 1
# Research results
Metadata records about the same scholarly work can be collected from different providers. Each metadata record can possibly carry different information because, for example, some providers are not aware of links to projects, keywords or other details. Another common case is when OpenAIRE collects one metadata record from a repository about a pre-print and another record from a journal about the published article. For the provision of statistics, OpenAIRE must identify those cases and “merge” the two metadata records, so that the scholarly work is counted only once in the statistics OpenAIRE produces.
Duplicates among research results are identified among results of the same type (publications, datasets, software, other research products). If two duplicate results are aggregated one as a dataset and one as a software, for example, they will never be compared and they will never be identified as duplicates.
OpenAIRE supports different deduplication strategies based on the type of results.
## Methodology overview
The next sections describe how each stage of the deduplication workflow is faced for research results.
The deduplication process can be divided into two different phases:
* Candidate identification (clustering)
* Decision tree
* Creation of representative record
### Candidate identification (clustering)
The implementation of each phase is different based on the type of results that are being processed.
To match the requirements of limiting the number of comparisons, OpenAIRE clustering for research products works with two functions:
* *DOI-based function*: the function generates the DOI when this is provided as part of the record properties;
* *Title-based function*: the function generates a key that depends on (i) number of significant words in the title (normalized, stemming, etc.), (ii) module 10 of the number of characters of such words, and (iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) on the first 3 words (2 words if the title only has 2). For example, the title ``Search for the Standard Model Higgs Boson`` becomes ``search standard model higgs boson`` with two keys key ``5-3-seaardmod`` and ``5-3-rchstadel``.
### Publications
To give an idea, this configuration generates around 77Mi blocks, which we limited to 200 records each (only 15K blocks are affected by the cut), and entails 260Bi matches.
#### Candidate identification (clustering)
### Duplicates identification (pair-wise comparisons)
Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a [clustering function](./clustering-functions) that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no [clustering function](./clustering-functions) will ever bring them into the same cluster. To match these requirements OpenAIRE clustering for products works with two functions:
DOI: the function generates the DOI when this is provided as part of the record properties;
Title-based function: the function generates a key that depends on (i) number of significant words in the title (normalized, stemming, etc.), (ii) module 10 of the number of characters of such words, and (iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) o the first 3 words (2 words if the title only has 2). For example, the title “Entity deduplication in big data graphs for scholarly communication” becomes “entity deduplication big data graphs scholarly communication” with two keys key “7.1entionbig” and “7.1itydedbig” (where 1 is module 10 of 54 characters of the normalized title.
Comparisons in a block are performed using a *sliding window* set to 50 records. The records are sorted lexicographically on a normalized version of their titles. The 1st record is compared against all the 50 following ones using the decision tree, then the second, etc. for an NlogN complexity.
A different decision tree is adopted depending on the type of the entity being processed.
Similarity relations drawn in this stage will be consequently used to perform the duplicates grouping.
#### Decision tree
#### Publications
For each pair of publications in a cluster the following strategy (depicted in the figure below) is applied.
Cross comparison of the pid lists (in the `pid` and `alternateid` elements). If 50% common pids, levenshtein distance on titles with low threshold (0.9).
Otherwise, check if the number of authors and the title version is equal. If so, levenshtein distance on titles with higher threshold (0.99).
The publications are matched as duplicate if the distance is higher than the threshold, in every other case they are considered as distinct publications.
The comparison goes through different stages:
1. *trusted pids check*: comparison of the trusted pid lists (in the `pid` field of the record). If at least 1 pid is equivalent, records match and the similarity relation is drawn.
2. *instance type check*: comparison of the instance types (indicating the subtype of the record, i.e. presentation, conference object, etc.). If the instance types are not compatible then the records does not match. Otherwise, the comparison proceeds to the next stage
3. *untrusted pids check*: comparison of all the available pids (in the `pid` and the `alternateid` fields of the record). In every case, no similarity relation is drawn in this stage. If at least one pid is equivalent, the next stage will be a *soft check*, otherwise the next stage is a *strong check*.
4. *soft check*: comparison of the record titles with the Levenshtein distance. If the distance measure is above 0.9 then the similarity relation is drawn.
5. *strong check*: comparison composed by three substages involving the (i) comparison of the author list sizes and the version of the record to determine if they are coherent, (ii) comparison of the record titles with the Levenshtein distance to determine if it is higher than 0.99, (iii) "smart" comparison of the author lists to check if common authors are more than 60%.
<p align="center">
<img loading="lazy" alt="Deduplication workflow" src="/img/docs/dedup-results.png" width="80%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
<img loading="lazy" alt="Publications Decision Tree" src={require('../../assets/img/decisiontree-publication.png').default} width="100%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
#### Creation of representative record
<span className="todo">TODO</span>
[//]: # (Link to the image: https://docs.google.com/drawings/d/19SIilTp1vukw6STMZuPMdc0pv0ODYCiOxP7OU3iPWK8/edit?usp=sharing)
### Datasets
<span className="todo">TODO</span>
#### Software
For each pair of software in a cluster the following strategy (depicted in the figure below) is applied.
The comparison goes through different stages:
1. *pids check*: comparison of the pids in the records. No similarity relation is drawn in this stage, it is only used to establish the final threshold to be used to compare record titles. If there is at least one common pid, then the next stage is a *soft check*. Otherwise, the next stage is a *strong check*
2. *soft check*: comparison of the record titles with Levenshtein distance. If the measure is above 0.9, then the similarity relation is drawn
3. *strong check*: comparison of the record titles with Levenshtein distance. If the measure is above 0.99, then the similarity relation is drawn
### Software
<span className="todo">TODO</span>
<p align="center">
<img loading="lazy" alt="Software Decision Tree" src={require('../../assets/img/decisiontree-software.png').default} width="85%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
### Other types of research products
<span className="todo">TODO</span>
[//]: # (Link to the image: https://docs.google.com/drawings/d/19gd1-GTOEEo6awMObGRkYFhpAlO_38mfbDFFX0HAkuo/edit?usp=sharing)
#### Datasets and Other types of research products
For each pair of datasets or other types of research products in a cluster the strategy depicted in the figure below is applied.
The decision tree is almost identical to the publication decision tree, with the only exception of the *instance type check* stage. Since such type of record does not have a relatable instance type, the check is not performed and the decision tree node is skipped.
<p align="center">
<img loading="lazy" alt="Dataset and Other types of research products Decision Tree" src={require('../../assets/img/decisiontree-dataset-orp.png').default} width="90%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
</p>
[//]: # (Link to the image: https://docs.google.com/drawings/d/1uBa7Bw2KwBRDUYIfyRr_Keol7UOeyvMNN7MPXYLg4qw/edit?usp=sharing)
### Duplicates grouping (transitive closure)
The general concept is that the field coming from the record with higher "trust" value is used as reference for the field of the representative record.
The IDs of the representative records are obtained by appending the prefix ``dedup_`` to the MD5 of the first ID (given their lexicographical ordering). If the group of merged records contains a trusted ID (i.e. the DOI), also the ``doi`` keyword is added to the prefix.

View File

@ -0,0 +1,30 @@
---
sidebar_position: 3
---
# Extraction of acknowledged concepts
***Short description:*** Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE.
***Algorithmic details:***
The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept.
***Parameters:***
Concept titles, acronyms, and identifiers, publication's identifiers and fulltexts
***Limitations:*** -
***Environment:***
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
***References:***
* Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. [doi:10.1007/978-3-031-16802-4_9](https://doi.org/10.1007/978-3-031-16802-4_9)
***Authority:*** ATHENA RC &bull; ***License:*** CC-BY/CC-0 &bull; ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)

View File

@ -0,0 +1,57 @@
---
sidebar_position: 1
---
# Affiliation matching
***Short description:*** The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database.
***Algorithmic details:***
*The buckets concept*
In order to get the best possible results, the algorithm should compare every affiliation with every organization. However, this approach would be very inefficient and slow, because it would involve the processing of the cartesian product (all possible pairs) of millions of affiliations and thousands of organizations. To avoid this, IIS has introduced the concept of buckets. A bucket is a smaller group of affiliations and organizations that have been selected to be matched with one another. The matching algorithm compares only these affiliations and organizations that belong to the same bucket.
*Affiliation matching process*
Every affiliation in a given *bucket* is compared with every organization in the same bucket multiple times, each time by using a different algorithm (*voter*). Each *voter* is assigned a number (match strength) that describes the estimated correctness of the result of its comparison. All the affiliation-organization pairs that have been matched by at least one *voter*, will be assigned the match strength > 0 (the actual number depends on the voters, its calculation method will be shown later).
It is very important for the algorithm to group the affiliations and organizations properly i.e. the ones that have a chance to match should be in the same *bucket*. To guarantee this, the affiliation matching module allows to create different methods of dividing the affiliations and organizations into *buckets*, and to use all of these methods in a single matching process. The specific method of grouping the affiliations and organizations into *bucket* and then joining them into pairs is carried out by the service called *Joiner*.
Every *joiner* can be linked with many different *voters* that will tell if the affiliation-organization pairs joined match or not. By providing new *joiners* and *voters* one can extend the matching algorithm with countless new methods for matching affiliations with organizations, thus adjusting the algorithm to his or her needs.
All the affiliations and organizations are sequentially computed by all the *matchers*. In every *matcher* they are grouped by some *joiner* in pairs, and then these pairs are processed by all the *voters* in the *matcher*. Every affiliation-organization pair that has been matched at least once is assigned the match strength that depends on the match strengths of the *voters* that pointed the given pair is a match.
**NOTE:** There can be many organizations matched with a given affiliation, each of them matched with a different match strength. The user of the module can set a match strength threshold which will limit the results to only those matches that have the match strength greater than the specified threshold.
*Calculation of the match strength of the affiliation-organization pair matched by multiple matchers*
It often happens that the given affiliation-organization pair is returned as a match by more than one matcher, each time with a different match strength. In such a case **the match with the highest match strength will be selected**.
*Calculation of the match strength of the affiliation-organization pair within a single matcher*
Every voter has a match strength that is in the range (0, 1]. **The voter match strength says what the quotient of correct matches to all matches guessed by this voter is, and is based on real data and hundreds of matches prepared by hand.**
The match strength of the given affiliation-organization pair is based on the match strengths of all the voters in the matcher that have pointed that the pair is a match. It will always be less than or equal to 1 and greater than the match strength of each single voter that matched the given pair.
The total match strength is calculated in such a way that each consecutive voter reduces (by its match strength) the gap of uncertainty about the correctness of the given match.
***Parameters:***
* input
* input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations.
* input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location.
* input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations.
* input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations.
* input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations
* output
* [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations.
***Limitations:*** -
***Environment:***
Java, Spark
***References:*** -
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching)

View File

@ -0,0 +1,41 @@
# Citation matching
***Short description:*** During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. In this document we shall evaluate if the module has been properly integrated with the whole
system and assess the accuracy of the algorithm used. It is worth mentioning that the implemented algorithm has been described in detail in arXiv:1303.6906 [cs.IR]1. However, in the referenced paper the algorithm was tested on small datasets, but here we will focus on larger datasets, which are expected to be analysed by the system in the production environment.
***Algorithmic details:***
*General description*
The algorithm used in citation matching task consists of two phases. In the first one, for each citation string a set of potentially matching documents is retrieved using a heuristic. In the second one, the metadata of these documents is analysed in order to assess which of them is the most similar to given citation. We assume that citations are parsed, i.e. fragments containing meaningful pieces of metadata information are marked in a special way. Note that in the IIS system, the citation parsing step is executed by another module. The following metadata fields are used by the described solution:
* an author,
* a title,
* a journal name,
* pages,
* a year of publication.
*Heuristic matching*
The heuristic is based on indexing of document metadata by their author names. For each citation we extract author names and try to find documents in the index which have the same author entries. As spelling errors and inaccuracies commonly occur in citations, we have implemented approximate index which enables retrieval of entities with edit distance less than or equal 1.
*Strict matching*
In this step, all the potentially matching pairs obtained in the heuristic step are evaluated and only the most probable ones are returned as the final result. As citations tend to contain spelling errors and differ in style, there is a need to introduce fuzzy similarity measures fitted to the specifics of various metadata fields. Most of them compute a fraction of tokens or trigrams that occur in both fields being compared. When comparing journal
names, we have taken longest common subsequence (LCS) of two strings into consideration. This can be seen as an instance of the assignment problem with some refinements added. The overall similarity of two citation strings is obtained by applying a linear Support Vector Machine (SVM) using field similarities as features.
***Parameters:***
* input:
* input_metadata: [ExtractedDocumentMetadataMergedWithOriginal](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/transformers/metadatamerger/ExtractedDocumentMetadataMergedWithOriginal.avdl) avro datastore location with the metadata of both publications and bibliorgaphic references to be matched
* input_matched_citations: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with citations which were already matched and should be excluded from fuzzy matching
* output: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with matched publications
***Limitations:*** -
***Environment:***
Java, Spark
***References:*** -
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/citation-matching](https://github.com/CeON/CoAnSys/tree/master/citation-matching)

View File

@ -0,0 +1,23 @@
---
sidebar_position: 4
---
# Extraction of cited concepts
***Short description:*** Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.
***Algorithmic details:***
The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata.
***Parameters:***
Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts
***Limitations:*** -
***Environment:***
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
***References:***
* Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham. [doi:10.1007/978-3-319-67008-9_28](https://doi.org/10.1007/978-3-319-67008-9_28)
***Authority:*** ATHENA RC &bull; ***License:*** CC-BY/CC-0 &bull; ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)

View File

@ -0,0 +1,22 @@
---
sidebar_position: 5
---
# Classifiers
***Short description:*** A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes.
***Algorithmic details:***
The algorithm classifies publication's fulltexts using a Bayesian classifier and weighted terms according to an offline training phase. The training has been done using the following taxonomies: arXiv, MeSH (Medical Subject Headings), ACM, and DDC (Dewey Decimal Classification, or Dewey Decimal System).
***Parameters:*** Publication's identifier and fulltext
***Limitations:*** -
***Environment:***
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
***References:***
* Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. [doi:10.1007/978-3-319-08425-1_10](https://doi.org/10.1007/978-3-319-08425-1_10)
***Authority:*** ATHENA RC &bull; ***License:*** CC-BY/CC-0 &bull; ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)

View File

@ -0,0 +1,48 @@
# Documents similarity
***Short description:*** Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content.
***Algorithmic details:***
The similarity between two documents is expressed as the similarity between weights of their common terms (i.e., words being reduced to their root form) within a context of all terms from the first and the second document. In this approach, the computation can be divided into three consecutive steps:
1. selection of proper terms,
2. calculation of weights of terms for each document,
3. calculation of a given similarity function on weights of terms corresponding to each pair of documents.
 
The document similarity module uses the term frequency inverse-document frequency (TFIDF) measure and the cosine similarity to produce weights for terms and calculate their similarity respectively.
*Steps of execution*
Computation of similarity between documents is executed in the following steps.
1. First, we create a text representation of each document. The text is a concatenation of 3 attributes of document object coming from Information Space: title, abstract, and keywords.
2. Text representation of each document is split into words. Next, stop words or words which occur in more than the N percent of documents (say 99%) or these occurring in less than M documents (say 5) are discarded as we assume that they carry no important information.
3. Next, the words are stemmed (reduced to their root form) and thus converted to terms. The importance of each term in each document is calculated using TFIDF measure (resulting in a vector of weights of terms for each document). Only the top P (say 20) important terms per documents remain for the further computations.
4. In order to calculate the cosine similarity value for the documents, we execute the following steps.
a. Triples [document id, term, term weight] are grouped by a common term and for each pair of triples from the group, term importance is recalculated as the multiplication of terms weights, producing quads [document id 1, document id 2, term, multiplied term weight].
b. Quads are grouped by [document id 1, document id 2] and the values of the multiplied term weight are summed up, resulting in the creation of triples [document id 1, document id 2, total common weight].
c. Finally, triples are normalized using product of the norm of the term weights' vectors. The normalized value is the final similarity measure with value between 0 and 1.
5. For a given document, only the top R (say 20) links to similar documents are returned. The links that are thrown away are assumed to be uninteresting for the end-user and thus storing them would only needlessly take disk space.
***Parameters:***
* input:
* input_document: [DocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/documentssimilarity/DocumentMetadata.avdl) avro datastore location
* parallel: sets parameter parallel for Pig actions (default=80)
* mapredChildJavaOpts: mapreduce's map and reduce child java opts set to all PIG actions (default=Xmx12g)
* tfidfTopnTermPerDocument: number of the most important terms taken into account (default=20)
* similarityTopnDocumentPerDocument: maximum number of similar documents for each publication (default=20)
* removal_rate: removal rate (default=0.99)
* removal_least_used: removal of the least used terms (default=20)
* threshold_num_of_vector_elems_length: vector elements length threshold, when set to less than 2 all documents will be included in similarity matching (default=2)
* output: [DocumentSimilarity](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/documentssimilarity/DocumentSimilarity.avdl) avro datastore location
***Limitations:*** -
***Environment:***
Pig, Java
***References:***
* P. J. Dendek, A. Czeczko, M. Fedoryszak, A. Kawa, and L. Bolikowski, "Content Analysis of Scientific Articles in Apache Hadoop Ecosystem", Stud. Comp.Intelligence, vol. 541, 2014.
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CoAnSys/document-similarity](https://github.com/CeON/CoAnSys/tree/master/document-similarity)

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

View File

@ -0,0 +1,36 @@
# Metadata extraction
***Short description:*** Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project.
CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts:
* document's metadata, including title, authors, affiliations, abstract, keywords, journal name, volume and issue,
* parsed bibliographic references
* the structure of document's sections, section titles and paragraphs
CERMINE is based on a modular workflow, whose architecture ensures that individual workflow steps can be maintained separately. As a result it is easy to perform evaluation, training, improve or replace one step implementation without changing other parts of the workflow. Most steps implementations utilize supervised and unsupervised machine-leaning techniques, which increases the maintainability of the system, as well as its ability to adapt to new document layouts.
***Algorithmic details:***
CERMINE workflow is composed of four main parts:
* Basic structure extraction takes a PDF file on the input and produces a geometric hierarchical structure representing the document. The structure is composed of pages, zones, lines, words and characters. The reading order of all elements is determined. Every zone is labelled with one of four general categories: METADATA, REFERENCES, BODY and OTHER.
* Metadata extraction part analyses parts of the geometric hierarchical structure labelled as METADATA and extracts a rich set of document's metadata from it.
* References extraction part analyses parts of the geometric hierarchical structure labelled as REFERENCES and the result is a list of document's parsed bibliographic references.
* Text extraction part analyses parts of the geometric hierarchical structure labelled as BODY and extracts document's body structure composed of sections, subsections and paragraphs.
CERMINE uses supervised and unsupervised machine-leaning techniques, such as Support Vector Machines, K-means clustering and Conditional Random Fields. Content classifiers are trained on [GROTOAP2 dataset](http://cermine.ceon.pl/grotoap2/). More information about CERMINE can be found in the [presentation](http://cermine.ceon.pl/static/docs/slides.pdf).
***Parameters:***
* input: [DocumentText](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/DocumentText.avdl) avro datastore location
* output: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location
***Limitations:***
Born-digital form of PDF documents is supported only. Large PDF documents may require more than 4g of assgined memory (set by default).
***Environment:***
Java, Hadoop
***References:***
* Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. CERMINE: automatic extraction of structured metadata from scientific literature. In International Journal on Document Analysis and Recognition, 2015, vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.
***Authority:*** ICM &bull; ***License:*** AGPL-3.0 &bull; ***Code:*** [CERMINE](https://github.com/CeON/CERMINE)

View File

@ -1,44 +0,0 @@
# Enrichment
## Mining
The OpenAIRE Research Graph is enriched by links mined by OpenAIREs full-text mining algorithms that scan the plaintexts of publications for funding information, references to datasets, software URIs, accession numbers of bioetities, and EPO patent mentions. Custom mining modules also link research objects to specific research communities, initiatives and infrastructures. In addition, other inference modules provide content-based document classification, document similarity, citation matching, and author affiliation matching.
**Project mining** in OpenAIRE text mines the full-texts of publications in order to extract matches to funding project codes/IDs. The mining algorithm works by utilising (i) the grant identifier, and (ii) the project acronym (if available) of each project. The mining algorithm: (1) Preprocesses/normalizes the full-texts using several functions, which depend on the characteristics of each funder (i.e., the format of the grant identifiers), such as stopword and/or punctuation removal, tokenization, stemming, converting to lowercase; then (2) String matching of grant identifiers against the normalized text is done using database techniques; and (3) The results are validated and cleaned using the context near the match by looking at the context around the matched ID for relevant metadata and positive or negative words/phrases, in order to calculate a confidence value for each publication-->project link. A confidence threshold is set to optimise high accuracy while minimising false positives, such as matches with page or report numbers, post/zip codes, parts of telephone numbers, DOIs or URLs, accession numbers. The algorithm also applies rules for disambiguating results, as different funders can share identical project IDs; for example, grant number 633172 could refer to H2020 project EuroMix but also to Australian-funded NHMRC project “Brain activity (EEG) analysis and brain imaging techniques to measure the neurobiological effects of sleep apnea”. Project mining works very well and was the first Text & Data Mining (TDM) service of OpenAIRE. Performance results vary from funder to funder but precision is higher than 98% for all funders and 99.5% for EC projects. Recall is higher than 95% (99% for EC projects), when projects are properly acknowledged using project/grant IDs.
**Dataset extraction** runs on publications full-texts as described in “High pass text-filtering for Citation matching”, TPDL 2017[1]. In particular, we search for citations to datasets using their DOIs, titles and other metadata (i.e., dates, creator names, publishers, etc.). We extract parts of the text which look like citations and search for datasets using database join and pattern matching techniques. Based on the experiments described in the paper, precision of the dataset extraction module is 98.5% and recall is 97.4% but it is also probably overestimated since it does not take into account corruptions that may take place during pdf to text extraction. It is calculated on the extracted full-texts of small samples from PubMed and arXiv.
**Software extraction** runs also on parts of the text which look like citations. We search the citations for links to software in open software repositories, specifically github, sourceforge, bitbucket and the google code archive. After that, we search for links that are included in Software Heritage (SH, https://www.softwareheritage.org) and return the permanent URL that SH provides for each software project. We also enrich this content with user names, titles and descriptions of the software projects using web mining techniques. Since software mining is based on URL matching, our precision is 100% (we return a software link only if we find it in the text and there is no need to disambiguate). As for recall rate, this is not calculable for this mining task. Although we apply all the necessary normalizations to the URLs in order to overcome usual issues (e.g., http or https, existence of www or not, lower/upper case), we do not calculate cases where a software is mentioned using its name and not by a link from the supported software repositories.
**For the extraction of bio-entities**, we focus on Protein Data Bank (PDB) entries. We have downloaded the database with PDB codes and we update it regularly. We search through the whole publications full-text for references to PDB codes. We apply disambiguation rules (e.g., there are PDB codes that are the same as antibody codes or other issues) so that we return valid results. Current precision is 98%. Although it's risky to mention recall rates since these are usually overestimated, we have calculated a recall rate of 98% using small samples from pubmed publications. Moreover, our technique is able to identify about 30% more links to proteins than the ones that are tagged in Pubmed xmls.
**Other text-mining modules** include mining for links to EPO patents, or custom mining modules for linking research objects to specific research communities, initiatives and infrastructures, e.g. COVID-19 mining module. Apart from text-mining modules, OpenAIRE also provides a document classification service that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text one or more predefined content classes. In OpenAIRE, the currently used taxonomies are arXiv, MeSH (Medical Subject Headings), ACM and DDC (Dewey Decimal Classification, or Dewey Decimal System).
## Bulk Tagging/Deduction
The Deduction process (also known as “bulk tagging”) enriches each record with new information that can be derived from the existing property values.
As of September 2020, three procedures are in place to relate a research product to a research initiative, infrastructure (RI) or community (RC) based on:
* subjects (2.7M results tagged)
* Zenodo community (16K results tagged)
* the data source it comes from (250K results tagged)
The list of subjects, Zenodo communities and data sources used to enrich the products are defined by the managers of the community gateway or infrastructure monitoring dashboard associated with the RC/RI.
## Propagation
This process “propagates” properties and links from one product to another if between the two there is a “strong” semantic relationship.
As of September 2020, the following procedures are in place:
Propagation of the property “country” to results from institutional repositories: e.g. publication collected from an institutional repository maintained by an italian university will be enriched with the property “country = IT”.
* Propagation of links to projects: e.g. publication linked to project P “is supplemented by” a dataset D. Dataset D will get the link to project P. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
* Propagation of related community/infrastructure/initiative from organizations to products via affiliation relationships: e.g. a publication with an author affiliated with organization O. The manager of the community gateway C declared that the outputs of O are all relevant for his/her community C. The publication is tagged as relevant for C.
* Propagation of related community/infrastructure/initiative to related products: e.g. publication associated to community C is supplemented by a dataset D. Dataset D will get the association to C. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
* Propagation of ORCID identifiers to related products, if the products have the same authors: e.g. publication has ORCID for its authors and is supplemented by a dataset D. Dataset D has the same authors as the publication. Authors of D are enriched with the ORCIDs available in the publication. The relationships considered for this procedure are “isSupplementedBy” and “supplements”.

View File

@ -1,73 +0,0 @@
---
sidebar_position: 2
---
# Impact scores
<span className="todo">TODO - add intro</span>
## Citation Count (CC)
This is the most widely used scientific impact indicator, which sums all citations received by each article. The citation count of a
publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$,
where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise).
Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly
drew on it.
## "Incubation" Citation Count (iCC)
This measure is essentially a time-restricted version of the citation count, where the time window is distinct for each paper, i.e.,
only citations $y$ years after its publication are counted (usually, $y=3$). The "incubation" citation count of a paper $i$ is
calculated as: $s_i = \sum_{j,t_j \leq t_i+3} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's
publication years, respectively. $t_i$ is cited paper $i$'s publication year. iCC can be seen as an indicator of a paper's initial momentum
(impulse) directly after its publication.
## PageRank (PR)
Originally developed to rank Web pages, PageRank has been also widely used to rank publications in citation
networks. In this latter context, a publication's PageRank
score also serves as a measure of its influence. In particular, the PageRank score of a publication is calculated
as its probability of being read by a researcher that either randomly selects publications to read or selects
publications based on the references of her latest read. Formally, the score of a publication $i$ is given by:
$$
s_i = \alpha \cdot \sum_{j} P_{i,j} \cdot s_j + (1-\alpha) \cdot \frac{1}{N}
$$
where $P$ is the stochastic transition matrix, which corresponds to the column normalised version of adjacency
matrix $A$, $\alpha \in [0,1]$, and $N$ is the number of publications in the citation network. The first addend
of the equation corresponds to the selection (with probability $\alpha$) of following a reference, while the
second one to the selection of randomly choosing any publication in the network. It should be noted that the
score of each publication relies of the score of publications citing it (the algorithm is executed iteratively
until all scores converge). As a result, PageRank differentiates citations based on the importance of citing
articles, thus alleviating the corresponding issue of the Citation Count.
## RAM
RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared
to older ones. Hence, it better captures the popularity of publications. This "time-awareness" of citations
alleviates the bias of methods like Citation Count and PageRank against recently published articles, which have
not had "enough" time to gather as many citations. The RAM score of each paper $i$ is calculated as follows:
$$
s_i = \sum_j{R_{i,j}}
$$
where $R$ is the so-called Retained Adjacency Matrix (RAM) and $R_{i,j}=\gamma^{t_c-t_j}$ when publication $j$ cites publication
$i$, and $R_{i,j}=0$ otherwise. Parameter $\gamma \in (0,1)$, $t_c$ corresponds to the current year and $t_j$ corresponds to the
publication year of citing article $j$.
## AttRank
AttRank is a PageRank variant that alleviates its bias against recent publications (i.e., it is tailored to capture popularity).
AttRank achieves this by modifying PageRank's probability of randomly selecting a publication. Instead of using a uniform probability,
AttRank defines it based on a combination of the publication's age and the citations it received in recent years. The AttRank score
of each publication $i$ is calculated based on:
$$
s_i = \alpha \cdot \sum_{j} P_{i,j} \cdot s_j
+ \beta \cdot Att(i)+ \gamma \cdot c \cdot e^{-\rho \cdot (t_c-t_i)}
$$
where $\alpha + \beta + \gamma =1$ and $\alpha,\beta,\gamma \in [0,1]$. $Att(i)$ denotes a recent attention-based score for publication $i$,
which reflects its share of citations in the $y$ most recent years, $t_i$ is the publication year of article $i$, $t_c$ denotes the current
year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix.

View File

@ -1,6 +0,0 @@
---
sidebar_position: 1
---
# Mining algorithms
<span className="todo">TODO</span>

View File

@ -0,0 +1,18 @@
# Finalisation
At the very end of the graph production workflow, a step is dedicated to perform certain finalisation operations, that we describe in this page,
aiming to improve the overall quality of the data.
The output of this final step is the final version of the OpenAIRE Research Graph.
## Filtering
Bibliographic records that do not meet minimal requirements for being part of the OpenAIRE Research Graph are eliminated during this phase.
Currently, the only criteria applied horizontally to the entire graph aims at excluding scientific results whose title is not meaningful for citation purposes.
Then, different criteria are applied in the pre-processing of specific sub-collections:
* [Crossref filtering](/data-provision/aggregation/non-compatible-sources/doiboost#crossref-filtering)
## Country cleaning
This phase is responsible for removing the country information from result records that match specific criteria. The need for this phase is driven by the fact that some datasources, although referred of national pertinence, they contain material that is not always related to the given country.

View File

@ -1,13 +1,17 @@
---
sidebar_position: 5
---
# Indexing
The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as:
The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals ([EXPLORE](https://explore.openaire.eu), [CONNECT](https://connect.openaire.eu), [PROVIDE](https://provide.openaire.eu)) and APIs, the latter adopted by several third-party applications and organizations, such as:
* EOSC --The OpenAIRE Research Graph APIs and Portals will offer to the EOSC an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.
* The OpenAIRE Graph APIs and Portals will offer to the EOSC (European Open Science Cloud) an Open Science Resource Catalogue, keeping an up to date map of all research results (publications, datasets, software), services, organizations, projects, funders in Europe and beyond.
* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE
* DSpace & EPrints repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding project, by selecting it from the list of project provided by OpenAIRE.
* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
* EC participant portal (Sygma - System for Grant Management) uses the OpenAIRE API in the “Continuous Reporting” section. Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in the OpenAIRE Research Graph that are linked to the project. The user can select the research products from the list and easily compile the continuous reporting data of the project.
* ScholExplorer is used by different players of the scholarly communication ecosystem. For example, [Elsevier](https://www.elsevier.com/authors/tools-and-resources/research-data/data-base-linking) uses its API to make the links between
publications and datasets automatically appear on ScienceDirect.
ScholExplorer indexes the links among the four major types of research products (API v3) available in the OpenAIRE Research Graph and makes them available through an HTTP API that allows
to search them by the following criteria:
* Links whose source object has a given PID or PID type;
* Links whose source object has been published by a given data source ("data source as publisher");
* Links that were collected from a given data source ("data source as provider").

View File

@ -0,0 +1,169 @@
# Impact indicators
This page summarises all calculated impact indicators, which are included in the [impactMeasures](/data-model/entities/other#impactmeasures) property which is part of the [indicators](/data-model/entities/result#indicators) property of the result.
It should be noted that the impact indicators are being calculated on the level of the research output.
Below we explain their main intuition, the way they are calculated, and their most important limitations, in an attempt help avoiding common pitfalls and misuses.
## Citation Count (CC)
***Short description:***
This is the most widely used scientific impact indicator, which sums all citations received by each article.
Citation count can be viewed as a measure of a publication's overall impact, since it conveys the number of other works that directly
drew on it.
***Algorithmic details:***
The citation count of a
publication $i$ corresponds to the in-degree of the corresponding node in the underlying citation network: $s_i = \sum_{j} A_{i,j}$,
where $A$ is the adjacency matrix of the network (i.e., $A_{i,j}=1$ when paper $j$ cites paper $i$, while $A_{i,j}=0$ otherwise).
***Parameters:*** -
***Limitations:***
OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
***Environment:*** PySpark
***References:*** -
***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
## "Incubation" Citation Count (iCC)
***Short description:***
This measure is essentially a time-restricted version of the citation count, where the time window is distinct for each paper, i.e.,
only citations $y$ years after its publication are counted.
***Algorithmic details:***
The "incubation" citation count of a paper $i$ is
calculated as: $s_i = \sum_{j,t_j \leq t_i+y} A_{i,j}$, where $A$ is the adjacency matrix and $t_j, t_i$ are the citing and cited paper's
publication years, respectively. $t_i$ is cited paper $i$'s publication year. iCC can be seen as an indicator of a paper's initial momentum
(impulse) directly after its publication.
***Parameters:***
$y=3$
***Limitations:***
OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
***Environment:*** PySpark
***References:***
* Vergoulis, T., Kanellos, I., Atzori, C., Mannocci, A., Chatzopoulos, S., Bruzzo, S. L., Manola, N., & Manghi, P. (2021, April). Bip! db: A dataset of impact measures for scientific publications. In Companion Proceedings of the Web Conference 2021 (pp. 456-460).
***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
## PageRank (PR)
***Short description:***
Originally developed to rank Web pages, PageRank has been also widely used to rank publications in citation
networks. In this latter context, a publication's PageRank
score also serves as a measure of its influence.
***Algorithmic details:***
The PageRank score of a publication is calculated
as its probability of being read by a researcher that either randomly selects publications to read or selects
publications based on the references of her latest read. Formally, the score of a publication $i$ is given by:
$$
s_i = \alpha \cdot \sum_{j} P_{i,j} \cdot s_j + (1-\alpha) \cdot \frac{1}{N}
$$
where $P$ is the stochastic transition matrix, which corresponds to the column normalised version of adjacency
matrix $A$, $\alpha \in [0,1]$, and $N$ is the number of publications in the citation network. The first addend
of the equation corresponds to the selection (with probability $\alpha$) of following a reference, while the
second one to the selection of randomly choosing any publication in the network. It should be noted that the
score of each publication relies of the score of publications citing it (the algorithm is executed iteratively
until all scores converge). As a result, PageRank differentiates citations based on the importance of citing
articles, thus alleviating the corresponding issue of the Citation Count.
***Parameters:***
$\alpha = 0.5, convergence\_error = 10^{-12}$
***Limitations:***
OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
***Environment:*** PySpark
***References:***
* Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.
***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
## RAM
***Short description:***
RAM is essentially a modified Citation Count, where recent citations are considered of higher importance compared to older ones.
Hence, it better captures the popularity of publications. This "time-awareness" of citations
alleviates the bias of methods like Citation Count and PageRank against recently published articles, which have
not had "enough" time to gather as many citations.
***Algorithmic details:***
The RAM score of each paper $i$ is calculated as follows:
$$
s_i = \sum_j{R_{i,j}}
$$
where $R$ is the so-called Retained Adjacency Matrix (RAM) and $R_{i,j}=\gamma^{t_c-t_j}$ when publication $j$ cites publication
$i$, and $R_{i,j}=0$ otherwise. Parameter $\gamma \in (0,1)$, $t_c$ corresponds to the current year and $t_j$ corresponds to the
publication year of citing article $j$.
***Parameters:***
$\gamma = 0.6$
***Limitations:***
OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
***Environment:*** PySpark
***References:***
* Ghosh, R., Kuo, T. T., Hsu, C. N., Lin, S. D., & Lerman, K. (2011, December). Time-aware ranking in dynamic citation networks. In 2011 ieee 11^{th} international conference on data mining workshops (pp. 373-380). IEEE.
***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)
## AttRank
***Short description:***
AttRank is a PageRank variant that alleviates its bias against recent publications (i.e., it is tailored to capture popularity).
AttRank achieves this by modifying PageRank's probability of randomly selecting a publication. Instead of using a uniform probability,
AttRank defines it based on a combination of the publication's age and the citations it received in recent years.
***Algorithmic details:***
The AttRank score
of each publication $i$ is calculated based on:
$$
s_i = \alpha \cdot \sum_{j} P_{i,j} \cdot s_j
+ \beta \cdot Att(i)+ \gamma \cdot c \cdot e^{-\rho \cdot (t_c-t_i)}
$$
where $\alpha + \beta + \gamma =1$ and $\alpha,\beta,\gamma \in [0,1]$. $Att(i)$ denotes a recent attention-based score for publication $i$,
which reflects its share of citations in the $y$ most recent years, $t_i$ is the publication year of article $i$, $t_c$ denotes the current
year, and $c$ is a normalisation constant. Finally, $P$ is the stochastic transition matrix.
***Parameters:***
$\alpha = 0.2, \beta = 0.5, \gamma = 0.3, \rho = -0.16, convergence\_error = 10^-{12}$
Note that recent attention is based on the 3 most recent years (including current one).
***Limitations:***
OpenAIRE collects data from specific data sources which means that part of the existing literature may not be considered when computing this indicator.
Also, since some indicators require the publication year for their calculation, we consider only research products for which we can gather this information from at least one data source.
***Environment:*** PySpark
***References:***
* Kanellos, I., Vergoulis, T., Sacharidis, D., Dalamagas, T., & Vassiliou, Y. (2021, April). Ranking papers by their short-term scientific impact. In 2021 IEEE 37th International Conference on Data Engineering (ICDE) (pp. 1997-2002). IEEE.
***Authority:*** ATHENA RC &bull; ***License:*** GPL-2.0 &bull; ***Code:*** [BIP! Ranker](https://github.com/athenarc/Bip-Ranker)

View File

@ -0,0 +1,7 @@
# Usage Statistics Indicators
Usage Statistics indicators for research products, like publications, datasets,etc., are an important complement to other (traditional and alternative) bibliometric indicators to provide a comprehensive and recent view of the impact of such resources but also about their authors, institutions and the platforms themselves. They are taking into account different levels of information: the usage of data sources, the usage of individual items in the context of their resource type and the usage of individual web resources or files.
Usage Statistics Indicators are built by the OpenAIRE's UsageCounts Service. The service collects usage data and consolidated usage statistics reports respectively, from its distributed network of data providers (repositories, e-journals, CRIS) by utilizing open standards and protocols and delivers reliable, consolidated and comparable usage metrics like counts of item downloads and metadata views conformant to COUNTER Code of Practice.
You can find more information about the UsageCounts service [here](https://usagecounts.openaire.eu/).

View File

@ -0,0 +1,28 @@
# Merge by id
In the metadata aggregation system it is common to find the same record provided by
different datasources and, sometimes, even inside the same datasource (especially in
case of aggregators). As the harmonisation processes are performed per datasource
contents, the relative records are the output of different mapping implementations.
This approach has the advantage to be deeply customisable to catch datasource specific
aspects, but it leaves room for inconsistencies when evaluating the different mappings
across the various datasources.
This phase is therefore responsible to compensate for such inconsistencies and performs
a global grouping of every record available in the graph:
- entities are grouped by [`id`](../data-model/entities/result#id)
- relations are grouped by [`source`, `target`, `reltype`](../data-model/relationships#the-relationship-object)
This ensures that the same record, possibly assigned to different types by different
mappings, appears only once in the graph and under a single typing. In case of clashing
identifiers, the properties are merged (including the provencance information), considering
the following precedence order for the result typing:
```
publication > dataset > software > other
```
The same holds for relationships, as the same (e.g.) DOI-to-DOI citation relation could
be aggregated from multiple sources, this grouping phase would collapse all the different
duplicates onto a single relation that would however include all the individual provenances.

View File

@ -1,9 +0,0 @@
---
sidebar_position: 4
---
# Post-cleaning
The aggregation processes are continuously running and apply vocabularies as they are in a given moment of time. It could be the case that a vocabulary changes after the aggregation of one data source has finished, thus the aggregated content does not reflect the current status of the controlled vocabularies.
In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on the raw and on the de-duplicated graph may introduce values that do not comply with the current status of the OpenAIRE controlled vocabularies. For these reasons, we included a final step of cleansing at the end of the workflow materialisation. The output of the final cleansing step is the final version of the OpenAIRE Research Graph.

View File

@ -1,7 +1,12 @@
---
sidebar_position: 6
---
# Stats analysis
The OpenAIRE Research Graph is also processed by a pipeline for extracting the statistics and producing the charts for funders, research initiative, infrastructures, and policy makers that you can see on MONITOR. Based on the information available on the graph, OpenAIRE provides a set of indicators for monitoring the funding and research impact and the uptake of Open Science publishing practices, such as Open Access publishing of publications and datasets, availability of interlinks between research products, availability of post-print versions in institutional or thematic Open Access repositories, etc.
The OpenAIRE Graph is also processed by a pipeline for extracting the statistics
and producing the charts for funders, research initiative, research infrastructures,
and policymakers available on [MONITOR](https://monitor.openaire.eu).
Based on the information available on the graph, OpenAIRE provides a set of
indicators for monitoring the funding and research impact and the uptake of
Open Science publishing practices, such as Open Access publishing of publications
and datasets, availability of interlinks between research products, availability
of post-print versions in institutional or thematic Open Access repositories, etc.

View File

@ -1,17 +0,0 @@
---
sidebar_position: 4
---
# Bulk downloads
In order to facilitate users, different dumps are available. All are available under the Zenodo community called [OpenAIRE Research Graph](https://zenodo.org/communities/openaire-research-graph).
Here we provide detailed documentation about the full dump:
* JSON dump: https://doi.org/10.5281/zenodo.3516917
* JSON schema: https://doi.org/10.5281/zenodo.4238938
:::note Tip!
For a visual and interactive overview of the JSON schema, we suggest to use a JSON schema viewer like [jsonschemaviewer](https://navneethg.github.io/jsonschemaviewer/) (you just need to copy the schema and then you can easily navigate through the nodes).
:::

View File

@ -0,0 +1,30 @@
---
sidebar_position: 1
---
# CfHbKeyValue
Information about the sources from which the record has been collected.
@JsonSchema(description = "the OpenAIRE identifier of the data source")
### key
_Type: String &bull; Cardinality: ONE_
the OpenAIRE identifier of the data source
```json
"key":"10|openaire____::081b82f96300b6a6e3d282bad31cb6e2"
```
### value
_Type: String &bull; Cardinality: ONE_
The name of the data source.
```json
"value":"Crossref"
```

View File

@ -0,0 +1,37 @@
---
sidebar_position: 1
---
# CommunityInstance
It is a subclass of [Instance](../../data-model/entities/result#instance) extended with information regarding the collection and hosting source for this materialization of the result.
### hostedby
_Type: [CfHbKeyValue](./cfhb) &bull; Cardinality: ONE_
Information about the source from which the instance can be viewed or downloaded.
```json
"hostedby": {
"key": "10|issn___print::35ee75a5ad42581d604be113a8f56427",
"value": "New Phytologist"
},
```
### collectedfrom
_Type: [CfHbKeyValue](./cfhb) &bull; Cardinality: ONE_
Information about the source from which the record has been collected
```json
"collectedfrom": {
"key": "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2",
"value": "Crossref"
}
```

View File

@ -0,0 +1,46 @@
---
sidebar_position: 1
---
# Context
Information related to research initiative/community (RI/RC) related to the result.
### code
_Type: String &bull; Cardinality: ONE_
Code identifying the RI/RC.
```json
"code":"sdsn-gr"
```
### label
_Type: String &bull; Cardinality: ONE_
Label of the RI/RC.
```json
"label":"SDSN - Greece"
```
### provenance
_Type: [Provenance](/data-model/entities/other#provenance-2) &bull; Cardinality: MANY_
Why this result is associated to the RI/RC.
```json
"provenance":[{
"provenance":"Inferred by OpenAIRE",
"trust":"0.9"
},
...
]
```

View File

@ -0,0 +1,141 @@
---
sidebar_position: 1
---
# Extended Result
It is a subclass of [Result](/data-model/entities/result) extended with information regarding projects (and funders), research communities/infrastructure and related data sources.
### projects
_Type: [Project](project.md) &bull; Cardinality: MANY_
List of projects (i.e. grants) that (co-)funded the production of the research results.
```json
"projects": [
{
"id": "40|corda__h2020::94c4a066401e22002c4811a301bb4655",
"code": "727929",
"acronym": "TomRes",
"title": "A NOVEL AND INTEGRATED APPROACH TO INCREASE MULTIPLE AND COMBINED STRESS TOLERANCE IN PLANTS USING TOMATO AS A MODEL",
"funder": {
"shortName": "EC",
"name": "European Commission",
"jurisdiction": "EU",
"fundingStream": "H2020"
},
"provenance": {
"provenance": "Harvested",
"trust": "0.900000000000000022"
},
"validated": {
"validationDate": "2021-0101",
"validatedByFunder": true
}
},
...
]
```
### context
_Type: [Context](./context) &bull; Cardinality: MANY_
Reference to relevant research infrastructure, initiative or communities (RI/RC) among those collaborating with OpenAIRE. Please see https://connect.openaire.eu that are publicly visible.
```json
"context":[
{
"code":"sdsn-gr",
"label":"SDSN - Greece",
"provenance":[
{
"provenance":"Inferred by OpenAIRE",
"trust":"0.9"
}
]
},
...
]
```
### collectedfrom
_Type: [CfHbKeyValue](./cfhb) &bull; Cardinality: MANY_
Information about the sources from which the record has been collected.
```json
"collectedfrom":[
{
"key":"10|openaire____::081b82f96300b6a6e3d282bad31cb6e2",
"value":"Crossref"
},
...
]
```
### instance
_Type: [CommunityInstance](./communityInstance) &bull; Cardinality: MANY_
Information about the source from which the instance can be viewed or downloaded.
```json
"instance": [
{
"license": "http://doi.wiley.com/10.1002/tdm_license_1.1",
"accessright": {
"code": "c_16ec",
"label": "RESTRICTED",
"scheme": "http://vocabularies.coar-repositories.org/documentation/access_rights/",
"openAccessRoute": null
},
"type": "Article",
"url": [
"https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1111%2Fnph.15014",
"http://onlinelibrary.wiley.com/wol1/doi/10.1111/nph.15014/fullpdf",
"http://dx.doi.org/10.1111/nph.15014"
],
"publicationdate": "2018-02-09",
"refereed": "UNKNOWN",
"hostedby": {
"key": "10|issn___print::35ee75a5ad42581d604be113a8f56427",
"value": "New Phytologist"
},
"collectedfrom": {
"key": "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2",
"value": "Crossref"
}
},
...
]
```

View File

@ -0,0 +1,72 @@
---
sidebar_position: 1
---
# Funder
Information about the funder funding the project.
### fundingStream
_Type: String &bull; Cardinality: ONE_
Funding information for the project.
```json
"funding_stream": "H2020"
```
### jurisdiction
_Type: String &bull; Cardinality: ONE_
Geographical jurisdiction (e.g. for European Commission is EU, for Croatian Science Foundation is HR).
```json
"jurisdiction": "EU"
```
### name
_Type: String &bull; Cardinality: ONE_
The name of the funder.
```json
"name": "European Commission"
```
### shortName
_Type: String &bull; Cardinality: ONE_
The short name of the funder.
```json
"shortName": "EC"
```

View File

@ -0,0 +1,134 @@
---
sidebar_position: 1
---
# Project
The information about the projects related to the result.
### id
_Type: String &bull; Cardinality: ONE_
Main entity identifier, created according to the [OpenAIRE entity identifier and PID mapping policy](../../data-model/pids-and-identifiers).
```json
"id": "40|corda__h2020::70ea22400fd890c5033cb31642c4ae68"
```
### code
_Type: String &bull; Cardinality: ONE_
Τhe grant agreement code of the project.
```json
"code": "777541"
```
### acronym
_Type: String &bull; Cardinality: ONE_
Project's acronym.
```json
"acronym": "OpenAIRE-Advance"
```
### title
_Type: String &bull; Cardinality: ONE_
Project's title.
```json
"title": "OpenAIRE Advancing Open Scholarship"
```
### funder
_Type [Funder](funder.md) &bull; Cardinality: ONE_
Information about the funder funding the project.
```json
"funder": {
"shortName": "EC",
"name": "European Commission",
"jurisdiction": "EU",
"fundingStream": "H2020"
}
```
### provenace
_Type [Provenance](../../data-model/entities/other#provenance-2) &bull; Cardinality: ONE_
The reason why the project is associated to the result.
```json
"provenance": {
"provenance": "Harvested",
"trust": "0.900000000000000022"
}
```
### validated
_Type [Validated](validated.md) &bull; Cardinality: ONE_
Specifies it the association between the project and the result was validated.
```json
"validated": {
"validationDate": "2021-0101",
"validatedByFunder": true
}
```

View File

@ -0,0 +1,41 @@
---
sidebar_position: 1
---
# Validated
Information about the validtion of the association between the result and the funding information.
### validationDate
_Type: String &bull; Cardinality: ONE_
When OpenAIRE collected the association between the funding and the result from an authoritative source (i.e. Sygma).
```json
"validationDate": "2021-0101"
```
### validatedByFunder
_Type: Boolean &bull; Cardinality: ONE_
Specifies if the validation comes from the funder.
```json
"validatedByFunder": true
```

View File

@ -0,0 +1,12 @@
---
sidebar_position: 2
---
# Beginner's kit
The large size of the OpenAIRE Research Graph is a major impediment for beginners to familiarise with the underlying data model and explore its contents.
Working with the Graph in its full size typically requires access to a huge distributed computing infrastructure which cannot be easily accessible to everyone.
[The OpenAIRE Beginners Kit]( https://doi.org/10.5281/zenodo.7490192) aims to address this issue. It consists of two components:
* A subset of the Graph composed of the research products published between 2022-06-29 and 2022-12-29, all the entities connected to them and the respective relationships.
* A Zeppelin notebook that demonstrates how you can use PySpark to analyse the Graph and get answers to some interesting research questions.

View File

@ -0,0 +1,48 @@
---
sidebar_position: 1
---
# Full graph dump
You can download the full OpenAIRE Research Graph Dump as well as its schema from the following links:
Dataset: https://doi.org/10.5281/zenodo.3516917
Schema: https://doi.org/10.5281/zenodo.4238938
The schema used to dump this dataset mirrors the one described in the [Data Model](/data-model).
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
It is composed of several files so that you can download the parts you are interested into. The files are named after the entity they store (i.e. publication, dataset). Each file is at most 10GB and it is
a tar archive containing gz files, each with one json per line.
## How to acknowledge this work
Open Science services are open and transparent and survive thanks to your active support and to the visibility and reward they gather. If you use one of the [OpenAIRE Research Graph dumps](https://doi.org/10.5281/zenodo.3516917) for your research, please provide a proper citation following the recommendation that you find on the dump's Zenodo page or as provided below.
:::note How to cite
Manghi P., Atzori C., Bardi A., Baglioni M., Schirrwagen J., Dimitropoulos H., La Bruzzo S., Foufoulas I., Mannocci A., Horst M., Czerniak A., Iatropoulou K., Kokogiannaki A., De Bonis M., Artini M., Lempesis A., Ioannidis A., Manola N., Principe P., Vergoulis T., Chatzopoulos S., Pierrakos D. (2022). "OpenAIRE Research Graph Dump", *Dataset*, Zenodo. [doi:10.5281/zenodo.3516917](https://doi.org/10.5281/zenodo.3516917) ([BibTex](/bibtex/OpenAIRE_Research_Graph_dump.bib))
:::
Please also consider citing [other relevant research products](/publications#relevant-research-products) that can be of interest.
Also consider adding one of the following badges to your service with the appropriate link to [our website](https://graph.openaire.eu); click on the badges below to download the respective badge image files.
<div className="row">
<div className="col col--4 left-badge">
<a target="_blank" href={require('../assets/badges/openaire-badge-1.zip').default} download>
<img loading="lazy" alt="Openaire badge" src={require('../assets/badges/openaire-badge-1.png').default} className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module pagination-nav__link" style={{ paddingTop: '1.2em', paddingBottom: '1.2em'}} title="Click to download"/>
</a>
</div>
<div className="col col--4 mid-badge">
<a target="_blank" href={require('../assets/badges/openaire-badge-2.zip').default} download>
<img loading="lazy" alt="Openaire badge" src={require('../assets/badges/openaire-badge-2.png').default} className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module pagination-nav__link dark-badge" style={{ paddingTop: '1.2em', paddingBottom: '1.2em'}} title="Click to download"/>
</a>
</div>
<div className="col col--4 right-badge">
<a target="_blank" href={require('../assets/badges/openaire-badge-3.zip').default} download>
<img loading="lazy" alt="Openaire badge" src={require('../assets/badges/openaire-badge-3.png').default} className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module pagination-nav__link" style={{ paddingTop: '1.2em', paddingBottom: '1.2em'}} title="Click to download"/>
</a>
</div>
</div>

View File

@ -0,0 +1,30 @@
---
sidebar_position: 4
---
# Other related datasets
In this page, we list other related datasets; please refer to their respective schema definitions for the data model they follow.
## The dump of ScholeXplorer
Dataset: https://doi.org/10.5281/zenodo.6338616
Schema (Scholix version 3): https://doi.org/10.5281/zenodo.1120275
Schema (Scholix version 4): https://doi.org/10.5281/zenodo.6351557
This dataset is licensed under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
The dataset contains the GZ-compressed dump of the Scholix links exposed by the OpenAIRE ScholeXplorer service.
## The OpenAIRE LOD dump
Dataset (RDF dump): https://doi.org/10.5281/zenodo.609943
LOD Ontology: http://lod.openaire.eu/vocab
SPARQL Endpoint: http://lod.openaire.eu/sparql
The OpenAIRE Linked Open Data (LOD) Services and their integration with the OpenAIRE information space have been released as a beta version. The LOD exporting process started with a specification of the OpenAIRE data model as an RDF vocabulary, and then mapping of the OpenAIRE data to the graph-based RDF data model. To interlink the OpenAIRE data with related data on the Web, we have identified a list of potential datasets to interlinked with, including the DBpedia dataset extracted from Wikipedia and the publication databases DBLP and CiteSeer.
Please refer [here](http://lod.openaire.eu/documentation) for more details on the LOD documentation.

View File

@ -0,0 +1,68 @@
---
sidebar_position: 3
---
# Sub-graph dumps
In order to facilitate users, different dumps are available under the Zenodo community called [OpenAIRE Research Graph](https://zenodo.org/communities/openaire-research-graph).
This page lists all alternative dumps currently available.
## The OpenAIRE COVID-19 dump
Dataset: https://doi.org/10.5281/zenodo.3980490
Schema: https://doi.org/10.5281/zenodo.3974225
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
It contains metadata records of publications, research data, software and projects on the topic of Corona Virus and COVID-19.
This dump is part of the activities of OpenAIRE to support the fight against COVID-19 together with the OpenAIRE COVID-19 Gateway.
The dump consists of a tar archive containing gzip files with one json per line. Please refer [here](#alternative-sub-graph-data-model) for details on the data model of this dump.
## The dump of funded products
Dataset: https://doi.org/10.5281/zenodo.4559725
Schema: https://doi.org/10.5281/zenodo.3974225
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
It contains metadata records of research products (research literature, data, software, other types of research products) with funding
information available in the OpenAIRE Research Graph. Records are grouped by funder in a dedicated archive file. Each tar archive contains
gzip files, each with one json record per line. The model of this dump differs from the one of the whole graph.
Please refer [here](#alternative-sub-graph-data-model) for details on the data model of this dump.
## The dump of delta projects
Dataset: https://doi.org/10.5281/zenodo.6419021
Schema: https://doi.org/10.5281/zenodo.4238938
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
It contains the metadata records of projects collected by OpenAIRE in a given time frame. Usually one deposition of collected projects is done for each release of the OpenAIRE Research Graph
The deposition is one tar archive containing gzip files, each with one json record per line.
## The dumps about research communities, initiatives and infrastructures
Dataset: https://doi.org/10.5281/zenodo.3974604
Schema: https://doi.org/10.5281/zenodo.3974225
This dataset is licensed under a Creative Commons Attribution 4.0 International License.
The dataset contains one file per community/initiative/infrastructure collaborating with OpenAIRE. Check out also their community gateways on
CONNECT. Each file is a tar archive containing gzip files with one json per line. The only communities/research initiative/infrastructure we dump are those visible to everyone.
The model of this dump differs from the one of the whole graph.
Please refer [here](#alternative-sub-graph-data-model) for details on the data model of this dump.
---
## Alternative sub-graph data model
It should be noted that the dumps for research communities, infrastructures, and products related to projects do not strictly follow the main data model of the OpenAIRE Research Graph. In particular, they differ in the following:
* only research products are dumped (no relations, and entities different from results)
* the dumped results are extended with information that can be inferred in the whole dump namely:
* funding information if present
* associated research community/infrastructure
* associated data sources
So they have just one entity type, that is the [Extended Result](alternative-model/extendedresult.md).

View File

@ -1,8 +0,0 @@
{
"label": "Learning center",
"position": 9,
"link": {
"type": "generated-index",
"description": "5 minutes to learn the most important Docusaurus concepts."
}
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 27 KiB

View File

@ -1,7 +0,0 @@
---
sidebar_position: 1
---
# OpenPlato Webinars
<span className="todo">TODO</span>

View File

@ -1,7 +0,0 @@
---
sidebar_position: 2
---
# Tutorials
<span className="todo">TODO</span>

View File

@ -3,4 +3,6 @@ sidebar_position: 11
---
# License
<span className="todo">TODO</span>
OpenAIRE Research Graph is available for download and re-use as CC-BY (due to some input sources whose license is CC-BY). Parts of the graphs can be re-used as CC-0.

View File

@ -2,56 +2,79 @@
sidebar_position: 7
---
# How to cite
# Relevant publications
If you use one of the [OpenAIRE Research Graph dumps](https://zenodo.org/record/6616871), please cite it following the recommendation that you find on the Zenodo page.
Open Science services are open and transparent and survive thanks to your active support and to the visibility and reward they gather. If you use one of the [OpenAIRE Research Graph dumps](https://doi.org/10.5281/zenodo.3516917) for your research, please provide a proper citation following the recommendation that you find on the dump's Zenodo page or as provided below.
## Other relevant publications
:::note How to cite
Manghi P., Atzori C., Bardi A., Baglioni M., Schirrwagen J., Dimitropoulos H., La Bruzzo S., Foufoulas I., Mannocci A., Horst M., Czerniak A., Iatropoulou K., Kokogiannaki A., De Bonis M., Artini M., Lempesis A., Ioannidis A., Manola N., Principe P., Vergoulis T., Chatzopoulos S., Pierrakos D. (2022). "OpenAIRE Research Graph Dump", *Dataset*, Zenodo. [doi:10.5281/zenodo.3516917](https://doi.org/10.5281/zenodo.3516917) ([BibTex](/bibtex/OpenAIRE_Research_Graph_dump.bib))
:::
## Other relevant research products
Please also consider citing the related research products listed below.
### Aggregation system
Manghi, P., Artini, M., Atzori, C., Bardi, A., Mannocci, A., La Bruzzo, S., Candela, L., Castelli, D. and Pagano, P. (2014), “The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures”, Program: electronic library and information systems, Vol. 48 No. 4, pp. 322-354.
Michele Artini, Claudio Atzori, Alessia Bardi, Sandro La Bruzzo, Paolo Manghi, & Andrea Mannocci. (2016, November 24). The D-NET software toolkit: dnet-basic-aggregator (Version 1.3.0). Zenodo. <i className="fa-solid fa-arrow-up-right-from-square"></i>
Atzori, C., Bardi, A., Manghi, P., & Mannocci, A. (2017, January). The OpenAIRE workflows for data management. In Italian Research Conference on Digital Libraries (pp. 95-107). Springer, Cham.
Manghi P., Artini M., Atzori C., Bardi A., Mannocci A., La Bruzzo S., Candela L., Castelli D., Pagano P. (2014). "The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures", Program: electronic library and information systems, Vol. 48 No. 4, pp. 322-354. [doi:10.1108/prog-08-2013-0045](http://doi.org/10.1108/prog-08-2013-0045)
Mannocci, A., & Manghi, P. (2016, September). DataQ: a data flow quality monitoring system for aggregative data infrastructures. In International Conference on Theory and Practice of Digital Libraries (pp. 357-369). Springer, Cham.
Atzori C., Bardi A., Manghi P., Mannocci A. (2017). "The OpenAIRE workflows for data management", In Italian Research Conference on Digital Libraries (IRCDL), pp. 95-107, Springer, Cham. [doi:10.1007/978-3-319-68130-6_8](https://doi.org/10.1007/978-3-319-68130-6_8)
Artini M., Atzori C., Bardi A., La Bruzzo S., Manghi P., Mannocci A. (2016). "The D-NET software toolkit: dnet-basic-aggregator (Version 1.3.0)". *Software*, Zenodo. [doi:10.5281/zenodo.168356](https://doi.org/10.5281/zenodo.168356) <i className="fa-solid fa-arrow-up-right-from-square"></i>
Mannocci A., Manghi P. (2016). "DataQ: a data flow quality monitoring system for aggregative data infrastructures", International Conference on Theory and Practice of Digital Libraries (TPDL), pp. 357-369, Springer, Cham. [doi:10.1007/978-3-319-43997-6_28](https://doi.org/10.1007/978-3-319-43997-6_28)
### Deduplication
Claudio Atzori, & Paolo Manghi. (2017, February 17). gdup: a big graph entity deduplication system (Version 4.0.5). Zenodo. https://code-repo.d4science.org/D-Net/dnet-dedup/releases
Manghi, Paolo, Marko Mikulicic, and Claudio Atzori. "De-duplication of aggregation authority files." International Journal of Metadata, Semantics and Ontologies 7.2 (2012): 114-130.
Vichos K., De Bonis M., Kanellos I., Chatzopoulos S., Atzori C., Manola N., Manghi P., Vergoulis T. (2022). "A preliminary assessment of the article deduplication algorithm used for the OpenAIRE Research Graph", In Italian Research Conference on Digital Libraries (IRCDL), Padua, Italy, CEUR-WS Proceedings. [http://ceur-ws.org/Vol-3160](http://ceur-ws.org/Vol-3160/)
Manghi, P., Atzori, C., De Bonis, M., & Bardi, A. (2020). Entity deduplication in big data graphs for scholarly communication. Data Technologies and Applications.
Manghi, P., & Mikulicic, M. (2011, October). PACE: A general-purpose tool for authority control. In Research Conference on Metadata and Semantic Research (pp. 80-92). Springer, Berlin, Heidelberg.
De Bonis M., Manghi P., Atzori C. (2022). "FDup: a framework for general-purpose and efficient entity deduplication of record collections", PeerJ Computer Science, 8, e1058. [https://peerj.com/articles/cs-1058](https://peerj.com/articles/cs-1058)
Atzori, C., Manghi, P., & Bardi, A. (2018, December). GDup: de-duplication of scholarly communication big graphs. In 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT) (pp. 142-151). IEEE.
Atzori, Claudio. "GDup: an Integrated, Scalable Big Graph Deduplication System." (2016).
Manghi P., Atzori C., De Bonis M., Bardi, A. (2020). "Entity deduplication in big data graphs for scholarly communication", Data Technologies and Applications. [doi:10.1108/dta-09-2019-0163](https://doi.org/10.1108/dta-09-2019-0163)
Atzori C., Manghi P., Bardi, A. (2018). "GDup: de-duplication of scholarly communication big graphs", In 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT) (pp. 142-151). IEEE. [doi:10.1109/bdcat.2018.00025](https://doi.org/10.1109/bdcat.2018.00025)
Atzori C., & Paolo Manghi. (2017). "GDup: a big graph entity deduplication system" (Version 4.0.5), *Software*, Zenodo. [doi:/10.5281/zenodo.292980](https://doi.org/10.5281/zenodo.292980)
Atzori C. (2016). "GDup: an Integrated, Scalable Big Graph Deduplication System.". [doi:10.5281/zenodo.1454879](https://doi.org/10.5281/zenodo.1454879)
Manghi P., Mikulicic M., Atzori C. (2012). "De-duplication of aggregation authority files." International Journal of Metadata, Semantics and Ontologies 7.2: 114-130. [doi:10.1504/ijmso.2012.050014](https://doi.org/10.1504/ijmso.2012.050014)
Manghi P., Mikulicic M. (2011). "PACE: A general-purpose tool for authority control", In Research Conference on Metadata and Semantic Research, pp. 80-92, Springer, Berlin, Heidelberg. [doi:10.1007/978-3-642-24731-6_8](https://doi.org/10.1007/978-3-642-24731-6_8)
### Mining
M. Kobos, Ł. Bolikowski, M. Horst, P. Manghi, N. Manola, J. Schirrwagen, “Information inference in scholarly communication infrastructures: the OpenAIREplus project experience”, Procedia Computer Science 38, 92-99.
Giannakopoulos T., Foufoulas Y., Dimitropoulos H., Manola N. (2019). "Interactive Text Analysis and Information Extraction", In Italian Research Conference on Digital Libraries (IRCDL), vol 988. Springer, Cham. [doi:10.1007/978-3-030-11226-4_27](https://doi.org/10.1007/978-3-030-11226-4_27)
Tkaczyk, D., Szostek, P., Fedoryszak, M. et al. CERMINE: automatic extraction of structured metadata from scientific literature. IJDAR 18, 317335 (2015).
Giannakopoulos T., Foufoulas Y., Dimitropoulos H., Manola N. (2019) “Interactive Text Analysis and Information Extraction”. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, Cham.
Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017). "High-Pass Text Filtering for Citation Matching", In International Conference on Theory and Practice of Digital Libraries (TPDL). Springer, Cham. [doi:10.1007/978-3-319-67008-9_28](https://doi.org/10.1007/978-3-319-67008-9_28)
Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham.
Chronis Y., Foufoulas Y., Nikolopoulos V., Papadopoulos A., Stamatogiannakis L., Svingos C., Ioannidis Y. E. (2016). "A Relational Approach to Complex Dataflows", In Workshop Proceedings of the EDBT/ICDT 2016 (MEDAL 2016) Joint Conference on CEUR-WS.org (ISSN 1613-0073) [http://ceur-ws.org/Vol-1558/paper45.pdf](http://ceur-ws.org/Vol-1558/paper45.pdf)
T. Giannakopoulos, I. Foufoulas, E. Stamatogiannakis, H. Dimitropoulos, N. Manola, and Y. Ioannidis. 2015. “Visual-Based Classification of Figures from Scientific Literature”. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). Association for Computing Machinery, New York, NY, USA, 10591060.
Giannakopoulos T., Foufoulas I., Stamatogiannakis E., Dimitropoulos H., Manola N., Ioannidis Y. (2015). "Visual-Based Classification of Figures from Scientific Literature", In Proceedings of the 24th International Conference on World Wide Web (WWW), Association for Computing Machinery, New York, NY, USA, 10591060. [doi:10.1145/2740908.2742024](https://doi.org/10.1145/2740908.2742024)
Giannakopoulos, T., Foufoulas, I., Stamatogiannakis, E., Dimitropoulos, H., Manola, N., & Ioannidis, Y. (2014). “Discovering and Visualizing Interdisciplinary Content Classes in Scientific Publications”. D-Lib Mag., Volume 20, Number 11/12.
Giannakopoulos T., Foufoulas I., Stamatogiannakis E., Dimitropoulos H., Manola N., Ioannidis Y. (2014). "Discovering and Visualizing Interdisciplinary Content Classes in Scientific Publications". D-Lib Mag., Volume 20, Number 11/12. [doi:10.1045/november14-giannakopoulos](https://doi.org/10.1045/november14-giannakopoulos)
Giannakopoulos T., Stamatogiannakis E., Foufoulas I., Dimitropoulos H., Manola N., Ioannidis Y. (2014) “Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation”. In: Bolikowski Ł., Casarosa V., Goodale P., Houssos N., Manghi P., Schirrwagen J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. Also in: Google Books
Giannakopoulos T., Stamatogiannakis E., Foufoulas I., Dimitropoulos H., Manola N., Ioannidis Y. (2014). "Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation", International Conference on Theory and Practice of Digital Libraries (TPDL), Springer, Cham. [doi:10.1007/978-3-319-08425-1_10](https://doi.org/10.1007/978-3-319-08425-1_10)
Giannakopoulos T., Dimitropoulos H., Metaxas O., Manola N., Ioannidis Y. (2013) “Supervised Content Visualization of Scientific Publications: A Case Study on the ArXiv Dataset”. In: Kłopotek M.A., Koronacki J., Marciniak M., Mykowiecka A., Wierzchoń S.T. (eds) Language Processing and Intelligent Information Systems. IIS 2013. Lecture Notes in Computer Science, vol 7912. Springer, Berlin, Heidelberg.
Giannakopoulos T., Dimitropoulos H., Metaxas O., Manola N., Ioannidis Y. (2013). "Supervised Content Visualization of Scientific Publications: A Case Study on the ArXiv Dataset", Intelligent Information Systems Symposium (IIS) vol 7912, Springer, Berlin, Heidelberg. [doi:10.1007/978-3-642-38634-3_23](https://doi.org/10.1007/978-3-642-38634-3_23)
Y. Chronis, Y. Foufoulas, V. Nikolopoulos, A. Papadopoulos, L. Stamatogiannakis, C. Svingos, Y. E. Ioannidis, "A Relational Approach to Complex Dataflows", in Workshop Proceedings of the EDBT/ICDT 2016 (MEDAL 2016) Joint Conference (March 15, 2016, Bordeaux, France) on CEUR-WS.org (ISSN 1613-0073)
Tkaczyk, D., Szostek, P., Fedoryszak, M., Jan Dendek P., Bolikowski Ł. (2015). "CERMINE: automatic extraction of structured metadata from scientific literature", International Journal on Document Analysis and Recognition (IJDAR), 317335. [doi:10.1007/s10032-015-0249-8](https://doi.org/10.1007/s10032-015-0249-8)
Kobos M., Bolikowski Ł., Horst M., Manghi P., Μanola N., Schirrwagen J. (2014). "Information inference in scholarly communication infrastructures: the OpenAIREplus project experience", Procedia Computer Science 38, 92-99. [doi:10.1016/j.procs.2014.10.016](https://doi.org/10.1016/j.procs.2014.10.016)
### Portals
Baglioni M. et al. (2019) The OpenAIRE Research Community Dashboard: On Blending Scientific Workflows and Scientific Publishing. In: Doucet A., Isaac A., Golub K., Aalberg T., Jatowt A. (eds) Digital Libraries for Open Knowledge. TPDL 2019. Lecture Notes in Computer Science, vol 11799. Springer, Cham.
Baglioni Μ., Bardi Α., Kokogiannaki Α., Manghi P., Iatropoulou K., Principe P., Vieira A., Nielsen L. H., Dimitropoulos H., Foufoulas I., Manola N., Atzori C., La Bruzzo S., Lazzeri E., Artini M., De Bonis M., DellAmico A. (2019). "The OpenAIRE Research Community Dashboard: On Blending Scientific Workflows and Scientific Publishing",
International Conference on Theory and Practice of Digital Libraries (TPDL). Lecture Notes in Computer Science, vol 11799. Springer, Cham. [doi:10.1007/978-3-030-30760-8_5](https://doi.org/10.1007/978-3-030-30760-8_5)
### Broker Service
Artini, M., Atzori, C., Bardi, A., La Bruzzo, S., Manghi, P., & Mannocci, A. (2015). The OpenAIRE literature broker service for institutional repositories. D-Lib Magazine, 21(11/12), 1.
Manghi, P., Atzori, C., Bardi, A., La Bruzzo, S., & Artini, M. (2016, February). Realizing a Scalable and History-Aware Literature Broker Service for OpenAIRE. In Italian Research Conference on Digital Libraries (pp. 92-103). Springer, Cham.
Manghi P., Atzori C., Bardi A., La Bruzzo S., Artini M. (2016). "Realizing a Scalable and History-Aware Literature Broker Service for OpenAIRE", Italian Research Conference on Digital Libraries (IRCDL), pp. 92-103, Springer, Cham. [doi:10.1007/978-3-319-56300-8_9](https://doi.org/10.1007/978-3-319-56300-8_9)
Artini M., Atzori C., Bardi A., La Bruzzo S., Manghi P., Mannocci A. (2015). "The OpenAIRE literature broker service for institutional repositories", D-Lib Magazine, 21(11/12), 1. [doi:10.1045/november2015-artini](https://doi.org/10.1045/november2015-artini)

View File

@ -1,20 +0,0 @@
---
sidebar_position: 8
---
# Graph-based services
## Explore
<span className="todo">TODO</span>
## Provide
<span className="todo">TODO</span>
## Connect
<span className="todo">TODO</span>
## Monitor
<span className="todo">TODO</span>
## Develop
<span className="todo">TODO</span>

View File

@ -5,14 +5,23 @@ const lightCodeTheme = require('prism-react-renderer/themes/github');
const darkCodeTheme = require('prism-react-renderer/themes/dracula');
const math = require('remark-math');
const katex = require('rehype-katex');
const { filterItems } = require('./sidebar-utils');
const dotenv = require('dotenv');
// load env variables (see .env file)
const env = dotenv.config();
if (env.error) {
throw env.error;
}
console.info("ENV VARIABLES:");
console.info(env.parsed);
/** @type {import('@docusaurus/types').Config} */
const config = {
title: 'OpenAIRE Documentation',
title: 'OpenAIRE Research Graph Documentation',
tagline: 'Open Access Infrastructure for Research in Europe',
url: 'http://snf-23385.ok-kno.grnetcloud.net',
baseUrl: '/', // serve the website at route
url: process.env.URL,
baseUrl: process.env.BASE_URL, // serve the website at route
onBrokenLinks: 'throw',
onBrokenMarkdownLinks: 'warn',
favicon: 'img/favicon.ico',
@ -29,7 +38,19 @@ const config = {
defaultLocale: 'en',
locales: ['en'],
},
themes: [
[
require.resolve("@easyops-cn/docusaurus-search-local"),
/** @type {import("@easyops-cn/docusaurus-search-local").PluginOptions} */
({
language: ["en"],
indexBlog: false,
highlightSearchTermsOnTargetPage: true,
searchBarShortcutHint: false,
docsRouteBasePath: "/",
}),
],
],
presets: [
[
'classic',
@ -37,18 +58,7 @@ const config = {
({
docs: {
routeBasePath: '/', // serve the docs at the site's route
sidebarPath: require.resolve('./sidebars.js'),
async sidebarItemsGenerator({ defaultSidebarItemsGenerator, ...args }) {
const sidebarItems = await defaultSidebarItemsGenerator(args);
const itemsToFilterOut = [
'data-model/entities/entity-identifiers',
'data-model/entities/other'
];
return filterItems(sidebarItems, itemsToFilterOut);
},
// Please change this to your repo.
// Remove this to remove the "edit this page" links.
// editUrl:
@ -63,6 +73,12 @@ const config = {
// },
theme: {
customCss: require.resolve('./src/css/custom.css'),
},
sitemap: {
changefreq: 'monthly',
priority: 0.5,
ignorePatterns: ['/tags/**'],
filename: 'sitemap.xml',
},
}),
],
@ -81,98 +97,45 @@ const config = {
/** @type {import('@docusaurus/preset-classic').ThemeConfig} */
({
navbar: {
// title: 'OpenAIRE Documentation',
title: 'documentation',
logo: {
alt: 'OpenAIRE',
src: 'img/logo.png',
},
items: [
{
type: 'doc',
docId: 'intro',
position: 'left',
label: 'Research graph v5.0',
},
//
// documentation version in the navbar
// {
// type: 'docsVersionDropdown',
// position: 'right'
// type: 'doc',
// docId: 'intro',
// position: 'left',
// label: 'Research graph v5.0',
// },
//
// documentation version in the navbar
{
type: 'docsVersionDropdown',
position: 'right'
},
// link to blog, the blog must be enabled first
// {to: '/blog', label: 'Blog', position: 'left'},
//
// link to github repo
// {
// href: 'https://github.com/facebook/docusaurus',
// label: 'GitHub',
// label: 'Issues',
// position: 'right',
// },
],
},
footer: {
style: 'dark',
links: [
{
title: 'Docs',
items: [
{
label: 'Research Graph',
to: '/',
},
],
},
{
title: 'Dashboards',
items: [
{
label: 'Explore',
href: 'https://explore.openaire.eu/',
},
{
label: 'Provide',
href: 'https://provide.openaire.eu/',
},
{
label: 'Connect',
href: 'https://connect.openaire.eu/',
},
{
label: 'Monitor',
href: 'https://monitor.openaire.eu/',
},
{
label: 'Develop',
href: 'https://graph.openaire.eu/',
},
],
},
{
title: 'Community',
items: [
{
label: 'Facebook',
href: 'http://www.facebook.com/groups/openaire/'
},
{
label: 'Linkedin',
href: 'https://www.linkedin.com/company/openaire-eu/',
},
{
label: 'Twitter',
href: 'https://twitter.com/OpenAIRE_eu',
},
{
label: 'Youtube',
href: 'https://www.youtube.com/channel/UChFYqizc-S6asNjQSoWuwjw',
},
],
},
],
style: 'light',
copyright: `Copyright © ${new Date().getFullYear()} OpenAIRE`,
},
colorMode: {
defaultMode: 'light',
disableSwitch: true,
respectPrefersColorScheme: false,
},
prism: {
theme: lightCodeTheme,
darkTheme: darkCodeTheme,

1605
package-lock.json generated

File diff suppressed because it is too large Load Diff

View File

@ -4,20 +4,22 @@
"private": true,
"scripts": {
"docusaurus": "docusaurus",
"start": "docusaurus start",
"start": "docusaurus start --host 0.0.0.0",
"build": "docusaurus build",
"swizzle": "docusaurus swizzle",
"deploy": "docusaurus deploy",
"clear": "docusaurus clear",
"serve": "docusaurus serve",
"serve": "docusaurus serve --host 0.0.0.0",
"write-translations": "docusaurus write-translations",
"write-heading-ids": "docusaurus write-heading-ids"
},
"dependencies": {
"@docusaurus/core": "2.0.1",
"@docusaurus/preset-classic": "2.0.1",
"@docusaurus/core": "^2.2.0",
"@docusaurus/preset-classic": "^2.2.0",
"@easyops-cn/docusaurus-search-local": "^0.33.6",
"@mdx-js/react": "^1.6.22",
"clsx": "^1.2.1",
"dotenv": "^16.0.3",
"hast-util-is-element": "^1.1.0",
"prism-react-renderer": "^1.3.5",
"react": "^17.0.2",
@ -26,7 +28,7 @@
"remark-math": "^3.0.1"
},
"devDependencies": {
"@docusaurus/module-type-aliases": "2.0.1"
"@docusaurus/module-type-aliases": "^2.2.0"
},
"browserslist": {
"production": [

8
release.properties Normal file
View File

@ -0,0 +1,8 @@
#The name of the tag
tag_name=1.1
# A description of the tag
tag_description=1.1 is our 1st tag
#The release name
release_name=release-1.1
#The release description
release_description=this is the release 1.1

View File

@ -1,18 +0,0 @@
// filter out specific items from the sidebar
function filterItems(items, itemsToFilter) {
// filter out items of categories
let result = items.map((item) => {
if (item.type === 'category') {
return {...item, items: filterItems(item.items, itemsToFilter)};
}
return item;
});
// filter out items in current level
return result.filter( item => !itemsToFilter.includes(item.id) );
}
module.exports = {
filterItems
};

View File

@ -51,12 +51,22 @@ const sidebars = {
href: "https://graph.openaire.eu/develop/overview.html"
},
{
type: 'doc',
id: 'download'
},
type: 'category',
label: "Downloads",
link: {
type: 'generated-index',
description: 'All resources, available for download, are listed below.'
},
items: [
{ type: 'doc', id: 'downloads/full-graph'},
{ type: 'doc', id: 'downloads/beginners-kit' },
{ type: 'doc', id: 'downloads/subgraphs' },
{ type: 'doc', id: 'downloads/related-datasets' },
]
},
{
type: 'category',
label: "Data provision",
label: "Graph production workflow",
link: {type: 'doc', id: 'data-provision/data-provision'},
items: [
{
@ -64,12 +74,46 @@ const sidebars = {
label: "Aggregation",
link: {type: 'doc', id: 'data-provision/aggregation/aggregation'},
items: [
{ type: 'doc', id: 'data-provision/aggregation/doiboost', label: 'DOIBoost' },
{ type: 'doc', id: 'data-provision/aggregation/pubmed' },
{ type: 'doc', id: 'data-provision/aggregation/datacite' },
{ type: 'doc', id: 'data-provision/aggregation/ebi', label: 'EMBL-EBI' },
{
type: 'doc',
label: "OpenAIRE compatible sources",
id: 'data-provision/aggregation/compatible-sources',
},
{
type: 'category',
label: "Non-compatible sources",
link: { type: 'generated-index' },
items: [
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/doiboost', label: 'DOIBoost' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/pubmed' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/datacite' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/ebi', label: 'EMBL-EBI' },
]
}
]
},
{
type: 'doc',
id: 'data-provision/merge-by-id'
},
{
type: 'category',
label: "Enrichment by mining",
link: {
type: 'generated-index',
description: 'The OpenAIRE Research Graph is enriched using the different Text and Data Mining (TDM) algorithms that are grouped in the following categories.'
},
items: [
{ type: 'doc', id: 'data-provision/enrichment-by-mining/affiliation_matching' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/citation_matching' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/classifies' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/documents_similarity' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/acks' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/cites' },
{ type: 'doc', id: 'data-provision/enrichment-by-mining/metadata_extraction' },
]
},
{ type: 'doc', id: 'data-provision/cleaning' },
{
type: 'category',
label: "Deduplication",
@ -80,41 +124,50 @@ const sidebars = {
]
},
{
type: 'category',
label: "Enrichment",
link: {type: 'doc', id: 'data-provision/enrichment/enrichment'},
type: 'category',
label: "Deduction & propagation",
link: {
type: 'generated-index' ,
description: 'The OpenAIRE Research Graph is further enriched by the deduction and propagation processes descibed in this section.'
},
items: [
{ type: 'doc', id: 'data-provision/enrichment/mining' },
{ type: 'doc', id: 'data-provision/enrichment/impact-scores' },
{ type: 'doc', id: 'data-provision/deduction-and-propagation/bulk-tagging' },
{ type: 'doc', id: 'data-provision/deduction-and-propagation/propagation' },
]
},
{ type: 'doc', id: 'data-provision/post-cleaning' },
{
type: 'category',
label: "Indicators ingestion",
link: {
type: 'generated-index' ,
description: 'In this step, the following types of indicators are ingested in the OpenAIRE Research Graph.'
},
items: [
{ type: 'doc', id: 'data-provision/indicators-ingestion/impact-scores' },
{ type: 'doc', id: 'data-provision/indicators-ingestion/usage-counts' },
]
},
{ type: 'doc', id: 'data-provision/finalisation' },
{ type: 'doc', id: 'data-provision/indexing' },
{ type: 'doc', id: 'data-provision/stats' },
{ type: 'doc', id: 'data-provision/stats' }
]
},
{
type: 'doc',
id: 'services'
},
{
type: 'category',
type: "link",
label: "Learning center",
link: { type: 'generated-index' },
items: [
{ type: 'doc', id: 'learning-center/open-plato' },
{ type: 'doc', id: 'learning-center/tutorials' },
]
href: "https://openplato.eu/"
},
{
type: 'doc',
id: 'publications',
label: "Relevant publications"
},
{
type: 'doc',
id: 'faq'
},
// {
// type: 'doc',
// id: 'faq'
// },
{
type: 'doc',
id: 'license'
@ -123,6 +176,11 @@ const sidebars = {
type: 'doc',
id: 'changelog'
},
{
type: "link",
label: "Helpdesk",
href: "https://graph.openaire.eu/support"
},
]
};

View File

@ -5,58 +5,66 @@
*/
/* You can override the default Infima variables here. */
/*
:root {
--ifm-color-primary: #2e8555;
--ifm-color-primary-dark: #29784c;
--ifm-color-primary-darker: #277148;
--ifm-color-primary-darkest: #205d3b;
--ifm-color-primary-light: #33925d;
--ifm-color-primary-lighter: #359962;
--ifm-color-primary-lightest: #3cad6e;
--ifm-code-font-size: 95%;
--docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.1);
}
*/
/* For readability concerns, you should choose a lighter palette in dark mode. */
/*
[data-theme='dark'] {
--ifm-color-primary: #25c2a0;
--ifm-color-primary-dark: #21af90;
--ifm-color-primary-darker: #1fa588;
--ifm-color-primary-darkest: #1a8870;
--ifm-color-primary-light: #29d5b0;
--ifm-color-primary-lighter: #32d8b4;
--ifm-color-primary-lightest: #4fddbf;
--docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.3);
}
*/
:root {
--ifm-color-primary: #4666ca;
--ifm-color-primary-dark: #3757be;
--ifm-color-primary-darker: #3353b4;
--ifm-color-primary-darkest: #2a4494;
--ifm-color-primary-light: #5b77d0;
--ifm-color-primary-lighter: #6680d3;
--ifm-color-primary-lightest: #859adc;
--ifm-color-primary: #e6122e;
--ifm-color-primary-dark: #cf1029;
--ifm-color-primary-darker: #c30f27;
--ifm-color-primary-darkest: #a10d20;
--ifm-color-primary-light: #ee233e;
--ifm-color-primary-lighter: #ef2f48;
--ifm-color-primary-lightest: #f15166;
--ifm-background-color: #F5F5F5;
--ifm-navbar-background-color: #fff;
--ifm-code-font-size: 95%;
--docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.1);
}
[data-theme='dark'] {
--ifm-color-primary: #5dade2;
--ifm-color-primary-dark: #429fdd;
--ifm-color-primary-darker: #3498db;
--ifm-color-primary-darkest: #227fbd;
--ifm-color-primary-light: #78bbe7;
--ifm-color-primary-lighter: #86c2e9;
--ifm-color-primary-lightest: #aed6f1;
--ifm-color-primary: #f15166;
--ifm-color-primary-dark: #ef334c;
--ifm-color-primary-darker: #ed243f;
--ifm-color-primary-darkest: #d1112a;
--ifm-color-primary-light: #f36f80;
--ifm-color-primary-lighter: #f57e8d;
--ifm-color-primary-lightest: #f8aab5;
--ifm-background-color: #2c2e3a;
--ifm-navbar-background-color: #2c2e3a;
--docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.3);
}
.navbar__logo {
height: 2.5rem;
}
.todo {
background-color: yellow;
}
}
@media (min-width: 996px) {
.left-badge {
padding-right: 5px;
}
.mid-badge {
padding-left: 0;
padding-right: 5px;
}
.right-badge {
padding-left: 0;
}
}
.dark-badge {
background-color: #c6c6c6;
}
.footer {
background-color: var(--ifm-navbar-background-color);
padding-bottom: 2em;
padding-top: 1em;
height: var(--ifm-navbar-height);
}

View File

@ -0,0 +1,35 @@
@dataset{manghi_paolo_2022_7488618,
author = {Manghi, Paolo and
Atzori, Claudio and
Bardi, Alessia and
Baglioni, Miriam and
Schirrwagen, Jochen and
Dimitropoulos, Harry and
La Bruzzo, Sandro and
Foufoulas, Ioannis and
Mannocci, Andrea and
Horst, Marek and
Czerniak, Andreas and
Iatropoulou, Katerina and
Kokogiannaki, Argiro and
De Bonis, Michele and
Artini, Michele and
Lempesis, Antonis and
Ioannidis, Alexandros and
Manola, Natalia and
Principe, Pedro and
Vergoulis, Thanasis and
Chatzopoulos, Serafeim and
Pierrakos, Dimitris},
title = {OpenAIRE Research Graph Dump},
month = dec,
year = 2022,
note = {{A new version of this dataset is published every 6
months. The content available on the OpenAIRE
EXPLORE and CONNECT portals might be more up-to-
date with respect to the data you find here.}},
publisher = {Zenodo},
version = {5.0.0},
doi = {10.5281/zenodo.7488618},
url = {https://doi.org/10.5281/zenodo.7488618}
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 236 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 649 KiB

Some files were not shown because too many files have changed in this diff Show More