Compare commits

..

604 Commits

Author SHA1 Message Date
dimitrispie 163b2ee2a8 Changes
1. Monitor updates
2. Bug fixes during copy to impala cluster
2023-07-13 15:25:00 +03:00
dimitrispie 76901a25f9 Updates Promotion DBs
- Add a step for promoting the splitted monitor DBs
2023-07-12 22:49:08 +03:00
dimitrispie 2b6370eaee Update step15_5.sql
Bug fix
2023-06-21 11:31:10 +03:00
dimitrispie 74cb060bfe Update step15_5.sql
Add "if not exists" clause
2023-06-21 11:24:06 +03:00
dimitrispie a475cfcb7b Update step16-createIndicatorsTables.sql
Rename a field in indi_pub_interdisciplinarity
2023-06-21 10:42:02 +03:00
dimitrispie 4648cd88d4 Update step15.sql
Cast score to double
2023-06-21 10:02:19 +03:00
dimitrispie 94d2573c77 Update step15.sql
Bug Fix
2023-06-21 09:22:39 +03:00
dimitrispie be2caedb04 Update step20-createMonitorDB_institutions.sql
Add openorgs____::1624ff7c01bb641b91f4518539a0c28a Vrije Universiteit Amsterdam
2023-06-19 12:12:17 +03:00
dimitrispie 36e0a8fec4 Changes to Promotion Stats WF
1. Add new cluster host at impala-shell commands
2. Add a step for splitting monitor dbs
3. Update workflow.xml to included the new splitting monitor dbs step
2023-06-19 09:44:34 +03:00
dimitrispie 4c770a5e29 Update finalizeImpalaCluster.sh
Drop views in shadow dbs before dropping the db
2023-06-15 13:25:37 +03:00
dimitrispie e06d962a6a Update step15.sql 2023-06-15 12:20:35 +03:00
dimitrispie afcad08396 Update step20-createMonitorDB_institutions.sql
Added openorgs____::c0b262bd6eab819e4c994914f9c010e2   -- National Institute of Geophysics and Volcanology
2023-06-15 10:28:49 +03:00
dimitrispie 42b8ce2ba4 Update copyDataToImpalaCluster.sh 2023-06-14 19:23:42 +03:00
dimitrispie 2032b0df40 Bug fixes
1. Remove tables/views from old databases in the new cluster, before dropping the dbs
2. Fix id in result_accessroute, indi_impact_measures, indi_pub_bronze_oa
2023-06-14 19:09:09 +03:00
dimitrispie c5f42c7f5b Added memory to hive 2023-06-07 18:18:23 +03:00
dimitrispie fa24e2e18f Bug fix on indicators step
indi_pub_gold_oa table was missing during the creation of other indicators
2023-06-07 17:43:37 +03:00
dimitrispie 28272c1b0e Bug fix 2023-06-07 15:34:01 +03:00
dimitrispie ad07fbf053 Add names to organizations for collaboration indicators 2023-06-02 14:13:10 +03:00
dimitrispie 2324670714 Split Monitor DBs-Interdisciplinarity indicators
- Split DBs Monitor for faster rendering of visualizations
- Add interdisciplinarity indicators from result_fos
2023-06-02 13:34:16 +03:00
dimitrispie ebe586b1d1 Impact indicators/Unpaywall
- Added Impact indicators
- Added unpaywall open access colours
2023-05-26 10:25:28 +03:00
dimitrispie d6102dd576 Update step16-createIndicatorsTables.sql
- Add org names to indi_project_collab_org
- Add indi_pub_bronze_oa
 - Changes to indi_pub_hybrid_oa_with_cc
2023-05-25 14:52:34 +03:00
dimitrispie 86f4f63daf Updates to steps related to transfer data to impala cluster
1. Remove external table definitions in stats_ext
2. Fix the issue where some views are not created.
3. Added two workflow parameters for copying also the usage stats dbs
2023-05-18 09:33:05 +03:00
dimitrispie b3f9633205 Update copyDataToImpalaCluster.sh
Added option --user to impala-shell command
2023-05-15 12:51:44 +03:00
dimitrispie 00d0d162b6 Update copyDataToImpalaCluster.sh
Added a temporary folder to copy the files to avoid permission issues
2023-05-12 12:31:13 +03:00
dimitrispie c3d58e58e1 Bug fixes 2023-05-02 11:54:07 +03:00
dimitrispie e57ecdaf98 Update step20-createMonitorDB.sql
Add University of Manitoba
2023-04-30 17:52:23 +03:00
dimitrispie fdb5d2b39f Bug fixes 2023-04-23 18:29:00 +03:00
dimitrispie 53ce023035 Bug fixes 2023-04-23 18:23:45 +03:00
dimitrispie 4fa750b719 Bug fixes on monitor-update 2023-04-19 17:39:53 +03:00
dimitrispie 5247cb7115 Bug fix 2023-04-19 11:11:19 +03:00
dimitrispie 25dafccc24 Merge branch 'hive' into beta 2023-04-12 11:36:59 +03:00
dimitrispie c85de8fa1f -Added Technological University Dublin
-Added project_organization_contribution table
-Add   Delft University of Technology
2023-04-07 09:22:59 +03:00
dimitrispie 9b41dff33c Update step20-createMonitorDB.sql
Added Delft University of Technology
2023-04-07 09:21:38 +03:00
dimitrispie 91e18ac7f4 Added project_organization_contribution table 2023-04-06 10:53:11 +03:00
dimitrispie 9e1335df4c -Added Technological University Dublin
-Added project_organization_contribution table
2023-04-04 13:22:40 +03:00
dimitrispie fad7fa4af8 Added Technological University Dublin 2023-03-22 09:44:00 +02:00
dimitrispie 43b23a9bf3 Update step20-createMonitorDB.sql
Added Technological University Dublin
2023-03-15 09:57:12 +02:00
dimitrispie 1547611246 Merge branch 'beta' into hive 2023-02-22 16:57:12 +02:00
dimitrispie 90807b60c7 Changes to monitor wf 2023-02-20 10:42:24 +02:00
dimitrispie d2f9ccf934 Changes to separate monitor wf 2023-02-20 10:41:21 +02:00
dimitrispie 032a401cbf Bug fixes 2023-02-20 09:29:20 +02:00
dimitrispie 595192d510 Bug fix 2023-02-14 16:24:08 +02:00
dimitrispie f3aaff3688 Remove duplicate orgs 2023-02-14 09:48:36 +02:00
dimitrispie 3400133c2f Bug fix 2023-02-13 09:44:00 +02:00
dimitrispie 935db0ab25 Added organizations for Monitor 2023-02-13 09:29:09 +02:00
dimitrispie 7b78b15c81 Changes for copying to Impala Cluster 2023-02-13 09:27:00 +02:00
dimitrispie d71f5672d3 Add monitor post step 2023-02-09 13:44:14 +02:00
dimitrispie 35ba8bb328 Bug fixes 2023-02-09 12:57:57 +02:00
dimitrispie 3ba11d64a1 Changes 07022023 2023-02-07 12:53:51 +02:00
dimitrispie 98c34263ed Update step20-createMonitorDB.sql
Add University of Cape Town organization
2023-02-07 08:14:48 +02:00
dimitrispie 2dc6d47270 Changes 06022023 2023-02-06 13:18:53 +02:00
dimitrispie 973d78a4d6 Update step15_5.sql
Added unpaywalls open access colors
2023-02-02 08:03:54 +02:00
dimitrispie cf58e4a5e4 Added Arts et Métiers ParisTech 2023-01-25 16:03:16 +02:00
dimitrispie db7d625ba9 Addedd Arts et Métiers ParisTech organization 2023-01-25 12:22:21 +02:00
dimitrispie 4d7553c9f1 Bug fixes 2023-01-12 17:19:19 +02:00
dimitrispie dd70c32ad7 Bug fixes 2023-01-12 17:18:05 +02:00
dimitrispie 51f7ab5864 Bug fixes 2023-01-12 17:15:06 +02:00
dimitrispie 34d4bf727c Bug fixes 2023-01-12 11:28:37 +02:00
dimitrispie 43f6d4f296 -Monitor DB workflow 2023-01-12 11:26:47 +02:00
dimitrispie 686580a220 - New Monitor DB workflow
- New Organization added
2023-01-12 11:18:03 +02:00
dimitrispie becb242c17 Monitor DB only Workflow 2023-01-04 16:50:29 +02:00
dimitrispie dcb958e146 Changes to execute the stats wf only in hive 2023-01-04 11:39:01 +02:00
dimitrispie 592013d5dd Added more steps in decision node 2022-12-23 09:43:16 +02:00
dimitrispie 2a4bf32d4c Merge branch 'hive' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into hive
# Conflicts:
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step10.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step13.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step14.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_1-definitions.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step7.sql
2022-12-22 10:22:46 +02:00
dimitrispie 6449ff4207 1. Added a decision node to enables the workflow to make a selection on the execution path to follow
2. Added new organization
3. Added 5 new tables from Eurostast
2022-12-22 10:18:21 +02:00
Antonis Lempesis c8309fe18e addded command line params to allow hive actions to run 2022-12-21 12:41:33 +02:00
Antonis Lempesis 028873cc51 added new hive opts 2022-12-21 12:41:33 +02:00
Antonis Lempesis 1ddea4f442 removed 'stored as parquet' from views.. 2022-12-21 12:41:33 +02:00
Antonis Lempesis 2754c3dd62 moving data to impala cluster and creating shadow databases there 2022-12-21 12:41:29 +02:00
Antonis Lempesis 778a1a724f finished migration to hive only 2022-12-21 12:41:25 +02:00
Antonis Lempesis e84dd5fe26 first 2022-12-21 12:41:23 +02:00
dimitrispie 2a52a42169 Added 4 institutions:
-University of Modena and Reggio Emilia
-Bilkent University
-Saints Cyril and Methodius University of Skopje
-University of Milan
2022-12-06 10:10:21 +02:00
dimitrispie 992fc5b628 Added McMaster University Institution 2022-11-03 11:02:18 +02:00
dimitrispie 7fda05e380 Added Autonomous University of Barcelona 2022-11-01 13:59:40 +02:00
dimitrispie 7861c472e0 Hive memory parameters 2022-10-28 19:00:32 +03:00
dimitrispie 5df9c63963 Added fields: totalcost, fundedamount, currency, in project table 2022-10-27 16:44:26 +03:00
dimitrispie 2c0c3f1806 Cast amount to float for table result_apcs 2022-09-28 19:33:24 +03:00
dimitrispie bdc46e3eaa Remove denormalization of results to fix downloads numbers in monitor 2022-09-28 14:59:08 +03:00
dimitrispie 2ebb1459a9 Fixed type in no_downloads 2022-09-28 14:36:57 +03:00
dimitrispie dcd85f8cd7 - Synchronize indicators in stats-db with monitor-db
- added new openorg id for Nanyang Technological University
- changed openorg id for University of Helsinki #8088 ticket
2022-09-22 13:33:07 +03:00
dimitrispie 3bf3127251 Changes to monitor and indicator scripts 2022-09-14 16:36:19 +03:00
dimitrispie 71b069ca90 Changes to indicator and monitor scripts 2022-09-09 13:15:58 +03:00
dimitrispie 2b5f8c9c9a comment out duplicate table creation 2022-09-06 12:27:53 +03:00
Antonis Lempesis fcef5294e2 restored some collab indicators 2022-08-05 13:45:01 +03:00
Antonis Lempesis 227e10f4b3 commenting out the collab indicators because they still fail 2022-08-05 12:54:36 +03:00
Antonis Lempesis 8b0407d8ec fixed the datasourceOrganization relations 2022-08-03 12:26:59 +03:00
Antonis Lempesis 1778d40c40 latest version of indicators 2022-08-02 13:39:34 +03:00
Antonis Lempesis 6fc9ef53f6 addded command line params to allow hive actions to run 2022-07-29 16:36:20 +03:00
Antonis Lempesis 9886fe87ec - Added FOS classification
- Added extra orgs in monitor
- Fixed result-project and organization-project tables
2022-07-29 16:34:50 +03:00
Antonis Lempesis ab18c9daa9 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta 2022-06-09 15:48:21 +03:00
Antonis Lempesis 574492c659 removed double result_apc table creation from monitor 2022-06-09 15:48:13 +03:00
Antonis Lempesis db088cc69c fixed *_organization tables 2022-06-07 04:04:28 +03:00
Antonis Lempesis 8160763330 fixed conflict 2022-05-13 14:29:31 +03:00
Antonis Lempesis 3fc9efeab6 fixed typo, addded open citations and apcs in monitor 2022-05-13 14:28:13 +03:00
Antonis Lempesis c25134f28d fixed typo 2022-05-12 14:55:47 +03:00
Antonis Lempesis 23334479bb removed yet another collab, added more orgs in monitor 2022-05-11 13:05:52 +03:00
Antonis Lempesis 61b4c19e65 restored indi_result_org_country_collab, removed indi_result_org_collab 2022-05-06 12:52:10 +03:00
Antonis Lempesis cfbbcaf7c4 commented out indi_result_org_country_collab 2022-05-06 12:49:36 +03:00
Antonis Lempesis 0353f93d54 added new hive opts 2022-04-29 12:49:27 +03:00
Antonis Lempesis b7cd2c6ca1 added open citations 2022-04-20 14:46:55 +03:00
Antonis Lempesis c442c91f89 computing stats in each step 2022-04-06 12:40:02 +03:00
Antonis Lempesis 7112806a73 views cannot be stored as parquet... 2022-03-29 16:37:29 +03:00
Antonis Lempesis fff0b3cc19 added apcs in monitor db 2022-03-29 14:15:31 +03:00
Antonis Lempesis ee24f3eb2c views cannot be stored as parquet... 2022-03-29 13:47:48 +03:00
Antonis Lempesis d8503cd191 added moooar organizations 2022-03-24 14:02:36 +02:00
Antonis Lempesis 62f91b0869 cleanup 2022-03-22 16:17:49 +02:00
Antonis Lempesis 2e8394ecf8 creating aaall tables as parquet 2022-03-22 16:16:08 +02:00
Antonis Lempesis dcfbeb8142 yet more typos 2022-03-21 12:36:03 +02:00
Antonis Lempesis ad78e505da yet another fix 2022-03-03 12:28:12 +02:00
Antonis Lempesis efeeebfee1 fixed query after the change in the indicator table 2022-03-02 13:29:25 +02:00
Antonis Lempesis 3b92a2ab9c added the rest of spring 6 in monitor db 2022-02-23 12:05:57 +02:00
dimitrispie 9a75ca1ae4 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta 2022-02-22 14:47:33 +02:00
Antonis Lempesis 87c91f70a2 added sprint 6 indicators to monitor db 2022-02-22 14:41:48 +02:00
Antonis Lempesis 0bff45e739 added sprint 6 indicators to monitor db 2022-02-18 17:11:23 +02:00
dimitrispie 58c59f46eb Added Sprint 6 2022-02-17 10:21:09 +02:00
Antonis Lempesis 5772f92dba merged beta chnages in hive branch 2022-02-15 13:24:51 +02:00
Antonis Lempesis 393a4ee956 fixed yet another typo... 2022-02-15 12:56:50 +02:00
Antonis Lempesis 5f762cbd09 fixed yet another typo 2022-02-07 12:09:12 +02:00
Antonis Lempesis ae633c566b fixed the result_result table 2022-02-04 15:04:19 +02:00
Antonis Lempesis c2b44530a3 typo... 2022-02-03 13:44:07 +02:00
Antonis Lempesis dbd2646d59 fixed the result_result creation for monitor 2022-02-03 12:37:10 +02:00
Antonis Lempesis 81ee654271 added result_result relations 2021-12-23 15:46:17 +02:00
Antonis Lempesis 7551e52e95 fixed a typo 2021-12-23 15:33:53 +02:00
Antonis Lempesis 16539d7360 added usage stats 2021-12-22 02:54:42 +02:00
Antonis Lempesis 3edd661608 fixed column names 2021-12-21 22:55:04 +02:00
Antonis Lempesis a4c0cbb98c fixed typos in indicators. Added extra views in monitor 2021-12-21 15:54:38 +02:00
Antonis Lempesis 58996972d9 added first indicator of sprint 5 2021-12-21 03:35:04 +02:00
dimitrispie c1cdec09a9 Sprint 5 and other changes 2021-12-20 19:23:57 +02:00
Antonis Lempesis ddd34087c2 removed 'stored as parquet' from views.. 2021-12-13 23:05:00 +02:00
Antonis Lempesis 915f758c82 moving data to impala cluster and creating shadow databases there 2021-12-13 16:26:14 +02:00
Antonis Lempesis d05210ba99 finished migration to hive only 2021-11-30 19:01:48 +02:00
dimitrispie 09fc2afdca Added indi_funder_country_collab
Kept only indi_pub_has_cc_licence
2021-11-26 16:13:10 +02:00
dimitrispie 8750a71502 Merge remote-tracking branch 'origin/beta' into beta 2021-11-26 16:11:26 +02:00
dimitrispie 25fc8abf77 Sprint 4 2021-11-26 16:10:58 +02:00
Antonis Lempesis 0b4163ee0b added sprint3,4, removed 2, chaos 2021-11-26 15:58:01 +02:00
Antonis Lempesis 12749a0a77 first 2021-11-26 15:40:40 +02:00
dimitrispie 29f69f2f89 Sprint 4 2021-11-26 15:22:04 +02:00
Antonis Lempesis cb3adb90f4 Merge branch 'beta' into beta 2021-11-17 14:33:45 +01:00
Antonis Lempesis c283406829 added Universidad Polytecnica de Madrid 2021-11-17 15:33:00 +02:00
Claudio Atzori e0395719d7 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-11-17 14:17:27 +01:00
Claudio Atzori 82a4e4efae [cleaning wf] fixed methodology to rule out invalid result titles, based on https://support.openaire.eu/issues/7206 2021-11-17 14:17:22 +01:00
Miriam Baglioni 6d4a1c57ee [Resolve Entities] Change test dataset to mirror the modification in the creation of the map between the pids and the unresolved 2021-11-17 12:41:52 +01:00
Claudio Atzori 49f897ef29 [cleaning wf] fixed regex used to spot garbage in result titles; adjusted threshold for filtering titles 2021-11-16 15:24:23 +01:00
Claudio Atzori 0a727d325d [dedup] increased number of partitions in the consistency phase 2021-11-16 08:43:41 +01:00
Claudio Atzori bafa2990f3 code formatting 2021-11-15 17:07:16 +01:00
Claudio Atzori 668ac25224 [graph resolution] using existing argument parser file name 2021-11-15 17:02:45 +01:00
Claudio Atzori 7d0a03f607 [graph resolution] minor 2021-11-15 14:45:54 +01:00
Claudio Atzori 941a50a2fc Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-11-15 14:42:49 +01:00
Claudio Atzori 7c804acda8 [graph resolution] minor 2021-11-15 14:42:43 +01:00
Sandro La Bruzzo efa09057db Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2021-11-15 14:32:09 +01:00
Sandro La Bruzzo 48923e46a1 added documentation to Pubmed Class and also added mvn site for dhp-aggregations 2021-11-15 14:32:01 +01:00
Claudio Atzori d2c787d416 [graph resolution] fixed sequence of the workflow steps 2021-11-15 14:31:15 +01:00
Claudio Atzori 975b10b711 [actionmanager] increased spark.sql.shuffle.partitions to 5000 2021-11-15 12:31:45 +01:00
Claudio Atzori 1ecceea788 Merge pull request 'Open Citations' (#158) from openCitations into beta
Reviewed-on: D-Net/dnet-hadoop#158
2021-11-15 10:59:19 +01:00
Miriam Baglioni 4ec88c718c merge with beta - resolved conflict in pom 2021-11-15 10:52:16 +01:00
Miriam Baglioni 6f1a434e90 [Bypass Action Set] Fixed test to consider the new identifier utils 2021-11-15 09:59:23 +01:00
Miriam Baglioni 157d33ebf9 [Bypass Action Set] Refactoring 2021-11-15 09:58:48 +01:00
Claudio Atzori 7b81607035 Merge pull request 'PR: Bypass Action Set' (#157) from bypass_acstionset into beta
Reviewed-on: D-Net/dnet-hadoop#157
2021-11-12 12:01:05 +01:00
Miriam Baglioni 92d0e18b55 [Bypass Action Set] used constant DOI instead of "doi" 2021-11-12 10:56:58 +01:00
Miriam Baglioni 881113743f [Bypass Action Set] refactoring 2021-11-12 10:55:50 +01:00
Miriam Baglioni 47ccb53c4f [Bypass Action Set] modification for comment D-Net/dnet-hadoop#157 (comment) 2021-11-12 10:54:09 +01:00
Miriam Baglioni ffb0ce1d59 merge with beta - resolved conflict in pom 2021-11-12 10:19:59 +01:00
Miriam Baglioni 716021546e [Bypass Action Set] minor fix 2021-11-12 10:18:01 +01:00
Claudio Atzori 1f2a3d1af0 depending on dhp-schemas:2.8.22 (release) 2021-11-12 10:15:11 +01:00
Sandro La Bruzzo 3469cc2b1d Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2021-11-12 09:56:52 +01:00
Sandro La Bruzzo a7763d2492 removed alternate identifier in resolutionMap 2021-11-12 09:56:45 +01:00
Miriam Baglioni 935062edec [Bypass Action Set] creation of unresolved entities 2021-11-11 16:11:25 +01:00
Antonis Lempesis 26f086dd64 removed the too restrctive clause. will discuss again 2021-11-11 12:57:19 +02:00
Claudio Atzori 8bdca3413f Merge pull request 'DOIBoost Mapping: change the creation of the instance in the DOIBoost result' (#155) from doiboost_url into beta
Reviewed-on: D-Net/dnet-hadoop#155
2021-11-11 10:40:32 +01:00
Claudio Atzori 148289150f Merge branch 'beta' into doiboost_url 2021-11-11 10:40:19 +01:00
Sandro La Bruzzo 2ca0a436ad added SparkResolveEntities node to the oozie wf 2021-11-11 10:25:42 +01:00
Sandro La Bruzzo 9cb195314f implemented and tested resolution of entities 2021-11-11 10:17:40 +01:00
Miriam Baglioni 6d3c4c4abe mergin with branch beta 2021-11-11 08:59:53 +01:00
Miriam Baglioni c371b23077 - 2021-11-10 17:00:37 +01:00
Miriam Baglioni 9e214ce0eb [BypassAS] addition of OC relations 2021-11-09 12:07:19 +01:00
Sandro La Bruzzo 6477a40670 implement filter of openCitation 2021-11-09 11:27:12 +01:00
Miriam Baglioni 6f7ca539c6 [BypassAS] update of results for bipFinder and FOS 2021-11-09 11:25:41 +01:00
Miriam Baglioni a7d50c499b [BypassAS] prepare FOS subject, test and model for FOS and BipFinder scores 2021-11-08 16:44:19 +01:00
Antonis Lempesis 91354c6068 - fetching all context related results
- storing tables as parquet
2021-11-08 15:15:46 +02:00
Miriam Baglioni df7ee77c7a [DOIBoost Mapping] removed not needed comments 2021-11-04 16:24:07 +01:00
Miriam Baglioni de63d29b6f [DOIBoost Mapping] Fix to avoid to produce results with null as identifier (probably due to the filtering function in the factory for the creation of the id) 2021-11-04 16:16:40 +01:00
Miriam Baglioni d50057b2d9 [DOIBoost Mapping] changed the way to create the url for the instance: we use the crooref guidelines https://doi.org/doi 2021-11-03 16:59:37 +01:00
Miriam Baglioni edf55395e9 added test resourse 2021-11-03 16:49:30 +01:00
Miriam Baglioni d97ea82a29 [DOIBoost Mapping] Added test to verify the instance created for Crossref will have just the url related to the doi 2021-11-03 16:45:15 +01:00
Miriam Baglioni 96769b4481 [DOIBoost - Mapping] Changed the logic which brought in in the instance urls that should not be there: The urld of the doi in the json is reachable from the root (json/"URL") other urls where added from the links element. Now the mapping from the link element has been removed 2021-11-03 16:43:36 +01:00
Miriam Baglioni 683fe093cf [DOIBoost - Mapping] Remove the addition of the instance to the MAG publication record 2021-11-03 15:51:26 +01:00
Miriam Baglioni b2bb8d9d79 [DOIBoost - Mapping] selecting the url from Crossref containing the doi 2021-11-03 15:44:57 +01:00
Miriam Baglioni 779318961c [DOIBoost - Mapping] removed the url from crossref containing the api.elsevier.com... string in the url 2021-11-03 14:38:52 +01:00
Miriam Baglioni 2480e590d1 [DOIBoost - Mapping] changed the type on which to map dissertation from Crossref: from 006 Doctoral thesis to 0044 Thesis since dissertation could be either Doctoral or master thesis 2021-11-03 14:25:23 +01:00
Sandro La Bruzzo 7bd224f051 implement first version of scholexplorer integration for the generation of final graph 2021-11-02 15:58:15 +01:00
Claudio Atzori 7fa49f6956 Merge pull request 'removed hardcoded reference' (#154) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#154
2021-11-02 09:11:30 +01:00
Antonis Lempesis f78afb5ef9 removed hardcoded reference 2021-11-01 15:42:29 +02:00
Claudio Atzori 1225ba0b92 [resolution] increasing number of partitions to avoid OOM 2021-10-28 16:18:17 +02:00
Sandro La Bruzzo d9cbca83f7 moved filter on next phase 2021-10-28 16:13:24 +02:00
Sandro La Bruzzo 1be9aa0a5f Removed filter of datacite items from the raw graph merging phase, Datacite is not an actionset anymore in beta 2021-10-26 17:52:20 +02:00
Sandro La Bruzzo 4acfa8fa2e Scholexplorer Datasource Aggregation:
- Added collectedfrom in the inverse relation generated
Relation resolution:
- increased number of partitions in workflow.xml
- using classid instead of classname to build the pid-dnetId mapping
2021-10-26 17:51:20 +02:00
Sandro La Bruzzo aafdffa6b3 resolved conflict 2021-10-26 09:45:46 +02:00
Sandro La Bruzzo 034304b33a conflict resolved on merge 2021-10-26 09:40:47 +02:00
Claudio Atzori 6b34ba737e minor 2021-10-21 14:16:18 +02:00
Claudio Atzori d147295c2f avoiding java.io.NotSerializableException: java.util.HashMap 2021-10-21 14:15:57 +02:00
Claudio Atzori 3702fe478d cleanup 2021-10-21 12:05:02 +02:00
Sandro La Bruzzo ac36aa7d1c fixed wrong Encoding during a map phase 2021-10-21 11:35:02 +02:00
Sandro La Bruzzo aeeebd573b code refactor renamed datacite package 2021-10-20 17:37:42 +02:00
Sandro La Bruzzo ab3a99d3e9 removed old datacite oozie workflow 2021-10-20 17:19:47 +02:00
Sandro La Bruzzo ae4e99a471 Adapted workflow of resolution of PID to work into OpenAIRE data workflow
- Added relations in both verse on all Scholexplorer datasources
2021-10-20 17:12:16 +02:00
Claudio Atzori 4f8970f8ed [stats] reducing the step22 wait time 2021-10-20 14:14:53 +02:00
Claudio Atzori 00b78b9c58 cleanup: mapping contents in the graph already defined in the OAF graph model doesn't require to be aware of the vocabularies 2021-10-20 14:04:45 +02:00
Claudio Atzori c01dd0c925 registered oaf model classes for the KryoSerializer 2021-10-20 13:55:07 +02:00
Claudio Atzori d0cf2963f0 Merge pull request 'hierarchical_orgs_relations' (#150) from hierarchical_orgs_relations into beta
Reviewed-on: D-Net/dnet-hadoop#150
2021-10-20 10:13:47 +02:00
Claudio Atzori 59f76b50d4 Merge branch 'beta' into hierarchical_orgs_relations 2021-10-20 09:42:35 +02:00
Claudio Atzori bc3372093e Merge pull request '[stats] affiliations in stats and monitor dbs' (#152) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#152
2021-10-20 09:40:34 +02:00
Antonis Lempesis 241dcf6df1 Merge branch 'beta' into beta 2021-10-19 23:54:21 +02:00
Claudio Atzori 515e068a78 Merge branch 'beta' into hierarchical_orgs_relations 2021-10-19 16:46:06 +02:00
Claudio Atzori 512e7b0170 code formatting 2021-10-19 16:19:29 +02:00
Claudio Atzori d517c71458 Merge branch 'dump' into beta 2021-10-19 16:15:42 +02:00
Claudio Atzori e9157c67aa Merge branch 'beta' into dump 2021-10-19 16:15:03 +02:00
Claudio Atzori 98f37c8d81 WIP: worflow nodes for including Scholexplorer records in the RAW graph 2021-10-19 16:14:40 +02:00
Claudio Atzori c8850456e9 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-10-19 16:09:54 +02:00
Sandro La Bruzzo c9870c5122 code formatted 2021-10-19 15:24:59 +02:00
Sandro La Bruzzo f8329bc110 since dhp-schemas changed, introducing new Relation inverse model, this class has been updated 2021-10-19 15:24:22 +02:00
Claudio Atzori 7a73010acd WIP: worflow nodes for including Scholexplorer records in the RAW graph 2021-10-19 11:59:16 +02:00
Miriam Baglioni c7f6cd2591 added again the setting for saXReader 2021-10-19 10:15:26 +02:00
Sandro La Bruzzo a894d7adf3 updated version of dhp-schemas 2021-10-19 10:02:55 +02:00
miconis 5f780a6ba1 bug fix in migrate entities: parameter name was wrong 2021-10-18 23:30:40 +02:00
Miriam Baglioni 1315952702 merge with branch beta 2021-10-18 14:17:09 +02:00
Miriam Baglioni 1cc09adfaa Opencitations: chenaged the test class to mirror the creation or not of duplicate dois for .refs oc original plus added optional parameter to duplicate the relation 2021-10-18 14:11:27 +02:00
Miriam Baglioni 76d41602be Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-10-18 10:53:22 +02:00
Miriam Baglioni 46f82c7c8f removed not needed folder deletion 2021-10-18 10:53:16 +02:00
Sandro La Bruzzo 7b15b88d4c renamed wrong package, implemented last aggregation workflow for scholexplorer 2021-10-15 15:00:15 +02:00
Antonis Lempesis 41ecb1eb61 invalidating medatadata before context thingies 2021-10-15 13:42:55 +03:00
Antonis Lempesis 4b7c8dff2d fetching affiliated results for 4 orgs in monitor. fixed affiliated orgs in stats db 2021-10-14 18:53:35 +03:00
Sandro La Bruzzo 51a03c0a50 refactor code for EBI from dhp-graph-mapper into dhp-aggregation 2021-10-14 14:23:13 +02:00
Claudio Atzori dd568ec88b Merge pull request 'Refactoring Solr Configuration' (#148) from beta_solr_config into beta
Reviewed-on: D-Net/dnet-hadoop#148
2021-10-14 12:45:11 +02:00
Claudio Atzori 14fbf92ad6 Merge branch 'beta' into beta_solr_config 2021-10-14 11:08:44 +02:00
Claudio Atzori b292e4a700 [stats wf] added extra logging in the context data retrieval phase 2021-10-13 17:31:53 +02:00
miconis 995c1eddaf minor change 2021-10-13 17:07:10 +02:00
Miriam Baglioni 5d9cc2452d changed the working path parameter value as dependant from the dnet-workflow working dir parameter 2021-10-13 15:33:50 +02:00
miconis 326bf63775 integration of parent child orgs relations 2021-10-13 12:24:48 +02:00
Miriam Baglioni 16b28494a9 added new parameter in the doiboost process workflow to specify a folder for the process of MAG dataset 2021-10-13 11:34:24 +02:00
Miriam Baglioni 63933808d4 added fix for mixing result types, added configuration default to funder subworkflow 2021-10-13 11:28:28 +02:00
Sandro La Bruzzo f2c8356ccf Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2021-10-12 12:36:40 +02:00
Sandro La Bruzzo 7387416e90 added params skip update to direct transform in OAF, this should be set to true in production 2021-10-12 12:36:30 +02:00
Claudio Atzori 914b3e92cb updating graph schema module dependency to version 2.8.20 to include organization parent/child relation constants 2021-10-12 12:00:45 +02:00
Sandro La Bruzzo 511da98d0c - fixed bug on download pmc Article
- removed unused line of code in SparkCreateActionset
2021-10-12 11:47:49 +02:00
Miriam Baglioni fec40bdd95 merging with branch beta - resolved conflicts 2021-10-12 09:16:36 +02:00
Miriam Baglioni 83f51f1812 refactoring 2021-10-12 09:14:43 +02:00
Sandro La Bruzzo 5606014b17 code refactor see ticket #7065 2021-10-12 08:11:53 +02:00
Serafeim Chatzopoulos 201ce71cc1 Add resultsubject, relprojectname and resultacceptanceyear to __all field 2021-10-11 13:16:39 +03:00
Serafeim Chatzopoulos e468a7b96b Add tests to query Solr with different configurations 2021-10-08 16:58:51 +03:00
Serafeim Chatzopoulos de81007302 Add exploreTestConfig, a new Solr configuration folder 2021-10-08 16:54:56 +03:00
Sandro La Bruzzo 8f99d2af86 Make the node of doiBoost to point to the correct OpenAire Organization in relations 2021-10-08 08:35:12 +02:00
Alessia Bardi c48c43fa9e Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-10-07 17:30:53 +02:00
Alessia Bardi 8d3b60f446 test for patching records for EOSC Future 2021-10-07 17:30:45 +02:00
miconis 611ca511db set configuration property in openorgs duplicates wf 2021-10-07 15:39:55 +02:00
miconis 9646b9fd98 implementation of the http call for the update of openorgs suggestions 2021-10-07 11:29:11 +02:00
Sandro La Bruzzo 2557bb41f5 Implemented new method for update baseline inside scala node 2021-10-06 16:41:08 +02:00
Sandro La Bruzzo b84e0cabeb Implemented new method for update baseline 2021-10-05 16:34:47 +02:00
Sandro La Bruzzo f258bbb927 Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into beta 2021-10-05 10:21:50 +02:00
Sandro La Bruzzo 991b06bd0b removed generation of EBI links from old dump, now EBI link dump is created by another wf 2021-10-05 10:21:33 +02:00
Claudio Atzori cb7efe12ac Merge pull request 'beta' (#146) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#146
2021-10-05 10:09:37 +02:00
Miriam Baglioni e653756e3d applied some suggestiond from Sonar Lint 2021-10-04 18:40:07 +02:00
dimitrispie 3f25d2efb2 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta 2021-10-01 16:03:48 +03:00
dimitrispie 13687fd887 Sprint 3 indicators update 2021-10-01 16:02:02 +03:00
Miriam Baglioni 9814c3e700 mergin with branch beta 2021-10-01 13:00:03 +02:00
Miriam Baglioni c4ccd7b32c - 2021-10-01 12:59:47 +02:00
Miriam Baglioni c8321ad31a merge with branch beta 2021-10-01 12:59:08 +02:00
Claudio Atzori 60a6a9a583 [graph2hive] added field 'measures' to the result view 2021-09-30 09:27:26 +02:00
Sandro La Bruzzo 66702b1973 Added node to update datacite 2021-09-28 08:59:06 +02:00
Sandro La Bruzzo 477cb10715 Merge remote-tracking branch 'origin/beta' into beta 2021-09-27 16:57:23 +02:00
Sandro La Bruzzo be79d74e3d Fixed DoiBoost generation to point to correct organization in affiliation relation 2021-09-27 16:57:04 +02:00
Claudio Atzori 35619b93ee Merge pull request 'implementation of the whitelist for similarity relations' (#144) from dedup_whitelist into beta
Reviewed-on: D-Net/dnet-hadoop#144
2021-09-27 16:47:40 +02:00
Claudio Atzori 474117c2e8 Merge branch 'beta' into dedup_whitelist 2021-09-27 16:41:25 +02:00
Miriam Baglioni 476a4708d6 mergin with branch beta 2021-09-27 16:02:32 +02:00
Miriam Baglioni 5ec69889db OpenCitations: creation of AS from OC 2021-09-27 16:02:06 +02:00
Claudio Atzori a53acfbc06 Merge pull request '[stats] updates in the mapping, indicators, wf' (#145) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#145
2021-09-27 15:59:54 +02:00
Alessia Bardi b924276e18 tests to generate records for the EOSC-Future demo with the EOSC Jupyter Notebbok subject 2021-09-24 17:11:56 +02:00
Antonis Lempesis a1e1cf32d7 fixed an impala error 2021-09-24 12:57:24 +03:00
Antonis Lempesis f358cabb2b fixed typo 2021-09-22 21:50:37 +03:00
Miriam Baglioni eedf7c3310 mergin with branch beta 2021-09-22 15:18:34 +02:00
Miriam Baglioni f2118d771a first steps in the implementation of the integration of opencitations 2021-09-22 15:18:05 +02:00
Claudio Atzori df15a4dc9f Merge pull request 'UnknowHostException handling for orcid collector api' (#141) from enrico.ottonello/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#141
2021-09-22 11:51:13 +02:00
Claudio Atzori 7fa60e166e Merge branch 'beta' into dedup_whitelist 2021-09-22 11:31:18 +02:00
Antonis Lempesis 421d55265d created hive action for observatory queries 2021-09-21 03:07:58 +03:00
Enrico Ottonello 92a63f78fe multiple download attempts handling if a connection to orcid server fails 2021-09-20 18:25:00 +02:00
Enrico Ottonello 0c74f5667e Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-09-20 18:12:31 +02:00
miconis 853333bdde implementation of the whitelist for similarity relations 2021-09-20 16:21:47 +02:00
Antonis Lempesis 8b681dcf1b attempt to make the observatory wf run in hive 2021-09-18 00:35:14 +03:00
Claudio Atzori 71cfa386bc Merge pull request 'cleaning for relation fields' (#142) from clean_relations into beta
Reviewed-on: D-Net/dnet-hadoop#142
2021-09-17 16:01:03 +02:00
Antonis Lempesis 2943287d10 fixed the definition of cc_licence, part II 2021-09-16 15:59:06 +03:00
Antonis Lempesis dd2329849f fixed the definition of cc_licence 2021-09-16 13:50:34 +03:00
Claudio Atzori 09c2eb7f62 Merge branch 'beta' into clean_relations 2021-09-16 11:09:47 +02:00
Claudio Atzori 954a16c213 Merge pull request 'Propagation relations not Cleaned' (#143) from enrichment into beta
Reviewed-on: D-Net/dnet-hadoop#143
2021-09-15 19:14:38 +02:00
Miriam Baglioni e9ccdf853f related to D-Net/dnet-hadoop#132 2021-09-15 18:44:54 +02:00
Claudio Atzori 12766bf5f2 Merge branch 'beta' into clean_relations 2021-09-15 17:18:15 +02:00
Claudio Atzori 663b1556d7 manually integrating PR#140 D-Net/dnet-hadoop#140 2021-09-15 16:40:25 +02:00
Claudio Atzori ebf53a1616 added cleaning for relation fields: subRelType & relClass according to dedicated vocabs 2021-09-15 16:10:37 +02:00
Enrico Ottonello 8b804e7fe1 removed unused imports 2021-09-14 17:30:52 +02:00
Enrico Ottonello aefa36c54b other task executions go ahead if UnknownHostException happens on a single task 2021-09-14 17:26:15 +02:00
Antonis Lempesis de9bf3a161 added cc_licences and abstracts in observatory db 2021-09-14 01:29:08 +03:00
Antonis Lempesis 9b1936701c fixed yet another typo 2021-09-13 21:07:44 +03:00
Antonis Lempesis 8fc89ae822 moved context table creation before indicators 2021-09-13 14:33:23 +03:00
Antonis Lempesis 461bf90ca6 fixed the gold_oa definition 2021-09-13 11:10:30 +03:00
Antonis Lempesis 43852bac0e creating other::other concept for all contexts 2021-09-13 01:36:41 +03:00
Antonis Lempesis f13cca7e83 moved dependencies of indicators before them... 2021-09-08 23:07:58 +03:00
Antonis Lempesis c6ada217a1 fixed typo 2021-09-08 22:34:59 +03:00
Antonis Lempesis 1250ae197f using new indicators for the definition of peerreviewed, gold, and green 2021-09-08 14:08:43 +03:00
Antonis Lempesis ccee451dde added indicators of sprint 2 in monitor db 2021-09-07 23:17:13 +03:00
Sandro La Bruzzo aed29156c7 changed behavior in transformation job, that doesn't fail at first error 2021-09-07 19:05:46 +02:00
Sandro La Bruzzo 3c6fc2096c fix bug on oai iterator that skip record cleaned 2021-09-07 10:46:26 +02:00
Sandro La Bruzzo d4dadf6d77 reduced max number of PID in Relatedentity 2021-09-02 14:21:24 +02:00
Sandro La Bruzzo 9f8a80deb7 fixed wrong import of unresolved relation in openaire 2021-09-01 14:16:27 +02:00
Alessia Bardi 3762b17f7b added VERSIOn and PART relationship and re-ordered according to my personal and obviously possibly biased
ordering
2021-08-31 20:20:05 +02:00
Sandro La Bruzzo e8b3cb9147 Implemented method to download delta updates in EBI Links 2021-08-30 09:32:45 +02:00
Alessia Bardi ccf4103a25 keep the original url if the decoder fails for any reason 2021-08-25 10:07:58 +02:00
Sandro La Bruzzo 45898c71ac fixed wrong doi in pubmed 2021-08-24 15:20:04 +02:00
Alessia Bardi 00a28c0080 originalId was renamed to acronym 2021-08-23 15:02:21 +02:00
Alessia Bardi f19b04d41b code formatting after mvn compile 2021-08-23 14:33:39 +02:00
Alessia Bardi 412d2cb16a added dependencies to classgraph and opencsv. Bumped version of dhp-schemas 2021-08-23 14:32:00 +02:00
Alessia Bardi 3bcac7e88c Merge pull request 'towards EOSC datasource profiles' (#130) from datasource_model_eosc_beta into beta
Reviewed-on: D-Net/dnet-hadoop#130
2021-08-23 11:58:34 +02:00
Alessia Bardi 931f430129 Merge branch 'beta' into datasource_model_eosc_beta 2021-08-23 11:57:21 +02:00
Alessia Bardi 4c1474e693 Dealing with #6859#note-2: we have to decode URLs to avoid & and other chars encoded becasue of the original XML representation of data 2021-08-20 17:03:30 +02:00
Miriam Baglioni 5f8ccbc365 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-08-20 11:13:47 +02:00
Miriam Baglioni 882abb40e4 CrossrefDump - 2021-08-20 11:12:53 +02:00
Miriam Baglioni 45c62609af CrossrefDump - modified because parameter file was moved 2021-08-20 11:12:31 +02:00
Miriam Baglioni 35880c0e7b CrossrefDump - changed the wf to be able to resume from one of the steps 2021-08-20 11:11:35 +02:00
Miriam Baglioni f3b6c392c1 CrossrefDump - moving parameter file under folder crossref_dump_reader 2021-08-20 11:10:58 +02:00
Miriam Baglioni 65822400ce CrossrefDump - added new parameter file that was missing 2021-08-20 11:10:35 +02:00
Alessia Bardi a053e1513c different funders in blacklist from BETA and PROD aggregator 2021-08-19 11:32:27 +02:00
Alessia Bardi 812bd54c57 different funders in blacklist from BETA and PROD aggregator 2021-08-19 11:30:14 +02:00
Miriam Baglioni a65d3caaea Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-08-19 10:29:10 +02:00
Miriam Baglioni e5cf11d088 change open access route to result matching hbm to gold 2021-08-19 10:29:04 +02:00
Claudio Atzori 7c0c67bdd6 added mock pom 2021-08-13 17:45:53 +02:00
Claudio Atzori 82086f3422 fixed directory name 2021-08-13 17:42:14 +02:00
Claudio Atzori bc7068106c added crossref download oozie workflow 2021-08-13 17:19:44 +02:00
Claudio Atzori 2c0a05f11a manually merged PR#139 2021-08-13 17:15:53 +02:00
Claudio Atzori d43667d857 Merge pull request 'Automatic download of Crossref' (#138) from crossref_dw_wf into beta
Reviewed-on: D-Net/dnet-hadoop#138
2021-08-13 17:10:10 +02:00
Miriam Baglioni 5856ca8a7b merging with branch beta - resolved conflicts 2021-08-13 16:45:45 +02:00
Miriam Baglioni 6fec71e8d2 removed the specific of the infra we are running the wf from the wf name 2021-08-13 16:39:02 +02:00
Miriam Baglioni ed7e28490a change in sh 2021-08-13 16:19:01 +02:00
Claudio Atzori 7743d0f919 consolidated dnet wf profiles into the same submodule 2021-08-13 16:14:54 +02:00
Miriam Baglioni 6eb7508995 mergin with branch beta 2021-08-13 16:07:04 +02:00
Claudio Atzori f74adc4752 added DownloadCSV2 as alternative implementation of the same download procedure 2021-08-13 15:52:15 +02:00
Claudio Atzori 5f0903d50d fixed CSV downloader & tests 2021-08-13 14:17:54 +02:00
Claudio Atzori 17cefe6a97 [HBM] removed stale replace option 2021-08-13 12:43:59 +02:00
Claudio Atzori 7ee2757fcd fixed DownloadCSV parameters spec; workflow patching the hostedby replaces the graph content (publication, datasource) rather than creating a copy 2021-08-13 12:41:01 +02:00
Claudio Atzori c3ad4ab701 minor fixes 2021-08-13 12:23:15 +02:00
Claudio Atzori baed5e3337 test classes moved in specific components 2021-08-13 12:14:47 +02:00
Claudio Atzori 3359f73fcf cleanup & best practices 2021-08-13 12:00:42 +02:00
Claudio Atzori 4e6575a428 Merge pull request 'Moving Download CSV' (#137) from refactoring_download_csv into beta
Reviewed-on: D-Net/dnet-hadoop#137
2021-08-13 10:41:01 +02:00
Miriam Baglioni f4ec81c92c mergin with branch beta 2021-08-13 10:31:35 +02:00
Miriam Baglioni dc8b05b39e Hosted By Map - changed the association with the datasource id for the hostedby element: there is no more the need to compute it. With the new HBM it is already the id in the graph 2021-08-13 10:18:25 +02:00
Miriam Baglioni 32fd75691f refactoring 2021-08-13 10:15:42 +02:00
Miriam Baglioni dfd1e53c69 added external dependency for version 2021-08-13 10:15:12 +02:00
Miriam Baglioni 01db1f8bc4 GetCSV refactoring - removed not needed import 2021-08-13 10:14:17 +02:00
Miriam Baglioni 964a46ca21 GetCSV refactoring - modified due to movement of classes 2021-08-13 10:11:18 +02:00
Miriam Baglioni eaf077fc34 GetCSV refactoring - removed not needed dependency 2021-08-13 10:08:58 +02:00
Miriam Baglioni 5f674efb0c moved dependency version in external pom 2021-08-13 10:07:53 +02:00
Miriam Baglioni 5cd5714530 GetCSV refactoring - added ignore annotation for fields not in input csv 2021-08-13 10:06:49 +02:00
Miriam Baglioni 58f241f4a2 GetCSV refactoring - changed due to change of input resource 2021-08-13 10:04:44 +02:00
Miriam Baglioni f3d575f749 GetCSV refactoring - changed due to changes in input resource 2021-08-13 10:03:57 +02:00
Miriam Baglioni a5f6edfa6c GetCSV refactoring - changed to mirror the original model class 2021-08-13 09:30:03 +02:00
Miriam Baglioni ed183d878e GetCSV refactoring - modified test classes due to change in the model of projects and programme 2021-08-13 09:28:51 +02:00
Miriam Baglioni 8769dd8eef GetCSV refactoring - refactoring due to movement of classes 2021-08-12 18:20:56 +02:00
Miriam Baglioni 6b9e1bf2e3 GetCSV refactoring - removing not needed dependency 2021-08-12 18:17:50 +02:00
Miriam Baglioni d57b2bb927 GetCSV refactoring - removing not needed dependency 2021-08-12 18:12:51 +02:00
Miriam Baglioni 9da74b544a GetCSV refactoring - refactoring due to movement of classes 2021-08-12 18:12:15 +02:00
Miriam Baglioni ab8abd61bb GetCSV refactoring - refactoring due to movement of classes 2021-08-12 18:11:07 +02:00
Miriam Baglioni 335a824e34 GetCSV refactoring - fixed issue 2021-08-12 18:10:10 +02:00
Miriam Baglioni f0845e9865 GetCSV refactoring - refactoring due to movement of classes 2021-08-12 18:04:58 +02:00
Miriam Baglioni 7a789423aa GetCSV refactoring - refactoring due to movement of classes 2021-08-12 18:04:27 +02:00
Miriam Baglioni e9fc3ef3bc GetCSV refactoring - changed to use the new class to get and write the csv file 2021-08-12 18:03:41 +02:00
Miriam Baglioni 4317211a2b GetCSV refactoring - refactoring due to movement 2021-08-12 18:03:14 +02:00
Miriam Baglioni b62cd656a7 GetCSV refactoring - changed the model to store only the information needed 2021-08-12 18:01:10 +02:00
Miriam Baglioni d36e925277 GetCSV refactoring - moved under model package 2021-08-12 18:00:21 +02:00
Miriam Baglioni 7402daf51a GetCSV refactoring - added dependency to open-csv lib 2021-08-12 17:59:19 +02:00
Miriam Baglioni 733bcaecf6 GetCSV refactoring - added test class (all the tests are disabled since they refer to remote resource) 2021-08-12 17:58:52 +02:00
Miriam Baglioni bfe8f5335c GetCSV refactoring - copied model classes in test path 2021-08-12 17:58:14 +02:00
Miriam Baglioni 6e84b3951f GetCSV refactoring - moving classes to dhp-common that have dependency with GetCSV class (that was located in graph-mapper) 2021-08-12 17:57:41 +02:00
Claudio Atzori e91ffcd2f3 Merge pull request 'hostedbymap' (#136) from hostedbymap into beta
Reviewed-on: D-Net/dnet-hadoop#136
2021-08-12 17:10:55 +02:00
Claudio Atzori 9587d4aee8 Merge branch 'beta' into hostedbymap 2021-08-12 17:04:30 +02:00
Claudio Atzori 86d940044c added test to verify bad records from FWF-E-Book-Library 2021-08-12 11:32:56 +02:00
Claudio Atzori 8cdce59e0e [graph raw] let the mapping exceptions propagate 2021-08-12 11:32:26 +02:00
Miriam Baglioni 08dd2b2102 moving the dependency version to the external pom file 2021-08-11 18:09:41 +02:00
Miriam Baglioni ac417ca798 removed not needed test resource 2021-08-11 17:50:33 +02:00
Miriam Baglioni e33daaeee8 reverting 2021-08-11 17:46:19 +02:00
Miriam Baglioni 9650eea497 reverting 2021-08-11 17:45:48 +02:00
Miriam Baglioni 785db1d5b2 refactoring 2021-08-11 17:44:07 +02:00
Miriam Baglioni 95e5482bbb removing not needed dependency 2021-08-11 17:42:26 +02:00
Miriam Baglioni cc3d72df0e removing not needed dependency 2021-08-11 17:42:01 +02:00
Miriam Baglioni b966329833 reverting 2021-08-11 17:37:00 +02:00
Miriam Baglioni 8ad7c71417 reverting 2021-08-11 17:36:12 +02:00
Miriam Baglioni 0e1a6bec20 reverting 2021-08-11 17:32:29 +02:00
Miriam Baglioni c6a2a780a9 reverting 2021-08-11 17:30:17 +02:00
Miriam Baglioni b6b58bba28 reverting 2021-08-11 17:25:37 +02:00
Miriam Baglioni 804589eb30 reverting 2021-08-11 17:23:35 +02:00
Miriam Baglioni d688749ad9 reverting 2021-08-11 17:22:28 +02:00
Miriam Baglioni 524c06e028 reverting 2021-08-11 17:20:30 +02:00
Miriam Baglioni 7aa3260729 reverting 2021-08-11 17:18:45 +02:00
Miriam Baglioni 55fc500d8d reverting 2021-08-11 17:17:48 +02:00
Miriam Baglioni f9b6b45d85 reverting 2021-08-11 17:04:48 +02:00
Miriam Baglioni 8229632839 adding assertions to the mapping of the unibi part of gold list 2021-08-11 16:36:01 +02:00
Miriam Baglioni b1c6140ebf removed all comments in Italian 2021-08-11 16:23:33 +02:00
Miriam Baglioni 52c18c2697 removed not needed test class. Teh functionality has been moved 2021-08-11 16:16:55 +02:00
Miriam Baglioni 8da3a25cf6 merging with branch beta 2021-08-11 15:55:34 +02:00
Claudio Atzori 9f4db73f30 updated/fixed unit tests 2021-08-11 15:02:51 +02:00
Claudio Atzori 61d811ba53 suggestions from intellij 2021-08-11 12:18:20 +02:00
Claudio Atzori 2ee21da43b suggestions from SonarLint 2021-08-11 12:13:22 +02:00
Miriam Baglioni b954fe9ba8 mergin with branch beta 2021-08-11 10:12:46 +02:00
Miriam Baglioni b688567db5 hostedbymap - modified part of test to check the bestaccessright changed 2021-08-11 10:12:10 +02:00
Miriam Baglioni 9731a6144a hostedbymap - in case the journal is open access the access may be changed also for the best access right in the result 2021-08-10 17:49:45 +02:00
Miriam Baglioni a90bac3bc9 Graph Dump - added method to test class to verify addition of validation date in projects for community result 2021-08-09 16:36:54 +02:00
Miriam Baglioni bd0d7bfba7 Graph Dump - added resources for testing addition of validation date in project for communityresult 2021-08-09 16:36:17 +02:00
Miriam Baglioni 8daaa32e90 Graph Dump - added resources for testing 2021-08-09 15:46:29 +02:00
Miriam Baglioni bc9e3a06ba Graph Dump - extended the test class 2021-08-09 15:46:06 +02:00
Miriam Baglioni 2efa5abda5 refactoring 2021-08-09 12:28:36 +02:00
Claudio Atzori 577f3b1ac8 added dnet workflows responsible for the graph construction, enrichment, provision 2021-08-09 11:53:58 +02:00
Miriam Baglioni da20fceaf7 removed all the part related to the crossref dump download since it is done in a separate workflow 2021-08-09 11:53:45 +02:00
Claudio Atzori 964f97ed4d cleanup 2021-08-09 11:53:06 +02:00
Miriam Baglioni 54a6cbb244 CrossrefDump - put token among the parameters 2021-08-09 11:41:10 +02:00
Miriam Baglioni b7079804cb CrossrefDump - put token among the parameters 2021-08-09 11:34:35 +02:00
Miriam Baglioni a5f82f442b Merge branch 'beta' into doiboost_wf 2021-08-09 11:17:51 +02:00
Miriam Baglioni b6dcf89d22 mergin with branch beta 2021-08-09 11:14:43 +02:00
Miriam Baglioni eff499af9f added new tests and changed the test example 2021-08-09 11:12:30 +02:00
Miriam Baglioni 5d70f842eb mergin with branch beta 2021-08-06 18:57:09 +02:00
Miriam Baglioni c3931557e3 extended the logic of the dump to consider the validation date in the relation (also in the dumped result for communities and funders at the level of the project), the extention on the instance for the APC, the pid, the alternate identifiers, and the extention of the AccessRight to store the OpenAccessRoute. Added new resourec for testing and extended the old class to verify the new dump. Fixed also issue on relation dump: only relation whose source and target are entities in the graph are dumped. The same hold for references to projects 2021-08-06 18:56:18 +02:00
Claudio Atzori 66f398fe6f Merge pull request '[stats] fixed a typo' (#133) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#133
2021-08-06 14:29:57 +02:00
Miriam Baglioni 6bd1eca7e0 merge branch with beta 2021-08-05 15:23:32 +02:00
Miriam Baglioni 73dc082927 added new dumped field (openaccessroute, pid and alternate identifier at the level of the instance) and the bipFinder measure at the level of the result 2021-08-05 15:20:50 +02:00
Miriam Baglioni ee13da9258 merge branch with master 2021-08-05 11:34:20 +02:00
Miriam Baglioni bd096f5170 removed not needed param file 2021-08-05 10:55:43 +02:00
Miriam Baglioni 5faeefbda8 added script to download the dump,changed the workflow input paramenters 2021-08-05 10:54:03 +02:00
Miriam Baglioni 1965e4eece new workflow for downloading the dump of crossref and unpack it 2021-08-04 18:29:03 +02:00
Claudio Atzori 83c04e5d28 mapping test for dataset records adapted to reflect the delegated pid authority (zenodo) 2021-08-04 10:37:57 +02:00
Miriam Baglioni b4eb026c8b mergin with branch beta 2021-08-04 10:21:37 +02:00
Miriam Baglioni c7b71647c6 Hosted By Map - modification of the resource for testing the presence of only one entry per datasource id 2021-08-04 10:20:02 +02:00
Miriam Baglioni eb8c3f8594 Hosted By Map - test modified because of the application of the new aggregator on datasources 2021-08-04 10:19:17 +02:00
Miriam Baglioni e94ae0b1de Hosted By Map - extention of the workflow to consider also the application of the map to publications and datasources 2021-08-04 10:18:11 +02:00
Miriam Baglioni 67ba4c40e0 Hosted By Map - added parameter resources 2021-08-04 10:17:28 +02:00
Miriam Baglioni eccf3851b0 Hosted By Map - refactoring 2021-08-04 10:16:30 +02:00
Sandro La Bruzzo 74afe43c3a fixed wrong test file 2021-08-04 10:16:17 +02:00
Miriam Baglioni 1e952cccf6 Hosted By Map - refactoring and deletion of not needed methods 2021-08-04 10:15:43 +02:00
Miriam Baglioni 8ba8c77f92 Hosted By Map - refactoring 2021-08-04 10:14:57 +02:00
Miriam Baglioni 8f7623e77a Hosted By Map - refactoring and application of the new aggregator 2021-08-04 10:14:20 +02:00
Sandro La Bruzzo 3fc820203b fixed wrong test file 2021-08-04 10:13:59 +02:00
Miriam Baglioni a7bf314fd2 Hosted By Map - added new aggregator to get just one result per datasource id 2021-08-04 10:13:30 +02:00
Miriam Baglioni 9831725073 Hosted By Map - remove from workflow a step not needed. The hbm will be take care also of the integration of the unibi list of gold openaccess journals 2021-08-03 11:02:17 +02:00
Miriam Baglioni 100e54e6c8 mergin with branch beta 2021-08-03 10:47:11 +02:00
Miriam Baglioni 461b8a29a0 removed not needed class 2021-08-03 10:46:51 +02:00
Miriam Baglioni 327cddde33 Hosted By Map - refactoring 2021-08-03 10:44:13 +02:00
Miriam Baglioni 17292c6641 Hosted By Map - resources for testing purposes 2021-08-02 19:37:08 +02:00
Miriam Baglioni ee7ccb98dc Hosted By Map - test class to verify the application of the hbm to results and datasource 2021-08-02 19:36:18 +02:00
Miriam Baglioni 90e91486e2 Hosted By Map - test class to verify each step in the preparation process 2021-08-02 19:35:52 +02:00
Miriam Baglioni 1e859706a3 Hosted By Map - Classes to apply the HBM to results and datasources 2021-08-02 19:35:23 +02:00
Miriam Baglioni 72df8f9232 Hosted By Map - removed the aggregator for the datasource (it is no more needed) and added a new aggregator for the results. Changed also the hostedBYMap aggregator 2021-08-02 19:34:44 +02:00
Miriam Baglioni ff1ce75e33 Hosted By Map - modification in the code to prepare the info needed to apply the HostedByMap. There is no need to join datasources with the hbm: all the information needed is in the hosted by map already 2021-08-02 19:32:59 +02:00
Claudio Atzori e826aae848 using constants from ModelConstants 2021-08-02 14:28:59 +02:00
Claudio Atzori fd55c77d97 updated dependency dhp-schemas:2.7.15 2021-08-02 13:48:42 +02:00
Antonis Lempesis 117c3d5c67 fixed a typo 2021-08-02 12:15:58 +03:00
Miriam Baglioni 1695d45bd4 Hosted By Map - Test class to verify the preparation of the intermediate information 2021-07-30 17:57:01 +02:00
Miriam Baglioni 7c6ea2f4c7 Hosted By Map - first attempt for the creation of intermedia information to be used to applu the hosted by map on the graph entities 2021-07-30 17:56:27 +02:00
Miriam Baglioni d8b9b0553b Hosted By Map - model classes to store the intermediate information to be used to apply the hosted by map 2021-07-30 17:55:39 +02:00
Miriam Baglioni 613bd3bde0 Hosted By Map - refactor of the first attemp to prepare a new hosted by map dependent on the datasource in the graph and on two external sources: the gold list from unibi ad the doaj list of open access journal. Both the lists are downloaded from provided url parameter 2021-07-30 17:54:45 +02:00
Miriam Baglioni d1807781c0 mergin with branch beta 2021-07-30 14:34:07 +02:00
Miriam Baglioni 1d6ac3715b merge branch with beta 2021-07-30 11:58:29 +02:00
Claudio Atzori e244f73165 Update 'README.md' 2021-07-30 11:54:38 +02:00
Claudio Atzori 11e26c020a Update 'README.md' 2021-07-30 11:54:13 +02:00
Claudio Atzori 19620eed46 applying PR#131, Patch the identifiers (source/target) in the relations, refinements 2021-07-30 11:09:32 +02:00
Claudio Atzori 5219d56be5 Merge pull request 'Patch the identifiers (source/target) in the relations, refinements' (#131) from fct_project_id_replacement into master
Reviewed-on: D-Net/dnet-hadoop#131
2021-07-30 11:07:54 +02:00
Claudio Atzori 4f78565c04 fixed implementation of PatchRelationsApplication, refined the relative unit test 2021-07-30 11:07:09 +02:00
Claudio Atzori a6a38cca9e fixed implementation of PatchRelationsApplication, refined the relative unit test 2021-07-30 11:06:11 +02:00
Miriam Baglioni 9bc4fd3b69 Patch FCT relations - fixed issue with join 2021-07-30 10:34:05 +02:00
Miriam Baglioni 2fc89fc9b5 Merge branch 'fct_project_id_replacement' of https://code-repo.d4science.org/D-Net/dnet-hadoop into fct_project_id_replacement 2021-07-30 10:20:43 +02:00
Claudio Atzori 081fe92a21 Merge branch 'fct_project_id_replacement' of https://code-repo.d4science.org/D-Net/dnet-hadoop into fct_project_id_replacement 2021-07-30 10:13:56 +02:00
Claudio Atzori 576693d782 added unit test for PatchRelationsApplication 2021-07-30 10:13:33 +02:00
Claudio Atzori 55e6470f44 Merge pull request 'added the sprint 2 indicators in monitor db' (#129) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#129
2021-07-30 10:11:46 +02:00
Sandro La Bruzzo 6358f92c3a added sleep to solve problem of lost request of creating index 2021-07-30 08:54:37 +02:00
Antonis Lempesis 26af0320d0 added the sprint 2 indicators in monitor db 2021-07-30 00:31:33 +03:00
Claudio Atzori 7b172e7cd9 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-07-29 13:57:06 +02:00
Claudio Atzori c53d106e80 [provision] lowercase relation filter 2021-07-29 13:57:00 +02:00
Claudio Atzori 6e3554a45e [provision] lowercase relation filter 2021-07-29 13:56:37 +02:00
Sandro La Bruzzo b1b0cc3f15 fixed wrong package name 2021-07-29 13:55:08 +02:00
Miriam Baglioni baad01cadc hostedbymap 2021-07-29 13:04:39 +02:00
Claudio Atzori e725c88ebb [raw_all] patching relation identifier phase to be run at the end, i.e. includes also claimed relations 2021-07-29 13:03:43 +02:00
Claudio Atzori 5d08ad86ae [raw_all] patching relation identifier phase to be run at the end, i.e. includes also claimed relations 2021-07-29 13:03:16 +02:00
Claudio Atzori e87e1805c4 [raw_all] added extra workflow step for patching the identifiers in the relations, given an id mapping dataset 2021-07-29 12:13:06 +02:00
Claudio Atzori f83dd70e1c Merge pull request 'Patch the identifiers (source/target) in the relations' (#125) from fct_project_id_replacement into master
Reviewed-on: D-Net/dnet-hadoop#125
2021-07-29 12:11:27 +02:00
Claudio Atzori 5f7330d407 Merge branch 'master' into fct_project_id_replacement 2021-07-29 11:38:22 +02:00
Claudio Atzori 1923c1ce21 replaced full join + filtering with a left join 2021-07-29 11:36:20 +02:00
Claudio Atzori dc55ed4acd Merge pull request '[beta] stats update workflow' (#128) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#128
2021-07-29 11:13:21 +02:00
Claudio Atzori 908f57a475 code formatting 2021-07-29 10:49:39 +02:00
Sandro La Bruzzo 3721df7aa6 refactoring create actionset of scholexplorer, moved on package dhp-aggregation 2021-07-29 10:45:35 +02:00
Michele Artini 6aef3e8f46 Merge pull request '[broker] updated relation descriptors' (#127) from broker_relations_upgrade into beta
Reviewed-on: D-Net/dnet-hadoop#127
2021-07-29 08:16:49 +02:00
Antonis Lempesis 4afa5215a9 fixed a NPE? 2021-07-28 21:59:12 +03:00
Antonis Lempesis 3d1580fa9b fixed a typo 2021-07-28 18:50:31 +03:00
Claudio Atzori 4c5a71ba2f [broker] updated relation descriptors, making use of constant values 2021-07-28 17:11:18 +02:00
Claudio Atzori a9961a1835 [cleaning] title cleaning based on the me.xuender:unidecode library 2021-07-28 16:36:33 +02:00
Claudio Atzori e1797c0a42 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-07-28 16:21:36 +02:00
Claudio Atzori 6dddad86ee [cleaning] title cleaning based on the me.xuender:unidecode library 2021-07-28 16:21:29 +02:00
Sandro La Bruzzo 3d8f0f629b implemented workflow of creation action set for scholexplorer 2021-07-28 16:15:34 +02:00
Antonis Lempesis 9b181ffa73 added the h2020 classification scheme for projects 2021-07-28 16:31:29 +03:00
Alessia Bardi df8715a1ec format code after mvn compile 2021-07-28 11:58:26 +02:00
Michele Artini 3e2a2d6e71 added new fields in xml 2021-07-28 11:56:55 +02:00
Alessia Bardi c806387d4b tests for enermaps 2021-07-28 11:54:36 +02:00
Alessia Bardi 9594343725 code formatting after mvn compile 2021-07-28 11:41:34 +02:00
Claudio Atzori 2fff24df55 code formatting 2021-07-28 11:34:19 +02:00
Michele Artini 9f1c7b8e17 tests 2021-07-28 11:32:34 +02:00
Claudio Atzori b346feed36 Merge pull request 'Change the access right in DoiBoost' (#126) from doiboosi_accessright into beta
Reviewed-on: D-Net/dnet-hadoop#126
2021-07-28 11:29:15 +02:00
Antonis Lempesis 4a9741825d added result_orcid, result_project provenance, issn in datasources 2021-07-28 12:28:04 +03:00
Miriam Baglioni 3d2bba3d5d removing not needed classes 2021-07-28 11:25:43 +02:00
Miriam Baglioni cc0d3d8a7b mergin with branch beta 2021-07-28 11:24:46 +02:00
Michele Artini e6f1773d63 mapping of new eosc fields 2021-07-28 11:17:11 +02:00
Miriam Baglioni 80d5b3b4de DoiBoost AccessRigh #4362 - removing commented code 2021-07-28 11:16:49 +02:00
Miriam Baglioni 5fe016dcbc DoiBoost AccessRigh #4362 - related to https://code-repo.d4science.org/D-Net/dnet-hadoop/pulls/126/files#issuecomment-4194 2021-07-28 11:14:28 +02:00
Miriam Baglioni 73ed7374a9 mergin with branch beta 2021-07-28 11:05:16 +02:00
Miriam Baglioni 43e62fcae9 DoiBoost AccessRigh #4362 - related to https://code-repo.d4science.org/D-Net/dnet-hadoop/pulls/126/files#issuecomment-4193 2021-07-28 11:04:55 +02:00
Michele Artini c72c960ffb added eosc fields 2021-07-28 11:03:15 +02:00
Michele Artini 1fb572a33a added eosc fields 2021-07-28 10:52:24 +02:00
Miriam Baglioni 708d0ade34 Merge branch 'beta' into hostedbymap 2021-07-28 10:37:22 +02:00
Sandro La Bruzzo 16c91203bd implemented workflow of creation action set for scholexplorer 2021-07-28 10:30:49 +02:00
Miriam Baglioni 6c936943aa mergin with branch beta 2021-07-28 10:24:48 +02:00
Miriam Baglioni 0424f47494 HostedByMap fixing issues 2021-07-28 10:24:13 +02:00
Michele Artini 52e2315ba2 removed trick for datasourcetypeui 2021-07-28 10:23:00 +02:00
Claudio Atzori d267dce520 [raw_all] added extra workflow step for patching the identifiers in the relations, given an id mapping dataset 2021-07-27 17:18:29 +02:00
Sandro La Bruzzo 825d9f0289 fixed datacite workflow starting from Importing delta 2021-07-27 16:09:46 +02:00
Claudio Atzori 5aa7d16d1b updated assertions in eu.dnetlib.dhp.oa.graph.raw.MappersTest 2021-07-27 15:11:58 +02:00
Claudio Atzori 998b66855a updated assertions in eu.dnetlib.dhp.oa.graph.raw.MappersTest 2021-07-27 15:11:37 +02:00
Antonis Lempesis 1a28a69cac changed the citeee in *_citations to cites 2021-07-27 15:14:09 +03:00
Miriam Baglioni 74f801b689 mergin with branch beta 2021-07-27 13:18:31 +02:00
Miriam Baglioni 35e395eae8 merge with master 2021-07-27 12:34:59 +02:00
Miriam Baglioni eb07f7f40f Hosted By Map 2021-07-27 12:27:26 +02:00
Antonis Lempesis ed185fd7ed added missing colons 2021-07-27 11:42:47 +03:00
Antonis Lempesis f3b9570354 properly invalidating metadata 2021-07-26 13:00:16 +03:00
Sandro La Bruzzo 848aabbb6c minor fix 2021-07-25 12:06:41 +02:00
Sandro La Bruzzo 8fac10c91e fixed defintion wf of creation final infospace of scholexplorer 2021-07-25 11:15:37 +02:00
Sandro La Bruzzo 3920c69bc8 change implementation of resolve Relation to generate jsonRdd in output 2021-07-25 09:51:36 +02:00
Antonis Lempesis f9fbb0f261 added indicators second sprint 2021-07-24 16:40:28 +03:00
Claudio Atzori a0393607a7 mapping funding relations from Datacite should be done according to the actual result identifier 2021-07-23 18:15:08 +02:00
Sandro La Bruzzo d9e3b89937 implemented last part of workflows to generate scholixGraph 2021-07-23 16:38:32 +02:00
Sandro La Bruzzo cfde63a7c3 fixed resolve relation join 2021-07-23 14:17:29 +02:00
Sandro La Bruzzo 4a439c3863 NPE fixed 2021-07-23 14:17:29 +02:00
Claudio Atzori bc835d2024 [cleaning] fixed filtering function for missing titles 2021-07-23 11:56:13 +02:00
Sandro La Bruzzo ca74e8dd02 create a separate wf for resolving relation 2021-07-23 11:40:06 +02:00
Sandro La Bruzzo 43e9380cd3 update resolve relation to use the same format of openaire graph 2021-07-23 11:25:18 +02:00
Sandro La Bruzzo 058b636d4d added control to check if the entity exists 2021-07-22 16:08:54 +02:00
Sandro La Bruzzo 62ae36a3d2 fixed NPE 2021-07-22 15:41:38 +02:00
Miriam Baglioni 63553a76b3 added code to download gold issn list from unibi 2021-07-22 12:01:48 +02:00
Miriam Baglioni 1a5b114906 DoiBoost AccessRigh #4362 - refactoring 2021-07-22 12:00:23 +02:00
Sandro La Bruzzo d94565862a fixed NPE 2021-07-21 21:23:11 +02:00
Sandro La Bruzzo 31d2d6d41e Scholexplorer: introduction of dedup openaire 2021-07-21 18:09:32 +02:00
Miriam Baglioni b226ba4439 mergin with branch beta 2021-07-21 09:46:40 +02:00
Claudio Atzori 10d7b4f0b4 filtering 'old' OpenAIRE ids from the entity.originalId[] array in the OAF -> XML searialization procedure 2021-07-20 11:52:05 +02:00
Miriam Baglioni 83fe31c92e changed the name of the workflows 2021-07-19 18:19:14 +02:00
Miriam Baglioni dd81c36b60 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-07-19 18:18:14 +02:00
Miriam Baglioni 54acc5373b changed the name of the workflows 2021-07-19 18:18:09 +02:00
Miriam Baglioni b420b11ed3 duplicate the number of partitions in ProcessMag 2021-07-19 18:16:23 +02:00
Claudio Atzori 65934888a1 adding record identifier among the originalIds regardless of what IdentifierFactory produces 2021-07-19 17:52:52 +02:00
Claudio Atzori 0977baf41d contents mapped from the stores with 'claim' interpretation will not change their identifier along their way towards the graph 2021-07-19 17:43:52 +02:00
Miriam Baglioni 662c396354 duplicate the number of partitions in ConvertCrossrefToOaf 2021-07-19 12:41:14 +02:00
Miriam Baglioni 59530a14fb DoiBoost AccessRigh #4362 - set BestAccessRight with the ususal comparator 2021-07-19 12:34:35 +02:00
Miriam Baglioni 199123b74b DoiBoost AccessRigh #4362 - Fixed issue on date formatting. Added test method and associated resource 2021-07-16 17:30:27 +02:00
Miriam Baglioni c4b18e6ccb changed the download.sh, added skip step to allow to not execute one phase and changed the workflow sequence of steps 2021-07-16 15:01:25 +02:00
Miriam Baglioni acd6056330 added shell action to automatically download the new dump and put it in a specified hdfs location 2021-07-16 12:47:10 +02:00
Miriam Baglioni 3bc9a05bc9 mergin with branch beta 2021-07-16 10:32:27 +02:00
Miriam Baglioni 34506df1b6 DoiBoost AccessRigh #4362 - if the journal is open, the OPEN access right is set to all instances and color is GOLD (overwrite if the color was already set in one of the previous steps) 2021-07-16 10:29:51 +02:00
Claudio Atzori bf9e0d2d4f Merge pull request 'orcid-no-doi' (#123) from enrico.ottonello/dnet-hadoop:orcid-no-doi into beta
Reviewed-on: D-Net/dnet-hadoop#123
2021-07-15 17:59:41 +02:00
Sandro La Bruzzo 7e2caafe84 Scholexplorer: fixed mapping typologies 2021-07-15 09:53:12 +02:00
Miriam Baglioni 4da46bb62f mergin with branch beta 2021-07-14 15:08:52 +02:00
Miriam Baglioni 09ad7b2a9e DoiBoost AccessRigh #4362 - Unpaywall mapped to OAF with OPEN instance (non oa are filtered out) (unknown hostedby) + map the color as it is 2021-07-14 14:45:21 +02:00
Miriam Baglioni f4f7c6f9d3 DoiBoost AccessRigh #4362 - Unpaywall mapped to OAF with OPEN instance (non oa are filtered out) (unknown hostedby) + map the color as it is 2021-07-14 14:44:54 +02:00
Miriam Baglioni 6222adf176 DoiBoost AccessRigh #4362 - added resources and test for crossref mapping (licence part included) 2021-07-14 14:42:34 +02:00
Miriam Baglioni 981b1018f6 DoiBoost AccessRigh #4362 - decide access right according to licence. Default access right is Unknown 2021-07-14 14:42:06 +02:00
Sandro La Bruzzo 3d8e2aa146 Code refactor:
- removed old workflows in doiboost
 - splitted workflow of doiboost in preprocess and process
2021-07-14 14:37:06 +02:00
Miriam Baglioni 441701c85c DoiBoost AccessRigh #4362 - If multiple licenses are available, take the one applied to 'vor' 2021-07-14 14:14:50 +02:00
Sandro La Bruzzo c35c117601 fixed process doiboost workflow:
- splitted OrcidToOAF into two phase preprocess and process
- updated workflow used in production
2021-07-14 12:48:01 +02:00
Miriam Baglioni 774cdb190e changes to mirror the last dump of the graph with the ols data model. 2021-07-13 18:57:24 +02:00
Miriam Baglioni 886617afd0 One result linked to more than on project is saved just once 2021-07-13 18:15:35 +02:00
Miriam Baglioni 320cf02d96 Changed the way to find results linked to projects. We verify to actually have the project on the graph before selecting the result 2021-07-13 18:13:32 +02:00
Miriam Baglioni 52ce35d57b - 2021-07-13 18:08:46 +02:00
Miriam Baglioni 970b387b8d modification to allow dump of a single community 2021-07-13 18:08:10 +02:00
Miriam Baglioni eae10c5894 modification to allow the dump for a single community 2021-07-13 18:07:25 +02:00
Miriam Baglioni c028feef4f workflow for the dump as sub workflows 2021-07-13 18:06:44 +02:00
Miriam Baglioni d70f8c96fd funding contains and not starts with h2020 2021-07-13 17:34:53 +02:00
Miriam Baglioni 5e38c7f42d dumping only communities with status all 2021-07-13 17:32:38 +02:00
Miriam Baglioni d418c309f5 removed the part after part-x- in the file name generated by spark. It was too long and created problems while creating the tar entries 2021-07-13 17:11:49 +02:00
Miriam Baglioni 618d2de2da minor changes and refactoring 2021-07-13 17:10:02 +02:00
Miriam Baglioni 59615da65e Add test to verify the creation of relation between context and projects 2021-07-13 17:09:15 +02:00
Miriam Baglioni 084b4ef999 added the creation of the openaireId from funder and grant number if the element is not present in the context profile 2021-07-13 17:07:46 +02:00
Miriam Baglioni 8f322a73cb change because of the renaming of originalId in acronym 2021-07-13 16:22:58 +02:00
Miriam Baglioni 72397ea1ba Added fix for community of arbitrary name length 2021-07-13 16:18:35 +02:00
Miriam Baglioni 5295d10691 added check not to dump deletedByInference entities 2021-07-13 16:11:46 +02:00
Miriam Baglioni e9a17ec899 added check to verify not to add void APC 2021-07-13 15:53:35 +02:00
Miriam Baglioni 8429aed6c6 Added resource for testing selection of valid relations 2021-07-13 15:49:38 +02:00
Miriam Baglioni 39b1a6edf6 added test class for the selection of valid relations and description 2021-07-13 15:23:09 +02:00
Miriam Baglioni 9a58f1b93d added logic to select only the valid relations: those not deletedbyinference and having both part of the relation as entities in the graph 2021-07-13 15:20:39 +02:00
Miriam Baglioni 13c66e16be changed logic to split for communities 2021-07-13 15:15:27 +02:00
Miriam Baglioni 6410ab71d8 added APC in the dump and test method 2021-07-13 15:13:58 +02:00
Miriam Baglioni 65a242646d added resource for APC dump 2021-07-13 14:45:25 +02:00
Miriam Baglioni 4b432fbee8 extended test class 2021-07-13 14:40:39 +02:00
Miriam Baglioni 87a6e2b967 extended test class 2021-07-13 14:38:28 +02:00
Miriam Baglioni 69fd40fd30 modified code to split the Croatian funder 2021-07-13 14:35:26 +02:00
Miriam Baglioni 86e50f7311 modified code to split the Croatian funder 2021-07-13 14:31:45 +02:00
Miriam Baglioni da88c850c6 changed the logic to verify if a community is contained in the list of context of a result 2021-07-13 14:22:44 +02:00
Miriam Baglioni 2f66fedfec changed the logic to verify if a community is contained in the list of context of a result 2021-07-13 14:22:23 +02:00
Claudio Atzori bc4b86c27c updated URL in the issueManagement tag 2021-07-13 11:54:32 +02:00
760 changed files with 43207 additions and 6955 deletions

View File

@ -1,2 +1,2 @@
# dnet-hadoop
Dnet-hadoop is a tool for
Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.

View File

@ -8,8 +8,6 @@ import java.util.List;
import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.maven.plugin.AbstractMojo;
import org.apache.maven.plugin.MojoExecutionException;
import org.apache.maven.plugin.MojoFailureException;
/**
* Generates oozie properties which were not provided from commandline.
@ -27,7 +25,7 @@ public class GenerateOoziePropertiesMojo extends AbstractMojo {
};
@Override
public void execute() throws MojoExecutionException, MojoFailureException {
public void execute() {
if (System.getProperties().containsKey(PROPERTY_NAME_WF_SOURCE_DIR)
&& !System.getProperties().containsKey(PROPERTY_NAME_SANDBOX_NAME)) {
String generatedSandboxName = generateSandboxName(
@ -46,24 +44,24 @@ public class GenerateOoziePropertiesMojo extends AbstractMojo {
/**
* Generates sandbox name from workflow source directory.
*
* @param wfSourceDir
* @param wfSourceDir workflow source directory
* @return generated sandbox name
*/
private String generateSandboxName(String wfSourceDir) {
// utilize all dir names until finding one of the limiters
List<String> sandboxNameParts = new ArrayList<String>();
List<String> sandboxNameParts = new ArrayList<>();
String[] tokens = StringUtils.split(wfSourceDir, File.separatorChar);
ArrayUtils.reverse(tokens);
if (tokens.length > 0) {
for (String token : tokens) {
for (String limiter : limiters) {
if (limiter.equals(token)) {
return sandboxNameParts.size() > 0
return !sandboxNameParts.isEmpty()
? StringUtils.join(sandboxNameParts.toArray())
: null;
}
}
if (sandboxNameParts.size() > 0) {
if (!sandboxNameParts.isEmpty()) {
sandboxNameParts.add(0, File.separator);
}
sandboxNameParts.add(0, token);

View File

@ -16,6 +16,7 @@ import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
@ -289,7 +290,7 @@ public class WritePredefinedProjectProperties extends AbstractMojo {
*/
protected List<String> getEscapeChars(String escapeChars) {
List<String> tokens = getListFromCSV(escapeChars);
List<String> realTokens = new ArrayList<String>();
List<String> realTokens = new ArrayList<>();
for (String token : tokens) {
String realToken = getRealToken(token);
realTokens.add(realToken);
@ -324,7 +325,7 @@ public class WritePredefinedProjectProperties extends AbstractMojo {
* @return content
*/
protected String getContent(String comment, Properties properties, List<String> escapeTokens) {
List<String> names = new ArrayList<String>(properties.stringPropertyNames());
List<String> names = new ArrayList<>(properties.stringPropertyNames());
Collections.sort(names);
StringBuilder sb = new StringBuilder();
if (!StringUtils.isBlank(comment)) {
@ -352,7 +353,7 @@ public class WritePredefinedProjectProperties extends AbstractMojo {
throws MojoExecutionException {
try {
String content = getContent(comment, properties, escapeTokens);
FileUtils.writeStringToFile(file, content, ENCODING_UTF8);
FileUtils.writeStringToFile(file, content, StandardCharsets.UTF_8);
} catch (IOException e) {
throw new MojoExecutionException("Error creating properties file", e);
}
@ -399,9 +400,9 @@ public class WritePredefinedProjectProperties extends AbstractMojo {
*/
protected static final List<String> getListFromCSV(String csv) {
if (StringUtils.isBlank(csv)) {
return new ArrayList<String>();
return new ArrayList<>();
}
List<String> list = new ArrayList<String>();
List<String> list = new ArrayList<>();
String[] tokens = StringUtils.split(csv, ",");
for (String token : tokens) {
list.add(token.trim());

View File

@ -9,18 +9,18 @@ import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
/** @author mhorst, claudio.atzori */
public class GenerateOoziePropertiesMojoTest {
class GenerateOoziePropertiesMojoTest {
private final GenerateOoziePropertiesMojo mojo = new GenerateOoziePropertiesMojo();
@BeforeEach
public void clearSystemProperties() {
void clearSystemProperties() {
System.clearProperty(PROPERTY_NAME_SANDBOX_NAME);
System.clearProperty(PROPERTY_NAME_WF_SOURCE_DIR);
}
@Test
public void testExecuteEmpty() throws Exception {
void testExecuteEmpty() throws Exception {
// execute
mojo.execute();
@ -29,7 +29,7 @@ public class GenerateOoziePropertiesMojoTest {
}
@Test
public void testExecuteSandboxNameAlreadySet() throws Exception {
void testExecuteSandboxNameAlreadySet() throws Exception {
// given
String workflowSourceDir = "eu/dnetlib/dhp/wf/transformers";
String sandboxName = "originalSandboxName";
@ -44,7 +44,7 @@ public class GenerateOoziePropertiesMojoTest {
}
@Test
public void testExecuteEmptyWorkflowSourceDir() throws Exception {
void testExecuteEmptyWorkflowSourceDir() throws Exception {
// given
String workflowSourceDir = "";
System.setProperty(PROPERTY_NAME_WF_SOURCE_DIR, workflowSourceDir);
@ -57,7 +57,7 @@ public class GenerateOoziePropertiesMojoTest {
}
@Test
public void testExecuteNullSandboxNameGenerated() throws Exception {
void testExecuteNullSandboxNameGenerated() throws Exception {
// given
String workflowSourceDir = "eu/dnetlib/dhp/";
System.setProperty(PROPERTY_NAME_WF_SOURCE_DIR, workflowSourceDir);
@ -70,7 +70,7 @@ public class GenerateOoziePropertiesMojoTest {
}
@Test
public void testExecute() throws Exception {
void testExecute() throws Exception {
// given
String workflowSourceDir = "eu/dnetlib/dhp/wf/transformers";
System.setProperty(PROPERTY_NAME_WF_SOURCE_DIR, workflowSourceDir);
@ -83,7 +83,7 @@ public class GenerateOoziePropertiesMojoTest {
}
@Test
public void testExecuteWithoutRoot() throws Exception {
void testExecuteWithoutRoot() throws Exception {
// given
String workflowSourceDir = "wf/transformers";
System.setProperty(PROPERTY_NAME_WF_SOURCE_DIR, workflowSourceDir);

View File

@ -20,7 +20,7 @@ import org.mockito.junit.jupiter.MockitoExtension;
/** @author mhorst, claudio.atzori */
@ExtendWith(MockitoExtension.class)
public class WritePredefinedProjectPropertiesTest {
class WritePredefinedProjectPropertiesTest {
@Mock
private MavenProject mavenProject;
@ -39,7 +39,7 @@ public class WritePredefinedProjectPropertiesTest {
// ----------------------------------- TESTS ---------------------------------------------
@Test
public void testExecuteEmpty() throws Exception {
void testExecuteEmpty() throws Exception {
// execute
mojo.execute();
@ -50,7 +50,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteWithProjectProperties() throws Exception {
void testExecuteWithProjectProperties() throws Exception {
// given
String key = "projectPropertyKey";
String value = "projectPropertyValue";
@ -70,7 +70,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test()
public void testExecuteWithProjectPropertiesAndInvalidOutputFile(@TempDir File testFolder) {
void testExecuteWithProjectPropertiesAndInvalidOutputFile(@TempDir File testFolder) {
// given
String key = "projectPropertyKey";
String value = "projectPropertyValue";
@ -84,7 +84,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteWithProjectPropertiesExclusion(@TempDir File testFolder) throws Exception {
void testExecuteWithProjectPropertiesExclusion(@TempDir File testFolder) throws Exception {
// given
String key = "projectPropertyKey";
String value = "projectPropertyValue";
@ -108,7 +108,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteWithProjectPropertiesInclusion(@TempDir File testFolder) throws Exception {
void testExecuteWithProjectPropertiesInclusion(@TempDir File testFolder) throws Exception {
// given
String key = "projectPropertyKey";
String value = "projectPropertyValue";
@ -132,7 +132,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteIncludingPropertyKeysFromFile(@TempDir File testFolder) throws Exception {
void testExecuteIncludingPropertyKeysFromFile(@TempDir File testFolder) throws Exception {
// given
String key = "projectPropertyKey";
String value = "projectPropertyValue";
@ -164,7 +164,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteIncludingPropertyKeysFromClasspathResource(@TempDir File testFolder)
void testExecuteIncludingPropertyKeysFromClasspathResource(@TempDir File testFolder)
throws Exception {
// given
String key = "projectPropertyKey";
@ -194,7 +194,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteIncludingPropertyKeysFromBlankLocation() {
void testExecuteIncludingPropertyKeysFromBlankLocation() {
// given
String key = "projectPropertyKey";
String value = "projectPropertyValue";
@ -214,7 +214,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteIncludingPropertyKeysFromXmlFile(@TempDir File testFolder)
void testExecuteIncludingPropertyKeysFromXmlFile(@TempDir File testFolder)
throws Exception {
// given
String key = "projectPropertyKey";
@ -247,7 +247,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteIncludingPropertyKeysFromInvalidXmlFile(@TempDir File testFolder)
void testExecuteIncludingPropertyKeysFromInvalidXmlFile(@TempDir File testFolder)
throws Exception {
// given
String key = "projectPropertyKey";
@ -273,7 +273,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteWithQuietModeOn(@TempDir File testFolder) throws Exception {
void testExecuteWithQuietModeOn(@TempDir File testFolder) throws Exception {
// given
mojo.setQuiet(true);
mojo.setIncludePropertyKeysFromFiles(new String[] {
@ -290,7 +290,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteIncludingPropertyKeysFromInvalidFile() {
void testExecuteIncludingPropertyKeysFromInvalidFile() {
// given
mojo.setIncludePropertyKeysFromFiles(new String[] {
"invalid location"
@ -301,7 +301,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteWithEnvironmentProperties(@TempDir File testFolder) throws Exception {
void testExecuteWithEnvironmentProperties(@TempDir File testFolder) throws Exception {
// given
mojo.setIncludeEnvironmentVariables(true);
@ -318,7 +318,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteWithSystemProperties(@TempDir File testFolder) throws Exception {
void testExecuteWithSystemProperties(@TempDir File testFolder) throws Exception {
// given
String key = "systemPropertyKey";
String value = "systemPropertyValue";
@ -337,7 +337,7 @@ public class WritePredefinedProjectPropertiesTest {
}
@Test
public void testExecuteWithSystemPropertiesAndEscapeChars(@TempDir File testFolder)
void testExecuteWithSystemPropertiesAndEscapeChars(@TempDir File testFolder)
throws Exception {
// given
String key = "systemPropertyKey ";

View File

@ -25,6 +25,11 @@
<groupId>com.github.sisyphsu</groupId>
<artifactId>dateparser</artifactId>
</dependency>
<dependency>
<groupId>me.xuender</groupId>
<artifactId>unidecode</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
@ -112,6 +117,11 @@
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-schemas</artifactId>
</dependency>
<dependency>
<groupId>com.opencsv</groupId>
<artifactId>opencsv</artifactId>
</dependency>
</dependencies>
</project>

View File

@ -1,14 +0,0 @@
package eu.dnetlib.dhp.application;
import java.io.*;
import java.util.Map;
import java.util.Properties;
import org.apache.hadoop.conf.Configuration;
import com.google.common.collect.Maps;
public class ApplicationUtils {
}

View File

@ -56,13 +56,13 @@ public class ArgumentApplicationParser implements Serializable {
final StringWriter stringWriter = new StringWriter();
IOUtils.copy(gis, stringWriter);
return stringWriter.toString();
} catch (Throwable e) {
log.error("Wrong value to decompress:" + abstractCompressed);
throw new RuntimeException(e);
} catch (IOException e) {
log.error("Wrong value to decompress: {}", abstractCompressed);
throw new IllegalArgumentException(e);
}
}
public static String compressArgument(final String value) throws Exception {
public static String compressArgument(final String value) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(value.getBytes());

View File

@ -9,9 +9,6 @@ public class OptionsParameter {
private boolean paramRequired;
private boolean compressed;
public OptionsParameter() {
}
public String getParamName() {
return paramName;
}

View File

@ -34,7 +34,7 @@ public class ApiDescriptor {
return params;
}
public void setParams(final HashMap<String, String> params) {
public void setParams(final Map<String, String> params) {
this.params = params;
}

View File

@ -12,6 +12,9 @@ public class Constants {
public static String COAR_ACCESS_RIGHT_SCHEMA = "http://vocabularies.coar-repositories.org/documentation/access_rights/";
private Constants() {
}
static {
accessRightsCoarMap.put("OPEN", "c_abf2");
accessRightsCoarMap.put("RESTRICTED", "c_16ec");
@ -49,4 +52,10 @@ public class Constants {
public static final String CONTENT_INVALIDRECORDS = "InvalidRecords";
public static final String CONTENT_TRANSFORMEDRECORDS = "transformedItems";
// IETF Draft and used by Repositories like ZENODO , not included in APACHE HTTP java packages
// see https://ietf-wg-httpapi.github.io/ratelimit-headers/draft-ietf-httpapi-ratelimit-headers.html
public static final String HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT = "X-RateLimit-Limit";
public static final String HTTPHEADER_IETF_DRAFT_RATELIMIT_REMAINING = "X-RateLimit-Remaining";
public static final String HTTPHEADER_IETF_DRAFT_RATELIMIT_RESET = "X-RateLimit-Reset";
}

View File

@ -84,7 +84,7 @@ public class GraphResultMapper implements Serializable {
.setDocumentationUrl(
value
.stream()
.map(v -> v.getValue())
.map(Field::getValue)
.collect(Collectors.toList())));
Optional
@ -100,20 +100,20 @@ public class GraphResultMapper implements Serializable {
.setContactgroup(
Optional
.ofNullable(ir.getContactgroup())
.map(value -> value.stream().map(cg -> cg.getValue()).collect(Collectors.toList()))
.map(value -> value.stream().map(Field::getValue).collect(Collectors.toList()))
.orElse(null));
out
.setContactperson(
Optional
.ofNullable(ir.getContactperson())
.map(value -> value.stream().map(cp -> cp.getValue()).collect(Collectors.toList()))
.map(value -> value.stream().map(Field::getValue).collect(Collectors.toList()))
.orElse(null));
out
.setTool(
Optional
.ofNullable(ir.getTool())
.map(value -> value.stream().map(t -> t.getValue()).collect(Collectors.toList()))
.map(value -> value.stream().map(Field::getValue).collect(Collectors.toList()))
.orElse(null));
out.setType(ModelConstants.ORP_DEFAULT_RESULTTYPE.getClassname());
@ -123,7 +123,8 @@ public class GraphResultMapper implements Serializable {
Optional
.ofNullable(input.getAuthor())
.ifPresent(ats -> out.setAuthor(ats.stream().map(at -> getAuthor(at)).collect(Collectors.toList())));
.ifPresent(
ats -> out.setAuthor(ats.stream().map(GraphResultMapper::getAuthor).collect(Collectors.toList())));
// I do not map Access Right UNKNOWN or OTHER
@ -210,7 +211,7 @@ public class GraphResultMapper implements Serializable {
if (oInst.isPresent()) {
out
.setInstance(
oInst.get().stream().map(i -> getInstance(i)).collect(Collectors.toList()));
oInst.get().stream().map(GraphResultMapper::getInstance).collect(Collectors.toList()));
}
@ -230,7 +231,7 @@ public class GraphResultMapper implements Serializable {
.stream()
.filter(t -> t.getQualifier().getClassid().equalsIgnoreCase("main title"))
.collect(Collectors.toList());
if (iTitle.size() > 0) {
if (!iTitle.isEmpty()) {
out.setMaintitle(iTitle.get(0).getValue());
}
@ -239,7 +240,7 @@ public class GraphResultMapper implements Serializable {
.stream()
.filter(t -> t.getQualifier().getClassid().equalsIgnoreCase("subtitle"))
.collect(Collectors.toList());
if (iTitle.size() > 0) {
if (!iTitle.isEmpty()) {
out.setSubtitle(iTitle.get(0).getValue());
}

View File

@ -28,7 +28,7 @@ public class HdfsSupport {
* @param configuration Configuration of hadoop env
*/
public static boolean exists(String path, Configuration configuration) {
logger.info("Removing path: {}", path);
logger.info("Checking existence for path: {}", path);
return rethrowAsRuntimeException(
() -> {
Path f = new Path(path);

View File

@ -14,38 +14,33 @@ public class MakeTarArchive implements Serializable {
private static TarArchiveOutputStream getTar(FileSystem fileSystem, String outputPath) throws IOException {
Path hdfsWritePath = new Path(outputPath);
FSDataOutputStream fsDataOutputStream = null;
if (fileSystem.exists(hdfsWritePath)) {
fileSystem.delete(hdfsWritePath, true);
}
fsDataOutputStream = fileSystem.create(hdfsWritePath);
return new TarArchiveOutputStream(fsDataOutputStream.getWrappedStream());
return new TarArchiveOutputStream(fileSystem.create(hdfsWritePath).getWrappedStream());
}
private static void write(FileSystem fileSystem, String inputPath, String outputPath, String dir_name)
throws IOException {
Path hdfsWritePath = new Path(outputPath);
FSDataOutputStream fsDataOutputStream = null;
if (fileSystem.exists(hdfsWritePath)) {
fileSystem.delete(hdfsWritePath, true);
}
fsDataOutputStream = fileSystem.create(hdfsWritePath);
try (TarArchiveOutputStream ar = new TarArchiveOutputStream(
fileSystem.create(hdfsWritePath).getWrappedStream())) {
TarArchiveOutputStream ar = new TarArchiveOutputStream(fsDataOutputStream.getWrappedStream());
RemoteIterator<LocatedFileStatus> iterator = fileSystem
.listFiles(
new Path(inputPath), true);
RemoteIterator<LocatedFileStatus> fileStatusListIterator = fileSystem
.listFiles(
new Path(inputPath), true);
while (iterator.hasNext()) {
writeCurrentFile(fileSystem, dir_name, iterator, ar, 0);
}
while (fileStatusListIterator.hasNext()) {
writeCurrentFile(fileSystem, dir_name, fileStatusListIterator, ar, 0);
}
ar.close();
}
public static void tarMaxSize(FileSystem fileSystem, String inputPath, String outputPath, String dir_name,
@ -90,6 +85,13 @@ public class MakeTarArchive implements Serializable {
String p_string = p.toString();
if (!p_string.endsWith("_SUCCESS")) {
String name = p_string.substring(p_string.lastIndexOf("/") + 1);
if (name.startsWith("part-") & name.length() > 10) {
String tmp = name.substring(0, 10);
if (name.contains(".")) {
tmp += name.substring(name.indexOf("."));
}
name = tmp;
}
TarArchiveEntry entry = new TarArchiveEntry(dir_name + "/" + name);
entry.setSize(fileStatus.getLen());
current_size += fileStatus.getLen();

View File

@ -10,8 +10,6 @@ import java.util.Optional;
import java.util.stream.StreamSupport;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.bson.Document;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -21,6 +19,7 @@ import com.mongodb.BasicDBObject;
import com.mongodb.MongoClient;
import com.mongodb.MongoClientURI;
import com.mongodb.QueryBuilder;
import com.mongodb.client.FindIterable;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
@ -46,7 +45,7 @@ public class MdstoreClient implements Closeable {
final String currentId = Optional
.ofNullable(getColl(db, COLL_METADATA_MANAGER, true).find(query))
.map(r -> r.first())
.map(FindIterable::first)
.map(d -> d.getString("currentId"))
.orElseThrow(() -> new IllegalArgumentException("cannot find current mdstore id for: " + mdId));
@ -84,7 +83,7 @@ public class MdstoreClient implements Closeable {
if (!Iterables.contains(client.listDatabaseNames(), dbName)) {
final String err = String.format("Database '%s' not found in %s", dbName, client.getAddress());
log.warn(err);
throw new RuntimeException(err);
throw new IllegalArgumentException(err);
}
return client.getDatabase(dbName);
}
@ -97,7 +96,7 @@ public class MdstoreClient implements Closeable {
String.format("Missing collection '%s' in database '%s'", collName, db.getName()));
log.warn(err);
if (abortIfMissing) {
throw new RuntimeException(err);
throw new IllegalArgumentException(err);
} else {
return null;
}

View File

@ -24,7 +24,6 @@ import com.google.common.hash.Hashing;
*/
public class PacePerson {
private static final String UTF8 = "UTF-8";
private List<String> name = Lists.newArrayList();
private List<String> surname = Lists.newArrayList();
private List<String> fullname = Lists.newArrayList();

View File

@ -1,5 +1,5 @@
package eu.dnetlib.dhp.aggregation.common;
package eu.dnetlib.dhp.common.aggregation;
import java.io.Closeable;
import java.io.IOException;
@ -11,8 +11,6 @@ import java.util.Objects;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.gson.Gson;
import eu.dnetlib.dhp.message.MessageSender;
import eu.dnetlib.dhp.utils.DHPUtils;
@ -20,12 +18,12 @@ public class AggregatorReport extends LinkedHashMap<String, String> implements C
private static final Logger log = LoggerFactory.getLogger(AggregatorReport.class);
private MessageSender messageSender;
private transient MessageSender messageSender;
public AggregatorReport() {
}
public AggregatorReport(MessageSender messageSender) throws IOException {
public AggregatorReport(MessageSender messageSender) {
this.messageSender = messageSender;
}

View File

@ -5,6 +5,9 @@ import java.io.*;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
import org.apache.http.HttpHeaders;
import org.apache.http.entity.ContentType;
import com.google.gson.Gson;
import eu.dnetlib.dhp.common.api.zenodo.ZenodoModel;
@ -43,7 +46,7 @@ public class ZenodoAPIClient implements Serializable {
this.deposition_id = deposition_id;
}
public ZenodoAPIClient(String urlString, String access_token) throws IOException {
public ZenodoAPIClient(String urlString, String access_token) {
this.urlString = urlString;
this.access_token = access_token;
@ -63,8 +66,8 @@ public class ZenodoAPIClient implements Serializable {
Request request = new Request.Builder()
.url(urlString)
.addHeader("Content-Type", "application/json") // add request headers
.addHeader("Authorization", "Bearer " + access_token)
.addHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString()) // add request headers
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.post(body)
.build();
@ -103,8 +106,8 @@ public class ZenodoAPIClient implements Serializable {
Request request = new Request.Builder()
.url(bucket + "/" + file_name)
.addHeader("Content-Type", "application/zip") // add request headers
.addHeader("Authorization", "Bearer " + access_token)
.addHeader(HttpHeaders.CONTENT_TYPE, "application/zip") // add request headers
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.put(InputStreamRequestBody.create(MEDIA_TYPE_ZIP, is, len))
.build();
@ -130,8 +133,8 @@ public class ZenodoAPIClient implements Serializable {
Request request = new Request.Builder()
.url(urlString + "/" + deposition_id)
.addHeader("Content-Type", "application/json") // add request headers
.addHeader("Authorization", "Bearer " + access_token)
.addHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString()) // add request headers
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.put(body)
.build();
@ -197,7 +200,7 @@ public class ZenodoAPIClient implements Serializable {
Request request = new Request.Builder()
.url(urlString + "/" + deposition_id + "/actions/newversion")
.addHeader("Authorization", "Bearer " + access_token)
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.post(body)
.build();
@ -270,8 +273,8 @@ public class ZenodoAPIClient implements Serializable {
Request request = new Request.Builder()
.url(urlString)
.addHeader("Content-Type", "application/json") // add request headers
.addHeader("Authorization", "Bearer " + access_token)
.addHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString()) // add request headers
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.get()
.build();
@ -293,8 +296,8 @@ public class ZenodoAPIClient implements Serializable {
Request request = new Request.Builder()
.url(url)
.addHeader("Content-Type", "application/json") // add request headers
.addHeader("Authorization", "Bearer " + access_token)
.addHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString()) // add request headers
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.get()
.build();

View File

@ -32,13 +32,13 @@ public class Creator {
public static Creator newInstance(String name, String affiliation, String orcid) {
Creator c = new Creator();
if (!(name == null)) {
if (name != null) {
c.name = name;
}
if (!(affiliation == null)) {
if (affiliation != null) {
c.affiliation = affiliation;
}
if (!(orcid == null)) {
if (orcid != null) {
c.orcid = orcid;
}

View File

@ -3,17 +3,12 @@ package eu.dnetlib.dhp.common.api.zenodo;
import java.io.Serializable;
import net.minidev.json.annotate.JsonIgnore;
public class File implements Serializable {
private String checksum;
private String filename;
private long filesize;
private String id;
@JsonIgnore
// private Links links;
public String getChecksum() {
return checksum;
}
@ -46,13 +41,4 @@ public class File implements Serializable {
this.id = id;
}
// @JsonIgnore
// public Links getLinks() {
// return links;
// }
//
// @JsonIgnore
// public void setLinks(Links links) {
// this.links = links;
// }
}

View File

@ -1,5 +1,5 @@
package eu.dnetlib.dhp.collection;
package eu.dnetlib.dhp.common.collection;
public class CollectorException extends Exception {

View File

@ -0,0 +1,56 @@
package eu.dnetlib.dhp.common.collection;
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.List;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.opencsv.bean.CsvToBeanBuilder;
public class GetCSV {
public static final char DEFAULT_DELIMITER = ',';
private GetCSV() {
}
public static void getCsv(FileSystem fileSystem, BufferedReader reader, String hdfsPath,
String modelClass) throws IOException, ClassNotFoundException {
getCsv(fileSystem, reader, hdfsPath, modelClass, DEFAULT_DELIMITER);
}
public static void getCsv(FileSystem fileSystem, Reader reader, String hdfsPath,
String modelClass, char delimiter) throws IOException, ClassNotFoundException {
Path hdfsWritePath = new Path(hdfsPath);
FSDataOutputStream fsDataOutputStream = null;
if (fileSystem.exists(hdfsWritePath)) {
fileSystem.delete(hdfsWritePath, false);
}
fsDataOutputStream = fileSystem.create(hdfsWritePath);
try (BufferedWriter writer = new BufferedWriter(
new OutputStreamWriter(fsDataOutputStream, StandardCharsets.UTF_8))) {
final ObjectMapper mapper = new ObjectMapper();
@SuppressWarnings("unchecked")
final List lines = new CsvToBeanBuilder(reader)
.withType(Class.forName(modelClass))
.withSeparator(delimiter)
.build()
.parse();
for (Object line : lines) {
writer.write(mapper.writeValueAsString(line));
writer.newLine();
}
}
}
}

View File

@ -1,5 +1,5 @@
package eu.dnetlib.dhp.collection;
package eu.dnetlib.dhp.common.collection;
/**
* Bundles the http connection parameters driving the client behaviour.

View File

@ -1,5 +1,5 @@
package eu.dnetlib.dhp.collection;
package eu.dnetlib.dhp.common.collection;
import static eu.dnetlib.dhp.utils.DHPUtils.*;
@ -15,12 +15,13 @@ import org.apache.http.HttpHeaders;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.aggregation.common.AggregatorReport;
import eu.dnetlib.dhp.common.Constants;
import eu.dnetlib.dhp.common.aggregation.AggregatorReport;
/**
* Migrated from https://svn.driver.research-infrastructures.eu/driver/dnet45/modules/dnet-modular-collector-service/trunk/src/main/java/eu/dnetlib/data/collector/plugins/HttpConnector.java
*
* @author jochen, michele, andrea, alessia, claudio
* @author jochen, michele, andrea, alessia, claudio, andreas
*/
public class HttpConnector2 {
@ -32,7 +33,7 @@ public class HttpConnector2 {
private String responseType = null;
private final String userAgent = "Mozilla/5.0 (compatible; OAI; +http://www.openaire.eu)";
private static final String userAgent = "Mozilla/5.0 (compatible; OAI; +http://www.openaire.eu)";
public HttpConnector2() {
this(new HttpClientParams());
@ -112,6 +113,17 @@ public class HttpConnector2 {
}
int retryAfter = obtainRetryAfter(urlConn.getHeaderFields());
String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT);
String rateRemaining = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_REMAINING);
if ((rateLimit != null) && (rateRemaining != null) && (Integer.parseInt(rateRemaining) < 2)) {
if (retryAfter > 0) {
backoffAndSleep(retryAfter);
} else {
backoffAndSleep(1000);
}
}
if (is2xx(urlConn.getResponseCode())) {
input = urlConn.getInputStream();
responseType = urlConn.getContentType();
@ -120,7 +132,7 @@ public class HttpConnector2 {
if (is3xx(urlConn.getResponseCode())) {
// REDIRECTS
final String newUrl = obtainNewLocation(urlConn.getHeaderFields());
log.info(String.format("The requested url has been moved to %s", newUrl));
log.info("The requested url has been moved to {}", newUrl);
report
.put(
REPORT_PREFIX + urlConn.getResponseCode(),
@ -140,14 +152,14 @@ public class HttpConnector2 {
if (retryAfter > 0) {
log
.warn(
requestUrl + " - waiting and repeating request after suggested retry-after "
+ retryAfter + " sec.");
"{} - waiting and repeating request after suggested retry-after {} sec.",
requestUrl, retryAfter);
backoffAndSleep(retryAfter * 1000);
} else {
log
.warn(
requestUrl + " - waiting and repeating request after default delay of "
+ getClientParams().getRetryDelay() + " sec.");
"{} - waiting and repeating request after default delay of {} sec.",
requestUrl, getClientParams().getRetryDelay());
backoffAndSleep(retryNumber * getClientParams().getRetryDelay() * 1000);
}
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
@ -181,12 +193,12 @@ public class HttpConnector2 {
}
private void logHeaderFields(final HttpURLConnection urlConn) throws IOException {
log.debug("StatusCode: " + urlConn.getResponseMessage());
log.debug("StatusCode: {}", urlConn.getResponseMessage());
for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) {
if (e.getKey() != null) {
for (String v : e.getValue()) {
log.debug(" key: " + e.getKey() + " - value: " + v);
log.debug(" key: {} - value: {}", e.getKey(), v);
}
}
}
@ -204,7 +216,7 @@ public class HttpConnector2 {
private int obtainRetryAfter(final Map<String, List<String>> headerMap) {
for (String key : headerMap.keySet()) {
if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (headerMap.get(key).size() > 0)
if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty())
&& NumberUtils.isCreatable(headerMap.get(key).get(0))) {
return Integer.parseInt(headerMap.get(key).get(0)) + 10;
}

View File

@ -1,11 +1,11 @@
package eu.dnetlib.dhp.common.rest;
import java.io.IOException;
import java.util.Arrays;
import java.util.stream.Collectors;
import org.apache.commons.io.IOUtils;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpUriRequest;
@ -23,17 +23,20 @@ public class DNetRestClient {
private static final ObjectMapper mapper = new ObjectMapper();
private DNetRestClient() {
}
public static <T> T doGET(final String url, Class<T> clazz) throws Exception {
final HttpGet httpGet = new HttpGet(url);
return doHTTPRequest(httpGet, clazz);
}
public static String doGET(final String url) throws Exception {
public static String doGET(final String url) throws IOException {
final HttpGet httpGet = new HttpGet(url);
return doHTTPRequest(httpGet);
}
public static <V> String doPOST(final String url, V objParam) throws Exception {
public static <V> String doPOST(final String url, V objParam) throws IOException {
final HttpPost httpPost = new HttpPost(url);
if (objParam != null) {
@ -45,25 +48,25 @@ public class DNetRestClient {
return doHTTPRequest(httpPost);
}
public static <T, V> T doPOST(final String url, V objParam, Class<T> clazz) throws Exception {
public static <T, V> T doPOST(final String url, V objParam, Class<T> clazz) throws IOException {
return mapper.readValue(doPOST(url, objParam), clazz);
}
private static String doHTTPRequest(final HttpUriRequest r) throws Exception {
CloseableHttpClient client = HttpClients.createDefault();
private static String doHTTPRequest(final HttpUriRequest r) throws IOException {
try (CloseableHttpClient client = HttpClients.createDefault()) {
log.info("performing HTTP request, method {} on URI {}", r.getMethod(), r.getURI().toString());
log
.info(
"request headers: {}",
Arrays
.asList(r.getAllHeaders())
.stream()
.map(h -> h.getName() + ":" + h.getValue())
.collect(Collectors.joining(",")));
log.info("performing HTTP request, method {} on URI {}", r.getMethod(), r.getURI().toString());
log
.info(
"request headers: {}",
Arrays
.asList(r.getAllHeaders())
.stream()
.map(h -> h.getName() + ":" + h.getValue())
.collect(Collectors.joining(",")));
CloseableHttpResponse response = client.execute(r);
return IOUtils.toString(response.getEntity().getContent());
return IOUtils.toString(client.execute(r).getEntity().getContent());
}
}
private static <T> T doHTTPRequest(final HttpUriRequest r, Class<T> clazz) throws Exception {

View File

@ -46,7 +46,7 @@ public class Vocabulary implements Serializable {
}
public VocabularyTerm getTerm(final String id) {
return Optional.ofNullable(id).map(s -> s.toLowerCase()).map(s -> terms.get(s)).orElse(null);
return Optional.ofNullable(id).map(String::toLowerCase).map(terms::get).orElse(null);
}
protected void addTerm(final String id, final String name) {
@ -81,7 +81,6 @@ public class Vocabulary implements Serializable {
.ofNullable(getTermBySynonym(syn))
.map(term -> getTermAsQualifier(term.getId()))
.orElse(null);
// .orElse(OafMapperUtils.unknown(getId(), getName()));
}
}

View File

@ -46,7 +46,6 @@ public class VocabularyGroup implements Serializable {
}
vocs.addTerm(vocId, termId, termName);
// vocs.addSynonyms(vocId, termId, termId);
}
}
@ -58,7 +57,6 @@ public class VocabularyGroup implements Serializable {
final String syn = arr[2].trim();
vocs.addSynonyms(vocId, termId, syn);
// vocs.addSynonyms(vocId, termId, termId);
}
}
@ -98,7 +96,7 @@ public class VocabularyGroup implements Serializable {
.getTerms()
.values()
.stream()
.map(t -> t.getId())
.map(VocabularyTerm::getId)
.collect(Collectors.toCollection(HashSet::new));
}
@ -154,16 +152,19 @@ public class VocabularyGroup implements Serializable {
return Optional
.ofNullable(vocId)
.map(String::toLowerCase)
.map(id -> vocs.containsKey(id))
.map(vocs::containsKey)
.orElse(false);
}
private void addSynonyms(final String vocId, final String termId, final String syn) {
String id = Optional
.ofNullable(vocId)
.map(s -> s.toLowerCase())
.map(String::toLowerCase)
.orElseThrow(
() -> new IllegalArgumentException(String.format("empty vocabulary id for [term:%s, synonym:%s]")));
() -> new IllegalArgumentException(
String
.format(
"empty vocabulary id for [term:%s, synonym:%s]", termId, syn)));
Optional
.ofNullable(vocs.get(id))
.orElseThrow(() -> new IllegalArgumentException("missing vocabulary id: " + vocId))

View File

@ -2,7 +2,6 @@
package eu.dnetlib.dhp.message;
import java.io.Serializable;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;
@ -10,8 +9,8 @@ public class Message implements Serializable {
private static final long serialVersionUID = 401753881204524893L;
public static String CURRENT_PARAM = "current";
public static String TOTAL_PARAM = "total";
public static final String CURRENT_PARAM = "current";
public static final String TOTAL_PARAM = "total";
private MessageType messageType;

View File

@ -4,7 +4,6 @@ package eu.dnetlib.dhp.oa.merge;
import java.text.Normalizer;
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import org.apache.commons.lang3.StringUtils;
@ -19,6 +18,9 @@ public class AuthorMerger {
private static final Double THRESHOLD = 0.95;
private AuthorMerger() {
}
public static List<Author> merge(List<List<Author>> authors) {
authors.sort((o1, o2) -> -Integer.compare(countAuthorsPids(o1), countAuthorsPids(o2)));
@ -36,7 +38,8 @@ public class AuthorMerger {
public static List<Author> mergeAuthor(final List<Author> a, final List<Author> b, Double threshold) {
int pa = countAuthorsPids(a);
int pb = countAuthorsPids(b);
List<Author> base, enrich;
List<Author> base;
List<Author> enrich;
int sa = authorsSize(a);
int sb = authorsSize(b);
@ -62,22 +65,24 @@ public class AuthorMerger {
// <pidComparableString, Author> (if an Author has more than 1 pid, it appears 2 times in the list)
final Map<String, Author> basePidAuthorMap = base
.stream()
.filter(a -> a.getPid() != null && a.getPid().size() > 0)
.filter(a -> a.getPid() != null && !a.getPid().isEmpty())
.flatMap(
a -> a
.getPid()
.stream()
.filter(Objects::nonNull)
.map(p -> new Tuple2<>(pidToComparableString(p), a)))
.collect(Collectors.toMap(Tuple2::_1, Tuple2::_2, (x1, x2) -> x1));
// <pid, Author> (list of pid that are missing in the other list)
final List<Tuple2<StructuredProperty, Author>> pidToEnrich = enrich
.stream()
.filter(a -> a.getPid() != null && a.getPid().size() > 0)
.filter(a -> a.getPid() != null && !a.getPid().isEmpty())
.flatMap(
a -> a
.getPid()
.stream()
.filter(Objects::nonNull)
.filter(p -> !basePidAuthorMap.containsKey(pidToComparableString(p)))
.map(p -> new Tuple2<>(p, a)))
.collect(Collectors.toList());
@ -115,9 +120,9 @@ public class AuthorMerger {
}
public static String pidToComparableString(StructuredProperty pid) {
return (pid.getQualifier() != null
? pid.getQualifier().getClassid() != null ? pid.getQualifier().getClassid().toLowerCase() : ""
: "")
final String classid = pid.getQualifier().getClassid() != null ? pid.getQualifier().getClassid().toLowerCase()
: "";
return (pid.getQualifier() != null ? classid : "")
+ (pid.getValue() != null ? pid.getValue().toLowerCase() : "");
}
@ -150,7 +155,7 @@ public class AuthorMerger {
}
private static boolean hasPid(Author a) {
if (a == null || a.getPid() == null || a.getPid().size() == 0)
if (a == null || a.getPid() == null || a.getPid().isEmpty())
return false;
return a.getPid().stream().anyMatch(p -> p != null && StringUtils.isNotBlank(p.getValue()));
}
@ -159,7 +164,10 @@ public class AuthorMerger {
if (StringUtils.isNotBlank(author.getSurname())) {
return new Person(author.getSurname() + ", " + author.getName(), false);
} else {
return new Person(author.getFullname(), false);
if (StringUtils.isNotBlank(author.getFullname()))
return new Person(author.getFullname(), false);
else
return new Person("", false);
}
}

View File

@ -12,6 +12,9 @@ import com.ximpleware.VTDNav;
/** Created by sandro on 9/29/16. */
public class VtdUtilityParser {
private VtdUtilityParser() {
}
public static List<Node> getTextValuesWithAttributes(
final AutoPilot ap, final VTDNav vn, final String xpath, final List<String> attributes)
throws VtdException {

View File

@ -7,22 +7,19 @@ import java.time.format.DateTimeFormatter;
import java.time.format.DateTimeParseException;
import java.util.*;
import java.util.function.Function;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import org.apache.commons.lang3.StringUtils;
import org.jetbrains.annotations.NotNull;
import com.github.sisyphsu.dateparser.DateParserUtils;
import com.google.common.collect.Lists;
import com.google.common.collect.Maps;
import com.google.common.collect.Sets;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*;
import me.xuender.unidecode.Unidecode;
public class GraphCleaningFunctions extends CleaningFunctions {
@ -30,8 +27,11 @@ public class GraphCleaningFunctions extends CleaningFunctions {
public static final int ORCID_LEN = 19;
public static final String CLEANING_REGEX = "(?:\\n|\\r|\\t)";
public static final String INVALID_AUTHOR_REGEX = ".*deactivated.*";
public static final String TITLE_FILTER_REGEX = "[.*test.*\\W\\d]";
public static final int TITLE_FILTER_RESIDUAL_LENGTH = 10;
public static final String TITLE_TEST = "test";
public static final String TITLE_FILTER_REGEX = String.format("(%s)|\\W|\\d", TITLE_TEST);
public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5;
public static <T extends Oaf> T fixVocabularyNames(T value) {
if (value instanceof Datasource) {
@ -194,11 +194,21 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.filter(Objects::nonNull)
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.filter(
sp -> sp
.getValue()
.toLowerCase()
.replaceAll(TITLE_FILTER_REGEX, "")
.length() > TITLE_FILTER_RESIDUAL_LENGTH)
sp -> {
final String title = sp
.getValue()
.toLowerCase();
final String decoded = Unidecode.decode(title);
if (StringUtils.contains(decoded, TITLE_TEST)) {
return decoded
.replaceAll(TITLE_FILTER_REGEX, "")
.length() > TITLE_FILTER_RESIDUAL_LENGTH;
}
return !decoded
.replaceAll("\\W|\\d", "")
.isEmpty();
})
.map(GraphCleaningFunctions::cleanValue)
.collect(Collectors.toList()));
}
@ -283,7 +293,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
r
.getAuthor()
.stream()
.filter(a -> Objects.nonNull(a))
.filter(Objects::nonNull)
.filter(a -> StringUtils.isNotBlank(a.getFullname()))
.filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
.collect(Collectors.toList()));

View File

@ -17,13 +17,16 @@ import eu.dnetlib.dhp.schema.oaf.*;
public class OafMapperUtils {
private OafMapperUtils() {
}
public static Oaf merge(final Oaf left, final Oaf right) {
if (ModelSupport.isSubClass(left, OafEntity.class)) {
return mergeEntities((OafEntity) left, (OafEntity) right);
} else if (ModelSupport.isSubClass(left, Relation.class)) {
((Relation) left).mergeFrom((Relation) right);
} else {
throw new RuntimeException("invalid Oaf type:" + left.getClass().getCanonicalName());
throw new IllegalArgumentException("invalid Oaf type:" + left.getClass().getCanonicalName());
}
return left;
}
@ -38,7 +41,7 @@ public class OafMapperUtils {
} else if (ModelSupport.isSubClass(left, Project.class)) {
left.mergeFrom(right);
} else {
throw new RuntimeException("invalid OafEntity subtype:" + left.getClass().getCanonicalName());
throw new IllegalArgumentException("invalid OafEntity subtype:" + left.getClass().getCanonicalName());
}
return left;
}
@ -62,7 +65,7 @@ public class OafMapperUtils {
public static List<KeyValue> listKeyValues(final String... s) {
if (s.length % 2 > 0) {
throw new RuntimeException("Invalid number of parameters (k,v,k,v,....)");
throw new IllegalArgumentException("Invalid number of parameters (k,v,k,v,....)");
}
final List<KeyValue> list = new ArrayList<>();
@ -88,7 +91,7 @@ public class OafMapperUtils {
.stream(values)
.map(v -> field(v, info))
.filter(Objects::nonNull)
.filter(distinctByKey(f -> f.getValue()))
.filter(distinctByKey(Field::getValue))
.collect(Collectors.toList());
}
@ -97,7 +100,7 @@ public class OafMapperUtils {
.stream()
.map(v -> field(v, info))
.filter(Objects::nonNull)
.filter(distinctByKey(f -> f.getValue()))
.filter(distinctByKey(Field::getValue))
.collect(Collectors.toList());
}
@ -342,10 +345,10 @@ public class OafMapperUtils {
if (instanceList != null) {
final Optional<AccessRight> min = instanceList
.stream()
.map(i -> i.getAccessright())
.map(Instance::getAccessright)
.min(new AccessRightComparator<>());
final Qualifier rights = min.isPresent() ? qualifier(min.get()) : new Qualifier();
final Qualifier rights = min.map(OafMapperUtils::qualifier).orElseGet(Qualifier::new);
if (StringUtils.isBlank(rights.getClassid())) {
rights.setClassid(UNKNOWN);

View File

@ -4,19 +4,19 @@ package eu.dnetlib.dhp.utils;
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
import java.util.*;
import java.util.stream.Collectors;
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.binary.Base64OutputStream;
import org.apache.commons.codec.binary.Hex;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SaveMode;
import org.slf4j.Logger;
@ -26,6 +26,8 @@ import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Maps;
import com.jayway.jsonpath.JsonPath;
import eu.dnetlib.dhp.schema.mdstore.MDStoreWithInfo;
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
import net.minidev.json.JSONArray;
import scala.collection.JavaConverters;
import scala.collection.Seq;
@ -34,6 +36,9 @@ public class DHPUtils {
private static final Logger log = LoggerFactory.getLogger(DHPUtils.class);
private DHPUtils() {
}
public static Seq<String> toSeq(List<String> list) {
return JavaConverters.asScalaIteratorConverter(list.iterator()).asScala().toSeq();
}
@ -44,40 +49,59 @@ public class DHPUtils {
md.update(s.getBytes(StandardCharsets.UTF_8));
return new String(Hex.encodeHex(md.digest()));
} catch (final Exception e) {
System.err.println("Error creating id");
log.error("Error creating id from {}", s);
return null;
}
}
/**
* Retrieves from the metadata store manager application the list of paths associated with mdstores characterized
* by he given format, layout, interpretation
* @param mdstoreManagerUrl the URL of the mdstore manager service
* @param format the mdstore format
* @param layout the mdstore layout
* @param interpretation the mdstore interpretation
* @param includeEmpty include Empty mdstores
* @return the set of hdfs paths
* @throws IOException in case of HTTP communication issues
*/
public static Set<String> mdstorePaths(final String mdstoreManagerUrl,
final String format,
final String layout,
final String interpretation,
boolean includeEmpty) throws IOException {
final String url = mdstoreManagerUrl + "/mdstores/";
final ObjectMapper objectMapper = new ObjectMapper();
final HttpGet req = new HttpGet(url);
try (final CloseableHttpClient client = HttpClients.createDefault()) {
try (final CloseableHttpResponse response = client.execute(req)) {
final String json = IOUtils.toString(response.getEntity().getContent());
final MDStoreWithInfo[] mdstores = objectMapper.readValue(json, MDStoreWithInfo[].class);
return Arrays
.stream(mdstores)
.filter(md -> md.getFormat().equalsIgnoreCase(format))
.filter(md -> md.getLayout().equalsIgnoreCase(layout))
.filter(md -> md.getInterpretation().equalsIgnoreCase(interpretation))
.filter(md -> StringUtils.isNotBlank(md.getHdfsPath()))
.filter(md -> StringUtils.isNotBlank(md.getCurrentVersion()))
.filter(md -> includeEmpty || md.getSize() > 0)
.map(md -> md.getHdfsPath() + "/" + md.getCurrentVersion() + "/store")
.collect(Collectors.toSet());
}
}
}
public static String generateIdentifier(final String originalId, final String nsPrefix) {
return String.format("%s::%s", nsPrefix, DHPUtils.md5(originalId));
}
public static String compressString(final String input) {
try (ByteArrayOutputStream out = new ByteArrayOutputStream();
Base64OutputStream b64os = new Base64OutputStream(out)) {
GZIPOutputStream gzip = new GZIPOutputStream(b64os);
gzip.write(input.getBytes(StandardCharsets.UTF_8));
gzip.close();
return out.toString();
} catch (Throwable e) {
return null;
}
}
public static String generateUnresolvedIdentifier(final String pid, final String pidType) {
public static String decompressString(final String input) {
byte[] byteArray = Base64.decodeBase64(input.getBytes());
int len;
try (GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream((byteArray)));
ByteArrayOutputStream bos = new ByteArrayOutputStream(byteArray.length)) {
byte[] buffer = new byte[1024];
while ((len = gis.read(buffer)) != -1) {
bos.write(buffer, 0, len);
}
return bos.toString();
} catch (Exception e) {
return null;
}
final String cleanedPid = CleaningFunctions.normalizePidValue(pidType, pid);
return String.format("unresolved::%s::%s", cleanedPid, pidType.toLowerCase().trim());
}
public static String getJPathString(final String jsonPath, final String json) {

View File

@ -18,13 +18,16 @@ public class ISLookupClientFactory {
private static final int requestTimeout = 60000 * 10;
private static final int connectTimeout = 60000 * 10;
private ISLookupClientFactory() {
}
public static ISLookUpService getLookUpService(final String isLookupUrl) {
return getServiceStub(ISLookUpService.class, isLookupUrl);
}
@SuppressWarnings("unchecked")
private static <T> T getServiceStub(final Class<T> clazz, final String endpoint) {
log.info(String.format("creating %s stub from %s", clazz.getName(), endpoint));
log.info("creating {} stub from {}", clazz.getName(), endpoint);
final JaxWsProxyFactoryBean jaxWsProxyFactory = new JaxWsProxyFactoryBean();
jaxWsProxyFactory.setServiceClass(clazz);
jaxWsProxyFactory.setAddress(endpoint);
@ -38,12 +41,10 @@ public class ISLookupClientFactory {
log
.info(
String
.format(
"setting connectTimeout to %s, requestTimeout to %s for service %s",
connectTimeout,
requestTimeout,
clazz.getCanonicalName()));
"setting connectTimeout to {}, requestTimeout to {} for service {}",
connectTimeout,
requestTimeout,
clazz.getCanonicalName());
policy.setConnectionTimeout(connectTimeout);
policy.setReceiveTimeout(requestTimeout);

View File

@ -10,7 +10,7 @@ import net.sf.saxon.trans.XPathException;
public abstract class AbstractExtensionFunction extends ExtensionFunctionDefinition {
public static String DEFAULT_SAXON_EXT_NS_URI = "http://www.d-net.research-infrastructures.eu/saxon-extension";
public static final String DEFAULT_SAXON_EXT_NS_URI = "http://www.d-net.research-infrastructures.eu/saxon-extension";
public abstract String getName();

View File

@ -26,7 +26,7 @@ public class ExtractYear extends AbstractExtensionFunction {
@Override
public Sequence doCall(XPathContext context, Sequence[] arguments) throws XPathException {
if (arguments == null | arguments.length == 0) {
if (arguments == null || arguments.length == 0) {
return new StringValue("");
}
final Item item = arguments[0].head();
@ -63,8 +63,7 @@ public class ExtractYear extends AbstractExtensionFunction {
for (String format : dateFormats) {
try {
c.setTime(new SimpleDateFormat(format).parse(s));
String year = String.valueOf(c.get(Calendar.YEAR));
return year;
return String.valueOf(c.get(Calendar.YEAR));
} catch (ParseException e) {
}
}

View File

@ -30,7 +30,7 @@ public class NormalizeDate extends AbstractExtensionFunction {
@Override
public Sequence doCall(XPathContext context, Sequence[] arguments) throws XPathException {
if (arguments == null | arguments.length == 0) {
if (arguments == null || arguments.length == 0) {
return new StringValue(BLANK);
}
String s = arguments[0].head().getStringValue();

View File

@ -1,6 +1,8 @@
package eu.dnetlib.dhp.utils.saxon;
import static org.apache.commons.lang3.StringUtils.isNotBlank;
import org.apache.commons.lang3.StringUtils;
import net.sf.saxon.expr.XPathContext;
@ -26,7 +28,8 @@ public class PickFirst extends AbstractExtensionFunction {
final String s1 = getValue(arguments[0]);
final String s2 = getValue(arguments[1]);
return new StringValue(StringUtils.isNotBlank(s1) ? s1 : StringUtils.isNotBlank(s2) ? s2 : "");
final String value = isNotBlank(s1) ? s1 : isNotBlank(s2) ? s2 : "";
return new StringValue(value);
}
private String getValue(final Sequence arg) throws XPathException {

View File

@ -12,6 +12,9 @@ import net.sf.saxon.TransformerFactoryImpl;
public class SaxonTransformerFactory {
private SaxonTransformerFactory() {
}
/**
* Creates the index record transformer from the given XSLT
*

View File

@ -7,10 +7,10 @@ import static org.junit.jupiter.api.Assertions.assertNotNull;
import org.apache.commons.io.IOUtils;
import org.junit.jupiter.api.Test;
public class ArgumentApplicationParserTest {
class ArgumentApplicationParserTest {
@Test
public void testParseParameter() throws Exception {
void testParseParameter() throws Exception {
final String jsonConfiguration = IOUtils
.toString(
this.getClass().getResourceAsStream("/eu/dnetlib/application/parameters.json"));

View File

@ -21,13 +21,13 @@ public class HdfsSupportTest {
class Remove {
@Test
public void shouldThrowARuntimeExceptionOnError() {
void shouldThrowARuntimeExceptionOnError() {
// when
assertThrows(RuntimeException.class, () -> HdfsSupport.remove(null, new Configuration()));
}
@Test
public void shouldRemoveADirFromHDFS(@TempDir Path tempDir) {
void shouldRemoveADirFromHDFS(@TempDir Path tempDir) {
// when
HdfsSupport.remove(tempDir.toString(), new Configuration());
@ -36,7 +36,7 @@ public class HdfsSupportTest {
}
@Test
public void shouldRemoveAFileFromHDFS(@TempDir Path tempDir) throws IOException {
void shouldRemoveAFileFromHDFS(@TempDir Path tempDir) throws IOException {
// given
Path file = Files.createTempFile(tempDir, "p", "s");
@ -52,13 +52,13 @@ public class HdfsSupportTest {
class ListFiles {
@Test
public void shouldThrowARuntimeExceptionOnError() {
void shouldThrowARuntimeExceptionOnError() {
// when
assertThrows(RuntimeException.class, () -> HdfsSupport.listFiles(null, new Configuration()));
}
@Test
public void shouldListFilesLocatedInPath(@TempDir Path tempDir) throws IOException {
void shouldListFilesLocatedInPath(@TempDir Path tempDir) throws IOException {
Path subDir1 = Files.createTempDirectory(tempDir, "list_me");
Path subDir2 = Files.createTempDirectory(tempDir, "list_me");

View File

@ -5,10 +5,10 @@ import static org.junit.jupiter.api.Assertions.*;
import org.junit.jupiter.api.Test;
public class PacePersonTest {
class PacePersonTest {
@Test
public void pacePersonTest1() {
void pacePersonTest1() {
PacePerson p = new PacePerson("Artini, Michele", false);
assertEquals("Artini", p.getSurnameString());
@ -17,7 +17,7 @@ public class PacePersonTest {
}
@Test
public void pacePersonTest2() {
void pacePersonTest2() {
PacePerson p = new PacePerson("Michele G. Artini", false);
assertEquals("Artini, Michele G.", p.getNormalisedFullname());
assertEquals("Michele G", p.getNameString());

View File

@ -18,7 +18,8 @@ public class SparkSessionSupportTest {
class RunWithSparkSession {
@Test
public void shouldExecuteFunctionAndNotStopSparkSessionWhenSparkSessionIsNotManaged()
@SuppressWarnings("unchecked")
void shouldExecuteFunctionAndNotStopSparkSessionWhenSparkSessionIsNotManaged()
throws Exception {
// given
SparkSession spark = mock(SparkSession.class);
@ -37,7 +38,8 @@ public class SparkSessionSupportTest {
}
@Test
public void shouldExecuteFunctionAndStopSparkSessionWhenSparkSessionIsManaged()
@SuppressWarnings("unchecked")
void shouldExecuteFunctionAndStopSparkSessionWhenSparkSessionIsManaged()
throws Exception {
// given
SparkSession spark = mock(SparkSession.class);

View File

@ -12,7 +12,7 @@ import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
@Disabled
public class ZenodoAPIClientTest {
class ZenodoAPIClientTest {
private final String URL_STRING = "https://sandbox.zenodo.org/api/deposit/depositions";
private final String ACCESS_TOKEN = "";
@ -22,7 +22,7 @@ public class ZenodoAPIClientTest {
private final String depositionId = "674915";
@Test
public void testUploadOldDeposition() throws IOException, MissingConceptDoiException {
void testUploadOldDeposition() throws IOException, MissingConceptDoiException {
ZenodoAPIClient client = new ZenodoAPIClient(URL_STRING,
ACCESS_TOKEN);
Assertions.assertEquals(200, client.uploadOpenDeposition(depositionId));
@ -44,7 +44,7 @@ public class ZenodoAPIClientTest {
}
@Test
public void testNewDeposition() throws IOException {
void testNewDeposition() throws IOException {
ZenodoAPIClient client = new ZenodoAPIClient(URL_STRING,
ACCESS_TOKEN);
@ -67,7 +67,7 @@ public class ZenodoAPIClientTest {
}
@Test
public void testNewVersionNewName() throws IOException, MissingConceptDoiException {
void testNewVersionNewName() throws IOException, MissingConceptDoiException {
ZenodoAPIClient client = new ZenodoAPIClient(URL_STRING,
ACCESS_TOKEN);
@ -87,7 +87,7 @@ public class ZenodoAPIClientTest {
}
@Test
public void testNewVersionOldName() throws IOException, MissingConceptDoiException {
void testNewVersionOldName() throws IOException, MissingConceptDoiException {
ZenodoAPIClient client = new ZenodoAPIClient(URL_STRING,
ACCESS_TOKEN);

View File

@ -21,7 +21,7 @@ import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
import eu.dnetlib.pace.util.MapDocumentUtil;
import scala.Tuple2;
public class AuthorMergerTest {
class AuthorMergerTest {
private String publicationsBasePath;
@ -43,7 +43,7 @@ public class AuthorMergerTest {
}
@Test
public void mergeTest() { // used in the dedup: threshold set to 0.95
void mergeTest() { // used in the dedup: threshold set to 0.95
for (List<Author> authors1 : authors) {
System.out.println("List " + (authors.indexOf(authors1) + 1));

View File

@ -4,12 +4,8 @@ package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.*;
import java.io.IOException;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.util.HashSet;
import java.util.List;
import java.util.Locale;
import java.util.Optional;
import java.util.stream.Collectors;
import org.apache.commons.io.IOUtils;
@ -19,15 +15,34 @@ import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.Dataset;
import eu.dnetlib.dhp.schema.oaf.KeyValue;
import eu.dnetlib.dhp.schema.oaf.Publication;
import eu.dnetlib.dhp.schema.oaf.Result;
import me.xuender.unidecode.Unidecode;
public class OafMapperUtilsTest {
class OafMapperUtilsTest {
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
@Test
public void testDateValidation() {
public void testUnidecode() {
assertEquals("Liu Ben Mu hiruzuSen tawa", Unidecode.decode("六本木ヒルズ森タワ"));
assertEquals("Nan Wu A Mi Tuo Fo", Unidecode.decode("南无阿弥陀佛"));
assertEquals("Yi Tiao Hui Zou Lu De Yu", Unidecode.decode("一条会走路的鱼"));
assertEquals("amidaniyorai", Unidecode.decode("あみだにょらい"));
assertEquals("T`owrk`iayi", Unidecode.decode("Թուրքիայի"));
assertEquals("Obzor tematiki", Unidecode.decode("Обзор тематики"));
assertEquals("GERMANSKIE IaZYKI", Unidecode.decode("ГЕРМАНСКИЕ ЯЗЫКИ"));
assertEquals("Diereunese tes ikanopoieses", Unidecode.decode("Διερεύνηση της ικανοποίησης"));
assertEquals("lqDy l'wly@", Unidecode.decode("القضايا الأولية"));
assertEquals("abc def ghi", Unidecode.decode("abc def ghi"));
}
@Test
void testDateValidation() {
assertTrue(GraphCleaningFunctions.doCleanDate("2016-05-07T12:41:19.202Z ").isPresent());
assertTrue(GraphCleaningFunctions.doCleanDate("2020-09-10 11:08:52 ").isPresent());
@ -132,44 +147,46 @@ public class OafMapperUtilsTest {
}
@Test
public void testDate() {
System.out.println(GraphCleaningFunctions.cleanDate("23-FEB-1998"));
void testDate() {
final String date = GraphCleaningFunctions.cleanDate("23-FEB-1998");
assertNotNull(date);
System.out.println(date);
}
@Test
public void testMergePubs() throws IOException {
void testMergePubs() throws IOException {
Publication p1 = read("publication_1.json", Publication.class);
Publication p2 = read("publication_2.json", Publication.class);
Dataset d1 = read("dataset_1.json", Dataset.class);
Dataset d2 = read("dataset_2.json", Dataset.class);
assertEquals(p1.getCollectedfrom().size(), 1);
assertEquals(p1.getCollectedfrom().get(0).getKey(), ModelConstants.CROSSREF_ID);
assertEquals(d2.getCollectedfrom().size(), 1);
assertEquals(1, p1.getCollectedfrom().size());
assertEquals(ModelConstants.CROSSREF_ID, p1.getCollectedfrom().get(0).getKey());
assertEquals(1, d2.getCollectedfrom().size());
assertFalse(cfId(d2.getCollectedfrom()).contains(ModelConstants.CROSSREF_ID));
assertTrue(
assertEquals(
ModelConstants.PUBLICATION_RESULTTYPE_CLASSID,
OafMapperUtils
.mergeResults(p1, d2)
.getResulttype()
.getClassid()
.equals(ModelConstants.PUBLICATION_RESULTTYPE_CLASSID));
.getClassid());
assertEquals(p2.getCollectedfrom().size(), 1);
assertEquals(1, p2.getCollectedfrom().size());
assertFalse(cfId(p2.getCollectedfrom()).contains(ModelConstants.CROSSREF_ID));
assertEquals(d1.getCollectedfrom().size(), 1);
assertEquals(1, d1.getCollectedfrom().size());
assertTrue(cfId(d1.getCollectedfrom()).contains(ModelConstants.CROSSREF_ID));
assertTrue(
assertEquals(
ModelConstants.DATASET_RESULTTYPE_CLASSID,
OafMapperUtils
.mergeResults(p2, d1)
.getResulttype()
.getClassid()
.equals(ModelConstants.DATASET_RESULTTYPE_CLASSID));
.getClassid());
}
protected HashSet<String> cfId(List<KeyValue> collectedfrom) {
return collectedfrom.stream().map(c -> c.getKey()).collect(Collectors.toCollection(HashSet::new));
return collectedfrom.stream().map(KeyValue::getKey).collect(Collectors.toCollection(HashSet::new));
}
protected <T extends Result> T read(String filename, Class<T> clazz) throws IOException {

View File

@ -3,10 +3,10 @@ package eu.dnetlib.scholexplorer.relation;
import org.junit.jupiter.api.Test;
public class RelationMapperTest {
class RelationMapperTest {
@Test
public void testLoadRels() throws Exception {
void testLoadRels() throws Exception {
RelationMapper relationMapper = RelationMapper.load();
relationMapper.keySet().forEach(System.out::println);

View File

@ -3,40 +3,37 @@ package eu.dnetlib.dhp.actionmanager;
import java.io.Serializable;
import java.io.StringReader;
import java.util.*;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.Optional;
import java.util.Set;
import java.util.stream.Collectors;
import org.apache.commons.lang3.tuple.Triple;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
import org.jetbrains.annotations.NotNull;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.xml.sax.SAXException;
import com.google.common.base.Joiner;
import com.google.common.base.Splitter;
import com.google.common.collect.Iterables;
import com.google.common.collect.Lists;
import com.google.common.collect.Sets;
import eu.dnetlib.actionmanager.rmi.ActionManagerException;
import eu.dnetlib.actionmanager.set.ActionManagerSet;
import eu.dnetlib.actionmanager.set.ActionManagerSet.ImpactTypes;
import eu.dnetlib.dhp.actionmanager.partition.PartitionActionSetsByPayloadTypeJob;
import eu.dnetlib.dhp.utils.ISLookupClientFactory;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
import scala.Tuple2;
public class ISClient implements Serializable {
private static final Logger log = LoggerFactory.getLogger(PartitionActionSetsByPayloadTypeJob.class);
private static final Logger log = LoggerFactory.getLogger(ISClient.class);
private static final String INPUT_ACTION_SET_ID_SEPARATOR = ",";
private final ISLookUpService isLookup;
private final transient ISLookUpService isLookup;
public ISClient(String isLookupUrl) {
isLookup = ISLookupClientFactory.getLookUpService(isLookupUrl);
@ -63,7 +60,7 @@ public class ISClient implements Serializable {
.map(
sets -> sets
.stream()
.map(set -> parseSetInfo(set))
.map(ISClient::parseSetInfo)
.filter(t -> ids.contains(t.getLeft()))
.map(t -> buildDirectory(basePath, t))
.collect(Collectors.toList()))
@ -73,15 +70,17 @@ public class ISClient implements Serializable {
}
}
private Triple<String, String, String> parseSetInfo(String set) {
private static Triple<String, String, String> parseSetInfo(String set) {
try {
Document doc = new SAXReader().read(new StringReader(set));
final SAXReader reader = new SAXReader();
reader.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
Document doc = reader.read(new StringReader(set));
return Triple
.of(
doc.valueOf("//SET/@id"),
doc.valueOf("//SET/@directory"),
doc.valueOf("//SET/@latest"));
} catch (DocumentException e) {
} catch (DocumentException | SAXException e) {
throw new IllegalStateException(e);
}
}
@ -99,7 +98,7 @@ public class ISClient implements Serializable {
final String q = "for $x in /RESOURCE_PROFILE[.//RESOURCE_TYPE/@value='ActionManagerServiceResourceType'] return $x//SERVICE_PROPERTIES/PROPERTY[./@ key='"
+ propertyName
+ "']/@value/string()";
log.debug("quering for service property: " + q);
log.debug("quering for service property: {}", q);
try {
final List<String> value = isLookup.quickSearchProfile(q);
return Iterables.getOnlyElement(value);

View File

@ -62,6 +62,7 @@ public class MergeAndGet {
x.getClass().getCanonicalName(), y.getClass().getCanonicalName()));
}
@SuppressWarnings("unchecked")
private static <G extends Oaf, A extends Oaf> G selectNewerAndGet(G x, A y) {
if (x.getClass().equals(y.getClass())
&& x.getLastupdatetimestamp() > y.getLastupdatetimestamp()) {

View File

@ -74,7 +74,9 @@ public class PromoteActionPayloadForGraphTableJob {
.orElse(true);
logger.info("shouldGroupById: {}", shouldGroupById);
@SuppressWarnings("unchecked")
Class<? extends Oaf> rowClazz = (Class<? extends Oaf>) Class.forName(graphTableClassName);
@SuppressWarnings("unchecked")
Class<? extends Oaf> actionPayloadClazz = (Class<? extends Oaf>) Class.forName(actionPayloadClassName);
throwIfGraphTableClassIsNotSubClassOfActionPayloadClass(rowClazz, actionPayloadClazz);
@ -152,7 +154,7 @@ public class PromoteActionPayloadForGraphTableJob {
return spark
.read()
.parquet(path)
.map((MapFunction<Row, String>) value -> extractPayload(value), Encoders.STRING())
.map((MapFunction<Row, String>) PromoteActionPayloadForGraphTableJob::extractPayload, Encoders.STRING())
.map(
(MapFunction<String, A>) value -> decodePayload(actionPayloadClazz, value),
Encoders.bean(actionPayloadClazz));

View File

@ -107,7 +107,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=2560
--conf spark.sql.shuffle.partitions=5000
</spark-opts>
<arg>--inputGraphTablePath</arg><arg>${inputGraphRootPath}/publication</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
@ -159,7 +159,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=2560
--conf spark.sql.shuffle.partitions=5000
</spark-opts>
<arg>--inputGraphTablePath</arg><arg>${workingDir}/publication</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>

View File

@ -99,7 +99,7 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=2560
--conf spark.sql.shuffle.partitions=5000
</spark-opts>
<arg>--inputGraphTablePath</arg><arg>${inputGraphRootPath}/relation</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Relation</arg>

View File

@ -80,7 +80,7 @@ public class PartitionActionSetsByPayloadTypeJobTest {
private ISClient isClient;
@Test
public void shouldPartitionActionSetsByPayloadType(@TempDir Path workingDir) throws Exception {
void shouldPartitionActionSetsByPayloadType(@TempDir Path workingDir) throws Exception {
// given
Path inputActionSetsBaseDir = workingDir.resolve("input").resolve("action_sets");
Path outputDir = workingDir.resolve("output");

View File

@ -20,7 +20,7 @@ public class MergeAndGetTest {
class MergeFromAndGetStrategy {
@Test
public void shouldThrowForOafAndOaf() {
void shouldThrowForOafAndOaf() {
// given
Oaf a = mock(Oaf.class);
Oaf b = mock(Oaf.class);
@ -33,7 +33,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldThrowForOafAndRelation() {
void shouldThrowForOafAndRelation() {
// given
Oaf a = mock(Oaf.class);
Relation b = mock(Relation.class);
@ -46,7 +46,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldThrowForOafAndOafEntity() {
void shouldThrowForOafAndOafEntity() {
// given
Oaf a = mock(Oaf.class);
OafEntity b = mock(OafEntity.class);
@ -59,7 +59,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldThrowForRelationAndOaf() {
void shouldThrowForRelationAndOaf() {
// given
Relation a = mock(Relation.class);
Oaf b = mock(Oaf.class);
@ -72,7 +72,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldThrowForRelationAndOafEntity() {
void shouldThrowForRelationAndOafEntity() {
// given
Relation a = mock(Relation.class);
OafEntity b = mock(OafEntity.class);
@ -85,7 +85,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldBehaveProperlyForRelationAndRelation() {
void shouldBehaveProperlyForRelationAndRelation() {
// given
Relation a = mock(Relation.class);
Relation b = mock(Relation.class);
@ -101,7 +101,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldThrowForOafEntityAndOaf() {
void shouldThrowForOafEntityAndOaf() {
// given
OafEntity a = mock(OafEntity.class);
Oaf b = mock(Oaf.class);
@ -114,7 +114,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldThrowForOafEntityAndRelation() {
void shouldThrowForOafEntityAndRelation() {
// given
OafEntity a = mock(OafEntity.class);
Relation b = mock(Relation.class);
@ -127,7 +127,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldThrowForOafEntityAndOafEntityButNotSubclasses() {
void shouldThrowForOafEntityAndOafEntityButNotSubclasses() {
// given
class OafEntitySub1 extends OafEntity {
}
@ -145,7 +145,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldBehaveProperlyForOafEntityAndOafEntity() {
void shouldBehaveProperlyForOafEntityAndOafEntity() {
// given
OafEntity a = mock(OafEntity.class);
OafEntity b = mock(OafEntity.class);
@ -165,7 +165,7 @@ public class MergeAndGetTest {
class SelectNewerAndGetStrategy {
@Test
public void shouldThrowForOafEntityAndRelation() {
void shouldThrowForOafEntityAndRelation() {
// given
OafEntity a = mock(OafEntity.class);
Relation b = mock(Relation.class);
@ -178,7 +178,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldThrowForRelationAndOafEntity() {
void shouldThrowForRelationAndOafEntity() {
// given
Relation a = mock(Relation.class);
OafEntity b = mock(OafEntity.class);
@ -191,7 +191,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldThrowForOafEntityAndResult() {
void shouldThrowForOafEntityAndResult() {
// given
OafEntity a = mock(OafEntity.class);
Result b = mock(Result.class);
@ -204,7 +204,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldThrowWhenSuperTypeIsNewerForResultAndOafEntity() {
void shouldThrowWhenSuperTypeIsNewerForResultAndOafEntity() {
// given
// real types must be used because subclass-superclass resolution does not work for
// mocks
@ -221,7 +221,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldShouldReturnLeftForOafEntityAndOafEntity() {
void shouldShouldReturnLeftForOafEntityAndOafEntity() {
// given
OafEntity a = mock(OafEntity.class);
when(a.getLastupdatetimestamp()).thenReturn(1L);
@ -238,7 +238,7 @@ public class MergeAndGetTest {
}
@Test
public void shouldShouldReturnRightForOafEntityAndOafEntity() {
void shouldShouldReturnRightForOafEntityAndOafEntity() {
// given
OafEntity a = mock(OafEntity.class);
when(a.getLastupdatetimestamp()).thenReturn(2L);

View File

@ -77,7 +77,7 @@ public class PromoteActionPayloadForGraphTableJobTest {
class Main {
@Test
public void shouldThrowWhenGraphTableClassIsNotASubClassOfActionPayloadClass() {
void shouldThrowWhenGraphTableClassIsNotASubClassOfActionPayloadClass() {
// given
Class<Relation> rowClazz = Relation.class;
Class<OafEntity> actionPayloadClazz = OafEntity.class;
@ -116,7 +116,7 @@ public class PromoteActionPayloadForGraphTableJobTest {
@ParameterizedTest(name = "strategy: {0}, graph table: {1}, action payload: {2}")
@MethodSource("eu.dnetlib.dhp.actionmanager.promote.PromoteActionPayloadForGraphTableJobTest#promoteJobTestParams")
public void shouldPromoteActionPayloadForGraphTable(
void shouldPromoteActionPayloadForGraphTable(
MergeAndGet.Strategy strategy,
Class<? extends Oaf> rowClazz,
Class<? extends Oaf> actionPayloadClazz)

View File

@ -44,7 +44,7 @@ public class PromoteActionPayloadFunctionsTest {
class JoinTableWithActionPayloadAndMerge {
@Test
public void shouldThrowWhenTableTypeIsNotSubtypeOfActionPayloadType() {
void shouldThrowWhenTableTypeIsNotSubtypeOfActionPayloadType() {
// given
class OafImpl extends Oaf {
}
@ -58,7 +58,7 @@ public class PromoteActionPayloadFunctionsTest {
}
@Test
public void shouldRunProperlyWhenActionPayloadTypeAndTableTypeAreTheSame() {
void shouldRunProperlyWhenActionPayloadTypeAndTableTypeAreTheSame() {
// given
String id0 = "id0";
String id1 = "id1";
@ -138,7 +138,7 @@ public class PromoteActionPayloadFunctionsTest {
}
@Test
public void shouldRunProperlyWhenActionPayloadTypeIsSuperTypeOfTableType() {
void shouldRunProperlyWhenActionPayloadTypeIsSuperTypeOfTableType() {
// given
String id0 = "id0";
String id1 = "id1";
@ -218,7 +218,7 @@ public class PromoteActionPayloadFunctionsTest {
class GroupTableByIdAndMerge {
@Test
public void shouldRunProperly() {
void shouldRunProperly() {
// given
String id1 = "id1";
String id2 = "id2";

View File

@ -29,6 +29,13 @@
<goal>testCompile</goal>
</goals>
</execution>
<execution>
<id>scala-doc</id>
<phase>process-resources</phase> <!-- or wherever -->
<goals>
<goal>doc</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
@ -84,14 +91,6 @@
<artifactId>json</artifactId>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-csv -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
<version>1.8</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml -->
<dependency>
<groupId>org.apache.poi</groupId>

View File

@ -4,6 +4,7 @@ package eu.dnetlib.dhp.actionmanager.bipfinder;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.io.Serializable;
import java.util.Objects;
import java.util.Optional;
import org.apache.commons.io.IOUtils;
@ -28,15 +29,16 @@ import eu.dnetlib.dhp.schema.oaf.Result;
public class CollectAndSave implements Serializable {
private static final Logger log = LoggerFactory.getLogger(CollectAndSave.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
public static <I extends Result> void main(String[] args) throws Exception {
public static void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils
.toString(
CollectAndSave.class
.getResourceAsStream(
"/eu/dnetlib/dhp/actionmanager/bipfinder/input_actionset_parameter.json"));
Objects
.requireNonNull(
CollectAndSave.class
.getResourceAsStream(
"/eu/dnetlib/dhp/actionmanager/bipfinder/input_actionset_parameter.json")));
final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);

View File

@ -87,7 +87,7 @@ public class SparkAtomicActionScoreJob implements Serializable {
private static <I extends Result> void prepareResults(SparkSession spark, String inputPath, String outputPath,
String bipScorePath, Class<I> inputClazz) {
final JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaRDD<BipDeserialize> bipDeserializeJavaRDD = sc
.textFile(bipScorePath)
@ -101,8 +101,6 @@ public class SparkAtomicActionScoreJob implements Serializable {
return bs;
}).collect(Collectors.toList()).iterator()).rdd(), Encoders.bean(BipScore.class));
System.out.println(bipScores.count());
Dataset<I> results = readPath(spark, inputPath, inputClazz);
results.createOrReplaceTempView("result");
@ -124,7 +122,7 @@ public class SparkAtomicActionScoreJob implements Serializable {
ret.setId(value._2().getId());
return ret;
}, Encoders.bean(BipScore.class))
.groupByKey((MapFunction<BipScore, String>) value -> value.getId(), Encoders.STRING())
.groupByKey((MapFunction<BipScore, String>) BipScore::getId, Encoders.STRING())
.mapGroups((MapGroupsFunction<String, BipScore, Result>) (k, it) -> {
Result ret = new Result();
ret.setDataInfo(getDataInfo());

View File

@ -0,0 +1,49 @@
package eu.dnetlib.dhp.actionmanager.createunresolvedentities;
import java.util.Optional;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
public class Constants {
public static final String DOI = "doi";
public static final String UPDATE_DATA_INFO_TYPE = "update";
public static final String UPDATE_SUBJECT_FOS_CLASS_ID = "subject:fos";
public static final String UPDATE_CLASS_NAME = "Inferred by OpenAIRE";
public static final String UPDATE_MEASURE_BIP_CLASS_ID = "measure:bip";
public static final String FOS_CLASS_ID = "FOS";
public static final String FOS_CLASS_NAME = "Fields of Science and Technology classification";
public static final String NULL = "NULL";
public static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private Constants() {
}
public static Boolean isSparkSessionManaged(ArgumentApplicationParser parser) {
return Optional
.ofNullable(parser.get("isSparkSessionManaged"))
.map(Boolean::valueOf)
.orElse(Boolean.TRUE);
}
public static <R> Dataset<R> readPath(
SparkSession spark, String inputPath, Class<R> clazz) {
return spark
.read()
.textFile(inputPath)
.map((MapFunction<String, R>) value -> OBJECT_MAPPER.readValue(value, clazz), Encoders.bean(clazz));
}
}

View File

@ -0,0 +1,77 @@
package eu.dnetlib.dhp.actionmanager.createunresolvedentities;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Serializable;
import java.util.Objects;
import java.util.Optional;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.collection.GetCSV;
public class GetFOSData implements Serializable {
private static final Logger log = LoggerFactory.getLogger(GetFOSData.class);
public static final char DEFAULT_DELIMITER = '\t';
public static void main(final String[] args) throws Exception {
final ArgumentApplicationParser parser = new ArgumentApplicationParser(
IOUtils
.toString(
Objects
.requireNonNull(
GetFOSData.class
.getResourceAsStream(
"/eu/dnetlib/dhp/actionmanager/createunresolvedentities/get_fos_parameters.json"))));
parser.parseArgument(args);
// the path where the original fos csv file is stored
final String sourcePath = parser.get("sourcePath");
log.info("sourcePath {}", sourcePath);
// the path where to put the file as json
final String outputPath = parser.get("outputPath");
log.info("outputPath {}", outputPath);
final String hdfsNameNode = parser.get("hdfsNameNode");
log.info("hdfsNameNode {}", hdfsNameNode);
final String classForName = parser.get("classForName");
log.info("classForName {}", classForName);
final char delimiter = Optional
.ofNullable(parser.get("delimiter"))
.map(s -> s.charAt(0))
.orElse(DEFAULT_DELIMITER);
log.info("delimiter {}", delimiter);
Configuration conf = new Configuration();
conf.set("fs.defaultFS", hdfsNameNode);
FileSystem fileSystem = FileSystem.get(conf);
new GetFOSData().doRewrite(sourcePath, outputPath, classForName, delimiter, fileSystem);
}
public void doRewrite(String inputPath, String outputFile, String classForName, char delimiter, FileSystem fs)
throws IOException, ClassNotFoundException {
// reads the csv and writes it as its json equivalent
try (InputStreamReader reader = new InputStreamReader(fs.open(new Path(inputPath)))) {
GetCSV.getCsv(fs, reader, outputFile, classForName, delimiter);
}
}
}

View File

@ -0,0 +1,145 @@
package eu.dnetlib.dhp.actionmanager.createunresolvedentities;
import static eu.dnetlib.dhp.actionmanager.createunresolvedentities.Constants.*;
import static eu.dnetlib.dhp.actionmanager.createunresolvedentities.Constants.UPDATE_CLASS_NAME;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.io.Serializable;
import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.hdfs.client.HdfsUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.actionmanager.createunresolvedentities.model.BipDeserialize;
import eu.dnetlib.dhp.actionmanager.createunresolvedentities.model.BipScore;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.KeyValue;
import eu.dnetlib.dhp.schema.oaf.Measure;
import eu.dnetlib.dhp.schema.oaf.Result;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
import eu.dnetlib.dhp.utils.DHPUtils;
public class PrepareBipFinder implements Serializable {
private static final Logger log = LoggerFactory.getLogger(PrepareBipFinder.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
public static <I extends Result> void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils
.toString(
PrepareBipFinder.class
.getResourceAsStream(
"/eu/dnetlib/dhp/actionmanager/createunresolvedentities/prepare_parameters.json"));
final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
parser.parseArgument(args);
Boolean isSparkSessionManaged = Optional
.ofNullable(parser.get("isSparkSessionManaged"))
.map(Boolean::valueOf)
.orElse(Boolean.TRUE);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
final String sourcePath = parser.get("sourcePath");
log.info("sourcePath {}: ", sourcePath);
final String outputPath = parser.get("outputPath");
log.info("outputPath {}: ", outputPath);
SparkConf conf = new SparkConf();
runWithSparkSession(
conf,
isSparkSessionManaged,
spark -> {
HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
prepareResults(spark, sourcePath, outputPath);
});
}
private static <I extends Result> void prepareResults(SparkSession spark, String inputPath, String outputPath) {
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaRDD<BipDeserialize> bipDeserializeJavaRDD = sc
.textFile(inputPath)
.map(item -> OBJECT_MAPPER.readValue(item, BipDeserialize.class));
spark
.createDataset(bipDeserializeJavaRDD.flatMap(entry -> entry.keySet().stream().map(key -> {
BipScore bs = new BipScore();
bs.setId(key);
bs.setScoreList(entry.get(key));
return bs;
}).collect(Collectors.toList()).iterator()).rdd(), Encoders.bean(BipScore.class))
.map((MapFunction<BipScore, Result>) v -> {
Result r = new Result();
r.setId(DHPUtils.generateUnresolvedIdentifier(v.getId(), DOI));
r.setMeasures(getMeasure(v));
return r;
}, Encoders.bean(Result.class))
.write()
.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputPath + "/bip");
}
private static List<Measure> getMeasure(BipScore value) {
return value
.getScoreList()
.stream()
.map(score -> {
Measure m = new Measure();
m.setId(score.getId());
m
.setUnit(
score
.getUnit()
.stream()
.map(unit -> {
KeyValue kv = new KeyValue();
kv.setValue(unit.getValue());
kv.setKey(unit.getKey());
kv
.setDataInfo(
OafMapperUtils
.dataInfo(
false,
UPDATE_DATA_INFO_TYPE,
true,
false,
OafMapperUtils
.qualifier(
UPDATE_MEASURE_BIP_CLASS_ID,
UPDATE_CLASS_NAME,
ModelConstants.DNET_PROVENANCE_ACTIONS,
ModelConstants.DNET_PROVENANCE_ACTIONS),
""));
return kv;
})
.collect(Collectors.toList()));
return m;
})
.collect(Collectors.toList());
}
}

View File

@ -0,0 +1,133 @@
package eu.dnetlib.dhp.actionmanager.createunresolvedentities;
import static eu.dnetlib.dhp.actionmanager.createunresolvedentities.Constants.*;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.io.Serializable;
import java.util.*;
import java.util.stream.Collectors;
import org.apache.commons.io.IOUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.actionmanager.createunresolvedentities.model.FOSDataModel;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.Result;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
import eu.dnetlib.dhp.utils.DHPUtils;
public class PrepareFOSSparkJob implements Serializable {
private static final Logger log = LoggerFactory.getLogger(PrepareFOSSparkJob.class);
public static void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils
.toString(
PrepareFOSSparkJob.class
.getResourceAsStream(
"/eu/dnetlib/dhp/actionmanager/createunresolvedentities/prepare_parameters.json"));
final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
parser.parseArgument(args);
Boolean isSparkSessionManaged = isSparkSessionManaged(parser);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
String sourcePath = parser.get("sourcePath");
log.info("sourcePath: {}", sourcePath);
final String outputPath = parser.get("outputPath");
log.info("outputPath: {}", outputPath);
SparkConf conf = new SparkConf();
runWithSparkSession(
conf,
isSparkSessionManaged,
spark -> {
distributeFOSdois(
spark,
sourcePath,
outputPath);
});
}
private static void distributeFOSdois(SparkSession spark, String sourcePath, String outputPath) {
Dataset<FOSDataModel> fosDataset = readPath(spark, sourcePath, FOSDataModel.class);
fosDataset.flatMap((FlatMapFunction<FOSDataModel, FOSDataModel>) v -> {
List<FOSDataModel> fosList = new ArrayList<>();
final String level1 = v.getLevel1();
final String level2 = v.getLevel2();
final String level3 = v.getLevel3();
Arrays
.stream(v.getDoi().split("\u0002"))
.forEach(d -> fosList.add(FOSDataModel.newInstance(d, level1, level2, level3)));
return fosList.iterator();
}, Encoders.bean(FOSDataModel.class))
.map((MapFunction<FOSDataModel, Result>) value -> {
Result r = new Result();
r.setId(DHPUtils.generateUnresolvedIdentifier(value.getDoi(), DOI));
r.setSubject(getSubjects(value));
return r;
}, Encoders.bean(Result.class))
.write()
.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputPath + "/fos");
}
private static List<StructuredProperty> getSubjects(FOSDataModel fos) {
return Arrays
.asList(getSubject(fos.getLevel1()), getSubject(fos.getLevel2()), getSubject(fos.getLevel3()))
.stream()
.filter(Objects::nonNull)
.collect(Collectors.toList());
}
private static StructuredProperty getSubject(String sbj) {
if (sbj.equals(NULL))
return null;
StructuredProperty sp = new StructuredProperty();
sp.setValue(sbj);
sp
.setQualifier(
OafMapperUtils
.qualifier(
FOS_CLASS_ID,
FOS_CLASS_NAME,
ModelConstants.DNET_SUBJECT_TYPOLOGIES,
ModelConstants.DNET_SUBJECT_TYPOLOGIES));
sp
.setDataInfo(
OafMapperUtils
.dataInfo(
false,
UPDATE_DATA_INFO_TYPE,
true,
false,
OafMapperUtils
.qualifier(
UPDATE_SUBJECT_FOS_CLASS_ID,
UPDATE_CLASS_NAME,
ModelConstants.DNET_PROVENANCE_ACTIONS,
ModelConstants.DNET_PROVENANCE_ACTIONS),
""));
return sp;
}
}

View File

@ -0,0 +1,79 @@
package eu.dnetlib.dhp.actionmanager.createunresolvedentities;
import static eu.dnetlib.dhp.actionmanager.createunresolvedentities.Constants.*;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.io.Serializable;
import org.apache.commons.io.IOUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.api.java.function.MapGroupsFunction;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.schema.oaf.Result;
public class SparkSaveUnresolved implements Serializable {
private static final Logger log = LoggerFactory.getLogger(PrepareFOSSparkJob.class);
public static void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils
.toString(
PrepareFOSSparkJob.class
.getResourceAsStream(
"/eu/dnetlib/dhp/actionmanager/createunresolvedentities/produce_unresolved_parameters.json"));
final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
parser.parseArgument(args);
Boolean isSparkSessionManaged = isSparkSessionManaged(parser);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
String sourcePath = parser.get("sourcePath");
log.info("sourcePath: {}", sourcePath);
final String outputPath = parser.get("outputPath");
log.info("outputPath: {}", outputPath);
SparkConf conf = new SparkConf();
runWithSparkSession(
conf,
isSparkSessionManaged,
spark -> {
saveUnresolved(
spark,
sourcePath,
outputPath);
});
}
private static void saveUnresolved(SparkSession spark, String sourcePath, String outputPath) {
spark
.read()
.textFile(sourcePath + "/*")
.map(
(MapFunction<String, Result>) l -> OBJECT_MAPPER.readValue(l, Result.class),
Encoders.bean(Result.class))
.groupByKey((MapFunction<Result, String>) r -> r.getId(), Encoders.STRING())
.mapGroups((MapGroupsFunction<String, Result, Result>) (k, it) -> {
Result ret = it.next();
it.forEachRemaining(r -> ret.mergeFrom(r));
return ret;
}, Encoders.bean(Result.class))
.write()
.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputPath);
}
}

View File

@ -0,0 +1,28 @@
package eu.dnetlib.dhp.actionmanager.createunresolvedentities.model;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
/**
* Class that maps the model of the bipFinder! input data.
* Only needed for deserialization purposes
*/
public class BipDeserialize extends HashMap<String, List<Score>> implements Serializable {
public BipDeserialize() {
super();
}
public List<Score> get(String key) {
if (super.get(key) == null) {
return new ArrayList<>();
}
return super.get(key);
}
}

View File

@ -0,0 +1,30 @@
package eu.dnetlib.dhp.actionmanager.createunresolvedentities.model;
import java.io.Serializable;
import java.util.List;
/**
* Rewriting of the bipFinder input data by extracting the identifier of the result (doi)
*/
public class BipScore implements Serializable {
private String id; // doi
private List<Score> scoreList; // unit as given in the inputfile
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public List<Score> getScoreList() {
return scoreList;
}
public void setScoreList(List<Score> scoreList) {
this.scoreList = scoreList;
}
}

View File

@ -0,0 +1,71 @@
package eu.dnetlib.dhp.actionmanager.createunresolvedentities.model;
import java.io.Serializable;
import com.opencsv.bean.CsvBindByPosition;
public class FOSDataModel implements Serializable {
@CsvBindByPosition(position = 1)
// @CsvBindByName(column = "doi")
private String doi;
@CsvBindByPosition(position = 2)
// @CsvBindByName(column = "level1")
private String level1;
@CsvBindByPosition(position = 3)
// @CsvBindByName(column = "level2")
private String level2;
@CsvBindByPosition(position = 4)
// @CsvBindByName(column = "level3")
private String level3;
public FOSDataModel() {
}
public FOSDataModel(String doi, String level1, String level2, String level3) {
this.doi = doi;
this.level1 = level1;
this.level2 = level2;
this.level3 = level3;
}
public static FOSDataModel newInstance(String d, String level1, String level2, String level3) {
return new FOSDataModel(d, level1, level2, level3);
}
public String getDoi() {
return doi;
}
public void setDoi(String doi) {
this.doi = doi;
}
public String getLevel1() {
return level1;
}
public void setLevel1(String level1) {
this.level1 = level1;
}
public String getLevel2() {
return level2;
}
public void setLevel2(String level2) {
this.level2 = level2;
}
public String getLevel3() {
return level3;
}
public void setLevel3(String level3) {
this.level3 = level3;
}
}

View File

@ -0,0 +1,26 @@
package eu.dnetlib.dhp.actionmanager.createunresolvedentities.model;
import java.io.Serializable;
public class KeyValue implements Serializable {
private String key;
private String value;
public String getKey() {
return key;
}
public void setKey(String key) {
this.key = key;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
}

View File

@ -0,0 +1,30 @@
package eu.dnetlib.dhp.actionmanager.createunresolvedentities.model;
import java.io.Serializable;
import java.util.List;
/**
* represents the score in the input file
*/
public class Score implements Serializable {
private String id;
private List<KeyValue> unit;
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public List<KeyValue> getUnit() {
return unit;
}
public void setUnit(List<KeyValue> unit) {
this.unit = unit;
}
}

View File

@ -1,86 +0,0 @@
package eu.dnetlib.dhp.actionmanager.datacite
import org.apache.commons.io.IOUtils
import org.apache.http.client.methods.{HttpGet, HttpPost, HttpRequestBase, HttpUriRequest}
import org.apache.http.entity.StringEntity
import org.apache.http.impl.client.HttpClients
import java.io.IOException
abstract class AbstractRestClient extends Iterator[String]{
var buffer: List[String] = List()
var current_index:Int = 0
var scroll_value: Option[String] = None
var complete:Boolean = false
def extractInfo(input: String): Unit
protected def getBufferData(): Unit
def doHTTPGETRequest(url:String): String = {
val httpGet = new HttpGet(url)
doHTTPRequest(httpGet)
}
def doHTTPPOSTRequest(url:String, json:String): String = {
val httpPost = new HttpPost(url)
if (json != null) {
val entity = new StringEntity(json)
httpPost.setEntity(entity)
httpPost.setHeader("Accept", "application/json")
httpPost.setHeader("Content-type", "application/json")
}
doHTTPRequest(httpPost)
}
def hasNext: Boolean = {
buffer.nonEmpty && current_index < buffer.size
}
override def next(): String = {
val next_item:String = buffer(current_index)
current_index = current_index + 1
if (current_index == buffer.size)
getBufferData()
next_item
}
private def doHTTPRequest[A <: HttpUriRequest](r: A) :String ={
val client = HttpClients.createDefault
var tries = 4
try {
while (tries > 0) {
println(s"requesting ${r.getURI}")
val response = client.execute(r)
println(s"get response with status${response.getStatusLine.getStatusCode}")
if (response.getStatusLine.getStatusCode > 400) {
tries -= 1
}
else
return IOUtils.toString(response.getEntity.getContent)
}
""
} catch {
case e: Throwable =>
throw new RuntimeException("Error on executing request ", e)
} finally try client.close()
catch {
case e: IOException =>
throw new RuntimeException("Unable to close client ", e)
}
}
getBufferData()
}

View File

@ -1,41 +0,0 @@
package eu.dnetlib.dhp.actionmanager.datacite
import eu.dnetlib.dhp.application.ArgumentApplicationParser
import eu.dnetlib.dhp.schema.oaf.Oaf
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.compress.GzipCodec
import org.apache.hadoop.mapred.SequenceFileOutputFormat
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Dataset, Encoder, Encoders, SaveMode, SparkSession}
import org.slf4j.{Logger, LoggerFactory}
import scala.io.Source
object ExportActionSetJobNode {
val log: Logger = LoggerFactory.getLogger(ExportActionSetJobNode.getClass)
def main(args: Array[String]): Unit = {
val conf = new SparkConf
val parser = new ArgumentApplicationParser(Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/actionmanager/datacite/exportDataset_parameters.json")).mkString)
parser.parseArgument(args)
val master = parser.get("master")
val sourcePath = parser.get("sourcePath")
val targetPath = parser.get("targetPath")
val spark: SparkSession = SparkSession.builder().config(conf)
.appName(ExportActionSetJobNode.getClass.getSimpleName)
.master(master)
.getOrCreate()
implicit val resEncoder: Encoder[Oaf] = Encoders.kryo[Oaf]
implicit val tEncoder:Encoder[(String,String)] = Encoders.tuple(Encoders.STRING,Encoders.STRING)
spark.read.load(sourcePath).as[Oaf]
.map(o =>DataciteToOAFTransformation.toActionSet(o))
.filter(o => o!= null)
.rdd.map(s => (new Text(s._1), new Text(s._2))).saveAsHadoopFile(s"$targetPath", classOf[Text], classOf[Text], classOf[SequenceFileOutputFormat[Text,Text]], classOf[GzipCodec])
}
}

View File

@ -1,46 +0,0 @@
package eu.dnetlib.dhp.actionmanager.datacite
import eu.dnetlib.dhp.application.ArgumentApplicationParser
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup
import eu.dnetlib.dhp.schema.mdstore.MetadataRecord
import eu.dnetlib.dhp.schema.oaf.{Oaf, Result}
import eu.dnetlib.dhp.utils.ISLookupClientFactory
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Dataset, Encoder, Encoders, SaveMode, SparkSession}
import org.slf4j.{Logger, LoggerFactory}
import scala.io.Source
object FilterCrossrefEntitiesSpark {
val log: Logger = LoggerFactory.getLogger(getClass.getClass)
def main(args: Array[String]): Unit = {
val conf = new SparkConf
val parser = new ArgumentApplicationParser(Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/actionmanager/datacite/filter_crossref_param.json")).mkString)
parser.parseArgument(args)
val master = parser.get("master")
val sourcePath = parser.get("sourcePath")
log.info("sourcePath: {}", sourcePath)
val targetPath = parser.get("targetPath")
log.info("targetPath: {}", targetPath)
val spark: SparkSession = SparkSession.builder().config(conf)
.appName(getClass.getSimpleName)
.master(master)
.getOrCreate()
implicit val oafEncoder: Encoder[Oaf] = Encoders.kryo[Oaf]
implicit val resEncoder: Encoder[Result] = Encoders.kryo[Result]
val d:Dataset[Oaf]= spark.read.load(sourcePath).as[Oaf]
d.filter(r => r.isInstanceOf[Result]).map(r => r.asInstanceOf[Result]).write.mode(SaveMode.Overwrite).save(targetPath)
}
}

View File

@ -0,0 +1,181 @@
package eu.dnetlib.dhp.actionmanager.opencitations;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.io.IOException;
import java.io.Serializable;
import java.util.*;
import org.apache.commons.cli.ParseException;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FilterFunction;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
import scala.Tuple2;
public class CreateActionSetSparkJob implements Serializable {
public static final String OPENCITATIONS_CLASSID = "sysimport:crosswalk:opencitations";
public static final String OPENCITATIONS_CLASSNAME = "Imported from OpenCitations";
private static final String ID_PREFIX = "50|doi_________::";
private static final String TRUST = "0.91";
private static final Logger log = LoggerFactory.getLogger(CreateActionSetSparkJob.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
public static void main(final String[] args) throws IOException, ParseException {
final ArgumentApplicationParser parser = new ArgumentApplicationParser(
IOUtils
.toString(
Objects
.requireNonNull(
CreateActionSetSparkJob.class
.getResourceAsStream(
"/eu/dnetlib/dhp/actionmanager/opencitations/as_parameters.json"))));
parser.parseArgument(args);
Boolean isSparkSessionManaged = Optional
.ofNullable(parser.get("isSparkSessionManaged"))
.map(Boolean::valueOf)
.orElse(Boolean.TRUE);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
final String inputPath = parser.get("inputPath");
log.info("inputPath {}", inputPath.toString());
final String outputPath = parser.get("outputPath");
log.info("outputPath {}", outputPath);
final boolean shouldDuplicateRels = Optional
.ofNullable(parser.get("shouldDuplicateRels"))
.map(Boolean::valueOf)
.orElse(Boolean.FALSE);
SparkConf conf = new SparkConf();
runWithSparkSession(
conf,
isSparkSessionManaged,
spark -> {
extractContent(spark, inputPath, outputPath, shouldDuplicateRels);
});
}
private static void extractContent(SparkSession spark, String inputPath, String outputPath,
boolean shouldDuplicateRels) {
spark
.sqlContext()
.createDataset(spark.sparkContext().textFile(inputPath + "/*", 6000), Encoders.STRING())
.flatMap(
(FlatMapFunction<String, Relation>) value -> createRelation(value, shouldDuplicateRels).iterator(),
Encoders.bean(Relation.class))
.filter((FilterFunction<Relation>) value -> value != null)
.toJavaRDD()
.map(p -> new AtomicAction(p.getClass(), p))
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);
}
private static List<Relation> createRelation(String value, boolean duplicate) {
String[] line = value.split(",");
if (!line[1].startsWith("10.")) {
return new ArrayList<>();
}
List<Relation> relationList = new ArrayList<>();
String citing = ID_PREFIX + IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", line[1]));
final String cited = ID_PREFIX + IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", line[2]));
relationList
.addAll(
getRelations(
citing,
cited));
if (duplicate && line[1].endsWith(".refs")) {
citing = ID_PREFIX + IdentifierFactory
.md5(CleaningFunctions.normalizePidValue("doi", line[1].substring(0, line[1].indexOf(".refs"))));
relationList.addAll(getRelations(citing, cited));
}
return relationList;
}
private static Collection<Relation> getRelations(String citing, String cited) {
return Arrays
.asList(
getRelation(citing, cited, ModelConstants.CITES),
getRelation(cited, citing, ModelConstants.IS_CITED_BY));
}
public static Relation getRelation(
String source,
String target,
String relclass) {
Relation r = new Relation();
r.setCollectedfrom(getCollectedFrom());
r.setSource(source);
r.setTarget(target);
r.setRelClass(relclass);
r.setRelType(ModelConstants.RESULT_RESULT);
r.setSubRelType(ModelConstants.CITATION);
r
.setDataInfo(
getDataInfo());
return r;
}
public static List<KeyValue> getCollectedFrom() {
KeyValue kv = new KeyValue();
kv.setKey(ModelConstants.OPENOCITATIONS_ID);
kv.setValue(ModelConstants.OPENOCITATIONS_NAME);
return Arrays.asList(kv);
}
public static DataInfo getDataInfo() {
DataInfo di = new DataInfo();
di.setInferred(false);
di.setDeletedbyinference(false);
di.setTrust(TRUST);
di
.setProvenanceaction(
getQualifier(OPENCITATIONS_CLASSID, OPENCITATIONS_CLASSNAME, ModelConstants.DNET_PROVENANCE_ACTIONS));
return di;
}
public static Qualifier getQualifier(String class_id, String class_name,
String qualifierSchema) {
Qualifier pa = new Qualifier();
pa.setClassid(class_id);
pa.setClassname(class_name);
pa.setSchemeid(qualifierSchema);
pa.setSchemename(qualifierSchema);
return pa;
}
}

View File

@ -0,0 +1,93 @@
package eu.dnetlib.dhp.actionmanager.opencitations;
import java.io.*;
import java.io.Serializable;
import java.util.Objects;
import java.util.zip.GZIPOutputStream;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import org.apache.commons.cli.ParseException;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
public class GetOpenCitationsRefs implements Serializable {
private static final Logger log = LoggerFactory.getLogger(GetOpenCitationsRefs.class);
public static void main(final String[] args) throws IOException, ParseException {
final ArgumentApplicationParser parser = new ArgumentApplicationParser(
IOUtils
.toString(
Objects
.requireNonNull(
GetOpenCitationsRefs.class
.getResourceAsStream(
"/eu/dnetlib/dhp/actionmanager/opencitations/input_parameters.json"))));
parser.parseArgument(args);
final String[] inputFile = parser.get("inputFile").split(";");
log.info("inputFile {}", inputFile.toString());
final String workingPath = parser.get("workingPath");
log.info("workingPath {}", workingPath);
final String hdfsNameNode = parser.get("hdfsNameNode");
log.info("hdfsNameNode {}", hdfsNameNode);
Configuration conf = new Configuration();
conf.set("fs.defaultFS", hdfsNameNode);
FileSystem fileSystem = FileSystem.get(conf);
GetOpenCitationsRefs ocr = new GetOpenCitationsRefs();
for (String file : inputFile) {
ocr.doExtract(workingPath + "/Original/" + file, workingPath, fileSystem);
}
}
private void doExtract(String inputFile, String workingPath, FileSystem fileSystem)
throws IOException {
final Path path = new Path(inputFile);
FSDataInputStream oc_zip = fileSystem.open(path);
int count = 1;
try (ZipInputStream zis = new ZipInputStream(oc_zip)) {
ZipEntry entry = null;
while ((entry = zis.getNextEntry()) != null) {
if (!entry.isDirectory()) {
String fileName = entry.getName();
fileName = fileName.substring(0, fileName.indexOf("T")) + "_" + count;
count++;
try (
FSDataOutputStream out = fileSystem
.create(new Path(workingPath + "/COCI/" + fileName + ".gz"));
GZIPOutputStream gzipOs = new GZIPOutputStream(new BufferedOutputStream(out))) {
IOUtils.copy(zis, gzipOs);
}
}
}
}
}
}

View File

@ -20,7 +20,7 @@ import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.actionmanager.project.utils.CSVProgramme;
import eu.dnetlib.dhp.actionmanager.project.utils.model.CSVProgramme;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import scala.Tuple2;
@ -171,26 +171,23 @@ public class PrepareProgramme {
}
private static CSVProgramme groupProgrammeByCode(CSVProgramme a, CSVProgramme b) {
if (!a.getLanguage().equals("en")) {
if (b.getLanguage().equalsIgnoreCase("en")) {
a.setTitle(b.getTitle());
a.setLanguage(b.getLanguage());
}
if (!a.getLanguage().equals("en") && b.getLanguage().equalsIgnoreCase("en")) {
a.setTitle(b.getTitle());
a.setLanguage(b.getLanguage());
}
if (StringUtils.isEmpty(a.getShortTitle())) {
if (!StringUtils.isEmpty(b.getShortTitle())) {
a.setShortTitle(b.getShortTitle());
}
if (StringUtils.isEmpty(a.getShortTitle()) && !StringUtils.isEmpty(b.getShortTitle())) {
a.setShortTitle(b.getShortTitle());
}
return a;
}
@SuppressWarnings("unchecked")
private static List<CSVProgramme> prepareClassification(JavaRDD<CSVProgramme> h2020Programmes) {
Object[] codedescription = h2020Programmes
.map(
value -> new Tuple2<>(value.getCode(),
new Tuple2<String, String>(value.getTitle(), value.getShortTitle())))
new Tuple2<>(value.getTitle(), value.getShortTitle())))
.collect()
.toArray();
@ -216,7 +213,7 @@ public class PrepareProgramme {
String[] tmp = ent.split("\\.");
if (tmp.length <= 2) {
if (StringUtils.isEmpty(entry._2()._2())) {
map.put(entry._1(), new Tuple2<String, String>(entry._2()._1(), entry._2()._1()));
map.put(entry._1(), new Tuple2<>(entry._2()._1(), entry._2()._1()));
} else {
map.put(entry._1(), entry._2());
}

View File

@ -18,7 +18,7 @@ import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.actionmanager.project.utils.CSVProject;
import eu.dnetlib.dhp.actionmanager.project.utils.model.CSVProject;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import scala.Tuple2;
@ -29,7 +29,7 @@ import scala.Tuple2;
*/
public class PrepareProjects {
private static final Logger log = LoggerFactory.getLogger(PrepareProgramme.class);
private static final Logger log = LoggerFactory.getLogger(PrepareProjects.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
public static void main(String[] args) throws Exception {

View File

@ -31,15 +31,16 @@ import eu.dnetlib.dhp.common.DbClient;
*/
public class ReadProjectsFromDB implements Closeable {
private final DbClient dbClient;
private static final Log log = LogFactory.getLog(ReadProjectsFromDB.class);
private static final String query = "SELECT code " +
"from projects where id like 'corda__h2020%' ";
private final DbClient dbClient;
private final Configuration conf;
private final BufferedWriter writer;
private final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private final static String query = "SELECT code " +
"from projects where id like 'corda__h2020%' ";
public static void main(final String[] args) throws Exception {
final ArgumentApplicationParser parser = new ArgumentApplicationParser(
IOUtils
@ -65,9 +66,9 @@ public class ReadProjectsFromDB implements Closeable {
}
}
public void execute(final String sql, final Function<ResultSet, List<ProjectSubset>> producer) throws Exception {
public void execute(final String sql, final Function<ResultSet, List<ProjectSubset>> producer) {
final Consumer<ResultSet> consumer = rs -> producer.apply(rs).forEach(r -> writeProject(r));
final Consumer<ResultSet> consumer = rs -> producer.apply(rs).forEach(this::writeProject);
dbClient.processResults(sql, consumer);
}
@ -94,20 +95,20 @@ public class ReadProjectsFromDB implements Closeable {
public ReadProjectsFromDB(
final String hdfsPath, String hdfsNameNode, final String dbUrl, final String dbUser, final String dbPassword)
throws Exception {
throws IOException {
this.dbClient = new DbClient(dbUrl, dbUser, dbPassword);
this.conf = new Configuration();
this.conf.set("fs.defaultFS", hdfsNameNode);
FileSystem fileSystem = FileSystem.get(this.conf);
Path hdfsWritePath = new Path(hdfsPath);
FSDataOutputStream fsDataOutputStream = null;
if (fileSystem.exists(hdfsWritePath)) {
fileSystem.delete(hdfsWritePath, false);
}
fsDataOutputStream = fileSystem.create(hdfsWritePath);
FSDataOutputStream fos = fileSystem.create(hdfsWritePath);
this.writer = new BufferedWriter(new OutputStreamWriter(fsDataOutputStream, StandardCharsets.UTF_8));
this.writer = new BufferedWriter(new OutputStreamWriter(fos, StandardCharsets.UTF_8));
}
@Override

View File

@ -4,7 +4,6 @@ package eu.dnetlib.dhp.actionmanager.project;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Objects;
import java.util.Optional;
@ -22,15 +21,16 @@ import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.actionmanager.project.utils.CSVProgramme;
import eu.dnetlib.dhp.actionmanager.project.utils.CSVProject;
import eu.dnetlib.dhp.actionmanager.project.utils.EXCELTopic;
import eu.dnetlib.dhp.actionmanager.project.utils.model.CSVProgramme;
import eu.dnetlib.dhp.actionmanager.project.utils.model.CSVProject;
import eu.dnetlib.dhp.actionmanager.project.utils.model.EXCELTopic;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.H2020Classification;
import eu.dnetlib.dhp.schema.oaf.H2020Programme;
import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.oaf.Project;
import eu.dnetlib.dhp.utils.DHPUtils;
import scala.Tuple2;
@ -47,13 +47,10 @@ import scala.Tuple2;
*
* To produce one single entry for each project code a step of groupoing is needed: each project can be associated to more
* than one programme.
*
*
*/
public class SparkAtomicActionJob {
private static final Logger log = LoggerFactory.getLogger(SparkAtomicActionJob.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static final HashMap<String, String> programmeMap = new HashMap<>();
public static void main(String[] args) throws Exception {
@ -137,7 +134,6 @@ public class SparkAtomicActionJob {
h2020classification.setClassification(csvProgramme.getClassification());
h2020classification.setH2020Programme(pm);
setLevelsandProgramme(h2020classification, csvProgramme.getClassification_short());
// setProgramme(h2020classification, ocsvProgramme.get().getClassification());
pp.setH2020classification(Arrays.asList(h2020classification));
return pp;
@ -152,20 +148,16 @@ public class SparkAtomicActionJob {
.map((MapFunction<Tuple2<Project, EXCELTopic>, Project>) p -> {
Optional<EXCELTopic> op = Optional.ofNullable(p._2());
Project rp = p._1();
if (op.isPresent()) {
rp.setH2020topicdescription(op.get().getTitle());
}
op.ifPresent(excelTopic -> rp.setH2020topicdescription(excelTopic.getTitle()));
return rp;
}, Encoders.bean(Project.class))
.filter(Objects::nonNull)
.groupByKey(
(MapFunction<Project, String>) p -> p.getId(),
(MapFunction<Project, String>) OafEntity::getId,
Encoders.STRING())
.mapGroups((MapGroupsFunction<String, Project, Project>) (s, it) -> {
Project first = it.next();
it.forEachRemaining(p -> {
first.mergeFrom(p);
});
it.forEachRemaining(first::mergeFrom);
return first;
}, Encoders.bean(Project.class))
.toJavaRDD()
@ -189,12 +181,6 @@ public class SparkAtomicActionJob {
h2020Classification.getH2020Programme().setDescription(tmp[tmp.length - 1]);
}
// private static void setProgramme(H2020Classification h2020Classification, String classification) {
// String[] tmp = classification.split(" \\| ");
//
// h2020Classification.getH2020Programme().setDescription(tmp[tmp.length - 1]);
// }
public static <R> Dataset<R> readPath(
SparkSession spark, String inputPath, Class<R> clazz) {
return spark

View File

@ -1,40 +0,0 @@
package eu.dnetlib.dhp.actionmanager.project.utils;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVRecord;
import org.apache.commons.lang.reflect.FieldUtils;
/**
* Reads a generic csv and maps it into classes that mirror its schema
*/
public class CSVParser {
public <R> List<R> parse(String csvFile, String classForName)
throws ClassNotFoundException, IOException, IllegalAccessException, InstantiationException {
final CSVFormat format = CSVFormat.EXCEL
.withHeader()
.withDelimiter(';')
.withQuote('"')
.withTrim();
List<R> ret = new ArrayList<>();
final org.apache.commons.csv.CSVParser parser = org.apache.commons.csv.CSVParser.parse(csvFile, format);
final Set<String> headers = parser.getHeaderMap().keySet();
Class<?> clazz = Class.forName(classForName);
for (CSVRecord csvRecord : parser.getRecords()) {
final Object cc = clazz.newInstance();
for (String header : headers) {
FieldUtils.writeField(cc, header, csvRecord.get(header), true);
}
ret.add((R) cc);
}
return ret;
}
}

View File

@ -1,200 +0,0 @@
package eu.dnetlib.dhp.actionmanager.project.utils;
import java.io.Serializable;
/**
* the mmodel for the projects csv file
*/
public class CSVProject implements Serializable {
private String rcn;
private String id;
private String acronym;
private String status;
private String programme;
private String topics;
private String frameworkProgramme;
private String title;
private String startDate;
private String endDate;
private String projectUrl;
private String objective;
private String totalCost;
private String ecMaxContribution;
private String call;
private String fundingScheme;
private String coordinator;
private String coordinatorCountry;
private String participants;
private String participantCountries;
private String subjects;
public String getRcn() {
return rcn;
}
public void setRcn(String rcn) {
this.rcn = rcn;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getAcronym() {
return acronym;
}
public void setAcronym(String acronym) {
this.acronym = acronym;
}
public String getStatus() {
return status;
}
public void setStatus(String status) {
this.status = status;
}
public String getProgramme() {
return programme;
}
public void setProgramme(String programme) {
this.programme = programme;
}
public String getTopics() {
return topics;
}
public void setTopics(String topics) {
this.topics = topics;
}
public String getFrameworkProgramme() {
return frameworkProgramme;
}
public void setFrameworkProgramme(String frameworkProgramme) {
this.frameworkProgramme = frameworkProgramme;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getStartDate() {
return startDate;
}
public void setStartDate(String startDate) {
this.startDate = startDate;
}
public String getEndDate() {
return endDate;
}
public void setEndDate(String endDate) {
this.endDate = endDate;
}
public String getProjectUrl() {
return projectUrl;
}
public void setProjectUrl(String projectUrl) {
this.projectUrl = projectUrl;
}
public String getObjective() {
return objective;
}
public void setObjective(String objective) {
this.objective = objective;
}
public String getTotalCost() {
return totalCost;
}
public void setTotalCost(String totalCost) {
this.totalCost = totalCost;
}
public String getEcMaxContribution() {
return ecMaxContribution;
}
public void setEcMaxContribution(String ecMaxContribution) {
this.ecMaxContribution = ecMaxContribution;
}
public String getCall() {
return call;
}
public void setCall(String call) {
this.call = call;
}
public String getFundingScheme() {
return fundingScheme;
}
public void setFundingScheme(String fundingScheme) {
this.fundingScheme = fundingScheme;
}
public String getCoordinator() {
return coordinator;
}
public void setCoordinator(String coordinator) {
this.coordinator = coordinator;
}
public String getCoordinatorCountry() {
return coordinatorCountry;
}
public void setCoordinatorCountry(String coordinatorCountry) {
this.coordinatorCountry = coordinatorCountry;
}
public String getParticipants() {
return participants;
}
public void setParticipants(String participants) {
this.participants = participants;
}
public String getParticipantCountries() {
return participantCountries;
}
public void setParticipantCountries(String participantCountries) {
this.participantCountries = participantCountries;
}
public String getSubjects() {
return subjects;
}
public void setSubjects(String subjects) {
this.subjects = subjects;
}
}

View File

@ -17,6 +17,8 @@ import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import eu.dnetlib.dhp.actionmanager.project.utils.model.EXCELTopic;
/**
* Reads a generic excel file and maps it into classes that mirror its schema
*/
@ -26,52 +28,52 @@ public class EXCELParser {
throws ClassNotFoundException, IOException, IllegalAccessException, InstantiationException,
InvalidFormatException {
OPCPackage pkg = OPCPackage.open(file);
XSSFWorkbook wb = new XSSFWorkbook(pkg);
try (OPCPackage pkg = OPCPackage.open(file); XSSFWorkbook wb = new XSSFWorkbook(pkg)) {
XSSFSheet sheet = wb.getSheet(sheetName);
if (sheetName == null) {
throw new RuntimeException("Sheet name " + sheetName + " not present in current file");
}
List<R> ret = new ArrayList<>();
DataFormatter dataFormatter = new DataFormatter();
Iterator<Row> rowIterator = sheet.rowIterator();
List<String> headers = new ArrayList<>();
int count = 0;
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
if (count == 0) {
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
headers.add(dataFormatter.formatCellValue(cell));
}
} else {
Class<?> clazz = Class.forName(classForName);
final Object cc = clazz.newInstance();
for (int i = 0; i < headers.size(); i++) {
Cell cell = row.getCell(i);
FieldUtils.writeField(cc, headers.get(i), dataFormatter.formatCellValue(cell), true);
}
EXCELTopic et = (EXCELTopic) cc;
if (StringUtils.isNotBlank(et.getRcn())) {
ret.add((R) cc);
}
XSSFSheet sheet = wb.getSheet(sheetName);
if (sheet == null) {
throw new IllegalArgumentException("Sheet name " + sheetName + " not present in current file");
}
count += 1;
}
List<R> ret = new ArrayList<>();
return ret;
DataFormatter dataFormatter = new DataFormatter();
Iterator<Row> rowIterator = sheet.rowIterator();
List<String> headers = new ArrayList<>();
int count = 0;
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
if (count == 0) {
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
headers.add(dataFormatter.formatCellValue(cell));
}
} else {
Class<?> clazz = Class.forName(classForName);
final Object cc = clazz.newInstance();
for (int i = 0; i < headers.size(); i++) {
Cell cell = row.getCell(i);
FieldUtils.writeField(cc, headers.get(i), dataFormatter.formatCellValue(cell), true);
}
EXCELTopic et = (EXCELTopic) cc;
if (StringUtils.isNotBlank(et.getRcn())) {
ret.add((R) cc);
}
}
count += 1;
}
return ret;
}
}
}

View File

@ -1,34 +1,21 @@
package eu.dnetlib.dhp.actionmanager.project.utils;
import java.io.BufferedWriter;
import java.io.Closeable;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.nio.charset.StandardCharsets;
import java.io.*;
import java.util.Optional;
import org.apache.commons.io.IOUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.collection.HttpConnector2;
import eu.dnetlib.dhp.common.collection.GetCSV;
import eu.dnetlib.dhp.common.collection.HttpConnector2;
/**
* Applies the parsing of a csv file and writes the Serialization of it in hdfs
*/
public class ReadCSV implements Closeable {
private static final Log log = LogFactory.getLog(ReadCSV.class);
private final Configuration conf;
private final BufferedWriter writer;
private final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private final String csvFile;
public class ReadCSV {
public static void main(final String[] args) throws Exception {
final ArgumentApplicationParser parser = new ArgumentApplicationParser(
@ -44,56 +31,22 @@ public class ReadCSV implements Closeable {
final String hdfsPath = parser.get("hdfsPath");
final String hdfsNameNode = parser.get("hdfsNameNode");
final String classForName = parser.get("classForName");
Optional<String> delimiter = Optional.ofNullable(parser.get("delimiter"));
char del = ';';
if (delimiter.isPresent())
del = delimiter.get().charAt(0);
try (final ReadCSV readCSV = new ReadCSV(hdfsPath, hdfsNameNode, fileURL)) {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", hdfsNameNode);
log.info("Getting CSV file...");
readCSV.execute(classForName);
FileSystem fileSystem = FileSystem.get(conf);
BufferedReader reader = new BufferedReader(
new InputStreamReader(new HttpConnector2().getInputSourceAsStream(fileURL)));
}
}
GetCSV.getCsv(fileSystem, reader, hdfsPath, classForName, del);
public void execute(final String classForName) throws Exception {
CSVParser csvParser = new CSVParser();
csvParser
.parse(csvFile, classForName)
.stream()
.forEach(p -> write(p));
reader.close();
}
@Override
public void close() throws IOException {
writer.close();
}
public ReadCSV(
final String hdfsPath,
final String hdfsNameNode,
final String fileURL)
throws Exception {
this.conf = new Configuration();
this.conf.set("fs.defaultFS", hdfsNameNode);
HttpConnector2 httpConnector = new HttpConnector2();
FileSystem fileSystem = FileSystem.get(this.conf);
Path hdfsWritePath = new Path(hdfsPath);
FSDataOutputStream fsDataOutputStream = null;
if (fileSystem.exists(hdfsWritePath)) {
fileSystem.delete(hdfsWritePath, false);
}
fsDataOutputStream = fileSystem.create(hdfsWritePath);
this.writer = new BufferedWriter(new OutputStreamWriter(fsDataOutputStream, StandardCharsets.UTF_8));
this.csvFile = httpConnector.getInputSource(fileURL);
}
protected void write(final Object p) {
try {
writer.write(OBJECT_MAPPER.writeValueAsString(p));
writer.newLine();
} catch (final Exception e) {
throw new RuntimeException(e);
}
}
}

View File

@ -11,18 +11,20 @@ import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.collection.HttpConnector2;
import eu.dnetlib.dhp.common.collection.CollectorException;
import eu.dnetlib.dhp.common.collection.HttpConnector2;
/**
* Applies the parsing of an excel file and writes the Serialization of it in hdfs
*/
public class ReadExcel implements Closeable {
private static final Log log = LogFactory.getLog(ReadCSV.class);
private final Configuration conf;
private static final Log log = LogFactory.getLog(ReadExcel.class);
private final BufferedWriter writer;
private final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private final InputStream excelFile;
@ -31,7 +33,7 @@ public class ReadExcel implements Closeable {
final ArgumentApplicationParser parser = new ArgumentApplicationParser(
IOUtils
.toString(
ReadCSV.class
ReadExcel.class
.getResourceAsStream(
"/eu/dnetlib/dhp/actionmanager/project/parameters.json")));
@ -51,13 +53,15 @@ public class ReadExcel implements Closeable {
}
}
public void execute(final String classForName, final String sheetName) throws Exception {
public void execute(final String classForName, final String sheetName)
throws IOException, ClassNotFoundException, InvalidFormatException, IllegalAccessException,
InstantiationException {
EXCELParser excelParser = new EXCELParser();
excelParser
.parse(excelFile, classForName, sheetName)
.stream()
.forEach(p -> write(p));
.forEach(this::write);
}
@Override
@ -68,20 +72,20 @@ public class ReadExcel implements Closeable {
public ReadExcel(
final String hdfsPath,
final String hdfsNameNode,
final String fileURL)
throws Exception {
this.conf = new Configuration();
this.conf.set("fs.defaultFS", hdfsNameNode);
final String fileURL) throws CollectorException, IOException {
final Configuration conf = new Configuration();
conf.set("fs.defaultFS", hdfsNameNode);
HttpConnector2 httpConnector = new HttpConnector2();
FileSystem fileSystem = FileSystem.get(this.conf);
FileSystem fileSystem = FileSystem.get(conf);
Path hdfsWritePath = new Path(hdfsPath);
FSDataOutputStream fsDataOutputStream = null;
if (fileSystem.exists(hdfsWritePath)) {
fileSystem.delete(hdfsWritePath, false);
}
fsDataOutputStream = fileSystem.create(hdfsWritePath);
FSDataOutputStream fos = fileSystem.create(hdfsWritePath);
this.writer = new BufferedWriter(new OutputStreamWriter(fsDataOutputStream, StandardCharsets.UTF_8));
this.writer = new BufferedWriter(new OutputStreamWriter(fos, StandardCharsets.UTF_8));
this.excelFile = httpConnector.getInputSourceAsStream(fileURL);
}

View File

@ -1,20 +1,32 @@
package eu.dnetlib.dhp.actionmanager.project.utils;
package eu.dnetlib.dhp.actionmanager.project.utils.model;
import java.io.Serializable;
import com.opencsv.bean.CsvBindByName;
import com.opencsv.bean.CsvIgnore;
/**
* The model for the programme csv file
*/
public class CSVProgramme implements Serializable {
private String rcn;
@CsvBindByName(column = "code")
private String code;
@CsvBindByName(column = "title")
private String title;
@CsvBindByName(column = "shortTitle")
private String shortTitle;
@CsvBindByName(column = "language")
private String language;
@CsvIgnore
private String classification;
@CsvIgnore
private String classification_short;
public String getClassification_short() {
@ -33,14 +45,6 @@ public class CSVProgramme implements Serializable {
this.classification = classification;
}
public String getRcn() {
return rcn;
}
public void setRcn(String rcn) {
this.rcn = rcn;
}
public String getCode() {
return code;
}
@ -73,5 +77,4 @@ public class CSVProgramme implements Serializable {
this.language = language;
}
//
}

View File

@ -0,0 +1,46 @@
package eu.dnetlib.dhp.actionmanager.project.utils.model;
import java.io.Serializable;
import com.opencsv.bean.CsvBindByName;
/**
* the mmodel for the projects csv file
*/
public class CSVProject implements Serializable {
@CsvBindByName(column = "id")
private String id;
@CsvBindByName(column = "programme")
private String programme;
@CsvBindByName(column = "topics")
private String topics;
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getProgramme() {
return programme;
}
public void setProgramme(String programme) {
this.programme = programme;
}
public String getTopics() {
return topics;
}
public void setTopics(String topics) {
this.topics = topics;
}
}

View File

@ -1,5 +1,5 @@
package eu.dnetlib.dhp.actionmanager.project.utils;
package eu.dnetlib.dhp.actionmanager.project.utils.model;
import java.io.Serializable;

View File

@ -9,6 +9,7 @@ import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.listKeyValues;
import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.qualifier;
import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.structuredProperty;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
@ -74,7 +75,7 @@ public class GenerateRorActionSetJob {
final String jsonConfiguration = IOUtils
.toString(
SparkAtomicActionJob.class
GenerateRorActionSetJob.class
.getResourceAsStream("/eu/dnetlib/dhp/actionmanager/ror/action_set_parameters.json"));
final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
@ -108,7 +109,7 @@ public class GenerateRorActionSetJob {
private static void processRorOrganizations(final SparkSession spark,
final String inputPath,
final String outputPath) throws Exception {
final String outputPath) throws IOException {
readInputPath(spark, inputPath)
.map(
@ -203,7 +204,7 @@ public class GenerateRorActionSetJob {
private static Dataset<RorOrganization> readInputPath(
final SparkSession spark,
final String path) throws Exception {
final String path) throws IOException {
try (final FileSystem fileSystem = FileSystem.get(new Configuration());
final InputStream is = fileSystem.open(new Path(path))) {

View File

@ -7,6 +7,8 @@ import com.fasterxml.jackson.annotation.JsonProperty;
public class Address implements Serializable {
private static final long serialVersionUID = 2444635485253443195L;
@JsonProperty("lat")
private Float lat;
@ -37,8 +39,6 @@ public class Address implements Serializable {
@JsonProperty("line")
private String line;
private final static long serialVersionUID = 2444635485253443195L;
public Float getLat() {
return lat;
}

View File

@ -7,14 +7,14 @@ import com.fasterxml.jackson.annotation.JsonProperty;
public class Country implements Serializable {
private static final long serialVersionUID = 4357848706229493627L;
@JsonProperty("country_code")
private String countryCode;
@JsonProperty("country_name")
private String countryName;
private final static long serialVersionUID = 4357848706229493627L;
public String getCountryCode() {
return countryCode;
}

View File

@ -13,7 +13,7 @@ public class ExternalIdType implements Serializable {
private String preferred;
private final static long serialVersionUID = 2616688352998387611L;
private static final long serialVersionUID = 2616688352998387611L;
public ExternalIdType() {
}

View File

@ -15,8 +15,7 @@ import com.fasterxml.jackson.databind.JsonNode;
public class ExternalIdTypeDeserializer extends JsonDeserializer<ExternalIdType> {
@Override
public ExternalIdType deserialize(final JsonParser p, final DeserializationContext ctxt)
throws IOException, JsonProcessingException {
public ExternalIdType deserialize(final JsonParser p, final DeserializationContext ctxt) throws IOException {
final ObjectCodec oc = p.getCodec();
final JsonNode node = oc.readTree(p);

View File

@ -19,7 +19,7 @@ public class GeonamesAdmin implements Serializable {
@JsonProperty("code")
private String code;
private final static long serialVersionUID = 7294958526269195673L;
private static final long serialVersionUID = 7294958526269195673L;
public String getAsciiName() {
return asciiName;

View File

@ -31,7 +31,7 @@ public class GeonamesCity implements Serializable {
@JsonProperty("license")
private License license;
private final static long serialVersionUID = -8389480201526252955L;
private static final long serialVersionUID = -8389480201526252955L;
public NameAndCode getNutsLevel2() {
return nutsLevel2;

View File

@ -13,7 +13,7 @@ public class Label implements Serializable {
@JsonProperty("label")
private String label;
private final static long serialVersionUID = -6576156103297850809L;
private static final long serialVersionUID = -6576156103297850809L;
public String getIso639() {
return iso639;

View File

@ -13,7 +13,7 @@ public class License implements Serializable {
@JsonProperty("license")
private String license;
private final static long serialVersionUID = -194308261058176439L;
private static final long serialVersionUID = -194308261058176439L;
public String getAttribution() {
return attribution;

View File

@ -7,14 +7,14 @@ import com.fasterxml.jackson.annotation.JsonProperty;
public class NameAndCode implements Serializable {
private static final long serialVersionUID = 5459836979206140843L;
@JsonProperty("name")
private String name;
@JsonProperty("code")
private String code;
private final static long serialVersionUID = 5459836979206140843L;
public String getName() {
return name;
}

View File

@ -7,6 +7,8 @@ import com.fasterxml.jackson.annotation.JsonProperty;
public class Relationship implements Serializable {
private static final long serialVersionUID = 7847399503395576960L;
@JsonProperty("type")
private String type;
@ -16,8 +18,6 @@ public class Relationship implements Serializable {
@JsonProperty("label")
private String label;
private final static long serialVersionUID = 7847399503395576960L;
public String getType() {
return type;
}

View File

@ -11,6 +11,8 @@ import com.fasterxml.jackson.annotation.JsonProperty;
public class RorOrganization implements Serializable {
private static final long serialVersionUID = -2658312087616043225L;
@JsonProperty("ip_addresses")
private List<String> ipAddresses = new ArrayList<>();
@ -59,8 +61,6 @@ public class RorOrganization implements Serializable {
@JsonProperty("status")
private String status;
private final static long serialVersionUID = -2658312087616043225L;
public List<String> getIpAddresses() {
return ipAddresses;
}

View File

@ -0,0 +1,69 @@
package eu.dnetlib.dhp.actionmanager.scholix
import eu.dnetlib.dhp.application.ArgumentApplicationParser
import eu.dnetlib.dhp.schema.oaf.{Oaf, Relation, Result}
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.slf4j.{Logger, LoggerFactory}
import scala.io.Source
object SparkCreateActionset {
def main(args: Array[String]): Unit = {
val log: Logger = LoggerFactory.getLogger(getClass)
val conf: SparkConf = new SparkConf()
val parser = new ArgumentApplicationParser(Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/actionset/generate_actionset.json")).mkString)
parser.parseArgument(args)
val spark: SparkSession =
SparkSession
.builder()
.config(conf)
.appName(getClass.getSimpleName)
.master(parser.get("master")).getOrCreate()
val sourcePath = parser.get("sourcePath")
log.info(s"sourcePath -> $sourcePath")
val targetPath = parser.get("targetPath")
log.info(s"targetPath -> $targetPath")
val workingDirFolder = parser.get("workingDirFolder")
log.info(s"workingDirFolder -> $workingDirFolder")
implicit val oafEncoders: Encoder[Oaf] = Encoders.kryo[Oaf]
implicit val resultEncoders: Encoder[Result] = Encoders.kryo[Result]
implicit val relationEncoders: Encoder[Relation] = Encoders.kryo[Relation]
import spark.implicits._
val relation = spark.read.load(s"$sourcePath/relation").as[Relation]
relation.filter(r => (r.getDataInfo == null || r.getDataInfo.getDeletedbyinference == false) && !r.getRelClass.toLowerCase.contains("merge"))
.flatMap(r => List(r.getSource, r.getTarget)).distinct().write.mode(SaveMode.Overwrite).save(s"$workingDirFolder/id_relation")
val idRelation = spark.read.load(s"$workingDirFolder/id_relation").as[String]
log.info("extract source and target Identifier involved in relations")
log.info("save relation filtered")
relation.filter(r => (r.getDataInfo == null || r.getDataInfo.getDeletedbyinference == false) && !r.getRelClass.toLowerCase.contains("merge"))
.write.mode(SaveMode.Overwrite).save(s"$workingDirFolder/actionSetOaf")
log.info("saving entities")
val entities: Dataset[(String, Result)] = spark.read.load(s"$sourcePath/entities/*").as[Result].map(p => (p.getId, p))(Encoders.tuple(Encoders.STRING, resultEncoders))
entities
.joinWith(idRelation, entities("_1").equalTo(idRelation("value")))
.map(p => p._1._2)
.write.mode(SaveMode.Append).save(s"$workingDirFolder/actionSetOaf")
}
}

View File

@ -0,0 +1,86 @@
package eu.dnetlib.dhp.actionmanager.scholix
import com.fasterxml.jackson.databind.ObjectMapper
import eu.dnetlib.dhp.application.ArgumentApplicationParser
import eu.dnetlib.dhp.schema.action.AtomicAction
import eu.dnetlib.dhp.schema.oaf.{Oaf, Dataset => OafDataset,Publication, Software, OtherResearchProduct, Relation}
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.compress.GzipCodec
import org.apache.hadoop.mapred.SequenceFileOutputFormat
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.slf4j.{Logger, LoggerFactory}
import scala.io.Source
object SparkSaveActionSet {
def toActionSet(item: Oaf): (String, String) = {
val mapper = new ObjectMapper()
item match {
case dataset: OafDataset =>
val a: AtomicAction[OafDataset] = new AtomicAction[OafDataset]
a.setClazz(classOf[OafDataset])
a.setPayload(dataset)
(dataset.getClass.getCanonicalName, mapper.writeValueAsString(a))
case publication: Publication =>
val a: AtomicAction[Publication] = new AtomicAction[Publication]
a.setClazz(classOf[Publication])
a.setPayload(publication)
(publication.getClass.getCanonicalName, mapper.writeValueAsString(a))
case software: Software =>
val a: AtomicAction[Software] = new AtomicAction[Software]
a.setClazz(classOf[Software])
a.setPayload(software)
(software.getClass.getCanonicalName, mapper.writeValueAsString(a))
case orp: OtherResearchProduct =>
val a: AtomicAction[OtherResearchProduct] = new AtomicAction[OtherResearchProduct]
a.setClazz(classOf[OtherResearchProduct])
a.setPayload(orp)
(orp.getClass.getCanonicalName, mapper.writeValueAsString(a))
case relation: Relation =>
val a: AtomicAction[Relation] = new AtomicAction[Relation]
a.setClazz(classOf[Relation])
a.setPayload(relation)
(relation.getClass.getCanonicalName, mapper.writeValueAsString(a))
case _ =>
null
}
}
def main(args: Array[String]): Unit = {
val log: Logger = LoggerFactory.getLogger(getClass)
val conf: SparkConf = new SparkConf()
val parser = new ArgumentApplicationParser(Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/actionset/save_actionset.json")).mkString)
parser.parseArgument(args)
val spark: SparkSession =
SparkSession
.builder()
.config(conf)
.appName(getClass.getSimpleName)
.master(parser.get("master")).getOrCreate()
val sourcePath = parser.get("sourcePath")
log.info(s"sourcePath -> $sourcePath")
val targetPath = parser.get("targetPath")
log.info(s"targetPath -> $targetPath")
implicit val oafEncoders: Encoder[Oaf] = Encoders.kryo[Oaf]
implicit val tEncoder: Encoder[(String, String)] = Encoders.tuple(Encoders.STRING, Encoders.STRING)
spark.read.load(sourcePath).as[Oaf]
.map(o => toActionSet(o))
.filter(o => o != null)
.rdd.map(s => (new Text(s._1), new Text(s._2))).saveAsHadoopFile(s"$targetPath", classOf[Text], classOf[Text], classOf[SequenceFileOutputFormat[Text, Text]], classOf[GzipCodec])
}
}

Some files were not shown because too many files have changed in this diff Show More