Compare commits

..

460 Commits

Author SHA1 Message Date
Claudio Atzori 242d647146 cleanup & docs 2023-10-12 12:23:44 +02:00
Claudio Atzori af3ffad6c4 [AMF] docs 2023-10-12 10:07:52 +02:00
Claudio Atzori ba5475ed4c Merge pull request 'Fix cleaning of Pmid where parsing of numbers stopped at first not leading 0 (zero) character' (#345) from fix_truncated_pmid into master
Reviewed-on: D-Net/dnet-hadoop#345
2023-10-06 14:19:49 +02:00
Giambattista Bloisi 2c235e82ad Fix cleaning of Pmid where parsing of numbers stopped at first not leading 0' character 2023-10-06 12:35:54 +02:00
Claudio Atzori 4ac06c9e37 Merge pull request 'Fix bug in conversion from dedup json model to Spark Dataset of Rows (instanceTypeMatch no longer working)' (#339) from fix_dedupfailsonmatchinginstances into master
Reviewed-on: D-Net/dnet-hadoop#339
2023-10-02 11:34:20 +02:00
Claudio Atzori fa692b3629 Merge branch 'master' into fix_dedupfailsonmatchinginstances 2023-10-02 11:28:16 +02:00
Claudio Atzori ef02648399 Merge pull request 'fixed dedup configuration management in the Broker workflow' (#341) from fix_8997 into master
Reviewed-on: D-Net/dnet-hadoop#341
2023-10-02 11:03:50 +02:00
Claudio Atzori d13bb534f0 Merge branch 'master' into fix_8997 2023-10-02 11:03:18 +02:00
Giambattista Bloisi 775c3f704a Fix bug in conversion from dedup json model to Spark Dataset of Rows: list of strings contained the json escaped representation of the value instead of the plain value, this caused instanceTypeMatch failures because of the leading and trailing double quotes 2023-09-27 22:30:47 +02:00
Sandro La Bruzzo 9c3ab11d5b Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2023-09-25 15:29:19 +02:00
Sandro La Bruzzo 423ef30676 minor fix on the aggregation of uniprot and pdb 2023-09-25 15:28:58 +02:00
Giambattista Bloisi 7152d47f84 Use asScala to convert java List to Scala Sequence 2023-09-20 16:14:27 +02:00
Claudio Atzori 4853c19b5e code formatting 2023-09-20 15:53:21 +02:00
Giambattista Bloisi 1f226d1dce Fix defect #8997: GenerateEventsJob is generating huge amounts of logs because broker entity similarity calculation consistently failed 2023-09-20 15:42:00 +02:00
Alessia Bardi 6186cdc2cc Use v5 of the UNIBI Gold ISSN list in test 2023-09-19 14:47:01 +02:00
Alessia Bardi d94b9bebf7 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-09-19 13:38:45 +02:00
Alessia Bardi 19abba8fa7 tests for d4science catalog 2023-09-19 13:38:25 +02:00
Claudio Atzori c2f179800c Merge pull request 'Run CC and RAM sequentieally in dhp-impact-indicators WF' (#338) from run_cc_and_ram_sequentially into master
Reviewed-on: D-Net/dnet-hadoop#338
2023-09-13 08:52:53 +02:00
Serafeim Chatzopoulos 2aed5a74be Run CC and RAM sequentieally in dhp-impact-indicators WF 2023-09-12 22:31:50 +03:00
Claudio Atzori 4dc4862011 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-09-12 14:34:34 +02:00
Claudio Atzori dc80ab14d3 [graph dedup] consistency wf should not remove the relations while dispatching the entities 2023-09-12 14:34:28 +02:00
Alessia Bardi 77a2199837 updated test for EOSC comunity 2023-09-08 11:05:49 +02:00
Claudio Atzori 265180bfd2 added Archive ouverte UNIGE (ETHZ.UNIGENF, opendoar____::1400) to the Datacite hostedBy_map 2023-09-07 11:20:35 +02:00
Claudio Atzori da0e9828f7 resolved conflicts for PR#337 2023-09-06 11:28:46 +02:00
Claudio Atzori 9f5d16624c Merge pull request '[graph raw] datainfo.invisible set as true only for entities' (#336) from invisible_relations into beta
Reviewed-on: D-Net/dnet-hadoop#336
2023-09-04 16:14:47 +02:00
Claudio Atzori adec6692ca Merge branch 'beta' into invisible_relations 2023-09-04 16:13:06 +02:00
Claudio Atzori 15666e86a8 added collectedfrom to the affiliation relations imported from Crossref 2023-09-04 15:56:06 +02:00
Claudio Atzori 7d6bd4f20b Merge pull request 'Fix import of affiliations relations from Crossref' (#335) from 8876_fix_crossref_affiliation_relations_import into beta
Reviewed-on: D-Net/dnet-hadoop#335
2023-09-04 15:19:58 +02:00
Claudio Atzori 5b06c9d06f [graph raw] datainfo.invisible set as true only for entities 2023-09-04 15:15:24 +02:00
Serafeim Chatzopoulos 7de0164c26 Fix import of affiliations relations from Crossref 2023-09-04 16:04:41 +03:00
Claudio Atzori 488d9a1cea Merge pull request 'Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb' (#331) from consistencywf_memoryoverhead_conf into beta
Reviewed-on: D-Net/dnet-hadoop#331
2023-08-29 16:31:36 +02:00
Giambattista Bloisi 6b1c05d118 Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb 2023-08-29 16:04:19 +02:00
Claudio Atzori bf35280ea6 code formatting 2023-08-29 11:11:00 +02:00
Claudio Atzori 0515d81c7c Merge pull request 'Rewrite SparkPropagateRelation exploiting Dataframe API' (#330) from propagate_relation_rewrite into beta
Reviewed-on: D-Net/dnet-hadoop#330
2023-08-29 10:47:14 +02:00
Claudio Atzori 58665a246c Merge branch 'beta' into propagate_relation_rewrite 2023-08-29 10:47:02 +02:00
Claudio Atzori f437be80ad [impact indicators] adjusted paths in the bip ranker wf parameters 2023-08-29 09:03:03 +02:00
Giambattista Bloisi d012aec0b3 Revert PropagateRelation's argument name from outputPath to graphOutputPath in consistency workflow (#8964) 2023-08-28 22:44:54 +02:00
Giambattista Bloisi a860e19423 Fix ensure all relations are written out, not only those managed by dedup 2023-08-28 15:36:02 +02:00
Giambattista Bloisi 0d7b2bf83d Rewrite SparkPropagateRelation exploiting Dataframe API 2023-08-28 10:34:54 +02:00
Miriam Baglioni 9c8b41475a Merge pull request '8172_impact_indicators_workflow' (#284) from 8172_impact_indicators_workflow into beta
Reviewed-on: D-Net/dnet-hadoop#284
2023-08-14 15:50:48 +02:00
Serafeim Chatzopoulos 97c1ba8918 Merge actionsets of results and projects 2023-08-11 15:56:53 +03:00
Miriam Baglioni 35b8deb2c6 Merge pull request 'DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag' (#329) from dispatch_filter_invisible_entities into beta
Reviewed-on: D-Net/dnet-hadoop#329
2023-08-10 12:56:18 +02:00
Giambattista Bloisi 95cd2b9b1e Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob
Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows
2023-08-10 11:53:48 +02:00
Giambattista Bloisi fab9920271 DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag 2023-08-09 15:41:43 +02:00
Miriam Baglioni c25ac21e5e Merge pull request 'graph cleaning, suggestions from ticket 8898' (#325) from cleaning_8898 into beta
Reviewed-on: D-Net/dnet-hadoop#325
2023-08-08 11:14:19 +02:00
Miriam Baglioni c334fe2438 Merge pull request 'Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleted by inference or that are pointing to dangling entities' (#328) from cleanup_relations_after_dedup into beta
Reviewed-on: D-Net/dnet-hadoop#328
2023-08-08 09:49:12 +02:00
Miriam Baglioni 0e2f855807 Merge pull request 'Updates Promotion DBs' (#321) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#321
2023-08-07 12:09:16 +02:00
Miriam Baglioni 18fbe52b20 Merge pull request 'Import affiliation relations from Crossref' (#320) from 8876 into beta
Reviewed-on: D-Net/dnet-hadoop#320
2023-08-07 10:45:30 +02:00
Giambattista Bloisi 97b6d1dc45 Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags
Filter relations also by dataInfo.invisible flag
2023-08-07 10:24:11 +02:00
Giambattista Bloisi af49424b59 Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities 2023-08-04 14:27:39 +02:00
Claudio Atzori 0bc74e2000 code formatting 2023-08-02 11:52:10 +02:00
Claudio Atzori 7180911ded [graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests 2023-08-02 11:44:14 +02:00
Claudio Atzori b9dddbfe54 rule out records with NULL dataInfo, except for Relations 2023-07-31 17:53:54 +02:00
Claudio Atzori da1727f93f rule out records with NULL dataInfo, except for Relations 2023-07-31 17:52:56 +02:00
Claudio Atzori 11ffb9bd68 rule out records with NULL dataInfo 2023-07-31 12:35:33 +02:00
Claudio Atzori ccac6a7f75 rule out records with NULL dataInfo 2023-07-31 12:35:05 +02:00
Serafeim Chatzopoulos 7cefe2665b Remove unnecessary classes 2023-07-28 19:14:39 +03:00
Serafeim Chatzopoulos 26a92ce762 Merge branch '8876' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8876 2023-07-28 19:03:57 +03:00
Serafeim Chatzopoulos ebfba38ab6 Add changes from code review 2023-07-28 19:03:47 +03:00
Serafeim Chatzopoulos eb8684a8cf Merge branch 'beta' into 8876 2023-07-28 13:39:33 +02:00
Claudio Atzori 1275a07d45 Merge pull request '[graph indexing] expand the instance level fulltext in the XML records' (#326) from instance_fulltext_xml into beta
Reviewed-on: D-Net/dnet-hadoop#326
2023-07-27 15:02:07 +02:00
Claudio Atzori a72b9e96ac expand the instance level fulltext in the XML records 2023-07-27 14:57:38 +02:00
Claudio Atzori d512df8612 code formatting 2023-07-26 09:14:08 +02:00
Claudio Atzori d8435a6512 inverted condition 2023-07-25 17:39:57 +02:00
Claudio Atzori 59764145bb cherry picked & fixed commit 270df939c4 2023-07-25 17:39:00 +02:00
Claudio Atzori 270df939c4 partial implementation of the suggestions from https://support.openaire.eu/issues/8898 2023-07-25 17:29:50 +02:00
Claudio Atzori 8c63e4a864 Merge pull request 'Refactor Dedup using Spark Dataframe API, initial support for scala 2.12 and Spark 3.4' (#324) from dedup-with-dataframe-2 into beta
Reviewed-on: D-Net/dnet-hadoop#324
2023-07-25 10:17:17 +02:00
Giambattista Bloisi e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Giambattista Bloisi bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00
Claudio Atzori 002b24e06f Merge pull request '[graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests' (#315) from pid_cleaning into beta
Reviewed-on: D-Net/dnet-hadoop#315
2023-07-24 10:49:44 +02:00
Claudio Atzori c754397a19 Merge branch 'beta' into pid_cleaning 2023-07-24 10:49:31 +02:00
Claudio Atzori f0678cda09 Merge pull request 'fix_beta_tests' (#323) from fix_beta_tests into beta
Reviewed-on: D-Net/dnet-hadoop#323
2023-07-24 10:47:35 +02:00
Serafeim Chatzopoulos 3a0f09774a Add script to find score limits 2023-07-21 17:55:41 +03:00
Ilias Kanellos 06b9b71c4e Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-07-21 17:42:49 +03:00
Ilias Kanellos 2374f445a9 Produce additional bip update specific files 2023-07-21 17:42:46 +03:00
Serafeim Chatzopoulos cb0f3c50f6 Format workflow.xml 2023-07-21 16:07:10 +03:00
Serafeim Chatzopoulos c64e5e588f Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-07-21 15:27:02 +03:00
Serafeim Chatzopoulos 2cc5b1a39b Fixes in workflow.xml 2023-07-21 15:26:50 +03:00
Ilias Kanellos 0f96af5d56 Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-07-21 13:42:35 +03:00
Ilias Kanellos 03da965162 Format bip-score based file without doi references 2023-07-21 13:42:30 +03:00
Giambattista Bloisi f03153823a Update testCitationRelations number of expected citations according to changes made in 0559d8b4 (monodirectional citations) 2023-07-21 10:48:28 +02:00
Giambattista Bloisi 54c1eacef1 SparkJobTest was failing because testing workingdir was not cleaned up after eact test 2023-07-21 10:42:24 +02:00
Giambattista Bloisi 5e15f20e6e Fix entityMerger that was excluding the authors of the first entity in the list to merge 2023-07-21 00:46:54 +02:00
Giambattista Bloisi 0210a14e43 Ignore timestamp differences in PromoteActionPayloadForGraphTableJobTest 2023-07-20 23:45:57 +02:00
Giambattista Bloisi dba34505de Fix SparkStatsTest bug where parquet tables were incorrectly read as text files leading to unpredictable count() values 2023-07-19 14:24:52 +02:00
Giambattista Bloisi e47ed1fdb2 Use DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES in json mapper to avoid that tests fail if they encounter unmapped properties 2023-07-19 14:21:40 +02:00
Giambattista Bloisi 38dfebfbe6 Disable MdStoreClientTest test as it requires a local mongodb running and it does not perform any assertions 2023-07-19 14:18:56 +02:00
Claudio Atzori 373a5f2c83 Merge pull request 'Master branch updates from beta July 2023' (#317) from master_july23 into master
Reviewed-on: D-Net/dnet-hadoop#317
2023-07-18 18:22:04 +02:00
Serafeim Chatzopoulos db4ca43ee8 Resolve conflict 2023-07-18 18:38:26 +03:00
Serafeim Chatzopoulos be320ba3c1 Indentation fixes 2023-07-17 16:04:21 +03:00
Serafeim Chatzopoulos bc1a4611aa Minor changes 2023-07-17 11:17:53 +03:00
Claudio Atzori 8af129b0c7 merged stats promotion step from antonis/promotion-prod-only 2023-07-13 15:03:28 +02:00
dimitrispie 706092bc19 Update updateProductionViews.sh 2023-07-13 15:48:12 +03:00
dimitrispie aedd279f78 Updates Promotion DBs
- Add a step for promoting the splitted monitor DBs
2023-07-13 15:35:46 +03:00
dimitrispie 76901a25f9 Updates Promotion DBs
- Add a step for promoting the splitted monitor DBs
2023-07-12 22:49:08 +03:00
Giambattista Bloisi ef493681d9 Merge pull request 'Import dnet-pace-core module in this project and use it after renaming to dhp-pace-core' (#319) from beta_with_pace_core into beta
Reviewed-on: D-Net/dnet-hadoop#319
2023-07-11 14:03:15 +02:00
Serafeim Chatzopoulos 4eba14a80e Add oozie workflow 2023-07-06 21:07:50 +03:00
Serafeim Chatzopoulos c2998a14e8 Add basic tests for affiliation relations 2023-07-06 20:28:16 +03:00
Serafeim Chatzopoulos bc7b00bcd1 Add bi-directional affiliation relations 2023-07-06 18:29:15 +03:00
Serafeim Chatzopoulos 12528ed2ef Refactor PrepareAffiliationRelations.java to use OafMapperUtils common functions 2023-07-06 18:08:33 +03:00
Serafeim Chatzopoulos bbc245696e Prepare actionsets for BIP affiliations 2023-07-06 15:56:12 +03:00
Ilias Kanellos 0c433eccdd Fix scores & Workflow 2023-07-06 15:06:28 +03:00
Ilias Kanellos d5c39a1059 Fix map scores to doi 2023-07-06 15:04:48 +03:00
Ilias Kanellos 772d5f0aab Make PR and AttRank serial 2023-07-06 13:47:51 +03:00
Giambattista Bloisi 801da2fd4a New sources formatted by maven plugin 2023-07-06 10:28:53 +02:00
Giambattista Bloisi bd3fcf869a rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules 2023-07-06 10:02:23 +02:00
Serafeim Chatzopoulos 347a889b20 Read affiliation relations 2023-07-06 00:51:01 +03:00
Giambattista Bloisi 3b35db5fbd Import dnet-pace-core module from dnet-dedup repository 2023-07-05 22:23:06 +02:00
Miriam Baglioni 8dcd028eed [UsageCount] fixed typo in attribute name for datasource table 2023-07-01 16:07:22 +02:00
Miriam Baglioni 7738372125 [UsageCount] fixed typo in attribute name for datasource table 2023-06-30 18:56:41 +02:00
Claudio Atzori f3a85e224b merged from branch beta the bulk tagging (single step, negative constraints), the cleanig worflow (single step, pid type based cleaning), instance level fulltext 2023-06-28 13:33:57 +02:00
Claudio Atzori 4ef0f2ec26 added dependency commons-validator:commons-validator:1.7 2023-06-28 13:32:01 +02:00
Claudio Atzori 288ec0b7d6 [doiboost] merged workflow from branch beta 2023-06-28 09:15:37 +02:00
Claudio Atzori 5f32edd9bf adopting dhp-schema:3.17.1 2023-06-27 16:57:17 +02:00
Claudio Atzori e10ce92fe5 [stats wf] merged workflows from branch beta 2023-06-27 14:32:48 +02:00
Claudio Atzori b93e1541aa Merge pull request 'update sql query to return distinct pids' (#301) from distinct_pids_from_openorgs into master
Reviewed-on: D-Net/dnet-hadoop#301
2023-06-27 12:24:47 +02:00
Claudio Atzori d029bf0b94 Merge branch 'master' into distinct_pids_from_openorgs 2023-06-27 12:24:35 +02:00
Claudio Atzori 0f5a819f44 [graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests 2023-06-23 16:10:49 +02:00
Serafeim Chatzopoulos 60f25b780d Minor fixes in workflow.xml and job.properties 2023-06-23 12:51:50 +03:00
Michele Artini 009d7f312f fixed a datasource Id 2023-06-21 16:17:34 +02:00
Giambattista Bloisi 758e662ab8 Revert "REmove duplicated code and ensure that load and initialization is done through "DedupConfig.load" method"
This reverts commit 485f9d18cb.
2023-06-19 13:08:10 +02:00
Giambattista Bloisi 485f9d18cb REmove duplicated code and ensure that load and initialization is done through "DedupConfig.load" method 2023-06-19 13:00:02 +02:00
Claudio Atzori 6210f6ee48 Merge pull request 'Precompile blacklists patterns before evaluating clustering criteria' (#1) from optimized-clustering into master
Reviewed-on: D-Net/dnet-dedup#1
2023-06-19 12:43:49 +02:00
Giambattista Bloisi b0ade43608 Precompile blacklists patterns before evaluating clustering criteria
Enable Junit 5 tests in maven builds
Make path comparisons platform-independent
Read String resource files assuming they are encoded in UTF-8
Fix a few test conditions
2023-06-16 09:41:11 +02:00
Michele Artini a92206dab5 re-added the name of a column (pid) 2023-06-13 11:43:10 +02:00
Alessia Bardi 118e72d7db Updated officialnmae of pangaea in hostedbymap for Datacite to avoid duplicate entries in the source filter of the portal 2023-06-06 14:39:12 +02:00
Alessia Bardi 5befd93d7d test records for Solr indexing 2023-06-06 14:34:33 +02:00
Michele Artini cae92cf811 update sql query to return distinct pids 2023-06-06 14:06:06 +02:00
Ilias Kanellos a1b9187039 Fix syntax error on workflow.xml 2023-05-23 17:17:12 +03:00
Ilias Kanellos 6a7e370a21 Remove unnecessary counts in graph creation 2023-05-23 16:48:58 +03:00
Ilias Kanellos ec4e010687 End after rankings | Create graph debugged 2023-05-23 16:44:04 +03:00
Claudio Atzori 654ffcba60 Merge pull request '[UsageCount] addition of usagecount for Projects and datasources' (#296) from master_datasource_project_usagecounts into master
Reviewed-on: D-Net/dnet-hadoop#296
2023-05-22 16:13:24 +02:00
Claudio Atzori db625e548d [UsageCount] addition of usagecount for Projects and datasources 2023-05-22 15:00:46 +02:00
Alessia Bardi 04141fe259 tests for records from D4Science catalogues 2023-05-19 14:28:24 +02:00
Ilias Kanellos 38020e242a Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-05-16 17:34:53 +03:00
Ilias Kanellos 3d69f33c84 Fix selection of columns in graph creation 2023-05-16 17:34:42 +03:00
Ilias Kanellos 3c38f7ba6f Fix selection of columns in graph creation 2023-05-16 17:32:53 +03:00
Serafeim Chatzopoulos 8ef718c363 Fix workflow application path 2023-05-16 16:28:48 +03:00
Serafeim Chatzopoulos 26328e2a0d Move job.properties 2023-05-16 14:39:53 +03:00
Serafeim Chatzopoulos 4eec3e7052 Add jobTracker, nameNode && spark2Lib as global params in oozie wf 2023-05-15 22:28:48 +03:00
Serafeim Chatzopoulos b83135c252 Add missing kill nodes in workflow.xml 2023-05-15 19:55:35 +03:00
Serafeim Chatzopoulos 45f2aa0867 Move end node ... at the end in workflow.xml 2023-05-15 17:52:20 +03:00
Serafeim Chatzopoulos 12a57e1f58 Resolve conflicts 2023-05-15 16:20:11 +03:00
Serafeim Chatzopoulos 82e2a96f51 Resolve conflicts 2023-05-15 15:53:12 +03:00
Serafeim Chatzopoulos b8e8c959fe Update workflow.xml && job.properties 2023-05-15 15:50:23 +03:00
Ilias Kanellos 4a905932a3 Spark properties from job.properties 2023-05-15 15:24:22 +03:00
Serafeim Chatzopoulos 07818131ef Update documentation 2023-05-15 13:04:44 +03:00
Ilias Kanellos 1788ac2d4d Correct filtering for MAG records 2023-05-12 12:55:43 +03:00
Ilias Kanellos 5ddbb4ad10 Spark properties no longer hardcoded 2023-05-11 15:36:47 +03:00
Ilias Kanellos 3de35fd6a3 Produce 5 classes of ranking scores 2023-05-11 14:42:25 +03:00
Ilias Kanellos 90332439ad Remove deletion of synonym folder 2023-04-28 13:45:19 +03:00
Ilias Kanellos a98da54896 Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-04-28 13:23:49 +03:00
Ilias Kanellos 09485fbee3 Fixed unicode bug. Workflow ends after first script 2023-04-28 13:09:13 +03:00
Serafeim Chatzopoulos 614cc1089b Add separate forder for results && project actionsets 2023-04-27 12:37:15 +03:00
Serafeim Chatzopoulos 815a4ddbba Add actionset creation for project bip indicators in workflow 2023-04-26 20:40:06 +03:00
Serafeim Chatzopoulos ee04cf92bf Add actionsets for project impact indicators 2023-04-26 20:23:46 +03:00
Alessia Bardi b88f009d9f combined level 4 and 6 for the demo 2023-04-24 12:10:33 +02:00
Alessia Bardi 5ffe82ffd8 aligned to current DMF index layout on production 2023-04-24 12:09:55 +02:00
Alessia Bardi 1c173642f0 removed level5 from test records 2023-04-24 09:32:32 +02:00
Alessia Bardi 382f46a8e4 tests to generate the XML records for the index for the EDITH demo on digital twins, integrating output from the FoS classifier 2023-04-21 16:46:30 +02:00
Serafeim Chatzopoulos 23f58a86f1 Change jar param in project impact indicators action 2023-04-18 12:26:01 +03:00
Miriam Baglioni 24c41806ac [ZenodoApiClienttest] change test to mirror change in the omplementation 2023-04-18 09:08:09 +02:00
Miriam Baglioni 087b5a7973 [ZenodiAPIClient] new version of the API to connect to Zenodo (change the http client 2023-04-17 18:59:22 +02:00
Michele De Bonis cb595c87bb implementation of the support for authors deduplication: cosinesimilarity comparator and double array json parser 2023-04-17 11:06:27 +02:00
Claudio Atzori 688e3b7936 added eoscifguidelines in the result view; removed compute statistics statements 2023-04-11 11:45:56 +02:00
Claudio Atzori 2e465915b4 [graph to Solr] using dedicated sparkExecutorCores, sparkExecutorMemory, sparkDriverMemory in convert_to_xml 2023-04-11 10:43:44 +02:00
Serafeim Chatzopoulos 7256c8d3c7 Add script for aggregating impact indicators at the project level 2023-04-07 16:30:12 +03:00
Claudio Atzori 4a4ca634f0 Merge pull request 'advConstraintsInBeta' (#288) from advConstraintsInBeta into master
Reviewed-on: D-Net/dnet-hadoop#288
2023-04-06 15:24:23 +02:00
Miriam Baglioni c6a7602b3e refactoring after compilation 2023-04-06 14:45:01 +02:00
Miriam Baglioni 831055a1fc change of the property for test purposes, addition of two new verbs, and fix of issue for advanced constraints 2023-04-06 14:41:32 +02:00
Miriam Baglioni cf3d0f4f83 fixed issue on bulktagging for the advanced constraints 2023-04-06 12:17:35 +02:00
Claudio Atzori 4f67225fbc Merge pull request 'doiboostMappingExtention' (#286) from doiboostMappingExtention into master
Reviewed-on: D-Net/dnet-hadoop#286
2023-04-06 09:25:08 +02:00
Claudio Atzori e093f04874 Merge pull request 'AdvancedConstraint' (#285) from advConstraintsInBeta into master
Reviewed-on: D-Net/dnet-hadoop#285
2023-04-06 09:24:54 +02:00
Miriam Baglioni c5a9f39141 Extended the association project - result in the mapping from CrossRef 2023-04-05 16:48:36 +02:00
Miriam Baglioni ecc05fe0f3 Added the code for the advancedConstraint implementation during the bulkTagging 2023-04-05 16:40:29 +02:00
Claudio Atzori 42442ccd39 Merge pull request 'updated the order of the compatibilities' (#275) from compatibility_order into master
Reviewed-on: D-Net/dnet-hadoop#275
2023-04-05 12:44:14 +02:00
Michele De Bonis 297eb207a5 minor change in the author match which now can compute count and percentage 2023-04-04 17:10:37 +02:00
Miriam Baglioni 9a9cc6a1dd changed the way the tar archive is build to support renaming in case we need to change .tt.gz into .json.gz 2023-04-04 11:40:58 +02:00
Serafeim Chatzopoulos 102aa5ab81 Add dependency to dhp-aggregation 2023-03-21 19:25:29 +02:00
Serafeim Chatzopoulos f3e5abf63b Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-03-21 18:26:09 +02:00
Serafeim Chatzopoulos 3e8a4cf952 Rearrange resources folder structure 2023-03-21 18:25:55 +02:00
Serafeim Chatzopoulos f992ecb657 Checkout BIP-Ranker during 'prepare-package' && add it in the oozie-package.tar.gz 2023-03-21 18:03:55 +02:00
Ilias Kanellos 9dc8f0f05f Add ActionSet step 2023-03-21 16:14:15 +02:00
Ilias Kanellos b5c252865c Add filtering based on citation source 2023-03-20 15:38:36 +02:00
Serafeim Chatzopoulos 720fd19b39 Add dhp-impact-indicators workflow files 2023-03-14 19:28:27 +02:00
Serafeim Chatzopoulos c6e39b7f33 Add dhp-impact-indicators 2023-03-14 18:50:54 +02:00
Michele Artini 200098b683 updated the order of the compatibilities 2023-02-22 11:52:59 +01:00
Michele Artini 9c1df15071 null values in date range conditions 2023-02-13 16:05:58 +01:00
Miriam Baglioni 32870339f5 refactoring after compile 2023-02-13 13:06:48 +01:00
Miriam Baglioni 7184cc0804 [FoS] added check for null on level1 subject 2023-02-13 13:03:49 +01:00
Miriam Baglioni 7473093c84 [FoS] changed the default separator from comma to tab to solve the issue in subject value split 2023-02-10 15:34:52 +01:00
Miriam Baglioni 5f0906be60 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-02-02 17:13:14 +01:00
Michele De Bonis 6a6c266dde implementation of author dedup configuration and lnfi clustering function 2023-01-31 11:53:10 +01:00
Claudio Atzori 1b37516578 [bulk tagging] better node naming 2023-01-20 16:11:26 +01:00
Claudio Atzori c1e2460293 [cleaning] the datasource master-duplicate fixup should not be brought to production yet 2023-01-20 09:20:26 +01:00
Claudio Atzori 3800361033 [country propagation] fixes error 'cannot resolve countrySet given input columns: []' when there is no prepared information driving the propagation process for a given result type 2023-01-19 15:57:43 +01:00
Michele Artini 699736addc NPE prevention 2023-01-11 13:14:44 +01:00
Claudio Atzori f86e19b282 code formatting 2023-01-11 09:53:19 +01:00
Michele Artini d40e20f437 Considering instance pids and alteternative identifiers 2023-01-11 09:37:34 +01:00
Michele Artini 4953ae5649 fixed an invalid char 2023-01-11 08:35:53 +01:00
Miriam Baglioni c60d3a2b46 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-01-09 17:28:27 +01:00
Claudio Atzori 7becdaf31d Merge pull request 'Workaround to use new version of intellij on Master' (#266) from master_intellij into master
Reviewed-on: D-Net/dnet-hadoop#266
2022-12-23 10:32:21 +01:00
Miriam Baglioni b713132db7 [Cleaning] adding missing classes 2022-12-21 12:49:08 +01:00
Miriam Baglioni 11f2b470d3 [Cleaning] adding missing classes 2022-12-21 12:42:19 +01:00
Sandro La Bruzzo 91c70b15a5 updated lines function to it's implementation linesWithSeparators.map(l => l.stripLineEnd) in this way we force scala plugin compiler to consider this pipeline scala code and not java.string.lines() pipeline 2022-12-21 11:14:42 +01:00
Claudio Atzori f910b7379d [cleaning] recovering missing resources from D-Net/dnet-hadoop#265 2022-12-21 09:26:34 +01:00
Claudio Atzori 33bdad104e [cleaning] align parameter names 2022-12-20 21:43:59 +01:00
Claudio Atzori 5816ded93f code formatting 2022-12-20 10:41:40 +01:00
Claudio Atzori 46972f8393 [orcid propagation] skip empty directory 2022-12-20 10:28:22 +01:00
Claudio Atzori da85ca697d Merge pull request 'cleanCountryOnMaster' (#265) from cleanCountryOnMaster into master
Reviewed-on: D-Net/dnet-hadoop#265
2022-12-16 15:58:44 +01:00
Miriam Baglioni 059e100ec7 [Clean Country] moving other resources for testing purposes 2022-12-16 15:48:21 +01:00
Miriam Baglioni fc95a550c3 [Clean Country] moving other resources for testing purposes 2022-12-16 15:46:32 +01:00
Miriam Baglioni 6901ac91b1 [Clean Country] moving source and resources to master 2022-12-16 15:42:49 +01:00
Claudio Atzori 08c4588d47 Merge pull request 'Changes from beta stats wf to prod' (#264) from antonis.lempesis/dnet-hadoop:beta into master
Reviewed-on: D-Net/dnet-hadoop#264
2022-12-07 15:56:22 +01:00
Miriam Baglioni 29d3da85f1 [EOSC DUMP] added resources needed for the review as test 2022-11-25 17:16:20 +01:00
Miriam Baglioni 33a2b1b5dc [Bulk Tag] fixed typo in test configuration 2022-11-23 11:31:17 +01:00
Miriam Baglioni c6df8327b3 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2022-11-23 11:26:57 +01:00
Miriam Baglioni 935aa367d8 [BulkTag] removed commented code 2022-11-23 11:16:39 +01:00
Miriam Baglioni 43aedbdfe5 [BulkTag] changed verb name in configuration 2022-11-23 11:14:23 +01:00
Miriam Baglioni b6da9b67ff [BulkTag] fixed typo in annotation for verb name 2022-11-23 11:13:58 +01:00
Michele De Bonis 14f6346676 implementation of the new software configuration 2022-11-22 17:48:34 +01:00
Claudio Atzori a34c8b6f81 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2022-11-22 10:22:31 +01:00
Miriam Baglioni 122e75aa17 fixed conflicts 2022-11-21 18:13:12 +01:00
Miriam Baglioni cee7a45b1d [Bulk Tag Datasource] fixed issue with verb name and add new test for neanias selection for orcid 2022-11-21 18:10:20 +01:00
Michele De Bonis 9fee2ed611 minor changes 2022-11-21 14:35:46 +01:00
Claudio Atzori ed64618235 increased spark.sql.shuffle.partitions in the last join phase of the result (publication) to community through semantic relation propagation 2022-11-18 16:06:51 +01:00
Claudio Atzori 8742934843 added spark.sql.shuffle.partitions in the last join phase of the result to community through semantic relation propagation 2022-11-18 11:32:22 +01:00
Claudio Atzori 13cc592f39 code formatting 2022-11-15 09:37:57 +01:00
Claudio Atzori af15b1e48d [eosc tag] extending criteria for Jupyter Notebook (adding to ORP the same constraint) 2022-11-14 18:30:43 +01:00
Claudio Atzori eb45ba7af0 extended mapping from ODF relations (PR#251) 2022-11-14 18:26:13 +01:00
Claudio Atzori a929dc5fee integrated changes for mapping ROHub contents in the Graph 2022-11-14 18:15:35 +01:00
Miriam Baglioni 5f9383b2d9 [EOSC TAG] remove reduntant check for jupyter notebook 2022-11-11 14:06:19 +01:00
Miriam Baglioni b18bbca8af [EOSC TAG] adding search in orp for jupyter notebook criteria 2022-11-11 12:42:58 +01:00
dimitrispie 55fa3b2a17 Hive memory parameters 2022-11-03 15:21:04 +01:00
Claudio Atzori 80c5e0f637 code formatting 2022-09-27 12:51:51 +02:00
Claudio Atzori c01d528ab2 suppressing hyper verbose spark logs during unit test execution 2022-09-23 15:19:50 +02:00
Claudio Atzori e6d788d27a [stats wf] adding missing changes lost in PR#248 2022-09-23 14:38:42 +02:00
Claudio Atzori 930f118673 fixed semantic (subreltype) for ServiceOrganization relations 2022-09-22 16:24:44 +02:00
Claudio Atzori b2c3071e72 Merge branch 'master' into beta2master_sept_2022 2022-09-22 14:39:15 +02:00
Claudio Atzori 10ec074f79 Merge remote-tracking branch 'antonis.lempesis/beta' into beta2master_sept_2022 2022-09-22 14:12:19 +02:00
Claudio Atzori 7225fe9cbe integrated changes from discard-non-wellformed 2022-09-22 10:06:07 +02:00
Miriam Baglioni 869e129288 [EOSC BulkTag] refactoring 2022-09-20 16:13:18 +02:00
Miriam Baglioni 840465958b [EOSC BulkTag] filtering aout the datasources registered in the eosc with compatibility different from 3.0, 4.0 for literature, data and CRIS to add the context eosc to the results 2022-09-20 10:30:41 +02:00
Claudio Atzori bdc8f993d0 [Patch Hosted By] check also the presence of datasource.officialname.value 2022-09-19 15:28:03 +02:00
Miriam Baglioni ec87149cb3 [Patch Hosted By] added fix to avoi NPE error when datasource official name is not provided. Removing datasources if no officialname has been provided 2022-09-19 14:06:52 +02:00
Miriam Baglioni b42e2c9df6 [Patch Hosted By] added fix to avoi NPE error when datasource official name is not provided 2022-09-19 12:30:32 +02:00
Miriam Baglioni 1329aa8479 [EOSC BulkTag] modified test to remove association of result to eosc when eoscifguidelines are set 2022-09-19 11:59:48 +02:00
Miriam Baglioni a0ee1a8640 [EOSC BulkTag] remove addition of eosc context for result with eosc if guidelines set 2022-09-19 11:44:10 +02:00
Claudio Atzori 96062164f9 Merge pull request '[Aggregator graph|master] Discard invalid records' (#245) from discard-non-wellformed into master
Reviewed-on: D-Net/dnet-hadoop#245
2022-09-19 09:48:16 +02:00
Claudio Atzori 35bb7c423f updated dhp-schemas version to 2.12.1 2022-09-16 16:13:15 +02:00
Claudio Atzori fd87571506 code formatting 2022-09-16 16:05:03 +02:00
Claudio Atzori c527112e33 Merge commit 'ff6f789b6d9be0567b6ad72f8a0e75fe3f52726a' into beta2master_sept_2022 2022-09-16 15:59:10 +02:00
Claudio Atzori 65209359bc Merge commit 'b5f7bd30be7f7adaaa28170740da0484b50a77ed' into beta2master_sept_2022 2022-09-16 15:58:11 +02:00
Claudio Atzori d72a64ded3 Merge commit '690be4482fc84327dc7617acbc8d976d559df512' into beta2master_sept_2022 2022-09-16 15:57:44 +02:00
Claudio Atzori 3e8499ce47 Merge commit '71b069ca90a2f7ec09d64241c60917d3636fc81e' into beta2master_sept_2022 2022-09-16 15:57:20 +02:00
Claudio Atzori 61aacb3271 Merge commit '1203378441dc6d8e8435cacd42e76e11746f6d1b' into beta2master_sept_2022 2022-09-16 15:56:55 +02:00
Claudio Atzori dbb567251a merged 853c996fa2 2022-09-16 15:56:28 +02:00
Claudio Atzori c7e8ad853e Merge commit '2b5f8c9c9a3611c57ee5febfe262a455a39ad801' into beta2master_sept_2022 2022-09-16 15:55:04 +02:00
Claudio Atzori 0849ebfd80 merged a11eb38065 2022-09-16 15:54:32 +02:00
Claudio Atzori 281239249e Merge commit 'b7c387c21f946adbc9da90ded95166205195edb0' into beta2master_sept_2022 2022-09-16 15:49:20 +02:00
Claudio Atzori 45fc5e12be Merge commit 'cb7c07c54e59675e8dffe42b7f2a13f16c956068' into beta2master_sept_2022 2022-09-16 15:48:55 +02:00
Claudio Atzori 1c05aaaa2e Merge commit '3418ce50ac9b28fed4fa949919e6c8208738cdcf' into beta2master_sept_2022 2022-09-16 15:48:36 +02:00
Claudio Atzori 01d5ad6361 Merge commit 'd85ba3c1a9d7f0e80565742161ff6c9ecffd52b7' into beta2master_sept_2022 2022-09-16 15:48:16 +02:00
Claudio Atzori d872d1cdd9 Merge commit 'a4815f6bec87f05be8cd740d236707949a0f746e' into beta2master_sept_2022 2022-09-16 15:47:49 +02:00
Claudio Atzori ab0efecab4 Merge commit '84598c75356cf580de6c81653a9351e9b8173639' into beta2master_sept_2022 2022-09-16 15:47:05 +02:00
Claudio Atzori 725c3c68d0 Merge commit '844f6eb46533cdd4be3210401b10401322079640' into beta2master_sept_2022 2022-09-16 15:46:40 +02:00
Claudio Atzori 300ae6221c Merge commit '32cee1f619eb30d2e2ac6083435b76b1aba7db09' into beta2master_sept_2022 2022-09-16 15:45:57 +02:00
Claudio Atzori 0ec2eaba35 Merge commit 'c1f2ffc53dc41f1fac3855b2d2df7d6a5ea15e3e' into beta2master_sept_2022 2022-09-16 15:45:27 +02:00
Claudio Atzori a387807d43 Merge commit 'b78889a0ce27a79c7ab2d8da05b118ee4f1bcb36' into beta2master_sept_2022 2022-09-16 15:44:17 +02:00
Claudio Atzori 2abe2bc137 Merge commit '08ce2cadc2d84aa982726e429c280a905536a715' into beta2master_sept_2022 2022-09-16 15:43:49 +02:00
Claudio Atzori a07c876922 Merge commit '27a91841e7fa2a1b615b4d1e161d606db5bead96' into beta2master_sept_2022 2022-09-16 15:43:02 +02:00
Claudio Atzori cbd48bc645 Merge commit 'efd96e7e664e4139321e35e8d172b884ba4b61a1' into beta2master_sept_2022 2022-09-16 15:38:56 +02:00
miconis 9ddd24ba36 implementation of comparators and clustering function for the author deduplication 2022-04-19 10:18:09 +02:00
miconis 97a32faf9b test implementation for the new fdup version 2022-04-13 09:48:56 +02:00
miconis 10172553ab [maven-release-plugin] prepare for next development iteration 2022-03-15 15:06:18 +01:00
miconis bd919ac98d [maven-release-plugin] prepare release dnet-dedup-4.1.12 2022-03-15 15:06:12 +01:00
miconis a965233dd0 bug fix in the normalization of a legalname, city map updated and transliteration support added 2022-03-15 14:59:13 +01:00
miconis ac9708e31b [maven-release-plugin] prepare for next development iteration 2022-03-09 13:43:48 +01:00
miconis a5a6054039 [maven-release-plugin] prepare release dnet-dedup-4.1.11 2022-03-09 13:43:44 +01:00
miconis 3bc07c5881 bug fix in the AuthorMatch, implementation of the concat function in the model creation with jpath query 2022-03-09 12:53:09 +01:00
miconis 699612dd17 implementation of the size threshold on authors list match 2022-03-08 16:49:28 +01:00
miconis 8f07f0c537 [maven-release-plugin] prepare for next development iteration 2022-01-13 17:22:16 +01:00
miconis 620e35db28 [maven-release-plugin] prepare release dnet-dedup-4.1.10 2022-01-13 17:22:12 +01:00
miconis 2ff97781d2 minor change 2022-01-13 17:20:20 +01:00
miconis 1ff6a3dc11 [maven-release-plugin] prepare for next development iteration 2022-01-13 15:15:05 +01:00
miconis 003bcf1699 [maven-release-plugin] prepare release dnet-dedup-4.1.9 2022-01-13 15:15:00 +01:00
miconis 2f1ba56f61 bug fix in the authormatch comparator, implementation of tests 2022-01-13 11:58:28 +01:00
miconis cea8440153 [maven-release-plugin] prepare for next development iteration 2021-12-30 13:11:57 +01:00
miconis eb48d31ea6 [maven-release-plugin] prepare release dnet-dedup-4.1.8 2021-12-30 13:11:52 +01:00
miconis a224bf70a4 implementation of new comparators for publication dedup configuration update 2021-12-27 17:35:02 +01:00
miconis 8f1db32921 implementation of the instance type comparator and its tests 2021-11-04 15:20:57 +01:00
miconis fbb1b66bfb dedup test implementation & graph drawing tools 2021-09-13 14:53:19 +02:00
miconis 1144d50a11 [maven-release-plugin] prepare for next development iteration 2021-05-03 16:09:56 +02:00
miconis f33a18ca9d [maven-release-plugin] prepare release dnet-dedup-4.1.7 2021-05-03 16:09:08 +02:00
miconis 4bce4f2e8e minor change: version updated 2021-05-03 16:05:39 +02:00
miconis c6266242e3 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-dedup 2021-05-03 15:38:00 +02:00
miconis 4988e9f80d implementation of cross comparison for different fields, addition of clustering mechanism to collapse keys from different clustering functions on the same cluster 2021-05-03 15:37:41 +02:00
Claudio Atzori 58d013e24f [maven-release-plugin] prepare for next development iteration 2021-04-12 16:12:15 +02:00
Claudio Atzori 3a7336157b [maven-release-plugin] prepare release dnet-dedup-4.0.6 2021-04-12 16:12:10 +02:00
miconis ed0d5d3e1d implementation of the wf to dedup entities, addition of the module to run the wf on the cluster 2020-12-04 15:41:31 +01:00
miconis 72116446ec [maven-release-plugin] prepare for next development iteration 2020-09-29 12:06:38 +02:00
miconis 05a03d97cd [maven-release-plugin] prepare release dnet-dedup-4.0.5 2020-09-29 12:06:35 +02:00
miconis 2a01022712 minor changes 2020-09-29 12:05:50 +02:00
miconis dd34e371d7 fixed error in the treeprocessor. it used th=-1 as default value, now it use th=1 2020-09-29 12:01:25 +02:00
miconis 19c3c90d7b fixed error in the block processor: entities with orderField=null were not considered 2020-09-19 17:43:41 +02:00
Sandro La Bruzzo a109ebe287 fixed NPE 2020-08-06 10:27:05 +02:00
miconis a5a3ea24f8 [maven-release-plugin] prepare for next development iteration 2020-07-16 18:59:25 +02:00
miconis 840fe8f4d3 [maven-release-plugin] prepare release dnet-dedup-4.0.4 2020-07-16 18:59:22 +02:00
miconis 07ab904d60 implementation of the clustering function for the suffixprefix chain 2020-07-16 18:57:55 +02:00
Claudio Atzori eaf7defe0c [maven-release-plugin] prepare for next development iteration 2020-07-15 17:57:09 +02:00
Claudio Atzori ff2c8eba12 [maven-release-plugin] prepare release dnet-dedup-4.0.3 2020-07-15 17:57:04 +02:00
Claudio Atzori 7cc3742a26 removed maven release.property 2020-07-15 17:52:27 +02:00
Claudio Atzori 14611ea450 reverted to 4.0.3-SNAPSHOT 2020-07-15 17:37:36 +02:00
Claudio Atzori 9f20f23870 Revert "wordssuffixprefix: adjust the token length according to the number of words; removed maven release temporary files"
This reverts commit 51d91fa520.
2020-07-15 17:35:56 +02:00
Claudio Atzori 9efcd8e245 Revert "reverted to 4.0.3-SNAPSHOT"
This reverts commit ec97983ce1.
2020-07-15 17:28:37 +02:00
Claudio Atzori ba493f9ab8 [maven-release-plugin] rollback the release of dnet-dedup-4.0.3 2020-07-15 17:24:43 +02:00
Claudio Atzori 6c98d4c436 [maven-release-plugin] prepare release dnet-dedup-4.0.3 2020-07-15 17:24:25 +02:00
Claudio Atzori ec97983ce1 reverted to 4.0.3-SNAPSHOT 2020-07-15 17:20:12 +02:00
Claudio Atzori 51d91fa520 wordssuffixprefix: adjust the token length according to the number of words; removed maven release temporary files 2020-07-15 17:13:45 +02:00
Claudio Atzori b79ea97107 Revert "wordssuffixprefix: adjust the token length according to the number of words; removed maven release temporary files"
This reverts commit d2861950ac.
2020-07-15 17:11:46 +02:00
Claudio Atzori 92aadbfc7b [maven-release-plugin] prepare release dnet-dedup-4.0.3 2020-07-15 17:04:20 +02:00
Claudio Atzori d2861950ac wordssuffixprefix: adjust the token length according to the number of words; removed maven release temporary files 2020-07-15 16:49:47 +02:00
miconis 244a037a90 implementation of a class to test the clustering functions 2020-07-12 10:13:54 +02:00
miconis 7aa2001a8b [maven-release-plugin] prepare for next development iteration 2020-07-02 17:06:38 +02:00
miconis c72055f543 [maven-release-plugin] prepare release dnet-dedup-4.0.2 2020-07-02 17:06:36 +02:00
miconis f933fd33e0 implemented new function for clustering 2020-07-02 17:04:17 +02:00
miconis 411d1cc24f implementation of the test for the dedup and addition of new support classes 2020-06-11 10:46:46 +02:00
miconis 48c094f599 [maven-release-plugin] prepare for next development iteration 2020-04-24 14:39:01 +02:00
miconis 4365ba41c9 [maven-release-plugin] prepare release dnet-dedup-4.0.1 2020-04-24 14:38:58 +02:00
miconis 6e9b27f37d implementation of the mechanism to truncate the string and the lists 2020-04-24 14:36:42 +02:00
Sandro La Bruzzo 8e4211708e [maven-release-plugin] prepare for next development iteration 2020-02-10 12:51:04 +01:00
Sandro La Bruzzo 24e2ab9092 [maven-release-plugin] prepare release dnet-dedup-4.0.0 2020-02-10 12:50:45 +01:00
Sandro La Bruzzo 46727f5c76 upgraded maven version of commons-lang 2020-02-10 12:38:40 +01:00
miconis 5c8f6febee minor changes in comparators 2020-01-24 10:01:11 +01:00
miconis 4dce785375 update in the implementation of the tree: addition of new logic aggregations and statistics 2020-01-14 11:42:43 +02:00
miconis b3748b8d77 minor changes 2019-12-18 16:20:35 +01:00
miconis b21b1b8f61 implementation of new aggregation in the tree node processing 2019-12-18 16:19:36 +01:00
miconis 20fcfe6328 implementation of new aggregation in the tree node processing 2019-12-18 16:19:26 +01:00
Sandro La Bruzzo d924f28b93 fixed wrong use of jspath 2019-12-18 09:29:44 +01:00
miconis 84aaa65501 implementation of new json comparator and update of the publication configuration 2019-12-17 09:16:26 +01:00
Sandro La Bruzzo 5c01ae4c92 merged JqMapping branch into tree2 2019-12-13 11:30:02 +01:00
Sandro La Bruzzo 35008fdbf9 fix stuff 2019-12-06 15:28:30 +01:00
Sandro La Bruzzo 16c670a5d5 Improved deduplication 2019-12-05 14:14:25 +01:00
miconis 49f9beb4a8 implementation of romansmatch and re-implementation of the getNumber function. New terms in the translation map and update of the configuration 2019-11-28 16:54:44 +01:00
miconis f791730330 addition of one term to the translation maps in the configurations 2019-11-27 15:48:37 +01:00
miconis d2278fe358 minor change in the citymatch 2019-11-21 10:54:02 +01:00
miconis 8c0d346005 the param map has been updated: now it accepts string parameters 2019-11-21 09:37:56 +01:00
miconis ddd40540aa jarowinklernormalizedname splitted in 3 different comparators: citymatch, keywordmatch and jarowinkler. Implementation of the TreeStatistic support functions 2019-11-20 10:45:00 +01:00
miconis c687956371 code cleaning and implementation of the TreeDedup + minor changes 2019-11-14 10:01:21 +01:00
miconis 0973899865 code cleaning, distribution of the classes in packages and implementation of the new configuration 2019-11-07 12:47:12 +01:00
miconis 30a873265f put the last modification of the master branch into the tree2. Addition of the configuration as parameter of the comparator. This is to allow the comparator to access it 2019-10-29 16:38:42 +01:00
miconis 1beb776691 minor changes 2019-10-29 15:58:21 +01:00
miconis 075f741d28 [maven-release-plugin] prepare for next development iteration 2019-10-24 11:34:19 +02:00
miconis ced4bcdd59 [maven-release-plugin] prepare release dnet-dedup-3.0.15 2019-10-24 11:34:12 +02:00
miconis 13f93e6055 Revert "[maven-release-plugin] prepare release dnet-dedup-3.0.15"
This reverts commit cf93515d94.
2019-10-24 11:23:01 +02:00
miconis cf93515d94 [maven-release-plugin] prepare release dnet-dedup-3.0.15 2019-10-24 11:17:07 +02:00
miconis 285ec3ca17 release rollback 2019-10-24 11:11:07 +02:00
miconis 5f249fd56c minor changes 2019-10-23 16:37:20 +02:00
miconis c9863debfa minor changes and configuration updates (synonym field added) 2019-10-23 16:31:45 +02:00
miconis 5499ca17c3 minor changes 2019-10-08 16:49:07 +02:00
miconis 50b7a12b3f normalization of the term in the translation map added 2019-10-08 15:13:45 +02:00
miconis 26b383fea2 translation map moved in json configuration, support for synonyms added in the configuration, now the configuration is argument of conditions, distancealgos and clusteringfunctions 2019-10-08 14:53:52 +02:00
Claudio Atzori 07355d2811 [maven-release-plugin] prepare for next development iteration 2019-09-25 10:39:46 +02:00
Claudio Atzori 254eb46809 [maven-release-plugin] prepare release dnet-dedup-3.0.14 2019-09-25 10:39:39 +02:00
Claudio Atzori 74c6462b49 updated translation map and some tests 2019-09-25 10:15:13 +02:00
miconis aed81e4cfa translation map updated 2019-09-25 09:53:06 +02:00
miconis afd2b398d5 optimize imports 2019-08-09 15:42:41 +02:00
miconis d71dae5fd2 implementation of the conditions in tree nodes. get rid of the conditions part of the configuration 2019-08-09 15:41:49 +02:00
miconis a5c5d2f01b implementation of the decision tree. It takes place of the distance algos, necessaryConditions and sufficientConditions are still there. The model contains only path, type and name of the field. ignoreMissing is still in the model because it is used by the conditions. 2019-08-09 10:08:34 +02:00
miconis f2136e1024 code refactoring: useless module removed 2019-08-07 15:16:59 +02:00
miconis 8c867101ef addition of a fixSpecial function to address the problem with special character in organization names, addition of new terms in translation maps 2019-08-06 17:06:05 +02:00
miconis 4502b44337 addition of the BlockUtils class for meta-blocking, implementation of a new local test with edge filtering example 2019-08-06 12:09:34 +02:00
miconis cffb712a99 Merge branch 'master' of https://github.com/dnet-team/dnet-dedup 2019-07-19 17:10:53 +02:00
miconis a85576c27e restyling of the JaroWinklerNormalizedName comparator, now it is optimized. Addition of some translations in the translation maps, addition of a clustering based on keywords in organizations legalnames 2019-07-19 17:10:29 +02:00
Claudio Atzori 6cb846331a [maven-release-plugin] prepare for next development iteration 2019-07-08 11:12:52 +02:00
Claudio Atzori c04d2232c2 [maven-release-plugin] prepare release dnet-dedup-3.0.13 2019-07-08 11:12:45 +02:00
miconis fb5e38db26 Merge branch 'master' of https://github.com/dnet-team/dnet-dedup 2019-07-08 11:02:29 +02:00
miconis 3c6f8d1e44 bug fixing in the keywordsclustering class 2019-07-08 11:01:49 +02:00
Claudio Atzori a69022617d [maven-release-plugin] prepare for next development iteration 2019-07-08 10:11:24 +02:00
Claudio Atzori c6baeb93d4 [maven-release-plugin] prepare release dnet-dedup-3.0.12 2019-07-08 10:11:17 +02:00
miconis f5de20a508 [maven-release-plugin] rollback the release of dnet-dedup-3.0.12 2019-07-08 10:00:48 +02:00
miconis ba50aa8654 [maven-release-plugin] prepare for next development iteration 2019-07-08 09:48:10 +02:00
miconis 7065110a21 [maven-release-plugin] prepare release dnet-dedup-3.0.12 2019-07-08 09:48:03 +02:00
miconis 15bec5e876 addition of doi normalization in PidMatch comparator, addition of keywordsclustering (clustering based on terms in the translation maps for the organizations), minor changes 2019-07-08 09:44:02 +02:00
Claudio Atzori 2dcffb965f [maven-release-plugin] prepare for next development iteration 2019-06-19 10:02:39 +02:00
Claudio Atzori 85126c59f7 [maven-release-plugin] prepare release dnet-dedup-3.0.11 2019-06-19 10:02:32 +02:00
Claudio Atzori 15d7b584f3 optimized classpath resolvers 2019-06-19 10:01:35 +02:00
Claudio Atzori ff4956def9 [maven-release-plugin] prepare for next development iteration 2019-06-18 14:46:34 +02:00
Claudio Atzori eb5ce312a3 [maven-release-plugin] prepare release dnet-dedup-3.0.10 2019-06-18 14:46:27 +02:00
Claudio Atzori f2bc665403 avoid to divide by zero: in case of missing values, return undefined response 2019-06-18 14:45:15 +02:00
Claudio Atzori e3f86b92c8 cleanup 2019-06-18 14:44:42 +02:00
miconis 54e4d0af04 exact match condition gives undefined if a field is missing, ignoremissing semantics changed: now performs the comparison in any case if =true, if false gives -1 in case of missing 2019-06-18 14:05:31 +02:00
miconis e8db8f2abb implementation of the integration test, addition of document blocks to group entities after clustering 2019-05-21 16:38:26 +02:00
Claudio Atzori f7a3bdf3f8 [maven-release-plugin] prepare for next development iteration 2019-04-03 12:35:00 +02:00
Claudio Atzori 98c179c8fb [maven-release-plugin] prepare release dnet-dedup-3.0.9 2019-04-03 12:34:52 +02:00
miconis 3e61a90c8f [maven-release-plugin] rollback the release of dnet-dedup-3.0.9 2019-04-03 12:27:28 +02:00
miconis 15fb9eb883 [maven-release-plugin] prepare for next development iteration 2019-04-03 12:26:05 +02:00
miconis a1ff4daa7f [maven-release-plugin] prepare release dnet-dedup-3.0.9 2019-04-03 12:25:56 +02:00
miconis 1d29bae47c branch cities merged into master 2019-04-03 12:22:33 +02:00
miconis 7e7018c51f addition of a sparktester test, implementation of 2 different classes for testing in dnet-dedup-test module, addition of new terms in the vocabulary and change in the implementation of the JaroWinklerNormalizedName comparator 2019-04-03 09:40:14 +02:00
miconis 4bd5a9beee minor changes 2019-03-26 15:48:21 +01:00
Michele De Bonis 662448e584 update of the comparator for legalnames of organizations 2019-03-21 14:27:27 +01:00
Claudio Atzori f2394fcd9f [maven-release-plugin] prepare for next development iteration 2019-02-18 09:09:14 +01:00
Claudio Atzori 722431dde1 [maven-release-plugin] prepare release dnet-dedup-3.0.8 2019-02-18 09:09:07 +01:00
Claudio Atzori 470c4b0f20 default configuration includes configurationId 2019-02-18 09:07:23 +01:00
Claudio Atzori ccb7e83196 [maven-release-plugin] prepare for next development iteration 2019-02-17 12:56:19 +01:00
Claudio Atzori 7d8e62d4cc [maven-release-plugin] prepare release dnet-dedup-3.0.7 2019-02-17 12:56:11 +01:00
Claudio Atzori 968cd47436 replace existing attributes when loading default configuration 2019-02-17 12:48:25 +01:00
Michele De Bonis 0735f3a822 implementation of the test classes and minor changes 2019-02-08 12:56:47 +01:00
Michele De Bonis 7a8d28991f implementation of the decision tree for the deduplication of the authors, implementation of multiple comparators to be used in a tree node and definition of the proto for person entity 2018-12-20 09:54:41 +01:00
Michele De Bonis 39613dbbd6 implementation of the decisional tree, addition of the dnet-openaire-data-protos module, definition of the person proto, blockprocessor and paceconfig modified with addition of support for the tree processing 2018-12-12 16:30:03 +01:00
Claudio Atzori f1c68d8ba3 apply limits (length, size) to pace Fields 2018-11-20 10:51:38 +01:00
Claudio Atzori c5979ffe18 [maven-release-plugin] prepare for next development iteration 2018-11-19 17:41:45 +01:00
Claudio Atzori 9869dff1d2 [maven-release-plugin] prepare release dnet-dedup-3.0.6 2018-11-19 17:41:37 +01:00
Claudio Atzori c2d4cb3ba6 added new properties to FieldDef (size, length) to limit the information mapped onto each MapDocument 2018-11-19 17:37:57 +01:00
Claudio Atzori 394fcafd41 [maven-release-plugin] prepare for next development iteration 2018-11-17 09:13:16 +01:00
Claudio Atzori 397554130c [maven-release-plugin] prepare release dnet-dedup-3.0.5 2018-11-17 09:13:09 +01:00
Claudio Atzori 0dfb2ea600 added distance function fot software titles 2018-11-17 09:11:38 +01:00
Michele De Bonis 3d4372ced9 addition of cities check 2018-11-16 16:11:03 +01:00
Claudio Atzori 55a9b4f501 [maven-release-plugin] prepare for next development iteration 2018-11-16 09:18:00 +01:00
Claudio Atzori 35ab630493 [maven-release-plugin] prepare release dnet-dedup-3.0.4 2018-11-16 09:17:53 +01:00
Claudio Atzori 399e4bc80f default (empty) configuration should be aligned with the updated model 2018-11-15 16:52:56 +01:00
Claudio Atzori 59bab8dba4 less verbose logging 2018-11-13 09:07:45 +01:00
Claudio Atzori 478ad72cb8 propagate exceptions in case of serialization errors, removed configuration pretty printing, removed unused class ScoredResult 2018-11-12 15:52:18 +01:00
Claudio Atzori f7616c7a8a [maven-release-plugin] prepare for next development iteration 2018-11-12 14:23:36 +01:00
Claudio Atzori df4b871c8b [maven-release-plugin] prepare release dnet-dedup-3.0.3 2018-11-12 14:23:29 +01:00
Michele De Bonis 72a9b3139e Merge branch 'master' of https://github.com/dnet-team/dnet-dedup 2018-11-12 14:11:26 +01:00
Michele De Bonis b5062f5429 configuration file updated, addition of condition on domain 2018-11-12 14:11:15 +01:00
Claudio Atzori 2a509b18fa [maven-release-plugin] prepare for next development iteration 2018-11-12 12:46:50 +01:00
Claudio Atzori e247218987 [maven-release-plugin] prepare release dnet-dedup-3.0.2 2018-11-12 12:46:42 +01:00
Claudio Atzori b7bc7f0401 getting rid of spark libs from dnet-pace-core 2018-11-12 12:46:06 +01:00
Claudio Atzori 3dacba37ea [maven-release-plugin] prepare for next development iteration 2018-11-12 11:40:42 +01:00
Claudio Atzori 8cc2517f5d [maven-release-plugin] prepare release dnet-dedup-3.0.1 2018-11-12 11:40:34 +01:00
Claudio Atzori 851ae5eec3 [maven-release-plugin] rollback the release of dnet-dedup-3.0.1 2018-11-12 11:39:07 +01:00
Claudio Atzori f283d58a6e [maven-release-plugin] prepare release dnet-dedup-3.0.1 2018-11-12 11:38:52 +01:00
Claudio Atzori 6d09041288 [maven-release-plugin] rollback the release of dnet-dedup-3.0.1 2018-11-12 11:28:28 +01:00
Claudio Atzori 46cee13596 [maven-release-plugin] prepare for next development iteration 2018-11-12 11:24:06 +01:00
Claudio Atzori e1c69ad24e [maven-release-plugin] prepare release dnet-dedup-3.0.1 2018-11-12 11:23:57 +01:00
Michele De Bonis b247a86e69 configuration files changed: dedupRun instead of run, assertion updated in tests 2018-11-06 11:02:00 +01:00
Michele De Bonis 4c8485d0bb deleted useless imports 2018-11-06 09:48:22 +01:00
Michele De Bonis 748189af10 implementation of JaroWinklerNormalizedName, addition of various stopwords in different languages and configuration test 2018-11-05 17:22:59 +01:00
Claudio Atzori e296f7a81c added DiffPatchMatch utility. Resumed commented tests! 2018-10-31 10:49:11 +01:00
Michele De Bonis dc41b76643 serialization test added. useless getter methods ignored by json serialization 2018-10-29 16:16:11 +01:00
Michele De Bonis ea36007d1f DedupConf parsed using Jackson library 2018-10-29 11:13:55 +01:00
Michele De Bonis 8b4762bf54 implementation of the toString methonds changed: from Gson to Jackson 2018-10-26 14:55:59 +02:00
Michele De Bonis 3cf3dc1934 modification in the initialization of clustering functions, distance algos and conditions. 2018-10-25 15:15:40 +02:00
Michele De Bonis 1cbbc3f15a update in the discovery of clustering, conditions and distance functions (annotated with custom annotations) 2018-10-24 12:09:41 +02:00
Claudio Atzori 4d379c2227 revised PidMatch implementation, cleanup 2018-10-20 08:38:19 +02:00
Claudio Atzori 3197f26691 [maven-release-plugin] prepare for next development iteration 2018-10-18 12:17:34 +02:00
Claudio Atzori 63815be2d6 [maven-release-plugin] prepare release dnet-dedup-3.0.0 2018-10-18 12:17:27 +02:00
Claudio Atzori ed14476b06 [maven-release-plugin] rollback the release of dnet-dedup-3.0.0 2018-10-18 12:13:03 +02:00
Claudio Atzori 82d5dce114 [maven-release-plugin] prepare release dnet-dedup-3.0.0 2018-10-18 12:12:45 +02:00
Claudio Atzori 4f29124607 [maven-release-plugin] rollback the release of dnet-dedup-3.0.0 2018-10-18 12:00:45 +02:00
Claudio Atzori 5a48937ae1 [maven-release-plugin] prepare for next development iteration 2018-10-18 11:58:43 +02:00
Claudio Atzori 5aec80345f [maven-release-plugin] prepare release dnet-dedup-3.0.0 2018-10-18 11:58:36 +02:00
Claudio Atzori 1b46966383 updated maven project structure 2018-10-18 11:56:26 +02:00
Michele De Bonis 72ebf7c0f3 update of the spark test 2018-10-18 10:12:44 +02:00
Sandro La Bruzzo 1bb5c26e6d Added FSpark Implementation of dedup 2018-10-11 15:19:20 +02:00
Sandro La Bruzzo d1c73bcf90 Added First Implementation of Spark Test 2018-10-02 17:07:17 +02:00
Sandro La Bruzzo 476c3d7b07 added d-net pace core module and ignored target folder 2018-10-02 10:37:54 +02:00
162 changed files with 7823 additions and 2557 deletions

1
.gitignore vendored
View File

@ -26,3 +26,4 @@ spark-warehouse
/**/*.log
/**/.factorypath
/**/.scalafmt.conf
/.java-version

128
README.md
View File

@ -1,2 +1,128 @@
# dnet-hadoop
Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
How to build, package and run oozie workflows
====================
Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a `*.tar.gz`
package that contains resources that define a workflow and some helper scripts.
This module is automatically executed when running:
`mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app`
on module having set:
```
<parent>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-workflows</artifactId>
</parent>
```
in `pom.xml` file. `oozie-package` profile initializes oozie workflow packaging, `workflow.source.dir` property points to
a workflow (notice: this is not a relative path but a classpath to directory usually holding `oozie_app` subdirectory).
The outcome of this packaging is `oozie-package.tar.gz` file containing inside all the resources required to run Oozie workflow:
- jar packages
- workflow definitions
- job properties
- maintenance scripts
Required properties
====================
In order to include proper workflow within package, `workflow.source.dir` property has to be set. It could be provided
by setting `-Dworkflow.source.dir=some/job/dir` maven parameter.
In oder to define full set of cluster environment properties one should create `~/.dhp/application.properties` file with
the following properties:
- `dhp.hadoop.frontend.user.name` - your user name on hadoop cluster and frontend machine
- `dhp.hadoop.frontend.host.name` - frontend host name
- `dhp.hadoop.frontend.temp.dir` - frontend directory for temporary files
- `dhp.hadoop.frontend.port.ssh` - frontend machine ssh port
- `oozieServiceLoc` - oozie service location required by run_workflow.sh script executing oozie job
- `nameNode` - name node address
- `jobTracker` - job tracker address
- `oozie.execution.log.file.location` - location of file that will be created when executing oozie job, it contains output
produced by `run_workflow.sh` script (needed to obtain oozie job id)
- `maven.executable` - mvn command location, requires parameterization due to a different setup of CI cluster
- `sparkDriverMemory` - amount of memory assigned to spark jobs driver
- `sparkExecutorMemory` - amount of memory assigned to spark jobs executors
- `sparkExecutorCores` - number of cores assigned to spark jobs executors
All values will be overriden with the ones from `job.properties` and eventually `job-override.properties` stored in module's
main folder.
When overriding properties from `job.properties`, `job-override.properties` file can be created in main module directory
(the one containing `pom.xml` file) and define all new properties which will override existing properties.
One can provide those properties one by one as command line `-D` arguments.
Properties overriding order is the following:
1. `pom.xml` defined properties (located in the project root dir)
2. `~/.dhp/application.properties` defined properties
3. `${workflow.source.dir}/job.properties`
4. `job-override.properties` (located in the project root dir)
5. `maven -Dparam=value`
where the maven `-Dparam` property is overriding all the other ones.
Workflow definition requirements
====================
`workflow.source.dir` property should point to the following directory structure:
[${workflow.source.dir}]
|
|-job.properties (optional)
|
\-[oozie_app]
|
\-workflow.xml
This property can be set using maven `-D` switch.
`[oozie_app]` is the default directory name however it can be set to any value as soon as `oozieAppDir` property is
provided with directory name as value.
Sub-workflows are supported as well and sub-workflow directories should be nested within `[oozie_app]` directory.
Creating oozie installer step-by-step
=====================================
Automated oozie-installer steps are the following:
1. creating jar packages: `*.jar` and `*tests.jar` along with copying all dependencies in `target/dependencies`
2. reading properties from maven, `~/.dhp/application.properties`, `job.properties`, `job-override.properties`
3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
5. copying whole `${workflow.source.dir}` content to `target/${oozie.package.file.name}`
6. generating updated `job.properties` file in `target/${oozie.package.file.name}` based on maven,
`~/.dhp/application.properties`, `job.properties` and `job-override.properties`
7. creating `lib` directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages
created at step (1) to each one of them
8. bundling whole `${oozie.package.file.name}` directory into single tar.gz package
Uploading oozie package and running workflow on cluster
=======================================================
In order to simplify deployment and execution process two dedicated profiles were introduced:
- `deploy`
- `run`
to be used along with `oozie-package` profile e.g. by providing `-Poozie-package,deploy,run` maven parameters.
The `deploy` profile supplements packaging process with:
1) uploading oozie-package via scp to `/home/${user.name}/oozie-packages` directory on `${dhp.hadoop.frontend.host.name}` machine
2) extracting uploaded package
3) uploading oozie content to hadoop cluster HDFS location defined in `oozie.wf.application.path` property (generated dynamically by maven build process, based on `${dhp.hadoop.frontend.user.name}` and `workflow.source.dir` properties)
The `run` profile introduces:
1) executing oozie application uploaded to HDFS cluster using `deploy` command. Triggers `run_workflow.sh` script providing runtime properties defined in `job.properties` file.
Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.

View File

@ -62,6 +62,11 @@
</build>
<dependencies>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-pace-core</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
@ -124,12 +129,6 @@
<dependency>
<groupId>eu.dnetlib</groupId>
<artifactId>cnr-rmi-api</artifactId>
<exclusions>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
@ -150,11 +149,6 @@
<artifactId>okhttp</artifactId>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-pace-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
@ -167,7 +161,7 @@
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-schemas_${scala.binary.version}</artifactId>
<artifactId>${dhp-schemas.artifact}</artifactId>
</dependency>
<dependency>

View File

@ -11,25 +11,18 @@ import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FilterFunction;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.oaf.Oaf;
import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.common.ModelSupport;
public class DispatchEntitiesSparkJob {
private static final Logger log = LoggerFactory.getLogger(DispatchEntitiesSparkJob.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
public static void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils
@ -54,44 +47,51 @@ public class DispatchEntitiesSparkJob {
String outputPath = parser.get("outputPath");
log.info("outputPath: {}", outputPath);
String graphTableClassName = parser.get("graphTableClassName");
log.info("graphTableClassName: {}", graphTableClassName);
@SuppressWarnings("unchecked")
Class<? extends OafEntity> entityClazz = (Class<? extends OafEntity>) Class.forName(graphTableClassName);
boolean filterInvisible = Boolean.parseBoolean(parser.get("filterInvisible"));
log.info("filterInvisible: {}", filterInvisible);
SparkConf conf = new SparkConf();
runWithSparkSession(
conf,
isSparkSessionManaged,
spark -> {
HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
dispatchEntities(spark, inputPath, entityClazz, outputPath);
});
spark -> dispatchEntities(spark, inputPath, outputPath, filterInvisible));
}
private static <T extends Oaf> void dispatchEntities(
private static void dispatchEntities(
SparkSession spark,
String inputPath,
Class<T> clazz,
String outputPath) {
String outputPath,
boolean filterInvisible) {
spark
.read()
.textFile(inputPath)
.filter((FilterFunction<String>) s -> isEntityType(s, clazz))
.map((MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"), Encoders.STRING())
.map(
(MapFunction<String, T>) value -> OBJECT_MAPPER.readValue(value, clazz),
Encoders.bean(clazz))
.write()
.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputPath);
Dataset<String> df = spark.read().textFile(inputPath);
ModelSupport.oafTypes.entrySet().parallelStream().forEach(entry -> {
String entityType = entry.getKey();
Class<?> clazz = entry.getValue();
final String entityPath = outputPath + "/" + entityType;
if (!entityType.equalsIgnoreCase("relation")) {
HdfsSupport.remove(entityPath, spark.sparkContext().hadoopConfiguration());
Dataset<Row> entityDF = spark
.read()
.schema(Encoders.bean(clazz).schema())
.json(
df
.filter((FilterFunction<String>) s -> s.startsWith(clazz.getName()))
.map(
(MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"),
Encoders.STRING()));
if (filterInvisible) {
entityDF = entityDF.filter("dataInfo.invisible != true");
}
entityDF
.write()
.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(entityPath);
}
});
}
private static <T extends Oaf> boolean isEntityType(final String s, final Class<T> clazz) {
return StringUtils.substringBefore(s, "|").equals(clazz.getName());
}
}

View File

@ -6,14 +6,16 @@ import java.util.regex.Pattern;
public class FundRefCleaningRule {
public static String clean(final String fundrefId) {
public static final Pattern PATTERN = Pattern.compile("\\d+");
String s = fundrefId
public static String clean(final String fundRefId) {
String s = fundRefId
.toLowerCase()
.replaceAll("\\s", "");
Matcher m = Pattern.compile("\\d+").matcher(s);
if (m.matches()) {
Matcher m = PATTERN.matcher(s);
if (m.find()) {
return m.group();
} else {
return "";

View File

@ -13,11 +13,7 @@ import java.util.stream.Collectors;
import java.util.stream.Stream;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Encoders;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.github.sisyphsu.dateparser.DateParserUtils;
import com.google.common.collect.Lists;
import com.google.common.collect.Sets;
@ -39,6 +35,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
public static final String TITLE_FILTER_REGEX = String.format("(%s)|\\W|\\d", TITLE_TEST);
public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5;
private static final String NAME_CLEANING_REGEX = "[\\r\\n\\t\\s]+";
public static <T extends Oaf> T cleanContext(T value, String contextId, String verifyParam) {
if (ModelSupport.isSubClass(value, Result.class)) {
@ -228,7 +225,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
}
public static <T extends Oaf> boolean filter(T value) {
if (Boolean.TRUE
if (!(value instanceof Relation) && (Boolean.TRUE
.equals(
Optional
.ofNullable(value)
@ -239,15 +236,16 @@ public class GraphCleaningFunctions extends CleaningFunctions {
d -> Optional
.ofNullable(d.getInvisible())
.orElse(true))
.orElse(true))
.orElse(true))) {
.orElse(false))
.orElse(true)))) {
return true;
}
if (value instanceof Datasource) {
// nothing to evaluate here
} else if (value instanceof Project) {
// nothing to evaluate here
final Project p = (Project) value;
return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
} else if (value instanceof Organization) {
// nothing to evaluate here
} else if (value instanceof Relation) {
@ -294,6 +292,13 @@ public class GraphCleaningFunctions extends CleaningFunctions {
} else if (value instanceof Result) {
Result r = (Result) value;
if (Objects.nonNull(r.getFulltext())
&& (ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
r.setFulltext(null);
}
if (Objects.nonNull(r.getDateofacceptance())) {
Optional<String> date = cleanDateField(r.getDateofacceptance());
if (date.isPresent()) {
@ -318,8 +323,18 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.collect(Collectors.toList()));
}
if (Objects.nonNull(r.getPublisher()) && StringUtils.isBlank(r.getPublisher().getValue())) {
r.setPublisher(null);
if (Objects.nonNull(r.getPublisher())) {
if (StringUtils.isBlank(r.getPublisher().getValue())) {
r.setPublisher(null);
} else {
r
.getPublisher()
.setValue(
r
.getPublisher()
.getValue()
.replaceAll(NAME_CLEANING_REGEX, " "));
}
}
if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
r
@ -486,6 +501,11 @@ public class GraphCleaningFunctions extends CleaningFunctions {
i.setDateofacceptance(null);
}
}
if (StringUtils.isNotBlank(i.getFulltext()) &&
(ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
i.setFulltext(null);
}
}
}
if (Objects.isNull(r.getBestaccessright())
@ -510,6 +530,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.filter(Objects::nonNull)
.filter(a -> StringUtils.isNotBlank(a.getFullname()))
.filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
.map(GraphCleaningFunctions::cleanupAuthor)
.collect(Collectors.toList()));
boolean nullRank = r
@ -604,6 +625,35 @@ public class GraphCleaningFunctions extends CleaningFunctions {
return value;
}
private static Author cleanupAuthor(Author author) {
if (StringUtils.isNotBlank(author.getFullname())) {
author
.setFullname(
author
.getFullname()
.replaceAll(NAME_CLEANING_REGEX, " ")
.replace("\"", "\\\""));
}
if (StringUtils.isNotBlank(author.getName())) {
author
.setName(
author
.getName()
.replaceAll(NAME_CLEANING_REGEX, " ")
.replace("\"", "\\\""));
}
if (StringUtils.isNotBlank(author.getSurname())) {
author
.setSurname(
author
.getSurname()
.replaceAll(NAME_CLEANING_REGEX, " ")
.replace("\"", "\\\""));
}
return author;
}
private static Optional<String> cleanDateField(Field<String> dateofacceptance) {
return Optional
.ofNullable(dateofacceptance)

View File

@ -6,13 +6,19 @@ import java.util.regex.Pattern;
public class GridCleaningRule {
public static final Pattern PATTERN = Pattern.compile("(?<grid>\\d{4,6}\\.[0-9a-z]{1,2})");
public static String clean(String grid) {
String s = grid
.replaceAll("\\s", "")
.toLowerCase();
Matcher m = Pattern.compile("\\d{4,6}\\.[0-9a-z]{1,2}").matcher(s);
return m.matches() ? "grid." + m.group() : "";
Matcher m = PATTERN.matcher(s);
if (m.find()) {
return "grid." + m.group("grid");
}
return "";
}
}

View File

@ -7,10 +7,12 @@ import java.util.regex.Pattern;
// https://www.wikidata.org/wiki/Property:P213
public class ISNICleaningRule {
public static final Pattern PATTERN = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])");
public static String clean(final String isni) {
Matcher m = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])").matcher(isni);
if (m.matches()) {
Matcher m = PATTERN.matcher(isni);
if (m.find()) {
return String.join("", m.group(1), m.group(2), m.group(3), m.group(4));
} else {
return "";

View File

@ -6,10 +6,12 @@ import java.util.regex.Pattern;
public class PICCleaningRule {
public static final Pattern PATTERN = Pattern.compile("\\d{9}");
public static String clean(final String pic) {
Matcher m = Pattern.compile("\\d{9}").matcher(pic);
if (m.matches()) {
Matcher m = PATTERN.matcher(pic);
if (m.find()) {
return m.group();
} else {
return "";

View File

@ -1,13 +1,24 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class PmcCleaningRule {
public static final Pattern PATTERN = Pattern.compile("PMC\\d{1,8}");
public static String clean(String pmc) {
String s = pmc
.replaceAll("\\s", "")
.toUpperCase();
return s.matches("^PMC\\d{1,8}$") ? s : "";
final Matcher m = PATTERN.matcher(s);
if (m.find()) {
return m.group();
}
return "";
}
}

View File

@ -1,16 +1,25 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
// https://researchguides.stevens.edu/c.php?g=442331&p=6577176
public class PmidCleaningRule {
public static final Pattern PATTERN = Pattern.compile("0*(\\d{1,8})");
public static String clean(String pmid) {
String s = pmid
.toLowerCase()
.replaceAll("\\s", "")
.trim()
.replaceAll("^0+", "");
return s.matches("^\\d{1,8}$") ? s : "";
.replaceAll("\\s", "");
final Matcher m = PATTERN.matcher(s);
if (m.find()) {
return m.group(1);
}
return "";
}
}

View File

@ -7,12 +7,21 @@ import java.util.regex.Pattern;
// https://ror.readme.io/docs/ror-identifier-pattern
public class RorCleaningRule {
public static final String ROR_PREFIX = "https://ror.org/";
private static final Pattern PATTERN = Pattern.compile("(?<ror>0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2})");
public static String clean(String ror) {
String s = ror
.replaceAll("\\s", "")
.toLowerCase();
Matcher m = Pattern.compile("0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2}").matcher(s);
return m.matches() ? "https://ror.org/" + m.group() : "";
Matcher m = PATTERN.matcher(s);
if (m.find()) {
return ROR_PREFIX + m.group("ror");
}
return "";
}
}

View File

@ -18,9 +18,9 @@
"paramRequired": true
},
{
"paramName": "c",
"paramLongName": "graphTableClassName",
"paramDescription": "the graph entity class name",
"paramName": "fi",
"paramLongName": "filterInvisible",
"paramDescription": "if true filters out invisible entities",
"paramRequired": true
}
]

View File

@ -50,10 +50,13 @@ object ScholixUtils extends Serializable {
}
}
def extractRelationDate(summary: ScholixResource): String = {
summary.getPublicationDate
def extractRelationDate(summary: ScholixSummary): String = {
if (summary.getDate == null || summary.getDate.isEmpty)
null
else {
summary.getDate.get(0)
}
}
def inverseRelationShip(rel: ScholixRelationship): ScholixRelationship = {
@ -141,7 +144,11 @@ object ScholixUtils extends Serializable {
s.setRelationship(inverseRelationShip(scholix.getRelationship))
s.setSource(scholix.getTarget)
s.setTarget(scholix.getSource)
updateId(s)
s.setIdentifier(
DHPUtils.md5(
s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
)
)
s
}
@ -180,21 +187,6 @@ object ScholixUtils extends Serializable {
} else List()
}
def updateId(scholix: Scholix): Scholix = {
scholix.setIdentifier(
generateIdentifier(
scholix.getSource.getDnetIdentifier,
scholix.getTarget.getDnetIdentifier,
scholix.getRelationship.getName
)
)
scholix
}
def generateIdentifier(sourceId: String, targetId: String, relation: String): String = {
DHPUtils.md5(s"$sourceId::$relation::$targetId")
}
def generateCompleteScholix(scholix: Scholix, target: ScholixSummary): Scholix = {
val s = new Scholix
s.setPublicationDate(scholix.getPublicationDate)
@ -203,7 +195,11 @@ object ScholixUtils extends Serializable {
s.setRelationship(scholix.getRelationship)
s.setSource(scholix.getSource)
s.setTarget(generateScholixResourceFromSummary(target))
updateId(s)
s.setIdentifier(
DHPUtils.md5(
s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
)
)
s
}
@ -215,7 +211,11 @@ object ScholixUtils extends Serializable {
s.setRelationship(scholix.getRelationship)
s.setSource(scholix.getSource)
s.setTarget(target)
updateId(s)
s.setIdentifier(
DHPUtils.md5(
s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
)
)
s
}
@ -232,7 +232,7 @@ object ScholixUtils extends Serializable {
if (summaryObject.getAuthor != null && !summaryObject.getAuthor.isEmpty) {
val l: List[ScholixEntityId] =
summaryObject.getAuthor.asScala.map(a => new ScholixEntityId(a, null)).take(100).toList
summaryObject.getAuthor.asScala.map(a => new ScholixEntityId(a, null)).toList
if (l.nonEmpty)
r.setCreator(l.asJava)
}
@ -241,7 +241,7 @@ object ScholixUtils extends Serializable {
r.setPublicationDate(summaryObject.getDate.get(0))
if (summaryObject.getPublisher != null && !summaryObject.getPublisher.isEmpty) {
val plist: List[ScholixEntityId] =
summaryObject.getPublisher.asScala.map(p => new ScholixEntityId(p, null)).take(100).toList
summaryObject.getPublisher.asScala.map(p => new ScholixEntityId(p, null)).toList
if (plist.nonEmpty)
r.setPublisher(plist.asJava)
@ -260,7 +260,6 @@ object ScholixUtils extends Serializable {
"complete"
)
)
.take(100)
.toList
if (l.nonEmpty)
@ -270,38 +269,38 @@ object ScholixUtils extends Serializable {
r
}
// def scholixFromSource(relation: Relation, source: ScholixResource): Scholix = {
// if (relation == null || source == null)
// return null
// val s = new Scholix
// var l: List[ScholixEntityId] = extractCollectedFrom(relation)
// if (l.isEmpty)
// l = extractCollectedFrom(source)
// if (l.isEmpty)
// return null
// s.setLinkprovider(l.asJava)
// var d = extractRelationDate(relation)
// if (d == null)
// d = source.getPublicationDate
//
// s.setPublicationDate(d)
//
// if (source.getPublisher != null && !source.getPublisher.isEmpty) {
// s.setPublisher(source.getPublisher)
// }
//
// val semanticRelation = relations.getOrElse(relation.getRelClass.toLowerCase, null)
// if (semanticRelation == null)
// return null
// s.setRelationship(
// new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
// )
// s.setSource(source)
//
// s
// }
def scholixFromSource(relation: Relation, source: ScholixResource): Scholix = {
if (relation == null || source == null)
return null
val s = new Scholix
var l: List[ScholixEntityId] = extractCollectedFrom(relation)
if (l.isEmpty)
l = extractCollectedFrom(source)
if (l.isEmpty)
return null
s.setLinkprovider(l.asJava)
var d = extractRelationDate(relation)
if (d == null)
d = source.getPublicationDate
s.setPublicationDate(d)
if (source.getPublisher != null && !source.getPublisher.isEmpty) {
s.setPublisher(source.getPublisher)
}
val semanticRelation = relations.getOrElse(relation.getRelClass.toLowerCase, null)
if (semanticRelation == null)
return null
s.setRelationship(
new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
)
s.setSource(source)
s
}
def scholixFromSource(relation: Relation, source: ScholixSummary): Scholix = {
if (relation == null || source == null)
return null
@ -323,8 +322,11 @@ object ScholixUtils extends Serializable {
s.setPublicationDate(d)
if (source.getPublisher != null && !source.getPublisher.isEmpty) {
source.getPublisher
val l: List[ScholixEntityId] = source.getPublisher.asScala.toList
val l: List[ScholixEntityId] = source.getPublisher.asScala
.map { p =>
new ScholixEntityId(p, null)
}(collection.breakOut)
if (l.nonEmpty)
s.setPublisher(l.asJava)
}
@ -335,7 +337,7 @@ object ScholixUtils extends Serializable {
s.setRelationship(
new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
)
s.setSource(source)
s.setSource(generateScholixResourceFromSummary(source))
s
}

View File

@ -15,7 +15,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;
public class MdStoreClientTest {
@Test
// @Test
public void testMongoCollection() throws IOException {
final MdstoreClient client = new MdstoreClient("mongodb://localhost:27017", "mdstore");

View File

@ -0,0 +1,18 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class GridCleaningRuleTest {
@Test
void testCleaning() {
assertEquals("grid.493784.5", GridCleaningRule.clean("grid.493784.5"));
assertEquals("grid.493784.5x", GridCleaningRule.clean("grid.493784.5x"));
assertEquals("grid.493784.5x", GridCleaningRule.clean("493784.5x"));
assertEquals("", GridCleaningRule.clean("493x784.5x"));
}
}

View File

@ -0,0 +1,19 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class ISNICleaningRuleTest {
@Test
void testCleaning() {
assertEquals("0000000463436020", ISNICleaningRule.clean("0000 0004 6343 6020"));
assertEquals("0000000463436020", ISNICleaningRule.clean("0000000463436020"));
assertEquals("", ISNICleaningRule.clean("Q30256598"));
assertEquals("0000000493403529", ISNICleaningRule.clean("ISNI:0000000493403529"));
assertEquals("000000008614884X", ISNICleaningRule.clean("0000 0000 8614 884X"));
}
}

View File

@ -0,0 +1,19 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class PICCleaningRuleTest {
@Test
void testCleaning() {
assertEquals("887624982", PICCleaningRule.clean("887624982"));
assertEquals("", PICCleaningRule.clean("887 624982"));
assertEquals("887624982", PICCleaningRule.clean(" 887624982 "));
assertEquals("887624982", PICCleaningRule.clean(" 887624982x "));
assertEquals("887624982", PICCleaningRule.clean(" 88762498200 "));
}
}

View File

@ -0,0 +1,19 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class PmcCleaningRuleTest {
@Test
void testCleaning() {
assertEquals("PMC1234", PmcCleaningRule.clean("PMC1234"));
assertEquals("PMC1234", PmcCleaningRule.clean(" PMC1234"));
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC12345678"));
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC123456789"));
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC 12345678"));
}
}

View File

@ -0,0 +1,24 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class PmidCleaningRuleTest {
@Test
void testCleaning() {
// leading zeros are removed
assertEquals("1234", PmidCleaningRule.clean("01234"));
// tolerant to spaces in the middle
assertEquals("1234567", PmidCleaningRule.clean("0123 4567"));
// stop parsing at first not numerical char
assertEquals("123", PmidCleaningRule.clean("0123x4567"));
// invalid id leading to empty result
assertEquals("", PmidCleaningRule.clean("abc"));
// valid id with zeroes in the number
assertEquals("20794075", PmidCleaningRule.clean("20794075"));
}
}

View File

@ -0,0 +1,17 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class RorCleaningRuleTest {
@Test
void testCleaning() {
assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("https://ror.org/05rpz9w55"));
assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("05rpz9w55"));
assertEquals("", RorCleaningRule.clean("05rpz9w_55"));
}
}

View File

@ -7,7 +7,7 @@
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp</artifactId>
<version>1.2.5-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
<relativePath>../pom.xml</relativePath>
</parent>
<groupId>eu.dnetlib.dhp</groupId>

View File

@ -16,8 +16,9 @@ public class NGramUtils extends AbstractPaceFunctions {
.loadFromClasspath("/eu/dnetlib/pace/config/stopwords_en.txt");
public static String cleanupForOrdering(String s) {
String result = NGRAMUTILS.filterStopWords(NGRAMUTILS.normalize(s), stopwords);
return result.isEmpty() ? result : result.replace(" ", "");
return (NGRAMUTILS.filterStopWords(NGRAMUTILS.normalize(s), stopwords) + StringUtils.repeat(" ", SIZE))
.substring(0, SIZE)
.replaceAll(" ", "");
}
}

View File

@ -2,7 +2,6 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

View File

@ -13,7 +13,19 @@ import eu.dnetlib.pace.config.Config;
public class SortedNgramPairs extends NgramPairs {
public SortedNgramPairs(Map<String, Integer> params) {
super(params, true);
super(params, false);
}
@Override
protected Collection<String> doApply(Config conf, String s) {
final List<String> tokens = Lists.newArrayList(Splitter.on(" ").omitEmptyStrings().trimResults().split(s));
Collections.sort(tokens);
return ngramPairs(
Lists.newArrayList(getNgrams(Joiner.on(" ").join(tokens), param("ngramLen"), param("max") * 2, 1, 2)),
param("max"));
}
}

View File

@ -49,18 +49,18 @@ public abstract class AbstractPaceFunctions {
protected static Set<String> ngramBlacklist = loadFromClasspath("/eu/dnetlib/pace/config/ngram_blacklist.txt");
// html regex for normalization
public final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");
public static final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");
private static final String alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ";
private static final String aliases_from = "⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎àáâäæãåāèéêëēėęəîïíīįìôöòóœøōõûüùúūßśšłžźżçćčñń";
private static final String aliases_to = "0123456789+-=()n0123456789+-=()aaaaaaaaeeeeeeeeiiiiiioooooooouuuuussslzzzcccnn";
// doi prefix for normalization
public final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");
public static final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");
private Pattern numberPattern = Pattern.compile("-?\\d+(\\.\\d+)?");
private static Pattern numberPattern = Pattern.compile("-?\\d+(\\.\\d+)?");
private Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
protected String concat(final List<String> l) {
return Joiner.on(" ").skipNulls().join(l);
@ -130,10 +130,12 @@ public abstract class AbstractPaceFunctions {
protected static String fixAliases(final String s) {
final StringBuilder sb = new StringBuilder();
for (final char ch : Lists.charactersOf(s)) {
s.chars().forEach(ch -> {
final int i = StringUtils.indexOf(aliases_from, ch);
sb.append(i >= 0 ? aliases_to.charAt(i) : ch);
}
sb.append(i >= 0 ? aliases_to.charAt(i) : (char) ch);
});
return sb.toString();
}
@ -148,9 +150,10 @@ public abstract class AbstractPaceFunctions {
protected String removeSymbols(final String s) {
final StringBuilder sb = new StringBuilder();
for (final char ch : Lists.charactersOf(s)) {
sb.append(StringUtils.contains(alpha, ch) ? ch : " ");
}
s.chars().forEach(ch -> {
sb.append(StringUtils.contains(alpha, ch) ? (char) ch : ' ');
});
return sb.toString().replaceAll("\\s+", " ");
}
@ -234,7 +237,8 @@ public abstract class AbstractPaceFunctions {
final Set<String> h = Sets.newHashSet();
try {
for (final String s : IOUtils.readLines(NGramUtils.class.getResourceAsStream(classpath))) {
for (final String s : IOUtils
.readLines(NGramUtils.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
h.add(fixAliases(transliterator.transliterate(s))); // transliteration of the stopwords
}
} catch (final Throwable e) {
@ -249,7 +253,8 @@ public abstract class AbstractPaceFunctions {
final Map<String, String> m = new HashMap<>();
try {
for (final String s : IOUtils.readLines(AbstractPaceFunctions.class.getResourceAsStream(classpath))) {
for (final String s : IOUtils
.readLines(AbstractPaceFunctions.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
// string is like this: code;word1;word2;word3
String[] line = s.split(";");
String value = line[0];
@ -342,7 +347,7 @@ public abstract class AbstractPaceFunctions {
public static <T> String readFromClasspath(final String filename, final Class<T> clazz) {
final StringWriter sw = new StringWriter();
try {
IOUtils.copy(clazz.getResourceAsStream(filename), sw);
IOUtils.copy(clazz.getResourceAsStream(filename), sw, StandardCharsets.UTF_8);
return sw.toString();
} catch (final IOException e) {
throw new RuntimeException("cannot load resource from classpath: " + filename);

View File

@ -4,7 +4,6 @@ package eu.dnetlib.pace.config;
import java.util.List;
import java.util.Map;
import java.util.function.Predicate;
import java.util.regex.Pattern;
import eu.dnetlib.pace.model.ClusteringDef;
import eu.dnetlib.pace.model.FieldDef;

View File

@ -13,7 +13,8 @@ import eu.dnetlib.pace.clustering.NGramUtils;
public class RowDataOrderingComparator implements Comparator<Row> {
/** The comparator field. */
private int comparatorField;
private final int comparatorField;
private final int identityFieldPosition;
/**
* Instantiates a new map document comparator.
@ -21,8 +22,9 @@ public class RowDataOrderingComparator implements Comparator<Row> {
* @param comparatorField
* the comparator field
*/
public RowDataOrderingComparator(final int comparatorField) {
public RowDataOrderingComparator(final int comparatorField, int identityFieldPosition) {
this.comparatorField = comparatorField;
this.identityFieldPosition = identityFieldPosition;
}
/*
@ -51,7 +53,10 @@ public class RowDataOrderingComparator implements Comparator<Row> {
int res = to1.compareTo(to2);
if (res == 0) {
return o1.compareTo(o2);
res = o1.compareTo(o2);
if (res == 0) {
return d1.getString(identityFieldPosition).compareTo(d2.getString(identityFieldPosition));
}
}
return res;

View File

@ -1,644 +0,0 @@
package eu.dnetlib.pace.model
import com.jayway.jsonpath.{Configuration, JsonPath, Option}
import eu.dnetlib.pace.config.{DedupConfig, Type}
import eu.dnetlib.pace.tree.support.TreeProcessor
import eu.dnetlib.pace.util.MapDocumentUtil.truncateValue
import eu.dnetlib.pace.util.{BlockProcessor, MapDocumentUtil, SparkReporter}
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder}
import org.apache.spark.sql.{Column, Dataset, Encoder, Encoders, Row, functions}
import org.apache.spark.sql.catalyst.expressions.{GenericRowWithSchema, Literal}
import org.apache.spark.sql.expressions.{Aggregator, MutableAggregationBuffer, UserDefinedAggregateFunction, UserDefinedFunction, Window}
import org.apache.spark.sql.types.{ArrayType, DataType, DataTypes, Metadata, StructField, StructType}
import java.util
import java.util.function.Predicate
import java.util.regex.Pattern
import scala.collection.JavaConverters._
import scala.collection.mutable
import org.apache.spark.sql.functions.{col, lit, udf}
import java.util.Collections
import java.util.stream.Collectors
case class SparkDedupConfig(conf: DedupConfig, numPartitions: Int) extends Serializable {
private val URL_REGEX: Pattern = Pattern.compile("^\\s*(http|https|ftp)\\://.*")
private val CONCAT_REGEX: Pattern = Pattern.compile("\\|\\|\\|")
private val urlFilter = (s: String) => URL_REGEX.matcher(s).matches
val modelExtractor: (Dataset[String] => Dataset[Row]) = df => {
df.withColumn("mapDocument", rowFromJsonUDF.apply(df.col(df.columns(0))))
.withColumn("identifier", new Column("mapDocument.identifier"))
//.repartition(new Column("identifier"))
.dropDuplicates("identifier")
.select("mapDocument.*")
df.map(r => rowFromJson(r))(RowEncoder(rowDataType))
.dropDuplicates("identifier")
}
val generateClusters: (Dataset[Row] => Dataset[Row]) = df => {
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
if (conf.blacklists.containsKey(fdef.getName)) {
res.withColumn(
fdef.getName + "_filtered",
filterColumnUDF(fdef).apply(new Column(fdef.getName))
)
} else {
res
}
})
val df_with_keys = conf
.clusterings()
.asScala
.foldLeft(df_with_filters)((res, cd) => {
res.withColumn(
cd.getName + "_clustered",
functions.explode_outer(
clusterValuesUDF(cd).apply(
functions.array(
cd.getFields.asScala
.map(f => res.col(if (conf.blacklists.containsKey(f)) f.concat("_filtered") else f)): _*
)
)
)
)
})
// filter blacklisted values// filter blacklisted values
// create one column per cluster prefix// create one column per cluster prefix
// GROUPING sets approach// GROUPING sets approach
val tempTable = this.getClass.getSimpleName + "__generateClusters";
df_with_keys.createOrReplaceTempView(this.getClass.getSimpleName + "__generateClusters")
val keys = conf.clusterings().asScala.map(_.getName + "_clustered").mkString(",")
val fields = rowDataType.fieldNames.mkString(",")
// Using SQL because GROUPING SETS are not available through Scala/Java DSL
df_with_keys.sqlContext.sql(
("SELECT coalesce(" + keys + ") as key, sort_array(collect_sort_slice(" + fields + ")) as block FROM " + tempTable + " WHERE coalesce(" + keys + ") IS NOT NULL GROUP BY GROUPING SETS (" + keys + ") HAVING size(block) > 1")
)
}
val generateClustersWithDFAPI: (Dataset[Row] => Dataset[Row]) = df => {
System.out.println(conf.getWf.getEntityType + "::" +conf.getWf.getSubEntityType)
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
if (conf.blacklists.containsKey(fdef.getName)) {
res.withColumn(
fdef.getName + "_filtered",
filterColumnUDF(fdef).apply(new Column(fdef.getName))
)
} else {
res
}
})
var relBlocks: Dataset[Row] = null
import scala.collection.JavaConversions._
for (cd <- conf.clusterings()) {
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
for (fName <- cd.getFields()) {
if (conf.blacklists.containsKey(fName))
columns.add(new Column(fName + "_filtered"))
else
columns.add(new Column(fName))
}
val tmp: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
/*.select((Seq(rowDataType.fieldNames: _*) ++ Seq("key")).map(col): _*)
.groupByKey(r => r.getAs[String]("key"))(Encoders.STRING)
.agg(collectSortSliceAggregator.toColumn)
.toDF("key", "block")
.select(col("block.block").as("block"))*/
System.out.println(cd.getName)
val ds = tmp.groupBy("key")
// .agg(functions.sort_array(collectSortSliceUDAF(rowDataType.fieldNames.map(col): _*)).as("block"))
.agg(functions.collect_set(functions.struct(rowDataType.fieldNames.map(col): _*)).as("block"))
//.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
//df_with_filters.printSchema()
//ds.printSchema()
if (relBlocks == null) relBlocks = ds
else relBlocks = relBlocks.union(ds)
}
// System.out.println()
relBlocks
}
val generateClustersWithWindows: (Dataset[Row] => Dataset[Row]) = df => {
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
if (conf.blacklists.containsKey(fdef.getName)) {
res.withColumn(
fdef.getName + "_filtered",
filterColumnUDF(fdef).apply(new Column(fdef.getName))
)
} else {
res
}
})
var relBlocks: Dataset[Row] = null
import scala.collection.JavaConversions._
for (cd <- conf.clusterings()) {
System.out.println(conf.getWf.getEntityType + "::" + conf.getWf.getSubEntityType+ ": " + cd.getName + " " + cd.toString)
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
for (fName <- cd.getFields()) {
if (conf.blacklists.containsKey(fName))
columns.add(new Column(fName + "_filtered"))
else
columns.add(new Column(fName))
}
// Add 'key' column with the value generated by the given clustering definition
val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(conf.getWf.getOrderField))))
// filter out rows with position exceeding the maxqueuesize parameter
.filter(col("position").leq(conf.getWf.getQueueMaxSize))
.groupBy("key")
.agg(functions.collect_set(functions.struct(rowDataType.fieldNames.map(col): _*)).as("block"))
.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
if (relBlocks == null) relBlocks = ds
else relBlocks = relBlocks.union(ds)
}
relBlocks
}
val generateClustersWithDFAPIMerged: (Dataset[Row] => Dataset[Row]) = df => {
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
if (conf.blacklists.containsKey(fdef.getName)) {
res.withColumn(
fdef.getName + "_filtered",
filterColumnUDF(fdef).apply(new Column(fdef.getName))
)
} else {
res
}
})
import scala.collection.JavaConversions._
val keys = conf.clusterings().foldLeft(null : Column)((res, cd) => {
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
for (fName <- cd.getFields()) {
if (conf.blacklists.containsKey(fName))
columns.add(new Column(fName + "_filtered"))
else
columns.add(new Column(fName))
}
if (res != null)
functions.array_union(res, clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*)))
else
clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))
})
val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(keys))
.select((Seq(rowDataType.fieldNames: _*) ++ Seq("key")).map(col): _*)
.groupByKey(r => r.getAs[String]("key"))(Encoders.STRING)
.agg(collectSortSliceAggregator.toColumn)
.toDF("key", "block")
.select(col("block.block").as("block"))
/*.groupBy("key")
.agg(collectSortSliceUDAF(rowDataType.fieldNames.map(col): _*).as("block"))*/
.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
ds
}
val generateClustersWithRDDReduction: (Dataset[Row] => Dataset[Row]) = df => {
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
if (conf.blacklists.containsKey(fdef.getName)) {
res.withColumn(
fdef.getName + "_filtered",
filterColumnUDF(fdef).apply(new Column(fdef.getName))
)
} else {
res
}
})
var relBlocks: Dataset[Row] = null
import scala.collection.JavaConversions._
for (cd <- conf.clusterings()) {
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
for (fName <- cd.getFields()) {
if (conf.blacklists.containsKey(fName))
columns.add(new Column(fName + "_filtered"))
else
columns.add(new Column(fName))
}
val ds: Dataset[Row] = df.sparkSession.createDataFrame(df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
.select(col("key"), functions.array(functions.struct(rowDataType.fieldNames.map(col): _*).as("value")))
.rdd.keyBy(_.getString(0))
.reduceByKey((a, b) => {
val b1 = a.getSeq[Row](1)
val b2 = b.getSeq[Row](1)
if (b1.size + b2.size > conf.getWf.getQueueMaxSize)
Row(a.get(0), b1.union(b2).sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize))
else
Row(a.get(0), b1.union(b2))
})
.map(_._2)
.filter(k => k.getSeq(1).size > 1),
new StructType().add(StructField("key", DataTypes.StringType)).add(StructField("block", ArrayType(rowDataType)))
)
if (relBlocks == null) relBlocks = ds
else relBlocks = relBlocks.union(ds)
}
relBlocks
}
val printAnalytics: (Dataset[Row] => Dataset[Row]) = df => {
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
if (conf.blacklists.containsKey(fdef.getName)) {
res.withColumn(
fdef.getName + "_filtered",
filterColumnUDF(fdef).apply(new Column(fdef.getName))
)
} else {
res
}
})
var relBlocks: Dataset[Row] = null
import scala.collection.JavaConversions._
for (cd <- conf.clusterings()) {
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
for (fName <- cd.getFields()) {
if (conf.blacklists.containsKey(fName))
columns.add(new Column(fName + "_filtered"))
else
columns.add(new Column(fName))
}
// Add 'key' column with the value generated by the given clustering definition
val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(conf.getWf.getOrderField)))
// filter out rows with position exceeding the maxqueuesize parameter
.filter(col("position").lt(conf.getWf.getQueueMaxSize))
// inner join to compute all combination of rows to compare
// note the condition on position to obtain 'windowing': given a row this is compared at most with the next
// SlidingWindowSize rows following the sort order
val dsWithMatch = ds.as("l").join(ds.as("r"),
col("l.key").equalTo(col("r.key")),
"inner"
)
.filter((col("l.position").lt(col("r.position")))
&& (col("r.position").lt(col("l.position").plus(lit(conf.getWf.getSlidingWindowSize)))))
// Add match column with the result of comparison
// dsWithMatch.show(false)
if (relBlocks == null)
relBlocks = dsWithMatch
else
relBlocks = relBlocks.union(dsWithMatch)
}
System.out.println(conf.getWf.getEntityType + "::" + conf.getWf.getSubEntityType)
System.out.println("Total number of comparations: " + relBlocks.count())
df
}
val generateAndProcessClustersWithJoins: (Dataset[Row] => Dataset[Row]) = df => {
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
if (conf.blacklists.containsKey(fdef.getName)) {
res.withColumn(
fdef.getName + "_filtered",
filterColumnUDF(fdef).apply(new Column(fdef.getName))
)
} else {
res
}
})
var relBlocks: Dataset[Row] = null
import scala.collection.JavaConversions._
for (cd <- conf.clusterings()) {
val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
for (fName <- cd.getFields()) {
if (conf.blacklists.containsKey(fName))
columns.add(new Column(fName + "_filtered"))
else
columns.add(new Column(fName))
}
// Add 'key' column with the value generated by the given clustering definition
val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(conf.getWf.getOrderField)))
// filter out rows with position exceeding the maxqueuesize parameter
.filter(col("position").lt(conf.getWf.getQueueMaxSize))
// inner join to compute all combination of rows to compare
// note the condition on position to obtain 'windowing': given a row this is compared at most with the next
// SlidingWindowSize rows following the sort order
val dsWithMatch = ds.as("l").join(ds.as("r"),
col("l.key").equalTo(col("r.key")),
"inner"
)
.filter((col("l.position").lt(col("r.position")))
&& (col("r.position").lt(col("l.position").plus(lit(conf.getWf.getSlidingWindowSize)))))
// Add match column with the result of comparison
.withColumn("match", udf[Boolean, Row, Row]((a, b) => {
val treeProcessor = new TreeProcessor(conf)
treeProcessor.compare(a, b)
}).apply(functions.struct(rowDataType.fieldNames.map(s => col("l.".concat(s))): _*), functions.struct(rowDataType.fieldNames.map(s => col("r.".concat(s))): _*)))
.filter(col("match").equalTo(true))
.select(col("l.identifier").as("from"), col("r.identifier").as("to"))
// dsWithMatch.show(false)
if (relBlocks == null)
relBlocks = dsWithMatch
else
relBlocks = relBlocks.union(dsWithMatch)
}
val res = relBlocks
//.select(col("l.identifier").as("from"), col("r.identifier").as("to"))
//.repartition()
.distinct()
// res.show(false)
res.select(functions.struct("from", "to"))
}
val processClusters: (Dataset[Row] => Dataset[Row]) = df => {
val entity = conf.getWf.getEntityType
df.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
.withColumn("relations", processBlock(df.sqlContext.sparkContext).apply(new Column("block")))
.select(functions.explode(new Column("relations")).as("relation"))
//.repartition(new Column("relation"))
.dropDuplicates("relation")
}
val rowDataType: StructType = {
// val unordered = conf.getPace.getModel.asScala.foldLeft(
// new StructType()
// )((resType, fdef) => {
// resType.add(fdef.getType match {
// case Type.List | Type.JSON =>
// StructField(fdef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
// case Type.DoubleArray =>
// StructField(fdef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
// case _ =>
// StructField(fdef.getName, DataTypes.StringType, true, Metadata.empty)
// })
// })
//
// conf.getPace.getModel.asScala.filterNot(_.getName.equals(conf.getWf.getOrderField)).foldLeft(
// new StructType()
// .add(unordered(conf.getWf.getOrderField))
// .add(StructField("identifier", DataTypes.StringType, false, Metadata.empty))
// )((resType, fdef) => resType.add(unordered(fdef.getName)))
val identifier = new FieldDef()
identifier.setName("identifier")
identifier.setType(Type.String)
(conf.getPace.getModel.asScala ++ Seq(identifier)).sortBy(_.getName)
.foldLeft(
new StructType()
)((resType, fdef) => {
resType.add(fdef.getType match {
case Type.List | Type.JSON =>
StructField(fdef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
case Type.DoubleArray =>
StructField(fdef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
case _ =>
StructField(fdef.getName, DataTypes.StringType, true, Metadata.empty)
})
})
}
val identityFieldPosition: Int = rowDataType.fieldIndex("identifier")
val orderingFieldPosition: Int = rowDataType.fieldIndex(conf.getWf.getOrderField)
def rowFromJson(json: String) : Row = {
val documentContext =
JsonPath.using(Configuration.defaultConfiguration.addOptions(Option.SUPPRESS_EXCEPTIONS)).parse(json)
val values = new Array[Any](rowDataType.size)
values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
rowDataType.fieldNames.zipWithIndex.foldLeft(values) {
case ((res, (fname, index))) => {
val fdef = conf.getPace.getModelMap.get(fname)
if (fdef != null) {
res(index) = fdef.getType match {
case Type.String | Type.Int =>
MapDocumentUtil.truncateValue(
MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
fdef.getLength
)
case Type.URL =>
var uv = MapDocumentUtil.getJPathString(fdef.getPath, documentContext)
if (!urlFilter(uv)) uv = ""
uv
case Type.List | Type.JSON =>
MapDocumentUtil.truncateList(
MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
fdef.getSize
).toArray
case Type.StringConcat =>
val jpaths = CONCAT_REGEX.split(fdef.getPath)
truncateValue(
jpaths
.map(jpath => MapDocumentUtil.getJPathString(jpath, documentContext))
.mkString(" "),
fdef.getLength
)
case Type.DoubleArray =>
MapDocumentUtil.getJPathArray(fdef.getPath, json)
}
}
res
}
}
new GenericRowWithSchema(values, rowDataType)
}
val rowFromJsonUDF = udf(rowFromJson(_), rowDataType)
def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
if (blacklist == null) {
throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
} else {
fdef.getType match {
case Type.List | Type.JSON =>
udf[Array[String], Array[String]](values => {
values.filter((v: String) => !blacklist.test(v))
})
case _ =>
udf[String, String](v => {
if (blacklist.test(v)) ""
else v
})
}
}
}
def clusterValuesUDF(cd: ClusteringDef) = {
udf[mutable.WrappedArray[String], mutable.WrappedArray[Object]](values => {
values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala).map(cd.getName.concat(_))
})
}
def processBlock(implicit sc: SparkContext) = {
val accumulators = SparkReporter.constructAccumulator(conf, sc)
udf[Array[Tuple2[String, String]], mutable.WrappedArray[Row]](block => {
val reporter = new SparkReporter(accumulators)
val mapDocuments = block.asJava.stream
.sorted(new RowDataOrderingComparator(orderingFieldPosition))
.limit(conf.getWf.getQueueMaxSize)
.collect(Collectors.toList[Row]())
new BlockProcessor(conf, identityFieldPosition, orderingFieldPosition).processSortedRows(mapDocuments, reporter)
reporter.getRelations.asScala.toArray
}).asNondeterministic()
}
val collectSortSliceAggregator : Aggregator[Row,Seq[Row], Row] = new Aggregator[Row, Seq[Row], Row] () {
override def zero: Seq[Row] = Seq[Row]()
override def reduce(buffer: Seq[Row], input: Row): Seq[Row] = {
merge(buffer, Seq(input))
}
override def merge(buffer: Seq[Row], toMerge: Seq[Row]): Seq[Row] = {
val newBlock = buffer ++ toMerge
if (newBlock.size > conf.getWf.getQueueMaxSize)
newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
else
newBlock
}
override def finish(reduction: Seq[Row]): Row = {
Row(reduction.toArray)
}
override def bufferEncoder: Encoder[Seq[Row]] = Encoders.kryo[Seq[Row]]
override def outputEncoder: Encoder[Row] = RowEncoder.apply(new StructType().add("block", DataTypes.createArrayType(rowDataType), nullable = true))
}
val collectSortSliceUDAF : UserDefinedAggregateFunction = new UserDefinedAggregateFunction {
override def inputSchema: StructType = rowDataType
override def bufferSchema: StructType = {
new StructType().add("block", DataTypes.createArrayType(rowDataType), nullable = true)
}
override def dataType: DataType = DataTypes.createArrayType(rowDataType)
override def deterministic: Boolean = true
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = Seq[Row]()
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val newBlock = buffer.getSeq[Row](0) ++ Seq(input)
if (newBlock.size > conf.getWf.getQueueMaxSize)
buffer(0) = newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
else
buffer(0) = newBlock
}
override def merge(buffer: MutableAggregationBuffer, row: Row): Unit = {
val newBlock = buffer.getSeq[Row](0) ++ row.getSeq[Row](0)
if (newBlock.size > conf.getWf.getQueueMaxSize)
buffer(0) = newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
else
buffer(0) = newBlock
}
override def evaluate(buffer: Row): Any = {
buffer.getSeq[Row](0)
}
}
}

View File

@ -0,0 +1,131 @@
package eu.dnetlib.pace.model
import eu.dnetlib.pace.config.{DedupConfig, Type}
import eu.dnetlib.pace.util.{BlockProcessor, SparkReporter}
import org.apache.spark.SparkContext
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions.{col, lit, udf}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, Dataset, Row, functions}
import java.util.function.Predicate
import java.util.stream.Collectors
import scala.collection.JavaConversions._
import scala.collection.JavaConverters._
import scala.collection.mutable
case class SparkDeduper(conf: DedupConfig) extends Serializable {
val model: SparkModel = SparkModel(conf)
val dedup: (Dataset[Row] => Dataset[Row]) = df => {
df.transform(filterAndCleanup)
.transform(generateClustersWithCollect)
.transform(processBlocks)
}
val filterAndCleanup: (Dataset[Row] => Dataset[Row]) = df => {
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
if (conf.blacklists.containsKey(fdef.getName)) {
res.withColumn(
fdef.getName + "_filtered",
filterColumnUDF(fdef).apply(new Column(fdef.getName))
)
} else {
res
}
})
df_with_filters
}
def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
if (blacklist == null) {
throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
} else {
fdef.getType match {
case Type.List | Type.JSON =>
udf[Array[String], Array[String]](values => {
values.filter((v: String) => !blacklist.test(v))
})
case _ =>
udf[String, String](v => {
if (blacklist.test(v)) ""
else v
})
}
}
}
val generateClustersWithCollect: (Dataset[Row] => Dataset[Row]) = df_with_filters => {
var df_with_clustering_keys: Dataset[Row] = null
for ((cd, idx) <- conf.clusterings().zipWithIndex) {
val inputColumns = cd.getFields().foldLeft(Seq[Column]())((acc, fName) => {
val column = if (conf.blacklists.containsKey(fName))
Seq(col(fName + "_filtered"))
else
Seq(col(fName))
acc ++ column
})
// Add 'key' column with the value generated by the given clustering definition
val ds: Dataset[Row] = df_with_filters
.withColumn("clustering", lit(cd.getName + "::" + idx))
.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*))))
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName))))
if (df_with_clustering_keys == null)
df_with_clustering_keys = ds
else
df_with_clustering_keys = df_with_clustering_keys.union(ds)
}
//TODO: analytics
val df_with_blocks = df_with_clustering_keys
// filter out rows with position exceeding the maxqueuesize parameter
.filter(col("position").leq(conf.getWf.getQueueMaxSize))
.groupBy("clustering", "key")
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
.filter(functions.size(new Column("block")).gt(1))
df_with_blocks
}
def clusterValuesUDF(cd: ClusteringDef) = {
udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => {
values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala)
})
}
val processBlocks: (Dataset[Row] => Dataset[Row]) = df => {
df.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
.withColumn("relations", processBlock(df.sqlContext.sparkContext).apply(new Column("block")))
.select(functions.explode(new Column("relations")).as("relation"))
}
def processBlock(implicit sc: SparkContext) = {
val accumulators = SparkReporter.constructAccumulator(conf, sc)
udf[Array[(String, String)], mutable.WrappedArray[Row]](block => {
val reporter = new SparkReporter(accumulators)
val mapDocuments = block.asJava.stream()
.sorted(new RowDataOrderingComparator(model.orderingFieldPosition, model.identityFieldPosition))
.limit(conf.getWf.getQueueMaxSize)
.collect(Collectors.toList[Row]())
new BlockProcessor(conf, model.identityFieldPosition, model.orderingFieldPosition).processSortedRows(mapDocuments, reporter)
reporter.getRelations.asScala.toArray
}).asNondeterministic()
}
}

View File

@ -0,0 +1,108 @@
package eu.dnetlib.pace.model
import com.jayway.jsonpath.{Configuration, JsonPath}
import eu.dnetlib.pace.config.{DedupConfig, Type}
import eu.dnetlib.pace.util.MapDocumentUtil
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
import org.apache.spark.sql.{Dataset, Row}
import java.util.regex.Pattern
import scala.collection.JavaConverters._
case class SparkModel(conf: DedupConfig) {
private val URL_REGEX: Pattern = Pattern.compile("^\\s*(http|https|ftp)\\://.*")
private val CONCAT_REGEX: Pattern = Pattern.compile("\\|\\|\\|")
val identifierFieldName = "identifier"
val orderingFieldName = if (!conf.getWf.getOrderField.isEmpty) conf.getWf.getOrderField else identifierFieldName
val schema: StructType = {
// create an implicit identifier field
val identifier = new FieldDef()
identifier.setName(identifierFieldName)
identifier.setType(Type.String)
// Construct a Spark StructType representing the schema of the model
(Seq(identifier) ++ conf.getPace.getModel.asScala)
.foldLeft(
new StructType()
)((resType, fieldDef) => {
resType.add(fieldDef.getType match {
case Type.List | Type.JSON =>
StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
case Type.DoubleArray =>
StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
case _ =>
StructField(fieldDef.getName, DataTypes.StringType, true, Metadata.empty)
})
})
}
val identityFieldPosition: Int = schema.fieldIndex(identifierFieldName)
val orderingFieldPosition: Int = schema.fieldIndex(orderingFieldName)
val parseJsonDataset: (Dataset[String] => Dataset[Row]) = df => {
df.map(r => rowFromJson(r))(RowEncoder(schema))
}
def rowFromJson(json: String): Row = {
val documentContext =
JsonPath.using(Configuration.defaultConfiguration.addOptions(com.jayway.jsonpath.Option.SUPPRESS_EXCEPTIONS)).parse(json)
val values = new Array[Any](schema.size)
values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
schema.fieldNames.zipWithIndex.foldLeft(values) {
case ((res, (fname, index))) => {
val fdef = conf.getPace.getModelMap.get(fname)
if (fdef != null) {
res(index) = fdef.getType match {
case Type.String | Type.Int =>
MapDocumentUtil.truncateValue(
MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
fdef.getLength
)
case Type.URL =>
var uv = MapDocumentUtil.getJPathString(fdef.getPath, documentContext)
if (!URL_REGEX.matcher(uv).matches)
uv = ""
uv
case Type.List | Type.JSON =>
MapDocumentUtil.truncateList(
MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
fdef.getSize
).asScala
case Type.StringConcat =>
val jpaths = CONCAT_REGEX.split(fdef.getPath)
MapDocumentUtil.truncateValue(
jpaths
.map(jpath => MapDocumentUtil.getJPathString(jpath, documentContext))
.mkString(" "),
fdef.getLength
)
case Type.DoubleArray =>
MapDocumentUtil.getJPathArray(fdef.getPath, json)
}
}
res
}
}
new GenericRowWithSchema(values, schema)
}
}

View File

@ -1,11 +1,8 @@
package eu.dnetlib.pace.tree;
import java.util.List;
import java.util.Map;
import com.google.common.base.Joiner;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;

View File

@ -5,7 +5,6 @@ import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.tree.support.AbstractComparator;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;

View File

@ -4,7 +4,6 @@ package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractComparator;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;

View File

@ -4,7 +4,6 @@ package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractComparator;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;

View File

@ -4,8 +4,6 @@ package eu.dnetlib.pace.tree;
import java.util.List;
import java.util.Map;
import com.google.common.collect.Lists;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractListComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;

View File

@ -4,7 +4,6 @@ package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractComparator;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ -44,22 +43,25 @@ public class StringContainsMatch extends AbstractStringComparator {
STRING = STRING.toLowerCase();
}
switch (AGGREGATOR) {
case "AND":
if (ca.contains(STRING) && cb.contains(STRING))
return 1.0;
break;
case "OR":
if (ca.contains(STRING) || cb.contains(STRING))
return 1.0;
break;
case "XOR":
if (ca.contains(STRING) ^ cb.contains(STRING))
return 1.0;
break;
default:
return 0.0;
if (AGGREGATOR != null) {
switch (AGGREGATOR) {
case "AND":
if (ca.contains(STRING) && cb.contains(STRING))
return 1.0;
break;
case "OR":
if (ca.contains(STRING) || cb.contains(STRING))
return 1.0;
break;
case "XOR":
if (ca.contains(STRING) ^ cb.contains(STRING))
return 1.0;
break;
default:
return 0.0;
}
}
return 0.0;
}
}

View File

@ -1,7 +1,6 @@
package eu.dnetlib.pace.tree.support;
import java.util.Collections;
import java.util.List;
import java.util.Map;

View File

@ -4,11 +4,9 @@ package eu.dnetlib.pace.tree.support;
import java.util.List;
import java.util.Map;
import com.google.common.collect.Lists;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.config.Type;
abstract public class AbstractListComparator extends AbstractComparator<List<String>> {
protected AbstractListComparator(Map<String, String> params) {

View File

@ -1,7 +1,6 @@
package eu.dnetlib.pace.tree.support;
import java.util.AbstractList;
import java.util.Collections;
import java.util.List;
import java.util.Map;

View File

@ -2,8 +2,6 @@
package eu.dnetlib.pace.util;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Iterator;
import java.util.List;
import org.apache.commons.lang3.StringUtils;
@ -13,7 +11,6 @@ import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.ArrayType;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.StringType;
import org.apache.spark.sql.types.StructType;
import eu.dnetlib.pace.config.DedupConfig;
import eu.dnetlib.pace.config.WfConfig;

View File

@ -18,6 +18,7 @@ package eu.dnetlib.pace.util;
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Diff Match and Patch
* Copyright 2018 The diff-match-patch Authors.

View File

@ -2,20 +2,20 @@
package eu.dnetlib.pace.util;
import java.math.BigDecimal;
import java.util.*;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.function.Predicate;
import java.util.stream.Collectors;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.jayway.jsonpath.Configuration;
import com.jayway.jsonpath.DocumentContext;
import com.jayway.jsonpath.JsonPath;
import com.jayway.jsonpath.Option;
import com.jayway.jsonpath.spi.cache.Cache;
import com.jayway.jsonpath.spi.cache.CacheProvider;
import eu.dnetlib.pace.config.DedupConfig;
import eu.dnetlib.pace.config.Type;
import eu.dnetlib.pace.model.*;
import net.minidev.json.JSONArray;
public class MapDocumentUtil {
@ -23,47 +23,20 @@ public class MapDocumentUtil {
public static final String URL_REGEX = "^(http|https|ftp)\\://.*";
public static Predicate<String> urlFilter = s -> s.trim().matches(URL_REGEX);
public static List<String> getJPathList(String path, String json, Type type) {
if (type == Type.List)
return JsonPath
.using(
Configuration
.defaultConfiguration()
.addOptions(Option.ALWAYS_RETURN_LIST, Option.SUPPRESS_EXCEPTIONS))
.parse(json)
.read(path);
Object jresult;
List<String> result = new ArrayList<>();
try {
jresult = JsonPath.read(json, path);
} catch (Throwable e) {
return result;
}
if (jresult instanceof JSONArray) {
((JSONArray) jresult).forEach(it -> {
try {
result.add(new ObjectMapper().writeValueAsString(it));
} catch (JsonProcessingException e) {
}
});
return result;
}
if (jresult instanceof LinkedHashMap) {
try {
result.add(new ObjectMapper().writeValueAsString(jresult));
} catch (JsonProcessingException e) {
static {
CacheProvider.setCache(new Cache() {
private final ConcurrentHashMap<String, JsonPath> jsonPathCache = new ConcurrentHashMap();
@Override
public JsonPath get(String key) {
return jsonPathCache.get(key);
}
return result;
}
if (jresult instanceof String) {
result.add((String) jresult);
}
return result;
@Override
public void put(String key, JsonPath value) {
jsonPathCache.put(key, value);
}
});
}
public static String getJPathString(final String jsonPath, final String json) {
@ -144,6 +117,11 @@ public class MapDocumentUtil {
return result;
}
if (type == Type.List && jresult instanceof List) {
((List<?>) jresult).forEach(x -> result.add(x.toString()));
return result;
}
if (jresult instanceof JSONArray) {
((JSONArray) jresult).forEach(it -> {
try {

View File

@ -10,7 +10,6 @@ import org.apache.spark.SparkContext;
import org.apache.spark.util.LongAccumulator;
import eu.dnetlib.pace.config.DedupConfig;
import eu.dnetlib.pace.util.Reporter;
import scala.Serializable;
import scala.Tuple2;

View File

@ -2,14 +2,12 @@
package eu.dnetlib.pace.clustering;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;
import org.junit.jupiter.api.*;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test;
import com.google.common.collect.Lists;
import com.google.common.collect.Maps;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.AbstractPaceTest;
import eu.dnetlib.pace.common.AbstractPaceFunctions;

View File

@ -6,6 +6,7 @@ import static org.junit.jupiter.api.Assertions.assertEquals;
import java.util.*;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.TestInstance;
@ -22,14 +23,18 @@ public class ComparatorTest extends AbstractPaceTest {
@BeforeAll
public void setup() {
conf = DedupConfig
.load(readFromClasspath("/eu/dnetlib/pace/config/organization.current.conf.json", ComparatorTest.class));
}
@BeforeEach
public void beforeEachTest() {
params = new HashMap<>();
params.put("weight", "1.0");
params.put("surname_th", "0.99");
params.put("name_th", "0.95");
params.put("jpath_value", "$.value");
params.put("jpath_classid", "$.qualifier.classid");
conf = DedupConfig
.load(readFromClasspath("/eu/dnetlib/pace/config/organization.current.conf.json", ComparatorTest.class));
}
@Test
@ -63,7 +68,10 @@ public class ComparatorTest extends AbstractPaceTest {
.distance(
"Politechniki Warszawskiej (Warsaw University of Technology)", "Warsaw University of Technology",
conf));
assertEquals(-1.0, cityMatch.distance("Allen (United States)", "United States Military Academy", conf));
// failing becasuse 'Allen' is a transliterrated greek stopword
// assertEquals(-1.0, cityMatch.distance("Allen (United States)", "United States Military Academy", conf));
assertEquals(-1.0, cityMatch.distance("Washington (United States)", "United States Military Academy", conf));
}
@Test
@ -78,7 +86,7 @@ public class ComparatorTest extends AbstractPaceTest {
assertEquals(1.0, keywordMatch.distance("Polytechnic University of Turin", "POLITECNICO DI TORINO", conf));
assertEquals(1.0, keywordMatch.distance("Istanbul Commerce University", "İstanbul Ticarət Universiteti", conf));
assertEquals(1.0, keywordMatch.distance("Franklin College", "Concordia College", conf));
assertEquals(0.5, keywordMatch.distance("University of Georgia", "Georgia State University", conf));
assertEquals(2.0 / 3.0, keywordMatch.distance("University of Georgia", "Georgia State University", conf));
assertEquals(0.5, keywordMatch.distance("University College London", "University of London", conf));
assertEquals(0.5, keywordMatch.distance("Washington State University", "University of Washington", conf));
assertEquals(-1.0, keywordMatch.distance("Allen (United States)", "United States Military Academy", conf));
@ -112,7 +120,7 @@ public class ComparatorTest extends AbstractPaceTest {
public void stringContainsMatchTest() {
params.put("string", "openorgs");
params.put("bool", "XOR");
params.put("aggregator", "XOR");
params.put("caseSensitive", "false");
StringContainsMatch stringContainsMatch = new StringContainsMatch(params);
@ -120,7 +128,7 @@ public class ComparatorTest extends AbstractPaceTest {
assertEquals(0.0, stringContainsMatch.distance("openorgs", "openorgs", conf));
params.put("string", "openorgs");
params.put("bool", "AND");
params.put("aggregator", "AND");
params.put("caseSensitive", "false");
stringContainsMatch = new StringContainsMatch(params);

View File

@ -6,7 +6,8 @@ import static org.junit.jupiter.api.Assertions.assertEquals;
import java.util.HashMap;
import java.util.Map;
import org.junit.jupiter.api.*;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test;
import eu.dnetlib.pace.model.Person;
import jdk.nashorn.internal.ir.annotations.Ignore;

View File

@ -0,0 +1,72 @@
# Action Management Framework
This module implements the oozie workflow for the integration of pre-built contents into the OpenAIRE Graph.
Such contents can be
* brand new, non-existing records to be introduced as nodes of the graph
* updates (or enrichment) for records that does exist in the graph (e.g. a new subject term for a publication)
* relations among existing nodes
The actionset contents are organised into logical containers, each of them can contain multiple versions contents and is characterised by
* a name
* an identifier
* the paths on HDFS where each version of the contents is stored
Each version is then characterised by
* the creation date
* the last update date
* the indication where it is the latest one or it is an expired version, candidate for garbage collection
## ActionSet serialization
Each actionset version contains records compliant to the graph internal data model, i.e. subclasses of `eu.dnetlib.dhp.schema.oaf.Oaf`,
defined in the external schemas module
```
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>${dhp-schemas.artifact}</artifactId>
<version>${dhp-schemas.version}</version>
</dependency>
```
When the actionset contains a relationship, the model class to use is `eu.dnetlib.dhp.schema.oaf.Relation`, otherwise
when the actionset contains an entity, it is a `eu.dnetlib.dhp.schema.oaf.OafEntity` or one of its subclasses
`Datasource`, `Organization`, `Project`, `Result` (or one of its subclasses `Publication`, `Dataset`, etc...).
Then, each OpenAIRE Graph model class instance must be wrapped using the class `eu.dnetlib.dhp.schema.action.AtomicAction`, a generic
container that defines two attributes
* `T payload` the OpenAIRE Graph class instance containing the data;
* `Class<T> clazz` must contain the class whose instance is contained in the payload.
Each AtomicAction can be then serialised in JSON format using `com.fasterxml.jackson.databind.ObjectMapper` from
```
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>${dhp.jackson.version}</version>
</dependency>
```
Then, the JSON serialization must be stored as a GZip compressed sequence file (`org.apache.hadoop.mapred.SequenceFileOutputFormat`).
As such, it contains a set of tuples, a key and a value defined as `org.apache.hadoop.io.Text` where
* the `key` must be set to the class canonical name contained in the `AtomicAction`;
* the `value` must be set to the AtomicAction JSON serialization.
The following snippet provides an example of how create an actionset version of Relation records:
```
rels // JavaRDD<Relation>
.map(relation -> new AtomicAction<Relation>(Relation.class, relation))
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
```

View File

@ -63,10 +63,6 @@
<groupId>eu.dnetlib</groupId>
<artifactId>dnet-openaireplus-mapping-utils</artifactId>
</exclusion>
<exclusion>
<groupId>eu.dnetlib</groupId>
<artifactId>dnet-index-solr-common</artifactId>
</exclusion>
<exclusion>
<groupId>saxonica</groupId>
<artifactId>saxon</artifactId>

View File

@ -20,6 +20,7 @@ import org.apache.spark.sql.SparkSession;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException;
@ -33,7 +34,8 @@ import eu.dnetlib.dhp.schema.oaf.*;
public class PromoteActionPayloadForGraphTableJob {
private static final Logger logger = LoggerFactory.getLogger(PromoteActionPayloadForGraphTableJob.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
public static void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils

View File

@ -31,6 +31,7 @@ import org.mockito.Mock;
import org.mockito.Mockito;
import org.mockito.junit.jupiter.MockitoExtension;
import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.actionmanager.ISClient;
@ -46,7 +47,8 @@ public class PartitionActionSetsByPayloadTypeJobTest {
private static Configuration configuration;
private static SparkSession spark;
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
private static final StructType ATOMIC_ACTION_SCHEMA = StructType$.MODULE$
.apply(

View File

@ -25,6 +25,7 @@ import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;
import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.schema.common.ModelSupport;
@ -41,7 +42,8 @@ public class PromoteActionPayloadForGraphTableJobTest {
private Path inputActionPayloadRootDir;
private Path outputDir;
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
@BeforeAll
public static void beforeAll() {
@ -154,6 +156,10 @@ public class PromoteActionPayloadForGraphTableJobTest {
List<? extends Oaf> actualOutputRows = readGraphTableFromJobOutput(outputGraphTableDir.toString(), rowClazz)
.collectAsList()
.stream()
.map(s -> {
s.setLastupdatetimestamp(0L);
return s;
})
.sorted(Comparator.comparingInt(Object::hashCode))
.collect(Collectors.toList());
String expectedOutputGraphTableJsonDumpPath = resultFileLocation(strategy, rowClazz, actionPayloadClazz);
@ -166,6 +172,10 @@ public class PromoteActionPayloadForGraphTableJobTest {
expectedOutputGraphTableJsonDumpFile.toString(), rowClazz)
.collectAsList()
.stream()
.map(s -> {
s.setLastupdatetimestamp(0L);
return s;
})
.sorted(Comparator.comparingInt(Object::hashCode))
.collect(Collectors.toList());
assertIterableEquals(expectedOutputRows, actualOutputRows);

View File

@ -79,8 +79,8 @@
</dependency>
<dependency>
<groupId>org.scala-lang.modules</groupId>
<artifactId>scala-xml_2.12</artifactId>
<version>2.1.0</version>
<artifactId>scala-xml_${scala.binary.version}</artifactId>
<version>${scala-xml.version}</version>
</dependency>
<dependency>

View File

@ -11,6 +11,7 @@ import org.apache.spark.sql.SparkSession;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
import eu.dnetlib.dhp.schema.oaf.Subject;
@ -93,4 +94,9 @@ public class Constants {
return s;
}
public static void removeOutputDir(SparkSession spark, String path) {
HdfsSupport.remove(path, spark.sparkContext().hadoopConfiguration());
}
}

View File

@ -0,0 +1,162 @@
package eu.dnetlib.dhp.actionmanager.bipaffiliations;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.io.Serializable;
import java.util.Arrays;
import java.util.List;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.Dataset;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.actionmanager.Constants;
import eu.dnetlib.dhp.actionmanager.ror.GenerateRorActionSetJob;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
import scala.Tuple2;
/**
* Creates action sets for Crossref affiliation relations inferred by BIP!
*/
public class PrepareAffiliationRelations implements Serializable {
private static final Logger log = LoggerFactory.getLogger(PrepareAffiliationRelations.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static final String ID_PREFIX = "50|doi_________::";
public static final String BIP_AFFILIATIONS_CLASSID = "result:organization:bipinference";
public static final String BIP_AFFILIATIONS_CLASSNAME = "Affiliation relation inferred by BIP!";
public static final String BIP_INFERENCE_PROVENANCE = "bip:affiliation:crossref";
public static <I extends Result> void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils
.toString(
PrepareAffiliationRelations.class
.getResourceAsStream(
"/eu/dnetlib/dhp/actionmanager/bipaffiliations/input_actionset_parameter.json"));
final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
parser.parseArgument(args);
Boolean isSparkSessionManaged = Constants.isSparkSessionManaged(parser);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
final String inputPath = parser.get("inputPath");
log.info("inputPath {}: ", inputPath);
final String outputPath = parser.get("outputPath");
log.info("outputPath {}: ", outputPath);
SparkConf conf = new SparkConf();
runWithSparkSession(
conf,
isSparkSessionManaged,
spark -> {
Constants.removeOutputDir(spark, outputPath);
prepareAffiliationRelations(spark, inputPath, outputPath);
});
}
private static <I extends Result> void prepareAffiliationRelations(SparkSession spark, String inputPath,
String outputPath) {
// load and parse affiliation relations from HDFS
Dataset<Row> df = spark
.read()
.schema("`DOI` STRING, `Matchings` ARRAY<STRUCT<`RORid`:STRING,`Confidence`:DOUBLE>>")
.json(inputPath);
// unroll nested arrays
df = df
.withColumn("matching", functions.explode(new Column("Matchings")))
.select(
new Column("DOI").as("doi"),
new Column("matching.RORid").as("rorid"),
new Column("matching.Confidence").as("confidence"));
// prepare action sets for affiliation relations
df
.toJavaRDD()
.flatMap((FlatMapFunction<Row, Relation>) row -> {
// DOI to OpenAIRE id
final String paperId = ID_PREFIX
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", row.getAs("doi")));
// ROR id to OpenAIRE id
final String affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("rorid"));
Qualifier qualifier = OafMapperUtils
.qualifier(
BIP_AFFILIATIONS_CLASSID,
BIP_AFFILIATIONS_CLASSNAME,
ModelConstants.DNET_PROVENANCE_ACTIONS,
ModelConstants.DNET_PROVENANCE_ACTIONS);
// format data info; setting `confidence` into relation's `trust`
DataInfo dataInfo = OafMapperUtils
.dataInfo(
false,
BIP_INFERENCE_PROVENANCE,
true,
false,
qualifier,
Double.toString(row.getAs("confidence")));
List<KeyValue> collectedfrom = OafMapperUtils.listKeyValues(ModelConstants.CROSSREF_ID, "Crossref");
// return bi-directional relations
return getAffiliationRelationPair(paperId, affId, collectedfrom, dataInfo).iterator();
})
.map(p -> new AtomicAction(Relation.class, p))
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
}
private static List<Relation> getAffiliationRelationPair(String paperId, String affId, List<KeyValue> collectedfrom,
DataInfo dataInfo) {
return Arrays
.asList(
OafMapperUtils
.getRelation(
paperId,
affId,
ModelConstants.RESULT_ORGANIZATION,
ModelConstants.AFFILIATION,
ModelConstants.HAS_AUTHOR_INSTITUTION,
collectedfrom,
dataInfo,
null),
OafMapperUtils
.getRelation(
affId,
paperId,
ModelConstants.RESULT_ORGANIZATION,
ModelConstants.AFFILIATION,
ModelConstants.IS_AUTHOR_INSTITUTION_OF,
collectedfrom,
dataInfo,
null));
}
}

View File

@ -6,13 +6,14 @@ import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.io.Serializable;
import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.MapFunction;
@ -24,8 +25,9 @@ import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.actionmanager.bipmodel.BipDeserialize;
import eu.dnetlib.dhp.actionmanager.bipmodel.BipScore;
import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipProjectModel;
import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipResultModel;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.action.AtomicAction;
@ -40,7 +42,6 @@ import scala.Tuple2;
*/
public class SparkAtomicActionScoreJob implements Serializable {
private static final String DOI = "doi";
private static final Logger log = LoggerFactory.getLogger(SparkAtomicActionScoreJob.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
@ -56,18 +57,17 @@ public class SparkAtomicActionScoreJob implements Serializable {
parser.parseArgument(args);
Boolean isSparkSessionManaged = Optional
.ofNullable(parser.get("isSparkSessionManaged"))
.map(Boolean::valueOf)
.orElse(Boolean.TRUE);
Boolean isSparkSessionManaged = isSparkSessionManaged(parser);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
final String inputPath = parser.get("inputPath");
log.info("inputPath {}: ", inputPath);
final String resultsInputPath = parser.get("resultsInputPath");
log.info("resultsInputPath: {}", resultsInputPath);
final String projectsInputPath = parser.get("projectsInputPath");
log.info("projectsInputPath: {}", projectsInputPath);
final String outputPath = parser.get("outputPath");
log.info("outputPath {}: ", outputPath);
log.info("outputPath: {}", outputPath);
SparkConf conf = new SparkConf();
@ -76,17 +76,45 @@ public class SparkAtomicActionScoreJob implements Serializable {
isSparkSessionManaged,
spark -> {
removeOutputDir(spark, outputPath);
prepareResults(spark, inputPath, outputPath);
JavaPairRDD<Text, Text> resultsRDD = prepareResults(spark, resultsInputPath, outputPath);
JavaPairRDD<Text, Text> projectsRDD = prepareProjects(spark, projectsInputPath, outputPath);
resultsRDD
.union(projectsRDD)
.saveAsHadoopFile(
outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
});
}
private static <I extends Result> void prepareResults(SparkSession spark, String bipScorePath, String outputPath) {
private static <I extends Project> JavaPairRDD<Text, Text> prepareProjects(SparkSession spark, String inputPath,
String outputPath) {
// read input bip project scores
Dataset<BipProjectModel> projectScores = readPath(spark, inputPath, BipProjectModel.class);
return projectScores.map((MapFunction<BipProjectModel, Project>) bipProjectScores -> {
Project project = new Project();
project.setId(bipProjectScores.getProjectId());
project.setMeasures(bipProjectScores.toMeasures());
return project;
}, Encoders.bean(Project.class))
.toJavaRDD()
.map(p -> new AtomicAction(Project.class, p))
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))));
}
private static <I extends Result> JavaPairRDD<Text, Text> prepareResults(SparkSession spark, String bipScorePath,
String outputPath) {
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaRDD<BipDeserialize> bipDeserializeJavaRDD = sc
JavaRDD<BipResultModel> bipDeserializeJavaRDD = sc
.textFile(bipScorePath)
.map(item -> OBJECT_MAPPER.readValue(item, BipDeserialize.class));
.map(item -> OBJECT_MAPPER.readValue(item, BipResultModel.class));
Dataset<BipScore> bipScores = spark
.createDataset(bipDeserializeJavaRDD.flatMap(entry -> entry.keySet().stream().map(key -> {
@ -96,24 +124,20 @@ public class SparkAtomicActionScoreJob implements Serializable {
return bs;
}).collect(Collectors.toList()).iterator()).rdd(), Encoders.bean(BipScore.class));
bipScores
return bipScores.map((MapFunction<BipScore, Result>) bs -> {
Result ret = new Result();
.map((MapFunction<BipScore, Result>) bs -> {
Result ret = new Result();
ret.setId(bs.getId());
ret.setId(bs.getId());
ret.setMeasures(getMeasure(bs));
ret.setMeasures(getMeasure(bs));
return ret;
}, Encoders.bean(Result.class))
return ret;
}, Encoders.bean(Result.class))
.toJavaRDD()
.map(p -> new AtomicAction(Result.class, p))
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);
new Text(OBJECT_MAPPER.writeValueAsString(aa))));
}
private static List<Measure> getMeasure(BipScore value) {
@ -159,12 +183,4 @@ public class SparkAtomicActionScoreJob implements Serializable {
HdfsSupport.remove(path, spark.sparkContext().hadoopConfiguration());
}
public static <R> Dataset<R> readPath(
SparkSession spark, String inputPath, Class<R> clazz) {
return spark
.read()
.textFile(inputPath)
.map((MapFunction<String, R>) value -> OBJECT_MAPPER.readValue(value, clazz), Encoders.bean(clazz));
}
}

View File

@ -0,0 +1,74 @@
package eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers;
import static eu.dnetlib.dhp.actionmanager.Constants.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import com.opencsv.bean.CsvBindByPosition;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.KeyValue;
import eu.dnetlib.dhp.schema.oaf.Measure;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
import lombok.AllArgsConstructor;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.Setter;
@NoArgsConstructor
@AllArgsConstructor
@Getter
@Setter
public class BipProjectModel {
String projectId;
String numOfInfluentialResults;
String numOfPopularResults;
String totalImpulse;
String totalCitationCount;
// each project bip measure has exactly one value, hence one key-value pair
private Measure createMeasure(String measureId, String measureValue) {
KeyValue kv = new KeyValue();
kv.setKey("score");
kv.setValue(measureValue);
kv
.setDataInfo(
OafMapperUtils
.dataInfo(
false,
UPDATE_DATA_INFO_TYPE,
true,
false,
OafMapperUtils
.qualifier(
UPDATE_MEASURE_BIP_CLASS_ID,
UPDATE_CLASS_NAME,
ModelConstants.DNET_PROVENANCE_ACTIONS,
ModelConstants.DNET_PROVENANCE_ACTIONS),
""));
Measure measure = new Measure();
measure.setId(measureId);
measure.setUnit(Collections.singletonList(kv));
return measure;
}
public List<Measure> toMeasures() {
return Arrays
.asList(
createMeasure("numOfInfluentialResults", numOfInfluentialResults),
createMeasure("numOfPopularResults", numOfPopularResults),
createMeasure("totalImpulse", totalImpulse),
createMeasure("totalCitationCount", totalCitationCount));
}
}

View File

@ -1,19 +1,21 @@
package eu.dnetlib.dhp.actionmanager.bipmodel;
package eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import eu.dnetlib.dhp.actionmanager.bipmodel.Score;
/**
* Class that maps the model of the bipFinder! input data.
* Only needed for deserialization purposes
*/
public class BipDeserialize extends HashMap<String, List<Score>> implements Serializable {
public class BipResultModel extends HashMap<String, List<Score>> implements Serializable {
public BipDeserialize() {
public BipResultModel() {
super();
}

View File

@ -24,8 +24,8 @@ import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.actionmanager.bipmodel.BipDeserialize;
import eu.dnetlib.dhp.actionmanager.bipmodel.BipScore;
import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipResultModel;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.common.ModelConstants;
@ -82,9 +82,9 @@ public class PrepareBipFinder implements Serializable {
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaRDD<BipDeserialize> bipDeserializeJavaRDD = sc
JavaRDD<BipResultModel> bipDeserializeJavaRDD = sc
.textFile(inputPath)
.map(item -> OBJECT_MAPPER.readValue(item, BipDeserialize.class));
.map(item -> OBJECT_MAPPER.readValue(item, BipResultModel.class));
spark
.createDataset(bipDeserializeJavaRDD.flatMap(entry -> entry.keySet().stream().map(key -> {

View File

@ -6,7 +6,6 @@ import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Serializable;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.io.IOUtils;
@ -23,7 +22,6 @@ import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.actionmanager.project.PrepareProjects;
import eu.dnetlib.dhp.actionmanager.project.utils.model.JsonTopic;
import eu.dnetlib.dhp.actionmanager.project.utils.model.Project;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
/**

View File

@ -168,7 +168,7 @@ public class GenerateRorActionSetJob {
}
private static String calculateOpenaireId(final String rorId) {
public static String calculateOpenaireId(final String rorId) {
return String.format("20|%s::%s", Constants.ROR_NS_PREFIX, DHPUtils.md5(rorId));
}

View File

@ -75,7 +75,7 @@ public class SparkAtomicActionUsageJob implements Serializable {
removeOutputDir(spark, outputPath);
prepareData(dbname, spark, workingPath + "/usageDb", "usage_stats", "result_id");
prepareData(dbname, spark, workingPath + "/projectDb", "project_stats", "id");
prepareData(dbname, spark, workingPath + "/datasourceDb", "datasource_stats", "repositor_id");
prepareData(dbname, spark, workingPath + "/datasourceDb", "datasource_stats", "repository_id");
writeActionSet(spark, workingPath, outputPath);
});
}

View File

@ -0,0 +1,20 @@
[
{
"paramName": "issm",
"paramLongName": "isSparkSessionManaged",
"paramDescription": "when true will stop SparkSession after job execution",
"paramRequired": false
},
{
"paramName": "ip",
"paramLongName": "inputPath",
"paramDescription": "the URL from where to get the programme file",
"paramRequired": true
},
{
"paramName": "o",
"paramLongName": "outputPath",
"paramDescription": "the path of the new ActionSet",
"paramRequired": true
}
]

View File

@ -0,0 +1,35 @@
# --- You can override the following properties (if needed) coming from your ~/.dhp/application.properties ---
# dhp.hadoop.frontend.temp.dir=/home/ilias.kanellos
# dhp.hadoop.frontend.user.name=ilias.kanellos
# dhp.hadoop.frontend.host.name=iis-cdh5-test-gw.ocean.icm.edu.pl
# dhp.hadoop.frontend.port.ssh=22
# oozieServiceLoc=http://iis-cdh5-test-m3:11000/oozie
# jobTracker=yarnRM
# nameNode=hdfs://nameservice1
# oozie.execution.log.file.location = target/extract-and-run-on-remote-host.log
# maven.executable=mvn
# Some memory and driver settings for more demanding tasks
sparkDriverMemory=10G
sparkExecutorMemory=10G
sparkExecutorCores=4
sparkShufflePartitions=7680
# The above is given differently in an example I found online
oozie.action.sharelib.for.spark=spark2
oozieActionShareLibForSpark2=spark2
spark2YarnHistoryServerAddress=http://iis-cdh5-test-gw.ocean.icm.edu.pl:18089
spark2EventLogDir=/user/spark/spark2ApplicationHistory
sparkSqlWarehouseDir=/user/hive/warehouse
hiveMetastoreUris=thrift://iis-cdh5-test-m3.ocean.icm.edu.pl:9083
# This MAY avoid the no library used error
oozie.use.system.libpath=true
# Some stuff copied from openaire's jobs
spark2ExtraListeners=com.cloudera.spark.lineage.NavigatorAppListener
spark2SqlQueryExecutionListeners=com.cloudera.spark.lineage.NavigatorQueryListener
# The following is needed as a property of a workflow
oozie.wf.application.path=${oozieTopWfApplicationPath}
inputPath=/data/bip-affiliations/data.json
outputPath=/tmp/crossref-affiliations-output-v5

View File

@ -0,0 +1,30 @@
<configuration>
<property>
<name>jobTracker</name>
<value>yarnRM</value>
</property>
<property>
<name>nameNode</name>
<value>hdfs://nameservice1</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
<property>
<name>hiveMetastoreUris</name>
<value>thrift://iis-cdh5-test-m3.ocean.icm.edu.pl:9083</value>
</property>
<property>
<name>hiveJdbcUrl</name>
<value>jdbc:hive2://iis-cdh5-test-m3.ocean.icm.edu.pl:10000</value>
</property>
<property>
<name>hiveDbName</name>
<value>openaire</value>
</property>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>

View File

@ -0,0 +1,107 @@
<workflow-app name="BipAffiliations" xmlns="uri:oozie:workflow:0.5">
<parameters>
<property>
<name>inputPath</name>
<description>the path where to find the inferred affiliation relations</description>
</property>
<property>
<name>outputPath</name>
<description>the path where to store the actionset</description>
</property>
<property>
<name>sparkDriverMemory</name>
<description>memory for driver process</description>
</property>
<property>
<name>sparkExecutorMemory</name>
<description>memory for individual executor</description>
</property>
<property>
<name>sparkExecutorCores</name>
<description>number of cores used by single executor</description>
</property>
<property>
<name>oozieActionShareLibForSpark2</name>
<description>oozie action sharelib for spark 2.*</description>
</property>
<property>
<name>spark2ExtraListeners</name>
<value>com.cloudera.spark.lineage.NavigatorAppListener</value>
<description>spark 2.* extra listeners classname</description>
</property>
<property>
<name>spark2SqlQueryExecutionListeners</name>
<value>com.cloudera.spark.lineage.NavigatorQueryListener</value>
<description>spark 2.* sql query execution listeners classname</description>
</property>
<property>
<name>spark2YarnHistoryServerAddress</name>
<description>spark 2.* yarn history server address</description>
</property>
<property>
<name>spark2EventLogDir</name>
<description>spark 2.* event log dir location</description>
</property>
</parameters>
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapreduce.job.queuename</name>
<value>${queueName}</value>
</property>
<property>
<name>oozie.launcher.mapred.job.queue.name</name>
<value>${oozieLauncherQueueName}</value>
</property>
<property>
<name>oozie.action.sharelib.for.spark</name>
<value>${oozieActionShareLibForSpark2}</value>
</property>
</configuration>
</global>
<start to="deleteoutputpath"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="deleteoutputpath">
<fs>
<delete path="${outputPath}"/>
<mkdir path="${outputPath}"/>
<delete path="${workingDir}"/>
<mkdir path="${workingDir}"/>
</fs>
<ok to="atomicactions"/>
<error to="Kill"/>
</action>
<action name="atomicactions">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Produces the atomic action with the inferred by BIP! affiliation relations from Crossref</name>
<class>eu.dnetlib.dhp.actionmanager.bipaffiliations.PrepareAffiliationRelations</class>
<jar>dhp-aggregation-${projectVersion}.jar</jar>
<spark-opts>
--executor-memory=${sparkExecutorMemory}
--executor-cores=${sparkExecutorCores}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.warehouse.dir=${sparkSqlWarehouseDir}
</spark-opts>
<arg>--inputPath</arg><arg>${inputPath}</arg>
<arg>--outputPath</arg><arg>${outputPath}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

View File

@ -6,9 +6,15 @@
"paramRequired": false
},
{
"paramName": "ip",
"paramLongName": "inputPath",
"paramDescription": "the URL from where to get the programme file",
"paramName": "rip",
"paramLongName": "resultsInputPath",
"paramDescription": "the URL from where to get the input file for results",
"paramRequired": true
},
{
"paramName": "pip",
"paramLongName": "projectsInputPath",
"paramDescription": "the URL from where to get the input file for projects",
"paramRequired": true
},
{

View File

@ -1,4 +1,9 @@
{
"ETHZ.UNIGENF": {
"openaire_id": "opendoar____::1400",
"datacite_name": "Uni Genf",
"official_name": "Archive ouverte UNIGE"
},
"GESIS.RKI": {
"openaire_id": "re3data_____::r3d100010436",
"datacite_name": "Forschungsdatenzentrum am Robert Koch Institut",

View File

@ -222,7 +222,7 @@ object BioDBToOAF {
def uniprotToOAF(input: String): List[Oaf] = {
implicit lazy val formats: DefaultFormats.type = org.json4s.DefaultFormats
lazy val json = parse(input)
val pid = (json \ "pid").extract[String]
val pid = (json \ "pid").extract[String].trim()
val d = new Dataset

View File

@ -18,9 +18,9 @@ import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql._
import org.slf4j.{Logger, LoggerFactory}
import java.io.{ByteArrayInputStream, InputStream}
import java.io.InputStream
import scala.io.Source
//import scala.xml.pull.XMLEventReader
import scala.xml.pull.XMLEventReader
object SparkCreateBaselineDataFrame {
@ -197,8 +197,8 @@ object SparkCreateBaselineDataFrame {
val ds: Dataset[PMArticle] = spark.createDataset(
k.filter(i => i._1.endsWith(".gz"))
.flatMap(i => {
// val xml = new XMLEventReader(Source.fromBytes(i._2.getBytes()))
new PMParser(new ByteArrayInputStream(i._2.getBytes()))
val xml = new XMLEventReader(Source.fromBytes(i._2.getBytes()))
new PMParser(xml)
})
)
ds.map(p => (p.getPmid, p))(Encoders.tuple(Encoders.STRING, PMEncoder))

View File

@ -1,20 +1,11 @@
package eu.dnetlib.dhp.sx.bio.pubmed
import javax.xml.stream.{XMLEventReader, XMLInputFactory, XMLStreamConstants}
import scala.language.postfixOps
import scala.xml.MetaData
//import scala.xml.pull.{EvElemEnd, EvElemStart, EvText, XMLEventReader}
import scala.xml.pull.{EvElemEnd, EvElemStart, EvText, XMLEventReader}
/** @param xml
*/
class PMParser(stream: java.io.InputStream) extends Iterator[PMArticle] {
private val reader: XMLEventReader = {
println("INSTANTIATE READER")
val factory = XMLInputFactory.newInstance()
factory.createXMLEventReader(stream)
}
class PMParser(xml: XMLEventReader) extends Iterator[PMArticle] {
var currentArticle: PMArticle = generateNextArticle()
@ -58,142 +49,85 @@ class PMParser(stream: java.io.InputStream) extends Iterator[PMArticle] {
var currentMonth = "01"
var currentDay = "01"
var currentArticleType: String = null
var sb = new StringBuilder()
var insideChar = false
var complete = false
while (reader.hasNext && !complete) {
val next = reader.nextEvent()
while (xml.hasNext) {
xml.next match {
case EvElemStart(_, label, attrs, _) =>
currNode = label
if (next.isStartElement) {
if (insideChar) {
if (sb.nonEmpty)
println(s"got data ${sb.toString.trim}")
insideChar = false
}
val name = next.asStartElement().getName.getLocalPart
println(s"Start Element $name")
next.asStartElement().getAttributes.forEachRemaining(e => print(e.toString))
label match {
case "PubmedArticle" => currentArticle = new PMArticle
case "Author" => currentAuthor = new PMAuthor
case "Journal" => currentJournal = new PMJournal
case "Grant" => currentGrant = new PMGrant
case "PublicationType" | "DescriptorName" =>
currentSubject = new PMSubject
currentSubject.setMeshId(extractAttributes(attrs, "UI"))
case "ArticleId" => currentArticleType = extractAttributes(attrs, "IdType")
case _ =>
}
case EvElemEnd(_, label) =>
label match {
case "PubmedArticle" => return currentArticle
case "Author" => currentArticle.getAuthors.add(currentAuthor)
case "Journal" => currentArticle.setJournal(currentJournal)
case "Grant" => currentArticle.getGrants.add(currentGrant)
case "PubMedPubDate" =>
if (currentArticle.getDate == null)
currentArticle.setDate(validate_Date(currentYear, currentMonth, currentDay))
case "PubDate" => currentJournal.setDate(s"$currentYear-$currentMonth-$currentDay")
case "DescriptorName" => currentArticle.getSubjects.add(currentSubject)
case "PublicationType" => currentArticle.getPublicationTypes.add(currentSubject)
case _ =>
}
case EvText(text) =>
if (currNode != null && text.trim.nonEmpty)
currNode match {
case "ArticleTitle" => {
if (currentArticle.getTitle == null)
currentArticle.setTitle(text.trim)
else
currentArticle.setTitle(currentArticle.getTitle + text.trim)
}
case "AbstractText" => {
if (currentArticle.getDescription == null)
currentArticle.setDescription(text.trim)
else
currentArticle.setDescription(currentArticle.getDescription + text.trim)
}
case "PMID" => currentArticle.setPmid(text.trim)
case "ArticleId" =>
if ("doi".equalsIgnoreCase(currentArticleType)) currentArticle.setDoi(text.trim)
if ("pmc".equalsIgnoreCase(currentArticleType)) currentArticle.setPmcId(text.trim)
case "Language" => currentArticle.setLanguage(text.trim)
case "ISSN" => currentJournal.setIssn(text.trim)
case "GrantID" => currentGrant.setGrantID(text.trim)
case "Agency" => currentGrant.setAgency(text.trim)
case "Country" => if (currentGrant != null) currentGrant.setCountry(text.trim)
case "Year" => currentYear = text.trim
case "Month" => currentMonth = text.trim
case "Day" => currentDay = text.trim
case "Volume" => currentJournal.setVolume(text.trim)
case "Issue" => currentJournal.setIssue(text.trim)
case "PublicationType" | "DescriptorName" => currentSubject.setValue(text.trim)
case "LastName" => {
if (currentAuthor != null)
currentAuthor.setLastName(text.trim)
}
case "ForeName" =>
if (currentAuthor != null)
currentAuthor.setForeName(text.trim)
case "Title" =>
if (currentJournal.getTitle == null)
currentJournal.setTitle(text.trim)
else
currentJournal.setTitle(currentJournal.getTitle + text.trim)
case _ =>
} else if (next.isEndElement) {
if (insideChar) {
if (sb.nonEmpty)
println(s"got data ${sb.toString.trim}")
insideChar = false
}
val name = next.asEndElement().getName.getLocalPart
println(s"End Element $name")
if (name.equalsIgnoreCase("PubmedArticle")) {
complete = true
println("Condizione di uscita")
}
} else if (next.isCharacters) {
if (!insideChar) {
insideChar = true
sb.clear()
}
val d = next.asCharacters().getData
if (d.trim.nonEmpty)
sb.append(d.trim)
}
case _ =>
}
// next match {
// case _ if (next.isStartElement) =>
// val name = next.asStartElement().getName.getLocalPart
// println(s"Start Element $name")
// case _ if (next.isEndElement) =>
// val name = next.asStartElement().getName.getLocalPart
// println(s"End Element $name")
// case _ if (next.isCharacters) =>
// val c = next.asCharacters()
// val data = c.getData
// println(s"Text value $data")
//
// }
//
//
// reader.next match {
//
// case
//
// case EvElemStart(_, label, attrs, _) =>
// currNode = label
//
// label match {
// case "PubmedArticle" => currentArticle = new PMArticle
// case "Author" => currentAuthor = new PMAuthor
// case "Journal" => currentJournal = new PMJournal
// case "Grant" => currentGrant = new PMGrant
// case "PublicationType" | "DescriptorName" =>
// currentSubject = new PMSubject
// currentSubject.setMeshId(extractAttributes(attrs, "UI"))
// case "ArticleId" => currentArticleType = extractAttributes(attrs, "IdType")
// case _ =>
// }
// case EvElemEnd(_, label) =>
// label match {
// case "PubmedArticle" => return currentArticle
// case "Author" => currentArticle.getAuthors.add(currentAuthor)
// case "Journal" => currentArticle.setJournal(currentJournal)
// case "Grant" => currentArticle.getGrants.add(currentGrant)
// case "PubMedPubDate" =>
// if (currentArticle.getDate == null)
// currentArticle.setDate(validate_Date(currentYear, currentMonth, currentDay))
// case "PubDate" => currentJournal.setDate(s"$currentYear-$currentMonth-$currentDay")
// case "DescriptorName" => currentArticle.getSubjects.add(currentSubject)
// case "PublicationType" => currentArticle.getPublicationTypes.add(currentSubject)
// case _ =>
// }
// case EvText(text) =>
// if (currNode != null && text.trim.nonEmpty)
// currNode match {
// case "ArticleTitle" => {
// if (currentArticle.getTitle == null)
// currentArticle.setTitle(text.trim)
// else
// currentArticle.setTitle(currentArticle.getTitle + text.trim)
// }
// case "AbstractText" => {
// if (currentArticle.getDescription == null)
// currentArticle.setDescription(text.trim)
// else
// currentArticle.setDescription(currentArticle.getDescription + text.trim)
// }
// case "PMID" => currentArticle.setPmid(text.trim)
// case "ArticleId" =>
// if ("doi".equalsIgnoreCase(currentArticleType)) currentArticle.setDoi(text.trim)
// if ("pmc".equalsIgnoreCase(currentArticleType)) currentArticle.setPmcId(text.trim)
// case "Language" => currentArticle.setLanguage(text.trim)
// case "ISSN" => currentJournal.setIssn(text.trim)
// case "GrantID" => currentGrant.setGrantID(text.trim)
// case "Agency" => currentGrant.setAgency(text.trim)
// case "Country" => if (currentGrant != null) currentGrant.setCountry(text.trim)
// case "Year" => currentYear = text.trim
// case "Month" => currentMonth = text.trim
// case "Day" => currentDay = text.trim
// case "Volume" => currentJournal.setVolume(text.trim)
// case "Issue" => currentJournal.setIssue(text.trim)
// case "PublicationType" | "DescriptorName" => currentSubject.setValue(text.trim)
// case "LastName" => {
// if (currentAuthor != null)
// currentAuthor.setLastName(text.trim)
// }
// case "ForeName" =>
// if (currentAuthor != null)
// currentAuthor.setForeName(text.trim)
// case "Title" =>
// if (currentJournal.getTitle == null)
// currentJournal.setTitle(text.trim)
// else
// currentJournal.setTitle(currentJournal.getTitle + text.trim)
// case _ =>
//
// }
// case _ =>
// }
}
null
}

View File

@ -0,0 +1,145 @@
package eu.dnetlib.dhp.actionmanager.bipaffiliations;
import static org.junit.jupiter.api.Assertions.*;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.io.Text;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.Relation;
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
public class PrepareAffiliationRelationsTest {
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static SparkSession spark;
private static Path workingDir;
private static final String ID_PREFIX = "50|doi_________::";
private static final Logger log = LoggerFactory
.getLogger(PrepareAffiliationRelationsTest.class);
@BeforeAll
public static void beforeAll() throws IOException {
workingDir = Files.createTempDirectory(PrepareAffiliationRelationsTest.class.getSimpleName());
log.info("Using work dir {}", workingDir);
SparkConf conf = new SparkConf();
conf.setAppName(PrepareAffiliationRelationsTest.class.getSimpleName());
conf.setMaster("local[*]");
conf.set("spark.driver.host", "localhost");
conf.set("hive.metastore.local", "true");
conf.set("spark.ui.enabled", "false");
conf.set("spark.sql.warehouse.dir", workingDir.toString());
conf.set("hive.metastore.warehouse.dir", workingDir.resolve("warehouse").toString());
spark = SparkSession
.builder()
.appName(PrepareAffiliationRelationsTest.class.getSimpleName())
.config(conf)
.getOrCreate();
}
@AfterAll
public static void afterAll() throws IOException {
FileUtils.deleteDirectory(workingDir.toFile());
spark.stop();
}
@Test
void testMatch() throws Exception {
String affiliationRelationsPath = getClass()
.getResource("/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror.json")
.getPath();
String outputPath = workingDir.toString() + "/actionSet";
PrepareAffiliationRelations
.main(
new String[] {
"-isSparkSessionManaged", Boolean.FALSE.toString(),
"-inputPath", affiliationRelationsPath,
"-outputPath", outputPath
});
final JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
JavaRDD<Relation> tmp = sc
.sequenceFile(outputPath, Text.class, Text.class)
.map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class))
.map(aa -> ((Relation) aa.getPayload()));
// for (Relation r : tmp.collect()) {
// System.out.println(
// r.getSource() + "\t" + r.getTarget() + "\t" + r.getRelType() + "\t" + r.getRelClass() + "\t" + r.getSubRelType() + "\t" + r.getValidationDate() + "\t" + r.getDataInfo().getTrust() + "\t" + r.getDataInfo().getInferred()
// );
// }
// count the number of relations
assertEquals(20, tmp.count());
Dataset<Relation> dataset = spark.createDataset(tmp.rdd(), Encoders.bean(Relation.class));
dataset.createOrReplaceTempView("result");
Dataset<Row> execVerification = spark
.sql("select r.relType, r.relClass, r.source, r.target, r.dataInfo.trust from result r");
// verify that we have equal number of bi-directional relations
Assertions
.assertEquals(
10, execVerification
.filter(
"relClass='" + ModelConstants.HAS_AUTHOR_INSTITUTION + "'")
.collectAsList()
.size());
Assertions
.assertEquals(
10, execVerification
.filter(
"relClass='" + ModelConstants.IS_AUTHOR_INSTITUTION_OF + "'")
.collectAsList()
.size());
// check confidence value of a specific relation
String sourceDOI = "10.1061/(asce)0733-9399(2002)128:7(759)";
final String sourceOpenaireId = ID_PREFIX
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", sourceDOI));
Assertions
.assertEquals(
"0.7071067812", execVerification
.filter(
"source='" + sourceOpenaireId + "'")
.collectAsList()
.get(0)
.getString(4));
}
}

View File

@ -6,7 +6,8 @@ import static org.junit.jupiter.api.Assertions.*;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;
import javax.xml.crypto.Data;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.io.Text;
@ -27,7 +28,9 @@ import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.oaf.Publication;
import eu.dnetlib.dhp.schema.oaf.KeyValue;
import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.oaf.Project;
import eu.dnetlib.dhp.schema.oaf.Result;
public class SparkAtomicActionScoreJobTest {
@ -37,8 +40,8 @@ public class SparkAtomicActionScoreJobTest {
private static SparkSession spark;
private static Path workingDir;
private static final Logger log = LoggerFactory
.getLogger(SparkAtomicActionScoreJobTest.class);
private static final Logger log = LoggerFactory.getLogger(SparkAtomicActionScoreJobTest.class);
@BeforeAll
public static void beforeAll() throws IOException {
@ -69,47 +72,64 @@ public class SparkAtomicActionScoreJobTest {
spark.stop();
}
@Test
void testMatch() throws Exception {
String bipScoresPath = getClass()
.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/bip_scores_oid.json")
.getPath();
private void runJob(String resultsInputPath, String projectsInputPath, String outputPath) throws Exception {
SparkAtomicActionScoreJob
.main(
new String[] {
"-isSparkSessionManaged",
Boolean.FALSE.toString(),
"-inputPath",
bipScoresPath,
"-outputPath",
workingDir.toString() + "/actionSet"
"-isSparkSessionManaged", Boolean.FALSE.toString(),
"-resultsInputPath", resultsInputPath,
"-projectsInputPath", projectsInputPath,
"-outputPath", outputPath,
});
}
@Test
void testScores() throws Exception {
String resultsInputPath = getClass()
.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/result_bip_scores.json")
.getPath();
String projectsInputPath = getClass()
.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/project_bip_scores.json")
.getPath();
String outputPath = workingDir.toString() + "/actionSet";
// execute the job to generate the action sets for result scores
runJob(resultsInputPath, projectsInputPath, outputPath);
final JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
JavaRDD<Result> tmp = sc
.sequenceFile(workingDir.toString() + "/actionSet", Text.class, Text.class)
JavaRDD<OafEntity> tmp = sc
.sequenceFile(outputPath, Text.class, Text.class)
.map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class))
.map(aa -> ((Result) aa.getPayload()));
.map(aa -> ((OafEntity) aa.getPayload()));
assertEquals(4, tmp.count());
assertEquals(8, tmp.count());
Dataset<Result> verificationDataset = spark.createDataset(tmp.rdd(), Encoders.bean(Result.class));
Dataset<OafEntity> verificationDataset = spark.createDataset(tmp.rdd(), Encoders.bean(OafEntity.class));
verificationDataset.createOrReplaceTempView("result");
Dataset<Row> execVerification = spark
Dataset<Row> testDataset = spark
.sql(
"Select p.id oaid, mes.id, mUnit.value from result p " +
"lateral view explode(measures) m as mes " +
"lateral view explode(mes.unit) u as mUnit ");
Assertions.assertEquals(12, execVerification.count());
// execVerification.show();
Assertions.assertEquals(28, testDataset.count());
assertResultImpactScores(testDataset);
assertProjectImpactScores(testDataset);
}
void assertResultImpactScores(Dataset<Row> testDataset) {
Assertions
.assertEquals(
"6.63451994567e-09", execVerification
"6.63451994567e-09", testDataset
.filter(
"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
"and id = 'influence'")
@ -119,7 +139,7 @@ public class SparkAtomicActionScoreJobTest {
.getString(0));
Assertions
.assertEquals(
"0.348694533145", execVerification
"0.348694533145", testDataset
.filter(
"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
"and id = 'popularity_alt'")
@ -129,7 +149,7 @@ public class SparkAtomicActionScoreJobTest {
.getString(0));
Assertions
.assertEquals(
"2.16094680115e-09", execVerification
"2.16094680115e-09", testDataset
.filter(
"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
"and id = 'popularity'")
@ -137,7 +157,49 @@ public class SparkAtomicActionScoreJobTest {
.collectAsList()
.get(0)
.getString(0));
}
void assertProjectImpactScores(Dataset<Row> testDataset) throws Exception {
Assertions
.assertEquals(
"0", testDataset
.filter(
"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
"and id = 'numOfInfluentialResults'")
.select("value")
.collectAsList()
.get(0)
.getString(0));
Assertions
.assertEquals(
"1", testDataset
.filter(
"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
"and id = 'numOfPopularResults'")
.select("value")
.collectAsList()
.get(0)
.getString(0));
Assertions
.assertEquals(
"25", testDataset
.filter(
"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
"and id = 'totalImpulse'")
.select("value")
.collectAsList()
.get(0)
.getString(0));
Assertions
.assertEquals(
"43", testDataset
.filter(
"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
"and id = 'totalCitationCount'")
.select("value")
.collectAsList()
.get(0)
.getString(0));
}
}

View File

@ -0,0 +1,7 @@
{"DOI":"10.1061\/(asce)0733-9399(2002)128:7(759)","Matchings":[{"RORid":"https:\/\/ror.org\/03yxnpp24","Confidence":0.7071067812},{"RORid":"https:\/\/ror.org\/01teme464","Confidence":0.89}]}
{"DOI":"10.1105\/tpc.8.3.343","Matchings":[{"RORid":"https:\/\/ror.org\/02k40bc56","Confidence":0.7071067812}]}
{"DOI":"10.1161\/01.cir.0000013305.01850.37","Matchings":[{"RORid":"https:\/\/ror.org\/00qjgza05","Confidence":1}]}
{"DOI":"10.1142\/s021821650200186x","Matchings":[{"RORid":"https:\/\/ror.org\/035xkbk20","Confidence":1},{"RORid":"https:\/\/ror.org\/05apxxy63","Confidence":1}]}
{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(575)","Matchings":[{"RORid":"https:\/\/ror.org\/04j198w64","Confidence":0.82}]}
{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(588)","Matchings":[{"RORid":"https:\/\/ror.org\/03m8km719","Confidence":0.8660254038},{"RORid":"https:\/\/ror.org\/02aze4h65","Confidence":0.87}]}
{"DOI":"10.1161\/hy0202.103001","Matchings":[{"RORid":"https:\/\/ror.org\/057xtrt18","Confidence":0.7071067812}]}

View File

@ -0,0 +1,4 @@
{"projectId":"40|nsf_________::d93e50d22374a1cf59f6a232413ea027","numOfInfluentialResults":0,"numOfPopularResults":10,"totalImpulse":181,"totalCitationCount":235}
{"projectId":"40|nih_________::1c93debc7085e440f245fbe70b2e8b21","numOfInfluentialResults":14,"numOfPopularResults":17,"totalImpulse":1558,"totalCitationCount":4226}
{"projectId":"40|nih_________::c02a8233e9b60f05bb418f0c9b714833","numOfInfluentialResults":0,"numOfPopularResults":1,"totalImpulse":25,"totalCitationCount":43}
{"projectId":"40|corda_______::d91dcf3a87dd7f72248fab0b8a4ba273","numOfInfluentialResults":2,"numOfPopularResults":3,"totalImpulse":78,"totalCitationCount":178}

View File

@ -1,15 +1,44 @@
{"pdb": "1CW0", "title": "crystal structure analysis of very short patch repair (vsr) endonuclease in complex with a duplex dna", "authors": ["S.E.Tsutakawa", "H.Jingami", "K.Morikawa"], "doi": "10.1016/S0092-8674(00)81550-0", "pmid": "10612397"}
{"pdb": "2CWW", "title": "crystal structure of thermus thermophilus ttha1280, a putative sam- dependent rna methyltransferase, in complex with s-adenosyl-l- homocysteine", "authors": ["A.A.Pioszak", "K.Murayama", "N.Nakagawa", "A.Ebihara", "S.Kuramitsu", "M.Shirouzu", "S.Yokoyama", "Riken Structural Genomics/proteomics Initiative (Rsgi)"], "doi": "10.1107/S1744309105029842", "pmid": "16511182"}
{"pdb": "6CWE", "title": "structure of alpha-gsa[8,6p] bound by cd1d and in complex with the va14vb8.2 tcr", "authors": ["J.Wang", "D.Zajonc"], "doi": null, "pmid": null}
{"pdb": "5CWS", "title": "crystal structure of the intact chaetomium thermophilum nsp1-nup49- nup57 channel nucleoporin heterotrimer bound to its nic96 nuclear pore complex attachment site", "authors": ["C.J.Bley", "S.Petrovic", "M.Paduch", "V.Lu", "A.A.Kossiakoff", "A.Hoelz"], "doi": "10.1126/SCIENCE.AAC9176", "pmid": "26316600"}
{"pdb": "5CWE", "title": "structure of cyp107l2 from streptomyces avermitilis with lauric acid", "authors": ["T.-V.Pham", "S.-H.Han", "J.-H.Kim", "D.-H.Kim", "L.-W.Kang"], "doi": null, "pmid": null}
{"pdb": "7CW4", "title": "acetyl-coa acetyltransferase from bacillus cereus atcc 14579", "authors": ["J.Hong", "K.J.Kim"], "doi": "10.1016/J.BBRC.2020.09.048", "pmid": "32972748"}
{"pdb": "2CWP", "title": "crystal structure of metrs related protein from pyrococcus horikoshii", "authors": ["K.Murayama", "M.Kato-Murayama", "M.Shirouzu", "S.Yokoyama", "Riken StructuralGenomics/proteomics Initiative (Rsgi)"], "doi": null, "pmid": null}
{"pdb": "2CW7", "title": "crystal structure of intein homing endonuclease ii", "authors": ["H.Matsumura", "H.Takahashi", "T.Inoue", "H.Hashimoto", "M.Nishioka", "S.Fujiwara", "M.Takagi", "T.Imanaka", "Y.Kai"], "doi": "10.1002/PROT.20858", "pmid": "16493661"}
{"pdb": "1CWU", "title": "brassica napus enoyl acp reductase a138g mutant complexed with nad+ and thienodiazaborine", "authors": ["A.Roujeinikova", "J.B.Rafferty", "D.W.Rice"], "doi": "10.1074/JBC.274.43.30811", "pmid": "10521472"}
{"pdb": "3CWN", "title": "escherichia coli transaldolase b mutant f178y", "authors": ["T.Sandalova", "G.Schneider", "A.Samland"], "doi": "10.1074/JBC.M803184200", "pmid": "18687684"}
{"pdb": "1CWL", "title": "human cyclophilin a complexed with 4 4-hydroxy-meleu cyclosporin", "authors": ["V.Mikol", "J.Kallen", "P.Taylor", "M.D.Walkinshaw"], "doi": "10.1006/JMBI.1998.2108", "pmid": "9769216"}
{"pdb": "3CW2", "title": "crystal structure of the intact archaeal translation initiation factor 2 from sulfolobus solfataricus .", "authors": ["E.A.Stolboushkina", "S.V.Nikonov", "A.D.Nikulin", "U.Blaesi", "D.J.Manstein", "R.V.Fedorov", "M.B.Garber", "O.S.Nikonov"], "doi": "10.1016/J.JMB.2008.07.039", "pmid": "18675278"}
{"pdb": "3CW9", "title": "4-chlorobenzoyl-coa ligase/synthetase in the thioester-forming conformation, bound to 4-chlorophenacyl-coa", "authors": ["A.S.Reger", "J.Cao", "R.Wu", "D.Dunaway-Mariano", "A.M.Gulick"], "doi": "10.1021/BI800696Y", "pmid": "18620418"}
{"pdb": "3CWU", "title": "crystal structure of an alka host/guest complex 2'-fluoro-2'-deoxy-1, n6-ethenoadenine:thymine base pair", "authors": ["B.R.Bowman", "S.Lee", "S.Wang", "G.L.Verdine"], "doi": "10.1016/J.STR.2008.04.012", "pmid": "18682218"}
{"pdb": "5CWF", "title": "crystal structure of de novo designed helical repeat protein dhr8", "authors": ["G.Bhabha", "D.C.Ekiert"], "doi": "10.1038/NATURE16162", "pmid": "26675729"}
{"classification": "Signaling protein", "pdb": "5NM4", "deposition_date": "2017-04-05", "title": "A2a adenosine receptor room-temperature structure determined by serial Femtosecond crystallography", "Keywords": ["Oom-temperature", " serial crystallography", " signaling protein"], "authors": ["T.weinert", "R.cheng", "D.james", "D.gashi", "P.nogly", "K.jaeger", "M.hennig", "", "J.standfuss"], "pmid": "28912485", "doi": "10.1038/S41467-017-00630-4"}
{"classification": "Oxidoreductase/oxidoreductase inhibitor", "pdb": "4KN3", "deposition_date": "2013-05-08", "title": "Structure of the y34ns91g double mutant of dehaloperoxidase from Amphitrite ornata with 2,4,6-trichlorophenol", "Keywords": ["Lobin", " oxygen storage", " peroxidase", " oxidoreductase", " oxidoreductase-", "Oxidoreductase inhibitor complex"], "authors": ["C.wang", "L.lovelace", "L.lebioda"], "pmid": "23952341", "doi": "10.1021/BI400627W"}
{"classification": "Transport protein", "pdb": "8HKM", "deposition_date": "2022-11-27", "title": "Ion channel", "Keywords": ["On channel", " transport protein"], "authors": ["D.h.jiang", "J.t.zhang"], "pmid": "37494189", "doi": "10.1016/J.CELREP.2023.112858"}
{"classification": "Signaling protein", "pdb": "6JT1", "deposition_date": "2019-04-08", "title": "Structure of human soluble guanylate cyclase in the heme oxidised State", "Keywords": ["Oluble guanylate cyclase", " signaling protein"], "authors": ["L.chen", "Y.kang", "R.liu", "J.-x.wu"], "pmid": "31514202", "doi": "10.1038/S41586-019-1584-6"}
{"classification": "Immune system", "pdb": "7OW6", "deposition_date": "2021-06-16", "title": "Crystal structure of a tcr in complex with hla-a*11:01 bound to kras G12d peptide (vvvgadgvgk)", "Keywords": ["La", " kras", " tcr", " immune system"], "authors": ["V.karuppiah", "R.a.robinson"], "doi": "10.1038/S41467-022-32811-1"}
{"classification": "Biosynthetic protein", "pdb": "5EQ8", "deposition_date": "2015-11-12", "title": "Crystal structure of medicago truncatula histidinol-phosphate Phosphatase (mthpp) in complex with l-histidinol", "Keywords": ["Istidine biosynthesis", " metabolic pathways", " dimer", " plant", "", "Biosynthetic protein"], "authors": ["M.ruszkowski", "Z.dauter"], "pmid": "26994138", "doi": "10.1074/JBC.M115.708727"}
{"classification": "De novo protein", "pdb": "8CWA", "deposition_date": "2022-05-18", "title": "Solution nmr structure of 8-residue rosetta-designed cyclic peptide D8.21 in cdcl3 with cis/trans switching (tc conformation, 53%)", "Keywords": ["Yclic peptide", " non natural amino acids", " cis/trans", " switch peptides", "", "De novo design", "Membrane permeability", "De novo protein"], "authors": ["T.a.ramelot", "R.tejero", "G.t.montelione"], "pmid": "36041435", "doi": "10.1016/J.CELL.2022.07.019"}
{"classification": "Hydrolase", "pdb": "3R6M", "deposition_date": "2011-03-21", "title": "Crystal structure of vibrio parahaemolyticus yeaz", "Keywords": ["Ctin/hsp70 nucleotide-binding fold", " bacterial resuscitation", " viable", "But non-culturable state", "Resuscitation promoting factor", "Ygjd", "", "Yjee", "Vibrio parahaemolyticus", "Hydrolase"], "authors": ["A.roujeinikova", "I.aydin"], "pmid": "21858042", "doi": "10.1371/JOURNAL.PONE.0023245"}
{"classification": "Hydrolase", "pdb": "2W5J", "deposition_date": "2008-12-10", "title": "Structure of the c14-rotor ring of the proton translocating Chloroplast atp synthase", "Keywords": ["Ydrolase", " chloroplast", " atp synthase", " lipid-binding", " cf(0)", " membrane", "", "Transport", "Formylation", "Energy transduction", "Hydrogen ion transport", "", "Ion transport", "Transmembrane", "Membrane protein"], "authors": ["M.vollmar", "D.schlieper", "M.winn", "C.buechner", "G.groth"], "pmid": "19423706", "doi": "10.1074/JBC.M109.006916"}
{"classification": "De novo protein", "pdb": "4GLU", "deposition_date": "2012-08-14", "title": "Crystal structure of the mirror image form of vegf-a", "Keywords": ["-protein", " covalent dimer", " cysteine knot protein", " growth factor", " de", "Novo protein"], "authors": ["K.mandal", "M.uppalapati", "D.ault-riche", "J.kenney", "J.lowitz", "S.sidhu", "", "S.b.h.kent"], "pmid": "22927390", "doi": "10.1073/PNAS.1210483109"}
{"classification": "Hydrolase/hydrolase inhibitor", "pdb": "3WYL", "deposition_date": "2014-09-01", "title": "Crystal structure of the catalytic domain of pde10a complexed with 5- Methoxy-3-(1-phenyl-1h-pyrazol-5-yl)-1-(3-(trifluoromethyl)phenyl) Pyridazin-4(1h)-one", "Keywords": ["Ydrolase-hydrolase inhibitor complex"], "authors": ["H.oki", "Y.hayano"], "pmid": "25384088", "doi": "10.1021/JM5013648"}
{"classification": "Isomerase", "pdb": "5BOR", "deposition_date": "2015-05-27", "title": "Structure of acetobacter aceti pure-s57c, sulfonate form", "Keywords": ["Cidophile", " pure", " purine biosynthesis", " isomerase"], "authors": ["K.l.sullivan", "T.j.kappock"]}
{"classification": "Hydrolase", "pdb": "1X0C", "deposition_date": "2005-03-17", "title": "Improved crystal structure of isopullulanase from aspergillus niger Atcc 9642", "Keywords": ["Ullulan", " glycoside hydrolase family 49", " glycoprotein", " hydrolase"], "authors": ["M.mizuno", "T.tonozuka", "A.yamamura", "Y.miyasaka", "H.akeboshi", "S.kamitori", "", "A.nishikawa", "Y.sakano"], "pmid": "18155243", "doi": "10.1016/J.JMB.2007.11.098"}
{"classification": "Oxidoreductase", "pdb": "7CUP", "deposition_date": "2020-08-23", "title": "Structure of 2,5-dihydroxypridine dioxygenase from pseudomonas putida Kt2440", "Keywords": ["On-heme dioxygenase", " oxidoreductase"], "authors": ["G.q.liu", "H.z.tang"]}
{"classification": "Ligase", "pdb": "1VCN", "deposition_date": "2004-03-10", "title": "Crystal structure of t.th. hb8 ctp synthetase complex with sulfate Anion", "Keywords": ["Etramer", " riken structural genomics/proteomics initiative", " rsgi", "", "Structural genomics", "Ligase"], "authors": ["M.goto", "Riken structural genomics/proteomics initiative (rsgi)"], "pmid": "15296735", "doi": "10.1016/J.STR.2004.05.013"}
{"classification": "Transferase/transferase inhibitor", "pdb": "6C9V", "deposition_date": "2018-01-28", "title": "Mycobacterium tuberculosis adenosine kinase bound to (2r,3s,4r,5r)-2- (hydroxymethyl)-5-(6-(4-phenylpiperazin-1-yl)-9h-purin-9-yl) Tetrahydrofuran-3,4-diol", "Keywords": ["Ucleoside analog", " complex", " inhibitor", " structural genomics", " psi-2", "", "Protein structure initiative", "Tb structural genomics consortium", "", "Tbsgc", "Transferase-transferase inhibitor complex"], "authors": ["R.a.crespo", "Tb structural genomics consortium (tbsgc)"], "pmid": "31002508", "doi": "10.1021/ACS.JMEDCHEM.9B00020"}
{"classification": "De novo protein", "pdb": "4LPY", "deposition_date": "2013-07-16", "title": "Crystal structure of tencon variant g10", "Keywords": ["Ibronectin type iii fold", " alternate scaffold", " de novo protein"], "authors": ["A.teplyakov", "G.obmolova", "G.l.gilliland"], "pmid": "24375666", "doi": "10.1002/PROT.24502"}
{"classification": "Isomerase", "pdb": "2Y88", "deposition_date": "2011-02-03", "title": "Crystal structure of mycobacterium tuberculosis phosphoribosyl Isomerase (variant d11n) with bound prfar", "Keywords": ["Romatic amino acid biosynthesis", " isomerase", " tim-barrel", " histidine", "Biosynthesis", "Tryptophan biosynthesis"], "authors": ["J.kuper", "A.v.due", "A.geerlof", "M.wilmanns"], "pmid": "21321225", "doi": "10.1073/PNAS.1015996108"}
{"classification": "Unknown function", "pdb": "1SR0", "deposition_date": "2004-03-22", "title": "Crystal structure of signalling protein from sheep(sps-40) at 3.0a Resolution using crystal grown in the presence of polysaccharides", "Keywords": ["Ignalling protein", " involution", " unknown function"], "authors": ["D.b.srivastava", "A.s.ethayathulla", "N.singh", "J.kumar", "S.sharma", "T.p.singh"]}
{"classification": "Dna binding protein", "pdb": "3RH2", "deposition_date": "2011-04-11", "title": "Crystal structure of a tetr-like transcriptional regulator (sama_0099) From shewanella amazonensis sb2b at 2.42 a resolution", "Keywords": ["Na/rna-binding 3-helical bundle", " structural genomics", " joint center", "For structural genomics", "Jcsg", "Protein structure initiative", "Psi-", "Biology", "Dna binding protein"], "authors": ["Joint center for structural genomics (jcsg)"]}
{"classification": "Transferase", "pdb": "2WK5", "deposition_date": "2009-06-05", "title": "Structural features of native human thymidine phosphorylase And in complex with 5-iodouracil", "Keywords": ["Lycosyltransferase", " developmental protein", " angiogenesis", "", "5-iodouracil", "Growth factor", "Enzyme kinetics", "", "Differentiation", "Disease mutation", "Thymidine", "Phosphorylase", "Chemotaxis", "Transferase", "Mutagenesis", "", "Polymorphism"], "authors": ["E.mitsiki", "A.c.papageorgiou", "S.iyer", "N.thiyagarajan", "S.h.prior", "", "D.sleep", "C.finnis", "K.r.acharya"], "pmid": "19555658", "doi": "10.1016/J.BBRC.2009.06.104"}
{"classification": "Hydrolase", "pdb": "3P9Y", "deposition_date": "2010-10-18", "title": "Crystal structure of the drosophila melanogaster ssu72-pctd complex", "Keywords": ["Hosphatase", " cis proline", " lmw ptp-like fold", " rna polymerase ii ctd", "", "Hydrolase"], "authors": ["J.w.werner-allen", "P.zhou"], "pmid": "21159777", "doi": "10.1074/JBC.M110.197129"}
{"classification": "Recombination/dna", "pdb": "6OEO", "deposition_date": "2019-03-27", "title": "Cryo-em structure of mouse rag1/2 nfc complex (dna1)", "Keywords": ["(d)j recombination", " dna transposition", " rag", " scid", " recombination", "", "Recombination-dna complex"], "authors": ["X.chen", "Y.cui", "Z.h.zhou", "W.yang", "M.gellert"], "pmid": "32015552", "doi": "10.1038/S41594-019-0363-2"}
{"classification": "Hydrolase", "pdb": "4ECA", "deposition_date": "1997-02-21", "title": "Asparaginase from e. coli, mutant t89v with covalently bound aspartate", "Keywords": ["Ydrolase", " acyl-enzyme intermediate", " threonine amidohydrolase"], "authors": ["G.j.palm", "J.lubkowski", "A.wlodawer"], "pmid": "8706862", "doi": "10.1016/0014-5793(96)00660-6"}
{"classification": "Transcription/protein binding", "pdb": "3UVX", "deposition_date": "2011-11-30", "title": "Crystal structure of the first bromodomain of human brd4 in complex With a diacetylated histone 4 peptide (h4k12ack16ac)", "Keywords": ["Romodomain", " bromodomain containing protein 4", " cap", " hunk1", " mcap", "", "Mitotic chromosome associated protein", "Peptide complex", "Structural", "Genomics consortium", "Sgc", "Transcription-protein binding complex"], "authors": ["P.filippakopoulos", "S.picaud", "T.keates", "E.ugochukwu", "F.von delft", "", "C.h.arrowsmith", "A.m.edwards", "J.weigelt", "C.bountra", "S.knapp", "Structural", "Genomics consortium (sgc)"], "pmid": "22464331", "doi": "10.1016/J.CELL.2012.02.013"}
{"classification": "Membrane protein", "pdb": "1TLZ", "deposition_date": "2004-06-10", "title": "Tsx structure complexed with uridine", "Keywords": ["Ucleoside transporter", " beta barrel", " uridine", " membrane", "Protein"], "authors": ["J.ye", "B.van den berg"], "pmid": "15272310", "doi": "10.1038/SJ.EMBOJ.7600330"}
{"classification": "Dna binding protein", "pdb": "7AZD", "deposition_date": "2020-11-16", "title": "Dna polymerase sliding clamp from escherichia coli with peptide 20 Bound", "Keywords": ["Ntibacterial drug", " dna binding protein"], "authors": ["C.monsarrat", "G.compain", "C.andre", "I.martiel", "S.engilberge", "V.olieric", "", "P.wolff", "K.brillet", "M.landolfo", "C.silva da veiga", "J.wagner", "G.guichard", "", "D.y.burnouf"], "pmid": "34806883", "doi": "10.1021/ACS.JMEDCHEM.1C00918"}
{"classification": "Transferase", "pdb": "5N3K", "deposition_date": "2017-02-08", "title": "Camp-dependent protein kinase a from cricetulus griseus in complex With fragment like molecule o-guanidino-l-homoserine", "Keywords": ["Ragment", " complex", " transferase", " serine threonine kinase", " camp", "", "Kinase", "Pka"], "authors": ["C.siefker", "A.heine", "G.klebe"]}
{"classification": "Biosynthetic protein", "pdb": "8H52", "deposition_date": "2022-10-11", "title": "Crystal structure of helicobacter pylori carboxyspermidine Dehydrogenase in complex with nadp", "Keywords": ["Arboxyspermidine dehydrogenase", " biosynthetic protein"], "authors": ["K.y.ko", "S.c.park", "S.y.cho", "S.i.yoon"], "pmid": "36283333", "doi": "10.1016/J.BBRC.2022.10.049"}
{"classification": "Metal binding protein", "pdb": "6DYC", "deposition_date": "2018-07-01", "title": "Co(ii)-bound structure of the engineered cyt cb562 variant, ch3", "Keywords": ["Esigned protein", " 4-helix bundle", " electron transport", " metal binding", "Protein"], "authors": ["F.a.tezcan", "J.rittle"], "pmid": "30778140", "doi": "10.1038/S41557-019-0218-9"}
{"classification": "Protein fibril", "pdb": "6A6B", "deposition_date": "2018-06-27", "title": "Cryo-em structure of alpha-synuclein fiber", "Keywords": ["Lpha-syn fiber", " parkinson disease", " protein fibril"], "authors": ["Y.w.li", "C.y.zhao", "F.luo", "Z.liu", "X.gui", "Z.luo", "X.zhang", "D.li", "C.liu", "X.li"], "pmid": "30065316", "doi": "10.1038/S41422-018-0075-X"}
{"classification": "Dna", "pdb": "7D5E", "deposition_date": "2020-09-25", "title": "Left-handed g-quadruplex containing two bulges", "Keywords": ["-quadruplex", " bulge", " dna", " left-handed"], "authors": ["P.das", "A.maity", "K.h.ngo", "F.r.winnerdy", "B.bakalar", "Y.mechulam", "E.schmitt", "", "A.t.phan"], "pmid": "33503265", "doi": "10.1093/NAR/GKAA1259"}
{"classification": "Transferase", "pdb": "3RSY", "deposition_date": "2011-05-02", "title": "Cellobiose phosphorylase from cellulomonas uda in complex with sulfate And glycerol", "Keywords": ["H94", " alpha barrel", " cellobiose phosphorylase", " disaccharide", "Phosphorylase", "Transferase"], "authors": ["A.van hoorebeke", "J.stout", "W.soetaert", "J.van beeumen", "T.desmet", "S.savvides"]}
{"classification": "Oxidoreductase", "pdb": "7MCI", "deposition_date": "2021-04-02", "title": "Mofe protein from azotobacter vinelandii with a sulfur-replenished Cofactor", "Keywords": ["Zotobacter vinelandii", " mofe-protein", " nitrogenase", " oxidoreductase"], "authors": ["W.kang", "C.lee", "Y.hu", "M.w.ribbe"], "doi": "10.1038/S41929-022-00782-7"}
{"classification": "Dna", "pdb": "1XUW", "deposition_date": "2004-10-26", "title": "Structural rationalization of a large difference in rna affinity Despite a small difference in chemistry between two 2'-o-modified Nucleic acid analogs", "Keywords": ["Na mimetic methylcarbamate amide analog", " dna"], "authors": ["R.pattanayek", "L.sethaphong", "C.pan", "M.prhavc", "T.p.prakash", "M.manoharan", "", "M.egli"], "pmid": "15547979", "doi": "10.1021/JA044637K"}
{"classification": "Lyase", "pdb": "7C0D", "deposition_date": "2020-05-01", "title": "Crystal structure of azospirillum brasilense l-2-keto-3-deoxyarabonate Dehydratase (hydroxypyruvate-bound form)", "Keywords": ["-2-keto-3-deoxyarabonate dehydratase", " lyase"], "authors": ["Y.watanabe", "S.watanabe"], "pmid": "32697085", "doi": "10.1021/ACS.BIOCHEM.0C00515"}
{"classification": "Signaling protein", "pdb": "5LYK", "deposition_date": "2016-09-28", "title": "Crystal structure of intracellular b30.2 domain of btn3a1 bound to Citrate", "Keywords": ["30.2", " butyrophilin", " signaling protein"], "authors": ["F.mohammed", "A.t.baker", "M.salim", "B.e.willcox"], "pmid": "28862425", "doi": "10.1021/ACSCHEMBIO.7B00694"}
{"classification": "Toxin", "pdb": "4IZL", "deposition_date": "2013-01-30", "title": "Structure of the n248a mutant of the panton-valentine leucocidin s Component from staphylococcus aureus", "Keywords": ["I-component leucotoxin", " staphylococcus aureus", " s component", "Leucocidin", "Beta-barrel pore forming toxin", "Toxin"], "authors": ["L.maveyraud", "B.j.laventie", "G.prevost", "L.mourey"], "pmid": "24643034", "doi": "10.1371/JOURNAL.PONE.0092094"}
{"classification": "Dna", "pdb": "6F3C", "deposition_date": "2017-11-28", "title": "The cytotoxic [pt(h2bapbpy)] platinum complex interacting with the Cgtacg hexamer", "Keywords": ["Rug-dna complex", " four-way junction", " dna"], "authors": ["M.ferraroni", "C.bazzicalupi", "P.gratteri", "F.papi"], "pmid": "31046177", "doi": "10.1002/ANIE.201814532"}
{"classification": "Signaling protein/inhibitor", "pdb": "4L5M", "deposition_date": "2013-06-11", "title": "Complexe of arno sec7 domain with the protein-protein interaction Inhibitor n-(4-hydroxy-2,6-dimethylphenyl)benzenesulfonamide at ph6.5", "Keywords": ["Ec-7domain", " signaling protein-inhibitor complex"], "authors": ["F.hoh", "J.rouhana"], "pmid": "24112024", "doi": "10.1021/JM4009357"}
{"classification": "Signaling protein", "pdb": "5I6J", "deposition_date": "2016-02-16", "title": "Crystal structure of srgap2 f-barx", "Keywords": ["Rgap2", " f-bar", " fx", " signaling protein"], "authors": ["M.sporny", "J.guez-haddad", "M.n.isupov", "Y.opatowsky"], "pmid": "28333212", "doi": "10.1093/MOLBEV/MSX094"}
{"classification": "Metal binding protein", "pdb": "1Q80", "deposition_date": "2003-08-20", "title": "Solution structure and dynamics of nereis sarcoplasmic calcium binding Protein", "Keywords": ["Ll-alpha", " metal binding protein"], "authors": ["G.rabah", "R.popescu", "J.a.cox", "Y.engelborghs", "C.t.craescu"], "pmid": "15819893", "doi": "10.1111/J.1742-4658.2005.04629.X"}
{"classification": "Transferase", "pdb": "1TW1", "deposition_date": "2004-06-30", "title": "Beta-1,4-galactosyltransferase mutant met344his (m344h-gal-t1) complex With udp-galactose and magnesium", "Keywords": ["Et344his mutation; closed conformation; mn binding", " transferase"], "authors": ["B.ramakrishnan", "E.boeggeman", "P.k.qasba"], "pmid": "15449940", "doi": "10.1021/BI049007+"}
{"classification": "Rna", "pdb": "2PN4", "deposition_date": "2007-04-23", "title": "Crystal structure of hepatitis c virus ires subdomain iia", "Keywords": ["Cv", " ires", " subdoamin iia", " rna", " strontium", " hepatitis"], "authors": ["Q.zhao", "Q.han", "C.r.kissinger", "P.a.thompson"], "pmid": "18391410", "doi": "10.1107/S0907444908002011"}

View File

@ -1,6 +1,36 @@
{"pid": "Q6GZX4", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 41."}], "title": "Putative transcription factor 001R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": "Q6GZX3", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 42."}], "title": "Uncharacterized protein 002L;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": "Q197F8", "dates": [{"date": "16-JUN-2009", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "11-JUL-2006", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 27."}], "title": "Uncharacterized protein 002R;", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus."], "references": [{"PubMed": "16912294"}, {" DOI": "10.1128/jvi.00464-06"}]}
{"pid": "Q197F7", "dates": [{"date": "16-JUN-2009", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "11-JUL-2006", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 23."}], "title": "Uncharacterized protein 003L;", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus."], "references": [{"PubMed": "16912294"}, {" DOI": "10.1128/jvi.00464-06"}]}
{"pid": "Q6GZX2", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 36."}], "title": "Uncharacterized protein 3R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": "Q6GZX1", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 34."}], "title": "Uncharacterized protein 004R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZX4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 43"}], "title": "Putative transcription factor 001R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZX3", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 45"}], "title": "Uncharacterized protein 002L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q197F8", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-02-23", "date_info": "entry version 29"}], "title": "Uncharacterized protein 002R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
{"pid": " Q197F7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2020-08-12", "date_info": "entry version 23"}], "title": "Uncharacterized protein 003L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
{"pid": " Q6GZX2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 3R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZX1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 38"}], "title": "Uncharacterized protein 004R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q197F5", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-10-12", "date_info": "entry version 32"}], "title": "Uncharacterized protein 005L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
{"pid": " Q6GZX0", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 47"}], "title": "Uncharacterized protein 005R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q91G88", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2023-06-28", "date_info": "entry version 53"}], "title": "Putative KilA-N domain-containing protein 006L", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
{"pid": " Q6GZW9", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 34"}], "title": "Uncharacterized protein 006R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZW8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 32"}], "title": "Uncharacterized protein 007R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q197F3", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 28"}], "title": "Uncharacterized protein 007R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
{"pid": " Q197F2", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-02-23", "date_info": "entry version 22"}], "title": "Uncharacterized protein 008L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
{"pid": " Q6GZW6", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 67"}], "title": "Putative helicase 009L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q91G85", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 38"}], "title": "Uncharacterized protein 009R", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
{"pid": " Q6GZW5", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 010R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q197E9", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 28"}], "title": "Uncharacterized protein 011L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
{"pid": " Q6GZW4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 011R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZW3", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 012L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q197E7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 37"}], "title": "Uncharacterized protein IIV3-013L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
{"pid": " Q6GZW2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 30"}], "title": "Uncharacterized protein 013R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZW1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 014R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZW0", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 50"}], "title": "Uncharacterized protein 015R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZV8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 017L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZV7", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 33"}], "title": "Uncharacterized protein 018L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZV6", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 87"}], "title": "Putative serine/threonine-protein kinase 019R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZV5", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 40"}], "title": "Uncharacterized protein 020R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZV4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 021L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q197D8", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-12-14", "date_info": "entry version 35"}], "title": "Transmembrane protein 022L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
{"pid": " Q6GZV2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 33"}], "title": "Uncharacterized protein 023R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q197D7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 25"}], "title": "Uncharacterized protein 023R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
{"pid": " Q6GZV1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 024R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q197D5", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-10-12", "date_info": "entry version 24"}], "title": "Uncharacterized protein 025R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
{"pid": " Q91G70", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2020-08-12", "date_info": "entry version 32"}], "title": "Uncharacterized protein 026R", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
{"pid": " Q6GZU9", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 49"}], "title": "Uncharacterized protein 027R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
{"pid": " Q6GZU8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 55"}], "title": "Uncharacterized protein 028R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}

View File

@ -14,10 +14,12 @@ import org.junit.jupiter.api.extension.ExtendWith
import org.junit.jupiter.api.{BeforeEach, Test}
import org.mockito.junit.jupiter.MockitoExtension
import java.io.{BufferedReader, FileInputStream, InputStream, InputStreamReader}
import java.io.{BufferedReader, InputStream, InputStreamReader}
import java.util.zip.GZIPInputStream
import scala.collection.JavaConverters._
import scala.collection.mutable.ListBuffer
import scala.io.Source
import scala.xml.pull.XMLEventReader
@ExtendWith(Array(classOf[MockitoExtension]))
class BioScholixTest extends AbstractVocabularyTest {
@ -47,11 +49,11 @@ class BioScholixTest extends AbstractVocabularyTest {
@Test
def testEBIData() = {
val inputXML = getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml")
// new PubmedParser(new GZIPInputStream(new FileInputStream("/Users/sandro/Downloads/pubmed23n1078.xml.gz")))
new PMParser(new GZIPInputStream(new FileInputStream("/Users/sandro/Downloads/pubmed23n1078.xml.gz")))
print("DONE")
val inputXML = Source
.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
.mkString
val xml = new XMLEventReader(Source.fromBytes(inputXML.getBytes()))
new PMParser(xml).foreach(s => println(mapper.writeValueAsString(s)))
}
@Test
@ -87,14 +89,14 @@ class BioScholixTest extends AbstractVocabularyTest {
}
// @Test
// def testParsingPubmedXML(): Unit = {
// val xml = new XMLEventReader(
// Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
// )
// val parser = new PMParser(xml)
// parser.foreach(checkPMArticle)
// }
@Test
def testParsingPubmedXML(): Unit = {
val xml = new XMLEventReader(
Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
)
val parser = new PMParser(xml)
parser.foreach(checkPMArticle)
}
private def checkPubmedPublication(o: Oaf): Unit = {
assertTrue(o.isInstanceOf[Publication])
@ -151,19 +153,19 @@ class BioScholixTest extends AbstractVocabularyTest {
assertTrue(hasOldOpenAIREID)
}
// @Test
// def testPubmedMapping(): Unit = {
//
// val xml = new XMLEventReader(
// Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
// )
// val parser = new PMParser(xml)
// val results = ListBuffer[Oaf]()
// parser.foreach(x => results += PubMedToOaf.convert(x, vocabularies))
//
// results.foreach(checkPubmedPublication)
//
// }
@Test
def testPubmedMapping(): Unit = {
val xml = new XMLEventReader(
Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
)
val parser = new PMParser(xml)
val results = ListBuffer[Oaf]()
parser.foreach(x => results += PubMedToOaf.convert(x, vocabularies))
results.foreach(checkPubmedPublication)
}
@Test
def testPDBToOAF(): Unit = {

View File

@ -2,7 +2,9 @@
package eu.dnetlib.dhp.broker.oa.util;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import org.apache.commons.io.IOUtils;
import org.apache.spark.sql.Row;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -11,7 +13,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.broker.objects.OaBrokerMainEntity;
import eu.dnetlib.pace.config.DedupConfig;
import eu.dnetlib.pace.model.SparkDedupConfig;
import eu.dnetlib.pace.model.SparkDeduper;
import eu.dnetlib.pace.tree.support.TreeProcessor;
public class TrustUtils {
@ -20,18 +22,22 @@ public class TrustUtils {
private static DedupConfig dedupConfig;
private static SparkDedupConfig sparkDedupConfig;
private static SparkDeduper deduper;
private static final ObjectMapper mapper;
static {
mapper = new ObjectMapper();
try {
dedupConfig = mapper
.readValue(
DedupConfig.class.getResourceAsStream("/eu/dnetlib/dhp/broker/oa/dedupConfig/dedupConfig.json"),
DedupConfig.class);
sparkDedupConfig = new SparkDedupConfig(dedupConfig, 1);
dedupConfig = DedupConfig
.load(
IOUtils
.toString(
DedupConfig.class
.getResourceAsStream("/eu/dnetlib/dhp/broker/oa/dedupConfig/dedupConfig.json"),
StandardCharsets.UTF_8));
deduper = new SparkDeduper(dedupConfig);
} catch (final IOException e) {
log.error("Error loading dedupConfig, e");
}
@ -47,8 +53,8 @@ public class TrustUtils {
}
try {
final Row doc1 = sparkDedupConfig.rowFromJson(mapper.writeValueAsString(r2));
final Row doc2 = sparkDedupConfig.rowFromJson(mapper.writeValueAsString(r2));
final Row doc1 = deduper.model().rowFromJson(mapper.writeValueAsString(r1));
final Row doc2 = deduper.model().rowFromJson(mapper.writeValueAsString(r2));
final double score = new TreeProcessor(dedupConfig).computeScore(doc1, doc2);
@ -57,7 +63,7 @@ public class TrustUtils {
return TrustUtils.rescale(score, threshold);
} catch (final Exception e) {
log.error("Error computing score between results", e);
return BrokerConstants.MIN_TRUST;
throw new RuntimeException(e);
}
}

View File

@ -83,7 +83,7 @@ public class SimpleVariableJobTest {
final long n = spark
.createDataset(inputList, Encoders.STRING())
.filter((FilterFunction<String>) s -> filter(map.get(s)))
.filter((FilterFunction<String>) s -> filter(map.get(s)))
.map((MapFunction<String, String>) String::toLowerCase, Encoders.STRING())
.count();

View File

@ -41,54 +41,18 @@
</build>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-bom</artifactId>
<version>1.16.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-common</artifactId>
<version>${project.version}</version>
<exclusions>
<exclusion>
<artifactId>log4j</artifactId>
<groupId>log4j</groupId>
</exclusion>
<exclusion>
<artifactId>annotations</artifactId>
<groupId>org.jetbrains</groupId>
</exclusion>
<exclusion>
<artifactId>slf4j-api</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-pace-core</artifactId>
<version>${project.version}</version>
<exclusions>
<exclusion>
<artifactId>jsr305</artifactId>
<groupId>com.google.code.findbugs</groupId>
</exclusion>
<exclusion>
<artifactId>javassist</artifactId>
<groupId>org.javassist</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
@ -126,17 +90,14 @@
<groupId>com.arakelian</groupId>
<artifactId>java-jq</artifactId>
</dependency>
<dependency>
<groupId>dom4j</groupId>
<artifactId>dom4j</artifactId>
</dependency>
<dependency>
<groupId>jaxen</groupId>
<artifactId>jaxen</artifactId>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
@ -155,7 +116,6 @@
<version>1.4.200</version>
<scope>test</scope>
</dependency>
</dependencies>

View File

@ -1,103 +0,0 @@
package eu.dnetlib.dhp.oa.dedup
import eu.dnetlib.dhp.application.ArgumentApplicationParser
import eu.dnetlib.dhp.oa.dedup.dsl.{Clustering, Deduper}
import eu.dnetlib.dhp.oa.dedup.model.BlockStats
import eu.dnetlib.dhp.utils.ISLookupClientFactory
import eu.dnetlib.enabling.is.lookup.rmi.{ISLookUpException, ISLookUpService}
import eu.dnetlib.pace.model.{RowDataOrderingComparator, SparkDedupConfig}
import org.apache.commons.io.IOUtils
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.types.DataTypes
import org.dom4j.DocumentException
import org.slf4j.LoggerFactory
import org.xml.sax.SAXException
import java.io.IOException
import java.util.stream.Collectors
object DSLExample {
private val log = LoggerFactory.getLogger(classOf[DSLExample])
@throws[Exception]
def main(args: Array[String]): Unit = {
val parser = new ArgumentApplicationParser(
IOUtils
.toString(classOf[DSLExample].getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/createBlockStats_parameters.json"))
)
parser.parseArgument(args)
val conf = new SparkConf
new DSLExample(parser, AbstractSparkAction.getSparkSession(conf)).run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")))
}
}
class DSLExample(parser: ArgumentApplicationParser, spark: SparkSession) extends AbstractSparkAction(parser, spark) {
def computeComparisons(blockSize: Long, slidingWindowSize: Long): Long =
if (slidingWindowSize >= blockSize) (slidingWindowSize * (slidingWindowSize - 1)) / 2
else (blockSize - slidingWindowSize + 1) * (slidingWindowSize * (slidingWindowSize - 1)) / 2
@throws[DocumentException]
@throws[IOException]
@throws[ISLookUpException]
@throws[SAXException]
override def run(isLookUpService: ISLookUpService): Unit = {
// read oozie parameters
val graphBasePath = parser.get("graphBasePath")
val isLookUpUrl = parser.get("isLookUpUrl")
val actionSetId = parser.get("actionSetId")
val workingPath = parser.get("workingPath")
val numPartitions : Int = Option(parser.get("numPartitions")).map(_.toInt).getOrElse(AbstractSparkAction.NUM_PARTITIONS)
DSLExample.log.info("graphBasePath: '{}'", graphBasePath)
DSLExample.log.info("isLookUpUrl: '{}'", isLookUpUrl)
DSLExample.log.info("actionSetId: '{}'", actionSetId)
DSLExample.log.info("workingPath: '{}'", workingPath)
// for each dedup configuration
import scala.collection.JavaConversions._
for (dedupConf <- getConfigurations(isLookUpService, actionSetId).subList(0, 1)) {
val subEntity = dedupConf.getWf.getSubEntityValue
DSLExample.log.info("Creating blockstats for: '{}'", subEntity)
val outputPath = DedupUtility.createBlockStatsPath(workingPath, actionSetId, subEntity)
AbstractSparkAction.removeOutputDir(spark, outputPath)
val sparkConfig = SparkDedupConfig(dedupConf, numPartitions)
val inputDF = spark.read
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
.transform(sparkConfig.modelExtractor)
val simRels = inputDF
.transform(sparkConfig.generateClusters)
.filter(functions.size(new Column("block")).geq(new Literal(1, DataTypes.IntegerType)))
val deduper = Deduper(inputDF.schema)
.withClustering( Clustering("sortedngrampairs"),
Clustering("sortedngrampairs", Seq("legalname"), Map("max" -> 2, "ngramLen" -> 3)),
Clustering("suffixprefix", Seq("legalname"), Map("max" -> 1, "len" -> 3)),
Clustering("urlclustering", Seq("websiteurl")),
Clustering("keywordsclustering", Seq("fields"), Map("max" -> 2, "windowSize" -> 4))
)
simRels
.map[BlockStats](
(b:Row) => {
val documents = b.getList(1)
val mapDocuments = documents.stream
.sorted(new RowDataOrderingComparator(sparkConfig.orderingFieldPosition))
.limit(dedupConf.getWf.getQueueMaxSize)
.collect(Collectors.toList)
new BlockStats(
b.getString(0),
mapDocuments.size.toLong,
computeComparisons(mapDocuments.size.toLong, dedupConf.getWf.getSlidingWindowSize.toLong)
)
})(Encoders.bean[BlockStats](classOf[BlockStats]))
.write
.mode(SaveMode.Overwrite)
.save(outputPath)
}
}
}

View File

@ -110,6 +110,10 @@ public class DedupRecordFactory {
// set authors and date
if (ModelSupport.isSubClass(entity, Result.class)) {
Optional
.ofNullable(((Result) entity).getAuthor())
.ifPresent(a -> authors.add(a));
((Result) entity).setAuthor(AuthorMerger.merge(authors));
}

View File

@ -3,12 +3,8 @@ package eu.dnetlib.dhp.oa.dedup;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.spark.SparkContext;
import org.apache.spark.util.LongAccumulator;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;

View File

@ -3,17 +3,13 @@ package eu.dnetlib.dhp.oa.dedup;
import java.io.IOException;
import java.util.Collection;
import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;
import org.apache.commons.io.IOUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.catalyst.expressions.Literal;
import org.apache.spark.sql.types.DataTypes;
import org.dom4j.DocumentException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -25,8 +21,7 @@ import eu.dnetlib.dhp.utils.ISLookupClientFactory;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
import eu.dnetlib.pace.config.DedupConfig;
import eu.dnetlib.pace.model.RowDataOrderingComparator;
import eu.dnetlib.pace.model.SparkDedupConfig;
import eu.dnetlib.pace.model.SparkDeduper;
public class SparkBlockStats extends AbstractSparkAction {
@ -90,27 +85,28 @@ public class SparkBlockStats extends AbstractSparkAction {
JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
SparkDeduper deduper = new SparkDeduper(dedupConf);
Dataset<Row> inputDF = spark
Dataset<Row> simRels = spark
.read()
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
.transform(sparkConfig.modelExtractor());
Dataset<Row> simRels = inputDF
.transform(sparkConfig.generateClusters())
.filter(functions.size(new Column("block")).geq(new Literal(1, DataTypes.IntegerType)));
.transform(deduper.model().parseJsonDataset())
.transform(deduper.filterAndCleanup())
.transform(deduper.generateClustersWithCollect())
.filter(functions.size(new Column("block")).geq(1));
simRels.map((MapFunction<Row, BlockStats>) b -> {
Collection<Row> documents = b.getList(1);
simRels.map((MapFunction<Row, BlockStats>) row -> {
Collection<Row> mapDocuments = row.getList(row.fieldIndex("block"));
List<Row> mapDocuments = documents
.stream()
.sorted(new RowDataOrderingComparator(sparkConfig.orderingFieldPosition()))
.limit(dedupConf.getWf().getQueueMaxSize())
.collect(Collectors.toList());
/*
* List<Row> mapDocuments = documents .stream() .sorted( new
* RowDataOrderingComparator(deduper.model().orderingFieldPosition(),
* deduper.model().identityFieldPosition())) .limit(dedupConf.getWf().getQueueMaxSize())
* .collect(Collectors.toList());
*/
return new BlockStats(
b.getString(0),
row.getString(row.fieldIndex("key")),
(long) mapDocuments.size(),
computeComparisons(
(long) mapDocuments.size(), (long) dedupConf.getWf().getSlidingWindowSize()));

View File

@ -0,0 +1,78 @@
package eu.dnetlib.dhp.oa.dedup
import eu.dnetlib.dhp.application.ArgumentApplicationParser
import eu.dnetlib.dhp.common.HdfsSupport
import eu.dnetlib.dhp.schema.oaf.Relation
import eu.dnetlib.dhp.utils.ISLookupClientFactory
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService
import org.apache.commons.io.IOUtils
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
import org.slf4j.LoggerFactory
object SparkCleanRelation {
private val log = LoggerFactory.getLogger(classOf[SparkCleanRelation])
@throws[Exception]
def main(args: Array[String]): Unit = {
val parser = new ArgumentApplicationParser(
IOUtils.toString(
classOf[SparkCleanRelation].getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/cleanRelation_parameters.json")
)
)
parser.parseArgument(args)
val conf = new SparkConf
new SparkCleanRelation(parser, AbstractSparkAction.getSparkSession(conf))
.run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")))
}
}
class SparkCleanRelation(parser: ArgumentApplicationParser, spark: SparkSession)
extends AbstractSparkAction(parser, spark) {
override def run(isLookUpService: ISLookUpService): Unit = {
val graphBasePath = parser.get("graphBasePath")
val inputPath = parser.get("inputPath")
val outputPath = parser.get("outputPath")
SparkCleanRelation.log.info("graphBasePath: '{}'", graphBasePath)
SparkCleanRelation.log.info("inputPath: '{}'", inputPath)
SparkCleanRelation.log.info("outputPath: '{}'", outputPath)
AbstractSparkAction.removeOutputDir(spark, outputPath)
val entities =
Seq("datasource", "project", "organization", "publication", "dataset", "software", "otherresearchproduct")
val idsSchema = StructType.fromDDL("`id` STRING, `dataInfo` STRUCT<`deletedbyinference`:BOOLEAN,`invisible`:BOOLEAN>")
val emptyIds = spark.createDataFrame(spark.sparkContext.emptyRDD[Row].setName("empty"),
idsSchema)
val ids = entities
.foldLeft(emptyIds)((ds, entity) => {
val entityPath = graphBasePath + '/' + entity
if (HdfsSupport.exists(entityPath, spark.sparkContext.hadoopConfiguration)) {
ds.union(spark.read.schema(idsSchema).json(entityPath))
} else {
ds
}
})
.filter("dataInfo.deletedbyinference != true AND dataInfo.invisible != true")
.select("id")
.distinct()
val relations = spark.read.schema(Encoders.bean(classOf[Relation]).schema).json(inputPath)
.filter("dataInfo.deletedbyinference != true AND dataInfo.invisible != true")
AbstractSparkAction.save(
relations
.join(ids, col("source") === ids("id"), "leftsemi")
.join(ids, col("target") === ids("id"), "leftsemi"),
outputPath,
SaveMode.Overwrite
)
}
}

View File

@ -23,7 +23,7 @@ import eu.dnetlib.dhp.utils.ISLookupClientFactory;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
import eu.dnetlib.pace.config.DedupConfig;
import eu.dnetlib.pace.model.SparkDedupConfig;
import eu.dnetlib.pace.model.SparkDeduper;
public class SparkCreateSimRels extends AbstractSparkAction {
@ -84,20 +84,14 @@ public class SparkCreateSimRels extends AbstractSparkAction {
JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
spark.udf().register("collect_sort_slice", sparkConfig.collectSortSliceUDAF());
SparkDeduper deduper = new SparkDeduper(dedupConf);
Dataset<?> simRels = spark
.read()
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
.transform(sparkConfig.modelExtractor()) // Extract fields from input json column according to model
// definition
.transform(sparkConfig.generateClustersWithWindows()) // generate <key,block> pairs according to
// filters, clusters, and model
// definition
.transform(sparkConfig.processClusters()) // process blocks and emits <from,to> pairs of found
// similarities
.transform(deduper.model().parseJsonDataset())
.transform(deduper.dedup())
.distinct()
.map(
(MapFunction<Row, Relation>) t -> DedupUtility
.createSimRel(t.getStruct(0).getString(0), t.getStruct(0).getString(1), entity),

View File

@ -3,13 +3,18 @@ package eu.dnetlib.dhp.oa.dedup;
import static org.apache.spark.sql.functions.col;
import java.util.Arrays;
import java.util.Collections;
import java.util.Iterator;
import java.util.Objects;
import org.apache.commons.beanutils.BeanUtils;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FilterFunction;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.api.java.function.ReduceFunction;
import org.apache.spark.sql.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -28,9 +33,9 @@ public class SparkPropagateRelation extends AbstractSparkAction {
private static final Logger log = LoggerFactory.getLogger(SparkPropagateRelation.class);
enum FieldType {
SOURCE, TARGET
}
private static Encoder<Relation> REL_BEAN_ENC = Encoders.bean(Relation.class);
private static Encoder<Relation> REL_KRYO_ENC = Encoders.kryo(Relation.class);
public SparkPropagateRelation(ArgumentApplicationParser parser, SparkSession spark) {
super(parser, spark);
@ -71,38 +76,62 @@ public class SparkPropagateRelation extends AbstractSparkAction {
Dataset<Relation> mergeRels = spark
.read()
.load(DedupUtility.createMergeRelPath(workingPath, "*", "*"))
.as(Encoders.bean(Relation.class));
.as(REL_BEAN_ENC);
// <mergedObjectID, dedupID>
Dataset<Tuple2<String, String>> mergedIds = mergeRels
Dataset<Row> mergedIds = mergeRels
.where(col("relClass").equalTo(ModelConstants.MERGES))
.select(col("source"), col("target"))
.select(col("source").as("dedupID"), col("target").as("mergedObjectID"))
.distinct()
.map(
(MapFunction<Row, Tuple2<String, String>>) r -> new Tuple2<>(r.getString(1), r.getString(0)),
Encoders.tuple(Encoders.STRING(), Encoders.STRING()))
.cache();
final String relationPath = DedupUtility.createEntityPath(graphBasePath, "relation");
Dataset<Row> allRels = spark
.read()
.schema(REL_BEAN_ENC.schema())
.json(DedupUtility.createEntityPath(graphBasePath, "relation"));
Dataset<Relation> rels = spark.read().textFile(relationPath).map(patchRelFn(), Encoders.bean(Relation.class));
Dataset<Relation> dedupedRels = allRels
.joinWith(mergedIds, allRels.col("source").equalTo(mergedIds.col("mergedObjectID")), "left_outer")
.joinWith(mergedIds, col("_1.target").equalTo(mergedIds.col("mergedObjectID")), "left_outer")
.select("_1._1", "_1._2.dedupID", "_2.dedupID")
.as(Encoders.tuple(REL_BEAN_ENC, Encoders.STRING(), Encoders.STRING()))
.flatMap(SparkPropagateRelation::addInferredRelations, REL_KRYO_ENC);
Dataset<Relation> newRels = createNewRels(rels, mergedIds, getFixRelFn());
Dataset<Relation> processedRelations = distinctRelations(
dedupedRels.union(mergeRels.map((MapFunction<Relation, Relation>) r -> r, REL_KRYO_ENC)))
.filter((FilterFunction<Relation>) r -> !Objects.equals(r.getSource(), r.getTarget()));
Dataset<Relation> updated = processDataset(
processDataset(rels, mergedIds, FieldType.SOURCE, getDeletedFn()),
mergedIds,
FieldType.TARGET,
getDeletedFn());
save(processedRelations, outputRelationPath, SaveMode.Overwrite);
}
save(
distinctRelations(
newRels
.union(updated)
.union(mergeRels)
.map((MapFunction<Relation, Relation>) r -> r, Encoders.kryo(Relation.class)))
.filter((FilterFunction<Relation>) r -> !Objects.equals(r.getSource(), r.getTarget())),
outputRelationPath, SaveMode.Overwrite);
private static Iterator<Relation> addInferredRelations(Tuple3<Relation, String, String> t) throws Exception {
Relation existingRel = t._1();
String newSource = t._2();
String newTarget = t._3();
if (newSource == null && newTarget == null) {
return Collections.singleton(t._1()).iterator();
}
// update existing relation
if (existingRel.getDataInfo() == null) {
existingRel.setDataInfo(new DataInfo());
}
existingRel.getDataInfo().setDeletedbyinference(true);
// Create new relation inferred by dedupIDs
Relation inferredRel = (Relation) BeanUtils.cloneBean(existingRel);
inferredRel.setDataInfo((DataInfo) BeanUtils.cloneBean(existingRel.getDataInfo()));
inferredRel.getDataInfo().setDeletedbyinference(false);
if (newSource != null)
inferredRel.setSource(newSource);
if (newTarget != null)
inferredRel.setTarget(newTarget);
return Arrays.asList(existingRel, inferredRel).iterator();
}
private Dataset<Relation> distinctRelations(Dataset<Relation> rels) {
@ -110,54 +139,13 @@ public class SparkPropagateRelation extends AbstractSparkAction {
.filter(getRelationFilterFunction())
.groupByKey(
(MapFunction<Relation, String>) r -> String
.join(r.getSource(), r.getTarget(), r.getRelType(), r.getSubRelType(), r.getRelClass()),
.join(" ", r.getSource(), r.getTarget(), r.getRelType(), r.getSubRelType(), r.getRelClass()),
Encoders.STRING())
.agg(new RelationAggregator().toColumn())
.map((MapFunction<Tuple2<String, Relation>, Relation>) Tuple2::_2, Encoders.bean(Relation.class));
}
// redirect the relations to the dedupID
private static Dataset<Relation> createNewRels(
Dataset<Relation> rels, // all the relations to be redirected
Dataset<Tuple2<String, String>> mergedIds, // merge rels: <mergedObjectID, dedupID>
MapFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>, Relation> mapRel) {
// <sourceID, relation, targetID>
Dataset<Tuple3<String, Relation, String>> mapped = rels
.map(
(MapFunction<Relation, Tuple3<String, Relation, String>>) r -> new Tuple3<>(getId(r, FieldType.SOURCE),
r, getId(r, FieldType.TARGET)),
Encoders.tuple(Encoders.STRING(), Encoders.kryo(Relation.class), Encoders.STRING()));
// < <sourceID, relation, target>, <sourceID, dedupID> >
Dataset<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>> relSource = mapped
.joinWith(mergedIds, mapped.col("_1").equalTo(mergedIds.col("_1")), "left_outer");
// < <<sourceID, relation, targetID>, <sourceID, dedupID>>, <targetID, dedupID> >
Dataset<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>> relSourceTarget = relSource
.joinWith(mergedIds, relSource.col("_1._3").equalTo(mergedIds.col("_1")), "left_outer");
return relSourceTarget
.filter(
(FilterFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>>) r -> r
._1()
._1() != null || r._2() != null)
.map(mapRel, Encoders.bean(Relation.class))
.distinct();
}
private static Dataset<Relation> processDataset(
Dataset<Relation> rels,
Dataset<Tuple2<String, String>> mergedIds,
FieldType type,
MapFunction<Tuple2<Tuple2<String, Relation>, Tuple2<String, String>>, Relation> mapFn) {
final Dataset<Tuple2<String, Relation>> mapped = rels
.map(
(MapFunction<Relation, Tuple2<String, Relation>>) r -> new Tuple2<>(getId(r, type), r),
Encoders.tuple(Encoders.STRING(), Encoders.kryo(Relation.class)));
return mapped
.joinWith(mergedIds, mapped.col("_1").equalTo(mergedIds.col("_1")), "left_outer")
.map(mapFn, Encoders.bean(Relation.class));
.reduceGroups((ReduceFunction<Relation>) (b, a) -> {
b.mergeFrom(a);
return b;
})
.map((MapFunction<Tuple2<String, Relation>, Relation>) Tuple2::_2, REL_BEAN_ENC);
}
private FilterFunction<Relation> getRelationFilterFunction() {
@ -167,52 +155,4 @@ public class SparkPropagateRelation extends AbstractSparkAction {
StringUtils.isNotBlank(r.getSubRelType()) ||
StringUtils.isNotBlank(r.getRelClass());
}
private static String getId(Relation r, FieldType type) {
switch (type) {
case SOURCE:
return r.getSource();
case TARGET:
return r.getTarget();
default:
throw new IllegalArgumentException("");
}
}
private static MapFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>, Relation> getFixRelFn() {
return value -> {
Relation r = value._1()._1()._2();
String newSource = value._1()._2() != null ? value._1()._2()._2() : null;
String newTarget = value._2() != null ? value._2()._2() : null;
if (r.getDataInfo() == null) {
r.setDataInfo(new DataInfo());
}
r.getDataInfo().setDeletedbyinference(false);
if (newSource != null)
r.setSource(newSource);
if (newTarget != null)
r.setTarget(newTarget);
return r;
};
}
private static MapFunction<Tuple2<Tuple2<String, Relation>, Tuple2<String, String>>, Relation> getDeletedFn() {
return value -> {
if (value._2() != null) {
Relation r = value._1()._2();
if (r.getDataInfo() == null) {
r.setDataInfo(new DataInfo());
}
r.getDataInfo().setDeletedbyinference(true);
return r;
}
return value._1()._2();
};
}
}

View File

@ -1,118 +0,0 @@
package eu.dnetlib.dhp.oa.dedup;
import java.io.IOException;
import java.util.Optional;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.*;
import org.dom4j.DocumentException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.xml.sax.SAXException;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.application.dedup.log.DedupLogModel;
import eu.dnetlib.dhp.application.dedup.log.DedupLogWriter;
import eu.dnetlib.dhp.schema.oaf.Relation;
import eu.dnetlib.dhp.utils.ISLookupClientFactory;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
import eu.dnetlib.pace.config.DedupConfig;
import eu.dnetlib.pace.model.SparkDedupConfig;
public class SparkSimRelsAnalytics extends AbstractSparkAction {
private static final Logger log = LoggerFactory.getLogger(SparkSimRelsAnalytics.class);
public SparkSimRelsAnalytics(ArgumentApplicationParser parser, SparkSession spark) {
super(parser, spark);
spark.sparkContext().setLogLevel("WARN");
}
public static void main(String[] args) throws Exception {
ArgumentApplicationParser parser = new ArgumentApplicationParser(
IOUtils
.toString(
SparkSimRelsAnalytics.class
.getResourceAsStream(
"/eu/dnetlib/dhp/oa/dedup/createSimRels_parameters.json")));
parser.parseArgument(args);
SparkConf conf = new SparkConf();
new SparkSimRelsAnalytics(parser, getSparkSession(conf))
.run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")));
}
@Override
public void run(ISLookUpService isLookUpService)
throws DocumentException, IOException, ISLookUpException, SAXException {
// read oozie parameters
final String graphBasePath = parser.get("graphBasePath");
final String isLookUpUrl = parser.get("isLookUpUrl");
final String actionSetId = parser.get("actionSetId");
final String workingPath = parser.get("workingPath");
final int numPartitions = Optional
.ofNullable(parser.get("numPartitions"))
.map(Integer::valueOf)
.orElse(NUM_PARTITIONS);
log.info("numPartitions: '{}'", numPartitions);
log.info("graphBasePath: '{}'", graphBasePath);
log.info("isLookUpUrl: '{}'", isLookUpUrl);
log.info("actionSetId: '{}'", actionSetId);
log.info("workingPath: '{}'", workingPath);
final String dfLogPath = parser.get("dataframeLog");
final String runTag = Optional.ofNullable(parser.get("runTAG")).orElse("UNKNOWN");
// for each dedup configuration
for (DedupConfig dedupConf : getConfigurations(isLookUpService, actionSetId)) {
final long start = System.currentTimeMillis();
final String entity = dedupConf.getWf().getEntityType();
final String subEntity = dedupConf.getWf().getSubEntityValue();
log.info("Creating simrels for: '{}'", subEntity);
final String outputPath = DedupUtility.createSimRelPath(workingPath, actionSetId, subEntity);
removeOutputDir(spark, outputPath);
JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
spark.udf().register("collect_sort_slice", sparkConfig.collectSortSliceUDAF());
Dataset<?> simRels = spark
.read()
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
.transform(sparkConfig.modelExtractor()) // Extract fields from input json column according to model
// definition
.transform(sparkConfig.generateClustersWithWindows()) // generate <key,block> pairs according to
// filters, clusters, and model
// definition
.transform(sparkConfig.processClusters()) // process blocks and emits <from,to> pairs of found
// similarities
.map(
(MapFunction<Row, Relation>) t -> DedupUtility
.createSimRel(t.getStruct(0).getString(0), t.getStruct(0).getString(1), entity),
Encoders.bean(Relation.class));
saveParquet(simRels, outputPath, SaveMode.Overwrite);
final long end = System.currentTimeMillis();
if (StringUtils.isNotBlank(dfLogPath)) {
final DedupLogModel model = new DedupLogModel(runTag, dedupConf.toString(), subEntity, start, end,
end - start);
new DedupLogWriter(dfLogPath).appendLog(model, spark);
}
}
}
}

View File

@ -104,18 +104,6 @@ public class SparkWhitelistSimRels extends AbstractSparkAction {
.join(entities, whiteListRels1.col("to").equalTo(entities.col("id")), "inner")
.select("from", "to");
// Dataset<Tuple2<String, String>> whiteListRels1 = whiteListRels
// .joinWith(entities, whiteListRels.col("_1").equalTo(entities.col("_1")), "inner")
// .map(
// (MapFunction<Tuple2<Tuple2<String, String>, Tuple2<String, String>>, Tuple2<String, String>>) Tuple2::_1,
// Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
//
// Dataset<Tuple2<String, String>> whiteListRels2 = whiteListRels1
// .joinWith(entities, whiteListRels1.col("_2").equalTo(entities.col("_1")), "inner")
// .map(
// (MapFunction<Tuple2<Tuple2<String, String>, Tuple2<String, String>>, Tuple2<String, String>>) Tuple2::_1,
// Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
Dataset<Relation> whiteListSimRels = whiteListRels2
.map(
(MapFunction<Row, Relation>) r -> DedupUtility

View File

@ -1,15 +0,0 @@
package eu.dnetlib.dhp.oa.dedup.dsl
case class Clustering(name: String = "",
fields: Seq[String] = Seq(),
params: Map[String,Int] = Map()) {
def withName(name: String) : Clustering =
copy(name = name)
def withFields(fields: String*): Clustering =
copy(fields = fields)
def withParams(params: Map[String,Int]): Clustering =
copy(params = params)
}

View File

@ -1,11 +0,0 @@
package eu.dnetlib.dhp.oa.dedup.dsl
import org.apache.spark.sql.types.StructType
case class Deduper (schema: StructType,
clusterings: Seq[Clustering] = Seq()) {
def withClustering(clusterings: Clustering*) =
copy(clusterings = clusterings)
}

View File

@ -0,0 +1,20 @@
[
{
"paramName": "i",
"paramLongName": "graphBasePath",
"paramDescription": "the base path of raw graph",
"paramRequired": true
},
{
"paramName": "w",
"paramLongName": "inputPath",
"paramDescription": "the path to the input relation to cleanup",
"paramRequired": true
},
{
"paramName": "o",
"paramLongName": "outputPath",
"paramDescription": "the path of the output relation cleaned",
"paramRequired": true
}
]

View File

@ -15,4 +15,8 @@
<name>oozie.action.sharelib.for.spark</name>
<value>spark2</value>
</property>
<property>
<name>sparkExecutorMemoryOverhead</name>
<value>1G</value>
</property>
</configuration>

View File

@ -12,19 +12,26 @@
<name>graphOutputPath</name>
<description>path of the output graph</description>
</property>
<property>
<name>filterInvisible</name>
<description>whether filter out invisible entities after merge</description>
</property>
<property>
<name>sparkDriverMemory</name>
<description>memory for driver process</description>
<description>heap memory for driver process</description>
</property>
<property>
<name>sparkExecutorMemory</name>
<description>memory for individual executor</description>
<description>heap memory for individual executor</description>
</property>
<property>
<name>sparkExecutorMemoryOverhead</name>
<description>off-heap memory for individual executor</description>
</property>
<property>
<name>sparkExecutorCores</name>
<description>number of cores used by single executor</description>
</property>
<property>
<name>oozieActionShareLibForSpark2</name>
<description>oozie action sharelib for spark 2.*</description>
@ -83,6 +90,7 @@
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
<spark-opts>
--executor-memory=${sparkExecutorMemory}
--conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
--executor-cores=${sparkExecutorCores}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
@ -92,9 +100,35 @@
--conf spark.sql.shuffle.partitions=15000
</spark-opts>
<arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
<arg>--o</arg><arg>${graphOutputPath}</arg>
<arg>--graphOutputPath</arg><arg>${workingPath}/propagaterelation/</arg>
<arg>--workingPath</arg><arg>${workingPath}</arg>
</spark>
<ok to="CleanRelation"/>
<error to="Kill"/>
</action>
<action name="CleanRelation">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Clean Relations</name>
<class>eu.dnetlib.dhp.oa.dedup.SparkCleanRelation</class>
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
<spark-opts>
--executor-memory=${sparkExecutorMemory}
--conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
--executor-cores=${sparkExecutorCores}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=15000
</spark-opts>
<arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
<arg>--inputPath</arg><arg>${workingPath}/propagaterelation/relation</arg>
<arg>--outputPath</arg><arg>${graphOutputPath}/relation</arg>
</spark>
<ok to="group_entities"/>
<error to="Kill"/>
</action>
@ -107,8 +141,9 @@
<class>eu.dnetlib.dhp.oa.merge.GroupEntitiesSparkJob</class>
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
<spark-opts>
--executor-cores=${sparkExecutorCores}
--executor-memory=${sparkExecutorMemory}
--conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
--executor-cores=${sparkExecutorCores}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
@ -119,30 +154,21 @@
<arg>--graphInputPath</arg><arg>${graphBasePath}</arg>
<arg>--outputPath</arg><arg>${workingPath}/grouped_entities</arg>
</spark>
<ok to="fork_dispatch_entities"/>
<ok to="dispatch_entities"/>
<error to="Kill"/>
</action>
<fork name="fork_dispatch_entities">
<path start="dispatch_datasource"/>
<path start="dispatch_project"/>
<path start="dispatch_organization"/>
<path start="dispatch_publication"/>
<path start="dispatch_dataset"/>
<path start="dispatch_software"/>
<path start="dispatch_otherresearchproduct"/>
</fork>
<action name="dispatch_datasource">
<action name="dispatch_entities">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Dispatch publications</name>
<name>Dispatch grouped entitities</name>
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
<spark-opts>
--executor-cores=${sparkExecutorCores}
--executor-memory=${sparkExecutorMemory}
--conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
--executor-cores=${sparkExecutorCores}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
@ -151,164 +177,12 @@
--conf spark.sql.shuffle.partitions=7680
</spark-opts>
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
<arg>--outputPath</arg><arg>${graphOutputPath}/datasource</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Datasource</arg>
<arg>--outputPath</arg><arg>${graphOutputPath}</arg>
<arg>--filterInvisible</arg><arg>${filterInvisible}</arg>
</spark>
<ok to="wait_dispatch"/>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="dispatch_project">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Dispatch project</name>
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
<spark-opts>
--executor-cores=${sparkExecutorCores}
--executor-memory=${sparkExecutorMemory}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=7680
</spark-opts>
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
<arg>--outputPath</arg><arg>${graphOutputPath}/project</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Project</arg>
</spark>
<ok to="wait_dispatch"/>
<error to="Kill"/>
</action>
<action name="dispatch_organization">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Dispatch organization</name>
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
<spark-opts>
--executor-cores=${sparkExecutorCores}
--executor-memory=${sparkExecutorMemory}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=7680
</spark-opts>
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
<arg>--outputPath</arg><arg>${graphOutputPath}/organization</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Organization</arg>
</spark>
<ok to="wait_dispatch"/>
<error to="Kill"/>
</action>
<action name="dispatch_publication">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Dispatch publication</name>
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
<spark-opts>
--executor-cores=${sparkExecutorCores}
--executor-memory=${sparkExecutorMemory}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=7680
</spark-opts>
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
<arg>--outputPath</arg><arg>${graphOutputPath}/publication</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
</spark>
<ok to="wait_dispatch"/>
<error to="Kill"/>
</action>
<action name="dispatch_dataset">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Dispatch dataset</name>
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
<spark-opts>
--executor-cores=${sparkExecutorCores}
--executor-memory=${sparkExecutorMemory}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=7680
</spark-opts>
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
<arg>--outputPath</arg><arg>${graphOutputPath}/dataset</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
</spark>
<ok to="wait_dispatch"/>
<error to="Kill"/>
</action>
<action name="dispatch_software">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Dispatch software</name>
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
<spark-opts>
--executor-cores=${sparkExecutorCores}
--executor-memory=${sparkExecutorMemory}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=7680
</spark-opts>
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
<arg>--outputPath</arg><arg>${graphOutputPath}/software</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
</spark>
<ok to="wait_dispatch"/>
<error to="Kill"/>
</action>
<action name="dispatch_otherresearchproduct">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>Dispatch otherresearchproduct</name>
<class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
<spark-opts>
--executor-cores=${sparkExecutorCores}
--executor-memory=${sparkExecutorMemory}
--driver-memory=${sparkDriverMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=7680
</spark-opts>
<arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
<arg>--outputPath</arg><arg>${graphOutputPath}/otherresearchproduct</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
</spark>
<ok to="wait_dispatch"/>
<error to="Kill"/>
</action>
<join name="wait_dispatch" to="End"/>
<end name="End"/>
</workflow-app>

View File

@ -13,10 +13,6 @@
</property>
<property>
<name>oozie.action.sharelib.for.spark</name>
<value>spark342</value>
</property>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
<value>spark2</value>
</property>
</configuration>

View File

@ -126,25 +126,15 @@
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
--conf spark.sql.shuffle.partitions=5000
--conf spark.driver.extraJavaOptions="-Xss256k"
--conf spark.executor.extraJavaOptions="-Dlog4j.configuration=spark-log4j.properties -Xss256k"
--conf spark.extraListeners=
--conf spark.sql.queryExecutionListeners=
--conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.minExecutors=100 --conf spark.dynamicAllocation.shuffleTracking.enabled=true
--conf spark.network.io.preferDirectBufs=true --conf spark.memory.fraction=0.4 --conf spark.sql.adaptive.coalescePartitions.minPartitionNum=5000
--conf spark.shuffle.useOldFetchProtocol=true --conf spark.shuffle.service.enabled=true --conf spark.eventLog.enabled=true
--conf spark.executor.heartbeatInterval=60s
--conf spark.network.timeout=640s
--conf spark.sql.legacy.allowUntypedScalaUDF=true
--conf spark.sql.shuffle.partitions=15000
</spark-opts>
<arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
<arg>--isLookUpUrl</arg><arg>${isLookUpUrl}</arg>
<arg>--actionSetId</arg><arg>${actionSetId}</arg>
<arg>--workingPath</arg><arg>${workingPath}</arg>
<arg>--numPartitions</arg><arg>5000</arg>
<arg>--numPartitions</arg><arg>15000</arg>
</spark>
<ok to="End"/>
<ok to="WhitelistSimRels"/>
<error to="Kill"/>
</action>

View File

@ -9,7 +9,8 @@ import java.io.IOException;
import java.io.Serializable;
import java.lang.reflect.InvocationTargetException;
import java.nio.file.Paths;
import java.util.*;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import org.codehaus.jackson.map.ObjectMapper;
@ -17,7 +18,10 @@ import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import eu.dnetlib.dhp.oa.merge.AuthorMerger;
import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.DataInfo;
import eu.dnetlib.dhp.schema.oaf.Publication;
import eu.dnetlib.dhp.schema.oaf.Software;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
import eu.dnetlib.pace.util.MapDocumentUtil;
import scala.Tuple2;

View File

@ -1,125 +0,0 @@
package eu.dnetlib.dhp.oa.dedup;
import static java.nio.file.Files.createTempDirectory;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.mockito.Mockito.lenient;
import java.io.File;
import java.io.IOException;
import java.io.Serializable;
import java.net.URISyntaxException;
import java.nio.file.Paths;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.Mock;
import org.mockito.Mockito;
import org.mockito.junit.jupiter.MockitoExtension;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
@ExtendWith(MockitoExtension.class)
public class SparkDSLExampleTest implements Serializable {
@Mock(serializable = true)
ISLookUpService isLookUpService;
private static SparkSession spark;
private static JavaSparkContext jsc;
private static String testGraphBasePath;
private static String testOutputBasePath;
private static final String testActionSetId = "test-orchestrator";
@BeforeAll
public static void beforeAll() throws IOException, URISyntaxException {
testGraphBasePath = Paths
.get(SparkDedupTest.class.getResource("/eu/dnetlib/dhp/dedup/entities").toURI())
.toFile()
.getAbsolutePath();
testOutputBasePath = createTempDirectory(SparkDedupTest.class.getSimpleName() + "-")
.toAbsolutePath()
.toString();
FileUtils.deleteDirectory(new File(testOutputBasePath));
final SparkConf conf = new SparkConf();
conf.set("spark.sql.shuffle.partitions", "200");
spark = SparkSession
.builder()
.appName(SparkDedupTest.class.getSimpleName())
.master("local[*]")
.config(conf)
.getOrCreate();
jsc = JavaSparkContext.fromSparkContext(spark.sparkContext());
}
@BeforeEach
public void setUp() throws IOException, ISLookUpException {
lenient()
.when(isLookUpService.getResourceProfileByQuery(Mockito.contains(testActionSetId)))
.thenReturn(
IOUtils
.toString(
SparkDSLExampleTest.class
.getResourceAsStream(
"/eu/dnetlib/dhp/dedup/profiles/mock_orchestrator.xml")));
lenient()
.when(isLookUpService.getResourceProfileByQuery(Mockito.contains("organization")))
.thenReturn(
IOUtils
.toString(
SparkDSLExampleTest.class
.getResourceAsStream(
"/eu/dnetlib/dhp/dedup/conf/org.curr.conf.json")));
}
@Test
void createBlockStatsTest() throws Exception {
ArgumentApplicationParser parser = new ArgumentApplicationParser(
IOUtils
.toString(
SparkDSLExampleTest.class
.getResourceAsStream(
"/eu/dnetlib/dhp/oa/dedup/createBlockStats_parameters.json")));
parser
.parseArgument(
new String[] {
"-i", testGraphBasePath,
"-asi", testActionSetId,
"-la", "lookupurl",
"-w", testOutputBasePath
});
new DSLExample(parser, spark).run(isLookUpService);
long orgs_blocks = spark
.read()
.textFile(testOutputBasePath + "/" + testActionSetId + "/organization_blockstats")
.count();
assertEquals(480, orgs_blocks);
}
@AfterAll
public static void tearDown() {
spark.close();
}
}

Some files were not shown because too many files have changed in this diff Show More