Compare commits

...

780 Commits

Author SHA1 Message Date
Claudio Atzori 242d647146 cleanup & docs 2023-10-12 12:23:44 +02:00
Claudio Atzori af3ffad6c4 [AMF] docs 2023-10-12 10:07:52 +02:00
Claudio Atzori ba5475ed4c Merge pull request 'Fix cleaning of Pmid where parsing of numbers stopped at first not leading 0 (zero) character' (#345) from fix_truncated_pmid into master
Reviewed-on: D-Net/dnet-hadoop#345
2023-10-06 14:19:49 +02:00
Giambattista Bloisi 2c235e82ad Fix cleaning of Pmid where parsing of numbers stopped at first not leading 0' character 2023-10-06 12:35:54 +02:00
Claudio Atzori 4ac06c9e37 Merge pull request 'Fix bug in conversion from dedup json model to Spark Dataset of Rows (instanceTypeMatch no longer working)' (#339) from fix_dedupfailsonmatchinginstances into master
Reviewed-on: D-Net/dnet-hadoop#339
2023-10-02 11:34:20 +02:00
Claudio Atzori fa692b3629 Merge branch 'master' into fix_dedupfailsonmatchinginstances 2023-10-02 11:28:16 +02:00
Claudio Atzori ef02648399 Merge pull request 'fixed dedup configuration management in the Broker workflow' (#341) from fix_8997 into master
Reviewed-on: D-Net/dnet-hadoop#341
2023-10-02 11:03:50 +02:00
Claudio Atzori d13bb534f0 Merge branch 'master' into fix_8997 2023-10-02 11:03:18 +02:00
Giambattista Bloisi 775c3f704a Fix bug in conversion from dedup json model to Spark Dataset of Rows: list of strings contained the json escaped representation of the value instead of the plain value, this caused instanceTypeMatch failures because of the leading and trailing double quotes 2023-09-27 22:30:47 +02:00
Sandro La Bruzzo 9c3ab11d5b Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2023-09-25 15:29:19 +02:00
Sandro La Bruzzo 423ef30676 minor fix on the aggregation of uniprot and pdb 2023-09-25 15:28:58 +02:00
Giambattista Bloisi 7152d47f84 Use asScala to convert java List to Scala Sequence 2023-09-20 16:14:27 +02:00
Claudio Atzori 4853c19b5e code formatting 2023-09-20 15:53:21 +02:00
Giambattista Bloisi 1f226d1dce Fix defect #8997: GenerateEventsJob is generating huge amounts of logs because broker entity similarity calculation consistently failed 2023-09-20 15:42:00 +02:00
Alessia Bardi 6186cdc2cc Use v5 of the UNIBI Gold ISSN list in test 2023-09-19 14:47:01 +02:00
Alessia Bardi d94b9bebf7 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-09-19 13:38:45 +02:00
Alessia Bardi 19abba8fa7 tests for d4science catalog 2023-09-19 13:38:25 +02:00
Claudio Atzori c2f179800c Merge pull request 'Run CC and RAM sequentieally in dhp-impact-indicators WF' (#338) from run_cc_and_ram_sequentially into master
Reviewed-on: D-Net/dnet-hadoop#338
2023-09-13 08:52:53 +02:00
Serafeim Chatzopoulos 2aed5a74be Run CC and RAM sequentieally in dhp-impact-indicators WF 2023-09-12 22:31:50 +03:00
Claudio Atzori 4dc4862011 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-09-12 14:34:34 +02:00
Claudio Atzori dc80ab14d3 [graph dedup] consistency wf should not remove the relations while dispatching the entities 2023-09-12 14:34:28 +02:00
Alessia Bardi 77a2199837 updated test for EOSC comunity 2023-09-08 11:05:49 +02:00
Claudio Atzori 265180bfd2 added Archive ouverte UNIGE (ETHZ.UNIGENF, opendoar____::1400) to the Datacite hostedBy_map 2023-09-07 11:20:35 +02:00
Claudio Atzori da0e9828f7 resolved conflicts for PR#337 2023-09-06 11:28:46 +02:00
Claudio Atzori 9f5d16624c Merge pull request '[graph raw] datainfo.invisible set as true only for entities' (#336) from invisible_relations into beta
Reviewed-on: D-Net/dnet-hadoop#336
2023-09-04 16:14:47 +02:00
Claudio Atzori adec6692ca Merge branch 'beta' into invisible_relations 2023-09-04 16:13:06 +02:00
Claudio Atzori 15666e86a8 added collectedfrom to the affiliation relations imported from Crossref 2023-09-04 15:56:06 +02:00
Claudio Atzori 7d6bd4f20b Merge pull request 'Fix import of affiliations relations from Crossref' (#335) from 8876_fix_crossref_affiliation_relations_import into beta
Reviewed-on: D-Net/dnet-hadoop#335
2023-09-04 15:19:58 +02:00
Claudio Atzori 5b06c9d06f [graph raw] datainfo.invisible set as true only for entities 2023-09-04 15:15:24 +02:00
Serafeim Chatzopoulos 7de0164c26 Fix import of affiliations relations from Crossref 2023-09-04 16:04:41 +03:00
Claudio Atzori 488d9a1cea Merge pull request 'Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb' (#331) from consistencywf_memoryoverhead_conf into beta
Reviewed-on: D-Net/dnet-hadoop#331
2023-08-29 16:31:36 +02:00
Giambattista Bloisi 6b1c05d118 Add sparkExecutorMemoryOverhead workflow config to set off-heap memory for Spark actions. If not explicitly set it is defaulted to 1Gb 2023-08-29 16:04:19 +02:00
Claudio Atzori bf35280ea6 code formatting 2023-08-29 11:11:00 +02:00
Claudio Atzori 0515d81c7c Merge pull request 'Rewrite SparkPropagateRelation exploiting Dataframe API' (#330) from propagate_relation_rewrite into beta
Reviewed-on: D-Net/dnet-hadoop#330
2023-08-29 10:47:14 +02:00
Claudio Atzori 58665a246c Merge branch 'beta' into propagate_relation_rewrite 2023-08-29 10:47:02 +02:00
Claudio Atzori f437be80ad [impact indicators] adjusted paths in the bip ranker wf parameters 2023-08-29 09:03:03 +02:00
Giambattista Bloisi d012aec0b3 Revert PropagateRelation's argument name from outputPath to graphOutputPath in consistency workflow (#8964) 2023-08-28 22:44:54 +02:00
Giambattista Bloisi a860e19423 Fix ensure all relations are written out, not only those managed by dedup 2023-08-28 15:36:02 +02:00
Giambattista Bloisi 0d7b2bf83d Rewrite SparkPropagateRelation exploiting Dataframe API 2023-08-28 10:34:54 +02:00
Miriam Baglioni 9c8b41475a Merge pull request '8172_impact_indicators_workflow' (#284) from 8172_impact_indicators_workflow into beta
Reviewed-on: D-Net/dnet-hadoop#284
2023-08-14 15:50:48 +02:00
Serafeim Chatzopoulos 97c1ba8918 Merge actionsets of results and projects 2023-08-11 15:56:53 +03:00
Miriam Baglioni 35b8deb2c6 Merge pull request 'DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag' (#329) from dispatch_filter_invisible_entities into beta
Reviewed-on: D-Net/dnet-hadoop#329
2023-08-10 12:56:18 +02:00
Giambattista Bloisi 95cd2b9b1e Make filterInvisible a mandatory parameter of DispathEntitiesSparkJob
Make filterInvisible a mandatory parameter of both dedup/consistency and graph/group oozie workflows
2023-08-10 11:53:48 +02:00
Giambattista Bloisi fab9920271 DispatchEntitiesSparkJob: manage all entity types together, support filtering by dataInfo.invisible flag 2023-08-09 15:41:43 +02:00
Miriam Baglioni c25ac21e5e Merge pull request 'graph cleaning, suggestions from ticket 8898' (#325) from cleaning_8898 into beta
Reviewed-on: D-Net/dnet-hadoop#325
2023-08-08 11:14:19 +02:00
Miriam Baglioni c334fe2438 Merge pull request 'Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleted by inference or that are pointing to dangling entities' (#328) from cleanup_relations_after_dedup into beta
Reviewed-on: D-Net/dnet-hadoop#328
2023-08-08 09:49:12 +02:00
Miriam Baglioni 0e2f855807 Merge pull request 'Updates Promotion DBs' (#321) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#321
2023-08-07 12:09:16 +02:00
Miriam Baglioni 18fbe52b20 Merge pull request 'Import affiliation relations from Crossref' (#320) from 8876 into beta
Reviewed-on: D-Net/dnet-hadoop#320
2023-08-07 10:45:30 +02:00
Giambattista Bloisi 97b6d1dc45 Filter ids by dataInfo.deletedbyinference and DataInfo.invisible flags
Filter relations also by dataInfo.invisible flag
2023-08-07 10:24:11 +02:00
Giambattista Bloisi af49424b59 Add a "CleanRelation" action after the PropagateRelation to filter out all relations that have been deleyted by inference or that are pointing to dangling entities 2023-08-04 14:27:39 +02:00
Claudio Atzori 0bc74e2000 code formatting 2023-08-02 11:52:10 +02:00
Claudio Atzori 7180911ded [graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests 2023-08-02 11:44:14 +02:00
Claudio Atzori b9dddbfe54 rule out records with NULL dataInfo, except for Relations 2023-07-31 17:53:54 +02:00
Claudio Atzori da1727f93f rule out records with NULL dataInfo, except for Relations 2023-07-31 17:52:56 +02:00
Claudio Atzori 11ffb9bd68 rule out records with NULL dataInfo 2023-07-31 12:35:33 +02:00
Claudio Atzori ccac6a7f75 rule out records with NULL dataInfo 2023-07-31 12:35:05 +02:00
Serafeim Chatzopoulos 7cefe2665b Remove unnecessary classes 2023-07-28 19:14:39 +03:00
Serafeim Chatzopoulos 26a92ce762 Merge branch '8876' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8876 2023-07-28 19:03:57 +03:00
Serafeim Chatzopoulos ebfba38ab6 Add changes from code review 2023-07-28 19:03:47 +03:00
Serafeim Chatzopoulos eb8684a8cf Merge branch 'beta' into 8876 2023-07-28 13:39:33 +02:00
Claudio Atzori 1275a07d45 Merge pull request '[graph indexing] expand the instance level fulltext in the XML records' (#326) from instance_fulltext_xml into beta
Reviewed-on: D-Net/dnet-hadoop#326
2023-07-27 15:02:07 +02:00
Claudio Atzori a72b9e96ac expand the instance level fulltext in the XML records 2023-07-27 14:57:38 +02:00
Claudio Atzori d512df8612 code formatting 2023-07-26 09:14:08 +02:00
Claudio Atzori d8435a6512 inverted condition 2023-07-25 17:39:57 +02:00
Claudio Atzori 59764145bb cherry picked & fixed commit 270df939c4 2023-07-25 17:39:00 +02:00
Claudio Atzori 270df939c4 partial implementation of the suggestions from https://support.openaire.eu/issues/8898 2023-07-25 17:29:50 +02:00
Claudio Atzori 8c63e4a864 Merge pull request 'Refactor Dedup using Spark Dataframe API, initial support for scala 2.12 and Spark 3.4' (#324) from dedup-with-dataframe-2 into beta
Reviewed-on: D-Net/dnet-hadoop#324
2023-07-25 10:17:17 +02:00
Giambattista Bloisi e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Giambattista Bloisi bb5b845e3c Use scala.binary.version property to resolve scala maven dependencies
Ensure consistent usage of maven properties
Profile for compiling with scala 2.12 and Spark 3.4
2023-07-24 11:13:48 +02:00
Claudio Atzori 002b24e06f Merge pull request '[graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests' (#315) from pid_cleaning into beta
Reviewed-on: D-Net/dnet-hadoop#315
2023-07-24 10:49:44 +02:00
Claudio Atzori c754397a19 Merge branch 'beta' into pid_cleaning 2023-07-24 10:49:31 +02:00
Claudio Atzori f0678cda09 Merge pull request 'fix_beta_tests' (#323) from fix_beta_tests into beta
Reviewed-on: D-Net/dnet-hadoop#323
2023-07-24 10:47:35 +02:00
Serafeim Chatzopoulos 3a0f09774a Add script to find score limits 2023-07-21 17:55:41 +03:00
Ilias Kanellos 06b9b71c4e Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-07-21 17:42:49 +03:00
Ilias Kanellos 2374f445a9 Produce additional bip update specific files 2023-07-21 17:42:46 +03:00
Serafeim Chatzopoulos cb0f3c50f6 Format workflow.xml 2023-07-21 16:07:10 +03:00
Serafeim Chatzopoulos c64e5e588f Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-07-21 15:27:02 +03:00
Serafeim Chatzopoulos 2cc5b1a39b Fixes in workflow.xml 2023-07-21 15:26:50 +03:00
Ilias Kanellos 0f96af5d56 Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-07-21 13:42:35 +03:00
Ilias Kanellos 03da965162 Format bip-score based file without doi references 2023-07-21 13:42:30 +03:00
Giambattista Bloisi f03153823a Update testCitationRelations number of expected citations according to changes made in 0559d8b4 (monodirectional citations) 2023-07-21 10:48:28 +02:00
Giambattista Bloisi 54c1eacef1 SparkJobTest was failing because testing workingdir was not cleaned up after eact test 2023-07-21 10:42:24 +02:00
Giambattista Bloisi 5e15f20e6e Fix entityMerger that was excluding the authors of the first entity in the list to merge 2023-07-21 00:46:54 +02:00
Giambattista Bloisi 0210a14e43 Ignore timestamp differences in PromoteActionPayloadForGraphTableJobTest 2023-07-20 23:45:57 +02:00
Giambattista Bloisi dba34505de Fix SparkStatsTest bug where parquet tables were incorrectly read as text files leading to unpredictable count() values 2023-07-19 14:24:52 +02:00
Giambattista Bloisi e47ed1fdb2 Use DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES in json mapper to avoid that tests fail if they encounter unmapped properties 2023-07-19 14:21:40 +02:00
Giambattista Bloisi 38dfebfbe6 Disable MdStoreClientTest test as it requires a local mongodb running and it does not perform any assertions 2023-07-19 14:18:56 +02:00
Claudio Atzori 373a5f2c83 Merge pull request 'Master branch updates from beta July 2023' (#317) from master_july23 into master
Reviewed-on: D-Net/dnet-hadoop#317
2023-07-18 18:22:04 +02:00
Serafeim Chatzopoulos db4ca43ee8 Resolve conflict 2023-07-18 18:38:26 +03:00
Serafeim Chatzopoulos be320ba3c1 Indentation fixes 2023-07-17 16:04:21 +03:00
Serafeim Chatzopoulos bc1a4611aa Minor changes 2023-07-17 11:17:53 +03:00
Claudio Atzori 8af129b0c7 merged stats promotion step from antonis/promotion-prod-only 2023-07-13 15:03:28 +02:00
dimitrispie 706092bc19 Update updateProductionViews.sh 2023-07-13 15:48:12 +03:00
dimitrispie aedd279f78 Updates Promotion DBs
- Add a step for promoting the splitted monitor DBs
2023-07-13 15:35:46 +03:00
dimitrispie 76901a25f9 Updates Promotion DBs
- Add a step for promoting the splitted monitor DBs
2023-07-12 22:49:08 +03:00
Giambattista Bloisi ef493681d9 Merge pull request 'Import dnet-pace-core module in this project and use it after renaming to dhp-pace-core' (#319) from beta_with_pace_core into beta
Reviewed-on: D-Net/dnet-hadoop#319
2023-07-11 14:03:15 +02:00
Serafeim Chatzopoulos 4eba14a80e Add oozie workflow 2023-07-06 21:07:50 +03:00
Serafeim Chatzopoulos c2998a14e8 Add basic tests for affiliation relations 2023-07-06 20:28:16 +03:00
Serafeim Chatzopoulos bc7b00bcd1 Add bi-directional affiliation relations 2023-07-06 18:29:15 +03:00
Serafeim Chatzopoulos 12528ed2ef Refactor PrepareAffiliationRelations.java to use OafMapperUtils common functions 2023-07-06 18:08:33 +03:00
Serafeim Chatzopoulos bbc245696e Prepare actionsets for BIP affiliations 2023-07-06 15:56:12 +03:00
Ilias Kanellos 0c433eccdd Fix scores & Workflow 2023-07-06 15:06:28 +03:00
Ilias Kanellos d5c39a1059 Fix map scores to doi 2023-07-06 15:04:48 +03:00
Ilias Kanellos 772d5f0aab Make PR and AttRank serial 2023-07-06 13:47:51 +03:00
Giambattista Bloisi 801da2fd4a New sources formatted by maven plugin 2023-07-06 10:28:53 +02:00
Giambattista Bloisi bd3fcf869a rename dnet-pace-core into dhp-pace-core module and use it as dependency in other modules 2023-07-06 10:02:23 +02:00
Serafeim Chatzopoulos 347a889b20 Read affiliation relations 2023-07-06 00:51:01 +03:00
Giambattista Bloisi 3b35db5fbd Import dnet-pace-core module from dnet-dedup repository 2023-07-05 22:23:06 +02:00
Miriam Baglioni 8dcd028eed [UsageCount] fixed typo in attribute name for datasource table 2023-07-01 16:07:22 +02:00
Miriam Baglioni 7738372125 [UsageCount] fixed typo in attribute name for datasource table 2023-06-30 18:56:41 +02:00
Sandro La Bruzzo 9963fd6d29 updated log to add subentity 2023-06-28 13:36:05 +02:00
Claudio Atzori f3a85e224b merged from branch beta the bulk tagging (single step, negative constraints), the cleanig worflow (single step, pid type based cleaning), instance level fulltext 2023-06-28 13:33:57 +02:00
Claudio Atzori 4ef0f2ec26 added dependency commons-validator:commons-validator:1.7 2023-06-28 13:32:01 +02:00
Sandro La Bruzzo ed7e2ab6d1 reverted mistake on commit workflow.xml 2023-06-28 11:40:19 +02:00
Sandro La Bruzzo 9910ce06ae added to CreateSimRel the feature to write time log 2023-06-28 11:38:16 +02:00
Miriam Baglioni 2717edafb7 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2023-06-28 11:25:14 +02:00
Miriam Baglioni 2f04c9d149 [BulkTagging] fixing left over for test 2023-06-28 11:24:42 +02:00
Sandro La Bruzzo bd17c3edc8 added to CreateSimRel the feature to write time log 2023-06-28 11:20:58 +02:00
Sandro La Bruzzo b195da3a83 Added utility to write time logs during the deduplication phase 2023-06-28 11:20:09 +02:00
Claudio Atzori 288ec0b7d6 [doiboost] merged workflow from branch beta 2023-06-28 09:15:37 +02:00
Claudio Atzori 5f32edd9bf adopting dhp-schema:3.17.1 2023-06-27 16:57:17 +02:00
Claudio Atzori e10ce92fe5 [stats wf] merged workflows from branch beta 2023-06-27 14:32:48 +02:00
Claudio Atzori b93e1541aa Merge pull request 'update sql query to return distinct pids' (#301) from distinct_pids_from_openorgs into master
Reviewed-on: D-Net/dnet-hadoop#301
2023-06-27 12:24:47 +02:00
Claudio Atzori d029bf0b94 Merge branch 'master' into distinct_pids_from_openorgs 2023-06-27 12:24:35 +02:00
Claudio Atzori 0f5a819f44 [graph cleaning] fixed regex behaviour for cleaning ROR and GRID identifiers, added tests 2023-06-23 16:10:49 +02:00
Serafeim Chatzopoulos 60f25b780d Minor fixes in workflow.xml and job.properties 2023-06-23 12:51:50 +03:00
Michele Artini 88a1cbc37d fixed a datasource id 2023-06-22 07:56:33 +02:00
Michele Artini 009d7f312f fixed a datasource Id 2023-06-21 16:17:34 +02:00
Claudio Atzori b0ebf56367 Merge pull request 'Update step15_5.sql' (#314) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#314
2023-06-21 10:33:22 +02:00
dimitrispie 2b6370eaee Update step15_5.sql
Bug fix
2023-06-21 11:31:10 +03:00
Claudio Atzori 35e42a86ed Merge pull request 'Update step15_5.sql' (#313) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#313
2023-06-21 10:26:16 +02:00
dimitrispie 74cb060bfe Update step15_5.sql
Add "if not exists" clause
2023-06-21 11:24:06 +03:00
Claudio Atzori 85e016df17 Merge pull request 'Update step16-createIndicatorsTables.sql' (#312) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#312
2023-06-21 09:52:33 +02:00
dimitrispie a475cfcb7b Update step16-createIndicatorsTables.sql
Rename a field in indi_pub_interdisciplinarity
2023-06-21 10:42:02 +03:00
Claudio Atzori 979cf9cd87 Merge pull request 'Update step15.sql' (#311) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#311
2023-06-21 09:20:01 +02:00
dimitrispie 4648cd88d4 Update step15.sql
Cast score to double
2023-06-21 10:02:19 +03:00
dimitrispie 94d2573c77 Update step15.sql
Bug Fix
2023-06-21 09:22:39 +03:00
Claudio Atzori 0561362de2 Merge pull request 'Update step20-createMonitorDB_institutions.sql' (#309) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#309
2023-06-20 15:07:09 +02:00
Claudio Atzori 50d7dc0078 [graph enrichment] fixed projectOrganizationPath not being passed to the apply_resulttoorganization_propagation node 2023-06-19 15:42:44 +02:00
Claudio Atzori fbd9bf704e indent 2023-06-19 15:41:22 +02:00
Giambattista Bloisi 758e662ab8 Revert "REmove duplicated code and ensure that load and initialization is done through "DedupConfig.load" method"
This reverts commit 485f9d18cb.
2023-06-19 13:08:10 +02:00
Giambattista Bloisi 485f9d18cb REmove duplicated code and ensure that load and initialization is done through "DedupConfig.load" method 2023-06-19 13:00:02 +02:00
Claudio Atzori 6210f6ee48 Merge pull request 'Precompile blacklists patterns before evaluating clustering criteria' (#1) from optimized-clustering into master
Reviewed-on: D-Net/dnet-dedup#1
2023-06-19 12:43:49 +02:00
dimitrispie be2caedb04 Update step20-createMonitorDB_institutions.sql
Add openorgs____::1624ff7c01bb641b91f4518539a0c28a Vrije Universiteit Amsterdam
2023-06-19 12:12:17 +03:00
dimitrispie 36e0a8fec4 Changes to Promotion Stats WF
1. Add new cluster host at impala-shell commands
2. Add a step for splitting monitor dbs
3. Update workflow.xml to included the new splitting monitor dbs step
2023-06-19 09:44:34 +03:00
Giambattista Bloisi b0ade43608 Precompile blacklists patterns before evaluating clustering criteria
Enable Junit 5 tests in maven builds
Make path comparisons platform-independent
Read String resource files assuming they are encoded in UTF-8
Fix a few test conditions
2023-06-16 09:41:11 +02:00
dimitrispie 4c770a5e29 Update finalizeImpalaCluster.sh
Drop views in shadow dbs before dropping the db
2023-06-15 13:25:37 +03:00
dimitrispie e06d962a6a Update step15.sql 2023-06-15 12:20:35 +03:00
dimitrispie afcad08396 Update step20-createMonitorDB_institutions.sql
Added openorgs____::c0b262bd6eab819e4c994914f9c010e2   -- National Institute of Geophysics and Volcanology
2023-06-15 10:28:49 +03:00
Claudio Atzori b9748763e2 Merge pull request '[stats wf] Bug fixes' (#308) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#308
2023-06-14 21:57:03 +02:00
dimitrispie 42b8ce2ba4 Update copyDataToImpalaCluster.sh 2023-06-14 19:23:42 +03:00
dimitrispie 2032b0df40 Bug fixes
1. Remove tables/views from old databases in the new cluster, before dropping the dbs
2. Fix id in result_accessroute, indi_impact_measures, indi_pub_bronze_oa
2023-06-14 19:09:09 +03:00
Michele Artini a92206dab5 re-added the name of a column (pid) 2023-06-13 11:43:10 +02:00
Claudio Atzori b76a47b103 [aggregator graph] added column alias when mapping organization PIDs from the OpenOrgs database 2023-06-13 11:38:10 +02:00
Claudio Atzori 744a61a030 depending on dhp-schema:3.17.1 2023-06-12 13:49:44 +02:00
Claudio Atzori 2e4616a251 Merge pull request '[graph cleaning] pid cleaning' (#307) from pid_cleaning into beta
Reviewed-on: D-Net/dnet-hadoop#307
2023-06-12 13:32:29 +02:00
Claudio Atzori d6a8b24711 Merge branch 'beta' into pid_cleaning 2023-06-12 13:32:22 +02:00
Claudio Atzori fdbfb25614 Merge pull request 'update sql query to return distinct pids [beta]' (#306) from distinct_pids_from_openorgs_beta into beta
Reviewed-on: D-Net/dnet-hadoop#306
2023-06-12 09:59:00 +02:00
Claudio Atzori ad04f14b81 Merge branch 'beta' into distinct_pids_from_openorgs_beta 2023-06-12 09:58:21 +02:00
Claudio Atzori a98e6591e2 Merge pull request 'propagation of projects through parent-child relations' (#299) from propagationProjectThroughParentChils into beta
Reviewed-on: D-Net/dnet-hadoop#299
2023-06-12 09:57:20 +02:00
Claudio Atzori 55f002f1e9 Merge branch 'beta' into propagationProjectThroughParentChils 2023-06-12 09:56:53 +02:00
Claudio Atzori daa21ddbb5 Merge pull request '[aggregator graph] validation for URLs from oaf:fulltext' (#298) from fulltext_url_validation into beta
Reviewed-on: D-Net/dnet-hadoop#298
2023-06-12 09:55:35 +02:00
Claudio Atzori 4b00a76271 Merge branch 'beta' into fulltext_url_validation 2023-06-12 09:55:25 +02:00
Claudio Atzori eb2fa8556b Merge pull request 'removeTaggingCondition' (#297) from removeTaggingCondition into beta
Reviewed-on: D-Net/dnet-hadoop#297
2023-06-12 09:53:05 +02:00
Claudio Atzori de225c71cd Merge branch 'beta' into removeTaggingCondition 2023-06-12 09:50:40 +02:00
Claudio Atzori e1409ffe80 update sql query to return distinct pids 2023-06-12 09:47:45 +02:00
Claudio Atzori 1d33074fd1 WIP: pid cleaning 2023-06-09 16:47:25 +02:00
Claudio Atzori da7b66c542 Merge pull request '[stats wf] Added memory to hive' (#305) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#305
2023-06-08 08:58:48 +02:00
dimitrispie c5f42c7f5b Added memory to hive 2023-06-07 18:18:23 +03:00
Claudio Atzori afb76ebf0f Merge pull request '[stats wf] Bug fix on indicators step' (#304) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#304
2023-06-07 16:49:09 +02:00
dimitrispie fa24e2e18f Bug fix on indicators step
indi_pub_gold_oa table was missing during the creation of other indicators
2023-06-07 17:43:37 +03:00
Claudio Atzori 01c67e697d Merge pull request '[ stats wf] Bug fix' (#303) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#303
2023-06-07 14:41:44 +02:00
dimitrispie 28272c1b0e Bug fix 2023-06-07 15:34:01 +03:00
Alessia Bardi d5be6a13e9 Updated officialnmae of pangaea in hostedbymap for Datacite to avoid duplicate entries in the source filter of the portal 2023-06-06 14:43:32 +02:00
Alessia Bardi 118e72d7db Updated officialnmae of pangaea in hostedbymap for Datacite to avoid duplicate entries in the source filter of the portal 2023-06-06 14:39:12 +02:00
Alessia Bardi 5befd93d7d test records for Solr indexing 2023-06-06 14:34:33 +02:00
Michele Artini cae92cf811 update sql query to return distinct pids 2023-06-06 14:06:06 +02:00
Claudio Atzori 8f651f1225 Merge pull request 'Changes to beta stats wf' (#300) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#300
2023-06-06 11:41:36 +02:00
dimitrispie ad07fbf053 Add names to organizations for collaboration indicators 2023-06-02 14:13:10 +03:00
dimitrispie 2324670714 Split Monitor DBs-Interdisciplinarity indicators
- Split DBs Monitor for faster rendering of visualizations
- Add interdisciplinarity indicators from result_fos
2023-06-02 13:34:16 +03:00
Miriam Baglioni daf4d7971b refactoring 2023-05-31 18:56:58 +02:00
Miriam Baglioni 97d72d41c3 finalization of implementation and testing 2023-05-31 18:53:22 +02:00
Miriam Baglioni 0389b57ca7 added propagation for project to organization 2023-05-31 11:06:58 +02:00
Claudio Atzori e45777e7e1 [aggregator graph] added validation for URLs mapped from oaf:fulltext 2023-05-26 11:33:42 +02:00
dimitrispie ebe586b1d1 Impact indicators/Unpaywall
- Added Impact indicators
- Added unpaywall open access colours
2023-05-26 10:25:28 +03:00
dimitrispie d6102dd576 Update step16-createIndicatorsTables.sql
- Add org names to indi_project_collab_org
- Add indi_pub_bronze_oa
 - Changes to indi_pub_hybrid_oa_with_cc
2023-05-25 14:52:34 +03:00
Miriam Baglioni 9097e71853 Added assertion in test 2023-05-24 16:30:53 +02:00
Miriam Baglioni 9567c13bc3 refactoring 2023-05-24 16:20:05 +02:00
Miriam Baglioni 34172455d1 [BulkTag] Adding remove constraints to specify when a community must not appear in the context of a result. 2023-05-24 09:56:23 +02:00
Ilias Kanellos a1b9187039 Fix syntax error on workflow.xml 2023-05-23 17:17:12 +03:00
Ilias Kanellos 6a7e370a21 Remove unnecessary counts in graph creation 2023-05-23 16:48:58 +03:00
Ilias Kanellos ec4e010687 End after rankings | Create graph debugged 2023-05-23 16:44:04 +03:00
Claudio Atzori 654ffcba60 Merge pull request '[UsageCount] addition of usagecount for Projects and datasources' (#296) from master_datasource_project_usagecounts into master
Reviewed-on: D-Net/dnet-hadoop#296
2023-05-22 16:13:24 +02:00
Claudio Atzori db625e548d [UsageCount] addition of usagecount for Projects and datasources 2023-05-22 15:00:46 +02:00
Alessia Bardi 04141fe259 tests for records from D4Science catalogues 2023-05-19 14:28:24 +02:00
Claudio Atzori a235d2a24a Merge pull request 'Updates to steps related to transfer data to impala cluster' (#295) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#295
2023-05-18 08:46:15 +02:00
dimitrispie 86f4f63daf Updates to steps related to transfer data to impala cluster
1. Remove external table definitions in stats_ext
2. Fix the issue where some views are not created.
3. Added two workflow parameters for copying also the usage stats dbs
2023-05-18 09:33:05 +03:00
Claudio Atzori 909729a2fc [dedup] tweaking num partitions, minor changes 2023-05-17 10:16:22 +02:00
Ilias Kanellos 38020e242a Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-05-16 17:34:53 +03:00
Ilias Kanellos 3d69f33c84 Fix selection of columns in graph creation 2023-05-16 17:34:42 +03:00
Ilias Kanellos 3c38f7ba6f Fix selection of columns in graph creation 2023-05-16 17:32:53 +03:00
Serafeim Chatzopoulos 8ef718c363 Fix workflow application path 2023-05-16 16:28:48 +03:00
Serafeim Chatzopoulos 26328e2a0d Move job.properties 2023-05-16 14:39:53 +03:00
Serafeim Chatzopoulos 4eec3e7052 Add jobTracker, nameNode && spark2Lib as global params in oozie wf 2023-05-15 22:28:48 +03:00
Serafeim Chatzopoulos b83135c252 Add missing kill nodes in workflow.xml 2023-05-15 19:55:35 +03:00
Serafeim Chatzopoulos 45f2aa0867 Move end node ... at the end in workflow.xml 2023-05-15 17:52:20 +03:00
Claudio Atzori e309688711 Merge pull request 'fix APC affiliation links' (#294) from apc_affiliation into beta
Reviewed-on: D-Net/dnet-hadoop#294
2023-05-15 15:47:57 +02:00
Claudio Atzori 8acad52a0c Merge branch 'beta' into apc_affiliation 2023-05-15 15:47:33 +02:00
Claudio Atzori 8a463cc3e8 fixed organization id created when mapping APC affiliations. Factored out ROR constants in dhp-common 2023-05-15 15:44:46 +02:00
Serafeim Chatzopoulos 12a57e1f58 Resolve conflicts 2023-05-15 16:20:11 +03:00
Serafeim Chatzopoulos 82e2a96f51 Resolve conflicts 2023-05-15 15:53:12 +03:00
Serafeim Chatzopoulos b8e8c959fe Update workflow.xml && job.properties 2023-05-15 15:50:23 +03:00
Ilias Kanellos 4a905932a3 Spark properties from job.properties 2023-05-15 15:24:22 +03:00
Claudio Atzori 0c314d5e09 Merge pull request 'Update copyDataToImpalaCluster.sh' (#293) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#293
2023-05-15 12:05:54 +02:00
Serafeim Chatzopoulos 07818131ef Update documentation 2023-05-15 13:04:44 +03:00
dimitrispie b3f9633205 Update copyDataToImpalaCluster.sh
Added option --user to impala-shell command
2023-05-15 12:51:44 +03:00
Miriam Baglioni 021321ae06 Merge pull request 'removed the inverse of the Citing relation' (#292) from citeOnly into beta
Reviewed-on: D-Net/dnet-hadoop#292
2023-05-15 11:37:39 +02:00
Miriam Baglioni 78b07400c0 changed test classes 2023-05-15 11:37:08 +02:00
Miriam Baglioni 86fe886c1a removed the inverse of the Citing relation 2023-05-15 11:20:51 +02:00
Ilias Kanellos 1788ac2d4d Correct filtering for MAG records 2023-05-12 12:55:43 +03:00
Miriam Baglioni 12cd179d2d Merge pull request 'Update copyDataToImpalaCluster.sh' (#291) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#291
2023-05-12 11:36:34 +02:00
dimitrispie 00d0d162b6 Update copyDataToImpalaCluster.sh
Added a temporary folder to copy the files to avoid permission issues
2023-05-12 12:31:13 +03:00
Ilias Kanellos 5ddbb4ad10 Spark properties no longer hardcoded 2023-05-11 15:36:47 +03:00
Ilias Kanellos 3de35fd6a3 Produce 5 classes of ranking scores 2023-05-11 14:42:25 +03:00
Miriam Baglioni 8c05f49665 moved the version as it was before the change 2023-05-09 10:48:34 +02:00
Miriam Baglioni 99ac5bab46 added check to avoid NPE when checking the organization country 2023-05-04 19:38:39 +02:00
Claudio Atzori 0704e186f6 Merge pull request 'Stats wf executed on hive only' (#283) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#283
2023-05-02 14:05:12 +02:00
Claudio Atzori cd80b200ee Merge pull request 'Affiliation links from APC' (#290) from apc_affiliation into beta
Reviewed-on: D-Net/dnet-hadoop#290
2023-05-02 12:00:04 +02:00
Claudio Atzori d8882c4481 extended mapping applied to datacite records to produce affiliations using the ROR ids. Inc ase of APCs it includes the amount and the currently in the relation 2023-05-02 11:56:51 +02:00
Claudio Atzori d02916ef82 code formatting 2023-05-02 11:05:37 +02:00
Claudio Atzori f653640cd9 Merge pull request 'Bulk Tagging single step' (#289) from bulkTagRefactor into beta
Reviewed-on: D-Net/dnet-hadoop#289
2023-05-02 10:54:14 +02:00
dimitrispie c3d58e58e1 Bug fixes 2023-05-02 11:54:07 +03:00
Claudio Atzori abd7ca0c18 Merge branch 'beta' into bulkTagRefactor 2023-05-02 10:50:01 +02:00
Claudio Atzori de36c7b083 Merge pull request 'Enrichment - result to community through organization' (#255) from organizationToRepresentative into beta
Reviewed-on: D-Net/dnet-hadoop#255
2023-05-02 10:47:07 +02:00
Claudio Atzori 45f625d14f Merge branch 'beta' into organizationToRepresentative 2023-05-02 10:46:55 +02:00
Claudio Atzori cdd33f7445 Merge pull request 'graph cleaning refactoring' (#282) from graph_cleaning_refactoring into beta
Reviewed-on: D-Net/dnet-hadoop#282
2023-05-02 10:40:02 +02:00
Claudio Atzori de11edca98 Merge branch 'beta' into organizationToRepresentative 2023-05-02 09:59:41 +02:00
Claudio Atzori 851f664bd9 Merge branch 'beta' into graph_cleaning_refactoring 2023-05-02 09:55:40 +02:00
dimitrispie e57ecdaf98 Update step20-createMonitorDB.sql
Add University of Manitoba
2023-04-30 17:52:23 +03:00
Ilias Kanellos 90332439ad Remove deletion of synonym folder 2023-04-28 13:45:19 +03:00
Ilias Kanellos a98da54896 Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-04-28 13:23:49 +03:00
Ilias Kanellos 09485fbee3 Fixed unicode bug. Workflow ends after first script 2023-04-28 13:09:13 +03:00
Serafeim Chatzopoulos 614cc1089b Add separate forder for results && project actionsets 2023-04-27 12:37:15 +03:00
Serafeim Chatzopoulos 815a4ddbba Add actionset creation for project bip indicators in workflow 2023-04-26 20:40:06 +03:00
Serafeim Chatzopoulos ee04cf92bf Add actionsets for project impact indicators 2023-04-26 20:23:46 +03:00
Alessia Bardi b88f009d9f combined level 4 and 6 for the demo 2023-04-24 12:10:33 +02:00
Alessia Bardi 5ffe82ffd8 aligned to current DMF index layout on production 2023-04-24 12:09:55 +02:00
Alessia Bardi 1c173642f0 removed level5 from test records 2023-04-24 09:32:32 +02:00
dimitrispie fdb5d2b39f Bug fixes 2023-04-23 18:29:00 +03:00
dimitrispie 53ce023035 Bug fixes 2023-04-23 18:23:45 +03:00
Alessia Bardi 382f46a8e4 tests to generate the XML records for the index for the EDITH demo on digital twins, integrating output from the FoS classifier 2023-04-21 16:46:30 +02:00
Miriam Baglioni ce03f3ee62 mergin with branch beta 2023-04-20 14:50:47 +02:00
dimitrispie 4fa750b719 Bug fixes on monitor-update 2023-04-19 17:39:53 +03:00
dimitrispie 5247cb7115 Bug fix 2023-04-19 11:11:19 +03:00
Miriam Baglioni efc4f6a658 [bulkTag] refactor to enrich each result single step 2023-04-18 17:39:31 +02:00
Serafeim Chatzopoulos 23f58a86f1 Change jar param in project impact indicators action 2023-04-18 12:26:01 +03:00
Miriam Baglioni 73f77575bd [ZenodoApiClient] align with master version 2023-04-18 10:25:27 +02:00
Miriam Baglioni 697a134504 - 2023-04-18 10:21:12 +02:00
Miriam Baglioni 6cc95c96a2 - 2023-04-18 09:53:11 +02:00
Miriam Baglioni 24c41806ac [ZenodoApiClienttest] change test to mirror change in the omplementation 2023-04-18 09:08:09 +02:00
Miriam Baglioni 087b5a7973 [ZenodiAPIClient] new version of the API to connect to Zenodo (change the http client 2023-04-17 18:59:22 +02:00
Michele De Bonis cb595c87bb implementation of the support for authors deduplication: cosinesimilarity comparator and double array json parser 2023-04-17 11:06:27 +02:00
dimitrispie 25dafccc24 Merge branch 'hive' into beta 2023-04-12 11:36:59 +03:00
Claudio Atzori 688e3b7936 added eoscifguidelines in the result view; removed compute statistics statements 2023-04-11 11:45:56 +02:00
Claudio Atzori 2e465915b4 [graph to Solr] using dedicated sparkExecutorCores, sparkExecutorMemory, sparkDriverMemory in convert_to_xml 2023-04-11 10:43:44 +02:00
Claudio Atzori a2dcb06daf added eoscifguidelines in the result view; removed compute statistics statements 2023-04-11 10:43:32 +02:00
Serafeim Chatzopoulos 7256c8d3c7 Add script for aggregating impact indicators at the project level 2023-04-07 16:30:12 +03:00
dimitrispie c85de8fa1f -Added Technological University Dublin
-Added project_organization_contribution table
-Add   Delft University of Technology
2023-04-07 09:22:59 +03:00
dimitrispie 9b41dff33c Update step20-createMonitorDB.sql
Added Delft University of Technology
2023-04-07 09:21:38 +03:00
Claudio Atzori 4a4ca634f0 Merge pull request 'advConstraintsInBeta' (#288) from advConstraintsInBeta into master
Reviewed-on: D-Net/dnet-hadoop#288
2023-04-06 15:24:23 +02:00
Miriam Baglioni 932d07d2dd [bulkTag] added filtering for datasources in eosctag 2023-04-06 15:08:27 +02:00
Miriam Baglioni c6a7602b3e refactoring after compilation 2023-04-06 14:45:01 +02:00
Miriam Baglioni 831055a1fc change of the property for test purposes, addition of two new verbs, and fix of issue for advanced constraints 2023-04-06 14:41:32 +02:00
Miriam Baglioni 287753417d better implementation for the fix 2023-04-06 12:22:38 +02:00
Miriam Baglioni cf3d0f4f83 fixed issue on bulktagging for the advanced constraints 2023-04-06 12:17:35 +02:00
Miriam Baglioni b42abc9904 fixed issue on bulktagging for the advanced constraints 2023-04-06 12:15:00 +02:00
dimitrispie 91e18ac7f4 Added project_organization_contribution table 2023-04-06 10:53:11 +03:00
Claudio Atzori 4f67225fbc Merge pull request 'doiboostMappingExtention' (#286) from doiboostMappingExtention into master
Reviewed-on: D-Net/dnet-hadoop#286
2023-04-06 09:25:08 +02:00
Claudio Atzori e093f04874 Merge pull request 'AdvancedConstraint' (#285) from advConstraintsInBeta into master
Reviewed-on: D-Net/dnet-hadoop#285
2023-04-06 09:24:54 +02:00
Miriam Baglioni c5a9f39141 Extended the association project - result in the mapping from CrossRef 2023-04-05 16:48:36 +02:00
Miriam Baglioni ecc05fe0f3 Added the code for the advancedConstraint implementation during the bulkTagging 2023-04-05 16:40:29 +02:00
Claudio Atzori 42442ccd39 Merge pull request 'updated the order of the compatibilities' (#275) from compatibility_order into master
Reviewed-on: D-Net/dnet-hadoop#275
2023-04-05 12:44:14 +02:00
Miriam Baglioni b25b401065 added test to verify the advconstraints to dth community. inserted some additional logs. 2023-04-05 12:18:39 +02:00
Claudio Atzori 864f4051d3 [graph cleaning] added missing case 2023-04-05 11:35:47 +02:00
Michele De Bonis 297eb207a5 minor change in the author match which now can compute count and percentage 2023-04-04 17:10:37 +02:00
Claudio Atzori dead87917f [graph cleaning] cleanup 2023-04-04 13:13:43 +02:00
Claudio Atzori 2a6ba29b64 [graph cleaning] unit tests & cleanup 2023-04-04 12:34:51 +02:00
dimitrispie 9e1335df4c -Added Technological University Dublin
-Added project_organization_contribution table
2023-04-04 13:22:40 +03:00
Miriam Baglioni 9a9cc6a1dd changed the way the tar archive is build to support renaming in case we need to change .tt.gz into .json.gz 2023-04-04 11:40:58 +02:00
Claudio Atzori 63b8bbc015 [graph to Solr] using dedicated sparkExecutorCores, sparkExecutorMemory, sparkDriverMemory in convert_to_xml 2023-03-24 13:43:20 +01:00
Claudio Atzori b502f86523 fixed input path supplemented to GetDatasourceFromCountry; adjusted the various spark.sql.shuffle.partitions 2023-03-24 13:09:12 +01:00
Claudio Atzori c07857fa37 [graph cleaning] unit tests & cleanup 2023-03-23 15:57:47 +01:00
Claudio Atzori 90e61a8aba [graph cleaning] WIP: refactoring of the cleaning stages, unit tests 2023-03-23 15:03:26 +01:00
Claudio Atzori 308e10d102 serialising: 1. measures for all the entity types and 2. result level fulltext 2023-03-23 11:23:22 +01:00
Claudio Atzori 488d9a5eaa [graph cleaning] WIP: refactoring of the cleaning stages, unit tests 2023-03-23 10:41:13 +01:00
dimitrispie fad7fa4af8 Added Technological University Dublin 2023-03-22 09:44:00 +02:00
Serafeim Chatzopoulos 102aa5ab81 Add dependency to dhp-aggregation 2023-03-21 19:25:29 +02:00
Serafeim Chatzopoulos f3e5abf63b Merge branch '8172_impact_indicators_workflow' of https://code-repo.d4science.org/D-Net/dnet-hadoop into 8172_impact_indicators_workflow 2023-03-21 18:26:09 +02:00
Serafeim Chatzopoulos 3e8a4cf952 Rearrange resources folder structure 2023-03-21 18:25:55 +02:00
Serafeim Chatzopoulos f992ecb657 Checkout BIP-Ranker during 'prepare-package' && add it in the oozie-package.tar.gz 2023-03-21 18:03:55 +02:00
Ilias Kanellos 9dc8f0f05f Add ActionSet step 2023-03-21 16:14:15 +02:00
Claudio Atzori 4f5ba0ed52 [graph cleaning] WIP: refactoring of the cleaning stages, unit tests 2023-03-21 14:41:20 +01:00
Ilias Kanellos b5c252865c Add filtering based on citation source 2023-03-20 15:38:36 +02:00
Claudio Atzori 6d3d18d8b5 [graph cleaning] WIP: refactoring of the cleaning stages 2023-03-16 17:23:36 +01:00
dimitrispie 43b23a9bf3 Update step20-createMonitorDB.sql
Added Technological University Dublin
2023-03-15 09:57:12 +02:00
Serafeim Chatzopoulos 720fd19b39 Add dhp-impact-indicators workflow files 2023-03-14 19:28:27 +02:00
Serafeim Chatzopoulos c6e39b7f33 Add dhp-impact-indicators 2023-03-14 18:50:54 +02:00
Claudio Atzori 518618f1a9 [graph cleaning] avoid to overwrite the subject class to 'keyword' for those with provenance 'subject:fos' 2023-03-14 15:22:47 +01:00
Claudio Atzori 41e00bcd07 [graph provision] avoid to parse again the XML records, apparently the escaped XML characters get unescaped invalidating the record 2023-03-13 15:19:49 +01:00
Claudio Atzori 46d2df1c90 Merge pull request '[aggregator graph] handle paths including wildcards' (#281) from aggregator_graph into beta
Reviewed-on: D-Net/dnet-hadoop#281
2023-03-08 21:17:39 +01:00
Claudio Atzori 24e2fd828b code formatting 2023-03-08 21:17:08 +01:00
Claudio Atzori e28d395e87 [aggregator graph] using dedicated path to sync claims, adjusted paths with wildcards 2023-03-08 21:16:52 +01:00
Claudio Atzori 5b8fd37314 [aggregator graph] using dedicated path to sync claims 2023-03-08 15:28:14 +01:00
Claudio Atzori 7fd89566c2 [aggregator graph] handle paths including wildcards 2023-03-08 12:43:00 +01:00
Miriam Baglioni 588aca5ce4 Merge pull request 'h2020classification' (#280) from h2020classification into beta
Reviewed-on: D-Net/dnet-hadoop#280
2023-03-03 09:29:10 +01:00
Claudio Atzori 8ec0d62d91 pre-group the records in each table before joning the contents from BETA and PROD together 2023-03-02 14:49:19 +01:00
Miriam Baglioni 0fff98a14c [ECclassification] removed print 2023-03-02 11:46:57 +01:00
Miriam Baglioni b0c2f7e526 [ECclassification] removed not needed resources 2023-03-02 11:44:48 +01:00
Miriam Baglioni d4fc62c2f6 mergin with branch beta 2023-03-02 11:14:54 +01:00
Miriam Baglioni de8ad1caef [ECclassification] new implementation for the H2020 classification 2023-03-02 11:14:03 +01:00
Claudio Atzori db9dad4aa7 [actionmanager] increased spark.sql.shuffle.partitions for publication, dataset, relation records 2023-03-02 09:11:37 +01:00
Miriam Baglioni c1f9848953 [ECclassification] added new classes 2023-03-01 15:29:11 +01:00
Claudio Atzori 6f488547a7 ignore non processable records 2023-03-01 14:49:51 +01:00
Claudio Atzori 7d263f265e adjusted logs 2023-03-01 11:58:07 +01:00
Claudio Atzori 16ad42e8f3 code formatting 2023-03-01 10:22:13 +01:00
Claudio Atzori 9c59dac859 followup changes reorganising the mdstore synchronisation mechanism 2023-03-01 10:16:20 +01:00
Miriam Baglioni 49737f1087 Merge pull request '[CrossrefFunderMapping] fixed issueson funder name' (#279) from doiboostFunderExtention into beta
Reviewed-on: D-Net/dnet-hadoop#279
2023-02-28 15:08:07 +01:00
Miriam Baglioni ad745c0aa3 [CrossrefFunderMapping] fixed issueson funder name 2023-02-28 14:58:27 +01:00
Miriam Baglioni 4f2df876cd [ECclassification] new implementation first try 2023-02-28 14:44:00 +01:00
Claudio Atzori bc986f66ec Merge pull request 'monodirectional citations' (#278) from citations_monodirectional into beta
Reviewed-on: D-Net/dnet-hadoop#278
2023-02-28 13:33:52 +01:00
Claudio Atzori 2f7346e9cf WIP monodirectional citations, Datacite 2023-02-28 13:30:51 +01:00
Claudio Atzori 0559d8b412 WIP monodirectional citations 2023-02-28 10:57:32 +01:00
Sandro La Bruzzo 69fa616490 removed wrong content 2023-02-28 10:27:38 +01:00
Sandro La Bruzzo 832a75d012 added mapping for crossref funder 2023-02-28 10:16:34 +01:00
Sandro La Bruzzo 78e51c182a Added missing parametero to raw all workflow 2023-02-28 10:16:01 +01:00
Claudio Atzori 7aebedb43c code formatting 2023-02-27 11:51:27 +01:00
Miriam Baglioni 80987801d7 [FoS] added check for null on level1 subject 2023-02-27 11:40:22 +01:00
Claudio Atzori 31e97c2a6b [unresolved entities] updated oozie wf node labels 2023-02-27 11:38:29 +01:00
Miriam Baglioni 23112929e9 [FoS] changed the default separator from comma to tab to solve the issue in subject value split 2023-02-27 10:18:39 +01:00
Claudio Atzori c4856b4eaa Merge pull request 'Remove unecessary indexed fields from Solr' (#277) from 8099_lighten_solr_index into beta
Reviewed-on: D-Net/dnet-hadoop#277
2023-02-23 11:50:29 +01:00
Serafeim Chatzopoulos 0b5bf53b45 Remove unecessary indexed fields from Solr 2023-02-23 12:42:42 +02:00
dimitrispie 1547611246 Merge branch 'beta' into hive 2023-02-22 16:57:12 +02:00
Claudio Atzori 9e4ec0023c Merge pull request 'updated the order of the compatibilities (BETA)' (#276) from compatibility_order_beta into beta
Reviewed-on: D-Net/dnet-hadoop#276
2023-02-22 14:47:32 +01:00
Michele Artini fddcf701e9 updated the order of the compatibilities 2023-02-22 12:07:09 +01:00
Michele Artini 200098b683 updated the order of the compatibilities 2023-02-22 11:52:59 +01:00
Claudio Atzori 0c1be41b30 code formatting 2023-02-22 10:15:25 +01:00
Claudio Atzori 3b876d9327 depending on dhp-schemas v. 3.16.0 2023-02-22 10:15:10 +01:00
Claudio Atzori 99cd7761aa cleanup of non necessary dhp-monitor-update workflow 2023-02-22 10:10:22 +01:00
Claudio Atzori a590c371a9 Merge pull request '8232-mdstore-synch-improve' (#272) from 8232-mdstore-synch-improve into beta
Reviewed-on: D-Net/dnet-hadoop#272
2023-02-22 10:02:26 +01:00
Claudio Atzori cd3a51a15f Merge branch 'beta' into 8232-mdstore-synch-improve 2023-02-22 09:57:07 +01:00
Claudio Atzori 42b6b5d5ce Merge pull request 'UsageCountOnProjectAndDatasource' (#271) from UsageCountOnProjectAndDatasource into beta
Reviewed-on: D-Net/dnet-hadoop#271
2023-02-22 09:56:08 +01:00
Claudio Atzori 477a7c416f Merge branch 'beta' into UsageCountOnProjectAndDatasource 2023-02-22 09:55:51 +01:00
Claudio Atzori c20c1c9159 Merge pull request 'Added 4 institutions:' (#261) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#261
2023-02-22 09:53:45 +01:00
Miriam Baglioni d617c3e812 [DOIBoost] extended mapping for funder #8407 2023-02-20 14:45:27 +01:00
dimitrispie 90807b60c7 Changes to monitor wf 2023-02-20 10:42:24 +02:00
dimitrispie d2f9ccf934 Changes to separate monitor wf 2023-02-20 10:41:21 +02:00
dimitrispie 032a401cbf Bug fixes 2023-02-20 09:29:20 +02:00
Miriam Baglioni 016337a0f9 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2023-02-16 15:54:59 +01:00
Sandro La Bruzzo 118c1fc3b3 Merge remote-tracking branch 'origin/beta' into beta 2023-02-15 10:29:28 +01:00
Sandro La Bruzzo a8ac79fa25 Added citation relation on crossref Mapping 2023-02-15 10:29:13 +01:00
dimitrispie 595192d510 Bug fix 2023-02-14 16:24:08 +02:00
dimitrispie f3aaff3688 Remove duplicate orgs 2023-02-14 09:48:36 +02:00
Claudio Atzori 9a03f71db1 code formatting 2023-02-13 16:25:47 +01:00
Michele Artini 554df257ab null values in date range conditions 2023-02-13 16:15:32 +01:00
Michele Artini 9c1df15071 null values in date range conditions 2023-02-13 16:05:58 +01:00
Miriam Baglioni 32870339f5 refactoring after compile 2023-02-13 13:06:48 +01:00
Miriam Baglioni 7184cc0804 [FoS] added check for null on level1 subject 2023-02-13 13:03:49 +01:00
dimitrispie 3400133c2f Bug fix 2023-02-13 09:44:00 +02:00
dimitrispie 935db0ab25 Added organizations for Monitor 2023-02-13 09:29:09 +02:00
dimitrispie 7b78b15c81 Changes for copying to Impala Cluster 2023-02-13 09:27:00 +02:00
Miriam Baglioni 5cf902a2b0 [UsageCount] changed query to make the sum be computed via sql instead of grouping 2023-02-10 16:16:37 +01:00
Miriam Baglioni f803530df6 [UsageCount] fixed query 2023-02-10 15:50:56 +01:00
Miriam Baglioni 7473093c84 [FoS] changed the default separator from comma to tab to solve the issue in subject value split 2023-02-10 15:34:52 +01:00
Miriam Baglioni bb5bba51b3 [UsageCount] extended test 2023-02-09 19:08:30 +01:00
Miriam Baglioni 85e53fad00 [UsageCount] addition of usagecount for Projects and datasources. Extention of the action set created for the results with new entities for projects and datasources. Extention of the resource set and modification of the testing class 2023-02-09 18:59:45 +01:00
dimitrispie d71f5672d3 Add monitor post step 2023-02-09 13:44:14 +02:00
dimitrispie 35ba8bb328 Bug fixes 2023-02-09 12:57:57 +02:00
Sandro La Bruzzo 8920932dd8 Code formatted 2023-02-08 11:34:18 +01:00
Sandro La Bruzzo 0b9819f1ab Code formatted 2023-02-08 10:32:33 +01:00
Sandro La Bruzzo 6c81a161d2 Merge remote-tracking branch 'origin/beta' into 8231-mdstore-synch-improve 2023-02-08 10:29:09 +01:00
dimitrispie 3ba11d64a1 Changes 07022023 2023-02-07 12:53:51 +02:00
dimitrispie 98c34263ed Update step20-createMonitorDB.sql
Add University of Cape Town organization
2023-02-07 08:14:48 +02:00
dimitrispie 2dc6d47270 Changes 06022023 2023-02-06 13:18:53 +02:00
Miriam Baglioni 5f0906be60 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-02-02 17:13:14 +01:00
dimitrispie 973d78a4d6 Update step15_5.sql
Added unpaywalls open access colors
2023-02-02 08:03:54 +02:00
Claudio Atzori d05ca53a14 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2023-01-31 14:39:53 +01:00
Michele De Bonis 6a6c266dde implementation of author dedup configuration and lnfi clustering function 2023-01-31 11:53:10 +01:00
Miriam Baglioni e82e009b46 added missing close tag for XML produced by the xquery to get information for the community from the IS 2023-01-31 10:19:34 +01:00
Miriam Baglioni b254a0375f [Affiliation from institutionalrepo] changed the field to check to verify the datasource type. Now it is in the field jurisdiction 2023-01-26 16:51:20 +01:00
dimitrispie cf58e4a5e4 Added Arts et Métiers ParisTech 2023-01-25 16:03:16 +02:00
dimitrispie db7d625ba9 Addedd Arts et Métiers ParisTech organization 2023-01-25 12:22:21 +02:00
Claudio Atzori 505867bce9 [bulk tagging] better node naming 2023-01-20 16:13:16 +01:00
Claudio Atzori 1b37516578 [bulk tagging] better node naming 2023-01-20 16:11:26 +01:00
Miriam Baglioni ecd398fe51 refactoring 2023-01-20 14:23:45 +01:00
Claudio Atzori c1e2460293 [cleaning] the datasource master-duplicate fixup should not be brought to production yet 2023-01-20 09:20:26 +01:00
Claudio Atzori 3800361033 [country propagation] fixes error 'cannot resolve countrySet given input columns: []' when there is no prepared information driving the propagation process for a given result type 2023-01-19 15:57:43 +01:00
Miriam Baglioni 0a5c6010b0 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2023-01-13 16:14:46 +01:00
dimitrispie 4d7553c9f1 Bug fixes 2023-01-12 17:19:19 +02:00
dimitrispie dd70c32ad7 Bug fixes 2023-01-12 17:18:05 +02:00
dimitrispie 51f7ab5864 Bug fixes 2023-01-12 17:15:06 +02:00
dimitrispie 34d4bf727c Bug fixes 2023-01-12 11:28:37 +02:00
dimitrispie 43f6d4f296 -Monitor DB workflow 2023-01-12 11:26:47 +02:00
dimitrispie 686580a220 - New Monitor DB workflow
- New Organization added
2023-01-12 11:18:03 +02:00
Claudio Atzori 0a58bc7ba7 [broker] prevent NPEs 2023-01-11 14:44:14 +01:00
Michele Artini 699736addc NPE prevention 2023-01-11 13:14:44 +01:00
Claudio Atzori 04cb96001c [broker] d40e20f437 adapted to the beta graph model 2023-01-11 10:10:12 +01:00
Michele Artini 91b845f611 Considering instance pids and alteternative identifiers 2023-01-11 09:58:54 +01:00
Claudio Atzori f86e19b282 code formatting 2023-01-11 09:53:19 +01:00
Miriam Baglioni 1f367122e4 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2023-01-11 09:47:44 +01:00
Michele Artini d40e20f437 Considering instance pids and alteternative identifiers 2023-01-11 09:37:34 +01:00
Michele Artini 7b7520850b fixed an invalid char 2023-01-11 09:22:18 +01:00
Michele Artini 4953ae5649 fixed an invalid char 2023-01-11 08:35:53 +01:00
Miriam Baglioni d6895f0387 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2023-01-09 17:28:38 +01:00
Miriam Baglioni c60d3a2b46 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-01-09 17:28:27 +01:00
dimitrispie becb242c17 Monitor DB only Workflow 2023-01-04 16:50:29 +02:00
dimitrispie dcb958e146 Changes to execute the stats wf only in hive 2023-01-04 11:39:01 +02:00
Claudio Atzori 7becdaf31d Merge pull request 'Workaround to use new version of intellij on Master' (#266) from master_intellij into master
Reviewed-on: D-Net/dnet-hadoop#266
2022-12-23 10:32:21 +01:00
Claudio Atzori 18a7aa2d78 Merge pull request 'Workaround to use new version of intellij on Beta' (#267) from beta_intellij into beta
Reviewed-on: D-Net/dnet-hadoop#267
2022-12-23 10:32:01 +01:00
dimitrispie 592013d5dd Added more steps in decision node 2022-12-23 09:43:16 +02:00
dimitrispie 2a4bf32d4c Merge branch 'hive' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into hive
# Conflicts:
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step10.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step13.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step14.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_1-definitions.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step7.sql
2022-12-22 10:22:46 +02:00
dimitrispie 6449ff4207 1. Added a decision node to enables the workflow to make a selection on the execution path to follow
2. Added new organization
3. Added 5 new tables from Eurostast
2022-12-22 10:18:21 +02:00
Miriam Baglioni b713132db7 [Cleaning] adding missing classes 2022-12-21 12:49:08 +01:00
Miriam Baglioni 8893389895 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2022-12-21 12:42:27 +01:00
Miriam Baglioni 11f2b470d3 [Cleaning] adding missing classes 2022-12-21 12:42:19 +01:00
Antonis Lempesis c8309fe18e addded command line params to allow hive actions to run 2022-12-21 12:41:33 +02:00
Antonis Lempesis 028873cc51 added new hive opts 2022-12-21 12:41:33 +02:00
Antonis Lempesis 1ddea4f442 removed 'stored as parquet' from views.. 2022-12-21 12:41:33 +02:00
Antonis Lempesis 2754c3dd62 moving data to impala cluster and creating shadow databases there 2022-12-21 12:41:29 +02:00
Antonis Lempesis 778a1a724f finished migration to hive only 2022-12-21 12:41:25 +02:00
Antonis Lempesis e84dd5fe26 first 2022-12-21 12:41:23 +02:00
Sandro La Bruzzo 3c9826f186 updated lines function to it's implementation linesWithSeparators.map(l => l.stripLineEnd) in this way we force scala plugin compiler to consider this pipeline scala code and not java.string.lines() pipeline 2022-12-21 11:21:17 +01:00
Sandro La Bruzzo 91c70b15a5 updated lines function to it's implementation linesWithSeparators.map(l => l.stripLineEnd) in this way we force scala plugin compiler to consider this pipeline scala code and not java.string.lines() pipeline 2022-12-21 11:14:42 +01:00
Claudio Atzori f910b7379d [cleaning] recovering missing resources from D-Net/dnet-hadoop#265 2022-12-21 09:26:34 +01:00
Claudio Atzori 33bdad104e [cleaning] align parameter names 2022-12-20 21:43:59 +01:00
Claudio Atzori 6aa91204a5 [orcid propagation] skip empty directories 2022-12-20 14:15:46 +01:00
Claudio Atzori 5816ded93f code formatting 2022-12-20 10:41:40 +01:00
Claudio Atzori 46972f8393 [orcid propagation] skip empty directory 2022-12-20 10:28:22 +01:00
Claudio Atzori 9cf0a98699 [cleaning] set the common subject classid/name 2022-12-20 10:17:33 +01:00
Claudio Atzori da85ca697d Merge pull request 'cleanCountryOnMaster' (#265) from cleanCountryOnMaster into master
Reviewed-on: D-Net/dnet-hadoop#265
2022-12-16 15:58:44 +01:00
Miriam Baglioni 059e100ec7 [Clean Country] moving other resources for testing purposes 2022-12-16 15:48:21 +01:00
Miriam Baglioni fc95a550c3 [Clean Country] moving other resources for testing purposes 2022-12-16 15:46:32 +01:00
Miriam Baglioni 6901ac91b1 [Clean Country] moving source and resources to master 2022-12-16 15:42:49 +01:00
Miriam Baglioni 6674cccb94 [BulkTag] description of parameters more comprehensive for those who do not implement it 2022-12-16 15:33:20 +01:00
Miriam Baglioni f37113a941 [BulkTag] moving xquery to get community configuration in dedicated file 2022-12-16 15:32:26 +01:00
Miriam Baglioni 8685eaa706 [Clean Country] added test to verify remove of country 2022-12-16 15:31:25 +01:00
Miriam Baglioni dc0ec88a58 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2022-12-16 13:18:32 +01:00
Miriam Baglioni d791840b82 [Clean Country] added test to verify remove of country: 2022-12-16 13:18:29 +01:00
Claudio Atzori 7b80b24f82 [cleaning] country cleaning must use both PID and AlternateIdentifier fields 2022-12-15 14:49:04 +01:00
Claudio Atzori b8bafab8a0 [cleaning] improved vocabulary based mapping, specialization for the strict vocab cleaning 2022-12-12 14:43:03 +01:00
Sandro La Bruzzo 5e4866d033 implemented synch for single mdstore 2022-12-12 11:29:46 +01:00
Claudio Atzori c18b8048c3 [cleaning] avoid NPE 2022-12-10 11:41:38 +01:00
Claudio Atzori 8b44afe5e5 [cleaning] avoid NPE 2022-12-09 15:44:57 +01:00
Claudio Atzori 389dd25430 [cleaning] avoid NPE 2022-12-08 18:40:48 +01:00
Claudio Atzori 730228d73d [cleaning] align wf parameter names in test 2022-12-08 18:40:22 +01:00
Claudio Atzori 2094fa6db0 [cleaning] align wf parameter names 2022-12-08 17:22:26 +01:00
Miriam Baglioni a485a94956 [Cleaning] fixed parameter name in property file 2022-12-08 16:59:34 +01:00
Miriam Baglioni 3d99b78d94 [Cleaning] fixed error in parameter (workingPath to workingDir) 2022-12-08 10:25:02 +01:00
Claudio Atzori 08c4588d47 Merge pull request 'Changes from beta stats wf to prod' (#264) from antonis.lempesis/dnet-hadoop:beta into master
Reviewed-on: D-Net/dnet-hadoop#264
2022-12-07 15:56:22 +01:00
Claudio Atzori 1b8488976b code formatting 2022-12-07 10:45:38 +01:00
Claudio Atzori cd1b58483e [bulk tag] fixed Community configuration parsing to void NPE 2022-12-07 10:39:00 +01:00
Claudio Atzori 062abfd669 fixed NPE, removed unused stuff 2022-12-06 12:04:00 +01:00
dimitrispie 2a52a42169 Added 4 institutions:
-University of Modena and Reggio Emilia
-Bilkent University
-Saints Cyril and Methodius University of Skopje
-University of Milan
2022-12-06 10:10:21 +02:00
Claudio Atzori 71b121e9f8 Merge pull request '[graph cleaning] update collectedfrom & hostedby references as consequence of the datasource deduplication' (#260) from graph_cleaning into beta
Reviewed-on: D-Net/dnet-hadoop#260
2022-12-02 14:49:15 +01:00
Claudio Atzori 8248da40d9 Merge branch 'beta' into graph_cleaning 2022-12-02 14:49:00 +01:00
Claudio Atzori ddf065756f Merge pull request 'Two organizations are added for monitor' (#258) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#258
2022-12-02 14:45:27 +01:00
Claudio Atzori 41f7f1bbc5 Merge pull request '[graph dedup] records stability and testing' (#44) from deduptesting into beta
Reviewed-on: D-Net/dnet-hadoop#44
2022-12-02 14:43:05 +01:00
Sandro La Bruzzo 5a48a2fb18 implemented synch for single mdstore 2022-12-01 11:34:43 +01:00
Claudio Atzori a38116546d Merge branch 'beta' into deduptesting 2022-11-30 11:27:29 +01:00
Miriam Baglioni ce020f2c83 [EOSC FUTURE] added resources and test for review 2022-11-30 09:57:30 +01:00
Miriam Baglioni bb0ddc1c44 [BulkTag] adding verb starts_with 2022-11-30 09:56:24 +01:00
Claudio Atzori 8e3edba318 [graph cleaning] testing the collectedfron and hostedby patch procedure 2022-11-29 16:07:09 +01:00
Claudio Atzori 58c05731f9 [graph cleaning] WIP: testing the collectedfron and hostedby patch procedure 2022-11-29 11:21:51 +01:00
Miriam Baglioni 7d264a1d69 Merge pull request 'horizontalConstraints' (#259) from horizontalConstraints into beta
Reviewed-on: D-Net/dnet-hadoop#259
2022-11-28 18:20:17 +01:00
Miriam Baglioni 9c70c5dbd6 [Bulk Tag horizontal] added new path in definition of constraint (to recognize fos subjects) - changed test and resource class to test this new aspect 2022-11-28 14:51:20 +01:00
Miriam Baglioni 0628df7a3a resolving conflicts 2022-11-28 10:44:56 +01:00
Claudio Atzori 11695ba649 [graph cleaning] patch also the result's collectedfrom and hostedby datasource name according to the datasource master-duplicate mapping 2022-11-28 10:18:43 +01:00
Claudio Atzori 6082d235d3 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into graph_cleaning 2022-11-28 09:54:48 +01:00
Claudio Atzori 24ef301cc1 [graph cleaning] patch the result's collectedfrom and hostedby identifiers according to the datasource master-duplicate mapping 2022-11-28 09:54:18 +01:00
Miriam Baglioni 29d3da85f1 [EOSC DUMP] added resources needed for the review as test 2022-11-25 17:16:20 +01:00
Alessia Bardi 90c8f9cb61 tests for EOSC Future 2022-11-23 12:18:44 +01:00
Miriam Baglioni 33a2b1b5dc [Bulk Tag] fixed typo in test configuration 2022-11-23 11:31:17 +01:00
Miriam Baglioni c6df8327b3 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2022-11-23 11:26:57 +01:00
Miriam Baglioni 0e3edc5018 [Bulk Tag] fixed issue in verb name 2022-11-23 11:26:36 +01:00
Miriam Baglioni 935aa367d8 [BulkTag] removed commented code 2022-11-23 11:16:39 +01:00
Miriam Baglioni 43aedbdfe5 [BulkTag] changed verb name in configuration 2022-11-23 11:14:23 +01:00
Miriam Baglioni b6da9b67ff [BulkTag] fixed typo in annotation for verb name 2022-11-23 11:13:58 +01:00
Claudio Atzori a79c47522d updated ORCID datasource identifier 2022-11-23 10:17:49 +01:00
Alessia Bardi 2832117f23 added eoscifguidelines in test 2022-11-22 18:01:12 +01:00
Michele De Bonis 14f6346676 implementation of the new software configuration 2022-11-22 17:48:34 +01:00
Alessia Bardi 3c08269a4d Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2022-11-22 17:31:00 +01:00
Alessia Bardi 2687fc9f73 tests for EOSC Future review - ROhub 2022-11-22 17:30:56 +01:00
Claudio Atzori a34c8b6f81 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2022-11-22 10:22:31 +01:00
Claudio Atzori 1d5143b0b6 Merge branch 'beta' into deduptesting 2022-11-22 10:21:30 +01:00
Miriam Baglioni 122e75aa17 fixed conflicts 2022-11-21 18:13:12 +01:00
Miriam Baglioni cee7a45b1d [Bulk Tag Datasource] fixed issue with verb name and add new test for neanias selection for orcid 2022-11-21 18:10:20 +01:00
Michele De Bonis 9fee2ed611 minor changes 2022-11-21 14:35:46 +01:00
Claudio Atzori ed64618235 increased spark.sql.shuffle.partitions in the last join phase of the result (publication) to community through semantic relation propagation 2022-11-18 16:06:51 +01:00
Claudio Atzori 8742934843 added spark.sql.shuffle.partitions in the last join phase of the result to community through semantic relation propagation 2022-11-18 11:32:22 +01:00
Claudio Atzori 0aa725083f extended dedup testing 2022-11-17 16:13:43 +01:00
Claudio Atzori 3dbc637d3e code formatting 2022-11-17 09:55:41 +01:00
Claudio Atzori 13cc592f39 code formatting 2022-11-15 09:37:57 +01:00
Claudio Atzori af15b1e48d [eosc tag] extending criteria for Jupyter Notebook (adding to ORP the same constraint) 2022-11-14 18:30:43 +01:00
Claudio Atzori eb45ba7af0 extended mapping from ODF relations (PR#251) 2022-11-14 18:26:13 +01:00
Claudio Atzori a929dc5fee integrated changes for mapping ROHub contents in the Graph 2022-11-14 18:15:35 +01:00
Claudio Atzori 24f99d7310 Merge pull request 'Map oaf:eoscifguidelines from mdstore to the graph' (#256) from eoscifguidelines-from-mdstores into beta
Reviewed-on: D-Net/dnet-hadoop#256
2022-11-14 15:40:34 +01:00
Claudio Atzori ddff0e8999 merging duplicates using IdentifierComparator 2022-11-11 16:10:25 +01:00
Miriam Baglioni 5f9383b2d9 [EOSC TAG] remove reduntant check for jupyter notebook 2022-11-11 14:06:19 +01:00
Miriam Baglioni b18bbca8af [EOSC TAG] adding search in orp for jupyter notebook criteria 2022-11-11 12:42:58 +01:00
Claudio Atzori 5af5a8ae42 added IdentifierComparator 2022-11-09 14:20:59 +01:00
Claudio Atzori 0419953470 merge from beta 2022-11-07 12:22:35 +01:00
Claudio Atzori 7c3390ac10 Merge branch 'beta' into eoscifguidelines-from-mdstores 2022-11-07 12:18:40 +01:00
dimitrispie 55fa3b2a17 Hive memory parameters 2022-11-03 15:21:04 +01:00
dimitrispie 992fc5b628 Added McMaster University Institution 2022-11-03 11:02:18 +02:00
dimitrispie 7fda05e380 Added Autonomous University of Barcelona 2022-11-01 13:59:40 +02:00
Claudio Atzori 22873c9172 Merge pull request 'Added fields: totalcost, fundedamount, currency, in project table' (#257) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#257
2022-10-31 13:49:27 +01:00
dimitrispie 7861c472e0 Hive memory parameters 2022-10-28 19:00:32 +03:00
dimitrispie 5df9c63963 Added fields: totalcost, fundedamount, currency, in project table 2022-10-27 16:44:26 +03:00
Sandro La Bruzzo 2b9a20a4a3 Changed the way Scholexplorer filter the relationships, I found that filter all relation coming from openCitation is wrong, because we loose a lot of relation than intersect OpenCitation, but they don't come only from there 2022-10-24 12:53:47 +02:00
Alessia Bardi 208ed32315 fixed xpath for semantic relation 2022-10-23 18:18:13 +02:00
Alessia Bardi ee759ac92d file format after mvn compile 2022-10-23 18:09:47 +02:00
Alessia Bardi 31a10f000b Map the field oaf:eoscifguidelines from mdstores. Currently we can find it in ROHub metadata 2022-10-23 18:05:37 +02:00
Claudio Atzori ec39b84898 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2022-10-19 15:21:02 +02:00
Claudio Atzori bca4a61710 suppressing hyper verbose spark logs during unit test execution 2022-10-19 15:20:58 +02:00
Sandro La Bruzzo 72f0d88d6c formatted code 2022-10-19 14:18:42 +02:00
Claudio Atzori 9b449110c6 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2022-10-14 15:48:04 +02:00
Claudio Atzori ae7cd0735a [graph2hive] more partitions 2022-10-14 15:47:58 +02:00
Sandro La Bruzzo 135cf81151 Merge remote-tracking branch 'origin/beta' into beta 2022-10-13 11:47:25 +02:00
Sandro La Bruzzo a1f94530a3 added documentation 2022-10-13 11:47:11 +02:00
Claudio Atzori b47aaf4dd1 [cleaning] subjects declared as belonging to specific vocabularies whose values are not found in the vocab are set to type keyword 2022-10-13 11:23:43 +02:00
Claudio Atzori 6163ecbf63 [cleaning] renamed parameters in wf action 2022-10-11 11:20:03 +02:00
Claudio Atzori b301e9fdff [cleaning] renamed action name/description 2022-10-11 11:08:52 +02:00
Claudio Atzori ece40adc09 [cleaning] fixing NPE in the country cleaning phase 2022-10-11 10:10:20 +02:00
Claudio Atzori d51275a965 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2022-10-07 09:52:49 +02:00
Claudio Atzori 8d97949316 [cleaning] fixed loop in wf nodes 2022-10-07 09:52:45 +02:00
Miriam Baglioni a653e1b3ea [Enrichment - result to community through organization] reimplementation of the data preparation step using spark 2022-10-04 15:01:28 +02:00
Miriam Baglioni 4d8339614b Revert "[BipFinder] Fixed issue for wrong escaped char in doi"
This reverts commit 188f25eefa.
2022-10-04 14:29:47 +02:00
Miriam Baglioni 7324853a17 Revert "[BipFinder] refactoring"
This reverts commit 28dc317350.
2022-10-04 14:29:39 +02:00
Miriam Baglioni 28dc317350 [BipFinder] refactoring 2022-10-04 09:47:27 +02:00
Miriam Baglioni 188f25eefa [BipFinder] Fixed issue for wrong escaped char in doi 2022-10-03 12:42:52 +02:00
Claudio Atzori 89f7007080 Merge pull request '[stats wf] misc changes' (#254) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: D-Net/dnet-hadoop#254
2022-10-03 10:32:05 +02:00
dimitrispie 2c0c3f1806 Cast amount to float for table result_apcs 2022-09-28 19:33:24 +03:00
dimitrispie bdc46e3eaa Remove denormalization of results to fix downloads numbers in monitor 2022-09-28 14:59:08 +03:00
dimitrispie 2ebb1459a9 Fixed type in no_downloads 2022-09-28 14:36:57 +03:00
Miriam Baglioni 3ec044600d [BulkTag] fixed conflicts 2022-09-28 11:58:28 +02:00
Miriam Baglioni 1cb79719a7 [BulkTag] fixed issues 2022-09-28 11:44:55 +02:00
Claudio Atzori 80c5e0f637 code formatting 2022-09-27 12:51:51 +02:00
Claudio Atzori c01d528ab2 suppressing hyper verbose spark logs during unit test execution 2022-09-23 15:19:50 +02:00
Claudio Atzori e6d788d27a [stats wf] adding missing changes lost in PR#248 2022-09-23 14:38:42 +02:00
Claudio Atzori 930f118673 fixed semantic (subreltype) for ServiceOrganization relations 2022-09-22 16:24:44 +02:00
Claudio Atzori b2c3071e72 Merge branch 'master' into beta2master_sept_2022 2022-09-22 14:39:15 +02:00
Claudio Atzori 10ec074f79 Merge remote-tracking branch 'antonis.lempesis/beta' into beta2master_sept_2022 2022-09-22 14:12:19 +02:00
Claudio Atzori 7225fe9cbe integrated changes from discard-non-wellformed 2022-09-22 10:06:07 +02:00
Miriam Baglioni 869e129288 [EOSC BulkTag] refactoring 2022-09-20 16:13:18 +02:00
Miriam Baglioni 840465958b [EOSC BulkTag] filtering aout the datasources registered in the eosc with compatibility different from 3.0, 4.0 for literature, data and CRIS to add the context eosc to the results 2022-09-20 10:30:41 +02:00
Claudio Atzori bdc8f993d0 [Patch Hosted By] check also the presence of datasource.officialname.value 2022-09-19 15:28:03 +02:00
Miriam Baglioni ec87149cb3 [Patch Hosted By] added fix to avoi NPE error when datasource official name is not provided. Removing datasources if no officialname has been provided 2022-09-19 14:06:52 +02:00
Miriam Baglioni b42e2c9df6 [Patch Hosted By] added fix to avoi NPE error when datasource official name is not provided 2022-09-19 12:30:32 +02:00
Miriam Baglioni 1329aa8479 [EOSC BulkTag] modified test to remove association of result to eosc when eoscifguidelines are set 2022-09-19 11:59:48 +02:00
Miriam Baglioni a0ee1a8640 [EOSC BulkTag] remove addition of eosc context for result with eosc if guidelines set 2022-09-19 11:44:10 +02:00
Claudio Atzori 96062164f9 Merge pull request '[Aggregator graph|master] Discard invalid records' (#245) from discard-non-wellformed into master
Reviewed-on: D-Net/dnet-hadoop#245
2022-09-19 09:48:16 +02:00
Claudio Atzori 35bb7c423f updated dhp-schemas version to 2.12.1 2022-09-16 16:13:15 +02:00
Claudio Atzori fd87571506 code formatting 2022-09-16 16:05:03 +02:00
Claudio Atzori c527112e33 Merge commit 'ff6f789b6d9be0567b6ad72f8a0e75fe3f52726a' into beta2master_sept_2022 2022-09-16 15:59:10 +02:00
Claudio Atzori 65209359bc Merge commit 'b5f7bd30be7f7adaaa28170740da0484b50a77ed' into beta2master_sept_2022 2022-09-16 15:58:11 +02:00
Claudio Atzori d72a64ded3 Merge commit '690be4482fc84327dc7617acbc8d976d559df512' into beta2master_sept_2022 2022-09-16 15:57:44 +02:00
Claudio Atzori 3e8499ce47 Merge commit '71b069ca90a2f7ec09d64241c60917d3636fc81e' into beta2master_sept_2022 2022-09-16 15:57:20 +02:00
Claudio Atzori 61aacb3271 Merge commit '1203378441dc6d8e8435cacd42e76e11746f6d1b' into beta2master_sept_2022 2022-09-16 15:56:55 +02:00
Claudio Atzori dbb567251a merged 853c996fa2 2022-09-16 15:56:28 +02:00
Claudio Atzori c7e8ad853e Merge commit '2b5f8c9c9a3611c57ee5febfe262a455a39ad801' into beta2master_sept_2022 2022-09-16 15:55:04 +02:00
Claudio Atzori 0849ebfd80 merged a11eb38065 2022-09-16 15:54:32 +02:00
Claudio Atzori 281239249e Merge commit 'b7c387c21f946adbc9da90ded95166205195edb0' into beta2master_sept_2022 2022-09-16 15:49:20 +02:00
Claudio Atzori 45fc5e12be Merge commit 'cb7c07c54e59675e8dffe42b7f2a13f16c956068' into beta2master_sept_2022 2022-09-16 15:48:55 +02:00
Claudio Atzori 1c05aaaa2e Merge commit '3418ce50ac9b28fed4fa949919e6c8208738cdcf' into beta2master_sept_2022 2022-09-16 15:48:36 +02:00
Claudio Atzori 01d5ad6361 Merge commit 'd85ba3c1a9d7f0e80565742161ff6c9ecffd52b7' into beta2master_sept_2022 2022-09-16 15:48:16 +02:00
Claudio Atzori d872d1cdd9 Merge commit 'a4815f6bec87f05be8cd740d236707949a0f746e' into beta2master_sept_2022 2022-09-16 15:47:49 +02:00
Claudio Atzori ab0efecab4 Merge commit '84598c75356cf580de6c81653a9351e9b8173639' into beta2master_sept_2022 2022-09-16 15:47:05 +02:00
Claudio Atzori 725c3c68d0 Merge commit '844f6eb46533cdd4be3210401b10401322079640' into beta2master_sept_2022 2022-09-16 15:46:40 +02:00
Claudio Atzori 300ae6221c Merge commit '32cee1f619eb30d2e2ac6083435b76b1aba7db09' into beta2master_sept_2022 2022-09-16 15:45:57 +02:00
Claudio Atzori 0ec2eaba35 Merge commit 'c1f2ffc53dc41f1fac3855b2d2df7d6a5ea15e3e' into beta2master_sept_2022 2022-09-16 15:45:27 +02:00
Claudio Atzori a387807d43 Merge commit 'b78889a0ce27a79c7ab2d8da05b118ee4f1bcb36' into beta2master_sept_2022 2022-09-16 15:44:17 +02:00
Claudio Atzori 2abe2bc137 Merge commit '08ce2cadc2d84aa982726e429c280a905536a715' into beta2master_sept_2022 2022-09-16 15:43:49 +02:00
Claudio Atzori a07c876922 Merge commit '27a91841e7fa2a1b615b4d1e161d606db5bead96' into beta2master_sept_2022 2022-09-16 15:43:02 +02:00
Claudio Atzori cbd48bc645 Merge commit 'efd96e7e664e4139321e35e8d172b884ba4b61a1' into beta2master_sept_2022 2022-09-16 15:38:56 +02:00
Antonis Lempesis 6fc9ef53f6 addded command line params to allow hive actions to run 2022-07-29 16:36:20 +03:00
Antonis Lempesis 0353f93d54 added new hive opts 2022-04-29 12:49:27 +03:00
miconis 9ddd24ba36 implementation of comparators and clustering function for the author deduplication 2022-04-19 10:18:09 +02:00
miconis 97a32faf9b test implementation for the new fdup version 2022-04-13 09:48:56 +02:00
miconis 10172553ab [maven-release-plugin] prepare for next development iteration 2022-03-15 15:06:18 +01:00
miconis bd919ac98d [maven-release-plugin] prepare release dnet-dedup-4.1.12 2022-03-15 15:06:12 +01:00
miconis a965233dd0 bug fix in the normalization of a legalname, city map updated and transliteration support added 2022-03-15 14:59:13 +01:00
miconis ac9708e31b [maven-release-plugin] prepare for next development iteration 2022-03-09 13:43:48 +01:00
miconis a5a6054039 [maven-release-plugin] prepare release dnet-dedup-4.1.11 2022-03-09 13:43:44 +01:00
miconis 3bc07c5881 bug fix in the AuthorMatch, implementation of the concat function in the model creation with jpath query 2022-03-09 12:53:09 +01:00
miconis 699612dd17 implementation of the size threshold on authors list match 2022-03-08 16:49:28 +01:00
Antonis Lempesis 5772f92dba merged beta chnages in hive branch 2022-02-15 13:24:51 +02:00
miconis 8f07f0c537 [maven-release-plugin] prepare for next development iteration 2022-01-13 17:22:16 +01:00
miconis 620e35db28 [maven-release-plugin] prepare release dnet-dedup-4.1.10 2022-01-13 17:22:12 +01:00
miconis 2ff97781d2 minor change 2022-01-13 17:20:20 +01:00
miconis 1ff6a3dc11 [maven-release-plugin] prepare for next development iteration 2022-01-13 15:15:05 +01:00
miconis 003bcf1699 [maven-release-plugin] prepare release dnet-dedup-4.1.9 2022-01-13 15:15:00 +01:00
miconis 2f1ba56f61 bug fix in the authormatch comparator, implementation of tests 2022-01-13 11:58:28 +01:00
miconis cea8440153 [maven-release-plugin] prepare for next development iteration 2021-12-30 13:11:57 +01:00
miconis eb48d31ea6 [maven-release-plugin] prepare release dnet-dedup-4.1.8 2021-12-30 13:11:52 +01:00
miconis a224bf70a4 implementation of new comparators for publication dedup configuration update 2021-12-27 17:35:02 +01:00
Antonis Lempesis ddd34087c2 removed 'stored as parquet' from views.. 2021-12-13 23:05:00 +02:00
Antonis Lempesis 915f758c82 moving data to impala cluster and creating shadow databases there 2021-12-13 16:26:14 +02:00
Antonis Lempesis d05210ba99 finished migration to hive only 2021-11-30 19:01:48 +02:00
Antonis Lempesis 12749a0a77 first 2021-11-26 15:40:40 +02:00
miconis 8f1db32921 implementation of the instance type comparator and its tests 2021-11-04 15:20:57 +01:00
miconis fbb1b66bfb dedup test implementation & graph drawing tools 2021-09-13 14:53:19 +02:00
miconis 1144d50a11 [maven-release-plugin] prepare for next development iteration 2021-05-03 16:09:56 +02:00
miconis f33a18ca9d [maven-release-plugin] prepare release dnet-dedup-4.1.7 2021-05-03 16:09:08 +02:00
miconis 4bce4f2e8e minor change: version updated 2021-05-03 16:05:39 +02:00
miconis c6266242e3 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-dedup 2021-05-03 15:38:00 +02:00
miconis 4988e9f80d implementation of cross comparison for different fields, addition of clustering mechanism to collapse keys from different clustering functions on the same cluster 2021-05-03 15:37:41 +02:00
Claudio Atzori 58d013e24f [maven-release-plugin] prepare for next development iteration 2021-04-12 16:12:15 +02:00
Claudio Atzori 3a7336157b [maven-release-plugin] prepare release dnet-dedup-4.0.6 2021-04-12 16:12:10 +02:00
miconis ed0d5d3e1d implementation of the wf to dedup entities, addition of the module to run the wf on the cluster 2020-12-04 15:41:31 +01:00
miconis 3f2d3253e4 Merge branch 'stable_ids' into deduptesting 2020-11-05 15:52:57 +01:00
miconis 1699d41d39 relations for openorgs: not it choose only one master 2020-11-05 15:48:42 +01:00
miconis 72116446ec [maven-release-plugin] prepare for next development iteration 2020-09-29 12:06:38 +02:00
miconis 05a03d97cd [maven-release-plugin] prepare release dnet-dedup-4.0.5 2020-09-29 12:06:35 +02:00
miconis 2a01022712 minor changes 2020-09-29 12:05:50 +02:00
miconis dd34e371d7 fixed error in the treeprocessor. it used th=-1 as default value, now it use th=1 2020-09-29 12:01:25 +02:00
miconis 19c3c90d7b fixed error in the block processor: entities with orderField=null were not considered 2020-09-19 17:43:41 +02:00
Sandro La Bruzzo a109ebe287 fixed NPE 2020-08-06 10:27:05 +02:00
miconis a5a3ea24f8 [maven-release-plugin] prepare for next development iteration 2020-07-16 18:59:25 +02:00
miconis 840fe8f4d3 [maven-release-plugin] prepare release dnet-dedup-4.0.4 2020-07-16 18:59:22 +02:00
miconis 07ab904d60 implementation of the clustering function for the suffixprefix chain 2020-07-16 18:57:55 +02:00
Claudio Atzori eaf7defe0c [maven-release-plugin] prepare for next development iteration 2020-07-15 17:57:09 +02:00
Claudio Atzori ff2c8eba12 [maven-release-plugin] prepare release dnet-dedup-4.0.3 2020-07-15 17:57:04 +02:00
Claudio Atzori 7cc3742a26 removed maven release.property 2020-07-15 17:52:27 +02:00
Claudio Atzori 14611ea450 reverted to 4.0.3-SNAPSHOT 2020-07-15 17:37:36 +02:00
Claudio Atzori 9f20f23870 Revert "wordssuffixprefix: adjust the token length according to the number of words; removed maven release temporary files"
This reverts commit 51d91fa520.
2020-07-15 17:35:56 +02:00
Claudio Atzori 9efcd8e245 Revert "reverted to 4.0.3-SNAPSHOT"
This reverts commit ec97983ce1.
2020-07-15 17:28:37 +02:00
Claudio Atzori ba493f9ab8 [maven-release-plugin] rollback the release of dnet-dedup-4.0.3 2020-07-15 17:24:43 +02:00
Claudio Atzori 6c98d4c436 [maven-release-plugin] prepare release dnet-dedup-4.0.3 2020-07-15 17:24:25 +02:00
Claudio Atzori ec97983ce1 reverted to 4.0.3-SNAPSHOT 2020-07-15 17:20:12 +02:00
Claudio Atzori 51d91fa520 wordssuffixprefix: adjust the token length according to the number of words; removed maven release temporary files 2020-07-15 17:13:45 +02:00
Claudio Atzori b79ea97107 Revert "wordssuffixprefix: adjust the token length according to the number of words; removed maven release temporary files"
This reverts commit d2861950ac.
2020-07-15 17:11:46 +02:00
Claudio Atzori 92aadbfc7b [maven-release-plugin] prepare release dnet-dedup-4.0.3 2020-07-15 17:04:20 +02:00
Claudio Atzori d2861950ac wordssuffixprefix: adjust the token length according to the number of words; removed maven release temporary files 2020-07-15 16:49:47 +02:00
miconis 244a037a90 implementation of a class to test the clustering functions 2020-07-12 10:13:54 +02:00
miconis 7aa2001a8b [maven-release-plugin] prepare for next development iteration 2020-07-02 17:06:38 +02:00
miconis c72055f543 [maven-release-plugin] prepare release dnet-dedup-4.0.2 2020-07-02 17:06:36 +02:00
miconis f933fd33e0 implemented new function for clustering 2020-07-02 17:04:17 +02:00
miconis 411d1cc24f implementation of the test for the dedup and addition of new support classes 2020-06-11 10:46:46 +02:00
miconis 48c094f599 [maven-release-plugin] prepare for next development iteration 2020-04-24 14:39:01 +02:00
miconis 4365ba41c9 [maven-release-plugin] prepare release dnet-dedup-4.0.1 2020-04-24 14:38:58 +02:00
miconis 6e9b27f37d implementation of the mechanism to truncate the string and the lists 2020-04-24 14:36:42 +02:00
Sandro La Bruzzo 8e4211708e [maven-release-plugin] prepare for next development iteration 2020-02-10 12:51:04 +01:00
Sandro La Bruzzo 24e2ab9092 [maven-release-plugin] prepare release dnet-dedup-4.0.0 2020-02-10 12:50:45 +01:00
Sandro La Bruzzo 46727f5c76 upgraded maven version of commons-lang 2020-02-10 12:38:40 +01:00
miconis 5c8f6febee minor changes in comparators 2020-01-24 10:01:11 +01:00
miconis 4dce785375 update in the implementation of the tree: addition of new logic aggregations and statistics 2020-01-14 11:42:43 +02:00
miconis b3748b8d77 minor changes 2019-12-18 16:20:35 +01:00
miconis b21b1b8f61 implementation of new aggregation in the tree node processing 2019-12-18 16:19:36 +01:00
miconis 20fcfe6328 implementation of new aggregation in the tree node processing 2019-12-18 16:19:26 +01:00
Sandro La Bruzzo d924f28b93 fixed wrong use of jspath 2019-12-18 09:29:44 +01:00
miconis 84aaa65501 implementation of new json comparator and update of the publication configuration 2019-12-17 09:16:26 +01:00
Sandro La Bruzzo 5c01ae4c92 merged JqMapping branch into tree2 2019-12-13 11:30:02 +01:00
Sandro La Bruzzo 35008fdbf9 fix stuff 2019-12-06 15:28:30 +01:00
Sandro La Bruzzo 16c670a5d5 Improved deduplication 2019-12-05 14:14:25 +01:00
miconis 49f9beb4a8 implementation of romansmatch and re-implementation of the getNumber function. New terms in the translation map and update of the configuration 2019-11-28 16:54:44 +01:00
miconis f791730330 addition of one term to the translation maps in the configurations 2019-11-27 15:48:37 +01:00
miconis d2278fe358 minor change in the citymatch 2019-11-21 10:54:02 +01:00
miconis 8c0d346005 the param map has been updated: now it accepts string parameters 2019-11-21 09:37:56 +01:00
miconis ddd40540aa jarowinklernormalizedname splitted in 3 different comparators: citymatch, keywordmatch and jarowinkler. Implementation of the TreeStatistic support functions 2019-11-20 10:45:00 +01:00
miconis c687956371 code cleaning and implementation of the TreeDedup + minor changes 2019-11-14 10:01:21 +01:00
miconis 0973899865 code cleaning, distribution of the classes in packages and implementation of the new configuration 2019-11-07 12:47:12 +01:00
miconis 30a873265f put the last modification of the master branch into the tree2. Addition of the configuration as parameter of the comparator. This is to allow the comparator to access it 2019-10-29 16:38:42 +01:00
miconis 1beb776691 minor changes 2019-10-29 15:58:21 +01:00
miconis 075f741d28 [maven-release-plugin] prepare for next development iteration 2019-10-24 11:34:19 +02:00
miconis ced4bcdd59 [maven-release-plugin] prepare release dnet-dedup-3.0.15 2019-10-24 11:34:12 +02:00
miconis 13f93e6055 Revert "[maven-release-plugin] prepare release dnet-dedup-3.0.15"
This reverts commit cf93515d94.
2019-10-24 11:23:01 +02:00
miconis cf93515d94 [maven-release-plugin] prepare release dnet-dedup-3.0.15 2019-10-24 11:17:07 +02:00
miconis 285ec3ca17 release rollback 2019-10-24 11:11:07 +02:00
miconis 5f249fd56c minor changes 2019-10-23 16:37:20 +02:00
miconis c9863debfa minor changes and configuration updates (synonym field added) 2019-10-23 16:31:45 +02:00
miconis 5499ca17c3 minor changes 2019-10-08 16:49:07 +02:00
miconis 50b7a12b3f normalization of the term in the translation map added 2019-10-08 15:13:45 +02:00
miconis 26b383fea2 translation map moved in json configuration, support for synonyms added in the configuration, now the configuration is argument of conditions, distancealgos and clusteringfunctions 2019-10-08 14:53:52 +02:00
Claudio Atzori 07355d2811 [maven-release-plugin] prepare for next development iteration 2019-09-25 10:39:46 +02:00
Claudio Atzori 254eb46809 [maven-release-plugin] prepare release dnet-dedup-3.0.14 2019-09-25 10:39:39 +02:00
Claudio Atzori 74c6462b49 updated translation map and some tests 2019-09-25 10:15:13 +02:00
miconis aed81e4cfa translation map updated 2019-09-25 09:53:06 +02:00
miconis afd2b398d5 optimize imports 2019-08-09 15:42:41 +02:00
miconis d71dae5fd2 implementation of the conditions in tree nodes. get rid of the conditions part of the configuration 2019-08-09 15:41:49 +02:00
miconis a5c5d2f01b implementation of the decision tree. It takes place of the distance algos, necessaryConditions and sufficientConditions are still there. The model contains only path, type and name of the field. ignoreMissing is still in the model because it is used by the conditions. 2019-08-09 10:08:34 +02:00
miconis f2136e1024 code refactoring: useless module removed 2019-08-07 15:16:59 +02:00
miconis 8c867101ef addition of a fixSpecial function to address the problem with special character in organization names, addition of new terms in translation maps 2019-08-06 17:06:05 +02:00
miconis 4502b44337 addition of the BlockUtils class for meta-blocking, implementation of a new local test with edge filtering example 2019-08-06 12:09:34 +02:00
miconis cffb712a99 Merge branch 'master' of https://github.com/dnet-team/dnet-dedup 2019-07-19 17:10:53 +02:00
miconis a85576c27e restyling of the JaroWinklerNormalizedName comparator, now it is optimized. Addition of some translations in the translation maps, addition of a clustering based on keywords in organizations legalnames 2019-07-19 17:10:29 +02:00
Claudio Atzori 6cb846331a [maven-release-plugin] prepare for next development iteration 2019-07-08 11:12:52 +02:00
Claudio Atzori c04d2232c2 [maven-release-plugin] prepare release dnet-dedup-3.0.13 2019-07-08 11:12:45 +02:00
miconis fb5e38db26 Merge branch 'master' of https://github.com/dnet-team/dnet-dedup 2019-07-08 11:02:29 +02:00
miconis 3c6f8d1e44 bug fixing in the keywordsclustering class 2019-07-08 11:01:49 +02:00
Claudio Atzori a69022617d [maven-release-plugin] prepare for next development iteration 2019-07-08 10:11:24 +02:00
Claudio Atzori c6baeb93d4 [maven-release-plugin] prepare release dnet-dedup-3.0.12 2019-07-08 10:11:17 +02:00
miconis f5de20a508 [maven-release-plugin] rollback the release of dnet-dedup-3.0.12 2019-07-08 10:00:48 +02:00
miconis ba50aa8654 [maven-release-plugin] prepare for next development iteration 2019-07-08 09:48:10 +02:00
miconis 7065110a21 [maven-release-plugin] prepare release dnet-dedup-3.0.12 2019-07-08 09:48:03 +02:00
miconis 15bec5e876 addition of doi normalization in PidMatch comparator, addition of keywordsclustering (clustering based on terms in the translation maps for the organizations), minor changes 2019-07-08 09:44:02 +02:00
Claudio Atzori 2dcffb965f [maven-release-plugin] prepare for next development iteration 2019-06-19 10:02:39 +02:00
Claudio Atzori 85126c59f7 [maven-release-plugin] prepare release dnet-dedup-3.0.11 2019-06-19 10:02:32 +02:00
Claudio Atzori 15d7b584f3 optimized classpath resolvers 2019-06-19 10:01:35 +02:00
Claudio Atzori ff4956def9 [maven-release-plugin] prepare for next development iteration 2019-06-18 14:46:34 +02:00
Claudio Atzori eb5ce312a3 [maven-release-plugin] prepare release dnet-dedup-3.0.10 2019-06-18 14:46:27 +02:00
Claudio Atzori f2bc665403 avoid to divide by zero: in case of missing values, return undefined response 2019-06-18 14:45:15 +02:00
Claudio Atzori e3f86b92c8 cleanup 2019-06-18 14:44:42 +02:00
miconis 54e4d0af04 exact match condition gives undefined if a field is missing, ignoremissing semantics changed: now performs the comparison in any case if =true, if false gives -1 in case of missing 2019-06-18 14:05:31 +02:00
miconis e8db8f2abb implementation of the integration test, addition of document blocks to group entities after clustering 2019-05-21 16:38:26 +02:00
Claudio Atzori f7a3bdf3f8 [maven-release-plugin] prepare for next development iteration 2019-04-03 12:35:00 +02:00
Claudio Atzori 98c179c8fb [maven-release-plugin] prepare release dnet-dedup-3.0.9 2019-04-03 12:34:52 +02:00
miconis 3e61a90c8f [maven-release-plugin] rollback the release of dnet-dedup-3.0.9 2019-04-03 12:27:28 +02:00
miconis 15fb9eb883 [maven-release-plugin] prepare for next development iteration 2019-04-03 12:26:05 +02:00
miconis a1ff4daa7f [maven-release-plugin] prepare release dnet-dedup-3.0.9 2019-04-03 12:25:56 +02:00
miconis 1d29bae47c branch cities merged into master 2019-04-03 12:22:33 +02:00
miconis 7e7018c51f addition of a sparktester test, implementation of 2 different classes for testing in dnet-dedup-test module, addition of new terms in the vocabulary and change in the implementation of the JaroWinklerNormalizedName comparator 2019-04-03 09:40:14 +02:00
miconis 4bd5a9beee minor changes 2019-03-26 15:48:21 +01:00
Michele De Bonis 662448e584 update of the comparator for legalnames of organizations 2019-03-21 14:27:27 +01:00
Claudio Atzori f2394fcd9f [maven-release-plugin] prepare for next development iteration 2019-02-18 09:09:14 +01:00
Claudio Atzori 722431dde1 [maven-release-plugin] prepare release dnet-dedup-3.0.8 2019-02-18 09:09:07 +01:00
Claudio Atzori 470c4b0f20 default configuration includes configurationId 2019-02-18 09:07:23 +01:00
Claudio Atzori ccb7e83196 [maven-release-plugin] prepare for next development iteration 2019-02-17 12:56:19 +01:00
Claudio Atzori 7d8e62d4cc [maven-release-plugin] prepare release dnet-dedup-3.0.7 2019-02-17 12:56:11 +01:00
Claudio Atzori 968cd47436 replace existing attributes when loading default configuration 2019-02-17 12:48:25 +01:00
Michele De Bonis 0735f3a822 implementation of the test classes and minor changes 2019-02-08 12:56:47 +01:00
Michele De Bonis 7a8d28991f implementation of the decision tree for the deduplication of the authors, implementation of multiple comparators to be used in a tree node and definition of the proto for person entity 2018-12-20 09:54:41 +01:00
Michele De Bonis 39613dbbd6 implementation of the decisional tree, addition of the dnet-openaire-data-protos module, definition of the person proto, blockprocessor and paceconfig modified with addition of support for the tree processing 2018-12-12 16:30:03 +01:00
Claudio Atzori f1c68d8ba3 apply limits (length, size) to pace Fields 2018-11-20 10:51:38 +01:00
Claudio Atzori c5979ffe18 [maven-release-plugin] prepare for next development iteration 2018-11-19 17:41:45 +01:00
Claudio Atzori 9869dff1d2 [maven-release-plugin] prepare release dnet-dedup-3.0.6 2018-11-19 17:41:37 +01:00
Claudio Atzori c2d4cb3ba6 added new properties to FieldDef (size, length) to limit the information mapped onto each MapDocument 2018-11-19 17:37:57 +01:00
Claudio Atzori 394fcafd41 [maven-release-plugin] prepare for next development iteration 2018-11-17 09:13:16 +01:00
Claudio Atzori 397554130c [maven-release-plugin] prepare release dnet-dedup-3.0.5 2018-11-17 09:13:09 +01:00
Claudio Atzori 0dfb2ea600 added distance function fot software titles 2018-11-17 09:11:38 +01:00
Michele De Bonis 3d4372ced9 addition of cities check 2018-11-16 16:11:03 +01:00
Claudio Atzori 55a9b4f501 [maven-release-plugin] prepare for next development iteration 2018-11-16 09:18:00 +01:00
Claudio Atzori 35ab630493 [maven-release-plugin] prepare release dnet-dedup-3.0.4 2018-11-16 09:17:53 +01:00
Claudio Atzori 399e4bc80f default (empty) configuration should be aligned with the updated model 2018-11-15 16:52:56 +01:00
Claudio Atzori 59bab8dba4 less verbose logging 2018-11-13 09:07:45 +01:00
Claudio Atzori 478ad72cb8 propagate exceptions in case of serialization errors, removed configuration pretty printing, removed unused class ScoredResult 2018-11-12 15:52:18 +01:00
Claudio Atzori f7616c7a8a [maven-release-plugin] prepare for next development iteration 2018-11-12 14:23:36 +01:00
Claudio Atzori df4b871c8b [maven-release-plugin] prepare release dnet-dedup-3.0.3 2018-11-12 14:23:29 +01:00
Michele De Bonis 72a9b3139e Merge branch 'master' of https://github.com/dnet-team/dnet-dedup 2018-11-12 14:11:26 +01:00
Michele De Bonis b5062f5429 configuration file updated, addition of condition on domain 2018-11-12 14:11:15 +01:00
Claudio Atzori 2a509b18fa [maven-release-plugin] prepare for next development iteration 2018-11-12 12:46:50 +01:00
Claudio Atzori e247218987 [maven-release-plugin] prepare release dnet-dedup-3.0.2 2018-11-12 12:46:42 +01:00
Claudio Atzori b7bc7f0401 getting rid of spark libs from dnet-pace-core 2018-11-12 12:46:06 +01:00
Claudio Atzori 3dacba37ea [maven-release-plugin] prepare for next development iteration 2018-11-12 11:40:42 +01:00
Claudio Atzori 8cc2517f5d [maven-release-plugin] prepare release dnet-dedup-3.0.1 2018-11-12 11:40:34 +01:00
Claudio Atzori 851ae5eec3 [maven-release-plugin] rollback the release of dnet-dedup-3.0.1 2018-11-12 11:39:07 +01:00
Claudio Atzori f283d58a6e [maven-release-plugin] prepare release dnet-dedup-3.0.1 2018-11-12 11:38:52 +01:00
Claudio Atzori 6d09041288 [maven-release-plugin] rollback the release of dnet-dedup-3.0.1 2018-11-12 11:28:28 +01:00
Claudio Atzori 46cee13596 [maven-release-plugin] prepare for next development iteration 2018-11-12 11:24:06 +01:00
Claudio Atzori e1c69ad24e [maven-release-plugin] prepare release dnet-dedup-3.0.1 2018-11-12 11:23:57 +01:00
Michele De Bonis b247a86e69 configuration files changed: dedupRun instead of run, assertion updated in tests 2018-11-06 11:02:00 +01:00
Michele De Bonis 4c8485d0bb deleted useless imports 2018-11-06 09:48:22 +01:00
Michele De Bonis 748189af10 implementation of JaroWinklerNormalizedName, addition of various stopwords in different languages and configuration test 2018-11-05 17:22:59 +01:00
Claudio Atzori e296f7a81c added DiffPatchMatch utility. Resumed commented tests! 2018-10-31 10:49:11 +01:00
Michele De Bonis dc41b76643 serialization test added. useless getter methods ignored by json serialization 2018-10-29 16:16:11 +01:00
Michele De Bonis ea36007d1f DedupConf parsed using Jackson library 2018-10-29 11:13:55 +01:00
Michele De Bonis 8b4762bf54 implementation of the toString methonds changed: from Gson to Jackson 2018-10-26 14:55:59 +02:00
Michele De Bonis 3cf3dc1934 modification in the initialization of clustering functions, distance algos and conditions. 2018-10-25 15:15:40 +02:00
Michele De Bonis 1cbbc3f15a update in the discovery of clustering, conditions and distance functions (annotated with custom annotations) 2018-10-24 12:09:41 +02:00
Claudio Atzori 4d379c2227 revised PidMatch implementation, cleanup 2018-10-20 08:38:19 +02:00
Claudio Atzori 3197f26691 [maven-release-plugin] prepare for next development iteration 2018-10-18 12:17:34 +02:00
Claudio Atzori 63815be2d6 [maven-release-plugin] prepare release dnet-dedup-3.0.0 2018-10-18 12:17:27 +02:00
Claudio Atzori ed14476b06 [maven-release-plugin] rollback the release of dnet-dedup-3.0.0 2018-10-18 12:13:03 +02:00
Claudio Atzori 82d5dce114 [maven-release-plugin] prepare release dnet-dedup-3.0.0 2018-10-18 12:12:45 +02:00
Claudio Atzori 4f29124607 [maven-release-plugin] rollback the release of dnet-dedup-3.0.0 2018-10-18 12:00:45 +02:00
Claudio Atzori 5a48937ae1 [maven-release-plugin] prepare for next development iteration 2018-10-18 11:58:43 +02:00
Claudio Atzori 5aec80345f [maven-release-plugin] prepare release dnet-dedup-3.0.0 2018-10-18 11:58:36 +02:00
Claudio Atzori 1b46966383 updated maven project structure 2018-10-18 11:56:26 +02:00
Michele De Bonis 72ebf7c0f3 update of the spark test 2018-10-18 10:12:44 +02:00
Sandro La Bruzzo 1bb5c26e6d Added FSpark Implementation of dedup 2018-10-11 15:19:20 +02:00
Sandro La Bruzzo d1c73bcf90 Added First Implementation of Spark Test 2018-10-02 17:07:17 +02:00
Sandro La Bruzzo 476c3d7b07 added d-net pace core module and ignored target folder 2018-10-02 10:37:54 +02:00
524 changed files with 54722 additions and 9821 deletions

1
.gitignore vendored
View File

@ -26,3 +26,4 @@ spark-warehouse
/**/*.log
/**/.factorypath
/**/.scalafmt.conf
/.java-version

128
README.md
View File

@ -1,2 +1,128 @@
# dnet-hadoop
Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
How to build, package and run oozie workflows
====================
Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a `*.tar.gz`
package that contains resources that define a workflow and some helper scripts.
This module is automatically executed when running:
`mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app`
on module having set:
```
<parent>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-workflows</artifactId>
</parent>
```
in `pom.xml` file. `oozie-package` profile initializes oozie workflow packaging, `workflow.source.dir` property points to
a workflow (notice: this is not a relative path but a classpath to directory usually holding `oozie_app` subdirectory).
The outcome of this packaging is `oozie-package.tar.gz` file containing inside all the resources required to run Oozie workflow:
- jar packages
- workflow definitions
- job properties
- maintenance scripts
Required properties
====================
In order to include proper workflow within package, `workflow.source.dir` property has to be set. It could be provided
by setting `-Dworkflow.source.dir=some/job/dir` maven parameter.
In oder to define full set of cluster environment properties one should create `~/.dhp/application.properties` file with
the following properties:
- `dhp.hadoop.frontend.user.name` - your user name on hadoop cluster and frontend machine
- `dhp.hadoop.frontend.host.name` - frontend host name
- `dhp.hadoop.frontend.temp.dir` - frontend directory for temporary files
- `dhp.hadoop.frontend.port.ssh` - frontend machine ssh port
- `oozieServiceLoc` - oozie service location required by run_workflow.sh script executing oozie job
- `nameNode` - name node address
- `jobTracker` - job tracker address
- `oozie.execution.log.file.location` - location of file that will be created when executing oozie job, it contains output
produced by `run_workflow.sh` script (needed to obtain oozie job id)
- `maven.executable` - mvn command location, requires parameterization due to a different setup of CI cluster
- `sparkDriverMemory` - amount of memory assigned to spark jobs driver
- `sparkExecutorMemory` - amount of memory assigned to spark jobs executors
- `sparkExecutorCores` - number of cores assigned to spark jobs executors
All values will be overriden with the ones from `job.properties` and eventually `job-override.properties` stored in module's
main folder.
When overriding properties from `job.properties`, `job-override.properties` file can be created in main module directory
(the one containing `pom.xml` file) and define all new properties which will override existing properties.
One can provide those properties one by one as command line `-D` arguments.
Properties overriding order is the following:
1. `pom.xml` defined properties (located in the project root dir)
2. `~/.dhp/application.properties` defined properties
3. `${workflow.source.dir}/job.properties`
4. `job-override.properties` (located in the project root dir)
5. `maven -Dparam=value`
where the maven `-Dparam` property is overriding all the other ones.
Workflow definition requirements
====================
`workflow.source.dir` property should point to the following directory structure:
[${workflow.source.dir}]
|
|-job.properties (optional)
|
\-[oozie_app]
|
\-workflow.xml
This property can be set using maven `-D` switch.
`[oozie_app]` is the default directory name however it can be set to any value as soon as `oozieAppDir` property is
provided with directory name as value.
Sub-workflows are supported as well and sub-workflow directories should be nested within `[oozie_app]` directory.
Creating oozie installer step-by-step
=====================================
Automated oozie-installer steps are the following:
1. creating jar packages: `*.jar` and `*tests.jar` along with copying all dependencies in `target/dependencies`
2. reading properties from maven, `~/.dhp/application.properties`, `job.properties`, `job-override.properties`
3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
5. copying whole `${workflow.source.dir}` content to `target/${oozie.package.file.name}`
6. generating updated `job.properties` file in `target/${oozie.package.file.name}` based on maven,
`~/.dhp/application.properties`, `job.properties` and `job-override.properties`
7. creating `lib` directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages
created at step (1) to each one of them
8. bundling whole `${oozie.package.file.name}` directory into single tar.gz package
Uploading oozie package and running workflow on cluster
=======================================================
In order to simplify deployment and execution process two dedicated profiles were introduced:
- `deploy`
- `run`
to be used along with `oozie-package` profile e.g. by providing `-Poozie-package,deploy,run` maven parameters.
The `deploy` profile supplements packaging process with:
1) uploading oozie-package via scp to `/home/${user.name}/oozie-packages` directory on `${dhp.hadoop.frontend.host.name}` machine
2) extracting uploaded package
3) uploading oozie content to hadoop cluster HDFS location defined in `oozie.wf.application.path` property (generated dynamically by maven build process, based on `${dhp.hadoop.frontend.user.name}` and `workflow.source.dir` properties)
The `run` profile introduces:
1) executing oozie application uploaded to HDFS cluster using `deploy` command. Triggers `run_workflow.sh` script providing runtime properties defined in `job.properties` file.
Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.

View File

@ -52,6 +52,8 @@
</execution>
</executions>
<configuration>
<failOnMultipleScalaVersions>true</failOnMultipleScalaVersions>
<scalaCompatVersion>${scala.binary.version}</scalaCompatVersion>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
@ -60,6 +62,11 @@
</build>
<dependencies>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-pace-core</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
@ -76,11 +83,11 @@
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
</dependency>
<dependency>
@ -142,11 +149,6 @@
<artifactId>okhttp</artifactId>
</dependency>
<dependency>
<groupId>eu.dnetlib</groupId>
<artifactId>dnet-pace-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
@ -159,7 +161,7 @@
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-schemas</artifactId>
<artifactId>${dhp-schemas.artifact}</artifactId>
</dependency>
<dependency>

View File

@ -10,6 +10,12 @@ public class Constants {
public static final Map<String, String> accessRightsCoarMap = Maps.newHashMap();
public static final Map<String, String> coarCodeLabelMap = Maps.newHashMap();
public static final String ROR_NS_PREFIX = "ror_________";
public static final String ROR_OPENAIRE_ID = "10|openaire____::993a7ae7a863813cf95028b50708e222";
public static final String ROR_DATASOURCE_NAME = "Research Organization Registry (ROR)";
public static String COAR_ACCESS_RIGHT_SCHEMA = "http://vocabularies.coar-repositories.org/documentation/access_rights/";
private Constants() {

View File

@ -0,0 +1,100 @@
package eu.dnetlib.dhp.common;
/**
* This utility represent the Metadata Store information
* needed during the migration from mongo to HDFS to store
*/
public class MDStoreInfo {
private String mdstore;
private String currentId;
private Long latestTimestamp;
/**
* Instantiates a new Md store info.
*/
public MDStoreInfo() {
}
/**
* Instantiates a new Md store info.
*
* @param mdstore the mdstore
* @param currentId the current id
* @param latestTimestamp the latest timestamp
*/
public MDStoreInfo(String mdstore, String currentId, Long latestTimestamp) {
this.mdstore = mdstore;
this.currentId = currentId;
this.latestTimestamp = latestTimestamp;
}
/**
* Gets mdstore.
*
* @return the mdstore
*/
public String getMdstore() {
return mdstore;
}
/**
* Sets mdstore.
*
* @param mdstore the mdstore
* @return the mdstore
*/
public MDStoreInfo setMdstore(String mdstore) {
this.mdstore = mdstore;
return this;
}
/**
* Gets current id.
*
* @return the current id
*/
public String getCurrentId() {
return currentId;
}
/**
* Sets current id.
*
* @param currentId the current id
* @return the current id
*/
public MDStoreInfo setCurrentId(String currentId) {
this.currentId = currentId;
return this;
}
/**
* Gets latest timestamp.
*
* @return the latest timestamp
*/
public Long getLatestTimestamp() {
return latestTimestamp;
}
/**
* Sets latest timestamp.
*
* @param latestTimestamp the latest timestamp
* @return the latest timestamp
*/
public MDStoreInfo setLatestTimestamp(Long latestTimestamp) {
this.latestTimestamp = latestTimestamp;
return this;
}
@Override
public String toString() {
return "MDStoreInfo{" +
"mdstore='" + mdstore + '\'' +
", currentId='" + currentId + '\'' +
", latestTimestamp=" + latestTimestamp +
'}';
}
}

View File

@ -1,12 +1,12 @@
package eu.dnetlib.dhp.common;
import static com.mongodb.client.model.Sorts.descending;
import java.io.Closeable;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;
import org.apache.commons.lang3.StringUtils;
@ -38,6 +38,26 @@ public class MdstoreClient implements Closeable {
this.db = getDb(client, dbName);
}
private Long parseTimestamp(Document f) {
if (f == null || !f.containsKey("timestamp"))
return null;
Object ts = f.get("timestamp");
return Long.parseLong(ts.toString());
}
public Long getLatestTimestamp(final String collectionId) {
MongoCollection<Document> collection = db.getCollection(collectionId);
FindIterable<Document> result = collection.find().sort(descending("timestamp")).limit(1);
if (result == null) {
return null;
}
Document f = result.first();
return parseTimestamp(f);
}
public MongoCollection<Document> mdStore(final String mdId) {
BasicDBObject query = (BasicDBObject) QueryBuilder.start("mdId").is(mdId).get();
@ -54,6 +74,16 @@ public class MdstoreClient implements Closeable {
return getColl(db, currentId, true);
}
public List<MDStoreInfo> mdStoreWithTimestamp(final String mdFormat, final String mdLayout,
final String mdInterpretation) {
Map<String, String> res = validCollections(mdFormat, mdLayout, mdInterpretation);
return res
.entrySet()
.stream()
.map(e -> new MDStoreInfo(e.getKey(), e.getValue(), getLatestTimestamp(e.getValue())))
.collect(Collectors.toList());
}
public Map<String, String> validCollections(
final String mdFormat, final String mdLayout, final String mdInterpretation) {

View File

@ -0,0 +1,81 @@
package eu.dnetlib.dhp.common.action;
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.nio.charset.StandardCharsets;
import java.sql.ResultSet;
import java.sql.SQLException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.common.DbClient;
import eu.dnetlib.dhp.common.action.model.MasterDuplicate;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
public class ReadDatasourceMasterDuplicateFromDB {
private static final Logger log = LoggerFactory.getLogger(ReadDatasourceMasterDuplicateFromDB.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static final String QUERY = "SELECT distinct dd.id as masterId, d.officialname as masterName, dd.duplicate as duplicateId "
+
"FROM dsm_dedup_services dd join dsm_services d on (dd.id = d.id);";
public static int execute(String dbUrl, String dbUser, String dbPassword, String hdfsPath, String hdfsNameNode)
throws IOException {
int count = 0;
try (DbClient dbClient = new DbClient(dbUrl, dbUser, dbPassword)) {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", hdfsNameNode);
FileSystem fileSystem = FileSystem.get(conf);
FSDataOutputStream fos = fileSystem.create(new Path(hdfsPath));
log.info("running query: {}", QUERY);
log.info("storing results in: {}", hdfsPath);
try (BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fos, StandardCharsets.UTF_8))) {
dbClient.processResults(QUERY, rs -> writeMap(datasourceMasterMap(rs), writer));
count++;
}
}
return count;
}
private static MasterDuplicate datasourceMasterMap(ResultSet rs) {
try {
final MasterDuplicate md = new MasterDuplicate();
final String duplicateId = rs.getString("duplicateId");
final String masterId = rs.getString("masterId");
final String masterName = rs.getString("masterName");
md.setDuplicateId(OafMapperUtils.createOpenaireId(10, duplicateId, true));
md.setMasterId(OafMapperUtils.createOpenaireId(10, masterId, true));
md.setMasterName(masterName);
return md;
} catch (final SQLException e) {
throw new RuntimeException(e);
}
}
private static void writeMap(final MasterDuplicate dm, final BufferedWriter writer) {
try {
writer.write(OBJECT_MAPPER.writeValueAsString(dm));
writer.newLine();
} catch (final IOException e) {
throw new RuntimeException(e);
}
}
}

View File

@ -0,0 +1,38 @@
package eu.dnetlib.dhp.common.action.model;
import java.io.Serializable;
/**
* @author miriam.baglioni
* @Date 21/07/22
*/
public class MasterDuplicate implements Serializable {
private String duplicateId;
private String masterId;
private String masterName;
public String getDuplicateId() {
return duplicateId;
}
public void setDuplicateId(String duplicateId) {
this.duplicateId = duplicateId;
}
public String getMasterId() {
return masterId;
}
public void setMasterId(String masterId) {
this.masterId = masterId;
}
public String getMasterName() {
return masterName;
}
public void setMasterName(String masterName) {
this.masterName = masterName;
}
}

View File

@ -3,10 +3,13 @@ package eu.dnetlib.dhp.common.api;
import java.io.*;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.concurrent.TimeUnit;
import org.apache.http.HttpHeaders;
import org.apache.http.entity.ContentType;
import org.jetbrains.annotations.NotNull;
import com.google.gson.Gson;
@ -60,33 +63,31 @@ public class ZenodoAPIClient implements Serializable {
*/
public int newDeposition() throws IOException {
String json = "{}";
OkHttpClient httpClient = new OkHttpClient.Builder().connectTimeout(600, TimeUnit.SECONDS).build();
RequestBody body = RequestBody.create(json, MEDIA_TYPE_JSON);
Request request = new Request.Builder()
.url(urlString)
.addHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString()) // add request headers
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.post(body)
.build();
try (Response response = httpClient.newCall(request).execute()) {
if (!response.isSuccessful())
throw new IOException("Unexpected code " + response + response.body().string());
// Get response body
json = response.body().string();
ZenodoModel newSubmission = new Gson().fromJson(json, ZenodoModel.class);
this.bucket = newSubmission.getLinks().getBucket();
this.deposition_id = newSubmission.getId();
return response.code();
URL url = new URL(urlString);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString());
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
conn.setRequestMethod("POST");
conn.setDoOutput(true);
try (OutputStream os = conn.getOutputStream()) {
byte[] input = json.getBytes("utf-8");
os.write(input, 0, input.length);
}
String body = getBody(conn);
int responseCode = conn.getResponseCode();
conn.disconnect();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + body);
ZenodoModel newSubmission = new Gson().fromJson(body, ZenodoModel.class);
this.bucket = newSubmission.getLinks().getBucket();
this.deposition_id = newSubmission.getId();
return responseCode;
}
/**
@ -94,28 +95,48 @@ public class ZenodoAPIClient implements Serializable {
*
* @param is the inputStream for the file to upload
* @param file_name the name of the file as it will appear on Zenodo
* @param len the size of the file
* @return the response code
*/
public int uploadIS(InputStream is, String file_name, long len) throws IOException {
OkHttpClient httpClient = new OkHttpClient.Builder()
.writeTimeout(600, TimeUnit.SECONDS)
.readTimeout(600, TimeUnit.SECONDS)
.connectTimeout(600, TimeUnit.SECONDS)
.build();
public int uploadIS(InputStream is, String file_name) throws IOException {
Request request = new Request.Builder()
.url(bucket + "/" + file_name)
.addHeader(HttpHeaders.CONTENT_TYPE, "application/zip") // add request headers
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.put(InputStreamRequestBody.create(MEDIA_TYPE_ZIP, is, len))
.build();
URL url = new URL(bucket + "/" + file_name);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.CONTENT_TYPE, "application/zip");
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
conn.setDoOutput(true);
conn.setRequestMethod("PUT");
byte[] buf = new byte[8192];
int length;
try (OutputStream os = conn.getOutputStream()) {
while ((length = is.read(buf)) != -1) {
os.write(buf, 0, length);
}
try (Response response = httpClient.newCall(request).execute()) {
if (!response.isSuccessful())
throw new IOException("Unexpected code " + response + response.body().string());
return response.code();
}
int responseCode = conn.getResponseCode();
if (!checkOKStatus(responseCode)) {
throw new IOException("Unexpected code " + responseCode + getBody(conn));
}
return responseCode;
}
@NotNull
private String getBody(HttpURLConnection conn) throws IOException {
String body = "{}";
try (BufferedReader br = new BufferedReader(
new InputStreamReader(conn.getInputStream(), "utf-8"))) {
StringBuilder response = new StringBuilder();
String responseLine = null;
while ((responseLine = br.readLine()) != null) {
response.append(responseLine.trim());
}
body = response.toString();
}
return body;
}
/**
@ -127,26 +148,34 @@ public class ZenodoAPIClient implements Serializable {
*/
public int sendMretadata(String metadata) throws IOException {
OkHttpClient httpClient = new OkHttpClient.Builder().connectTimeout(600, TimeUnit.SECONDS).build();
URL url = new URL(urlString + "/" + deposition_id);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString());
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
conn.setDoOutput(true);
conn.setRequestMethod("PUT");
RequestBody body = RequestBody.create(metadata, MEDIA_TYPE_JSON);
Request request = new Request.Builder()
.url(urlString + "/" + deposition_id)
.addHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString()) // add request headers
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.put(body)
.build();
try (Response response = httpClient.newCall(request).execute()) {
if (!response.isSuccessful())
throw new IOException("Unexpected code " + response + response.body().string());
return response.code();
try (OutputStream os = conn.getOutputStream()) {
byte[] input = metadata.getBytes("utf-8");
os.write(input, 0, input.length);
}
final int responseCode = conn.getResponseCode();
conn.disconnect();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + getBody(conn));
return responseCode;
}
private boolean checkOKStatus(int responseCode) {
if (HttpURLConnection.HTTP_OK != responseCode ||
HttpURLConnection.HTTP_CREATED != responseCode)
return true;
return false;
}
/**
@ -155,6 +184,7 @@ public class ZenodoAPIClient implements Serializable {
* @return response code
* @throws IOException
*/
@Deprecated
public int publish() throws IOException {
String json = "{}";
@ -194,28 +224,34 @@ public class ZenodoAPIClient implements Serializable {
setDepositionId(concept_rec_id, 1);
String json = "{}";
OkHttpClient httpClient = new OkHttpClient.Builder().connectTimeout(600, TimeUnit.SECONDS).build();
URL url = new URL(urlString + "/" + deposition_id + "/actions/newversion");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
RequestBody body = RequestBody.create(json, MEDIA_TYPE_JSON);
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
conn.setDoOutput(true);
conn.setRequestMethod("POST");
Request request = new Request.Builder()
.url(urlString + "/" + deposition_id + "/actions/newversion")
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.post(body)
.build();
try (Response response = httpClient.newCall(request).execute()) {
if (!response.isSuccessful())
throw new IOException("Unexpected code " + response + response.body().string());
ZenodoModel zenodoModel = new Gson().fromJson(response.body().string(), ZenodoModel.class);
String latest_draft = zenodoModel.getLinks().getLatest_draft();
deposition_id = latest_draft.substring(latest_draft.lastIndexOf("/") + 1);
bucket = getBucket(latest_draft);
return response.code();
try (OutputStream os = conn.getOutputStream()) {
byte[] input = json.getBytes("utf-8");
os.write(input, 0, input.length);
}
String body = getBody(conn);
int responseCode = conn.getResponseCode();
conn.disconnect();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + body);
ZenodoModel zenodoModel = new Gson().fromJson(body, ZenodoModel.class);
String latest_draft = zenodoModel.getLinks().getLatest_draft();
deposition_id = latest_draft.substring(latest_draft.lastIndexOf("/") + 1);
bucket = getBucket(latest_draft);
return responseCode;
}
/**
@ -233,24 +269,32 @@ public class ZenodoAPIClient implements Serializable {
this.deposition_id = deposition_id;
OkHttpClient httpClient = new OkHttpClient.Builder().connectTimeout(600, TimeUnit.SECONDS).build();
String json = "{}";
Request request = new Request.Builder()
.url(urlString + "/" + deposition_id)
.addHeader("Authorization", "Bearer " + access_token)
.build();
try (Response response = httpClient.newCall(request).execute()) {
if (!response.isSuccessful())
throw new IOException("Unexpected code " + response + response.body().string());
ZenodoModel zenodoModel = new Gson().fromJson(response.body().string(), ZenodoModel.class);
bucket = zenodoModel.getLinks().getBucket();
return response.code();
URL url = new URL(urlString + "/" + deposition_id);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
conn.setRequestMethod("POST");
conn.setDoOutput(true);
try (OutputStream os = conn.getOutputStream()) {
byte[] input = json.getBytes("utf-8");
os.write(input, 0, input.length);
}
String body = getBody(conn);
int responseCode = conn.getResponseCode();
conn.disconnect();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + body);
ZenodoModel zenodoModel = new Gson().fromJson(body, ZenodoModel.class);
bucket = zenodoModel.getLinks().getBucket();
return responseCode;
}
private void setDepositionId(String concept_rec_id, Integer page) throws IOException, MissingConceptDoiException {
@ -273,53 +317,48 @@ public class ZenodoAPIClient implements Serializable {
private String getPrevDepositions(String page) throws IOException {
OkHttpClient httpClient = new OkHttpClient.Builder().connectTimeout(600, TimeUnit.SECONDS).build();
HttpUrl.Builder urlBuilder = HttpUrl.parse(urlString).newBuilder();
urlBuilder.addQueryParameter("page", page);
String url = urlBuilder.build().toString();
Request request = new Request.Builder()
.url(url)
.addHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString()) // add request headers
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.get()
.build();
URL url = new URL(urlBuilder.build().toString());
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString());
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
conn.setDoOutput(true);
conn.setRequestMethod("GET");
try (Response response = httpClient.newCall(request).execute()) {
String body = getBody(conn);
if (!response.isSuccessful())
throw new IOException("Unexpected code " + response + response.body().string());
int responseCode = conn.getResponseCode();
return response.body().string();
conn.disconnect();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + body);
}
return body;
}
private String getBucket(String url) throws IOException {
OkHttpClient httpClient = new OkHttpClient.Builder()
.connectTimeout(600, TimeUnit.SECONDS)
.build();
private String getBucket(String inputUurl) throws IOException {
Request request = new Request.Builder()
.url(url)
.addHeader(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString()) // add request headers
.addHeader(HttpHeaders.AUTHORIZATION, "Bearer " + access_token)
.get()
.build();
URL url = new URL(inputUurl);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString());
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
conn.setDoOutput(true);
conn.setRequestMethod("GET");
try (Response response = httpClient.newCall(request).execute()) {
String body = getBody(conn);
if (!response.isSuccessful())
throw new IOException("Unexpected code " + response + response.body().string());
int responseCode = conn.getResponseCode();
// Get response body
ZenodoModel zenodoModel = new Gson().fromJson(response.body().string(), ZenodoModel.class);
conn.disconnect();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + body);
return zenodoModel.getLinks().getBucket();
ZenodoModel zenodoModel = new Gson().fromJson(body, ZenodoModel.class);
}
return zenodoModel.getLinks().getBucket();
}

View File

@ -4,6 +4,7 @@ package eu.dnetlib.dhp.common.vocabulary;
import java.io.Serializable;
import java.util.HashMap;
import java.util.Map;
import java.util.Objects;
import java.util.Optional;
import org.apache.commons.lang3.StringUtils;
@ -66,27 +67,39 @@ public class Vocabulary implements Serializable {
}
public Qualifier getTermAsQualifier(final String termId) {
if (StringUtils.isBlank(termId)) {
return getTermAsQualifier(termId, false);
}
public Qualifier getTermAsQualifier(final String termId, boolean strict) {
final VocabularyTerm term = getTerm(termId);
if (Objects.nonNull(term)) {
return OafMapperUtils.qualifier(term.getId(), term.getName(), getId(), getName());
} else if (Objects.isNull(term) && strict) {
return OafMapperUtils.unknown(getId(), getName());
} else if (termExists(termId)) {
final VocabularyTerm t = getTerm(termId);
return OafMapperUtils.qualifier(t.getId(), t.getName(), getId(), getName());
} else {
return OafMapperUtils.qualifier(termId, termId, getId(), getName());
}
}
public Qualifier getSynonymAsQualifier(final String syn) {
return getSynonymAsQualifier(syn, false);
}
public Qualifier getSynonymAsQualifier(final String syn, boolean strict) {
return Optional
.ofNullable(getTermBySynonym(syn))
.map(term -> getTermAsQualifier(term.getId()))
.map(term -> getTermAsQualifier(term.getId(), strict))
.orElse(null);
}
public Qualifier lookup(String id) {
return lookup(id, false);
}
public Qualifier lookup(String id, boolean strict) {
return Optional
.ofNullable(getSynonymAsQualifier(id))
.orElse(getTermAsQualifier(id));
.ofNullable(getSynonymAsQualifier(id, strict))
.orElse(getTermAsQualifier(id, strict));
}
}

View File

@ -11,25 +11,18 @@ import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FilterFunction;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.oaf.Oaf;
import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.common.ModelSupport;
public class DispatchEntitiesSparkJob {
private static final Logger log = LoggerFactory.getLogger(DispatchEntitiesSparkJob.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
public static void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils
@ -54,44 +47,51 @@ public class DispatchEntitiesSparkJob {
String outputPath = parser.get("outputPath");
log.info("outputPath: {}", outputPath);
String graphTableClassName = parser.get("graphTableClassName");
log.info("graphTableClassName: {}", graphTableClassName);
@SuppressWarnings("unchecked")
Class<? extends OafEntity> entityClazz = (Class<? extends OafEntity>) Class.forName(graphTableClassName);
boolean filterInvisible = Boolean.parseBoolean(parser.get("filterInvisible"));
log.info("filterInvisible: {}", filterInvisible);
SparkConf conf = new SparkConf();
runWithSparkSession(
conf,
isSparkSessionManaged,
spark -> {
HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
dispatchEntities(spark, inputPath, entityClazz, outputPath);
});
spark -> dispatchEntities(spark, inputPath, outputPath, filterInvisible));
}
private static <T extends Oaf> void dispatchEntities(
private static void dispatchEntities(
SparkSession spark,
String inputPath,
Class<T> clazz,
String outputPath) {
String outputPath,
boolean filterInvisible) {
spark
.read()
.textFile(inputPath)
.filter((FilterFunction<String>) s -> isEntityType(s, clazz))
.map((MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"), Encoders.STRING())
.map(
(MapFunction<String, T>) value -> OBJECT_MAPPER.readValue(value, clazz),
Encoders.bean(clazz))
.write()
.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputPath);
Dataset<String> df = spark.read().textFile(inputPath);
ModelSupport.oafTypes.entrySet().parallelStream().forEach(entry -> {
String entityType = entry.getKey();
Class<?> clazz = entry.getValue();
final String entityPath = outputPath + "/" + entityType;
if (!entityType.equalsIgnoreCase("relation")) {
HdfsSupport.remove(entityPath, spark.sparkContext().hadoopConfiguration());
Dataset<Row> entityDF = spark
.read()
.schema(Encoders.bean(clazz).schema())
.json(
df
.filter((FilterFunction<String>) s -> s.startsWith(clazz.getName()))
.map(
(MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"),
Encoders.STRING()));
if (filterInvisible) {
entityDF = entityDF.filter("dataInfo.invisible != true");
}
entityDF
.write()
.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(entityPath);
}
});
}
private static <T extends Oaf> boolean isEntityType(final String s, final Class<T> clazz) {
return StringUtils.substringBefore(s, "|").equals(clazz.getName());
}
}

View File

@ -0,0 +1,14 @@
package eu.dnetlib.dhp.schema.oaf.utils;
public class DoiCleaningRule {
public static String clean(final String doi) {
return doi
.toLowerCase()
.replaceAll("\\s", "")
.replaceAll("^doi:", "")
.replaceFirst(CleaningFunctions.DOI_PREFIX_REGEX, CleaningFunctions.DOI_PREFIX);
}
}

View File

@ -0,0 +1,25 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FundRefCleaningRule {
public static final Pattern PATTERN = Pattern.compile("\\d+");
public static String clean(final String fundRefId) {
String s = fundRefId
.toLowerCase()
.replaceAll("\\s", "");
Matcher m = PATTERN.matcher(s);
if (m.find()) {
return m.group();
} else {
return "";
}
}
}

View File

@ -35,61 +35,197 @@ public class GraphCleaningFunctions extends CleaningFunctions {
public static final String TITLE_FILTER_REGEX = String.format("(%s)|\\W|\\d", TITLE_TEST);
public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5;
private static final String NAME_CLEANING_REGEX = "[\\r\\n\\t\\s]+";
public static <T extends Oaf> T cleanContext(T value, String contextId, String verifyParam) {
if (ModelSupport.isSubClass(value, Result.class)) {
final Result res = (Result) value;
if (shouldCleanContext(res, verifyParam)) {
res
.setContext(
res
.getContext()
.stream()
.filter(c -> !StringUtils.startsWith(c.getId().toLowerCase(), contextId))
.collect(Collectors.toList()));
}
return (T) res;
} else {
return value;
}
}
private static boolean shouldCleanContext(Result res, String verifyParam) {
boolean titleMatch = res
.getTitle()
.stream()
.filter(
t -> t
.getQualifier()
.getClassid()
.equalsIgnoreCase(ModelConstants.MAIN_TITLE_QUALIFIER.getClassid()))
.anyMatch(t -> t.getValue().toLowerCase().startsWith(verifyParam.toLowerCase()));
return titleMatch && Objects.nonNull(res.getContext());
}
public static <T extends Oaf> T cleanCountry(T value, String[] verifyParam, Set<String> hostedBy,
String collectedfrom, String country) {
if (ModelSupport.isSubClass(value, Result.class)) {
final Result res = (Result) value;
if (res.getInstance().stream().anyMatch(i -> hostedBy.contains(i.getHostedby().getKey())) ||
!res.getCollectedfrom().stream().anyMatch(cf -> cf.getValue().equals(collectedfrom))) {
return (T) res;
}
List<StructuredProperty> ids = getPidsAndAltIds(res).collect(Collectors.toList());
if (ids
.stream()
.anyMatch(
p -> p
.getQualifier()
.getClassid()
.equals(PidType.doi.toString()) && pidInParam(p.getValue(), verifyParam))) {
res
.setCountry(
res
.getCountry()
.stream()
.filter(
c -> toTakeCountry(c, country))
.collect(Collectors.toList()));
}
return (T) res;
} else {
return value;
}
}
private static <T extends Result> Stream<StructuredProperty> getPidsAndAltIds(T r) {
final Stream<StructuredProperty> resultPids = Optional
.ofNullable(r.getPid())
.map(Collection::stream)
.orElse(Stream.empty());
final Stream<StructuredProperty> instancePids = Optional
.ofNullable(r.getInstance())
.map(
instance -> instance
.stream()
.flatMap(
i -> Optional
.ofNullable(i.getPid())
.map(Collection::stream)
.orElse(Stream.empty())))
.orElse(Stream.empty());
final Stream<StructuredProperty> instanceAltIds = Optional
.ofNullable(r.getInstance())
.map(
instance -> instance
.stream()
.flatMap(
i -> Optional
.ofNullable(i.getAlternateIdentifier())
.map(Collection::stream)
.orElse(Stream.empty())))
.orElse(Stream.empty());
return Stream
.concat(
Stream.concat(resultPids, instancePids),
instanceAltIds);
}
private static boolean pidInParam(String value, String[] verifyParam) {
for (String s : verifyParam)
if (value.startsWith(s))
return true;
return false;
}
private static boolean toTakeCountry(Country c, String country) {
// If dataInfo is not set, or dataInfo.inferenceprovenance is not set or not present then it cannot be
// inserted via propagation
if (!Optional.ofNullable(c.getDataInfo()).isPresent())
return true;
if (!Optional.ofNullable(c.getDataInfo().getInferenceprovenance()).isPresent())
return true;
return !(c
.getClassid()
.equalsIgnoreCase(country) &&
c.getDataInfo().getInferenceprovenance().equals("propagation"));
}
public static <T extends Oaf> T fixVocabularyNames(T value) {
if (value instanceof Datasource) {
// nothing to clean here
} else if (value instanceof Project) {
// nothing to clean here
} else if (value instanceof Organization) {
Organization o = (Organization) value;
if (Objects.nonNull(o.getCountry())) {
fixVocabName(o.getCountry(), ModelConstants.DNET_COUNTRY_TYPE);
if (value instanceof OafEntity) {
OafEntity e = (OafEntity) value;
Optional
.ofNullable(e.getPid())
.ifPresent(pid -> pid.forEach(p -> fixVocabName(p.getQualifier(), ModelConstants.DNET_PID_TYPES)));
if (value instanceof Result) {
Result r = (Result) value;
fixVocabName(r.getLanguage(), ModelConstants.DNET_LANGUAGES);
fixVocabName(r.getResourcetype(), ModelConstants.DNET_DATA_CITE_RESOURCE);
fixVocabName(r.getBestaccessright(), ModelConstants.DNET_ACCESS_MODES);
if (Objects.nonNull(r.getSubject())) {
r.getSubject().forEach(s -> fixVocabName(s.getQualifier(), ModelConstants.DNET_SUBJECT_TYPOLOGIES));
}
if (Objects.nonNull(r.getInstance())) {
for (Instance i : r.getInstance()) {
fixVocabName(i.getAccessright(), ModelConstants.DNET_ACCESS_MODES);
fixVocabName(i.getRefereed(), ModelConstants.DNET_REVIEW_LEVELS);
Optional
.ofNullable(i.getPid())
.ifPresent(
pid -> pid.forEach(p -> fixVocabName(p.getQualifier(), ModelConstants.DNET_PID_TYPES)));
}
}
if (Objects.nonNull(r.getAuthor())) {
r.getAuthor().stream().filter(Objects::nonNull).forEach(a -> {
if (Objects.nonNull(a.getPid())) {
a.getPid().stream().filter(Objects::nonNull).forEach(p -> {
fixVocabName(p.getQualifier(), ModelConstants.DNET_PID_TYPES);
});
}
});
}
if (value instanceof Publication) {
} else if (value instanceof Dataset) {
} else if (value instanceof OtherResearchProduct) {
} else if (value instanceof Software) {
}
} else if (value instanceof Datasource) {
// nothing to clean here
} else if (value instanceof Project) {
// nothing to clean here
} else if (value instanceof Organization) {
Organization o = (Organization) value;
if (Objects.nonNull(o.getCountry())) {
fixVocabName(o.getCountry(), ModelConstants.DNET_COUNTRY_TYPE);
}
}
} else if (value instanceof Relation) {
// nothing to clean here
} else if (value instanceof Result) {
Result r = (Result) value;
fixVocabName(r.getLanguage(), ModelConstants.DNET_LANGUAGES);
fixVocabName(r.getResourcetype(), ModelConstants.DNET_DATA_CITE_RESOURCE);
fixVocabName(r.getBestaccessright(), ModelConstants.DNET_ACCESS_MODES);
if (Objects.nonNull(r.getSubject())) {
r.getSubject().forEach(s -> fixVocabName(s.getQualifier(), ModelConstants.DNET_SUBJECT_TYPOLOGIES));
}
if (Objects.nonNull(r.getInstance())) {
for (Instance i : r.getInstance()) {
fixVocabName(i.getAccessright(), ModelConstants.DNET_ACCESS_MODES);
fixVocabName(i.getRefereed(), ModelConstants.DNET_REVIEW_LEVELS);
}
}
if (Objects.nonNull(r.getAuthor())) {
r.getAuthor().stream().filter(Objects::nonNull).forEach(a -> {
if (Objects.nonNull(a.getPid())) {
a.getPid().stream().filter(Objects::nonNull).forEach(p -> {
fixVocabName(p.getQualifier(), ModelConstants.DNET_PID_TYPES);
});
}
});
}
if (value instanceof Publication) {
} else if (value instanceof Dataset) {
} else if (value instanceof OtherResearchProduct) {
} else if (value instanceof Software) {
}
}
return value;
}
public static <T extends Oaf> boolean filter(T value) {
if (Boolean.TRUE
if (!(value instanceof Relation) && (Boolean.TRUE
.equals(
Optional
.ofNullable(value)
@ -100,15 +236,16 @@ public class GraphCleaningFunctions extends CleaningFunctions {
d -> Optional
.ofNullable(d.getInvisible())
.orElse(true))
.orElse(true))
.orElse(true))) {
.orElse(false))
.orElse(true)))) {
return true;
}
if (value instanceof Datasource) {
// nothing to evaluate here
} else if (value instanceof Project) {
// nothing to evaluate here
final Project p = (Project) value;
return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
} else if (value instanceof Organization) {
// nothing to evaluate here
} else if (value instanceof Relation) {
@ -135,15 +272,343 @@ public class GraphCleaningFunctions extends CleaningFunctions {
}
public static <T extends Oaf> T cleanup(T value, VocabularyGroup vocs) {
if (value instanceof Datasource) {
// nothing to clean here
} else if (value instanceof Project) {
// nothing to clean here
} else if (value instanceof Organization) {
Organization o = (Organization) value;
if (Objects.isNull(o.getCountry()) || StringUtils.isBlank(o.getCountry().getClassid())) {
o.setCountry(ModelConstants.UNKNOWN_COUNTRY);
if (value instanceof OafEntity) {
OafEntity e = (OafEntity) value;
if (Objects.nonNull(e.getPid())) {
e.setPid(processPidCleaning(e.getPid()));
}
if (value instanceof Datasource) {
// nothing to clean here
} else if (value instanceof Project) {
// nothing to clean here
} else if (value instanceof Organization) {
Organization o = (Organization) value;
if (Objects.isNull(o.getCountry()) || StringUtils.isBlank(o.getCountry().getClassid())) {
o.setCountry(ModelConstants.UNKNOWN_COUNTRY);
}
} else if (value instanceof Result) {
Result r = (Result) value;
if (Objects.nonNull(r.getFulltext())
&& (ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
r.setFulltext(null);
}
if (Objects.nonNull(r.getDateofacceptance())) {
Optional<String> date = cleanDateField(r.getDateofacceptance());
if (date.isPresent()) {
r.getDateofacceptance().setValue(date.get());
} else {
r.setDateofacceptance(null);
}
}
if (Objects.nonNull(r.getRelevantdate())) {
r
.setRelevantdate(
r
.getRelevantdate()
.stream()
.filter(Objects::nonNull)
.filter(sp -> Objects.nonNull(sp.getQualifier()))
.filter(sp -> StringUtils.isNotBlank(sp.getQualifier().getClassid()))
.map(sp -> {
sp.setValue(GraphCleaningFunctions.cleanDate(sp.getValue()));
return sp;
})
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.collect(Collectors.toList()));
}
if (Objects.nonNull(r.getPublisher())) {
if (StringUtils.isBlank(r.getPublisher().getValue())) {
r.setPublisher(null);
} else {
r
.getPublisher()
.setValue(
r
.getPublisher()
.getValue()
.replaceAll(NAME_CLEANING_REGEX, " "));
}
}
if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
r
.setLanguage(
qualifier("und", "Undetermined", ModelConstants.DNET_LANGUAGES));
}
if (Objects.nonNull(r.getSubject())) {
List<Subject> subjects = Lists
.newArrayList(
r
.getSubject()
.stream()
.filter(Objects::nonNull)
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.filter(sp -> Objects.nonNull(sp.getQualifier()))
.filter(sp -> StringUtils.isNotBlank(sp.getQualifier().getClassid()))
.map(s -> {
if ("dnet:result_subject".equals(s.getQualifier().getClassid())) {
s.getQualifier().setClassid(ModelConstants.DNET_SUBJECT_TYPOLOGIES);
s.getQualifier().setClassname(ModelConstants.DNET_SUBJECT_TYPOLOGIES);
}
return s;
})
.map(GraphCleaningFunctions::cleanValue)
.collect(
Collectors
.toMap(
s -> Optional
.ofNullable(s.getQualifier())
.map(q -> q.getClassid() + s.getValue())
.orElse(s.getValue()),
Function.identity(),
(s1, s2) -> Collections
.min(Lists.newArrayList(s1, s2), new SubjectProvenanceComparator())))
.values());
r.setSubject(subjects);
}
if (Objects.nonNull(r.getTitle())) {
r
.setTitle(
r
.getTitle()
.stream()
.filter(Objects::nonNull)
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.filter(
sp -> {
final String title = sp
.getValue()
.toLowerCase();
final String decoded = Unidecode.decode(title);
if (StringUtils.contains(decoded, TITLE_TEST)) {
return decoded
.replaceAll(TITLE_FILTER_REGEX, "")
.length() > TITLE_FILTER_RESIDUAL_LENGTH;
}
return !decoded
.replaceAll("\\W|\\d", "")
.isEmpty();
})
.map(GraphCleaningFunctions::cleanValue)
.collect(Collectors.toList()));
}
if (Objects.nonNull(r.getFormat())) {
r
.setFormat(
r
.getFormat()
.stream()
.map(GraphCleaningFunctions::cleanValue)
.collect(Collectors.toList()));
}
if (Objects.nonNull(r.getDescription())) {
r
.setDescription(
r
.getDescription()
.stream()
.filter(Objects::nonNull)
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.map(GraphCleaningFunctions::cleanValue)
.collect(Collectors.toList()));
}
if (Objects.isNull(r.getResourcetype()) || StringUtils.isBlank(r.getResourcetype().getClassid())) {
r
.setResourcetype(
qualifier(ModelConstants.UNKNOWN, "Unknown", ModelConstants.DNET_DATA_CITE_RESOURCE));
}
if (Objects.nonNull(r.getInstance())) {
for (Instance i : r.getInstance()) {
if (!vocs
.termExists(ModelConstants.DNET_PUBLICATION_RESOURCE, i.getInstancetype().getClassid())) {
if (r instanceof Publication) {
i
.setInstancetype(
OafMapperUtils
.qualifier(
"0038", "Other literature type",
ModelConstants.DNET_PUBLICATION_RESOURCE,
ModelConstants.DNET_PUBLICATION_RESOURCE));
} else if (r instanceof Dataset) {
i
.setInstancetype(
OafMapperUtils
.qualifier(
"0039", "Other dataset type", ModelConstants.DNET_PUBLICATION_RESOURCE,
ModelConstants.DNET_PUBLICATION_RESOURCE));
} else if (r instanceof Software) {
i
.setInstancetype(
OafMapperUtils
.qualifier(
"0040", "Other software type", ModelConstants.DNET_PUBLICATION_RESOURCE,
ModelConstants.DNET_PUBLICATION_RESOURCE));
} else if (r instanceof OtherResearchProduct) {
i
.setInstancetype(
OafMapperUtils
.qualifier(
"0020", "Other ORP type", ModelConstants.DNET_PUBLICATION_RESOURCE,
ModelConstants.DNET_PUBLICATION_RESOURCE));
}
}
if (Objects.nonNull(i.getPid())) {
i.setPid(processPidCleaning(i.getPid()));
}
if (Objects.nonNull(i.getAlternateIdentifier())) {
i.setAlternateIdentifier(processPidCleaning(i.getAlternateIdentifier()));
}
Optional
.ofNullable(i.getPid())
.ifPresent(pid -> {
final Set<StructuredProperty> pids = Sets.newHashSet(pid);
Optional
.ofNullable(i.getAlternateIdentifier())
.ifPresent(altId -> {
final Set<StructuredProperty> altIds = Sets.newHashSet(altId);
i.setAlternateIdentifier(Lists.newArrayList(Sets.difference(altIds, pids)));
});
});
if (Objects.isNull(i.getAccessright())
|| StringUtils.isBlank(i.getAccessright().getClassid())) {
i
.setAccessright(
accessRight(
ModelConstants.UNKNOWN, ModelConstants.NOT_AVAILABLE,
ModelConstants.DNET_ACCESS_MODES));
}
if (Objects.isNull(i.getHostedby()) || StringUtils.isBlank(i.getHostedby().getKey())) {
i.setHostedby(ModelConstants.UNKNOWN_REPOSITORY);
}
if (Objects.isNull(i.getRefereed()) || StringUtils.isBlank(i.getRefereed().getClassid())) {
i.setRefereed(qualifier("0000", "Unknown", ModelConstants.DNET_REVIEW_LEVELS));
}
if (Objects.nonNull(i.getDateofacceptance())) {
Optional<String> date = cleanDateField(i.getDateofacceptance());
if (date.isPresent()) {
i.getDateofacceptance().setValue(date.get());
} else {
i.setDateofacceptance(null);
}
}
if (StringUtils.isNotBlank(i.getFulltext()) &&
(ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
i.setFulltext(null);
}
}
}
if (Objects.isNull(r.getBestaccessright())
|| StringUtils.isBlank(r.getBestaccessright().getClassid())) {
Qualifier bestaccessrights = OafMapperUtils.createBestAccessRights(r.getInstance());
if (Objects.isNull(bestaccessrights)) {
r
.setBestaccessright(
qualifier(
ModelConstants.UNKNOWN, ModelConstants.NOT_AVAILABLE,
ModelConstants.DNET_ACCESS_MODES));
} else {
r.setBestaccessright(bestaccessrights);
}
}
if (Objects.nonNull(r.getAuthor())) {
r
.setAuthor(
r
.getAuthor()
.stream()
.filter(Objects::nonNull)
.filter(a -> StringUtils.isNotBlank(a.getFullname()))
.filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
.map(GraphCleaningFunctions::cleanupAuthor)
.collect(Collectors.toList()));
boolean nullRank = r
.getAuthor()
.stream()
.anyMatch(a -> Objects.isNull(a.getRank()));
if (nullRank) {
int i = 1;
for (Author author : r.getAuthor()) {
author.setRank(i++);
}
}
for (Author a : r.getAuthor()) {
if (Objects.isNull(a.getPid())) {
a.setPid(Lists.newArrayList());
} else {
a
.setPid(
a
.getPid()
.stream()
.filter(Objects::nonNull)
.filter(p -> Objects.nonNull(p.getQualifier()))
.filter(p -> StringUtils.isNotBlank(p.getValue()))
.map(p -> {
// hack to distinguish orcid from orcid_pending
String pidProvenance = getProvenance(p.getDataInfo());
if (p
.getQualifier()
.getClassid()
.toLowerCase()
.contains(ModelConstants.ORCID)) {
if (pidProvenance
.equals(ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY)) {
p.getQualifier().setClassid(ModelConstants.ORCID);
} else {
p.getQualifier().setClassid(ModelConstants.ORCID_PENDING);
}
final String orcid = p
.getValue()
.trim()
.toLowerCase()
.replaceAll(ORCID_CLEANING_REGEX, "$1-$2-$3-$4");
if (orcid.length() == ORCID_LEN) {
p.setValue(orcid);
} else {
p.setValue("");
}
}
return p;
})
.filter(p -> StringUtils.isNotBlank(p.getValue()))
.collect(
Collectors
.toMap(
p -> p.getQualifier().getClassid() + p.getValue(),
Function.identity(),
(p1, p2) -> p1,
LinkedHashMap::new))
.values()
.stream()
.collect(Collectors.toList()));
}
}
}
if (value instanceof Publication) {
} else if (value instanceof Dataset) {
} else if (value instanceof OtherResearchProduct) {
} else if (value instanceof Software) {
}
}
} else if (value instanceof Relation) {
Relation r = (Relation) value;
@ -155,298 +620,40 @@ public class GraphCleaningFunctions extends CleaningFunctions {
r.setValidationDate(null);
r.setValidated(false);
}
} else if (value instanceof Result) {
Result r = (Result) value;
if (Objects.nonNull(r.getDateofacceptance())) {
Optional<String> date = cleanDateField(r.getDateofacceptance());
if (date.isPresent()) {
r.getDateofacceptance().setValue(date.get());
} else {
r.setDateofacceptance(null);
}
}
if (Objects.nonNull(r.getRelevantdate())) {
r
.setRelevantdate(
r
.getRelevantdate()
.stream()
.filter(Objects::nonNull)
.filter(sp -> Objects.nonNull(sp.getQualifier()))
.filter(sp -> StringUtils.isNotBlank(sp.getQualifier().getClassid()))
.map(sp -> {
sp.setValue(GraphCleaningFunctions.cleanDate(sp.getValue()));
return sp;
})
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.collect(Collectors.toList()));
}
if (Objects.nonNull(r.getPublisher()) && StringUtils.isBlank(r.getPublisher().getValue())) {
r.setPublisher(null);
}
if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
r
.setLanguage(
qualifier("und", "Undetermined", ModelConstants.DNET_LANGUAGES));
}
if (Objects.nonNull(r.getSubject())) {
List<Subject> subjects = Lists
.newArrayList(
r
.getSubject()
.stream()
.filter(Objects::nonNull)
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.filter(sp -> Objects.nonNull(sp.getQualifier()))
.filter(sp -> StringUtils.isNotBlank(sp.getQualifier().getClassid()))
.map(GraphCleaningFunctions::cleanValue)
.collect(
Collectors
.toMap(
s -> Optional
.ofNullable(s.getQualifier())
.map(q -> q.getClassid() + s.getValue())
.orElse(s.getValue()),
Function.identity(),
(s1, s2) -> Collections
.min(Lists.newArrayList(s1, s1), new SubjectProvenanceComparator())))
.values());
r.setSubject(subjects);
}
if (Objects.nonNull(r.getTitle())) {
r
.setTitle(
r
.getTitle()
.stream()
.filter(Objects::nonNull)
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.filter(
sp -> {
final String title = sp
.getValue()
.toLowerCase();
final String decoded = Unidecode.decode(title);
if (StringUtils.contains(decoded, TITLE_TEST)) {
return decoded
.replaceAll(TITLE_FILTER_REGEX, "")
.length() > TITLE_FILTER_RESIDUAL_LENGTH;
}
return !decoded
.replaceAll("\\W|\\d", "")
.isEmpty();
})
.map(GraphCleaningFunctions::cleanValue)
.collect(Collectors.toList()));
}
if (Objects.nonNull(r.getFormat())) {
r
.setFormat(
r
.getFormat()
.stream()
.map(GraphCleaningFunctions::cleanValue)
.collect(Collectors.toList()));
}
if (Objects.nonNull(r.getDescription())) {
r
.setDescription(
r
.getDescription()
.stream()
.filter(Objects::nonNull)
.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.map(GraphCleaningFunctions::cleanValue)
.collect(Collectors.toList()));
}
if (Objects.nonNull(r.getPid())) {
r.setPid(processPidCleaning(r.getPid()));
}
if (Objects.isNull(r.getResourcetype()) || StringUtils.isBlank(r.getResourcetype().getClassid())) {
r
.setResourcetype(
qualifier(ModelConstants.UNKNOWN, "Unknown", ModelConstants.DNET_DATA_CITE_RESOURCE));
}
if (Objects.nonNull(r.getInstance())) {
for (Instance i : r.getInstance()) {
if (!vocs.termExists(ModelConstants.DNET_PUBLICATION_RESOURCE, i.getInstancetype().getClassid())) {
if (r instanceof Publication) {
i
.setInstancetype(
OafMapperUtils
.qualifier(
"0038", "Other literature type", ModelConstants.DNET_PUBLICATION_RESOURCE,
ModelConstants.DNET_PUBLICATION_RESOURCE));
} else if (r instanceof Dataset) {
i
.setInstancetype(
OafMapperUtils
.qualifier(
"0039", "Other dataset type", ModelConstants.DNET_PUBLICATION_RESOURCE,
ModelConstants.DNET_PUBLICATION_RESOURCE));
} else if (r instanceof Software) {
i
.setInstancetype(
OafMapperUtils
.qualifier(
"0040", "Other software type", ModelConstants.DNET_PUBLICATION_RESOURCE,
ModelConstants.DNET_PUBLICATION_RESOURCE));
} else if (r instanceof OtherResearchProduct) {
i
.setInstancetype(
OafMapperUtils
.qualifier(
"0020", "Other ORP type", ModelConstants.DNET_PUBLICATION_RESOURCE,
ModelConstants.DNET_PUBLICATION_RESOURCE));
}
}
if (Objects.nonNull(i.getPid())) {
i.setPid(processPidCleaning(i.getPid()));
}
if (Objects.nonNull(i.getAlternateIdentifier())) {
i.setAlternateIdentifier(processPidCleaning(i.getAlternateIdentifier()));
}
Optional
.ofNullable(i.getPid())
.ifPresent(pid -> {
final Set<StructuredProperty> pids = Sets.newHashSet(pid);
Optional
.ofNullable(i.getAlternateIdentifier())
.ifPresent(altId -> {
final Set<StructuredProperty> altIds = Sets.newHashSet(altId);
i.setAlternateIdentifier(Lists.newArrayList(Sets.difference(altIds, pids)));
});
});
if (Objects.isNull(i.getAccessright()) || StringUtils.isBlank(i.getAccessright().getClassid())) {
i
.setAccessright(
accessRight(
ModelConstants.UNKNOWN, ModelConstants.NOT_AVAILABLE,
ModelConstants.DNET_ACCESS_MODES));
}
if (Objects.isNull(i.getHostedby()) || StringUtils.isBlank(i.getHostedby().getKey())) {
i.setHostedby(ModelConstants.UNKNOWN_REPOSITORY);
}
if (Objects.isNull(i.getRefereed())) {
i.setRefereed(qualifier("0000", "Unknown", ModelConstants.DNET_REVIEW_LEVELS));
}
if (Objects.nonNull(i.getDateofacceptance())) {
Optional<String> date = cleanDateField(i.getDateofacceptance());
if (date.isPresent()) {
i.getDateofacceptance().setValue(date.get());
} else {
i.setDateofacceptance(null);
}
}
}
}
if (Objects.isNull(r.getBestaccessright()) || StringUtils.isBlank(r.getBestaccessright().getClassid())) {
Qualifier bestaccessrights = OafMapperUtils.createBestAccessRights(r.getInstance());
if (Objects.isNull(bestaccessrights)) {
r
.setBestaccessright(
qualifier(
ModelConstants.UNKNOWN, ModelConstants.NOT_AVAILABLE,
ModelConstants.DNET_ACCESS_MODES));
} else {
r.setBestaccessright(bestaccessrights);
}
}
if (Objects.nonNull(r.getAuthor())) {
r
.setAuthor(
r
.getAuthor()
.stream()
.filter(Objects::nonNull)
.filter(a -> StringUtils.isNotBlank(a.getFullname()))
.filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
.collect(Collectors.toList()));
boolean nullRank = r
.getAuthor()
.stream()
.anyMatch(a -> Objects.isNull(a.getRank()));
if (nullRank) {
int i = 1;
for (Author author : r.getAuthor()) {
author.setRank(i++);
}
}
for (Author a : r.getAuthor()) {
if (Objects.isNull(a.getPid())) {
a.setPid(Lists.newArrayList());
} else {
a
.setPid(
a
.getPid()
.stream()
.filter(Objects::nonNull)
.filter(p -> Objects.nonNull(p.getQualifier()))
.filter(p -> StringUtils.isNotBlank(p.getValue()))
.map(p -> {
// hack to distinguish orcid from orcid_pending
String pidProvenance = getProvenance(p.getDataInfo());
if (p
.getQualifier()
.getClassid()
.toLowerCase()
.contains(ModelConstants.ORCID)) {
if (pidProvenance
.equals(ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY)) {
p.getQualifier().setClassid(ModelConstants.ORCID);
} else {
p.getQualifier().setClassid(ModelConstants.ORCID_PENDING);
}
final String orcid = p
.getValue()
.trim()
.toLowerCase()
.replaceAll(ORCID_CLEANING_REGEX, "$1-$2-$3-$4");
if (orcid.length() == ORCID_LEN) {
p.setValue(orcid);
} else {
p.setValue("");
}
}
return p;
})
.filter(p -> StringUtils.isNotBlank(p.getValue()))
.collect(
Collectors
.toMap(
p -> p.getQualifier().getClassid() + p.getValue(),
Function.identity(),
(p1, p2) -> p1,
LinkedHashMap::new))
.values()
.stream()
.collect(Collectors.toList()));
}
}
}
if (value instanceof Publication) {
} else if (value instanceof Dataset) {
} else if (value instanceof OtherResearchProduct) {
} else if (value instanceof Software) {
}
}
return value;
}
private static Author cleanupAuthor(Author author) {
if (StringUtils.isNotBlank(author.getFullname())) {
author
.setFullname(
author
.getFullname()
.replaceAll(NAME_CLEANING_REGEX, " ")
.replace("\"", "\\\""));
}
if (StringUtils.isNotBlank(author.getName())) {
author
.setName(
author
.getName()
.replaceAll(NAME_CLEANING_REGEX, " ")
.replace("\"", "\\\""));
}
if (StringUtils.isNotBlank(author.getSurname())) {
author
.setSurname(
author
.getSurname()
.replaceAll(NAME_CLEANING_REGEX, " ")
.replace("\"", "\\\""));
}
return author;
}
private static Optional<String> cleanDateField(Field<String> dateofacceptance) {
return Optional
.ofNullable(dateofacceptance)
@ -496,7 +703,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.filter(sp -> !PID_BLACKLIST.contains(sp.getValue().trim().toLowerCase()))
.filter(sp -> Objects.nonNull(sp.getQualifier()))
.filter(sp -> StringUtils.isNotBlank(sp.getQualifier().getClassid()))
.map(CleaningFunctions::normalizePidValue)
.map(PidCleaner::normalizePidValue)
.filter(CleaningFunctions::pidFilter)
.collect(Collectors.toList());
}

View File

@ -0,0 +1,24 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class GridCleaningRule {
public static final Pattern PATTERN = Pattern.compile("(?<grid>\\d{4,6}\\.[0-9a-z]{1,2})");
public static String clean(String grid) {
String s = grid
.replaceAll("\\s", "")
.toLowerCase();
Matcher m = PATTERN.matcher(s);
if (m.find()) {
return "grid." + m.group("grid");
}
return "";
}
}

View File

@ -0,0 +1,21 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
// https://www.wikidata.org/wiki/Property:P213
public class ISNICleaningRule {
public static final Pattern PATTERN = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])");
public static String clean(final String isni) {
Matcher m = PATTERN.matcher(isni);
if (m.find()) {
return String.join("", m.group(1), m.group(2), m.group(3), m.group(4));
} else {
return "";
}
}
}

View File

@ -0,0 +1,21 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class PICCleaningRule {
public static final Pattern PATTERN = Pattern.compile("\\d{9}");
public static String clean(final String pic) {
Matcher m = PATTERN.matcher(pic);
if (m.find()) {
return m.group();
} else {
return "";
}
}
}

View File

@ -0,0 +1,62 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Optional;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
public class PidCleaner {
/**
* Utility method that normalises PID values on a per-type basis.
* @param pid the PID whose value will be normalised.
* @return the PID containing the normalised value.
*/
public static StructuredProperty normalizePidValue(StructuredProperty pid) {
pid
.setValue(
normalizePidValue(
pid.getQualifier().getClassid(),
pid.getValue()));
return pid;
}
public static String normalizePidValue(String pidType, String pidValue) {
String value = Optional
.ofNullable(pidValue)
.map(String::trim)
.orElseThrow(() -> new IllegalArgumentException("PID value cannot be empty"));
switch (pidType) {
// TODO add cleaning for more PID types as needed
// Result
case "doi":
return DoiCleaningRule.clean(value);
case "pmid":
return PmidCleaningRule.clean(value);
case "pmc":
return PmcCleaningRule.clean(value);
case "handle":
case "arXiv":
return value;
// Organization
case "GRID":
return GridCleaningRule.clean(value);
case "ISNI":
return ISNICleaningRule.clean(value);
case "ROR":
return RorCleaningRule.clean(value);
case "PIC":
return PICCleaningRule.clean(value);
case "FundRef":
return FundRefCleaningRule.clean(value);
default:
return value;
}
}
}

View File

@ -0,0 +1,24 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class PmcCleaningRule {
public static final Pattern PATTERN = Pattern.compile("PMC\\d{1,8}");
public static String clean(String pmc) {
String s = pmc
.replaceAll("\\s", "")
.toUpperCase();
final Matcher m = PATTERN.matcher(s);
if (m.find()) {
return m.group();
}
return "";
}
}

View File

@ -0,0 +1,25 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
// https://researchguides.stevens.edu/c.php?g=442331&p=6577176
public class PmidCleaningRule {
public static final Pattern PATTERN = Pattern.compile("0*(\\d{1,8})");
public static String clean(String pmid) {
String s = pmid
.toLowerCase()
.replaceAll("\\s", "");
final Matcher m = PATTERN.matcher(s);
if (m.find()) {
return m.group(1);
}
return "";
}
}

View File

@ -0,0 +1,27 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
// https://ror.readme.io/docs/ror-identifier-pattern
public class RorCleaningRule {
public static final String ROR_PREFIX = "https://ror.org/";
private static final Pattern PATTERN = Pattern.compile("(?<ror>0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2})");
public static String clean(String ror) {
String s = ror
.replaceAll("\\s", "")
.toLowerCase();
Matcher m = PATTERN.matcher(s);
if (m.find()) {
return ROR_PREFIX + m.group("ror");
}
return "";
}
}

View File

@ -18,9 +18,9 @@
"paramRequired": true
},
{
"paramName": "c",
"paramLongName": "graphTableClassName",
"paramDescription": "the graph entity class name",
"paramName": "fi",
"paramLongName": "filterInvisible",
"paramDescription": "if true filters out invisible entities",
"paramRequired": true
}
]

View File

@ -0,0 +1,10 @@
package eu.dnetlib.dhp.application.dedup.log
case class DedupLogModel(
tag: String,
configuration: String,
entity: String,
startTS: Long,
endTS: Long,
totalMs: Long
) {}

View File

@ -0,0 +1,14 @@
package eu.dnetlib.dhp.application.dedup.log
import org.apache.spark.sql.{SaveMode, SparkSession}
class DedupLogWriter(path: String) {
def appendLog(dedupLogModel: DedupLogModel, spark: SparkSession): Unit = {
import spark.implicits._
val df = spark.createDataset[DedupLogModel](data = List(dedupLogModel))
df.write.mode(SaveMode.Append).save(path)
}
}

View File

@ -0,0 +1,36 @@
package eu.dnetlib.dhp.common;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
import org.junit.jupiter.api.Test;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
public class MdStoreClientTest {
// @Test
public void testMongoCollection() throws IOException {
final MdstoreClient client = new MdstoreClient("mongodb://localhost:27017", "mdstore");
final ObjectMapper mapper = new ObjectMapper();
final List<MDStoreInfo> infos = client.mdStoreWithTimestamp("ODF", "store", "cleaned");
infos.forEach(System.out::println);
final String s = mapper.writeValueAsString(infos);
Path fileName = Paths.get("/Users/sandro/mdstore_info.json");
// Writing into the file
Files.write(fileName, s.getBytes(StandardCharsets.UTF_8));
}
}

View File

@ -33,7 +33,7 @@ class ZenodoAPIClientTest {
InputStream is = new FileInputStream(file);
Assertions.assertEquals(200, client.uploadIS(is, "COVID-19.json.gz", file.length()));
Assertions.assertEquals(200, client.uploadIS(is, "COVID-19.json.gz"));
String metadata = IOUtils.toString(getClass().getResourceAsStream("/eu/dnetlib/dhp/common/api/metadata.json"));
@ -56,7 +56,7 @@ class ZenodoAPIClientTest {
InputStream is = new FileInputStream(file);
Assertions.assertEquals(200, client.uploadIS(is, "COVID-19.json.gz", file.length()));
Assertions.assertEquals(200, client.uploadIS(is, "COVID-19.json.gz"));
String metadata = IOUtils.toString(getClass().getResourceAsStream("/eu/dnetlib/dhp/common/api/metadata.json"));
@ -80,7 +80,7 @@ class ZenodoAPIClientTest {
InputStream is = new FileInputStream(file);
Assertions.assertEquals(200, client.uploadIS(is, "newVersion_deposition", file.length()));
Assertions.assertEquals(200, client.uploadIS(is, "newVersion_deposition"));
Assertions.assertEquals(202, client.publish());
@ -100,7 +100,7 @@ class ZenodoAPIClientTest {
InputStream is = new FileInputStream(file);
Assertions.assertEquals(200, client.uploadIS(is, "newVersion_deposition", file.length()));
Assertions.assertEquals(200, client.uploadIS(is, "newVersion_deposition"));
Assertions.assertEquals(202, client.publish());

View File

@ -1,100 +0,0 @@
package eu.dnetlib.dhp.oa.merge;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.schema.oaf.Author;
import eu.dnetlib.dhp.schema.oaf.Publication;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
import eu.dnetlib.pace.util.MapDocumentUtil;
import scala.Tuple2;
class AuthorMergerTest {
private String publicationsBasePath;
private List<List<Author>> authors;
@BeforeEach
public void setUp() throws Exception {
publicationsBasePath = Paths
.get(AuthorMergerTest.class.getResource("/eu/dnetlib/dhp/oa/merge").toURI())
.toFile()
.getAbsolutePath();
authors = readSample(publicationsBasePath + "/publications_with_authors.json", Publication.class)
.stream()
.map(p -> p._2().getAuthor())
.collect(Collectors.toList());
}
@Test
void mergeTest() { // used in the dedup: threshold set to 0.95
for (List<Author> authors1 : authors) {
System.out.println("List " + (authors.indexOf(authors1) + 1));
for (Author author : authors1) {
System.out.println(authorToString(author));
}
}
List<Author> merge = AuthorMerger.merge(authors);
System.out.println("Merge ");
for (Author author : merge) {
System.out.println(authorToString(author));
}
Assertions.assertEquals(7, merge.size());
}
public <T> List<Tuple2<String, T>> readSample(String path, Class<T> clazz) {
List<Tuple2<String, T>> res = new ArrayList<>();
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(path));
String line = reader.readLine();
while (line != null) {
res
.add(
new Tuple2<>(
MapDocumentUtil.getJPathString("$.id", line),
new ObjectMapper().readValue(line, clazz)));
// read next line
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
return res;
}
public String authorToString(Author a) {
String print = "Fullname = ";
print += a.getFullname() + " pid = [";
if (a.getPid() != null)
for (StructuredProperty sp : a.getPid()) {
print += sp.toComparableString() + " ";
}
print += "]";
return print;
}
}

View File

@ -0,0 +1,18 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class GridCleaningRuleTest {
@Test
void testCleaning() {
assertEquals("grid.493784.5", GridCleaningRule.clean("grid.493784.5"));
assertEquals("grid.493784.5x", GridCleaningRule.clean("grid.493784.5x"));
assertEquals("grid.493784.5x", GridCleaningRule.clean("493784.5x"));
assertEquals("", GridCleaningRule.clean("493x784.5x"));
}
}

View File

@ -0,0 +1,19 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class ISNICleaningRuleTest {
@Test
void testCleaning() {
assertEquals("0000000463436020", ISNICleaningRule.clean("0000 0004 6343 6020"));
assertEquals("0000000463436020", ISNICleaningRule.clean("0000000463436020"));
assertEquals("", ISNICleaningRule.clean("Q30256598"));
assertEquals("0000000493403529", ISNICleaningRule.clean("ISNI:0000000493403529"));
assertEquals("000000008614884X", ISNICleaningRule.clean("0000 0000 8614 884X"));
}
}

View File

@ -0,0 +1,19 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class PICCleaningRuleTest {
@Test
void testCleaning() {
assertEquals("887624982", PICCleaningRule.clean("887624982"));
assertEquals("", PICCleaningRule.clean("887 624982"));
assertEquals("887624982", PICCleaningRule.clean(" 887624982 "));
assertEquals("887624982", PICCleaningRule.clean(" 887624982x "));
assertEquals("887624982", PICCleaningRule.clean(" 88762498200 "));
}
}

View File

@ -0,0 +1,19 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class PmcCleaningRuleTest {
@Test
void testCleaning() {
assertEquals("PMC1234", PmcCleaningRule.clean("PMC1234"));
assertEquals("PMC1234", PmcCleaningRule.clean(" PMC1234"));
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC12345678"));
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC123456789"));
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC 12345678"));
}
}

View File

@ -0,0 +1,24 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class PmidCleaningRuleTest {
@Test
void testCleaning() {
// leading zeros are removed
assertEquals("1234", PmidCleaningRule.clean("01234"));
// tolerant to spaces in the middle
assertEquals("1234567", PmidCleaningRule.clean("0123 4567"));
// stop parsing at first not numerical char
assertEquals("123", PmidCleaningRule.clean("0123x4567"));
// invalid id leading to empty result
assertEquals("", PmidCleaningRule.clean("abc"));
// valid id with zeroes in the number
assertEquals("20794075", PmidCleaningRule.clean("20794075"));
}
}

View File

@ -0,0 +1,17 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class RorCleaningRuleTest {
@Test
void testCleaning() {
assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("https://ror.org/05rpz9w55"));
assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("05rpz9w55"));
assertEquals("", RorCleaningRule.clean("05rpz9w_55"));
}
}

File diff suppressed because one or more lines are too long

110
dhp-pace-core/pom.xml Normal file
View File

@ -0,0 +1,110 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp</artifactId>
<version>1.2.5-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-pace-core</artifactId>
<version>1.2.5-SNAPSHOT</version>
<packaging>jar</packaging>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>${net.alchim31.maven.version}</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>initialize</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<failOnMultipleScalaVersions>true</failOnMultipleScalaVersions>
<scalaCompatVersion>${scala.binary.version}</scalaCompatVersion>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>edu.cmu</groupId>
<artifactId>secondstring</artifactId>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
</dependency>
<dependency>
<groupId>org.antlr</groupId>
<artifactId>stringtemplate</artifactId>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
</dependency>
<dependency>
<groupId>org.reflections</groupId>
<artifactId>reflections</artifactId>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-math3</artifactId>
</dependency>
<dependency>
<groupId>com.jayway.jsonpath</groupId>
<artifactId>json-path</artifactId>
</dependency>
<dependency>
<groupId>com.ibm.icu</groupId>
<artifactId>icu4j</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
</dependency>
</dependencies>
</project>

View File

@ -0,0 +1,46 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.apache.commons.lang3.StringUtils;
import eu.dnetlib.pace.common.AbstractPaceFunctions;
import eu.dnetlib.pace.config.Config;
public abstract class AbstractClusteringFunction extends AbstractPaceFunctions implements ClusteringFunction {
protected Map<String, Integer> params;
public AbstractClusteringFunction(final Map<String, Integer> params) {
this.params = params;
}
protected abstract Collection<String> doApply(Config conf, String s);
@Override
public Collection<String> apply(Config conf, List<String> fields) {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(this::normalize)
.map(s -> filterAllStopWords(s))
.map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist))
.flatMap(c -> c.stream())
.filter(StringUtils::isNotBlank)
.collect(Collectors.toCollection(HashSet::new));
}
public Map<String, Integer> getParams() {
return params;
}
protected Integer param(String name) {
return params.get(name);
}
}

View File

@ -0,0 +1,51 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("acronyms")
public class Acronyms extends AbstractClusteringFunction {
public Acronyms(Map<String, Integer> params) {
super(params);
}
@Override
protected Collection<String> doApply(Config conf, String s) {
return extractAcronyms(s, param("max"), param("minLen"), param("maxLen"));
}
private Set<String> extractAcronyms(final String s, int maxAcronyms, int minLen, int maxLen) {
final Set<String> acronyms = Sets.newLinkedHashSet();
for (int i = 0; i < maxAcronyms; i++) {
final StringTokenizer st = new StringTokenizer(s);
final StringBuilder sb = new StringBuilder();
while (st.hasMoreTokens()) {
final String token = st.nextToken();
if (sb.length() > maxLen) {
break;
}
if (token.length() > 1 && i < token.length()) {
sb.append(token.charAt(i));
}
}
String acronym = sb.toString();
if (acronym.length() > minLen) {
acronyms.add(acronym);
}
}
return acronyms;
}
}

View File

@ -0,0 +1,14 @@
package eu.dnetlib.pace.clustering;
import java.lang.annotation.ElementType;
import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.TYPE)
public @interface ClusteringClass {
public String value();
}

View File

@ -0,0 +1,16 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.List;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
public interface ClusteringFunction {
public Collection<String> apply(Config config, List<String> fields);
public Map<String, Integer> getParams();
}

View File

@ -0,0 +1,28 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.List;
import java.util.Map;
import com.google.common.collect.Lists;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("immutablefieldvalue")
public class ImmutableFieldValue extends AbstractClusteringFunction {
public ImmutableFieldValue(final Map<String, Integer> params) {
super(params);
}
@Override
protected Collection<String> doApply(final Config conf, final String s) {
final List<String> res = Lists.newArrayList();
res.add(s);
return res;
}
}

View File

@ -0,0 +1,54 @@
package eu.dnetlib.pace.clustering;
import java.util.*;
import java.util.stream.Collectors;
import org.apache.commons.lang3.StringUtils;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("keywordsclustering")
public class KeywordsClustering extends AbstractClusteringFunction {
public KeywordsClustering(Map<String, Integer> params) {
super(params);
}
@Override
protected Collection<String> doApply(final Config conf, String s) {
// takes city codes and keywords codes without duplicates
Set<String> keywords = getKeywords(s, conf.translationMap(), params.getOrDefault("windowSize", 4));
Set<String> cities = getCities(s, params.getOrDefault("windowSize", 4));
// list of combination to return as result
final Collection<String> combinations = new LinkedHashSet<String>();
for (String keyword : keywordsToCodes(keywords, conf.translationMap())) {
for (String city : citiesToCodes(cities)) {
combinations.add(keyword + "-" + city);
if (combinations.size() >= params.getOrDefault("max", 2)) {
return combinations;
}
}
}
return combinations;
}
@Override
public Collection<String> apply(final Config conf, List<String> fields) {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(this::cleanup)
.map(this::normalize)
.map(s -> filterAllStopWords(s))
.map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist))
.flatMap(c -> c.stream())
.filter(StringUtils::isNotBlank)
.collect(Collectors.toCollection(HashSet::new));
}
}

View File

@ -0,0 +1,79 @@
package eu.dnetlib.pace.clustering;
import java.util.*;
import java.util.stream.Collectors;
import org.apache.commons.lang3.StringUtils;
import com.google.common.collect.Lists;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.model.Person;
@ClusteringClass("lnfi")
public class LastNameFirstInitial extends AbstractClusteringFunction {
private boolean DEFAULT_AGGRESSIVE = true;
public LastNameFirstInitial(final Map<String, Integer> params) {
super(params);
}
@Override
public Collection<String> apply(Config conf, List<String> fields) {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(this::normalize)
.map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist))
.flatMap(c -> c.stream())
.filter(StringUtils::isNotBlank)
.collect(Collectors.toCollection(HashSet::new));
}
@Override
protected String normalize(final String s) {
return fixAliases(transliterate(nfd(unicodeNormalization(s))))
// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
// strings
.replaceAll("[^ \\w]+", "")
.replaceAll("(\\p{InCombiningDiacriticalMarks})+", "")
.replaceAll("(\\p{Punct})+", " ")
.replaceAll("(\\d)+", " ")
.replaceAll("(\\n)+", " ")
.trim();
}
@Override
protected Collection<String> doApply(final Config conf, final String s) {
final List<String> res = Lists.newArrayList();
final boolean aggressive = (Boolean) (getParams().containsKey("aggressive") ? getParams().get("aggressive")
: DEFAULT_AGGRESSIVE);
Person p = new Person(s, aggressive);
if (p.isAccurate()) {
String lastName = p.getNormalisedSurname().toLowerCase();
String firstInitial = p.getNormalisedFirstName().toLowerCase().substring(0, 1);
res.add(firstInitial.concat(lastName));
} else { // is not accurate, meaning it has no defined name and surname
List<String> fullname = Arrays.asList(p.getNormalisedFullname().split(" "));
if (fullname.size() == 1) {
res.add(p.getNormalisedFullname().toLowerCase());
} else if (fullname.size() == 2) {
res.add(fullname.get(0).substring(0, 1).concat(fullname.get(1)).toLowerCase());
res.add(fullname.get(1).substring(0, 1).concat(fullname.get(0)).toLowerCase());
} else {
res.add(fullname.get(0).substring(0, 1).concat(fullname.get(fullname.size() - 1)).toLowerCase());
res.add(fullname.get(fullname.size() - 1).substring(0, 1).concat(fullname.get(0)).toLowerCase());
}
}
return res;
}
}

View File

@ -0,0 +1,38 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.List;
import java.util.Map;
import org.apache.commons.lang3.StringUtils;
import com.google.common.collect.Lists;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("lowercase")
public class LowercaseClustering extends AbstractClusteringFunction {
public LowercaseClustering(final Map<String, Integer> params) {
super(params);
}
@Override
public Collection<String> apply(Config conf, List<String> fields) {
Collection<String> c = Sets.newLinkedHashSet();
for (String f : fields) {
c.addAll(doApply(conf, f));
}
return c;
}
@Override
protected Collection<String> doApply(final Config conf, final String s) {
if (StringUtils.isBlank(s)) {
return Lists.newArrayList();
}
return Lists.newArrayList(s.toLowerCase().trim());
}
}

View File

@ -0,0 +1,24 @@
package eu.dnetlib.pace.clustering;
import java.util.Set;
import org.apache.commons.lang3.StringUtils;
import eu.dnetlib.pace.common.AbstractPaceFunctions;
public class NGramUtils extends AbstractPaceFunctions {
static private final NGramUtils NGRAMUTILS = new NGramUtils();
private static final int SIZE = 100;
private static final Set<String> stopwords = AbstractPaceFunctions
.loadFromClasspath("/eu/dnetlib/pace/config/stopwords_en.txt");
public static String cleanupForOrdering(String s) {
return (NGRAMUTILS.filterStopWords(NGRAMUTILS.normalize(s), stopwords) + StringUtils.repeat(" ", SIZE))
.substring(0, SIZE)
.replaceAll(" ", "");
}
}

View File

@ -0,0 +1,41 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.List;
import java.util.Map;
import com.google.common.collect.Lists;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("ngrampairs")
public class NgramPairs extends Ngrams {
public NgramPairs(Map<String, Integer> params) {
super(params, false);
}
public NgramPairs(Map<String, Integer> params, boolean sorted) {
super(params, sorted);
}
@Override
protected Collection<String> doApply(Config conf, String s) {
return ngramPairs(Lists.newArrayList(getNgrams(s, param("ngramLen"), param("max") * 2, 1, 2)), param("max"));
}
protected Collection<String> ngramPairs(final List<String> ngrams, int maxNgrams) {
Collection<String> res = Lists.newArrayList();
int j = 0;
for (int i = 0; i < ngrams.size() && res.size() < maxNgrams; i++) {
if (++j >= ngrams.size()) {
break;
}
res.add(ngrams.get(i) + ngrams.get(j));
// System.out.println("-- " + concatNgrams);
}
return res;
}
}

View File

@ -0,0 +1,52 @@
package eu.dnetlib.pace.clustering;
import java.util.*;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("ngrams")
public class Ngrams extends AbstractClusteringFunction {
private final boolean sorted;
public Ngrams(Map<String, Integer> params) {
this(params, false);
}
public Ngrams(Map<String, Integer> params, boolean sorted) {
super(params);
this.sorted = sorted;
}
@Override
protected Collection<String> doApply(Config conf, String s) {
return getNgrams(s, param("ngramLen"), param("max"), param("maxPerToken"), param("minNgramLen"));
}
protected Collection<String> getNgrams(String s, int ngramLen, int max, int maxPerToken, int minNgramLen) {
final Collection<String> ngrams = sorted ? new TreeSet<>() : new LinkedHashSet<String>();
final StringTokenizer st = new StringTokenizer(s);
while (st.hasMoreTokens()) {
final String token = st.nextToken();
if (!token.isEmpty()) {
for (int i = 0; i < maxPerToken && ngramLen + i <= token.length(); i++) {
String ngram = token.substring(i, Math.min(ngramLen + i, token.length())).trim();
if (ngram.length() >= minNgramLen) {
ngrams.add(ngram);
if (ngrams.size() >= max) {
return ngrams;
}
}
}
}
}
// System.out.println(ngrams + " n: " + ngrams.size());
return ngrams;
}
}

View File

@ -0,0 +1,84 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.List;
import java.util.Map;
import java.util.Set;
import org.apache.commons.lang3.StringUtils;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.common.AbstractPaceFunctions;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.model.Person;
@ClusteringClass("personClustering")
public class PersonClustering extends AbstractPaceFunctions implements ClusteringFunction {
private Map<String, Integer> params;
private static final int MAX_TOKENS = 5;
public PersonClustering(final Map<String, Integer> params) {
this.params = params;
}
@Override
public Collection<String> apply(final Config conf, final List<String> fields) {
final Set<String> hashes = Sets.newHashSet();
for (final String f : fields) {
final Person person = new Person(f, false);
if (StringUtils.isNotBlank(person.getNormalisedFirstName())
&& StringUtils.isNotBlank(person.getNormalisedSurname())) {
hashes.add(firstLC(person.getNormalisedFirstName()) + person.getNormalisedSurname().toLowerCase());
} else {
for (final String token1 : tokens(f, MAX_TOKENS)) {
for (final String token2 : tokens(f, MAX_TOKENS)) {
if (!token1.equals(token2)) {
hashes.add(firstLC(token1) + token2);
}
}
}
}
}
return hashes;
}
// @Override
// public Collection<String> apply(final List<Field> fields) {
// final Set<String> hashes = Sets.newHashSet();
//
// for (final Field f : fields) {
//
// final GTAuthor gta = GTAuthor.fromOafJson(f.stringValue());
//
// final Author a = gta.getAuthor();
//
// if (StringUtils.isNotBlank(a.getFirstname()) && StringUtils.isNotBlank(a.getSecondnames())) {
// hashes.add(firstLC(a.getFirstname()) + a.getSecondnames().toLowerCase());
// } else {
// for (final String token1 : tokens(f.stringValue(), MAX_TOKENS)) {
// for (final String token2 : tokens(f.stringValue(), MAX_TOKENS)) {
// if (!token1.equals(token2)) {
// hashes.add(firstLC(token1) + token2);
// }
// }
// }
// }
// }
//
// return hashes;
// }
@Override
public Map<String, Integer> getParams() {
return params;
}
}

View File

@ -0,0 +1,34 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.List;
import java.util.Map;
import com.google.common.collect.Lists;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.model.Person;
@ClusteringClass("personHash")
public class PersonHash extends AbstractClusteringFunction {
private boolean DEFAULT_AGGRESSIVE = false;
public PersonHash(final Map<String, Integer> params) {
super(params);
}
@Override
protected Collection<String> doApply(final Config conf, final String s) {
final List<String> res = Lists.newArrayList();
final boolean aggressive = (Boolean) (getParams().containsKey("aggressive") ? getParams().get("aggressive")
: DEFAULT_AGGRESSIVE);
res.add(new Person(s, aggressive).hash());
return res;
}
}

View File

@ -0,0 +1,20 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
public class RandomClusteringFunction extends AbstractClusteringFunction {
public RandomClusteringFunction(Map<String, Integer> params) {
super(params);
}
@Override
protected Collection<String> doApply(final Config conf, String s) {
return null;
}
}

View File

@ -0,0 +1,31 @@
package eu.dnetlib.pace.clustering;
import java.util.*;
import com.google.common.base.Joiner;
import com.google.common.base.Splitter;
import com.google.common.collect.Lists;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("sortedngrampairs")
public class SortedNgramPairs extends NgramPairs {
public SortedNgramPairs(Map<String, Integer> params) {
super(params, false);
}
@Override
protected Collection<String> doApply(Config conf, String s) {
final List<String> tokens = Lists.newArrayList(Splitter.on(" ").omitEmptyStrings().trimResults().split(s));
Collections.sort(tokens);
return ngramPairs(
Lists.newArrayList(getNgrams(Joiner.on(" ").join(tokens), param("ngramLen"), param("max") * 2, 1, 2)),
param("max"));
}
}

View File

@ -0,0 +1,34 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.List;
import java.util.Map;
import org.apache.commons.lang3.RandomStringUtils;
import org.apache.commons.lang3.StringUtils;
import com.google.common.collect.Lists;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("spacetrimmingfieldvalue")
public class SpaceTrimmingFieldValue extends AbstractClusteringFunction {
public SpaceTrimmingFieldValue(final Map<String, Integer> params) {
super(params);
}
@Override
protected Collection<String> doApply(final Config conf, final String s) {
final List<String> res = Lists.newArrayList();
res
.add(
StringUtils.isBlank(s) ? RandomStringUtils.random(getParams().get("randomLength"))
: s.toLowerCase().replaceAll("\\s+", ""));
return res;
}
}

View File

@ -0,0 +1,42 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.Map;
import java.util.Set;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("suffixprefix")
public class SuffixPrefix extends AbstractClusteringFunction {
public SuffixPrefix(Map<String, Integer> params) {
super(params);
}
@Override
protected Collection<String> doApply(Config conf, String s) {
return suffixPrefix(s, param("len"), param("max"));
}
private Collection<String> suffixPrefix(String s, int len, int max) {
final Set<String> bigrams = Sets.newLinkedHashSet();
int i = 0;
while (++i < s.length() && bigrams.size() < max) {
int j = s.indexOf(" ", i);
int offset = j + len + 1 < s.length() ? j + len + 1 : s.length();
if (j - len > 0) {
String bigram = s.substring(j - len, offset).replaceAll(" ", "").trim();
if (bigram.length() >= 4) {
bigrams.add(bigram);
}
}
}
return bigrams;
}
}

View File

@ -0,0 +1,52 @@
package eu.dnetlib.pace.clustering;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Collection;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import eu.dnetlib.pace.common.AbstractPaceFunctions;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("urlclustering")
public class UrlClustering extends AbstractPaceFunctions implements ClusteringFunction {
protected Map<String, Integer> params;
public UrlClustering(final Map<String, Integer> params) {
this.params = params;
}
@Override
public Collection<String> apply(final Config conf, List<String> fields) {
try {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(this::asUrl)
.map(URL::getHost)
.collect(Collectors.toCollection(HashSet::new));
} catch (IllegalStateException e) {
return new HashSet<>();
}
}
@Override
public Map<String, Integer> getParams() {
return null;
}
private URL asUrl(String value) {
try {
return new URL(value);
} catch (MalformedURLException e) {
// should not happen as checked by pace typing
throw new IllegalStateException("invalid URL: " + value);
}
}
}

View File

@ -0,0 +1,91 @@
package eu.dnetlib.pace.clustering;
import java.util.*;
import java.util.stream.Collectors;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("wordsStatsSuffixPrefixChain")
public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {
public WordsStatsSuffixPrefixChain(Map<String, Integer> params) {
super(params);
}
@Override
protected Collection<String> doApply(Config conf, String s) {
return suffixPrefixChain(s, param("mod"));
}
private Collection<String> suffixPrefixChain(String s, int mod) {
// create the list of words from the string (remove short words)
List<String> wordsList = Arrays
.stream(s.split(" "))
.filter(si -> si.length() > 3)
.collect(Collectors.toList());
final int words = wordsList.size();
final int letters = s.length();
// create the prefix: number of words + number of letters/mod
String prefix = words + "-" + letters / mod + "-";
return doSuffixPrefixChain(wordsList, prefix);
}
private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) {
Set<String> set = Sets.newLinkedHashSet();
switch (wordsList.size()) {
case 0:
case 1:
break;
case 2:
set
.add(
prefix +
suffix(wordsList.get(0), 3) +
prefix(wordsList.get(1), 3));
set
.add(
prefix +
prefix(wordsList.get(0), 3) +
suffix(wordsList.get(1), 3));
break;
default:
set
.add(
prefix +
suffix(wordsList.get(0), 3) +
prefix(wordsList.get(1), 3) +
suffix(wordsList.get(2), 3));
set
.add(
prefix +
prefix(wordsList.get(0), 3) +
suffix(wordsList.get(1), 3) +
prefix(wordsList.get(2), 3));
break;
}
return set;
}
private String suffix(String s, int len) {
return s.substring(s.length() - len);
}
private String prefix(String s, int len) {
return s.substring(0, len);
}
}

View File

@ -0,0 +1,59 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.Map;
import java.util.Set;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("wordssuffixprefix")
public class WordsSuffixPrefix extends AbstractClusteringFunction {
public WordsSuffixPrefix(Map<String, Integer> params) {
super(params);
}
@Override
protected Collection<String> doApply(Config conf, String s) {
return suffixPrefix(s, param("len"), param("max"));
}
private Collection<String> suffixPrefix(String s, int len, int max) {
final int words = s.split(" ").length;
// adjust the token length according to the number of words
switch (words) {
case 1:
return Sets.newLinkedHashSet();
case 2:
return doSuffixPrefix(s, len + 2, max, words);
case 3:
return doSuffixPrefix(s, len + 1, max, words);
default:
return doSuffixPrefix(s, len, max, words);
}
}
private Collection<String> doSuffixPrefix(String s, int len, int max, int words) {
final Set<String> bigrams = Sets.newLinkedHashSet();
int i = 0;
while (++i < s.length() && bigrams.size() < max) {
int j = s.indexOf(" ", i);
int offset = j + len + 1 < s.length() ? j + len + 1 : s.length();
if (j - len > 0) {
String bigram = s.substring(j - len, offset).replaceAll(" ", "").trim();
if (bigram.length() >= 4) {
bigrams.add(words + bigram);
}
}
}
return bigrams;
}
}

View File

@ -0,0 +1,357 @@
package eu.dnetlib.pace.common;
import java.io.IOException;
import java.io.StringWriter;
import java.nio.charset.StandardCharsets;
import java.text.Normalizer;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import com.google.common.base.Joiner;
import com.google.common.base.Splitter;
import com.google.common.collect.Iterables;
import com.google.common.collect.Lists;
import com.google.common.collect.Sets;
import com.ibm.icu.text.Transliterator;
import eu.dnetlib.pace.clustering.NGramUtils;
/**
* Set of common functions for the framework
*
* @author claudio
*/
public abstract class AbstractPaceFunctions {
// city map to be used when translating the city names into codes
private static Map<String, String> cityMap = AbstractPaceFunctions
.loadMapFromClasspath("/eu/dnetlib/pace/config/city_map.csv");
// list of stopwords in different languages
protected static Set<String> stopwords_gr = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_gr.txt");
protected static Set<String> stopwords_en = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_en.txt");
protected static Set<String> stopwords_de = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_de.txt");
protected static Set<String> stopwords_es = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_es.txt");
protected static Set<String> stopwords_fr = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_fr.txt");
protected static Set<String> stopwords_it = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_it.txt");
protected static Set<String> stopwords_pt = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_pt.txt");
// transliterator
protected static Transliterator transliterator = Transliterator.getInstance("Any-Eng");
// blacklist of ngrams: to avoid generic keys
protected static Set<String> ngramBlacklist = loadFromClasspath("/eu/dnetlib/pace/config/ngram_blacklist.txt");
// html regex for normalization
public static final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");
private static final String alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ";
private static final String aliases_from = "⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎àáâäæãåāèéêëēėęəîïíīįìôöòóœøōõûüùúūßśšłžźżçćčñń";
private static final String aliases_to = "0123456789+-=()n0123456789+-=()aaaaaaaaeeeeeeeeiiiiiioooooooouuuuussslzzzcccnn";
// doi prefix for normalization
public static final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");
private static Pattern numberPattern = Pattern.compile("-?\\d+(\\.\\d+)?");
private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
protected String concat(final List<String> l) {
return Joiner.on(" ").skipNulls().join(l);
}
protected String cleanup(final String s) {
final String s1 = HTML_REGEX.matcher(s).replaceAll("");
final String s2 = unicodeNormalization(s1.toLowerCase());
final String s3 = nfd(s2);
final String s4 = fixXML(s3);
final String s5 = s4.replaceAll("([0-9]+)", " $1 ");
final String s6 = transliterate(s5);
final String s7 = fixAliases(s6);
final String s8 = s7.replaceAll("[^\\p{ASCII}]", "");
final String s9 = s8.replaceAll("[\\p{Punct}]", " ");
final String s10 = s9.replaceAll("\\n", " ");
final String s11 = s10.replaceAll("(?m)\\s+", " ");
final String s12 = s11.trim();
return s12;
}
protected String fixXML(final String a) {
return a
.replaceAll("&ndash;", " ")
.replaceAll("&amp;", " ")
.replaceAll("&quot;", " ")
.replaceAll("&minus;", " ");
}
protected boolean checkNumbers(final String a, final String b) {
final String numbersA = getNumbers(a);
final String numbersB = getNumbers(b);
final String romansA = getRomans(a);
final String romansB = getRomans(b);
return !numbersA.equals(numbersB) || !romansA.equals(romansB);
}
protected String getRomans(final String s) {
final StringBuilder sb = new StringBuilder();
for (final String t : s.split(" ")) {
sb.append(isRoman(t) ? t : "");
}
return sb.toString();
}
protected boolean isRoman(final String s) {
return s
.replaceAll("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", "qwertyuiop")
.equals("qwertyuiop");
}
protected String getNumbers(final String s) {
final StringBuilder sb = new StringBuilder();
for (final String t : s.split(" ")) {
sb.append(isNumber(t) ? t : "");
}
return sb.toString();
}
public boolean isNumber(String strNum) {
if (strNum == null) {
return false;
}
return numberPattern.matcher(strNum).matches();
}
protected static String fixAliases(final String s) {
final StringBuilder sb = new StringBuilder();
s.chars().forEach(ch -> {
final int i = StringUtils.indexOf(aliases_from, ch);
sb.append(i >= 0 ? aliases_to.charAt(i) : (char) ch);
});
return sb.toString();
}
protected static String transliterate(final String s) {
try {
return transliterator.transliterate(s);
} catch (Exception e) {
return s;
}
}
protected String removeSymbols(final String s) {
final StringBuilder sb = new StringBuilder();
s.chars().forEach(ch -> {
sb.append(StringUtils.contains(alpha, ch) ? (char) ch : ' ');
});
return sb.toString().replaceAll("\\s+", " ");
}
protected boolean notNull(final String s) {
return s != null;
}
protected String normalize(final String s) {
return fixAliases(transliterate(nfd(unicodeNormalization(s))))
.toLowerCase()
// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
// strings
.replaceAll("[^ \\w]+", "")
.replaceAll("(\\p{InCombiningDiacriticalMarks})+", "")
.replaceAll("(\\p{Punct})+", " ")
.replaceAll("(\\d)+", " ")
.replaceAll("(\\n)+", " ")
.trim();
}
public String nfd(final String s) {
return Normalizer.normalize(s, Normalizer.Form.NFD);
}
public String utf8(final String s) {
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
return new String(bytes, StandardCharsets.UTF_8);
}
public String unicodeNormalization(final String s) {
Matcher m = hexUnicodePattern.matcher(s);
StringBuffer buf = new StringBuffer(s.length());
while (m.find()) {
String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
m.appendReplacement(buf, Matcher.quoteReplacement(ch));
}
m.appendTail(buf);
return buf.toString();
}
protected String filterStopWords(final String s, final Set<String> stopwords) {
final StringTokenizer st = new StringTokenizer(s);
final StringBuilder sb = new StringBuilder();
while (st.hasMoreTokens()) {
final String token = st.nextToken();
if (!stopwords.contains(token)) {
sb.append(token);
sb.append(" ");
}
}
return sb.toString().trim();
}
public String filterAllStopWords(String s) {
s = filterStopWords(s, stopwords_en);
s = filterStopWords(s, stopwords_de);
s = filterStopWords(s, stopwords_it);
s = filterStopWords(s, stopwords_fr);
s = filterStopWords(s, stopwords_pt);
s = filterStopWords(s, stopwords_es);
s = filterStopWords(s, stopwords_gr);
return s;
}
protected Collection<String> filterBlacklisted(final Collection<String> set, final Set<String> ngramBlacklist) {
final Set<String> newset = Sets.newLinkedHashSet();
for (final String s : set) {
if (!ngramBlacklist.contains(s)) {
newset.add(s);
}
}
return newset;
}
public static Set<String> loadFromClasspath(final String classpath) {
Transliterator transliterator = Transliterator.getInstance("Any-Eng");
final Set<String> h = Sets.newHashSet();
try {
for (final String s : IOUtils
.readLines(NGramUtils.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
h.add(fixAliases(transliterator.transliterate(s))); // transliteration of the stopwords
}
} catch (final Throwable e) {
return Sets.newHashSet();
}
return h;
}
public static Map<String, String> loadMapFromClasspath(final String classpath) {
Transliterator transliterator = Transliterator.getInstance("Any-Eng");
final Map<String, String> m = new HashMap<>();
try {
for (final String s : IOUtils
.readLines(AbstractPaceFunctions.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
// string is like this: code;word1;word2;word3
String[] line = s.split(";");
String value = line[0];
for (int i = 1; i < line.length; i++) {
m.put(fixAliases(transliterator.transliterate(line[i].toLowerCase())), value);
}
}
} catch (final Throwable e) {
return new HashMap<>();
}
return m;
}
public String removeKeywords(String s, Set<String> keywords) {
s = " " + s + " ";
for (String k : keywords) {
s = s.replaceAll(k.toLowerCase(), "");
}
return s.trim();
}
public double commonElementsPercentage(Set<String> s1, Set<String> s2) {
double longer = Math.max(s1.size(), s2.size());
return (double) s1.stream().filter(s2::contains).count() / longer;
}
// convert the set of keywords to codes
public Set<String> toCodes(Set<String> keywords, Map<String, String> translationMap) {
return keywords.stream().map(s -> translationMap.get(s)).collect(Collectors.toSet());
}
public Set<String> keywordsToCodes(Set<String> keywords, Map<String, String> translationMap) {
return toCodes(keywords, translationMap);
}
public Set<String> citiesToCodes(Set<String> keywords) {
return toCodes(keywords, cityMap);
}
protected String firstLC(final String s) {
return StringUtils.substring(s, 0, 1).toLowerCase();
}
protected Iterable<String> tokens(final String s, final int maxTokens) {
return Iterables.limit(Splitter.on(" ").omitEmptyStrings().trimResults().split(s), maxTokens);
}
public String normalizePid(String pid) {
return DOI_PREFIX.matcher(pid.toLowerCase()).replaceAll("");
}
// get the list of keywords into the input string
public Set<String> getKeywords(String s1, Map<String, String> translationMap, int windowSize) {
String s = s1;
List<String> tokens = Arrays.asList(s.toLowerCase().split(" "));
Set<String> codes = new HashSet<>();
if (tokens.size() < windowSize)
windowSize = tokens.size();
int length = windowSize;
while (length != 0) {
for (int i = 0; i <= tokens.size() - length; i++) {
String candidate = concat(tokens.subList(i, i + length));
if (translationMap.containsKey(candidate)) {
codes.add(candidate);
s = s.replace(candidate, "").trim();
}
}
tokens = Arrays.asList(s.split(" "));
length -= 1;
}
return codes;
}
public Set<String> getCities(String s1, int windowSize) {
return getKeywords(s1, cityMap, windowSize);
}
public static <T> String readFromClasspath(final String filename, final Class<T> clazz) {
final StringWriter sw = new StringWriter();
try {
IOUtils.copy(clazz.getResourceAsStream(filename), sw, StandardCharsets.UTF_8);
return sw.toString();
} catch (final IOException e) {
throw new RuntimeException("cannot load resource from classpath: " + filename);
}
}
}

View File

@ -0,0 +1,53 @@
package eu.dnetlib.pace.config;
import java.util.List;
import java.util.Map;
import java.util.function.Predicate;
import eu.dnetlib.pace.model.ClusteringDef;
import eu.dnetlib.pace.model.FieldDef;
import eu.dnetlib.pace.tree.support.TreeNodeDef;
/**
* Interface for PACE configuration bean.
*
* @author claudio
*/
public interface Config {
/**
* Field configuration definitions.
*
* @return the list of definitions
*/
public List<FieldDef> model();
/**
* Decision Tree definition
*
* @return the map representing the decision tree
*/
public Map<String, TreeNodeDef> decisionTree();
/**
* Clusterings.
*
* @return the list
*/
public List<ClusteringDef> clusterings();
/**
* Blacklists.
*
* @return the map
*/
public Map<String, Predicate<String>> blacklists();
/**
* Translation map.
*
* @return the map
* */
public Map<String, String> translationMap();
}

View File

@ -0,0 +1,178 @@
package eu.dnetlib.pace.config;
import java.io.IOException;
import java.io.Serializable;
import java.nio.charset.StandardCharsets;
import java.util.AbstractMap;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.function.Predicate;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
import java.util.stream.Collectors;
import org.antlr.stringtemplate.StringTemplate;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import com.fasterxml.jackson.annotation.JsonIgnore;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Maps;
import eu.dnetlib.pace.model.ClusteringDef;
import eu.dnetlib.pace.model.FieldDef;
import eu.dnetlib.pace.tree.support.TreeNodeDef;
import eu.dnetlib.pace.util.PaceException;
public class DedupConfig implements Config, Serializable {
private static String CONFIG_TEMPLATE = "dedupConfig.st";
private PaceConfig pace;
private WfConfig wf;
@JsonIgnore
private Map<String, Predicate<String>> blacklists;
private static Map<String, String> defaults = Maps.newHashMap();
static {
defaults.put("dedupRun", "001");
defaults.put("entityType", "result");
defaults.put("subEntityType", "resulttype");
defaults.put("subEntityValue", "publication");
defaults.put("orderField", "title");
defaults.put("queueMaxSize", "2000");
defaults.put("groupMaxSize", "10");
defaults.put("slidingWindowSize", "200");
defaults.put("rootBuilder", "result");
defaults.put("includeChildren", "true");
defaults.put("maxIterations", "20");
defaults.put("idPath", "$.id");
}
public DedupConfig() {
}
public static DedupConfig load(final String json) {
final DedupConfig config;
try {
config = new ObjectMapper().readValue(json, DedupConfig.class);
config.getPace().initModel();
config.getPace().initTranslationMap();
config.blacklists = config
.getPace()
.getBlacklists()
.entrySet()
.stream()
.map(
e -> new AbstractMap.SimpleEntry<String, List<Pattern>>(e.getKey(),
e
.getValue()
.stream()
.filter(s -> !StringUtils.isBlank(s))
.map(Pattern::compile)
.collect(Collectors.toList())))
.collect(
Collectors
.toMap(
e -> e.getKey(),
e -> (Predicate<String> & Serializable) s -> e
.getValue()
.stream()
.filter(p -> p.matcher(s).matches())
.findFirst()
.isPresent()))
;
return config;
} catch (IOException | PatternSyntaxException e) {
throw new PaceException("Error in parsing configuration json", e);
}
}
public static DedupConfig loadDefault() throws IOException {
return loadDefault(new HashMap<String, String>());
}
public static DedupConfig loadDefault(final Map<String, String> params) throws IOException {
final StringTemplate template = new StringTemplate(new DedupConfig().readFromClasspath(CONFIG_TEMPLATE));
for (final Entry<String, String> e : defaults.entrySet()) {
template.setAttribute(e.getKey(), e.getValue());
}
for (final Entry<String, String> e : params.entrySet()) {
if (template.getAttribute(e.getKey()) != null) {
template.getAttributes().computeIfPresent(e.getKey(), (o, o2) -> e.getValue());
} else {
template.setAttribute(e.getKey(), e.getValue());
}
}
final String json = template.toString();
return load(json);
}
private String readFromClasspath(final String resource) throws IOException {
return IOUtils.toString(getClass().getResource(resource), StandardCharsets.UTF_8);
}
public PaceConfig getPace() {
return pace;
}
public void setPace(final PaceConfig pace) {
this.pace = pace;
}
public WfConfig getWf() {
return wf;
}
public void setWf(final WfConfig wf) {
this.wf = wf;
}
@Override
public String toString() {
try {
return new ObjectMapper().writeValueAsString(this);
} catch (IOException e) {
throw new PaceException("unable to serialise configuration", e);
}
}
@Override
public Map<String, TreeNodeDef> decisionTree() {
return getPace().getDecisionTree();
}
@Override
public List<FieldDef> model() {
return getPace().getModel();
}
@Override
public List<ClusteringDef> clusterings() {
return getPace().getClustering();
}
@Override
public Map<String, Predicate<String>> blacklists() {
return blacklists;
}
@Override
public Map<String, String> translationMap() {
return getPace().translationMap();
}
}

View File

@ -0,0 +1,108 @@
package eu.dnetlib.pace.config;
import java.io.Serializable;
import java.util.List;
import java.util.Map;
import com.fasterxml.jackson.annotation.JsonIgnore;
import com.google.common.collect.Maps;
import com.ibm.icu.text.Transliterator;
import eu.dnetlib.pace.common.AbstractPaceFunctions;
import eu.dnetlib.pace.model.ClusteringDef;
import eu.dnetlib.pace.model.FieldDef;
import eu.dnetlib.pace.tree.support.TreeNodeDef;
import eu.dnetlib.pace.util.PaceResolver;
public class PaceConfig extends AbstractPaceFunctions implements Serializable {
private List<FieldDef> model;
private List<ClusteringDef> clustering;
private Map<String, TreeNodeDef> decisionTree;
private Map<String, List<String>> blacklists;
private Map<String, List<String>> synonyms;
@JsonIgnore
private Map<String, String> translationMap;
public Map<String, FieldDef> getModelMap() {
return modelMap;
}
@JsonIgnore
private Map<String, FieldDef> modelMap;
@JsonIgnore
public static PaceResolver resolver = new PaceResolver();
public PaceConfig() {
}
public void initModel() {
modelMap = Maps.newHashMap();
for (FieldDef fd : getModel()) {
modelMap.put(fd.getName(), fd);
}
}
public void initTranslationMap() {
translationMap = Maps.newHashMap();
Transliterator transliterator = Transliterator.getInstance("Any-Eng");
for (String key : synonyms.keySet()) {
for (String term : synonyms.get(key)) {
translationMap
.put(
fixAliases(transliterator.transliterate(term.toLowerCase())),
key);
}
}
}
public Map<String, String> translationMap() {
return translationMap;
}
public List<FieldDef> getModel() {
return model;
}
public void setModel(final List<FieldDef> model) {
this.model = model;
}
public List<ClusteringDef> getClustering() {
return clustering;
}
public void setClustering(final List<ClusteringDef> clustering) {
this.clustering = clustering;
}
public Map<String, TreeNodeDef> getDecisionTree() {
return decisionTree;
}
public void setDecisionTree(Map<String, TreeNodeDef> decisionTree) {
this.decisionTree = decisionTree;
}
public Map<String, List<String>> getBlacklists() {
return blacklists;
}
public void setBlacklists(final Map<String, List<String>> blacklists) {
this.blacklists = blacklists;
}
public Map<String, List<String>> getSynonyms() {
return synonyms;
}
public void setSynonyms(Map<String, List<String>> synonyms) {
this.synonyms = synonyms;
}
}

View File

@ -0,0 +1,6 @@
package eu.dnetlib.pace.config;
public enum Type {
String, Int, List, JSON, URL, StringConcat, DoubleArray
}

View File

@ -0,0 +1,294 @@
package eu.dnetlib.pace.config;
import java.io.IOException;
import java.io.Serializable;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.commons.lang3.StringUtils;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Lists;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.util.PaceException;
public class WfConfig implements Serializable {
/**
* Entity type.
*/
private String entityType = "";
/**
* Sub-Entity type refers to one of fields declared in the model. See eu.dnetlib.pace.config.PaceConfig.modelMap
*/
private String subEntityType = "";
/**
* Sub-Entity value declares a value for subTypes to be considered.
*/
private String subEntityValue = "";
/**
* Field name used to sort the values in the reducer phase.
*/
private String orderField = "";
/**
* Column Families involved in the relations redirection.
*/
private List<String> rootBuilder = Lists.newArrayList();
/**
* Set of datasource namespace prefixes that won't be deduplicated.
*/
private Set<String> skipList = Sets.newHashSet();
/**
* Subprefix used to build the root id, allows multiple dedup runs.
*/
private String dedupRun = "";
/**
* Similarity threshold.
*/
private double threshold = 0;
/** The queue max size. */
private int queueMaxSize = 2000;
/** The group max size. */
private int groupMaxSize;
/** The sliding window size. */
private int slidingWindowSize;
/** The configuration id. */
private String configurationId;
/** The include children. */
private boolean includeChildren;
/** Default maximum number of allowed children. */
private final static int MAX_CHILDREN = 10;
/** Maximum number of allowed children. */
private int maxChildren = MAX_CHILDREN;
/** Default maximum number of iterations. */
private final static int MAX_ITERATIONS = 20;
/** Maximum number of iterations */
private int maxIterations = MAX_ITERATIONS;
/** The Jquery path to retrieve the identifier */
private String idPath = "$.id";
public WfConfig() {
}
/**
* Instantiates a new dedup config.
*
* @param entityType
* the entity type
* @param orderField
* the order field
* @param rootBuilder
* the root builder families
* @param dedupRun
* the dedup run
* @param skipList
* the skip list
* @param queueMaxSize
* the queue max size
* @param groupMaxSize
* the group max size
* @param slidingWindowSize
* the sliding window size
* @param includeChildren
* allows the children to be included in the representative records or not.
* @param maxIterations
* the maximum number of iterations
* @param idPath
* the path for the id of the entity
*/
public WfConfig(final String entityType, final String orderField, final List<String> rootBuilder,
final String dedupRun,
final Set<String> skipList, final int queueMaxSize, final int groupMaxSize, final int slidingWindowSize,
final boolean includeChildren, final int maxIterations, final String idPath) {
super();
this.entityType = entityType;
this.orderField = orderField;
this.rootBuilder = rootBuilder;
this.dedupRun = cleanupStringNumber(dedupRun);
this.skipList = skipList;
this.queueMaxSize = queueMaxSize;
this.groupMaxSize = groupMaxSize;
this.slidingWindowSize = slidingWindowSize;
this.includeChildren = includeChildren;
this.maxIterations = maxIterations;
this.idPath = idPath;
}
/**
* Cleanup string number.
*
* @param s
* the s
* @return the string
*/
private String cleanupStringNumber(final String s) {
return s.contains("'") ? s.replaceAll("'", "") : s;
}
public boolean hasSubType() {
return StringUtils.isNotBlank(getSubEntityType()) && StringUtils.isNotBlank(getSubEntityValue());
}
public String getEntityType() {
return entityType;
}
public void setEntityType(final String entityType) {
this.entityType = entityType;
}
public String getSubEntityType() {
return subEntityType;
}
public void setSubEntityType(final String subEntityType) {
this.subEntityType = subEntityType;
}
public String getSubEntityValue() {
return subEntityValue;
}
public void setSubEntityValue(final String subEntityValue) {
this.subEntityValue = subEntityValue;
}
public String getOrderField() {
return orderField;
}
public void setOrderField(final String orderField) {
this.orderField = orderField;
}
public List<String> getRootBuilder() {
return rootBuilder;
}
public void setRootBuilder(final List<String> rootBuilder) {
this.rootBuilder = rootBuilder;
}
public Set<String> getSkipList() {
return skipList != null ? skipList : new HashSet<String>();
}
public void setSkipList(final Set<String> skipList) {
this.skipList = skipList;
}
public String getDedupRun() {
return dedupRun;
}
public void setDedupRun(final String dedupRun) {
this.dedupRun = dedupRun;
}
public double getThreshold() {
return threshold;
}
public void setThreshold(final double threshold) {
this.threshold = threshold;
}
public int getQueueMaxSize() {
return queueMaxSize;
}
public void setQueueMaxSize(final int queueMaxSize) {
this.queueMaxSize = queueMaxSize;
}
public int getGroupMaxSize() {
return groupMaxSize;
}
public void setGroupMaxSize(final int groupMaxSize) {
this.groupMaxSize = groupMaxSize;
}
public int getSlidingWindowSize() {
return slidingWindowSize;
}
public void setSlidingWindowSize(final int slidingWindowSize) {
this.slidingWindowSize = slidingWindowSize;
}
public String getConfigurationId() {
return configurationId;
}
public void setConfigurationId(final String configurationId) {
this.configurationId = configurationId;
}
public boolean isIncludeChildren() {
return includeChildren;
}
public void setIncludeChildren(final boolean includeChildren) {
this.includeChildren = includeChildren;
}
public int getMaxChildren() {
return maxChildren;
}
public void setMaxChildren(final int maxChildren) {
this.maxChildren = maxChildren;
}
public int getMaxIterations() {
return maxIterations;
}
public void setMaxIterations(int maxIterations) {
this.maxIterations = maxIterations;
}
public String getIdPath() {
return idPath;
}
public void setIdPath(String idPath) {
this.idPath = idPath;
}
/*
* (non-Javadoc)
* @see java.lang.Object#toString()
*/
@Override
public String toString() {
try {
return new ObjectMapper().writeValueAsString(this);
} catch (IOException e) {
throw new PaceException("unable to serialise " + this.getClass().getName(), e);
}
}
}

View File

@ -0,0 +1,63 @@
package eu.dnetlib.pace.model;
import java.io.IOException;
import java.io.Serializable;
import java.util.List;
import java.util.Map;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.pace.clustering.ClusteringFunction;
import eu.dnetlib.pace.config.PaceConfig;
import eu.dnetlib.pace.util.PaceException;
public class ClusteringDef implements Serializable {
private String name;
private List<String> fields;
private Map<String, Integer> params;
public ClusteringDef() {
}
public String getName() {
return name;
}
public void setName(final String name) {
this.name = name;
}
public ClusteringFunction clusteringFunction() {
return PaceConfig.resolver.getClusteringFunction(getName(), params);
}
public List<String> getFields() {
return fields;
}
public void setFields(final List<String> fields) {
this.fields = fields;
}
public Map<String, Integer> getParams() {
return params;
}
public void setParams(final Map<String, Integer> params) {
this.params = params;
}
@Override
public String toString() {
try {
return new ObjectMapper().writeValueAsString(this);
} catch (IOException e) {
throw new PaceException("unable to serialise " + this.getClass().getName(), e);
}
}
}

View File

@ -0,0 +1,103 @@
package eu.dnetlib.pace.model;
import java.io.Serializable;
import java.util.List;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.base.Splitter;
import com.google.common.collect.Lists;
import eu.dnetlib.pace.config.Type;
/**
* The schema is composed by field definitions (FieldDef). Each field has a type, a name, and an associated compare algorithm.
*/
public class FieldDef implements Serializable {
public final static String PATH_SEPARATOR = "/";
private String name;
private String path;
private Type type;
private boolean overrideMatch;
/**
* Sets maximum size for the repeatable fields in the model. -1 for unbounded size.
*/
private int size = -1;
/**
* Sets maximum length for field values in the model. -1 for unbounded length.
*/
private int length = -1;
public FieldDef() {
}
public String getName() {
return name;
}
public String getPath() {
return path;
}
public List<String> getPathList() {
return Lists.newArrayList(Splitter.on(PATH_SEPARATOR).split(getPath()));
}
public Type getType() {
return type;
}
public void setType(final Type type) {
this.type = type;
}
public boolean isOverrideMatch() {
return overrideMatch;
}
public void setOverrideMatch(final boolean overrideMatch) {
this.overrideMatch = overrideMatch;
}
public int getSize() {
return size;
}
public void setSize(int size) {
this.size = size;
}
public int getLength() {
return length;
}
public void setLength(int length) {
this.length = length;
}
public void setName(String name) {
this.name = name;
}
public void setPath(String path) {
this.path = path;
}
@Override
public String toString() {
try {
return new ObjectMapper().writeValueAsString(this);
} catch (JsonProcessingException e) {
return null;
}
}
}

View File

@ -0,0 +1,156 @@
package eu.dnetlib.pace.model;
import java.nio.charset.Charset;
import java.text.Normalizer;
import java.util.List;
import java.util.Set;
import com.google.common.base.Joiner;
import com.google.common.base.Splitter;
import com.google.common.collect.Iterables;
import com.google.common.collect.Lists;
import com.google.common.hash.Hashing;
import eu.dnetlib.pace.common.AbstractPaceFunctions;
import eu.dnetlib.pace.util.Capitalise;
import eu.dnetlib.pace.util.DotAbbreviations;
public class Person {
private static final String UTF8 = "UTF-8";
private List<String> name = Lists.newArrayList();
private List<String> surname = Lists.newArrayList();
private List<String> fullname = Lists.newArrayList();
private final String original;
private static Set<String> particles = null;
public Person(String s, final boolean aggressive) {
original = s;
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("\\(.+\\)", "");
s = s.replaceAll("\\[.+\\]", "");
s = s.replaceAll("\\{.+\\}", "");
s = s.replaceAll("\\s+-\\s+", "-");
s = s.replaceAll("[\\p{Punct}&&[^,-]]", " ");
s = s.replaceAll("\\d", " ");
s = s.replaceAll("\\n", " ");
s = s.replaceAll("\\.", " ");
s = s.replaceAll("\\s+", " ");
if (aggressive) {
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}&&[^,-]]", "");
// s = s.replaceAll("[\\W&&[^,-]]", "");
}
if (s.contains(",")) { // if the name contains a comma it is easy derivable the name and the surname
final String[] arr = s.split(",");
if (arr.length == 1) {
fullname = splitTerms(arr[0]);
} else if (arr.length > 1) {
surname = splitTerms(arr[0]);
name = splitTerms(arr[1]);
fullname.addAll(surname);
fullname.addAll(name);
}
} else {
fullname = splitTerms(s);
int lastInitialPosition = fullname.size();
boolean hasSurnameInUpperCase = false;
for (int i = 0; i < fullname.size(); i++) {
final String term = fullname.get(i);
if (term.length() == 1) {
lastInitialPosition = i;
} else if (term.equals(term.toUpperCase())) {
hasSurnameInUpperCase = true;
}
}
if (lastInitialPosition < (fullname.size() - 1)) { // Case: Michele G. Artini
name = fullname.subList(0, lastInitialPosition + 1);
surname = fullname.subList(lastInitialPosition + 1, fullname.size());
} else if (hasSurnameInUpperCase) { // Case: Michele ARTINI
for (final String term : fullname) {
if ((term.length() > 1) && term.equals(term.toUpperCase())) {
surname.add(term);
} else {
name.add(term);
}
}
}
}
}
private List<String> splitTerms(final String s) {
if (particles == null) {
particles = AbstractPaceFunctions.loadFromClasspath("/eu/dnetlib/pace/config/name_particles.txt");
}
final List<String> list = Lists.newArrayList();
for (final String part : Splitter.on(" ").omitEmptyStrings().split(s)) {
if (!particles.contains(part.toLowerCase())) {
list.add(part);
}
}
return list;
}
public List<String> getName() {
return name;
}
public String getNameString() {
return Joiner.on(" ").join(getName());
}
public List<String> getSurname() {
return surname;
}
public List<String> getFullname() {
return fullname;
}
public String getOriginal() {
return original;
}
public String hash() {
return Hashing.murmur3_128().hashString(getNormalisedFullname(), Charset.forName(UTF8)).toString();
}
public String getNormalisedFirstName() {
return Joiner.on(" ").join(getCapitalFirstnames());
}
public String getNormalisedSurname() {
return Joiner.on(" ").join(getCapitalSurname());
}
public String getSurnameString() {
return Joiner.on(" ").join(getSurname());
}
public String getNormalisedFullname() {
return isAccurate() ? getNormalisedSurname() + ", " + getNormalisedFirstName() : Joiner.on(" ").join(fullname);
}
public List<String> getCapitalFirstnames() {
return Lists.newArrayList(Iterables.transform(getNameWithAbbreviations(), new Capitalise()));
}
public List<String> getCapitalSurname() {
return Lists.newArrayList(Iterables.transform(surname, new Capitalise()));
}
public List<String> getNameWithAbbreviations() {
return Lists.newArrayList(Iterables.transform(name, new DotAbbreviations()));
}
public boolean isAccurate() {
return ((name != null) && (surname != null) && !name.isEmpty() && !surname.isEmpty());
}
}

View File

@ -0,0 +1,119 @@
package eu.dnetlib.pace.model;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Set;
import com.google.common.collect.Lists;
import com.google.common.collect.Sets;
public class PersonComparatorUtils {
private static final int MAX_FULLNAME_LENGTH = 50;
public static Set<String> getNgramsForPerson(String fullname) {
Set<String> set = Sets.newHashSet();
if (fullname.length() > MAX_FULLNAME_LENGTH) {
return set;
}
Person p = new Person(fullname, true);
if (p.isAccurate()) {
for (String name : p.getName()) {
for (String surname : p.getSurname()) {
set.add((name.charAt(0) + "_" + surname).toLowerCase());
}
}
} else {
List<String> list = p.getFullname();
for (int i = 0; i < list.size(); i++) {
if (list.get(i).length() > 1) {
for (int j = 0; j < list.size(); j++) {
if (i != j) {
set.add((list.get(j).charAt(0) + "_" + list.get(i)).toLowerCase());
}
}
}
}
}
return set;
}
public static boolean areSimilar(String s1, String s2) {
Person p1 = new Person(s1, true);
Person p2 = new Person(s2, true);
if (p1.isAccurate() && p2.isAccurate()) {
return verifyNames(p1.getName(), p2.getName()) && verifySurnames(p1.getSurname(), p2.getSurname());
} else {
return verifyFullnames(p1.getFullname(), p2.getFullname());
}
}
private static boolean verifyNames(List<String> list1, List<String> list2) {
return verifySimilarity(extractExtendedNames(list1), extractExtendedNames(list2))
&& verifySimilarity(extractInitials(list1), extractInitials(list2));
}
private static boolean verifySurnames(List<String> list1, List<String> list2) {
if (list1.size() != list2.size()) {
return false;
}
for (int i = 0; i < list1.size(); i++) {
if (!list1.get(i).equalsIgnoreCase(list2.get(i))) {
return false;
}
}
return true;
}
private static boolean verifyFullnames(List<String> list1, List<String> list2) {
Collections.sort(list1);
Collections.sort(list2);
return verifySimilarity(extractExtendedNames(list1), extractExtendedNames(list2))
&& verifySimilarity(extractInitials(list1), extractInitials(list2));
}
private static List<String> extractExtendedNames(List<String> list) {
ArrayList<String> res = Lists.newArrayList();
for (String s : list) {
if (s.length() > 1) {
res.add(s.toLowerCase());
}
}
return res;
}
private static List<String> extractInitials(List<String> list) {
ArrayList<String> res = Lists.newArrayList();
for (String s : list) {
res.add(s.substring(0, 1).toLowerCase());
}
return res;
}
private static boolean verifySimilarity(List<String> list1, List<String> list2) {
if (list1.size() > list2.size()) {
return verifySimilarity(list2, list1);
}
// NB: List2 is greater than list1 (or equal)
int pos = -1;
for (String s : list1) {
int curr = list2.indexOf(s);
if (curr > pos) {
list2.set(curr, "*"); // I invalidate the found element, example: "amm - amm"
pos = curr;
} else {
return false;
}
}
return true;
}
}

View File

@ -0,0 +1,65 @@
package eu.dnetlib.pace.model;
import java.util.Comparator;
import org.apache.spark.sql.Row;
import eu.dnetlib.pace.clustering.NGramUtils;
/**
* The Class MapDocumentComparator.
*/
public class RowDataOrderingComparator implements Comparator<Row> {
/** The comparator field. */
private final int comparatorField;
private final int identityFieldPosition;
/**
* Instantiates a new map document comparator.
*
* @param comparatorField
* the comparator field
*/
public RowDataOrderingComparator(final int comparatorField, int identityFieldPosition) {
this.comparatorField = comparatorField;
this.identityFieldPosition = identityFieldPosition;
}
/*
* (non-Javadoc)
* @see java.util.Comparator#compare(java.lang.Object, java.lang.Object)
*/
@Override
public int compare(final Row d1, final Row d2) {
if (d1 == null)
return d2 == null ? 0 : -1;
else if (d2 == null) {
return 1;
}
final String o1 = d1.getString(comparatorField);
final String o2 = d2.getString(comparatorField);
if (o1 == null)
return o2 == null ? 0 : -1;
else if (o2 == null) {
return 1;
}
final String to1 = NGramUtils.cleanupForOrdering(o1);
final String to2 = NGramUtils.cleanupForOrdering(o2);
int res = to1.compareTo(to2);
if (res == 0) {
res = o1.compareTo(o2);
if (res == 0) {
return d1.getString(identityFieldPosition).compareTo(d2.getString(identityFieldPosition));
}
}
return res;
}
}

View File

@ -0,0 +1,131 @@
package eu.dnetlib.pace.model
import eu.dnetlib.pace.config.{DedupConfig, Type}
import eu.dnetlib.pace.util.{BlockProcessor, SparkReporter}
import org.apache.spark.SparkContext
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions.{col, lit, udf}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, Dataset, Row, functions}
import java.util.function.Predicate
import java.util.stream.Collectors
import scala.collection.JavaConversions._
import scala.collection.JavaConverters._
import scala.collection.mutable
case class SparkDeduper(conf: DedupConfig) extends Serializable {
val model: SparkModel = SparkModel(conf)
val dedup: (Dataset[Row] => Dataset[Row]) = df => {
df.transform(filterAndCleanup)
.transform(generateClustersWithCollect)
.transform(processBlocks)
}
val filterAndCleanup: (Dataset[Row] => Dataset[Row]) = df => {
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
if (conf.blacklists.containsKey(fdef.getName)) {
res.withColumn(
fdef.getName + "_filtered",
filterColumnUDF(fdef).apply(new Column(fdef.getName))
)
} else {
res
}
})
df_with_filters
}
def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
if (blacklist == null) {
throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
} else {
fdef.getType match {
case Type.List | Type.JSON =>
udf[Array[String], Array[String]](values => {
values.filter((v: String) => !blacklist.test(v))
})
case _ =>
udf[String, String](v => {
if (blacklist.test(v)) ""
else v
})
}
}
}
val generateClustersWithCollect: (Dataset[Row] => Dataset[Row]) = df_with_filters => {
var df_with_clustering_keys: Dataset[Row] = null
for ((cd, idx) <- conf.clusterings().zipWithIndex) {
val inputColumns = cd.getFields().foldLeft(Seq[Column]())((acc, fName) => {
val column = if (conf.blacklists.containsKey(fName))
Seq(col(fName + "_filtered"))
else
Seq(col(fName))
acc ++ column
})
// Add 'key' column with the value generated by the given clustering definition
val ds: Dataset[Row] = df_with_filters
.withColumn("clustering", lit(cd.getName + "::" + idx))
.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*))))
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName))))
if (df_with_clustering_keys == null)
df_with_clustering_keys = ds
else
df_with_clustering_keys = df_with_clustering_keys.union(ds)
}
//TODO: analytics
val df_with_blocks = df_with_clustering_keys
// filter out rows with position exceeding the maxqueuesize parameter
.filter(col("position").leq(conf.getWf.getQueueMaxSize))
.groupBy("clustering", "key")
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
.filter(functions.size(new Column("block")).gt(1))
df_with_blocks
}
def clusterValuesUDF(cd: ClusteringDef) = {
udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => {
values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala)
})
}
val processBlocks: (Dataset[Row] => Dataset[Row]) = df => {
df.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
.withColumn("relations", processBlock(df.sqlContext.sparkContext).apply(new Column("block")))
.select(functions.explode(new Column("relations")).as("relation"))
}
def processBlock(implicit sc: SparkContext) = {
val accumulators = SparkReporter.constructAccumulator(conf, sc)
udf[Array[(String, String)], mutable.WrappedArray[Row]](block => {
val reporter = new SparkReporter(accumulators)
val mapDocuments = block.asJava.stream()
.sorted(new RowDataOrderingComparator(model.orderingFieldPosition, model.identityFieldPosition))
.limit(conf.getWf.getQueueMaxSize)
.collect(Collectors.toList[Row]())
new BlockProcessor(conf, model.identityFieldPosition, model.orderingFieldPosition).processSortedRows(mapDocuments, reporter)
reporter.getRelations.asScala.toArray
}).asNondeterministic()
}
}

View File

@ -0,0 +1,108 @@
package eu.dnetlib.pace.model
import com.jayway.jsonpath.{Configuration, JsonPath}
import eu.dnetlib.pace.config.{DedupConfig, Type}
import eu.dnetlib.pace.util.MapDocumentUtil
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
import org.apache.spark.sql.{Dataset, Row}
import java.util.regex.Pattern
import scala.collection.JavaConverters._
case class SparkModel(conf: DedupConfig) {
private val URL_REGEX: Pattern = Pattern.compile("^\\s*(http|https|ftp)\\://.*")
private val CONCAT_REGEX: Pattern = Pattern.compile("\\|\\|\\|")
val identifierFieldName = "identifier"
val orderingFieldName = if (!conf.getWf.getOrderField.isEmpty) conf.getWf.getOrderField else identifierFieldName
val schema: StructType = {
// create an implicit identifier field
val identifier = new FieldDef()
identifier.setName(identifierFieldName)
identifier.setType(Type.String)
// Construct a Spark StructType representing the schema of the model
(Seq(identifier) ++ conf.getPace.getModel.asScala)
.foldLeft(
new StructType()
)((resType, fieldDef) => {
resType.add(fieldDef.getType match {
case Type.List | Type.JSON =>
StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
case Type.DoubleArray =>
StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
case _ =>
StructField(fieldDef.getName, DataTypes.StringType, true, Metadata.empty)
})
})
}
val identityFieldPosition: Int = schema.fieldIndex(identifierFieldName)
val orderingFieldPosition: Int = schema.fieldIndex(orderingFieldName)
val parseJsonDataset: (Dataset[String] => Dataset[Row]) = df => {
df.map(r => rowFromJson(r))(RowEncoder(schema))
}
def rowFromJson(json: String): Row = {
val documentContext =
JsonPath.using(Configuration.defaultConfiguration.addOptions(com.jayway.jsonpath.Option.SUPPRESS_EXCEPTIONS)).parse(json)
val values = new Array[Any](schema.size)
values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
schema.fieldNames.zipWithIndex.foldLeft(values) {
case ((res, (fname, index))) => {
val fdef = conf.getPace.getModelMap.get(fname)
if (fdef != null) {
res(index) = fdef.getType match {
case Type.String | Type.Int =>
MapDocumentUtil.truncateValue(
MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
fdef.getLength
)
case Type.URL =>
var uv = MapDocumentUtil.getJPathString(fdef.getPath, documentContext)
if (!URL_REGEX.matcher(uv).matches)
uv = ""
uv
case Type.List | Type.JSON =>
MapDocumentUtil.truncateList(
MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
fdef.getSize
).asScala
case Type.StringConcat =>
val jpaths = CONCAT_REGEX.split(fdef.getPath)
MapDocumentUtil.truncateValue(
jpaths
.map(jpath => MapDocumentUtil.getJPathString(jpath, documentContext))
.mkString(" "),
fdef.getLength
)
case Type.DoubleArray =>
MapDocumentUtil.getJPathArray(fdef.getPath, json)
}
}
res
}
}
new GenericRowWithSchema(values, schema)
}
}

View File

@ -0,0 +1,42 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("alwaysMatch")
public class AlwaysMatch<T> extends AbstractComparator<T> {
public AlwaysMatch(final Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler());
}
public AlwaysMatch(final double weight) {
super(weight, new com.wcohen.ss.JaroWinkler());
}
protected AlwaysMatch(final double weight, final AbstractStringDistance ssalgo) {
super(weight, ssalgo);
}
@Override
public double compare(final Object a, final Object b, final Config conf) {
return 1.0;
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(final double d) {
return d;
}
}

View File

@ -0,0 +1,157 @@
package eu.dnetlib.pace.tree;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.model.Person;
import eu.dnetlib.pace.tree.support.AbstractListComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("authorsMatch")
public class AuthorsMatch extends AbstractListComparator {
Map<String, String> params;
private double SURNAME_THRESHOLD;
private double NAME_THRESHOLD;
private double FULLNAME_THRESHOLD;
private String MODE; // full or surname
private int SIZE_THRESHOLD;
private String TYPE; // count or percentage
private int common;
public AuthorsMatch(Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler());
this.params = params;
MODE = params.getOrDefault("mode", "full");
SURNAME_THRESHOLD = Double.parseDouble(params.getOrDefault("surname_th", "0.95"));
NAME_THRESHOLD = Double.parseDouble(params.getOrDefault("name_th", "0.95"));
FULLNAME_THRESHOLD = Double.parseDouble(params.getOrDefault("fullname_th", "0.9"));
SIZE_THRESHOLD = Integer.parseInt(params.getOrDefault("size_th", "20"));
TYPE = params.getOrDefault("type", "percentage");
common = 0;
}
protected AuthorsMatch(double w, AbstractStringDistance ssalgo) {
super(w, ssalgo);
}
@Override
public double compare(final List<String> a, final List<String> b, final Config conf) {
if (a.isEmpty() || b.isEmpty())
return -1;
if (a.size() > SIZE_THRESHOLD || b.size() > SIZE_THRESHOLD)
return 1.0;
List<Person> aList = a.stream().map(author -> new Person(author, false)).collect(Collectors.toList());
List<Person> bList = b.stream().map(author -> new Person(author, false)).collect(Collectors.toList());
common = 0;
// compare each element of List1 with each element of List2
for (Person p1 : aList)
for (Person p2 : bList) {
// both persons are inaccurate
if (!p1.isAccurate() && !p2.isAccurate()) {
// compare just normalized fullnames
String fullname1 = normalization(
p1.getNormalisedFullname().isEmpty() ? p1.getOriginal() : p1.getNormalisedFullname());
String fullname2 = normalization(
p2.getNormalisedFullname().isEmpty() ? p2.getOriginal() : p2.getNormalisedFullname());
if (ssalgo.score(fullname1, fullname2) > FULLNAME_THRESHOLD) {
common += 1;
break;
}
}
// one person is inaccurate
if (p1.isAccurate() ^ p2.isAccurate()) {
// prepare data
// data for the accurate person
String name = normalization(
p1.isAccurate() ? p1.getNormalisedFirstName() : p2.getNormalisedFirstName());
String surname = normalization(
p1.isAccurate() ? p1.getNormalisedSurname() : p2.getNormalisedSurname());
// data for the inaccurate person
String fullname = normalization(
p1.isAccurate()
? ((p2.getNormalisedFullname().isEmpty()) ? p2.getOriginal() : p2.getNormalisedFullname())
: (p1.getNormalisedFullname().isEmpty() ? p1.getOriginal() : p1.getNormalisedFullname()));
if (fullname.contains(surname)) {
if (MODE.equals("full")) {
if (fullname.contains(name)) {
common += 1;
break;
}
} else { // MODE equals "surname"
common += 1;
break;
}
}
}
// both persons are accurate
if (p1.isAccurate() && p2.isAccurate()) {
if (compareSurname(p1, p2)) {
if (MODE.equals("full")) {
if (compareFirstname(p1, p2)) {
common += 1;
break;
}
} else { // MODE equals "surname"
common += 1;
break;
}
}
}
}
// normalization factor to compute the score
int normFactor = aList.size() == bList.size() ? aList.size() : (aList.size() + bList.size() - common);
if (TYPE.equals("percentage")) {
return (double) common / normFactor;
} else {
return (double) common;
}
}
public boolean compareSurname(Person p1, Person p2) {
return ssalgo
.score(
normalization(p1.getNormalisedSurname()), normalization(p2.getNormalisedSurname())) > SURNAME_THRESHOLD;
}
public boolean compareFirstname(Person p1, Person p2) {
if (p1.getNormalisedFirstName().length() <= 2 || p2.getNormalisedFirstName().length() <= 2) {
if (firstLC(p1.getNormalisedFirstName()).equals(firstLC(p2.getNormalisedFirstName())))
return true;
}
return ssalgo
.score(
normalization(p1.getNormalisedFirstName()),
normalization(p2.getNormalisedFirstName())) > NAME_THRESHOLD;
}
public String normalization(String s) {
return normalize(utf8(cleanup(s)));
}
}

View File

@ -0,0 +1,48 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import java.util.Set;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("cityMatch")
public class CityMatch extends AbstractStringComparator {
private Map<String, String> params;
public CityMatch(Map<String, String> params) {
super(params);
this.params = params;
}
@Override
public double distance(final String a, final String b, final Config conf) {
String ca = cleanup(a);
String cb = cleanup(b);
ca = normalize(ca);
cb = normalize(cb);
ca = filterAllStopWords(ca);
cb = filterAllStopWords(cb);
Set<String> cities1 = getCities(ca, Integer.parseInt(params.getOrDefault("windowSize", "4")));
Set<String> cities2 = getCities(cb, Integer.parseInt(params.getOrDefault("windowSize", "4")));
Set<String> codes1 = citiesToCodes(cities1);
Set<String> codes2 = citiesToCodes(cities2);
// if no cities are detected, the comparator gives 1.0
if (codes1.isEmpty() && codes2.isEmpty())
return 1.0;
else {
if (codes1.isEmpty() ^ codes2.isEmpty())
return -1; // undefined if one of the two has no cities
return commonElementsPercentage(codes1, codes2);
}
}
}

View File

@ -0,0 +1,47 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("cosineSimilarity")
public class CosineSimilarity extends AbstractComparator<double[]> {
Map<String, String> params;
public CosineSimilarity(Map<String, String> params) {
super(params);
}
@Override
public double compare(Object a, Object b, Config config) {
return compare((double[]) a, (double[]) b, config);
}
public double compare(final double[] a, final double[] b, final Config conf) {
if (a.length == 0 || b.length == 0)
return -1;
return cosineSimilarity(a, b);
}
double cosineSimilarity(double[] a, double[] b) {
double dotProduct = 0;
double normASum = 0;
double normBSum = 0;
for (int i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normASum += a[i] * a[i];
normBSum += b[i] * b[i];
}
double eucledianDist = Math.sqrt(normASum) * Math.sqrt(normBSum);
return dotProduct / eucledianDist;
}
}

View File

@ -0,0 +1,27 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.tree.support.ComparatorClass;
/**
* The Class ExactMatch.
*
* @author claudio
*/
@ComparatorClass("doiExactMatch")
public class DoiExactMatch extends ExactMatchIgnoreCase {
public final String PREFIX = "(http:\\/\\/dx\\.doi\\.org\\/)|(doi:)";
public DoiExactMatch(final Map<String, String> params) {
super(params);
}
@Override
protected String toString(final Object f) {
return super.toString(f).replaceAll(PREFIX, "");
}
}

View File

@ -0,0 +1,30 @@
package eu.dnetlib.pace.tree;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Map;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("domainExactMatch")
public class DomainExactMatch extends ExactMatchIgnoreCase {
public DomainExactMatch(final Map<String, String> params) {
super(params);
}
@Override
protected String toString(final Object f) {
try {
return asUrl(super.toString(f)).getHost();
} catch (MalformedURLException e) {
return "";
}
}
private URL asUrl(final String value) throws MalformedURLException {
return new URL(value);
}
}

View File

@ -0,0 +1,44 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("exactMatch")
public class ExactMatch extends AbstractStringComparator {
public ExactMatch(Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler());
}
public ExactMatch(final double weight) {
super(weight, new com.wcohen.ss.JaroWinkler());
}
protected ExactMatch(final double weight, final AbstractStringDistance ssalgo) {
super(weight, ssalgo);
}
@Override
public double distance(final String a, final String b, final Config conf) {
if (a.isEmpty() || b.isEmpty()) {
return -1.0; // return -1 if a field is missing
}
return a.equals(b) ? 1.0 : 0;
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(final double d) {
return d;
}
}

View File

@ -0,0 +1,29 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("exactMatchIgnoreCase")
public class ExactMatchIgnoreCase extends AbstractStringComparator {
public ExactMatchIgnoreCase(Map<String, String> params) {
super(params);
}
@Override
public double compare(String a, String b, final Config conf) {
if (a.isEmpty() || b.isEmpty())
return -1;
return a.equalsIgnoreCase(b) ? 1 : 0;
}
protected String toString(final Object object) {
return toFirstString(object);
}
}

View File

@ -0,0 +1,80 @@
package eu.dnetlib.pace.tree;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractListComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("instanceTypeMatch")
public class InstanceTypeMatch extends AbstractListComparator {
final Map<String, String> translationMap = new HashMap<>();
public InstanceTypeMatch(Map<String, String> params) {
super(params);
// jolly types
translationMap.put("Conference object", "*");
translationMap.put("Other literature type", "*");
translationMap.put("Unknown", "*");
// article types
translationMap.put("Article", "Article");
translationMap.put("Data Paper", "Article");
translationMap.put("Software Paper", "Article");
translationMap.put("Preprint", "Article");
// thesis types
translationMap.put("Thesis", "Thesis");
translationMap.put("Master thesis", "Thesis");
translationMap.put("Bachelor thesis", "Thesis");
translationMap.put("Doctoral thesis", "Thesis");
}
@Override
public double compare(final List<String> a, final List<String> b, final Config conf) {
if (a == null || b == null) {
return -1;
}
if (a.isEmpty() || b.isEmpty()) {
return -1;
}
final Set<String> ca = a.stream().map(this::translate).collect(Collectors.toSet());
final Set<String> cb = b.stream().map(this::translate).collect(Collectors.toSet());
// if at least one is a jolly type, it must produce a match
if (ca.contains("*") || cb.contains("*"))
return 1.0;
int incommon = Sets.intersection(ca, cb).size();
// if at least one is in common, it must produce a match
return incommon >= 1 ? 1 : 0;
}
public String translate(String term) {
return translationMap.getOrDefault(term, term);
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(final double d) {
return d;
}
}

View File

@ -0,0 +1,46 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
//case class JaroWinkler(w: Double) extends SecondStringDistanceAlgo(w, new com.wcohen.ss.JaroWinkler())
@ComparatorClass("jaroWinkler")
public class JaroWinkler extends AbstractStringComparator {
public JaroWinkler(Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler());
}
public JaroWinkler(double weight) {
super(weight, new com.wcohen.ss.JaroWinkler());
}
protected JaroWinkler(double weight, AbstractStringDistance ssalgo) {
super(weight, ssalgo);
}
@Override
public double distance(String a, String b, final Config conf) {
String ca = cleanup(a);
String cb = cleanup(b);
return normalize(ssalgo.score(ca, cb));
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(double d) {
return d;
}
}

View File

@ -0,0 +1,74 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import java.util.Set;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("jaroWinklerNormalizedName")
public class JaroWinklerNormalizedName extends AbstractStringComparator {
private Map<String, String> params;
public JaroWinklerNormalizedName(Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler());
this.params = params;
}
public JaroWinklerNormalizedName(double weight) {
super(weight, new com.wcohen.ss.JaroWinkler());
}
protected JaroWinklerNormalizedName(double weight, AbstractStringDistance ssalgo) {
super(weight, ssalgo);
}
@Override
public double distance(String a, String b, final Config conf) {
String ca = cleanup(a);
String cb = cleanup(b);
ca = normalize(ca);
cb = normalize(cb);
ca = filterAllStopWords(ca);
cb = filterAllStopWords(cb);
Set<String> keywords1 = getKeywords(
ca, conf.translationMap(), Integer.parseInt(params.getOrDefault("windowSize", "4")));
Set<String> keywords2 = getKeywords(
cb, conf.translationMap(), Integer.parseInt(params.getOrDefault("windowSize", "4")));
Set<String> cities1 = getCities(ca, Integer.parseInt(params.getOrDefault("windowSize", "4")));
Set<String> cities2 = getCities(cb, Integer.parseInt(params.getOrDefault("windowSize", "4")));
ca = removeKeywords(ca, keywords1);
ca = removeKeywords(ca, cities1);
cb = removeKeywords(cb, keywords2);
cb = removeKeywords(cb, cities2);
ca = ca.replaceAll("[ ]{2,}", " ");
cb = cb.replaceAll("[ ]{2,}", " ");
if (ca.isEmpty() && cb.isEmpty())
return 1.0;
else
return normalize(ssalgo.score(ca, cb));
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(double d) {
return d;
}
}

View File

@ -0,0 +1,47 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
//case class JaroWinkler(w: Double) extends SecondStringDistanceAlgo(w, new com.wcohen.ss.JaroWinkler())
@ComparatorClass("jaroWinklerTitle")
public class JaroWinklerTitle extends AbstractStringComparator {
public JaroWinklerTitle(Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler());
}
public JaroWinklerTitle(double weight) {
super(weight, new com.wcohen.ss.JaroWinkler());
}
protected JaroWinklerTitle(double weight, AbstractStringDistance ssalgo) {
super(weight, ssalgo);
}
@Override
public double distance(String a, String b, final Config conf) {
String ca = cleanup(a);
String cb = cleanup(b);
boolean check = checkNumbers(ca, cb);
return check ? 0.5 : normalize(ssalgo.score(ca, cb));
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(double d) {
return d;
}
}

View File

@ -0,0 +1,82 @@
package eu.dnetlib.pace.tree;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import com.google.common.collect.Sets;
import com.jayway.jsonpath.Configuration;
import com.jayway.jsonpath.DocumentContext;
import com.jayway.jsonpath.JsonPath;
import com.jayway.jsonpath.Option;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractListComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
import eu.dnetlib.pace.util.MapDocumentUtil;
@ComparatorClass("jsonListMatch")
public class JsonListMatch extends AbstractListComparator {
private static final Log log = LogFactory.getLog(JsonListMatch.class);
private Map<String, String> params;
private String MODE; // "percentage" or "count"
public JsonListMatch(final Map<String, String> params) {
super(params);
this.params = params;
MODE = params.getOrDefault("mode", "percentage");
}
@Override
public double compare(final List<String> sa, final List<String> sb, final Config conf) {
if (sa.isEmpty() || sb.isEmpty()) {
return -1;
}
final Set<String> ca = sa.stream().map(this::toComparableString).collect(Collectors.toSet());
final Set<String> cb = sb.stream().map(this::toComparableString).collect(Collectors.toSet());
int incommon = Sets.intersection(ca, cb).size();
int simDiff = Sets.symmetricDifference(ca, cb).size();
if (incommon + simDiff == 0) {
return 0.0;
}
if (MODE.equals("percentage"))
return (double) incommon / (incommon + simDiff);
else
return incommon;
}
// converts every json into a comparable string basing on parameters
private String toComparableString(String json) {
StringBuilder st = new StringBuilder(); // to build the string used for comparisons basing on the jpath into
// parameters
final DocumentContext documentContext = JsonPath
.using(Configuration.defaultConfiguration().addOptions(Option.SUPPRESS_EXCEPTIONS))
.parse(json);
// for each path in the param list
for (String key : params.keySet().stream().filter(k -> k.contains("jpath")).collect(Collectors.toList())) {
String path = params.get(key);
String value = MapDocumentUtil.getJPathString(path, documentContext);
if (value == null || value.isEmpty())
value = "";
st.append(value);
st.append("::");
}
st.setLength(st.length() - 2);
return st.toString();
}
}

View File

@ -0,0 +1,50 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import java.util.Set;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("keywordMatch")
public class KeywordMatch extends AbstractStringComparator {
Map<String, String> params;
public KeywordMatch(Map<String, String> params) {
super(params);
this.params = params;
}
@Override
public double distance(final String a, final String b, final Config conf) {
String ca = cleanup(a);
String cb = cleanup(b);
ca = normalize(ca);
cb = normalize(cb);
ca = filterAllStopWords(ca);
cb = filterAllStopWords(cb);
Set<String> keywords1 = getKeywords(
ca, conf.translationMap(), Integer.parseInt(params.getOrDefault("windowSize", "4")));
Set<String> keywords2 = getKeywords(
cb, conf.translationMap(), Integer.parseInt(params.getOrDefault("windowSize", "4")));
Set<String> codes1 = toCodes(keywords1, conf.translationMap());
Set<String> codes2 = toCodes(keywords2, conf.translationMap());
// if no cities are detected, the comparator gives 1.0
if (codes1.isEmpty() && codes2.isEmpty())
return 1.0;
else {
if (codes1.isEmpty() ^ codes2.isEmpty())
return -1.0; // undefined if one of the two has no keywords
return commonElementsPercentage(codes1, codes2);
}
}
}

View File

@ -0,0 +1,36 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("level2JaroWinkler")
public class Level2JaroWinkler extends AbstractStringComparator {
public Level2JaroWinkler(Map<String, String> params) {
super(params, new com.wcohen.ss.Level2JaroWinkler());
}
public Level2JaroWinkler(double w) {
super(w, new com.wcohen.ss.Level2JaroWinkler());
}
protected Level2JaroWinkler(double w, AbstractStringDistance ssalgo) {
super(w, ssalgo);
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(double d) {
return d;
}
}

View File

@ -0,0 +1,50 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("level2JaroWinklerTitle")
public class Level2JaroWinklerTitle extends AbstractStringComparator {
public Level2JaroWinklerTitle(Map<String, String> params) {
super(params, new com.wcohen.ss.Level2JaroWinkler());
}
public Level2JaroWinklerTitle(final double w) {
super(w, new com.wcohen.ss.Level2JaroWinkler());
}
protected Level2JaroWinklerTitle(final double w, final AbstractStringDistance ssalgo) {
super(w, ssalgo);
}
@Override
public double distance(final String a, final String b, final Config conf) {
final String ca = cleanup(a);
final String cb = cleanup(b);
final boolean check = checkNumbers(ca, cb);
if (check)
return 0.5;
return ssalgo.score(ca, cb);
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(final double d) {
return d;
}
}

View File

@ -0,0 +1,36 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("level2Levenstein")
public class Level2Levenstein extends AbstractStringComparator {
public Level2Levenstein(Map<String, String> params) {
super(params, new com.wcohen.ss.Level2Levenstein());
}
public Level2Levenstein(double w) {
super(w, new com.wcohen.ss.Level2Levenstein());
}
protected Level2Levenstein(double w, AbstractStringDistance ssalgo) {
super(w, ssalgo);
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(double d) {
return 1 / Math.pow(Math.abs(d) + 1, 0.1);
}
}

View File

@ -0,0 +1,36 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("levenstein")
public class Levenstein extends AbstractStringComparator {
public Levenstein(Map<String, String> params) {
super(params, new com.wcohen.ss.Levenstein());
}
public Levenstein(double w) {
super(w, new com.wcohen.ss.Levenstein());
}
protected Levenstein(double w, AbstractStringDistance ssalgo) {
super(w, ssalgo);
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(double d) {
return 1 / Math.pow(Math.abs(d) + 1, 0.1);
}
}

View File

@ -0,0 +1,59 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("levensteinTitle")
public class LevensteinTitle extends AbstractStringComparator {
private static final Log log = LogFactory.getLog(LevensteinTitle.class);
public LevensteinTitle(Map<String, String> params) {
super(params, new com.wcohen.ss.Levenstein());
}
public LevensteinTitle(final double w) {
super(w, new com.wcohen.ss.Levenstein());
}
protected LevensteinTitle(final double w, final AbstractStringDistance ssalgo) {
super(w, ssalgo);
}
@Override
public double distance(final String a, final String b, final Config conf) {
final String ca = cleanup(a);
final String cb = cleanup(b);
final boolean check = checkNumbers(ca, cb);
if (check)
return 0.5;
return normalize(ssalgo.score(ca, cb), ca.length(), cb.length());
}
private double normalize(final double score, final int la, final int lb) {
return 1 - (Math.abs(score) / Math.max(la, lb));
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(final double d) {
return 1 / Math.pow(Math.abs(d) + 1, 0.1);
}
}

View File

@ -0,0 +1,58 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
/**
* Compared compare between two titles, ignoring version numbers. Suitable for Software entities.
*/
@ComparatorClass("levensteinTitleIgnoreVersion")
public class LevensteinTitleIgnoreVersion extends AbstractStringComparator {
public LevensteinTitleIgnoreVersion(Map<String, String> params) {
super(params, new com.wcohen.ss.Levenstein());
}
public LevensteinTitleIgnoreVersion(final double w) {
super(w, new com.wcohen.ss.Levenstein());
}
protected LevensteinTitleIgnoreVersion(final double w, final AbstractStringDistance ssalgo) {
super(w, ssalgo);
}
@Override
public double distance(final String a, final String b, final Config conf) {
String ca = cleanup(a);
String cb = cleanup(b);
ca = ca.replaceAll("\\d", "").replaceAll(getRomans(ca), "").trim();
cb = cb.replaceAll("\\d", "").replaceAll(getRomans(cb), "").trim();
ca = filterAllStopWords(ca);
cb = filterAllStopWords(cb);
return normalize(ssalgo.score(ca, cb), ca.length(), cb.length());
}
private double normalize(final double score, final int la, final int lb) {
return 1 - (Math.abs(score) / Math.max(la, lb));
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(final double d) {
return 1 / Math.pow(Math.abs(d) + 1, 0.1);
}
}

View File

@ -0,0 +1,66 @@
package eu.dnetlib.pace.tree;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractListComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
/**
* The Class Contains match
*
* @author miconis
* */
@ComparatorClass("listContainsMatch")
public class ListContainsMatch extends AbstractListComparator {
private Map<String, String> params;
private boolean CASE_SENSITIVE;
private String STRING;
private String AGGREGATOR;
public ListContainsMatch(Map<String, String> params) {
super(params);
this.params = params;
// read parameters
CASE_SENSITIVE = Boolean.parseBoolean(params.getOrDefault("caseSensitive", "false"));
STRING = params.get("string");
AGGREGATOR = params.get("bool");
}
@Override
public double compare(List<String> sa, List<String> sb, Config conf) {
if (sa.isEmpty() || sb.isEmpty()) {
return -1;
}
if (!CASE_SENSITIVE) {
sa = sa.stream().map(String::toLowerCase).collect(Collectors.toList());
sb = sb.stream().map(String::toLowerCase).collect(Collectors.toList());
STRING = STRING.toLowerCase();
}
switch (AGGREGATOR) {
case "AND":
if (sa.contains(STRING) && sb.contains(STRING))
return 1.0;
break;
case "OR":
if (sa.contains(STRING) || sb.contains(STRING))
return 1.0;
break;
case "XOR":
if (sa.contains(STRING) ^ sb.contains(STRING))
return 1.0;
break;
default:
return 0.0;
}
return 0.0;
}
}

View File

@ -0,0 +1,42 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("mustBeDifferent")
public class MustBeDifferent extends AbstractStringComparator {
public MustBeDifferent(Map<String, String> params) {
super(params, new com.wcohen.ss.Levenstein());
}
public MustBeDifferent(final double weight) {
super(weight, new com.wcohen.ss.JaroWinkler());
}
protected MustBeDifferent(final double weight, final AbstractStringDistance ssalgo) {
super(weight, ssalgo);
}
@Override
public double distance(final String a, final String b, final Config conf) {
return !a.equals(b) ? 1.0 : 0;
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(final double d) {
return d;
}
}

View File

@ -0,0 +1,24 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.Comparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
/**
* Not all fields of a document need to partecipate in the compare measure. We model those fields as having a
* NullDistanceAlgo.
*/
@ComparatorClass("null")
public class NullDistanceAlgo<T> implements Comparator<T> {
public NullDistanceAlgo(Map<String, String> params) {
}
@Override
public double compare(Object a, Object b, Config config) {
return 0;
}
}

View File

@ -0,0 +1,35 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("numbersComparator")
public class NumbersComparator extends AbstractStringComparator {
Map<String, String> params;
public NumbersComparator(Map<String, String> params) {
super(params);
this.params = params;
}
@Override
public double distance(String a, String b, Config conf) {
// extracts numbers from the field
String numbers1 = getNumbers(nfd(a));
String numbers2 = getNumbers(nfd(b));
if (numbers1.isEmpty() || numbers2.isEmpty())
return -1.0;
int n1 = Integer.parseInt(numbers1);
int n2 = Integer.parseInt(numbers2);
return Math.abs(n1 - n2);
}
}

View File

@ -0,0 +1,35 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("numbersMatch")
public class NumbersMatch extends AbstractStringComparator {
public NumbersMatch(Map<String, String> params) {
super(params);
}
@Override
public double distance(String a, String b, Config conf) {
// extracts numbers from the field
String numbers1 = getNumbers(nfd(a));
String numbers2 = getNumbers(nfd(b));
if (numbers1.isEmpty() && numbers2.isEmpty())
return 1.0;
if (numbers1.isEmpty() || numbers2.isEmpty())
return -1.0;
if (numbers1.equals(numbers2))
return 1.0;
return 0.0;
}
}

View File

@ -0,0 +1,35 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("romansMatch")
public class RomansMatch extends AbstractStringComparator {
public RomansMatch(Map<String, String> params) {
super(params);
}
@Override
public double distance(String a, String b, Config conf) {
// extracts romans from the field
String romans1 = getRomans(nfd(a));
String romans2 = getRomans(nfd(b));
if (romans1.isEmpty() && romans2.isEmpty())
return 1.0;
if (romans1.isEmpty() || romans2.isEmpty())
return -1.0;
if (romans1.equals(romans2))
return 1.0;
return 0.0;
}
}

View File

@ -0,0 +1,38 @@
package eu.dnetlib.pace.tree;
import java.util.List;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractListComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
/**
* Returns true if the number of values in the fields is the same.
*
* @author claudio
*/
@ComparatorClass("sizeMatch")
public class SizeMatch extends AbstractListComparator {
/**
* Instantiates a new size match.
*
* @param params
* the parameters
*/
public SizeMatch(final Map<String, String> params) {
super(params);
}
@Override
public double compare(final List<String> a, final List<String> b, final Config conf) {
if (a.isEmpty() || b.isEmpty())
return -1.0;
return a.size() == b.size() ? 1.0 : 0.0;
}
}

View File

@ -0,0 +1,61 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.tree.support.AbstractSortedComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
/**
* The Class SortedJaroWinkler.
*/
@ComparatorClass("sortedJaroWinkler")
public class SortedJaroWinkler extends AbstractSortedComparator {
public SortedJaroWinkler(Map<String, String> params) {
super(params, new com.wcohen.ss.Levenstein());
}
/**
* Instantiates a new sorted jaro winkler.
*
* @param weight
* the weight
*/
public SortedJaroWinkler(final double weight) {
super(weight, new com.wcohen.ss.JaroWinkler());
}
/**
* Instantiates a new sorted jaro winkler.
*
* @param weight
* the weight
* @param ssalgo
* the ssalgo
*/
protected SortedJaroWinkler(final double weight, final AbstractStringDistance ssalgo) {
super(weight, ssalgo);
}
/*
* (non-Javadoc)
* @see eu.dnetlib.pace.compare.DistanceAlgo#getWeight()
*/
@Override
public double getWeight() {
return super.weight;
}
/*
* (non-Javadoc)
* @see eu.dnetlib.pace.compare.SecondStringDistanceAlgo#normalize(double)
*/
@Override
protected double normalize(final double d) {
return d;
}
}

View File

@ -0,0 +1,61 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.tree.support.AbstractSortedComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
/**
* The Class SortedJaroWinkler.
*/
@ComparatorClass("sortedLevel2JaroWinkler")
public class SortedLevel2JaroWinkler extends AbstractSortedComparator {
/**
* Instantiates a new sorted jaro winkler.
*
* @param weight
* the weight
*/
public SortedLevel2JaroWinkler(final double weight) {
super(weight, new com.wcohen.ss.Level2JaroWinkler());
}
public SortedLevel2JaroWinkler(final Map<String, String> params) {
super(params, new com.wcohen.ss.Level2JaroWinkler());
}
/**
* Instantiates a new sorted jaro winkler.
*
* @param weight
* the weight
* @param ssalgo
* the ssalgo
*/
protected SortedLevel2JaroWinkler(final double weight, final AbstractStringDistance ssalgo) {
super(weight, ssalgo);
}
/*
* (non-Javadoc)
* @see eu.dnetlib.pace.compare.DistanceAlgo#getWeight()
*/
@Override
public double getWeight() {
return super.weight;
}
/*
* (non-Javadoc)
* @see eu.dnetlib.pace.compare.SecondStringDistanceAlgo#normalize(double)
*/
@Override
protected double normalize(final double d) {
return d;
}
}

View File

@ -0,0 +1,67 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
/**
* The Class Contains match
*
* @author miconis
* */
@ComparatorClass("stringContainsMatch")
public class StringContainsMatch extends AbstractStringComparator {
private Map<String, String> params;
private boolean CASE_SENSITIVE;
private String STRING;
private String AGGREGATOR;
public StringContainsMatch(Map<String, String> params) {
super(params);
this.params = params;
// read parameters
CASE_SENSITIVE = Boolean.parseBoolean(params.getOrDefault("caseSensitive", "false"));
STRING = params.get("string");
AGGREGATOR = params.get("aggregator");
}
@Override
public double distance(final String a, final String b, final Config conf) {
String ca = a;
String cb = b;
if (!CASE_SENSITIVE) {
ca = a.toLowerCase();
cb = b.toLowerCase();
STRING = STRING.toLowerCase();
}
if (AGGREGATOR != null) {
switch (AGGREGATOR) {
case "AND":
if (ca.contains(STRING) && cb.contains(STRING))
return 1.0;
break;
case "OR":
if (ca.contains(STRING) || cb.contains(STRING))
return 1.0;
break;
case "XOR":
if (ca.contains(STRING) ^ cb.contains(STRING))
return 1.0;
break;
default:
return 0.0;
}
}
return 0.0;
}
}

View File

@ -0,0 +1,56 @@
package eu.dnetlib.pace.tree;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractListComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("stringListMatch")
public class StringListMatch extends AbstractListComparator {
private static final Log log = LogFactory.getLog(StringListMatch.class);
private Map<String, String> params;
final private String TYPE; // percentage or count
public StringListMatch(final Map<String, String> params) {
super(params);
this.params = params;
TYPE = params.getOrDefault("type", "percentage");
}
@Override
public double compare(final List<String> a, final List<String> b, final Config conf) {
final Set<String> pa = new HashSet<>(a);
final Set<String> pb = new HashSet<>(b);
if (pa.isEmpty() || pb.isEmpty()) {
return -1; // return undefined if one of the two lists is empty
}
int incommon = Sets.intersection(pa, pb).size();
int simDiff = Sets.symmetricDifference(pa, pb).size();
if (incommon + simDiff == 0) {
return 0.0;
}
if (TYPE.equals("percentage"))
return (double) incommon / (incommon + simDiff);
else
return incommon;
}
}

View File

@ -0,0 +1,90 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import org.apache.commons.lang3.StringUtils;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
/**
* The Class SubStringLevenstein.
*/
@ComparatorClass("subStringLevenstein")
public class SubStringLevenstein extends AbstractStringComparator {
/**
* The limit.
*/
protected int limit;
/**
* Instantiates a new sub string levenstein.
*
* @param w the w
*/
public SubStringLevenstein(final double w) {
super(w, new com.wcohen.ss.Levenstein());
}
public SubStringLevenstein(Map<String, String> params) {
super(params, new com.wcohen.ss.Levenstein());
this.limit = Integer.parseInt(params.getOrDefault("limit", "1"));
}
/**
* Instantiates a new sub string levenstein.
*
* @param w the w
* @param limit the limit
*/
public SubStringLevenstein(final double w, final int limit) {
super(w, new com.wcohen.ss.Levenstein());
this.limit = limit;
}
/**
* Instantiates a new sub string levenstein.
*
* @param w the w
* @param limit the limit
* @param ssalgo the ssalgo
*/
protected SubStringLevenstein(final double w, final int limit, final AbstractStringDistance ssalgo) {
super(w, ssalgo);
this.limit = limit;
}
/*
* (non-Javadoc)
* @see eu.dnetlib.pace.compare.SecondStringDistanceAlgo#compare(eu.dnetlib.pace.model.Field,
* eu.dnetlib.pace.model.Field)
*/
@Override
public double distance(final String a, final String b, final Config conf) {
return distance(StringUtils.left(a, limit), StringUtils.left(b, limit), conf);
}
/*
* (non-Javadoc)
* @see eu.dnetlib.pace.compare.DistanceAlgo#getWeight()
*/
@Override
public double getWeight() {
return super.weight;
}
/*
* (non-Javadoc)
* @see eu.dnetlib.pace.compare.SecondStringDistanceAlgo#normalize(double)
*/
@Override
protected double normalize(final double d) {
return 1 / Math.pow(Math.abs(d) + 1, 0.1);
}
}

Some files were not shown because too many files have changed in this diff Show More