Compare commits

...

154 Commits

Author SHA1 Message Date
Michele Artini 4752d60421 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2024-05-29 14:23:14 +02:00
Claudio Atzori 795e1b2629 Merge pull request '[graph indexing] sets spark memoryOverhead in the join operations to the same value used for the memory executor' (#426) from provision_memoryOverhead into master
Reviewed-on: #426
2024-04-19 16:59:45 +02:00
Claudio Atzori 0c05abe50b [graph indexing] sets spark memoryOverhead in the join operations to the same value used for the memory executor 2024-04-19 16:57:55 +02:00
Claudio Atzori 8fdd0244ad Merge pull request 'Various fixes for the stats DB update workflow, step16-createIndicatorsTables.sql' (#425) from stats_step16_fix into master
Reviewed-on: #425
2024-04-18 11:25:24 +02:00
Claudio Atzori 18fdaaf548 integrating suggestion from #9699 to improve the result_country table construction 2024-04-18 11:23:43 +02:00
Claudio Atzori 43e123c624 added column alias 2024-04-17 16:40:29 +02:00
Claudio Atzori 62a07b7add added missing end of statement /*EOS*/ 2024-04-17 15:13:28 +02:00
Claudio Atzori 96bddcc921 revised query implementation for indi_pub_gold_oa 2024-04-17 15:06:50 +02:00
Miriam Baglioni 0486cea4c4 removed the funder id : 100011062 Asian Spinal Cord Network, wrongly associated to Ireland 2024-04-16 15:36:40 +02:00
Claudio Atzori 013935c593 Merge pull request 'Improvements to copying data from ocean to impala' (#420) from antonis.lempesis/dnet-hadoop:beta into master
Reviewed-on: #420
2024-04-16 14:17:47 +02:00
Lampros Smyrnaios d7da4f814b Minor updates to the copying operation to Impala Cluster:
- Improve logging.
- Code optimization/polishing.
2024-04-12 18:12:06 +03:00
Lampros Smyrnaios 14719dcd62 Miscellaneous updates to the copying operation to Impala Cluster:
- Update the algorithm for creating views that depend on other views.
- Add check for successful execution of the "hadoop distcp" command.
- Add a check for successful copy operation of all entities.
- Upon facing an error in a DB, exit the method, instead of the whole script.
- Improve logging.
- Code polishing.
2024-04-12 15:36:13 +03:00
Lampros Smyrnaios 22745027c8 Use the "HADOOP_USER_NAME" value from the "workflow-property", in "copyDataToImpalaCluster.sh", in "stats-monitor-updates". 2024-04-11 17:46:33 +03:00
Lampros Smyrnaios abf0b69f29 Upgrade the copying operation to Impala Cluster:
- Use only hive commands in the Ocean Cluster, as the "impala-shell" will be removed from there to free-up resources.
- Hugely improve the performance in every aspect of the copying process: a) speedup file-transferring and DB-deletion, b) eliminate permissions-assignment, "load" operations and "use $db" queries, c) retry only the "create view" statements and only as long as they depend on other non-created views, instead of trying to recreate all tables and views 5 consecutive times.
- Add error-checks for the creation of tables and views.
2024-04-11 17:12:12 +03:00
Claudio Atzori 6132bd028e Merge pull request 'Extend Crossref-funders mapping and datacite hostedbymap' (#417) from CrossrefFundersMap into master
Reviewed-on: #417
2024-04-09 10:30:53 +02:00
Miriam Baglioni 519db1ddef Extended mapping of funder from crossref (#9169, #9277) and change the correspondece files for the irish fundrs (#9635). Extended the datacite map to include the association between metadata and the EBRAINS datasource (SciLake) 2024-04-09 09:33:09 +02:00
Claudio Atzori 5add51f38c Merge pull request 'fixed the result_country definition and updated the stats DB copy procedure' (#412) from antonis.lempesis/dnet-hadoop:beta into master
Reviewed-on: #412
2024-04-03 12:34:17 +02:00
Lampros Smyrnaios b7c8acc563 - Update the code which acquires the "IMPALA_HDFS_NODE", to test the "tmp"-dir, instead of the base-dir and introduce retries, to overcome potential file-system failures. This change was suggested by "Sebastian Tymkow" and "Grzegorz Bakalarski".
- Fix typos.
2024-04-03 13:15:37 +03:00
Antonis Lempesis df6e3bda04 added new orgs in monitor 2024-04-01 22:45:29 +03:00
Antonis Lempesis 573b081f1d added new orgs in monitor 2024-04-01 22:24:46 +03:00
Antonis Lempesis 0bf2a7a359 fixed the result_country definition 2024-04-01 15:23:22 +03:00
Claudio Atzori f01390702e Merge pull request 'fixed typo in indicator query' (#410) from antonis.lempesis/dnet-hadoop:beta into master
Reviewed-on: #410
2024-03-27 13:42:07 +01:00
Antonis Lempesis 9ff44eed96 fixed typo in indicator query
added more institutions
2024-03-27 14:39:01 +02:00
Claudio Atzori 5592ccc37a Merge pull request 'added missing EOS, Generate tables with parquet-files, instead of csv in the contexts.sh script' (#408) from antonis.lempesis/dnet-hadoop:beta into master
Reviewed-on: #408
2024-03-27 12:02:57 +01:00
Antonis Lempesis 1fee4124e0 added missing EOS 2024-03-27 12:58:25 +02:00
Claudio Atzori d16c15da8d adjusted pom files 2024-03-26 14:00:44 +01:00
Lampros Smyrnaios 036ba03fcd Generate tables with parquet-files, instead of csv, in "dhp-stats-update/.../contexts.sh" script. 2024-03-26 13:29:04 +02:00
Claudio Atzori 09a6d17059 Merge pull request '[Stats wf] #372, #405 to production' (#406) from antonis.lempesis/dnet-hadoop:beta into master
Reviewed-on: #406
2024-03-26 12:18:26 +01:00
Claudio Atzori d70793847d resolving conflicts on step16-createIndicatorsTables.sql 2024-03-26 12:17:52 +01:00
Lampros Smyrnaios bc8c97182d Automatically select the ACTIVE HDFS NODE for Impala cluster, in all "copyDataToImpalaCluster.sh" scripts. 2024-03-26 13:01:12 +02:00
Lampros Smyrnaios 92cc27e7eb Use the ACTIVE HDFS NODE for Impala cluster, in "copyDataToImpalaCluster.sh" script. 2024-03-26 12:34:11 +02:00
Michele De Bonis f6601ea7d1 default parameters for openorgs updated 2024-03-25 13:07:04 +01:00
Michele De Bonis cd4c3c934d openorgs wf updated 2024-03-22 15:42:37 +01:00
Antonis Lempesis 4c40c96e30 code cleanup 2024-03-22 10:16:49 +02:00
Antonis Lempesis 459167ac2f Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta 2024-03-21 12:44:58 +02:00
Antonis Lempesis 07f634a46d code cleanup 2024-03-21 12:44:30 +02:00
Antonis Lempesis 9521625a07 code cleanup 2024-03-21 11:45:08 +02:00
Antonis Lempesis 67a5aa0a38 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta 2024-03-19 11:24:54 +02:00
dimitrispie a3a570e9a0 Commit monitor-updates-wf 2024-03-19 09:42:21 +02:00
Michele Artini 7a934affd8 oaf:country mapping 2024-03-15 10:57:12 +01:00
Michele Artini a99942f7cf filter by base types 2024-03-13 12:12:42 +01:00
Michele Artini 7f7083f53e updated sql query for filtering BASE records 2024-03-13 11:57:26 +01:00
Michele Artini d9b23a76c5 comments 2024-03-12 14:53:34 +01:00
Michele Artini 841ca92246 Merge pull request 'new plugin to collect from a dump of BASE' (#400) from base-collector-plugin into master
Reviewed-on: #400
2024-03-12 12:22:42 +01:00
Michele Artini 3bcfc40293 new plugin to collect from a dump of BASE 2024-03-12 12:17:58 +01:00
Antonis Lempesis f74c7e8689 selecting distinct peer_reviewed 2024-03-12 02:13:04 +02:00
Antonis Lempesis 3c79720342 fixed the irish result subset 2024-03-07 14:08:57 +02:00
Antonis Lempesis 5ae4b4286c Merge branch 'beta' of https://code-repo.d3science.org/antonis.lempesis/dnet-hadoop into beta 2024-03-07 12:15:19 +02:00
Antonis Lempesis 316d585c8a using distinct apcs per publication to avoid huge sums 2024-03-07 02:07:59 +02:00
Giambattista Bloisi 3067ea390d Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf 2024-03-04 11:13:34 +01:00
Miriam Baglioni c94d94035c [BulkTagging] added check to verify if field is present in the pathMap 2024-02-28 09:41:42 +01:00
Michele Artini 4374d7449e mapping of project PIDs 2024-02-22 14:44:35 +01:00
Claudio Atzori 07d009007b Merge pull request 'Fixed problem on missing author in crossref Mapping' (#384) from crossref_missing_author_fix_master into master
Reviewed-on: #384
2024-02-15 15:06:17 +01:00
Claudio Atzori 071d044971 Merge branch 'master' into crossref_missing_author_fix_master 2024-02-15 15:04:19 +01:00
Claudio Atzori b3ddbaed58 fixed import of ORPs stored on HDFS in the internal graph format (e.g. Datacite) 2024-02-15 15:02:48 +01:00
Claudio Atzori 1416f16b35 [graph raw] fixed mapping of the original resource type from the Datacite format 2024-02-09 10:19:53 +01:00
Giambattista Bloisi ba1a0e7b4f Merge pull request 'Set deletedbyinference =true to dedup aliases, created when a dedup in a previous build has been merged in a new dedup' (#392) from fix_dedupaliases_deletedbyinference into master
Reviewed-on: #392
2024-02-08 15:29:29 +01:00
Giambattista Bloisi 079085286c Merge branch 'master' into fix_dedupaliases_deletedbyinference 2024-02-08 15:29:13 +01:00
Giambattista Bloisi 8dd666aedd Dedup aliases, created when a dedup in a previous build has been merged in a new dedup, need to be marked as "deletedbyinference", since they are "merged" in the new dedup 2024-02-08 15:27:57 +01:00
Claudio Atzori f21133229a Merge pull request 'Support for the PromoteAction strategy [master]' (#391) from promote_actions_join_type_master into master
Reviewed-on: #391
2024-02-08 15:12:16 +01:00
Claudio Atzori d86b909db2 [actiosets] fixed join type 2024-02-08 15:10:55 +01:00
Claudio Atzori 08162902ab [actiosets] introduced support for the PromoteAction strategy 2024-02-08 15:10:40 +01:00
Antonis Lempesis dd4c27f4f3 added 2 new institutions in monitor 2024-02-08 12:57:57 +02:00
Claudio Atzori e8630a6d03 [graph cleaning] rule out datasources without an officialname 2024-02-05 14:59:06 +02:00
Claudio Atzori f28c63d5ef [orcid enrichment] fixed directory cleanup before distcp 2024-02-05 09:44:56 +02:00
Antonis Lempesis a512ead447 changed orcid ids to all capital 2024-01-30 16:54:47 +02:00
Claudio Atzori 1a8b609ed2 code formatting 2024-01-30 11:34:16 +01:00
Antonis Lempesis bb10a22290 merged changes from dnet-hadoop 2024-01-29 21:51:47 +02:00
Miriam Baglioni 4c8706efee [orcid-enrichment] change the value of parameters. 2024-01-29 18:21:36 +01:00
Claudio Atzori f804c58bc7 Merge pull request 'Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf' (#386) from stats_with_spark_sql into beta
Reviewed-on: #386
2024-01-29 09:11:59 +01:00
Claudio Atzori 926903b06b Merge branch 'beta' into stats_with_spark_sql 2024-01-29 09:11:45 +01:00
Giambattista Bloisi 078df0b4d1 Use SparkSQL in place of Hive for executing step16-createIndicatorsTables.sql of stats update wf 2024-01-26 21:56:55 +01:00
Claudio Atzori 4d0c59669b merged changes from beta 2024-01-26 16:08:54 +01:00
Claudio Atzori bf99c424fa Merge pull request 'Fixed problem on missing author in crossref Mapping' (#383) from crossref_missing_author_fix into beta
Reviewed-on: #383
2024-01-26 15:57:23 +01:00
Claudio Atzori ce3200263e Merge branch 'beta' into crossref_missing_author_fix 2024-01-26 15:57:04 +01:00
Sandro La Bruzzo 3c8c88bdd3 Fixed problem on missing author in crossref Mapping 2024-01-26 12:29:30 +01:00
Sandro La Bruzzo e889808daa Fixed problem on missing author in crossref Mapping 2024-01-26 12:19:04 +01:00
Claudio Atzori 9e8fc6aa88 [collection] increased logging from the oai-pmh metadata collection process 2024-01-26 09:17:20 +01:00
Antonis Lempesis c548796463 Changed step16-createIndicatorsTables to use a spark oozie action instead of hive 2024-01-26 02:04:48 +02:00
Antonis Lempesis a7115cfa9e max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%. 2024-01-25 15:13:16 +01:00
Antonis Lempesis fd43b0e84a max mem of joins (hive.mapjoin.followby.gby.localtask.max.memory.usage) now 80%, up from 55%. 2024-01-25 15:06:34 +01:00
Claudio Atzori 2838a9b630 Update 'CONTRIBUTING.md' 2024-01-24 16:07:05 +01:00
Claudio Atzori da944a5c55 Merge pull request 'code of conduct and contributing' (#382) from contributing into beta
Reviewed-on: #382
2024-01-24 15:40:26 +01:00
Claudio Atzori 0c97a3a81a minor 2024-01-24 10:56:33 +01:00
Claudio Atzori 2c1e6849f0 added code of conduct and contributing files 2024-01-24 10:36:41 +01:00
Claudio Atzori 9b13c22e5d [graph provision] retrieve all the context information by adding all=true to the requests issued to thr API 2024-01-23 15:36:08 +01:00
Claudio Atzori 3e96777cc4 [collection] increased logging from the oai-pmh metadata collection process 2024-01-23 15:21:03 +01:00
Claudio Atzori 9812406589 Merge pull request '[graph provision] updated param specification for the XML converter job' (#380) from provision_community_api into beta
Reviewed-on: #380
2024-01-23 08:55:59 +01:00
Claudio Atzori f87f3a6483 [graph provision] updated param specification for the XML converter job 2024-01-23 08:54:37 +01:00
Claudio Atzori 6fd25cf549 code formatting 2024-01-23 08:47:12 +01:00
Claudio Atzori bd187ec6e7 Merge pull request 'Implements pivots table update oozie workflow' (#376) from update_pivots_table into beta
Reviewed-on: #376
2024-01-22 16:37:30 +01:00
Claudio Atzori f76852f385 Merge branch 'beta' into update_pivots_table 2024-01-22 16:37:22 +01:00
Claudio Atzori b9fcc5ad5e Merge pull request 'Context API update' (#379) from provision_community_api into beta
Reviewed-on: #379
2024-01-22 15:55:33 +01:00
Claudio Atzori 1c6db320f4 [graph provision] obtain context info from the context API instead from the ISLookUp service 2024-01-22 15:53:17 +01:00
Claudio Atzori 2655eea5bc [orcid enrichment] drop paths before copying the non-modifyed contents 2024-01-19 16:28:05 +01:00
Claudio Atzori c6b3401596 increased shuffle partitions for publications in the country propagation workflow 2024-01-19 10:15:39 +01:00
Miriam Baglioni bcc0a13981 [enrichment single step] adding <end> element in wf definition 2024-01-18 17:39:14 +01:00
Miriam Baglioni 6af536541d [enrichment single step] moving parameter file in correct location 2024-01-18 15:35:40 +01:00
Miriam Baglioni a12a3eb143 - 2024-01-18 15:18:10 +01:00
Claudio Atzori 628fdfb5eb Merge pull request '[enrichment single step]' (#378) from enrichmentSingleStepFixed into beta
Reviewed-on: #378
2024-01-18 09:41:09 +01:00
Miriam Baglioni 82e9e262ee [enrichment single step] remove parameter from execution 2024-01-17 17:38:03 +01:00
Miriam Baglioni 67ce2d54be [enrichment single step] refactoring to fix issues in disappeared result type 2024-01-17 16:50:00 +01:00
Miriam Baglioni 59eaccbd87 [enrichment single step] refactoring to fix issue in disappeared result type 2024-01-15 17:49:54 +01:00
Giambattista Bloisi 21a14fcd80 Reusable RunSQLSparkJob for executing SQL in Spark through Oozie Spark Actions
Implements pivots table update oozie workflow
2024-01-15 10:18:14 +01:00
Claudio Atzori 2d302e6827 Merge pull request '[FoS integration]fix issue on FoS integration. Removing the null values from FoS' (#375) from fosPreparationBeta into beta
Reviewed-on: #375
2024-01-12 10:27:28 +01:00
Miriam Baglioni f612125939 fix issue on FoS integration. Removing the null values from FoS 2024-01-12 10:20:28 +01:00
Claudio Atzori c67467723b Merge pull request 'refined mapping for the extraction of the original resource type' (#374) from resource_types into beta
Reviewed-on: #374
2024-01-11 16:29:47 +01:00
Claudio Atzori cb9e739484 Merge branch 'beta' into resource_types 2024-01-11 16:29:41 +01:00
Claudio Atzori 2753044d13 refined mapping for the extraction of the original resource type 2024-01-11 16:28:26 +01:00
Giambattista Bloisi a88dce5bf3 Merge pull request 'Improvements and refactoring in Dedup' (#367) from dedup_increasenumofblocks into beta
Reviewed-on: #367
2024-01-11 11:24:06 +01:00
Giambattista Bloisi 3c66e3bd7b Create dedup record for "merged" pivots
Do not create dedup records for group that have more than 20 different acceptance date
2024-01-10 22:59:52 +01:00
Giambattista Bloisi 10e135db1e Use dedup_wf_002 in place of dedup_wf_001 to make explicit a different algorithm has been used to generate those kind of ids 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 831cc1fdde Generate "merged" dedup id relations also for records that are filtered out by the cut parameters 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 1287315ffb Do no longer use dedupId information from pivotHistory Database 2024-01-10 22:59:52 +01:00
Giambattista Bloisi 02636e802c SparkCreateSimRels:
- Create dedup blocks from the complete queue of records matching cluster key instead of truncating the results
- Clean titles once before clustering and similarity comparisons
- Added support for filtered fields in model
- Added support for sorting List fields in model
- Added new JSONListClustering and numAuthorsTitleSuffixPrefixChain clustering functions
- Added new maxLengthMatch comparator function
- Use reduced complexity Levenshtein with threshold in levensteinTitle
- Use reduced complexity AuthorsMatch with threshold early-quit
- Use incremental Connected Component to decrease comparisons in similarity match in BlockProcessor
- Use new clusterings configuration in Dedup tests

SparkWhitelistSimRels: use left semi join for clarity and performance

SparkCreateMergeRels:
- Use new connected component algorithm that converge faster than Spark GraphX provided algorithm
- Refactored to use Windowing sorting rather than groupBy to reduce memory pressure
- Use historical pivot table to generate singleton rels, merged rels and keep continuity with dedupIds used in the past
- Comparator for pivot record selection now uses "tomorrow" as filler for missing or incorrect date instead of "2000-01-01"
- Changed generation of ids of type dedup_wf_001 to avoid collisions

DedupRecordFactory: use reduceGroups instead of mapGroups to decrease memory pressure
2024-01-10 22:59:52 +01:00
Antonis Lempesis e024718f73 creating result_instances even when no pids exist for the instance 2024-01-10 22:25:50 +01:00
Claudio Atzori 16d858fbf0 Merge pull request 'enrichmentSingleStep' (#373) from enrichmentSingleStep into beta
Reviewed-on: #373
2024-01-10 16:58:49 +01:00
Miriam Baglioni e711a05229 fixed conflicts 2024-01-10 11:03:42 +01:00
Miriam Baglioni 71d6f30711 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-01-10 10:59:58 +01:00
dimitrispie b920307bdd Changes to indicators 2024-01-09 00:47:09 +02:00
dimitrispie 8b2cbb611e Changes to beta db names 2024-01-09 00:40:56 +02:00
Antonis Lempesis 2e4cab026c fixed the result_country definition 2024-01-08 16:01:26 +02:00
dimitrispie 6b823100ae Update buildIrishMonitorDB.sql
New indicators added
2024-01-07 22:54:39 +02:00
dimitrispie 75bfde043c Historical Snapshots Workflow
Create historical snapshots db with parameters:

hist_db_name=openaire_beta_historical_snapshots_xxx
hist_db_name_prev=openaire_beta_historical_snapshots_xxx (previous run of wf)
stats_db_name=openaire_beta_stats_xxx
stats_irish_db_name=openaire_beta_stats_monitor_ie_xxx
monitor_db_name=openaire_beta_stats_monitor_xxx
monitor_db_prod_name=openaire_beta_stats_monitor
monitor_irish_db_name=openaire_beta_stats_monitor_ie_xxx
monitor_irish_db_prod_name=openaire_beta_stats_monitor_ie
hist_db_prod_name=openaire_beta_historical_snapshots
hist_db_shadow_name=openaire_beta_historical_snapshots_shadow
hist_date=122023
hive_timeout=150000
hadoop_user_name=xxx
resumeFrom=CreateDB
2024-01-04 15:11:04 +02:00
Miriam Baglioni cb14470ba6 added properties file in the forlder for the workflow of result to organization from inst repo propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:50:05 +01:00
Miriam Baglioni 9f966b59d4 added properties file in the forlder for the workflow of result to community from semrel propagation. Changes the path in the classes implementing the propagation 2023-12-22 14:11:47 +01:00
Miriam Baglioni 2f3b5a133d added properties file in the forlder for the workflow of result to community from organization propagation. Changes the path in the classes implementing the propagation 2023-12-22 13:56:40 +01:00
Miriam Baglioni 2f7b9ad815 added properties file in the forlder for the workflow of project to result propagation. Changes the path in the classes implementing the propagation 2023-12-22 11:46:15 +01:00
Miriam Baglioni f2352e8a78 changed in the classes the path for the property files for the propagation of community from project 2023-12-22 11:43:34 +01:00
Miriam Baglioni 009730b3d1 added properties file in the forlder for the workflow of orcid propagation. Changes the path in the classes implementing the propagationchanged the path to the parameter file in the class for entitytoorganization propagation 2023-12-22 11:42:09 +01:00
Miriam Baglioni 89f269c7f4 changed the path to the parameter file in the class for entitytoorganization propagation 2023-12-22 11:37:50 +01:00
Miriam Baglioni b06aea0adf adding the bulkTag parameter file in the folder for the oozie workflow for bulkTagging. Changes the path in the class 2023-12-22 11:35:37 +01:00
Miriam Baglioni 3afd4aa57b adjustments for country propagation 2023-12-22 11:27:30 +01:00
dimitrispie ffdd03d2f4 Monitor Irish Stats WF
Parameters (with examples):
stats_db_name=openaire_beta_stats_20231208
monitor_irish_db_name=openaire_beta_stats_monitor_ie_20231208b
monitor_irish_db_prod_name=openaire_beta_stats_monitor_ie
graph_db_name=openaire_beta_20231208
monitor_irish_db_shadow_name=openaire_beta_stats_monitor_ie_shadow
hive_timeout=150000
hadoop_user_name=dnet.beta
resumeFrom=Step1-buildIrishMonitorDB
2023-12-22 11:05:24 +02:00
dimitrispie 40b98d8182 Changes to indicators and funders definition
- Changes result_refereed definition
- Added result_country indicator
- Added indi_pub_green_with_license indicator
- Added country from jurisdiction to funders
2023-12-22 10:29:20 +02:00
Claudio Atzori 62104790ae added metaresourcetype to the result hive DB view 2023-12-21 12:27:10 +01:00
Claudio Atzori 106968adaa Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2023-12-21 12:26:29 +01:00
Claudio Atzori a8a4db96f0 added metaresourcetype to the result hive DB view 2023-12-21 12:26:19 +01:00
Miriam Baglioni 5011c4d11a refactoring after compiletion 2023-12-20 15:57:26 +01:00
Miriam Baglioni 4740c808f7 - 2023-12-20 14:26:54 +01:00
Miriam Baglioni d410ea8a41 added needed parameter 2023-12-19 12:15:01 +01:00
Sandro La Bruzzo 37e36baf76 updated workflow for generation of Scholix Datasource's to use mdstore transactions 2023-12-18 16:05:35 +01:00
Sandro La Bruzzo 9d39845d1f uploaded input parameters on CreateBaseline WF 2023-12-18 12:23:12 +01:00
Sandro La Bruzzo 15fd93a2b6 uploaded input parameters on CreateBaseline WF 2023-12-18 12:21:55 +01:00
Sandro La Bruzzo 9d342a47da updated the transformation Baseline workflow to include mdstore rollback/commit action 2023-12-18 11:48:57 +01:00
Sandro La Bruzzo 1fbd4325f5 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2023-12-18 11:47:17 +01:00
Sandro La Bruzzo 1f1a6a5f5f updated the transformation Baseline workflow to include mdstore rollback/commit action 2023-12-18 11:47:00 +01:00
Miriam Baglioni 3eca5d2e1c - 2023-12-18 09:55:27 +01:00
Miriam Baglioni 01ce0b9c76 [doiboost - preprocess] remove transition to orcid preparation from sequence of steps at the beginning of the workflow 2023-12-15 12:24:55 +01:00
Miriam Baglioni 0d8e496a63 - 2023-12-15 12:16:43 +01:00
Claudio Atzori c4ec35b6cd Merge pull request 'Master branch updates from beta December 2023' (#369) from beta_to_master_dicember2023 into master
Reviewed-on: #369
2023-12-15 11:18:30 +01:00
Miriam Baglioni 8752d275fa removed not needed parameter 2023-12-09 15:24:45 +01:00
Miriam Baglioni d4eedada71 adjusting workflow definition 2023-12-09 15:20:11 +01:00
Miriam Baglioni 616622d2bb first version of the workflow single step 2023-12-07 09:59:52 +01:00
254 changed files with 9473 additions and 2656 deletions

43
CODE_OF_CONDUCT.md Normal file
View File

@ -0,0 +1,43 @@
# Contributor Code of Conduct
Openness, transparency and our community-driven participatory approach guide us in our day-to-day interactions and decision-making. Our open source projects are no exception. Trust, respect, collaboration and transparency are core values we believe should live and breathe within our projects. Our community welcomes participants from around the world with different experiences, unique perspectives, and great ideas to share.
## Our Pledge
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment include:
- Using welcoming and inclusive language
- Being respectful of differing viewpoints and experiences
- Gracefully accepting constructive criticism
- Attempting collaboration before conflict
- Focusing on what is best for the community
- Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
- Violence, threats of violence, or inciting others to commit self-harm
- The use of sexualized language or imagery and unwelcome sexual attention or advances
- Trolling, intentionally spreading misinformation, insulting/derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or electronic address, without explicit permission
- Abuse of the reporting process to intentionally harass or exclude others
- Advocating for, or encouraging, any of the above behavior
- Other conduct which could reasonably be considered inappropriate in a professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), [version 1.4](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html).

10
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,10 @@
# Contributing to D-Net Hadoop
:+1::tada: First off, thanks for taking the time to contribute! :tada::+1:
This project and everyone participating in it is governed by our [Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
The following is a set of guidelines for contributing to this project and its packages. These are mostly guidelines, not rules, which applies to this project as a while, including all its sub-modules.
Use your best judgment, and feel free to propose changes to this document in a pull request.
All contributions are welcome, all contributions will be considered to be contributed under the [project license](LICENSE.md).

View File

View File

@ -2,6 +2,11 @@
Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
This project adheres to the Contributor Covenant [code of conduct](CODE_OF_CONDUCT.md).
By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
This project is licensed under the [AGPL v3 or later version](#LICENSE.md).
How to build, package and run oozie workflows
====================

View File

@ -0,0 +1,39 @@
package eu.dnetlib.dhp.common.api.context;
public class CategorySummary {
private String id;
private String label;
private boolean hasConcept;
public String getId() {
return id;
}
public String getLabel() {
return label;
}
public boolean isHasConcept() {
return hasConcept;
}
public CategorySummary setId(final String id) {
this.id = id;
return this;
}
public CategorySummary setLabel(final String label) {
this.label = label;
return this;
}
public CategorySummary setHasConcept(final boolean hasConcept) {
this.hasConcept = hasConcept;
return this;
}
}

View File

@ -0,0 +1,7 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class CategorySummaryList extends ArrayList<CategorySummary> {
}

View File

@ -0,0 +1,52 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.List;
public class ConceptSummary {
private String id;
private String label;
public boolean hasSubConcept;
private List<ConceptSummary> concepts;
public String getId() {
return id;
}
public String getLabel() {
return label;
}
public List<ConceptSummary> getConcepts() {
return concepts;
}
public ConceptSummary setId(final String id) {
this.id = id;
return this;
}
public ConceptSummary setLabel(final String label) {
this.label = label;
return this;
}
public boolean isHasSubConcept() {
return hasSubConcept;
}
public ConceptSummary setHasSubConcept(final boolean hasSubConcept) {
this.hasSubConcept = hasSubConcept;
return this;
}
public ConceptSummary setConcept(final List<ConceptSummary> concepts) {
this.concepts = concepts;
return this;
}
}

View File

@ -0,0 +1,7 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class ConceptSummaryList extends ArrayList<ConceptSummary> {
}

View File

@ -0,0 +1,50 @@
package eu.dnetlib.dhp.common.api.context;
public class ContextSummary {
private String id;
private String label;
private String type;
private String status;
public String getId() {
return id;
}
public String getLabel() {
return label;
}
public String getType() {
return type;
}
public String getStatus() {
return status;
}
public ContextSummary setId(final String id) {
this.id = id;
return this;
}
public ContextSummary setLabel(final String label) {
this.label = label;
return this;
}
public ContextSummary setType(final String type) {
this.type = type;
return this;
}
public ContextSummary setStatus(final String status) {
this.status = status;
return this;
}
}

View File

@ -0,0 +1,7 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class ContextSummaryList extends ArrayList<ContextSummary> {
}

View File

@ -8,10 +8,13 @@ import java.io.InputStream;
import java.net.*;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.math.NumberUtils;
import org.apache.commons.lang3.time.DateUtils;
import org.apache.http.HttpHeaders;
import org.joda.time.Instant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -94,14 +97,16 @@ public class HttpConnector2 {
throw new CollectorException(msg);
}
log.info("Request attempt {} [{}]", retryNumber, requestUrl);
InputStream input = null;
long start = System.currentTimeMillis();
try {
if (getClientParams().getRequestDelay() > 0) {
backoffAndSleep(getClientParams().getRequestDelay());
}
log.info("Request attempt {} [{}]", retryNumber, requestUrl);
final HttpURLConnection urlConn = (HttpURLConnection) new URL(requestUrl).openConnection();
urlConn.setInstanceFollowRedirects(false);
urlConn.setReadTimeout(getClientParams().getReadTimeOut() * 1000);
@ -115,9 +120,8 @@ public class HttpConnector2 {
urlConn.addRequestProperty(headerEntry.getKey(), headerEntry.getValue());
}
}
if (log.isDebugEnabled()) {
logHeaderFields(urlConn);
}
logHeaderFields(urlConn);
int retryAfter = obtainRetryAfter(urlConn.getHeaderFields());
String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT);
@ -132,9 +136,7 @@ public class HttpConnector2 {
}
if (is2xx(urlConn.getResponseCode())) {
input = urlConn.getInputStream();
responseType = urlConn.getContentType();
return input;
return getInputStream(urlConn, start);
}
if (is3xx(urlConn.getResponseCode())) {
// REDIRECTS
@ -144,6 +146,7 @@ public class HttpConnector2 {
.put(
REPORT_PREFIX + urlConn.getResponseCode(),
String.format("Moved to: %s", newUrl));
logRequestTime(start);
urlConn.disconnect();
if (retryAfter > 0) {
backoffAndSleep(retryAfter);
@ -159,26 +162,50 @@ public class HttpConnector2 {
if (retryAfter > 0) {
log
.warn(
"{} - waiting and repeating request after suggested retry-after {} sec.",
requestUrl, retryAfter);
"waiting and repeating request after suggested retry-after {} sec for URL {}",
retryAfter, requestUrl);
backoffAndSleep(retryAfter * 1000);
} else {
log
.warn(
"{} - waiting and repeating request after default delay of {} sec.",
requestUrl, getClientParams().getRetryDelay());
backoffAndSleep(retryNumber * getClientParams().getRetryDelay() * 1000);
"waiting and repeating request after default delay of {} sec for URL {}",
getClientParams().getRetryDelay(), requestUrl);
backoffAndSleep(retryNumber * getClientParams().getRetryDelay());
}
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
logRequestTime(start);
urlConn.disconnect();
return attemptDownload(requestUrl, retryNumber + 1, report);
case 422: // UNPROCESSABLE ENTITY
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
log.warn("waiting and repeating request after 10 sec for URL {}", requestUrl);
backoffAndSleep(10000);
urlConn.disconnect();
logRequestTime(start);
try {
return getInputStream(urlConn, start);
} catch (IOException e) {
log
.error(
"server returned 422 and got IOException accessing the response body from URL {}",
requestUrl);
log.error("IOException:", e);
return attemptDownload(requestUrl, retryNumber + 1, report);
}
default:
log.error("gor error {} from URL: {}", urlConn.getResponseCode(), urlConn.getURL());
log.error("response message: {}", urlConn.getResponseMessage());
report
.put(
REPORT_PREFIX + urlConn.getResponseCode(),
String
.format(
"%s Error: %s", requestUrl, urlConn.getResponseMessage()));
logRequestTime(start);
urlConn.disconnect();
throw new CollectorException(urlConn.getResponseCode() + " error " + report);
}
}
@ -199,13 +226,27 @@ public class HttpConnector2 {
}
}
private InputStream getInputStream(HttpURLConnection urlConn, long start) throws IOException {
InputStream input = urlConn.getInputStream();
responseType = urlConn.getContentType();
logRequestTime(start);
return input;
}
private static void logRequestTime(long start) {
log
.info(
"request time elapsed: {}sec",
TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis() - start));
}
private void logHeaderFields(final HttpURLConnection urlConn) throws IOException {
log.debug("StatusCode: {}", urlConn.getResponseMessage());
log.info("Response: {} - {}", urlConn.getResponseCode(), urlConn.getResponseMessage());
for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) {
if (e.getKey() != null) {
for (String v : e.getValue()) {
log.debug(" key: {} - value: {}", e.getKey(), v);
log.info(" key: {} - value: {}", e.getKey(), v);
}
}
}
@ -225,7 +266,7 @@ public class HttpConnector2 {
for (String key : headerMap.keySet()) {
if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty())
&& NumberUtils.isCreatable(headerMap.get(key).get(0))) {
return Integer.parseInt(headerMap.get(key).get(0)) + 10;
return Integer.parseInt(headerMap.get(key).get(0));
}
}
return -1;

View File

@ -0,0 +1,77 @@
package eu.dnetlib.dhp.oozie;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkHiveSession;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
import org.apache.commons.lang3.time.DurationFormatUtils;
import org.apache.commons.text.StringSubstitutor;
import org.apache.spark.SparkConf;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.common.io.Resources;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
public class RunSQLSparkJob {
private static final Logger log = LoggerFactory.getLogger(RunSQLSparkJob.class);
private final ArgumentApplicationParser parser;
public RunSQLSparkJob(ArgumentApplicationParser parser) {
this.parser = parser;
}
public static void main(String[] args) throws Exception {
Map<String, String> params = new HashMap<>();
for (int i = 0; i < args.length - 1; i++) {
if (args[i].startsWith("--")) {
params.put(args[i].substring(2), args[++i]);
}
}
/*
* String jsonConfiguration = IOUtils .toString( Objects .requireNonNull( RunSQLSparkJob.class
* .getResourceAsStream( "/eu/dnetlib/dhp/oozie/run_sql_parameters.json"))); final ArgumentApplicationParser
* parser = new ArgumentApplicationParser(jsonConfiguration); parser.parseArgument(args);
*/
Boolean isSparkSessionManaged = Optional
.ofNullable(params.get("isSparkSessionManaged"))
.map(Boolean::valueOf)
.orElse(Boolean.TRUE);
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
URL url = com.google.common.io.Resources.getResource(params.get("sql"));
String raw_sql = Resources.toString(url, StandardCharsets.UTF_8);
String sql = StringSubstitutor.replace(raw_sql, params);
log.info("sql: {}", sql);
SparkConf conf = new SparkConf();
conf.set("hive.metastore.uris", params.get("hiveMetastoreUris"));
runWithSparkHiveSession(
conf,
isSparkSessionManaged,
spark -> {
for (String statement : sql.split(";\\s*/\\*\\s*EOS\\s*\\*/\\s*")) {
log.info("executing: {}", statement);
long startTime = System.currentTimeMillis();
spark.sql(statement).show();
log
.info(
"executed in {}",
DurationFormatUtils.formatDuration(System.currentTimeMillis() - startTime, "HH:mm:ss.S"));
}
});
}
}

View File

@ -312,7 +312,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
}
if (value instanceof Datasource) {
// nothing to evaluate here
final Datasource d = (Datasource) value;
return Objects.nonNull(d.getOfficialname()) && StringUtils.isNotBlank(d.getOfficialname().getValue());
} else if (value instanceof Project) {
final Project p = (Project) value;
return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());

View File

@ -0,0 +1,20 @@
[
{
"paramName": "issm",
"paramLongName": "isSparkSessionManaged",
"paramDescription": "when true will stop SparkSession after job execution",
"paramRequired": false
},
{
"paramName": "hmu",
"paramLongName": "hiveMetastoreUris",
"paramDescription": "the hive metastore uris",
"paramRequired": true
},
{
"paramName": "sql",
"paramLongName": "sql",
"paramDescription": "sql script to execute",
"paramRequired": true
}
]

View File

@ -14,9 +14,9 @@ import eu.dnetlib.pace.config.Config;
public abstract class AbstractClusteringFunction extends AbstractPaceFunctions implements ClusteringFunction {
protected Map<String, Integer> params;
protected Map<String, Object> params;
public AbstractClusteringFunction(final Map<String, Integer> params) {
public AbstractClusteringFunction(final Map<String, Object> params) {
this.params = params;
}
@ -27,7 +27,7 @@ public abstract class AbstractClusteringFunction extends AbstractPaceFunctions i
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(this::normalize)
.map(s -> normalize(s))
.map(s -> filterAllStopWords(s))
.map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist))
@ -36,11 +36,24 @@ public abstract class AbstractClusteringFunction extends AbstractPaceFunctions i
.collect(Collectors.toCollection(HashSet::new));
}
public Map<String, Integer> getParams() {
public Map<String, Object> getParams() {
return params;
}
protected Integer param(String name) {
return params.get(name);
Object val = params.get(name);
if (val == null)
return null;
if (val instanceof Number) {
return ((Number) val).intValue();
}
return Integer.parseInt(val.toString());
}
protected int paramOrDefault(String name, int i) {
Integer res = param(name);
if (res == null)
res = i;
return res;
}
}

View File

@ -13,7 +13,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("acronyms")
public class Acronyms extends AbstractClusteringFunction {
public Acronyms(Map<String, Integer> params) {
public Acronyms(Map<String, Object> params) {
super(params);
}

View File

@ -11,6 +11,6 @@ public interface ClusteringFunction {
public Collection<String> apply(Config config, List<String> fields);
public Map<String, Integer> getParams();
public Map<String, Object> getParams();
}

View File

@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("immutablefieldvalue")
public class ImmutableFieldValue extends AbstractClusteringFunction {
public ImmutableFieldValue(final Map<String, Integer> params) {
public ImmutableFieldValue(final Map<String, Object> params) {
super(params);
}

View File

@ -0,0 +1,69 @@
package eu.dnetlib.pace.clustering;
import java.util.Collection;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.apache.commons.lang3.StringUtils;
import com.jayway.jsonpath.Configuration;
import com.jayway.jsonpath.DocumentContext;
import com.jayway.jsonpath.JsonPath;
import com.jayway.jsonpath.Option;
import eu.dnetlib.pace.common.AbstractPaceFunctions;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.util.MapDocumentUtil;
@ClusteringClass("jsonlistclustering")
public class JSONListClustering extends AbstractPaceFunctions implements ClusteringFunction {
private Map<String, Object> params;
public JSONListClustering(Map<String, Object> params) {
this.params = params;
}
@Override
public Map<String, Object> getParams() {
return params;
}
@Override
public Collection<String> apply(Config conf, List<String> fields) {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(s -> doApply(conf, s))
.filter(StringUtils::isNotBlank)
.collect(Collectors.toCollection(HashSet::new));
}
private String doApply(Config conf, String json) {
StringBuilder st = new StringBuilder(); // to build the string used for comparisons basing on the jpath into
// parameters
final DocumentContext documentContext = JsonPath
.using(Configuration.defaultConfiguration().addOptions(Option.SUPPRESS_EXCEPTIONS))
.parse(json);
// for each path in the param list
for (String key : params.keySet().stream().filter(k -> k.contains("jpath")).collect(Collectors.toList())) {
String path = params.get(key).toString();
String value = MapDocumentUtil.getJPathString(path, documentContext);
if (value == null || value.isEmpty())
value = "";
st.append(value);
st.append(" ");
}
st.setLength(st.length() - 1);
if (StringUtils.isBlank(st)) {
return "1";
}
return st.toString();
}
}

View File

@ -11,7 +11,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("keywordsclustering")
public class KeywordsClustering extends AbstractClusteringFunction {
public KeywordsClustering(Map<String, Integer> params) {
public KeywordsClustering(Map<String, Object> params) {
super(params);
}
@ -19,8 +19,8 @@ public class KeywordsClustering extends AbstractClusteringFunction {
protected Collection<String> doApply(final Config conf, String s) {
// takes city codes and keywords codes without duplicates
Set<String> keywords = getKeywords(s, conf.translationMap(), params.getOrDefault("windowSize", 4));
Set<String> cities = getCities(s, params.getOrDefault("windowSize", 4));
Set<String> keywords = getKeywords(s, conf.translationMap(), paramOrDefault("windowSize", 4));
Set<String> cities = getCities(s, paramOrDefault("windowSize", 4));
// list of combination to return as result
final Collection<String> combinations = new LinkedHashSet<String>();
@ -28,7 +28,7 @@ public class KeywordsClustering extends AbstractClusteringFunction {
for (String keyword : keywordsToCodes(keywords, conf.translationMap())) {
for (String city : citiesToCodes(cities)) {
combinations.add(keyword + "-" + city);
if (combinations.size() >= params.getOrDefault("max", 2)) {
if (combinations.size() >= paramOrDefault("max", 2)) {
return combinations;
}
}
@ -42,8 +42,8 @@ public class KeywordsClustering extends AbstractClusteringFunction {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(this::cleanup)
.map(this::normalize)
.map(KeywordsClustering::cleanup)
.map(KeywordsClustering::normalize)
.map(s -> filterAllStopWords(s))
.map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist))

View File

@ -16,7 +16,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
private boolean DEFAULT_AGGRESSIVE = true;
public LastNameFirstInitial(final Map<String, Integer> params) {
public LastNameFirstInitial(final Map<String, Object> params) {
super(params);
}
@ -25,7 +25,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
return fields
.stream()
.filter(f -> !f.isEmpty())
.map(this::normalize)
.map(LastNameFirstInitial::normalize)
.map(s -> doApply(conf, s))
.map(c -> filterBlacklisted(c, ngramBlacklist))
.flatMap(c -> c.stream())
@ -33,8 +33,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
.collect(Collectors.toCollection(HashSet::new));
}
@Override
protected String normalize(final String s) {
public static String normalize(final String s) {
return fixAliases(transliterate(nfd(unicodeNormalization(s))))
// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
// strings

View File

@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("lowercase")
public class LowercaseClustering extends AbstractClusteringFunction {
public LowercaseClustering(final Map<String, Integer> params) {
public LowercaseClustering(final Map<String, Object> params) {
super(params);
}

View File

@ -12,11 +12,11 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("ngrampairs")
public class NgramPairs extends Ngrams {
public NgramPairs(Map<String, Integer> params) {
public NgramPairs(Map<String, Object> params) {
super(params, false);
}
public NgramPairs(Map<String, Integer> params, boolean sorted) {
public NgramPairs(Map<String, Object> params, boolean sorted) {
super(params, sorted);
}

View File

@ -10,11 +10,11 @@ public class Ngrams extends AbstractClusteringFunction {
private final boolean sorted;
public Ngrams(Map<String, Integer> params) {
public Ngrams(Map<String, Object> params) {
this(params, false);
}
public Ngrams(Map<String, Integer> params, boolean sorted) {
public Ngrams(Map<String, Object> params, boolean sorted) {
super(params);
this.sorted = sorted;
}

View File

@ -0,0 +1,113 @@
package eu.dnetlib.pace.clustering;
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;
import com.google.common.base.Splitter;
import com.google.common.collect.Sets;
import eu.dnetlib.pace.config.Config;
@ClusteringClass("numAuthorsTitleSuffixPrefixChain")
public class NumAuthorsTitleSuffixPrefixChain extends AbstractClusteringFunction {
public NumAuthorsTitleSuffixPrefixChain(Map<String, Object> params) {
super(params);
}
@Override
public Collection<String> apply(Config conf, List<String> fields) {
try {
int num_authors = Math.min(Integer.parseInt(fields.get(0)), 21); // SIZE threshold is 20, +1
if (num_authors > 0) {
return super.apply(conf, fields.subList(1, fields.size()))
.stream()
.map(s -> num_authors + "-" + s)
.collect(Collectors.toList());
}
} catch (NumberFormatException e) {
// missing or null authors array
}
return Collections.emptyList();
}
@Override
protected Collection<String> doApply(Config conf, String s) {
return suffixPrefixChain(cleanup(s), param("mod"));
}
private Collection<String> suffixPrefixChain(String s, int mod) {
// create the list of words from the string (remove short words)
List<String> wordsList = Arrays
.stream(s.split(" "))
.filter(si -> si.length() > 3)
.collect(Collectors.toList());
final int words = wordsList.size();
final int letters = s.length();
// create the prefix: number of words + number of letters/mod
String prefix = words / mod + "-";
return doSuffixPrefixChain(wordsList, prefix);
}
private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) {
Set<String> set = Sets.newLinkedHashSet();
switch (wordsList.size()) {
case 0:
break;
case 1:
set.add(wordsList.get(0));
break;
case 2:
set
.add(
prefix +
suffix(wordsList.get(0), 3) +
prefix(wordsList.get(1), 3));
set
.add(
prefix +
prefix(wordsList.get(0), 3) +
suffix(wordsList.get(1), 3));
break;
default:
set
.add(
prefix +
suffix(wordsList.get(0), 3) +
prefix(wordsList.get(1), 3) +
suffix(wordsList.get(2), 3));
set
.add(
prefix +
prefix(wordsList.get(0), 3) +
suffix(wordsList.get(1), 3) +
prefix(wordsList.get(2), 3));
break;
}
return set;
}
private String suffix(String s, int len) {
return s.substring(s.length() - len);
}
private String prefix(String s, int len) {
return s.substring(0, len);
}
}

View File

@ -17,11 +17,11 @@ import eu.dnetlib.pace.model.Person;
@ClusteringClass("personClustering")
public class PersonClustering extends AbstractPaceFunctions implements ClusteringFunction {
private Map<String, Integer> params;
private Map<String, Object> params;
private static final int MAX_TOKENS = 5;
public PersonClustering(final Map<String, Integer> params) {
public PersonClustering(final Map<String, Object> params) {
this.params = params;
}
@ -77,7 +77,7 @@ public class PersonClustering extends AbstractPaceFunctions implements Clusterin
// }
@Override
public Map<String, Integer> getParams() {
public Map<String, Object> getParams() {
return params;
}

View File

@ -15,7 +15,7 @@ public class PersonHash extends AbstractClusteringFunction {
private boolean DEFAULT_AGGRESSIVE = false;
public PersonHash(final Map<String, Integer> params) {
public PersonHash(final Map<String, Object> params) {
super(params);
}

View File

@ -8,7 +8,7 @@ import eu.dnetlib.pace.config.Config;
public class RandomClusteringFunction extends AbstractClusteringFunction {
public RandomClusteringFunction(Map<String, Integer> params) {
public RandomClusteringFunction(Map<String, Object> params) {
super(params);
}

View File

@ -1,7 +1,10 @@
package eu.dnetlib.pace.clustering;
import java.util.*;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import com.google.common.base.Joiner;
import com.google.common.base.Splitter;
@ -12,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("sortedngrampairs")
public class SortedNgramPairs extends NgramPairs {
public SortedNgramPairs(Map<String, Integer> params) {
public SortedNgramPairs(Map<String, Object> params) {
super(params, false);
}

View File

@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("spacetrimmingfieldvalue")
public class SpaceTrimmingFieldValue extends AbstractClusteringFunction {
public SpaceTrimmingFieldValue(final Map<String, Integer> params) {
public SpaceTrimmingFieldValue(final Map<String, Object> params) {
super(params);
}
@ -25,7 +25,7 @@ public class SpaceTrimmingFieldValue extends AbstractClusteringFunction {
res
.add(
StringUtils.isBlank(s) ? RandomStringUtils.random(getParams().get("randomLength"))
StringUtils.isBlank(s) ? RandomStringUtils.random(param("randomLength"))
: s.toLowerCase().replaceAll("\\s+", ""));
return res;

View File

@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("suffixprefix")
public class SuffixPrefix extends AbstractClusteringFunction {
public SuffixPrefix(Map<String, Integer> params) {
public SuffixPrefix(Map<String, Object> params) {
super(params);
}

View File

@ -15,12 +15,17 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("urlclustering")
public class UrlClustering extends AbstractPaceFunctions implements ClusteringFunction {
protected Map<String, Integer> params;
protected Map<String, Object> params;
public UrlClustering(final Map<String, Integer> params) {
public UrlClustering(final Map<String, Object> params) {
this.params = params;
}
@Override
public Map<String, Object> getParams() {
return params;
}
@Override
public Collection<String> apply(final Config conf, List<String> fields) {
try {
@ -35,11 +40,6 @@ public class UrlClustering extends AbstractPaceFunctions implements ClusteringFu
}
}
@Override
public Map<String, Integer> getParams() {
return null;
}
private URL asUrl(String value) {
try {
return new URL(value);

View File

@ -11,7 +11,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("wordsStatsSuffixPrefixChain")
public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {
public WordsStatsSuffixPrefixChain(Map<String, Integer> params) {
public WordsStatsSuffixPrefixChain(Map<String, Object> params) {
super(params);
}

View File

@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("wordssuffixprefix")
public class WordsSuffixPrefix extends AbstractClusteringFunction {
public WordsSuffixPrefix(Map<String, Integer> params) {
public WordsSuffixPrefix(Map<String, Object> params) {
super(params);
}

View File

@ -16,7 +16,6 @@ import org.apache.commons.lang3.StringUtils;
import com.google.common.base.Joiner;
import com.google.common.base.Splitter;
import com.google.common.collect.Iterables;
import com.google.common.collect.Lists;
import com.google.common.collect.Sets;
import com.ibm.icu.text.Transliterator;
@ -27,7 +26,7 @@ import eu.dnetlib.pace.clustering.NGramUtils;
*
* @author claudio
*/
public abstract class AbstractPaceFunctions {
public class AbstractPaceFunctions {
// city map to be used when translating the city names into codes
private static Map<String, String> cityMap = AbstractPaceFunctions
@ -62,11 +61,14 @@ public abstract class AbstractPaceFunctions {
private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
protected String concat(final List<String> l) {
private static Pattern romanNumberPattern = Pattern
.compile("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$");
protected static String concat(final List<String> l) {
return Joiner.on(" ").skipNulls().join(l);
}
protected String cleanup(final String s) {
public static String cleanup(final String s) {
final String s1 = HTML_REGEX.matcher(s).replaceAll("");
final String s2 = unicodeNormalization(s1.toLowerCase());
final String s3 = nfd(s2);
@ -82,7 +84,7 @@ public abstract class AbstractPaceFunctions {
return s12;
}
protected String fixXML(final String a) {
protected static String fixXML(final String a) {
return a
.replaceAll("&ndash;", " ")
@ -91,7 +93,7 @@ public abstract class AbstractPaceFunctions {
.replaceAll("&minus;", " ");
}
protected boolean checkNumbers(final String a, final String b) {
protected static boolean checkNumbers(final String a, final String b) {
final String numbersA = getNumbers(a);
final String numbersB = getNumbers(b);
final String romansA = getRomans(a);
@ -99,7 +101,7 @@ public abstract class AbstractPaceFunctions {
return !numbersA.equals(numbersB) || !romansA.equals(romansB);
}
protected String getRomans(final String s) {
protected static String getRomans(final String s) {
final StringBuilder sb = new StringBuilder();
for (final String t : s.split(" ")) {
sb.append(isRoman(t) ? t : "");
@ -107,13 +109,12 @@ public abstract class AbstractPaceFunctions {
return sb.toString();
}
protected boolean isRoman(final String s) {
return s
.replaceAll("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", "qwertyuiop")
.equals("qwertyuiop");
protected static boolean isRoman(final String s) {
Matcher m = romanNumberPattern.matcher(s);
return m.matches() && m.hitEnd();
}
protected String getNumbers(final String s) {
protected static String getNumbers(final String s) {
final StringBuilder sb = new StringBuilder();
for (final String t : s.split(" ")) {
sb.append(isNumber(t) ? t : "");
@ -121,7 +122,7 @@ public abstract class AbstractPaceFunctions {
return sb.toString();
}
public boolean isNumber(String strNum) {
public static boolean isNumber(String strNum) {
if (strNum == null) {
return false;
}
@ -147,7 +148,7 @@ public abstract class AbstractPaceFunctions {
}
}
protected String removeSymbols(final String s) {
protected static String removeSymbols(final String s) {
final StringBuilder sb = new StringBuilder();
s.chars().forEach(ch -> {
@ -157,11 +158,11 @@ public abstract class AbstractPaceFunctions {
return sb.toString().replaceAll("\\s+", " ");
}
protected boolean notNull(final String s) {
protected static boolean notNull(final String s) {
return s != null;
}
protected String normalize(final String s) {
public static String normalize(final String s) {
return fixAliases(transliterate(nfd(unicodeNormalization(s))))
.toLowerCase()
// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
@ -174,16 +175,16 @@ public abstract class AbstractPaceFunctions {
.trim();
}
public String nfd(final String s) {
public static String nfd(final String s) {
return Normalizer.normalize(s, Normalizer.Form.NFD);
}
public String utf8(final String s) {
public static String utf8(final String s) {
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
return new String(bytes, StandardCharsets.UTF_8);
}
public String unicodeNormalization(final String s) {
public static String unicodeNormalization(final String s) {
Matcher m = hexUnicodePattern.matcher(s);
StringBuffer buf = new StringBuffer(s.length());
@ -195,7 +196,7 @@ public abstract class AbstractPaceFunctions {
return buf.toString();
}
protected String filterStopWords(final String s, final Set<String> stopwords) {
protected static String filterStopWords(final String s, final Set<String> stopwords) {
final StringTokenizer st = new StringTokenizer(s);
final StringBuilder sb = new StringBuilder();
while (st.hasMoreTokens()) {
@ -208,7 +209,7 @@ public abstract class AbstractPaceFunctions {
return sb.toString().trim();
}
public String filterAllStopWords(String s) {
public static String filterAllStopWords(String s) {
s = filterStopWords(s, stopwords_en);
s = filterStopWords(s, stopwords_de);
@ -221,7 +222,8 @@ public abstract class AbstractPaceFunctions {
return s;
}
protected Collection<String> filterBlacklisted(final Collection<String> set, final Set<String> ngramBlacklist) {
protected static Collection<String> filterBlacklisted(final Collection<String> set,
final Set<String> ngramBlacklist) {
final Set<String> newset = Sets.newLinkedHashSet();
for (final String s : set) {
if (!ngramBlacklist.contains(s)) {
@ -268,7 +270,7 @@ public abstract class AbstractPaceFunctions {
return m;
}
public String removeKeywords(String s, Set<String> keywords) {
public static String removeKeywords(String s, Set<String> keywords) {
s = " " + s + " ";
for (String k : keywords) {
@ -278,39 +280,39 @@ public abstract class AbstractPaceFunctions {
return s.trim();
}
public double commonElementsPercentage(Set<String> s1, Set<String> s2) {
public static double commonElementsPercentage(Set<String> s1, Set<String> s2) {
double longer = Math.max(s1.size(), s2.size());
return (double) s1.stream().filter(s2::contains).count() / longer;
}
// convert the set of keywords to codes
public Set<String> toCodes(Set<String> keywords, Map<String, String> translationMap) {
public static Set<String> toCodes(Set<String> keywords, Map<String, String> translationMap) {
return keywords.stream().map(s -> translationMap.get(s)).collect(Collectors.toSet());
}
public Set<String> keywordsToCodes(Set<String> keywords, Map<String, String> translationMap) {
public static Set<String> keywordsToCodes(Set<String> keywords, Map<String, String> translationMap) {
return toCodes(keywords, translationMap);
}
public Set<String> citiesToCodes(Set<String> keywords) {
public static Set<String> citiesToCodes(Set<String> keywords) {
return toCodes(keywords, cityMap);
}
protected String firstLC(final String s) {
protected static String firstLC(final String s) {
return StringUtils.substring(s, 0, 1).toLowerCase();
}
protected Iterable<String> tokens(final String s, final int maxTokens) {
protected static Iterable<String> tokens(final String s, final int maxTokens) {
return Iterables.limit(Splitter.on(" ").omitEmptyStrings().trimResults().split(s), maxTokens);
}
public String normalizePid(String pid) {
public static String normalizePid(String pid) {
return DOI_PREFIX.matcher(pid.toLowerCase()).replaceAll("");
}
// get the list of keywords into the input string
public Set<String> getKeywords(String s1, Map<String, String> translationMap, int windowSize) {
public static Set<String> getKeywords(String s1, Map<String, String> translationMap, int windowSize) {
String s = s1;
@ -340,7 +342,7 @@ public abstract class AbstractPaceFunctions {
return codes;
}
public Set<String> getCities(String s1, int windowSize) {
public static Set<String> getCities(String s1, int windowSize) {
return getKeywords(s1, cityMap, windowSize);
}

View File

@ -18,7 +18,7 @@ public class ClusteringDef implements Serializable {
private List<String> fields;
private Map<String, Integer> params;
private Map<String, Object> params;
public ClusteringDef() {
}
@ -43,11 +43,11 @@ public class ClusteringDef implements Serializable {
this.fields = fields;
}
public Map<String, Integer> getParams() {
public Map<String, Object> getParams() {
return params;
}
public void setParams(final Map<String, Integer> params) {
public void setParams(final Map<String, Object> params) {
this.params = params;
}

View File

@ -2,6 +2,7 @@
package eu.dnetlib.pace.model;
import java.io.Serializable;
import java.util.HashSet;
import java.util.List;
import com.fasterxml.jackson.core.JsonProcessingException;
@ -36,6 +37,16 @@ public class FieldDef implements Serializable {
*/
private int length = -1;
private HashSet<String> filter;
private boolean sorted;
public boolean isSorted() {
return sorted;
}
private String clean;
public FieldDef() {
}
@ -91,6 +102,30 @@ public class FieldDef implements Serializable {
this.path = path;
}
public HashSet<String> getFilter() {
return filter;
}
public void setFilter(HashSet<String> filter) {
this.filter = filter;
}
public boolean getSorted() {
return sorted;
}
public void setSorted(boolean sorted) {
this.sorted = sorted;
}
public String getClean() {
return clean;
}
public void setClean(String clean) {
this.clean = clean;
}
@Override
public String toString() {
try {

View File

@ -5,9 +5,9 @@ import eu.dnetlib.pace.util.{BlockProcessor, SparkReporter}
import org.apache.spark.SparkContext
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions.{col, lit, udf}
import org.apache.spark.sql.functions.{col, desc, expr, lit, udf}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, Dataset, Row, functions}
import org.apache.spark.sql.{Column, Dataset, Row, SaveMode, functions}
import java.util.function.Predicate
import java.util.stream.Collectors
@ -80,6 +80,8 @@ case class SparkDeduper(conf: DedupConfig) extends Serializable {
.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*))))
// Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
.withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName))))
// .withColumn("count", functions.max("position").over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName)).rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing) ))
// .filter("count > 1")
if (df_with_clustering_keys == null)
df_with_clustering_keys = ds
@ -88,20 +90,44 @@ case class SparkDeduper(conf: DedupConfig) extends Serializable {
}
//TODO: analytics
/*df_with_clustering_keys.groupBy(col("clustering"), col("key"))
.agg(expr("max(count) AS size"))
.orderBy(desc("size"))
.show*/
val df_with_blocks = df_with_clustering_keys
// filter out rows with position exceeding the maxqueuesize parameter
.filter(col("position").leq(conf.getWf.getQueueMaxSize))
.groupBy("clustering", "key")
// split the clustering block into smaller blocks of queuemaxsize
.groupBy(col("clustering"), col("key"), functions.floor(col("position").divide(lit(conf.getWf.getQueueMaxSize))))
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
.filter(functions.size(new Column("block")).gt(1))
.union(
//adjacency blocks
df_with_clustering_keys
// filter out leading and trailing elements
.filter(col("position").gt(conf.getWf.getSlidingWindowSize/2))
//.filter(col("position").lt(col("count").minus(conf.getWf.getSlidingWindowSize/2)))
// create small blocks of records on "the border" of maxqueuesize: getSlidingWindowSize/2 elements before and after
.filter(
col("position").mod(conf.getWf.getQueueMaxSize).lt(conf.getWf.getSlidingWindowSize/2) // slice of the start of block
|| col("position").mod(conf.getWf.getQueueMaxSize).gt(conf.getWf.getQueueMaxSize - (conf.getWf.getSlidingWindowSize/2)) //slice of the end of the block
)
.groupBy(col("clustering"), col("key"), functions.floor((col("position") + lit(conf.getWf.getSlidingWindowSize/2)).divide(lit(conf.getWf.getQueueMaxSize))))
.agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
.filter(functions.size(new Column("block")).gt(1))
)
df_with_blocks
}
def clusterValuesUDF(cd: ClusteringDef) = {
udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => {
values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala)
val valueList = values.flatMap {
case a: mutable.WrappedArray[Any] => a.map(_.toString)
case s: Any => Seq(s.toString)
}.asJava;
mutable.WrappedArray.make(cd.clusteringFunction().apply(conf, valueList).toArray())
})
}

View File

@ -1,13 +1,16 @@
package eu.dnetlib.pace.model
import com.jayway.jsonpath.{Configuration, JsonPath}
import eu.dnetlib.pace.common.AbstractPaceFunctions
import eu.dnetlib.pace.config.{DedupConfig, Type}
import eu.dnetlib.pace.util.MapDocumentUtil
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
import org.apache.spark.sql.{Dataset, Row}
import java.util.Locale
import java.util.regex.Pattern
import scala.collection.JavaConverters._
@ -60,7 +63,7 @@ case class SparkModel(conf: DedupConfig) {
values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
schema.fieldNames.zipWithIndex.foldLeft(values) {
case ((res, (fname, index))) => {
case ((res, (fname, index))) =>
val fdef = conf.getPace.getModelMap.get(fname)
if (fdef != null) {
@ -96,13 +99,52 @@ case class SparkModel(conf: DedupConfig) {
case Type.DoubleArray =>
MapDocumentUtil.getJPathArray(fdef.getPath, json)
}
val filter = fdef.getFilter
if (StringUtils.isNotBlank(fdef.getClean)) {
res(index) = res(index) match {
case x: Seq[String] => x.map(clean(_, fdef.getClean)).toSeq
case _ => clean(res(index).toString, fdef.getClean)
}
}
if (filter != null && !filter.isEmpty) {
res(index) = res(index) match {
case x: String if filter.contains(x.toLowerCase(Locale.ROOT)) => null
case x: Seq[String] => x.filter(s => !filter.contains(s.toLowerCase(Locale.ROOT))).toSeq
case _ => res(index)
}
}
if (fdef.getSorted) {
res(index) = res(index) match {
case x: Seq[String] => x.sorted.toSeq
case _ => res(index)
}
}
}
res
}
}
new GenericRowWithSchema(values, schema)
}
def clean(value: String, cleantype: String) : String = {
val res = cleantype match {
case "title" => AbstractPaceFunctions.cleanup(value)
case _ => value
}
// if (!res.equals(AbstractPaceFunctions.normalize(value))) {
// println(res)
// println(AbstractPaceFunctions.normalize(value))
// println()
// }
res
}
}

View File

@ -23,7 +23,6 @@ public class AuthorsMatch extends AbstractListComparator {
private String MODE; // full or surname
private int SIZE_THRESHOLD;
private String TYPE; // count or percentage
private int common;
public AuthorsMatch(Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler());
@ -35,7 +34,6 @@ public class AuthorsMatch extends AbstractListComparator {
FULLNAME_THRESHOLD = Double.parseDouble(params.getOrDefault("fullname_th", "0.9"));
SIZE_THRESHOLD = Integer.parseInt(params.getOrDefault("size_th", "20"));
TYPE = params.getOrDefault("type", "percentage");
common = 0;
}
protected AuthorsMatch(double w, AbstractStringDistance ssalgo) {
@ -44,22 +42,27 @@ public class AuthorsMatch extends AbstractListComparator {
@Override
public double compare(final List<String> a, final List<String> b, final Config conf) {
if (a.isEmpty() || b.isEmpty())
return -1;
if (a.size() > SIZE_THRESHOLD || b.size() > SIZE_THRESHOLD)
return 1.0;
List<Person> aList = a.stream().map(author -> new Person(author, false)).collect(Collectors.toList());
int maxMiss = Integer.MAX_VALUE;
List<Person> bList = b.stream().map(author -> new Person(author, false)).collect(Collectors.toList());
common = 0;
Double threshold = getDoubleParam("threshold");
if (threshold != null && threshold >= 0.0 && threshold <= 1.0 && a.size() == b.size()) {
maxMiss = (int) Math.floor((1 - threshold) * Math.max(a.size(), b.size()));
}
int common = 0;
// compare each element of List1 with each element of List2
for (Person p1 : aList)
for (int i = 0; i < a.size(); i++) {
Person p1 = new Person(a.get(i), false);
for (Person p2 : bList) {
// both persons are inaccurate
if (!p1.isAccurate() && !p2.isAccurate()) {
// compare just normalized fullnames
@ -118,11 +121,15 @@ public class AuthorsMatch extends AbstractListComparator {
}
}
}
if (i - common > maxMiss) {
return 0.0;
}
}
// normalization factor to compute the score
int normFactor = aList.size() == bList.size() ? aList.size() : (aList.size() + bList.size() - common);
int normFactor = a.size() == b.size() ? a.size() : (a.size() + b.size() - common);
if (TYPE.equals("percentage")) {
return (double) common / normFactor;

View File

@ -25,6 +25,7 @@ public class InstanceTypeMatch extends AbstractListComparator {
translationMap.put("Conference object", "*");
translationMap.put("Other literature type", "*");
translationMap.put("Unknown", "*");
translationMap.put("UNKNOWN", "*");
// article types
translationMap.put("Article", "Article");
@ -76,5 +77,4 @@ public class InstanceTypeMatch extends AbstractListComparator {
protected double normalize(final double d) {
return d;
}
}

View File

@ -3,6 +3,7 @@ package eu.dnetlib.pace.tree;
import java.util.Map;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
@ -30,16 +31,25 @@ public class LevensteinTitle extends AbstractStringComparator {
}
@Override
public double distance(final String a, final String b, final Config conf) {
final String ca = cleanup(a);
final String cb = cleanup(b);
public double distance(final String ca, final String cb, final Config conf) {
final boolean check = checkNumbers(ca, cb);
if (check)
return 0.5;
return normalize(ssalgo.score(ca, cb), ca.length(), cb.length());
Double threshold = getDoubleParam("threshold");
// reduce Levenshtein algo complexity when target threshold is known
if (threshold != null && threshold >= 0.0 && threshold <= 1.0) {
int maxdistance = (int) Math.floor((1 - threshold) * Math.max(ca.length(), cb.length()));
int score = StringUtils.getLevenshteinDistance(ca, cb, maxdistance);
if (score == -1) {
return 0;
}
return normalize(score, ca.length(), cb.length());
} else {
return normalize(StringUtils.getLevenshteinDistance(ca, cb), ca.length(), cb.length());
}
}
private double normalize(final double score, final int la, final int lb) {

View File

@ -0,0 +1,29 @@
package eu.dnetlib.pace.tree;
import java.util.Map;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
@ComparatorClass("maxLengthMatch")
public class MaxLengthMatch extends AbstractStringComparator {
private final int limit;
public MaxLengthMatch(Map<String, String> params) {
super(params);
limit = Integer.parseInt(params.getOrDefault("limit", "200"));
}
@Override
public double compare(String a, String b, final Config conf) {
return a.length() < limit && b.length() < limit ? 1.0 : -1.0;
}
protected String toString(final Object object) {
return toFirstString(object);
}
}

View File

@ -127,4 +127,14 @@ public abstract class AbstractComparator<T> extends AbstractPaceFunctions implem
return this.weight;
}
public Double getDoubleParam(String name) {
String svalue = params.get(name);
try {
return Double.parseDouble(svalue);
} catch (Throwable t) {
}
return null;
}
}

View File

@ -67,8 +67,10 @@ public class BlockProcessor {
private void processRows(final List<Row> queue, final Reporter context) {
for (int pivotPos = 0; pivotPos < queue.size(); pivotPos++) {
final Row pivot = queue.get(pivotPos);
IncrementalConnectedComponents icc = new IncrementalConnectedComponents(queue.size());
for (int i = 0; i < queue.size(); i++) {
final Row pivot = queue.get(i);
final String idPivot = pivot.getString(identifierFieldPos); // identifier
final Object fieldsPivot = getJavaValue(pivot, orderFieldPos);
@ -76,9 +78,9 @@ public class BlockProcessor {
final WfConfig wf = dedupConf.getWf();
if (fieldPivot != null) {
int i = 0;
for (int windowPos = pivotPos + 1; windowPos < queue.size(); windowPos++) {
final Row curr = queue.get(windowPos);
for (int j = icc.nextUnconnected(i, i + 1); j >= 0
&& j < queue.size(); j = icc.nextUnconnected(i, j + 1)) {
final Row curr = queue.get(j);
final String idCurr = curr.getString(identifierFieldPos); // identifier
if (mustSkip(idCurr)) {
@ -86,7 +88,7 @@ public class BlockProcessor {
break;
}
if (++i > wf.getSlidingWindowSize()) {
if (wf.getSlidingWindowSize() > 0 && (j - i) > wf.getSlidingWindowSize()) {
break;
}
@ -97,7 +99,9 @@ public class BlockProcessor {
final TreeProcessor treeProcessor = new TreeProcessor(dedupConf);
emitOutput(treeProcessor.compare(pivot, curr), idPivot, idCurr, context);
if (emitOutput(treeProcessor.compare(pivot, curr), idPivot, idCurr, context)) {
icc.connect(i, j);
}
}
}
}
@ -115,7 +119,8 @@ public class BlockProcessor {
return null;
}
private void emitOutput(final boolean result, final String idPivot, final String idCurr, final Reporter context) {
private boolean emitOutput(final boolean result, final String idPivot, final String idCurr,
final Reporter context) {
if (result) {
if (idPivot.compareTo(idCurr) <= 0) {
@ -127,6 +132,8 @@ public class BlockProcessor {
} else {
context.incrementCounter(dedupConf.getWf().getEntityType(), "d < " + dedupConf.getWf().getThreshold(), 1);
}
return result;
}
private boolean mustSkip(final String idPivot) {
@ -142,5 +149,4 @@ public class BlockProcessor {
context.emit(type, from, to);
}
}

View File

@ -0,0 +1,50 @@
package eu.dnetlib.pace.util;
import java.util.BitSet;
public class IncrementalConnectedComponents {
final private int size;
final private BitSet[] indexes;
IncrementalConnectedComponents(int size) {
this.size = size;
this.indexes = new BitSet[size];
}
public void connect(int i, int j) {
if (indexes[i] == null) {
if (indexes[j] == null) {
indexes[i] = new BitSet(size);
} else {
indexes[i] = indexes[j];
}
} else {
if (indexes[j] != null && indexes[i] != indexes[j]) {
// merge adjacency lists for i and j
indexes[i].or(indexes[j]);
}
}
indexes[i].set(i);
indexes[i].set(j);
indexes[j] = indexes[i];
}
public int nextUnconnected(int i, int j) {
if (indexes[i] == null) {
return j;
}
int result = indexes[i].nextClearBit(j);
return (result >= size) ? -1 : result;
}
public BitSet getConnections(int i) {
if (indexes[i] == null) {
return null;
}
return indexes[i];
}
}

View File

@ -97,6 +97,8 @@ public class MapDocumentUtil {
Object o = json.read(jsonPath);
if (o instanceof String)
return (String) o;
if (o instanceof Number)
return (String) o.toString();
if (o instanceof JSONArray && ((JSONArray) o).size() > 0)
return (String) ((JSONArray) o).get(0);
return "";

View File

@ -40,7 +40,7 @@ public class PaceResolver implements Serializable {
Collectors.toMap(cl -> cl.getAnnotation(ComparatorClass.class).value(), cl -> (Class<Comparator>) cl));
}
public ClusteringFunction getClusteringFunction(String name, Map<String, Integer> params) throws PaceException {
public ClusteringFunction getClusteringFunction(String name, Map<String, Object> params) throws PaceException {
try {
return clusteringFunctions.get(name).getDeclaredConstructor(Map.class).newInstance(params);
} catch (InstantiationException | IllegalAccessException | InvocationTargetException

View File

@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.DedupConfig;
public class ClusteringFunctionTest extends AbstractPaceTest {
private static Map<String, Integer> params;
private static Map<String, Object> params;
private static DedupConfig conf;
@BeforeAll
@ -40,10 +40,10 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testNgram() {
params.put("ngramLen", 3);
params.put("max", 8);
params.put("maxPerToken", 2);
params.put("minNgramLen", 1);
params.put("ngramLen", "3");
params.put("max", "8");
params.put("maxPerToken", "2");
params.put("minNgramLen", "1");
final ClusteringFunction ngram = new Ngrams(params);
@ -54,8 +54,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testNgramPairs() {
params.put("ngramLen", 3);
params.put("max", 2);
params.put("ngramLen", "3");
params.put("max", "2");
final ClusteringFunction np = new NgramPairs(params);
@ -66,8 +66,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testSortedNgramPairs() {
params.put("ngramLen", 3);
params.put("max", 2);
params.put("ngramLen", "3");
params.put("max", "2");
final ClusteringFunction np = new SortedNgramPairs(params);
@ -87,9 +87,9 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testAcronym() {
params.put("max", 4);
params.put("minLen", 1);
params.put("maxLen", 3);
params.put("max", "4");
params.put("minLen", "1");
params.put("maxLen", "3");
final ClusteringFunction acro = new Acronyms(params);
@ -100,8 +100,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testSuffixPrefix() {
params.put("len", 3);
params.put("max", 4);
params.put("len", "3");
params.put("max", "4");
final ClusteringFunction sp = new SuffixPrefix(params);
@ -109,8 +109,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
System.out.println(s);
System.out.println(sp.apply(conf, Lists.newArrayList(s)));
params.put("len", 3);
params.put("max", 1);
params.put("len", "3");
params.put("max", "1");
System.out.println(sp.apply(conf, Lists.newArrayList("Framework for general-purpose deduplication")));
}
@ -118,8 +118,8 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testWordsSuffixPrefix() {
params.put("len", 3);
params.put("max", 4);
params.put("len", "3");
params.put("max", "4");
final ClusteringFunction sp = new WordsSuffixPrefix(params);
@ -130,7 +130,7 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testWordsStatsSuffixPrefix() {
params.put("mod", 10);
params.put("mod", "10");
final ClusteringFunction sp = new WordsStatsSuffixPrefixChain(params);
@ -167,7 +167,7 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
@Test
public void testFieldValue() {
params.put("randomLength", 5);
params.put("randomLength", "5");
final ClusteringFunction sp = new SpaceTrimmingFieldValue(params);

View File

@ -0,0 +1,40 @@
package eu.dnetlib.pace.util;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNull;
import org.junit.jupiter.api.Test;
public class IncrementalConnectedComponentsTest {
@Test
public void transitiveClosureTest() {
IncrementalConnectedComponents icc = new IncrementalConnectedComponents(10);
icc.connect(0, 1);
icc.connect(0, 2);
icc.connect(0, 3);
icc.connect(1, 2);
icc.connect(1, 4);
icc.connect(1, 5);
icc.connect(6, 7);
icc.connect(6, 9);
assertEquals(icc.getConnections(0).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(1).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(2).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(3).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(4).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(5).toString(), "{0, 1, 2, 3, 4, 5}");
assertEquals(icc.getConnections(6).toString(), "{6, 7, 9}");
assertEquals(icc.getConnections(7).toString(), "{6, 7, 9}");
assertEquals(icc.getConnections(9).toString(), "{6, 7, 9}");
assertNull(icc.getConnections(8));
}
}

View File

@ -0,0 +1,39 @@
/*
* Copyright (c) 2024.
* SPDX-FileCopyrightText: © 2023 Consiglio Nazionale delle Ricerche
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
package eu.dnetlib.dhp.actionmanager.promote;
/** Encodes the Actionset promotion strategies */
public class PromoteAction {
/** The supported actionset promotion strategies
*
* ENRICH: promotes only records in the actionset matching another record in the
* graph and enriches them applying the given MergeAndGet strategy
* UPSERT: promotes all the records in an actionset, matching records are updated
* using the given MergeAndGet strategy, the non-matching record as inserted as they are.
*/
public enum Strategy {
ENRICH, UPSERT
}
/**
* Returns the string representation of the join type implementing the given PromoteAction.
*
* @param strategy the strategy to be used to promote the Actionset contents
* @return the join type used to implement the promotion strategy
*/
public static String joinTypeForStrategy(PromoteAction.Strategy strategy) {
switch (strategy) {
case ENRICH:
return "left_outer";
case UPSERT:
return "full_outer";
default:
throw new IllegalStateException("unsupported PromoteAction: " + strategy.toString());
}
}
}

View File

@ -67,8 +67,9 @@ public class PromoteActionPayloadForGraphTableJob {
String outputGraphTablePath = parser.get("outputGraphTablePath");
logger.info("outputGraphTablePath: {}", outputGraphTablePath);
MergeAndGet.Strategy strategy = MergeAndGet.Strategy.valueOf(parser.get("mergeAndGetStrategy").toUpperCase());
logger.info("strategy: {}", strategy);
MergeAndGet.Strategy mergeAndGetStrategy = MergeAndGet.Strategy
.valueOf(parser.get("mergeAndGetStrategy").toUpperCase());
logger.info("mergeAndGetStrategy: {}", mergeAndGetStrategy);
Boolean shouldGroupById = Optional
.ofNullable(parser.get("shouldGroupById"))
@ -76,6 +77,12 @@ public class PromoteActionPayloadForGraphTableJob {
.orElse(true);
logger.info("shouldGroupById: {}", shouldGroupById);
PromoteAction.Strategy promoteActionStrategy = Optional
.ofNullable(parser.get("promoteActionStrategy"))
.map(PromoteAction.Strategy::valueOf)
.orElse(PromoteAction.Strategy.UPSERT);
logger.info("promoteActionStrategy: {}", promoteActionStrategy);
@SuppressWarnings("unchecked")
Class<? extends Oaf> rowClazz = (Class<? extends Oaf>) Class.forName(graphTableClassName);
@SuppressWarnings("unchecked")
@ -97,7 +104,8 @@ public class PromoteActionPayloadForGraphTableJob {
inputGraphTablePath,
inputActionPayloadPath,
outputGraphTablePath,
strategy,
mergeAndGetStrategy,
promoteActionStrategy,
rowClazz,
actionPayloadClazz,
shouldGroupById);
@ -124,14 +132,16 @@ public class PromoteActionPayloadForGraphTableJob {
String inputGraphTablePath,
String inputActionPayloadPath,
String outputGraphTablePath,
MergeAndGet.Strategy strategy,
MergeAndGet.Strategy mergeAndGetStrategy,
PromoteAction.Strategy promoteActionStrategy,
Class<G> rowClazz,
Class<A> actionPayloadClazz, Boolean shouldGroupById) {
Dataset<G> rowDS = readGraphTable(spark, inputGraphTablePath, rowClazz);
Dataset<A> actionPayloadDS = readActionPayload(spark, inputActionPayloadPath, actionPayloadClazz);
Dataset<G> result = promoteActionPayloadForGraphTable(
rowDS, actionPayloadDS, strategy, rowClazz, actionPayloadClazz, shouldGroupById)
rowDS, actionPayloadDS, mergeAndGetStrategy, promoteActionStrategy, rowClazz, actionPayloadClazz,
shouldGroupById)
.map((MapFunction<G, G>) value -> value, Encoders.bean(rowClazz));
saveGraphTable(result, outputGraphTablePath);
@ -183,7 +193,8 @@ public class PromoteActionPayloadForGraphTableJob {
private static <G extends Oaf, A extends Oaf> Dataset<G> promoteActionPayloadForGraphTable(
Dataset<G> rowDS,
Dataset<A> actionPayloadDS,
MergeAndGet.Strategy strategy,
MergeAndGet.Strategy mergeAndGetStrategy,
PromoteAction.Strategy promoteActionStrategy,
Class<G> rowClazz,
Class<A> actionPayloadClazz,
Boolean shouldGroupById) {
@ -195,8 +206,9 @@ public class PromoteActionPayloadForGraphTableJob {
SerializableSupplier<Function<G, String>> rowIdFn = ModelSupport::idFn;
SerializableSupplier<Function<A, String>> actionPayloadIdFn = ModelSupport::idFn;
SerializableSupplier<BiFunction<G, A, G>> mergeRowWithActionPayloadAndGetFn = MergeAndGet.functionFor(strategy);
SerializableSupplier<BiFunction<G, G, G>> mergeRowsAndGetFn = MergeAndGet.functionFor(strategy);
SerializableSupplier<BiFunction<G, A, G>> mergeRowWithActionPayloadAndGetFn = MergeAndGet
.functionFor(mergeAndGetStrategy);
SerializableSupplier<BiFunction<G, G, G>> mergeRowsAndGetFn = MergeAndGet.functionFor(mergeAndGetStrategy);
SerializableSupplier<G> zeroFn = zeroFn(rowClazz);
SerializableSupplier<Function<G, Boolean>> isNotZeroFn = PromoteActionPayloadForGraphTableJob::isNotZeroFnUsingIdOrSourceAndTarget;
@ -207,6 +219,7 @@ public class PromoteActionPayloadForGraphTableJob {
rowIdFn,
actionPayloadIdFn,
mergeRowWithActionPayloadAndGetFn,
promoteActionStrategy,
rowClazz,
actionPayloadClazz);

View File

@ -34,6 +34,7 @@ public class PromoteActionPayloadFunctions {
* @param rowIdFn Function used to get the id of graph table row
* @param actionPayloadIdFn Function used to get id of action payload instance
* @param mergeAndGetFn Function used to merge graph table row and action payload instance
* @param promoteActionStrategy the Actionset promotion strategy
* @param rowClazz Class of graph table
* @param actionPayloadClazz Class of action payload
* @param <G> Type of graph table row
@ -46,6 +47,7 @@ public class PromoteActionPayloadFunctions {
SerializableSupplier<Function<G, String>> rowIdFn,
SerializableSupplier<Function<A, String>> actionPayloadIdFn,
SerializableSupplier<BiFunction<G, A, G>> mergeAndGetFn,
PromoteAction.Strategy promoteActionStrategy,
Class<G> rowClazz,
Class<A> actionPayloadClazz) {
if (!isSubClass(rowClazz, actionPayloadClazz)) {
@ -61,7 +63,7 @@ public class PromoteActionPayloadFunctions {
.joinWith(
actionPayloadWithIdDS,
rowWithIdDS.col("_1").equalTo(actionPayloadWithIdDS.col("_1")),
"full_outer")
PromoteAction.joinTypeForStrategy(promoteActionStrategy))
.map(
(MapFunction<Tuple2<Tuple2<String, G>, Tuple2<String, A>>, G>) value -> {
Optional<G> rowOpt = Optional.ofNullable(value._1()).map(Tuple2::_2);

View File

@ -41,6 +41,12 @@
"paramDescription": "strategy for merging graph table objects with action payload instances, MERGE_FROM_AND_GET or SELECT_NEWER_AND_GET",
"paramRequired": true
},
{
"paramName": "pas",
"paramLongName": "promoteActionStrategy",
"paramDescription": "strategy for promoting the actionset contents into the graph tables, ENRICH or UPSERT (default)",
"paramRequired": false
},
{
"paramName": "sgid",
"paramLongName": "shouldGroupById",

View File

@ -115,6 +115,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
<arg>--outputGraphTablePath</arg><arg>${workingDir}/dataset</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="DecisionPromoteResultActionPayloadForDatasetTable"/>
@ -167,6 +168,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/dataset</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="End"/>

View File

@ -106,6 +106,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Datasource</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/datasource</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>

View File

@ -106,6 +106,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Organization</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/organization</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>

View File

@ -114,6 +114,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
<arg>--outputGraphTablePath</arg><arg>${workingDir}/otherresearchproduct</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="DecisionPromoteResultActionPayloadForOtherResearchProductTable"/>
@ -166,6 +167,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/otherresearchproduct</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="End"/>

View File

@ -106,6 +106,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Project</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/project</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>

View File

@ -115,6 +115,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
<arg>--outputGraphTablePath</arg><arg>${workingDir}/publication</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="DecisionPromoteResultActionPayloadForPublicationTable"/>
@ -167,6 +168,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/publication</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="End"/>

View File

@ -107,6 +107,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Relation</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/relation</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>

View File

@ -114,6 +114,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
<arg>--outputGraphTablePath</arg><arg>${workingDir}/software</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="DecisionPromoteResultActionPayloadForSoftwareTable"/>
@ -166,6 +167,7 @@
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/software</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
</spark>
<ok to="End"/>

View File

@ -54,7 +54,7 @@ public class PromoteActionPayloadFunctionsTest {
RuntimeException.class,
() -> PromoteActionPayloadFunctions
.joinGraphTableWithActionPayloadAndMerge(
null, null, null, null, null, OafImplSubSub.class, OafImpl.class));
null, null, null, null, null, null, OafImplSubSub.class, OafImpl.class));
}
@Test
@ -104,6 +104,7 @@ public class PromoteActionPayloadFunctionsTest {
rowIdFn,
actionPayloadIdFn,
mergeAndGetFn,
PromoteAction.Strategy.UPSERT,
OafImplSubSub.class,
OafImplSubSub.class)
.collectAsList();
@ -183,6 +184,7 @@ public class PromoteActionPayloadFunctionsTest {
rowIdFn,
actionPayloadIdFn,
mergeAndGetFn,
PromoteAction.Strategy.UPSERT,
OafImplSubSub.class,
OafImplSub.class)
.collectAsList();

View File

@ -124,8 +124,19 @@ public class PrepareFOSSparkJob implements Serializable {
FOSDataModel first) {
level1.add(first.getLevel1());
level2.add(first.getLevel2());
level3.add(first.getLevel3() + "@@" + first.getScoreL3());
level4.add(first.getLevel4() + "@@" + first.getScoreL4());
if (Optional.ofNullable(first.getLevel3()).isPresent() &&
!first.getLevel3().equalsIgnoreCase(NA) && !first.getLevel3().equalsIgnoreCase(NULL)
&& first.getLevel3() != null)
level3.add(first.getLevel3() + "@@" + first.getScoreL3());
else
level3.add(NULL);
if (Optional.ofNullable(first.getLevel4()).isPresent() &&
!first.getLevel4().equalsIgnoreCase(NA) &&
!first.getLevel4().equalsIgnoreCase(NULL) &&
first.getLevel4() != null)
level4.add(first.getLevel4() + "@@" + first.getScoreL4());
else
level4.add(NULL);
}
}

View File

@ -19,6 +19,7 @@ import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.aggregation.common.ReporterCallback;
import eu.dnetlib.dhp.aggregation.common.ReportingJob;
import eu.dnetlib.dhp.collection.plugin.CollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.base.BaseCollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.file.FileCollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.file.FileGZipCollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.mongodb.MDStoreCollectorPlugin;
@ -120,6 +121,8 @@ public class CollectorWorker extends ReportingJob {
return new FileCollectorPlugin(fileSystem);
case fileGzip:
return new FileGZipCollectorPlugin(fileSystem);
case baseDump:
return new BaseCollectorPlugin(this.fileSystem);
case other:
final CollectorPlugin.NAME.OTHER_NAME plugin = Optional
.ofNullable(api.getParams().get("other_plugin_type"))

View File

@ -10,7 +10,8 @@ import eu.dnetlib.dhp.common.collection.CollectorException;
public interface CollectorPlugin {
enum NAME {
oai, other, rest_json2xml, file, fileGzip;
oai, other, rest_json2xml, file, fileGzip, baseDump;
public enum OTHER_NAME {
mdstore_mongodb_dump, mdstore_mongodb

View File

@ -0,0 +1,171 @@
package eu.dnetlib.dhp.collection.plugin.base;
import java.io.BufferedInputStream;
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.io.StringWriter;
import java.util.Iterator;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;
import org.apache.commons.compress.compressors.CompressorInputStream;
import org.apache.commons.compress.compressors.CompressorStreamFactory;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.common.aggregation.AggregatorReport;
public class BaseCollectorIterator implements Iterator<String> {
private String nextElement;
private final BlockingQueue<String> queue = new LinkedBlockingQueue<>(100);
private static final Logger log = LoggerFactory.getLogger(BaseCollectorIterator.class);
private static final String END_ELEM = "__END__";
public BaseCollectorIterator(final FileSystem fs, final Path filePath, final AggregatorReport report) {
new Thread(() -> importHadoopFile(fs, filePath, report)).start();
try {
this.nextElement = this.queue.take();
} catch (final InterruptedException e) {
throw new RuntimeException(e);
}
}
protected BaseCollectorIterator(final String resourcePath, final AggregatorReport report) {
new Thread(() -> importTestFile(resourcePath, report)).start();
try {
this.nextElement = this.queue.take();
} catch (final InterruptedException e) {
throw new RuntimeException(e);
}
}
@Override
public synchronized boolean hasNext() {
return (this.nextElement != null) & !END_ELEM.equals(this.nextElement);
}
@Override
public synchronized String next() {
try {
return END_ELEM.equals(this.nextElement) ? null : this.nextElement;
} finally {
try {
this.nextElement = this.queue.take();
} catch (final InterruptedException e) {
throw new RuntimeException(e);
}
}
}
private void importHadoopFile(final FileSystem fs, final Path filePath, final AggregatorReport report) {
log.info("I start to read the TAR stream");
try (InputStream origInputStream = fs.open(filePath);
final TarArchiveInputStream tarInputStream = new TarArchiveInputStream(origInputStream)) {
importTarStream(tarInputStream, report);
} catch (final Throwable e) {
throw new RuntimeException("Error processing BASE records", e);
}
}
private void importTestFile(final String resourcePath, final AggregatorReport report) {
try (final InputStream origInputStream = BaseCollectorIterator.class.getResourceAsStream(resourcePath);
final TarArchiveInputStream tarInputStream = new TarArchiveInputStream(origInputStream)) {
importTarStream(tarInputStream, report);
} catch (final Throwable e) {
throw new RuntimeException("Error processing BASE records", e);
}
}
private void importTarStream(final TarArchiveInputStream tarInputStream, final AggregatorReport report) {
long count = 0;
final XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
final XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newInstance();
try {
TarArchiveEntry entry;
while ((entry = (TarArchiveEntry) tarInputStream.getNextEntry()) != null) {
final String name = entry.getName();
if (!entry.isDirectory() && name.contains("ListRecords") && name.endsWith(".bz2")) {
log.info("Processing file (BZIP): " + name);
final byte[] bzipData = new byte[(int) entry.getSize()];
IOUtils.readFully(tarInputStream, bzipData);
try (InputStream bzipIs = new ByteArrayInputStream(bzipData);
final BufferedInputStream bzipBis = new BufferedInputStream(bzipIs);
final CompressorInputStream bzipInput = new CompressorStreamFactory()
.createCompressorInputStream(bzipBis)) {
final XMLEventReader reader = xmlInputFactory.createXMLEventReader(bzipInput);
XMLEventWriter eventWriter = null;
StringWriter xmlWriter = null;
while (reader.hasNext()) {
final XMLEvent nextEvent = reader.nextEvent();
if (nextEvent.isStartElement()) {
final StartElement startElement = nextEvent.asStartElement();
if ("record".equals(startElement.getName().getLocalPart())) {
xmlWriter = new StringWriter();
eventWriter = xmlOutputFactory.createXMLEventWriter(xmlWriter);
}
}
if (eventWriter != null) {
eventWriter.add(nextEvent);
}
if (nextEvent.isEndElement()) {
final EndElement endElement = nextEvent.asEndElement();
if ("record".equals(endElement.getName().getLocalPart())) {
eventWriter.flush();
eventWriter.close();
this.queue.put(xmlWriter.toString());
eventWriter = null;
xmlWriter = null;
count++;
}
}
}
}
}
}
this.queue.put(END_ELEM); // TO INDICATE THE END OF THE QUEUE
} catch (final Throwable e) {
log.error("Error processing BASE records", e);
report.put(e.getClass().getName(), e.getMessage());
throw new RuntimeException("Error processing BASE records", e);
} finally {
log.info("Total records (written in queue): " + count);
}
}
}

View File

@ -0,0 +1,159 @@
package eu.dnetlib.dhp.collection.plugin.base;
import java.io.IOException;
import java.sql.SQLException;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Optional;
import java.util.Set;
import java.util.Spliterator;
import java.util.Spliterators;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.DocumentHelper;
import org.dom4j.Node;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.collection.ApiDescriptor;
import eu.dnetlib.dhp.collection.plugin.CollectorPlugin;
import eu.dnetlib.dhp.collection.plugin.file.AbstractSplittedRecordPlugin;
import eu.dnetlib.dhp.common.DbClient;
import eu.dnetlib.dhp.common.aggregation.AggregatorReport;
import eu.dnetlib.dhp.common.collection.CollectorException;
public class BaseCollectorPlugin implements CollectorPlugin {
private final FileSystem fs;
private static final Logger log = LoggerFactory.getLogger(AbstractSplittedRecordPlugin.class);
// MAPPING AND FILTERING ARE DEFINED HERE:
// https://docs.google.com/document/d/1Aj-ZAV11b44MCrAAUCPiS2TUlXb6PnJEu1utCMAcCOU/edit
public BaseCollectorPlugin(final FileSystem fs) {
this.fs = fs;
}
@Override
public Stream<String> collect(final ApiDescriptor api, final AggregatorReport report) throws CollectorException {
// the path of the dump file on HDFS
// http://oai.base-search.net/initial_load/base_oaipmh_dump-current.tar
// it could be downloaded from iis-cdh5-test-gw.ocean.icm.edu.pl and then copied on HDFS
final Path filePath = Optional
.ofNullable(api.getBaseUrl())
.map(Path::new)
.orElseThrow(() -> new CollectorException("missing baseUrl"));
// get the parameters for the connection to the OpenAIRE database.
// the database is used to obtain the list of the datasources that the plugin will collect
final String dbUrl = api.getParams().get("dbUrl");
final String dbUser = api.getParams().get("dbUser");
final String dbPassword = api.getParams().get("dbPassword");
// the types(comma separated, empty value for all) that the plugin will collect,
// the types should be expressed in the format of the normalized types of BASE (for example 1,121,...)
final String acceptedNormTypesString = api.getParams().get("acceptedNormTypes");
log.info("baseUrl: {}", filePath);
log.info("dbUrl: {}", dbUrl);
log.info("dbUser: {}", dbUser);
log.info("dbPassword: {}", "***");
log.info("acceptedNormTypes: {}", acceptedNormTypesString);
try {
if (!this.fs.exists(filePath)) {
throw new CollectorException("path does not exist: " + filePath);
}
} catch (final Throwable e) {
throw new CollectorException(e);
}
final Set<String> acceptedOpendoarIds = findAcceptedOpendoarIds(dbUrl, dbUser, dbPassword);
final Set<String> acceptedNormTypes = new HashSet<>();
if (StringUtils.isNotBlank(acceptedNormTypesString)) {
for (final String s : StringUtils.split(acceptedNormTypesString, ",")) {
if (StringUtils.isNotBlank(s)) {
acceptedNormTypes.add(s.trim());
}
}
}
final Iterator<String> iterator = new BaseCollectorIterator(this.fs, filePath, report);
final Spliterator<String> spliterator = Spliterators.spliteratorUnknownSize(iterator, Spliterator.ORDERED);
return StreamSupport
.stream(spliterator, false)
.filter(doc -> filterXml(doc, acceptedOpendoarIds, acceptedNormTypes));
}
private Set<String> findAcceptedOpendoarIds(final String dbUrl, final String dbUser, final String dbPassword)
throws CollectorException {
final Set<String> accepted = new HashSet<>();
try (final DbClient dbClient = new DbClient(dbUrl, dbUser, dbPassword)) {
final String sql = IOUtils
.toString(
getClass().getResourceAsStream("/eu/dnetlib/dhp/collection/plugin/base/sql/opendoar-accepted.sql"));
dbClient.processResults(sql, row -> {
try {
final String dsId = row.getString("id");
log.info("Accepted Datasource: " + dsId);
accepted.add(dsId);
} catch (final SQLException e) {
log.error("Error in SQL", e);
throw new RuntimeException("Error in SQL", e);
}
});
} catch (final IOException e) {
log.error("Error accessong SQL", e);
throw new CollectorException("Error accessong SQL", e);
}
log.info("Accepted Datasources (TOTAL): " + accepted.size());
return accepted;
}
protected static boolean filterXml(final String xml,
final Set<String> acceptedOpendoarIds,
final Set<String> acceptedNormTypes) {
try {
final Document doc = DocumentHelper.parseText(xml);
final String id = doc.valueOf("//*[local-name()='collection']/@opendoar_id").trim();
if (StringUtils.isBlank(id) || !acceptedOpendoarIds.contains("opendoar____::" + id)) {
return false;
}
if (acceptedNormTypes.isEmpty()) {
return true;
}
for (final Object s : doc.selectNodes("//*[local-name()='typenorm']")) {
if (acceptedNormTypes.contains(((Node) s).getText().trim())) {
return true;
}
}
return false;
} catch (final DocumentException e) {
log.error("Error parsing document", e);
throw new RuntimeException("Error parsing document", e);
}
}
}

View File

@ -0,0 +1,114 @@
BEGIN;
INSERT INTO dsm_services(
_dnet_resource_identifier_,
id,
officialname,
englishname,
namespaceprefix,
websiteurl,
logourl,
platform,
contactemail,
collectedfrom,
provenanceaction,
_typology_to_remove_,
eosc_type,
eosc_datasource_type,
research_entity_types,
thematic
) VALUES (
'openaire____::base_search',
'openaire____::base_search',
'Bielefeld Academic Search Engine (BASE)',
'Bielefeld Academic Search Engine (BASE)',
'base_search_',
'https://www.base-search.net',
'https://www.base-search.net/about/download/logo_224x57_white.gif',
'BASE',
'openaire-helpdesk@uni-bielefeld.de',
'infrastruct_::openaire',
'user:insert',
'aggregator::pubsrepository::unknown',
'Data Source',
'Aggregator',
ARRAY['Research Products'],
false
);
INSERT INTO dsm_service_organization(
_dnet_resource_identifier_,
organization,
service
) VALUES (
'fairsharing_::org::214@@openaire____::base_search',
'fairsharing_::org::214',
'openaire____::base_search'
);
INSERT INTO dsm_api(
_dnet_resource_identifier_,
id,
service,
protocol,
baseurl,
metadata_identifier_path
) VALUES (
'api_________::openaire____::base_search::dump',
'api_________::openaire____::base_search::dump',
'openaire____::base_search',
'baseDump',
'/user/michele.artini/base-import/base_oaipmh_dump-current.tar',
'//*[local-name()=''header'']/*[local-name()=''identifier'']'
);
INSERT INTO dsm_apiparams(
_dnet_resource_identifier_,
api,
param,
value
) VALUES (
'api_________::openaire____::base_search::dump@@dbUrl',
'api_________::openaire____::base_search::dump',
'dbUrl',
'jdbc:postgresql://postgresql.services.openaire.eu:5432/dnet_openaireplus'
);
INSERT INTO dsm_apiparams(
_dnet_resource_identifier_,
api,
param,
value
) VALUES (
'api_________::openaire____::base_search::dump@@dbUser',
'api_________::openaire____::base_search::dump',
'dbUser',
'dnet'
);
INSERT INTO dsm_apiparams(
_dnet_resource_identifier_,
api,
param,
value
) VALUES (
'api_________::openaire____::base_search::dump@@dbPassword',
'api_________::openaire____::base_search::dump',
'dbPassword',
'***'
);
INSERT INTO dsm_apiparams(
_dnet_resource_identifier_,
api,
param,
value
) VALUES (
'api_________::openaire____::base_search::dump@@acceptedNormTypes',
'api_________::openaire____::base_search::dump',
'acceptedNormTypes',
'1,11,111,121,13,14,15,18,181,182,183,1A,6,7'
);
COMMIT;

View File

@ -0,0 +1,9 @@
select s.id as id
from dsm_services s
where collectedfrom = 'openaire____::opendoar'
and jurisdiction = 'Institutional'
and s.id in (
select service from dsm_api where coalesce(compatibility_override, compatibility) = 'driver' or coalesce(compatibility_override, compatibility) = 'UNKNOWN'
) and s.id not in (
select service from dsm_api where coalesce(compatibility_override, compatibility) like '%openaire%'
);

View File

@ -0,0 +1,11 @@
select
s.id as id,
s.jurisdiction as jurisdiction,
array_remove(array_agg(a.id || ' (compliance: ' || coalesce(a.compatibility_override, a.compatibility, 'UNKNOWN') || ')@@@' || coalesce(a.last_collection_total, 0)), NULL) as aggregations
from
dsm_services s
join dsm_api a on (s.id = a.service)
where
collectedfrom = 'openaire____::opendoar'
group by
s.id;

View File

@ -0,0 +1,180 @@
<RESOURCE_PROFILE>
<HEADER>
<RESOURCE_IDENTIFIER value="c67911d6-9988-4a3b-b965-7d39bdd4a31d_Vm9jYWJ1bGFyeURTUmVzb3VyY2VzL1ZvY2FidWxhcnlEU1Jlc291cmNlVHlwZQ==" />
<RESOURCE_TYPE value="VocabularyDSResourceType" />
<RESOURCE_KIND value="VocabularyDSResources" />
<RESOURCE_URI value="" />
<DATE_OF_CREATION value="2024-02-13T11:15:48+00:00" />
</HEADER>
<BODY>
<CONFIGURATION>
<VOCABULARY_NAME code="base:normalized_types">base:normalized_types</VOCABULARY_NAME>
<VOCABULARY_DESCRIPTION>base:normalized_types</VOCABULARY_DESCRIPTION>
<TERMS>
<TERM native_name="Text" code="Text" english_name="Text" encoding="BASE">
<SYNONYMS>
<SYNONYM term="1" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Book" code="Book" english_name="Book" encoding="BASE">
<SYNONYMS>
<SYNONYM term="11" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Book part" code="Book part" english_name="Book part" encoding="BASE">
<SYNONYMS>
<SYNONYM term="111" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Journal/Newspaper" code="Journal/Newspaper" english_name="Journal/Newspaper" encoding="BASE">
<SYNONYMS>
<SYNONYM term="12" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Article contribution" code="Article contribution" english_name="Article contribution" encoding="BASE">
<SYNONYMS>
<SYNONYM term="121" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Other non-article" code="Other non-article" english_name="Other non-article" encoding="BASE">
<SYNONYMS>
<SYNONYM term="122" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Conference object" code="Conference object" english_name="Conference object" encoding="BASE">
<SYNONYMS>
<SYNONYM term="13" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Report" code="Report" english_name="Report" encoding="BASE">
<SYNONYMS>
<SYNONYM term="14" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Review" code="Review" english_name="Review" encoding="BASE">
<SYNONYMS>
<SYNONYM term="15" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Course material" code="Course material" english_name="Course material" encoding="BASE">
<SYNONYMS>
<SYNONYM term="16" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Lecture" code="Lecture" english_name="Lecture" encoding="BASE">
<SYNONYMS>
<SYNONYM term="17" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Thesis" code="Thesis" english_name="Thesis" encoding="BASE">
<SYNONYMS>
<SYNONYM term="18" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Bachelor's thesis" code="Bachelor's thesis" english_name="Bachelor's thesis" encoding="BASE">
<SYNONYMS>
<SYNONYM term="181" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Master's thesis" code="Master's thesis" english_name="Master's thesis" encoding="BASE">
<SYNONYMS>
<SYNONYM term="182" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Doctoral and postdoctoral thesis" code="Doctoral and postdoctoral thesis" english_name="Doctoral and postdoctoral thesis" encoding="BASE">
<SYNONYMS>
<SYNONYM term="183" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Manuscript" code="Manuscript" english_name="Manuscript" encoding="BASE">
<SYNONYMS>
<SYNONYM term="19" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Patent" code="Patent" english_name="Patent" encoding="BASE">
<SYNONYMS>
<SYNONYM term="1A" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Musical notation" code="Musical notation" english_name="Musical notation" encoding="BASE">
<SYNONYMS>
<SYNONYM term="2" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Map" code="Map" english_name="Map" encoding="BASE">
<SYNONYMS>
<SYNONYM term="3" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Audio" code="Audio" english_name="Audio" encoding="BASE">
<SYNONYMS>
<SYNONYM term="4" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Image/Video" code="Image/Video" english_name="Image/Video" encoding="BASE">
<SYNONYMS>
<SYNONYM term="5" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Still image" code="Still image" english_name="Still image" encoding="BASE">
<SYNONYMS>
<SYNONYM term="51" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Moving image/Video" code="Moving image/Video" english_name="Moving image/Video" encoding="BASE">
<SYNONYMS>
<SYNONYM term="52" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Software" code="Software" english_name="Software" encoding="BASE">
<SYNONYMS>
<SYNONYM term="6" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Dataset" code="Dataset" english_name="Dataset" encoding="BASE">
<SYNONYMS>
<SYNONYM term="7" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
<TERM native_name="Unknown" code="Unknown" english_name="Unknown" encoding="BASE">
<SYNONYMS>
<SYNONYM term="F" encoding="BASE" />
</SYNONYMS>
<RELATIONS />
</TERM>
</TERMS>
</CONFIGURATION>
<STATUS>
<LAST_UPDATE value="2013-11-18T10:46:36Z" />
</STATUS>
<SECURITY_PARAMETERS>String</SECURITY_PARAMETERS>
</BODY>
</RESOURCE_PROFILE>

View File

@ -0,0 +1,302 @@
<RESOURCE_PROFILE>
<HEADER>
<RESOURCE_IDENTIFIER value="" />
<RESOURCE_TYPE value="TransformationRuleDSResourceType" />
<RESOURCE_KIND value="TransformationRuleDSResources" />
<RESOURCE_URI value="" />
<DATE_OF_CREATION value="2024-03-05T11:23:00+00:00" />
</HEADER>
<BODY>
<CONFIGURATION>
<SOURCE_METADATA_FORMAT interpretation="cleaned" layout="store" name="dc" />
<SINK_METADATA_FORMAT name="oaf_hbase" />
<IMPORTED />
<SCRIPT>
<TITLE>xslt_base2oaf_hadoop</TITLE>
<CODE>
<xsl:stylesheet xmlns:oaire="http://namespace.openaire.eu/schema/oaire/" xmlns:dateCleaner="http://eu/dnetlib/transform/dateISO"
xmlns:base_dc="http://oai.base-search.net/base_dc/"
xmlns:datacite="http://datacite.org/schema/kernel-4" xmlns:dr="http://www.driver-repository.eu/namespace/dr" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:vocabulary="http://eu/dnetlib/transform/clean" xmlns:oaf="http://namespace.openaire.eu/oaf"
xmlns:oai="http://www.openarchives.org/OAI/2.0/" xmlns:dri="http://www.driver-repository.eu/namespace/dri" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/"
exclude-result-prefixes="xsl vocabulary dateCleaner base_dc" version="2.0">
<xsl:param name="varOfficialName" />
<xsl:param name="varDataSourceId" />
<xsl:param name="varFP7" select="'corda_______::'" />
<xsl:param name="varH2020" select="'corda__h2020::'" />
<xsl:param name="repoCode" select="substring-before(//*[local-name() = 'header']/*[local-name()='recordIdentifier'], ':')" />
<xsl:param name="index" select="0" />
<xsl:param name="transDate" select="current-dateTime()" />
<xsl:template name="terminate">
<xsl:message terminate="yes">
record is not compliant, transformation is interrupted.
</xsl:message>
</xsl:template>
<xsl:template match="/">
<record>
<xsl:apply-templates select="//*[local-name() = 'header']" />
<!-- TO EVALUATE
base_dc:authod_id
base_dc:authod_id/base_dc:creator_id
base_dc:authod_id/base_dc:creator_name
example:
<dc:creator>ALBU, Svetlana</dc:creator>
<base_dc:authod_id>
<base_dc:creator_name>ALBU, Svetlana</base_dc:creator_name>
<base_dc:creator_id>https://orcid.org/0000-0002-8648-950X</base_dc:creator_id>
</base_dc:authod_id>
-->
<!-- NOT USED
base_dc:global_id (I used oai:identifier)
base_dc:collection/text()
base_dc:continent
base_dc:year (I used dc:date)
dc:coverage
dc:language (I used base_dc:lang)
base_dc:link (I used dc:identifier)
-->
<xsl:variable name="varBaseNormType" select="vocabulary:clean(//base_dc:typenorm, 'base:normalized_types')" />
<metadata>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:title" />
<xsl:with-param name="targetElement" select="'dc:title'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:creator/replace(., '^(.*)\|.*$', '$1')" />
<xsl:with-param name="targetElement" select="'dc:creator'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:contributor" />
<xsl:with-param name="targetElement" select="'dc:contributor'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:description" />
<xsl:with-param name="targetElement" select="'dc:description'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:subject" />
<xsl:with-param name="targetElement" select="'dc:subject'" />
</xsl:call-template>
<!-- TODO: I'm not sure if this is the correct encoding -->
<xsl:for-each select="//base_dc:classcode|//base_dc:autoclasscode">
<dc:subject><xsl:value-of select="concat(@type, ':', .)" /></dc:subject>
</xsl:for-each>
<!-- END TODO -->
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:publisher" />
<xsl:with-param name="targetElement" select="'dc:publisher'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:format" />
<xsl:with-param name="targetElement" select="'dc:format'" />
</xsl:call-template>
<dc:type>
<xsl:value-of select="$varBaseNormType" />
</dc:type>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:type" />
<xsl:with-param name="targetElement" select="'dc:type'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:source" />
<xsl:with-param name="targetElement" select="'dc:source'" />
</xsl:call-template>
<dc:language>
<xsl:value-of select="vocabulary:clean( //base_dc:lang, 'dnet:languages')" />
</dc:language>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:rights" />
<xsl:with-param name="targetElement" select="'dc:rights'" />
</xsl:call-template>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:relation" />
<xsl:with-param name="targetElement" select="'dc:relation'" />
</xsl:call-template>
<xsl:if test="not(//dc:identifier[starts-with(., 'http')])">
<xsl:call-template name="terminate" />
</xsl:if>
<xsl:call-template name="allElements">
<xsl:with-param name="sourceElement" select="//dc:identifier[starts-with(., 'http')]" />
<xsl:with-param name="targetElement" select="'dc:identifier'" />
</xsl:call-template>
<xsl:for-each select="//dc:relation">
<xsl:if test="matches(normalize-space(.), '(info:eu-repo/grantagreement/ec/fp7/)(\d\d\d\d\d\d)(.*)', 'i')">
<oaf:projectid>
<xsl:value-of select="concat($varFP7, replace(normalize-space(.), '(info:eu-repo/grantagreement/ec/fp7/)(\d\d\d\d\d\d)(.*)', '$2', 'i'))" />
</oaf:projectid>
</xsl:if>
<xsl:if test="matches(normalize-space(.), '(info:eu-repo/grantagreement/ec/h2020/)(\d\d\d\d\d\d)(.*)', 'i')">
<oaf:projectid>
<xsl:value-of select="concat($varH2020, replace(normalize-space(.), '(info:eu-repo/grantagreement/ec/h2020/)(\d\d\d\d\d\d)(.*)', '$2', 'i'))" />
</oaf:projectid>
</xsl:if>
</xsl:for-each>
<dr:CobjCategory>
<xsl:variable name="varCobjCategory" select="vocabulary:clean($varBaseNormType, 'dnet:publication_resource')" />
<xsl:variable name="varSuperType" select="vocabulary:clean($varCobjCategory, 'dnet:result_typologies')" />
<xsl:attribute name="type" select="$varSuperType" />
<xsl:value-of select="$varCobjCategory" />
</dr:CobjCategory>
<oaf:accessrights>
<xsl:choose>
<xsl:when test="//base_dc:oa[.='1']">OPEN</xsl:when>
<xsl:when test="//base_dc:rightsnorm">
<xsl:value-of select="vocabulary:clean(//base_dc:rightsnorm, 'dnet:access_modes')" />
</xsl:when>
<xsl:when test="//dc:rights">
<xsl:value-of select="vocabulary:clean( //dc:rights, 'dnet:access_modes')" />
</xsl:when>
<xsl:otherwise>UNKNOWN</xsl:otherwise>
</xsl:choose>
</oaf:accessrights>
<xsl:for-each select="//base_dc:doi">
<oaf:identifier identifierType="doi">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'http') and (not(contains(., '://dx.doi.org/') or contains(., '://doi.org/') or contains(., '://hdl.handle.net/')))])">
<oaf:identifier identifierType="url">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'http') and contains(., '://hdl.handle.net/')]/substring-after(., 'hdl.handle.net/'))">
<oaf:identifier identifierType="handle">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'urn:nbn:nl:') or starts-with(., 'URN:NBN:NL:')])">
<oaf:identifier identifierType='urn'>
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<oaf:identifier identifierType="oai-original">
<xsl:value-of
select="//*[local-name() = 'about']/*[local-name() = 'provenance']//*[local-name() = 'originDescription' and not(./*[local-name() = 'originDescription'])]/*[local-name() = 'identifier']" />
</oaf:identifier>
<oaf:hostedBy>
<xsl:attribute name="name">
<xsl:value-of select="//base_dc:collname" />
</xsl:attribute>
<xsl:attribute name="id">
<xsl:value-of select="concat('opendoar____::', //base_dc:collection/@opendoar_id)" />
</xsl:attribute>
</oaf:hostedBy>
<oaf:collectedFrom>
<xsl:attribute name="name">
<xsl:value-of select="$varOfficialName" />
</xsl:attribute>
<xsl:attribute name="id">
<xsl:value-of select="$varDataSourceId" />
</xsl:attribute>
</oaf:collectedFrom>
<oaf:dateAccepted>
<xsl:value-of select="dateCleaner:dateISO( //dc:date[1] )" />
</oaf:dateAccepted>
<xsl:if test="//base_dc:oa[.='1']">
<xsl:for-each select="//dc:relation[starts-with(., 'http')]">
<oaf:fulltext>
<xsl:value-of select="normalize-space(.)" />
</oaf:fulltext>
</xsl:for-each>
</xsl:if>
<xsl:for-each select="//base_dc:collection/@ror_id">
<oaf:relation relType="resultOrganization"
subRelType="affiliation"
relClass="hasAuthorInstitution"
targetType="organization">
<xsl:choose>
<xsl:when test="contains(.,'https://ror.org/')">
<xsl:value-of select="concat('ror_________::', normalize-space(.))" />
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="concat('ror_________::https://ror.org/', normalize-space(.))" />
</xsl:otherwise>
</xsl:choose>
</oaf:relation>
</xsl:for-each>
<xsl:for-each select="//base_dc:country">
<oaf:country><xsl:value-of select="vocabulary:clean(., 'dnet:countries')" /></oaf:country>
</xsl:for-each>
</metadata>
<xsl:copy-of select="//*[local-name() = 'about']" />
</record>
</xsl:template>
<xsl:template name="allElements">
<xsl:param name="sourceElement" />
<xsl:param name="targetElement" />
<xsl:for-each select="$sourceElement">
<xsl:element name="{$targetElement}">
<xsl:value-of select="normalize-space(.)" />
</xsl:element>
</xsl:for-each>
</xsl:template>
<xsl:template match="//*[local-name() = 'header']">
<xsl:if test="//oai:header/@status='deleted'">
<xsl:call-template name="terminate" />
</xsl:if>
<xsl:copy>
<xsl:apply-templates select="node()|@*" />
<xsl:element name="dr:dateOfTransformation">
<xsl:value-of select="$transDate" />
</xsl:element>
</xsl:copy>
</xsl:template>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
</CODE>
</SCRIPT>
</CONFIGURATION>
<STATUS />
<SECURITY_PARAMETERS />
</BODY>
</RESOURCE_PROFILE>

View File

@ -0,0 +1,326 @@
<RESOURCE_PROFILE>
<HEADER>
<RESOURCE_IDENTIFIER value="2ad0cdd9-c96c-484c-8b0e-ed56d86891fe_VHJhbnNmb3JtYXRpb25SdWxlRFNSZXNvdXJjZXMvVHJhbnNmb3JtYXRpb25SdWxlRFNSZXNvdXJjZVR5cGU=" />
<RESOURCE_TYPE value="TransformationRuleDSResourceType" />
<RESOURCE_KIND value="TransformationRuleDSResources" />
<RESOURCE_URI value="" />
<DATE_OF_CREATION value="2024-03-05T11:23:00+00:00" />
</HEADER>
<BODY>
<CONFIGURATION>
<SOURCE_METADATA_FORMAT interpretation="cleaned" layout="store" name="dc" />
<SINK_METADATA_FORMAT name="odf_hbase" />
<IMPORTED />
<SCRIPT>
<TITLE>xslt_base2odf_hadoop</TITLE>
<CODE>
<xsl:stylesheet xmlns:oaire="http://namespace.openaire.eu/schema/oaire/" xmlns:dateCleaner="http://eu/dnetlib/transform/dateISO" xmlns:base_dc="http://oai.base-search.net/base_dc/"
xmlns:datacite="http://datacite.org/schema/kernel-4" xmlns:dr="http://www.driver-repository.eu/namespace/dr" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:vocabulary="http://eu/dnetlib/transform/clean" xmlns:oaf="http://namespace.openaire.eu/oaf"
xmlns:oai="http://www.openarchives.org/OAI/2.0/" xmlns:dri="http://www.driver-repository.eu/namespace/dri" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/"
exclude-result-prefixes="xsl vocabulary dateCleaner base_dc" version="2.0">
<xsl:param name="varOfficialName" />
<xsl:param name="varDataSourceId" />
<xsl:param name="varFP7" select="'corda_______::'" />
<xsl:param name="varH2020" select="'corda__h2020::'" />
<xsl:param name="repoCode" select="substring-before(//*[local-name() = 'header']/*[local-name()='recordIdentifier'], ':')" />
<xsl:param name="index" select="0" />
<xsl:param name="transDate" select="current-dateTime()" />
<xsl:template name="terminate">
<xsl:message terminate="yes">
record is not compliant, transformation is interrupted.
</xsl:message>
</xsl:template>
<xsl:template match="/">
<record>
<xsl:apply-templates select="//*[local-name() = 'header']" />
<!-- NOT USED
base_dc:global_id (I used oai:identifier)
base_dc:collection/text()
base_dc:continent
dc:coverage
dc:source
dc:relation
dc:type (I used //base_dc:typenorm)
dc:language (I used base_dc:lang)
base_dc:link (I used dc:identifier)
-->
<xsl:variable name="varBaseNormType" select="vocabulary:clean(//base_dc:typenorm, 'base:normalized_types')" />
<metadata>
<datacite:resource>
<xsl:for-each select="//base_dc:doi">
<datacite:identifier identifierType="DOI">
<xsl:value-of select="." />
</datacite:identifier>
</xsl:for-each>
<datacite:alternateIdentifiers>
<xsl:for-each
select="distinct-values(//dc:identifier[starts-with(., 'http') and (not(contains(., '://dx.doi.org/') or contains(., '://doi.org/') or contains(., '://hdl.handle.net/')))])">
<datacite:identifier alternateIdentifierType="url">
<xsl:value-of select="." />
</datacite:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'http') and contains(., '://hdl.handle.net/')]/substring-after(., 'hdl.handle.net/'))">
<datacite:identifier alternateIdentifierType="handle">
<xsl:value-of select="." />
</datacite:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'urn:nbn:nl:') or starts-with(., 'URN:NBN:NL:')])">
<datacite:identifier alternateIdentifierType='urn'>
<xsl:value-of select="." />
</datacite:identifier>
</xsl:for-each>
<datacite:identifier alternateIdentifierType="oai-original">
<xsl:value-of
select="//*[local-name() = 'about']/*[local-name() = 'provenance']//*[local-name() = 'originDescription' and not(./*[local-name() = 'originDescription'])]/*[local-name() = 'identifier']" />
</datacite:identifier>
</datacite:alternateIdentifiers>
<datacite:relatedIdentifiers />
<datacite:resourceType><xsl:value-of select="$varBaseNormType" /></datacite:resourceType>
<datacite:titles>
<xsl:for-each select="//dc:title">
<datacite:title>
<xsl:value-of select="normalize-space(.)" />
</datacite:title>
</xsl:for-each>
</datacite:titles>
<datacite:creators>
<xsl:for-each select="//dc:creator">
<xsl:variable name="author" select="normalize-space(.)" />
<datacite:creator>
<datacite:creatorName>
<xsl:value-of select="$author" />
</datacite:creatorName>
<xsl:for-each select="//base_dc:authod_id[normalize-space(./base_dc:creator_name) = $author]/base_dc:creator_id ">
<xsl:if test="contains(.,'https://orcid.org/')">
<nameIdentifier schemeURI="https://orcid.org/" nameIdentifierScheme="ORCID">
<xsl:value-of select="substring-after(., 'https://orcid.org/')" />
</nameIdentifier>
</xsl:if>
</xsl:for-each>
</datacite:creator>
</xsl:for-each>
</datacite:creators>
<datacite:contributors>
<xsl:for-each select="//dc:contributor">
<datacite:contributor>
<datacite:contributorName>
<xsl:value-of select="normalize-space(.)" />
</datacite:contributorName>
</datacite:contributor>
</xsl:for-each>
</datacite:contributors>
<datacite:descriptions>
<xsl:for-each select="//dc:description">
<datacite:description descriptionType="Abstract">
<xsl:value-of select="normalize-space(.)" />
</datacite:description>
</xsl:for-each>
</datacite:descriptions>
<datacite:subjects>
<xsl:for-each select="//dc:subject">
<datacite:subject>
<xsl:value-of select="normalize-space(.)" />
</datacite:subject>
</xsl:for-each>
<xsl:for-each select="//base_dc:classcode|//base_dc:autoclasscode">
<datacite:subject subjectScheme="{@type}" classificationCode="{normalize-space(.)}">
<!-- TODO the value should be obtained by the Code -->
<xsl:value-of select="normalize-space(.)" />
</datacite:subject>
</xsl:for-each>
</datacite:subjects>
<datacite:publisher>
<xsl:value-of select="normalize-space(//dc:publisher)" />
</datacite:publisher>
<datacite:publicationYear>
<xsl:value-of select="normalize-space(//base_dc:year)" />
</datacite:publicationYear>
<datacite:formats>
<xsl:for-each select="//dc:format">
<datacite:format>
<xsl:value-of select="normalize-space(.)" />
</datacite:format>
</xsl:for-each>
</datacite:formats>
<datacite:language>
<xsl:value-of select="vocabulary:clean( //base_dc:lang, 'dnet:languages')" />
</datacite:language>
<oaf:accessrights>
<xsl:if test="//base_dc:oa[.='1']">
<datacite:rights rightsURI="http://purl.org/coar/access_right/c_abf2">open access</datacite:rights>
</xsl:if>
<xsl:for-each select="//dc:rights|//base_dc:rightsnorm">
<datacite:rights><xsl:value-of select="vocabulary:clean(., 'dnet:access_modes')" /></datacite:rights>
</xsl:for-each>
</oaf:accessrights>
</datacite:resource>
<xsl:for-each select="//dc:relation">
<xsl:if test="matches(normalize-space(.), '(info:eu-repo/grantagreement/ec/fp7/)(\d\d\d\d\d\d)(.*)', 'i')">
<oaf:projectid>
<xsl:value-of select="concat($varFP7, replace(normalize-space(.), '(info:eu-repo/grantagreement/ec/fp7/)(\d\d\d\d\d\d)(.*)', '$2', 'i'))" />
</oaf:projectid>
</xsl:if>
<xsl:if test="matches(normalize-space(.), '(info:eu-repo/grantagreement/ec/h2020/)(\d\d\d\d\d\d)(.*)', 'i')">
<oaf:projectid>
<xsl:value-of select="concat($varH2020, replace(normalize-space(.), '(info:eu-repo/grantagreement/ec/h2020/)(\d\d\d\d\d\d)(.*)', '$2', 'i'))" />
</oaf:projectid>
</xsl:if>
</xsl:for-each>
<dr:CobjCategory>
<xsl:variable name="varCobjCategory" select="vocabulary:clean($varBaseNormType, 'dnet:publication_resource')" />
<xsl:variable name="varSuperType" select="vocabulary:clean($varCobjCategory, 'dnet:result_typologies')" />
<xsl:attribute name="type" select="$varSuperType" />
<xsl:value-of select="$varCobjCategory" />
</dr:CobjCategory>
<oaf:accessrights>
<xsl:choose>
<xsl:when test="//base_dc:oa[.='1']">OPEN</xsl:when>
<xsl:when test="//base_dc:rightsnorm">
<xsl:value-of select="vocabulary:clean(//base_dc:rightsnorm, 'dnet:access_modes')" />
</xsl:when>
<xsl:when test="//dc:rights">
<xsl:value-of select="vocabulary:clean( //dc:rights, 'dnet:access_modes')" />
</xsl:when>
<xsl:otherwise>UNKNOWN</xsl:otherwise>
</xsl:choose>
</oaf:accessrights>
<xsl:for-each select="//base_dc:doi">
<oaf:identifier identifierType="doi">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each
select="distinct-values(//dc:identifier[starts-with(., 'http') and ( not(contains(., '://dx.doi.org/') or contains(., '://doi.org/') or contains(., '://hdl.handle.net/')))])">
<oaf:identifier identifierType="url">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'http') and contains(., '://hdl.handle.net/')]/substring-after(., 'hdl.handle.net/'))">
<oaf:identifier identifierType="handle">
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<xsl:for-each select="distinct-values(//dc:identifier[starts-with(., 'urn:nbn:nl:') or starts-with(., 'URN:NBN:NL:')])">
<oaf:identifier identifierType='urn'>
<xsl:value-of select="." />
</oaf:identifier>
</xsl:for-each>
<oaf:identifier identifierType="oai-original">
<xsl:value-of
select="//*[local-name() = 'about']/*[local-name() = 'provenance']//*[local-name() = 'originDescription' and not(./*[local-name() = 'originDescription'])]/*[local-name() = 'identifier']" />
</oaf:identifier>
<oaf:hostedBy>
<xsl:attribute name="name">
<xsl:value-of select="//base_dc:collname" />
</xsl:attribute>
<xsl:attribute name="id">
<xsl:value-of select="concat('opendoar____::', //base_dc:collection/@opendoar_id)" />
</xsl:attribute>
</oaf:hostedBy>
<oaf:collectedFrom>
<xsl:attribute name="name">
<xsl:value-of select="$varOfficialName" />
</xsl:attribute>
<xsl:attribute name="id">
<xsl:value-of select="$varDataSourceId" />
</xsl:attribute>
</oaf:collectedFrom>
<oaf:dateAccepted>
<xsl:value-of select="dateCleaner:dateISO( //dc:date[1] )" />
</oaf:dateAccepted>
<xsl:if test="//base_dc:oa[.='1']">
<xsl:for-each select="//dc:relation[starts-with(., 'http')]">
<oaf:fulltext>
<xsl:value-of select="normalize-space(.)" />
</oaf:fulltext>
</xsl:for-each>
</xsl:if>
<xsl:for-each select="//base_dc:collection/@ror_id">
<oaf:relation relType="resultOrganization" subRelType="affiliation" relClass="hasAuthorInstitution" targetType="organization">
<xsl:choose>
<xsl:when test="contains(.,'https://ror.org/')">
<xsl:value-of select="concat('ror_________::', normalize-space(.))" />
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="concat('ror_________::https://ror.org/', normalize-space(.))" />
</xsl:otherwise>
</xsl:choose>
</oaf:relation>
</xsl:for-each>
<xsl:for-each select="//base_dc:country">
<oaf:country><xsl:value-of select="vocabulary:clean(., 'dnet:countries')" /></oaf:country>
</xsl:for-each>
</metadata>
<xsl:copy-of select="//*[local-name() = 'about']" />
</record>
</xsl:template>
<xsl:template match="//*[local-name() = 'header']">
<xsl:if test="//oai:header/@status='deleted'">
<xsl:call-template name="terminate" />
</xsl:if>
<xsl:copy>
<xsl:apply-templates select="node()|@*" />
<xsl:element name="dr:dateOfTransformation">
<xsl:value-of select="$transDate" />
</xsl:element>
</xsl:copy>
</xsl:template>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
</CODE>
</SCRIPT>
</CONFIGURATION>
<STATUS />
<SECURITY_PARAMETERS />
</BODY>
</RESOURCE_PROFILE>

View File

@ -1048,5 +1048,10 @@
"openaire_id": "re3data_____::r3d100010399",
"datacite_name": "ZEW Forschungsdatenzentrum",
"official_name": "ZEW Forschungsdatenzentrum"
},
"HBP.NEUROINF": {
"openaire_id": "fairsharing_::2975",
"datacite_name": "EBRAINS",
"official_name": "EBRAINS"
}
}

View File

@ -1,4 +1,4 @@
<workflow-app name="Transform_BioEntity_Workflow" xmlns="uri:oozie:workflow:0.5">
<workflow-app name="Transform_BioEntity_Workflow" xmlns="uri:oozie:workflow:0.5">
<parameters>
<property>
<name>sourcePath</name>
@ -8,19 +8,40 @@
<name>database</name>
<description>the PDB Database Working Path</description>
</property>
<property>
<name>targetPath</name>
<description>the Target Working dir path</description>
<name>mdStoreOutputId</name>
<description>the identifier of the cleaned MDStore</description>
</property>
<property>
<name>mdStoreManagerURI</name>
<description>the path of the cleaned mdstore</description>
</property>
</parameters>
<start to="ConvertDB"/>
<start to="StartTransaction"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="StartTransaction">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>NEW_VERSION</arg>
<arg>--mdStoreID</arg><arg>${mdStoreOutputId}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
<capture-output/>
</java>
<ok to="ConvertDB"/>
<error to="RollBack"/>
</action>
<action name="ConvertDB">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
@ -41,11 +62,48 @@
<arg>--master</arg><arg>yarn</arg>
<arg>--dbPath</arg><arg>${sourcePath}</arg>
<arg>--database</arg><arg>${database}</arg>
<arg>--targetPath</arg><arg>${targetPath}</arg>
<arg>--mdstoreOutputVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>
<ok to="CommitVersion"/>
<error to="RollBack"/>
</action>
<end name="End"/>
<action name="CommitVersion">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>COMMIT</arg>
<arg>--namenode</arg><arg>${nameNode}</arg>
<arg>--mdStoreVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
</java>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="RollBack">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>ROLLBACK</arg>
<arg>--mdStoreVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
</java>
<ok to="Kill"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

View File

@ -2,7 +2,7 @@
{"paramName":"mt", "paramLongName":"master", "paramDescription": "should be local or yarn", "paramRequired": true},
{"paramName":"i", "paramLongName":"isLookupUrl", "paramDescription": "isLookupUrl", "paramRequired": true},
{"paramName":"w", "paramLongName":"workingPath", "paramDescription": "the path of the sequencial file to read", "paramRequired": true},
{"paramName":"t", "paramLongName":"targetPath", "paramDescription": "the oaf path ", "paramRequired": true},
{"paramName":"mo", "paramLongName":"mdstoreOutputVersion", "paramDescription": "the oaf path ", "paramRequired": true},
{"paramName":"s", "paramLongName":"skipUpdate", "paramDescription": "skip update ", "paramRequired": false},
{"paramName":"h", "paramLongName":"hdfsServerUri", "paramDescription": "the working path ", "paramRequired": true}
]

View File

@ -2,5 +2,5 @@
{"paramName":"mt", "paramLongName":"master", "paramDescription": "should be local or yarn", "paramRequired": true},
{"paramName":"db", "paramLongName":"database", "paramDescription": "should be PDB or UNIPROT", "paramRequired": true},
{"paramName":"p", "paramLongName":"dbPath", "paramDescription": "the path of the database to transform", "paramRequired": true},
{"paramName":"t", "paramLongName":"targetPath", "paramDescription": "the OAF target path ", "paramRequired": true}
{"paramName":"mo", "paramLongName":"mdstoreOutputVersion", "paramDescription": "the oaf path ", "paramRequired": true}
]

View File

@ -1,5 +1,20 @@
[
{"paramName":"mt", "paramLongName":"master", "paramDescription": "should be local or yarn", "paramRequired": true},
{"paramName":"s", "paramLongName":"sourcePath","paramDescription": "the source Path", "paramRequired": true},
{"paramName":"t", "paramLongName":"targetPath","paramDescription": "the oaf path ", "paramRequired": true}
{
"paramName": "mt",
"paramLongName": "master",
"paramDescription": "should be local or yarn",
"paramRequired": true
},
{
"paramName": "s",
"paramLongName": "sourcePath",
"paramDescription": "the source Path",
"paramRequired": true
},
{
"paramName": "mo",
"paramLongName": "mdstoreOutputVersion",
"paramDescription": "the oaf path ",
"paramRequired": true
}
]

View File

@ -9,34 +9,26 @@
<description>the Working Path</description>
</property>
<property>
<name>targetPath</name>
<description>the OAF MDStore Path</description>
<name>mdStoreOutputId</name>
<description>the identifier of the cleaned MDStore</description>
</property>
<property>
<name>sparkDriverMemory</name>
<description>memory for driver process</description>
</property>
<property>
<name>sparkExecutorMemory</name>
<description>memory for individual executor</description>
</property>
<property>
<name>sparkExecutorCores</name>
<description>number of cores used by single executor</description>
<name>mdStoreManagerURI</name>
<description>the path of the cleaned mdstore</description>
</property>
<property>
<name>resumeFrom</name>
<value>DownloadEBILinks</value>
<value>CreateEBIDataSet</value>
<description>node to start</description>
</property>
</parameters>
<start to="resume_from"/>
<start to="StartTransaction"/>
<decision name="resume_from">
<switch>
<case to="DownloadEBILinks">${wf:conf('resumeFrom') eq 'DownloadEBILinks'}</case>
<case to="CreateEBIDataSet">${wf:conf('resumeFrom') eq 'CreateEBIDataSet'}</case>
<case to="StartTransaction">${wf:conf('resumeFrom') eq 'CreateEBIDataSet'}</case>
<default to="DownloadEBILinks"/>
</switch>
</decision>
@ -77,9 +69,29 @@
<move source="${sourcePath}/ebi_links_dataset" target="${sourcePath}/ebi_links_dataset_old"/>
<move source="${workingPath}/links_final" target="${sourcePath}/ebi_links_dataset"/>
</fs>
<ok to="CreateEBIDataSet"/>
<ok to="StartTransaction"/>
<error to="Kill"/>
</action>
<action name="StartTransaction">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>NEW_VERSION</arg>
<arg>--mdStoreID</arg><arg>${mdStoreOutputId}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
<capture-output/>
</java>
<ok to="CreateEBIDataSet"/>
<error to="RollBack"/>
</action>
<action name="CreateEBIDataSet">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn-cluster</master>
@ -95,11 +107,49 @@
${sparkExtraOPT}
</spark-opts>
<arg>--sourcePath</arg><arg>${sourcePath}/ebi_links_dataset</arg>
<arg>--targetPath</arg><arg>${targetPath}</arg>
<arg>--mdstoreOutputVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--master</arg><arg>yarn</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="CommitVersion">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>COMMIT</arg>
<arg>--namenode</arg><arg>${nameNode}</arg>
<arg>--mdStoreVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
</java>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="RollBack">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>ROLLBACK</arg>
<arg>--mdStoreVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
</java>
<ok to="Kill"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

View File

@ -1,4 +1,4 @@
<workflow-app name="Download_Transform_Pubmed_Workflow" xmlns="uri:oozie:workflow:0.5">
<workflow-app name="Download_Transform_Pubmed_Workflow" xmlns="uri:oozie:workflow:0.5">
<parameters>
<property>
<name>baselineWorkingPath</name>
@ -9,8 +9,12 @@
<description>The IS lookUp service endopoint</description>
</property>
<property>
<name>targetPath</name>
<description>The target path</description>
<name>mdStoreOutputId</name>
<description>the identifier of the cleaned MDStore</description>
</property>
<property>
<name>mdStoreManagerURI</name>
<description>the path of the cleaned mdstore</description>
</property>
<property>
<name>skipUpdate</name>
@ -19,12 +23,31 @@
</property>
</parameters>
<start to="ConvertDataset"/>
<start to="StartTransaction"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="StartTransaction">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>NEW_VERSION</arg>
<arg>--mdStoreID</arg><arg>${mdStoreOutputId}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
<capture-output/>
</java>
<ok to="ConvertDataset"/>
<error to="RollBack"/>
</action>
<action name="ConvertDataset">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
@ -43,16 +66,52 @@
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
</spark-opts>
<arg>--workingPath</arg><arg>${baselineWorkingPath}</arg>
<arg>--targetPath</arg><arg>${targetPath}</arg>
<arg>--mdstoreOutputVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--master</arg><arg>yarn</arg>
<arg>--isLookupUrl</arg><arg>${isLookupUrl}</arg>
<arg>--hdfsServerUri</arg><arg>${nameNode}</arg>
<arg>--skipUpdate</arg><arg>${skipUpdate}</arg>
</spark>
<ok to="CommitVersion"/>
<error to="RollBack"/>
</action>
<action name="CommitVersion">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>COMMIT</arg>
<arg>--namenode</arg><arg>${nameNode}</arg>
<arg>--mdStoreVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
</java>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="RollBack">
<java>
<configuration>
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
</configuration>
<main-class>eu.dnetlib.dhp.aggregation.mdstore.MDStoreActionNode</main-class>
<arg>--action</arg><arg>ROLLBACK</arg>
<arg>--mdStoreVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
</java>
<ok to="Kill"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

View File

@ -2,12 +2,15 @@ package eu.dnetlib.dhp.sx.bio
import eu.dnetlib.dhp.application.ArgumentApplicationParser
import eu.dnetlib.dhp.collection.CollectionUtils
import eu.dnetlib.dhp.common.Constants.{MDSTORE_DATA_PATH, MDSTORE_SIZE_PATH}
import eu.dnetlib.dhp.schema.mdstore.MDStoreVersion
import eu.dnetlib.dhp.schema.oaf.Oaf
import eu.dnetlib.dhp.sx.bio.BioDBToOAF.ScholixResolved
import org.apache.commons.io.IOUtils
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.slf4j.{Logger, LoggerFactory}
import eu.dnetlib.dhp.utils.DHPUtils.{MAPPER, writeHdfsFile}
object SparkTransformBioDatabaseToOAF {
@ -25,8 +28,13 @@ object SparkTransformBioDatabaseToOAF {
val dbPath: String = parser.get("dbPath")
log.info("dbPath: {}", database)
val targetPath: String = parser.get("targetPath")
log.info("targetPath: {}", database)
val mdstoreOutputVersion = parser.get("mdstoreOutputVersion")
log.info("mdstoreOutputVersion: {}", mdstoreOutputVersion)
val cleanedMdStoreVersion = MAPPER.readValue(mdstoreOutputVersion, classOf[MDStoreVersion])
val outputBasePath = cleanedMdStoreVersion.getHdfsPath
log.info("outputBasePath: {}", outputBasePath)
val spark: SparkSession =
SparkSession
@ -43,24 +51,28 @@ object SparkTransformBioDatabaseToOAF {
case "UNIPROT" =>
CollectionUtils.saveDataset(
spark.createDataset(sc.textFile(dbPath).flatMap(i => BioDBToOAF.uniprotToOAF(i))),
targetPath
s"$outputBasePath/$MDSTORE_DATA_PATH"
)
case "PDB" =>
CollectionUtils.saveDataset(
spark.createDataset(sc.textFile(dbPath).flatMap(i => BioDBToOAF.pdbTOOaf(i))),
targetPath
s"$outputBasePath/$MDSTORE_DATA_PATH"
)
case "SCHOLIX" =>
CollectionUtils.saveDataset(
spark.read.load(dbPath).as[ScholixResolved].map(i => BioDBToOAF.scholixResolvedToOAF(i)),
targetPath
s"$outputBasePath/$MDSTORE_DATA_PATH"
)
case "CROSSREF_LINKS" =>
CollectionUtils.saveDataset(
spark.createDataset(sc.textFile(dbPath).map(i => BioDBToOAF.crossrefLinksToOaf(i))),
targetPath
s"$outputBasePath/$MDSTORE_DATA_PATH"
)
}
val df = spark.read.text(s"$outputBasePath/$MDSTORE_DATA_PATH")
val mdStoreSize = df.count
writeHdfsFile(spark.sparkContext.hadoopConfiguration, s"$mdStoreSize", s"$outputBasePath/$MDSTORE_SIZE_PATH")
}
}

View File

@ -2,9 +2,12 @@ package eu.dnetlib.dhp.sx.bio.ebi
import eu.dnetlib.dhp.application.ArgumentApplicationParser
import eu.dnetlib.dhp.collection.CollectionUtils
import eu.dnetlib.dhp.common.Constants.{MDSTORE_DATA_PATH, MDSTORE_SIZE_PATH}
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup
import eu.dnetlib.dhp.schema.mdstore.MDStoreVersion
import eu.dnetlib.dhp.schema.oaf.{Oaf, Result}
import eu.dnetlib.dhp.sx.bio.pubmed._
import eu.dnetlib.dhp.utils.DHPUtils.{MAPPER, writeHdfsFile}
import eu.dnetlib.dhp.utils.ISLookupClientFactory
import org.apache.commons.io.IOUtils
import org.apache.hadoop.conf.Configuration
@ -164,11 +167,15 @@ object SparkCreateBaselineDataFrame {
val workingPath = parser.get("workingPath")
log.info("workingPath: {}", workingPath)
val targetPath = parser.get("targetPath")
log.info("targetPath: {}", targetPath)
val mdstoreOutputVersion = parser.get("mdstoreOutputVersion")
log.info("mdstoreOutputVersion: {}", mdstoreOutputVersion)
val cleanedMdStoreVersion = MAPPER.readValue(mdstoreOutputVersion, classOf[MDStoreVersion])
val outputBasePath = cleanedMdStoreVersion.getHdfsPath
log.info("outputBasePath: {}", outputBasePath)
val hdfsServerUri = parser.get("hdfsServerUri")
log.info("hdfsServerUri: {}", targetPath)
log.info("hdfsServerUri: {}", hdfsServerUri)
val skipUpdate = parser.get("skipUpdate")
log.info("skipUpdate: {}", skipUpdate)
@ -216,8 +223,11 @@ object SparkCreateBaselineDataFrame {
.map(a => PubMedToOaf.convert(a, vocabularies))
.as[Oaf]
.filter(p => p != null),
targetPath
s"$outputBasePath/$MDSTORE_DATA_PATH"
)
val df = spark.read.text(s"$outputBasePath/$MDSTORE_DATA_PATH")
val mdStoreSize = df.count
writeHdfsFile(spark.sparkContext.hadoopConfiguration, s"$mdStoreSize", s"$outputBasePath/$MDSTORE_SIZE_PATH")
}
}

View File

@ -9,6 +9,9 @@ import org.apache.commons.io.IOUtils
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.slf4j.{Logger, LoggerFactory}
import eu.dnetlib.dhp.common.Constants.{MDSTORE_DATA_PATH, MDSTORE_SIZE_PATH}
import eu.dnetlib.dhp.schema.mdstore.MDStoreVersion
import eu.dnetlib.dhp.utils.DHPUtils.{MAPPER, writeHdfsFile}
object SparkEBILinksToOaf {
@ -32,8 +35,13 @@ object SparkEBILinksToOaf {
import spark.implicits._
val sourcePath = parser.get("sourcePath")
log.info(s"sourcePath -> $sourcePath")
val targetPath = parser.get("targetPath")
log.info(s"targetPath -> $targetPath")
val mdstoreOutputVersion = parser.get("mdstoreOutputVersion")
log.info("mdstoreOutputVersion: {}", mdstoreOutputVersion)
val cleanedMdStoreVersion = MAPPER.readValue(mdstoreOutputVersion, classOf[MDStoreVersion])
val outputBasePath = cleanedMdStoreVersion.getHdfsPath
log.info("outputBasePath: {}", outputBasePath)
implicit val PMEncoder: Encoder[Oaf] = Encoders.kryo(classOf[Oaf])
val ebLinks: Dataset[EBILinkItem] = spark.read
@ -46,7 +54,10 @@ object SparkEBILinksToOaf {
.flatMap(j => BioDBToOAF.parse_ebi_links(j.links))
.filter(p => BioDBToOAF.EBITargetLinksFilter(p))
.flatMap(p => BioDBToOAF.convertEBILinksToOaf(p)),
targetPath
s"$outputBasePath/$MDSTORE_DATA_PATH"
)
val df = spark.read.text(s"$outputBasePath/$MDSTORE_DATA_PATH")
val mdStoreSize = df.count
writeHdfsFile(spark.sparkContext.hadoopConfiguration, s"$mdStoreSize", s"$outputBasePath/$MDSTORE_SIZE_PATH")
}
}

View File

@ -0,0 +1,38 @@
package eu.dnetlib.dhp.collection.plugin.base;
import java.io.Serializable;
public class BaseCollectionInfo implements Serializable {
private static final long serialVersionUID = 5766333937429419647L;
private String id;
private String opendoarId;
private String rorId;
public String getId() {
return this.id;
}
public void setId(final String id) {
this.id = id;
}
public String getOpendoarId() {
return this.opendoarId;
}
public void setOpendoarId(final String opendoarId) {
this.opendoarId = opendoarId;
}
public String getRorId() {
return this.rorId;
}
public void setRorId(final String rorId) {
this.rorId = rorId;
}
}

View File

@ -0,0 +1,184 @@
package eu.dnetlib.dhp.collection.plugin.base;
import static org.junit.jupiter.api.Assertions.assertEquals;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import java.util.concurrent.atomic.AtomicInteger;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import org.dom4j.Attribute;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.DocumentHelper;
import org.dom4j.Element;
import org.dom4j.Node;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.common.aggregation.AggregatorReport;
@Disabled
public class BaseCollectorIteratorTest {
@Test
void testImportFile() throws Exception {
long count = 0;
final BaseCollectorIterator iterator = new BaseCollectorIterator("base-sample.tar", new AggregatorReport());
final Map<String, Map<String, String>> collections = new HashMap<>();
final Map<String, AtomicInteger> fields = new HashMap<>();
final Set<String> types = new HashSet<>();
while (iterator.hasNext()) {
final Document record = DocumentHelper.parseText(iterator.next());
count++;
if ((count % 1000) == 0) {
System.out.println("# Read records: " + count);
}
// System.out.println(record.asXML());
for (final Object o : record.selectNodes("//*|//@*")) {
final String path = ((Node) o).getPath();
if (fields.containsKey(path)) {
fields.get(path).incrementAndGet();
} else {
fields.put(path, new AtomicInteger(1));
}
if (o instanceof Element) {
final Element n = (Element) o;
if ("collection".equals(n.getName())) {
final String collName = n.getText().trim();
if (StringUtils.isNotBlank(collName) && !collections.containsKey(collName)) {
final Map<String, String> collAttrs = new HashMap<>();
for (final Object ao : n.attributes()) {
collAttrs.put(((Attribute) ao).getName(), ((Attribute) ao).getValue());
}
collections.put(collName, collAttrs);
}
} else if ("type".equals(n.getName())) {
types.add(n.getText().trim());
}
}
}
}
final ObjectMapper mapper = new ObjectMapper();
for (final Entry<String, Map<String, String>> e : collections.entrySet()) {
System.out.println(e.getKey() + ": " + mapper.writeValueAsString(e.getValue()));
}
for (final Entry<String, AtomicInteger> e : fields.entrySet()) {
System.out.println(e.getKey() + ": " + e.getValue().get());
}
System.out.println("TYPES: ");
for (final String s : types) {
System.out.println(s);
}
assertEquals(30000, count);
}
@Test
public void testParquet() throws Exception {
final String xml = IOUtils.toString(getClass().getResourceAsStream("record.xml"));
final SparkSession spark = SparkSession.builder().master("local[*]").getOrCreate();
final List<BaseRecordInfo> ls = new ArrayList<>();
for (int i = 0; i < 10; i++) {
ls.add(extractInfo(xml));
}
final JavaRDD<BaseRecordInfo> rdd = JavaSparkContext
.fromSparkContext(spark.sparkContext())
.parallelize(ls);
final Dataset<BaseRecordInfo> df = spark
.createDataset(rdd.rdd(), Encoders.bean(BaseRecordInfo.class));
df.printSchema();
df.show(false);
}
private BaseRecordInfo extractInfo(final String s) {
try {
final Document record = DocumentHelper.parseText(s);
final BaseRecordInfo info = new BaseRecordInfo();
final Set<String> paths = new LinkedHashSet<>();
final Set<String> types = new LinkedHashSet<>();
final List<BaseCollectionInfo> colls = new ArrayList<>();
for (final Object o : record.selectNodes("//*|//@*")) {
paths.add(((Node) o).getPath());
if (o instanceof Element) {
final Element n = (Element) o;
final String nodeName = n.getName();
if ("collection".equals(nodeName)) {
final String collName = n.getText().trim();
if (StringUtils.isNotBlank(collName)) {
final BaseCollectionInfo coll = new BaseCollectionInfo();
coll.setId(collName);
coll.setOpendoarId(n.valueOf("@opendoar_id").trim());
coll.setRorId(n.valueOf("@ror_id").trim());
colls.add(coll);
}
} else if ("type".equals(nodeName)) {
types.add("TYPE: " + n.getText().trim());
} else if ("typenorm".equals(nodeName)) {
types.add("TYPE_NORM: " + n.getText().trim());
}
}
}
info.setId(record.valueOf("//*[local-name() = 'header']/*[local-name() = 'identifier']").trim());
info.getTypes().addAll(types);
info.getPaths().addAll(paths);
info.setCollections(colls);
return info;
} catch (final DocumentException e) {
throw new RuntimeException(e);
}
}
}

View File

@ -0,0 +1,32 @@
package eu.dnetlib.dhp.collection.plugin.base;
import static org.junit.jupiter.api.Assertions.assertFalse;
import static org.junit.jupiter.api.Assertions.assertTrue;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
import org.apache.commons.io.IOUtils;
import org.junit.jupiter.api.Test;
class BaseCollectorPluginTest {
@Test
void testFilterXml() throws Exception {
final String xml = IOUtils.toString(getClass().getResourceAsStream("record.xml"));
final Set<String> validIds = new HashSet<>(Arrays.asList("opendoar____::1234", "opendoar____::4567"));
final Set<String> validTypes = new HashSet<>(Arrays.asList("1", "121"));
final Set<String> validTypes2 = new HashSet<>(Arrays.asList("1", "11"));
assertTrue(BaseCollectorPlugin.filterXml(xml, validIds, validTypes));
assertTrue(BaseCollectorPlugin.filterXml(xml, validIds, new HashSet<>()));
assertFalse(BaseCollectorPlugin.filterXml(xml, new HashSet<>(), validTypes));
assertFalse(BaseCollectorPlugin.filterXml(xml, validIds, validTypes2));
}
}

View File

@ -0,0 +1,49 @@
package eu.dnetlib.dhp.collection.plugin.base;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
public class BaseRecordInfo implements Serializable {
private static final long serialVersionUID = -8848232018350074593L;
private String id;
private List<BaseCollectionInfo> collections = new ArrayList<>();
private List<String> paths = new ArrayList<>();
private List<String> types = new ArrayList<>();
public String getId() {
return this.id;
}
public void setId(final String id) {
this.id = id;
}
public List<String> getPaths() {
return this.paths;
}
public void setPaths(final List<String> paths) {
this.paths = paths;
}
public List<String> getTypes() {
return this.types;
}
public void setTypes(final List<String> types) {
this.types = types;
}
public List<BaseCollectionInfo> getCollections() {
return this.collections;
}
public void setCollections(final List<BaseCollectionInfo> collections) {
this.collections = collections;
}
}

View File

@ -0,0 +1,77 @@
package eu.dnetlib.dhp.collection.plugin.base;
import java.io.IOException;
import org.apache.commons.io.IOUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.util.LongAccumulator;
import org.dom4j.io.SAXReader;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.junit.jupiter.MockitoExtension;
import eu.dnetlib.dhp.aggregation.AbstractVocabularyTest;
import eu.dnetlib.dhp.aggregation.common.AggregationCounter;
import eu.dnetlib.dhp.schema.mdstore.MetadataRecord;
import eu.dnetlib.dhp.schema.mdstore.Provenance;
import eu.dnetlib.dhp.transformation.xslt.XSLTTransformationFunction;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
// @Disabled
@ExtendWith(MockitoExtension.class)
public class BaseTransfomationTest extends AbstractVocabularyTest {
private SparkConf sparkConf;
@BeforeEach
public void setUp() throws IOException, ISLookUpException {
setUpVocabulary();
this.sparkConf = new SparkConf();
this.sparkConf.setMaster("local[*]");
this.sparkConf.set("spark.driver.host", "localhost");
this.sparkConf.set("spark.ui.enabled", "false");
}
@Test
void testBase2ODF() throws Exception {
final MetadataRecord mr = new MetadataRecord();
mr.setProvenance(new Provenance("DSID", "DSNAME", "PREFIX"));
mr.setBody(IOUtils.toString(getClass().getResourceAsStream("record.xml")));
final XSLTTransformationFunction tr = loadTransformationRule("xml/base2odf.transformationRule.xml");
final MetadataRecord result = tr.call(mr);
System.out.println(result.getBody());
}
@Test
void testBase2OAF() throws Exception {
final MetadataRecord mr = new MetadataRecord();
mr.setProvenance(new Provenance("DSID", "DSNAME", "PREFIX"));
mr.setBody(IOUtils.toString(getClass().getResourceAsStream("record.xml")));
final XSLTTransformationFunction tr = loadTransformationRule("xml/base2oaf.transformationRule.xml");
final MetadataRecord result = tr.call(mr);
System.out.println(result.getBody());
}
private XSLTTransformationFunction loadTransformationRule(final String path) throws Exception {
final String xslt = new SAXReader()
.read(this.getClass().getResourceAsStream(path))
.selectSingleNode("//CODE/*")
.asXML();
final LongAccumulator la = new LongAccumulator();
return new XSLTTransformationFunction(new AggregationCounter(la, la, la), xslt, 0, this.vocabularies);
}
}

View File

@ -0,0 +1,58 @@
<record>
<header xmlns="http://www.openarchives.org/OAI/2.0/">
<identifier>ftdoajarticles:oai:doaj.org/article:e2d5b5126b2d4e479933cc7f9a9ae0c1</identifier>
<datestamp>2022-12-31T11:48:55Z</datestamp>
</header>
<metadata xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:base_dc="http://oai.base-search.net/base_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/">
<base_dc:dc xsi:schemaLocation="http://oai.base-search.net/base_dc/ http://oai.base-search.net/base_dc/base_dc.xsd">
<base_dc:global_id>ftdoajarticles:oai:doaj.org/article:e2d5b5126b2d4e479933cc7f9a9ae0c1</base_dc:global_id>
<base_dc:continent>cww</base_dc:continent>
<base_dc:country>org</base_dc:country>
<base_dc:collection opendoar_id="1234" ror_id="ror1234">ftdoajarticles</base_dc:collection>
<base_dc:collname>TEST REPO</base_dc:collname>
<dc:title>Assessment of cultural heritage: the legislative and methodological framework of Russian Federation</dc:title>
<dc:creator>ALBU, Svetlana</dc:creator>
<dc:creator>LEȘAN, Anna</dc:creator>
<dc:subject>architectural heritage</dc:subject>
<dc:subject>evaluation of architectural heritage</dc:subject>
<dc:subject>types of values</dc:subject>
<dc:subject>experience of russian federation</dc:subject>
<dc:subject>Social Sciences</dc:subject>
<dc:subject>H</dc:subject>
<dc:description>Architectural heritage is the real estate inheritance by population of a country becoming an extremely valuable and specific category, preserving and capitalizing on those assets requires considerable effort. The state does not have sufficient means to maintain and preserve cultural heritage, as a result it is included in the civil circuit. The transfer of property right or of some partial rights over the architectural patrimony is accompanied by the necessity to estimate the value of goods. In this article, the authors examine the experience of Russian Federation (one of the largest countries with a huge architectural heritage) on the legislative framework of architectural and methodological heritage of architectural heritage assessment. The particularities of cultural assets valuation compared to other categories of real estate are examined, as well as the methodological aspects (types of values, methods applied in valuation, approaches according to the purpose of valuation) regarding the valuation of real estate with architectural value in Russian Federation.</dc:description>
<dc:publisher>Technical University of Moldova</dc:publisher>
<dc:date>2020-09-01T00:00:00Z</dc:date>
<base_dc:year>2020</base_dc:year>
<dc:type>article</dc:type>
<base_dc:typenorm>121</base_dc:typenorm>
<dc:identifier>https://doi.org/10.5281/zenodo.3971988</dc:identifier>
<dc:identifier>https://doaj.org/article/e2d5b5126b2d4e479933cc7f9a9ae0c1</dc:identifier>
<base_dc:link>https://doi.org/10.5281/zenodo.3971988</base_dc:link>
<dc:source>Journal of Social Sciences, Vol 3, Iss 3, Pp 134-143 (2020)</dc:source>
<dc:language>EN</dc:language>
<dc:language>FR</dc:language>
<dc:language>RO</dc:language>
<dc:relation>http://ibn.idsi.md/sites/default/files/imag_file/JSS-3-2020_134-143.pdf</dc:relation>
<dc:relation>https://doaj.org/toc/2587-3490</dc:relation>
<dc:relation>https://doaj.org/toc/2587-3504</dc:relation>
<dc:relation>doi:10.5281/zenodo.3971988</dc:relation>
<dc:relation>2587-3490</dc:relation>
<dc:relation>2587-3504</dc:relation>
<dc:relation>https://doaj.org/article/e2d5b5126b2d4e479933cc7f9a9ae0c1</dc:relation>
<base_dc:autoclasscode type="ddc">720</base_dc:autoclasscode>
<base_dc:authod_id>
<base_dc:creator_name>ALBU, Svetlana</base_dc:creator_name>
<base_dc:creator_id>https://orcid.org/0000-0002-8648-950X</base_dc:creator_id>
</base_dc:authod_id>
<base_dc:authod_id>
<base_dc:creator_name>LEȘAN, Anna</base_dc:creator_name>
<base_dc:creator_id>https://orcid.org/0000-0003-3284-0525</base_dc:creator_id>
</base_dc:authod_id>
<base_dc:doi>https://doi.org/10.5281/zenodo.3971988</base_dc:doi>
<base_dc:oa>1</base_dc:oa>
<base_dc:lang>eng</base_dc:lang>
<base_dc:lang>fre</base_dc:lang>
<base_dc:lang>rum</base_dc:lang>
</base_dc:dc>
</metadata>
</record>

View File

@ -1496,4 +1496,30 @@ cnr:institutes @=@ __CDS131__ @=@ IBE - Istituto per la BioEconomia
cnr:institutes @=@ https://ror.org/0263zy895 @=@ CDS132
cnr:institutes @=@ https://ror.org/0263zy895 @=@ SCITEC - Istituto di Scienze e Tecnologie Chimiche \"Giulio Natta\"
cnr:institutes @=@ __CDS133__ @=@ CDS133
cnr:institutes @=@ __CDS133__ @=@ STEMS - Istituto di Scienze e Tecnologie per l'Energia e la Mobilità Sostenibili
cnr:institutes @=@ __CDS133__ @=@ STEMS - Istituto di Scienze e Tecnologie per l'Energia e la Mobilità Sostenibili
base:normalized_types @=@ Text @=@ 1
base:normalized_types @=@ Book @=@ 11
base:normalized_types @=@ Book part @=@ 111
base:normalized_types @=@ Journal/Newspaper @=@ 12
base:normalized_types @=@ Article contribution @=@ 121
base:normalized_types @=@ Other non-article @=@ 122
base:normalized_types @=@ Conference object @=@ 13
base:normalized_types @=@ Report @=@ 14
base:normalized_types @=@ Review @=@ 15
base:normalized_types @=@ Course material @=@ 16
base:normalized_types @=@ Lecture @=@ 17
base:normalized_types @=@ Thesis @=@ 18
base:normalized_types @=@ Bachelor's thesis @=@ 181
base:normalized_types @=@ Master's thesis @=@ 182
base:normalized_types @=@ Doctoral and postdoctoral thesis @=@ 183
base:normalized_types @=@ Manuscript @=@ 19
base:normalized_types @=@ Patent @=@ 1A
base:normalized_types @=@ Musical notation @=@ 2
base:normalized_types @=@ Map @=@ 3
base:normalized_types @=@ Audio @=@ 4
base:normalized_types @=@ Image/Video @=@ 5
base:normalized_types @=@ Still image @=@ 51
base:normalized_types @=@ Moving image/Video @=@ 52
base:normalized_types @=@ Software @=@ 6
base:normalized_types @=@ Dataset @=@ 7
base:normalized_types @=@ Unknown @=@ F

View File

@ -1210,4 +1210,29 @@ cnr:institutes @=@ cnr:institutes @=@ __CDS130__ @=@ __CDS130__
cnr:institutes @=@ cnr:institutes @=@ __CDS131__ @=@ __CDS131__
cnr:institutes @=@ cnr:institutes @=@ https://ror.org/0263zy895 @=@ https://ror.org/0263zy895
cnr:institutes @=@ cnr:institutes @=@ __CDS133__ @=@ __CDS133__
base:normalized_types @=@ base:normalized_types @=@ Text @=@ Text
base:normalized_types @=@ base:normalized_types @=@ Book @=@ Book
base:normalized_types @=@ base:normalized_types @=@ Book part @=@ Book part
base:normalized_types @=@ base:normalized_types @=@ Journal/Newspaper @=@ Journal/Newspaper
base:normalized_types @=@ base:normalized_types @=@ Article contribution @=@ Article contribution
base:normalized_types @=@ base:normalized_types @=@ Other non-article @=@ Other non-article
base:normalized_types @=@ base:normalized_types @=@ Conference object @=@ Conference object
base:normalized_types @=@ base:normalized_types @=@ Report @=@ Report
base:normalized_types @=@ base:normalized_types @=@ Review @=@ Review
base:normalized_types @=@ base:normalized_types @=@ Course material @=@ Course material
base:normalized_types @=@ base:normalized_types @=@ Lecture @=@ Lecture
base:normalized_types @=@ base:normalized_types @=@ Thesis @=@ Thesis
base:normalized_types @=@ base:normalized_types @=@ Bachelor's thesis @=@ Bachelor's thesis
base:normalized_types @=@ base:normalized_types @=@ Master's thesis @=@ Master's thesis
base:normalized_types @=@ base:normalized_types @=@ Doctoral and postdoctoral thesis @=@ Doctoral and postdoctoral thesis
base:normalized_types @=@ base:normalized_types @=@ Manuscript @=@ Manuscript
base:normalized_types @=@ base:normalized_types @=@ Patent @=@ Patent
base:normalized_types @=@ base:normalized_types @=@ Musical notation @=@ Musical notation
base:normalized_types @=@ base:normalized_types @=@ Map @=@ Map
base:normalized_types @=@ base:normalized_types @=@ Audio @=@ Audio
base:normalized_types @=@ base:normalized_types @=@ Image/Video @=@ Image/Video
base:normalized_types @=@ base:normalized_types @=@ Still image @=@ Still image
base:normalized_types @=@ base:normalized_types @=@ Moving image/Video @=@ Moving image/Video
base:normalized_types @=@ base:normalized_types @=@ Software @=@ Software
base:normalized_types @=@ base:normalized_types @=@ Dataset @=@ Dataset
base:normalized_types @=@ base:normalized_types @=@ Unknown @=@ Unknown

View File

@ -101,6 +101,10 @@ abstract class AbstractSparkAction implements Serializable {
return SparkSession.builder().config(conf).getOrCreate();
}
protected static SparkSession getSparkWithHiveSession(SparkConf conf) {
return SparkSession.builder().enableHiveSupport().config(conf).getOrCreate();
}
protected static <T> void save(Dataset<T> dataset, String outPath, SaveMode mode) {
dataset.write().option("compression", "gzip").mode(mode).json(outPath);
}

View File

@ -1,128 +1,212 @@
package eu.dnetlib.dhp.oa.dedup;
import java.lang.reflect.InvocationTargetException;
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import org.apache.commons.beanutils.BeanUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.api.java.function.MapGroupsFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Lists;
import org.apache.spark.api.java.function.ReduceFunction;
import org.apache.spark.sql.*;
import eu.dnetlib.dhp.oa.dedup.model.Identifier;
import eu.dnetlib.dhp.oa.merge.AuthorMerger;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.Author;
import eu.dnetlib.dhp.schema.oaf.DataInfo;
import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.oaf.Result;
import scala.Tuple2;
import scala.Tuple3;
import scala.collection.JavaConversions;
public class DedupRecordFactory {
public static final class DedupRecordReduceState {
public final String dedupId;
protected static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
public final ArrayList<String> aliases = new ArrayList<>();
public final HashSet<String> acceptanceDate = new HashSet<>();
public OafEntity entity;
public DedupRecordReduceState(String dedupId, String id, OafEntity entity) {
this.dedupId = dedupId;
this.entity = entity;
if (entity == null) {
aliases.add(id);
} else {
if (Result.class.isAssignableFrom(entity.getClass())) {
Result result = (Result) entity;
if (result.getDateofacceptance() != null
&& StringUtils.isNotBlank(result.getDateofacceptance().getValue())) {
acceptanceDate.add(result.getDateofacceptance().getValue());
}
}
}
}
public String getDedupId() {
return dedupId;
}
}
private static final int MAX_ACCEPTANCE_DATE = 20;
private DedupRecordFactory() {
}
public static <T extends OafEntity> Dataset<T> createDedupRecord(
public static Dataset<OafEntity> createDedupRecord(
final SparkSession spark,
final DataInfo dataInfo,
final String mergeRelsInputPath,
final String entitiesInputPath,
final Class<T> clazz) {
final Class<OafEntity> clazz) {
long ts = System.currentTimeMillis();
final long ts = System.currentTimeMillis();
final Encoder<OafEntity> beanEncoder = Encoders.bean(clazz);
final Encoder<OafEntity> kryoEncoder = Encoders.kryo(clazz);
// <id, json_entity>
Dataset<Tuple2<String, T>> entities = spark
Dataset<Row> entities = spark
.read()
.textFile(entitiesInputPath)
.schema(Encoders.bean(clazz).schema())
.json(entitiesInputPath)
.as(beanEncoder)
.map(
(MapFunction<String, Tuple2<String, T>>) it -> {
T entity = OBJECT_MAPPER.readValue(it, clazz);
(MapFunction<OafEntity, Tuple2<String, OafEntity>>) entity -> {
return new Tuple2<>(entity.getId(), entity);
},
Encoders.tuple(Encoders.STRING(), Encoders.kryo(clazz)));
Encoders.tuple(Encoders.STRING(), kryoEncoder))
.selectExpr("_1 AS id", "_2 AS kryoObject");
// <source, target>: source is the dedup_id, target is the id of the mergedIn
Dataset<Tuple2<String, String>> mergeRels = spark
Dataset<Row> mergeRels = spark
.read()
.load(mergeRelsInputPath)
.as(Encoders.bean(Relation.class))
.where("relClass == 'merges'")
.map(
(MapFunction<Relation, Tuple2<String, String>>) r -> new Tuple2<>(r.getSource(), r.getTarget()),
Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
.selectExpr("source as dedupId", "target as id");
return mergeRels
.joinWith(entities, mergeRels.col("_2").equalTo(entities.col("_1")), "inner")
.join(entities, JavaConversions.asScalaBuffer(Collections.singletonList("id")), "left")
.select("dedupId", "id", "kryoObject")
.as(Encoders.tuple(Encoders.STRING(), Encoders.STRING(), kryoEncoder))
.map(
(MapFunction<Tuple2<Tuple2<String, String>, Tuple2<String, T>>, Tuple2<String, T>>) value -> new Tuple2<>(
value._1()._1(), value._2()._2()),
Encoders.tuple(Encoders.STRING(), Encoders.kryo(clazz)))
(MapFunction<Tuple3<String, String, OafEntity>, DedupRecordReduceState>) t -> new DedupRecordReduceState(
t._1(), t._2(), t._3()),
Encoders.kryo(DedupRecordReduceState.class))
.groupByKey(
(MapFunction<Tuple2<String, T>, String>) Tuple2::_1, Encoders.STRING())
.mapGroups(
(MapGroupsFunction<String, Tuple2<String, T>, T>) (key,
values) -> entityMerger(key, values, ts, dataInfo, clazz),
Encoders.bean(clazz));
(MapFunction<DedupRecordReduceState, String>) DedupRecordReduceState::getDedupId, Encoders.STRING())
.reduceGroups(
(ReduceFunction<DedupRecordReduceState>) (t1, t2) -> {
if (t1.entity == null) {
t2.aliases.addAll(t1.aliases);
return t2;
}
if (t1.acceptanceDate.size() < MAX_ACCEPTANCE_DATE) {
t1.acceptanceDate.addAll(t2.acceptanceDate);
}
t1.aliases.addAll(t2.aliases);
t1.entity = reduceEntity(t1.entity, t2.entity);
return t1;
})
.flatMap((FlatMapFunction<Tuple2<String, DedupRecordReduceState>, OafEntity>) t -> {
String dedupId = t._1();
DedupRecordReduceState agg = t._2();
if (agg.acceptanceDate.size() >= MAX_ACCEPTANCE_DATE) {
return Collections.emptyIterator();
}
return Stream
.concat(
Stream
.of(agg.getDedupId())
.map(id -> createDedupOafEntity(id, agg.entity, dataInfo, ts)),
agg.aliases
.stream()
.map(id -> createMergedDedupAliasOafEntity(id, agg.entity, dataInfo, ts)))
.iterator();
}, beanEncoder);
}
public static <T extends OafEntity> T entityMerger(
String id, Iterator<Tuple2<String, T>> entities, long ts, DataInfo dataInfo, Class<T> clazz)
throws IllegalAccessException, InstantiationException, InvocationTargetException {
private static OafEntity createDedupOafEntity(String id, OafEntity base, DataInfo dataInfo, long ts) {
try {
OafEntity res = (OafEntity) BeanUtils.cloneBean(base);
res.setId(id);
res.setDataInfo(dataInfo);
res.setLastupdatetimestamp(ts);
return res;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
final Comparator<Identifier<T>> idComparator = new IdentifierComparator<>();
private static OafEntity createMergedDedupAliasOafEntity(String id, OafEntity base, DataInfo dataInfo, long ts) {
try {
OafEntity res = createDedupOafEntity(id, base, dataInfo, ts);
DataInfo ds = (DataInfo) BeanUtils.cloneBean(dataInfo);
ds.setDeletedbyinference(true);
res.setDataInfo(ds);
return res;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
final LinkedList<T> entityList = Lists
.newArrayList(entities)
.stream()
.map(t -> Identifier.newInstance(t._2()))
.sorted(idComparator)
.map(Identifier::getEntity)
.collect(Collectors.toCollection(LinkedList::new));
private static OafEntity reduceEntity(OafEntity entity, OafEntity duplicate) {
final T entity = clazz.newInstance();
final T first = entityList.removeFirst();
BeanUtils.copyProperties(entity, first);
final List<List<Author>> authors = Lists.newArrayList();
entityList
.forEach(
duplicate -> {
entity.mergeFrom(duplicate);
if (ModelSupport.isSubClass(duplicate, Result.class)) {
Result r1 = (Result) duplicate;
Optional
.ofNullable(r1.getAuthor())
.ifPresent(a -> authors.add(a));
}
});
// set authors and date
if (ModelSupport.isSubClass(entity, Result.class)) {
Optional
.ofNullable(((Result) entity).getAuthor())
.ifPresent(a -> authors.add(a));
((Result) entity).setAuthor(AuthorMerger.merge(authors));
if (duplicate == null) {
return entity;
}
entity.setId(id);
int compare = new IdentifierComparator<>()
.compare(Identifier.newInstance(entity), Identifier.newInstance(duplicate));
entity.setLastupdatetimestamp(ts);
entity.setDataInfo(dataInfo);
if (compare > 0) {
OafEntity swap = duplicate;
duplicate = entity;
entity = swap;
}
entity.mergeFrom(duplicate);
if (ModelSupport.isSubClass(duplicate, Result.class)) {
Result re = (Result) entity;
Result rd = (Result) duplicate;
List<List<Author>> authors = new ArrayList<>();
if (re.getAuthor() != null) {
authors.add(re.getAuthor());
}
if (rd.getAuthor() != null) {
authors.add(rd.getAuthor());
}
re.setAuthor(AuthorMerger.merge(authors));
}
return entity;
}
public static <T extends OafEntity> T entityMerger(
String id, Iterator<Tuple2<String, T>> entities, long ts, DataInfo dataInfo, Class<T> clazz) {
T base = entities.next()._2();
while (entities.hasNext()) {
T duplicate = entities.next()._2();
if (duplicate != null)
base = (T) reduceEntity(base, duplicate);
}
base.setId(id);
base.setDataInfo(dataInfo);
base.setLastupdatetimestamp(ts);
return base;
}
}

View File

@ -1,6 +1,7 @@
package eu.dnetlib.dhp.oa.dedup;
import static eu.dnetlib.dhp.utils.DHPUtils.md5;
import static org.apache.commons.lang3.StringUtils.substringAfter;
import static org.apache.commons.lang3.StringUtils.substringBefore;
@ -14,33 +15,36 @@ import eu.dnetlib.dhp.schema.oaf.utils.PidType;
public class IdGenerator implements Serializable {
// pick the best pid from the list (consider date and pidtype)
public static <T extends OafEntity> String generate(List<Identifier<T>> pids, String defaultID) {
public static <T extends OafEntity> String generate(List<? extends Identifier> pids, String defaultID) {
if (pids == null || pids.isEmpty())
return defaultID;
return generateId(pids);
}
private static <T extends OafEntity> String generateId(List<Identifier<T>> pids) {
Identifier<T> bp = pids
private static String generateId(List<? extends Identifier> pids) {
Identifier bp = pids
.stream()
.min(Identifier::compareTo)
.orElseThrow(() -> new IllegalStateException("unable to generate id"));
String prefix = substringBefore(bp.getOriginalID(), "|");
String ns = substringBefore(substringAfter(bp.getOriginalID(), "|"), "::");
String suffix = substringAfter(bp.getOriginalID(), "::");
return generate(bp.getOriginalID());
}
public static String generate(String originalId) {
String prefix = substringBefore(originalId, "|");
String ns = substringBefore(substringAfter(originalId, "|"), "::");
String suffix = substringAfter(originalId, "::");
final String pidType = substringBefore(ns, "_");
if (PidType.isValid(pidType)) {
return prefix + "|" + dedupify(ns) + "::" + suffix;
} else {
return prefix + "|dedup_wf_001::" + suffix;
return prefix + "|dedup_wf_002::" + md5(originalId); // hash the whole originalId to avoid collisions
}
}
private static String dedupify(String ns) {
StringBuilder prefix;
if (PidType.valueOf(substringBefore(ns, "_")) == PidType.openorgs) {
prefix = new StringBuilder(substringBefore(ns, "_"));
@ -53,5 +57,4 @@ public class IdGenerator implements Serializable {
}
return prefix.substring(0, 12);
}
}

View File

@ -3,49 +3,47 @@ package eu.dnetlib.dhp.oa.dedup;
import static eu.dnetlib.dhp.schema.common.ModelConstants.DNET_PROVENANCE_ACTIONS;
import static eu.dnetlib.dhp.schema.common.ModelConstants.PROVENANCE_DEDUP;
import static org.apache.spark.sql.functions.*;
import java.io.IOException;
import java.util.*;
import java.util.stream.Collectors;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Optional;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.api.java.function.MapGroupsFunction;
import org.apache.spark.graphx.Edge;
import org.apache.spark.rdd.RDD;
import org.apache.spark.sql.*;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.catalyst.encoders.RowEncoder;
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
import org.dom4j.DocumentException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.xml.sax.SAXException;
import com.google.common.collect.Lists;
import com.google.common.hash.Hashing;
import com.kwartile.lib.cc.ConnectedComponent;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.oa.dedup.graph.ConnectedComponent;
import eu.dnetlib.dhp.oa.dedup.graph.GraphProcessor;
import eu.dnetlib.dhp.oa.dedup.model.Identifier;
import eu.dnetlib.dhp.schema.common.EntityType;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.DataInfo;
import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.oaf.Qualifier;
import eu.dnetlib.dhp.schema.oaf.Relation;
import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.utils.PidType;
import eu.dnetlib.dhp.utils.ISLookupClientFactory;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
import eu.dnetlib.pace.config.DedupConfig;
import eu.dnetlib.pace.util.MapDocumentUtil;
import scala.Tuple2;
import scala.Tuple3;
import scala.collection.JavaConversions;
public class SparkCreateMergeRels extends AbstractSparkAction {
@ -68,10 +66,12 @@ public class SparkCreateMergeRels extends AbstractSparkAction {
log.info("isLookupUrl {}", isLookUpUrl);
SparkConf conf = new SparkConf();
conf.set("hive.metastore.uris", parser.get("hiveMetastoreUris"));
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.registerKryoClasses(ModelSupport.getOafModelClasses());
new SparkCreateMergeRels(parser, getSparkSession(conf))
new SparkCreateMergeRels(parser, getSparkWithHiveSession(conf))
.run(ISLookupClientFactory.getLookUpService(isLookUpUrl));
}
@ -87,14 +87,15 @@ public class SparkCreateMergeRels extends AbstractSparkAction {
.ofNullable(parser.get("cutConnectedComponent"))
.map(Integer::valueOf)
.orElse(0);
final String pivotHistoryDatabase = parser.get("pivotHistoryDatabase");
log.info("connected component cut: '{}'", cut);
log.info("graphBasePath: '{}'", graphBasePath);
log.info("isLookUpUrl: '{}'", isLookUpUrl);
log.info("actionSetId: '{}'", actionSetId);
log.info("workingPath: '{}'", workingPath);
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
for (DedupConfig dedupConf : getConfigurations(isLookUpService, actionSetId)) {
final String subEntity = dedupConf.getWf().getSubEntityValue();
final Class<OafEntity> clazz = ModelSupport.entityTypes.get(EntityType.valueOf(subEntity));
@ -106,113 +107,173 @@ public class SparkCreateMergeRels extends AbstractSparkAction {
final String mergeRelPath = DedupUtility.createMergeRelPath(workingPath, actionSetId, subEntity);
// <hash(id), id>
JavaPairRDD<Object, String> vertexes = createVertexes(sc, graphBasePath, subEntity, dedupConf);
final RDD<Edge<String>> edgeRdd = spark
final Dataset<Row> simRels = spark
.read()
.load(DedupUtility.createSimRelPath(workingPath, actionSetId, subEntity))
.as(Encoders.bean(Relation.class))
.javaRDD()
.map(it -> new Edge<>(hash(it.getSource()), hash(it.getTarget()), it.getRelClass()))
.rdd();
.select("source", "target");
Dataset<Tuple2<String, String>> rawMergeRels = spark
.createDataset(
GraphProcessor
.findCCs(vertexes.rdd(), edgeRdd, maxIterations, cut)
.toJavaRDD()
.filter(k -> k.getIds().size() > 1)
.flatMap(this::ccToRels)
.rdd(),
Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
UserDefinedFunction hashUDF = functions
.udf(
(String s) -> hash(s), DataTypes.LongType);
Dataset<Tuple2<String, OafEntity>> entities = spark
// <hash(id), id>
Dataset<Row> vertexIdMap = simRels
.selectExpr("source as id")
.union(simRels.selectExpr("target as id"))
.distinct()
.withColumn("vertexId", hashUDF.apply(functions.col("id")));
// transform simrels into pairs of numeric ids
final Dataset<Row> edges = spark
.read()
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
.map(
(MapFunction<String, Tuple2<String, OafEntity>>) it -> {
OafEntity entity = OBJECT_MAPPER.readValue(it, clazz);
return new Tuple2<>(entity.getId(), entity);
},
Encoders.tuple(Encoders.STRING(), Encoders.kryo(clazz)));
.load(DedupUtility.createSimRelPath(workingPath, actionSetId, subEntity))
.select("source", "target")
.withColumn("source", hashUDF.apply(functions.col("source")))
.withColumn("target", hashUDF.apply(functions.col("target")));
Dataset<Relation> mergeRels = rawMergeRels
.joinWith(entities, rawMergeRels.col("_2").equalTo(entities.col("_1")), "inner")
// <tmp_source,target>,<target,entity>
.map(
(MapFunction<Tuple2<Tuple2<String, String>, Tuple2<String, OafEntity>>, Tuple2<String, OafEntity>>) value -> new Tuple2<>(
value._1()._1(), value._2()._2()),
Encoders.tuple(Encoders.STRING(), Encoders.kryo(clazz)))
// <tmp_source,entity>
.groupByKey(
(MapFunction<Tuple2<String, OafEntity>, String>) Tuple2::_1, Encoders.STRING())
.mapGroups(
(MapGroupsFunction<String, Tuple2<String, OafEntity>, ConnectedComponent>) this::generateID,
Encoders.bean(ConnectedComponent.class))
// <root_id, list(target)>
// resolve connected components
// ("vertexId", "groupId")
Dataset<Row> cliques = ConnectedComponent
.runOnPairs(edges, 50, spark);
// transform "vertexId" back to its original string value
// groupId is kept numeric as its string value is not used
// ("id", "groupId")
Dataset<Row> rawMergeRels = cliques
.join(vertexIdMap, JavaConversions.asScalaBuffer(Collections.singletonList("vertexId")), "inner")
.drop("vertexId")
.distinct();
// empty dataframe if historydatabase is not used
Dataset<Row> pivotHistory = spark
.createDataset(
Collections.emptyList(),
RowEncoder
.apply(StructType.fromDDL("id STRING, lastUsage STRING")));
if (StringUtils.isNotBlank(pivotHistoryDatabase)) {
pivotHistory = spark
.read()
.table(pivotHistoryDatabase + "." + subEntity)
.selectExpr("id", "lastUsage");
}
// depending on resulttype collectefrom and dateofacceptance are evaluated differently
String collectedfromExpr = "false AS collectedfrom";
String dateExpr = "'' AS date";
if (Result.class.isAssignableFrom(clazz)) {
if (Publication.class.isAssignableFrom(clazz)) {
collectedfromExpr = "array_contains(collectedfrom.key, '" + ModelConstants.CROSSREF_ID
+ "') AS collectedfrom";
} else if (eu.dnetlib.dhp.schema.oaf.Dataset.class.isAssignableFrom(clazz)) {
collectedfromExpr = "array_contains(collectedfrom.key, '" + ModelConstants.DATACITE_ID
+ "') AS collectedfrom";
}
dateExpr = "dateofacceptance.value AS date";
}
// cap pidType at w3id as from there on they are considered equal
UserDefinedFunction mapPid = udf(
(String s) -> Math.min(PidType.tryValueOf(s).ordinal(), PidType.w3id.ordinal()), DataTypes.IntegerType);
UserDefinedFunction validDate = udf((String date) -> {
if (StringUtils.isNotBlank(date)
&& date.matches(DatePicker.DATE_PATTERN) && DatePicker.inRange(date)) {
return date;
}
return LocalDate.now().plusWeeks(1).toString();
}, DataTypes.StringType);
Dataset<Row> pivotingData = spark
.read()
.schema(Encoders.bean(clazz).schema())
.json(DedupUtility.createEntityPath(graphBasePath, subEntity))
.selectExpr(
"id",
"regexp_extract(id, '^\\\\d+\\\\|([^_]+).*::', 1) AS pidType",
collectedfromExpr,
dateExpr)
.withColumn("pidType", mapPid.apply(col("pidType"))) // ordinal of pid type
.withColumn("date", validDate.apply(col("date")));
// ordering to selected pivot id
WindowSpec w = Window
.partitionBy("groupId")
.orderBy(
col("lastUsage").desc_nulls_last(),
col("pidType").asc_nulls_last(),
col("collectedfrom").desc_nulls_last(),
col("date").asc_nulls_last(),
col("id").asc_nulls_last());
Dataset<Relation> output = rawMergeRels
.join(pivotHistory, JavaConversions.asScalaBuffer(Collections.singletonList("id")), "full")
.join(pivotingData, JavaConversions.asScalaBuffer(Collections.singletonList("id")), "left")
.withColumn("pivot", functions.first("id").over(w))
.withColumn("position", functions.row_number().over(w))
.flatMap(
(FlatMapFunction<ConnectedComponent, Relation>) cc -> ccToMergeRel(cc, dedupConf),
Encoders.bean(Relation.class));
(FlatMapFunction<Row, Tuple3<String, String, String>>) (Row r) -> {
String id = r.getAs("id");
String dedupId = IdGenerator.generate(id);
saveParquet(mergeRels, mergeRelPath, SaveMode.Overwrite);
String pivot = r.getAs("pivot");
String pivotDedupId = IdGenerator.generate(pivot);
// filter out id == pivotDedupId
// those are caused by claim expressed on pivotDedupId
// information will be merged after creating deduprecord
if (id.equals(pivotDedupId)) {
return Collections.emptyIterator();
}
ArrayList<Tuple3<String, String, String>> res = new ArrayList<>();
// singleton pivots have null groupId as they do not match rawMergeRels
if (r.isNullAt(r.fieldIndex("groupId"))) {
// the record is existing if it matches pivotingData
if (!r.isNullAt(r.fieldIndex("collectedfrom"))) {
// create relation with old dedup id
res.add(new Tuple3<>(id, dedupId, null));
}
return res.iterator();
}
// this was a pivot in a previous graph but it has been merged into a new group with different
// pivot
if (!r.isNullAt(r.fieldIndex("lastUsage")) && !pivot.equals(id)
&& !dedupId.equals(pivotDedupId)) {
// materialize the previous dedup record as a merge relation with the new one
res.add(new Tuple3<>(dedupId, pivotDedupId, null));
}
// add merge relations
if (cut <= 0 || r.<Integer> getAs("position") <= cut) {
res.add(new Tuple3<>(id, pivotDedupId, pivot));
}
return res.iterator();
}, Encoders.tuple(Encoders.STRING(), Encoders.STRING(), Encoders.STRING()))
.distinct()
.flatMap(
(FlatMapFunction<Tuple3<String, String, String>, Relation>) (Tuple3<String, String, String> r) -> {
String id = r._1();
String dedupId = r._2();
String pivot = r._3();
ArrayList<Relation> res = new ArrayList<>();
res.add(rel(pivot, dedupId, id, ModelConstants.MERGES, dedupConf));
res.add(rel(pivot, id, dedupId, ModelConstants.IS_MERGED_IN, dedupConf));
return res.iterator();
}, Encoders.bean(Relation.class));
saveParquet(output, mergeRelPath, SaveMode.Overwrite);
}
}
private <T extends OafEntity> ConnectedComponent generateID(String key, Iterator<Tuple2<String, T>> values) {
List<Identifier<T>> identifiers = Lists
.newArrayList(values)
.stream()
.map(v -> Identifier.newInstance(v._2()))
.collect(Collectors.toList());
String rootID = IdGenerator.generate(identifiers, key);
if (Objects.equals(rootID, key))
throw new IllegalStateException("generated default ID: " + rootID);
return new ConnectedComponent(rootID,
identifiers.stream().map(i -> i.getEntity().getId()).collect(Collectors.toSet()));
}
private JavaPairRDD<Object, String> createVertexes(JavaSparkContext sc, String graphBasePath, String subEntity,
DedupConfig dedupConf) {
return sc
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
.mapToPair(json -> {
String id = MapDocumentUtil.getJPathString(dedupConf.getWf().getIdPath(), json);
return new Tuple2<>(hash(id), id);
});
}
private Iterator<Tuple2<String, String>> ccToRels(ConnectedComponent cc) {
return cc
.getIds()
.stream()
.map(id -> new Tuple2<>(cc.getCcId(), id))
.iterator();
}
private Iterator<Relation> ccToMergeRel(ConnectedComponent cc, DedupConfig dedupConf) {
return cc
.getIds()
.stream()
.flatMap(
id -> {
List<Relation> tmp = new ArrayList<>();
tmp.add(rel(cc.getCcId(), id, ModelConstants.MERGES, dedupConf));
tmp.add(rel(id, cc.getCcId(), ModelConstants.IS_MERGED_IN, dedupConf));
return tmp.stream();
})
.iterator();
}
private Relation rel(String source, String target, String relClass, DedupConfig dedupConf) {
private static Relation rel(String pivot, String source, String target, String relClass, DedupConfig dedupConf) {
String entityType = dedupConf.getWf().getEntityType();
@ -238,6 +299,14 @@ public class SparkCreateMergeRels extends AbstractSparkAction {
// TODO calculate the trust value based on the similarity score of the elements in the CC
r.setDataInfo(info);
if (pivot != null) {
KeyValue pivotKV = new KeyValue();
pivotKV.setKey("pivot");
pivotKV.setValue(pivot);
r.setProperties(Arrays.asList(pivotKV));
}
return r;
}

View File

@ -91,18 +91,12 @@ public class SparkWhitelistSimRels extends AbstractSparkAction {
Dataset<Row> entities = spark
.read()
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
.repartition(numPartitions)
.withColumn("id", functions.get_json_object(new Column("value"), dedupConf.getWf().getIdPath()));
.select(functions.get_json_object(new Column("value"), dedupConf.getWf().getIdPath()).as("id"))
.distinct();
Dataset<Row> whiteListRels1 = whiteListRels
.join(entities, entities.col("id").equalTo(whiteListRels.col("from")), "inner")
.select("from", "to");
Dataset<Row> whiteListRels2 = whiteListRels1
.join(entities, whiteListRels1.col("to").equalTo(entities.col("id")), "inner")
.select("from", "to");
Dataset<Relation> whiteListSimRels = whiteListRels2
Dataset<Relation> whiteListSimRels = whiteListRels
.join(entities, entities.col("id").equalTo(whiteListRels.col("from")), "leftsemi")
.join(entities, functions.col("to").equalTo(entities.col("id")), "leftsemi")
.map(
(MapFunction<Row, Relation>) r -> DedupUtility
.createSimRel(r.getString(0), r.getString(1), entity),

View File

@ -1,100 +0,0 @@
package eu.dnetlib.dhp.oa.dedup.graph;
import java.io.IOException;
import java.io.Serializable;
import java.util.Set;
import java.util.stream.Collectors;
import org.apache.commons.lang3.StringUtils;
import org.codehaus.jackson.annotate.JsonIgnore;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.utils.DHPUtils;
import eu.dnetlib.pace.util.PaceException;
public class ConnectedComponent implements Serializable {
private String ccId;
private Set<String> ids;
private static final String CONNECTED_COMPONENT_ID_PREFIX = "connect_comp";
public ConnectedComponent(Set<String> ids, final int cut) {
this.ids = ids;
this.ccId = createDefaultID();
if (cut > 0 && ids.size() > cut) {
this.ids = ids
.stream()
.filter(id -> !ccId.equalsIgnoreCase(id))
.limit(cut - 1)
.collect(Collectors.toSet());
// this.ids.add(ccId); ??
}
}
public ConnectedComponent(String ccId, Set<String> ids) {
this.ccId = ccId;
this.ids = ids;
}
public String createDefaultID() {
if (ids.size() > 1) {
final String s = getMin();
String prefix = s.split("\\|")[0];
ccId = prefix + "|" + CONNECTED_COMPONENT_ID_PREFIX + "::" + DHPUtils.md5(s);
return ccId;
} else {
return ids.iterator().next();
}
}
@JsonIgnore
public String getMin() {
final StringBuilder min = new StringBuilder();
ids
.forEach(
id -> {
if (StringUtils.isBlank(min.toString())) {
min.append(id);
} else {
if (min.toString().compareTo(id) > 0) {
min.setLength(0);
min.append(id);
}
}
});
return min.toString();
}
@Override
public String toString() {
ObjectMapper mapper = new ObjectMapper();
try {
return mapper.writeValueAsString(this);
} catch (IOException e) {
throw new PaceException("Failed to create Json: ", e);
}
}
public Set<String> getIds() {
return ids;
}
public void setIds(Set<String> ids) {
this.ids = ids;
}
public String getCcId() {
return ccId;
}
public void setCcId(String ccId) {
this.ccId = ccId;
}
}

View File

@ -1,37 +0,0 @@
package eu.dnetlib.dhp.oa.dedup.graph
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import scala.collection.JavaConversions;
object GraphProcessor {
def findCCs(vertexes: RDD[(VertexId, String)], edges: RDD[Edge[String]], maxIterations: Int, cut:Int): RDD[ConnectedComponent] = {
val graph: Graph[String, String] = Graph(vertexes, edges).partitionBy(PartitionStrategy.RandomVertexCut) //TODO remember to remove partitionby
val cc = graph.connectedComponents(maxIterations).vertices
val joinResult = vertexes.leftOuterJoin(cc).map {
case (id, (openaireId, cc)) => {
if (cc.isEmpty) {
(id, openaireId)
}
else {
(cc.get, openaireId)
}
}
}
val connectedComponents = joinResult.groupByKey()
.map[ConnectedComponent](cc => asConnectedComponent(cc, cut))
connectedComponents
}
def asConnectedComponent(group: (VertexId, Iterable[String]), cut:Int): ConnectedComponent = {
val docs = group._2.toSet[String]
val connectedComponent = new ConnectedComponent(JavaConversions.setAsJavaSet[String](docs), cut);
connectedComponent
}
}

Some files were not shown because too many files have changed in this diff Show More