Compare commits

...

171 Commits

Author SHA1 Message Date
sandro.labruzzo 730a7751b6 added zenodoDump to enum of CollectorPlugin 2024-12-04 15:03:59 +01:00
sandro.labruzzo 5f134c4045 Merge remote-tracking branch 'origin/beta' into zenodo_dump_collection 2024-12-04 13:41:25 +01:00
sandro.labruzzo 4034da7579 code formatted 2024-12-04 13:37:14 +01:00
sandro.labruzzo 32e2a8b340 implemented zenodo dump collector plugin 2024-12-04 13:36:21 +01:00
sandro.labruzzo cc6bbbb804 make setter void 2024-12-03 14:31:11 +01:00
sandro.labruzzo 0517e452e3 Fixed error on empty affiliation 2024-12-02 14:00:59 +01:00
Miriam Baglioni ca2d480df3 [BulkTagging] added fix to consider when the set of constraints for the datasource is empty. Added check for remove constraints and advanced constraints to verify if the constraints list is empty and in that case do nothing 2024-11-26 15:56:52 +01:00
Claudio Atzori 2e54715d71 Applying PR#512 - Sequential ActionSet promotion 2024-11-26 15:56:46 +01:00
Claudio Atzori 15227f82b8 added related author's given name and family name in the solr json payload serialisation 2024-11-20 15:52:40 +01:00
Claudio Atzori 4e55ddc547 [PubMed aggregation] storing contents into mdStoreVersion/store 2024-11-19 16:50:42 +01:00
Claudio Atzori ef51a60f19 Merge pull request 'dedup_new_comparators' (#509) from dedup_new_comparators into beta
Reviewed-on: #509
2024-11-19 15:13:40 +01:00
Claudio Atzori ff5cb32067 Merge pull request 'abstracts in ODF records from the datacite and the dc nsPrefixes' (#508) from abtracts_guidelines4 into beta
Reviewed-on: #508
2024-11-19 15:12:53 +01:00
Claudio Atzori a48d080e08 Merge pull request 'Improve OAF Generation from Baseline PubMed Collection' (#504) from pubmed_fix into beta
Reviewed-on: #504
2024-11-19 15:12:37 +01:00
Claudio Atzori 5d34432398 align MergeUtils with beta branch 2024-11-19 15:12:04 +01:00
Michele De Bonis c97facf5e6 conflict resolution in the comparator test class 2024-11-18 14:59:30 +01:00
Claudio Atzori 9e439f5eca map the abstracts considering both the datacite and the dc nsPrefix 2024-11-15 12:19:26 +01:00
Claudio Atzori cf7d9a32ab disable autoBroadcastJoin in the cleaning workflow 2024-11-15 09:17:28 +01:00
Claudio Atzori 5f512f510e code formatting 2024-11-15 09:16:51 +01:00
Claudio Atzori b95672b420 mergeUtils set the result identifier when enforcing the result type 2024-11-15 09:16:18 +01:00
Claudio Atzori 9e8849b753 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-11-13 20:41:51 +01:00
sandro.labruzzo 4778a70478 Merge remote-tracking branch 'origin/beta' into pubmed_fix 2024-11-13 16:28:39 +01:00
Claudio Atzori 4a3b173ca2 defaults to 0000 - Unknown in case the instance type lookup in the dnet:result_typologies doesn't find a corresponding result type binding 2024-11-13 16:27:00 +01:00
sandro.labruzzo ac0a94d62d updated pubmed parser to add also ORCID id and affiliation string to authors 2024-11-13 16:26:59 +01:00
Giambattista Bloisi 5ee8881646 Merge pull request '[danishfunders] added link for danish funders versus the unidentified project for IRFD (501100004836) CF (501100002808) and NNF(501100009708)' (#502) from danishFunders_crossrefmap into beta
Reviewed-on: #502
2024-11-13 12:01:38 +01:00
Miriam Baglioni fb1f0f8850 [danishfunders] added the possibility to link also versus a specif award if present in the metadata 2024-11-13 12:00:33 +01:00
Giambattista Bloisi 5b4d821bf9 Merge pull request 'Crossref: generate canonical openaire id for results in affiliation relationship' (#507) from fix_crossref_affiliations into beta
Reviewed-on: #507
2024-11-13 11:01:37 +01:00
Giambattista Bloisi 03c262ccb9 Crossref: generate canonical openaire id for results in affiliation relationship 2024-11-13 10:56:17 +01:00
sandro.labruzzo a1d5ad5c26 code formatted 2024-11-13 09:51:13 +01:00
sandro.labruzzo b0478c380e merged conflicts on beta 2024-11-13 09:43:16 +01:00
Claudio Atzori 07f267bb10 fix vocabulary lookup in mergeutils 2024-11-13 08:14:26 +01:00
Claudio Atzori 8088943399 Merge pull request 'enforce resulttype' (#506) from merge_resulttypes into beta
Reviewed-on: #506
2024-11-12 14:20:22 +01:00
Claudio Atzori 6c5df761e2 enforce resulttype based on the dnet:result_typologies vocabulary and upon merge 2024-11-12 14:18:04 +01:00
Claudio Atzori 9f7a606ddd Merge pull request 'betaFixPerson' (#505) from betaFixPerson into beta
Reviewed-on: #505
2024-11-12 14:09:22 +01:00
Miriam Baglioni 250f101779 [person] fixed issue in creating project identifier for the graph for person->project relations 2024-11-11 16:04:06 +01:00
Miriam Baglioni f1ea9da5bc [person] checked type in inferenceprovenance 2024-11-11 15:37:56 +01:00
Miriam Baglioni b0283fe94c [person] fix provenance of pid in person when it is orcid (classid entityregistry to avoid the cleaning put orcid_pending) 2024-11-11 14:57:57 +01:00
sandro.labruzzo 474f365286 removed wrong test 2024-11-11 12:37:27 +01:00
sandro.labruzzo 19ce783e58 renamed workflow 2024-11-11 12:28:02 +01:00
Sandro La Bruzzo 0d0904f4ec updated workflow baseline to direct transform on OAF 2024-11-11 10:27:23 +01:00
Giambattista Bloisi f31f22801f Merge pull request 'Remove ORCID information when the same ORCID ID is used multiple times in the same result for different authors' (#503) from clean_clashing_orcids into beta
Reviewed-on: #503
2024-11-08 09:31:11 +01:00
Miriam Baglioni 6fd9ec8566 [danishfunders] added link for danish funders versus the unidentified project for IRFD (501100004836) CF (501100002808) and NNF(501100009708) 2024-11-07 13:55:31 +01:00
Giambattista Bloisi 8f5171557e Remove ORCID information when the same ORCID ID is used multiple times in the same result for different authors 2024-11-07 12:22:34 +01:00
Claudio Atzori f7bb53fe78 [orcid enrichment] added missing workflow parameter: workingDir 2024-11-07 01:04:43 +01:00
Claudio Atzori 973aa7dca6 [dedup] force the Relation schema when reading the merge rels 2024-11-06 12:29:06 +01:00
Sandro La Bruzzo c1cef5d685 removed old library joda time replaced with standard java.time introduced in java 8 2024-11-05 10:38:40 +01:00
Sandro La Bruzzo a8ed5a3b04 Organized getters and setters in the PMArticle class for better readability and maintainability. 2024-11-04 17:45:28 +01:00
Claudio Atzori a42c8b7c85 person table directory produced by the workflows raw_all and merge graphs 2024-10-30 11:25:17 +01:00
Claudio Atzori a877c76d70 make MergeUtils.selectOldestDate less prone to errors when receiving invalid date formats 2024-10-30 11:24:25 +01:00
Claudio Atzori 26cdc7e439 Avoid NPEs in MergeUtils 2024-10-30 07:35:47 +01:00
Claudio Atzori 323c76eafc patch relations job: removed non necessary logging 2024-10-30 07:35:30 +01:00
Miriam Baglioni 69aee609ef [bulktag] align type to community api 2024-10-29 15:53:04 +01:00
Claudio Atzori 5ca031c8d6 [graph raw] rule out empty PIDs 2024-10-29 13:48:41 +01:00
Claudio Atzori 499892b67c [graph raw] rule out empty PIDs 2024-10-29 09:51:30 +01:00
Claudio Atzori e4504fd98d [Person] fixed project identifier creation 2024-10-28 15:32:09 +01:00
Claudio Atzori 9b4415cb67 using _the right_ scala 2.11 converters 2024-10-28 13:56:25 +01:00
Claudio Atzori e6ca382deb using scala 2.11 converters 2024-10-28 13:52:06 +01:00
Claudio Atzori 940735921f Merge pull request 'Fill mergedIds field and filter mergerels with dedup records actually created' (#500) from mergedids into beta
Reviewed-on: #500
2024-10-28 13:43:09 +01:00
Giambattista Bloisi 56224e034a Fill the new mergedIds field when generating dedup records
Filter out dedup records composed of invisible records only
Filter out mergerels that have not been used when creating the dedup record (ungrouping of cliques)
2024-10-28 13:31:01 +01:00
Miriam Baglioni 5916346ba1 [TransformativeAgreement] fix to remove the file downloaded from a previous run of the workflow 2024-10-28 12:18:50 +01:00
Claudio Atzori e4abe55988 merged person_through_the_graph & code formatting 2024-10-28 11:01:49 +01:00
Claudio Atzori d71df6de19 Merge pull request 'affroNewModelonBeta' (#494) from affroNewModelonBeta into beta
Reviewed-on: #494
2024-10-28 10:48:34 +01:00
Claudio Atzori 1cdcd07a7e Merge pull request 'dhp-schema upgrade & provision mapping 2' (#499) from beta_provision_alignment_9.0.0 into beta
Reviewed-on: #499
2024-10-28 10:44:08 +01:00
Claudio Atzori 6fd50266f1 translate 'otherresearchproduct' into 'other' when setting the related record type 2024-10-28 10:42:46 +01:00
Claudio Atzori dffa376eb6 Merge pull request 'dhp-schema upgrade & provision mapping' (#498) from beta_provision_alignment_9.0.0 into beta
Reviewed-on: #498
2024-10-28 10:03:24 +01:00
Claudio Atzori 32fa579b80 [graph provision] select the longest abstract 2024-10-28 10:03:02 +01:00
Claudio Atzori 67e37f41fb Merge pull request 'blacklist filtering moved before the cleanup phase in order to have case sensitive regex' (#485) from dedup_blacklist_fix into beta
Reviewed-on: #485
2024-10-28 09:42:51 +01:00
Miriam Baglioni 0fb6af5586 Updated main pom dependency against dhp-schema, from 8.0.1 to 9.0.0. The new fields included in the updated schema module are populated by the Solr JSON payload mapping, which also limits the number of authors serialised to 200. 2024-10-25 16:28:50 +02:00
Claudio Atzori dcba5ad32a Merge pull request 'person_through_the_graph_newDevelopments' (#497) from person_through_the_graph_newDevelopments into person_through_the_graph
Reviewed-on: #497
2024-10-25 10:20:40 +02:00
Claudio Atzori 46dbb62598 Merge pull request '#9839: include claimed affiliation relationships' (#476) from claim-orgs into beta
Reviewed-on: #476
2024-10-25 10:12:59 +02:00
Claudio Atzori d3764265d5 Merge pull request '[dedup] avoid NPEs in the countryInference dedup utility' (#475) from dedup_countryInference_NPE into beta
Reviewed-on: #475
2024-10-25 10:12:06 +02:00
Claudio Atzori 4a9aeb6238 Merge pull request '9126-impact-indicators-wf-optimisation' (#471) from 9126-impact-indicators-wf-optimisation into beta
Reviewed-on: #471
2024-10-25 10:10:44 +02:00
Claudio Atzori 8172bee8c8 Merge pull request 'Minor fixes' (#496) from beta_fixes_oct into beta
Reviewed-on: #496
2024-10-25 10:09:56 +02:00
Miriam Baglioni 1fce7d5a0f [Person] remove the isolated nodes from the person set 2024-10-25 10:05:17 +02:00
Miriam Baglioni 842cc75dae [AffRo] fix name 2024-10-25 09:42:52 +02:00
Miriam Baglioni e75326d6ec [FundersMatchFromCrossref] added match from CrossRef to DFG unidentified project 2024-10-25 09:13:54 +02:00
Miriam Baglioni 32f444984e [person] - 2024-10-24 17:51:42 +02:00
Miriam Baglioni cab8f1135f [affroNewModel] - 2024-10-24 17:44:33 +02:00
Miriam Baglioni c93bf82487 [affroNewModel] extended wf definition 2024-10-24 17:34:34 +02:00
Miriam Baglioni a7699558ed [person] - 2024-10-24 16:15:12 +02:00
Miriam Baglioni 01679c935a [person] added test class to be implemented 2024-10-24 15:27:06 +02:00
Miriam Baglioni c773421cc7 [person] added new substep in propagation worflow main 2024-10-24 14:44:13 +02:00
Miriam Baglioni cf07ed9058 [person] refactoring 2024-10-24 14:35:14 +02:00
Miriam Baglioni c921cf7ee0 [personEntity] removed the deletedbyinference results (not indexed, but still in the graph). Changed the writing mode: append instead of overwrite 2024-10-24 09:57:20 +02:00
Giambattista Bloisi 6bc741715c Fix OafMapperUtilsTest.testMergePubs 2024-10-23 14:02:45 +02:00
Giambattista Bloisi aa7b8fd014 Use workingDir parameter for temporary data of ORCID enrichment 2024-10-23 14:02:17 +02:00
Giambattista Bloisi 0e34b0ece1 Fix imports: point them from the main distribution packages 2024-10-23 14:01:52 +02:00
Miriam Baglioni aac5eb3499 [personEntity] changed the data info for the relations with projects. added missing parameters to the job.properties file 2024-10-22 11:54:16 +02:00
Miriam Baglioni 821540f94a [personEntity] updated the property file to include also the db parameters. The same for the wf definition. Refactoring for compilation 2024-10-22 10:13:30 +02:00
Miriam Baglioni 09a2c93fc7 [personEntity] added relations with projects extracting the info from the database 2024-10-21 16:21:15 +02:00
Miriam Baglioni ce4ee1189f [personEntity] create entity for each profile in orcid even without works. Added validated true to each relation coming from orcid data 2024-10-21 14:38:15 +02:00
Miriam Baglioni 2b27afaec8 [createASfromAffRo] refactoring after compilation 2024-10-18 16:22:51 +02:00
Miriam Baglioni 0e5dd14538 [createASfromAffRo] adding the provenance datasource used to get the relation (no datasource can be webcrawl = publisher, rawaff means oalex) 2024-10-18 16:22:21 +02:00
Michele De Bonis 6c17993d16 Merge branch 'beta' into dedup_new_comparators 2024-10-14 15:24:38 +02:00
Michele De Bonis eab623ddfa implementation of date matcher 2024-10-14 10:24:19 +02:00
Michele De Bonis 5015ba10eb addition of date comparator 2024-10-14 10:23:42 +02:00
Giambattista Bloisi 56b05cde0b Revert the changes for IgnoreUndefined management in tree evaluation 2024-10-11 10:35:15 +02:00
Michele De Bonis 62c4c3ed29 implementation of new comparators for organization and dataset disambiguation 2024-10-09 12:26:03 +02:00
Claudio Atzori 62ff843334 adopting dhp-schemas:8.0.1 to support Auhtor's rawAffiliationString(s). Improved graph2hive implementation 2024-10-08 16:22:54 +02:00
Claudio Atzori d5867a1992 merged #490 2024-10-08 15:39:59 +02:00
Claudio Atzori e5df68772d [graph provision] fixed serialisation of the usage counts as measures in the XML records 2024-10-02 09:35:21 +02:00
Miriam Baglioni 7e6d12fa77 [UsageCount] fixed error
(cherry picked from commit 9c9a9562ae)
2024-10-01 15:55:07 +02:00
Miriam Baglioni 191fc3a461 [UsageCount] add check in case the datasource is not matched against those present in the graph
(cherry picked from commit b42bdd5fb3)
2024-10-01 15:54:31 +02:00
Claudio Atzori 10696f2a44 reverted procedure for creating the UsageCounts actionset 2024-10-01 15:54:13 +02:00
Claudio Atzori 5734b80861 Merge pull request 'datasource table creation split in steps' (#489) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #489
2024-09-30 16:34:38 +02:00
Antonis Lempesis f3c179658a datasource table creation split in steps 2024-09-30 17:12:21 +03:00
Miriam Baglioni b18ad035c1 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-09-30 15:10:44 +02:00
Miriam Baglioni e430826e00 [ImportOC] fix to move original folder instead of extracted ones 2024-09-30 15:10:10 +02:00
Giambattista Bloisi c45cae447a Fix: invert the "natural" order when ordering by id lexicographically 2024-09-26 17:08:02 +02:00
Claudio Atzori 3fcafc7ed6 Merge pull request 'Latest institutions in monitor dbs' (#472) from antonis.lempesis/dnet-hadoop:beta into beta
Reviewed-on: #472
2024-09-26 09:49:01 +02:00
Miriam Baglioni 599e56dbc6 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2024-09-25 17:28:23 +02:00
Claudio Atzori 6397141e56 code formatting 2024-09-25 15:27:32 +02:00
Claudio Atzori e354f9853a [OpenCitations] move the extracted contents under a backup path to avoid needing to re-download it in case of errors 2024-09-25 15:27:02 +02:00
Claudio Atzori 535a7b99f1 the metadata collection plugins using the HttpConnector2 class shall now retry instead of failing in case of UnknownHostException 2024-09-25 11:35:34 +02:00
Sandro La Bruzzo 6a097abc89 as described on ticket #9525
1. Changed the mapping applied to Crossref records: anything that has a relationship "is-review-of" must be mapped as publication of type "Review".
2. Force the hostedby of Crossref records with DOI prefix 10.3410 and 10.12703 to the H1 Connect data source.
2024-09-25 11:32:54 +02:00
Michele Artini 9754521847 Merge pull request 'fixed a bug with id' (#486) from osfPreprints_plugin into beta
Reviewed-on: #486
2024-09-25 10:02:24 +02:00
Michele Artini 54f8b4da39 Merge pull request 'fixed a bug with 'null' string' (#484) from osfPreprints_plugin into beta
Reviewed-on: #484
2024-09-24 15:19:54 +02:00
Claudio Atzori 4f0463d779 [graph provision] person serialisation, limit the number of authorships and coauthorships before expanding the payloads 2024-09-24 14:54:34 +02:00
Miriam Baglioni 4d3e079590 Merge remote-tracking branch 'origin/beta' into beta 2024-09-24 14:26:29 +02:00
Claudio Atzori d1cadc77c9 [graph provision] person serialisation, limit the number of authorships and coauthorships before expanding the payloads 2024-09-24 10:57:20 +02:00
Michele Artini 0e89d4a1cf fixed a bug with topic ENRICH/MORE/SUBJECT/ARXIV 2024-09-24 08:57:49 +02:00
Michele Artini e941adbe2b fixed a bug with topic ENRICH/MORE/SUBJECT/ARXIV 2024-09-24 08:57:37 +02:00
Michele Artini 7f81673f3c removed the deletedByInference=true filter 2024-09-23 15:27:43 +02:00
Michele Artini fdbe629f49 removed the deletedByInference=true filter 2024-09-23 15:27:28 +02:00
Antonis Lempesis 619aa34a15 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into beta 2024-09-23 15:25:59 +03:00
Antonis Lempesis dbea7a4072 removed duplicate line 2024-09-23 14:57:11 +03:00
Antonis Lempesis c9241dba0d Merge pull request 'convert_hive_to_spark_actions' (#1) from convert_hive_to_spark_actions into beta
Reviewed-on: antonis.lempesis/dnet-hadoop#1
2024-09-23 13:53:28 +02:00
Claudio Atzori e0ff84baf0 [graph provision] person serialisation, limit the number of authorships and coauthorships before expanding the payloads 2024-09-23 10:29:46 +02:00
Michele Artini 755a5aefcf Merge pull request 'osfPreprints_plugin' (#482) from osfPreprints_plugin into beta
Reviewed-on: #482
2024-09-23 10:21:34 +02:00
Claudio Atzori 5f86c93be6 [graph provision] person serialisation 2024-09-20 12:20:00 +02:00
Michele Artini db6f137cf9 Merge pull request 'osfPreprints_plugin' (#480) from osfPreprints_plugin into beta
Reviewed-on: #480
2024-09-20 09:56:50 +02:00
Claudio Atzori 23e0ab3a7c run mergeResultsOfDifferentTypes only when checkDelegatedAuthority is true 2024-09-17 15:36:10 +02:00
Michele De Bonis 6df6b4583e blacklist filtering moved before the cleanup phase in order to have case sensitive regex 2024-09-16 14:04:59 +02:00
Alessia 07e6e7b4d6 #9839: include claimed affiliation relationships 2024-09-16 13:41:56 +02:00
Antonis Lempesis 37ad259296 cleanup 2024-09-05 16:02:44 +03:00
Antonis Lempesis b64c144abf added new institutions 2024-09-05 16:00:09 +03:00
Serafeim Chatzopoulos b043f8a963 Remove redundant error messages from impact indicators workflow 2024-09-04 14:28:43 +03:00
Serafeim Chatzopoulos db03f85366 Remove steps for updating BIP! from the impact indicators workflow 2024-09-04 14:25:44 +03:00
Miriam Baglioni 468f2aa5a5 [AffiliationAffRo]align beta with new affiliation from publisher webpage introduced in production. AffRo collectedfrom OpenAIRE to discriminate against WebCrawl 2024-08-12 18:10:46 +02:00
Miriam Baglioni 89fcf4086c [Person]fix issue in affiliation relation id construction for person (missing ::) 2024-08-12 18:04:43 +02:00
Miriam Baglioni 45605f93ae merging with branch beta 2024-08-12 18:03:10 +02:00
Miriam Baglioni 5a7ba77271 [Person]fix issue in affiliation relation id construction for person (missing ::) 2024-08-12 18:01:15 +02:00
Miriam Baglioni 8c185a7b1a resolving conflicts 2024-08-05 17:14:11 +02:00
Claudio Atzori e16616b964 added dataInfo to person records 2024-08-05 15:57:37 +02:00
Miriam Baglioni 985ca15264 [openaire-affiliation]removes matchings without DOI 2024-08-05 12:10:40 +02:00
Claudio Atzori 0bf76f2a34 [graph provision] added person to the graph2hive workflow 2024-08-05 09:35:07 +02:00
Claudio Atzori 975d44cac7 [graph provision] added person to the provision workflow 2024-08-02 16:14:10 +02:00
Claudio Atzori 6bdb8643e6 ActionManager promote: allow to ingest person records in a graph that did not contain them, bumped dhp-schemas version 2024-07-31 11:02:22 +02:00
Claudio Atzori 9486e21a44 copy or process the person records throughout the graph pipeline 2024-07-30 14:25:31 +02:00
Claudio Atzori 75a11d0ba5 [dedup] avoid NPEs in the countryInference dedup utility 2024-07-25 16:34:32 +02:00
Antonis Lempesis d0590e0e49 added latest institutions 2024-07-23 15:17:15 +03:00
Antonis Lempesis 7d2c0a3723 added new institutions 2024-07-23 15:10:17 +03:00
Lampros Smyrnaios e9686365a2 Improve performance of creating the "result_fos" table, by using a temp-table to cache data, which is requested multiple times. 2024-07-03 20:24:36 +03:00
Lampros Smyrnaios ce0aee21cc Improve performance of transferring the stats-DBs to another cluster and querying the DBs' tables, by ordering Spark to create up to 100 files per table, instead of thousands. 2024-07-03 20:15:33 +03:00
Lampros Smyrnaios 7b7dd32ad5 - Fix placement of some "set mapred.job.queue.name=analytics" statements and remove their unused "/*EOS*/" indicator.
- Add stacktrace-info to failed actions.
2024-07-03 19:53:24 +03:00
Lampros Smyrnaios 7ce051d766 - Update the remaining hive-actions to spark-actions.
- Update the version of shell-actions.
- Fix missing "/*EOS*/" indicators.
2024-07-03 19:49:19 +03:00
Lampros Smyrnaios aa4d7d5e20 Prioritize the rest of the stats-queries over other tasks on the cluster, by putting them in the "analytics" queue. 2024-07-03 19:14:25 +03:00
Lampros Smyrnaios 54e11b6a43 Improve performance and efficiency by rewriting the creation process of "publication", "project", "dataset", "datasource", "software", "otherresearchproduct" and "result" tables, to be performed in a single query, for each one. 2024-07-03 13:03:15 +03:00
Lampros Smyrnaios fe2275a9b0 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions
# Conflicts:
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step14.sql
2024-06-25 20:17:47 +03:00
Lampros Smyrnaios a644a6f4fe Catch Spark-sql errors and show a log with the statement that failed. 2024-05-29 12:10:11 +03:00
Lampros Smyrnaios 888637773c Add missing "/*EOS*/" comments. 2024-05-27 12:34:49 +03:00
Lampros Smyrnaios e0ac494859 Merge branch 'beta' into convert_hive_to_spark_actions
# Conflicts:
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step15.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step15_5.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_1-definitions.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16_5.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step2.sql
#	dhp-workflows/dhp-stats-update/src/main/resources/eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step7.sql
2024-05-27 12:27:40 +03:00
Lampros Smyrnaios 3c17183d10 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions 2024-04-23 17:18:16 +03:00
Lampros Smyrnaios 69a9ac7393 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions 2024-04-22 17:07:11 +03:00
Lampros Smyrnaios 342223f75c Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions 2024-04-19 13:18:34 +03:00
Lampros Smyrnaios 2616971e2b dhp-stats-update: remove leftover duplicate line 2024-04-18 16:18:16 +03:00
Lampros Smyrnaios ba533d9f34 Merge branch 'beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into convert_hive_to_spark_actions 2024-04-18 15:47:56 +03:00
Lampros Smyrnaios d46b78b659 dhp-stats-update:
- Set Steps 2-7 and 9 to limit the amount of files generated by Spark, from 8000, down to 100, to improve file-transfer and querying performance.
- Allow the workflow to run up to Step10. The Step11 seems to have some issues even when using hive-action.
2024-04-18 15:40:27 +03:00
Lampros Smyrnaios 6f2ebb2a52 Revert Step8 and Step11 to use Hive again, since their "UPDATE" statements are not supported by Spark. 2024-04-18 15:35:03 +03:00
Lampros Smyrnaios ca091c0f1e dhp-stats-update:
- Fix not passing some parameters to some Spark actions.
- Allow the workflow to run up to Step7. The first 7 steps seem to work out of the box.
2024-04-17 14:03:59 +03:00
Lampros Smyrnaios 0b897f2f66 Fix and add missing "DROP TABLE" statements, in "dhp-stats-update" sql-scripts. 2024-04-16 18:17:54 +03:00
Lampros Smyrnaios db33f7727c Update "dhp-stats-update" workflow to use "spark"-actions, instead of "hive" ones.
Note: Currently the code is set to only test the "Step1".
2024-04-15 16:22:40 +03:00
200 changed files with 5308 additions and 2608 deletions

1
.gitignore vendored
View File

@ -28,3 +28,4 @@ spark-warehouse
/**/.scalafmt.conf /**/.scalafmt.conf
/.java-version /.java-version
/dhp-shade-package/dependency-reduced-pom.xml /dhp-shade-package/dependency-reduced-pom.xml
/**/job.properties

View File

@ -212,11 +212,11 @@ public class HttpConnector2 {
.format( .format(
"Unexpected status code: %s errors: %s", urlConn.getResponseCode(), "Unexpected status code: %s errors: %s", urlConn.getResponseCode(),
MAPPER.writeValueAsString(report))); MAPPER.writeValueAsString(report)));
} catch (MalformedURLException | UnknownHostException e) { } catch (MalformedURLException e) {
log.error(e.getMessage(), e); log.error(e.getMessage(), e);
report.put(e.getClass().getName(), e.getMessage()); report.put(e.getClass().getName(), e.getMessage());
throw new CollectorException(e.getMessage(), e); throw new CollectorException(e.getMessage(), e);
} catch (SocketTimeoutException | SocketException e) { } catch (SocketTimeoutException | SocketException | UnknownHostException e) {
log.error(e.getMessage(), e); log.error(e.getMessage(), e);
report.put(e.getClass().getName(), e.getMessage()); report.put(e.getClass().getName(), e.getMessage());
backoffAndSleep(getClientParams().getRetryDelay() * retryNumber * 1000); backoffAndSleep(getClientParams().getRetryDelay() * retryNumber * 1000);

View File

@ -1,5 +1,5 @@
package eu.dnetlib.dhp.actionmanager.personentity; package eu.dnetlib.dhp.common.person;
import java.util.Arrays; import java.util.Arrays;
import java.util.Iterator; import java.util.Iterator;
@ -61,7 +61,7 @@ public class CoAuthorshipIterator implements Iterator<Relation> {
private Relation getRelation(String orcid1, String orcid2) { private Relation getRelation(String orcid1, String orcid2) {
String source = PERSON_PREFIX + IdentifierFactory.md5(orcid1); String source = PERSON_PREFIX + IdentifierFactory.md5(orcid1);
String target = PERSON_PREFIX + IdentifierFactory.md5(orcid2); String target = PERSON_PREFIX + IdentifierFactory.md5(orcid2);
return OafMapperUtils Relation relation = OafMapperUtils
.getRelation( .getRelation(
source, target, ModelConstants.PERSON_PERSON_RELTYPE, source, target, ModelConstants.PERSON_PERSON_RELTYPE,
ModelConstants.PERSON_PERSON_SUBRELTYPE, ModelConstants.PERSON_PERSON_SUBRELTYPE,
@ -76,5 +76,7 @@ public class CoAuthorshipIterator implements Iterator<Relation> {
ModelConstants.DNET_PROVENANCE_ACTIONS, ModelConstants.DNET_PROVENANCE_ACTIONS), ModelConstants.DNET_PROVENANCE_ACTIONS, ModelConstants.DNET_PROVENANCE_ACTIONS),
"0.91"), "0.91"),
null); null);
relation.setValidated(true);
return relation;
} }
} }

View File

@ -1,12 +1,9 @@
package eu.dnetlib.dhp.actionmanager.personentity; package eu.dnetlib.dhp.common.person;
import java.io.Serializable; import java.io.Serializable;
import java.util.ArrayList;
import java.util.List; import java.util.List;
import eu.dnetlib.dhp.schema.oaf.Relation;
public class Coauthors implements Serializable { public class Coauthors implements Serializable {
private List<String> coauthors; private List<String> coauthors;

View File

@ -2,8 +2,7 @@
package eu.dnetlib.dhp.oa.merge; package eu.dnetlib.dhp.oa.merge;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession; import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import static org.apache.spark.sql.functions.col; import static org.apache.spark.sql.functions.*;
import static org.apache.spark.sql.functions.when;
import java.util.Map; import java.util.Map;
import java.util.Optional; import java.util.Optional;
@ -135,7 +134,9 @@ public class GroupEntitiesSparkJob {
.applyCoarVocabularies(entity, vocs), .applyCoarVocabularies(entity, vocs),
OAFENTITY_KRYO_ENC) OAFENTITY_KRYO_ENC)
.groupByKey((MapFunction<OafEntity, String>) OafEntity::getId, Encoders.STRING()) .groupByKey((MapFunction<OafEntity, String>) OafEntity::getId, Encoders.STRING())
.mapGroups((MapGroupsFunction<String, OafEntity, OafEntity>) MergeUtils::mergeById, OAFENTITY_KRYO_ENC) .mapGroups(
(MapGroupsFunction<String, OafEntity, OafEntity>) (key, group) -> MergeUtils.mergeById(group, vocs),
OAFENTITY_KRYO_ENC)
.map( .map(
(MapFunction<OafEntity, Tuple2<String, OafEntity>>) t -> new Tuple2<>( (MapFunction<OafEntity, Tuple2<String, OafEntity>>) t -> new Tuple2<>(
t.getClass().getName(), t), t.getClass().getName(), t),

View File

@ -65,7 +65,13 @@ public class RunSQLSparkJob {
for (String statement : sql.split(";\\s*/\\*\\s*EOS\\s*\\*/\\s*")) { for (String statement : sql.split(";\\s*/\\*\\s*EOS\\s*\\*/\\s*")) {
log.info("executing: {}", statement); log.info("executing: {}", statement);
long startTime = System.currentTimeMillis(); long startTime = System.currentTimeMillis();
try {
spark.sql(statement).show(); spark.sql(statement).show();
} catch (Exception e) {
log.error("Error executing statement: {}", statement, e);
System.err.println("Error executing statement: " + statement + "\n" + e);
throw e;
}
log log
.info( .info(
"executed in {}", "executed in {}",

View File

@ -0,0 +1,70 @@
/*
* Copyright (c) 2024.
* SPDX-FileCopyrightText: © 2023 Consiglio Nazionale delle Ricerche
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
package eu.dnetlib.dhp.schema.oaf;
import org.apache.commons.lang3.builder.EqualsBuilder;
import org.apache.commons.lang3.builder.HashCodeBuilder;
public class HashableStructuredProperty extends StructuredProperty {
private static final long serialVersionUID = 8371670185221126045L;
public static HashableStructuredProperty newInstance(String value, Qualifier qualifier, DataInfo dataInfo) {
if (value == null) {
return null;
}
final HashableStructuredProperty sp = new HashableStructuredProperty();
sp.setValue(value);
sp.setQualifier(qualifier);
sp.setDataInfo(dataInfo);
return sp;
}
public static HashableStructuredProperty newInstance(StructuredProperty sp) {
HashableStructuredProperty hsp = new HashableStructuredProperty();
hsp.setQualifier(sp.getQualifier());
hsp.setValue(sp.getValue());
hsp.setQualifier(sp.getQualifier());
return hsp;
}
public static StructuredProperty toStructuredProperty(HashableStructuredProperty hsp) {
StructuredProperty sp = new StructuredProperty();
sp.setQualifier(hsp.getQualifier());
sp.setValue(hsp.getValue());
sp.setQualifier(hsp.getQualifier());
return sp;
}
@Override
public int hashCode() {
return new HashCodeBuilder(11, 91)
.append(getQualifier().getClassid())
.append(getQualifier().getSchemeid())
.append(getValue())
.hashCode();
}
@Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (obj == this) {
return true;
}
if (obj.getClass() != getClass()) {
return false;
}
final HashableStructuredProperty rhs = (HashableStructuredProperty) obj;
return new EqualsBuilder()
.append(getQualifier().getClassid(), rhs.getQualifier().getClassid())
.append(getQualifier().getSchemeid(), rhs.getQualifier().getSchemeid())
.append(getValue(), rhs.getValue())
.isEquals();
}
}

View File

@ -43,34 +43,4 @@ public class CleaningFunctions {
return !PidBlacklistProvider.getBlacklist(s.getQualifier().getClassid()).contains(pidValue); return !PidBlacklistProvider.getBlacklist(s.getQualifier().getClassid()).contains(pidValue);
} }
/**
* Utility method that normalises PID values on a per-type basis.
* @param pid the PID whose value will be normalised.
* @return the PID containing the normalised value.
*/
public static StructuredProperty normalizePidValue(StructuredProperty pid) {
pid
.setValue(
normalizePidValue(
pid.getQualifier().getClassid(),
pid.getValue()));
return pid;
}
public static String normalizePidValue(String pidType, String pidValue) {
String value = Optional
.ofNullable(pidValue)
.map(String::trim)
.orElseThrow(() -> new IllegalArgumentException("PID value cannot be empty"));
switch (pidType) {
// TODO add cleaning for more PID types as needed
case "doi":
return value.toLowerCase().replaceFirst(DOI_PREFIX_REGEX, DOI_PREFIX);
}
return value;
}
} }

View File

@ -6,18 +6,11 @@ import org.apache.commons.lang3.StringUtils;
public class DoiCleaningRule { public class DoiCleaningRule {
public static String clean(final String doi) { public static String clean(final String doi) {
return doi if (doi == null)
.toLowerCase()
.replaceAll("\\s", "")
.replaceAll("^doi:", "")
.replaceFirst(CleaningFunctions.DOI_PREFIX_REGEX, CleaningFunctions.DOI_PREFIX);
}
public static String normalizeDoi(final String input) {
if (input == null)
return null; return null;
final String replaced = input final String replaced = doi
.replaceAll("\\n|\\r|\\t|\\s", "") .replaceAll("\\n|\\r|\\t|\\s", "")
.replaceAll("^doi:", "")
.toLowerCase() .toLowerCase()
.replaceFirst(CleaningFunctions.DOI_PREFIX_REGEX, CleaningFunctions.DOI_PREFIX); .replaceFirst(CleaningFunctions.DOI_PREFIX_REGEX, CleaningFunctions.DOI_PREFIX);
if (StringUtils.isEmpty(replaced)) if (StringUtils.isEmpty(replaced))
@ -32,7 +25,6 @@ public class DoiCleaningRule {
return null; return null;
return ret; return ret;
} }
} }

View File

@ -2,7 +2,6 @@
package eu.dnetlib.dhp.schema.oaf.utils; package eu.dnetlib.dhp.schema.oaf.utils;
import static eu.dnetlib.dhp.schema.common.ModelConstants.*; import static eu.dnetlib.dhp.schema.common.ModelConstants.*;
import static eu.dnetlib.dhp.schema.common.ModelConstants.OPENAIRE_META_RESOURCE_TYPE;
import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.getProvenance; import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.getProvenance;
import java.net.MalformedURLException; import java.net.MalformedURLException;
@ -363,6 +362,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
// nothing to clean here // nothing to clean here
} else if (value instanceof Project) { } else if (value instanceof Project) {
// nothing to clean here // nothing to clean here
} else if (value instanceof Person) {
// nothing to clean here
} else if (value instanceof Organization) { } else if (value instanceof Organization) {
Organization o = (Organization) value; Organization o = (Organization) value;
if (Objects.isNull(o.getCountry()) || StringUtils.isBlank(o.getCountry().getClassid())) { if (Objects.isNull(o.getCountry()) || StringUtils.isBlank(o.getCountry().getClassid())) {
@ -563,12 +564,24 @@ public class GraphCleaningFunctions extends CleaningFunctions {
Optional Optional
.ofNullable(i.getPid()) .ofNullable(i.getPid())
.ifPresent(pid -> { .ifPresent(pid -> {
final Set<StructuredProperty> pids = Sets.newHashSet(pid); final Set<HashableStructuredProperty> pids = pid
.stream()
.map(HashableStructuredProperty::newInstance)
.collect(Collectors.toCollection(HashSet::new));
Optional Optional
.ofNullable(i.getAlternateIdentifier()) .ofNullable(i.getAlternateIdentifier())
.ifPresent(altId -> { .ifPresent(altId -> {
final Set<StructuredProperty> altIds = Sets.newHashSet(altId); final Set<HashableStructuredProperty> altIds = altId
i.setAlternateIdentifier(Lists.newArrayList(Sets.difference(altIds, pids))); .stream()
.map(HashableStructuredProperty::newInstance)
.collect(Collectors.toCollection(HashSet::new));
i
.setAlternateIdentifier(
Sets
.difference(altIds, pids)
.stream()
.map(HashableStructuredProperty::toStructuredProperty)
.collect(Collectors.toList()));
}); });
}); });
@ -682,6 +695,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
} }
} }
// set ORCID_PENDING to all orcid values that are not coming from ORCID provenance
for (Author a : r.getAuthor()) { for (Author a : r.getAuthor()) {
if (Objects.isNull(a.getPid())) { if (Objects.isNull(a.getPid())) {
a.setPid(Lists.newArrayList()); a.setPid(Lists.newArrayList());
@ -738,6 +752,40 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.collect(Collectors.toList())); .collect(Collectors.toList()));
} }
} }
// Identify clashing ORCIDS:that is same ORCID associated to multiple authors in this result
Map<String, Integer> clashing_orcid = new HashMap<>();
for (Author a : r.getAuthor()) {
a
.getPid()
.stream()
.filter(
p -> StringUtils
.contains(StringUtils.lowerCase(p.getQualifier().getClassid()), ORCID_PENDING))
.map(StructuredProperty::getValue)
.distinct()
.forEach(orcid -> clashing_orcid.compute(orcid, (k, v) -> (v == null) ? 1 : v + 1));
}
Set<String> clashing = clashing_orcid
.entrySet()
.stream()
.filter(ee -> ee.getValue() > 1)
.map(Map.Entry::getKey)
.collect(Collectors.toSet());
// filter out clashing orcids
for (Author a : r.getAuthor()) {
a
.setPid(
a
.getPid()
.stream()
.filter(p -> !clashing.contains(p.getValue()))
.collect(Collectors.toList()));
}
} }
if (value instanceof Publication) { if (value instanceof Publication) {
@ -796,7 +844,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
return author; return author;
} }
private static Optional<String> cleanDateField(Field<String> dateofacceptance) { public static Optional<String> cleanDateField(Field<String> dateofacceptance) {
return Optional return Optional
.ofNullable(dateofacceptance) .ofNullable(dateofacceptance)
.map(Field::getValue) .map(Field::getValue)

View File

@ -175,7 +175,7 @@ public class IdentifierFactory implements Serializable {
return entity return entity
.getPid() .getPid()
.stream() .stream()
.map(CleaningFunctions::normalizePidValue) .map(PidCleaner::normalizePidValue)
.filter(CleaningFunctions::pidFilter) .filter(CleaningFunctions::pidFilter)
.collect( .collect(
Collectors Collectors
@ -204,10 +204,11 @@ public class IdentifierFactory implements Serializable {
.map( .map(
pp -> pp pp -> pp
.stream() .stream()
.filter(p -> StringUtils.isNotBlank(p.getValue()))
// filter away PIDs provided by a DS that is not considered an authority for the // filter away PIDs provided by a DS that is not considered an authority for the
// given PID Type // given PID Type
.filter(p -> shouldFilterPidByCriteria(collectedFrom, p, mapHandles)) .filter(p -> shouldFilterPidByCriteria(collectedFrom, p, mapHandles))
.map(CleaningFunctions::normalizePidValue) .map(PidCleaner::normalizePidValue)
.filter(p -> isNotFromDelegatedAuthority(collectedFrom, p)) .filter(p -> isNotFromDelegatedAuthority(collectedFrom, p))
.filter(CleaningFunctions::pidFilter)) .filter(CleaningFunctions::pidFilter))
.orElse(Stream.empty()); .orElse(Stream.empty());

View File

@ -96,7 +96,7 @@ public class MergeEntitiesComparator implements Comparator<Oaf> {
// id // id
if (res == 0) { if (res == 0) {
if (left instanceof OafEntity && right instanceof OafEntity) { if (left instanceof OafEntity && right instanceof OafEntity) {
res = ((OafEntity) left).getId().compareTo(((OafEntity) right).getId()); res = ((OafEntity) right).getId().compareTo(((OafEntity) left).getId());
} }
} }

View File

@ -23,24 +23,30 @@ import org.apache.commons.lang3.tuple.Pair;
import com.github.sisyphsu.dateparser.DateParserUtils; import com.github.sisyphsu.dateparser.DateParserUtils;
import com.google.common.base.Joiner; import com.google.common.base.Joiner;
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup;
import eu.dnetlib.dhp.oa.merge.AuthorMerger; import eu.dnetlib.dhp.oa.merge.AuthorMerger;
import eu.dnetlib.dhp.schema.common.AccessRightComparator; import eu.dnetlib.dhp.schema.common.AccessRightComparator;
import eu.dnetlib.dhp.schema.common.EntityType;
import eu.dnetlib.dhp.schema.common.ModelConstants; import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport; import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*; import eu.dnetlib.dhp.schema.oaf.*;
public class MergeUtils { public class MergeUtils {
public static <T extends Oaf> T mergeById(String s, Iterator<T> oafEntityIterator) { public static <T extends Oaf> T mergeById(Iterator<T> oafEntityIterator, VocabularyGroup vocs) {
return mergeGroup(s, oafEntityIterator, true); return mergeGroup(oafEntityIterator, true, vocs);
} }
public static <T extends Oaf> T mergeGroup(String s, Iterator<T> oafEntityIterator) { public static <T extends Oaf> T mergeGroup(Iterator<T> oafEntityIterator) {
return mergeGroup(s, oafEntityIterator, false); return mergeGroup(oafEntityIterator, false);
} }
public static <T extends Oaf> T mergeGroup(String s, Iterator<T> oafEntityIterator, public static <T extends Oaf> T mergeGroup(Iterator<T> oafEntityIterator, boolean checkDelegateAuthority) {
boolean checkDelegateAuthority) { return mergeGroup(oafEntityIterator, checkDelegateAuthority, null);
}
public static <T extends Oaf> T mergeGroup(Iterator<T> oafEntityIterator,
boolean checkDelegateAuthority, VocabularyGroup vocs) {
ArrayList<T> sortedEntities = new ArrayList<>(); ArrayList<T> sortedEntities = new ArrayList<>();
oafEntityIterator.forEachRemaining(sortedEntities::add); oafEntityIterator.forEachRemaining(sortedEntities::add);
@ -49,13 +55,55 @@ public class MergeUtils {
Iterator<T> it = sortedEntities.iterator(); Iterator<T> it = sortedEntities.iterator();
T merged = it.next(); T merged = it.next();
if (!it.hasNext() && merged instanceof Result && vocs != null) {
return enforceResultType(vocs, (Result) merged);
} else {
while (it.hasNext()) { while (it.hasNext()) {
merged = checkedMerge(merged, it.next(), checkDelegateAuthority); merged = checkedMerge(merged, it.next(), checkDelegateAuthority);
} }
}
return merged; return merged;
} }
private static <T extends Oaf> T enforceResultType(VocabularyGroup vocs, Result mergedResult) {
if (Optional.ofNullable(mergedResult.getInstance()).map(List::isEmpty).orElse(true)) {
return (T) mergedResult;
} else {
final Instance i = mergedResult.getInstance().get(0);
if (!vocs.vocabularyExists(ModelConstants.DNET_RESULT_TYPOLOGIES)) {
return (T) mergedResult;
} else {
final String expectedResultType = Optional
.ofNullable(
vocs
.lookupTermBySynonym(
ModelConstants.DNET_RESULT_TYPOLOGIES, i.getInstancetype().getClassid()))
.orElse(ModelConstants.ORP_DEFAULT_RESULTTYPE)
.getClassid();
// there is a clash among the result types
if (!expectedResultType.equals(mergedResult.getResulttype().getClassid())) {
Result result = (Result) Optional
.ofNullable(ModelSupport.oafTypes.get(expectedResultType))
.map(r -> {
try {
return r.newInstance();
} catch (InstantiationException | IllegalAccessException e) {
throw new IllegalStateException(e);
}
})
.orElse(new OtherResearchProduct());
result.setId(mergedResult.getId());
return (T) mergeResultFields(result, mergedResult);
} else {
return (T) mergedResult;
}
}
}
}
public static <T extends Oaf> T checkedMerge(final T left, final T right, boolean checkDelegateAuthority) { public static <T extends Oaf> T checkedMerge(final T left, final T right, boolean checkDelegateAuthority) {
return (T) merge(left, right, checkDelegateAuthority); return (T) merge(left, right, checkDelegateAuthority);
} }
@ -106,7 +154,7 @@ public class MergeUtils {
return mergeSoftware((Software) left, (Software) right); return mergeSoftware((Software) left, (Software) right);
} }
return mergeResultFields((Result) left, (Result) right); return left;
} else if (sameClass(left, right, Datasource.class)) { } else if (sameClass(left, right, Datasource.class)) {
// TODO // TODO
final int trust = compareTrust(left, right); final int trust = compareTrust(left, right);
@ -654,16 +702,9 @@ public class MergeUtils {
} }
private static Field<String> selectOldestDate(Field<String> d1, Field<String> d2) { private static Field<String> selectOldestDate(Field<String> d1, Field<String> d2) {
if (d1 == null || StringUtils.isBlank(d1.getValue())) { if (!GraphCleaningFunctions.cleanDateField(d1).isPresent()) {
return d2; return d2;
} else if (d2 == null || StringUtils.isBlank(d2.getValue())) { } else if (!GraphCleaningFunctions.cleanDateField(d2).isPresent()) {
return d1;
}
if (StringUtils.contains(d1.getValue(), "null")) {
return d2;
}
if (StringUtils.contains(d2.getValue(), "null")) {
return d1; return d1;
} }
@ -715,7 +756,11 @@ public class MergeUtils {
private static String spKeyExtractor(StructuredProperty sp) { private static String spKeyExtractor(StructuredProperty sp) {
return Optional return Optional
.ofNullable(sp) .ofNullable(sp)
.map(s -> Joiner.on("||").join(qualifierKeyExtractor(s.getQualifier()), s.getValue())) .map(
s -> Joiner
.on("||")
.useForNull("")
.join(qualifierKeyExtractor(s.getQualifier()), s.getValue()))
.orElse(null); .orElse(null);
} }
@ -972,7 +1017,7 @@ public class MergeUtils {
private static String extractKeyFromPid(final StructuredProperty pid) { private static String extractKeyFromPid(final StructuredProperty pid) {
if (pid == null) if (pid == null)
return null; return null;
final StructuredProperty normalizedPid = CleaningFunctions.normalizePidValue(pid); final StructuredProperty normalizedPid = PidCleaner.normalizePidValue(pid);
return String.format("%s::%s", normalizedPid.getQualifier().getClassid(), normalizedPid.getValue()); return String.format("%s::%s", normalizedPid.getQualifier().getClassid(), normalizedPid.getValue());
} }

View File

@ -1,6 +1,12 @@
package eu.dnetlib.dhp.schema.oaf.utils; package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Map;
import com.google.common.collect.Maps;
import eu.dnetlib.dhp.schema.common.ModelConstants;
public class ModelHardLimits { public class ModelHardLimits {
private ModelHardLimits() { private ModelHardLimits() {
@ -12,6 +18,7 @@ public class ModelHardLimits {
public static final int MAX_EXTERNAL_ENTITIES = 50; public static final int MAX_EXTERNAL_ENTITIES = 50;
public static final int MAX_AUTHORS = 200; public static final int MAX_AUTHORS = 200;
public static final int MAX_RELATED_AUTHORS = 20;
public static final int MAX_AUTHOR_FULLNAME_LENGTH = 1000; public static final int MAX_AUTHOR_FULLNAME_LENGTH = 1000;
public static final int MAX_TITLE_LENGTH = 5000; public static final int MAX_TITLE_LENGTH = 5000;
public static final int MAX_TITLES = 10; public static final int MAX_TITLES = 10;
@ -19,6 +26,12 @@ public class ModelHardLimits {
public static final int MAX_ABSTRACT_LENGTH = 150000; public static final int MAX_ABSTRACT_LENGTH = 150000;
public static final int MAX_RELATED_ABSTRACT_LENGTH = 500; public static final int MAX_RELATED_ABSTRACT_LENGTH = 500;
public static final int MAX_INSTANCES = 10; public static final int MAX_INSTANCES = 10;
public static final Map<String, Long> MAX_RELATIONS_BY_RELCLASS = Maps.newHashMap();
static {
MAX_RELATIONS_BY_RELCLASS.put(ModelConstants.PERSON_PERSON_HASCOAUTHORED, 500L);
MAX_RELATIONS_BY_RELCLASS.put(ModelConstants.RESULT_PERSON_HASAUTHORED, 500L);
}
public static String getCollectionName(String format) { public static String getCollectionName(String format) {
return format + SEPARATOR + LAYOUT + SEPARATOR + INTERPRETATION; return format + SEPARATOR + LAYOUT + SEPARATOR + INTERPRETATION;

View File

@ -26,7 +26,7 @@ public class PidCleaner {
String value = Optional String value = Optional
.ofNullable(pidValue) .ofNullable(pidValue)
.map(String::trim) .map(String::trim)
.orElseThrow(() -> new IllegalArgumentException("PID value cannot be empty")); .orElseThrow(() -> new IllegalArgumentException("PID (" + pidType + ") value cannot be empty"));
switch (pidType) { switch (pidType) {

View File

@ -18,8 +18,8 @@ public class PidValueComparator implements Comparator<StructuredProperty> {
if (right == null) if (right == null)
return -1; return -1;
StructuredProperty l = CleaningFunctions.normalizePidValue(left); StructuredProperty l = PidCleaner.normalizePidValue(left);
StructuredProperty r = CleaningFunctions.normalizePidValue(right); StructuredProperty r = PidCleaner.normalizePidValue(right);
return Optional return Optional
.ofNullable(l.getValue()) .ofNullable(l.getValue())

View File

@ -28,6 +28,7 @@ import com.jayway.jsonpath.JsonPath;
import eu.dnetlib.dhp.schema.mdstore.MDStoreWithInfo; import eu.dnetlib.dhp.schema.mdstore.MDStoreWithInfo;
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions; import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.PidCleaner;
import net.minidev.json.JSONArray; import net.minidev.json.JSONArray;
import scala.collection.JavaConverters; import scala.collection.JavaConverters;
import scala.collection.Seq; import scala.collection.Seq;
@ -104,7 +105,7 @@ public class DHPUtils {
public static String generateUnresolvedIdentifier(final String pid, final String pidType) { public static String generateUnresolvedIdentifier(final String pid, final String pidType) {
final String cleanedPid = CleaningFunctions.normalizePidValue(pidType, pid); final String cleanedPid = PidCleaner.normalizePidValue(pidType, pid);
return String.format("unresolved::%s::%s", cleanedPid, pidType.toLowerCase().trim()); return String.format("unresolved::%s::%s", cleanedPid, pidType.toLowerCase().trim());
} }

View File

@ -29,7 +29,7 @@ class IdentifierFactoryTest {
"publication_doi2.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true); "publication_doi2.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);
verifyIdentifier( verifyIdentifier(
"publication_doi3.json", "50|pmc_________::94e4cb08c93f8733b48e2445d04002ac", true); "publication_doi3.json", "50|pmc_________::e2a339e0e11bfbf55462e14a07f1b304", true);
verifyIdentifier( verifyIdentifier(
"publication_doi4.json", "50|od______2852::38861c44e6052a8d49f59a4c39ba5e66", true); "publication_doi4.json", "50|od______2852::38861c44e6052a8d49f59a4c39ba5e66", true);
@ -41,7 +41,7 @@ class IdentifierFactoryTest {
"publication_pmc1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", true); "publication_pmc1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", true);
verifyIdentifier( verifyIdentifier(
"publication_pmc2.json", "50|pmc_________::94e4cb08c93f8733b48e2445d04002ac", true); "publication_pmc2.json", "50|pmc_________::e2a339e0e11bfbf55462e14a07f1b304", true);
verifyIdentifier( verifyIdentifier(
"publication_openapc.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true); "publication_openapc.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);

View File

@ -179,7 +179,7 @@ class OafMapperUtilsTest {
assertEquals( assertEquals(
ModelConstants.DATASET_RESULTTYPE_CLASSID, ModelConstants.DATASET_RESULTTYPE_CLASSID,
((Result) MergeUtils ((Result) MergeUtils
.merge(p2, d1)) .merge(p2, d1, true))
.getResulttype() .getResulttype()
.getClassid()); .getClassid());
} }

View File

@ -29,7 +29,7 @@
}, },
{ {
"qualifier": {"classid": "pmc"}, "qualifier": {"classid": "pmc"},
"value": "21459329" "value": "PMC21459329"
} }
] ]
} }

View File

@ -13,7 +13,7 @@
}, },
{ {
"qualifier":{"classid":"pmc"}, "qualifier":{"classid":"pmc"},
"value":"21459329" "value":"PMC21459329"
} }
] ]
} }

View File

@ -38,7 +38,7 @@ public class NumAuthorsTitleSuffixPrefixChain extends AbstractClusteringFunction
@Override @Override
protected Collection<String> doApply(Config conf, String s) { protected Collection<String> doApply(Config conf, String s) {
return suffixPrefixChain(cleanup(s), param("mod")); return suffixPrefixChain(cleanup(s), paramOrDefault("mod", 10));
} }
private Collection<String> suffixPrefixChain(String s, int mod) { private Collection<String> suffixPrefixChain(String s, int mod) {

View File

@ -90,7 +90,7 @@ public class AbstractPaceFunctions extends PaceCommonUtils {
inferFrom = normalize(inferFrom); inferFrom = normalize(inferFrom);
inferFrom = filterAllStopWords(inferFrom); inferFrom = filterAllStopWords(inferFrom);
Set<String> cities = getCities(inferFrom, 4); Set<String> cities = getCities(inferFrom, 4);
return citiesToCountry(cities).stream().findFirst().orElse("UNKNOWN"); return citiesToCountry(cities).stream().filter(Objects::nonNull).findFirst().orElse("UNKNOWN");
} }
public static String cityInference(String original) { public static String cityInference(String original) {

View File

@ -54,6 +54,22 @@ public class FieldDef implements Serializable {
public FieldDef() { public FieldDef() {
} }
public FieldDef clone() {
FieldDef fieldDef = new FieldDef();
fieldDef.setName(this.name);
fieldDef.setPath(this.path);
fieldDef.setType(this.type);
fieldDef.setOverrideMatch(this.overrideMatch);
fieldDef.setSize(this.size);
fieldDef.setLength(this.length);
fieldDef.setFilter(this.filter);
fieldDef.setSorted(this.sorted);
fieldDef.setClean(this.clean);
fieldDef.setInfer(this.infer);
fieldDef.setInferenceFrom(this.inferenceFrom);
return fieldDef;
}
public String getInferenceFrom() { public String getInferenceFrom() {
return inferenceFrom; return inferenceFrom;
} }

View File

@ -19,48 +19,10 @@ case class SparkDeduper(conf: DedupConfig) extends Serializable {
val model: SparkModel = SparkModel(conf) val model: SparkModel = SparkModel(conf)
val dedup: (Dataset[Row] => Dataset[Row]) = df => { val dedup: (Dataset[Row] => Dataset[Row]) = df => {
df.transform(filterAndCleanup) df.transform(generateClustersWithCollect)
.transform(generateClustersWithCollect)
.transform(processBlocks) .transform(processBlocks)
} }
val filterAndCleanup: (Dataset[Row] => Dataset[Row]) = df => {
val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
if (conf.blacklists.containsKey(fdef.getName)) {
res.withColumn(
fdef.getName + "_filtered",
filterColumnUDF(fdef).apply(new Column(fdef.getName))
)
} else {
res
}
})
df_with_filters
}
def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
if (blacklist == null) {
throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
} else {
fdef.getType match {
case Type.List | Type.JSON =>
udf[Array[String], Array[String]](values => {
values.filter((v: String) => !blacklist.test(v))
})
case _ =>
udf[String, String](v => {
if (blacklist.test(v)) ""
else v
})
}
}
}
val generateClustersWithCollect: (Dataset[Row] => Dataset[Row]) = df_with_filters => { val generateClustersWithCollect: (Dataset[Row] => Dataset[Row]) = df_with_filters => {
var df_with_clustering_keys: Dataset[Row] = null var df_with_clustering_keys: Dataset[Row] = null

View File

@ -5,12 +5,12 @@ import eu.dnetlib.pace.common.AbstractPaceFunctions
import eu.dnetlib.pace.config.{DedupConfig, Type} import eu.dnetlib.pace.config.{DedupConfig, Type}
import eu.dnetlib.pace.util.{MapDocumentUtil, SparkCompatUtils} import eu.dnetlib.pace.util.{MapDocumentUtil, SparkCompatUtils}
import org.apache.commons.lang3.StringUtils import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
import org.apache.spark.sql.{Dataset, Row} import org.apache.spark.sql.{Dataset, Row}
import java.util.Locale import java.util.Locale
import java.util.function.Predicate
import java.util.regex.Pattern import java.util.regex.Pattern
import scala.collection.JavaConverters._ import scala.collection.JavaConverters._
@ -29,8 +29,20 @@ case class SparkModel(conf: DedupConfig) {
identifier.setName(identifierFieldName) identifier.setName(identifierFieldName)
identifier.setType(Type.String) identifier.setType(Type.String)
// create fields for blacklist
val filtered = conf.getPace.getModel.asScala.flatMap(fdef => {
if (conf.blacklists().containsKey(fdef.getName)) {
val fdef_filtered = fdef.clone()
fdef_filtered.setName(fdef.getName + "_filtered")
Seq(fdef, fdef_filtered)
}
else {
Seq(fdef)
}
})
// Construct a Spark StructType representing the schema of the model // Construct a Spark StructType representing the schema of the model
(Seq(identifier) ++ conf.getPace.getModel.asScala) (Seq(identifier) ++ filtered)
.foldLeft( .foldLeft(
new StructType() new StructType()
)((resType, fieldDef) => { )((resType, fieldDef) => {
@ -44,7 +56,6 @@ case class SparkModel(conf: DedupConfig) {
}) })
}) })
} }
val identityFieldPosition: Int = schema.fieldIndex(identifierFieldName) val identityFieldPosition: Int = schema.fieldIndex(identifierFieldName)
@ -52,7 +63,8 @@ case class SparkModel(conf: DedupConfig) {
val orderingFieldPosition: Int = schema.fieldIndex(orderingFieldName) val orderingFieldPosition: Int = schema.fieldIndex(orderingFieldName)
val parseJsonDataset: (Dataset[String] => Dataset[Row]) = df => { val parseJsonDataset: (Dataset[String] => Dataset[Row]) = df => {
df.map(r => rowFromJson(r))(SparkCompatUtils.encoderFor(schema)) df
.map(r => rowFromJson(r))(SparkCompatUtils.encoderFor(schema))
} }
def rowFromJson(json: String): Row = { def rowFromJson(json: String): Row = {
@ -64,9 +76,11 @@ case class SparkModel(conf: DedupConfig) {
schema.fieldNames.zipWithIndex.foldLeft(values) { schema.fieldNames.zipWithIndex.foldLeft(values) {
case ((res, (fname, index))) => case ((res, (fname, index))) =>
val fdef = conf.getPace.getModelMap.get(fname)
val fdef = conf.getPace.getModelMap.get(fname.split("_filtered")(0))
if (fdef != null) { if (fdef != null) {
if (!fname.contains("_filtered")) { //process fields with no blacklist
res(index) = fdef.getType match { res(index) = fdef.getType match {
case Type.String | Type.Int => case Type.String | Type.Int =>
MapDocumentUtil.truncateValue( MapDocumentUtil.truncateValue(
@ -99,6 +113,26 @@ case class SparkModel(conf: DedupConfig) {
case Type.DoubleArray => case Type.DoubleArray =>
MapDocumentUtil.getJPathArray(fdef.getPath, json) MapDocumentUtil.getJPathArray(fdef.getPath, json)
} }
}
else { //process fields with blacklist
val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
res(index) = fdef.getType match {
case Type.List | Type.JSON =>
MapDocumentUtil.truncateList(
MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
fdef.getSize
).asScala.filter((v: String) => !blacklist.test(v))
case _ =>
val value: String = MapDocumentUtil.truncateValue(
MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
fdef.getLength
)
if (blacklist.test(value)) "" else value
}
}
val filter = fdef.getFilter val filter = fdef.getFilter
@ -125,13 +159,12 @@ case class SparkModel(conf: DedupConfig) {
} }
if (StringUtils.isNotBlank(fdef.getInfer)) { if (StringUtils.isNotBlank(fdef.getInfer)) {
val inferFrom : String = if (StringUtils.isNotBlank(fdef.getInferenceFrom)) fdef.getInferenceFrom else fdef.getPath val inferFrom: String = if (StringUtils.isNotBlank(fdef.getInferenceFrom)) fdef.getInferenceFrom else fdef.getPath
res(index) = res(index) match { res(index) = res(index) match {
case x: Seq[String] => x.map(inference(_, MapDocumentUtil.getJPathString(inferFrom, documentContext), fdef.getInfer)) case x: Seq[String] => x.map(inference(_, MapDocumentUtil.getJPathString(inferFrom, documentContext), fdef.getInfer))
case _ => inference(res(index).toString, MapDocumentUtil.getJPathString(inferFrom, documentContext), fdef.getInfer) case _ => inference(res(index).toString, MapDocumentUtil.getJPathString(inferFrom, documentContext), fdef.getInfer)
} }
} }
} }
res res
@ -139,6 +172,7 @@ case class SparkModel(conf: DedupConfig) {
} }
new GenericRowWithSchema(values, schema) new GenericRowWithSchema(values, schema)
} }
def clean(value: String, cleantype: String) : String = { def clean(value: String, cleantype: String) : String = {

View File

@ -21,7 +21,7 @@ public class CodeMatch extends AbstractStringComparator {
public CodeMatch(Map<String, String> params) { public CodeMatch(Map<String, String> params) {
super(params); super(params);
this.params = params; this.params = params;
this.CODE_REGEX = Pattern.compile(params.getOrDefault("codeRegex", "[a-zA-Z]::\\d+")); this.CODE_REGEX = Pattern.compile(params.getOrDefault("codeRegex", "[a-zA-Z]+::\\d+"));
} }
public Set<String> getRegexList(String input) { public Set<String> getRegexList(String input) {

View File

@ -0,0 +1,67 @@
package eu.dnetlib.pace.tree;
import com.wcohen.ss.AbstractStringDistance;
import eu.dnetlib.pace.config.Config;
import eu.dnetlib.pace.tree.support.AbstractStringComparator;
import eu.dnetlib.pace.tree.support.ComparatorClass;
import org.joda.time.DateTime;
import java.time.DateTimeException;
import java.time.LocalDate;
import java.time.Period;
import java.time.format.DateTimeFormatter;
import java.util.Locale;
import java.util.Map;
@ComparatorClass("dateRange")
public class DateRange extends AbstractStringComparator {
int YEAR_RANGE;
public DateRange(Map<String, String> params) {
super(params, new com.wcohen.ss.JaroWinkler());
YEAR_RANGE = Integer.parseInt(params.getOrDefault("year_range", "3"));
}
public DateRange(final double weight) {
super(weight, new com.wcohen.ss.JaroWinkler());
}
protected DateRange(final double weight, final AbstractStringDistance ssalgo) {
super(weight, ssalgo);
}
public static boolean isNumeric(String str) {
return str.matches("\\d+"); //match a number with optional '-' and decimal.
}
@Override
public double distance(final String a, final String b, final Config conf) {
if (a.isEmpty() || b.isEmpty()) {
return -1.0; // return -1 if a field is missing
}
try {
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd", Locale.ENGLISH);
LocalDate d1 = LocalDate.parse(a, formatter);
LocalDate d2 = LocalDate.parse(b, formatter);
Period period = Period.between(d1, d2);
return period.getYears() <= YEAR_RANGE? 1.0 : 0.0;
}
catch (DateTimeException e) {
return -1.0;
}
}
@Override
public double getWeight() {
return super.weight;
}
@Override
protected double normalize(final double d) {
return d;
}
}

View File

@ -41,21 +41,38 @@ public class JsonListMatch extends AbstractListComparator {
return -1; return -1;
} }
final Set<String> ca = sa.stream().map(this::toComparableString).collect(Collectors.toSet()); Set<String> ca = sa.stream().map(this::toComparableString).collect(Collectors.toSet());
final Set<String> cb = sb.stream().map(this::toComparableString).collect(Collectors.toSet()); Set<String> cb = sb.stream().map(this::toComparableString).collect(Collectors.toSet());
switch (MODE) {
case "count":
return Sets.intersection(ca, cb).size();
case "percentage":
int incommon = Sets.intersection(ca, cb).size(); int incommon = Sets.intersection(ca, cb).size();
int simDiff = Sets.symmetricDifference(ca, cb).size(); int simDiff = Sets.symmetricDifference(ca, cb).size();
if (incommon + simDiff == 0) { if (incommon + simDiff == 0) {
return 0.0; return 0.0;
} }
if (MODE.equals("percentage"))
return (double) incommon / (incommon + simDiff); return (double) incommon / (incommon + simDiff);
else
return incommon;
case "type":
Set<String> typesA = ca.stream().map(s -> s.split("::")[0]).collect(Collectors.toSet());
Set<String> typesB = cb.stream().map(s -> s.split("::")[0]).collect(Collectors.toSet());
Set<String> types = Sets.intersection(typesA, typesB);
if (types.isEmpty()) // if no common type, it is impossible to compare
return -1;
ca = ca.stream().filter(s -> types.contains(s.split("::")[0])).collect(Collectors.toSet());
cb = cb.stream().filter(s -> types.contains(s.split("::")[0])).collect(Collectors.toSet());
return (double) Sets.intersection(ca, cb).size() / types.size();
default:
return -1;
}
} }
// converts every json into a comparable string basing on parameters // converts every json into a comparable string basing on parameters
@ -69,7 +86,7 @@ public class JsonListMatch extends AbstractListComparator {
// for each path in the param list // for each path in the param list
for (String key : params.keySet().stream().filter(k -> k.contains("jpath")).collect(Collectors.toList())) { for (String key : params.keySet().stream().filter(k -> k.contains("jpath")).collect(Collectors.toList())) {
String path = params.get(key); String path = params.get(key);
String value = MapDocumentUtil.getJPathString(path, documentContext); String value = MapDocumentUtil.getJPathString(path, documentContext).toLowerCase();
if (value == null || value.isEmpty()) if (value == null || value.isEmpty())
value = ""; value = "";
st.append(value); st.append(value);

View File

@ -48,7 +48,7 @@ public class TreeNodeDef implements Serializable {
// function for the evaluation of the node // function for the evaluation of the node
public TreeNodeStats evaluate(Row doc1, Row doc2, Config conf) { public TreeNodeStats evaluate(Row doc1, Row doc2, Config conf) {
TreeNodeStats stats = new TreeNodeStats(ignoreUndefined); TreeNodeStats stats = new TreeNodeStats();
// for each field in the node, it computes the // for each field in the node, it computes the
for (FieldConf fieldConf : fields) { for (FieldConf fieldConf : fields) {

View File

@ -9,11 +9,8 @@ public class TreeNodeStats implements Serializable {
private Map<String, FieldStats> results; // this is an accumulator for the results of the node private Map<String, FieldStats> results; // this is an accumulator for the results of the node
private final boolean ignoreUndefined; public TreeNodeStats() {
public TreeNodeStats(boolean ignoreUndefined) {
this.results = new HashMap<>(); this.results = new HashMap<>();
this.ignoreUndefined = ignoreUndefined;
} }
public Map<String, FieldStats> getResults() { public Map<String, FieldStats> getResults() {
@ -25,10 +22,7 @@ public class TreeNodeStats implements Serializable {
} }
public int fieldsCount() { public int fieldsCount() {
if (ignoreUndefined)
return this.results.size(); return this.results.size();
else
return this.results.size() - undefinedCount(); // do not count undefined
} }
public int undefinedCount() { public int undefinedCount() {
@ -84,23 +78,12 @@ public class TreeNodeStats implements Serializable {
double min = 100.0; // random high value double min = 100.0; // random high value
for (FieldStats fs : this.results.values()) { for (FieldStats fs : this.results.values()) {
if (fs.getResult() < min) { if (fs.getResult() < min) {
if (fs.getResult() == -1) { if (fs.getResult() >= 0.0 || (fs.getResult() == -1 && fs.isCountIfUndefined()))
if (fs.isCountIfUndefined()) {
min = 0.0;
} else {
min = -1;
}
} else {
min = fs.getResult(); min = fs.getResult();
} }
} }
}
if (ignoreUndefined) {
return min == -1.0 ? 0.0 : min;
} else {
return min; return min;
} }
}
// if at least one is true, return 1.0 // if at least one is true, return 1.0
public double or() { public double or() {
@ -108,12 +91,8 @@ public class TreeNodeStats implements Serializable {
if (fieldStats.getResult() >= fieldStats.getThreshold()) if (fieldStats.getResult() >= fieldStats.getThreshold())
return 1.0; return 1.0;
} }
if (!ignoreUndefined && undefinedCount() > 0) {
return -1.0;
} else {
return 0.0; return 0.0;
} }
}
// if at least one is false, return 0.0 // if at least one is false, return 0.0
public double and() { public double and() {
@ -121,7 +100,7 @@ public class TreeNodeStats implements Serializable {
if (fieldStats.getResult() == -1) { if (fieldStats.getResult() == -1) {
if (fieldStats.isCountIfUndefined()) if (fieldStats.isCountIfUndefined())
return ignoreUndefined ? 0.0 : -1.0; return 0.0;
} else { } else {
if (fieldStats.getResult() < fieldStats.getThreshold()) if (fieldStats.getResult() < fieldStats.getThreshold())
return 0.0; return 0.0;

View File

@ -44,10 +44,12 @@ public class TreeProcessor {
TreeNodeStats stats = currentNode.evaluate(doc1, doc2, config); TreeNodeStats stats = currentNode.evaluate(doc1, doc2, config);
treeStats.addNodeStats(nextNodeName, stats); treeStats.addNodeStats(nextNodeName, stats);
double finalScore = stats.getFinalScore(currentNode.getAggregation()); // if ignoreUndefined=false the miss is considered as undefined
if (finalScore == -1.0) if (!currentNode.isIgnoreUndefined() && stats.undefinedCount() > 0) {
nextNodeName = currentNode.getUndefined(); nextNodeName = currentNode.getUndefined();
else if (finalScore >= currentNode.getThreshold()) { }
// if ignoreUndefined=true the miss is ignored and the score computed anyway
else if (stats.getFinalScore(currentNode.getAggregation()) >= currentNode.getThreshold()) {
nextNodeName = currentNode.getPositive(); nextNodeName = currentNode.getPositive();
} else { } else {
nextNodeName = currentNode.getNegative(); nextNodeName = currentNode.getNegative();

View File

@ -227,4 +227,17 @@ public class ClusteringFunctionTest extends AbstractPaceTest {
System.out.println(cf.apply(conf, Lists.newArrayList(s))); System.out.println(cf.apply(conf, Lists.newArrayList(s)));
} }
@Test
public void testNumAuthorsTitleSuffixPrefixChain() {
final ClusteringFunction cf = new NumAuthorsTitleSuffixPrefixChain(params);
params.put("mod", 10);
final String title = "PARP-2 Regulates SIRT1 Expression and Whole-Body Energy Expenditure";
final String num_authors = "10";
System.out.println("title = " + title);
System.out.println("num_authors = " + num_authors);
System.out.println(cf.apply(conf, Lists.newArrayList(num_authors, title)));
}
} }

View File

@ -1,8 +1,7 @@
package eu.dnetlib.pace.common; package eu.dnetlib.pace.common;
import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.*;
import static org.junit.jupiter.api.Assertions.assertTrue;
import org.junit.jupiter.api.*; import org.junit.jupiter.api.*;
@ -54,8 +53,17 @@ public class PaceFunctionTest extends AbstractPaceFunctions {
System.out.println("Fixed aliases : " + fixAliases(TEST_STRING)); System.out.println("Fixed aliases : " + fixAliases(TEST_STRING));
} }
@Test()
public void countryInferenceTest_NPE() {
assertThrows(
NullPointerException.class,
() -> countryInference("UNKNOWN", null),
"Expected countryInference() to throw an NPE");
}
@Test @Test
public void countryInferenceTest() { public void countryInferenceTest() {
assertEquals("UNKNOWN", countryInference("UNKNOWN", ""));
assertEquals("IT", countryInference("UNKNOWN", "Università di Bologna")); assertEquals("IT", countryInference("UNKNOWN", "Università di Bologna"));
assertEquals("UK", countryInference("UK", "Università di Bologna")); assertEquals("UK", countryInference("UK", "Università di Bologna"));
assertEquals("IT", countryInference("UNKNOWN", "Universiteé de Naples")); assertEquals("IT", countryInference("UNKNOWN", "Universiteé de Naples"));

View File

@ -65,6 +65,23 @@ public class ComparatorTest extends AbstractPaceTest {
} }
@Test
public void datasetVersionCodeMatchTest() {
params.put("codeRegex", "(?=[\\w-]*[a-zA-Z])(?=[\\w-]*\\d)[\\w-]+");
CodeMatch codeMatch = new CodeMatch(params);
// names have different codes
assertEquals(0.0, codeMatch.distance("physical oceanography at ctd station june 1998 ev02a", "physical oceanography at ctd station june 1998 ir02", conf));
// names have same code
assertEquals(1.0, codeMatch.distance("physical oceanography at ctd station june 1998 ev02a", "physical oceanography at ctd station june 1998 ev02a", conf));
// code is not in both names
assertEquals(-1, codeMatch.distance("physical oceanography at ctd station june 1998", "physical oceanography at ctd station june 1998 ev02a", conf));
assertEquals(1.0, codeMatch.distance("physical oceanography at ctd station june 1998", "physical oceanography at ctd station june 1998", conf));
}
@Test @Test
public void listContainsMatchTest() { public void listContainsMatchTest() {
@ -257,15 +274,15 @@ public class ComparatorTest extends AbstractPaceTest {
List<String> a = createFieldList( List<String> a = createFieldList(
Arrays Arrays
.asList( .asList(
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":null,\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:actionset\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"doi\",\"classname\":\"Digital Object Identifier\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"10.1111/pbi.12655\"}"), "{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":null,\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:actionset\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"grid\",\"classname\":\"GRID Identifier\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"grid_1\"}",
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":null,\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:actionset\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"ror\",\"classname\":\"Research Organization Registry\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"ror_1\"}"),
"authors"); "authors");
List<String> b = createFieldList( List<String> b = createFieldList(
Arrays Arrays
.asList( .asList(
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:crosswalk:repository\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"pmc\",\"classname\":\"PubMed Central ID\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"PMC5399005\"}", "{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:crosswalk:repository\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"grid\",\"classname\":\"GRID Identifier\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"grid_1\"}",
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:crosswalk:repository\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"pmid\",\"classname\":\"PubMed ID\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"27775869\"}", "{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:crosswalk:repository\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"ror\",\"classname\":\"Research Organization Registry\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"ror_2\"}",
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"user:claim\",\"classname\":\"Linked by user\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"doi\",\"classname\":\"Digital Object Identifier\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"10.1111/pbi.12655\"}", "{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"user:claim\",\"classname\":\"Linked by user\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"isni\",\"classname\":\"ISNI Identifier\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"isni_1\"}"),
"{\"datainfo\":{\"deletedbyinference\":false,\"inferenceprovenance\":\"\",\"inferred\":false,\"invisible\":false,\"provenanceaction\":{\"classid\":\"sysimport:crosswalk:repository\",\"classname\":\"Harvested\",\"schemeid\":\"dnet:provenanceActions\",\"schemename\":\"dnet:provenanceActions\"},\"trust\":\"0.9\"},\"qualifier\":{\"classid\":\"handle\",\"classname\":\"Handle\",\"schemeid\":\"dnet:pid_types\",\"schemename\":\"dnet:pid_types\"},\"value\":\"1854/LU-8523529\"}"),
"authors"); "authors");
double result = jsonListMatch.compare(a, b, conf); double result = jsonListMatch.compare(a, b, conf);
@ -277,6 +294,13 @@ public class ComparatorTest extends AbstractPaceTest {
result = jsonListMatch.compare(a, b, conf); result = jsonListMatch.compare(a, b, conf);
assertEquals(1.0, result); assertEquals(1.0, result);
params.put("mode", "type");
jsonListMatch = new JsonListMatch(params);
result = jsonListMatch.compare(a, b, conf);
assertEquals(0.5, result);
} }
@Test @Test
@ -327,4 +351,34 @@ public class ComparatorTest extends AbstractPaceTest {
} }
@Test
public void dateMatch() {
DateRange dateRange = new DateRange(params);
double result = dateRange.distance("2021-05-13", "2023-05-13", conf);
assertEquals(1.0, result);
result = dateRange.distance("2021-05-13", "2025-05-13", conf);
assertEquals(0.0, result);
result = dateRange.distance("", "2020-05-05", conf);
assertEquals(-1.0, result);
result = dateRange.distance("invalid date", "2021-05-02", conf);
assertEquals(-1.0, result);
}
@Test
public void titleVersionMatchTest() {
TitleVersionMatch titleVersionMatch = new TitleVersionMatch(params);
double result = titleVersionMatch
.compare(
"parp 2 regulates sirt 1 expression and whole body energy expenditure",
"parp 2 regulates sirt 1 expression and whole body energy expenditure", conf);
assertEquals(1.0, result);
}
} }

View File

@ -11,7 +11,6 @@ import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
import eu.dnetlib.pace.model.Person; import eu.dnetlib.pace.model.Person;
import jdk.nashorn.internal.ir.annotations.Ignore;
public class UtilTest { public class UtilTest {

View File

@ -26,16 +26,16 @@
<dependencies> <dependencies>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-actionmanager</artifactId>
<version>${project.version}</version>
</dependency>
<!-- <dependency>--> <!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>--> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-aggregation</artifactId>--> <!-- <artifactId>dhp-actionmanager</artifactId>-->
<!-- <version>${project.version}</version>--> <!-- <version>${project.version}</version>-->
<!-- </dependency>--> <!-- </dependency>-->
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-aggregation</artifactId>
<version>${project.version}</version>
</dependency>
<!-- <dependency>--> <!-- <dependency>-->
<!-- <groupId>eu.dnetlib.dhp</groupId>--> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<!-- <artifactId>dhp-blacklist</artifactId>--> <!-- <artifactId>dhp-blacklist</artifactId>-->
@ -56,61 +56,61 @@
<!-- <artifactId>dhp-enrichment</artifactId>--> <!-- <artifactId>dhp-enrichment</artifactId>-->
<!-- <version>${project.version}</version>--> <!-- <version>${project.version}</version>-->
<!-- </dependency>--> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-graph-mapper</artifactId> <!-- <artifactId>dhp-graph-mapper</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-graph-provision</artifactId> <!-- <artifactId>dhp-graph-provision</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-impact-indicators</artifactId> <!-- <artifactId>dhp-impact-indicators</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-stats-actionsets</artifactId> <!-- <artifactId>dhp-stats-actionsets</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-stats-hist-snaps</artifactId> <!-- <artifactId>dhp-stats-hist-snaps</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-stats-monitor-irish</artifactId> <!-- <artifactId>dhp-stats-monitor-irish</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-stats-promote</artifactId> <!-- <artifactId>dhp-stats-promote</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-stats-update</artifactId> <!-- <artifactId>dhp-stats-update</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-swh</artifactId> <!-- <artifactId>dhp-swh</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-usage-raw-data-update</artifactId> <!-- <artifactId>dhp-usage-raw-data-update</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
<dependency> <!-- <dependency>-->
<groupId>eu.dnetlib.dhp</groupId> <!-- <groupId>eu.dnetlib.dhp</groupId>-->
<artifactId>dhp-usage-stats-build</artifactId> <!-- <artifactId>dhp-usage-stats-build</artifactId>-->
<version>${project.version}</version> <!-- <version>${project.version}</version>-->
</dependency> <!-- </dependency>-->
</dependencies> </dependencies>

View File

@ -151,12 +151,17 @@ public class PromoteActionPayloadForGraphTableJob {
SparkSession spark, String path, Class<G> rowClazz) { SparkSession spark, String path, Class<G> rowClazz) {
logger.info("Reading graph table from path: {}", path); logger.info("Reading graph table from path: {}", path);
if (HdfsSupport.exists(path, spark.sparkContext().hadoopConfiguration())) {
return spark return spark
.read() .read()
.textFile(path) .textFile(path)
.map( .map(
(MapFunction<String, G>) value -> OBJECT_MAPPER.readValue(value, rowClazz), (MapFunction<String, G>) value -> OBJECT_MAPPER.readValue(value, rowClazz),
Encoders.bean(rowClazz)); Encoders.bean(rowClazz));
} else {
logger.info("Found empty graph table from path: {}", path);
return spark.emptyDataset(Encoders.bean(rowClazz));
}
} }
private static <A extends Oaf> Dataset<A> readActionPayload( private static <A extends Oaf> Dataset<A> readActionPayload(
@ -223,7 +228,7 @@ public class PromoteActionPayloadForGraphTableJob {
rowClazz, rowClazz,
actionPayloadClazz); actionPayloadClazz);
if (shouldGroupById) { if (Boolean.TRUE.equals(shouldGroupById)) {
return PromoteActionPayloadFunctions return PromoteActionPayloadFunctions
.groupGraphTableByIdAndMerge( .groupGraphTableByIdAndMerge(
joinedAndMerged, rowIdFn, mergeRowsAndGetFn, zeroFn, isNotZeroFn, rowClazz); joinedAndMerged, rowIdFn, mergeRowsAndGetFn, zeroFn, isNotZeroFn, rowClazz);
@ -250,6 +255,8 @@ public class PromoteActionPayloadForGraphTableJob {
return () -> clazz.cast(new eu.dnetlib.dhp.schema.oaf.Relation()); return () -> clazz.cast(new eu.dnetlib.dhp.schema.oaf.Relation());
case "eu.dnetlib.dhp.schema.oaf.Software": case "eu.dnetlib.dhp.schema.oaf.Software":
return () -> clazz.cast(new eu.dnetlib.dhp.schema.oaf.Software()); return () -> clazz.cast(new eu.dnetlib.dhp.schema.oaf.Software());
case "eu.dnetlib.dhp.schema.oaf.Person":
return () -> clazz.cast(new eu.dnetlib.dhp.schema.oaf.Person());
default: default:
throw new RuntimeException("unknown class: " + clazz.getCanonicalName()); throw new RuntimeException("unknown class: " + clazz.getCanonicalName());
} }

View File

@ -50,7 +50,7 @@ public class PromoteActionPayloadFunctions {
PromoteAction.Strategy promoteActionStrategy, PromoteAction.Strategy promoteActionStrategy,
Class<G> rowClazz, Class<G> rowClazz,
Class<A> actionPayloadClazz) { Class<A> actionPayloadClazz) {
if (!isSubClass(rowClazz, actionPayloadClazz)) { if (Boolean.FALSE.equals(isSubClass(rowClazz, actionPayloadClazz))) {
throw new RuntimeException( throw new RuntimeException(
"action payload type must be the same or be a super type of table row type"); "action payload type must be the same or be a super type of table row type");
} }

View File

@ -7,3 +7,4 @@ promote_action_payload_for_project_table classpath eu/dnetlib/dhp/actionmanager/
promote_action_payload_for_publication_table classpath eu/dnetlib/dhp/actionmanager/wf/publication/oozie_app promote_action_payload_for_publication_table classpath eu/dnetlib/dhp/actionmanager/wf/publication/oozie_app
promote_action_payload_for_relation_table classpath eu/dnetlib/dhp/actionmanager/wf/relation/oozie_app promote_action_payload_for_relation_table classpath eu/dnetlib/dhp/actionmanager/wf/relation/oozie_app
promote_action_payload_for_software_table classpath eu/dnetlib/dhp/actionmanager/wf/software/oozie_app promote_action_payload_for_software_table classpath eu/dnetlib/dhp/actionmanager/wf/software/oozie_app
promote_action_payload_for_person_table classpath eu/dnetlib/dhp/actionmanager/wf/person/oozie_app

View File

@ -135,21 +135,10 @@
<arg>--outputPath</arg><arg>${workingDir}/action_payload_by_type</arg> <arg>--outputPath</arg><arg>${workingDir}/action_payload_by_type</arg>
<arg>--isLookupUrl</arg><arg>${isLookupUrl}</arg> <arg>--isLookupUrl</arg><arg>${isLookupUrl}</arg>
</spark> </spark>
<ok to="ForkPromote"/> <ok to="PromoteActionPayloadForDatasetTable"/>
<error to="Kill"/> <error to="Kill"/>
</action> </action>
<fork name="ForkPromote">
<path start="PromoteActionPayloadForDatasetTable"/>
<path start="PromoteActionPayloadForDatasourceTable"/>
<path start="PromoteActionPayloadForOrganizationTable"/>
<path start="PromoteActionPayloadForOtherResearchProductTable"/>
<path start="PromoteActionPayloadForProjectTable"/>
<path start="PromoteActionPayloadForPublicationTable"/>
<path start="PromoteActionPayloadForRelationTable"/>
<path start="PromoteActionPayloadForSoftwareTable"/>
</fork>
<action name="PromoteActionPayloadForDatasetTable"> <action name="PromoteActionPayloadForDatasetTable">
<sub-workflow> <sub-workflow>
<app-path>${wf:appPath()}/promote_action_payload_for_dataset_table</app-path> <app-path>${wf:appPath()}/promote_action_payload_for_dataset_table</app-path>
@ -161,7 +150,7 @@
</property> </property>
</configuration> </configuration>
</sub-workflow> </sub-workflow>
<ok to="JoinPromote"/> <ok to="PromoteActionPayloadForDatasourceTable"/>
<error to="Kill"/> <error to="Kill"/>
</action> </action>
@ -176,7 +165,7 @@
</property> </property>
</configuration> </configuration>
</sub-workflow> </sub-workflow>
<ok to="JoinPromote"/> <ok to="PromoteActionPayloadForOrganizationTable"/>
<error to="Kill"/> <error to="Kill"/>
</action> </action>
@ -191,7 +180,7 @@
</property> </property>
</configuration> </configuration>
</sub-workflow> </sub-workflow>
<ok to="JoinPromote"/> <ok to="PromoteActionPayloadForOtherResearchProductTable"/>
<error to="Kill"/> <error to="Kill"/>
</action> </action>
@ -206,7 +195,7 @@
</property> </property>
</configuration> </configuration>
</sub-workflow> </sub-workflow>
<ok to="JoinPromote"/> <ok to="PromoteActionPayloadForProjectTable"/>
<error to="Kill"/> <error to="Kill"/>
</action> </action>
@ -221,7 +210,7 @@
</property> </property>
</configuration> </configuration>
</sub-workflow> </sub-workflow>
<ok to="JoinPromote"/> <ok to="PromoteActionPayloadForPublicationTable"/>
<error to="Kill"/> <error to="Kill"/>
</action> </action>
@ -236,7 +225,7 @@
</property> </property>
</configuration> </configuration>
</sub-workflow> </sub-workflow>
<ok to="JoinPromote"/> <ok to="PromoteActionPayloadForRelationTable"/>
<error to="Kill"/> <error to="Kill"/>
</action> </action>
@ -251,7 +240,7 @@
</property> </property>
</configuration> </configuration>
</sub-workflow> </sub-workflow>
<ok to="JoinPromote"/> <ok to="PromoteActionPayloadForSoftwareTable"/>
<error to="Kill"/> <error to="Kill"/>
</action> </action>
@ -266,11 +255,9 @@
</property> </property>
</configuration> </configuration>
</sub-workflow> </sub-workflow>
<ok to="JoinPromote"/> <ok to="End"/>
<error to="Kill"/> <error to="Kill"/>
</action> </action>
<join name="JoinPromote" to="End"/>
<end name="End"/> <end name="End"/>
</workflow-app> </workflow-app>

View File

@ -0,0 +1,129 @@
<workflow-app name="promote_action_payload_for_person_table" xmlns="uri:oozie:workflow:0.5">
<parameters>
<property>
<name>activePromotePersonActionPayload</name>
<description>when true will promote actions with eu.dnetlib.dhp.schema.oaf.Person payload</description>
</property>
<property>
<name>inputGraphRootPath</name>
<description>root location of input materialized graph</description>
</property>
<property>
<name>inputActionPayloadRootPath</name>
<description>root location of action payloads to promote</description>
</property>
<property>
<name>outputGraphRootPath</name>
<description>root location for output materialized graph</description>
</property>
<property>
<name>mergeAndGetStrategy</name>
<description>strategy for merging graph table objects with action payload instances, MERGE_FROM_AND_GET or SELECT_NEWER_AND_GET</description>
</property>
<property>
<name>sparkDriverMemory</name>
<description>memory for driver process</description>
</property>
<property>
<name>sparkExecutorMemory</name>
<description>memory for individual executor</description>
</property>
<property>
<name>sparkExecutorCores</name>
<description>number of cores used by single executor</description>
</property>
<property>
<name>oozieActionShareLibForSpark2</name>
<description>oozie action sharelib for spark 2.*</description>
</property>
<property>
<name>spark2ExtraListeners</name>
<value>com.cloudera.spark.lineage.NavigatorAppListener</value>
<description>spark 2.* extra listeners classname</description>
</property>
<property>
<name>spark2SqlQueryExecutionListeners</name>
<value>com.cloudera.spark.lineage.NavigatorQueryListener</value>
<description>spark 2.* sql query execution listeners classname</description>
</property>
<property>
<name>spark2YarnHistoryServerAddress</name>
<description>spark 2.* yarn history server address</description>
</property>
<property>
<name>spark2EventLogDir</name>
<description>spark 2.* event log dir location</description>
</property>
</parameters>
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.action.sharelib.for.spark</name>
<value>${oozieActionShareLibForSpark2}</value>
</property>
</configuration>
</global>
<start to="DecisionPromotePersonActionPayload"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<decision name="DecisionPromotePersonActionPayload">
<switch>
<case to="PromotePersonActionPayloadForPersonTable">
${(activePromotePersonActionPayload eq "true") and
(fs:exists(concat(concat(concat(concat(wf:conf('nameNode'),'/'),wf:conf('inputActionPayloadRootPath')),'/'),'clazz=eu.dnetlib.dhp.schema.oaf.Person')) eq "true")}
</case>
<default to="SkipPromotePersonActionPayloadForPersonTable"/>
</switch>
</decision>
<action name="PromotePersonActionPayloadForPersonTable">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>PromotePersonActionPayloadForPersonTable</name>
<class>eu.dnetlib.dhp.actionmanager.promote.PromoteActionPayloadForGraphTableJob</class>
<jar>dhp-actionmanager-${projectVersion}.jar</jar>
<spark-opts>
--executor-memory=${sparkExecutorMemory}
--executor-cores=${sparkExecutorCores}
--driver-memory=${sparkDriverMemory}
--conf spark.executor.memoryOverhead=${sparkExecutorMemory}
--conf spark.extraListeners=${spark2ExtraListeners}
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
</spark-opts>
<arg>--inputGraphTablePath</arg><arg>${inputGraphRootPath}/person</arg>
<arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Person</arg>
<arg>--inputActionPayloadPath</arg><arg>${inputActionPayloadRootPath}/clazz=eu.dnetlib.dhp.schema.oaf.Person</arg>
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Person</arg>
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/person</arg>
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="SkipPromotePersonActionPayloadForPersonTable">
<distcp xmlns="uri:oozie:distcp-action:0.2">
<prepare>
<delete path="${outputGraphRootPath}/person"/>
</prepare>
<arg>-pb</arg>
<arg>${inputGraphRootPath}/person</arg>
<arg>${outputGraphRootPath}/person</arg>
</distcp>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

View File

@ -10,7 +10,6 @@ import java.util.List;
import org.apache.commons.io.IOUtils; import org.apache.commons.io.IOUtils;
import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec; import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.spark.SparkConf; import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaPairRDD;
@ -29,12 +28,13 @@ import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.common.ModelConstants; import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.*; import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions; import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.DoiCleaningRule;
import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory; import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils; import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
import scala.Tuple2; import scala.Tuple2;
/** /**
* Creates action sets for Crossref affiliation relations inferred by BIP! * Creates action sets for Crossref affiliation relations inferred by OpenAIRE
*/ */
public class PrepareAffiliationRelations implements Serializable { public class PrepareAffiliationRelations implements Serializable {
@ -44,6 +44,10 @@ public class PrepareAffiliationRelations implements Serializable {
public static final String BIP_AFFILIATIONS_CLASSID = "result:organization:openaireinference"; public static final String BIP_AFFILIATIONS_CLASSID = "result:organization:openaireinference";
public static final String BIP_AFFILIATIONS_CLASSNAME = "Affiliation relation inferred by OpenAIRE"; public static final String BIP_AFFILIATIONS_CLASSNAME = "Affiliation relation inferred by OpenAIRE";
public static final String BIP_INFERENCE_PROVENANCE = "openaire:affiliation"; public static final String BIP_INFERENCE_PROVENANCE = "openaire:affiliation";
public static final String OPENAIRE_DATASOURCE_ID = "10|infrastruct_::f66f1bd369679b5b077dcdf006089556";
public static final String OPENAIRE_DATASOURCE_NAME = "OpenAIRE";
public static final String DOI_URL_PREFIX = "https://doi.org/";
public static final int DOI_URL_PREFIX_LENGTH = 16;
public static <I extends Result> void main(String[] args) throws Exception { public static <I extends Result> void main(String[] args) throws Exception {
@ -74,6 +78,9 @@ public class PrepareAffiliationRelations implements Serializable {
final String webcrawlInputPath = parser.get("webCrawlInputPath"); final String webcrawlInputPath = parser.get("webCrawlInputPath");
log.info("webcrawlInputPath: {}", webcrawlInputPath); log.info("webcrawlInputPath: {}", webcrawlInputPath);
final String publisherInputPath = parser.get("publisherInputPath");
log.info("publisherInputPath: {}", publisherInputPath);
final String outputPath = parser.get("outputPath"); final String outputPath = parser.get("outputPath");
log.info("outputPath: {}", outputPath); log.info("outputPath: {}", outputPath);
@ -84,46 +91,80 @@ public class PrepareAffiliationRelations implements Serializable {
isSparkSessionManaged, isSparkSessionManaged,
spark -> { spark -> {
Constants.removeOutputDir(spark, outputPath); Constants.removeOutputDir(spark, outputPath);
createActionSet(
spark, crossrefInputPath, pubmedInputPath, openapcInputPath, dataciteInputPath, webcrawlInputPath,
publisherInputPath, outputPath);
});
}
List<KeyValue> collectedFromCrossref = OafMapperUtils private static void createActionSet(SparkSession spark, String crossrefInputPath, String pubmedInputPath,
.listKeyValues(ModelConstants.CROSSREF_ID, "Crossref"); String openapcInputPath, String dataciteInputPath, String webcrawlInputPath, String publisherlInputPath,
JavaPairRDD<Text, Text> crossrefRelations = prepareAffiliationRelations( String outputPath) {
spark, crossrefInputPath, collectedFromCrossref); List<KeyValue> collectedfromOpenAIRE = OafMapperUtils
.listKeyValues(OPENAIRE_DATASOURCE_ID, OPENAIRE_DATASOURCE_NAME);
JavaPairRDD<Text, Text> crossrefRelations = prepareAffiliationRelationsNewModel(
spark, crossrefInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":crossref");
List<KeyValue> collectedFromPubmed = OafMapperUtils
.listKeyValues(ModelConstants.PUBMED_CENTRAL_ID, "Pubmed");
JavaPairRDD<Text, Text> pubmedRelations = prepareAffiliationRelations( JavaPairRDD<Text, Text> pubmedRelations = prepareAffiliationRelations(
spark, pubmedInputPath, collectedFromPubmed); spark, pubmedInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":pubmed");
List<KeyValue> collectedFromOpenAPC = OafMapperUtils JavaPairRDD<Text, Text> openAPCRelations = prepareAffiliationRelationsNewModel(
.listKeyValues(ModelConstants.OPEN_APC_ID, "OpenAPC"); spark, openapcInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":openapc");
JavaPairRDD<Text, Text> openAPCRelations = prepareAffiliationRelations(
spark, openapcInputPath, collectedFromOpenAPC);
List<KeyValue> collectedFromDatacite = OafMapperUtils JavaPairRDD<Text, Text> dataciteRelations = prepareAffiliationRelationsNewModel(
.listKeyValues(ModelConstants.DATACITE_ID, "Datacite"); spark, dataciteInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":datacite");
JavaPairRDD<Text, Text> dataciteRelations = prepareAffiliationRelations(
spark, dataciteInputPath, collectedFromDatacite);
List<KeyValue> collectedFromWebCrawl = OafMapperUtils JavaPairRDD<Text, Text> webCrawlRelations = prepareAffiliationRelationsNewModel(
.listKeyValues(Constants.WEB_CRAWL_ID, Constants.WEB_CRAWL_NAME); spark, webcrawlInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":rawaff");
JavaPairRDD<Text, Text> webCrawlRelations = prepareAffiliationRelations(
spark, webcrawlInputPath, collectedFromWebCrawl); JavaPairRDD<Text, Text> publisherRelations = prepareAffiliationRelationFromPublisherNewModel(
spark, publisherlInputPath, collectedfromOpenAIRE, BIP_INFERENCE_PROVENANCE + ":webcrawl");
crossrefRelations crossrefRelations
.union(pubmedRelations) .union(pubmedRelations)
.union(openAPCRelations) .union(openAPCRelations)
.union(dataciteRelations) .union(dataciteRelations)
.union(webCrawlRelations) .union(webCrawlRelations)
.union(publisherRelations)
.saveAsHadoopFile( .saveAsHadoopFile(
outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, BZip2Codec.class); outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, BZip2Codec.class);
}
private static JavaPairRDD<Text, Text> prepareAffiliationRelationFromPublisherNewModel(SparkSession spark,
String inputPath,
List<KeyValue> collectedfrom,
String dataprovenance) {
Dataset<Row> df = spark
.read()
.schema(
"`DOI` STRING, `Organizations` ARRAY<STRUCT<`PID`:STRING, `Value`:STRING,`Confidence`:DOUBLE, `Status`:STRING>>")
.json(inputPath)
.where("DOI is not null");
return getTextTextJavaPairRDDNew(
collectedfrom, df.selectExpr("DOI", "Organizations as Matchings"), dataprovenance);
}
private static JavaPairRDD<Text, Text> prepareAffiliationRelationFromPublisher(SparkSession spark, String inputPath,
List<KeyValue> collectedfrom, String dataprovenance) {
Dataset<Row> df = spark
.read()
.schema("`DOI` STRING, `Organizations` ARRAY<STRUCT<`RORid`:STRING,`Confidence`:DOUBLE>>")
.json(inputPath)
.where("DOI is not null");
return getTextTextJavaPairRDD(
collectedfrom, df.selectExpr("DOI", "Organizations as Matchings"), dataprovenance);
});
} }
private static <I extends Result> JavaPairRDD<Text, Text> prepareAffiliationRelations(SparkSession spark, private static <I extends Result> JavaPairRDD<Text, Text> prepareAffiliationRelations(SparkSession spark,
String inputPath, String inputPath,
List<KeyValue> collectedfrom) { List<KeyValue> collectedfrom, String dataprovenance) {
// load and parse affiliation relations from HDFS // load and parse affiliation relations from HDFS
Dataset<Row> df = spark Dataset<Row> df = spark
@ -132,6 +173,25 @@ public class PrepareAffiliationRelations implements Serializable {
.json(inputPath) .json(inputPath)
.where("DOI is not null"); .where("DOI is not null");
return getTextTextJavaPairRDD(collectedfrom, df, dataprovenance);
}
private static <I extends Result> JavaPairRDD<Text, Text> prepareAffiliationRelationsNewModel(SparkSession spark,
String inputPath,
List<KeyValue> collectedfrom, String dataprovenance) {
// load and parse affiliation relations from HDFS
Dataset<Row> df = spark
.read()
.schema(
"`DOI` STRING, `Matchings` ARRAY<STRUCT<`PID`:STRING, `Value`:STRING,`Confidence`:DOUBLE, `Status`:STRING>>")
.json(inputPath)
.where("DOI is not null");
return getTextTextJavaPairRDDNew(collectedfrom, df, dataprovenance);
}
private static JavaPairRDD<Text, Text> getTextTextJavaPairRDD(List<KeyValue> collectedfrom, Dataset<Row> df,
String dataprovenance) {
// unroll nested arrays // unroll nested arrays
df = df df = df
.withColumn("matching", functions.explode(new Column("Matchings"))) .withColumn("matching", functions.explode(new Column("Matchings")))
@ -147,7 +207,7 @@ public class PrepareAffiliationRelations implements Serializable {
// DOI to OpenAIRE id // DOI to OpenAIRE id
final String paperId = ID_PREFIX final String paperId = ID_PREFIX
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", row.getAs("doi"))); + IdentifierFactory.md5(DoiCleaningRule.clean(removePrefix(row.getAs("doi"))));
// ROR id to OpenAIRE id // ROR id to OpenAIRE id
final String affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("rorid")); final String affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("rorid"));
@ -163,7 +223,7 @@ public class PrepareAffiliationRelations implements Serializable {
DataInfo dataInfo = OafMapperUtils DataInfo dataInfo = OafMapperUtils
.dataInfo( .dataInfo(
false, false,
BIP_INFERENCE_PROVENANCE, dataprovenance,
true, true,
false, false,
qualifier, qualifier,
@ -179,6 +239,70 @@ public class PrepareAffiliationRelations implements Serializable {
new Text(OBJECT_MAPPER.writeValueAsString(aa)))); new Text(OBJECT_MAPPER.writeValueAsString(aa))));
} }
private static JavaPairRDD<Text, Text> getTextTextJavaPairRDDNew(List<KeyValue> collectedfrom, Dataset<Row> df,
String dataprovenance) {
// unroll nested arrays
df = df
.withColumn("matching", functions.explode(new Column("Matchings")))
.select(
new Column("DOI").as("doi"),
new Column("matching.PID").as("pidtype"),
new Column("matching.Value").as("pidvalue"),
new Column("matching.Confidence").as("confidence"),
new Column("matching.Status").as("status"))
.where("status = 'active'");
// prepare action sets for affiliation relations
return df
.toJavaRDD()
.flatMap((FlatMapFunction<Row, Relation>) row -> {
// DOI to OpenAIRE id
final String paperId = ID_PREFIX
+ IdentifierFactory.md5(DoiCleaningRule.clean(removePrefix(row.getAs("doi"))));
// Organization to OpenAIRE identifier
String affId = null;
if (row.getAs("pidtype").equals("ROR"))
// ROR id to OpenIARE id
affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("pidvalue"));
else
// getting the OpenOrgs identifier for the organization
affId = row.getAs("pidvalue");
Qualifier qualifier = OafMapperUtils
.qualifier(
BIP_AFFILIATIONS_CLASSID,
BIP_AFFILIATIONS_CLASSNAME,
ModelConstants.DNET_PROVENANCE_ACTIONS,
ModelConstants.DNET_PROVENANCE_ACTIONS);
// format data info; setting `confidence` into relation's `trust`
DataInfo dataInfo = OafMapperUtils
.dataInfo(
false,
dataprovenance,
true,
false,
qualifier,
Double.toString(row.getAs("confidence")));
// return bi-directional relations
return getAffiliationRelationPair(paperId, affId, collectedfrom, dataInfo).iterator();
})
.map(p -> new AtomicAction(Relation.class, p))
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))));
}
private static String removePrefix(String doi) {
if (doi.startsWith(DOI_URL_PREFIX))
return doi.substring(DOI_URL_PREFIX_LENGTH);
return doi;
}
private static List<Relation> getAffiliationRelationPair(String paperId, String affId, List<KeyValue> collectedfrom, private static List<Relation> getAffiliationRelationPair(String paperId, String affId, List<KeyValue> collectedfrom,
DataInfo dataInfo) { DataInfo dataInfo) {
return Arrays return Arrays

View File

@ -46,6 +46,9 @@ public class GetOpenCitationsRefs implements Serializable {
final String outputPath = parser.get("outputPath"); final String outputPath = parser.get("outputPath");
log.info("outputPath {}", outputPath); log.info("outputPath {}", outputPath);
final String backupPath = parser.get("backupPath");
log.info("backupPath {}", backupPath);
Configuration conf = new Configuration(); Configuration conf = new Configuration();
conf.set("fs.defaultFS", hdfsNameNode); conf.set("fs.defaultFS", hdfsNameNode);
@ -53,11 +56,11 @@ public class GetOpenCitationsRefs implements Serializable {
GetOpenCitationsRefs ocr = new GetOpenCitationsRefs(); GetOpenCitationsRefs ocr = new GetOpenCitationsRefs();
ocr.doExtract(inputPath, outputPath, fileSystem); ocr.doExtract(inputPath, outputPath, backupPath, fileSystem);
} }
private void doExtract(String inputPath, String outputPath, FileSystem fileSystem) private void doExtract(String inputPath, String outputPath, String backupPath, FileSystem fileSystem)
throws IOException { throws IOException {
RemoteIterator<LocatedFileStatus> fileStatusListIterator = fileSystem RemoteIterator<LocatedFileStatus> fileStatusListIterator = fileSystem
@ -89,6 +92,7 @@ public class GetOpenCitationsRefs implements Serializable {
} }
} }
fileSystem.rename(fileStatus.getPath(), new Path(backupPath));
} }
} }

View File

@ -107,7 +107,8 @@ public class ReadCOCI implements Serializable {
.mode(SaveMode.Append) .mode(SaveMode.Append)
.option("compression", "gzip") .option("compression", "gzip")
.json(outputPath); .json(outputPath);
fileSystem.rename(fileStatus.getPath(), new Path("/tmp/miriam/OC/DONE"));
fileSystem.delete(fileStatus.getPath());
} }
} }

View File

@ -2,25 +2,34 @@
package eu.dnetlib.dhp.actionmanager.personentity; package eu.dnetlib.dhp.actionmanager.personentity;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession; import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import static org.apache.spark.sql.functions.*;
import java.io.BufferedWriter;
import java.io.IOException; import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Serializable; import java.io.Serializable;
import java.nio.charset.StandardCharsets;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.*; import java.util.*;
import java.util.stream.Collectors; import java.util.stream.Collectors;
import org.apache.commons.cli.ParseException; import org.apache.commons.cli.ParseException;
import org.apache.commons.io.IOUtils; import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec; import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.spark.SparkConf; import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.*; import org.apache.spark.api.java.function.*;
import org.apache.spark.sql.*; import org.apache.spark.sql.*;
import org.apache.spark.sql.Dataset;
import org.jetbrains.annotations.NotNull; import org.jetbrains.annotations.NotNull;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import org.spark_project.jetty.util.StringUtil;
import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.ObjectMapper;
@ -28,13 +37,14 @@ import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.collection.orcid.model.Author; import eu.dnetlib.dhp.collection.orcid.model.Author;
import eu.dnetlib.dhp.collection.orcid.model.Employment; import eu.dnetlib.dhp.collection.orcid.model.Employment;
import eu.dnetlib.dhp.collection.orcid.model.Work; import eu.dnetlib.dhp.collection.orcid.model.Work;
import eu.dnetlib.dhp.common.DbClient;
import eu.dnetlib.dhp.common.HdfsSupport; import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.common.person.CoAuthorshipIterator;
import eu.dnetlib.dhp.common.person.Coauthors;
import eu.dnetlib.dhp.schema.action.AtomicAction; import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.common.ModelConstants; import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport; import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.KeyValue; import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.Person;
import eu.dnetlib.dhp.schema.oaf.Relation;
import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory; import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils; import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
import eu.dnetlib.dhp.schema.oaf.utils.PidCleaner; import eu.dnetlib.dhp.schema.oaf.utils.PidCleaner;
@ -44,7 +54,7 @@ import scala.Tuple2;
public class ExtractPerson implements Serializable { public class ExtractPerson implements Serializable {
private static final Logger log = LoggerFactory.getLogger(ExtractPerson.class); private static final Logger log = LoggerFactory.getLogger(ExtractPerson.class);
private static final String QUERY = "SELECT * FROM project_person WHERE pid_type = 'ORCID'";
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper(); private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
private static final String OPENAIRE_PREFIX = "openaire____"; private static final String OPENAIRE_PREFIX = "openaire____";
private static final String SEPARATOR = "::"; private static final String SEPARATOR = "::";
@ -58,9 +68,48 @@ public class ExtractPerson implements Serializable {
private static final String PMCID_PREFIX = "50|pmcid_______::"; private static final String PMCID_PREFIX = "50|pmcid_______::";
private static final String ROR_PREFIX = "20|ror_________::"; private static final String ROR_PREFIX = "20|ror_________::";
private static final String PERSON_PREFIX = ModelSupport.getIdPrefix(Person.class) + "|orcid_______"; private static final String PERSON_PREFIX = ModelSupport.getIdPrefix(Person.class)
+ IdentifierFactory.ID_PREFIX_SEPARATOR + ModelConstants.ORCID + "_______";
private static final String PROJECT_ID_PREFIX = ModelSupport.getIdPrefix(Project.class)
+ IdentifierFactory.ID_PREFIX_SEPARATOR;
public static final String ORCID_AUTHORS_CLASSID = "sysimport:crosswalk:orcid"; public static final String ORCID_AUTHORS_CLASSID = "sysimport:crosswalk:orcid";
public static final String ORCID_AUTHORS_CLASSNAME = "Imported from ORCID"; public static final String ORCID_AUTHORS_CLASSNAME = "Imported from ORCID";
public static final String FUNDER_AUTHORS_CLASSID = "sysimport:crosswalk:funderdatabase";
public static final String FUNDER_AUTHORS_CLASSNAME = "Imported from Funder Database";
public static final String OPENAIRE_DATASOURCE_ID = "10|infrastruct_::f66f1bd369679b5b077dcdf006089556";
public static final String OPENAIRE_DATASOURCE_NAME = "OpenAIRE";
public static List<KeyValue> collectedfromOpenAIRE = OafMapperUtils
.listKeyValues(OPENAIRE_DATASOURCE_ID, OPENAIRE_DATASOURCE_NAME);
public static final DataInfo ORCIDDATAINFO = OafMapperUtils
.dataInfo(
false,
null,
false,
false,
OafMapperUtils
.qualifier(
ORCID_AUTHORS_CLASSID,
ORCID_AUTHORS_CLASSNAME,
ModelConstants.DNET_PROVENANCE_ACTIONS,
ModelConstants.DNET_PROVENANCE_ACTIONS),
"0.91");
public static final DataInfo FUNDERDATAINFO = OafMapperUtils
.dataInfo(
false,
null,
false,
false,
OafMapperUtils
.qualifier(
FUNDER_AUTHORS_CLASSID,
FUNDER_AUTHORS_CLASSNAME,
ModelConstants.DNET_PROVENANCE_ACTIONS,
ModelConstants.DNET_PROVENANCE_ACTIONS),
"0.91");
public static void main(final String[] args) throws IOException, ParseException { public static void main(final String[] args) throws IOException, ParseException {
@ -91,19 +140,130 @@ public class ExtractPerson implements Serializable {
final String workingDir = parser.get("workingDir"); final String workingDir = parser.get("workingDir");
log.info("workingDir {}", workingDir); log.info("workingDir {}", workingDir);
final String dbUrl = parser.get("postgresUrl");
final String dbUser = parser.get("postgresUser");
final String dbPassword = parser.get("postgresPassword");
final String hdfsNameNode = parser.get("hdfsNameNode");
SparkConf conf = new SparkConf(); SparkConf conf = new SparkConf();
runWithSparkSession( runWithSparkSession(
conf, conf,
isSparkSessionManaged, isSparkSessionManaged,
spark -> { spark -> {
HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration()); HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
createActionSet(spark, inputPath, outputPath, workingDir); extractInfoForActionSetFromORCID(spark, inputPath, workingDir);
extractInfoForActionSetFromProjects(
spark, inputPath, workingDir, dbUrl, dbUser, dbPassword, workingDir + "/project", hdfsNameNode);
createActionSet(spark, outputPath, workingDir);
}); });
} }
private static void createActionSet(SparkSession spark, String inputPath, String outputPath, String workingDir) { private static void extractInfoForActionSetFromProjects(SparkSession spark, String inputPath, String workingDir,
String dbUrl, String dbUser, String dbPassword, String hdfsPath, String hdfsNameNode) throws IOException {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", hdfsNameNode);
FileSystem fileSystem = FileSystem.get(conf);
Path hdfsWritePath = new Path(hdfsPath);
FSDataOutputStream fos = fileSystem.create(hdfsWritePath);
try (DbClient dbClient = new DbClient(dbUrl, dbUser, dbPassword)) {
try (BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fos, StandardCharsets.UTF_8))) {
dbClient.processResults(QUERY, rs -> writeRelation(getRelationWithProject(rs), writer));
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}
public static Relation getRelationWithProject(ResultSet rs) {
try {
return getProjectRelation(
rs.getString("project"), rs.getString("pid"),
rs.getString("role"));
} catch (final SQLException e) {
throw new RuntimeException(e);
}
}
private static Relation getProjectRelation(String project, String orcid, String role) {
String source = PERSON_PREFIX + "::" + IdentifierFactory.md5(orcid);
String target = PROJECT_ID_PREFIX + StringUtils.substringBefore(project, "::") + "::"
+ IdentifierFactory.md5(StringUtils.substringAfter(project, "::"));
List<KeyValue> properties = new ArrayList<>();
Relation relation = OafMapperUtils
.getRelation(
source, target, ModelConstants.PROJECT_PERSON_RELTYPE, ModelConstants.PROJECT_PERSON_SUBRELTYPE,
ModelConstants.PROJECT_PERSON_PARTICIPATES,
collectedfromOpenAIRE,
FUNDERDATAINFO,
null);
relation.setValidated(true);
if (StringUtils.isNotBlank(role)) {
KeyValue kv = new KeyValue();
kv.setKey("role");
kv.setValue(role);
properties.add(kv);
}
if (!properties.isEmpty())
relation.setProperties(properties);
return relation;
}
protected static void writeRelation(final Relation relation, BufferedWriter writer) {
try {
writer.write(OBJECT_MAPPER.writeValueAsString(relation));
writer.newLine();
} catch (final IOException e) {
throw new RuntimeException(e);
}
}
private static void createActionSet(SparkSession spark, String outputPath, String workingDir) {
Dataset<Person> people;
people = spark
.read()
.textFile(workingDir + "/people")
.map(
(MapFunction<String, Person>) value -> OBJECT_MAPPER
.readValue(value, Person.class),
Encoders.bean(Person.class));
people
.toJavaRDD()
.map(p -> new AtomicAction(p.getClass(), p))
.union(
getRelations(spark, workingDir + "/authorship").toJavaRDD().map(r -> new AtomicAction(r.getClass(), r)))
.union(
getRelations(spark, workingDir + "/coauthorship")
.toJavaRDD()
.map(r -> new AtomicAction(r.getClass(), r)))
.union(
getRelations(spark, workingDir + "/affiliation")
.toJavaRDD()
.map(r -> new AtomicAction(r.getClass(), r)))
.union(
getRelations(spark, workingDir + "/project")
.toJavaRDD()
.map(r -> new AtomicAction(r.getClass(), r)))
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(
outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, BZip2Codec.class);
}
private static void extractInfoForActionSetFromORCID(SparkSession spark, String inputPath, String workingDir) {
Dataset<Author> authors = spark Dataset<Author> authors = spark
.read() .read()
.parquet(inputPath + "Authors") .parquet(inputPath + "Authors")
@ -129,18 +289,13 @@ public class ExtractPerson implements Serializable {
.parquet(inputPath + "Employments") .parquet(inputPath + "Employments")
.as(Encoders.bean(Employment.class)); .as(Encoders.bean(Employment.class));
Dataset<Author> peopleToMap = authors
.joinWith(works, authors.col("orcid").equalTo(works.col("orcid")))
.map((MapFunction<Tuple2<Author, Work>, Author>) t2 -> t2._1(), Encoders.bean(Author.class))
.groupByKey((MapFunction<Author, String>) a -> a.getOrcid(), Encoders.STRING())
.mapGroups((MapGroupsFunction<String, Author, Author>) (k, it) -> it.next(), Encoders.bean(Author.class));
Dataset<Employment> employment = employmentDataset Dataset<Employment> employment = employmentDataset
.joinWith(peopleToMap, employmentDataset.col("orcid").equalTo(peopleToMap.col("orcid"))) .joinWith(authors, employmentDataset.col("orcid").equalTo(authors.col("orcid")))
.map((MapFunction<Tuple2<Employment, Author>, Employment>) t2 -> t2._1(), Encoders.bean(Employment.class)); .map((MapFunction<Tuple2<Employment, Author>, Employment>) t2 -> t2._1(), Encoders.bean(Employment.class));
Dataset<Person> people; // Mapping all the orcid profiles even if the profile has no visible works
peopleToMap.map((MapFunction<Author, Person>) op -> {
authors.map((MapFunction<Author, Person>) op -> {
Person person = new Person(); Person person = new Person();
person.setId(DHPUtils.generateIdentifier(op.getOrcid(), PERSON_PREFIX)); person.setId(DHPUtils.generateIdentifier(op.getOrcid(), PERSON_PREFIX));
person person
@ -190,9 +345,23 @@ public class ExtractPerson implements Serializable {
OafMapperUtils OafMapperUtils
.structuredProperty( .structuredProperty(
op.getOrcid(), ModelConstants.ORCID, ModelConstants.ORCID_CLASSNAME, op.getOrcid(), ModelConstants.ORCID, ModelConstants.ORCID_CLASSNAME,
ModelConstants.DNET_PID_TYPES, ModelConstants.DNET_PID_TYPES, null)); ModelConstants.DNET_PID_TYPES, ModelConstants.DNET_PID_TYPES,
OafMapperUtils
.dataInfo(
false,
null,
false,
false,
OafMapperUtils
.qualifier(
ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY,
ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY,
ModelConstants.DNET_PID_TYPES,
ModelConstants.DNET_PID_TYPES),
"0.91")));
person.setDateofcollection(op.getLastModifiedDate()); person.setDateofcollection(op.getLastModifiedDate());
person.setOriginalId(Arrays.asList(op.getOrcid())); person.setOriginalId(Arrays.asList(op.getOrcid()));
person.setDataInfo(ORCIDDATAINFO);
return person; return person;
}, Encoders.bean(Person.class)) }, Encoders.bean(Person.class))
.write() .write()
@ -246,34 +415,6 @@ public class ExtractPerson implements Serializable {
.option("compression", "gzip") .option("compression", "gzip")
.mode(SaveMode.Overwrite) .mode(SaveMode.Overwrite)
.json(workingDir + "/affiliation"); .json(workingDir + "/affiliation");
people = spark
.read()
.textFile(workingDir + "/people")
.map(
(MapFunction<String, Person>) value -> OBJECT_MAPPER
.readValue(value, Person.class),
Encoders.bean(Person.class));
people.show(false);
people
.toJavaRDD()
.map(p -> new AtomicAction(p.getClass(), p))
.union(
getRelations(spark, workingDir + "/authorship").toJavaRDD().map(r -> new AtomicAction(r.getClass(), r)))
.union(
getRelations(spark, workingDir + "/coauthorship")
.toJavaRDD()
.map(r -> new AtomicAction(r.getClass(), r)))
.union(
getRelations(spark, workingDir + "/affiliation")
.toJavaRDD()
.map(r -> new AtomicAction(r.getClass(), r)))
.mapToPair(
aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
new Text(OBJECT_MAPPER.writeValueAsString(aa))))
.saveAsHadoopFile(
outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, BZip2Codec.class);
} }
private static Dataset<Relation> getRelations(SparkSession spark, String path) { private static Dataset<Relation> getRelations(SparkSession spark, String path) {
@ -297,7 +438,7 @@ public class ExtractPerson implements Serializable {
} }
private static Relation getAffiliationRelation(Employment row) { private static Relation getAffiliationRelation(Employment row) {
String source = PERSON_PREFIX + IdentifierFactory.md5(row.getOrcid()); String source = PERSON_PREFIX + "::" + IdentifierFactory.md5(row.getOrcid());
String target = ROR_PREFIX String target = ROR_PREFIX
+ IdentifierFactory.md5(PidCleaner.normalizePidValue("ROR", row.getAffiliationId().getValue())); + IdentifierFactory.md5(PidCleaner.normalizePidValue("ROR", row.getAffiliationId().getValue()));
List<KeyValue> properties = new ArrayList<>(); List<KeyValue> properties = new ArrayList<>();
@ -307,23 +448,17 @@ public class ExtractPerson implements Serializable {
source, target, ModelConstants.ORG_PERSON_RELTYPE, ModelConstants.ORG_PERSON_SUBRELTYPE, source, target, ModelConstants.ORG_PERSON_RELTYPE, ModelConstants.ORG_PERSON_SUBRELTYPE,
ModelConstants.ORG_PERSON_PARTICIPATES, ModelConstants.ORG_PERSON_PARTICIPATES,
Arrays.asList(OafMapperUtils.keyValue(orcidKey, ModelConstants.ORCID_DS)), Arrays.asList(OafMapperUtils.keyValue(orcidKey, ModelConstants.ORCID_DS)),
OafMapperUtils ORCIDDATAINFO,
.dataInfo(
false, null, false, false,
OafMapperUtils
.qualifier(
ORCID_AUTHORS_CLASSID, ORCID_AUTHORS_CLASSNAME, ModelConstants.DNET_PROVENANCE_ACTIONS,
ModelConstants.DNET_PROVENANCE_ACTIONS),
"0.91"),
null); null);
relation.setValidated(true);
if (Optional.ofNullable(row.getStartDate()).isPresent() && StringUtil.isNotBlank(row.getStartDate())) { if (Optional.ofNullable(row.getStartDate()).isPresent() && StringUtils.isNotBlank(row.getStartDate())) {
KeyValue kv = new KeyValue(); KeyValue kv = new KeyValue();
kv.setKey("startDate"); kv.setKey("startDate");
kv.setValue(row.getStartDate()); kv.setValue(row.getStartDate());
properties.add(kv); properties.add(kv);
} }
if (Optional.ofNullable(row.getEndDate()).isPresent() && StringUtil.isNotBlank(row.getEndDate())) { if (Optional.ofNullable(row.getEndDate()).isPresent() && StringUtils.isNotBlank(row.getEndDate())) {
KeyValue kv = new KeyValue(); KeyValue kv = new KeyValue();
kv.setKey("endDate"); kv.setKey("endDate");
kv.setValue(row.getEndDate()); kv.setValue(row.getEndDate());
@ -336,45 +471,6 @@ public class ExtractPerson implements Serializable {
} }
private static Collection<? extends Relation> getCoAuthorshipRelations(String orcid1, String orcid2) {
String source = PERSON_PREFIX + "::" + IdentifierFactory.md5(orcid1);
String target = PERSON_PREFIX + "::" + IdentifierFactory.md5(orcid2);
return Arrays
.asList(
OafMapperUtils
.getRelation(
source, target, ModelConstants.PERSON_PERSON_RELTYPE,
ModelConstants.PERSON_PERSON_SUBRELTYPE,
ModelConstants.PERSON_PERSON_HASCOAUTHORED,
Arrays.asList(OafMapperUtils.keyValue(orcidKey, ModelConstants.ORCID_DS)),
OafMapperUtils
.dataInfo(
false, null, false, false,
OafMapperUtils
.qualifier(
ORCID_AUTHORS_CLASSID, ORCID_AUTHORS_CLASSNAME,
ModelConstants.DNET_PROVENANCE_ACTIONS, ModelConstants.DNET_PROVENANCE_ACTIONS),
"0.91"),
null),
OafMapperUtils
.getRelation(
target, source, ModelConstants.PERSON_PERSON_RELTYPE,
ModelConstants.PERSON_PERSON_SUBRELTYPE,
ModelConstants.PERSON_PERSON_HASCOAUTHORED,
Arrays.asList(OafMapperUtils.keyValue(orcidKey, ModelConstants.ORCID_DS)),
OafMapperUtils
.dataInfo(
false, null, false, false,
OafMapperUtils
.qualifier(
ORCID_AUTHORS_CLASSID, ORCID_AUTHORS_CLASSNAME,
ModelConstants.DNET_PROVENANCE_ACTIONS, ModelConstants.DNET_PROVENANCE_ACTIONS),
"0.91"),
null));
}
private static @NotNull Iterator<Relation> getAuthorshipRelationIterator(Work w) { private static @NotNull Iterator<Relation> getAuthorshipRelationIterator(Work w) {
if (Optional.ofNullable(w.getPids()).isPresent()) if (Optional.ofNullable(w.getPids()).isPresent())
@ -417,21 +513,15 @@ public class ExtractPerson implements Serializable {
default: default:
return null; return null;
} }
Relation relation = OafMapperUtils
return OafMapperUtils
.getRelation( .getRelation(
source, target, ModelConstants.RESULT_PERSON_RELTYPE, source, target, ModelConstants.RESULT_PERSON_RELTYPE,
ModelConstants.RESULT_PERSON_SUBRELTYPE, ModelConstants.RESULT_PERSON_SUBRELTYPE,
ModelConstants.RESULT_PERSON_HASAUTHORED, ModelConstants.RESULT_PERSON_HASAUTHORED,
Arrays.asList(OafMapperUtils.keyValue(orcidKey, ModelConstants.ORCID_DS)), Arrays.asList(OafMapperUtils.keyValue(orcidKey, ModelConstants.ORCID_DS)),
OafMapperUtils ORCIDDATAINFO,
.dataInfo(
false, null, false, false,
OafMapperUtils
.qualifier(
ORCID_AUTHORS_CLASSID, ORCID_AUTHORS_CLASSNAME, ModelConstants.DNET_PROVENANCE_ACTIONS,
ModelConstants.DNET_PROVENANCE_ACTIONS),
"0.91"),
null); null);
relation.setValidated(true);
return relation;
} }
} }

View File

@ -7,6 +7,7 @@ import java.io.IOException;
import java.util.Optional; import java.util.Optional;
import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicInteger;
import eu.dnetlib.dhp.collection.plugin.zenodo.CollectZenodoDumpCollectorPlugin;
import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.IntWritable;
@ -69,9 +70,12 @@ public class CollectorWorker extends ReportingJob {
scheduleReport(counter); scheduleReport(counter);
try (SequenceFile.Writer writer = SequenceFile try (SequenceFile.Writer writer = SequenceFile
.createWriter(this.fileSystem.getConf(), SequenceFile.Writer.file(new Path(outputPath)), SequenceFile.Writer .createWriter(
.keyClass(IntWritable.class), SequenceFile.Writer this.fileSystem.getConf(), SequenceFile.Writer.file(new Path(outputPath)), SequenceFile.Writer
.valueClass(Text.class), SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK, new DeflateCodec()))) { .keyClass(IntWritable.class),
SequenceFile.Writer
.valueClass(Text.class),
SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK, new DeflateCodec()))) {
final IntWritable key = new IntWritable(counter.get()); final IntWritable key = new IntWritable(counter.get());
final Text value = new Text(); final Text value = new Text();
plugin plugin
@ -126,6 +130,8 @@ public class CollectorWorker extends ReportingJob {
return new Gtr2PublicationsCollectorPlugin(this.clientParams); return new Gtr2PublicationsCollectorPlugin(this.clientParams);
case osfPreprints: case osfPreprints:
return new OsfPreprintsCollectorPlugin(this.clientParams); return new OsfPreprintsCollectorPlugin(this.clientParams);
case zenodoDump:
return new CollectZenodoDumpCollectorPlugin();
case other: case other:
final CollectorPlugin.NAME.OTHER_NAME plugin = Optional final CollectorPlugin.NAME.OTHER_NAME plugin = Optional
.ofNullable(this.api.getParams().get("other_plugin_type")) .ofNullable(this.api.getParams().get("other_plugin_type"))

View File

@ -11,7 +11,7 @@ public interface CollectorPlugin {
enum NAME { enum NAME {
oai, other, rest_json2xml, file, fileGzip, baseDump, gtr2Publications, osfPreprints; oai, other, rest_json2xml, file, fileGzip, baseDump, gtr2Publications, osfPreprints, zenodoDump;
public enum OTHER_NAME { public enum OTHER_NAME {
mdstore_mongodb_dump, mdstore_mongodb mdstore_mongodb_dump, mdstore_mongodb

View File

@ -1,6 +1,9 @@
package eu.dnetlib.dhp.collection.plugin.gtr2; package eu.dnetlib.dhp.collection.plugin.gtr2;
import java.nio.charset.StandardCharsets;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.HashMap; import java.util.HashMap;
import java.util.Iterator; import java.util.Iterator;
@ -16,9 +19,6 @@ import org.dom4j.Document;
import org.dom4j.DocumentException; import org.dom4j.DocumentException;
import org.dom4j.DocumentHelper; import org.dom4j.DocumentHelper;
import org.dom4j.Element; import org.dom4j.Element;
import org.joda.time.DateTime;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
@ -33,7 +33,7 @@ public class Gtr2PublicationsIterator implements Iterator<String> {
private static final Logger log = LoggerFactory.getLogger(Gtr2PublicationsIterator.class); private static final Logger log = LoggerFactory.getLogger(Gtr2PublicationsIterator.class);
private final HttpConnector2 connector; private final HttpConnector2 connector;
private static final DateTimeFormatter simpleDateTimeFormatter = DateTimeFormat.forPattern("yyyy-MM-dd"); private static final DateTimeFormatter simpleDateTimeFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
private static final int MAX_ATTEMPTS = 10; private static final int MAX_ATTEMPTS = 10;
@ -41,7 +41,7 @@ public class Gtr2PublicationsIterator implements Iterator<String> {
private int currPage; private int currPage;
private int endPage; private int endPage;
private boolean incremental = false; private boolean incremental = false;
private DateTime fromDate; private LocalDate fromDate;
private final Map<String, String> cache = new HashMap<>(); private final Map<String, String> cache = new HashMap<>();
@ -188,28 +188,28 @@ public class Gtr2PublicationsIterator implements Iterator<String> {
private Document loadURL(final String cleanUrl, final int attempt) { private Document loadURL(final String cleanUrl, final int attempt) {
try { try {
log.debug(" * Downloading Url: " + cleanUrl); log.debug(" * Downloading Url: {}", cleanUrl);
final byte[] bytes = this.connector.getInputSource(cleanUrl).getBytes("UTF-8"); final byte[] bytes = this.connector.getInputSource(cleanUrl).getBytes(StandardCharsets.UTF_8);
return DocumentHelper.parseText(new String(bytes)); return DocumentHelper.parseText(new String(bytes));
} catch (final Throwable e) { } catch (final Throwable e) {
log.error("Error dowloading url: " + cleanUrl + ", attempt = " + attempt, e); log.error("Error dowloading url: {}, attempt = {}", cleanUrl, attempt, e);
if (attempt >= MAX_ATTEMPTS) { if (attempt >= MAX_ATTEMPTS) {
throw new RuntimeException("Error dowloading url: " + cleanUrl, e); throw new RuntimeException("Error downloading url: " + cleanUrl, e);
} }
try { try {
Thread.sleep(60000); // I wait for a minute Thread.sleep(60000); // I wait for a minute
} catch (final InterruptedException e1) { } catch (final InterruptedException e1) {
throw new RuntimeException("Error dowloading url: " + cleanUrl, e); throw new RuntimeException("Error downloading url: " + cleanUrl, e);
} }
return loadURL(cleanUrl, attempt + 1); return loadURL(cleanUrl, attempt + 1);
} }
} }
private DateTime parseDate(final String s) { private LocalDate parseDate(final String s) {
return DateTime.parse(s.contains("T") ? s.substring(0, s.indexOf("T")) : s, simpleDateTimeFormatter); return LocalDate.parse(s.contains("T") ? s.substring(0, s.indexOf("T")) : s, simpleDateTimeFormatter);
} }
private boolean isAfter(final String d, final DateTime fromDate) { private boolean isAfter(final String d, final LocalDate fromDate) {
return StringUtils.isNotBlank(d) && parseDate(d).isAfter(fromDate); return StringUtils.isNotBlank(d) && parseDate(d).isAfter(fromDate);
} }
} }

View File

@ -36,7 +36,9 @@ public class OsfPreprintsCollectorPlugin implements CollectorPlugin {
.map(s -> NumberUtils.toInt(s, PAGE_SIZE_VALUE_DEFAULT)) .map(s -> NumberUtils.toInt(s, PAGE_SIZE_VALUE_DEFAULT))
.orElse(PAGE_SIZE_VALUE_DEFAULT); .orElse(PAGE_SIZE_VALUE_DEFAULT);
if (StringUtils.isBlank(baseUrl)) { throw new CollectorException("Param 'baseUrl' is null or empty"); } if (StringUtils.isBlank(baseUrl)) {
throw new CollectorException("Param 'baseUrl' is null or empty");
}
final OsfPreprintsIterator it = new OsfPreprintsIterator(baseUrl, pageSize, getClientParams()); final OsfPreprintsIterator it = new OsfPreprintsIterator(baseUrl, pageSize, getClientParams());

View File

@ -54,7 +54,8 @@ public class OsfPreprintsIterator implements Iterator<String> {
@Override @Override
public boolean hasNext() { public boolean hasNext() {
synchronized (this.recordQueue) { synchronized (this.recordQueue) {
while (this.recordQueue.isEmpty() && StringUtils.isNotBlank(this.currentUrl) && this.currentUrl.startsWith("http")) { while (this.recordQueue.isEmpty() && StringUtils.isNotBlank(this.currentUrl)
&& this.currentUrl.startsWith("http")) {
try { try {
this.currentUrl = downloadPage(this.currentUrl); this.currentUrl = downloadPage(this.currentUrl);
} catch (final CollectorException e) { } catch (final CollectorException e) {
@ -63,7 +64,9 @@ public class OsfPreprintsIterator implements Iterator<String> {
} }
} }
if (!this.recordQueue.isEmpty()) { return true; } if (!this.recordQueue.isEmpty()) {
return true;
}
return false; return false;
} }
@ -112,7 +115,9 @@ public class OsfPreprintsIterator implements Iterator<String> {
} }
private Document downloadUrl(final String url, final int attempt) throws CollectorException { private Document downloadUrl(final String url, final int attempt) throws CollectorException {
if (attempt > MAX_ATTEMPTS) { throw new CollectorException("Max Number of attempts reached, url:" + url); } if (attempt > MAX_ATTEMPTS) {
throw new CollectorException("Max Number of attempts reached, url:" + url);
}
if (attempt > 0) { if (attempt > 0) {
final int delay = (attempt * 5000); final int delay = (attempt * 5000);

View File

@ -0,0 +1,96 @@
package eu.dnetlib.dhp.collection.plugin.zenodo;
import static eu.dnetlib.dhp.utils.DHPUtils.getHadoopConfiguration;
import java.io.IOException;
import java.io.InputStream;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.collection.ApiDescriptor;
import eu.dnetlib.dhp.collection.plugin.CollectorPlugin;
import eu.dnetlib.dhp.common.aggregation.AggregatorReport;
import eu.dnetlib.dhp.common.collection.CollectorException;
public class CollectZenodoDumpCollectorPlugin implements CollectorPlugin {
final private Logger log = LoggerFactory.getLogger(getClass());
private void downloadItem(final String name, final String itemURL, final String basePath,
final FileSystem fileSystem) {
try {
final Path hdfsWritePath = new Path(String.format("%s/%s", basePath, name));
final FSDataOutputStream fsDataOutputStream = fileSystem.create(hdfsWritePath, true);
final HttpGet request = new HttpGet(itemURL);
final int timeout = 60; // seconds
final RequestConfig config = RequestConfig
.custom()
.setConnectTimeout(timeout * 1000)
.setConnectionRequestTimeout(timeout * 1000)
.setSocketTimeout(timeout * 1000)
.build();
log.info("Downloading url {} into {}", itemURL, hdfsWritePath.getName());
try (CloseableHttpClient client = HttpClientBuilder.create().setDefaultRequestConfig(config).build();
CloseableHttpResponse response = client.execute(request)) {
int responseCode = response.getStatusLine().getStatusCode();
log.info("Response code is {}", responseCode);
if (responseCode >= 200 && responseCode < 400) {
IOUtils.copy(response.getEntity().getContent(), fsDataOutputStream);
}
} catch (Throwable eu) {
throw new RuntimeException(eu);
}
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
@Override
public Stream<String> collect(ApiDescriptor api, AggregatorReport report) throws CollectorException {
try {
final String zenodoURL = api.getBaseUrl();
final String hdfsURI = api.getParams().get("hdfsURI");
final FileSystem fileSystem = FileSystem.get(getHadoopConfiguration(hdfsURI));
downloadItem("zenodoDump.tar.gz", zenodoURL, "/tmp", fileSystem);
CompressionCodecFactory factory = new CompressionCodecFactory(fileSystem.getConf());
Path sourcePath = new Path("/tmp/zenodoDump.tar.gz");
CompressionCodec codec = factory.getCodec(sourcePath);
InputStream gzipInputStream = null;
try {
gzipInputStream = codec.createInputStream(fileSystem.open(sourcePath));
return iterateTar(gzipInputStream);
} catch (IOException e) {
throw new CollectorException(e);
} finally {
log.info("Closing gzip stream");
org.apache.hadoop.io.IOUtils.closeStream(gzipInputStream);
}
} catch (Exception e) {
throw new CollectorException(e);
}
}
private Stream<String> iterateTar(InputStream gzipInputStream) throws Exception {
Iterable<String> iterable = () -> new ZenodoTarIterator(gzipInputStream);
return StreamSupport.stream(iterable.spliterator(), false);
}
}

View File

@ -0,0 +1,59 @@
package eu.dnetlib.dhp.collection.plugin.zenodo;
import java.io.Closeable;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.Iterator;
import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;
import org.apache.commons.io.IOUtils;
public class ZenodoTarIterator implements Iterator<String>, Closeable {
private final InputStream gzipInputStream;
private final StringBuilder currentItem = new StringBuilder();
private TarArchiveInputStream tais;
private boolean hasNext;
public ZenodoTarIterator(InputStream gzipInputStream) {
this.gzipInputStream = gzipInputStream;
tais = new TarArchiveInputStream(gzipInputStream);
hasNext = getNextItem();
}
private boolean getNextItem() {
try {
TarArchiveEntry entry;
while ((entry = tais.getNextTarEntry()) != null) {
if (entry.isFile()) {
currentItem.setLength(0);
currentItem.append(IOUtils.toString(new InputStreamReader(tais)));
return true;
}
}
return false;
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
@Override
public boolean hasNext() {
return hasNext;
}
@Override
public String next() {
final String data = currentItem.toString();
hasNext = getNextItem();
return data;
}
@Override
public void close() throws IOException {
gzipInputStream.close();
}
}

View File

@ -0,0 +1,39 @@
package eu.dnetlib.dhp.sx.bio.pubmed;
/**
* The type Pubmed Affiliation.
*
* @author Sandro La Bruzzo
*/
public class PMAffiliation {
private String name;
private PMIdentifier identifier;
public PMAffiliation() {
}
public PMAffiliation(String name, PMIdentifier identifier) {
this.name = name;
this.identifier = identifier;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public PMIdentifier getIdentifier() {
return identifier;
}
public void setIdentifier(PMIdentifier identifier) {
this.identifier = identifier;
}
}

View File

@ -8,259 +8,115 @@ import java.util.List;
/** /**
* This class represent an instance of Pubmed Article extracted from the native XML * This class represent an instance of Pubmed Article extracted from the native XML
* *
* @author Sandro La Bruzzo
*/ */
public class PMArticle implements Serializable { public class PMArticle implements Serializable {
/**
* the Pubmed Identifier
*/
private String pmid; private String pmid;
private String pmcId; private String pmcId;
/**
* the DOI
*/
private String doi; private String doi;
/**
* the Pubmed Date extracted from <PubmedPubDate> Specifies a date significant to either the article's history or the citation's processing.
* All <History> dates will have a <Year>, <Month>, and <Day> elements. Some may have an <Hour>, <Minute>, and <Second> element(s).
*/
private String date; private String date;
/**
* This is an 'envelop' element that contains various elements describing the journal cited; i.e., ISSN, Volume, Issue, and PubDate and author name(s), however, it does not contain data itself.
*/
private PMJournal journal; private PMJournal journal;
/**
* The full journal title (taken from NLM cataloging data following NLM rules for how to compile a serial name) is exported in this element. Some characters that are not part of the NLM MEDLINE/PubMed Character Set reside in a relatively small number of full journal titles. The NLM journal title abbreviation is exported in the <MedlineTA> element.
*/
private String title; private String title;
/**
* English-language abstracts are taken directly from the published article.
* If the article does not have a published abstract, the National Library of Medicine does not create one,
* thus the record lacks the <Abstract> and <AbstractText> elements. However, in the absence of a formally
* labeled abstract in the published article, text from a substantive "summary", "summary and conclusions" or "conclusions and summary" may be used.
*/
private String description; private String description;
/**
* the language in which an article was published is recorded in <Language>.
* All entries are three letter abbreviations stored in lower case, such as eng, fre, ger, jpn, etc. When a single
* record contains more than one language value the XML export program extracts the languages in alphabetic order by the 3-letter language value.
* Some records provided by collaborating data producers may contain the value und to identify articles whose language is undetermined.
*/
private String language; private String language;
private List<PMSubject> subjects;
/** private List<PMSubject> publicationTypes = new ArrayList<>();
* NLM controlled vocabulary, Medical Subject Headings (MeSH®), is used to characterize the content of the articles represented by MEDLINE citations. *
*/
private final List<PMSubject> subjects = new ArrayList<>();
/**
* This element is used to identify the type of article indexed for MEDLINE;
* it characterizes the nature of the information or the manner in which it is conveyed as well as the type of
* research support received (e.g., Review, Letter, Retracted Publication, Clinical Conference, Research Support, N.I.H., Extramural).
*/
private final List<PMSubject> publicationTypes = new ArrayList<>();
/**
* Personal and collective (corporate) author names published with the article are found in <AuthorList>.
*/
private List<PMAuthor> authors = new ArrayList<>(); private List<PMAuthor> authors = new ArrayList<>();
private List<PMGrant> grants = new ArrayList<>();
/**
* <GrantID> contains the research grant or contract number (or both) that designates financial support by any agency of the United States Public Health Service
* or any institute of the National Institutes of Health. Additionally, beginning in late 2005, grant numbers are included for many other US and non-US funding agencies and organizations.
*/
private final List<PMGrant> grants = new ArrayList<>();
/**
* get the DOI
* @return a DOI
*/
public String getDoi() {
return doi;
}
/**
* Set the DOI
* @param doi a DOI
*/
public void setDoi(String doi) {
this.doi = doi;
}
/**
* get the Pubmed Identifier
* @return the PMID
*/
public String getPmid() { public String getPmid() {
return pmid; return pmid;
} }
/**
* set the Pubmed Identifier
* @param pmid the Pubmed Identifier
*/
public void setPmid(String pmid) { public void setPmid(String pmid) {
this.pmid = pmid; this.pmid = pmid;
} }
/**
* the Pubmed Date extracted from <PubmedPubDate> Specifies a date significant to either the article's history or the citation's processing.
* All <History> dates will have a <Year>, <Month>, and <Day> elements. Some may have an <Hour>, <Minute>, and <Second> element(s).
*
* @return the Pubmed Date
*/
public String getDate() {
return date;
}
/**
* Set the pubmed Date
* @param date
*/
public void setDate(String date) {
this.date = date;
}
/**
* The full journal title (taken from NLM cataloging data following NLM rules for how to compile a serial name) is exported in this element.
* Some characters that are not part of the NLM MEDLINE/PubMed Character Set reside in a relatively small number of full journal titles.
* The NLM journal title abbreviation is exported in the <MedlineTA> element.
*
* @return the pubmed Journal Extracted
*/
public PMJournal getJournal() {
return journal;
}
/**
* Set the mapped pubmed Journal
* @param journal
*/
public void setJournal(PMJournal journal) {
this.journal = journal;
}
/**
* <ArticleTitle> contains the entire title of the journal article. <ArticleTitle> is always in English;
* those titles originally published in a non-English language and translated for <ArticleTitle> are enclosed in square brackets.
* All titles end with a period unless another punctuation mark such as a question mark or bracket is present.
* Explanatory information about the title itself is enclosed in parentheses, e.g.: (author's transl).
* Corporate/collective authors may appear at the end of <ArticleTitle> for citations up to about the year 2000.
*
* @return the extracted pubmed Title
*/
public String getTitle() {
return title;
}
/**
* set the pubmed title
* @param title
*/
public void setTitle(String title) {
this.title = title;
}
/**
* English-language abstracts are taken directly from the published article.
* If the article does not have a published abstract, the National Library of Medicine does not create one,
* thus the record lacks the <Abstract> and <AbstractText> elements. However, in the absence of a formally
* labeled abstract in the published article, text from a substantive "summary", "summary and conclusions" or "conclusions and summary" may be used.
*
* @return the Mapped Pubmed Article Abstracts
*/
public String getDescription() {
return description;
}
/**
* Set the Mapped Pubmed Article Abstracts
* @param description
*/
public void setDescription(String description) {
this.description = description;
}
/**
* Personal and collective (corporate) author names published with the article are found in <AuthorList>.
*
* @return get the Mapped Authors lists
*/
public List<PMAuthor> getAuthors() {
return authors;
}
/**
* Set the Mapped Authors lists
* @param authors
*/
public void setAuthors(List<PMAuthor> authors) {
this.authors = authors;
}
/**
* This element is used to identify the type of article indexed for MEDLINE;
* it characterizes the nature of the information or the manner in which it is conveyed as well as the type of
* research support received (e.g., Review, Letter, Retracted Publication, Clinical Conference, Research Support, N.I.H., Extramural).
*
* @return the mapped Subjects
*/
public List<PMSubject> getSubjects() {
return subjects;
}
/**
*
* the language in which an article was published is recorded in <Language>.
* All entries are three letter abbreviations stored in lower case, such as eng, fre, ger, jpn, etc. When a single
* record contains more than one language value the XML export program extracts the languages in alphabetic order by the 3-letter language value.
* Some records provided by collaborating data producers may contain the value und to identify articles whose language is undetermined.
*
* @return The mapped Language
*/
public String getLanguage() {
return language;
}
/**
*
* Set The mapped Language
*
* @param language the mapped Language
*/
public void setLanguage(String language) {
this.language = language;
}
/**
* This element is used to identify the type of article indexed for MEDLINE;
* it characterizes the nature of the information or the manner in which it is conveyed as well as the type of
* research support received (e.g., Review, Letter, Retracted Publication, Clinical Conference, Research Support, N.I.H., Extramural).
*
* @return the mapped Publication Type
*/
public List<PMSubject> getPublicationTypes() {
return publicationTypes;
}
/**
* <GrantID> contains the research grant or contract number (or both) that designates financial support by any agency of the United States Public Health Service
* or any institute of the National Institutes of Health. Additionally, beginning in late 2005, grant numbers are included for many other US and non-US funding agencies and organizations.
* @return the mapped grants
*/
public List<PMGrant> getGrants() {
return grants;
}
public String getPmcId() { public String getPmcId() {
return pmcId; return pmcId;
} }
public PMArticle setPmcId(String pmcId) { public void setPmcId(String pmcId) {
this.pmcId = pmcId; this.pmcId = pmcId;
return this; }
public String getDoi() {
return doi;
}
public void setDoi(String doi) {
this.doi = doi;
}
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
public PMJournal getJournal() {
return journal;
}
public void setJournal(PMJournal journal) {
this.journal = journal;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getDescription() {
return description;
}
public void setDescription(String description) {
this.description = description;
}
public String getLanguage() {
return language;
}
public void setLanguage(String language) {
this.language = language;
}
public List<PMSubject> getSubjects() {
return subjects;
}
public void setSubjects(List<PMSubject> subjects) {
this.subjects = subjects;
}
public List<PMSubject> getPublicationTypes() {
return publicationTypes;
}
public void setPublicationTypes(List<PMSubject> publicationTypes) {
this.publicationTypes = publicationTypes;
}
public List<PMAuthor> getAuthors() {
return authors;
}
public void setAuthors(List<PMAuthor> authors) {
this.authors = authors;
}
public List<PMGrant> getGrants() {
return grants;
}
public void setGrants(List<PMGrant> grants) {
this.grants = grants;
} }
} }

View File

@ -12,6 +12,8 @@ public class PMAuthor implements Serializable {
private String lastName; private String lastName;
private String foreName; private String foreName;
private PMIdentifier identifier;
private PMAffiliation affiliation;
/** /**
* Gets last name. * Gets last name.
@ -59,4 +61,40 @@ public class PMAuthor implements Serializable {
.format("%s, %s", this.foreName != null ? this.foreName : "", this.lastName != null ? this.lastName : ""); .format("%s, %s", this.foreName != null ? this.foreName : "", this.lastName != null ? this.lastName : "");
} }
/**
* Gets identifier.
*
* @return the identifier
*/
public PMIdentifier getIdentifier() {
return identifier;
}
/**
* Sets identifier.
*
* @param identifier the identifier
*/
public void setIdentifier(PMIdentifier identifier) {
this.identifier = identifier;
}
/**
* Gets affiliation.
*
* @return the affiliation
*/
public PMAffiliation getAffiliation() {
return affiliation;
}
/**
* Sets affiliation.
*
* @param affiliation the affiliation
*/
public void setAffiliation(PMAffiliation affiliation) {
this.affiliation = affiliation;
}
} }

View File

@ -0,0 +1,53 @@
package eu.dnetlib.dhp.sx.bio.pubmed;
public class PMIdentifier {
private String pid;
private String type;
public PMIdentifier(String pid, String type) {
this.pid = cleanPid(pid);
this.type = type;
}
public PMIdentifier() {
}
private String cleanPid(String pid) {
if (pid == null) {
return null;
}
// clean ORCID ID in the form 0000000163025705 to 0000-0001-6302-5705
if (pid.matches("[0-9]{15}[0-9X]")) {
return pid.replaceAll("(.{4})(.{4})(.{4})(.{4})", "$1-$2-$3-$4");
}
// clean ORCID in the form http://orcid.org/0000-0001-8567-3543 to 0000-0001-8567-3543
if (pid.matches("http://orcid.org/[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}")) {
return pid.replaceAll("http://orcid.org/", "");
}
return pid;
}
public String getPid() {
return pid;
}
public PMIdentifier setPid(String pid) {
this.pid = cleanPid(pid);
return this;
}
public String getType() {
return type;
}
public PMIdentifier setType(String type) {
this.type = type;
return this;
}
}

View File

@ -28,13 +28,19 @@
"paramLongName": "dataciteInputPath", "paramLongName": "dataciteInputPath",
"paramDescription": "the path to get the input data from Datacite", "paramDescription": "the path to get the input data from Datacite",
"paramRequired": true "paramRequired": true
},{ },
{
"paramName": "wip", "paramName": "wip",
"paramLongName": "webCrawlInputPath", "paramLongName": "webCrawlInputPath",
"paramDescription": "the path to get the input data from Web Crawl", "paramDescription": "the path to get the input data from Web Crawl",
"paramRequired": true "paramRequired": true
} },
, {
"paramName": "pub",
"paramLongName": "publisherInputPath",
"paramDescription": "the path to get the input data from publishers",
"paramRequired": true
},
{ {
"paramName": "o", "paramName": "o",
"paramLongName": "outputPath", "paramLongName": "outputPath",

View File

@ -31,9 +31,11 @@ spark2SqlQueryExecutionListeners=com.cloudera.spark.lineage.NavigatorQueryListen
# The following is needed as a property of a workflow # The following is needed as a property of a workflow
oozie.wf.application.path=${oozieTopWfApplicationPath} oozie.wf.application.path=${oozieTopWfApplicationPath}
crossrefInputPath=/data/bip-affiliations/crossref-data.json crossrefInputPath=/data/openaire-affiliations/crossref-data.json
pubmedInputPath=/data/bip-affiliations/pubmed-data.json pubmedInputPath=/data/openaire-affiliations/pubmed-data-v4.json
openapcInputPath=/data/bip-affiliations/openapc-data.json openapcInputPath=/data/openaire-affiliations/openapc-data.json
dataciteInputPath=/data/bip-affiliations/datacite-data.json dataciteInputPath=/data/openaire-affiliations/datacite-data.json
webCrawlInputPath=/data/openaire-affiliations/webCrawl
publisherInputPath=/data/openaire-affiliations/publishers
outputPath=/tmp/crossref-affiliations-output-v5 outputPath=/tmp/affRoAS

View File

@ -1,4 +1,4 @@
<workflow-app name="BipAffiliations" xmlns="uri:oozie:workflow:0.5"> <workflow-app name="OpenAIREAffiliations" xmlns="uri:oozie:workflow:0.5">
<parameters> <parameters>
<property> <property>
@ -21,6 +21,10 @@
<name>webCrawlInputPath</name> <name>webCrawlInputPath</name>
<description>the path where to find the inferred affiliation relations from webCrawl</description> <description>the path where to find the inferred affiliation relations from webCrawl</description>
</property> </property>
<property>
<name>publisherInputPath</name>
<description>the path where to find the inferred affiliation relations from publisher websites</description>
</property>
<property> <property>
<name>outputPath</name> <name>outputPath</name>
<description>the path where to store the actionset</description> <description>the path where to store the actionset</description>
@ -99,7 +103,7 @@
<spark xmlns="uri:oozie:spark-action:0.2"> <spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master> <master>yarn</master>
<mode>cluster</mode> <mode>cluster</mode>
<name>Produces the atomic action with the inferred by BIP! affiliation relations (from Crossref and Pubmed)</name> <name>Produces the atomic action with the inferred by OpenAIRE affiliation relations</name>
<class>eu.dnetlib.dhp.actionmanager.bipaffiliations.PrepareAffiliationRelations</class> <class>eu.dnetlib.dhp.actionmanager.bipaffiliations.PrepareAffiliationRelations</class>
<jar>dhp-aggregation-${projectVersion}.jar</jar> <jar>dhp-aggregation-${projectVersion}.jar</jar>
<spark-opts> <spark-opts>
@ -117,6 +121,7 @@
<arg>--openapcInputPath</arg><arg>${openapcInputPath}</arg> <arg>--openapcInputPath</arg><arg>${openapcInputPath}</arg>
<arg>--dataciteInputPath</arg><arg>${dataciteInputPath}</arg> <arg>--dataciteInputPath</arg><arg>${dataciteInputPath}</arg>
<arg>--webCrawlInputPath</arg><arg>${webCrawlInputPath}</arg> <arg>--webCrawlInputPath</arg><arg>${webCrawlInputPath}</arg>
<arg>--publisherInputPath</arg><arg>${publisherInputPath}</arg>
<arg>--outputPath</arg><arg>${outputPath}</arg> <arg>--outputPath</arg><arg>${outputPath}</arg>
</spark> </spark>
<ok to="End"/> <ok to="End"/>

View File

@ -16,5 +16,11 @@
"paramLongName": "hdfsNameNode", "paramLongName": "hdfsNameNode",
"paramDescription": "the hdfs name node", "paramDescription": "the hdfs name node",
"paramRequired": true "paramRequired": true
},
{
"paramName": "bp",
"paramLongName": "backupPath",
"paramDescription": "the hdfs path to move the OC data after the extraction",
"paramRequired": true
} }
] ]

View File

@ -24,12 +24,13 @@
"paramLongName": "outputPath", "paramLongName": "outputPath",
"paramDescription": "the hdfs name node", "paramDescription": "the hdfs name node",
"paramRequired": true "paramRequired": true
}, { },
{
"paramName": "nn", "paramName": "nn",
"paramLongName": "hdfsNameNode", "paramLongName": "hdfsNameNode",
"paramDescription": "the hdfs name node", "paramDescription": "the hdfs name node",
"paramRequired": true "paramRequired": true
} }
] ]

View File

@ -94,17 +94,7 @@
<arg>--hdfsNameNode</arg><arg>${nameNode}</arg> <arg>--hdfsNameNode</arg><arg>${nameNode}</arg>
<arg>--inputPath</arg><arg>${inputPath}/Original</arg> <arg>--inputPath</arg><arg>${inputPath}/Original</arg>
<arg>--outputPath</arg><arg>${inputPath}/Extracted</arg> <arg>--outputPath</arg><arg>${inputPath}/Extracted</arg>
</java> <arg>--backupPath</arg><arg>${inputPath}/backup</arg>
<ok to="read"/>
<error to="Kill"/>
</action>
<action name="extract_correspondence">
<java>
<main-class>eu.dnetlib.dhp.actionmanager.opencitations.GetOpenCitationsRefs</main-class>
<arg>--hdfsNameNode</arg><arg>${nameNode}</arg>
<arg>--inputPath</arg><arg>${inputPath}/correspondence</arg>
<arg>--outputPath</arg><arg>${inputPath}/correspondence_extracted</arg>
</java> </java>
<ok to="read"/> <ok to="read"/>
<error to="Kill"/> <error to="Kill"/>

View File

@ -21,5 +21,30 @@
"paramLongName": "workingDir", "paramLongName": "workingDir",
"paramDescription": "the hdfs name node", "paramDescription": "the hdfs name node",
"paramRequired": false "paramRequired": false
},
{
"paramName": "pu",
"paramLongName": "postgresUrl",
"paramDescription": "the hdfs name node",
"paramRequired": false
},
{
"paramName": "ps",
"paramLongName": "postgresUser",
"paramDescription": "the hdfs name node",
"paramRequired": false
},
{
"paramName": "pp",
"paramLongName": "postgresPassword",
"paramDescription": "the hdfs name node",
"paramRequired": false
},{
"paramName": "nn",
"paramLongName": "hdfsNameNode",
"paramDescription": "the hdfs name node",
"paramRequired": false
} }
] ]

View File

@ -1,2 +1,5 @@
inputPath=/data/orcid_2023/tables/ inputPath=/data/orcid_2023/tables/
outputPath=/user/miriam.baglioni/peopleAS outputPath=/user/miriam.baglioni/peopleAS
postgresUrl=jdbc:postgresql://beta.services.openaire.eu:5432/dnet_openaireplus
postgresUser=dnet
postgresPassword=dnetPwd

View File

@ -9,6 +9,18 @@
<name>outputPath</name> <name>outputPath</name>
<description>the path where to store the actionset</description> <description>the path where to store the actionset</description>
</property> </property>
<property>
<name>postgresUrl</name>
<description>the path where to store the actionset</description>
</property>
<property>
<name>postgresUser</name>
<description>the path where to store the actionset</description>
</property>
<property>
<name>postgresPassword</name>
<description>the path where to store the actionset</description>
</property>
<property> <property>
<name>sparkDriverMemory</name> <name>sparkDriverMemory</name>
<description>memory for driver process</description> <description>memory for driver process</description>
@ -102,6 +114,10 @@
<arg>--inputPath</arg><arg>${inputPath}</arg> <arg>--inputPath</arg><arg>${inputPath}</arg>
<arg>--outputPath</arg><arg>${outputPath}</arg> <arg>--outputPath</arg><arg>${outputPath}</arg>
<arg>--workingDir</arg><arg>${workingDir}</arg> <arg>--workingDir</arg><arg>${workingDir}</arg>
<arg>--hdfsNameNode</arg><arg>${nameNode}</arg>
<arg>--postgresUrl</arg><arg>${postgresUrl}</arg>
<arg>--postgresUser</arg><arg>${postgresUser}</arg>
<arg>--postgresPassword</arg><arg>${postgresPassword}</arg>
</spark> </spark>
<ok to="End"/> <ok to="End"/>
<error to="Kill"/> <error to="Kill"/>

View File

@ -24,7 +24,7 @@
<decision name="resume_from"> <decision name="resume_from">
<switch> <switch>
<case to="download">${wf:conf('resumeFrom') eq 'DownloadDump'}</case> <case to="reset_workingDir">${wf:conf('resumeFrom') eq 'DownloadDump'}</case>
<default to="create_actionset"/> <!-- first action to be done when downloadDump is to be performed --> <default to="create_actionset"/> <!-- first action to be done when downloadDump is to be performed -->
</switch> </switch>
</decision> </decision>
@ -33,6 +33,14 @@
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill> </kill>
<action name="reset_workingDir">
<fs>
<delete path="${workingDir}"/>
<mkdir path="${workingDir}"/>
</fs>
<ok to="download"/>
<error to="Kill"/>
</action>
<action name="download"> <action name="download">
<shell xmlns="uri:oozie:shell-action:0.2"> <shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker> <job-tracker>${jobTracker}</job-tracker>

View File

@ -1,8 +1,7 @@
[ [
{"paramName":"mt", "paramLongName":"master", "paramDescription": "should be local or yarn", "paramRequired": true}, {"paramName":"mt", "paramLongName":"master", "paramDescription": "should be local or yarn", "paramRequired": true},
{"paramName":"i", "paramLongName":"isLookupUrl", "paramDescription": "isLookupUrl", "paramRequired": true}, {"paramName":"i", "paramLongName":"isLookupUrl", "paramDescription": "isLookupUrl", "paramRequired": true},
{"paramName":"w", "paramLongName":"workingPath", "paramDescription": "the path of the sequencial file to read", "paramRequired": true}, {"paramName":"s", "paramLongName":"sourcePath", "paramDescription": "the baseline path", "paramRequired": true},
{"paramName":"mo", "paramLongName":"mdstoreOutputVersion", "paramDescription": "the oaf path ", "paramRequired": true}, {"paramName":"mo", "paramLongName":"mdstoreOutputVersion", "paramDescription": "the mdstore path to save", "paramRequired": true}
{"paramName":"s", "paramLongName":"skipUpdate", "paramDescription": "skip update ", "paramRequired": false},
{"paramName":"h", "paramLongName":"hdfsServerUri", "paramDescription": "the working path ", "paramRequired": true}
] ]

View File

@ -1,4 +1,4 @@
<workflow-app name="Download_Transform_Pubmed_Workflow" xmlns="uri:oozie:workflow:0.5"> <workflow-app name="Transform_Pubmed_Workflow" xmlns="uri:oozie:workflow:0.5">
<parameters> <parameters>
<property> <property>
<name>baselineWorkingPath</name> <name>baselineWorkingPath</name>
@ -16,11 +16,6 @@
<name>mdStoreManagerURI</name> <name>mdStoreManagerURI</name>
<description>the path of the cleaned mdstore</description> <description>the path of the cleaned mdstore</description>
</property> </property>
<property>
<name>skipUpdate</name>
<value>false</value>
<description>The request block size</description>
</property>
</parameters> </parameters>
<start to="StartTransaction"/> <start to="StartTransaction"/>
@ -44,16 +39,16 @@
<arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg> <arg>--mdStoreManagerURI</arg><arg>${mdStoreManagerURI}</arg>
<capture-output/> <capture-output/>
</java> </java>
<ok to="ConvertDataset"/> <ok to="TransformPubMed"/>
<error to="RollBack"/> <error to="RollBack"/>
</action> </action>
<action name="ConvertDataset"> <action name="TransformPubMed">
<spark xmlns="uri:oozie:spark-action:0.2"> <spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master> <master>yarn</master>
<mode>cluster</mode> <mode>cluster</mode>
<name>Convert Baseline to OAF Dataset</name> <name>Convert Baseline Pubmed to OAF Dataset</name>
<class>eu.dnetlib.dhp.sx.bio.ebi.SparkCreateBaselineDataFrame</class> <class>eu.dnetlib.dhp.sx.bio.ebi.SparkCreatePubmedDump</class>
<jar>dhp-aggregation-${projectVersion}.jar</jar> <jar>dhp-aggregation-${projectVersion}.jar</jar>
<spark-opts> <spark-opts>
--executor-memory=${sparkExecutorMemory} --executor-memory=${sparkExecutorMemory}
@ -65,12 +60,10 @@
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress} --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--conf spark.eventLog.dir=${nameNode}${spark2EventLogDir} --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
</spark-opts> </spark-opts>
<arg>--workingPath</arg><arg>${baselineWorkingPath}</arg> <arg>--sourcePath</arg><arg>${baselineWorkingPath}</arg>
<arg>--mdstoreOutputVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg> <arg>--mdstoreOutputVersion</arg><arg>${wf:actionData('StartTransaction')['mdStoreVersion']}</arg>
<arg>--master</arg><arg>yarn</arg> <arg>--master</arg><arg>yarn</arg>
<arg>--isLookupUrl</arg><arg>${isLookupUrl}</arg> <arg>--isLookupUrl</arg><arg>${isLookupUrl}</arg>
<arg>--hdfsServerUri</arg><arg>${nameNode}</arg>
<arg>--skipUpdate</arg><arg>${skipUpdate}</arg>
</spark> </spark>
<ok to="CommitVersion"/> <ok to="CommitVersion"/>
<error to="RollBack"/> <error to="RollBack"/>

View File

@ -14,7 +14,7 @@ import eu.dnetlib.dhp.schema.oaf.utils.{
PidType PidType
} }
import eu.dnetlib.dhp.utils.DHPUtils import eu.dnetlib.dhp.utils.DHPUtils
import org.apache.commons.lang.StringUtils import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.Row import org.apache.spark.sql.Row
import org.json4s import org.json4s
import org.json4s.DefaultFormats import org.json4s.DefaultFormats
@ -332,7 +332,7 @@ case object Crossref2Oaf {
implicit lazy val formats: DefaultFormats.type = org.json4s.DefaultFormats implicit lazy val formats: DefaultFormats.type = org.json4s.DefaultFormats
//MAPPING Crossref DOI into PID //MAPPING Crossref DOI into PID
val doi: String = DoiCleaningRule.normalizeDoi((json \ "DOI").extract[String]) val doi: String = DoiCleaningRule.clean((json \ "DOI").extract[String])
result.setPid( result.setPid(
List( List(
structuredProperty( structuredProperty(
@ -504,6 +504,24 @@ case object Crossref2Oaf {
) )
} }
val is_review = json \ "relation" \ "is-review-of" \ "id"
if (is_review != JNothing) {
instance.setInstancetype(
OafMapperUtils.qualifier(
"0015",
"peerReviewed",
ModelConstants.DNET_REVIEW_LEVELS,
ModelConstants.DNET_REVIEW_LEVELS
)
)
}
if (doi.startsWith("10.3410") || doi.startsWith("10.12703"))
instance.setHostedby(
OafMapperUtils.keyValue(OafMapperUtils.createOpenaireId(10, "openaire____::H1Connect", true), "H1Connect")
)
instance.setAccessright( instance.setAccessright(
decideAccessRight(instance.getLicense, result.getDateofacceptance.getValue) decideAccessRight(instance.getLicense, result.getDateofacceptance.getValue)
) )
@ -655,11 +673,11 @@ case object Crossref2Oaf {
val doi = input.getString(0) val doi = input.getString(0)
val rorId = input.getString(1) val rorId = input.getString(1)
val pubId = s"50|${PidType.doi.toString.padTo(12, "_")}::${DoiCleaningRule.normalizeDoi(doi)}" val pubId = IdentifierFactory.idFromPid("50", "doi", DoiCleaningRule.clean(doi), true)
val affId = GenerateRorActionSetJob.calculateOpenaireId(rorId) val affId = GenerateRorActionSetJob.calculateOpenaireId(rorId)
val r: Relation = new Relation val r: Relation = new Relation
DoiCleaningRule.clean(doi)
r.setSource(pubId) r.setSource(pubId)
r.setTarget(affId) r.setTarget(affId)
r.setRelType(ModelConstants.RESULT_ORGANIZATION) r.setRelType(ModelConstants.RESULT_ORGANIZATION)
@ -960,7 +978,26 @@ case object Crossref2Oaf {
case "10.13039/501100010790" => case "10.13039/501100010790" =>
generateSimpleRelationFromAward(funder, "erasmusplus_", a => a) generateSimpleRelationFromAward(funder, "erasmusplus_", a => a)
case _ => logger.debug("no match for " + funder.DOI.get) case _ => logger.debug("no match for " + funder.DOI.get)
//Add for Danish funders
//Independent Research Fund Denmark (IRFD)
case "10.13039/501100004836" =>
generateSimpleRelationFromAward(funder, "irfd________", a => a)
val targetId = getProjectId("irfd________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
//Carlsberg Foundation (CF)
case "10.13039/501100002808" =>
generateSimpleRelationFromAward(funder, "cf__________", a => a)
val targetId = getProjectId("cf__________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
//Novo Nordisk Foundation (NNF)
case "10.13039/501100009708" =>
generateSimpleRelationFromAward(funder, "nnf___________", a => a)
val targetId = getProjectId("nnf_________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
case _ => logger.debug("no match for " + funder.DOI.get)
} }
} else { } else {

View File

@ -407,10 +407,9 @@ object DataciteToOAFTransformation {
) )
} }
if (c.affiliation.isDefined) if (c.affiliation.isDefined)
a.setAffiliation( a.setRawAffiliationString(
c.affiliation.get c.affiliation.get
.filter(af => af.nonEmpty) .filter(af => af.nonEmpty)
.map(af => OafMapperUtils.field(af, dataInfo))
.asJava .asJava
) )
a.setRank(idx + 1) a.setRank(idx + 1)

View File

@ -0,0 +1,104 @@
package eu.dnetlib.dhp.sx.bio.ebi
import com.fasterxml.jackson.databind.ObjectMapper
import eu.dnetlib.dhp.application.AbstractScalaApplication
import eu.dnetlib.dhp.common.Constants
import eu.dnetlib.dhp.common.Constants.{MDSTORE_DATA_PATH, MDSTORE_SIZE_PATH}
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup
import eu.dnetlib.dhp.schema.mdstore.MDStoreVersion
import eu.dnetlib.dhp.sx.bio.pubmed.{PMArticle, PMParser2, PubMedToOaf}
import eu.dnetlib.dhp.transformation.TransformSparkJobNode
import eu.dnetlib.dhp.utils.DHPUtils.writeHdfsFile
import eu.dnetlib.dhp.utils.ISLookupClientFactory
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.slf4j.{Logger, LoggerFactory}
class SparkCreatePubmedDump(propertyPath: String, args: Array[String], log: Logger)
extends AbstractScalaApplication(propertyPath, args, log: Logger) {
/** Here all the spark applications runs this method
* where the whole logic of the spark node is defined
*/
override def run(): Unit = {
val isLookupUrl: String = parser.get("isLookupUrl")
log.info("isLookupUrl: {}", isLookupUrl)
val sourcePath = parser.get("sourcePath")
log.info(s"SourcePath is '$sourcePath'")
val mdstoreOutputVersion = parser.get("mdstoreOutputVersion")
log.info(s"mdstoreOutputVersion is '$mdstoreOutputVersion'")
val mapper = new ObjectMapper()
val cleanedMdStoreVersion = mapper.readValue(mdstoreOutputVersion, classOf[MDStoreVersion])
val outputBasePath = cleanedMdStoreVersion.getHdfsPath
log.info(s"outputBasePath is '$outputBasePath'")
val isLookupService = ISLookupClientFactory.getLookUpService(isLookupUrl)
val vocabularies = VocabularyGroup.loadVocsFromIS(isLookupService)
createPubmedDump(spark, sourcePath, outputBasePath, vocabularies)
}
/** This method creates a dump of the pubmed articles
* @param spark the spark session
* @param sourcePath the path of the source file
* @param targetPath the path of the target file
* @param vocabularies the vocabularies
*/
def createPubmedDump(
spark: SparkSession,
sourcePath: String,
targetPath: String,
vocabularies: VocabularyGroup
): Unit = {
require(spark != null)
implicit val PMEncoder: Encoder[PMArticle] = Encoders.bean(classOf[PMArticle])
import spark.implicits._
val df = spark.read.option("lineSep", "</PubmedArticle>").text(sourcePath)
val mapper = new ObjectMapper()
df.as[String]
.map(s => {
val id = s.indexOf("<PubmedArticle>")
if (id >= 0) s"${s.substring(id)}</PubmedArticle>" else null
})
.filter(s => s != null)
.map { i =>
//remove try catch
try {
new PMParser2().parse(i)
} catch {
case _: Exception => {
throw new RuntimeException(s"Error parsing article: $i")
}
}
}
.dropDuplicates("pmid")
.map { a =>
val oaf = PubMedToOaf.convert(a, vocabularies)
if (oaf != null)
mapper.writeValueAsString(oaf)
else
null
}
.as[String]
.filter(s => s != null)
.write
.option("compression", "gzip")
.mode("overwrite")
.text(targetPath + MDSTORE_DATA_PATH)
val mdStoreSize = spark.read.text(targetPath + MDSTORE_DATA_PATH).count
writeHdfsFile(spark.sparkContext.hadoopConfiguration, "" + mdStoreSize, targetPath + MDSTORE_SIZE_PATH)
}
}
object SparkCreatePubmedDump {
def main(args: Array[String]): Unit = {
val log: Logger = LoggerFactory.getLogger(getClass)
new SparkCreatePubmedDump("/eu/dnetlib/dhp/sx/bio/ebi/baseline_to_oaf_params.json", args, log).initialize().run()
}
}

View File

@ -0,0 +1,277 @@
package eu.dnetlib.dhp.sx.bio.pubmed
import org.apache.commons.lang3.StringUtils
import javax.xml.stream.XMLEventReader
import scala.collection.JavaConverters._
import scala.xml.{MetaData, NodeSeq}
import scala.xml.pull.{EvElemEnd, EvElemStart, EvText}
class PMParser2 {
/** Extracts the value of an attribute from a MetaData object.
* @param attrs the MetaData object
* @param key the key of the attribute
* @return the value of the attribute or null if the attribute is not found
*/
private def extractAttributes(attrs: MetaData, key: String): String = {
val res = attrs.get(key)
if (res.isDefined) {
val s = res.get
if (s != null && s.nonEmpty)
s.head.text
else
null
} else null
}
/** Validates and formats a date given the year, month, and day as strings.
*
* @param year the year as a string
* @param month the month as a string
* @param day the day as a string
* @return the formatted date as "YYYY-MM-DD" or null if the date is invalid
*/
private def validate_Date(year: String, month: String, day: String): String = {
try {
f"${year.toInt}-${month.toInt}%02d-${day.toInt}%02d"
} catch {
case _: Throwable => null
}
}
/** Extracts the grant information from a NodeSeq object.
*
* @param gNode the NodeSeq object
* @return the grant information or an empty list if the grant information is not found
*/
private def extractGrant(gNode: NodeSeq): List[PMGrant] = {
gNode
.map(node => {
val grantId = (node \ "GrantID").text
val agency = (node \ "Agency").text
val country = (node \ "Country").text
new PMGrant(grantId, agency, country)
})
.toList
}
/** Extracts the journal information from a NodeSeq object.
*
* @param jNode the NodeSeq object
* @return the journal information or null if the journal information is not found
*/
private def extractJournal(jNode: NodeSeq): PMJournal = {
val journal = new PMJournal
journal.setTitle((jNode \ "Title").text)
journal.setIssn((jNode \ "ISSN").text)
journal.setVolume((jNode \ "JournalIssue" \ "Volume").text)
journal.setIssue((jNode \ "JournalIssue" \ "Issue").text)
if (journal.getTitle != null && StringUtils.isNotEmpty(journal.getTitle))
journal
else
null
}
private def extractAuthors(aNode: NodeSeq): List[PMAuthor] = {
aNode
.map(author => {
val a = new PMAuthor
a.setLastName((author \ "LastName").text)
a.setForeName((author \ "ForeName").text)
val id = (author \ "Identifier").text
val idType = (author \ "Identifier" \ "@Source").text
if (id != null && id.nonEmpty && idType != null && idType.nonEmpty) {
a.setIdentifier(new PMIdentifier(id, idType))
}
val affiliation = (author \ "AffiliationInfo" \ "Affiliation").text
val affiliationId = (author \ "AffiliationInfo" \ "Identifier").text
val affiliationIdType = (author \ "AffiliationInfo" \ "Identifier" \ "@Source").text
if (affiliation != null && affiliation.nonEmpty) {
val aff = new PMAffiliation()
aff.setName(affiliation)
if (
affiliationId != null && affiliationId.nonEmpty && affiliationIdType != null && affiliationIdType.nonEmpty
) {
aff.setIdentifier(new PMIdentifier(affiliationId, affiliationIdType))
}
a.setAffiliation(aff)
}
a
})
.toList
}
def parse(input: String): PMArticle = {
val xml = scala.xml.XML.loadString(input)
val article = new PMArticle
val grantNodes = xml \ "MedlineCitation" \\ "Grant"
article.setGrants(extractGrant(grantNodes).asJava)
val journal = xml \ "MedlineCitation" \ "Article" \ "Journal"
article.setJournal(extractJournal(journal))
val authors = xml \ "MedlineCitation" \ "Article" \ "AuthorList" \ "Author"
article.setAuthors(
extractAuthors(authors).asJava
)
val pmId = xml \ "MedlineCitation" \ "PMID"
val articleIds = xml \ "PubmedData" \ "ArticleIdList" \ "ArticleId"
articleIds.foreach(articleId => {
val idType = (articleId \ "@IdType").text
val id = articleId.text
if ("doi".equalsIgnoreCase(idType)) article.setDoi(id)
if ("pmc".equalsIgnoreCase(idType)) article.setPmcId(id)
})
article.setPmid(pmId.text)
val pubMedPubDate = xml \ "MedlineCitation" \ "DateCompleted"
val currentDate =
validate_Date((pubMedPubDate \ "Year").text, (pubMedPubDate \ "Month").text, (pubMedPubDate \ "Day").text)
if (currentDate != null) article.setDate(currentDate)
val articleTitle = xml \ "MedlineCitation" \ "Article" \ "ArticleTitle"
article.setTitle(articleTitle.text)
val abstractText = xml \ "MedlineCitation" \ "Article" \ "Abstract" \ "AbstractText"
if (abstractText != null && abstractText.text != null && abstractText.text.nonEmpty)
article.setDescription(abstractText.text.split("\n").map(s => s.trim).mkString(" ").trim)
val language = xml \ "MedlineCitation" \ "Article" \ "Language"
article.setLanguage(language.text)
val subjects = xml \ "MedlineCitation" \ "MeshHeadingList" \ "MeshHeading"
article.setSubjects(
subjects
.take(20)
.map(subject => {
val descriptorName = (subject \ "DescriptorName").text
val ui = (subject \ "DescriptorName" \ "@UI").text
val s = new PMSubject
s.setValue(descriptorName)
s.setMeshId(ui)
s
})
.toList
.asJava
)
val publicationTypes = xml \ "MedlineCitation" \ "Article" \ "PublicationTypeList" \ "PublicationType"
article.setPublicationTypes(
publicationTypes
.map(pt => {
val s = new PMSubject
s.setValue(pt.text)
s
})
.toList
.asJava
)
article
}
def parse2(xml: XMLEventReader): PMArticle = {
var currentArticle: PMArticle = null
var currentSubject: PMSubject = null
var currentAuthor: PMAuthor = null
var currentJournal: PMJournal = null
var currentGrant: PMGrant = null
var currNode: String = null
var currentYear = "0"
var currentMonth = "01"
var currentDay = "01"
var currentArticleType: String = null
while (xml.hasNext) {
val ne = xml.next
ne match {
case EvElemStart(_, label, attrs, _) =>
currNode = label
label match {
case "PubmedArticle" => currentArticle = new PMArticle
case "Author" => currentAuthor = new PMAuthor
case "Journal" => currentJournal = new PMJournal
case "Grant" => currentGrant = new PMGrant
case "PublicationType" | "DescriptorName" =>
currentSubject = new PMSubject
currentSubject.setMeshId(extractAttributes(attrs, "UI"))
case "ArticleId" => currentArticleType = extractAttributes(attrs, "IdType")
case _ =>
}
case EvElemEnd(_, label) =>
label match {
case "PubmedArticle" => return currentArticle
case "Author" => currentArticle.getAuthors.add(currentAuthor)
case "Journal" => currentArticle.setJournal(currentJournal)
case "Grant" => currentArticle.getGrants.add(currentGrant)
case "PubMedPubDate" =>
if (currentArticle.getDate == null)
currentArticle.setDate(validate_Date(currentYear, currentMonth, currentDay))
case "PubDate" => currentJournal.setDate(s"$currentYear-$currentMonth-$currentDay")
case "DescriptorName" => currentArticle.getSubjects.add(currentSubject)
case "PublicationType" => currentArticle.getPublicationTypes.add(currentSubject)
case _ =>
}
case EvText(text) =>
if (currNode != null && text.trim.nonEmpty)
currNode match {
case "ArticleTitle" => {
if (currentArticle.getTitle == null)
currentArticle.setTitle(text.trim)
else
currentArticle.setTitle(currentArticle.getTitle + text.trim)
}
case "AbstractText" => {
if (currentArticle.getDescription == null)
currentArticle.setDescription(text.trim)
else
currentArticle.setDescription(currentArticle.getDescription + text.trim)
}
case "PMID" => currentArticle.setPmid(text.trim)
case "ArticleId" =>
if ("doi".equalsIgnoreCase(currentArticleType)) currentArticle.setDoi(text.trim)
if ("pmc".equalsIgnoreCase(currentArticleType)) currentArticle.setPmcId(text.trim)
case "Language" => currentArticle.setLanguage(text.trim)
case "ISSN" => currentJournal.setIssn(text.trim)
case "GrantID" => currentGrant.setGrantID(text.trim)
case "Agency" => currentGrant.setAgency(text.trim)
case "Country" => if (currentGrant != null) currentGrant.setCountry(text.trim)
case "Year" => currentYear = text.trim
case "Month" => currentMonth = text.trim
case "Day" => currentDay = text.trim
case "Volume" => currentJournal.setVolume(text.trim)
case "Issue" => currentJournal.setIssue(text.trim)
case "PublicationType" | "DescriptorName" => currentSubject.setValue(text.trim)
case "LastName" => {
if (currentAuthor != null)
currentAuthor.setLastName(text.trim)
}
case "ForeName" =>
if (currentAuthor != null)
currentAuthor.setForeName(text.trim)
case "Title" =>
if (currentJournal.getTitle == null)
currentJournal.setTitle(text.trim)
else
currentJournal.setTitle(currentJournal.getTitle + text.trim)
case _ =>
}
case _ =>
}
}
null
}
}

View File

@ -294,6 +294,24 @@ object PubMedToOaf {
author.setName(a.getForeName) author.setName(a.getForeName)
author.setSurname(a.getLastName) author.setSurname(a.getLastName)
author.setFullname(a.getFullName) author.setFullname(a.getFullName)
if (a.getIdentifier != null) {
author.setPid(
List(
OafMapperUtils.structuredProperty(
a.getIdentifier.getPid,
OafMapperUtils.qualifier(
a.getIdentifier.getType,
a.getIdentifier.getType,
ModelConstants.DNET_PID_TYPES,
ModelConstants.DNET_PID_TYPES
),
dataInfo
)
).asJava
)
}
if (a.getAffiliation != null)
author.setRawAffiliationString(List(a.getAffiliation.getName).asJava)
author.setRank(index + 1) author.setRank(index + 1)
author author
}(collection.breakOut) }(collection.breakOut)

View File

@ -28,8 +28,8 @@ import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.schema.action.AtomicAction; import eu.dnetlib.dhp.schema.action.AtomicAction;
import eu.dnetlib.dhp.schema.common.ModelConstants; import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.Relation; import eu.dnetlib.dhp.schema.oaf.Relation;
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory; import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
import eu.dnetlib.dhp.schema.oaf.utils.PidCleaner;
public class PrepareAffiliationRelationsTest { public class PrepareAffiliationRelationsTest {
@ -39,8 +39,7 @@ public class PrepareAffiliationRelationsTest {
private static Path workingDir; private static Path workingDir;
private static final String ID_PREFIX = "50|doi_________::"; private static final String ID_PREFIX = "50|doi_________::";
private static final Logger log = LoggerFactory private static final Logger log = LoggerFactory.getLogger(PrepareAffiliationRelationsTest.class);
.getLogger(PrepareAffiliationRelationsTest.class);
@BeforeAll @BeforeAll
public static void beforeAll() throws IOException { public static void beforeAll() throws IOException {
@ -74,21 +73,34 @@ public class PrepareAffiliationRelationsTest {
@Test @Test
void testMatch() throws Exception { void testMatch() throws Exception {
String crossrefAffiliationRelationPath = getClass() String crossrefAffiliationRelationPathNew = getClass()
.getResource("/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror.json") .getResource("/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror.json")
.getPath(); .getPath();
String crossrefAffiliationRelationPath = getClass()
.getResource("/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror_old.json")
.getPath();
String publisherAffiliationRelationPath = getClass()
.getResource("/eu/dnetlib/dhp/actionmanager/bipaffiliations/publishers")
.getPath();
String publisherAffiliationRelationOldPath = getClass()
.getResource("/eu/dnetlib/dhp/actionmanager/bipaffiliations/publichers_old")
.getPath();
String outputPath = workingDir.toString() + "/actionSet"; String outputPath = workingDir.toString() + "/actionSet";
PrepareAffiliationRelations PrepareAffiliationRelations
.main( .main(
new String[] { new String[] {
"-isSparkSessionManaged", Boolean.FALSE.toString(), "-isSparkSessionManaged", Boolean.FALSE.toString(),
"-crossrefInputPath", crossrefAffiliationRelationPath, "-crossrefInputPath", crossrefAffiliationRelationPathNew,
"-pubmedInputPath", crossrefAffiliationRelationPath, "-pubmedInputPath", crossrefAffiliationRelationPath,
"-openapcInputPath", crossrefAffiliationRelationPath, "-openapcInputPath", crossrefAffiliationRelationPathNew,
"-dataciteInputPath", crossrefAffiliationRelationPath, "-dataciteInputPath", crossrefAffiliationRelationPathNew,
"-webCrawlInputPath", crossrefAffiliationRelationPath, "-webCrawlInputPath", crossrefAffiliationRelationPathNew,
"-publisherInputPath", publisherAffiliationRelationPath,
"-outputPath", outputPath "-outputPath", outputPath
}); });
@ -99,13 +111,8 @@ public class PrepareAffiliationRelationsTest {
.map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class)) .map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class))
.map(aa -> ((Relation) aa.getPayload())); .map(aa -> ((Relation) aa.getPayload()));
// for (Relation r : tmp.collect()) {
// System.out.println(
// r.getSource() + "\t" + r.getTarget() + "\t" + r.getRelType() + "\t" + r.getRelClass() + "\t" + r.getSubRelType() + "\t" + r.getValidationDate() + "\t" + r.getDataInfo().getTrust() + "\t" + r.getDataInfo().getInferred()
// );
// }
// count the number of relations // count the number of relations
assertEquals(120, tmp.count()); assertEquals(162, tmp.count());// 18 + 24 + 30 * 4 =
Dataset<Relation> dataset = spark.createDataset(tmp.rdd(), Encoders.bean(Relation.class)); Dataset<Relation> dataset = spark.createDataset(tmp.rdd(), Encoders.bean(Relation.class));
dataset.createOrReplaceTempView("result"); dataset.createOrReplaceTempView("result");
@ -116,7 +123,7 @@ public class PrepareAffiliationRelationsTest {
// verify that we have equal number of bi-directional relations // verify that we have equal number of bi-directional relations
Assertions Assertions
.assertEquals( .assertEquals(
60, execVerification 81, execVerification
.filter( .filter(
"relClass='" + ModelConstants.HAS_AUTHOR_INSTITUTION + "'") "relClass='" + ModelConstants.HAS_AUTHOR_INSTITUTION + "'")
.collectAsList() .collectAsList()
@ -124,26 +131,56 @@ public class PrepareAffiliationRelationsTest {
Assertions Assertions
.assertEquals( .assertEquals(
60, execVerification 81, execVerification
.filter( .filter(
"relClass='" + ModelConstants.IS_AUTHOR_INSTITUTION_OF + "'") "relClass='" + ModelConstants.IS_AUTHOR_INSTITUTION_OF + "'")
.collectAsList() .collectAsList()
.size()); .size());
// check confidence value of a specific relation // check confidence value of a specific relation
String sourceDOI = "10.1061/(asce)0733-9399(2002)128:7(759)"; String sourceDOI = "10.1089/10872910260066679";
final String sourceOpenaireId = ID_PREFIX final String sourceOpenaireId = ID_PREFIX
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", sourceDOI)); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", sourceDOI));
Assertions Assertions
.assertEquals( .assertEquals(
"0.7071067812", execVerification "1.0", execVerification
.filter( .filter(
"source='" + sourceOpenaireId + "'") "source='" + sourceOpenaireId + "'")
.collectAsList() .collectAsList()
.get(0) .get(0)
.getString(4)); .getString(4));
final String publisherid = ID_PREFIX
+ IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1089/10872910260066679"));
final String rorId = "20|ror_________::" + IdentifierFactory.md5("https://ror.org/05cf8a891");
Assertions
.assertEquals(
4, execVerification.filter("source = '" + publisherid + "' and target = '" + rorId + "'").count());
Assertions
.assertEquals(
1, execVerification
.filter(
"source = '" + ID_PREFIX
+ IdentifierFactory
.md5(PidCleaner.normalizePidValue("doi", "10.1007/s00217-010-1268-9"))
+ "' and target = '" + "20|ror_________::"
+ IdentifierFactory.md5("https://ror.org/03265fv13") + "'")
.count());
Assertions
.assertEquals(
1, execVerification
.filter(
"source = '" + ID_PREFIX
+ IdentifierFactory
.md5(PidCleaner.normalizePidValue("doi", "10.1007/3-540-47984-8_14"))
+ "' and target = '" + "20|ror_________::"
+ IdentifierFactory.md5("https://ror.org/00a0n9e72") + "'")
.count());
} }
} }

View File

@ -31,6 +31,7 @@ import eu.dnetlib.dhp.schema.oaf.Publication;
import eu.dnetlib.dhp.schema.oaf.Relation; import eu.dnetlib.dhp.schema.oaf.Relation;
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions; import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory; import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
import eu.dnetlib.dhp.schema.oaf.utils.PidCleaner;
public class CreateOpenCitationsASTest { public class CreateOpenCitationsASTest {
@ -280,17 +281,17 @@ public class CreateOpenCitationsASTest {
@Test @Test
void testRelationsSourceTargetCouple() throws Exception { void testRelationsSourceTargetCouple() throws Exception {
final String doi1 = "50|doi_________::" final String doi1 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1007/s10854-015-3684-x")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1007/s10854-015-3684-x"));
final String doi2 = "50|doi_________::" final String doi2 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1111/j.1551-2916.2008.02408.x")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1111/j.1551-2916.2008.02408.x"));
final String doi3 = "50|doi_________::" final String doi3 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1007/s10854-014-2114-9")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1007/s10854-014-2114-9"));
final String doi4 = "50|doi_________::" final String doi4 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1016/j.ceramint.2013.09.069")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1016/j.ceramint.2013.09.069"));
final String doi5 = "50|doi_________::" final String doi5 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1007/s10854-009-9913-4")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1007/s10854-009-9913-4"));
final String doi6 = "50|doi_________::" final String doi6 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1016/0038-1098(72)90370-5")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1016/0038-1098(72)90370-5"));
String inputPath = getClass() String inputPath = getClass()
.getResource( .getResource(

View File

@ -28,6 +28,7 @@ import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.Relation; import eu.dnetlib.dhp.schema.oaf.Relation;
import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions; import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory; import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
import eu.dnetlib.dhp.schema.oaf.utils.PidCleaner;
/** /**
* @author miriam.baglioni * @author miriam.baglioni
@ -270,17 +271,17 @@ public class CreateTAActionSetTest {
@Test @Test
void testRelationsSourceTargetCouple() throws Exception { void testRelationsSourceTargetCouple() throws Exception {
final String doi1 = "50|doi_________::" final String doi1 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1007/s10854-015-3684-x")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1007/s10854-015-3684-x"));
final String doi2 = "50|doi_________::" final String doi2 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1111/j.1551-2916.2008.02408.x")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1111/j.1551-2916.2008.02408.x"));
final String doi3 = "50|doi_________::" final String doi3 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1007/s10854-014-2114-9")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1007/s10854-014-2114-9"));
final String doi4 = "50|doi_________::" final String doi4 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1016/j.ceramint.2013.09.069")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1016/j.ceramint.2013.09.069"));
final String doi5 = "50|doi_________::" final String doi5 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1007/s10854-009-9913-4")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1007/s10854-009-9913-4"));
final String doi6 = "50|doi_________::" final String doi6 = "50|doi_________::"
+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", "10.1016/0038-1098(72)90370-5")); + IdentifierFactory.md5(PidCleaner.normalizePidValue("doi", "10.1016/0038-1098(72)90370-5"));
String inputPath = getClass() String inputPath = getClass()
.getResource( .getResource(

View File

@ -50,7 +50,8 @@ public class OsfPreprintsCollectorPluginTest {
@Test @Test
@Disabled @Disabled
void test_one() throws CollectorException { void test_one() throws CollectorException {
this.plugin.collect(this.api, new AggregatorReport()) this.plugin
.collect(this.api, new AggregatorReport())
.limit(1) .limit(1)
.forEach(log::info); .forEach(log::info);
} }
@ -95,7 +96,8 @@ public class OsfPreprintsCollectorPluginTest {
final HttpConnector2 connector = new HttpConnector2(); final HttpConnector2 connector = new HttpConnector2();
try { try {
final String res = connector.getInputSource("https://api.osf.io/v2/preprints/ydtzx/contributors/?format=json"); final String res = connector
.getInputSource("https://api.osf.io/v2/preprints/ydtzx/contributors/?format=json");
System.out.println(res); System.out.println(res);
fail(); fail();
} catch (final Throwable e) { } catch (final Throwable e) {

View File

@ -0,0 +1,35 @@
package eu.dnetlib.dhp.collection.plugin.zenodo;
import static org.junit.jupiter.api.Assertions.assertNotNull;
import java.util.zip.GZIPInputStream;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.collection.ApiDescriptor;
import eu.dnetlib.dhp.common.collection.CollectorException;
public class ZenodoPluginCollectionTest {
@Test
public void testZenodoIterator() throws Exception {
final GZIPInputStream gis = new GZIPInputStream(
getClass().getResourceAsStream("/eu/dnetlib/dhp/collection/zenodo/zenodo.tar.gz"));
try (ZenodoTarIterator it = new ZenodoTarIterator(gis)) {
Assertions.assertTrue(it.hasNext());
int i = 0;
while (it.hasNext()) {
Assertions.assertNotNull(it.next());
i++;
}
Assertions.assertEquals(10, i);
}
}
}

View File

@ -1,9 +1,10 @@
{"DOI":"10.1061\/(asce)0733-9399(2002)128:7(759)","Matchings":[{"RORid":"https:\/\/ror.org\/03yxnpp24","Confidence":0.7071067812},{"RORid":"https:\/\/ror.org\/01teme464","Confidence":0.89}]} {"DOI":"10.1021\/ac020069k","Matchings":[{"PID":"ROR","Value":"https:\/\/ror.org\/01f5ytq51","Status":"active","Confidence":1}]}
{"DOI":"10.1105\/tpc.8.3.343","Matchings":[{"RORid":"https:\/\/ror.org\/02k40bc56","Confidence":0.7071067812}]} {"DOI":"10.1161\/01.cir.0000013846.72805.7e","Matchings":[{"PID":"ROR","Value":"https:\/\/ror.org\/02pttbw34","Status":"active","Confidence":1}]}
{"DOI":"10.1161\/01.cir.0000013305.01850.37","Matchings":[{"RORid":"https:\/\/ror.org\/00qjgza05","Confidence":1}]} {"DOI":"10.1161\/hy02t2.102992","Matchings":[{"PID":"ROR","Value":"https:\/\/ror.org\/00qqv6244","Status":"active","Confidence":1},{"PID":"ROR","Value":"https:\/\/ror.org\/00p991c53","Status":"active","Confidence":1}]}
{"DOI":"10.1142\/s021821650200186x","Matchings":[{"RORid":"https:\/\/ror.org\/035xkbk20","Confidence":1},{"RORid":"https:\/\/ror.org\/05apxxy63","Confidence":1}]} {"DOI":"10.1126\/science.1073633","Matchings":[{"PID":"ROR","Value":"https:\/\/ror.org\/03xez1567","Status":"active","Confidence":1},{"PID":"ROR","Value":"https:\/\/ror.org\/006w34k90","Status":"active","Confidence":1}]}
{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(575)","Matchings":[{"RORid":"https:\/\/ror.org\/04j198w64","Confidence":0.82}]} {"DOI":"10.1089\/10872910260066679","Matchings":[{"PID":"ROR","Value":"https:\/\/ror.org\/05cf8a891","Status":"active","Confidence":1}]}
{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(588)","Matchings":[{"RORid":"https:\/\/ror.org\/03m8km719","Confidence":0.8660254038},{"RORid":"https:\/\/ror.org\/02aze4h65","Confidence":0.87}]} {"DOI":"10.1108\/02656719610116117","Matchings":[{"PID":"ROR","Value":"https:\/\/ror.org\/03mnm0t94","Status":"active","Confidence":1},{"PID":"ROR","Value":"https:\/\/ror.org\/007tn5k56","Status":"active","Confidence":1}]}
{"DOI":"10.1161\/hy0202.103001","Matchings":[{"RORid":"https:\/\/ror.org\/057xtrt18","Confidence":0.7071067812}]} {"DOI":"10.1080\/01443610050111986","Matchings":[{"PID":"ROR","Value":"https:\/\/ror.org\/001x4vz59","Status":"active","Confidence":1},{"PID":"ROR","Value":"https:\/\/ror.org\/01tmqtf75","Status":"active","Confidence":1}]}
{"DOI": "10.1080/13669877.2015.1042504", "Matchings": [{"Confidence": 1.0, "RORid": "https://ror.org/03265fv13"}]} {"DOI":"10.1021\/cm020118+","Matchings":[{"PID":"ROR","Value":"https:\/\/ror.org\/02cf1je33","Confidence":1,"Status":"inactive"},{"PID":"ROR","Value":"https:\/\/ror.org\/01hvx5h04","Confidence":1,"Status":"active"}]}
{"DOI": "10.1007/3-540-47984-8_14", "Matchings": [{"Confidence": 1.0, "RORid": "https://ror.org/00a0n9e72"}]} {"DOI":"10.1161\/hc1202.104524","Matchings":[{"PID":"ROR","Value":"https:\/\/ror.org\/040r8fr65","Status":"active","Confidence":1},{"PID":"ROR","Value":"https:\/\/ror.org\/04fctr677","Status":"active","Confidence":1}]}
{"DOI":"10.1021\/ma011134f","Matchings":[{"PID":"ROR","Value":"https:\/\/ror.org\/04tj63d06","Status":"active","Confidence":1}]}

View File

@ -0,0 +1,9 @@
{"DOI":"10.1061\/(asce)0733-9399(2002)128:7(759)","Matchings":[{"RORid":"https:\/\/ror.org\/03yxnpp24","Confidence":0.7071067812},{"RORid":"https:\/\/ror.org\/01teme464","Confidence":0.89}]}
{"DOI":"10.1105\/tpc.8.3.343","Matchings":[{"RORid":"https:\/\/ror.org\/02k40bc56","Confidence":0.7071067812}]}
{"DOI":"10.1161\/01.cir.0000013305.01850.37","Matchings":[{"RORid":"https:\/\/ror.org\/00qjgza05","Confidence":1}]}
{"DOI":"10.1142\/s021821650200186x","Matchings":[{"RORid":"https:\/\/ror.org\/035xkbk20","Confidence":1},{"RORid":"https:\/\/ror.org\/05apxxy63","Confidence":1}]}
{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(575)","Matchings":[{"RORid":"https:\/\/ror.org\/04j198w64","Confidence":0.82}]}
{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(588)","Matchings":[{"RORid":"https:\/\/ror.org\/03m8km719","Confidence":0.8660254038},{"RORid":"https:\/\/ror.org\/02aze4h65","Confidence":0.87}]}
{"DOI":"10.1161\/hy0202.103001","Matchings":[{"RORid":"https:\/\/ror.org\/057xtrt18","Confidence":0.7071067812}]}
{"DOI": "10.1080/13669877.2015.1042504", "Matchings": [{"Confidence": 1.0, "RORid": "https://ror.org/03265fv13"}]}
{"DOI": "https://doi.org/10.1007/3-540-47984-8_14", "Matchings": [{"Confidence": 1.0, "RORid": "https://ror.org/00a0n9e72"}]}

View File

@ -0,0 +1,6 @@
{"DOI": "10.1007/s00217-010-1268-9", "Authors": [{"Name": {"Full": "Martin Zarnkow", "First": null, "Last": null}, "Raw_affiliations": ["TU M\u00fcnchen, Lehrstuhl f\u00fcr Brau- und Getr\u00e4nketechnologie"], "Organization_PIDs": []}, {"Name": {"Full": "Andrea Faltermaier", "First": null, "Last": null}, "Raw_affiliations": ["Lehrstuhl f\u00fcr Brau- und Getr\u00e4nketechnologie"], "Organization_PIDs": []}, {"Name": {"Full": "Werner Back", "First": null, "Last": null}, "Raw_affiliations": ["Lehrstuhl f\u00fcr Technologie der Brauerei I"], "Organization_PIDs": []}, {"Name": {"Full": "Martina Gastl", "First": null, "Last": null}, "Raw_affiliations": ["Lehrstuhl f\u00fcr Brau- und Getr\u00e4nketechnologie"], "Organization_PIDs": []}, {"Name": {"Full": "Elkek K. Arendt", "First": null, "Last": null}, "Raw_affiliations": ["University College Cork"], "Organization_PIDs": [{"RORid": "https://ror.org/03265fv13", "Confidence": 1}]}], "Organizations": [{"RORid": "https://ror.org/03265fv13", "Confidence": 1}]}
{"DOI": "10.1007/BF01154707", "Authors": [{"Name": {"Full": "Buggy, M.", "First": null, "Last": null}, "Raw_affiliations": ["Department of Materials Science and Technology, University of Limerick, Limerick, Ireland"], "Organization_PIDs": [{"RORid": "https://ror.org/00a0n9e72", "Confidence": 1}]}, {"Name": {"Full": "Carew, A.", "First": null, "Last": null}, "Raw_affiliations": ["Department of Materials Science and Technology, University of Limerick, Limerick, Ireland"], "Organization_PIDs": [{"RORid": "https://ror.org/00a0n9e72", "Confidence": 1}]}], "Organizations": [{"RORid": "https://ror.org/00a0n9e72", "Confidence": 1}]}
{"DOI": "10.1007/s10237-017-0974-7", "Authors": [{"Name": {"Full": "Donnacha J. McGrath", "First": null, "Last": null}, "Raw_affiliations": ["Biomechanics Research Centre (BMEC), Biomedical Engineering, College of Engineering and Informatics, NUI Galway, Galway, Ireland"], "Organization_PIDs": [{"RORid": "https://ror.org/03bea9k73", "Confidence": 1}]}, {"Name": {"Full": "Anja Lena Thiebes", "First": null, "Last": null}, "Raw_affiliations": ["Department of Biohybrid and Medical Textiles (BioTex), AME-Helmholtz Institute for Biomedical Engineering, ITA-Institut f\u00fcr Textiltechnik, RWTH Aachen University and at AMIBM Maastricht University, Maastricht, The Netherlands, Aachen, Germany"], "Organization_PIDs": [{"RORid": "https://ror.org/02jz4aj89", "Confidence": 0.82}, {"RORid": "https://ror.org/04xfq0f34", "Confidence": 0.87}]}, {"Name": {"Full": "Christian G. Cornelissen", "First": null, "Last": null}, "Raw_affiliations": ["Department of Biohybrid and Medical Textiles (BioTex), AME-Helmholtz Institute for Biomedical Engineering, ITA-Institut f\u00fcr Textiltechnik, RWTH Aachen University and at AMIBM Maastricht University, Maastricht, The Netherlands, Aachen, Germany"], "Organization_PIDs": [{"RORid": "https://ror.org/02jz4aj89", "Confidence": 0.82}, {"RORid": "https://ror.org/04xfq0f34", "Confidence": 0.87}]}, {"Name": {"Full": "Barry O\u2019Brien", "First": null, "Last": null}, "Raw_affiliations": ["Department for Internal Medicine \u2013 Section for Pneumology, Medical Faculty, RWTH Aachen University, Aachen, Germany"], "Organization_PIDs": [{"RORid": "https://ror.org/04xfq0f34", "Confidence": 1}]}, {"Name": {"Full": "Stefan Jockenhoevel", "First": null, "Last": null}, "Raw_affiliations": ["Biomechanics Research Centre (BMEC), Biomedical Engineering, College of Engineering and Informatics, NUI Galway, Galway, Ireland"], "Organization_PIDs": [{"RORid": "https://ror.org/03bea9k73", "Confidence": 1}]}, {"Name": {"Full": "Mark Bruzzi", "First": null, "Last": null}, "Raw_affiliations": ["Department of Biohybrid and Medical Textiles (BioTex), AME-Helmholtz Institute for Biomedical Engineering, ITA-Institut f\u00fcr Textiltechnik, RWTH Aachen University and at AMIBM Maastricht University, Maastricht, The Netherlands, Aachen, Germany"], "Organization_PIDs": [{"RORid": "https://ror.org/02jz4aj89", "Confidence": 0.82}, {"RORid": "https://ror.org/04xfq0f34", "Confidence": 0.87}]}, {"Name": {"Full": "Peter E. McHugh", "First": null, "Last": null}, "Raw_affiliations": ["Biomechanics Research Centre (BMEC), Biomedical Engineering, College of Engineering and Informatics, NUI Galway, Galway, Ireland"], "Organization_PIDs": [{"RORid": "https://ror.org/03bea9k73", "Confidence": 1}]}], "Organizations": [{"RORid": "https://ror.org/03bea9k73", "Confidence": 1}, {"RORid": "https://ror.org/02jz4aj89", "Confidence": 0.82}, {"RORid": "https://ror.org/04xfq0f34", "Confidence": 0.87}, {"RORid": "https://ror.org/04xfq0f34", "Confidence": 1}]}
{"DOI": "10.1007/BF03168973", "Authors": [{"Name": {"Full": "Sheehan, G.", "First": null, "Last": null}, "Raw_affiliations": ["Dept of Infectious Diseases, Mater Misercordiae Hospital, Dublin 7"], "Organization_PIDs": []}, {"Name": {"Full": "Chew, N.", "First": null, "Last": null}, "Raw_affiliations": ["Dept of Infectious Diseases, Mater Misercordiae Hospital, Dublin 7"], "Organization_PIDs": []}], "Organizations": []}
{"DOI": "10.1007/s00338-009-0480-1", "Authors": [{"Name": {"Full": "Gleason, D. F.", "First": null, "Last": null}, "Raw_affiliations": ["Department of Biology, Georgia Southern University, Statesboro, USA"], "Organization_PIDs": [{"RORid": "https://ror.org/04agmb972", "Confidence": 1}]}, {"Name": {"Full": "Danilowicz, B. S.", "First": null, "Last": null}, "Raw_affiliations": ["Department of Biology, Georgia Southern University, Statesboro, USA"], "Organization_PIDs": [{"RORid": "https://ror.org/04agmb972", "Confidence": 1}]}, {"Name": {"Full": "Nolan, C. J.", "First": null, "Last": null}, "Raw_affiliations": ["School of Biology and Environmental Science, University College Dublin, Dublin 4, Ireland"], "Organization_PIDs": [{"RORid": "https://ror.org/05m7pjf47", "Confidence": 1}]}], "Organizations": [{"RORid": "https://ror.org/04agmb972", "Confidence": 1}, {"RORid": "https://ror.org/05m7pjf47", "Confidence": 1}]}
{"DOI": "10.1007/s10993-010-9187-y", "Authors": [{"Name": {"Full": "Martin Howard", "First": null, "Last": null}, "Raw_affiliations": ["University College Cork"], "Organization_PIDs": [{"RORid": "https://ror.org/03265fv13", "Confidence": 1}]}], "Organizations": [{"RORid": "https://ror.org/03265fv13", "Confidence": 1}]}

View File

@ -0,0 +1,6 @@
{"DOI": "10.1007/s00217-010-1268-9", "Authors": [{"Name": {"Full": "Martin Zarnkow", "First": null, "Last": null}, "Raw_affiliations": ["TU M\u00fcnchen, Lehrstuhl f\u00fcr Brau- und Getr\u00e4nketechnologie"], "Organization_PIDs": []}, {"Name": {"Full": "Andrea Faltermaier", "First": null, "Last": null}, "Raw_affiliations": ["Lehrstuhl f\u00fcr Brau- und Getr\u00e4nketechnologie"], "Organization_PIDs": []}, {"Name": {"Full": "Werner Back", "First": null, "Last": null}, "Raw_affiliations": ["Lehrstuhl f\u00fcr Technologie der Brauerei I"], "Organization_PIDs": []}, {"Name": {"Full": "Martina Gastl", "First": null, "Last": null}, "Raw_affiliations": ["Lehrstuhl f\u00fcr Brau- und Getr\u00e4nketechnologie"], "Organization_PIDs": []}, {"Name": {"Full": "Elkek K. Arendt", "First": null, "Last": null}, "Raw_affiliations": ["University College Cork"], "Organization_PIDs": [{"Value": "https://ror.org/03265fv13", "Confidence": 1}]}], "Organizations": [{"Provenance":"AffRo","PID":"ROR","Status":"active","Value": "https://ror.org/03265fv13", "Confidence": 1}]}
{"DOI": "10.1007/BF01154707", "Authors": [{"Name": {"Full": "Buggy, M.", "First": null, "Last": null}, "Raw_affiliations": ["Department of Materials Science and Technology, University of Limerick, Limerick, Ireland"], "Organization_PIDs": [{"Value": "https://ror.org/00a0n9e72", "Confidence": 1}]}, {"Name": {"Full": "Carew, A.", "First": null, "Last": null}, "Raw_affiliations": ["Department of Materials Science and Technology, University of Limerick, Limerick, Ireland"], "Organization_PIDs": [{"Value": "https://ror.org/00a0n9e72", "Confidence": 1}]}], "Organizations": [{"Provenance":"AffRo","PID":"ROR","Status":"active","Value": "https://ror.org/00a0n9e72", "Confidence": 1}]}
{"DOI": "10.1007/s10237-017-0974-7", "Authors": [{"Name": {"Full": "Donnacha J. McGrath", "First": null, "Last": null}, "Raw_affiliations": ["Biomechanics Research Centre (BMEC), Biomedical Engineering, College of Engineering and Informatics, NUI Galway, Galway, Ireland"], "Organization_PIDs": [{"Value": "https://ror.org/03bea9k73", "Confidence": 1}]}, {"Name": {"Full": "Anja Lena Thiebes", "First": null, "Last": null}, "Raw_affiliations": ["Department of Biohybrid and Medical Textiles (BioTex), AME-Helmholtz Institute for Biomedical Engineering, ITA-Institut f\u00fcr Textiltechnik, RWTH Aachen University and at AMIBM Maastricht University, Maastricht, The Netherlands, Aachen, Germany"], "Organization_PIDs": [{"Value": "https://ror.org/02jz4aj89", "Confidence": 0.82}, {"Value": "https://ror.org/04xfq0f34", "Confidence": 0.87}]}, {"Name": {"Full": "Christian G. Cornelissen", "First": null, "Last": null}, "Raw_affiliations": ["Department of Biohybrid and Medical Textiles (BioTex), AME-Helmholtz Institute for Biomedical Engineering, ITA-Institut f\u00fcr Textiltechnik, RWTH Aachen University and at AMIBM Maastricht University, Maastricht, The Netherlands, Aachen, Germany"], "Organization_PIDs": [{"Value": "https://ror.org/02jz4aj89", "Confidence": 0.82}, {"Value": "https://ror.org/04xfq0f34", "Confidence": 0.87}]}, {"Name": {"Full": "Barry O\u2019Brien", "First": null, "Last": null}, "Raw_affiliations": ["Department for Internal Medicine \u2013 Section for Pneumology, Medical Faculty, RWTH Aachen University, Aachen, Germany"], "Organization_PIDs": [{"Value": "https://ror.org/04xfq0f34", "Confidence": 1}]}, {"Name": {"Full": "Stefan Jockenhoevel", "First": null, "Last": null}, "Raw_affiliations": ["Biomechanics Research Centre (BMEC), Biomedical Engineering, College of Engineering and Informatics, NUI Galway, Galway, Ireland"], "Organization_PIDs": [{"Value": "https://ror.org/03bea9k73", "Confidence": 1}]}, {"Name": {"Full": "Mark Bruzzi", "First": null, "Last": null}, "Raw_affiliations": ["Department of Biohybrid and Medical Textiles (BioTex), AME-Helmholtz Institute for Biomedical Engineering, ITA-Institut f\u00fcr Textiltechnik, RWTH Aachen University and at AMIBM Maastricht University, Maastricht, The Netherlands, Aachen, Germany"], "Organization_PIDs": [{"Value": "https://ror.org/02jz4aj89", "Confidence": 0.82}, {"Value": "https://ror.org/04xfq0f34", "Confidence": 0.87}]}, {"Name": {"Full": "Peter E. McHugh", "First": null, "Last": null}, "Raw_affiliations": ["Biomechanics Research Centre (BMEC), Biomedical Engineering, College of Engineering and Informatics, NUI Galway, Galway, Ireland"], "Organization_PIDs": [{"Value": "https://ror.org/03bea9k73", "Confidence": 1}]}], "Organizations": [{"Provenance":"AffRo","PID":"ROR","Status":"active","Value": "https://ror.org/03bea9k73", "Confidence": 1}, {"Provenance":"AffRo","PID":"ROR","Status":"active","Value": "https://ror.org/02jz4aj89", "Confidence": 0.82}, {"Provenance":"AffRo","PID":"ROR","Status":"active","Value": "https://ror.org/04xfq0f34", "Confidence": 0.87}, {"Provenance":"AffRo","PID":"ROR","Status":"active","Value": "https://ror.org/04xfq0f34", "Confidence": 1}]}
{"DOI": "10.1007/BF03168973", "Authors": [{"Name": {"Full": "Sheehan, G.", "First": null, "Last": null}, "Raw_affiliations": ["Dept of Infectious Diseases, Mater Misercordiae Hospital, Dublin 7"], "Organization_PIDs": []}, {"Name": {"Full": "Chew, N.", "First": null, "Last": null}, "Raw_affiliations": ["Dept of Infectious Diseases, Mater Misercordiae Hospital, Dublin 7"], "Organization_PIDs": []}], "Organizations": []}
{"DOI": "10.1007/s00338-009-0480-1", "Authors": [{"Name": {"Full": "Gleason, D. F.", "First": null, "Last": null}, "Raw_affiliations": ["Department of Biology, Georgia Southern University, Statesboro, USA"], "Organization_PIDs": [{"Value": "https://ror.org/04agmb972", "Confidence": 1}]}, {"Name": {"Full": "Danilowicz, B. S.", "First": null, "Last": null}, "Raw_affiliations": ["Department of Biology, Georgia Southern University, Statesboro, USA"], "Organization_PIDs": [{"Value": "https://ror.org/04agmb972", "Confidence": 1}]}, {"Name": {"Full": "Nolan, C. J.", "First": null, "Last": null}, "Raw_affiliations": ["School of Biology and Environmental Science, University College Dublin, Dublin 4, Ireland"], "Organization_PIDs": [{"Value": "https://ror.org/05m7pjf47", "Confidence": 1}]}], "Organizations": [{"Provenance":"AffRo","PID":"ROR","Status":"active","Value": "https://ror.org/04agmb972", "Confidence": 1}, {"Provenance":"AffRo","PID":"ROR","Status":"active","Value": "https://ror.org/05m7pjf47", "Confidence": 1}]}
{"DOI": "10.1007/s10993-010-9187-y", "Authors": [{"Name": {"Full": "Martin Howard", "First": null, "Last": null}, "Raw_affiliations": ["University College Cork"], "Organization_PIDs": [{"Value": "https://ror.org/03265fv13", "Confidence": 1}]}], "Organizations": [{"PID":"ROR","Status":"active","Value": "https://ror.org/03265fv13", "Confidence": 1}]}

View File

@ -0,0 +1,157 @@
<PubmedArticle>
<MedlineCitation Status="MEDLINE" IndexingMethod="Curated" Owner="NLM">
<PMID Version="1">37318999</PMID>
<DateCompleted>
<Year>2024</Year>
<Month>02</Month>
<Day>09</Day>
</DateCompleted>
<DateRevised>
<Year>2024</Year>
<Month>02</Month>
<Day>09</Day>
</DateRevised>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1522-1229</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>47</Volume>
<Issue>3</Issue>
<PubDate>
<Year>2023</Year>
<Month>Sep</Month>
<Day>01</Day>
</PubDate>
</JournalIssue>
<Title>Advances in physiology education</Title>
<ISOAbbreviation>Adv Physiol Educ</ISOAbbreviation>
</Journal>
<ArticleTitle>Providing the choice of in-person or videoconference attendance in a clinical physiology course may harm learning outcomes for the entire cohort.</ArticleTitle>
<Pagination>
<MedlinePgn>548-556</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1152/advan.00160.2022</ELocationID>
<Abstract>
<AbstractText>Clinical Physiology 1 and 2 are flipped classes in which students watch prerecorded videos before class. During the 3-h class, students take practice assessments, work in groups on critical thinking exercises, work through case studies, and engage in drawing exercises. Due to the COVID pandemic, these courses were transitioned from in-person classes to online classes. Despite the university's return-to-class policy, some students were reluctant to return to in-person classes; therefore during the 2021-2022 academic year, Clinical Physiology 1 and 2 were offered as flipped, hybrid courses. In a hybrid format, students either attended the synchronous class in person or online. Here we evaluate the learning outcomes and the perceptions of the learning experience for students who attended Clinical Physiology 1 and 2 either online (2020-2021) or in a hybrid format (2021-2022). In addition to exam scores, in-class surveys and end of course evaluations were compiled to describe the student experience in the flipped hybrid setting. Retrospective linear mixed-model regression analysis of exam scores revealed that a hybrid modality (2021-2022) was associated with lower exam scores when controlling for sex, graduate/undergraduate status, delivery method, and the order in which the courses were taken (<i>F</i> test: <i>F</i> = 8.65, df1 = 2, df2 = 179.28, <i>P</i> = 0.0003). In addition, being a Black Indigenous Person of Color (BIPOC) student is associated with a lower exam score, controlling for the same previous factors (<i>F</i> test: <i>F</i> = 4.23, df1 = 1, df2 = 130.28, <i>P</i> = 0.04), albeit with lower confidence; the BIPOC representation in this sample is small (BIPOC: <i>n</i> = 144; total: <i>n</i> = 504). There is no significant interaction between the hybrid modality and race, meaning that BIPOC and White students are both negatively affected in a hybrid flipped course. Instructors should consider carefully about offering hybrid courses and build in extra student support.<b>NEW &amp; NOTEWORTHY</b> The transition from online to in-person teaching has been as challenging as the original transition to remote teaching with the onset of the pandemic. Since not all students were ready to return to the classroom, students could choose to take this course in person or online. This arrangement provided flexibility and opportunities for innovative class activities for students but introduced tradeoffs in lower test scores from the hybrid modality than fully online or fully in-person modalities.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Anderson</LastName>
<ForeName>Lisa Carney</ForeName>
<Initials>LC</Initials>
<Identifier Source="ORCID">0000-0003-2261-1921</Identifier>
<AffiliationInfo>
<Affiliation>Department of Integrative Biology and Physiology, University of Minnesota, Minneapolis, Minnesota, United States.</Affiliation>
<Identifier Source="ROR">https://ror.org/017zqws13</Identifier>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Jacobson</LastName>
<ForeName>Tate</ForeName>
<Initials>T</Initials>
<AffiliationInfo>
<Affiliation>Department of Statistics, University of Minnesota, Minneapolis, Minnesota, United States.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2023</Year>
<Month>06</Month>
<Day>15</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>Adv Physiol Educ</MedlineTA>
<NlmUniqueID>100913944</NlmUniqueID>
<ISSNLinking>1043-4046</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D010827" MajorTopicYN="Y">Physiology</DescriptorName>
<QualifierName UI="Q000193" MajorTopicYN="N">education</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012189" MajorTopicYN="N">Retrospective Studies</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D007858" MajorTopicYN="N">Learning</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D058873" MajorTopicYN="N">Pandemics</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000086382" MajorTopicYN="N">COVID-19</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012044" MajorTopicYN="N">Regression Analysis</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D013334" MajorTopicYN="N">Students</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D008297" MajorTopicYN="N">Male</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D005260" MajorTopicYN="N">Female</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D044465" MajorTopicYN="N">White People</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D044383" MajorTopicYN="N">Black People</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D020375" MajorTopicYN="N">Education, Distance</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D003479" MajorTopicYN="N">Curriculum</DescriptorName>
</MeshHeading>
</MeshHeadingList>
<KeywordList Owner="NOTNLM">
<Keyword MajorTopicYN="N">flipped teaching</Keyword>
<Keyword MajorTopicYN="N">hybrid teaching</Keyword>
<Keyword MajorTopicYN="N">inequity</Keyword>
<Keyword MajorTopicYN="N">learning outcomes</Keyword>
<Keyword MajorTopicYN="N">responsive teaching</Keyword>
</KeywordList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="medline">
<Year>2023</Year>
<Month>7</Month>
<Day>21</Day>
<Hour>6</Hour>
<Minute>44</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2023</Year>
<Month>6</Month>
<Day>15</Day>
<Hour>19</Hour>
<Minute>14</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2023</Year>
<Month>6</Month>
<Day>15</Day>
<Hour>12</Hour>
<Minute>53</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">37318999</ArticleId>
<ArticleId IdType="doi">10.1152/advan.00160.2022</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>

View File

@ -5,7 +5,10 @@ import eu.dnetlib.dhp.aggregation.AbstractVocabularyTest
import eu.dnetlib.dhp.schema.oaf.utils.PidType import eu.dnetlib.dhp.schema.oaf.utils.PidType
import eu.dnetlib.dhp.schema.oaf.{Oaf, Publication, Relation, Result} import eu.dnetlib.dhp.schema.oaf.{Oaf, Publication, Relation, Result}
import eu.dnetlib.dhp.sx.bio.BioDBToOAF.ScholixResolved import eu.dnetlib.dhp.sx.bio.BioDBToOAF.ScholixResolved
import eu.dnetlib.dhp.sx.bio.pubmed.{PMArticle, PMParser, PMSubject, PubMedToOaf} import eu.dnetlib.dhp.sx.bio.ebi.SparkCreatePubmedDump
import eu.dnetlib.dhp.sx.bio.pubmed._
import org.apache.commons.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.json4s.DefaultFormats import org.json4s.DefaultFormats
import org.json4s.JsonAST.{JField, JObject, JString} import org.json4s.JsonAST.{JField, JObject, JString}
import org.json4s.jackson.JsonMethods.parse import org.json4s.jackson.JsonMethods.parse
@ -13,14 +16,16 @@ import org.junit.jupiter.api.Assertions._
import org.junit.jupiter.api.extension.ExtendWith import org.junit.jupiter.api.extension.ExtendWith
import org.junit.jupiter.api.{BeforeEach, Test} import org.junit.jupiter.api.{BeforeEach, Test}
import org.mockito.junit.jupiter.MockitoExtension import org.mockito.junit.jupiter.MockitoExtension
import org.slf4j.LoggerFactory
import java.io.{BufferedReader, InputStream, InputStreamReader} import java.io.{BufferedReader, InputStream, InputStreamReader}
import java.util.regex.Pattern
import java.util.zip.GZIPInputStream import java.util.zip.GZIPInputStream
import javax.xml.stream.XMLInputFactory import javax.xml.stream.XMLInputFactory
import scala.collection.JavaConverters._ import scala.collection.JavaConverters._
import scala.collection.mutable
import scala.collection.mutable.ListBuffer import scala.collection.mutable.ListBuffer
import scala.io.Source import scala.io.Source
import scala.xml.pull.XMLEventReader
@ExtendWith(Array(classOf[MockitoExtension])) @ExtendWith(Array(classOf[MockitoExtension]))
class BioScholixTest extends AbstractVocabularyTest { class BioScholixTest extends AbstractVocabularyTest {
@ -48,6 +53,76 @@ class BioScholixTest extends AbstractVocabularyTest {
} }
} }
@Test
def testPid(): Unit = {
val pids = List(
"0000000163025705",
"000000018494732X",
"0000000308873343",
"0000000335964515",
"0000000333457333",
"0000000335964515",
"0000000302921949",
"http://orcid.org/0000-0001-8567-3543",
"http://orcid.org/0000-0001-7868-8528",
"0000-0001-9189-1440",
"0000-0003-3727-9247",
"0000-0001-7246-1058",
"000000033962389X",
"0000000330371470",
"0000000171236123",
"0000000272569752",
"0000000293231371",
"http://orcid.org/0000-0003-3345-7333",
"0000000340145688",
"http://orcid.org/0000-0003-4894-1689"
)
pids.foreach(pid => {
val pidCleaned = new PMIdentifier(pid, "ORCID").getPid
// assert pid is in the format of ORCID
println(pidCleaned)
assertTrue(pidCleaned.matches("[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{3}[0-9X]"))
})
}
def extractAffiliation(s: String): List[String] = {
val regex: String = "<Affiliation>(.*)<\\/Affiliation>"
val pattern = Pattern.compile(regex, Pattern.MULTILINE)
val matcher = pattern.matcher(s)
val l: mutable.ListBuffer[String] = mutable.ListBuffer()
while (matcher.find()) {
l += matcher.group(1)
}
l.toList
}
case class AuthorPID(pidType: String, pid: String) {}
def extractAuthorIdentifier(s: String): List[AuthorPID] = {
val regex: String = "<Identifier Source=\"(.*)\">(.*)<\\/Identifier>"
val pattern = Pattern.compile(regex, Pattern.MULTILINE)
val matcher = pattern.matcher(s)
val l: mutable.ListBuffer[AuthorPID] = mutable.ListBuffer()
while (matcher.find()) {
l += AuthorPID(pidType = matcher.group(1), pid = matcher.group(2))
}
l.toList
}
@Test
def testParsingPubmed2(): Unit = {
val mapper = new ObjectMapper()
val xml = IOUtils.toString(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/single_pubmed.xml"))
val parser = new PMParser2()
val article = parser.parse(xml)
// println(mapper.writerWithDefaultPrettyPrinter().writeValueAsString(article))
println(mapper.writerWithDefaultPrettyPrinter().writeValueAsString(PubMedToOaf.convert(article, vocabularies)))
}
@Test @Test
def testEBIData() = { def testEBIData() = {
val inputFactory = XMLInputFactory.newInstance val inputFactory = XMLInputFactory.newInstance
@ -124,6 +199,14 @@ class BioScholixTest extends AbstractVocabularyTest {
} }
} }
def testPubmedSplitting(): Unit = {
val spark: SparkSession = SparkSession.builder().appName("test").master("local").getOrCreate()
new SparkCreatePubmedDump("", Array.empty, LoggerFactory.getLogger(getClass))
.createPubmedDump(spark, "/home/sandro/Downloads/pubmed", "/home/sandro/Downloads/pubmed_mapped", vocabularies)
}
@Test @Test
def testPubmedOriginalID(): Unit = { def testPubmedOriginalID(): Unit = {
val article: PMArticle = new PMArticle val article: PMArticle = new PMArticle

View File

@ -63,6 +63,7 @@
<path start="copy_software"/> <path start="copy_software"/>
<path start="copy_datasource"/> <path start="copy_datasource"/>
<path start="copy_project"/> <path start="copy_project"/>
<path start="copy_person"/>
<path start="copy_organization"/> <path start="copy_organization"/>
</fork> </fork>
@ -120,6 +121,15 @@
<error to="Kill"/> <error to="Kill"/>
</action> </action>
<action name="copy_person">
<distcp xmlns="uri:oozie:distcp-action:0.2">
<arg>${nameNode}/${sourcePath}/person</arg>
<arg>${nameNode}/${outputPath}/person</arg>
</distcp>
<ok to="wait"/>
<error to="Kill"/>
</action>
<action name="copy_datasource"> <action name="copy_datasource">
<distcp xmlns="uri:oozie:distcp-action:0.2"> <distcp xmlns="uri:oozie:distcp-action:0.2">
<arg>${nameNode}/${sourcePath}/datasource</arg> <arg>${nameNode}/${sourcePath}/datasource</arg>

View File

@ -70,9 +70,8 @@ public class PrepareRelatedProjectsJob {
final Dataset<Relation> rels = ClusterUtils final Dataset<Relation> rels = ClusterUtils
.loadRelations(graphPath, spark) .loadRelations(graphPath, spark)
.filter((FilterFunction<Relation>) r -> r.getDataInfo().getDeletedbyinference()) .filter((FilterFunction<Relation>) r -> ModelConstants.RESULT_PROJECT.equals(r.getRelType()))
.filter((FilterFunction<Relation>) r -> r.getRelType().equals(ModelConstants.RESULT_PROJECT)) .filter((FilterFunction<Relation>) r -> !BrokerConstants.IS_MERGED_IN_CLASS.equals(r.getRelClass()))
.filter((FilterFunction<Relation>) r -> !r.getRelClass().equals(BrokerConstants.IS_MERGED_IN_CLASS))
.filter((FilterFunction<Relation>) r -> !ClusterUtils.isDedupRoot(r.getSource())) .filter((FilterFunction<Relation>) r -> !ClusterUtils.isDedupRoot(r.getSource()))
.filter((FilterFunction<Relation>) r -> !ClusterUtils.isDedupRoot(r.getTarget())); .filter((FilterFunction<Relation>) r -> !ClusterUtils.isDedupRoot(r.getTarget()));

View File

@ -53,7 +53,7 @@ public class EnrichMoreSubject extends UpdateMatcher<OaBrokerTypedValue> {
.collect(Collectors.toSet()); .collect(Collectors.toSet());
return source return source
.getPids() .getSubjects()
.stream() .stream()
.filter(s -> !existingSubjects.contains(subjectAsString(s))) .filter(s -> !existingSubjects.contains(subjectAsString(s)))
.collect(Collectors.toList()); .collect(Collectors.toList());

View File

@ -0,0 +1,60 @@
package eu.dnetlib.dhp.broker.oa.matchers.simple;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertTrue;
import java.util.Arrays;
import java.util.List;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import eu.dnetlib.broker.objects.OaBrokerMainEntity;
import eu.dnetlib.broker.objects.OaBrokerTypedValue;
public class EnrichMoreSubjectTest {
final EnrichMoreSubject matcher = new EnrichMoreSubject();
@BeforeEach
void setUp() throws Exception {
}
@Test
void testFindDifferences_1() {
final OaBrokerMainEntity source = new OaBrokerMainEntity();
final OaBrokerMainEntity target = new OaBrokerMainEntity();
final List<OaBrokerTypedValue> list = this.matcher.findDifferences(source, target);
assertTrue(list.isEmpty());
}
@Test
void testFindDifferences_2() {
final OaBrokerMainEntity source = new OaBrokerMainEntity();
final OaBrokerMainEntity target = new OaBrokerMainEntity();
source.setSubjects(Arrays.asList(new OaBrokerTypedValue("arxiv", "subject_01")));
final List<OaBrokerTypedValue> list = this.matcher.findDifferences(source, target);
assertEquals(1, list.size());
}
@Test
void testFindDifferences_3() {
final OaBrokerMainEntity source = new OaBrokerMainEntity();
final OaBrokerMainEntity target = new OaBrokerMainEntity();
target.setSubjects(Arrays.asList(new OaBrokerTypedValue("arxiv", "subject_01")));
final List<OaBrokerTypedValue> list = this.matcher.findDifferences(source, target);
assertTrue(list.isEmpty());
}
@Test
void testFindDifferences_4() {
final OaBrokerMainEntity source = new OaBrokerMainEntity();
final OaBrokerMainEntity target = new OaBrokerMainEntity();
source.setSubjects(Arrays.asList(new OaBrokerTypedValue("arxiv", "subject_01")));
target.setSubjects(Arrays.asList(new OaBrokerTypedValue("arxiv", "subject_01")));
final List<OaBrokerTypedValue> list = this.matcher.findDifferences(source, target);
assertTrue(list.isEmpty());
}
}

View File

@ -2,14 +2,13 @@
package eu.dnetlib.dhp.oa.dedup; package eu.dnetlib.dhp.oa.dedup;
import java.util.*; import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.Stream; import java.util.stream.Stream;
import org.apache.commons.beanutils.BeanUtils; import org.apache.commons.beanutils.BeanUtils;
import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.StringUtils;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.FlatMapGroupsFunction; import org.apache.spark.api.java.function.FlatMapGroupsFunction;
import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.api.java.function.ReduceFunction;
import org.apache.spark.sql.*; import org.apache.spark.sql.*;
import eu.dnetlib.dhp.oa.dedup.model.Identifier; import eu.dnetlib.dhp.oa.dedup.model.Identifier;
@ -107,6 +106,8 @@ public class DedupRecordFactory {
final HashSet<String> acceptanceDate = new HashSet<>(); final HashSet<String> acceptanceDate = new HashSet<>();
boolean isVisible = false;
while (it.hasNext()) { while (it.hasNext()) {
Tuple3<String, String, OafEntity> t = it.next(); Tuple3<String, String, OafEntity> t = it.next();
OafEntity entity = t._3(); OafEntity entity = t._3();
@ -114,6 +115,7 @@ public class DedupRecordFactory {
if (entity == null) { if (entity == null) {
aliases.add(t._2()); aliases.add(t._2());
} else { } else {
isVisible = isVisible || !entity.getDataInfo().getInvisible();
cliques.add(entity); cliques.add(entity);
if (acceptanceDate.size() < MAX_ACCEPTANCE_DATE) { if (acceptanceDate.size() < MAX_ACCEPTANCE_DATE) {
@ -129,13 +131,20 @@ public class DedupRecordFactory {
} }
if (acceptanceDate.size() >= MAX_ACCEPTANCE_DATE || cliques.isEmpty()) { if (!isVisible || acceptanceDate.size() >= MAX_ACCEPTANCE_DATE || cliques.isEmpty()) {
return Collections.emptyIterator(); return Collections.emptyIterator();
} }
OafEntity mergedEntity = MergeUtils.mergeGroup(dedupId, cliques.iterator()); OafEntity mergedEntity = MergeUtils.mergeGroup(cliques.iterator());
// dedup records do not have date of transformation attribute // dedup records do not have date of transformation attribute
mergedEntity.setDateoftransformation(null); mergedEntity.setDateoftransformation(null);
mergedEntity
.setMergedIds(
Stream
.concat(cliques.stream().map(OafEntity::getId), aliases.stream())
.distinct()
.sorted()
.collect(Collectors.toList()));
return Stream return Stream
.concat( .concat(

View File

@ -91,7 +91,6 @@ public class SparkBlockStats extends AbstractSparkAction {
.read() .read()
.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity)) .textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
.transform(deduper.model().parseJsonDataset()) .transform(deduper.model().parseJsonDataset())
.transform(deduper.filterAndCleanup())
.transform(deduper.generateClustersWithCollect()) .transform(deduper.generateClustersWithCollect())
.filter(functions.size(new Column("block")).geq(1)); .filter(functions.size(new Column("block")).geq(1));

View File

@ -5,11 +5,11 @@ import static eu.dnetlib.dhp.schema.common.ModelConstants.DNET_PROVENANCE_ACTION
import static eu.dnetlib.dhp.schema.common.ModelConstants.PROVENANCE_DEDUP; import static eu.dnetlib.dhp.schema.common.ModelConstants.PROVENANCE_DEDUP;
import java.io.IOException; import java.io.IOException;
import java.util.Arrays;
import org.apache.commons.io.IOUtils; import org.apache.commons.io.IOUtils;
import org.apache.spark.SparkConf; import org.apache.spark.SparkConf;
import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.*;
import org.apache.spark.sql.SparkSession;
import org.dom4j.DocumentException; import org.dom4j.DocumentException;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
@ -17,6 +17,7 @@ import org.xml.sax.SAXException;
import eu.dnetlib.dhp.application.ArgumentApplicationParser; import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.schema.common.EntityType; import eu.dnetlib.dhp.schema.common.EntityType;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport; import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.DataInfo; import eu.dnetlib.dhp.schema.oaf.DataInfo;
import eu.dnetlib.dhp.schema.oaf.OafEntity; import eu.dnetlib.dhp.schema.oaf.OafEntity;
@ -25,6 +26,8 @@ import eu.dnetlib.dhp.utils.ISLookupClientFactory;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException; import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService; import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
import eu.dnetlib.pace.config.DedupConfig; import eu.dnetlib.pace.config.DedupConfig;
import scala.collection.JavaConversions;
import scala.collection.JavaConverters;
public class SparkCreateDedupRecord extends AbstractSparkAction { public class SparkCreateDedupRecord extends AbstractSparkAction {
@ -85,6 +88,36 @@ public class SparkCreateDedupRecord extends AbstractSparkAction {
.mode(SaveMode.Overwrite) .mode(SaveMode.Overwrite)
.option("compression", "gzip") .option("compression", "gzip")
.json(outputPath); .json(outputPath);
log.info("Updating mergerels for: '{}'", subEntity);
final Dataset<Row> dedupIds = spark
.read()
.schema("`id` STRING, `mergedIds` ARRAY<STRING>")
.json(outputPath)
.selectExpr("id as source", "explode(mergedIds) as target");
spark
.read()
.load(mergeRelPath)
.where("relClass == 'merges'")
.join(dedupIds, JavaConversions.asScalaBuffer(Arrays.asList("source", "target")), "left_semi")
.write()
.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.save(workingPath + "/mergerel_filtered");
final Dataset<Row> validRels = spark.read().load(workingPath + "/mergerel_filtered");
final Dataset<Row> filteredMergeRels = validRels
.union(
validRels
.withColumnRenamed("source", "source_tmp")
.withColumnRenamed("target", "target_tmp")
.withColumn("relClass", functions.lit(ModelConstants.IS_MERGED_IN))
.withColumnRenamed("target_tmp", "source")
.withColumnRenamed("source_tmp", "target"));
saveParquet(filteredMergeRels, mergeRelPath, SaveMode.Overwrite);
removeOutputDir(spark, workingPath + "/mergerel_filtered");
} }
} }

View File

@ -69,6 +69,7 @@ public class SparkPropagateRelation extends AbstractSparkAction {
Dataset<Relation> mergeRels = spark Dataset<Relation> mergeRels = spark
.read() .read()
.schema(REL_BEAN_ENC.schema())
.load(DedupUtility.createMergeRelPath(workingPath, "*", "*")) .load(DedupUtility.createMergeRelPath(workingPath, "*", "*"))
.as(REL_BEAN_ENC); .as(REL_BEAN_ENC);

View File

@ -46,8 +46,8 @@ class DatasetMergerTest implements Serializable {
} }
@Test @Test
void datasetMergerTest() throws InstantiationException, IllegalAccessException, InvocationTargetException { void datasetMergerTest() {
Dataset pub_merged = MergeUtils.mergeGroup(dedupId, datasets.stream().map(Tuple2::_2).iterator()); Dataset pub_merged = MergeUtils.mergeGroup(datasets.stream().map(Tuple2::_2).iterator());
// verify id // verify id
assertEquals(dedupId, pub_merged.getId()); assertEquals(dedupId, pub_merged.getId());

View File

@ -96,7 +96,7 @@
"aggregation": "MAX", "aggregation": "MAX",
"positive": "layer4", "positive": "layer4",
"negative": "NO_MATCH", "negative": "NO_MATCH",
"undefined": "MATCH", "undefined": "layer4",
"ignoreUndefined": "true" "ignoreUndefined": "true"
}, },
"layer4": { "layer4": {

View File

@ -7,7 +7,7 @@ import eu.dnetlib.dhp.schema.oaf.utils.{GraphCleaningFunctions, IdentifierFactor
import eu.dnetlib.dhp.utils.DHPUtils import eu.dnetlib.dhp.utils.DHPUtils
import eu.dnetlib.doiboost.DoiBoostMappingUtil import eu.dnetlib.doiboost.DoiBoostMappingUtil
import eu.dnetlib.doiboost.DoiBoostMappingUtil._ import eu.dnetlib.doiboost.DoiBoostMappingUtil._
import org.apache.commons.lang.StringUtils import org.apache.commons.lang3.StringUtils
import org.json4s import org.json4s
import org.json4s.DefaultFormats import org.json4s.DefaultFormats
import org.json4s.JsonAST._ import org.json4s.JsonAST._
@ -560,9 +560,32 @@ case object Crossref2Oaf {
"10.13039/501100000266" | "10.13039/501100006041" | "10.13039/501100000265" | "10.13039/501100000270" | "10.13039/501100000266" | "10.13039/501100006041" | "10.13039/501100000265" | "10.13039/501100000270" |
"10.13039/501100013589" | "10.13039/501100000271" => "10.13039/501100013589" | "10.13039/501100000271" =>
generateSimpleRelationFromAward(funder, "ukri________", a => a) generateSimpleRelationFromAward(funder, "ukri________", a => a)
//DFG
case "10.13039/501100001659" =>
val targetId = getProjectId("dfgf________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
case _ => logger.debug("no match for " + funder.DOI.get)
//Add for Danish funders
//Independent Research Fund Denmark (IRFD)
case "10.13039/501100004836" =>
generateSimpleRelationFromAward(funder, "irfd________", a => a)
val targetId = getProjectId("irfd________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
//Carlsberg Foundation (CF)
case "10.13039/501100002808" =>
generateSimpleRelationFromAward(funder, "cf__________", a => a)
val targetId = getProjectId("cf__________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
//Novo Nordisk Foundation (NNF)
case "10.13039/501100009708" =>
generateSimpleRelationFromAward(funder, "nnf___________", a => a)
val targetId = getProjectId("nnf_________", "1e5e62235d094afd01cd56e65112fc63")
queue += generateRelation(sourceId, targetId, ModelConstants.IS_PRODUCED_BY)
queue += generateRelation(targetId, sourceId, ModelConstants.PRODUCES)
case _ => logger.debug("no match for " + funder.DOI.get) case _ => logger.debug("no match for " + funder.DOI.get)
} }
} else { } else {

View File

@ -313,7 +313,7 @@ case object ConversionUtil {
if (f.author.DisplayName.isDefined) if (f.author.DisplayName.isDefined)
a.setFullname(f.author.DisplayName.get) a.setFullname(f.author.DisplayName.get)
if (f.affiliation != null) if (f.affiliation != null)
a.setAffiliation(List(asField(f.affiliation)).asJava) a.setRawAffiliationString(List(f.affiliation).asJava)
a.setPid( a.setPid(
List( List(
createSP( createSP(
@ -386,7 +386,7 @@ case object ConversionUtil {
a.setFullname(f.author.DisplayName.get) a.setFullname(f.author.DisplayName.get)
if (f.affiliation != null) if (f.affiliation != null)
a.setAffiliation(List(asField(f.affiliation)).asJava) a.setRawAffiliationString(List(f.affiliation).asJava)
a.setPid( a.setPid(
List( List(

View File

@ -6,7 +6,7 @@ import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory
import eu.dnetlib.dhp.schema.oaf.{Author, DataInfo, Publication} import eu.dnetlib.dhp.schema.oaf.{Author, DataInfo, Publication}
import eu.dnetlib.doiboost.DoiBoostMappingUtil import eu.dnetlib.doiboost.DoiBoostMappingUtil
import eu.dnetlib.doiboost.DoiBoostMappingUtil.{createSP, generateDataInfo} import eu.dnetlib.doiboost.DoiBoostMappingUtil.{createSP, generateDataInfo}
import org.apache.commons.lang.StringUtils import org.apache.commons.lang3.StringUtils
import org.json4s import org.json4s
import org.json4s.DefaultFormats import org.json4s.DefaultFormats
import org.json4s.JsonAST._ import org.json4s.JsonAST._

View File

@ -48,12 +48,7 @@
<groupId>io.github.classgraph</groupId> <groupId>io.github.classgraph</groupId>
<artifactId>classgraph</artifactId> <artifactId>classgraph</artifactId>
</dependency> </dependency>
<dependency>
<groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-aggregation</artifactId>
<version>1.2.5-SNAPSHOT</version>
<scope>compile</scope>
</dependency>
</dependencies> </dependencies>

Some files were not shown because too many files have changed in this diff Show More