Commit Graph

2785 Commits

Author SHA1 Message Date
Miriam Baglioni 084b4ef999 added the creation of the openaireId from funder and grant number if the element is not present in the context profile 2021-07-13 17:07:46 +02:00
Claudio Atzori fa720c1da4 [doiboost] added workflow for the ActionSet update dedicated to production 2021-07-13 16:59:30 +02:00
Miriam Baglioni 8f322a73cb change because of the renaming of originalId in acronym 2021-07-13 16:22:58 +02:00
Miriam Baglioni 72397ea1ba Added fix for community of arbitrary name length 2021-07-13 16:18:35 +02:00
Miriam Baglioni 5295d10691 added check not to dump deletedByInference entities 2021-07-13 16:11:46 +02:00
Claudio Atzori 9629569e22 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2021-07-13 16:04:08 +02:00
Claudio Atzori f13e11e3f7 [aggregation] datacite wf: defined parameter declaring the path used to store the OAF objects produced by the transformation phase 2021-07-13 16:04:02 +02:00
Miriam Baglioni e9a17ec899 added check to verify not to add void APC 2021-07-13 15:53:35 +02:00
Miriam Baglioni 8429aed6c6 Added resource for testing selection of valid relations 2021-07-13 15:49:38 +02:00
Miriam Baglioni 39b1a6edf6 added test class for the selection of valid relations and description 2021-07-13 15:23:09 +02:00
Miriam Baglioni 9a58f1b93d added logic to select only the valid relations: those not deletedbyinference and having both part of the relation as entities in the graph 2021-07-13 15:20:39 +02:00
Miriam Baglioni 13c66e16be changed logic to split for communities 2021-07-13 15:15:27 +02:00
Miriam Baglioni 6410ab71d8 added APC in the dump and test method 2021-07-13 15:13:58 +02:00
Miriam Baglioni 65a242646d added resource for APC dump 2021-07-13 14:45:25 +02:00
Miriam Baglioni 4b432fbee8 extended test class 2021-07-13 14:40:39 +02:00
Miriam Baglioni 87a6e2b967 extended test class 2021-07-13 14:38:28 +02:00
Miriam Baglioni 69fd40fd30 modified code to split the Croatian funder 2021-07-13 14:35:26 +02:00
Miriam Baglioni 86e50f7311 modified code to split the Croatian funder 2021-07-13 14:31:45 +02:00
Miriam Baglioni da88c850c6 changed the logic to verify if a community is contained in the list of context of a result 2021-07-13 14:22:44 +02:00
Miriam Baglioni 2f66fedfec changed the logic to verify if a community is contained in the list of context of a result 2021-07-13 14:22:23 +02:00
Miriam Baglioni f5486ffb14 Fixed issues to tests 2021-07-13 14:07:45 +02:00
Claudio Atzori e0061232e9 [aggregation] datacite wf: conditional creation of links, optional resume from intermediate phases 2021-07-13 13:41:21 +02:00
Sandro La Bruzzo bbe8193930 merged stable ids 2021-07-12 17:00:43 +02:00
Claudio Atzori ae2b47b29d [broker] added coalesce(1) on the stats dataset before storing it on postgres 2021-07-09 15:47:51 +02:00
Sandro La Bruzzo 57c74c73c6 fixed mistakes in oozie workflow 2021-07-09 12:28:09 +02:00
Sandro La Bruzzo 61ccb54fde removed wrong loop on oozie wf 2021-07-09 12:17:57 +02:00
Sandro La Bruzzo 9f5a0f3ab6 moved wf indexing of Scholexplorer in dhp-graph-provision 2021-07-09 12:06:43 +02:00
Sandro La Bruzzo 09fccf8000 added workflow to serialize scholix and summary in json 2021-07-09 11:01:42 +02:00
Sandro La Bruzzo 0ea576745f updated CreateInputGraph because ggenerics don't work on Spark Dataset 2021-07-09 10:29:24 +02:00
Sandro La Bruzzo cd17e19044 implemented branch workflow to import datacite and crossref in scholexplorer 2021-07-08 21:20:19 +02:00
Miriam Baglioni c30f3ce647 merge doi normalization 2021-07-08 19:20:02 +02:00
Sandro La Bruzzo 8a034e46e1 updated baseline workflow 2021-07-08 11:11:41 +02:00
Claudio Atzori b7b8e0986e [raw_all] The claim merge procedure includes the claimed contexts in the merged result 2021-07-08 10:42:31 +02:00
Sandro La Bruzzo 0799ac9fb6 fixed wrong path 2021-07-08 10:36:37 +02:00
Sandro La Bruzzo 4d53402712 extended ebiLinks to create a dataset before generation of OAF 2021-07-08 10:26:21 +02:00
Sandro La Bruzzo a4a54a3786 code refactor 2021-07-08 09:08:25 +02:00
Sandro La Bruzzo a01dbe0ab0 completed workflow of generation of scholix and summaries 2021-07-07 23:10:34 +02:00
Claudio Atzori fdcff42e46 [raw_all] Aggregator graph creation merges claims (updates) with the corresponding entity 2021-07-07 19:01:59 +02:00
Claudio Atzori 777536ce91 [aggregation] string values used as regular expressions in the OAI collection classes are defined in a single point as constants, to be reused across the code (PR#122) 2021-07-07 11:23:48 +02:00
Claudio Atzori bc014023c8 Merge pull request 'to solve the scala SI-3623' (#122) from andreas.czerniak/BrStableId_dnet-hadoop:stable_ids into stable_ids
Reviewed-on: #122
2021-07-07 11:13:51 +02:00
Claudio Atzori 32bdfdccbc [raw_all] Aggregator graph creation merges claims (updates) with the corresponding entity 2021-07-07 11:08:27 +02:00
Andreas Czerniak ebf3f47a02 from&until more OAI2.0 compl., adding tfs 2021-07-07 09:29:49 +02:00
Claudio Atzori f580cb77e1 added mapping for claim relation 'resultResult_publicationDataset_isRelatedTo' (present on BETA) 2021-07-06 21:11:11 +02:00
Sandro La Bruzzo ed684874f2 deleted old scholix project 2021-07-06 17:20:08 +02:00
Sandro La Bruzzo 8535506c22 added scholix generation 2021-07-06 17:18:06 +02:00
Sandro La Bruzzo 4c54bd8742 add test to verify merge scholix on source 2021-07-06 11:32:14 +02:00
Andreas Czerniak 3531802710 to solve the scala SI-3623 2021-07-06 11:30:56 +02:00
Sandro La Bruzzo 7d8db2eb8a betterRenamingMethod 2021-07-06 09:56:32 +02:00
Sandro La Bruzzo c952c8d236 generate first side of scholix mapping 2021-07-06 09:53:14 +02:00
Claudio Atzori 70ded407bb HttpClient used in metadata collection retries also on 404 2021-07-05 18:04:30 +02:00
Miriam Baglioni 7177c25261 added check for null value during doi normalization 2021-07-05 16:22:38 +02:00
Miriam Baglioni 0892cad4e8 the normalization of the content of value was not visible outside the block. Moved doi normalization operation while returning value 2021-07-05 16:21:42 +02:00
Antonis Lempesis 89e6f46682 using organization ids instead of names in monitor db creation 2021-07-05 12:00:00 +03:00
Sandro La Bruzzo e4b84ef5d6 fixed mapping OAF to Scholix summary 2021-07-02 16:48:48 +02:00
Sandro La Bruzzo c6fa8598e1 massive code refactor:
removed modules dhp-*-scholexplorer
2021-07-01 22:13:45 +02:00
Antonis Lempesis 829caee4fd added the missing indicators files 2021-06-30 17:31:33 +02:00
Sandro La Bruzzo 84b834c893 added test dataset test for pangaea 2021-06-30 17:31:09 +02:00
Sandro La Bruzzo 1a6b398968 implemented Creation of Raw Graph and Resolution 2021-06-30 17:27:55 +02:00
Miriam Baglioni bc34347643 added assertions to verify doi normalization 2021-06-30 14:37:08 +02:00
Miriam Baglioni 86f47afcc7 slight modification of the resource to accomodate also doi normalization tests 2021-06-30 14:36:49 +02:00
Miriam Baglioni 03767ea8e6 slight modification of the resource to accomodate also doi normalization tests 2021-06-30 13:21:24 +02:00
Miriam Baglioni f8eec0ca9a added resource to test the normalization of doi during the import of MAG 2021-06-30 13:19:54 +02:00
Miriam Baglioni 149f85ddf5 added tests for the normalization of the dois 2021-06-30 13:00:52 +02:00
Miriam Baglioni e487b5544c added tests for the normalization of the dois 2021-06-30 12:57:11 +02:00
Miriam Baglioni 1503ccbbb5 added tests for the normalization of the dois 2021-06-30 12:55:37 +02:00
Miriam Baglioni 1299bfb357 Added class to test the normalization of doi 2021-06-30 12:53:27 +02:00
Sandro La Bruzzo 623a0c4edb code Refactor, renaming packages 2021-06-30 11:09:30 +02:00
Miriam Baglioni cf758f4f91 added normalization step for the doi 2021-06-30 10:03:15 +02:00
Miriam Baglioni 801763a0fa there is no more the need to lower case the doi since it is done in the first step. Also changed the creation of the id by using the factory 2021-06-29 19:07:23 +02:00
Miriam Baglioni a74de1cda2 added normalization step to the doi 2021-06-29 18:51:11 +02:00
Miriam Baglioni 06074ea7d3 added normalization step to the doi 2021-06-29 18:46:08 +02:00
Miriam Baglioni 8b8ffe82dc added step of normalization for the doi 2021-06-29 18:41:39 +02:00
Miriam Baglioni 50cc21d92e Added method to normalize doi values (lower case, remove all preceeding 10., filtering out doi not starting with 10.) 2021-06-29 18:35:28 +02:00
Antonis Lempesis 87f14a3899 added the missing indicators files 2021-06-29 16:31:51 +03:00
Sandro La Bruzzo db933ebd21 Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer 2021-06-29 14:16:12 +02:00
Sandro La Bruzzo 7e08655e5f added relation dates in all scholexplorer Datasources 2021-06-29 12:02:03 +02:00
Sandro La Bruzzo 075055eaca added relation dates in bio mapping 2021-06-29 10:33:09 +02:00
Sandro La Bruzzo f36f92287d implemented mapping from Crossref Event Data to Oaf 2021-06-29 10:21:23 +02:00
Antonis Lempesis 018c4eb52c copied latest changes from old fork: indicators+monitor institutions 2021-06-28 23:46:52 +03:00
Sandro La Bruzzo 511ec14c63 implemented mapping from EBI and Scholix Resolved to OAF 2021-06-28 22:04:22 +02:00
Claudio Atzori af42377d0e HttpClient used in metadata collection retries on 502, 503, 504 2021-06-28 09:34:30 +02:00
Sandro La Bruzzo ad50415167 Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer 2021-06-24 17:20:50 +02:00
Sandro La Bruzzo 80e15cc455 implemented mapping from uniprot, pdb and ebi links 2021-06-24 17:20:00 +02:00
Claudio Atzori 2e8fd2c531 cleanup 2021-06-23 14:38:24 +02:00
Claudio Atzori 4dc9ebf217 [raw_all] fixed unit test 2021-06-23 14:38:07 +02:00
Claudio Atzori 50fc5a64a0 [raw_all] Aggregator graph creation merges claims (updates) with the corresponding entity 2021-06-23 11:49:42 +02:00
Claudio Atzori 5edcc6832a applying sonarLint suggestions 2021-06-23 09:53:29 +02:00
Sandro La Bruzzo 080a280bea added pdb to Oaf Transformation 2021-06-21 16:23:59 +02:00
Sandro La Bruzzo 1dc0c59e20 merged fix thai dates from stable_ids 2021-06-21 10:39:46 +02:00
Sandro La Bruzzo dc66cf615b Merge branch 'stable_id_scholexplorer' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_id_scholexplorer 2021-06-21 09:38:33 +02:00
Sandro La Bruzzo 507e42102a added pdb to oaf class 2021-06-21 09:36:40 +02:00
Sandro La Bruzzo a167543637 Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_id_scholexplorer 2021-06-21 09:14:11 +02:00
Sandro La Bruzzo 4fe7b75644 renamed packages 2021-06-18 16:41:24 +02:00
Sandro La Bruzzo 3990165d05 changed typologies of unresolved relation 2021-06-18 11:43:59 +02:00
Miriam Baglioni 180d671127 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-18 09:46:18 +02:00
Miriam Baglioni 13c96622c9 - 2021-06-18 09:45:16 +02:00
Miriam Baglioni b486ae498f added test and test resource to verify the generation of the date of acceptance from the input extracted from the dump 2021-06-18 09:43:32 +02:00
Miriam Baglioni 464c2ddde3 changed to split in two steps the generation of the crossref dataset 2021-06-18 09:42:31 +02:00
Miriam Baglioni 6aca0d8ebb added kryo encoding for input files 2021-06-18 09:42:07 +02:00
Miriam Baglioni 3585e53da3 changed to split in two steps the generation of the crossref dataset 2021-06-18 09:41:23 +02:00
Claudio Atzori 41b551562e applying PR#115 (DatePicker) on stable_ids 2021-06-17 09:33:50 +02:00
Sandro La Bruzzo 3100166d29 Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer 2021-06-16 16:22:16 +02:00
Claudio Atzori 74833d04f1 Merge branch 'pids_beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into stable_ids 2021-06-16 15:54:18 +02:00
Claudio Atzori 7243a40c88 code formatting 2021-06-16 15:03:03 +02:00
Sandro La Bruzzo dfcf78cf24 removed wrong code 2021-06-16 14:57:42 +02:00
Sandro La Bruzzo cc0f2b11fb Implemented mapping from pubmed baseline to OAF 2021-06-16 14:56:24 +02:00
Miriam Baglioni 95885bcf12 forces executor Executor memory and driver executor memory to be 7G (trying to avoid OOM) 2021-06-16 10:17:52 +02:00
Miriam Baglioni 2550a73981 - 2021-06-16 10:04:41 +02:00
Miriam Baglioni 1c47c0d786 modified the number of executors trying to avoid OOM exception 2021-06-15 21:05:39 +02:00
Miriam Baglioni 7deac55138 added one option for resume from in the wf 2021-06-15 18:38:20 +02:00
Antonis Lempesis f7c0b80e35 storing result_instance as parquet 2021-06-15 14:45:48 +03:00
Miriam Baglioni 66e7ef892f changed the parameter name 2021-06-15 11:08:54 +02:00
Miriam Baglioni 4f47ad0891 no need to rename the folders, just write in overwrite mode, so I changed the name of the output folder 2021-06-15 09:28:31 +02:00
Miriam Baglioni 9f9dd00b94 refactoring 2021-06-15 09:24:46 +02:00
Miriam Baglioni 63d74ee379 refactoring 2021-06-15 09:24:11 +02:00
Miriam Baglioni 6ebc236657 added needed property: outputPath 2021-06-15 09:23:24 +02:00
Miriam Baglioni f7379255b6 changed the workflow to extract info from the dump 2021-06-15 09:22:54 +02:00
Miriam Baglioni d6e21bb6ea creates the crossref dataset used for doiboost together with unpacking part from tar 2021-06-14 17:27:19 +02:00
Miriam Baglioni 4da141bd7c Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-14 13:41:02 +02:00
Miriam Baglioni ce0cfd79e0 creates the crossref dataset used for doiboost 2021-06-14 13:40:19 +02:00
Miriam Baglioni 93efe4de82 split the construction of crossref dataset in two parts. This one just unpacks the tar entries 2021-06-14 13:39:40 +02:00
Michele Artini ada063ce70 fixed a problem with empty mdstore list (2) 2021-06-14 12:04:47 +02:00
Michele Artini 83132ee99a fixed a problem with empty mdstore list 2021-06-14 11:57:00 +02:00
Miriam Baglioni cf360d7c97 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-14 10:19:49 +02:00
Miriam Baglioni 8873e6b6d1 workflow and parameter 2021-06-14 10:15:57 +02:00
Miriam Baglioni 0f1acdf6b6 workflow and parameter 2021-06-14 10:08:55 +02:00
Sandro La Bruzzo aeb8132627 Merged branch stable_ids 2021-06-14 10:07:29 +02:00
Sandro La Bruzzo efbea1e01a minor fix 2021-06-14 09:45:14 +02:00
Miriam Baglioni 75780fc636 extraction of the tar for the dump of crossref, and creation of the dataset 2021-06-14 09:45:07 +02:00
Claudio Atzori 2039bb9f5f orcid / orcid_pending cleaning backported from master branch 2021-06-14 09:40:50 +02:00
Claudio Atzori dd19c4ac5a Merge pull request 'import_new_mdstores' (#112) from import_new_mdstores into stable_ids
Reviewed-on: #112
2021-06-14 09:23:55 +02:00
Claudio Atzori e9e86a237d Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-11 17:00:02 +02:00
Claudio Atzori a900bfb874 delegating the date parsing to https://github.com/sisyphsu/dateparser 2021-06-11 16:53:01 +02:00
Sandro La Bruzzo dd997c49e0 fix wrong relation id
fix date thai ticket #6791
2021-06-10 14:47:18 +02:00
Antonis Lempesis d413b24611 added instances, orgs for monitor, totalcost for projects, apcs 2021-06-10 02:35:46 +03:00
Claudio Atzori 741077dbca Merge pull request 'Fix in Affiliation Propagation' (#113) from miriam.baglioni/dnet-hadoop:master into stable_ids
Reviewed-on: #113
2021-06-09 18:42:42 +02:00
Miriam Baglioni 32b0c27217 Aggiornare 'dhp-workflows/dhp-enrichment/src/main/java/eu/dnetlib/dhp/resulttoorganizationfrominstrepo/PrepareResultInstRepoAssociation.java'
fix in SQL query: while writing the blacklist constraint it used d.id to indicate the datasource id, but no alias for the datasource was defined. So I removed the alias
2021-06-09 18:36:11 +02:00
Sandro La Bruzzo 0d1f37302f Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_id_scholexplorer 2021-06-09 09:35:16 +02:00
Miriam Baglioni dc07f1079b added check in case the author set to be enriched is null 2021-06-08 12:06:10 +02:00
Miriam Baglioni 8d2e086e48 changes to avoid reassignment to val 2021-06-07 17:50:37 +02:00
Miriam Baglioni f33521d338 Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala'
to be able to replace the aboject assigned to author val has been replaced by var
2021-06-07 17:27:07 +02:00
Miriam Baglioni bc12e9819e Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala'
The change is to fix the issue that arises when the same work appears more than once on the same ORCID profile. The change avoid to replicate the association doi -> author when the orcid id is already associated to the doi.
2021-06-07 16:37:01 +02:00
Sandro La Bruzzo 0cdb7ccdaa added inverse relations to datacite mapping 2021-06-04 15:10:20 +02:00
Sandro La Bruzzo 5b724d9972 added relations to datacite mapping 2021-06-04 10:14:22 +02:00
Sandro La Bruzzo e57294ac99 implemented changes on PUBMed dataflow 2021-06-03 10:52:09 +02:00
Michele Artini ede2749822 orcid pid type 2021-06-01 12:42:43 +02:00
Michele Artini f0fbfdcfae Merge branch 'stable_ids' into import_new_mdstores 2021-06-01 12:03:00 +02:00
Michele Artini e950750262 add nodes to import hdfs mdstores 2021-06-01 10:48:50 +02:00
Michele Artini 03a510859a removed coalesce(1) 2021-05-31 14:10:51 +02:00
Michele Artini e9f2b6037c patch of mdstore records 2021-05-31 11:36:26 +02:00
Sandro La Bruzzo 02ef46535f Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-05-31 09:50:15 +02:00
Sandro La Bruzzo aeadc5a366 updated wf Datacite Import to retrieve the block size as parameter 2021-05-31 09:49:53 +02:00
Claudio Atzori 96238152cb added serialization for alternateIdentifiers and pids within each record instance 2021-05-28 16:57:30 +02:00
Michele Artini ad56a44fda save as gzipped sequence file 2021-05-28 14:45:39 +02:00
Claudio Atzori 83722ebc47 pull #111 replied on stable_ids 2021-05-28 14:11:46 +02:00
Claudio Atzori 6e3a4e9237 updated test expectations 2021-05-28 09:37:50 +02:00
Michele Artini 4fa5671d16 first implementation of Hdfs Mdstores Importer 2021-05-27 16:22:07 +02:00
Claudio Atzori d512062b58 integrating pull #109, H2020Classification 2021-05-27 12:22:47 +02:00
Claudio Atzori 5e4b91d9ef more pervasive use of constants from ModelConstants, especially for ORCID 2021-05-26 18:20:23 +02:00
Sandro La Bruzzo bced804151 updated wf Datacite Import to retrieve the block size as parameter 2021-05-26 17:06:50 +02:00
Miriam Baglioni abd88f663d changed test resource to mirror change in the input file 2021-05-21 15:20:47 +02:00
Miriam Baglioni c844877de2 changed workflow flow to possibly parallelize also the programme and project preparation steps 2021-05-21 14:41:57 +02:00
Miriam Baglioni 073d76864d refactoring 2021-05-21 14:41:03 +02:00
Miriam Baglioni 4c8b4a774c removed not needed code 2021-05-21 14:40:07 +02:00
Enrico Ottonello abdd0ade1f added temporary output folder as workflow parameter 2021-05-21 12:08:16 +02:00
Miriam Baglioni 53b9d87fec new prepareProgramme according to the new file 2021-05-21 11:49:31 +02:00
Miriam Baglioni 1ee8f13580 refactoring and added "left" as join type to be 100% sure to get the whole set of projects 2021-05-21 11:49:05 +02:00
Miriam Baglioni e07c3ba089 due to change in the input file the filtering step is no more needed 2021-05-21 11:47:43 +02:00
Miriam Baglioni 54f6e2f693 changed to get the needed information to build the action set as parallel jobs 2021-05-21 11:47:00 +02:00
Miriam Baglioni 7180505519 removed non needed variable 2021-05-21 11:46:13 +02:00
Miriam Baglioni 2eb1a8b344 changed because the input file changed 2021-05-21 11:40:20 +02:00
Enrico Ottonello d0945c3c78 added temporary output folder, because of folder access rights are different on beta and prod 2021-05-20 19:14:31 +02:00
Enrico Ottonello 1265dadc90 workflow aligned with stable_ids 2021-05-20 19:01:28 +02:00
Enrico Ottonello 0821d8e97d Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2021-05-20 18:33:18 +02:00
Enrico Ottonello ae7bd24d79 removed old workflows 2021-05-20 18:32:22 +02:00
Claudio Atzori 9d725efdc1 reverted implementation of the mdstore client 2021-05-20 18:26:09 +02:00
Miriam Baglioni 9610224671 added param to workflow property 2021-05-20 18:21:12 +02:00
Claudio Atzori 863b56b6ce using constants from ModelConstants 2021-05-20 16:23:58 +02:00
Claudio Atzori ae5c28e54f code formatting 2021-05-20 16:13:06 +02:00
Miriam Baglioni aa45b4df9b - 2021-05-20 15:57:40 +02:00
Miriam Baglioni 052c837843 - 2021-05-20 15:54:44 +02:00
Claudio Atzori b695932ae4 integrated pull#108 2021-05-20 15:34:04 +02:00
Claudio Atzori ea9b00ce56 adjusted test 2021-05-20 15:31:42 +02:00
Claudio Atzori b572f56763 Merge branch 'master' into master 2021-05-20 15:22:35 +02:00
Claudio Atzori 2578b7fbb3 code formatting 2021-05-20 14:59:02 +02:00
Miriam Baglioni dc0ad8d2e0 fixed issue related to change in the file name downloaded. Added sheet name as parameter and also a check if the name should change 2021-05-20 14:53:53 +02:00
Claudio Atzori 232dce83db fixes #6701: xpath for titles to support both datacite and Guidelines v4 mapping 2021-05-20 14:41:15 +02:00
Claudio Atzori aef2977ad0 fixes #6701: xpath for titles to support both datacite and Guidelines v4 mapping 2021-05-20 14:40:22 +02:00
Miriam Baglioni 02b80cf24f resolved conflicts 2021-05-20 10:59:39 +02:00
Claudio Atzori c4a23c2f4d fix: preserving the old identifier among the originalIds in the doiboost construction process, trying to avoid UnsupportedOperationException while adding elements to the originalIds 2021-05-19 16:01:52 +02:00
Claudio Atzori ba03f549d7 fix: preserving the old identifier among the originalIds in the doiboost construction process 2021-05-19 15:43:26 +02:00
Claudio Atzori 239d0f0a9a ROR actionset import workflow backported from branch stable_ids 2021-05-18 16:12:11 +02:00
Antonis Lempesis 168edcbde3 added the final steps for the observatory promote wf and some cleanup 2021-05-18 15:23:20 +03:00
Michele Artini e56ccec536 Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-05-18 14:00:28 +02:00
Michele Artini c1e20de7cf fixed the deserialization of a json property 2021-05-18 14:00:14 +02:00
Claudio Atzori a9f512103b using constants from ModelConstants 2021-05-18 11:19:07 +02:00
Claudio Atzori eeb8bcf075 using constants from ModelConstants 2021-05-18 11:10:07 +02:00
Claudio Atzori 2cbf15f4fb using ModelConstants 2021-05-17 09:54:45 +02:00
Enrico Ottonello e13926cdd0 merged with master 2021-05-14 18:10:31 +02:00
Claudio Atzori f19feceaf0 set the old identifier before switching to the new one 2021-05-14 12:53:40 +02:00
Claudio Atzori 1bd70fa2c6 preserving the old identifier among the originalIds in the doiboost construction process 2021-05-14 11:30:41 +02:00
Claudio Atzori ca3f3a7687 using ModelConstants 2021-05-14 11:29:49 +02:00
Claudio Atzori 23b8883ab1 applied intellij code cleanup 2021-05-14 10:58:12 +02:00
Claudio Atzori 609eb711b3 IndexRecordTransformerTest for producing a record that can be manually submitted to solr 2021-05-13 16:13:28 +02:00
Claudio Atzori 1517bf7c92 IndexRecordTransformerTest for producing a record that can be manually submitted to solr 2021-05-13 16:11:22 +02:00
Sandro La Bruzzo d9a0bbda7b implemented new phase in doiboost to make the dataset Distinct by ID 2021-05-13 12:25:14 +02:00
Sandro La Bruzzo 6424cd9062 Added passing of the following parameters:
-varDataSourceId
-varOfficialName

in Each transformation Rule
2021-05-11 15:17:38 +02:00
Sandro La Bruzzo 073dcea2aa Added passing of the following parameters:
-varDataSourceId
-varOfficialName

in Each transformation Rule
2021-05-11 15:05:58 +02:00
Claudio Atzori d4c3476152 mapping datasource.journal only when an issn is available, null otherwhise 2021-05-11 11:08:54 +02:00
Claudio Atzori da9d6f3887 mapping datasource.journal only when an issn is available, null otherwhise 2021-05-11 10:45:30 +02:00
Sandro La Bruzzo 54217d73ff removed old parameters from oozie workflow 2021-05-11 09:59:02 +02:00
Claudio Atzori d1cbee8413 imported methods from CleaningFunctions, defined in GraphCleaningFunctions 2021-05-10 16:43:39 +02:00
Claudio Atzori 3797543600 MDStoreManager model classes moved in dhp-schemas 2021-05-10 14:32:05 +02:00
Claudio Atzori 25254885b9 [ActionManagement] reduced number of xqueries used to access ActionSet info 2021-05-07 17:32:03 +02:00
Claudio Atzori 8a0de2fc18 [ActionManagement] reduced number of xqueries used to access ActionSet info 2021-05-07 17:31:32 +02:00
Sandro La Bruzzo 7dc824fc23 imported changes in stable_id into master 2021-05-07 12:53:50 +02:00
Michele Artini d82071ba6c originalId with prefix 2021-05-06 15:34:48 +02:00
Claudio Atzori d4a30fabe3 clean up tests 2021-05-05 17:28:15 +02:00
Claudio Atzori dccaf173cf fixed mapping applied to ODF records. Added unit test to verify the mapping for OpenTrials 2021-05-05 16:36:15 +02:00
Claudio Atzori 8c96a82a03 fixed mapping applied to ODF records. Added unit test to verify the mapping for OpenTrials 2021-05-05 15:30:06 +02:00
Claudio Atzori 2e1eb96f9a code formatting 2021-05-05 11:23:57 +02:00
Sandro La Bruzzo 1adfc41d23 merged manually changes on stable_id for doiboost into master 2021-05-05 10:23:32 +02:00
Claudio Atzori fb930b84d3 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-05-04 18:06:30 +02:00
Claudio Atzori 923d19ea8e mdstore read lock/unlock when bulk copying records from mongodb to hdfs 2021-05-04 18:06:21 +02:00
Sandro La Bruzzo 714b71bd21 updated pubmed 2021-05-04 14:54:12 +02:00
Claudio Atzori ba86835951 using common constants from ModelConstants 2021-05-04 11:51:52 +02:00
Michele Artini f4bd2b5619 recert file SparkDedupTest.java 2021-05-04 10:26:14 +02:00
Michele Artini b4877da363 Merge branch 'stable_ids' into prepare_ror_actionset 2021-05-03 08:13:55 +02:00
Alessia Bardi 9a20057615 fixed query for organisations' pids 2021-04-29 15:23:39 +02:00
Michele Artini 6692128234 Merge branch 'stable_ids' into prepare_ror_actionset 2021-04-29 13:24:08 +02:00
Alessia Bardi a801999e75 fixed query for organisations' pids 2021-04-29 12:18:42 +02:00
Michele Artini a278d67175 parse input file 2021-04-29 11:34:47 +02:00
Claudio Atzori f6ccd54d87 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-04-29 10:10:01 +02:00
Claudio Atzori 91e7220f20 cleaned up workflow for actionset migration, adjusted dnet|cnr* dependency versions 2021-04-29 10:09:52 +02:00
Michele Artini f77ba34126 pid types 2021-04-29 09:50:05 +02:00
Michele Artini 7c5cd86927 annotations and tests 2021-04-29 09:29:19 +02:00
Michele Artini b5cf505cc6 partial implementation of the ROR->actionset workflow 2021-04-28 16:00:24 +02:00
Enrico Ottonello c537986b7c deleted folders with merged data immediately before merge phases 2021-04-28 11:25:25 +02:00
Sandro La Bruzzo 2129e9caa7 updated pangaea transformation to parse directly the xml 2021-04-28 10:21:03 +02:00
Claudio Atzori 5afa7d3e0c core utilities in dhp-common moved in external module dhp-schemas 2021-04-27 15:44:01 +02:00
Alessia Bardi e6075bb917 updated json schema for results - added instances and accessright definition 2021-04-27 15:15:08 +02:00
Sandro La Bruzzo 63c0303137 removed unused import, add log 2021-04-27 12:17:23 +02:00
Sandro La Bruzzo 74484d2823 bug fixing 2021-04-27 12:13:44 +02:00
Sandro La Bruzzo c74b03d59c Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-04-27 11:31:07 +02:00
Sandro La Bruzzo 7f8848ecdd added first implementation of Pangaea Mapping 2021-04-27 11:30:37 +02:00
Claudio Atzori 27ab8a704d adjusted poms to align with the external dhp-schema module 2021-04-27 10:12:27 +02:00
Claudio Atzori a7cf449b36 cleanup 2021-04-27 10:11:26 +02:00
Claudio Atzori fa42026590 fixed PersonCleaner extension functions 2021-04-27 10:10:06 +02:00
Claudio Atzori ef4bfd82e2 code formatting 2021-04-27 10:09:31 +02:00
Claudio Atzori faa8f6f4e2 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-04-27 09:57:03 +02:00
miconis 6d5c14e030 assertions updated in entity merger test 2021-04-27 09:47:49 +02:00
Claudio Atzori c2bb03c8b5 depending on external dhp-schemas module 2021-04-23 17:57:35 +02:00
Claudio Atzori 7ed107be53 depending on external dhp-schemas module 2021-04-23 17:52:36 +02:00
Claudio Atzori c25238480c making ODF record parsing namespace unaware (#6629) 2021-04-23 17:34:57 +02:00
Claudio Atzori 99cfb027fa making ODF record parsing namespace unaware (#6629) 2021-04-23 17:09:36 +02:00
Miriam Baglioni 72e5aa3b42 refactoring 2021-04-23 12:10:30 +02:00
Miriam Baglioni 7d1b8b7f64 merge upstream 2021-04-23 11:55:49 +02:00
miconis d0e3366c34 Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-04-22 11:45:19 +02:00
miconis 3c12eeadce bug fix in propagation of relations 2021-04-22 11:44:33 +02:00
Claudio Atzori e5abbec2ba [orcid] download of the lambda file defined in a script 2021-04-22 11:22:10 +02:00
Claudio Atzori 55964cbd81 [orcid] large oozie workflow cleanup; updated workflow for the orcidnodoi actionset creation 2021-04-22 10:18:09 +02:00
Claudio Atzori 8f309b72ff [dedup] using node names consistently across the workflow 2021-04-21 17:54:51 +02:00
Claudio Atzori 52244f813a merging from enrico.ottonello/dnet-hadoop:orcid-no-doi 2021-04-21 12:24:09 +02:00
Sandro La Bruzzo fd29307b84 updated workflow name 2021-04-21 09:21:41 +02:00
Claudio Atzori 815b9f4d56 [openorgs dedup] fixed workflow parameter declarations. Introduced support for resuming the execution from intermediate steps 2021-04-20 17:24:45 +02:00
Claudio Atzori d0d477cca3 code formatting 2021-04-20 12:50:34 +02:00
miconis 0393cdce42 addition of alternative names in export queries 2021-04-20 12:45:21 +02:00
miconis cadd0a5de8 modification of the queries for openorgs: they now consider also pending orgs 2021-04-20 12:06:56 +02:00
Sandro La Bruzzo e06c7f32f6 updated id figshare as described in #6377 2021-04-20 10:18:07 +02:00
Sandro La Bruzzo dbe0d0378e resolved ticket #6377 2021-04-20 09:44:44 +02:00
Antonis Lempesis 625d993cd9 added step for observatory db 2021-04-20 02:31:06 +03:00
Antonis Lempesis 25d0512fbd code cleanup 2021-04-20 01:43:23 +03:00
Sandro La Bruzzo 524e5f3092 Improved parallelization on transformation wf on hadoop 2021-04-19 15:17:25 +02:00
Sandro La Bruzzo cdfe01bbae improved parallelization on transformation job 2021-04-19 15:14:52 +02:00
Sandro La Bruzzo 3ae67b7a1d Merge remote-tracking branch 'origin/stable_ids' into stable_ids 2021-04-16 17:36:57 +02:00
Sandro La Bruzzo a16e5299f9 applied unique function on the final dataset 2021-04-16 17:36:48 +02:00
Claudio Atzori 45057440c1 code formatting 2021-04-16 17:28:25 +02:00
Enrico Ottonello 34ca792a55 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2021-04-16 17:18:46 +02:00
Enrico Ottonello 27068aacd1 wf to move orcid-no-doi dataset on the folder ready the import 2021-04-16 17:17:47 +02:00
miconis 7ad573d023 bug fix: changed join in propagaterelations without applying filter on the id 2021-04-16 16:40:42 +02:00
Sandro La Bruzzo 67085da305 fixed NPE 2021-04-16 11:05:58 +02:00
Sandro La Bruzzo 644aa8f40c Merge remote-tracking branch 'origin/stable_ids' into stable_ids 2021-04-16 09:14:26 +02:00
Sandro La Bruzzo 7d6a80e2f2 added new type on MAG mapping 2021-04-16 09:14:15 +02:00
Claudio Atzori 906d50563c Merge pull request 'properly invalidating impala metadata' (#105) from antonis.lempesis/dnet-hadoop:master into master
Reviewed-on: #105
2021-04-15 15:06:22 +02:00
Claudio Atzori 3d58f95522 [stats update] properly invalidating impala metadata 2021-04-15 15:03:05 +02:00
Antonis Lempesis 03d36fadea properly invalidating impala metadata 2021-04-15 13:34:22 +03:00
miconis f64e57c112 refactoring of the id generation, sparkcreatemergerels collects entities to create root id after a join 2021-04-15 10:59:24 +02:00
miconis 176a5e493d Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-04-14 18:06:34 +02:00
miconis 3525a8f504 id generation of representative record moved to the SparkCreateMergeRel job 2021-04-14 18:06:07 +02:00
Sandro La Bruzzo 3f77bfceb0 fixed test failure on jenkins 2021-04-14 10:03:01 +02:00
Claudio Atzori 3125cef545 code formatting 2021-04-14 09:11:54 +02:00
Sandro La Bruzzo 44a0064df6 Merge remote-tracking branch 'origin/stable_ids' into stable_ids 2021-04-13 17:48:12 +02:00
Sandro La Bruzzo 479abd10cb Add into ORCID workflow a method that extracts orcid directly to the dump generated by Enrico 2021-04-13 17:47:43 +02:00
Claudio Atzori 710cd1e8f2 Merge pull request 'add xslt, personname cleaner' (#104) from andreas.czerniak/BrStableId_dnet-hadoop:stable_ids into stable_ids
Reviewed-on: #104

LGTM
2021-04-13 14:43:05 +02:00
Claudio Atzori d1ca025b0b [cleaning] remiving authors without fullname or providing 'deactivated' keyword. Removing test test titles 2021-04-13 14:32:41 +02:00
miconis 1542196a33 bug fix: starting node of duplicate scan wf changed 2021-04-13 10:15:43 +02:00
miconis 369ed1cd8a bug fix: lookupurl parameter added to dedup record job 2021-04-13 09:08:05 +02:00
Andreas Czerniak 3b694074ff add xslt, personname cleaner 2021-04-13 07:04:27 +02:00
Claudio Atzori 511c0521e5 [dedup] avoiding NPEs handling OpenOrg relations 2021-04-12 17:45:11 +02:00
miconis d442e25cbc bug fix: ids in self mergerels are not marked deletedbyinference=true 2021-04-12 15:56:22 +02:00
miconis dcff9cecdf bug fix: ids in self mergerels are not marked deletedbyinference=true 2021-04-12 15:55:27 +02:00
miconis 11b22b2d23 bug fix in the query, it now exports only relations with non-hidden organizations 2021-04-08 11:51:47 +02:00
miconis 0857100fb8 implementation of the tests for the openorgs integration in the openaire provision 2021-04-07 18:42:16 +02:00
miconis bf685d849f addition of pids in the query for the export of openorgs for the provision, addition of ec_fields in the openorgs model 2021-04-07 14:27:43 +02:00
Miriam Baglioni 70e391d427 merge upstream 2021-04-07 10:38:08 +02:00
miconis eaaefb8b4c implementation of the procedure to reuse content of different dbs when creating the raw graph 2021-04-06 14:35:51 +02:00
miconis c39c82dfe9 modification of the jobs for the integration of openorgs in the provision, dedup records are no more created by merging but simply taking results of openorgs portal 2021-04-06 14:31:00 +02:00
Claudio Atzori 37b65cc3ad Merge pull request 'updates on stats-update workflow' (#100) from antonis.lempesis/dnet-hadoop:master into master
The workflow integrated in the _stable_ids_ branch has been run correctly on the BETA content, thus IMO this PR can be integrated in the master branch.

Reviewed-on: #100
2021-04-02 16:13:35 +02:00
Claudio Atzori 1e7e5180fa [Graph model] updated definition of ExternalReference: added alternateLabel, removed description (#6503) 2021-04-02 12:32:12 +02:00
Claudio Atzori e686b8de8d [ORCID-no-doi] integrating PR#98 #98 2021-04-01 17:11:03 +02:00
Claudio Atzori ee34cc51c3 [ORCID-no-doi] integrating PR#98 #98 2021-04-01 17:07:49 +02:00
Claudio Atzori 70e49ed53c [OpenOrgsWf] trivial refactoring 2021-04-01 10:30:51 +02:00
Claudio Atzori 7941d7be29 WIP: using common definitions from ModelConstants 2021-03-31 18:33:57 +02:00
Claudio Atzori 879e8cc7ef WIP: using common definitions from ModelConstants 2021-03-31 17:12:01 +02:00
Claudio Atzori 72ce741ea6 WIP: using common definitions from ModelConstants 2021-03-31 17:07:13 +02:00
Enrico Ottonello 59ec5137e1 improvement related to https://issue.openaire.research-infrastructures.eu/issues/6501 2021-03-31 16:25:41 +02:00
Sandro La Bruzzo 616d2ecce2 splitted workflow collecting datacite into two workflows.
Released on beta
2021-03-31 15:45:58 +02:00
Miriam Baglioni 4b6e514f02 merge upstream 2021-03-30 10:27:12 +02:00
Claudio Atzori 9237d55d7f [OpenOrgsWf] cleanup 2021-03-29 17:40:34 +02:00
Claudio Atzori 7f4e9479ec [OpenOrgsWf] graph construction wf: allow to skip the import openorgs node (importOpenorgs true|false) 2021-03-29 16:59:16 +02:00
miconis 2709d08fc2 Merge branch 'stable_ids' into openorgswf 2021-03-29 16:39:07 +02:00
miconis f446580e9f code refactoring (useless classes and wf removed), implementation of the test for the openorgs dedup 2021-03-29 16:10:46 +02:00
Claudio Atzori a0837ac357 [Stats update] integrating PR#100 for testing #100 2021-03-29 15:59:58 +02:00
miconis 2355cc4e9b minor changes and bug fix 2021-03-29 10:07:12 +02:00
Sandro La Bruzzo 1dfda3624e improved workflow importing datacite 2021-03-26 13:56:29 +01:00
Enrico Ottonello 91d8660982 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2021-03-25 11:21:20 +01:00
Enrico Ottonello ebd67b8c8f removed duplicates orcid data on authors set 2021-03-25 11:20:52 +01:00
Claudio Atzori 827e7e37db [Cleaning] drop instance.alternateIdentifier elements when they are available among instance.pid 2021-03-25 11:07:59 +01:00
miconis 28c1cdd132 merged stable_ids into openorgswf 2021-03-25 10:44:49 +01:00
miconis 5dfb66b0fa minor changes 2021-03-25 10:29:34 +01:00
miconis 348b0ef921 bug fix, implementation of the workflow for the creation of raw_organizations (openorgs dedup), addition of the pid lists to the openorgs postgres db 2021-03-24 15:51:27 +01:00
Claudio Atzori 751125fdf9 [Actionmanager] zero function considers empty entity.id as well as rel.source/rel.target 2021-03-23 17:34:32 +01:00
Claudio Atzori 1e423fdc07 [Actionmanager] remove invalid records from the input graph before groupGraphTableByIdAndMerge 2021-03-23 13:39:24 +01:00
Claudio Atzori e5ebb500cf fixed pom versions; included missing workflow modules in dhp-workflows/pom.xml 2021-03-23 12:13:53 +01:00
Claudio Atzori b75ad76f79 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-03-23 09:59:12 +01:00
Claudio Atzori 8db248aa13 avoiding error on jenkins compilations: java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! 2021-03-23 09:56:34 +01:00
Sandro La Bruzzo 625e4c29c4 added model constants 2021-03-23 09:39:56 +01:00
Claudio Atzori b4febed138 updated mapping tests as consequence of the special treatment reserved to Handle PIDs 2021-03-23 09:37:48 +01:00
Claudio Atzori 431cbe9955 handle missing instance.pid during bulk cleaning 2021-03-23 09:28:58 +01:00
Sandro La Bruzzo c392936b97 fixed error on best access right 2021-03-23 09:23:22 +01:00
Sandro La Bruzzo c73072079d fix conflicts 2021-03-22 16:36:31 +01:00
Sandro La Bruzzo 098914dcff fix wrong relation with source null 2021-03-22 11:35:02 +01:00
miconis 0fe40b08e4 addition of deduplication profiles for the results, double check on pids and the title with a lower threshold 2021-03-19 17:12:05 +01:00
miconis 98854b0124 minor changes 2021-03-19 16:57:40 +01:00
Claudio Atzori 5a043e95ea code formatting 2021-03-19 11:37:27 +01:00
Claudio Atzori a4e82a65aa integrated filter applied when merging BETA & PROD graphs to rule our records from Datacite 2021-03-19 11:34:44 +01:00
Claudio Atzori 75144dacb3 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-03-19 09:07:40 +01:00
Claudio Atzori 972d5a3d98 [dedup] Datacite should be authoritative for datasets 2021-03-19 09:04:20 +01:00
Sandro La Bruzzo 25d5663d97 added filter 2021-03-18 10:24:42 +01:00
Sandro La Bruzzo 5f98ea74a9 Added fix for pid generation in stableIds 2021-03-17 15:53:24 +01:00