Commit Graph

2671 Commits

Author SHA1 Message Date
Miriam Baglioni 2550a73981 - 2021-06-16 10:04:41 +02:00
Miriam Baglioni 1c47c0d786 modified the number of executors trying to avoid OOM exception 2021-06-15 21:05:39 +02:00
Miriam Baglioni 7deac55138 added one option for resume from in the wf 2021-06-15 18:38:20 +02:00
Antonis Lempesis f7c0b80e35 storing result_instance as parquet 2021-06-15 14:45:48 +03:00
Miriam Baglioni 66e7ef892f changed the parameter name 2021-06-15 11:08:54 +02:00
Miriam Baglioni 4f47ad0891 no need to rename the folders, just write in overwrite mode, so I changed the name of the output folder 2021-06-15 09:28:31 +02:00
Miriam Baglioni 9f9dd00b94 refactoring 2021-06-15 09:24:46 +02:00
Miriam Baglioni 63d74ee379 refactoring 2021-06-15 09:24:11 +02:00
Miriam Baglioni 6ebc236657 added needed property: outputPath 2021-06-15 09:23:24 +02:00
Miriam Baglioni f7379255b6 changed the workflow to extract info from the dump 2021-06-15 09:22:54 +02:00
Miriam Baglioni d6e21bb6ea creates the crossref dataset used for doiboost together with unpacking part from tar 2021-06-14 17:27:19 +02:00
Miriam Baglioni 4da141bd7c Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-14 13:41:02 +02:00
Miriam Baglioni ce0cfd79e0 creates the crossref dataset used for doiboost 2021-06-14 13:40:19 +02:00
Miriam Baglioni 93efe4de82 split the construction of crossref dataset in two parts. This one just unpacks the tar entries 2021-06-14 13:39:40 +02:00
Michele Artini ada063ce70 fixed a problem with empty mdstore list (2) 2021-06-14 12:04:47 +02:00
Michele Artini 83132ee99a fixed a problem with empty mdstore list 2021-06-14 11:57:00 +02:00
Miriam Baglioni cf360d7c97 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-14 10:19:49 +02:00
Miriam Baglioni 8873e6b6d1 workflow and parameter 2021-06-14 10:15:57 +02:00
Miriam Baglioni 0f1acdf6b6 workflow and parameter 2021-06-14 10:08:55 +02:00
Sandro La Bruzzo aeb8132627 Merged branch stable_ids 2021-06-14 10:07:29 +02:00
Sandro La Bruzzo efbea1e01a minor fix 2021-06-14 09:45:14 +02:00
Miriam Baglioni 75780fc636 extraction of the tar for the dump of crossref, and creation of the dataset 2021-06-14 09:45:07 +02:00
Claudio Atzori 2039bb9f5f orcid / orcid_pending cleaning backported from master branch 2021-06-14 09:40:50 +02:00
Claudio Atzori dd19c4ac5a Merge pull request 'import_new_mdstores' (#112) from import_new_mdstores into stable_ids
Reviewed-on: D-Net/dnet-hadoop#112
2021-06-14 09:23:55 +02:00
Claudio Atzori e9e86a237d Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-11 17:00:02 +02:00
Claudio Atzori a900bfb874 delegating the date parsing to https://github.com/sisyphsu/dateparser 2021-06-11 16:53:01 +02:00
Sandro La Bruzzo dd997c49e0 fix wrong relation id
fix date thai ticket #6791
2021-06-10 14:47:18 +02:00
Antonis Lempesis d413b24611 added instances, orgs for monitor, totalcost for projects, apcs 2021-06-10 02:35:46 +03:00
Claudio Atzori 741077dbca Merge pull request 'Fix in Affiliation Propagation' (#113) from miriam.baglioni/dnet-hadoop:master into stable_ids
Reviewed-on: D-Net/dnet-hadoop#113
2021-06-09 18:42:42 +02:00
Miriam Baglioni 32b0c27217 Aggiornare 'dhp-workflows/dhp-enrichment/src/main/java/eu/dnetlib/dhp/resulttoorganizationfrominstrepo/PrepareResultInstRepoAssociation.java'
fix in SQL query: while writing the blacklist constraint it used d.id to indicate the datasource id, but no alias for the datasource was defined. So I removed the alias
2021-06-09 18:36:11 +02:00
Sandro La Bruzzo 0d1f37302f Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_id_scholexplorer 2021-06-09 09:35:16 +02:00
Miriam Baglioni dc07f1079b added check in case the author set to be enriched is null 2021-06-08 12:06:10 +02:00
Miriam Baglioni 8d2e086e48 changes to avoid reassignment to val 2021-06-07 17:50:37 +02:00
Miriam Baglioni f33521d338 Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala'
to be able to replace the aboject assigned to author val has been replaced by var
2021-06-07 17:27:07 +02:00
Miriam Baglioni bc12e9819e Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala'
The change is to fix the issue that arises when the same work appears more than once on the same ORCID profile. The change avoid to replicate the association doi -> author when the orcid id is already associated to the doi.
2021-06-07 16:37:01 +02:00
Sandro La Bruzzo 0cdb7ccdaa added inverse relations to datacite mapping 2021-06-04 15:10:20 +02:00
Sandro La Bruzzo 5b724d9972 added relations to datacite mapping 2021-06-04 10:14:22 +02:00
Sandro La Bruzzo e57294ac99 implemented changes on PUBMed dataflow 2021-06-03 10:52:09 +02:00
Michele Artini ede2749822 orcid pid type 2021-06-01 12:42:43 +02:00
Michele Artini f0fbfdcfae Merge branch 'stable_ids' into import_new_mdstores 2021-06-01 12:03:00 +02:00
Michele Artini e950750262 add nodes to import hdfs mdstores 2021-06-01 10:48:50 +02:00
Michele Artini 03a510859a removed coalesce(1) 2021-05-31 14:10:51 +02:00
Michele Artini e9f2b6037c patch of mdstore records 2021-05-31 11:36:26 +02:00
Sandro La Bruzzo 02ef46535f Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-05-31 09:50:15 +02:00
Sandro La Bruzzo aeadc5a366 updated wf Datacite Import to retrieve the block size as parameter 2021-05-31 09:49:53 +02:00
Claudio Atzori 96238152cb added serialization for alternateIdentifiers and pids within each record instance 2021-05-28 16:57:30 +02:00
Michele Artini ad56a44fda save as gzipped sequence file 2021-05-28 14:45:39 +02:00
Claudio Atzori 83722ebc47 pull #111 replied on stable_ids 2021-05-28 14:11:46 +02:00
Claudio Atzori 6e3a4e9237 updated test expectations 2021-05-28 09:37:50 +02:00
Michele Artini 4fa5671d16 first implementation of Hdfs Mdstores Importer 2021-05-27 16:22:07 +02:00
Claudio Atzori d512062b58 integrating pull #109, H2020Classification 2021-05-27 12:22:47 +02:00
Claudio Atzori 5e4b91d9ef more pervasive use of constants from ModelConstants, especially for ORCID 2021-05-26 18:20:23 +02:00
Sandro La Bruzzo bced804151 updated wf Datacite Import to retrieve the block size as parameter 2021-05-26 17:06:50 +02:00
Miriam Baglioni abd88f663d changed test resource to mirror change in the input file 2021-05-21 15:20:47 +02:00
Miriam Baglioni c844877de2 changed workflow flow to possibly parallelize also the programme and project preparation steps 2021-05-21 14:41:57 +02:00
Miriam Baglioni 073d76864d refactoring 2021-05-21 14:41:03 +02:00
Miriam Baglioni 4c8b4a774c removed not needed code 2021-05-21 14:40:07 +02:00
Miriam Baglioni 53b9d87fec new prepareProgramme according to the new file 2021-05-21 11:49:31 +02:00
Miriam Baglioni 1ee8f13580 refactoring and added "left" as join type to be 100% sure to get the whole set of projects 2021-05-21 11:49:05 +02:00
Miriam Baglioni e07c3ba089 due to change in the input file the filtering step is no more needed 2021-05-21 11:47:43 +02:00
Miriam Baglioni 54f6e2f693 changed to get the needed information to build the action set as parallel jobs 2021-05-21 11:47:00 +02:00
Miriam Baglioni 7180505519 removed non needed variable 2021-05-21 11:46:13 +02:00
Miriam Baglioni 2eb1a8b344 changed because the input file changed 2021-05-21 11:40:20 +02:00
Claudio Atzori 9d725efdc1 reverted implementation of the mdstore client 2021-05-20 18:26:09 +02:00
Miriam Baglioni 9610224671 added param to workflow property 2021-05-20 18:21:12 +02:00
Claudio Atzori 863b56b6ce using constants from ModelConstants 2021-05-20 16:23:58 +02:00
Claudio Atzori ae5c28e54f code formatting 2021-05-20 16:13:06 +02:00
Miriam Baglioni aa45b4df9b - 2021-05-20 15:57:40 +02:00
Miriam Baglioni 052c837843 - 2021-05-20 15:54:44 +02:00
Claudio Atzori b695932ae4 integrated pull#108 2021-05-20 15:34:04 +02:00
Claudio Atzori b572f56763 Merge branch 'master' into master 2021-05-20 15:22:35 +02:00
Claudio Atzori 2578b7fbb3 code formatting 2021-05-20 14:59:02 +02:00
Miriam Baglioni dc0ad8d2e0 fixed issue related to change in the file name downloaded. Added sheet name as parameter and also a check if the name should change 2021-05-20 14:53:53 +02:00
Claudio Atzori 232dce83db fixes #6701: xpath for titles to support both datacite and Guidelines v4 mapping 2021-05-20 14:41:15 +02:00
Claudio Atzori aef2977ad0 fixes #6701: xpath for titles to support both datacite and Guidelines v4 mapping 2021-05-20 14:40:22 +02:00
Miriam Baglioni 02b80cf24f resolved conflicts 2021-05-20 10:59:39 +02:00
Claudio Atzori c4a23c2f4d fix: preserving the old identifier among the originalIds in the doiboost construction process, trying to avoid UnsupportedOperationException while adding elements to the originalIds 2021-05-19 16:01:52 +02:00
Claudio Atzori ba03f549d7 fix: preserving the old identifier among the originalIds in the doiboost construction process 2021-05-19 15:43:26 +02:00
Claudio Atzori 239d0f0a9a ROR actionset import workflow backported from branch stable_ids 2021-05-18 16:12:11 +02:00
Antonis Lempesis 168edcbde3 added the final steps for the observatory promote wf and some cleanup 2021-05-18 15:23:20 +03:00
Michele Artini e56ccec536 Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-05-18 14:00:28 +02:00
Michele Artini c1e20de7cf fixed the deserialization of a json property 2021-05-18 14:00:14 +02:00
Claudio Atzori a9f512103b using constants from ModelConstants 2021-05-18 11:19:07 +02:00
Claudio Atzori eeb8bcf075 using constants from ModelConstants 2021-05-18 11:10:07 +02:00
Claudio Atzori 2cbf15f4fb using ModelConstants 2021-05-17 09:54:45 +02:00
Claudio Atzori f19feceaf0 set the old identifier before switching to the new one 2021-05-14 12:53:40 +02:00
Claudio Atzori 1bd70fa2c6 preserving the old identifier among the originalIds in the doiboost construction process 2021-05-14 11:30:41 +02:00
Claudio Atzori ca3f3a7687 using ModelConstants 2021-05-14 11:29:49 +02:00
Claudio Atzori 23b8883ab1 applied intellij code cleanup 2021-05-14 10:58:12 +02:00
Claudio Atzori 609eb711b3 IndexRecordTransformerTest for producing a record that can be manually submitted to solr 2021-05-13 16:13:28 +02:00
Claudio Atzori 1517bf7c92 IndexRecordTransformerTest for producing a record that can be manually submitted to solr 2021-05-13 16:11:22 +02:00
Sandro La Bruzzo d9a0bbda7b implemented new phase in doiboost to make the dataset Distinct by ID 2021-05-13 12:25:14 +02:00
Sandro La Bruzzo 6424cd9062 Added passing of the following parameters:
-varDataSourceId
-varOfficialName

in Each transformation Rule
2021-05-11 15:17:38 +02:00
Sandro La Bruzzo 073dcea2aa Added passing of the following parameters:
-varDataSourceId
-varOfficialName

in Each transformation Rule
2021-05-11 15:05:58 +02:00
Claudio Atzori d4c3476152 mapping datasource.journal only when an issn is available, null otherwhise 2021-05-11 11:08:54 +02:00
Claudio Atzori da9d6f3887 mapping datasource.journal only when an issn is available, null otherwhise 2021-05-11 10:45:30 +02:00
Sandro La Bruzzo 54217d73ff removed old parameters from oozie workflow 2021-05-11 09:59:02 +02:00
Claudio Atzori d1cbee8413 imported methods from CleaningFunctions, defined in GraphCleaningFunctions 2021-05-10 16:43:39 +02:00
Claudio Atzori 3797543600 MDStoreManager model classes moved in dhp-schemas 2021-05-10 14:32:05 +02:00
Claudio Atzori 25254885b9 [ActionManagement] reduced number of xqueries used to access ActionSet info 2021-05-07 17:32:03 +02:00
Claudio Atzori 8a0de2fc18 [ActionManagement] reduced number of xqueries used to access ActionSet info 2021-05-07 17:31:32 +02:00
Sandro La Bruzzo 7dc824fc23 imported changes in stable_id into master 2021-05-07 12:53:50 +02:00
Michele Artini d82071ba6c originalId with prefix 2021-05-06 15:34:48 +02:00
Claudio Atzori d4a30fabe3 clean up tests 2021-05-05 17:28:15 +02:00
Claudio Atzori dccaf173cf fixed mapping applied to ODF records. Added unit test to verify the mapping for OpenTrials 2021-05-05 16:36:15 +02:00
Claudio Atzori 8c96a82a03 fixed mapping applied to ODF records. Added unit test to verify the mapping for OpenTrials 2021-05-05 15:30:06 +02:00
Claudio Atzori 2e1eb96f9a code formatting 2021-05-05 11:23:57 +02:00
Sandro La Bruzzo 1adfc41d23 merged manually changes on stable_id for doiboost into master 2021-05-05 10:23:32 +02:00
Claudio Atzori fb930b84d3 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-05-04 18:06:30 +02:00
Claudio Atzori 923d19ea8e mdstore read lock/unlock when bulk copying records from mongodb to hdfs 2021-05-04 18:06:21 +02:00
Sandro La Bruzzo 714b71bd21 updated pubmed 2021-05-04 14:54:12 +02:00
Claudio Atzori ba86835951 using common constants from ModelConstants 2021-05-04 11:51:52 +02:00
Michele Artini f4bd2b5619 recert file SparkDedupTest.java 2021-05-04 10:26:14 +02:00
Michele Artini b4877da363 Merge branch 'stable_ids' into prepare_ror_actionset 2021-05-03 08:13:55 +02:00
Alessia Bardi 9a20057615 fixed query for organisations' pids 2021-04-29 15:23:39 +02:00
Michele Artini 6692128234 Merge branch 'stable_ids' into prepare_ror_actionset 2021-04-29 13:24:08 +02:00
Alessia Bardi a801999e75 fixed query for organisations' pids 2021-04-29 12:18:42 +02:00
Michele Artini a278d67175 parse input file 2021-04-29 11:34:47 +02:00
Claudio Atzori f6ccd54d87 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-04-29 10:10:01 +02:00
Claudio Atzori 91e7220f20 cleaned up workflow for actionset migration, adjusted dnet|cnr* dependency versions 2021-04-29 10:09:52 +02:00
Michele Artini f77ba34126 pid types 2021-04-29 09:50:05 +02:00
Michele Artini 7c5cd86927 annotations and tests 2021-04-29 09:29:19 +02:00
Michele Artini b5cf505cc6 partial implementation of the ROR->actionset workflow 2021-04-28 16:00:24 +02:00
Enrico Ottonello c537986b7c deleted folders with merged data immediately before merge phases 2021-04-28 11:25:25 +02:00
Sandro La Bruzzo 2129e9caa7 updated pangaea transformation to parse directly the xml 2021-04-28 10:21:03 +02:00
Claudio Atzori 5afa7d3e0c core utilities in dhp-common moved in external module dhp-schemas 2021-04-27 15:44:01 +02:00
Alessia Bardi e6075bb917 updated json schema for results - added instances and accessright definition 2021-04-27 15:15:08 +02:00
Sandro La Bruzzo 63c0303137 removed unused import, add log 2021-04-27 12:17:23 +02:00
Sandro La Bruzzo 74484d2823 bug fixing 2021-04-27 12:13:44 +02:00
Sandro La Bruzzo c74b03d59c Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-04-27 11:31:07 +02:00
Sandro La Bruzzo 7f8848ecdd added first implementation of Pangaea Mapping 2021-04-27 11:30:37 +02:00
Claudio Atzori 27ab8a704d adjusted poms to align with the external dhp-schema module 2021-04-27 10:12:27 +02:00
Claudio Atzori a7cf449b36 cleanup 2021-04-27 10:11:26 +02:00
Claudio Atzori fa42026590 fixed PersonCleaner extension functions 2021-04-27 10:10:06 +02:00
Claudio Atzori ef4bfd82e2 code formatting 2021-04-27 10:09:31 +02:00
Claudio Atzori faa8f6f4e2 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-04-27 09:57:03 +02:00
miconis 6d5c14e030 assertions updated in entity merger test 2021-04-27 09:47:49 +02:00
Claudio Atzori c2bb03c8b5 depending on external dhp-schemas module 2021-04-23 17:57:35 +02:00
Claudio Atzori 7ed107be53 depending on external dhp-schemas module 2021-04-23 17:52:36 +02:00
Claudio Atzori c25238480c making ODF record parsing namespace unaware (#6629) 2021-04-23 17:34:57 +02:00
Claudio Atzori 99cfb027fa making ODF record parsing namespace unaware (#6629) 2021-04-23 17:09:36 +02:00
Miriam Baglioni 72e5aa3b42 refactoring 2021-04-23 12:10:30 +02:00
Miriam Baglioni 7d1b8b7f64 merge upstream 2021-04-23 11:55:49 +02:00
miconis d0e3366c34 Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-04-22 11:45:19 +02:00
miconis 3c12eeadce bug fix in propagation of relations 2021-04-22 11:44:33 +02:00
Claudio Atzori e5abbec2ba [orcid] download of the lambda file defined in a script 2021-04-22 11:22:10 +02:00
Claudio Atzori 55964cbd81 [orcid] large oozie workflow cleanup; updated workflow for the orcidnodoi actionset creation 2021-04-22 10:18:09 +02:00
Claudio Atzori 8f309b72ff [dedup] using node names consistently across the workflow 2021-04-21 17:54:51 +02:00
Claudio Atzori 52244f813a merging from enrico.ottonello/dnet-hadoop:orcid-no-doi 2021-04-21 12:24:09 +02:00
Sandro La Bruzzo fd29307b84 updated workflow name 2021-04-21 09:21:41 +02:00
Claudio Atzori 815b9f4d56 [openorgs dedup] fixed workflow parameter declarations. Introduced support for resuming the execution from intermediate steps 2021-04-20 17:24:45 +02:00
Claudio Atzori d0d477cca3 code formatting 2021-04-20 12:50:34 +02:00
miconis 0393cdce42 addition of alternative names in export queries 2021-04-20 12:45:21 +02:00
miconis cadd0a5de8 modification of the queries for openorgs: they now consider also pending orgs 2021-04-20 12:06:56 +02:00
Sandro La Bruzzo e06c7f32f6 updated id figshare as described in #6377 2021-04-20 10:18:07 +02:00
Sandro La Bruzzo dbe0d0378e resolved ticket #6377 2021-04-20 09:44:44 +02:00
Antonis Lempesis 625d993cd9 added step for observatory db 2021-04-20 02:31:06 +03:00
Antonis Lempesis 25d0512fbd code cleanup 2021-04-20 01:43:23 +03:00
Sandro La Bruzzo 524e5f3092 Improved parallelization on transformation wf on hadoop 2021-04-19 15:17:25 +02:00
Sandro La Bruzzo cdfe01bbae improved parallelization on transformation job 2021-04-19 15:14:52 +02:00
Sandro La Bruzzo 3ae67b7a1d Merge remote-tracking branch 'origin/stable_ids' into stable_ids 2021-04-16 17:36:57 +02:00
Sandro La Bruzzo a16e5299f9 applied unique function on the final dataset 2021-04-16 17:36:48 +02:00
Claudio Atzori 45057440c1 code formatting 2021-04-16 17:28:25 +02:00
Enrico Ottonello 34ca792a55 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2021-04-16 17:18:46 +02:00
Enrico Ottonello 27068aacd1 wf to move orcid-no-doi dataset on the folder ready the import 2021-04-16 17:17:47 +02:00
miconis 7ad573d023 bug fix: changed join in propagaterelations without applying filter on the id 2021-04-16 16:40:42 +02:00
Sandro La Bruzzo 67085da305 fixed NPE 2021-04-16 11:05:58 +02:00
Sandro La Bruzzo 644aa8f40c Merge remote-tracking branch 'origin/stable_ids' into stable_ids 2021-04-16 09:14:26 +02:00
Sandro La Bruzzo 7d6a80e2f2 added new type on MAG mapping 2021-04-16 09:14:15 +02:00
Claudio Atzori 906d50563c Merge pull request 'properly invalidating impala metadata' (#105) from antonis.lempesis/dnet-hadoop:master into master
Reviewed-on: D-Net/dnet-hadoop#105
2021-04-15 15:06:22 +02:00
Claudio Atzori 3d58f95522 [stats update] properly invalidating impala metadata 2021-04-15 15:03:05 +02:00
Antonis Lempesis 03d36fadea properly invalidating impala metadata 2021-04-15 13:34:22 +03:00
miconis f64e57c112 refactoring of the id generation, sparkcreatemergerels collects entities to create root id after a join 2021-04-15 10:59:24 +02:00
miconis 176a5e493d Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-04-14 18:06:34 +02:00
miconis 3525a8f504 id generation of representative record moved to the SparkCreateMergeRel job 2021-04-14 18:06:07 +02:00
Sandro La Bruzzo 3f77bfceb0 fixed test failure on jenkins 2021-04-14 10:03:01 +02:00
Claudio Atzori 3125cef545 code formatting 2021-04-14 09:11:54 +02:00
Sandro La Bruzzo 44a0064df6 Merge remote-tracking branch 'origin/stable_ids' into stable_ids 2021-04-13 17:48:12 +02:00
Sandro La Bruzzo 479abd10cb Add into ORCID workflow a method that extracts orcid directly to the dump generated by Enrico 2021-04-13 17:47:43 +02:00
Claudio Atzori 710cd1e8f2 Merge pull request 'add xslt, personname cleaner' (#104) from andreas.czerniak/BrStableId_dnet-hadoop:stable_ids into stable_ids
Reviewed-on: D-Net/dnet-hadoop#104

LGTM
2021-04-13 14:43:05 +02:00
Claudio Atzori d1ca025b0b [cleaning] remiving authors without fullname or providing 'deactivated' keyword. Removing test test titles 2021-04-13 14:32:41 +02:00
miconis 1542196a33 bug fix: starting node of duplicate scan wf changed 2021-04-13 10:15:43 +02:00
miconis 369ed1cd8a bug fix: lookupurl parameter added to dedup record job 2021-04-13 09:08:05 +02:00
Andreas Czerniak 3b694074ff add xslt, personname cleaner 2021-04-13 07:04:27 +02:00
Claudio Atzori 511c0521e5 [dedup] avoiding NPEs handling OpenOrg relations 2021-04-12 17:45:11 +02:00
miconis d442e25cbc bug fix: ids in self mergerels are not marked deletedbyinference=true 2021-04-12 15:56:22 +02:00
miconis dcff9cecdf bug fix: ids in self mergerels are not marked deletedbyinference=true 2021-04-12 15:55:27 +02:00
miconis 11b22b2d23 bug fix in the query, it now exports only relations with non-hidden organizations 2021-04-08 11:51:47 +02:00
miconis 0857100fb8 implementation of the tests for the openorgs integration in the openaire provision 2021-04-07 18:42:16 +02:00
miconis bf685d849f addition of pids in the query for the export of openorgs for the provision, addition of ec_fields in the openorgs model 2021-04-07 14:27:43 +02:00
Miriam Baglioni 70e391d427 merge upstream 2021-04-07 10:38:08 +02:00
miconis eaaefb8b4c implementation of the procedure to reuse content of different dbs when creating the raw graph 2021-04-06 14:35:51 +02:00
miconis c39c82dfe9 modification of the jobs for the integration of openorgs in the provision, dedup records are no more created by merging but simply taking results of openorgs portal 2021-04-06 14:31:00 +02:00
Claudio Atzori 37b65cc3ad Merge pull request 'updates on stats-update workflow' (#100) from antonis.lempesis/dnet-hadoop:master into master
The workflow integrated in the _stable_ids_ branch has been run correctly on the BETA content, thus IMO this PR can be integrated in the master branch.

Reviewed-on: D-Net/dnet-hadoop#100
2021-04-02 16:13:35 +02:00
Claudio Atzori 1e7e5180fa [Graph model] updated definition of ExternalReference: added alternateLabel, removed description (#6503) 2021-04-02 12:32:12 +02:00
Claudio Atzori e686b8de8d [ORCID-no-doi] integrating PR#98 D-Net/dnet-hadoop#98 2021-04-01 17:11:03 +02:00
Claudio Atzori ee34cc51c3 [ORCID-no-doi] integrating PR#98 D-Net/dnet-hadoop#98 2021-04-01 17:07:49 +02:00
Claudio Atzori 70e49ed53c [OpenOrgsWf] trivial refactoring 2021-04-01 10:30:51 +02:00
Claudio Atzori 7941d7be29 WIP: using common definitions from ModelConstants 2021-03-31 18:33:57 +02:00
Claudio Atzori 879e8cc7ef WIP: using common definitions from ModelConstants 2021-03-31 17:12:01 +02:00
Claudio Atzori 72ce741ea6 WIP: using common definitions from ModelConstants 2021-03-31 17:07:13 +02:00
Enrico Ottonello 59ec5137e1 improvement related to https://issue.openaire.research-infrastructures.eu/issues/6501 2021-03-31 16:25:41 +02:00
Sandro La Bruzzo 616d2ecce2 splitted workflow collecting datacite into two workflows.
Released on beta
2021-03-31 15:45:58 +02:00
Miriam Baglioni 4b6e514f02 merge upstream 2021-03-30 10:27:12 +02:00
Claudio Atzori 9237d55d7f [OpenOrgsWf] cleanup 2021-03-29 17:40:34 +02:00
Claudio Atzori 7f4e9479ec [OpenOrgsWf] graph construction wf: allow to skip the import openorgs node (importOpenorgs true|false) 2021-03-29 16:59:16 +02:00
miconis 2709d08fc2 Merge branch 'stable_ids' into openorgswf 2021-03-29 16:39:07 +02:00
miconis f446580e9f code refactoring (useless classes and wf removed), implementation of the test for the openorgs dedup 2021-03-29 16:10:46 +02:00
Claudio Atzori a0837ac357 [Stats update] integrating PR#100 for testing D-Net/dnet-hadoop#100 2021-03-29 15:59:58 +02:00
miconis 2355cc4e9b minor changes and bug fix 2021-03-29 10:07:12 +02:00
Sandro La Bruzzo 1dfda3624e improved workflow importing datacite 2021-03-26 13:56:29 +01:00
Enrico Ottonello 91d8660982 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2021-03-25 11:21:20 +01:00
Enrico Ottonello ebd67b8c8f removed duplicates orcid data on authors set 2021-03-25 11:20:52 +01:00
Claudio Atzori 827e7e37db [Cleaning] drop instance.alternateIdentifier elements when they are available among instance.pid 2021-03-25 11:07:59 +01:00
miconis 28c1cdd132 merged stable_ids into openorgswf 2021-03-25 10:44:49 +01:00
miconis 5dfb66b0fa minor changes 2021-03-25 10:29:34 +01:00
miconis 348b0ef921 bug fix, implementation of the workflow for the creation of raw_organizations (openorgs dedup), addition of the pid lists to the openorgs postgres db 2021-03-24 15:51:27 +01:00
Claudio Atzori 751125fdf9 [Actionmanager] zero function considers empty entity.id as well as rel.source/rel.target 2021-03-23 17:34:32 +01:00
Claudio Atzori 1e423fdc07 [Actionmanager] remove invalid records from the input graph before groupGraphTableByIdAndMerge 2021-03-23 13:39:24 +01:00
Claudio Atzori e5ebb500cf fixed pom versions; included missing workflow modules in dhp-workflows/pom.xml 2021-03-23 12:13:53 +01:00
Claudio Atzori b75ad76f79 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-03-23 09:59:12 +01:00
Claudio Atzori 8db248aa13 avoiding error on jenkins compilations: java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! 2021-03-23 09:56:34 +01:00
Sandro La Bruzzo 625e4c29c4 added model constants 2021-03-23 09:39:56 +01:00
Claudio Atzori b4febed138 updated mapping tests as consequence of the special treatment reserved to Handle PIDs 2021-03-23 09:37:48 +01:00
Claudio Atzori 431cbe9955 handle missing instance.pid during bulk cleaning 2021-03-23 09:28:58 +01:00
Sandro La Bruzzo c392936b97 fixed error on best access right 2021-03-23 09:23:22 +01:00
Sandro La Bruzzo c73072079d fix conflicts 2021-03-22 16:36:31 +01:00
Sandro La Bruzzo 098914dcff fix wrong relation with source null 2021-03-22 11:35:02 +01:00
miconis 0fe40b08e4 addition of deduplication profiles for the results, double check on pids and the title with a lower threshold 2021-03-19 17:12:05 +01:00
miconis 98854b0124 minor changes 2021-03-19 16:57:40 +01:00
Claudio Atzori 5a043e95ea code formatting 2021-03-19 11:37:27 +01:00
Claudio Atzori a4e82a65aa integrated filter applied when merging BETA & PROD graphs to rule our records from Datacite 2021-03-19 11:34:44 +01:00
Claudio Atzori 75144dacb3 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-03-19 09:07:40 +01:00
Claudio Atzori 972d5a3d98 [dedup] Datacite should be authoritative for datasets 2021-03-19 09:04:20 +01:00
Sandro La Bruzzo 25d5663d97 added filter 2021-03-18 10:24:42 +01:00
Sandro La Bruzzo 5f98ea74a9 Added fix for pid generation in stableIds 2021-03-17 15:53:24 +01:00
Sandro La Bruzzo 2be0428047 Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-03-17 14:54:28 +01:00
Claudio Atzori 8257f9a2bc result.pid: adjusted the mapping applied to the contents from the aggregator 2021-03-17 12:45:38 +01:00
Sandro La Bruzzo 7c97a4d900 Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-03-17 12:13:03 +01:00
Sandro La Bruzzo cc5bbafa5d some fix to make workflows runs 2021-03-17 12:12:56 +01:00
Claudio Atzori 640b885706 added instance.alternativeIdentifiers to the graph model, adjusted the mapping applied to the contents from the aggregator 2021-03-16 14:19:32 +01:00
Claudio Atzori 61a2551e74 migrated last changes from svn (dnet45) 2021-03-15 17:17:55 +01:00
Antonis Lempesis 0ba0a6b9da update promote wf to support monitor&production 2021-03-12 16:42:59 +02:00
Antonis Lempesis 60ebdf2dbe update promote wf to support monitor&production 2021-03-12 16:34:53 +02:00
Antonis Lempesis 236435b470 following redirects 2021-03-12 14:11:21 +02:00
Antonis Lempesis 3c75a05044 fixed a ton of typos 2021-03-12 13:47:04 +02:00
Sandro La Bruzzo 4bb3bcafa5 add author sequence number 2021-03-11 11:32:32 +01:00
Sandro La Bruzzo a8e5d0ea0d updated test and fixed assign of access right 2021-03-11 10:41:24 +01:00
Sandro La Bruzzo f5e7c57654 Fixed ticket 6282 2021-03-11 10:32:45 +01:00
Antonis Lempesis fa1ec5b5e9 fixed typo... 2021-03-10 14:05:58 +02:00
Claudio Atzori 01630f638d IdentifierFactory implementation based on the list of datasources authoritative for a given pid type 2021-03-09 17:11:50 +01:00
Claudio Atzori 59532b0919 [#6281 Provenance of product PIDs] Added PIDs to the Instance type; extended mapping for OAF/ODF records 2021-03-09 11:14:45 +01:00
Claudio Atzori d525785497 [#6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color. 2021-03-09 11:12:55 +01:00
Sandro La Bruzzo bbe1a7c69a [#6281 Provenance of product PIDs] Added PIDs to the Instance type in Scholexplorer Export 2021-03-09 10:46:36 +01:00
Sandro La Bruzzo a2169ccf07 // implemented Ticket #6281 added pid to Instance in doiBoost 2021-03-09 10:46:36 +01:00
Claudio Atzori f468c7f0d7 merged from master 2021-03-09 09:12:41 +01:00
Claudio Atzori 8d2bb24512 merged from master 2021-03-08 15:44:34 +01:00
Claudio Atzori acbe3119a4 RestCollectorPlugin imported from dne45 2021-03-08 09:44:09 +01:00
Antonis Lempesis f40c150a0d fixed steps... 2021-03-06 00:35:57 +02:00
Claudio Atzori fa7930d2e2 merging contributions from PR#97 2021-03-05 15:45:28 +01:00
Antonis Lempesis 6147ee4950 assigning correctly hive contexts to concepts 2021-03-05 14:12:18 +02:00
Antonis Lempesis c5fbad8093 Contexts are now downloaded instead of using the stats_ext db 2021-03-04 00:42:21 +02:00
Claudio Atzori 55f6ff5f55 README.md for aggregation workflows 2021-03-03 16:18:34 +01:00
Claudio Atzori e8789b0cdb Merge pull request 'stats DB for monitor' (#99) from antonis.lempesis/dnet-hadoop:master into master
Looks good to me, just a note on the parsing of the citations: since the last version, IIS produces citations as proper relationships among results. This is what we got already in the BETA graph

```
count		r.reltype	r.subreltype	r.relclass
62.129.254	resultResult	citation	cites
62.043.309	resultResult	citation	isCitedBy
```

Thus, I suggest to move away from the current property based implementation for the extraction of the citation links and start relying on the relationships instead.
2021-03-03 10:29:09 +01:00
Claudio Atzori 36f750cd1d removed unused classes 2021-03-03 10:22:29 +01:00
Claudio Atzori b73dce3e3a more logging on the MDStore mongodb client. Forcing UTF_8 encoding on the content 2021-03-03 10:17:16 +01:00
Antonis Lempesis 27796343ca crude sleep. hardcoded value 2021-03-03 01:37:47 +02:00
Enrico Ottonello 70cb100647 added updating last orcid dataset folders after completion 2021-03-01 10:17:04 +01:00
Enrico Ottonello bd3b16402b added result typologies 2021-03-01 10:16:02 +01:00
Claudio Atzori e76c4f62c1 MetadataRecord moved in dhp-schemas 2021-02-26 10:58:48 +01:00
miconis 1a85020572 bug fix in graph-mapper, changes in the implementation of the openorgs wf to create relations and populate openorgs db 2021-02-26 10:19:28 +01:00
Enrico Ottonello ca1800510a Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2021-02-25 18:45:02 +01:00
Enrico Ottonello 53d7023460 dateOfCollection taken from orcid last_update.txt on hdfs; cleaned wf parameters 2021-02-25 18:43:29 +01:00
Claudio Atzori 7df2461ccc indent XML records collected from oai-pmh endpoints 2021-02-25 16:19:12 +01:00
Enrico Ottonello d43ea88caf aligned orcid result typologies with openaire vocabulary 2021-02-25 15:02:10 +01:00
Claudio Atzori b830e33392 mdstore collector plugin 2021-02-25 12:30:30 +01:00
Claudio Atzori 271e88537b code formatting 2021-02-25 12:28:56 +01:00
Claudio Atzori 9c899f4433 cleanup on transformation functions and the relative tests 2021-02-24 15:07:59 +01:00
Claudio Atzori fc3fa5e343 implemented mdstore collector plugin 2021-02-24 15:07:24 +01:00
Enrico Ottonello 975823b968 data from last updated orcid 2021-02-23 15:35:04 +01:00
Miriam Baglioni 896919e735 merge upstream 2021-02-23 10:45:29 +01:00
Antonis Lempesis d90767c733 correctly invalidating metadata 2021-02-19 03:18:47 +02:00
Antonis Lempesis 3681afbe04 typo 2021-02-19 03:04:27 +02:00
Antonis Lempesis c5502eba8f actually moved stats computation in impala instead of hive... 2021-02-19 02:54:39 +02:00
Antonis Lempesis 33c85d4e66 moved stats computation in impala instead of hive 2021-02-18 17:23:34 +02:00
Antonis Lempesis b8e96c8ae7 moved cache update to the end 2021-02-18 16:42:22 +02:00
Antonis Lempesis bcbfc052b1 fixed last errors in step 21 2021-02-18 16:32:54 +02:00
Antonis Lempesis 10a29a4b9a fixes in monitor step 2021-02-18 15:05:59 +02:00
Antonis Lempesis 8ef66452d5 fixed typo 2021-02-17 22:24:44 +02:00
Antonis Lempesis a8836e2f5f fixed typo 2021-02-17 19:27:07 +02:00
Claudio Atzori e7eba9f7e7 WIP: transformation workflow error reporting; cleanup 2021-02-17 16:54:08 +01:00
Claudio Atzori 58467aaf1e WIP: transformation workflow error reporting 2021-02-17 16:14:41 +01:00
Claudio Atzori cc88701f29 retry for any Socket exception 2021-02-17 16:13:54 +01:00
Antonis Lempesis a445c1ac3d fixed variable names in monitor script 2021-02-17 16:45:09 +02:00
Antonis Lempesis 00d516360f added missing ; 2021-02-17 16:41:10 +02:00
Claudio Atzori 545f8f3e48 using jackson objectmapper instead of GSon to serialise the aggregation report 2021-02-17 12:15:00 +01:00
Claudio Atzori b592d78bb4 WIP: collectorWorker error reporting, generalised reported implementation 2021-02-17 10:28:01 +01:00
Antonis Lempesis cd1b794409 added the monitor db wf 2021-02-17 02:11:55 +02:00
Claudio Atzori cf27905a71 WIP: collectorWorker error reporting, added report messages 2021-02-16 16:53:14 +01:00
Alessia Bardi 32e81c2d89 non validated rel has null value in validated field 2021-02-16 11:01:42 +01:00
Claudio Atzori 1abe6d1ad7 WIP: collectorWorker error reporting, added report messages 2021-02-15 15:08:59 +01:00
Claudio Atzori 523a6bfa97 Merge pull request 'first commit to the correct branch' (#94) from andreas.czerniak/BrAggr_dnet-hadoop:hadoop_aggregator into hadoop_aggregator
Looks good to me, thanks Andreas!
2021-02-15 12:15:31 +01:00
Antonis Lempesis 1c029b9fc0 fixed formatting 2021-02-14 03:14:24 +02:00
Antonis Lempesis 2c4dcc90ba analyzing tables to produce stats 2021-02-14 02:54:55 +02:00
Sandro La Bruzzo 7edcc87ed4 changed xslt behaviour on failure 2021-02-12 17:27:08 +01:00
Sandro La Bruzzo 6a37c7f175 merge fixed 2021-02-12 16:38:47 +01:00
Sandro La Bruzzo b3f5c2351d Merge branch 'hadoop_aggregator' of code-repo.d4science.org:D-Net/dnet-hadoop into hadoop_aggregator
 Conflicts:
	dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/transformation/TransformationJobTest.java
2021-02-12 16:37:14 +01:00
Sandro La Bruzzo f216277219 Implemented cleaning date 2021-02-12 16:34:52 +01:00
Andreas Czerniak 5a9017cf18 clone, min. changes, test, run 2021-02-12 14:32:36 +01:00
Claudio Atzori aa55dedb8a Merge branch 'hadoop_aggregator' of https://code-repo.d4science.org/D-Net/dnet-hadoop into hadoop_aggregator 2021-02-12 12:31:05 +01:00
Claudio Atzori 29c6f7e255 classes related to the collection workflow moved into common package; implemented MongoDB collection plugins 2021-02-12 12:31:02 +01:00
Sandro La Bruzzo 17e6f1934e fixed NPE on cleaner 2021-02-12 11:48:11 +01:00
Sandro La Bruzzo ebcc3ec14f updated wrong datacite identifier in trasformation 2021-02-11 16:25:51 +01:00
Michele Artini 83d815d0bc only stats 2021-02-11 10:57:23 +01:00
Michele Artini 8c836bf930 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2021-02-11 10:54:41 +01:00
Michele Artini 8c1600398a added resumeFrom parameter 2021-02-11 10:54:16 +01:00
Claudio Atzori 3f8f78cbfb Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2021-02-11 09:36:10 +01:00
Claudio Atzori b34b5a39ca index field authoridtypevalue mixes up different author id-type value pairs, dropped in favour of orcidtypevalue 2021-02-11 09:36:04 +01:00
Michele Artini 7249cceb53 switch of 2 nodes 2021-02-11 09:27:08 +01:00
Alessia Bardi 986dd969d3 use the proper import for Lists 2021-02-10 12:03:54 +01:00
miconis 4b2124a18e implementation of the openorgs wfs, implementation of the raw_all wf to migrate openorgs db entities 2021-02-10 11:51:50 +01:00
Alessia Bardi c4d1feca74 mapper test with validated link to project 2021-02-10 11:22:54 +01:00
Alessia Bardi 09fc7e2f78 serialization of validated flag on relationships 2021-02-10 11:22:09 +01:00
Enrico Ottonello ee4ba7298b fix last update read/write from file on hdfs 2021-02-09 23:24:57 +01:00
Claudio Atzori bc458d1b54 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2021-02-09 16:27:30 +01:00
Claudio Atzori 82e6c50f3f updated solr fields (authoridtypevalue, resultsubject, resultresourcetypename) 2021-02-09 16:27:04 +01:00
Claudio Atzori 62bd3c53ee Merge branch 'master' into provision_indexing 2021-02-09 15:46:26 +01:00
Claudio Atzori bae029f828 collection_java_xmx allows to declare the heap size allocated for the java actions involved in the metadata collectionw workflow 2021-02-08 18:07:23 +01:00
Claudio Atzori bebc54d5bf seq file storing native records is now compressed 2021-02-08 18:06:25 +01:00
Claudio Atzori 50add4c61b added requestDelay to HttpConnector2 configuration; Aggregation workflow constants moved in dhp-common 2021-02-08 12:19:38 +01:00
Miriam Baglioni 2f5e6647c6 merge upstream 2021-02-08 10:33:11 +01:00
Claudio Atzori 40df0f987d better logging, WIP: collectorWorker error reporting; common functions moved in DHPUtils 2021-02-06 20:12:00 +01:00
Claudio Atzori a8a758925e better logging, WIP: collectorWorker error reporting 2021-02-05 19:18:05 +01:00
Claudio Atzori 730973679a Merge branch 'hadoop_aggregator' of https://code-repo.d4science.org/D-Net/dnet-hadoop into hadoop_aggregator 2021-02-04 17:25:00 +01:00
Claudio Atzori deb85706db imported HttpConnector from https://svn.driver.research-infrastructures.eu/driver/dnet45/modules/dnet-modular-collector-service/trunk/src/main/java/eu/dnetlib/data/collector/plugins/HttpConnector.java as HttpConnector2 2021-02-04 17:24:52 +01:00
Sandro La Bruzzo 4dae5e605d implemented messaging btween collection worker and Dnet 2021-02-04 15:51:15 +01:00
Claudio Atzori 72c57b28fa switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT 2021-02-04 14:08:18 +01:00
Claudio Atzori 40764cf626 better logging, WIP: collectorWorker error reporting 2021-02-04 14:06:02 +01:00
Enrico Ottonello c238561001 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2021-02-04 10:44:21 +01:00
Enrico Ottonello 465ce39f75 job execution now based on file last_update.txt on hdfs 2021-02-04 10:44:04 +01:00
Sandro La Bruzzo 69c253710b fixed test 2021-02-04 10:30:49 +01:00
Claudio Atzori e04045089f better logging, WIP: collectorWorker error reporting 2021-02-03 17:58:22 +01:00
Alessia Bardi c67329d3ad updated test for EU Open Data portal datasets 2021-02-03 17:06:48 +01:00
Claudio Atzori 0e8a4f9f1a better logging, WIP: collectorWorker error reporting 2021-02-03 12:33:41 +01:00
Alessia Bardi fd705404a1 tests for EU Open Data portal dataset mapping 2021-02-03 10:28:17 +01:00
Miriam Baglioni 6190465851 merge upstream 2021-02-03 10:27:27 +01:00
Claudio Atzori 53884d12c2 code formatting 2021-02-02 14:38:03 +01:00
Claudio Atzori ac46c247d2 code formatting 2021-02-02 14:24:00 +01:00
Claudio Atzori bde14b149a fixed transformation target paths 2021-02-02 12:49:29 +01:00
Claudio Atzori ca4391aa1c minor changes 2021-02-02 12:44:04 +01:00