Commit Graph

2742 Commits

Author SHA1 Message Date
Miriam Baglioni 8b8ffe82dc added step of normalization for the doi 2021-06-29 18:41:39 +02:00
Miriam Baglioni 50cc21d92e Added method to normalize doi values (lower case, remove all preceeding 10., filtering out doi not starting with 10.) 2021-06-29 18:35:28 +02:00
Claudio Atzori 6d3f960238 Merge pull request 'added the missing indicators files' (#120) from antonis.lempesis/dnet-hadoop:stable_ids into stable_ids
Reviewed-on: D-Net/dnet-hadoop#120
2021-06-29 15:57:39 +02:00
Antonis Lempesis ae18171212 Merge branch 'stable_ids' into stable_ids 2021-06-29 15:33:39 +02:00
Antonis Lempesis 87f14a3899 added the missing indicators files 2021-06-29 16:31:51 +03:00
Claudio Atzori 986a8011ec Merge pull request 'copied latest changes from old fork: indicators+monitor institutions' (#119) from antonis.lempesis/dnet-hadoop:stable_ids into stable_ids
Reviewed-on: D-Net/dnet-hadoop#119
2021-06-29 08:49:12 +02:00
Antonis Lempesis 018c4eb52c copied latest changes from old fork: indicators+monitor institutions 2021-06-28 23:46:52 +03:00
Claudio Atzori af42377d0e HttpClient used in metadata collection retries on 502, 503, 504 2021-06-28 09:34:30 +02:00
Claudio Atzori 67afd06cd1 [cleaning] cleaning instance.pid and instance.alternateidentifier using the same procedure used to clean result.pid 2021-06-24 12:10:17 +02:00
Claudio Atzori 2e8fd2c531 cleanup 2021-06-23 14:38:24 +02:00
Claudio Atzori 4dc9ebf217 [raw_all] fixed unit test 2021-06-23 14:38:07 +02:00
Claudio Atzori 50fc5a64a0 [raw_all] Aggregator graph creation merges claims (updates) with the corresponding entity 2021-06-23 11:49:42 +02:00
Claudio Atzori 5edcc6832a applying sonarLint suggestions 2021-06-23 09:53:29 +02:00
Claudio Atzori 2dd5449c13 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-18 10:08:15 +02:00
Claudio Atzori fd54ecf7bd bumped dhp-schemas dependency version 2021-06-18 10:08:07 +02:00
Miriam Baglioni 180d671127 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-18 09:46:18 +02:00
Miriam Baglioni 13c96622c9 - 2021-06-18 09:45:16 +02:00
Miriam Baglioni b486ae498f added test and test resource to verify the generation of the date of acceptance from the input extracted from the dump 2021-06-18 09:43:32 +02:00
Miriam Baglioni 464c2ddde3 changed to split in two steps the generation of the crossref dataset 2021-06-18 09:42:31 +02:00
Miriam Baglioni 6aca0d8ebb added kryo encoding for input files 2021-06-18 09:42:07 +02:00
Miriam Baglioni 3585e53da3 changed to split in two steps the generation of the crossref dataset 2021-06-18 09:41:23 +02:00
Claudio Atzori 41b551562e applying PR#115 (DatePicker) on stable_ids 2021-06-17 09:33:50 +02:00
Claudio Atzori 74833d04f1 Merge branch 'pids_beta' of https://code-repo.d4science.org/antonis.lempesis/dnet-hadoop into stable_ids 2021-06-16 15:54:18 +02:00
Claudio Atzori 7243a40c88 code formatting 2021-06-16 15:03:03 +02:00
Miriam Baglioni 95885bcf12 forces executor Executor memory and driver executor memory to be 7G (trying to avoid OOM) 2021-06-16 10:17:52 +02:00
Miriam Baglioni 2550a73981 - 2021-06-16 10:04:41 +02:00
Miriam Baglioni 1c47c0d786 modified the number of executors trying to avoid OOM exception 2021-06-15 21:05:39 +02:00
Miriam Baglioni 7deac55138 added one option for resume from in the wf 2021-06-15 18:38:20 +02:00
Antonis Lempesis f7c0b80e35 storing result_instance as parquet 2021-06-15 14:45:48 +03:00
Miriam Baglioni 66e7ef892f changed the parameter name 2021-06-15 11:08:54 +02:00
Miriam Baglioni 4f47ad0891 no need to rename the folders, just write in overwrite mode, so I changed the name of the output folder 2021-06-15 09:28:31 +02:00
Miriam Baglioni 9f9dd00b94 refactoring 2021-06-15 09:24:46 +02:00
Miriam Baglioni 63d74ee379 refactoring 2021-06-15 09:24:11 +02:00
Miriam Baglioni 6ebc236657 added needed property: outputPath 2021-06-15 09:23:24 +02:00
Miriam Baglioni f7379255b6 changed the workflow to extract info from the dump 2021-06-15 09:22:54 +02:00
Miriam Baglioni d6e21bb6ea creates the crossref dataset used for doiboost together with unpacking part from tar 2021-06-14 17:27:19 +02:00
Miriam Baglioni 4da141bd7c Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-14 13:41:02 +02:00
Miriam Baglioni ce0cfd79e0 creates the crossref dataset used for doiboost 2021-06-14 13:40:19 +02:00
Miriam Baglioni 93efe4de82 split the construction of crossref dataset in two parts. This one just unpacks the tar entries 2021-06-14 13:39:40 +02:00
Michele Artini ada063ce70 fixed a problem with empty mdstore list (2) 2021-06-14 12:04:47 +02:00
Michele Artini 83132ee99a fixed a problem with empty mdstore list 2021-06-14 11:57:00 +02:00
Miriam Baglioni cf360d7c97 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-14 10:19:49 +02:00
Miriam Baglioni 8873e6b6d1 workflow and parameter 2021-06-14 10:15:57 +02:00
Miriam Baglioni 0f1acdf6b6 workflow and parameter 2021-06-14 10:08:55 +02:00
Miriam Baglioni 75780fc636 extraction of the tar for the dump of crossref, and creation of the dataset 2021-06-14 09:45:07 +02:00
Claudio Atzori 2039bb9f5f orcid / orcid_pending cleaning backported from master branch 2021-06-14 09:40:50 +02:00
Claudio Atzori dd19c4ac5a Merge pull request 'import_new_mdstores' (#112) from import_new_mdstores into stable_ids
Reviewed-on: D-Net/dnet-hadoop#112
2021-06-14 09:23:55 +02:00
Claudio Atzori e9e86a237d Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-06-11 17:00:02 +02:00
Claudio Atzori 10bd6ca194 depending on dhp-schemas:2.5.12 (release) 2021-06-11 16:59:56 +02:00
Claudio Atzori a900bfb874 delegating the date parsing to https://github.com/sisyphsu/dateparser 2021-06-11 16:53:01 +02:00