Commit Graph

144 Commits

Author SHA1 Message Date
Miriam Baglioni 0f1acdf6b6 workflow and parameter 2021-06-14 10:08:55 +02:00
Enrico Ottonello c537986b7c deleted folders with merged data immediately before merge phases 2021-04-28 11:25:25 +02:00
Claudio Atzori e5abbec2ba [orcid] download of the lambda file defined in a script 2021-04-22 11:22:10 +02:00
Claudio Atzori 55964cbd81 [orcid] large oozie workflow cleanup; updated workflow for the orcidnodoi actionset creation 2021-04-22 10:18:09 +02:00
Claudio Atzori 52244f813a merging from enrico.ottonello/dnet-hadoop:orcid-no-doi 2021-04-21 12:24:09 +02:00
Enrico Ottonello 27068aacd1 wf to move orcid-no-doi dataset on the folder ready the import 2021-04-16 17:17:47 +02:00
Sandro La Bruzzo 67085da305 fixed NPE 2021-04-16 11:05:58 +02:00
Sandro La Bruzzo 479abd10cb Add into ORCID workflow a method that extracts orcid directly to the dump generated by Enrico 2021-04-13 17:47:43 +02:00
Claudio Atzori ee34cc51c3 [ORCID-no-doi] integrating PR#98 #98 2021-04-01 17:07:49 +02:00
Enrico Ottonello ebd67b8c8f removed duplicates orcid data on authors set 2021-03-25 11:20:52 +01:00
Sandro La Bruzzo cc5bbafa5d some fix to make workflows runs 2021-03-17 12:12:56 +01:00
Enrico Ottonello 70cb100647 added updating last orcid dataset folders after completion 2021-03-01 10:17:04 +01:00
Enrico Ottonello bd3b16402b added result typologies 2021-03-01 10:16:02 +01:00
Enrico Ottonello 53d7023460 dateOfCollection taken from orcid last_update.txt on hdfs; cleaned wf parameters 2021-02-25 18:43:29 +01:00
Enrico Ottonello d43ea88caf aligned orcid result typologies with openaire vocabulary 2021-02-25 15:02:10 +01:00
Enrico Ottonello 975823b968 data from last updated orcid 2021-02-23 15:35:04 +01:00
Enrico Ottonello c238561001 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2021-02-04 10:44:21 +01:00
Enrico Ottonello 465ce39f75 job execution now based on file last_update.txt on hdfs 2021-02-04 10:44:04 +01:00
Claudio Atzori ab2fe9266a [DOIBoost] minor fixes in workflow definition 2021-01-05 10:26:39 +01:00
Claudio Atzori 7c722f3fdc [DOIBoost] fixed typo 2021-01-05 10:25:54 +01:00
Claudio Atzori 8879704ba0 [DOIBoost] configurable ES server url and index name in crossref importer 2021-01-05 10:00:13 +01:00
Sandro La Bruzzo e79445a8b4 minor fix for claudio polemica 2021-01-04 17:39:25 +01:00
Sandro La Bruzzo 8765020b85 minor fix 2021-01-04 17:37:08 +01:00
Sandro La Bruzzo b0dc92786f defined a single oozie workflow for the generation of doiboost 2021-01-04 17:01:35 +01:00
Enrico Ottonello b2de598c1a all actions from download lambda file to merge updated data into one wf 2020-12-15 10:42:55 +01:00
Enrico Ottonello efe4c2a9c5 authors and works are now updated in two separate spark actions of the wf 2020-12-12 02:06:21 +01:00
Enrico Ottonello 858efbfad1 fix dataset creation for downloaded works 2020-12-11 16:49:54 +01:00
Enrico Ottonello 2233750a37 original orcid xml data are stored in a field of the class that models orcid data 2020-12-09 09:45:19 +01:00
Sandro La Bruzzo 302baab67b fixed doiboost mapping and workflows 2020-12-07 19:59:33 +01:00
Enrico Ottonello 5c65e602d3 wf doi_authors generates one json data foreach row 2020-12-07 15:28:10 +01:00
Enrico Ottonello fa1855a4b8 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-12-07 11:02:59 +01:00
Enrico Ottonello b1b589ada1 wf to generate orcid dataset 2020-12-07 11:02:32 +01:00
Sandro La Bruzzo b31dd126fb fixed crossref workflow added common ORCID Class 2020-12-07 10:42:38 +01:00
Enrico Ottonello 8812ab65e1 completed download function to wf; added accumulators 2020-12-04 21:13:49 +01:00
Enrico Ottonello 1b1e9ea67c wf to generate doi_author_list for doiboost; wf to download updated works 2020-12-02 23:20:16 +01:00
Enrico Ottonello f2df3ead74 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-11-30 14:22:46 +01:00
Sandro La Bruzzo ec3e238de6 Fixed problem on duplicated identifier 2020-11-25 17:15:54 +01:00
Enrico Ottonello 99a086f0c6 max concurrent executors set to 10, according to ORCID Director of Technology mail request 2020-11-24 17:49:32 +01:00
Enrico Ottonello 5c17e768b2 set wf configuration with spark.dynamicAllocation.maxExecutors 20 over 20 input partitions 2020-11-23 16:01:23 +01:00
Enrico Ottonello 97c8111847 action to convert lambda file in seq file; spark action to download updated authors 2020-11-23 09:49:22 +01:00
Enrico Ottonello c0c2e05eae added wf to extracting authors and works xml data from orcid dump to hdfs; added wf to download the lamda file (containing last orcid update informations) from orcid to hdfs 2020-11-17 18:23:12 +01:00
Enrico Ottonello 13f28fa225 moved AuthorData to dhp-schemas; added other names to author data 2020-11-12 17:43:32 +01:00
Enrico Ottonello 6bc7dbeca7 first version of dataset successful generated from orcid dump 2020 2020-11-06 13:47:50 +01:00
Enrico Ottonello 9818e74a70 added dependency version in main pom.xml for orcid no doi 2020-10-22 16:38:00 +02:00
Sandro La Bruzzo cd9c377d18 adpted scholexplorer Dump generation to the new Dataset definition 2020-10-08 10:10:13 +02:00
Enrico Ottonello c82b15b5f4 migrate configuration to ocean, fix publication dataset creation 2020-07-28 15:23:52 +02:00
Enrico Ottonello ca37d3427b separate workflow to parse orcid summaries, activities and generate dataset with no doi publications; test 2020-07-03 23:30:31 +02:00
Enrico Ottonello 5525f57ec8 converter from orcid work json to oaf 2020-07-01 18:36:14 +02:00
Enrico Ottonello b7b6be12a5 fixed enriched works generation 2020-06-29 18:03:16 +02:00
Enrico Ottonello b2213b6435 merged with dnet version 2020-06-26 17:27:34 +02:00
Enrico Ottonello c5e149c46e Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-06-26 16:15:38 +02:00
Enrico Ottonello d6498278ed added workflow to generate seq(orcidId,work) and seq(orcidId,enrichedWork) 2020-06-25 18:43:29 +02:00
Sandro La Bruzzo a6c0faac70 added test to verify secondary sorting 2020-06-25 10:48:15 +02:00
Sandro La Bruzzo 1d4275acc4 implemented first version of exportation of Scholexplorer into ActionSet 2020-06-17 09:10:38 +02:00
Alessia Bardi 4551c1082f mapping csv for orcid 2020-06-09 18:08:47 +02:00
Alessia Bardi a3a6755d58 mapping csv for Unpaywall 2020-06-09 17:45:44 +02:00
Alessia Bardi f3b033cf09 added csv line for funders from Crossref 2020-06-09 17:08:26 +02:00
Alessia Bardi 33b130ec43 Mapping instructions for MAG 2020-06-09 15:57:15 +02:00
Alessia Bardi 181f52b9bc Added mapping table for Crossref 2020-06-08 19:33:47 +02:00
Sandro La Bruzzo 7ac1ba2e35 improvement DOIBoost 2020-06-04 14:39:20 +02:00
Sandro La Bruzzo 13815d5d13 improvement DOIBoost 2020-06-01 17:52:12 +02:00
Sandro La Bruzzo b87b3ddb6b changed mapping ORCIDToOAF 2020-05-29 09:32:04 +02:00
Sandro La Bruzzo 7d29b61c62 code refactor 2020-05-28 09:57:46 +02:00
Sandro La Bruzzo 25f52e19a4 implemented generation of ActionSet 2020-05-26 09:15:33 +02:00
Sandro La Bruzzo 2408083566 implemented filtering step 2020-05-23 08:46:49 +02:00
Sandro La Bruzzo 147dd389bf minor fix 2020-05-22 20:51:42 +02:00
Sandro La Bruzzo 22936d0877 Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-05-22 15:15:17 +02:00
Sandro La Bruzzo 9fbb221457 completed mapping of UnpayWall and ORCID 2020-05-22 15:15:09 +02:00
Enrico Ottonello 1109d3b3fc Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost 2020-05-21 00:41:27 +02:00
Enrico Ottonello 869a53040e save to text file format 2020-05-21 00:41:21 +02:00
Sandro La Bruzzo b771d67e9d next step of MAG conversion implemented 2020-05-20 08:14:03 +02:00
Enrico Ottonello 934ad570e0 joined summaries and activities dataset 2020-05-19 12:57:21 +02:00
Enrico Ottonello ca722d4d18 merged 2020-05-19 09:43:12 +02:00
Enrico Ottonello 7362bc3e9d workflow to generate seq(doi,AuthorList) 2020-05-19 09:34:44 +02:00
Sandro La Bruzzo 486e850bcc next step of MAG conversion implemented 2020-05-19 09:24:45 +02:00
Enrico Ottonello d4e9075f22 Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost 2020-05-18 19:51:36 +02:00
Enrico Ottonello fc80e8c7de added accumulator; last modified date of the record is added to saved data; lambda file is partitioned into 20 parts before starting downloading 2020-05-18 19:51:29 +02:00
Enrico Ottonello 0b29bb7e3b spark job to download orcid record modified after a fixed date 2020-05-15 19:49:26 +02:00
Sandro La Bruzzo d876f47d06 next step of MAG conversion implemented 2020-05-13 10:38:04 +02:00
Enrico Ottonello 08040cef80 spark action to analyze orcid lambda file 2020-05-12 16:57:43 +02:00
Sandro La Bruzzo 2b48a2c32c Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-05-11 09:38:36 +02:00
Sandro La Bruzzo 4cebca09d2 start implementing MAG mapping 2020-05-11 09:38:27 +02:00
Enrico Ottonello 9d812788e4 added job to download from orcid the records modified after a fixed date, the info are taken from last_modified.csv on hdfs 2020-05-08 14:49:39 +02:00
Enrico Ottonello 1edcd53581 added shell actions to download all 11 activities files from ORCID 2020-04-28 20:25:09 +02:00
Enrico Ottonello a1861b9eaa workflow works in parallel on 2 activity files 2020-04-24 18:33:37 +02:00
Enrico Ottonello 941e94af06 added workflow for generating authors with dois data sequence file 2020-04-24 15:50:40 +02:00
Sandro La Bruzzo 4ba386d996 improved crossref mapping 2020-04-23 09:33:48 +02:00
Enrico Ottonello 7d759947ae used vtd for parsing orcid xml record, set 4g heapspace 2020-04-22 14:41:19 +02:00
Sandro La Bruzzo 5d46ec7d5f fixed name of wrong package 2020-04-20 14:49:32 +02:00
Sandro La Bruzzo 82cc3b707d fixed name of wrong package 2020-04-20 14:47:06 +02:00
Enrico Ottonello 4ae55e3891 added workflow parameters 2020-04-20 12:00:04 +02:00
Sandro La Bruzzo eef60bb9f4 created structure of oozie wf for ORCID 2020-04-20 10:24:57 +02:00
Sandro La Bruzzo 618bc1fc72 first implementation of crossrefMapping 2020-04-20 09:53:34 +02:00
Sandro La Bruzzo 205e9521c6 implemented import crossref job 2020-04-01 14:12:33 +02:00