Commit Graph

142 Commits

Author SHA1 Message Date
Enrico Ottonello c58db1c8ea added filter on null value after map function 2020-10-22 15:11:02 +02:00
Enrico Ottonello 846ba30873 if typologies mapping fails, an exception will be propagated 2020-10-22 14:36:18 +02:00
Enrico Ottonello c3114ba0ae replaced null as return value with a more safe empty string 2020-10-22 14:21:31 +02:00
Enrico Ottonello c295c71ca0 added comment 2020-10-22 14:07:26 +02:00
Enrico Ottonello ab083f9946 propagate exception on parsing work (PR request) 2020-10-22 14:02:32 +02:00
sandro 3a81a940b7 solved bug on merge publication 2020-10-21 22:41:55 +02:00
Sandro La Bruzzo cd9c377d18 adpted scholexplorer Dump generation to the new Dataset definition 2020-10-08 10:10:13 +02:00
Sandro La Bruzzo c4a3c52e45 fixed Doiboost bug in the identifier 2020-10-01 15:46:44 +02:00
Enrico Ottonello a97ad20c7b exception is now propagated (PR review) 2020-09-22 10:46:34 +02:00
Enrico Ottonello 9e8e7fe6ef add comments 2020-09-15 11:32:49 +02:00
Enrico Ottonello 0377b40fba output to one parquet file 2020-07-30 18:38:07 +02:00
Enrico Ottonello 196f36c6ed fix publication dataset creation 2020-07-30 13:38:33 +02:00
Enrico Ottonello c82b15b5f4 migrate configuration to ocean, fix publication dataset creation 2020-07-28 15:23:52 +02:00
Enrico Ottonello ca37d3427b separate workflow to parse orcid summaries, activities and generate dataset with no doi publications; test 2020-07-03 23:30:31 +02:00
Enrico Ottonello 1729cc5cf3 publication conversion from json to oaf test 2020-07-02 18:46:20 +02:00
Enrico Ottonello 5525f57ec8 converter from orcid work json to oaf 2020-07-01 18:36:14 +02:00
Enrico Ottonello b7b6be12a5 fixed enriched works generation 2020-06-29 18:03:16 +02:00
Enrico Ottonello b2213b6435 merged with dnet version 2020-06-26 17:27:34 +02:00
Enrico Ottonello c5e149c46e Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi 2020-06-26 16:15:38 +02:00
Enrico Ottonello d6498278ed added workflow to generate seq(orcidId,work) and seq(orcidId,enrichedWork) 2020-06-25 18:43:29 +02:00
Sandro La Bruzzo a6c0faac70 added test to verify secondary sorting 2020-06-25 10:48:15 +02:00
Enrico Ottonello fcbb4c1489 parser of orcid publication data from xml original dump 2020-06-24 16:29:32 +02:00
Sandro La Bruzzo 9bf67f5de1 resolved conflicts 2020-06-17 09:15:43 +02:00
Sandro La Bruzzo 1d4275acc4 implemented first version of exportation of Scholexplorer into ActionSet 2020-06-17 09:10:38 +02:00
Claudio Atzori 67c7b31ba6 Merge branch 'master' into graph_cleaning 2020-06-10 15:00:35 +02:00
Claudio Atzori a2fdf85ba1 WIP: graph cleaner implementation 2020-06-09 19:52:53 +02:00
Alessia Bardi 4551c1082f mapping csv for orcid 2020-06-09 18:08:47 +02:00
Alessia Bardi 2d3f7d1eb4 fixed log classes to make the ORCID test run 2020-06-09 18:07:14 +02:00
Alessia Bardi a3a6755d58 mapping csv for Unpaywall 2020-06-09 17:45:44 +02:00
Alessia Bardi f3b033cf09 added csv line for funders from Crossref 2020-06-09 17:08:26 +02:00
Alessia Bardi fc4d220964 updated function name for SNSF 2020-06-09 17:05:31 +02:00
Alessia Bardi 33b130ec43 Mapping instructions for MAG 2020-06-09 15:57:15 +02:00
Alessia Bardi d6de406e11 fixed classid for subjects 2020-06-09 14:43:34 +02:00
Alessia Bardi f072125152 map volume and issue in journal information from MAG 2020-06-09 14:32:10 +02:00
Alessia Bardi b7cb1163ea identifiers always start with 50 2020-06-09 10:39:11 +02:00
Alessia Bardi 181f52b9bc Added mapping table for Crossref 2020-06-08 19:33:47 +02:00
Alessia Bardi 9fd25887f7 Result identifiers all start with 50| 2020-06-08 19:32:24 +02:00
Alessia Bardi 16cb073b15 set the instance datepfacceptance with the Crossref createdDate in case the issuedDate is blank 2020-06-08 19:06:03 +02:00
Sandro La Bruzzo 7ac1ba2e35 improvement DOIBoost 2020-06-04 14:39:20 +02:00
Sandro La Bruzzo 13815d5d13 improvement DOIBoost 2020-06-01 17:52:12 +02:00
Sandro La Bruzzo b87b3ddb6b changed mapping ORCIDToOAF 2020-05-29 09:32:04 +02:00
Sandro La Bruzzo 7d29b61c62 code refactor 2020-05-28 09:57:46 +02:00
Sandro La Bruzzo 25f52e19a4 implemented generation of ActionSet 2020-05-26 09:15:33 +02:00
Sandro La Bruzzo 2408083566 implemented filtering step 2020-05-23 08:46:49 +02:00
Sandro La Bruzzo 147dd389bf minor fix 2020-05-22 20:51:42 +02:00
Sandro La Bruzzo 22936d0877 Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-05-22 15:15:17 +02:00
Sandro La Bruzzo 9fbb221457 completed mapping of UnpayWall and ORCID 2020-05-22 15:15:09 +02:00
Enrico Ottonello 1109d3b3fc Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost 2020-05-21 00:41:27 +02:00
Enrico Ottonello 869a53040e save to text file format 2020-05-21 00:41:21 +02:00
Sandro La Bruzzo 5818abaab4 fixed Crossref Mapping 2020-05-20 17:05:46 +02:00
Sandro La Bruzzo b771d67e9d next step of MAG conversion implemented 2020-05-20 08:14:03 +02:00
Enrico Ottonello 934ad570e0 joined summaries and activities dataset 2020-05-19 12:57:21 +02:00
Enrico Ottonello ca722d4d18 merged 2020-05-19 09:43:12 +02:00
Enrico Ottonello 7362bc3e9d workflow to generate seq(doi,AuthorList) 2020-05-19 09:34:44 +02:00
Sandro La Bruzzo 486e850bcc next step of MAG conversion implemented 2020-05-19 09:24:45 +02:00
Enrico Ottonello d4e9075f22 Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost 2020-05-18 19:51:36 +02:00
Enrico Ottonello fc80e8c7de added accumulator; last modified date of the record is added to saved data; lambda file is partitioned into 20 parts before starting downloading 2020-05-18 19:51:29 +02:00
Enrico Ottonello 0b29bb7e3b spark job to download orcid record modified after a fixed date 2020-05-15 19:49:26 +02:00
Sandro La Bruzzo d876f47d06 next step of MAG conversion implemented 2020-05-13 10:38:04 +02:00
Enrico Ottonello 08040cef80 spark action to analyze orcid lambda file 2020-05-12 16:57:43 +02:00
Enrico Ottonello f53e42bda7 merged 2020-05-11 14:49:28 +02:00
Enrico Ottonello 7990894454 different date format in lambda file parsing 2020-05-11 14:41:11 +02:00
Sandro La Bruzzo 0c6774e4da updated pom version 2020-05-11 14:35:14 +02:00
Sandro La Bruzzo 2b48a2c32c Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-05-11 09:38:36 +02:00
Sandro La Bruzzo 4cebca09d2 start implementing MAG mapping 2020-05-11 09:38:27 +02:00
Enrico Ottonello b9d126dd1f formatting modified after commit 2020-05-08 14:54:37 +02:00
Enrico Ottonello 7e1c987370 Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost 2020-05-08 14:49:50 +02:00
Enrico Ottonello 9d812788e4 added job to download from orcid the records modified after a fixed date, the info are taken from last_modified.csv on hdfs 2020-05-08 14:49:39 +02:00
Sandro La Bruzzo 4a89465740 reformatted code 2020-04-29 13:24:29 +02:00
Sandro La Bruzzo a6b1a59d0a merged with maaster 2020-04-29 13:20:57 +02:00
Sandro La Bruzzo 920c0f19c3 Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-04-29 13:13:16 +02:00
Sandro La Bruzzo 09f161f1f4 implemented unit test 2020-04-29 13:13:02 +02:00
Enrico Ottonello 1edcd53581 added shell actions to download all 11 activities files from ORCID 2020-04-28 20:25:09 +02:00
Enrico Ottonello a1861b9eaa workflow works in parallel on 2 activity files 2020-04-24 18:33:37 +02:00
Enrico Ottonello 941e94af06 added workflow for generating authors with dois data sequence file 2020-04-24 15:50:40 +02:00
Sandro La Bruzzo 4ba386d996 improved crossref mapping 2020-04-23 09:33:48 +02:00
Sandro La Bruzzo 157915988c improved crossref mapping 2020-04-22 15:00:44 +02:00
Enrico Ottonello 5977f08e92 merged 2020-04-22 14:50:50 +02:00
Enrico Ottonello 7d759947ae used vtd for parsing orcid xml record, set 4g heapspace 2020-04-22 14:41:19 +02:00
Sandro La Bruzzo e4b105cece improved crossref mapping 2020-04-20 18:10:07 +02:00
Sandro La Bruzzo 5d46ec7d5f fixed name of wrong package 2020-04-20 14:49:32 +02:00
Sandro La Bruzzo 82cc3b707d fixed name of wrong package 2020-04-20 14:47:06 +02:00
Sandro La Bruzzo 7029942e06 Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-04-20 13:26:41 +02:00
Sandro La Bruzzo 0e45f4d450 continue mapping from crossref to OAF 2020-04-20 13:26:29 +02:00
Enrico Ottonello a466648b4b renamed output file 2020-04-20 12:32:03 +02:00
Enrico Ottonello 4ae55e3891 added workflow parameters 2020-04-20 12:00:04 +02:00
Sandro La Bruzzo eef60bb9f4 created structure of oozie wf for ORCID 2020-04-20 10:24:57 +02:00
Sandro La Bruzzo 4d0d9de07e reorganized package and fixed test 2020-04-20 10:02:42 +02:00
Sandro La Bruzzo 618bc1fc72 first implementation of crossrefMapping 2020-04-20 09:53:34 +02:00
Enrico Ottonello 1d44a359ea renamed package folder 2020-04-20 09:25:40 +02:00
Enrico Ottonello 7011d4203e parser of orcid summaries from tar gz file on hdfs, that creates a sequence file with authors informations (oid, name, surname, credit name) 2020-04-17 18:52:39 +02:00
Sandro La Bruzzo 205e9521c6 implemented import crossref job 2020-04-01 14:12:33 +02:00