Commit Graph

1198 Commits

Author SHA1 Message Date
Claudio Atzori 5441f01586 Merge pull request 'missing landingPage urls in instances' (#22) from instances-with-landing-page into master
Looks good, thanks!
2020-06-16 15:32:44 +02:00
Claudio Atzori 89859111ee Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-16 15:28:29 +02:00
Claudio Atzori 4ec262db53 included externalreference(s) in the result view on the Hive graph DB 2020-06-16 15:28:20 +02:00
Michele Artini 8a4f84f8c0 refactoring 2020-06-16 12:34:13 +02:00
Claudio Atzori 2a4f65795f WIP: graph cleaner implementation 2020-06-15 18:32:24 +02:00
Claudio Atzori c15c8c0ad0 map datasource identities (including piwik ids) as original IDs 2020-06-15 16:07:30 +02:00
Claudio Atzori 0d52816244 WIP: graph cleaner implementation 2020-06-13 13:06:04 +02:00
Claudio Atzori bed65a1be6 WIP: graph cleaner implementation 2020-06-12 18:25:47 +02:00
Claudio Atzori c4d9f1837f [maven-release-plugin] prepare for next development iteration 2020-06-12 12:21:08 +02:00
Claudio Atzori f0746a7605 [maven-release-plugin] prepare release dhp-1.2.2 2020-06-12 12:21:03 +02:00
Claudio Atzori 463489f59f code formatting 2020-06-12 12:03:25 +02:00
Claudio Atzori 4bcad1c9c3 Merge branch 'graph_cleaning' 2020-06-12 11:40:25 +02:00
Claudio Atzori cdb1956fe9 WIP: graph cleaner implementation 2020-06-12 11:36:59 +02:00
Alessia Bardi b347499745 do not use deprecated subreltype 2020-06-12 10:58:02 +02:00
Claudio Atzori 97b1c4057c WIP: graph cleaner implementation 2020-06-12 10:45:18 +02:00
Claudio Atzori ba8a024af9 avoid NPEs merging titles 2020-06-12 10:45:11 +02:00
Michele Artini 30ea1bda88 oozie workflow 2020-06-12 10:42:35 +02:00
Michele Artini c22cb5a3c6 refactoring 2020-06-12 09:47:55 +02:00
Michele Artini 472cf77639 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-11 14:30:47 +02:00
Michele Artini c6b5bb3f17 orcid events 2020-06-11 14:30:24 +02:00
Michele Artini c2e1b66e83 Revert "orcid events"
This reverts commit 48959e9a17.
2020-06-11 14:28:03 +02:00
Michele Artini 48959e9a17 orcid events 2020-06-11 14:24:02 +02:00
Alessia Bardi e79943965b Fixes #5604: field oamandatepublications in XML 2020-06-11 12:49:31 +02:00
Michele Artini a41e0cb648 missing landingPage urls in instances 2020-06-11 12:28:34 +02:00
Michele Artini 04fdcacd83 results with all joined entities 2020-06-11 11:25:18 +02:00
Michele Artini 99f88e1cb8 fixed generation entities from claims 2020-06-11 10:51:57 +02:00
Claudio Atzori d1d92c4d8c fixed integration of claims in the graph 2020-06-11 10:12:00 +02:00
Claudio Atzori 953da4a427 Merge branch 'master' into graph_cleaning 2020-06-10 21:36:56 +02:00
Claudio Atzori f1bce64391 WIP: graph cleaner implementation 2020-06-10 21:36:31 +02:00
Claudio Atzori 67c7b31ba6 Merge branch 'master' into graph_cleaning 2020-06-10 15:00:35 +02:00
Claudio Atzori 3ebf81d2b0 Merge pull request 'oaf-store-interpretation' (#21) from oaf-store-interpretation into master
Looks good, thanks Michele!
2020-06-10 14:58:09 +02:00
Michele Artini 5869cb76b3 reformatting 2020-06-10 12:11:16 +02:00
Michele Artini c08e66e01e fixed a workflow parameter 2020-06-10 10:11:56 +02:00
Michele Artini 7177a32d75 import of invisible stores 2020-06-10 10:04:00 +02:00
Claudio Atzori ce12f236bb disabled test, need to need to update the joined_entity.json file 2020-06-09 20:07:36 +02:00
Claudio Atzori a2fdf85ba1 WIP: graph cleaner implementation 2020-06-09 19:52:53 +02:00
Alessia Bardi 4551c1082f mapping csv for orcid 2020-06-09 18:08:47 +02:00
Alessia Bardi 2d3f7d1eb4 fixed log classes to make the ORCID test run 2020-06-09 18:07:14 +02:00
Alessia Bardi a3a6755d58 mapping csv for Unpaywall 2020-06-09 17:45:44 +02:00
Claudio Atzori d9f33582c5 WIP: graph cleaner implementation 2020-06-09 17:20:40 +02:00
Alessia Bardi f3b033cf09 added csv line for funders from Crossref 2020-06-09 17:08:26 +02:00
Alessia Bardi 79969d78b9 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-09 17:05:39 +02:00
Alessia Bardi fc4d220964 updated function name for SNSF 2020-06-09 17:05:31 +02:00
Michele Artini baaa55f4a3 use of pace to calculate trusts 2020-06-09 16:01:31 +02:00
Alessia Bardi 33b130ec43 Mapping instructions for MAG 2020-06-09 15:57:15 +02:00
Alessia Bardi d6de406e11 fixed classid for subjects 2020-06-09 14:43:34 +02:00
Alessia Bardi f072125152 map volume and issue in journal information from MAG 2020-06-09 14:32:10 +02:00
Alessia Bardi b7cb1163ea identifiers always start with 50 2020-06-09 10:39:11 +02:00
Alessia Bardi 181f52b9bc Added mapping table for Crossref 2020-06-08 19:33:47 +02:00
Alessia Bardi 9fd25887f7 Result identifiers all start with 50| 2020-06-08 19:32:24 +02:00
Alessia Bardi 16cb073b15 set the instance datepfacceptance with the Crossref createdDate in case the issuedDate is blank 2020-06-08 19:06:03 +02:00
Michele Artini bb659d870c join simrels 2020-06-08 16:29:01 +02:00
Michele Artini 81e85465d8 join simrels 2020-06-08 16:26:16 +02:00
Claudio Atzori 3d871c6651 Merge branch 'master' into graph_cleaning 2020-06-08 15:23:24 +02:00
Claudio Atzori 25a093b1a4 integrated changes from master 2020-06-08 15:04:00 +02:00
Sandro La Bruzzo e34e7d6728 merge DOIBoost 2020-06-08 08:32:22 +02:00
Sandro La Bruzzo e46e2a4776 Merge remote-tracking branch 'origin/master' into doiboost 2020-06-08 08:17:14 +02:00
Spyros Zoupanos 3576dd186b Adding hive timeout as workflow parameter 2020-06-05 22:29:54 +03:00
Claudio Atzori b2349659cf WIP: graph property fixing implementation 2020-06-05 18:37:38 +02:00
Michele Artini a73973a74b partial implemantation of broker events generation 2020-06-05 11:43:00 +02:00
Michele Artini 7e82996e7c partial implemantation of broker events generation 2020-06-04 17:10:43 +02:00
Sandro La Bruzzo b57e8ba374 Merge remote-tracking branch 'origin/master' into doiboost 2020-06-04 14:39:41 +02:00
Sandro La Bruzzo 7ac1ba2e35 improvement DOIBoost 2020-06-04 14:39:20 +02:00
Michele Artini 97177d7f7b partial refactoring 2020-06-04 10:26:34 +02:00
Sandro La Bruzzo 13815d5d13 improvement DOIBoost 2020-06-01 17:52:12 +02:00
Claudio Atzori 05f269a1c0 kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-06-01 00:32:42 +02:00
Claudio Atzori 5e23fb3a74 code formatting 2020-05-30 10:52:56 +02:00
Claudio Atzori 54ca8ed6c3 uniformed param name (isLookupUrl), Vocab model classes defined as Serializable 2020-05-29 18:17:30 +02:00
Claudio Atzori 1577bd5b8b added IsLookupUrl to the raw_db workflow parameters 2020-05-29 16:18:16 +02:00
Claudio Atzori 91d78b825b Merge pull request 'import from db using is vocabularies' (#17) from result_pids into master
Looks good, thanks Michele!
2020-05-29 16:02:40 +02:00
Michele Artini adb798faa5 import from db using is vocabularies 2020-05-29 12:03:51 +02:00
Claudio Atzori 6f5f498c78 restored common properties driving executor-cores and executor-memory in join_organization_relations wf node 2020-05-29 11:22:00 +02:00
Claudio Atzori b2f9564f13 WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-05-29 10:58:15 +02:00
Miriam Baglioni dfa4997a4f removed commented code 2020-05-29 10:45:18 +02:00
Miriam Baglioni 6f1eea28b6 changed message in log 2020-05-29 10:41:39 +02:00
Sandro La Bruzzo b87b3ddb6b changed mapping ORCIDToOAF 2020-05-29 09:32:04 +02:00
Miriam Baglioni 8b6e886fb6 added new resource for testing 2020-05-28 23:54:31 +02:00
Miriam Baglioni 6989fb9c8a changed the project test according to the newly introduced join with the db project codes 2020-05-28 23:53:24 +02:00
Miriam Baglioni 782984d8e5 added needed parameter 2020-05-28 23:52:41 +02:00
Miriam Baglioni 01f7876595 fix issue with flatMap - the return type must not be null 2020-05-28 23:50:32 +02:00
Claudio Atzori a57965a3ea limiting the dimensions of outliers 2020-05-28 17:36:37 +02:00
Miriam Baglioni 773735f870 added the path to the file containing the projects code from the db 2020-05-28 17:30:45 +02:00
Miriam Baglioni 6a15067a64 added one step in the workflow 2020-05-28 17:30:09 +02:00
Miriam Baglioni 5309a99a70 modified the PrepareProjects to consider those in the db 2020-05-28 17:29:53 +02:00
Miriam Baglioni b737ed8236 added part to read projects from the openaire db to filter out those in the csv file that are not in the db 2020-05-28 17:29:21 +02:00
Claudio Atzori 821be1f8b6 experimental implementation of custom aggregation using kryo encoders 2020-05-28 13:53:13 +02:00
Claudio Atzori 83504ecace limiting the maximum number of authors allowed in XML records to MAX_AUTHORS = 200; authors with ORCID can exceed that limit 2020-05-28 13:52:30 +02:00
Claudio Atzori ef11593068 JoinedEntity.links defined as empty list by default 2020-05-28 13:50:44 +02:00
Claudio Atzori 5dea155a87 increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase 2020-05-28 13:49:59 +02:00
Miriam Baglioni 35b7279147 changed test because data are saved as SequenceFile now, and because of the group by the umber of produced update decrease 2020-05-28 10:26:12 +02:00
Miriam Baglioni 37c155b86a merge branch with fork master 2020-05-28 10:09:51 +02:00
Miriam Baglioni df44db686a refactoring 2020-05-28 10:07:00 +02:00
Miriam Baglioni 87b07f4af8 removed unused variables 2020-05-28 10:05:43 +02:00
Miriam Baglioni 1060977272 added fs actions to remove and the create the workingDir 2020-05-28 10:04:36 +02:00
Miriam Baglioni 96d1a3c431 deleted the file were to store the csv files 2020-05-28 10:04:10 +02:00
Miriam Baglioni 669c05c771 added groupBy before creating Actions 2020-05-28 10:00:45 +02:00
Sandro La Bruzzo 02f90eeb07 Merge remote-tracking branch 'origin/master' into doiboost 2020-05-28 09:58:32 +02:00
Sandro La Bruzzo 7d29b61c62 code refactor 2020-05-28 09:57:46 +02:00
Claudio Atzori fdd54bad1c code formatting 2020-05-27 19:31:54 +02:00
Miriam Baglioni 1855453434 changed the outputdir of the last step 2020-05-27 17:59:36 +02:00
Claudio Atzori b9b1bc9967 Merge branch 'master' into provision_indexing 2020-05-27 12:55:20 +02:00
Claudio Atzori aac1515b58 Merge pull request 'result_pids without conflicts ???' (#16) from result_pids into master
Looks good, thanks Michele
2020-05-27 12:54:52 +02:00
Michele Artini f5ce7d76e1 resolve conflicts 2020-05-27 12:49:17 +02:00
Claudio Atzori cfd753217c repartition the join_entities in 24k files 2020-05-27 12:44:01 +02:00
Claudio Atzori 2f1a623d09 sync from master branch 2020-05-27 12:39:58 +02:00
Claudio Atzori 9e4ec1543b updated test 2020-05-27 12:38:42 +02:00
Claudio Atzori 8047d16dd9 added RDD based adjacency list creation procedure 2020-05-27 12:38:12 +02:00
Claudio Atzori f057dcdf65 limit the max number of externalreferences to MAX_EXTERNAL_ENTITIES 2020-05-27 12:37:33 +02:00
Michele Artini b81f2741d2 xquery 2020-05-27 12:10:20 +02:00
Michele Artini a25598140a result pids (new xpaths + IS vocabularies) 2020-05-27 12:10:20 +02:00
Michele Artini 7a7272d9ec result pids (new xpaths + IS vocabularies) 2020-05-27 12:10:20 +02:00
Michele Artini 3ceb2d2853 match terms with vocabularies 2020-05-27 11:34:13 +02:00
Claudio Atzori 4e36d689dd fixed XML serialization for children sub-elements (duplicates & externalreferences) 2020-05-26 18:30:40 +02:00
Miriam Baglioni 92e3a52e91 merge branch with fork master 2020-05-26 15:57:51 +02:00
Michele Artini c15d997925 xquery 2020-05-26 13:13:17 +02:00
Michele Artini c6af36496a result pids (new xpaths + IS vocabularies) 2020-05-26 13:11:09 +02:00
Michele Artini 093f1aff03 result pids (new xpaths + IS vocabularies) 2020-05-26 13:06:55 +02:00
Claudio Atzori b8e541a454 fixing repeated organization.websiteurl in organization entities (#5645) as well as project.ecinternationalorganizationeurinterests 2020-05-26 10:30:09 +02:00
Claudio Atzori 55595d7235 HACK: patch NULL values with defaults found in result.datainfo.deletedbyinference and result.context 2020-05-26 10:28:35 +02:00
Claudio Atzori 7b288a94cb code formatting 2020-05-26 09:54:13 +02:00
Miriam Baglioni 54d869e618 merge upstream 2020-05-26 09:22:04 +02:00
Miriam Baglioni eea07f4c42 refactoring 2020-05-26 09:21:49 +02:00
Sandro La Bruzzo 79c26382da Merge remote-tracking branch 'origin/master' into doiboost 2020-05-26 09:15:50 +02:00
Sandro La Bruzzo 25f52e19a4 implemented generation of ActionSet 2020-05-26 09:15:33 +02:00
Michele Artini d6aada4957 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-26 08:44:31 +02:00
Michele Artini b1546605e3 updated version of a dependency 2020-05-26 08:44:15 +02:00
Claudio Atzori 7582532e73 [maven-release-plugin] prepare for next development iteration 2020-05-25 19:48:18 +02:00
Claudio Atzori 01c2e93395 [maven-release-plugin] prepare release dhp-1.2.1 2020-05-25 19:48:14 +02:00
miconis da1e5cf557 implementation of the result title merge. main title with higher trust, distinct between the others 2020-05-25 18:02:57 +02:00
Miriam Baglioni d3d36647d2 merge upstream 2020-05-25 10:38:22 +02:00
Miriam Baglioni 74215f6d9f refactoring 2020-05-25 10:38:16 +02:00
Miriam Baglioni dbde2d243a changed due to move of PacePerson from dhp-graph-mapper to dhp-common 2020-05-25 10:35:39 +02:00
Miriam Baglioni f754c424bd changed logic to compute only onece PacePerson for each Author to be enriched 2020-05-25 10:35:02 +02:00
Miriam Baglioni 8f51af4e9b added PacePerson to get name surname for authors having only fullname set 2020-05-25 10:34:30 +02:00
Miriam Baglioni b258f99ece fix for issue that duplicated result 2020-05-25 10:26:48 +02:00
Miriam Baglioni 8f6ce970f9 moved PacePerson to dhp-common to avoid conflict in dependency with graph-mapper 2020-05-25 10:25:55 +02:00
Claudio Atzori de108f54d6 code formatting 2020-05-23 10:21:19 +02:00
Claudio Atzori 6b56cae57d added mapping for bestaccessrights 2020-05-23 09:57:39 +02:00
Claudio Atzori 7181807e64 code formatting 2020-05-23 09:51:48 +02:00
Sandro La Bruzzo 2408083566 implemented filtering step 2020-05-23 08:46:49 +02:00
Sandro La Bruzzo 244f6e50cf Merge remote-tracking branch 'origin/master' into doiboost 2020-05-22 20:52:15 +02:00
Sandro La Bruzzo 147dd389bf minor fix 2020-05-22 20:51:42 +02:00
Miriam Baglioni 0d1ec1913f added fix to avoid duplication of results 2020-05-22 18:42:25 +02:00
miconis 5d7ac78c41 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-22 17:25:08 +02:00
miconis 0fd0c7d725 reimplementation of the sim between two authors. now it takes into account both name and surname. threshold incremented to 1.0 if the name is too short 2020-05-22 17:24:57 +02:00
Michele Artini eb606dc1e2 partial implementation of events with rels 2020-05-22 17:17:41 +02:00
Miriam Baglioni 29066a6b46 applied code cleanup 2020-05-22 15:38:50 +02:00
Miriam Baglioni 8610ad5142 added groupby id to fix multiple result with same id at join step 2020-05-22 15:32:55 +02:00
Miriam Baglioni 1e44703e3e merge upstream 2020-05-22 15:30:07 +02:00
Miriam Baglioni ac8025f469 - 2020-05-22 15:29:41 +02:00
Miriam Baglioni 50ad83b97f - 2020-05-22 15:27:19 +02:00
Miriam Baglioni 473c6d3a23 produces AtomicActions instead of Projects 2020-05-22 15:26:57 +02:00
Sandro La Bruzzo 72278b9375 Merge remote-tracking branch 'origin/master' into doiboost 2020-05-22 15:17:13 +02:00
Sandro La Bruzzo 22936d0877 Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost 2020-05-22 15:15:17 +02:00
Sandro La Bruzzo 9fbb221457 completed mapping of UnpayWall and ORCID 2020-05-22 15:15:09 +02:00
Miriam Baglioni 70389b0a30 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-22 13:53:23 +02:00
Miriam Baglioni 4308f31165 added fix to make test run 2020-05-22 13:13:01 +02:00
Claudio Atzori 946598cfba Merge branch 'master' into provision_indexing 2020-05-22 12:35:41 +02:00
Claudio Atzori 3cf2796ac6 code formatting 2020-05-22 12:34:00 +02:00
Michele Artini dc4621b3cb filter ORCID e MAG identifiers 2020-05-22 12:25:01 +02:00
Michele Artini 9f2d0f1b08 filter ORCID e MAG identifiers 2020-05-22 11:00:27 +02:00
Michele Artini 9de71e54a8 filter ORCID e MAG identifiers 2020-05-22 10:47:39 +02:00
Michele Artini c5f7e17348 author fullnames 2020-05-22 10:08:02 +02:00
Claudio Atzori ad40470040 Merge branch 'master' into provision_indexing 2020-05-22 08:51:22 +02:00
Claudio Atzori 925d933204 making XmlRecordFactory immune to graph encoding changes (mostly to avoid NPEs) 2020-05-22 08:50:44 +02:00
Claudio Atzori b33dd58be4 replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging 2020-05-22 08:50:06 +02:00
Michele Artini c7ca3cf35b Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-21 16:48:20 +02:00
Michele Artini 3e34517479 partial implementation of events with rels 2020-05-21 16:47:53 +02:00
Miriam Baglioni eae12a6586 Merge branch 'master' into dhp_oaf_model 2020-05-21 16:31:22 +02:00
Miriam Baglioni 6750075fbd merge upstream 2020-05-21 16:31:09 +02:00
Miriam Baglioni 4589c428b1 generate action sets and saves them in the hdfs path for the actions sets 2020-05-21 16:30:39 +02:00
miconis 8b35e0e7f0 reimplementation of the author merging in deduprecord creation. implementation of the test class. minor changes 2020-05-21 12:02:44 +02:00
miconis 8bbd1d0501 reimplementation of the author merging in deduprecord creation. implementation of the test class. 2020-05-21 11:52:14 +02:00
Michele Artini e43d4d7778 added a coalesce in sql query 2020-05-21 11:08:07 +02:00
Claudio Atzori dbfb9c19fe minor changes 2020-05-21 10:00:14 +02:00
Michele Artini b3bcbb3129 resolve name of organization countries 2020-05-21 08:41:32 +02:00
Enrico Ottonello 1109d3b3fc Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost 2020-05-21 00:41:27 +02:00
Enrico Ottonello 869a53040e save to text file format 2020-05-21 00:41:21 +02:00
Sandro La Bruzzo 5818abaab4 fixed Crossref Mapping 2020-05-20 17:05:46 +02:00
Claudio Atzori da4267d0fe Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-05-20 14:58:22 +02:00
Claudio Atzori d7d2a0637f added extra parameters to the provision indexing workflow 2020-05-20 14:55:38 +02:00
Miriam Baglioni 055eec5a77 added resource for prepare project test 2020-05-20 13:54:10 +02:00
Miriam Baglioni 9079bc1f61 - 2020-05-20 13:53:32 +02:00
Miriam Baglioni 67ba4fde57 added test for prepare projects step 2020-05-20 13:53:08 +02:00
Miriam Baglioni 5e0e554000 Merge branch 'master' into dhp_oaf_model 2020-05-20 10:57:30 +02:00
Miriam Baglioni 76f3f73caa merge upstream 2020-05-20 10:31:40 +02:00
Miriam Baglioni 3c0eb12d3e removed the not zipped files 2020-05-20 10:31:05 +02:00
Miriam Baglioni c0d9e02340 zipped test resources that are too big 2020-05-20 10:30:25 +02:00
Miriam Baglioni 5e9c9fa87c tests 2020-05-20 10:29:57 +02:00
Miriam Baglioni faed7521bf added resources for testing 2020-05-20 10:29:29 +02:00
Miriam Baglioni 75491482de added a new preparation step to replicate each project for the programme it is associated to 2020-05-20 10:28:56 +02:00
Miriam Baglioni eb0e47ba53 parameters for h2020 programme 2020-05-20 10:26:44 +02:00
Sandro La Bruzzo b771d67e9d next step of MAG conversion implemented 2020-05-20 08:14:03 +02:00
Miriam Baglioni 08218d2f3f new workflow with added steps 2020-05-19 18:44:25 +02:00
Miriam Baglioni 457293ccc0 test for the variuos steps of project update with programme 2020-05-19 18:43:42 +02:00
Miriam Baglioni 9447d78ef3 added preparation classes 2020-05-19 18:42:50 +02:00
Michele Artini 85ca5622d4 partial implementation of generation of simple events 2020-05-19 16:17:35 +02:00
Claudio Atzori 0bdfbb0a57 reintroduced RDD based relation cut off procedure 2020-05-19 15:02:21 +02:00
Enrico Ottonello 934ad570e0 joined summaries and activities dataset 2020-05-19 12:57:21 +02:00
Enrico Ottonello ca722d4d18 merged 2020-05-19 09:43:12 +02:00
Enrico Ottonello 7362bc3e9d workflow to generate seq(doi,AuthorList) 2020-05-19 09:34:44 +02:00
Sandro La Bruzzo 8c95b50f26 Merge remote-tracking branch 'origin/master' into doiboost 2020-05-19 09:25:04 +02:00
Sandro La Bruzzo 486e850bcc next step of MAG conversion implemented 2020-05-19 09:24:45 +02:00
Enrico Ottonello d4e9075f22 Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost 2020-05-18 19:51:36 +02:00
Enrico Ottonello fc80e8c7de added accumulator; last modified date of the record is added to saved data; lambda file is partitioned into 20 parts before starting downloading 2020-05-18 19:51:29 +02:00
Claudio Atzori f3bc8aed31 lifted memory requirements for country propagation wf 2020-05-18 15:29:10 +02:00
Miriam Baglioni b71fbb68b1 removed the removeOutputDir command from code. Reltions are written in Append. The erase of the output dir ment to remove all the relations computed in the prevoius steps 2020-05-18 13:57:20 +02:00
Miriam Baglioni 629af7cb79 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-18 13:07:36 +02:00
Miriam Baglioni f0f14caf99 removed script files for shell actions not performed 2020-05-18 13:06:16 +02:00
Miriam Baglioni 23bbac7d7c - 2020-05-18 13:05:03 +02:00
Miriam Baglioni 4f1ff7ba73 added dependency to org.apache.commons common-csv 2020-05-18 13:04:39 +02:00
Miriam Baglioni abc45f2708 added dnet-45 HttpConnector and related Classes, produced the POJO for projects and programme 2020-05-18 13:04:06 +02:00
Claudio Atzori ef9a9a9f1a remove the outout path when starting 2020-05-15 22:34:19 +02:00
Enrico Ottonello 0b29bb7e3b spark job to download orcid record modified after a fixed date 2020-05-15 19:49:26 +02:00
Miriam Baglioni 5a648016ef parameters from the GetFile class 2020-05-15 18:18:50 +02:00
Miriam Baglioni 83c262a483 workflow to download the files 2020-05-15 18:18:31 +02:00
Miriam Baglioni 22cb9e0da7 simple code to get file from URL 2020-05-15 18:18:01 +02:00
Claudio Atzori 7838f2c63f init the empty list for author pids mapped from OAF 2020-05-15 17:06:01 +02:00
Claudio Atzori 82b615ab33 NPE check 2020-05-15 16:04:46 +02:00
Miriam Baglioni e26a67c3eb merge with upstream 2020-05-15 15:53:05 +02:00
Claudio Atzori 7a89507ab1 code formatting 2020-05-15 15:16:54 +02:00
Miriam Baglioni 5ec8c49ad5 removed serialization points 2020-05-15 12:49:58 +02:00
Claudio Atzori 1d35836a58 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-05-15 12:26:31 +02:00
Claudio Atzori cfc8948717 fixed mapping OdfToGraph: pick the correct element to map author pids and author affiliations; extended mapping Oaf2Graph: added support for author pids 2020-05-15 12:26:16 +02:00
Michele Artini 2a4e68a292 events recognition 2020-05-15 12:25:37 +02:00
Claudio Atzori a832658296 code formatting 2020-05-15 10:21:09 +02:00
Claudio Atzori 50d6a2ad3c added output directory removal in the blacklist spark actions; included common global properties in blacklist's workflow.xml 2020-05-15 09:53:37 +02:00
Claudio Atzori 18f46e47b9 added relations to the graph2hive import workflow 2020-05-15 09:34:48 +02:00
Claudio Atzori 9d028ffe1c cleanup 2020-05-15 09:28:55 +02:00
Claudio Atzori fd62359538 cleanup 2020-05-15 09:28:15 +02:00
Claudio Atzori eb64335a54 parallel implementation for graph Hive importer 2020-05-15 09:05:26 +02:00
Miriam Baglioni 94571c9a51 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-14 18:29:55 +02:00
Miriam Baglioni f25db01664 changed in the constant from propagationconstants to modelconstants 2020-05-14 18:29:24 +02:00
Miriam Baglioni d05630d979 removed the constants added in ModelConstants 2020-05-14 18:22:50 +02:00
Claudio Atzori f044d09315 revised mapping: more accurate mapping for name/surname from datacite format; improved mapping of null values 2020-05-14 15:07:24 +02:00
Miriam Baglioni e7eb4f377e Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-14 10:34:17 +02:00
Miriam Baglioni 8828458acf minor changes 2020-05-14 10:34:12 +02:00
Claudio Atzori ab37953332 added global properties in wf definitions to avoid repeating name-node and job-tracker in the (many) distcp actions; reintroduced output directory removal at the beginning of each spark action 2020-05-14 10:25:41 +02:00
Claudio Atzori 12bfa6702e Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-05-13 17:01:17 +02:00
Claudio Atzori 5ecacad70a fixed default resource typing in Oaf/Odf mapping 2020-05-13 17:01:11 +02:00
Enrico Ottonello 12756f9d41 multithread (4 threads) test to feed elastic search 2020-05-13 16:11:40 +02:00
Michele Artini c0265213a0 partial implementation 2020-05-13 12:00:27 +02:00
Sandro La Bruzzo a92ee0f41e Merge remote-tracking branch 'origin/master' into doiboost 2020-05-13 10:38:13 +02:00
Sandro La Bruzzo d876f47d06 next step of MAG conversion implemented 2020-05-13 10:38:04 +02:00
Claudio Atzori 1ddd33de41 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-05-13 09:04:41 +02:00
Claudio Atzori 85f3c55992 fixed node names in blacklist workflow 2020-05-13 09:04:33 +02:00
Miriam Baglioni 43f127448d changed the package name from dhp-propagation to dhp-enrichment for the preparation phase of funding propagation 2020-05-12 18:24:26 +02:00
Enrico Ottonello 08040cef80 spark action to analyze orcid lambda file 2020-05-12 16:57:43 +02:00
Claudio Atzori ec0782e582 renamed jar containing the bulktagging and propagation workflows from dhp-[bulktagging|propagation] to dhp-enrichment; adjusted xml formatting 2020-05-12 15:49:28 +02:00
Miriam Baglioni 1547ca7e15 added blacklist step to the end of the provision wf 2020-05-12 12:17:27 +02:00