1
0
Fork 0
Commit Graph

1335 Commits

Author SHA1 Message Date
Michele Artini dffa0b01a2 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-07 15:37:29 +02:00
Michele Artini efadbdb2bc fixed a bug with duplicated events 2020-07-07 15:37:13 +02:00
Claudio Atzori 8af8e7481a code formatting 2020-07-07 14:23:34 +02:00
Claudio Atzori b383ed42fa pass optional parameter relationFilter to the PrepareRelationJob implementation 2020-07-07 14:21:28 +02:00
Claudio Atzori 911894a987 Merge branch 'deduptesting' 2020-07-07 14:20:43 +02:00
Miriam Baglioni c19818a3f8 merge branch with fork master 2020-07-06 13:58:23 +02:00
Miriam Baglioni d22240c0ba merge upstream 2020-07-06 13:58:02 +02:00
Michele Artini edf6c6c4dc Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-03 11:48:24 +02:00
Michele Artini 04bebb708c some fixes 2020-07-03 11:48:12 +02:00
Claudio Atzori c3d67f709a adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80) 2020-07-02 17:35:22 +02:00
Miriam Baglioni f8bf4acd76 - 2020-07-02 16:03:11 +02:00
Miriam Baglioni e6c79d44e6 - 2020-07-02 16:02:02 +02:00
Miriam Baglioni d7f6f0c216 changed code to use other lib 2020-07-02 16:01:34 +02:00
Miriam Baglioni 8fdc9e070c added dependency to OkHttp 2020-07-02 16:01:08 +02:00
Miriam Baglioni 94500a581b merge branch with fork master 2020-07-02 14:25:39 +02:00
Miriam Baglioni c133a23cf0 merge upstream 2020-07-02 14:24:57 +02:00
Claudio Atzori 1d39f7901c Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-07-02 12:45:01 +02:00
Claudio Atzori 0f77cac4b5 fix: deduper must use queueMaxSize instead of groupMaxSize for the block definition 2020-07-02 12:43:51 +02:00
Sandro La Bruzzo 18b9330312 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-02 12:43:19 +02:00
Michele Artini b413db0bff white/blacklists 2020-07-02 12:43:03 +02:00
Claudio Atzori d380b85246 unit test for the preparation of the relations 2020-07-02 12:42:13 +02:00
Claudio Atzori ed1c7e5d75 fixed workflow for the import of the claims alone 2020-07-02 12:40:21 +02:00
Sandro La Bruzzo 07f0723fa7 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-02 12:37:49 +02:00
Sandro La Bruzzo 1d420eedb4 added generation of EBI Dataset 2020-07-02 12:37:43 +02:00
Claudio Atzori e4a29a4513 fixed workflow for the import of the claims alone 2020-07-02 12:36:33 +02:00
Michele Artini 3bcdfbabe9 list with limits 2020-07-01 08:42:39 +02:00
Michele Artini 59a5421c24 indexing, accumulators, limited lists 2020-06-30 16:17:09 +02:00
Michele Artini 6f13673464 accumulators 2020-06-29 16:33:32 +02:00
Sandro La Bruzzo dab783b173 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-29 09:05:00 +02:00
Michele Artini a6ea432435 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-29 08:44:20 +02:00
Michele Artini 35ae381d28 all events matchers 2020-06-29 08:43:56 +02:00
Claudio Atzori 7817338e05 added test to verify the relation pre-processing 2020-06-26 17:58:33 +02:00
Claudio Atzori 8d59fdf34e WIP: dataset based PrepareRelationsJob 2020-06-26 14:32:58 +02:00
Michele Artini 2393d9da2f limits 2020-06-26 11:20:45 +02:00
Sandro La Bruzzo 96ce124b59 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-25 17:00:43 +02:00
Miriam Baglioni 4a7de07ea2 refactoring 2020-06-25 16:32:40 +02:00
Miriam Baglioni 54a12978d3 fixed issue in xquery 2020-06-25 16:30:20 +02:00
Michele Artini 408165a756 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-25 15:53:35 +02:00
Michele Artini e8fb305f18 compilation of event map 2020-06-25 15:53:20 +02:00
Michele Artini 4eb3e109d7 compilation of event map 2020-06-25 15:45:50 +02:00
Claudio Atzori d839e88783 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-25 14:06:30 +02:00
Claudio Atzori 6f5771c1c9 sets author.rank when null 2020-06-25 14:06:21 +02:00
Michele Artini e28033c6d8 some fixes 2020-06-25 13:01:09 +02:00
Claudio Atzori 216975c4ec restored complete provision workflow 2020-06-25 12:55:52 +02:00
Claudio Atzori 2d77d3a388 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-25 12:54:30 +02:00
Claudio Atzori 93f627ea51 code formatting 2020-06-25 12:54:21 +02:00
Miriam Baglioni 05a99cfb61 change the position of value and description elements in the workflow definition 2020-06-25 12:36:08 +02:00
Claudio Atzori 7df2712824 Merge branch 'provision_indexing' 2020-06-25 12:22:41 +02:00
Claudio Atzori e62333192c WIP: prepare relation job 2020-06-25 12:22:18 +02:00
Claudio Atzori 6933ec11fb WIP: prepare relation job 2020-06-25 11:04:12 +02:00
Sandro La Bruzzo a6c0faac70 added test to verify secondary sorting 2020-06-25 10:48:15 +02:00
Claudio Atzori 69b0391708 WIP: prepare relation job 2020-06-25 10:19:56 +02:00
Michele Artini abcbebcbb4 fixed generation of ids 2020-06-25 09:50:46 +02:00
Michele Artini 77d2a1b1c4 params to choose sql queries for beta or production 2020-06-25 09:28:13 +02:00
Claudio Atzori 46e76affeb WIP: prepare relation job 2020-06-24 19:01:15 +02:00
Claudio Atzori 0e723d378b added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job 2020-06-24 18:34:42 +02:00
Michele Artini 202f6e62ff Splitted join wf 2020-06-24 15:47:06 +02:00
Sandro La Bruzzo 96689a8994 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-24 14:06:50 +02:00
Sandro La Bruzzo 46631a4421 updated mapping scholexplorer to OAF 2020-06-24 14:06:38 +02:00
Michele Artini e53dd62e87 minot changes 2020-06-24 09:24:45 +02:00
Michele Artini 8b9933b934 refactoring aggregators 2020-06-24 08:57:13 +02:00
Miriam Baglioni 3e5570de7a - 2020-06-23 15:44:54 +02:00
Michele Artini d13e3d3f68 fixed paths 2020-06-23 11:01:42 +02:00
Michele Artini 8386c6f90d filter of valid resultResult relations 2020-06-23 10:24:15 +02:00
Michele Artini 38bb45d0b6 test osf:refereed 2020-06-23 10:14:39 +02:00
Michele Artini c3286f4c37 fixed relType 2020-06-23 09:32:32 +02:00
Miriam Baglioni 507f7a94a8 added one of the main zenodo communities to the tagging conf for testing purposes 2020-06-23 08:45:27 +02:00
Michele Artini af2f7705fc partial refactoring of some joins 2020-06-23 08:37:35 +02:00
Miriam Baglioni af1d40351b changed XQuery to add also the main Zenodo community among the communities associated to the openaire community 2020-06-22 19:20:54 +02:00
Miriam Baglioni e4b21be004 - 2020-06-22 17:31:50 +02:00
Miriam Baglioni afa19b0c84 changed the way to PUT the files to the rest API 2020-06-22 17:20:07 +02:00
Miriam Baglioni 250fd1c854 merge branch with fork master 2020-06-22 16:25:48 +02:00
Claudio Atzori 8a3bc7c183 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-22 14:12:33 +02:00
Claudio Atzori e162ba5075 added dnet workflows to orchestrate the execution of graph2hive, updateSolr and updateStats oozie wfs 2020-06-22 14:12:28 +02:00
Michele Artini 3ce20c198e reformatting 2020-06-22 12:14:25 +02:00
Michele Artini ed787398b3 refactoring wf 2020-06-22 11:45:14 +02:00
Claudio Atzori 9cd27183b6 [maven-release-plugin] prepare for next development iteration 2020-06-22 11:27:44 +02:00
Claudio Atzori 1e3dab0631 [maven-release-plugin] prepare release dhp-1.2.3 2020-06-22 11:27:39 +02:00
Miriam Baglioni df80ae5c1b merge branch with fork master 2020-06-22 10:51:23 +02:00
Miriam Baglioni e8f914f8b3 - 2020-06-22 10:50:41 +02:00
Miriam Baglioni edeb862476 excluded dependency in module that generates conflict 2020-06-22 10:49:56 +02:00
Miriam Baglioni 185facb8e5 change the deprecated DefaultHttpClient with the CLoseableHttpClient 2020-06-22 10:49:10 +02:00
Claudio Atzori 961a0d0b49 [actionset promotion] log debugging info in case of error in the action payload extraction or parsing the data 2020-06-22 10:20:45 +02:00
Claudio Atzori 5e8b922962 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-22 09:50:47 +02:00
Claudio Atzori 7d416f08d8 graph cleaning workflow: set hostedby to unknown repository when defined as NULL 2020-06-22 09:50:43 +02:00
Michele Artini 16c7a18435 refactoring 2020-06-22 08:51:31 +02:00
Miriam Baglioni 669a509430 - 2020-06-19 17:39:46 +02:00
Michele Artini f9fc64ffaf âÃMerge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-19 15:24:43 +02:00
Michele Artini d88fe0ac84 join methods 2020-06-19 15:24:30 +02:00
Sandro La Bruzzo 464eeeec87 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-19 15:11:53 +02:00
Sandro La Bruzzo 1681de672d updated mapping scholexplorer to OAF 2020-06-19 15:11:46 +02:00
Michele Artini 4822747313 some fixes 2020-06-19 13:53:56 +02:00
Michele Artini 834f139e6e fixed some NPE 2020-06-19 12:33:29 +02:00
Claudio Atzori d0ac7514b2 cleaning workflow to include cleaning of default values 2020-06-18 19:37:25 +02:00
Miriam Baglioni 44a12d244f - 2020-06-18 18:38:54 +02:00
Michele Artini 52f62d5d8c events 2020-06-18 14:49:13 +02:00
Miriam Baglioni fb80353018 - 2020-06-18 14:21:36 +02:00
Michele Artini 61634fbfe0 removed kryo encoding 2020-06-18 14:09:58 +02:00
Michele Artini 8d2b199dd2 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-18 13:15:34 +02:00
Michele Artini e659b02e6b some wf fixing 2020-06-18 13:15:13 +02:00
Michele Artini 9a847b4557 some wf fixing 2020-06-18 13:14:10 +02:00
Miriam Baglioni 65bf312360 merge branch with fork master 2020-06-18 11:35:27 +02:00
Miriam Baglioni 3953f56bd3 added dependency to pom 2020-06-18 11:34:47 +02:00
Miriam Baglioni a118b66858 - 2020-06-18 11:34:30 +02:00
Miriam Baglioni f9578312b5 - 2020-06-18 11:34:15 +02:00
Miriam Baglioni 8b145e6aba - 2020-06-18 11:25:28 +02:00
Miriam Baglioni e8b3e972f2 changed the input params and the workflow definition to tackle the Result as all result product produced 2020-06-18 11:25:05 +02:00
Miriam Baglioni 3233b01089 changes due to adding all the result type under Result 2020-06-18 11:22:58 +02:00
Miriam Baglioni 5c8533d1a1 changed in the testing classes 2020-06-18 11:20:08 +02:00
Miriam Baglioni bc8611a95a added new resources for testing 2020-06-18 11:19:20 +02:00
Sandro La Bruzzo 9bf67f5de1 resolved conflicts 2020-06-17 09:15:43 +02:00
Sandro La Bruzzo 1d4275acc4 implemented first version of exportation of Scholexplorer into ActionSet 2020-06-17 09:10:38 +02:00
miconis 5233b15265 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-16 18:31:19 +02:00
miconis 11b77b9f4e json dumps for entity merge test modified to fit the new model. title merge adjusted to fix the error 2020-06-16 18:31:11 +02:00
Claudio Atzori 64f02de5d3 updated workflow definition to include the cleaning step 2020-06-16 17:48:51 +02:00
Claudio Atzori 306669209f code formatting 2020-06-16 16:54:44 +02:00
Claudio Atzori 1bc1d15eaf stubbing for mock datasource.identities must be typed as array 2020-06-16 16:54:28 +02:00
Claudio Atzori 631fef12a7 Merge branch 'master' into dhp_oaf_model 2020-06-16 16:11:19 +02:00
Michele Artini 9e2c23e391 partial refactoring 2020-06-16 15:55:42 +02:00
Michele Artini 113c9b1de0 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-16 15:53:39 +02:00
Michele Artini 76ea7607f7 partial refactoring 2020-06-16 15:53:13 +02:00
Claudio Atzori 603b1bd0bb Merge branch 'master' into dhp_oaf_model 2020-06-16 15:43:59 +02:00
Claudio Atzori 5441f01586 Merge pull request 'missing landingPage urls in instances' (#22) from instances-with-landing-page into master
Looks good, thanks!
2020-06-16 15:32:44 +02:00
Claudio Atzori 89859111ee Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-16 15:28:29 +02:00
Claudio Atzori 4ec262db53 included externalreference(s) in the result view on the Hive graph DB 2020-06-16 15:28:20 +02:00
Michele Artini 8a4f84f8c0 refactoring 2020-06-16 12:34:13 +02:00
Claudio Atzori 2a4f65795f WIP: graph cleaner implementation 2020-06-15 18:32:24 +02:00
Claudio Atzori c15c8c0ad0 map datasource identities (including piwik ids) as original IDs 2020-06-15 16:07:30 +02:00
Miriam Baglioni 9dd3ef22c5 merge branch with fork master 2020-06-15 11:23:26 +02:00
Miriam Baglioni 68cf0fd03f test input 2020-06-15 11:14:42 +02:00
Miriam Baglioni 0467145ae3 test for graph dump 2020-06-15 11:13:51 +02:00
Miriam Baglioni e43eedb5b0 added resources and workflow for dump of community products 2020-06-15 11:13:21 +02:00
Miriam Baglioni f96ca900e1 fixed issues while running on cluster 2020-06-15 11:12:14 +02:00
Miriam Baglioni 20b9e67728 added new class funder 2020-06-15 11:06:18 +02:00
Claudio Atzori 0d52816244 WIP: graph cleaner implementation 2020-06-13 13:06:04 +02:00
Claudio Atzori bed65a1be6 WIP: graph cleaner implementation 2020-06-12 18:25:47 +02:00
Claudio Atzori c4d9f1837f [maven-release-plugin] prepare for next development iteration 2020-06-12 12:21:08 +02:00
Claudio Atzori f0746a7605 [maven-release-plugin] prepare release dhp-1.2.2 2020-06-12 12:21:03 +02:00
Claudio Atzori 463489f59f code formatting 2020-06-12 12:03:25 +02:00
Claudio Atzori 4bcad1c9c3 Merge branch 'graph_cleaning' 2020-06-12 11:40:25 +02:00
Claudio Atzori cdb1956fe9 WIP: graph cleaner implementation 2020-06-12 11:36:59 +02:00
Alessia Bardi b347499745 do not use deprecated subreltype 2020-06-12 10:58:02 +02:00
Claudio Atzori 97b1c4057c WIP: graph cleaner implementation 2020-06-12 10:45:18 +02:00
Claudio Atzori ba8a024af9 avoid NPEs merging titles 2020-06-12 10:45:11 +02:00
Michele Artini 30ea1bda88 oozie workflow 2020-06-12 10:42:35 +02:00
Michele Artini c22cb5a3c6 refactoring 2020-06-12 09:47:55 +02:00
Michele Artini 472cf77639 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-06-11 14:30:47 +02:00
Michele Artini c6b5bb3f17 orcid events 2020-06-11 14:30:24 +02:00
Michele Artini c2e1b66e83 Revert "orcid events"
This reverts commit 48959e9a17.
2020-06-11 14:28:03 +02:00
Michele Artini 48959e9a17 orcid events 2020-06-11 14:24:02 +02:00
Miriam Baglioni e145972962 - 2020-06-11 13:08:39 +02:00
Miriam Baglioni a01800224c - 2020-06-11 13:02:04 +02:00
Miriam Baglioni 356dd582a3 map construction moved in class 2020-06-11 12:59:22 +02:00
Alessia Bardi e79943965b Fixes #5604: field oamandatepublications in XML 2020-06-11 12:49:31 +02:00
Michele Artini a41e0cb648 missing landingPage urls in instances 2020-06-11 12:28:34 +02:00
Michele Artini 04fdcacd83 results with all joined entities 2020-06-11 11:25:18 +02:00
Michele Artini 99f88e1cb8 fixed generation entities from claims 2020-06-11 10:51:57 +02:00
Miriam Baglioni db27663750 - 2020-06-11 10:49:01 +02:00
Miriam Baglioni bb9f21d0e7 job test for class producing first step of results dump 2020-06-11 10:20:05 +02:00
Claudio Atzori d1d92c4d8c fixed integration of claims in the graph 2020-06-11 10:12:00 +02:00
Claudio Atzori 953da4a427 Merge branch 'master' into graph_cleaning 2020-06-10 21:36:56 +02:00
Claudio Atzori f1bce64391 WIP: graph cleaner implementation 2020-06-10 21:36:31 +02:00
Claudio Atzori 67c7b31ba6 Merge branch 'master' into graph_cleaning 2020-06-10 15:00:35 +02:00
Claudio Atzori 3ebf81d2b0 Merge pull request 'oaf-store-interpretation' (#21) from oaf-store-interpretation into master
Looks good, thanks Michele!
2020-06-10 14:58:09 +02:00
Michele Artini 5869cb76b3 reformatting 2020-06-10 12:11:16 +02:00
Michele Artini c08e66e01e fixed a workflow parameter 2020-06-10 10:11:56 +02:00
Michele Artini 7177a32d75 import of invisible stores 2020-06-10 10:04:00 +02:00
Claudio Atzori ce12f236bb disabled test, need to need to update the joined_entity.json file 2020-06-09 20:07:36 +02:00
Claudio Atzori a2fdf85ba1 WIP: graph cleaner implementation 2020-06-09 19:52:53 +02:00
Alessia Bardi 4551c1082f mapping csv for orcid 2020-06-09 18:08:47 +02:00
Alessia Bardi 2d3f7d1eb4 fixed log classes to make the ORCID test run 2020-06-09 18:07:14 +02:00
Alessia Bardi a3a6755d58 mapping csv for Unpaywall 2020-06-09 17:45:44 +02:00
Claudio Atzori d9f33582c5 WIP: graph cleaner implementation 2020-06-09 17:20:40 +02:00
Alessia Bardi f3b033cf09 added csv line for funders from Crossref 2020-06-09 17:08:26 +02:00
Alessia Bardi 79969d78b9 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-06-09 17:05:39 +02:00
Alessia Bardi fc4d220964 updated function name for SNSF 2020-06-09 17:05:31 +02:00
Michele Artini baaa55f4a3 use of pace to calculate trusts 2020-06-09 16:01:31 +02:00
Alessia Bardi 33b130ec43 Mapping instructions for MAG 2020-06-09 15:57:15 +02:00
Miriam Baglioni 206abba48c merge branch with fork master 2020-06-09 15:41:14 +02:00
Miriam Baglioni a089db18f1 workflow and parameters to exucute the dump 2020-06-09 15:39:38 +02:00
Miriam Baglioni 6bbe27587f new classes to execute the dump for products associated to community, enrich each result with project information and assign the result to each community it belongs to 2020-06-09 15:39:03 +02:00
Miriam Baglioni 5121cbaf6a new classes for external dump. Only classes functional to dump products 2020-06-09 15:37:46 +02:00
Alessia Bardi d6de406e11 fixed classid for subjects 2020-06-09 14:43:34 +02:00
Alessia Bardi f072125152 map volume and issue in journal information from MAG 2020-06-09 14:32:10 +02:00
Alessia Bardi b7cb1163ea identifiers always start with 50 2020-06-09 10:39:11 +02:00
Alessia Bardi 181f52b9bc Added mapping table for Crossref 2020-06-08 19:33:47 +02:00
Alessia Bardi 9fd25887f7 Result identifiers all start with 50| 2020-06-08 19:32:24 +02:00
Alessia Bardi 16cb073b15 set the instance datepfacceptance with the Crossref createdDate in case the issuedDate is blank 2020-06-08 19:06:03 +02:00
Michele Artini bb659d870c join simrels 2020-06-08 16:29:01 +02:00
Michele Artini 81e85465d8 join simrels 2020-06-08 16:26:16 +02:00
Claudio Atzori 3d871c6651 Merge branch 'master' into graph_cleaning 2020-06-08 15:23:24 +02:00
Claudio Atzori 25a093b1a4 integrated changes from master 2020-06-08 15:04:00 +02:00
Sandro La Bruzzo e34e7d6728 merge DOIBoost 2020-06-08 08:32:22 +02:00
Sandro La Bruzzo e46e2a4776 Merge remote-tracking branch 'origin/master' into doiboost 2020-06-08 08:17:14 +02:00
Spyros Zoupanos 3576dd186b Adding hive timeout as workflow parameter 2020-06-05 22:29:54 +03:00
Claudio Atzori b2349659cf WIP: graph property fixing implementation 2020-06-05 18:37:38 +02:00
Michele Artini a73973a74b partial implemantation of broker events generation 2020-06-05 11:43:00 +02:00
Michele Artini 7e82996e7c partial implemantation of broker events generation 2020-06-04 17:10:43 +02:00
Sandro La Bruzzo b57e8ba374 Merge remote-tracking branch 'origin/master' into doiboost 2020-06-04 14:39:41 +02:00
Sandro La Bruzzo 7ac1ba2e35 improvement DOIBoost 2020-06-04 14:39:20 +02:00
Michele Artini 97177d7f7b partial refactoring 2020-06-04 10:26:34 +02:00
Sandro La Bruzzo 13815d5d13 improvement DOIBoost 2020-06-01 17:52:12 +02:00
Claudio Atzori 05f269a1c0 kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-06-01 00:32:42 +02:00
Claudio Atzori 5e23fb3a74 code formatting 2020-05-30 10:52:56 +02:00
Claudio Atzori 54ca8ed6c3 uniformed param name (isLookupUrl), Vocab model classes defined as Serializable 2020-05-29 18:17:30 +02:00
Claudio Atzori 1577bd5b8b added IsLookupUrl to the raw_db workflow parameters 2020-05-29 16:18:16 +02:00
Claudio Atzori 91d78b825b Merge pull request 'import from db using is vocabularies' (#17) from result_pids into master
Looks good, thanks Michele!
2020-05-29 16:02:40 +02:00
Michele Artini adb798faa5 import from db using is vocabularies 2020-05-29 12:03:51 +02:00
Claudio Atzori 6f5f498c78 restored common properties driving executor-cores and executor-memory in join_organization_relations wf node 2020-05-29 11:22:00 +02:00
Claudio Atzori b2f9564f13 WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-05-29 10:58:15 +02:00
Miriam Baglioni dfa4997a4f removed commented code 2020-05-29 10:45:18 +02:00
Miriam Baglioni 6f1eea28b6 changed message in log 2020-05-29 10:41:39 +02:00
Sandro La Bruzzo b87b3ddb6b changed mapping ORCIDToOAF 2020-05-29 09:32:04 +02:00
Miriam Baglioni 8b6e886fb6 added new resource for testing 2020-05-28 23:54:31 +02:00
Miriam Baglioni 6989fb9c8a changed the project test according to the newly introduced join with the db project codes 2020-05-28 23:53:24 +02:00
Miriam Baglioni 782984d8e5 added needed parameter 2020-05-28 23:52:41 +02:00
Miriam Baglioni 01f7876595 fix issue with flatMap - the return type must not be null 2020-05-28 23:50:32 +02:00
Claudio Atzori a57965a3ea limiting the dimensions of outliers 2020-05-28 17:36:37 +02:00
Miriam Baglioni 773735f870 added the path to the file containing the projects code from the db 2020-05-28 17:30:45 +02:00
Miriam Baglioni 6a15067a64 added one step in the workflow 2020-05-28 17:30:09 +02:00
Miriam Baglioni 5309a99a70 modified the PrepareProjects to consider those in the db 2020-05-28 17:29:53 +02:00
Miriam Baglioni b737ed8236 added part to read projects from the openaire db to filter out those in the csv file that are not in the db 2020-05-28 17:29:21 +02:00
Claudio Atzori 821be1f8b6 experimental implementation of custom aggregation using kryo encoders 2020-05-28 13:53:13 +02:00
Claudio Atzori 83504ecace limiting the maximum number of authors allowed in XML records to MAX_AUTHORS = 200; authors with ORCID can exceed that limit 2020-05-28 13:52:30 +02:00
Claudio Atzori ef11593068 JoinedEntity.links defined as empty list by default 2020-05-28 13:50:44 +02:00
Claudio Atzori 5dea155a87 increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase 2020-05-28 13:49:59 +02:00
Miriam Baglioni 35b7279147 changed test because data are saved as SequenceFile now, and because of the group by the umber of produced update decrease 2020-05-28 10:26:12 +02:00
Miriam Baglioni 37c155b86a merge branch with fork master 2020-05-28 10:09:51 +02:00
Miriam Baglioni df44db686a refactoring 2020-05-28 10:07:00 +02:00
Miriam Baglioni 87b07f4af8 removed unused variables 2020-05-28 10:05:43 +02:00
Miriam Baglioni 1060977272 added fs actions to remove and the create the workingDir 2020-05-28 10:04:36 +02:00
Miriam Baglioni 96d1a3c431 deleted the file were to store the csv files 2020-05-28 10:04:10 +02:00
Miriam Baglioni 669c05c771 added groupBy before creating Actions 2020-05-28 10:00:45 +02:00
Sandro La Bruzzo 02f90eeb07 Merge remote-tracking branch 'origin/master' into doiboost 2020-05-28 09:58:32 +02:00
Sandro La Bruzzo 7d29b61c62 code refactor 2020-05-28 09:57:46 +02:00
Claudio Atzori fdd54bad1c code formatting 2020-05-27 19:31:54 +02:00
Miriam Baglioni 1855453434 changed the outputdir of the last step 2020-05-27 17:59:36 +02:00
Claudio Atzori b9b1bc9967 Merge branch 'master' into provision_indexing 2020-05-27 12:55:20 +02:00
Claudio Atzori aac1515b58 Merge pull request 'result_pids without conflicts ???' (#16) from result_pids into master
Looks good, thanks Michele
2020-05-27 12:54:52 +02:00
Michele Artini f5ce7d76e1 resolve conflicts 2020-05-27 12:49:17 +02:00
Claudio Atzori cfd753217c repartition the join_entities in 24k files 2020-05-27 12:44:01 +02:00
Claudio Atzori 2f1a623d09 sync from master branch 2020-05-27 12:39:58 +02:00
Claudio Atzori 9e4ec1543b updated test 2020-05-27 12:38:42 +02:00
Claudio Atzori 8047d16dd9 added RDD based adjacency list creation procedure 2020-05-27 12:38:12 +02:00
Claudio Atzori f057dcdf65 limit the max number of externalreferences to MAX_EXTERNAL_ENTITIES 2020-05-27 12:37:33 +02:00
Michele Artini b81f2741d2 xquery 2020-05-27 12:10:20 +02:00
Michele Artini a25598140a result pids (new xpaths + IS vocabularies) 2020-05-27 12:10:20 +02:00
Michele Artini 7a7272d9ec result pids (new xpaths + IS vocabularies) 2020-05-27 12:10:20 +02:00
Michele Artini 3ceb2d2853 match terms with vocabularies 2020-05-27 11:34:13 +02:00
Claudio Atzori 4e36d689dd fixed XML serialization for children sub-elements (duplicates & externalreferences) 2020-05-26 18:30:40 +02:00