Commit Graph

412 Commits

Author SHA1 Message Date
Sandro La Bruzzo 8c9a56a0c8 refactored package name 2020-03-27 13:19:33 +01:00
Sandro La Bruzzo 2bd2d6f202 Merge branch 'master' of code-repo.d3science.org:D-Net/dnet-hadoop 2020-03-27 13:16:36 +01:00
Sandro La Bruzzo a9935f80d4 refactor class name and workflow name for graph mapper, added javadoc 2020-03-27 13:16:24 +01:00
Michele Artini ae03948eed Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-27 11:47:07 +01:00
Michele Artini f6e86b44a6 tests 2020-03-27 11:46:37 +01:00
Michele Artini 408be3c632 test and fixed a problem with datacite namespaces 2020-03-27 11:44:50 +01:00
Claudio Atzori 673e744649 moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa 2020-03-27 10:42:17 +01:00
Claudio Atzori 098fabab3f reorganizing content under dhp-workflows/dhp-graph-mapper 2020-03-26 19:44:19 +01:00
Claudio Atzori 77c4294924 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-26 18:26:52 +01:00
Claudio Atzori 43cbcda7ef unit test for SparkGraphImporterJob 2020-03-26 18:26:40 +01:00
Sandro La Bruzzo e04da6d66a merged all oozie wf in one 2020-03-26 14:17:07 +01:00
Sandro La Bruzzo e71e001b58 commented test that doesn't work 2020-03-26 14:15:21 +01:00
Sandro La Bruzzo 0cd022ad6a merge with master 2020-03-26 14:08:29 +01:00
Claudio Atzori abcd3f5bf5 added sample data for unit tests 2020-03-26 11:12:52 +01:00
Sandro La Bruzzo d5f11e27be renamed wf 2020-03-26 09:49:23 +01:00
Sandro La Bruzzo 9a37ad0127 renamed modules 2020-03-26 09:46:46 +01:00
Sandro La Bruzzo a768226e52 updated generate scholix to generate json 2020-03-26 09:40:50 +01:00
Claudio Atzori 9dff4adbc3 dhp-graph-mapper workflow tests upgraded to junit5 2020-03-25 18:25:12 +01:00
Claudio Atzori cd7dc3e1ae dhp-dedup-openaire workflow tests upgraded to junit5 2020-03-25 18:04:23 +01:00
Claudio Atzori c0e825e713 dhp-aggregation workflow tests upgraded to junit5 2020-03-25 17:59:45 +01:00
Michele Artini ebe45003d9 fixed some junit packages 2020-03-25 16:45:03 +01:00
Michele Artini d9bfdcd607 updated poms 2020-03-25 16:31:12 +01:00
Michele Artini 120e823cd1 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-25 16:00:10 +01:00
Claudio Atzori 71ae7dd272 renamed module dnet-dedup to dnet-dedup-openaire 2020-03-25 15:57:09 +01:00
Michele Artini fd57722c69 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-25 15:56:49 +01:00
Claudio Atzori f441f823dd fixed path referencing a test resource file 2020-03-25 15:21:46 +01:00
Claudio Atzori 51d0c9bdd7 integrated changes from branch dedupTest 2020-03-25 15:15:41 +01:00
Claudio Atzori 36f8f2ea66 master set to 'yarn' in spark actions, removed path to rawSet from the dedup scan workflow 2020-03-25 14:16:06 +01:00
Michele Artini 2559299da4 tests 2020-03-25 12:25:00 +01:00
Claudio Atzori 2180cc4fe7 more fields included in result view definition 2020-03-25 11:21:46 +01:00
Claudio Atzori efb0b7d660 master set to 'yarn' in spark actions 2020-03-25 11:15:35 +01:00
Michele Artini 0fda2c3a30 some tests on db records 2020-03-25 09:43:58 +01:00
miconis 02320de371 minor changes 2020-03-24 17:43:51 +01:00
miconis 8e8b5e8f30 roots wf merged in scan wf 2020-03-24 17:40:58 +01:00
Miriam Baglioni 19d7f8b51d decommented execution for some of the result type for testing purposes 2020-03-24 16:49:46 +01:00
Miriam Baglioni ad24c8478f added missing parameter 2020-03-24 16:19:59 +01:00
Miriam Baglioni 46094a3eec bug fixing for implementation with dataset 2020-03-24 16:19:36 +01:00
Claudio Atzori 51ff68db66 Merge branch 'dedupTest' of https://code-repo.d4science.org/D-Net/dnet-hadoop into dedupTest 2020-03-24 11:18:19 +01:00
Claudio Atzori 1e869e7bed using method available from currently used library 2020-03-24 11:17:44 +01:00
miconis f0d72b76a8 package structure fixed 2020-03-24 10:51:40 +01:00
Claudio Atzori aaedbb1b8b WIP: dedup workflow, stage 2 2020-03-24 09:59:28 +01:00
Michele Artini e3760c7f39 fix a bug with organization countries 2020-03-24 08:43:56 +01:00
Claudio Atzori 8b0ba3d76a posprocessing script correctly run as hive2 action 2020-03-23 17:40:39 +01:00
miconis 93e2291291 minor changes 2020-03-23 17:17:56 +01:00
miconis f7890a90df implementation of the mechanism that checks the existance of a mergerel file 2020-03-23 17:13:30 +01:00
Miriam Baglioni ad712f2d79 added the needed variables in the config and read the variables in the workflow 2020-03-23 17:11:36 +01:00
Miriam Baglioni f1e9fe9752 changed implementation using dataset and query on hive 2020-03-23 17:11:00 +01:00
Miriam Baglioni f09cd1e911 removed unuseful variable in the configuration 2020-03-23 17:10:14 +01:00
Miriam Baglioni 9418e3d4fa read dataset from files instead of using hive tables 2020-03-23 17:09:27 +01:00
Miriam Baglioni a7bf037306 remove unused class 2020-03-23 14:36:43 +01:00
Miriam Baglioni 8ab8b6b0bf minor 2020-03-23 14:35:23 +01:00
Miriam Baglioni 30d58fd98c change the configuration of the workflow 2020-03-23 14:32:49 +01:00
Miriam Baglioni a440152b46 refactoring 2020-03-23 14:30:56 +01:00
Miriam Baglioni 47561f3597 changed the implementation from rdd to dataset got from sql queries (on hive) 2020-03-23 11:58:32 +01:00
miconis c20e179f5a structure of the workflows updated 2020-03-23 11:43:49 +01:00
Claudio Atzori 658d40ccbe WIP trying to use hive2 actions 2020-03-23 11:14:54 +01:00
Claudio Atzori ecb64e4998 Merge branch 'migration_wfs_regular_all_steps' 2020-03-23 08:57:01 +01:00
Michele Artini 15160032bd fixed a bug setting some organization fields 2020-03-23 08:39:14 +01:00
Claudio Atzori a4c52661a0 WIP: fixing dedup workflows 2020-03-20 19:17:24 +01:00
Claudio Atzori 6cb0a9bff0 dedup wf directory structure aligned with project commons 2020-03-20 16:48:14 +01:00
miconis e16e644faf implementation of the workflow for entity update and for relations update 2020-03-20 13:01:56 +01:00
przemek 638b78f96a Merge remote-tracking branch 'origin/master' into przemyslawjacewicz_actionmanager_impl_prototype 2020-03-19 15:12:56 +01:00
miconis 4e82a24af2 minor changes and implementation of the create connected components action 2020-03-19 15:01:07 +01:00
Claudio Atzori 36236dd1c1 action migration workflow produces eu.dnetlib.dhp.schema.action.AtomicAction(s) 2020-03-19 14:00:38 +01:00
Claudio Atzori a0ab15a64c need to stick on using guava:11.0.2 as it is the version used by the hadoop components (oozie client for sure). The last version (28.2-jre) breaks the oozie workflow submission 2020-03-19 13:58:58 +01:00
Sandro La Bruzzo 0594b92a6d implemented relation with dataset 2020-03-19 11:11:07 +01:00
miconis 679b5869e5 implementation of the lookup procedure to take dedup conf from the resource profiles 2020-03-18 17:41:56 +01:00
Claudio Atzori abe8fb69a2 added global properties, moved postprocessing script inside the oozie_app directory 2020-03-18 15:43:54 +01:00
miconis f32eae5ce9 implementation of the spark action for the simrel creation 2020-03-18 14:27:49 +01:00
Claudio Atzori c7e0730720 compress the output produced by migration steps 1 and 2 2020-03-18 09:34:57 +01:00
Claudio Atzori 2f11e37602 fixed expansion of path variables 2020-03-17 19:41:07 +01:00
Claudio Atzori 2795b0b096 no need to mkdir a the all_entities file 2020-03-17 17:22:14 +01:00
Claudio Atzori 19746ad308 when reuseContent, reset ${workingPath}/all_entities 2020-03-17 17:17:06 +01:00
Claudio Atzori 2f0c85eeb3 updated parameters for regular_all_steps worfklow, introduced flag 'reuseContent' 2020-03-17 17:04:58 +01:00
Miriam Baglioni 67ea3cf3ed changed the way to read the file with info on resource or relation. From sequenceFile to textFile 2020-03-17 16:32:05 +01:00
Miriam Baglioni b4652d018c moved the creation of new dir to common class. 2020-03-17 16:31:24 +01:00
Claudio Atzori b8290b5851 updated parameters for regular_all_steps worfklow 2020-03-17 15:45:30 +01:00
Claudio Atzori 4706f24ec5 updated parameters for regular_all_steps worfklow 2020-03-17 15:23:54 +01:00
Claudio Atzori aeb01fa353 reading from newline delimited json textfiles instead of sequence files 2020-03-17 11:57:24 +01:00
Miriam Baglioni 92f4e0001d Merge branch 'bulktag' 2020-03-16 13:33:27 +01:00
Miriam Baglioni ab08a37024 Merge remote-tracking branch 'upstream/master' 2020-03-16 12:45:23 +01:00
Claudio Atzori af835f2f98 when migrating actionsets from DM cluster, populate the AtomicAction.targetValue when empty (dedup similarities) 2020-03-15 18:07:59 +01:00
Claudio Atzori 9c84e21b87 added workflow to migrate latest version of each actionset content from DM to OCEAN cluster, mapping the targetValues from the old protobuf data model to the dhp.OAF datamodel 2020-03-13 15:56:52 +01:00
Claudio Atzori 8fe7ae1482 xml formatting 2020-03-13 15:53:56 +01:00
Przemysław Jacewicz d0c9b0cdd6 WIP promote job functions updated 2020-03-13 12:36:42 +01:00
Przemysław Jacewicz 8d9b3c5de2 WIP action payload mapping into OAF type moved, (local) graph table name enum created, tests fixed 2020-03-13 10:01:39 +01:00
Przemysław Jacewicz 5cc560c7e5 Removed unnecessary dependency on old OAF model 2020-03-13 09:57:46 +01:00
Sandro La Bruzzo addaaa091f migrate relation from RDD to Dataset 2020-03-13 09:13:20 +01:00
Przemysław Jacewicz 3f24593e51 WIP: promote job tests and test resources implementation snapshot 2020-03-11 17:06:29 +01:00
Przemysław Jacewicz 2e996d610f WIP: promote job functions implementation snapshot 2020-03-11 17:02:57 +01:00
Przemysław Jacewicz cc63cdc9e6 WIP: promote job implementation snapshot 2020-03-11 17:02:06 +01:00
Przemysław Jacewicz 69540f6f78 Serialization-safe supplier added 2020-03-11 16:59:05 +01:00
Przemysław Jacewicz e6e214dab5 Oaf merge and get strategy added 2020-03-11 16:58:17 +01:00
Claudio Atzori 7b6f0c8756 reading graph dump as text files, encoded as newline-delimited JSON records, as indicated in the wiki 2020-03-10 17:19:17 +01:00
Claudio Atzori 60aedb1110 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-10 17:09:44 +01:00
Claudio Atzori a3f184fd3f added field websiteurl in related organizations 2020-03-10 17:08:58 +01:00
Claudio Atzori 0e95544495 fixed serialization for datasource subjects 2020-03-10 17:07:44 +01:00
Sandro La Bruzzo 7b28783fb4 updated unpaywall mapping 2020-03-08 17:00:19 +01:00
Michele Artini b6efa9d6ab Configuration of the SequenceFile Writer 2020-03-05 15:49:14 +01:00
Claudio Atzori 5e342a555c no need to compute the inverse relClass, fixed text() in xpath expressions 2020-03-05 12:51:48 +01:00
Claudio Atzori 6ec04d4e02 specified column used to perform the join operation in the javadoc 2020-03-05 12:50:38 +01:00
Michele Artini 7a2a466161 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-04 14:50:59 +01:00
Michele Artini 755eade2fb fix creation ids 2020-03-04 14:49:45 +01:00
Claudio Atzori 6379f32466 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-04 10:57:06 +01:00
Claudio Atzori 0233987603 introduced post processing step following the hive DB creation/population 2020-03-04 10:56:50 +01:00
Claudio Atzori 1e563bc15e introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase 2020-03-04 10:55:11 +01:00
Claudio Atzori 9af3e904be close the SparkSession at the end 2020-03-04 10:53:31 +01:00
Michele Artini e7167b996a logs and closeable 2020-03-04 10:46:36 +01:00
Claudio Atzori 25ceec29ab code formatting 2020-03-04 10:44:24 +01:00
Claudio Atzori 63c00c5e88 fixed typo 2020-03-04 10:43:44 +01:00
Miriam Baglioni c37f2bd1b5 moved some classes to package to make code clearer 2020-03-03 16:42:23 +01:00
Miriam Baglioni d9d2060561 implementation for bulk tagging 2020-03-03 16:38:50 +01:00
Miriam Baglioni e80f80ca93 properties and workflow for new propagation 2020-03-02 17:03:31 +01:00
Claudio Atzori 9cf5ce2e66 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-02 17:03:10 +01:00
Claudio Atzori bc7cfd5975 indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure 2020-03-02 17:03:07 +01:00
Miriam Baglioni 50080c1b3c changed the implementation of addAll method. Before adding all the items in a collection, we check if the accumulator set is not empty 2020-03-02 16:41:37 +01:00
Miriam Baglioni 02815dd2cf update result for community moved in propagationconstants 2020-03-02 16:40:56 +01:00
Miriam Baglioni 95f8c3092f update for new propagation implementation and moving of updateResult for community business logic since the same can be used for result to community from organization and result to community from semrel 2020-03-02 16:40:17 +01:00
Miriam Baglioni 3d63f35dcb implementation of new propagation. Result to community for results linked to given organization. We exploit the hasAuthorInstitution semantic link to discover which results are related to institutions 2020-03-02 16:39:03 +01:00
Michele Artini 4b29a121b0 migration using spark in step2 2020-03-02 16:12:14 +01:00
Michele Artini 5445a57102 migration using spark in step2 2020-03-02 16:11:59 +01:00
Miriam Baglioni 3a4ccb26c0 New properties for the orcid to result propagation through semantic relation 2020-02-28 18:26:04 +01:00
Miriam Baglioni b50166b9ad None 2020-02-28 18:25:28 +01:00
Miriam Baglioni 550cb21c23 None 2020-02-28 18:24:39 +01:00
Miriam Baglioni b098ee0bae Changed the structure of typed row to conatain also list of authors with orcid 2020-02-28 18:23:51 +01:00
Miriam Baglioni 841f5523fe Added information and methods for the new propagation of orcid to result through semrel 2020-02-28 18:23:16 +01:00
Miriam Baglioni 2b7b05fb29 New propagation of ORCID to result exploiting the semantic relation connecting them. R has author with orcid o, R is bounf by strong semantic relationship with R1 that has the same author withouth orcid, then o is also associated to the author in R1 2020-02-28 18:22:41 +01:00
Miriam Baglioni 833c83c694 Wrong file name 2020-02-28 18:21:01 +01:00
Miriam Baglioni a86426776a Changed from Oaf to Result the type of the updateResult method parameter, not to be forced to cast each time 2020-02-28 18:20:19 +01:00
Sandro La Bruzzo b32655e48e changed code to save intermediate result 2020-02-27 10:18:46 +01:00
Claudio Atzori 60bc2b1a20 drop the hive DB before populating it from scratch 2020-02-27 10:10:55 +01:00
Sandro La Bruzzo f09e065865 incremented number of repartition 2020-02-26 19:26:19 +01:00
Sandro La Bruzzo 071f5c3e52 fixed NPE 2020-02-26 15:42:20 +01:00
Sandro La Bruzzo a1a6fc8315 fixed NPE 2020-02-26 15:42:13 +01:00
Sandro La Bruzzo 1edf02a3ce added log 2020-02-26 15:25:03 +01:00
Sandro La Bruzzo c3ecabd8e8 fixed NPE 2020-02-26 14:40:02 +01:00
Sandro La Bruzzo 5d0f46651b fixed NPE 2020-02-26 14:31:34 +01:00
Sandro La Bruzzo bc342bf73a fixed wrong generation type in summary 2020-02-26 12:49:47 +01:00
Sandro La Bruzzo 3112e21858 fixed typo 2020-02-26 12:22:43 +01:00
Sandro La Bruzzo 119ae6eef5 fixed wrong loop in the workflow 2020-02-26 12:18:50 +01:00
Sandro La Bruzzo 7936583a3d added generation of Scholix collection 2020-02-26 12:09:06 +01:00
Przemysław Jacewicz 02db368dc5 Merge branch 'master' into przemyslawjacewicz_actionmanager_impl_prototype 2020-02-26 11:50:20 +01:00
Sandro La Bruzzo 2ef3705b2c Added Provision workflow 2020-02-26 10:51:35 +01:00
Michele Artini 689908b2e9 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-02-25 16:00:51 +01:00
Michele Artini 93665773ea Fixed a problem with JavaRDD Union 2020-02-25 15:59:21 +01:00
Sandro La Bruzzo b021b8a2e1 Added index wf 2020-02-24 10:15:55 +01:00
Claudio Atzori 6a73fd5da5 in order to reuse the same XmlRecordFactory across different tasks, the state of contexts must be one per record built 2020-02-21 09:17:19 +01:00
Michele Artini d49cd2fdc6 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-02-20 11:21:54 +01:00
Miriam Baglioni 3f941a2af4 Merge branch 'master' into propagationCommunityToResult 2020-02-19 18:05:22 +01:00
Miriam Baglioni b2bdc9b99b merging project to result propagation logic to master 2020-02-19 18:04:59 +01:00