Commit Graph

3267 Commits

Author SHA1 Message Date
Miriam Baglioni ad24c8478f added missing parameter 2020-03-24 16:19:59 +01:00
Miriam Baglioni 46094a3eec bug fixing for implementation with dataset 2020-03-24 16:19:36 +01:00
Claudio Atzori 51ff68db66 Merge branch 'dedupTest' of https://code-repo.d4science.org/D-Net/dnet-hadoop into dedupTest 2020-03-24 11:18:19 +01:00
Claudio Atzori 1e869e7bed using method available from currently used library 2020-03-24 11:17:44 +01:00
miconis f0d72b76a8 package structure fixed 2020-03-24 10:51:40 +01:00
Claudio Atzori aaedbb1b8b WIP: dedup workflow, stage 2 2020-03-24 09:59:28 +01:00
Michele Artini e3760c7f39 fix a bug with organization countries 2020-03-24 08:43:56 +01:00
Claudio Atzori 8b0ba3d76a posprocessing script correctly run as hive2 action 2020-03-23 17:40:39 +01:00
miconis 93e2291291 minor changes 2020-03-23 17:17:56 +01:00
miconis f7890a90df implementation of the mechanism that checks the existance of a mergerel file 2020-03-23 17:13:30 +01:00
Miriam Baglioni ad712f2d79 added the needed variables in the config and read the variables in the workflow 2020-03-23 17:11:36 +01:00
Miriam Baglioni f1e9fe9752 changed implementation using dataset and query on hive 2020-03-23 17:11:00 +01:00
Miriam Baglioni f09cd1e911 removed unuseful variable in the configuration 2020-03-23 17:10:14 +01:00
Miriam Baglioni 9418e3d4fa read dataset from files instead of using hive tables 2020-03-23 17:09:27 +01:00
Miriam Baglioni a7bf037306 remove unused class 2020-03-23 14:36:43 +01:00
Miriam Baglioni 8ab8b6b0bf minor 2020-03-23 14:35:23 +01:00
Miriam Baglioni 30d58fd98c change the configuration of the workflow 2020-03-23 14:32:49 +01:00
Miriam Baglioni a440152b46 refactoring 2020-03-23 14:30:56 +01:00
Miriam Baglioni 47561f3597 changed the implementation from rdd to dataset got from sql queries (on hive) 2020-03-23 11:58:32 +01:00
miconis c20e179f5a structure of the workflows updated 2020-03-23 11:43:49 +01:00
Claudio Atzori 658d40ccbe WIP trying to use hive2 actions 2020-03-23 11:14:54 +01:00
Claudio Atzori ecb64e4998 Merge branch 'migration_wfs_regular_all_steps' 2020-03-23 08:57:01 +01:00
Michele Artini 15160032bd fixed a bug setting some organization fields 2020-03-23 08:39:14 +01:00
Claudio Atzori a4c52661a0 WIP: fixing dedup workflows 2020-03-20 19:17:24 +01:00
Claudio Atzori 6cb0a9bff0 dedup wf directory structure aligned with project commons 2020-03-20 16:48:14 +01:00
miconis e16e644faf implementation of the workflow for entity update and for relations update 2020-03-20 13:01:56 +01:00
przemek 638b78f96a Merge remote-tracking branch 'origin/master' into przemyslawjacewicz_actionmanager_impl_prototype 2020-03-19 15:12:56 +01:00
miconis 6d879e2ee1 integration of the new AtomicAction class 2020-03-19 15:10:42 +01:00
miconis 6e0fb8efa0 minor changes 2020-03-19 15:08:03 +01:00
miconis 4e82a24af2 minor changes and implementation of the create connected components action 2020-03-19 15:01:07 +01:00
Claudio Atzori 36236dd1c1 action migration workflow produces eu.dnetlib.dhp.schema.action.AtomicAction(s) 2020-03-19 14:00:38 +01:00
Claudio Atzori a0ab15a64c need to stick on using guava:11.0.2 as it is the version used by the hadoop components (oozie client for sure). The last version (28.2-jre) breaks the oozie workflow submission 2020-03-19 13:58:58 +01:00
Sandro La Bruzzo 0594b92a6d implemented relation with dataset 2020-03-19 11:11:07 +01:00
Claudio Atzori 1850a02ae4 added simpler, AtomicAction replacement, based on the dhp.Oaf model 2020-03-19 10:44:16 +01:00
miconis 679b5869e5 implementation of the lookup procedure to take dedup conf from the resource profiles 2020-03-18 17:41:56 +01:00
Claudio Atzori abe8fb69a2 added global properties, moved postprocessing script inside the oozie_app directory 2020-03-18 15:43:54 +01:00
miconis f32eae5ce9 implementation of the spark action for the simrel creation 2020-03-18 14:27:49 +01:00
Claudio Atzori c7e0730720 compress the output produced by migration steps 1 and 2 2020-03-18 09:34:57 +01:00
Claudio Atzori 2f11e37602 fixed expansion of path variables 2020-03-17 19:41:07 +01:00
Claudio Atzori 2795b0b096 no need to mkdir a the all_entities file 2020-03-17 17:22:14 +01:00
Claudio Atzori 19746ad308 when reuseContent, reset ${workingPath}/all_entities 2020-03-17 17:17:06 +01:00
Claudio Atzori 2f0c85eeb3 updated parameters for regular_all_steps worfklow, introduced flag 'reuseContent' 2020-03-17 17:04:58 +01:00
Miriam Baglioni 67ea3cf3ed changed the way to read the file with info on resource or relation. From sequenceFile to textFile 2020-03-17 16:32:05 +01:00
Miriam Baglioni b4652d018c moved the creation of new dir to common class. 2020-03-17 16:31:24 +01:00
Claudio Atzori b8290b5851 updated parameters for regular_all_steps worfklow 2020-03-17 15:45:30 +01:00
Claudio Atzori 4706f24ec5 updated parameters for regular_all_steps worfklow 2020-03-17 15:23:54 +01:00
Claudio Atzori aeb01fa353 reading from newline delimited json textfiles instead of sequence files 2020-03-17 11:57:24 +01:00
Miriam Baglioni 92f4e0001d Merge branch 'bulktag' 2020-03-16 13:33:27 +01:00
Miriam Baglioni ab08a37024 Merge remote-tracking branch 'upstream/master' 2020-03-16 12:45:23 +01:00
Claudio Atzori af835f2f98 when migrating actionsets from DM cluster, populate the AtomicAction.targetValue when empty (dedup similarities) 2020-03-15 18:07:59 +01:00