1
0
Fork 0
Commit Graph

176 Commits

Author SHA1 Message Date
Claudio Atzori 1e869e7bed using method available from currently used library 2020-03-24 11:17:44 +01:00
Claudio Atzori 8b0ba3d76a posprocessing script correctly run as hive2 action 2020-03-23 17:40:39 +01:00
Claudio Atzori 658d40ccbe WIP trying to use hive2 actions 2020-03-23 11:14:54 +01:00
Claudio Atzori ecb64e4998 Merge branch 'migration_wfs_regular_all_steps' 2020-03-23 08:57:01 +01:00
Michele Artini 15160032bd fixed a bug setting some organization fields 2020-03-23 08:39:14 +01:00
Claudio Atzori 36236dd1c1 action migration workflow produces eu.dnetlib.dhp.schema.action.AtomicAction(s) 2020-03-19 14:00:38 +01:00
Claudio Atzori a0ab15a64c need to stick on using guava:11.0.2 as it is the version used by the hadoop components (oozie client for sure). The last version (28.2-jre) breaks the oozie workflow submission 2020-03-19 13:58:58 +01:00
Claudio Atzori abe8fb69a2 added global properties, moved postprocessing script inside the oozie_app directory 2020-03-18 15:43:54 +01:00
Claudio Atzori c7e0730720 compress the output produced by migration steps 1 and 2 2020-03-18 09:34:57 +01:00
Claudio Atzori 2f11e37602 fixed expansion of path variables 2020-03-17 19:41:07 +01:00
Claudio Atzori 2795b0b096 no need to mkdir a the all_entities file 2020-03-17 17:22:14 +01:00
Claudio Atzori 19746ad308 when reuseContent, reset ${workingPath}/all_entities 2020-03-17 17:17:06 +01:00
Claudio Atzori 2f0c85eeb3 updated parameters for regular_all_steps worfklow, introduced flag 'reuseContent' 2020-03-17 17:04:58 +01:00
Claudio Atzori b8290b5851 updated parameters for regular_all_steps worfklow 2020-03-17 15:45:30 +01:00
Claudio Atzori 4706f24ec5 updated parameters for regular_all_steps worfklow 2020-03-17 15:23:54 +01:00
Claudio Atzori aeb01fa353 reading from newline delimited json textfiles instead of sequence files 2020-03-17 11:57:24 +01:00
Claudio Atzori af835f2f98 when migrating actionsets from DM cluster, populate the AtomicAction.targetValue when empty (dedup similarities) 2020-03-15 18:07:59 +01:00
Claudio Atzori 9c84e21b87 added workflow to migrate latest version of each actionset content from DM to OCEAN cluster, mapping the targetValues from the old protobuf data model to the dhp.OAF datamodel 2020-03-13 15:56:52 +01:00
Claudio Atzori 8fe7ae1482 xml formatting 2020-03-13 15:53:56 +01:00
Claudio Atzori 7b6f0c8756 reading graph dump as text files, encoded as newline-delimited JSON records, as indicated in the wiki 2020-03-10 17:19:17 +01:00
Claudio Atzori 60aedb1110 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-10 17:09:44 +01:00
Claudio Atzori a3f184fd3f added field websiteurl in related organizations 2020-03-10 17:08:58 +01:00
Claudio Atzori 0e95544495 fixed serialization for datasource subjects 2020-03-10 17:07:44 +01:00
Michele Artini b6efa9d6ab Configuration of the SequenceFile Writer 2020-03-05 15:49:14 +01:00
Claudio Atzori 5e342a555c no need to compute the inverse relClass, fixed text() in xpath expressions 2020-03-05 12:51:48 +01:00
Claudio Atzori 6ec04d4e02 specified column used to perform the join operation in the javadoc 2020-03-05 12:50:38 +01:00
Michele Artini 7a2a466161 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-03-04 14:50:59 +01:00
Michele Artini 755eade2fb fix creation ids 2020-03-04 14:49:45 +01:00
Claudio Atzori 6379f32466 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-04 10:57:06 +01:00
Claudio Atzori 0233987603 introduced post processing step following the hive DB creation/population 2020-03-04 10:56:50 +01:00
Claudio Atzori 1e563bc15e introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase 2020-03-04 10:55:11 +01:00
Claudio Atzori 9af3e904be close the SparkSession at the end 2020-03-04 10:53:31 +01:00
Michele Artini e7167b996a logs and closeable 2020-03-04 10:46:36 +01:00
Claudio Atzori 25ceec29ab code formatting 2020-03-04 10:44:24 +01:00
Claudio Atzori 63c00c5e88 fixed typo 2020-03-04 10:43:44 +01:00
Claudio Atzori 9cf5ce2e66 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-03-02 17:03:10 +01:00
Claudio Atzori bc7cfd5975 indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure 2020-03-02 17:03:07 +01:00
Michele Artini 4b29a121b0 migration using spark in step2 2020-03-02 16:12:14 +01:00
Michele Artini 5445a57102 migration using spark in step2 2020-03-02 16:11:59 +01:00
Claudio Atzori 60bc2b1a20 drop the hive DB before populating it from scratch 2020-02-27 10:10:55 +01:00
Michele Artini 689908b2e9 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-02-25 16:00:51 +01:00
Michele Artini 93665773ea Fixed a problem with JavaRDD Union 2020-02-25 15:59:21 +01:00
Claudio Atzori 6a73fd5da5 in order to reuse the same XmlRecordFactory across different tasks, the state of contexts must be one per record built 2020-02-21 09:17:19 +01:00
Michele Artini d49cd2fdc6 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-02-20 11:21:54 +01:00
Claudio Atzori 5e5e32cb48 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-02-19 16:56:52 +01:00
Claudio Atzori 33185fd0b7 ISLookupClientFactory moved in dhp-common 2020-02-19 16:56:38 +01:00
Michele Artini 5d3739b5cf migration of claims 2020-02-19 15:11:17 +01:00
Michele Artini 173f1df1e5 saved a query for openaire production database 2020-02-19 10:15:08 +01:00
Sandro La Bruzzo 9a2d74ac82 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-02-19 10:13:45 +01:00
Sandro La Bruzzo e5d7cdf422 fixed sql query 2020-02-19 10:13:36 +01:00