dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Claudio Atzori	5dea155a87	increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase	2020-05-28 13:49:59 +02:00
Claudio Atzori	cfd753217c	repartition the join_entities in 24k files	2020-05-27 12:44:01 +02:00
Claudio Atzori	4e36d689dd	fixed XML serialization for children sub-elements (duplicates & externalreferences)	2020-05-26 18:30:40 +02:00
Claudio Atzori	b33dd58be4	replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging	2020-05-22 08:50:06 +02:00
Claudio Atzori	dbfb9c19fe	minor changes	2020-05-21 10:00:14 +02:00
Claudio Atzori	d7d2a0637f	added extra parameters to the provision indexing workflow	2020-05-20 14:55:38 +02:00
Claudio Atzori	0bdfbb0a57	reintroduced RDD based relation cut off procedure	2020-05-19 15:02:21 +02:00
Claudio Atzori	8c67073a07	force speculative execution to false	2020-05-08 09:42:21 +02:00
Claudio Atzori	bac37b3973	fixed children expansion in XML records	2020-05-04 11:51:17 +02:00
Claudio Atzori	0b55795d4d	small adjustments in the provisioning workflow	2020-04-21 16:15:04 +02:00
Claudio Atzori	77f59b1b10	dataset based provision WIP	2020-04-06 19:37:27 +02:00
Claudio Atzori	ca345aaad3	dataset based provision WIP	2020-04-06 15:33:31 +02:00
Claudio Atzori	c8f4b95464	dataset based provision WIP	2020-04-06 08:59:58 +02:00
Claudio Atzori	eb2f5f3198	dataset based provision WIP	2020-04-04 17:41:31 +02:00
Claudio Atzori	3d1b637cab	dataset based provision WIP	2020-04-04 14:03:43 +02:00
Claudio Atzori	24b2c9012e	dataset based provision WIP	2020-04-02 18:44:09 +02:00
Claudio Atzori	daa26acc9d	dataset based provision WIP, fixed spark2EventLogDir	2020-04-02 16:15:50 +02:00
Claudio Atzori	9c7092416a	dataset based provision WIP	2020-04-01 19:07:30 +02:00
Claudio Atzori	adcdd2d05e	WIP: reimplementing the adjacency list construction process using spark Datasets	2020-04-01 14:56:57 +02:00
Claudio Atzori	0fbec69b82	use oozie prepare statement to cleanup working directories	2020-03-30 19:48:41 +02:00
Claudio Atzori	f3f9affd49	allow dynamic executors to build XML records	2020-03-30 13:12:11 +02:00
Claudio Atzori	673e744649	moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa	2020-03-27 10:42:17 +01:00
Claudio Atzori	abe8fb69a2	added global properties, moved postprocessing script inside the oozie_app directory	2020-03-18 15:43:54 +01:00
Claudio Atzori	1e563bc15e	introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase	2020-03-04 10:55:11 +01:00
Claudio Atzori	bc7cfd5975	indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure	2020-03-02 17:03:07 +01:00
Claudio Atzori	56d1810a66	working procedure for records indexing using Spark, via lib com.lucidworks.spark:spark-solr	2020-02-14 12:28:52 +01:00
Claudio Atzori	1fee6e2b7e	implemented XML records construction and serialization, indexing WIP	2020-02-13 16:53:27 +01:00
Claudio Atzori	49ef2f4eb1	removed input parameter specification, SparkXmlRecordBuilderJob doesn't need hive	2020-01-30 18:20:26 +01:00
Claudio Atzori	b2691a3b0a	save adjacency list as JoinedEntity	2020-01-30 17:46:29 +01:00
Claudio Atzori	799929c1e3	joining entities using T x R x S method with groupByKey	2020-01-21 16:35:44 +01:00
Claudio Atzori	97c239ee0d	WIP: trying to find a way to build the records for the index	2020-01-16 12:02:28 +02:00
Claudio Atzori	7ba586d2e5	oozie workflow aimed to build the adjacency lists representation of the graph, needed to build the records to be indexed	2019-12-17 16:24:49 +01:00

32 Commits