dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Alessia Bardi	09fc7e2f78	serialization of validated flag on relationships	2021-02-10 11:22:09 +01:00
Claudio Atzori	0374d34c3e	introduced configuration param outputFormat: HDFS \| SOLR	2020-11-19 10:34:28 +01:00
Claudio Atzori	d9e07a242b	extended XmlIndexingJob to accept an optional parameter: outputPath. When present, forces the job to write its output on the specified HDFS location	2020-11-18 14:34:55 +01:00
Claudio Atzori	1871d1c6f6	solve error java.lang.NoSuchFieldError: INSTANCE when instantiating Solr client	2020-08-14 11:18:30 +02:00
Claudio Atzori	3a11a387a9	data provision workflow enhancement: added nodes to perform DELETE BY QUERY before the indexing begins and COMMIT after the indexing is completed	2020-08-03 14:28:08 +02:00
Claudio Atzori	cc5d13da85	introduced parameter shouldIndex (true\|false)	2020-07-16 13:46:39 +02:00
Claudio Atzori	06c1913062	added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations	2020-07-10 19:03:33 +02:00
Claudio Atzori	b21866a2da	allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types	2020-07-10 13:59:48 +02:00
Claudio Atzori	b383ed42fa	pass optional parameter relationFilter to the PrepareRelationJob implementation	2020-07-07 14:21:28 +02:00
Claudio Atzori	7817338e05	added test to verify the relation pre-processing	2020-06-26 17:58:33 +02:00
Claudio Atzori	216975c4ec	restored complete provision workflow	2020-06-25 12:55:52 +02:00
Claudio Atzori	0e723d378b	added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job	2020-06-24 18:34:42 +02:00
Claudio Atzori	05f269a1c0	kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob	2020-06-01 00:32:42 +02:00
Claudio Atzori	6f5f498c78	restored common properties driving executor-cores and executor-memory in join_organization_relations wf node	2020-05-29 11:22:00 +02:00
Claudio Atzori	b2f9564f13	WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob	2020-05-29 10:58:15 +02:00
Claudio Atzori	5dea155a87	increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase	2020-05-28 13:49:59 +02:00
Claudio Atzori	cfd753217c	repartition the join_entities in 24k files	2020-05-27 12:44:01 +02:00
Claudio Atzori	4e36d689dd	fixed XML serialization for children sub-elements (duplicates & externalreferences)	2020-05-26 18:30:40 +02:00
Claudio Atzori	b33dd58be4	replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging	2020-05-22 08:50:06 +02:00
Claudio Atzori	dbfb9c19fe	minor changes	2020-05-21 10:00:14 +02:00
Claudio Atzori	d7d2a0637f	added extra parameters to the provision indexing workflow	2020-05-20 14:55:38 +02:00
Claudio Atzori	0bdfbb0a57	reintroduced RDD based relation cut off procedure	2020-05-19 15:02:21 +02:00
Claudio Atzori	8c67073a07	force speculative execution to false	2020-05-08 09:42:21 +02:00
Claudio Atzori	bac37b3973	fixed children expansion in XML records	2020-05-04 11:51:17 +02:00
Claudio Atzori	0b55795d4d	small adjustments in the provisioning workflow	2020-04-21 16:15:04 +02:00
Claudio Atzori	77f59b1b10	dataset based provision WIP	2020-04-06 19:37:27 +02:00
Claudio Atzori	ca345aaad3	dataset based provision WIP	2020-04-06 15:33:31 +02:00
Claudio Atzori	c8f4b95464	dataset based provision WIP	2020-04-06 08:59:58 +02:00
Claudio Atzori	eb2f5f3198	dataset based provision WIP	2020-04-04 17:41:31 +02:00
Claudio Atzori	3d1b637cab	dataset based provision WIP	2020-04-04 14:03:43 +02:00
Claudio Atzori	24b2c9012e	dataset based provision WIP	2020-04-02 18:44:09 +02:00
Claudio Atzori	daa26acc9d	dataset based provision WIP, fixed spark2EventLogDir	2020-04-02 16:15:50 +02:00
Claudio Atzori	9c7092416a	dataset based provision WIP	2020-04-01 19:07:30 +02:00
Claudio Atzori	adcdd2d05e	WIP: reimplementing the adjacency list construction process using spark Datasets	2020-04-01 14:56:57 +02:00
Claudio Atzori	0fbec69b82	use oozie prepare statement to cleanup working directories	2020-03-30 19:48:41 +02:00
Claudio Atzori	f3f9affd49	allow dynamic executors to build XML records	2020-03-30 13:12:11 +02:00
Claudio Atzori	673e744649	moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa	2020-03-27 10:42:17 +01:00
Claudio Atzori	abe8fb69a2	added global properties, moved postprocessing script inside the oozie_app directory	2020-03-18 15:43:54 +01:00
Claudio Atzori	1e563bc15e	introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase	2020-03-04 10:55:11 +01:00
Claudio Atzori	bc7cfd5975	indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure	2020-03-02 17:03:07 +01:00
Claudio Atzori	56d1810a66	working procedure for records indexing using Spark, via lib com.lucidworks.spark:spark-solr	2020-02-14 12:28:52 +01:00
Claudio Atzori	1fee6e2b7e	implemented XML records construction and serialization, indexing WIP	2020-02-13 16:53:27 +01:00
Claudio Atzori	49ef2f4eb1	removed input parameter specification, SparkXmlRecordBuilderJob doesn't need hive	2020-01-30 18:20:26 +01:00
Claudio Atzori	b2691a3b0a	save adjacency list as JoinedEntity	2020-01-30 17:46:29 +01:00
Claudio Atzori	799929c1e3	joining entities using T x R x S method with groupByKey	2020-01-21 16:35:44 +01:00
Claudio Atzori	97c239ee0d	WIP: trying to find a way to build the records for the index	2020-01-16 12:02:28 +02:00
Claudio Atzori	7ba586d2e5	oozie workflow aimed to build the adjacency lists representation of the graph, needed to build the records to be indexed	2019-12-17 16:24:49 +01:00

47 Commits