dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Claudio Atzori	ff72fcd91a	allow orcid_pending to be percolate to the XML graph serialization	2020-12-09 19:04:50 +01:00
Claudio Atzori	211aa04726	allow orcid_pending to be percolate to the XML graph serialization	2020-12-09 18:08:51 +01:00
Claudio Atzori	026ad40633	disabled test	2020-12-07 13:50:01 +01:00
Claudio Atzori	cfb55effd9	code formatting	2020-12-02 11:23:49 +01:00
Alessia Bardi	2d15667b4a	testing XML generation from json object (case AMS ACTA)	2020-12-02 10:16:26 +01:00
Claudio Atzori	d48f388fb2	Merge branch 'provision_indexing'	2020-11-19 15:59:55 +01:00
Claudio Atzori	7c9feaf9e7	project attributes removed from the XML record serialization: contactfullname, contactfax, contactphone, contactemail	2020-11-19 15:26:20 +01:00
Claudio Atzori	3f34757c63	merged from master	2020-11-19 14:34:54 +01:00
Claudio Atzori	0374d34c3e	introduced configuration param outputFormat: HDFS \| SOLR	2020-11-19 10:34:28 +01:00
Claudio Atzori	5218718e8b	updated set of fields from the MDFormatDSResourceType on PROD	2020-11-18 15:00:41 +01:00
Claudio Atzori	d9e07a242b	extended XmlIndexingJob to accept an optional parameter: outputPath. When present, forces the job to write its output on the specified HDFS location	2020-11-18 14:34:55 +01:00
Claudio Atzori	29dcff0f34	spark complains about missing classes, so here they are again	2020-11-18 14:32:32 +01:00
Claudio Atzori	8177ce7939	test for XmlIndexingJob based on a local miniSolrCluster	2020-11-18 10:58:05 +01:00
Claudio Atzori	2bed29eb09	WIP: added oozie workflow for grouping graph entities by id	2020-11-13 10:05:12 +01:00
Claudio Atzori	9b0fb9e958	merged from master	2020-11-12 09:27:12 +01:00
Claudio Atzori	822971f54f	no need to filter relations in CreateRelatedEntitiesJob_phase1; replaced 'left outer' join with 'left' join in CreateRelatedEntitiesJob_phase2; cleanup;	2020-11-12 09:22:59 +01:00
Claudio Atzori	18d9aad70c	improved documentation in dhp-graph-provision	2020-11-10 11:48:55 +01:00
Claudio Atzori	58f28296ea	ProvisionConstants moved as ModelHardLimits in dhp-common and applied to truncate long abstracts (len > 150000). Further filtering for empty PID values	2020-10-30 10:56:42 +01:00
Claudio Atzori	1871d1c6f6	solve error java.lang.NoSuchFieldError: INSTANCE when instantiating Solr client	2020-08-14 11:18:30 +02:00
Claudio Atzori	3a11a387a9	data provision workflow enhancement: added nodes to perform DELETE BY QUERY before the indexing begins and COMMIT after the indexing is completed	2020-08-03 14:28:08 +02:00
Claudio Atzori	cc5d13da85	introduced parameter shouldIndex (true\|false)	2020-07-16 13:46:39 +02:00
Claudio Atzori	b098cc3cbe	avoid repeating identical values for fields: source, description	2020-07-16 13:45:53 +02:00
Claudio Atzori	7d6e269b40	reverted CreateRelatedEntitiesJob_phase1 to its previous state	2020-07-13 22:54:04 +02:00
Claudio Atzori	8e97598eb4	avoid to NPE in case of null instances	2020-07-13 20:46:14 +02:00
Claudio Atzori	06c1913062	added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations	2020-07-10 19:03:33 +02:00
Claudio Atzori	4c3836f62e	materialize the related entities before joining them	2020-07-10 19:00:44 +02:00
Claudio Atzori	b21866a2da	allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types	2020-07-10 13:59:48 +02:00
Claudio Atzori	ff4d6214f1	experimenting with pruning of relations	2020-07-10 10:06:41 +02:00
Claudio Atzori	b383ed42fa	pass optional parameter relationFilter to the PrepareRelationJob implementation	2020-07-07 14:21:28 +02:00
Claudio Atzori	d380b85246	unit test for the preparation of the relations	2020-07-02 12:42:13 +02:00
Claudio Atzori	7817338e05	added test to verify the relation pre-processing	2020-06-26 17:58:33 +02:00
Claudio Atzori	8d59fdf34e	WIP: dataset based PrepareRelationsJob	2020-06-26 14:32:58 +02:00
Claudio Atzori	216975c4ec	restored complete provision workflow	2020-06-25 12:55:52 +02:00
Claudio Atzori	93f627ea51	code formatting	2020-06-25 12:54:21 +02:00
Claudio Atzori	e62333192c	WIP: prepare relation job	2020-06-25 12:22:18 +02:00
Claudio Atzori	6933ec11fb	WIP: prepare relation job	2020-06-25 11:04:12 +02:00
Sandro La Bruzzo	a6c0faac70	added test to verify secondary sorting	2020-06-25 10:48:15 +02:00
Claudio Atzori	69b0391708	WIP: prepare relation job	2020-06-25 10:19:56 +02:00
Claudio Atzori	46e76affeb	WIP: prepare relation job	2020-06-24 19:01:15 +02:00
Claudio Atzori	0e723d378b	added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job	2020-06-24 18:34:42 +02:00
Claudio Atzori	463489f59f	code formatting	2020-06-12 12:03:25 +02:00
Claudio Atzori	4bcad1c9c3	Merge branch 'graph_cleaning'	2020-06-12 11:40:25 +02:00
Alessia Bardi	e79943965b	Fixes #5604 : field oamandatepublications in XML	2020-06-11 12:49:31 +02:00
Claudio Atzori	67c7b31ba6	Merge branch 'master' into graph_cleaning	2020-06-10 15:00:35 +02:00
Claudio Atzori	ce12f236bb	disabled test, need to need to update the joined_entity.json file	2020-06-09 20:07:36 +02:00
Claudio Atzori	a2fdf85ba1	WIP: graph cleaner implementation	2020-06-09 19:52:53 +02:00
Claudio Atzori	05f269a1c0	kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob	2020-06-01 00:32:42 +02:00
Claudio Atzori	6f5f498c78	restored common properties driving executor-cores and executor-memory in join_organization_relations wf node	2020-05-29 11:22:00 +02:00
Claudio Atzori	b2f9564f13	WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob	2020-05-29 10:58:15 +02:00
Claudio Atzori	a57965a3ea	limiting the dimensions of outliers	2020-05-28 17:36:37 +02:00
Claudio Atzori	821be1f8b6	experimental implementation of custom aggregation using kryo encoders	2020-05-28 13:53:13 +02:00
Claudio Atzori	83504ecace	limiting the maximum number of authors allowed in XML records to MAX_AUTHORS = 200; authors with ORCID can exceed that limit	2020-05-28 13:52:30 +02:00
Claudio Atzori	ef11593068	JoinedEntity.links defined as empty list by default	2020-05-28 13:50:44 +02:00
Claudio Atzori	5dea155a87	increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase	2020-05-28 13:49:59 +02:00
Claudio Atzori	fdd54bad1c	code formatting	2020-05-27 19:31:54 +02:00
Claudio Atzori	cfd753217c	repartition the join_entities in 24k files	2020-05-27 12:44:01 +02:00
Claudio Atzori	2f1a623d09	sync from master branch	2020-05-27 12:39:58 +02:00
Claudio Atzori	9e4ec1543b	updated test	2020-05-27 12:38:42 +02:00
Claudio Atzori	8047d16dd9	added RDD based adjacency list creation procedure	2020-05-27 12:38:12 +02:00
Claudio Atzori	f057dcdf65	limit the max number of externalreferences to MAX_EXTERNAL_ENTITIES	2020-05-27 12:37:33 +02:00
Claudio Atzori	4e36d689dd	fixed XML serialization for children sub-elements (duplicates & externalreferences)	2020-05-26 18:30:40 +02:00
Claudio Atzori	b8e541a454	fixing repeated organization.websiteurl in organization entities (#5645 ) as well as project.ecinternationalorganizationeurinterests	2020-05-26 10:30:09 +02:00
Claudio Atzori	925d933204	making XmlRecordFactory immune to graph encoding changes (mostly to avoid NPEs)	2020-05-22 08:50:44 +02:00
Claudio Atzori	b33dd58be4	replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging	2020-05-22 08:50:06 +02:00
Claudio Atzori	dbfb9c19fe	minor changes	2020-05-21 10:00:14 +02:00
Claudio Atzori	d7d2a0637f	added extra parameters to the provision indexing workflow	2020-05-20 14:55:38 +02:00
Claudio Atzori	0bdfbb0a57	reintroduced RDD based relation cut off procedure	2020-05-19 15:02:21 +02:00
Claudio Atzori	8c67073a07	force speculative execution to false	2020-05-08 09:42:21 +02:00
Claudio Atzori	bac37b3973	fixed children expansion in XML records	2020-05-04 11:51:17 +02:00
Claudio Atzori	077ccd8743	stats wf properties cleanup	2020-05-04 11:41:46 +02:00
Claudio Atzori	439c6255a2	cleanup	2020-04-29 19:09:07 +02:00
Claudio Atzori	6f5b899038	reformatted code according to the updated style descriptor	2020-04-28 11:23:29 +02:00
Claudio Atzori	a0bdbacdae	switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin	2020-04-27 14:52:31 +02:00
Claudio Atzori	7a3f8085f7	switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin	2020-04-27 14:45:40 +02:00
Claudio Atzori	1e7583c5a6	filtered invisible records in data provision workflow	2020-04-23 07:51:34 +02:00
Claudio Atzori	0b55795d4d	small adjustments in the provisioning workflow	2020-04-21 16:15:04 +02:00
Claudio Atzori	d772d967aa	restored changes from master branch	2020-04-20 18:53:06 +02:00
miconis	4da13e4570	Revert "Merge branch 'master' into deduptesting" This reverts commit `772f75d167`, reversing changes made to `5f45f2c77f`.	2020-04-20 16:04:49 +02:00
Claudio Atzori	d714bfb4d4	collectedfrom field moved in common parent class Oaf.java	2020-04-20 12:25:19 +02:00
Claudio Atzori	ad7a131b18	introduced common project code formatting plugin, works on the commit hook, based on https://github.com/Cosium/git-code-format-maven-plugin , applied to each java class in the project	2020-04-18 12:42:58 +02:00
Claudio Atzori	d74e128aa6	Utility classes moved in dhp-common and dhp-schemas	2020-04-07 11:56:22 +02:00
Claudio Atzori	1a1a026a18	we do expect to find field bestaccessright already defined. No need to add it again	2020-04-07 08:55:33 +02:00
Claudio Atzori	fbdd18a96b	using dataset based relation preparation procedure	2020-04-07 08:54:39 +02:00
Claudio Atzori	77f59b1b10	dataset based provision WIP	2020-04-06 19:37:27 +02:00
Claudio Atzori	e355961997	dataset based provision WIP	2020-04-06 17:34:25 +02:00
Claudio Atzori	ca345aaad3	dataset based provision WIP	2020-04-06 15:33:31 +02:00
Claudio Atzori	c8f4b95464	dataset based provision WIP	2020-04-06 08:59:58 +02:00
Claudio Atzori	eb2f5f3198	dataset based provision WIP	2020-04-04 17:41:31 +02:00
Claudio Atzori	3d1b637cab	dataset based provision WIP	2020-04-04 14:03:43 +02:00
Claudio Atzori	24b2c9012e	dataset based provision WIP	2020-04-02 18:44:09 +02:00
Claudio Atzori	daa26acc9d	dataset based provision WIP, fixed spark2EventLogDir	2020-04-02 16:15:50 +02:00
Claudio Atzori	9c7092416a	dataset based provision WIP	2020-04-01 19:07:30 +02:00
Claudio Atzori	1402eb1fe7	cleanup	2020-04-01 15:38:50 +02:00
Claudio Atzori	adcdd2d05e	WIP: reimplementing the adjacency list construction process using spark Datasets	2020-04-01 14:56:57 +02:00
Claudio Atzori	0fbec69b82	use oozie prepare statement to cleanup working directories	2020-03-30 19:48:41 +02:00
Claudio Atzori	f3f9affd49	allow dynamic executors to build XML records	2020-03-30 13:12:11 +02:00
Claudio Atzori	2e2d4c4c68	adjusted path to template resource	2020-03-30 13:11:49 +02:00
Claudio Atzori	673e744649	moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa	2020-03-27 10:42:17 +01:00
Michele Artini	ebe45003d9	fixed some junit packages	2020-03-25 16:45:03 +01:00
Claudio Atzori	abe8fb69a2	added global properties, moved postprocessing script inside the oozie_app directory	2020-03-18 15:43:54 +01:00
Claudio Atzori	aeb01fa353	reading from newline delimited json textfiles instead of sequence files	2020-03-17 11:57:24 +01:00
Claudio Atzori	a3f184fd3f	added field websiteurl in related organizations	2020-03-10 17:08:58 +01:00
Claudio Atzori	0e95544495	fixed serialization for datasource subjects	2020-03-10 17:07:44 +01:00
Claudio Atzori	5e342a555c	no need to compute the inverse relClass, fixed text() in xpath expressions	2020-03-05 12:51:48 +01:00
Claudio Atzori	6ec04d4e02	specified column used to perform the join operation in the javadoc	2020-03-05 12:50:38 +01:00
Claudio Atzori	1e563bc15e	introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase	2020-03-04 10:55:11 +01:00
Claudio Atzori	bc7cfd5975	indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure	2020-03-02 17:03:07 +01:00
Claudio Atzori	60bc2b1a20	drop the hive DB before populating it from scratch	2020-02-27 10:10:55 +01:00
Claudio Atzori	6a73fd5da5	in order to reuse the same XmlRecordFactory across different tasks, the state of contexts must be one per record built	2020-02-21 09:17:19 +01:00
Claudio Atzori	33185fd0b7	ISLookupClientFactory moved in dhp-common	2020-02-19 16:56:38 +01:00
Claudio Atzori	ed76521d9b	removed stale test resources, will be re-added later on	2020-02-18 11:51:08 +01:00
Claudio Atzori	0f364605ff	removed stale tests, need to reimplemente them anyway	2020-02-18 11:48:19 +01:00
Claudio Atzori	56d1810a66	working procedure for records indexing using Spark, via lib com.lucidworks.spark:spark-solr	2020-02-14 12:28:52 +01:00
Claudio Atzori	1fee6e2b7e	implemented XML records construction and serialization, indexing WIP	2020-02-13 16:53:27 +01:00
Claudio Atzori	7ba0f44d05	WIP	2020-01-30 18:21:07 +01:00
Claudio Atzori	49ef2f4eb1	removed input parameter specification, SparkXmlRecordBuilderJob doesn't need hive	2020-01-30 18:20:26 +01:00
Claudio Atzori	b5e1e2e5b2	reintegrated changes from `fcbc4ccd70`	2020-01-30 18:11:04 +01:00
Claudio Atzori	7bacd6812e	Merge branch 'provision_indexing' of https://code-repo.d4science.org/D-Net/dnet-hadoop into HEAD Conflicts: dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/GraphJoiner.java dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/MappingUtils.java dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/RelatedEntity.java dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/SparkXmlRecordBuilderJob.java	2020-01-30 17:59:46 +01:00
Claudio Atzori	b2691a3b0a	save adjacency list as JoinedEntity	2020-01-30 17:46:29 +01:00
Claudio Atzori	8c2aff99b0	joining entities using T x R x S, WIP: last representation based on LinkedEntity type	2020-01-29 15:40:33 +01:00
Claudio Atzori	fcbc4ccd70	a bit of docs doesn't hurt	2020-01-24 08:43:23 +01:00
Claudio Atzori	a55f5fecc6	joining entities using T x R x S method with groupByKey, WIP: making target objects (T) have lower memory footprint	2020-01-24 08:17:53 +01:00
Claudio Atzori	799929c1e3	joining entities using T x R x S method with groupByKey	2020-01-21 16:35:44 +01:00
Claudio Atzori	97c239ee0d	WIP: trying to find a way to build the records for the index	2020-01-16 12:02:28 +02:00
Claudio Atzori	7ba586d2e5	oozie workflow aimed to build the adjacency lists representation of the graph, needed to build the records to be indexed	2019-12-17 16:24:49 +01:00

1 2 3 4 5

225 Commits