dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Sandro La Bruzzo	16ae3c9ccf	updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset	2020-07-30 09:25:32 +02:00
Claudio Atzori	56bbfdc65d	introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'	2020-07-23 08:54:10 +02:00
Sandro La Bruzzo	9ab594ccf6	fixed test	2020-07-21 10:36:21 +02:00
Claudio Atzori	ebf60020ac	map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type	2020-07-20 19:01:10 +02:00
Claudio Atzori	32f5e466e3	imports cleanup	2020-07-20 17:42:58 +02:00
Claudio Atzori	54ac583923	code formatting	2020-07-20 17:37:08 +02:00
Claudio Atzori	124e7ce19c	in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available	2020-07-20 17:33:37 +02:00
Claudio Atzori	050dda223d	Merge pull request 'removed duplicated fields' (#25 ) from unique_field_in_lists into master Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists. The task to update the model in such a way is added on #9#issuecomment-1583 Thanks!	2020-07-20 12:12:50 +02:00
Claudio Atzori	e0c4cf6f7b	added parameter to drive the graph merge strategy: priority (BETA\|PROD)	2020-07-20 10:48:01 +02:00
Claudio Atzori	94ccdb4852	Merge branch 'master' into merge_graph	2020-07-20 10:14:55 +02:00
Claudio Atzori	0937c9998f	Merge branch 'deduptesting'	2020-07-20 10:00:20 +02:00
Claudio Atzori	de72b1c859	cleanup	2020-07-20 09:59:11 +02:00
Michele Artini	331a3cbdd0	fixed originalId	2020-07-20 09:50:29 +02:00
Michele Artini	c59c5369b1	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-18 09:40:54 +02:00
Michele Artini	346a1d2b5a	update eventId generator	2020-07-18 09:40:36 +02:00
Sandro La Bruzzo	9116d75b3e	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-17 18:01:30 +02:00
Miriam Baglioni	47c7122773	changed priority from beta to production	2020-07-17 12:56:35 +02:00
Michele Artini	442f30930c	removed duplicated fields	2020-07-17 12:25:36 +02:00
Claudio Atzori	1781609508	code formatting	2020-07-16 19:06:56 +02:00
Claudio Atzori	db8b90a156	renamed CORE -> BETA	2020-07-16 19:05:13 +02:00
Claudio Atzori	878f2b931c	Merge branch 'master' into merge_graph	2020-07-16 16:34:24 +02:00
Claudio Atzori	cc5d13da85	introduced parameter shouldIndex (true\|false)	2020-07-16 13:46:39 +02:00
Claudio Atzori	b098cc3cbe	avoid repeating identical values for fields: source, description	2020-07-16 13:45:53 +02:00
Claudio Atzori	805de4eca1	fix: filter the blocks with size = 1	2020-07-16 10:11:32 +02:00
Claudio Atzori	4b9fb2ffb8	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop	2020-07-15 11:26:04 +02:00
Claudio Atzori	b90389bac4	code formatting	2020-07-15 11:24:48 +02:00
Claudio Atzori	4e6f46e8fa	filter blocks with one record only	2020-07-15 11:22:20 +02:00
Michele Artini	262c29463e	relations with multiple datasources	2020-07-15 09:18:40 +02:00
Claudio Atzori	7d6e269b40	reverted CreateRelatedEntitiesJob_phase1 to its previous state	2020-07-13 22:54:04 +02:00
Claudio Atzori	8e97598eb4	avoid to NPE in case of null instances	2020-07-13 20:46:14 +02:00
Claudio Atzori	06def0c0cb	SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter	2020-07-13 20:09:06 +02:00
miconis	b52c246aed	merge done	2020-07-13 19:57:02 +02:00
miconis	b8a45041fd	minor changes	2020-07-13 19:53:18 +02:00
Claudio Atzori	66f9f6d323	adjusted parameters for the dedup stats workflow	2020-07-13 19:26:46 +02:00
miconis	03ecfa5ebd	implementation of the test class for the new block stats spark action	2020-07-13 18:48:23 +02:00
miconis	10e08ccf45	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-13 18:22:45 +02:00
miconis	9258e4f095	implementation of a new workflow to compute statistics on the blocks	2020-07-13 18:22:34 +02:00
Claudio Atzori	c6f6fb0f28	code formatting	2020-07-13 16:46:13 +02:00
Claudio Atzori	8d2102d7d2	Merge branch 'deduptesting'	2020-07-13 16:32:43 +02:00
Claudio Atzori	344a90c2e6	updated assertions in propagateRelationTest	2020-07-13 16:32:04 +02:00
Claudio Atzori	1143f426aa	WIP SparkCreateMergeRels distinct relations	2020-07-13 16:13:36 +02:00
Claudio Atzori	8c67938ad0	configurable number of partitions used in the SparkCreateSimRels phase	2020-07-13 16:07:07 +02:00
Claudio Atzori	c73168b18e	Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting	2020-07-13 15:54:58 +02:00
Claudio Atzori	c8284bab06	WIP SparkCreateMergeRels distinct relations	2020-07-13 15:54:51 +02:00
Sandro La Bruzzo	1d133b7fe6	update test	2020-07-13 15:52:41 +02:00
Michele Artini	3635d05061	poms	2020-07-13 15:52:23 +02:00
Claudio Atzori	7dd91edf43	parsing of optional parameter	2020-07-13 15:40:41 +02:00
Claudio Atzori	4c101a9d66	WIP SparkCreateMergeRels distinct relations	2020-07-13 15:31:38 +02:00
Claudio Atzori	8a612d861a	WIP SparkCreateMergeRels distinct relations	2020-07-13 15:30:57 +02:00
Sandro La Bruzzo	9ef2385022	implemented test for cut of connected component	2020-07-13 15:28:17 +02:00

1 2 3 4 5 ...

1167 Commits