BrBETA_dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Claudio Atzori	cce21eafc2	WIP: materialize graph as Hive DB, configured spark actions to include hive support]	2020-08-06 21:48:29 +02:00
Claudio Atzori	3be0e5c2cd	Merge branch 'master' into graph_db	2020-08-05 12:54:48 +02:00
Alessia Bardi	a29565ff57	code formatting	2020-08-04 12:55:27 +02:00
Alessia Bardi	01db29e208	fixes redmine issue #5846 : datacite and its different namespace declarations	2020-08-04 12:53:48 +02:00
Alessia Bardi	b4e4e5f858	do not duplicate result PIDs	2020-08-04 12:52:14 +02:00
Alessia Bardi	09a323d18d	testing a dataset from Nakala	2020-08-04 12:50:52 +02:00
Alessia Bardi	c35bf486cc	added handle among the possible PIDs	2020-08-04 12:50:12 +02:00
Claudio Atzori	f3ce97ecf9	WIP: materialize graph as Hive DB, mergeAggregatorGraphs [added workflow node to drop the DB]	2020-08-04 12:29:42 +02:00
Claudio Atzori	771bf8bcc4	WIP: materialize graph as Hive DB, mergeAggregatorGraphs	2020-08-04 12:26:09 +02:00
Claudio Atzori	0da1d2c0c9	introduced GraphFormat.DEFAULT, indicating a common value to be used across the workflows	2020-08-04 12:25:31 +02:00
Claudio Atzori	1fcc28968e	integrated changes from master	2020-08-04 10:57:44 +02:00
Claudio Atzori	da2f8af72d	adjusted MergeClaimsApplication param specs	2020-08-03 19:56:16 +02:00
Alessia Bardi	8cc067fe76	specific test for claims	2020-08-03 11:17:50 +02:00
Michele Artini	652b13abb6	Merge branch 'master' into nsprefix_blacklist	2020-07-31 07:58:37 +02:00
Claudio Atzori	cd631bb5bc	defaults fixed in the cleaning workflow forces result.publisher to NULL when result.publisher.value in empty	2020-07-30 17:03:53 +02:00
Claudio Atzori	4bbfcf1ac6	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop	2020-07-30 16:25:06 +02:00
Claudio Atzori	4ff8007518	added function to set the missing vocabulary names, used in the cleaning workflow as a pre-cleaning step	2020-07-30 16:24:39 +02:00
Michele Artini	bdece15ca0	blacklist of nsprefix	2020-07-30 16:13:38 +02:00
Sandro La Bruzzo	c97c8f0c44	implemented new oozie job to extract entities in a separate dataset	2020-07-30 12:13:58 +02:00
Sandro La Bruzzo	3010a362bc	updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset	2020-07-30 09:25:56 +02:00
Sandro La Bruzzo	487226f669	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-30 09:25:39 +02:00
Sandro La Bruzzo	16ae3c9ccf	updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset	2020-07-30 09:25:32 +02:00
Claudio Atzori	9e594cf4c2	WIP: materialize graph as Hive DB, aggregator graph	2020-07-29 19:25:11 +02:00
Claudio Atzori	2dbac631c9	WIP: factoring out utilities into dhp-workflows-common	2020-07-29 13:08:20 +02:00
Michele Artini	35e6e9c064	tests	2020-07-28 12:02:15 +02:00
Claudio Atzori	56bbfdc65d	introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'	2020-07-23 08:54:10 +02:00
Sandro La Bruzzo	9ab594ccf6	fixed test	2020-07-21 10:36:21 +02:00
Claudio Atzori	ebf60020ac	map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type	2020-07-20 19:01:10 +02:00
Claudio Atzori	32f5e466e3	imports cleanup	2020-07-20 17:42:58 +02:00
Claudio Atzori	54ac583923	code formatting	2020-07-20 17:37:08 +02:00
Claudio Atzori	124e7ce19c	in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available	2020-07-20 17:33:37 +02:00
Claudio Atzori	050dda223d	Merge pull request 'removed duplicated fields' (#25 ) from unique_field_in_lists into master Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists. The task to update the model in such a way is added on #9#issuecomment-1583 Thanks!	2020-07-20 12:12:50 +02:00
Claudio Atzori	e0c4cf6f7b	added parameter to drive the graph merge strategy: priority (BETA\|PROD)	2020-07-20 10:48:01 +02:00
Claudio Atzori	94ccdb4852	Merge branch 'master' into merge_graph	2020-07-20 10:14:55 +02:00
Michele Artini	331a3cbdd0	fixed originalId	2020-07-20 09:50:29 +02:00
Sandro La Bruzzo	9116d75b3e	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-17 18:01:30 +02:00
Miriam Baglioni	47c7122773	changed priority from beta to production	2020-07-17 12:56:35 +02:00
Michele Artini	442f30930c	removed duplicated fields	2020-07-17 12:25:36 +02:00
Claudio Atzori	1781609508	code formatting	2020-07-16 19:06:56 +02:00
Claudio Atzori	878f2b931c	Merge branch 'master' into merge_graph	2020-07-16 16:34:24 +02:00
Claudio Atzori	31071e363f	Merge branch 'provision_indexing'	2020-07-10 19:03:57 +02:00
Claudio Atzori	cc77446dc4	added dbSchema parameter to the raw_db workflow	2020-07-10 19:01:50 +02:00
Michele Artini	e1ae964bc4	stats	2020-07-10 16:12:08 +02:00
Sandro La Bruzzo	c01efed79b	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-10 14:44:57 +02:00
Sandro La Bruzzo	a7d3977481	added generation of EBI Dataset	2020-07-10 14:44:50 +02:00
Claudio Atzori	67e1d222b6	bulk cleaning when found null or empty, sets bestaccessrights evaluating the result instances	2020-07-08 17:53:35 +02:00
Claudio Atzori	610d377d57	first implementation of the BETA & PROD graphs merge procedure	2020-07-08 16:54:26 +02:00
Claudio Atzori	ed1c7e5d75	fixed workflow for the import of the claims alone	2020-07-02 12:40:21 +02:00
Sandro La Bruzzo	1d420eedb4	added generation of EBI Dataset	2020-07-02 12:37:43 +02:00
Claudio Atzori	e4a29a4513	fixed workflow for the import of the claims alone	2020-07-02 12:36:33 +02:00

1 2 3 4 5 ...

297 Commits