BrBETA_dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Claudio Atzori	3be0e5c2cd	Merge branch 'master' into graph_db	2020-08-05 12:54:48 +02:00
Alessia Bardi	a29565ff57	code formatting	2020-08-04 12:55:27 +02:00
Alessia Bardi	01db29e208	fixes redmine issue #5846 : datacite and its different namespace declarations	2020-08-04 12:53:48 +02:00
Alessia Bardi	b4e4e5f858	do not duplicate result PIDs	2020-08-04 12:52:14 +02:00
Claudio Atzori	771bf8bcc4	WIP: materialize graph as Hive DB, mergeAggregatorGraphs	2020-08-04 12:26:09 +02:00
Claudio Atzori	0da1d2c0c9	introduced GraphFormat.DEFAULT, indicating a common value to be used across the workflows	2020-08-04 12:25:31 +02:00
Claudio Atzori	1fcc28968e	integrated changes from master	2020-08-04 10:57:44 +02:00
Michele Artini	652b13abb6	Merge branch 'master' into nsprefix_blacklist	2020-07-31 07:58:37 +02:00
Claudio Atzori	cd631bb5bc	defaults fixed in the cleaning workflow forces result.publisher to NULL when result.publisher.value in empty	2020-07-30 17:03:53 +02:00
Claudio Atzori	4bbfcf1ac6	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop	2020-07-30 16:25:06 +02:00
Claudio Atzori	4ff8007518	added function to set the missing vocabulary names, used in the cleaning workflow as a pre-cleaning step	2020-07-30 16:24:39 +02:00
Michele Artini	bdece15ca0	blacklist of nsprefix	2020-07-30 16:13:38 +02:00
Sandro La Bruzzo	c97c8f0c44	implemented new oozie job to extract entities in a separate dataset	2020-07-30 12:13:58 +02:00
Sandro La Bruzzo	3010a362bc	updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset	2020-07-30 09:25:56 +02:00
Sandro La Bruzzo	16ae3c9ccf	updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset	2020-07-30 09:25:32 +02:00
Claudio Atzori	9e594cf4c2	WIP: materialize graph as Hive DB, aggregator graph	2020-07-29 19:25:11 +02:00
Claudio Atzori	2dbac631c9	WIP: factoring out utilities into dhp-workflows-common	2020-07-29 13:08:20 +02:00
Claudio Atzori	56bbfdc65d	introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'	2020-07-23 08:54:10 +02:00
Claudio Atzori	ebf60020ac	map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type	2020-07-20 19:01:10 +02:00
Claudio Atzori	32f5e466e3	imports cleanup	2020-07-20 17:42:58 +02:00
Claudio Atzori	54ac583923	code formatting	2020-07-20 17:37:08 +02:00
Claudio Atzori	124e7ce19c	in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available	2020-07-20 17:33:37 +02:00
Claudio Atzori	050dda223d	Merge pull request 'removed duplicated fields' (#25 ) from unique_field_in_lists into master Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists. The task to update the model in such a way is added on #9#issuecomment-1583 Thanks!	2020-07-20 12:12:50 +02:00
Claudio Atzori	e0c4cf6f7b	added parameter to drive the graph merge strategy: priority (BETA\|PROD)	2020-07-20 10:48:01 +02:00
Claudio Atzori	94ccdb4852	Merge branch 'master' into merge_graph	2020-07-20 10:14:55 +02:00
Michele Artini	331a3cbdd0	fixed originalId	2020-07-20 09:50:29 +02:00
Sandro La Bruzzo	9116d75b3e	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-17 18:01:30 +02:00
Miriam Baglioni	47c7122773	changed priority from beta to production	2020-07-17 12:56:35 +02:00
Michele Artini	442f30930c	removed duplicated fields	2020-07-17 12:25:36 +02:00
Claudio Atzori	1781609508	code formatting	2020-07-16 19:06:56 +02:00
Claudio Atzori	878f2b931c	Merge branch 'master' into merge_graph	2020-07-16 16:34:24 +02:00
Michele Artini	e1ae964bc4	stats	2020-07-10 16:12:08 +02:00
Sandro La Bruzzo	c01efed79b	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-10 14:44:57 +02:00
Sandro La Bruzzo	a7d3977481	added generation of EBI Dataset	2020-07-10 14:44:50 +02:00
Claudio Atzori	67e1d222b6	bulk cleaning when found null or empty, sets bestaccessrights evaluating the result instances	2020-07-08 17:53:35 +02:00
Claudio Atzori	610d377d57	first implementation of the BETA & PROD graphs merge procedure	2020-07-08 16:54:26 +02:00
Sandro La Bruzzo	1d420eedb4	added generation of EBI Dataset	2020-07-02 12:37:43 +02:00
Claudio Atzori	6f5771c1c9	sets author.rank when null	2020-06-25 14:06:21 +02:00
Claudio Atzori	7df2712824	Merge branch 'provision_indexing'	2020-06-25 12:22:41 +02:00
Michele Artini	abcbebcbb4	fixed generation of ids	2020-06-25 09:50:46 +02:00
Michele Artini	77d2a1b1c4	params to choose sql queries for beta or production	2020-06-25 09:28:13 +02:00
Claudio Atzori	0e723d378b	added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job	2020-06-24 18:34:42 +02:00
Claudio Atzori	7d416f08d8	graph cleaning workflow: set hostedby to unknown repository when defined as NULL	2020-06-22 09:50:43 +02:00
Claudio Atzori	d0ac7514b2	cleaning workflow to include cleaning of default values	2020-06-18 19:37:25 +02:00
Sandro La Bruzzo	9bf67f5de1	resolved conflicts	2020-06-17 09:15:43 +02:00
Sandro La Bruzzo	1d4275acc4	implemented first version of exportation of Scholexplorer into ActionSet	2020-06-17 09:10:38 +02:00
Claudio Atzori	5441f01586	Merge pull request 'missing landingPage urls in instances' (#22 ) from instances-with-landing-page into master Looks good, thanks!	2020-06-16 15:32:44 +02:00
Claudio Atzori	2a4f65795f	WIP: graph cleaner implementation	2020-06-15 18:32:24 +02:00
Claudio Atzori	c15c8c0ad0	map datasource identities (including piwik ids) as original IDs	2020-06-15 16:07:30 +02:00
Claudio Atzori	0d52816244	WIP: graph cleaner implementation	2020-06-13 13:06:04 +02:00

1 2 3 4

199 Commits