BrBETA_dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Alessia Bardi	c35bf486cc	added handle among the possible PIDs	2020-08-04 12:50:12 +02:00
Miriam Baglioni	5b651abf82	merge branch with master	2020-08-04 10:14:07 +02:00
Miriam Baglioni	88e4c3b751	added default trust to context bulktagged	2020-08-04 10:13:25 +02:00
Miriam Baglioni	f9342cb484	added constant	2020-08-03 18:32:35 +02:00
Miriam Baglioni	96c3c891f4	added trust	2020-08-03 18:32:17 +02:00
Miriam Baglioni	53656600ad	changed XQuery to select only community and ri with status not hidden	2020-08-03 18:29:30 +02:00
Miriam Baglioni	b34177d8ef	merge upstream	2020-08-03 18:13:42 +02:00
Miriam Baglioni	901ae37f7b	added step to workflow	2020-08-03 18:12:54 +02:00
Miriam Baglioni	fa38cdb10b	added resource	2020-08-03 18:11:12 +02:00
Miriam Baglioni	e9fcc0b2f1	commented test unit - to decide change for mirroring the changed logics	2020-08-03 18:10:53 +02:00
Miriam Baglioni	e43aeb139a	added new property file and changed some parameter to old files	2020-08-03 18:07:28 +02:00
Miriam Baglioni	aa9f3d9698	changed logic for save in s3 directly	2020-08-03 18:06:18 +02:00
Miriam Baglioni	d465f0eec9	added fulltext to result	2020-08-03 18:03:27 +02:00
Miriam Baglioni	ec4b392d12	added new dependencies for writing on s3	2020-08-03 17:57:04 +02:00
Miriam Baglioni	c892c7dfa7	changed to query for community map just once and save the result for remaining executions	2020-08-03 17:56:31 +02:00
Claudio Atzori	3a11a387a9	data provision workflow enhancement: added nodes to perform DELETE BY QUERY before the indexing begins and COMMIT after the indexing is completed	2020-08-03 14:28:08 +02:00
Alessia Bardi	8cc067fe76	specific test for claims	2020-08-03 11:17:50 +02:00
Claudio Atzori	a89b6cc3ba	Merge pull request 'nsprefix_blacklist' (#34 ) from nsprefix_blacklist into master	2020-07-31 11:52:23 +02:00
Sandro La Bruzzo	0c3bc9ea4b	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-31 09:07:18 +02:00
Sandro La Bruzzo	168bfb496a	adopted dedup to the new schema	2020-07-31 09:06:57 +02:00
Michele Artini	652b13abb6	Merge branch 'master' into nsprefix_blacklist	2020-07-31 07:58:37 +02:00
Claudio Atzori	cd631bb5bc	defaults fixed in the cleaning workflow forces result.publisher to NULL when result.publisher.value in empty	2020-07-30 17:03:53 +02:00
Miriam Baglioni	872d7783fc	-	2020-07-30 16:50:36 +02:00
Miriam Baglioni	57c87b7653	re-implemented to fix issue on not serializable Set<String> variable	2020-07-30 16:43:43 +02:00
Miriam Baglioni	ef8e5957b5	added specific directory where to save results	2020-07-30 16:42:46 +02:00
Miriam Baglioni	75f3361c85	-	2020-07-30 16:41:31 +02:00
Miriam Baglioni	3f695b25fa	refactoring	2020-07-30 16:40:15 +02:00
Miriam Baglioni	e623f12bef	refactoring	2020-07-30 16:32:59 +02:00
Miriam Baglioni	ff7d05abb4	added support class to store the couple organizationId representativeId gaot from sql query on hive	2020-07-30 16:32:04 +02:00
Miriam Baglioni	cf6d80b2ab	added command to close the writer	2020-07-30 16:31:22 +02:00
Miriam Baglioni	f985bca37b	added USER_CLAIM constant value	2020-07-30 16:25:26 +02:00
Claudio Atzori	4bbfcf1ac6	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop	2020-07-30 16:25:06 +02:00
Claudio Atzori	4ff8007518	added function to set the missing vocabulary names, used in the cleaning workflow as a pre-cleaning step	2020-07-30 16:24:39 +02:00
Miriam Baglioni	6f1c40a933	-	2020-07-30 16:24:28 +02:00
Miriam Baglioni	2b66a93f9e	added property file that was missing	2020-07-30 16:24:17 +02:00
Michele Artini	bdece15ca0	blacklist of nsprefix	2020-07-30 16:13:38 +02:00
Sandro La Bruzzo	c97c8f0c44	implemented new oozie job to extract entities in a separate dataset	2020-07-30 12:13:58 +02:00
Sandro La Bruzzo	3010a362bc	updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset	2020-07-30 09:25:56 +02:00
Sandro La Bruzzo	487226f669	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-30 09:25:39 +02:00
Sandro La Bruzzo	16ae3c9ccf	updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset	2020-07-30 09:25:32 +02:00
Miriam Baglioni	ee8420c6b3	added resource for datasource test	2020-07-29 18:28:43 +02:00
Miriam Baglioni	76bcab98ce	added code to filter out null originalId from the dump	2020-07-29 18:28:21 +02:00
Miriam Baglioni	ef1d8aef17	added one test to verify the dump for the datasources	2020-07-29 18:27:46 +02:00
Miriam Baglioni	86bab79512	-	2020-07-29 18:20:22 +02:00
Miriam Baglioni	31791dcf3d	fixed wrong property file path name	2020-07-29 18:20:08 +02:00
Miriam Baglioni	9e722aa1ef	-	2020-07-29 18:00:08 +02:00
Miriam Baglioni	d22f106f27	added constant to identify datasource associated to funders	2020-07-29 17:56:55 +02:00
Miriam Baglioni	40e194fe2f	added check to not dump datasources related to funders	2020-07-29 17:56:18 +02:00
Miriam Baglioni	b48934f6df	changed the workflow name	2020-07-29 17:43:43 +02:00
Miriam Baglioni	1433db825d	refactorign	2020-07-29 17:43:24 +02:00
Miriam Baglioni	074e9ab75e	refactoring	2020-07-29 17:42:50 +02:00
Miriam Baglioni	8ad8dac7d4	merge branch with fork master	2020-07-29 17:38:28 +02:00
Miriam Baglioni	9e997e63a2	merge upstream	2020-07-29 17:38:14 +02:00
Miriam Baglioni	9fa82dc93b	fixed issue	2020-07-29 17:36:16 +02:00
Miriam Baglioni	8907648d6a	-	2020-07-29 17:35:47 +02:00
Miriam Baglioni	536e7f6352	added and changed resources for testing of the whole graph dump and of community related products dumps	2020-07-29 17:33:34 +02:00
Miriam Baglioni	4d7f590493	testings for the whole graph dump	2020-07-29 17:32:37 +02:00
Miriam Baglioni	a2f73ec2c7	changed due to changes in the model	2020-07-29 17:32:02 +02:00
Miriam Baglioni	481585e9d3	-	2020-07-29 17:31:41 +02:00
Miriam Baglioni	40a8dafbdc	-	2020-07-29 17:30:44 +02:00
Miriam Baglioni	de2ebb467e	changed due to changes in the model	2020-07-29 17:08:02 +02:00
Miriam Baglioni	d0ff2a56fb	-	2020-07-29 17:06:53 +02:00
Miriam Baglioni	b96dedb56b	changed due to changes in the model	2020-07-29 17:05:31 +02:00
Miriam Baglioni	6d0f08277b	classes to implement the dump of the whole graph.	2020-07-29 17:03:19 +02:00
Miriam Baglioni	8d4327b292	input parameters and workflow definition for the dump of the whole graph	2020-07-29 17:00:34 +02:00
Miriam Baglioni	b5f995ab12	refactoring	2020-07-29 16:59:48 +02:00
Miriam Baglioni	f7a87cc447	added new constants value	2020-07-29 16:58:40 +02:00
Miriam Baglioni	b71d12cf26	refactoring	2020-07-29 16:52:44 +02:00
Miriam Baglioni	a8d65b68cb	changed to delete the part to check if it was a test or a real execution	2020-07-29 16:47:57 +02:00
Miriam Baglioni	3ec2392904	Added new class to move the place the split is effectively run	2020-07-29 16:46:50 +02:00
Michele Artini	8ba94833bd	added an es prop	2020-07-29 14:16:08 +02:00
Miriam Baglioni	178c2729a7	changed the path to reach the java class to be executed	2020-07-29 12:29:51 +02:00
Miriam Baglioni	437ac12139	removed unused parameter	2020-07-29 12:28:16 +02:00
Claudio Atzori	6f11c0496e	fixed typo in module name dhp-worfklow-profiles -> dhp-workflow-profiles	2020-07-28 15:01:58 +02:00
Claudio Atzori	f680eb3e12	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop	2020-07-28 14:10:56 +02:00
Claudio Atzori	985b360c31	fixed typo in module name dhp-worfklow-profiles -> dhp-workflow-profiles	2020-07-28 14:10:52 +02:00
Michele Artini	3acd632123	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-28 12:02:30 +02:00
Michele Artini	35e6e9c064	tests	2020-07-28 12:02:15 +02:00
Claudio Atzori	ee832f358e	Merge pull request 'stats_wf_extensions_and_corrections' (#28 ) from spyros/dnet-hadoop:stats_wf_extensions_and_corrections into master Thank you Guys! The update workflow will be made available to the beta & production orchestration systems under the HDFS path ```/lib/dnet/oa/graph/stats/oozie_app```	2020-07-27 16:02:03 +02:00
Antonis Lempesis	4ac8ebe427	correctly calculating the project duration	2020-07-24 19:50:40 +03:00
Antonis Lempesis	18d9464b52	creating shadow db only if it not exists...	2020-07-24 19:50:40 +03:00
Antonis Lempesis	e217d496ab	added the dest db...	2020-07-24 19:50:40 +03:00
Antonis Lempesis	b16bb68b9f	added the target db name...	2020-07-24 19:50:40 +03:00
Antonis Lempesis	1ee7eeedf3	added the source db name...	2020-07-24 19:50:40 +03:00
Antonis Lempesis	cecbbfa0fc	added missing tables and views: contexts, creation_date, funder	2020-07-24 19:50:40 +03:00
Antonis Lempesis	25b7a615f5	moved datasource_sources table creating in the datasource section	2020-07-24 19:50:40 +03:00
Antonis Lempesis	a8da4ab9c0	years in projects are now integers	2020-07-24 19:50:40 +03:00
Antonis Lempesis	c9cfc165d9	not using impala since the resulting tables are not visible	2020-07-24 19:50:40 +03:00
Antonis Lempesis	dd3d6a6e15	compute stats for the used and new impala tables	2020-07-24 19:50:40 +03:00
Antonis Lempesis	e6f50de6ef	Separated impala from hive steps	2020-07-24 19:50:40 +03:00
Antonis Lempesis	de49173420	fixed a typo in queries	2020-07-24 19:50:40 +03:00
antleb	391cf80fb8	Added peer-reviewed, green, gold tables and fields in result. Added shortcuts from result-country	2020-07-24 19:50:40 +03:00
antleb	68389d0125	Corrected the script used by the last step of the wf	2020-07-24 19:50:40 +03:00
antleb	ec52141f1a	changed refereed type from value to clssname	2020-07-24 19:50:40 +03:00
Spyros Zoupanos	63cd797aba	Comment out step 15 to make it work with the new schema of Claudio	2020-07-24 19:50:40 +03:00
Spyros Zoupanos	138c6ddffa	Insert statement to datasource table that takes into account the piwik_id of the openAIRE graph	2020-07-24 19:50:40 +03:00
Spyros Zoupanos	3630794cef	Fix to consider the relationships that have been 'virtually deleted' for project_results - defect #5607	2020-07-24 19:50:40 +03:00
Spyros Zoupanos	5546f29e63	Corrections on the shadow schema and the impala table stats calculation	2020-07-24 19:50:40 +03:00
Spyros Zoupanos	adf8a025d2	Adding more relations (Sources, Licences, Additional) and shadow schema as provided and discussed with Antonis Lempesis	2020-07-24 19:50:40 +03:00
Spyros Zoupanos	657a40536b	Corrections by Spyros: Scipt cleanup, corrections and re-arrangement	2020-07-24 19:50:40 +03:00
Giorgos Alexiou	477fa6234d	Script re-organisation and adding table invalidations needed for impala	2020-07-24 19:50:40 +03:00
Miriam Baglioni	6c2223d1fc	added code to get the openaire id for contexts	2020-07-24 17:30:15 +02:00
Miriam Baglioni	afd54c1684	removed not needed upload and refactoring	2020-07-24 17:28:56 +02:00
Miriam Baglioni	7b0569d989	changed to map also the result associated to the whole graph	2020-07-24 17:28:11 +02:00
Miriam Baglioni	082225ad61	-	2020-07-24 17:27:26 +02:00
Miriam Baglioni	968c59d97a	added teh logic to dump also the products for the whole graph. They will miss collected from and context information that will be materialized as new relations	2020-07-24 17:25:19 +02:00
Miriam Baglioni	332258d199	split the classes related to the communities dump and to the whole graph dump	2020-07-24 17:21:48 +02:00
Claudio Atzori	56bbfdc65d	introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'	2020-07-23 08:54:10 +02:00
Sandro La Bruzzo	9ab594ccf6	fixed test	2020-07-21 10:36:21 +02:00
Claudio Atzori	ebf60020ac	map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type	2020-07-20 19:01:10 +02:00
Miriam Baglioni	355d7e426e	added dumo for project - not finished	2020-07-20 18:54:43 +02:00
Miriam Baglioni	a2f01e5259	added getter and setter	2020-07-20 18:54:17 +02:00
Miriam Baglioni	40bbe94f7c	merge with master fork	2020-07-20 18:10:03 +02:00
Miriam Baglioni	2a15494b16	merge upstream	2020-07-20 18:05:01 +02:00
Miriam Baglioni	23160b4d29	realignment of the workflow classes with the changes in the structure of the module	2020-07-20 18:04:30 +02:00
Miriam Baglioni	b904e0699a	-	2020-07-20 18:02:53 +02:00
Miriam Baglioni	3aab7680f6	changed the test results	2020-07-20 18:00:43 +02:00
Miriam Baglioni	cde0300801	moved from projects to project	2020-07-20 17:57:35 +02:00
Miriam Baglioni	5076e4f320	changed test to comply with the modifications	2020-07-20 17:55:18 +02:00
Miriam Baglioni	08dbd99455	changed to dump the whole results graph by usign classes already implemented for communities. Added class to dump also organization	2020-07-20 17:54:28 +02:00
Miriam Baglioni	e47ea9349c	extended some types by adding provenance as the couple (provenance, trust) and moved some classes to be used by the complete graph dump also	2020-07-20 17:46:27 +02:00
Claudio Atzori	32f5e466e3	imports cleanup	2020-07-20 17:42:58 +02:00
Claudio Atzori	54ac583923	code formatting	2020-07-20 17:37:08 +02:00
Claudio Atzori	124e7ce19c	in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available	2020-07-20 17:33:37 +02:00
Claudio Atzori	050dda223d	Merge pull request 'removed duplicated fields' (#25 ) from unique_field_in_lists into master Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists. The task to update the model in such a way is added on #9#issuecomment-1583 Thanks!	2020-07-20 12:12:50 +02:00
Claudio Atzori	e0c4cf6f7b	added parameter to drive the graph merge strategy: priority (BETA\|PROD)	2020-07-20 10:48:01 +02:00
Claudio Atzori	94ccdb4852	Merge branch 'master' into merge_graph	2020-07-20 10:14:55 +02:00
Claudio Atzori	0937c9998f	Merge branch 'deduptesting'	2020-07-20 10:00:20 +02:00
Claudio Atzori	de72b1c859	cleanup	2020-07-20 09:59:11 +02:00
Michele Artini	331a3cbdd0	fixed originalId	2020-07-20 09:50:29 +02:00
Michele Artini	c59c5369b1	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-18 09:40:54 +02:00
Michele Artini	346a1d2b5a	update eventId generator	2020-07-18 09:40:36 +02:00
Sandro La Bruzzo	9116d75b3e	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-07-17 18:01:30 +02:00
Miriam Baglioni	d7d84c8217	-	2020-07-17 14:03:23 +02:00
Miriam Baglioni	47c7122773	changed priority from beta to production	2020-07-17 12:56:35 +02:00
Michele Artini	442f30930c	removed duplicated fields	2020-07-17 12:25:36 +02:00
Claudio Atzori	1781609508	code formatting	2020-07-16 19:06:56 +02:00
Claudio Atzori	db8b90a156	renamed CORE -> BETA	2020-07-16 19:05:13 +02:00
Miriam Baglioni	44e1c40c42	merge upstream	2020-07-16 18:49:38 +02:00
Claudio Atzori	878f2b931c	Merge branch 'master' into merge_graph	2020-07-16 16:34:24 +02:00
Claudio Atzori	cc5d13da85	introduced parameter shouldIndex (true\|false)	2020-07-16 13:46:39 +02:00
Claudio Atzori	b098cc3cbe	avoid repeating identical values for fields: source, description	2020-07-16 13:45:53 +02:00
Claudio Atzori	805de4eca1	fix: filter the blocks with size = 1	2020-07-16 10:11:32 +02:00
Claudio Atzori	4b9fb2ffb8	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop	2020-07-15 11:26:04 +02:00
Claudio Atzori	b90389bac4	code formatting	2020-07-15 11:24:48 +02:00
Claudio Atzori	4e6f46e8fa	filter blocks with one record only	2020-07-15 11:22:20 +02:00
Michele Artini	262c29463e	relations with multiple datasources	2020-07-15 09:18:40 +02:00
Claudio Atzori	7d6e269b40	reverted CreateRelatedEntitiesJob_phase1 to its previous state	2020-07-13 22:54:04 +02:00
Claudio Atzori	8e97598eb4	avoid to NPE in case of null instances	2020-07-13 20:46:14 +02:00
Claudio Atzori	06def0c0cb	SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter	2020-07-13 20:09:06 +02:00

1 2 3 4 5 ...

1456 Commits