dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Miriam Baglioni	b828587252	prevent the code to cicle indefinetly	2020-10-30 15:01:25 +01:00
Miriam Baglioni	f747e303ac	classes for dumping of the graph as ttl file	2020-10-30 14:13:45 +01:00
Miriam Baglioni	16baf5b69e	formatting	2020-10-30 14:13:14 +01:00
Miriam Baglioni	a9eef9c852	added check for possible Optional value in relation dataInfo	2020-10-30 14:12:28 +01:00
Miriam Baglioni	5f4de9a962	formatting	2020-10-30 14:11:40 +01:00
Miriam Baglioni	14bf2e7238	added option to split dumps bigger that 40Gb on different files	2020-10-30 14:09:04 +01:00
Claudio Atzori	58f28296ea	ProvisionConstants moved as ModelHardLimits in dhp-common and applied to truncate long abstracts (len > 150000). Further filtering for empty PID values	2020-10-30 10:56:42 +01:00
Miriam Baglioni	78fdb11c3f	merge branch with master	2020-10-29 12:55:22 +01:00
Sandro La Bruzzo	1d9fdb7367	fixed spark memory issue in SparkSplitOafTODLIEntities	2020-10-28 12:30:32 +01:00
Miriam Baglioni	d2374e3b9e	added code to handle cases where the funding tree is not existing	2020-10-27 16:15:21 +01:00
Miriam Baglioni	5d3012eeb4	changed code to dump only the programme list and not the classification list	2020-10-27 16:14:18 +01:00
Miriam Baglioni	3241ec1777	added connection timeout and socket timeout 600 sec	2020-10-27 16:12:11 +01:00
Claudio Atzori	266bf1a221	common IdentifierFactory in use on the mapping from the aggregator data; merge the entities sharing the same id; code formatting	2020-10-16 17:02:10 +02:00
Claudio Atzori	34f1d0904b	common IdentifierFactory in use on the mapping from the aggregator data	2020-10-16 16:00:19 +02:00
Sandro La Bruzzo	fed711da80	Merge remote-tracking branch 'origin/master' into merge_record_to_common	2020-10-13 15:32:45 +02:00
Alessia Bardi	8775a64bc1	Merge pull request 'Merging different compatibility levels (pinocchio operator)' (#47 ) from merge_graph into master	2020-10-09 14:44:52 +02:00
Sandro La Bruzzo	eec418cd26	moved AuthoreMerger into dhp-common	2020-10-08 10:33:55 +02:00
Sandro La Bruzzo	cd9c377d18	adpted scholexplorer Dump generation to the new Dataset definition	2020-10-08 10:10:13 +02:00
Claudio Atzori	a3f37a9414	javadoc	2020-10-07 16:44:22 +02:00
Claudio Atzori	8d85a2fced	[BETA wf only] datasources involved in the merge operation doesn't obey to the infra precedence policy, but relies on a custom behaviour that, given two datasources from beta and prod returns the one from prod with the highest compatibility among the two	2020-10-07 16:28:52 +02:00
Miriam Baglioni	ae08b3c0dd	merge branch with master	2020-10-05 11:35:55 +02:00
Miriam Baglioni	32bffb0134	changed the name from communities_infrastructures to communities_infrastuctures.json	2020-10-05 11:24:17 +02:00
Miriam Baglioni	25cbcf6114	changed to solve issues about names. context renamed communities_infrastructure.json and removed the double json.gz extention to the name of the part in the tar	2020-10-02 12:17:46 +02:00
Claudio Atzori	49ae3450a9	code formatting	2020-10-02 09:43:24 +02:00
Claudio Atzori	c2a6e2a9bf	fixed mapping for datasource journal info (ISSNs)	2020-10-02 09:37:08 +02:00
Miriam Baglioni	cfb5766c6b	removed double json.gz from names of files in the tar	2020-10-01 17:18:34 +02:00
Miriam Baglioni	fcaedac980	merge branch with master	2020-10-01 16:46:59 +02:00
Miriam Baglioni	c6e6ed1bd8	merge branch with master	2020-10-01 16:24:41 +02:00
Claudio Atzori	2e9e13444d	author pids made unique by value	2020-10-01 12:50:40 +02:00
Claudio Atzori	e265c3e125	cleaning functions factored out in a dedicated class	2020-10-01 10:50:15 +02:00
Miriam Baglioni	7b6a7333e6	merge branch with master	2020-09-25 16:42:07 +02:00
Miriam Baglioni	ed5239f9ec	added new code to handle the new possibility to upload files to an already open deposition	2020-09-25 16:34:32 +02:00
Miriam Baglioni	3a8c524fce	refactor	2020-09-25 16:34:02 +02:00
Miriam Baglioni	de6c4d46d8	fixed conflicts	2020-09-24 15:35:01 +02:00
Claudio Atzori	9e3e93c6b6	setting the correct issn type in the datasource.journal element	2020-09-24 10:39:16 +02:00
Miriam Baglioni	39eb8ab25b	changed the dump to move from h2020programme to h2020classification	2020-09-23 17:33:00 +02:00
Miriam Baglioni	1f893e63dc	-	2020-09-14 14:33:10 +02:00
Claudio Atzori	8a523474b7	code formatting	2020-09-07 11:40:16 +02:00
Miriam Baglioni	8694bb9b31	refactoring due to compilation	2020-08-24 17:07:34 +02:00
Miriam Baglioni	8a069a4fea	-	2020-08-24 17:01:30 +02:00
Miriam Baglioni	34fa96f3b1	-	2020-08-24 17:00:20 +02:00
Miriam Baglioni	5fb2949cb8	added utils methods	2020-08-24 17:00:09 +02:00
Miriam Baglioni	2a540b6c01	added constants for the pid graph dump	2020-08-24 16:55:35 +02:00
Miriam Baglioni	bef79d3bdf	first attempt to the dump of pids graph	2020-08-24 16:49:38 +02:00
Miriam Baglioni	85203c16e3	merge branch with master	2020-08-19 11:49:03 +02:00
Miriam Baglioni	1c593a9cfe	-	2020-08-19 11:29:51 +02:00
Miriam Baglioni	e42b2f5ae2	-	2020-08-19 11:29:09 +02:00
Miriam Baglioni	f81ee22418	changed to mirror the changes in the model (Instance, CommunityInstance, GraphResult)	2020-08-19 11:28:26 +02:00
Miriam Baglioni	387be43fd4	changed to discriminate if dumping all the results type together or each one in its own archive	2020-08-19 11:25:27 +02:00
Miriam Baglioni	dc5096a327	refactoring due to compilation	2020-08-19 10:57:36 +02:00
Miriam Baglioni	09f5b92763	added specific reference to class	2020-08-14 20:00:09 +02:00
Miriam Baglioni	a5043de5da	added method to get the mapped instance	2020-08-13 18:45:50 +02:00
Miriam Baglioni	fcd10f452c	changed because of D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:55:32 +02:00
Miriam Baglioni	bfd1fcde6d	removed not useful method and changed because of D-Net/dnet-hadoop#40 (comment) and D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:14:37 +02:00
Miriam Baglioni	7fd8397123	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:13:15 +02:00
Miriam Baglioni	753d448cc9	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:12:58 +02:00
Miriam Baglioni	c0e071fa26	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:12:40 +02:00
Miriam Baglioni	526db915bc	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:12:16 +02:00
Miriam Baglioni	b0fab0d138	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:11:57 +02:00
Miriam Baglioni	1b6320b251	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:11:41 +02:00
Miriam Baglioni	743d31be22	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:11:22 +02:00
Miriam Baglioni	65b48df652	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:11:06 +02:00
Miriam Baglioni	90b54d3efb	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:08:24 +02:00
Miriam Baglioni	69bbb9592a	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:07:39 +02:00
Miriam Baglioni	945323299a	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:07:24 +02:00
Miriam Baglioni	e04c993247	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:07:07 +02:00
Miriam Baglioni	ed0812d0ce	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:06:49 +02:00
Miriam Baglioni	d55cfe0ea5	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:06:20 +02:00
Miriam Baglioni	80866bec7d	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:06:05 +02:00
Miriam Baglioni	1400978c0a	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:05:44 +02:00
Miriam Baglioni	7b941a2e0a	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:05:17 +02:00
Miriam Baglioni	f7474f50fe	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:04:52 +02:00
Miriam Baglioni	367203f412	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:04:33 +02:00
Miriam Baglioni	3ab4809d31	apply changes in D-Net/dnet-hadoop#40 (comment)	2020-08-13 12:04:10 +02:00
Miriam Baglioni	235d4e4d6e	moved Context as relevant for Communities dump	2020-08-12 18:16:45 +02:00
Miriam Baglioni	7400cd019d	removed not needed variable	2020-08-12 10:03:33 +02:00
Miriam Baglioni	98d28bab5c	fixed missing _ in context nsprefix	2020-08-12 10:00:18 +02:00
Miriam Baglioni	2d67476417	merge branch with master	2020-08-11 15:46:04 +02:00
Miriam Baglioni	0603ec4757	changed test to upload the dump for covid-19 community	2020-08-11 15:43:25 +02:00
Miriam Baglioni	cf4d918787	added description, changed parameter name and added method	2020-08-11 15:27:31 +02:00
Miriam Baglioni	dc5fc5366d	Creation of an archive for each related dump part	2020-08-11 15:26:06 +02:00
Miriam Baglioni	0ce49049d6	added description	2020-08-11 15:25:11 +02:00
Miriam Baglioni	9bae991167	added description of the class	2020-08-11 11:20:43 +02:00
Miriam Baglioni	341dc59ead	removed the repartition(1). Added code for the creation of an archive containing all the parts dumped for each community	2020-08-11 11:18:58 +02:00
Miriam Baglioni	1991a49f70	removed reference to isLookUp to get the communityMap	2020-08-10 18:02:56 +02:00
Miriam Baglioni	fe88904df0	changed the wf definition	2020-08-10 12:01:14 +02:00
Miriam Baglioni	87856467e2	removed isLookUpUrl and added code to read from HDSF the communitymap	2020-08-10 11:38:41 +02:00
Miriam Baglioni	3aedfdf0d6	added option to do a new deposition or new version of an old deposition	2020-08-07 17:49:14 +02:00
Miriam Baglioni	1b3ad1bce6	filter out authors pid (only orcid). Added check to get unique provenance for context id. filtr out countries with code UNKNOWN	2020-08-07 17:48:18 +02:00
Miriam Baglioni	5ceb8c5f0a	moved constants from graph.Constants	2020-08-07 17:46:47 +02:00
Miriam Baglioni	6c65c93c0e	refactoring	2020-08-07 17:45:35 +02:00
Miriam Baglioni	68adf86fe4	refactoring	2020-08-07 17:43:20 +02:00
Miriam Baglioni	26d2ad6ebb	refactoring	2020-08-07 17:41:56 +02:00
Miriam Baglioni	9675af7965	refactoring	2020-08-07 17:41:07 +02:00
Miriam Baglioni	346a91f4d9	Added constants	2020-08-07 17:35:39 +02:00
Miriam Baglioni	d52b0e1797	no use of IsLookUp. The query is done once and its result stored on HDFS. The path to the result is given instead of the isLookUpUrl	2020-08-07 17:34:40 +02:00
Miriam Baglioni	ae1b7fbfdb	changed method signature from set of mapkey entries to String representing path on file system where to find the map	2020-08-07 17:32:27 +02:00
Miriam Baglioni	545ea9f77e	moved in common. Zenodo response model and APIClient to deposit in Zenodo	2020-08-07 16:44:51 +02:00
Sandro La Bruzzo	4fb1821fab	Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop	2020-08-06 10:28:31 +02:00
Sandro La Bruzzo	9d9e9edbd2	improved extractEntity Relation workflows using dataset	2020-08-06 10:28:24 +02:00
Miriam Baglioni	14eda4f46e	added method to try to put inputstream to zenodo	2020-08-05 14:18:25 +02:00
Miriam Baglioni	e737a47270	added classes to try to send input stream to zenodo for the upload	2020-08-05 14:17:40 +02:00
Miriam Baglioni	873e9cd50c	changed hadoop setting to connect to s3	2020-08-04 15:37:25 +02:00
Alessia Bardi	a29565ff57	code formatting	2020-08-04 12:55:27 +02:00
Alessia Bardi	01db29e208	fixes redmine issue #5846 : datacite and its different namespace declarations	2020-08-04 12:53:48 +02:00
Alessia Bardi	b4e4e5f858	do not duplicate result PIDs	2020-08-04 12:52:14 +02:00
Miriam Baglioni	5b651abf82	merge branch with master	2020-08-04 10:14:07 +02:00
Miriam Baglioni	aa9f3d9698	changed logic for save in s3 directly	2020-08-03 18:06:18 +02:00
Miriam Baglioni	d465f0eec9	added fulltext to result	2020-08-03 18:03:27 +02:00
Miriam Baglioni	c892c7dfa7	changed to query for community map just once and save the result for remaining executions	2020-08-03 17:56:31 +02:00
Michele Artini	652b13abb6	Merge branch 'master' into nsprefix_blacklist	2020-07-31 07:58:37 +02:00
Claudio Atzori	cd631bb5bc	defaults fixed in the cleaning workflow forces result.publisher to NULL when result.publisher.value in empty	2020-07-30 17:03:53 +02:00
Miriam Baglioni	57c87b7653	re-implemented to fix issue on not serializable Set<String> variable	2020-07-30 16:43:43 +02:00
Miriam Baglioni	ef8e5957b5	added specific directory where to save results	2020-07-30 16:42:46 +02:00
Miriam Baglioni	75f3361c85	-	2020-07-30 16:41:31 +02:00
Miriam Baglioni	3f695b25fa	refactoring	2020-07-30 16:40:15 +02:00
Miriam Baglioni	e623f12bef	refactoring	2020-07-30 16:32:59 +02:00
Miriam Baglioni	ff7d05abb4	added support class to store the couple organizationId representativeId gaot from sql query on hive	2020-07-30 16:32:04 +02:00
Miriam Baglioni	cf6d80b2ab	added command to close the writer	2020-07-30 16:31:22 +02:00
Miriam Baglioni	f985bca37b	added USER_CLAIM constant value	2020-07-30 16:25:26 +02:00
Claudio Atzori	4bbfcf1ac6	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop	2020-07-30 16:25:06 +02:00
Claudio Atzori	4ff8007518	added function to set the missing vocabulary names, used in the cleaning workflow as a pre-cleaning step	2020-07-30 16:24:39 +02:00
Michele Artini	bdece15ca0	blacklist of nsprefix	2020-07-30 16:13:38 +02:00
Sandro La Bruzzo	c97c8f0c44	implemented new oozie job to extract entities in a separate dataset	2020-07-30 12:13:58 +02:00
Sandro La Bruzzo	3010a362bc	updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset	2020-07-30 09:25:56 +02:00
Sandro La Bruzzo	16ae3c9ccf	updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset	2020-07-30 09:25:32 +02:00
Miriam Baglioni	76bcab98ce	added code to filter out null originalId from the dump	2020-07-29 18:28:21 +02:00
Miriam Baglioni	86bab79512	-	2020-07-29 18:20:22 +02:00
Miriam Baglioni	31791dcf3d	fixed wrong property file path name	2020-07-29 18:20:08 +02:00
Miriam Baglioni	9e722aa1ef	-	2020-07-29 18:00:08 +02:00
Miriam Baglioni	d22f106f27	added constant to identify datasource associated to funders	2020-07-29 17:56:55 +02:00
Miriam Baglioni	40e194fe2f	added check to not dump datasources related to funders	2020-07-29 17:56:18 +02:00
Miriam Baglioni	074e9ab75e	refactoring	2020-07-29 17:42:50 +02:00
Miriam Baglioni	8ad8dac7d4	merge branch with fork master	2020-07-29 17:38:28 +02:00
Miriam Baglioni	9fa82dc93b	fixed issue	2020-07-29 17:36:16 +02:00
Miriam Baglioni	8907648d6a	-	2020-07-29 17:35:47 +02:00
Miriam Baglioni	6d0f08277b	classes to implement the dump of the whole graph.	2020-07-29 17:03:19 +02:00
Miriam Baglioni	b5f995ab12	refactoring	2020-07-29 16:59:48 +02:00
Miriam Baglioni	f7a87cc447	added new constants value	2020-07-29 16:58:40 +02:00
Miriam Baglioni	b71d12cf26	refactoring	2020-07-29 16:52:44 +02:00
Miriam Baglioni	a8d65b68cb	changed to delete the part to check if it was a test or a real execution	2020-07-29 16:47:57 +02:00
Miriam Baglioni	3ec2392904	Added new class to move the place the split is effectively run	2020-07-29 16:46:50 +02:00
Miriam Baglioni	6c2223d1fc	added code to get the openaire id for contexts	2020-07-24 17:30:15 +02:00
Miriam Baglioni	afd54c1684	removed not needed upload and refactoring	2020-07-24 17:28:56 +02:00
Miriam Baglioni	7b0569d989	changed to map also the result associated to the whole graph	2020-07-24 17:28:11 +02:00
Miriam Baglioni	082225ad61	-	2020-07-24 17:27:26 +02:00
Miriam Baglioni	968c59d97a	added teh logic to dump also the products for the whole graph. They will miss collected from and context information that will be materialized as new relations	2020-07-24 17:25:19 +02:00
Miriam Baglioni	332258d199	split the classes related to the communities dump and to the whole graph dump	2020-07-24 17:21:48 +02:00
Claudio Atzori	56bbfdc65d	introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'	2020-07-23 08:54:10 +02:00
Claudio Atzori	ebf60020ac	map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type	2020-07-20 19:01:10 +02:00

1 2 3 4 5 ...

464 Commits