dnet-hadoop

sabeel

Author	SHA1	Message	Date
Claudio Atzori	8d2bb24512	merged from master	2021-03-08 15:44:34 +01:00
Enrico Ottonello	70cb100647	added updating last orcid dataset folders after completion	2021-03-01 10:17:04 +01:00
Enrico Ottonello	bd3b16402b	added result typologies	2021-03-01 10:16:02 +01:00
Enrico Ottonello	53d7023460	dateOfCollection taken from orcid last_update.txt on hdfs; cleaned wf parameters	2021-02-25 18:43:29 +01:00
Enrico Ottonello	d43ea88caf	aligned orcid result typologies with openaire vocabulary	2021-02-25 15:02:10 +01:00
Enrico Ottonello	975823b968	data from last updated orcid	2021-02-23 15:35:04 +01:00
Enrico Ottonello	ee4ba7298b	fix last update read/write from file on hdfs	2021-02-09 23:24:57 +01:00
Claudio Atzori	72c57b28fa	switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT	2021-02-04 14:08:18 +01:00
Enrico Ottonello	c238561001	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2021-02-04 10:44:21 +01:00
Enrico Ottonello	465ce39f75	job execution now based on file last_update.txt on hdfs	2021-02-04 10:44:04 +01:00
Sandro La Bruzzo	99cf3a8ea4	Merged Datacite transfrom into this branch	2021-01-28 16:34:46 +01:00
Claudio Atzori	ab2fe9266a	[DOIBoost] minor fixes in workflow definition	2021-01-05 10:26:39 +01:00
Claudio Atzori	7c722f3fdc	[DOIBoost] fixed typo	2021-01-05 10:25:54 +01:00
Claudio Atzori	8879704ba0	[DOIBoost] configurable ES server url and index name in crossref importer	2021-01-05 10:00:13 +01:00
Sandro La Bruzzo	7834a35768	avoid to save intermediate dataset before generation of Sequence file	2021-01-04 17:54:57 +01:00
Sandro La Bruzzo	e79445a8b4	minor fix for claudio polemica	2021-01-04 17:39:25 +01:00
Sandro La Bruzzo	8765020b85	minor fix	2021-01-04 17:37:08 +01:00
Sandro La Bruzzo	b0dc92786f	defined a single oozie workflow for the generation of doiboost	2021-01-04 17:01:35 +01:00
Claudio Atzori	28460c2cd1	using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper	2020-12-23 16:59:52 +01:00
Sandro La Bruzzo	1f6c8a9e83	added orcid_pending type to records coming from Crossref	2020-12-15 11:47:15 +01:00
Enrico Ottonello	b2de598c1a	all actions from download lambda file to merge updated data into one wf	2020-12-15 10:42:55 +01:00
Enrico Ottonello	efe4c2a9c5	authors and works are now updated in two separate spark actions of the wf	2020-12-12 02:06:21 +01:00
Enrico Ottonello	858efbfad1	fix dataset creation for downloaded works	2020-12-11 16:49:54 +01:00
Claudio Atzori	d9532446eb	imported more diffs from master branch; code formatting	2020-12-10 16:14:16 +01:00
Claudio Atzori	12e2f930c8	resolved conflicts	2020-12-10 10:57:39 +01:00
Enrico Ottonello	2233750a37	original orcid xml data are stored in a field of the class that models orcid data	2020-12-09 09:45:19 +01:00
Sandro La Bruzzo	302baab67b	fixed doiboost mapping and workflows	2020-12-07 19:59:33 +01:00
Enrico Ottonello	5c65e602d3	wf doi_authors generates one json data foreach row	2020-12-07 15:28:10 +01:00
Enrico Ottonello	fa1855a4b8	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2020-12-07 11:02:59 +01:00
Enrico Ottonello	b1b589ada1	wf to generate orcid dataset	2020-12-07 11:02:32 +01:00
Sandro La Bruzzo	b31dd126fb	fixed crossref workflow added common ORCID Class	2020-12-07 10:42:38 +01:00
Enrico Ottonello	8812ab65e1	completed download function to wf; added accumulators	2020-12-04 21:13:49 +01:00
Enrico Ottonello	53b22c1937	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2020-12-02 23:21:27 +01:00
Enrico Ottonello	1b1e9ea67c	wf to generate doi_author_list for doiboost; wf to download updated works	2020-12-02 23:20:16 +01:00
Sandro La Bruzzo	7da679542f	fixed wrong projectId	2020-12-02 14:28:09 +01:00
Sandro La Bruzzo	6ba8037cc7	fixed failure to test due to changing of input	2020-12-02 11:34:46 +01:00
Claudio Atzori	cfb55effd9	code formatting	2020-12-02 11:23:49 +01:00
Enrico Ottonello	f2df3ead74	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2020-11-30 14:22:46 +01:00
Enrico Ottonello	40c4559e92	added datainfo on authors pid with "sysimport:crosswalk:entityregistry",	2020-11-30 14:19:22 +01:00
Claudio Atzori	a104d2b6ad	cleanup	2020-11-26 11:12:00 +01:00
Claudio Atzori	db0181b8af	Merge pull request 'added bidirectionality to relations from project and result coming from crossref' (#60 ) from miriam.baglioni/dnet-hadoop:sxBidirectionality into master	2020-11-25 17:17:40 +01:00
Sandro La Bruzzo	ec3e238de6	Fixed problem on duplicated identifier	2020-11-25 17:15:54 +01:00
Sandro La Bruzzo	264723ffd8	updated stuff for zenodo upload	2020-11-25 11:56:07 +01:00
Enrico Ottonello	99a086f0c6	max concurrent executors set to 10, according to ORCID Director of Technology mail request	2020-11-24 17:49:32 +01:00
Miriam Baglioni	00874a8ce6	added bidirectionality to relations from project and result	2020-11-24 15:17:23 +01:00
Enrico Ottonello	5c17e768b2	set wf configuration with spark.dynamicAllocation.maxExecutors 20 over 20 input partitions	2020-11-23 16:01:23 +01:00
Enrico Ottonello	97c8111847	action to convert lambda file in seq file; spark action to download updated authors	2020-11-23 09:49:22 +01:00
Enrico Ottonello	c0c2e05eae	added wf to extracting authors and works xml data from orcid dump to hdfs; added wf to download the lamda file (containing last orcid update informations) from orcid to hdfs	2020-11-17 18:23:12 +01:00
Enrico Ottonello	005f849674	added compression to output dataset	2020-11-13 12:45:31 +01:00
Enrico Ottonello	9a2fa9dc2f	added test for other names parsing from summaries dump	2020-11-13 10:25:34 +01:00
Enrico Ottonello	13f28fa225	moved AuthorData to dhp-schemas; added other names to author data	2020-11-12 17:43:32 +01:00
Claudio Atzori	9b0fb9e958	merged from master	2020-11-12 09:27:12 +01:00
Enrico Ottonello	1f861f2b0d	now wf output is a sequence file with the format seq("eu.dnetlib.dhp.schema.oaf.Publication",eu.dnetlib.dhp.schema.action.AtomicActions)	2020-11-11 17:38:50 +01:00
Enrico Ottonello	fea2451658	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2020-11-10 11:49:43 +01:00
Enrico Ottonello	1513174d7e	added further test case	2020-11-10 11:44:55 +01:00
Sandro La Bruzzo	8e1d43aab2	Implemented ID generation using IdentifierRecordFactory on DOIBoost	2020-11-09 11:53:55 +01:00
Sandro La Bruzzo	cd27df91a1	fixed bug on missing relation in ANDS	2020-11-06 17:12:31 +01:00
Enrico Ottonello	6bc7dbeca7	first version of dataset successful generated from orcid dump 2020	2020-11-06 13:47:50 +01:00
Sandro La Bruzzo	39337d8a8a	fixed test	2020-11-02 09:26:25 +01:00
Enrico Ottonello	9818e74a70	added dependency version in main pom.xml for orcid no doi	2020-10-22 16:38:00 +02:00
Enrico Ottonello	210a50e4f4	replaced null value	2020-10-22 16:24:42 +02:00
Enrico Ottonello	b0290dbcb7	moved all dependencies version to main pom.xml	2020-10-22 16:20:46 +02:00
Enrico Ottonello	a38ab57062	let run test methods	2020-10-22 15:43:50 +02:00
Enrico Ottonello	1139d6568d	replaced null value with a more safe empty string as return value	2020-10-22 15:32:26 +02:00
Enrico Ottonello	c58db1c8ea	added filter on null value after map function	2020-10-22 15:11:02 +02:00
Enrico Ottonello	846ba30873	if typologies mapping fails, an exception will be propagated	2020-10-22 14:36:18 +02:00
Enrico Ottonello	c3114ba0ae	replaced null as return value with a more safe empty string	2020-10-22 14:21:31 +02:00
Enrico Ottonello	c295c71ca0	added comment	2020-10-22 14:07:26 +02:00
Enrico Ottonello	ab083f9946	propagate exception on parsing work (PR request)	2020-10-22 14:02:32 +02:00
sandro	3a81a940b7	solved bug on merge publication	2020-10-21 22:41:55 +02:00
Sandro La Bruzzo	34bf64c94f	fixed export Scholexplorer to OpenAire	2020-10-13 08:47:58 +02:00
Sandro La Bruzzo	cd9c377d18	adpted scholexplorer Dump generation to the new Dataset definition	2020-10-08 10:10:13 +02:00
Sandro La Bruzzo	c4a3c52e45	fixed Doiboost bug in the identifier	2020-10-01 15:46:44 +02:00
Enrico Ottonello	a97ad20c7b	exception is now propagated (PR review)	2020-09-22 10:46:34 +02:00
Enrico Ottonello	fefbcfb106	dependency version moved to main pom (PR review)	2020-09-22 10:20:25 +02:00
Enrico Ottonello	9e8e7fe6ef	add comments	2020-09-15 11:32:49 +02:00
Enrico Ottonello	0377b40fba	output to one parquet file	2020-07-30 18:38:07 +02:00
Enrico Ottonello	196f36c6ed	fix publication dataset creation	2020-07-30 13:38:33 +02:00
Enrico Ottonello	c82b15b5f4	migrate configuration to ocean, fix publication dataset creation	2020-07-28 15:23:52 +02:00
Enrico Ottonello	ca37d3427b	separate workflow to parse orcid summaries, activities and generate dataset with no doi publications; test	2020-07-03 23:30:31 +02:00
Enrico Ottonello	1729cc5cf3	publication conversion from json to oaf test	2020-07-02 18:46:20 +02:00
Enrico Ottonello	5525f57ec8	converter from orcid work json to oaf	2020-07-01 18:36:14 +02:00
Enrico Ottonello	b7b6be12a5	fixed enriched works generation	2020-06-29 18:03:16 +02:00
Enrico Ottonello	b2213b6435	merged with dnet version	2020-06-26 17:27:34 +02:00
Enrico Ottonello	c5e149c46e	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2020-06-26 16:15:38 +02:00
Enrico Ottonello	d6498278ed	added workflow to generate seq(orcidId,work) and seq(orcidId,enrichedWork)	2020-06-25 18:43:29 +02:00
Sandro La Bruzzo	a6c0faac70	added test to verify secondary sorting	2020-06-25 10:48:15 +02:00
Enrico Ottonello	fcbb4c1489	parser of orcid publication data from xml original dump	2020-06-24 16:29:32 +02:00
Claudio Atzori	9cd27183b6	[maven-release-plugin] prepare for next development iteration	2020-06-22 11:27:44 +02:00
Claudio Atzori	1e3dab0631	[maven-release-plugin] prepare release dhp-1.2.3	2020-06-22 11:27:39 +02:00
Sandro La Bruzzo	9bf67f5de1	resolved conflicts	2020-06-17 09:15:43 +02:00
Sandro La Bruzzo	1d4275acc4	implemented first version of exportation of Scholexplorer into ActionSet	2020-06-17 09:10:38 +02:00
Claudio Atzori	c4d9f1837f	[maven-release-plugin] prepare for next development iteration	2020-06-12 12:21:08 +02:00
Claudio Atzori	f0746a7605	[maven-release-plugin] prepare release dhp-1.2.2	2020-06-12 12:21:03 +02:00
Claudio Atzori	67c7b31ba6	Merge branch 'master' into graph_cleaning	2020-06-10 15:00:35 +02:00
Claudio Atzori	a2fdf85ba1	WIP: graph cleaner implementation	2020-06-09 19:52:53 +02:00
Alessia Bardi	4551c1082f	mapping csv for orcid	2020-06-09 18:08:47 +02:00
Alessia Bardi	2d3f7d1eb4	fixed log classes to make the ORCID test run	2020-06-09 18:07:14 +02:00
Alessia Bardi	a3a6755d58	mapping csv for Unpaywall	2020-06-09 17:45:44 +02:00
Alessia Bardi	f3b033cf09	added csv line for funders from Crossref	2020-06-09 17:08:26 +02:00
Alessia Bardi	fc4d220964	updated function name for SNSF	2020-06-09 17:05:31 +02:00
Alessia Bardi	33b130ec43	Mapping instructions for MAG	2020-06-09 15:57:15 +02:00
Alessia Bardi	d6de406e11	fixed classid for subjects	2020-06-09 14:43:34 +02:00
Alessia Bardi	f072125152	map volume and issue in journal information from MAG	2020-06-09 14:32:10 +02:00
Alessia Bardi	b7cb1163ea	identifiers always start with 50	2020-06-09 10:39:11 +02:00
Alessia Bardi	181f52b9bc	Added mapping table for Crossref	2020-06-08 19:33:47 +02:00
Alessia Bardi	9fd25887f7	Result identifiers all start with 50\|	2020-06-08 19:32:24 +02:00
Alessia Bardi	16cb073b15	set the instance datepfacceptance with the Crossref createdDate in case the issuedDate is blank	2020-06-08 19:06:03 +02:00
Sandro La Bruzzo	7ac1ba2e35	improvement DOIBoost	2020-06-04 14:39:20 +02:00
Sandro La Bruzzo	13815d5d13	improvement DOIBoost	2020-06-01 17:52:12 +02:00
Sandro La Bruzzo	b87b3ddb6b	changed mapping ORCIDToOAF	2020-05-29 09:32:04 +02:00
Sandro La Bruzzo	7d29b61c62	code refactor	2020-05-28 09:57:46 +02:00
Sandro La Bruzzo	25f52e19a4	implemented generation of ActionSet	2020-05-26 09:15:33 +02:00
Sandro La Bruzzo	2408083566	implemented filtering step	2020-05-23 08:46:49 +02:00
Sandro La Bruzzo	147dd389bf	minor fix	2020-05-22 20:51:42 +02:00
Sandro La Bruzzo	22936d0877	Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost	2020-05-22 15:15:17 +02:00
Sandro La Bruzzo	9fbb221457	completed mapping of UnpayWall and ORCID	2020-05-22 15:15:09 +02:00
Enrico Ottonello	1109d3b3fc	Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost	2020-05-21 00:41:27 +02:00
Enrico Ottonello	869a53040e	save to text file format	2020-05-21 00:41:21 +02:00
Sandro La Bruzzo	5818abaab4	fixed Crossref Mapping	2020-05-20 17:05:46 +02:00
Sandro La Bruzzo	b771d67e9d	next step of MAG conversion implemented	2020-05-20 08:14:03 +02:00
Enrico Ottonello	934ad570e0	joined summaries and activities dataset	2020-05-19 12:57:21 +02:00
Enrico Ottonello	ca722d4d18	merged	2020-05-19 09:43:12 +02:00
Enrico Ottonello	7362bc3e9d	workflow to generate seq(doi,AuthorList)	2020-05-19 09:34:44 +02:00
Sandro La Bruzzo	486e850bcc	next step of MAG conversion implemented	2020-05-19 09:24:45 +02:00
Enrico Ottonello	d4e9075f22	Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost	2020-05-18 19:51:36 +02:00
Enrico Ottonello	fc80e8c7de	added accumulator; last modified date of the record is added to saved data; lambda file is partitioned into 20 parts before starting downloading	2020-05-18 19:51:29 +02:00
Enrico Ottonello	0b29bb7e3b	spark job to download orcid record modified after a fixed date	2020-05-15 19:49:26 +02:00
Enrico Ottonello	12756f9d41	multithread (4 threads) test to feed elastic search	2020-05-13 16:11:40 +02:00
Sandro La Bruzzo	d876f47d06	next step of MAG conversion implemented	2020-05-13 10:38:04 +02:00
Enrico Ottonello	08040cef80	spark action to analyze orcid lambda file	2020-05-12 16:57:43 +02:00
Enrico Ottonello	3b1a68cbf5	elastic search feed test	2020-05-11 14:53:52 +02:00
Enrico Ottonello	f53e42bda7	merged	2020-05-11 14:49:28 +02:00
Enrico Ottonello	7990894454	different date format in lambda file parsing	2020-05-11 14:41:11 +02:00
Sandro La Bruzzo	0c6774e4da	updated pom version	2020-05-11 14:35:14 +02:00
Sandro La Bruzzo	4062eafbdb	merged from branch	2020-05-11 14:08:16 +02:00
Sandro La Bruzzo	1662f221f5	added test class	2020-05-11 09:39:11 +02:00
Sandro La Bruzzo	2b48a2c32c	Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost	2020-05-11 09:38:36 +02:00
Sandro La Bruzzo	4cebca09d2	start implementing MAG mapping	2020-05-11 09:38:27 +02:00
Enrico Ottonello	b9d126dd1f	formatting modified after commit	2020-05-08 14:54:37 +02:00
Enrico Ottonello	7e1c987370	Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost	2020-05-08 14:49:50 +02:00
Enrico Ottonello	9d812788e4	added job to download from orcid the records modified after a fixed date, the info are taken from last_modified.csv on hdfs	2020-05-08 14:49:39 +02:00
Sandro La Bruzzo	1e06bbaee8	fixed test	2020-04-30 11:38:58 +02:00
Sandro La Bruzzo	4a89465740	reformatted code	2020-04-29 13:24:29 +02:00
Sandro La Bruzzo	a6b1a59d0a	merged with maaster	2020-04-29 13:20:57 +02:00
Sandro La Bruzzo	920c0f19c3	Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost	2020-04-29 13:13:16 +02:00
Sandro La Bruzzo	09f161f1f4	implemented unit test	2020-04-29 13:13:02 +02:00
Enrico Ottonello	1edcd53581	added shell actions to download all 11 activities files from ORCID	2020-04-28 20:25:09 +02:00
Enrico Ottonello	a1861b9eaa	workflow works in parallel on 2 activity files	2020-04-24 18:33:37 +02:00
Enrico Ottonello	941e94af06	added workflow for generating authors with dois data sequence file	2020-04-24 15:50:40 +02:00
Sandro La Bruzzo	4ba386d996	improved crossref mapping	2020-04-23 09:33:48 +02:00
Sandro La Bruzzo	157915988c	improved crossref mapping	2020-04-22 15:00:44 +02:00
Enrico Ottonello	5977f08e92	merged	2020-04-22 14:50:50 +02:00
Enrico Ottonello	7d759947ae	used vtd for parsing orcid xml record, set 4g heapspace	2020-04-22 14:41:19 +02:00
Sandro La Bruzzo	e4b105cece	improved crossref mapping	2020-04-20 18:10:07 +02:00
Sandro La Bruzzo	5d46ec7d5f	fixed name of wrong package	2020-04-20 14:49:32 +02:00
Sandro La Bruzzo	82cc3b707d	fixed name of wrong package	2020-04-20 14:47:06 +02:00
Sandro La Bruzzo	7029942e06	Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost	2020-04-20 13:26:41 +02:00
Sandro La Bruzzo	0e45f4d450	continue mapping from crossref to OAF	2020-04-20 13:26:29 +02:00
Enrico Ottonello	a466648b4b	renamed output file	2020-04-20 12:32:03 +02:00
Enrico Ottonello	4ae55e3891	added workflow parameters	2020-04-20 12:00:04 +02:00
Sandro La Bruzzo	eef60bb9f4	created structure of oozie wf for ORCID	2020-04-20 10:24:57 +02:00
Sandro La Bruzzo	4d0d9de07e	reorganized package and fixed test	2020-04-20 10:02:42 +02:00
Sandro La Bruzzo	618bc1fc72	first implementation of crossrefMapping	2020-04-20 09:53:34 +02:00
Enrico Ottonello	1d44a359ea	renamed package folder	2020-04-20 09:25:40 +02:00
Enrico Ottonello	7011d4203e	parser of orcid summaries from tar gz file on hdfs, that creates a sequence file with authors informations (oid, name, surname, credit name)	2020-04-17 18:52:39 +02:00
Sandro La Bruzzo	a329ea5575	merged with master branch	2020-04-17 12:23:54 +02:00
Sandro La Bruzzo	205e9521c6	implemented import crossref job	2020-04-01 14:12:33 +02:00

... 2 3 4 5 6 ...

318 Commits