dnet-hadoop

Commit Graph

Author	SHA1	Message	Date
Miriam Baglioni	86f47afcc7	slight modification of the resource to accomodate also doi normalization tests	2021-06-30 14:36:49 +02:00
Miriam Baglioni	03767ea8e6	slight modification of the resource to accomodate also doi normalization tests	2021-06-30 13:21:24 +02:00
Miriam Baglioni	f8eec0ca9a	added resource to test the normalization of doi during the import of MAG	2021-06-30 13:19:54 +02:00
Miriam Baglioni	149f85ddf5	added tests for the normalization of the dois	2021-06-30 13:00:52 +02:00
Miriam Baglioni	e487b5544c	added tests for the normalization of the dois	2021-06-30 12:57:11 +02:00
Miriam Baglioni	1503ccbbb5	added tests for the normalization of the dois	2021-06-30 12:55:37 +02:00
Miriam Baglioni	1299bfb357	Added class to test the normalization of doi	2021-06-30 12:53:27 +02:00
Miriam Baglioni	cf758f4f91	added normalization step for the doi	2021-06-30 10:03:15 +02:00
Miriam Baglioni	801763a0fa	there is no more the need to lower case the doi since it is done in the first step. Also changed the creation of the id by using the factory	2021-06-29 19:07:23 +02:00
Miriam Baglioni	a74de1cda2	added normalization step to the doi	2021-06-29 18:51:11 +02:00
Miriam Baglioni	06074ea7d3	added normalization step to the doi	2021-06-29 18:46:08 +02:00
Miriam Baglioni	8b8ffe82dc	added step of normalization for the doi	2021-06-29 18:41:39 +02:00
Miriam Baglioni	50cc21d92e	Added method to normalize doi values (lower case, remove all preceeding 10., filtering out doi not starting with 10.)	2021-06-29 18:35:28 +02:00
Sandro La Bruzzo	80e15cc455	implemented mapping from uniprot, pdb and ebi links	2021-06-24 17:20:00 +02:00
Sandro La Bruzzo	a167543637	Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_id_scholexplorer	2021-06-21 09:14:11 +02:00
Miriam Baglioni	13c96622c9	-	2021-06-18 09:45:16 +02:00
Miriam Baglioni	b486ae498f	added test and test resource to verify the generation of the date of acceptance from the input extracted from the dump	2021-06-18 09:43:32 +02:00
Miriam Baglioni	464c2ddde3	changed to split in two steps the generation of the crossref dataset	2021-06-18 09:42:31 +02:00
Miriam Baglioni	6aca0d8ebb	added kryo encoding for input files	2021-06-18 09:42:07 +02:00
Miriam Baglioni	3585e53da3	changed to split in two steps the generation of the crossref dataset	2021-06-18 09:41:23 +02:00
Sandro La Bruzzo	3100166d29	Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer	2021-06-16 16:22:16 +02:00
Miriam Baglioni	95885bcf12	forces executor Executor memory and driver executor memory to be 7G (trying to avoid OOM)	2021-06-16 10:17:52 +02:00
Miriam Baglioni	2550a73981	-	2021-06-16 10:04:41 +02:00
Miriam Baglioni	1c47c0d786	modified the number of executors trying to avoid OOM exception	2021-06-15 21:05:39 +02:00
Miriam Baglioni	7deac55138	added one option for resume from in the wf	2021-06-15 18:38:20 +02:00
Miriam Baglioni	66e7ef892f	changed the parameter name	2021-06-15 11:08:54 +02:00
Miriam Baglioni	4f47ad0891	no need to rename the folders, just write in overwrite mode, so I changed the name of the output folder	2021-06-15 09:28:31 +02:00
Miriam Baglioni	9f9dd00b94	refactoring	2021-06-15 09:24:46 +02:00
Miriam Baglioni	63d74ee379	refactoring	2021-06-15 09:24:11 +02:00
Miriam Baglioni	6ebc236657	added needed property: outputPath	2021-06-15 09:23:24 +02:00
Miriam Baglioni	f7379255b6	changed the workflow to extract info from the dump	2021-06-15 09:22:54 +02:00
Miriam Baglioni	d6e21bb6ea	creates the crossref dataset used for doiboost together with unpacking part from tar	2021-06-14 17:27:19 +02:00
Miriam Baglioni	ce0cfd79e0	creates the crossref dataset used for doiboost	2021-06-14 13:40:19 +02:00
Miriam Baglioni	93efe4de82	split the construction of crossref dataset in two parts. This one just unpacks the tar entries	2021-06-14 13:39:40 +02:00
Miriam Baglioni	8873e6b6d1	workflow and parameter	2021-06-14 10:15:57 +02:00
Miriam Baglioni	0f1acdf6b6	workflow and parameter	2021-06-14 10:08:55 +02:00
Sandro La Bruzzo	efbea1e01a	minor fix	2021-06-14 09:45:14 +02:00
Miriam Baglioni	75780fc636	extraction of the tar for the dump of crossref, and creation of the dataset	2021-06-14 09:45:07 +02:00
Miriam Baglioni	8d2e086e48	changes to avoid reassignment to val	2021-06-07 17:50:37 +02:00
Miriam Baglioni	f33521d338	Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala' to be able to replace the aboject assigned to author val has been replaced by var	2021-06-07 17:27:07 +02:00
Miriam Baglioni	bc12e9819e	Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala' The change is to fix the issue that arises when the same work appears more than once on the same ORCID profile. The change avoid to replicate the association doi -> author when the orcid id is already associated to the doi.	2021-06-07 16:37:01 +02:00
Claudio Atzori	5e4b91d9ef	more pervasive use of constants from ModelConstants, especially for ORCID	2021-05-26 18:20:23 +02:00
Claudio Atzori	c4a23c2f4d	fix: preserving the old identifier among the originalIds in the doiboost construction process, trying to avoid UnsupportedOperationException while adding elements to the originalIds	2021-05-19 16:01:52 +02:00
Claudio Atzori	ba03f549d7	fix: preserving the old identifier among the originalIds in the doiboost construction process	2021-05-19 15:43:26 +02:00
Claudio Atzori	2cbf15f4fb	using ModelConstants	2021-05-17 09:54:45 +02:00
Claudio Atzori	f19feceaf0	set the old identifier before switching to the new one	2021-05-14 12:53:40 +02:00
Claudio Atzori	1bd70fa2c6	preserving the old identifier among the originalIds in the doiboost construction process	2021-05-14 11:30:41 +02:00
Claudio Atzori	ca3f3a7687	using ModelConstants	2021-05-14 11:29:49 +02:00
Claudio Atzori	23b8883ab1	applied intellij code cleanup	2021-05-14 10:58:12 +02:00
Enrico Ottonello	c537986b7c	deleted folders with merged data immediately before merge phases	2021-04-28 11:25:25 +02:00
Claudio Atzori	5afa7d3e0c	core utilities in dhp-common moved in external module dhp-schemas	2021-04-27 15:44:01 +02:00
Claudio Atzori	27ab8a704d	adjusted poms to align with the external dhp-schema module	2021-04-27 10:12:27 +02:00
Claudio Atzori	c2bb03c8b5	depending on external dhp-schemas module	2021-04-23 17:57:35 +02:00
Claudio Atzori	e5abbec2ba	[orcid] download of the lambda file defined in a script	2021-04-22 11:22:10 +02:00
Claudio Atzori	55964cbd81	[orcid] large oozie workflow cleanup; updated workflow for the orcidnodoi actionset creation	2021-04-22 10:18:09 +02:00
Claudio Atzori	52244f813a	merging from enrico.ottonello/dnet-hadoop:orcid-no-doi	2021-04-21 12:24:09 +02:00
Sandro La Bruzzo	a16e5299f9	applied unique function on the final dataset	2021-04-16 17:36:48 +02:00
Enrico Ottonello	27068aacd1	wf to move orcid-no-doi dataset on the folder ready the import	2021-04-16 17:17:47 +02:00
Sandro La Bruzzo	67085da305	fixed NPE	2021-04-16 11:05:58 +02:00
Sandro La Bruzzo	7d6a80e2f2	added new type on MAG mapping	2021-04-16 09:14:15 +02:00
Sandro La Bruzzo	3f77bfceb0	fixed test failure on jenkins	2021-04-14 10:03:01 +02:00
Sandro La Bruzzo	479abd10cb	Add into ORCID workflow a method that extracts orcid directly to the dump generated by Enrico	2021-04-13 17:47:43 +02:00
Claudio Atzori	e686b8de8d	[ORCID-no-doi] integrating PR#98 #98	2021-04-01 17:11:03 +02:00
Claudio Atzori	ee34cc51c3	[ORCID-no-doi] integrating PR#98 #98	2021-04-01 17:07:49 +02:00
Claudio Atzori	7941d7be29	WIP: using common definitions from ModelConstants	2021-03-31 18:33:57 +02:00
Enrico Ottonello	59ec5137e1	improvement related to https://issue.openaire.research-infrastructures.eu/issues/6501	2021-03-31 16:25:41 +02:00
Sandro La Bruzzo	616d2ecce2	splitted workflow collecting datacite into two workflows. Released on beta	2021-03-31 15:45:58 +02:00
Sandro La Bruzzo	1dfda3624e	improved workflow importing datacite	2021-03-26 13:56:29 +01:00
Enrico Ottonello	ebd67b8c8f	removed duplicates orcid data on authors set	2021-03-25 11:20:52 +01:00
Sandro La Bruzzo	625e4c29c4	added model constants	2021-03-23 09:39:56 +01:00
Sandro La Bruzzo	c392936b97	fixed error on best access right	2021-03-23 09:23:22 +01:00
Sandro La Bruzzo	c73072079d	fix conflicts	2021-03-22 16:36:31 +01:00
Sandro La Bruzzo	098914dcff	fix wrong relation with source null	2021-03-22 11:35:02 +01:00
Sandro La Bruzzo	25d5663d97	added filter	2021-03-18 10:24:42 +01:00
Sandro La Bruzzo	5f98ea74a9	Added fix for pid generation in stableIds	2021-03-17 15:53:24 +01:00
Sandro La Bruzzo	cc5bbafa5d	some fix to make workflows runs	2021-03-17 12:12:56 +01:00
Sandro La Bruzzo	4bb3bcafa5	add author sequence number	2021-03-11 11:32:32 +01:00
Sandro La Bruzzo	a8e5d0ea0d	updated test and fixed assign of access right	2021-03-11 10:41:24 +01:00
Sandro La Bruzzo	f5e7c57654	Fixed ticket 6282	2021-03-11 10:32:45 +01:00
Claudio Atzori	d525785497	[#6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color.	2021-03-09 11:12:55 +01:00
Sandro La Bruzzo	a2169ccf07	// implemented Ticket #6281 added pid to Instance in doiBoost	2021-03-09 10:46:36 +01:00
Claudio Atzori	8d2bb24512	merged from master	2021-03-08 15:44:34 +01:00
Enrico Ottonello	70cb100647	added updating last orcid dataset folders after completion	2021-03-01 10:17:04 +01:00
Enrico Ottonello	bd3b16402b	added result typologies	2021-03-01 10:16:02 +01:00
Enrico Ottonello	53d7023460	dateOfCollection taken from orcid last_update.txt on hdfs; cleaned wf parameters	2021-02-25 18:43:29 +01:00
Enrico Ottonello	d43ea88caf	aligned orcid result typologies with openaire vocabulary	2021-02-25 15:02:10 +01:00
Enrico Ottonello	975823b968	data from last updated orcid	2021-02-23 15:35:04 +01:00
Enrico Ottonello	ee4ba7298b	fix last update read/write from file on hdfs	2021-02-09 23:24:57 +01:00
Claudio Atzori	72c57b28fa	switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT	2021-02-04 14:08:18 +01:00
Enrico Ottonello	c238561001	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2021-02-04 10:44:21 +01:00
Enrico Ottonello	465ce39f75	job execution now based on file last_update.txt on hdfs	2021-02-04 10:44:04 +01:00
Sandro La Bruzzo	99cf3a8ea4	Merged Datacite transfrom into this branch	2021-01-28 16:34:46 +01:00
Claudio Atzori	ab2fe9266a	[DOIBoost] minor fixes in workflow definition	2021-01-05 10:26:39 +01:00
Claudio Atzori	7c722f3fdc	[DOIBoost] fixed typo	2021-01-05 10:25:54 +01:00
Claudio Atzori	8879704ba0	[DOIBoost] configurable ES server url and index name in crossref importer	2021-01-05 10:00:13 +01:00
Sandro La Bruzzo	7834a35768	avoid to save intermediate dataset before generation of Sequence file	2021-01-04 17:54:57 +01:00
Sandro La Bruzzo	e79445a8b4	minor fix for claudio polemica	2021-01-04 17:39:25 +01:00
Sandro La Bruzzo	8765020b85	minor fix	2021-01-04 17:37:08 +01:00
Sandro La Bruzzo	b0dc92786f	defined a single oozie workflow for the generation of doiboost	2021-01-04 17:01:35 +01:00
Claudio Atzori	28460c2cd1	using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper	2020-12-23 16:59:52 +01:00
Sandro La Bruzzo	1f6c8a9e83	added orcid_pending type to records coming from Crossref	2020-12-15 11:47:15 +01:00
Enrico Ottonello	b2de598c1a	all actions from download lambda file to merge updated data into one wf	2020-12-15 10:42:55 +01:00
Enrico Ottonello	efe4c2a9c5	authors and works are now updated in two separate spark actions of the wf	2020-12-12 02:06:21 +01:00
Enrico Ottonello	858efbfad1	fix dataset creation for downloaded works	2020-12-11 16:49:54 +01:00
Claudio Atzori	d9532446eb	imported more diffs from master branch; code formatting	2020-12-10 16:14:16 +01:00
Claudio Atzori	12e2f930c8	resolved conflicts	2020-12-10 10:57:39 +01:00
Enrico Ottonello	2233750a37	original orcid xml data are stored in a field of the class that models orcid data	2020-12-09 09:45:19 +01:00
Sandro La Bruzzo	302baab67b	fixed doiboost mapping and workflows	2020-12-07 19:59:33 +01:00
Enrico Ottonello	5c65e602d3	wf doi_authors generates one json data foreach row	2020-12-07 15:28:10 +01:00
Enrico Ottonello	fa1855a4b8	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2020-12-07 11:02:59 +01:00
Enrico Ottonello	b1b589ada1	wf to generate orcid dataset	2020-12-07 11:02:32 +01:00
Sandro La Bruzzo	b31dd126fb	fixed crossref workflow added common ORCID Class	2020-12-07 10:42:38 +01:00
Enrico Ottonello	8812ab65e1	completed download function to wf; added accumulators	2020-12-04 21:13:49 +01:00
Enrico Ottonello	53b22c1937	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2020-12-02 23:21:27 +01:00
Enrico Ottonello	1b1e9ea67c	wf to generate doi_author_list for doiboost; wf to download updated works	2020-12-02 23:20:16 +01:00
Sandro La Bruzzo	7da679542f	fixed wrong projectId	2020-12-02 14:28:09 +01:00
Sandro La Bruzzo	6ba8037cc7	fixed failure to test due to changing of input	2020-12-02 11:34:46 +01:00
Claudio Atzori	cfb55effd9	code formatting	2020-12-02 11:23:49 +01:00
Enrico Ottonello	f2df3ead74	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2020-11-30 14:22:46 +01:00
Enrico Ottonello	40c4559e92	added datainfo on authors pid with "sysimport:crosswalk:entityregistry",	2020-11-30 14:19:22 +01:00
Claudio Atzori	a104d2b6ad	cleanup	2020-11-26 11:12:00 +01:00
Claudio Atzori	db0181b8af	Merge pull request 'added bidirectionality to relations from project and result coming from crossref' (#60 ) from miriam.baglioni/dnet-hadoop:sxBidirectionality into master	2020-11-25 17:17:40 +01:00
Sandro La Bruzzo	ec3e238de6	Fixed problem on duplicated identifier	2020-11-25 17:15:54 +01:00
Sandro La Bruzzo	264723ffd8	updated stuff for zenodo upload	2020-11-25 11:56:07 +01:00
Enrico Ottonello	99a086f0c6	max concurrent executors set to 10, according to ORCID Director of Technology mail request	2020-11-24 17:49:32 +01:00
Miriam Baglioni	00874a8ce6	added bidirectionality to relations from project and result	2020-11-24 15:17:23 +01:00
Enrico Ottonello	5c17e768b2	set wf configuration with spark.dynamicAllocation.maxExecutors 20 over 20 input partitions	2020-11-23 16:01:23 +01:00
Enrico Ottonello	97c8111847	action to convert lambda file in seq file; spark action to download updated authors	2020-11-23 09:49:22 +01:00
Enrico Ottonello	c0c2e05eae	added wf to extracting authors and works xml data from orcid dump to hdfs; added wf to download the lamda file (containing last orcid update informations) from orcid to hdfs	2020-11-17 18:23:12 +01:00
Enrico Ottonello	005f849674	added compression to output dataset	2020-11-13 12:45:31 +01:00
Enrico Ottonello	9a2fa9dc2f	added test for other names parsing from summaries dump	2020-11-13 10:25:34 +01:00
Enrico Ottonello	13f28fa225	moved AuthorData to dhp-schemas; added other names to author data	2020-11-12 17:43:32 +01:00
Claudio Atzori	9b0fb9e958	merged from master	2020-11-12 09:27:12 +01:00
Enrico Ottonello	1f861f2b0d	now wf output is a sequence file with the format seq("eu.dnetlib.dhp.schema.oaf.Publication",eu.dnetlib.dhp.schema.action.AtomicActions)	2020-11-11 17:38:50 +01:00
Enrico Ottonello	fea2451658	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2020-11-10 11:49:43 +01:00
Enrico Ottonello	1513174d7e	added further test case	2020-11-10 11:44:55 +01:00
Sandro La Bruzzo	8e1d43aab2	Implemented ID generation using IdentifierRecordFactory on DOIBoost	2020-11-09 11:53:55 +01:00
Sandro La Bruzzo	cd27df91a1	fixed bug on missing relation in ANDS	2020-11-06 17:12:31 +01:00
Enrico Ottonello	6bc7dbeca7	first version of dataset successful generated from orcid dump 2020	2020-11-06 13:47:50 +01:00
Sandro La Bruzzo	39337d8a8a	fixed test	2020-11-02 09:26:25 +01:00
Enrico Ottonello	9818e74a70	added dependency version in main pom.xml for orcid no doi	2020-10-22 16:38:00 +02:00
Enrico Ottonello	210a50e4f4	replaced null value	2020-10-22 16:24:42 +02:00
Enrico Ottonello	b0290dbcb7	moved all dependencies version to main pom.xml	2020-10-22 16:20:46 +02:00
Enrico Ottonello	a38ab57062	let run test methods	2020-10-22 15:43:50 +02:00
Enrico Ottonello	1139d6568d	replaced null value with a more safe empty string as return value	2020-10-22 15:32:26 +02:00
Enrico Ottonello	c58db1c8ea	added filter on null value after map function	2020-10-22 15:11:02 +02:00
Enrico Ottonello	846ba30873	if typologies mapping fails, an exception will be propagated	2020-10-22 14:36:18 +02:00
Enrico Ottonello	c3114ba0ae	replaced null as return value with a more safe empty string	2020-10-22 14:21:31 +02:00
Enrico Ottonello	c295c71ca0	added comment	2020-10-22 14:07:26 +02:00
Enrico Ottonello	ab083f9946	propagate exception on parsing work (PR request)	2020-10-22 14:02:32 +02:00

1 2 3 4 5 ...

349 Commits