dnet-hadoop

Author	SHA1	Message	Date
Miriam Baglioni	cf758f4f91	added normalization step for the doi	2021-06-30 10:03:15 +02:00
Miriam Baglioni	801763a0fa	there is no more the need to lower case the doi since it is done in the first step. Also changed the creation of the id by using the factory	2021-06-29 19:07:23 +02:00
Miriam Baglioni	a74de1cda2	added normalization step to the doi	2021-06-29 18:51:11 +02:00
Miriam Baglioni	06074ea7d3	added normalization step to the doi	2021-06-29 18:46:08 +02:00
Miriam Baglioni	8b8ffe82dc	added step of normalization for the doi	2021-06-29 18:41:39 +02:00
Miriam Baglioni	50cc21d92e	Added method to normalize doi values (lower case, remove all preceeding 10., filtering out doi not starting with 10.)	2021-06-29 18:35:28 +02:00
Miriam Baglioni	13c96622c9	-	2021-06-18 09:45:16 +02:00
Miriam Baglioni	b486ae498f	added test and test resource to verify the generation of the date of acceptance from the input extracted from the dump	2021-06-18 09:43:32 +02:00
Miriam Baglioni	464c2ddde3	changed to split in two steps the generation of the crossref dataset	2021-06-18 09:42:31 +02:00
Miriam Baglioni	6aca0d8ebb	added kryo encoding for input files	2021-06-18 09:42:07 +02:00
Miriam Baglioni	3585e53da3	changed to split in two steps the generation of the crossref dataset	2021-06-18 09:41:23 +02:00
Miriam Baglioni	95885bcf12	forces executor Executor memory and driver executor memory to be 7G (trying to avoid OOM)	2021-06-16 10:17:52 +02:00
Miriam Baglioni	2550a73981	-	2021-06-16 10:04:41 +02:00
Miriam Baglioni	1c47c0d786	modified the number of executors trying to avoid OOM exception	2021-06-15 21:05:39 +02:00
Miriam Baglioni	7deac55138	added one option for resume from in the wf	2021-06-15 18:38:20 +02:00
Miriam Baglioni	66e7ef892f	changed the parameter name	2021-06-15 11:08:54 +02:00
Miriam Baglioni	4f47ad0891	no need to rename the folders, just write in overwrite mode, so I changed the name of the output folder	2021-06-15 09:28:31 +02:00
Miriam Baglioni	9f9dd00b94	refactoring	2021-06-15 09:24:46 +02:00
Miriam Baglioni	63d74ee379	refactoring	2021-06-15 09:24:11 +02:00
Miriam Baglioni	6ebc236657	added needed property: outputPath	2021-06-15 09:23:24 +02:00
Miriam Baglioni	f7379255b6	changed the workflow to extract info from the dump	2021-06-15 09:22:54 +02:00
Miriam Baglioni	d6e21bb6ea	creates the crossref dataset used for doiboost together with unpacking part from tar	2021-06-14 17:27:19 +02:00
Miriam Baglioni	ce0cfd79e0	creates the crossref dataset used for doiboost	2021-06-14 13:40:19 +02:00
Miriam Baglioni	93efe4de82	split the construction of crossref dataset in two parts. This one just unpacks the tar entries	2021-06-14 13:39:40 +02:00
Miriam Baglioni	8873e6b6d1	workflow and parameter	2021-06-14 10:15:57 +02:00
Miriam Baglioni	0f1acdf6b6	workflow and parameter	2021-06-14 10:08:55 +02:00
Miriam Baglioni	75780fc636	extraction of the tar for the dump of crossref, and creation of the dataset	2021-06-14 09:45:07 +02:00
Miriam Baglioni	8d2e086e48	changes to avoid reassignment to val	2021-06-07 17:50:37 +02:00
Miriam Baglioni	f33521d338	Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala' to be able to replace the aboject assigned to author val has been replaced by var	2021-06-07 17:27:07 +02:00
Miriam Baglioni	bc12e9819e	Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala' The change is to fix the issue that arises when the same work appears more than once on the same ORCID profile. The change avoid to replicate the association doi -> author when the orcid id is already associated to the doi.	2021-06-07 16:37:01 +02:00
Claudio Atzori	5e4b91d9ef	more pervasive use of constants from ModelConstants, especially for ORCID	2021-05-26 18:20:23 +02:00
Claudio Atzori	c4a23c2f4d	fix: preserving the old identifier among the originalIds in the doiboost construction process, trying to avoid UnsupportedOperationException while adding elements to the originalIds	2021-05-19 16:01:52 +02:00
Claudio Atzori	ba03f549d7	fix: preserving the old identifier among the originalIds in the doiboost construction process	2021-05-19 15:43:26 +02:00
Claudio Atzori	2cbf15f4fb	using ModelConstants	2021-05-17 09:54:45 +02:00
Claudio Atzori	f19feceaf0	set the old identifier before switching to the new one	2021-05-14 12:53:40 +02:00
Claudio Atzori	1bd70fa2c6	preserving the old identifier among the originalIds in the doiboost construction process	2021-05-14 11:30:41 +02:00
Claudio Atzori	ca3f3a7687	using ModelConstants	2021-05-14 11:29:49 +02:00
Claudio Atzori	23b8883ab1	applied intellij code cleanup	2021-05-14 10:58:12 +02:00
Enrico Ottonello	c537986b7c	deleted folders with merged data immediately before merge phases	2021-04-28 11:25:25 +02:00
Claudio Atzori	5afa7d3e0c	core utilities in dhp-common moved in external module dhp-schemas	2021-04-27 15:44:01 +02:00
Claudio Atzori	27ab8a704d	adjusted poms to align with the external dhp-schema module	2021-04-27 10:12:27 +02:00
Claudio Atzori	c2bb03c8b5	depending on external dhp-schemas module	2021-04-23 17:57:35 +02:00
Claudio Atzori	e5abbec2ba	[orcid] download of the lambda file defined in a script	2021-04-22 11:22:10 +02:00
Claudio Atzori	55964cbd81	[orcid] large oozie workflow cleanup; updated workflow for the orcidnodoi actionset creation	2021-04-22 10:18:09 +02:00
Claudio Atzori	52244f813a	merging from enrico.ottonello/dnet-hadoop:orcid-no-doi	2021-04-21 12:24:09 +02:00
Sandro La Bruzzo	a16e5299f9	applied unique function on the final dataset	2021-04-16 17:36:48 +02:00
Enrico Ottonello	27068aacd1	wf to move orcid-no-doi dataset on the folder ready the import	2021-04-16 17:17:47 +02:00
Sandro La Bruzzo	67085da305	fixed NPE	2021-04-16 11:05:58 +02:00
Sandro La Bruzzo	7d6a80e2f2	added new type on MAG mapping	2021-04-16 09:14:15 +02:00
Sandro La Bruzzo	3f77bfceb0	fixed test failure on jenkins	2021-04-14 10:03:01 +02:00

1 2 3 4 5

238 Commits