dnet-hadoop

Author	SHA1	Message	Date
Miriam Baglioni	a74de1cda2	added normalization step to the doi	2021-06-29 18:51:11 +02:00
Miriam Baglioni	06074ea7d3	added normalization step to the doi	2021-06-29 18:46:08 +02:00
Miriam Baglioni	8b8ffe82dc	added step of normalization for the doi	2021-06-29 18:41:39 +02:00
Miriam Baglioni	50cc21d92e	Added method to normalize doi values (lower case, remove all preceeding 10., filtering out doi not starting with 10.)	2021-06-29 18:35:28 +02:00
Sandro La Bruzzo	80e15cc455	implemented mapping from uniprot, pdb and ebi links	2021-06-24 17:20:00 +02:00
Sandro La Bruzzo	a167543637	Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_id_scholexplorer	2021-06-21 09:14:11 +02:00
Miriam Baglioni	13c96622c9	-	2021-06-18 09:45:16 +02:00
Miriam Baglioni	b486ae498f	added test and test resource to verify the generation of the date of acceptance from the input extracted from the dump	2021-06-18 09:43:32 +02:00
Miriam Baglioni	464c2ddde3	changed to split in two steps the generation of the crossref dataset	2021-06-18 09:42:31 +02:00
Miriam Baglioni	6aca0d8ebb	added kryo encoding for input files	2021-06-18 09:42:07 +02:00
Miriam Baglioni	3585e53da3	changed to split in two steps the generation of the crossref dataset	2021-06-18 09:41:23 +02:00
Sandro La Bruzzo	3100166d29	Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer	2021-06-16 16:22:16 +02:00
Miriam Baglioni	95885bcf12	forces executor Executor memory and driver executor memory to be 7G (trying to avoid OOM)	2021-06-16 10:17:52 +02:00
Miriam Baglioni	2550a73981	-	2021-06-16 10:04:41 +02:00
Miriam Baglioni	1c47c0d786	modified the number of executors trying to avoid OOM exception	2021-06-15 21:05:39 +02:00
Miriam Baglioni	7deac55138	added one option for resume from in the wf	2021-06-15 18:38:20 +02:00
Miriam Baglioni	66e7ef892f	changed the parameter name	2021-06-15 11:08:54 +02:00
Miriam Baglioni	4f47ad0891	no need to rename the folders, just write in overwrite mode, so I changed the name of the output folder	2021-06-15 09:28:31 +02:00
Miriam Baglioni	9f9dd00b94	refactoring	2021-06-15 09:24:46 +02:00
Miriam Baglioni	63d74ee379	refactoring	2021-06-15 09:24:11 +02:00
Miriam Baglioni	6ebc236657	added needed property: outputPath	2021-06-15 09:23:24 +02:00
Miriam Baglioni	f7379255b6	changed the workflow to extract info from the dump	2021-06-15 09:22:54 +02:00
Miriam Baglioni	d6e21bb6ea	creates the crossref dataset used for doiboost together with unpacking part from tar	2021-06-14 17:27:19 +02:00
Miriam Baglioni	ce0cfd79e0	creates the crossref dataset used for doiboost	2021-06-14 13:40:19 +02:00
Miriam Baglioni	93efe4de82	split the construction of crossref dataset in two parts. This one just unpacks the tar entries	2021-06-14 13:39:40 +02:00
Miriam Baglioni	8873e6b6d1	workflow and parameter	2021-06-14 10:15:57 +02:00
Miriam Baglioni	0f1acdf6b6	workflow and parameter	2021-06-14 10:08:55 +02:00
Sandro La Bruzzo	efbea1e01a	minor fix	2021-06-14 09:45:14 +02:00
Miriam Baglioni	75780fc636	extraction of the tar for the dump of crossref, and creation of the dataset	2021-06-14 09:45:07 +02:00
Miriam Baglioni	8d2e086e48	changes to avoid reassignment to val	2021-06-07 17:50:37 +02:00
Miriam Baglioni	f33521d338	Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala' to be able to replace the aboject assigned to author val has been replaced by var	2021-06-07 17:27:07 +02:00
Miriam Baglioni	bc12e9819e	Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala' The change is to fix the issue that arises when the same work appears more than once on the same ORCID profile. The change avoid to replicate the association doi -> author when the orcid id is already associated to the doi.	2021-06-07 16:37:01 +02:00
Claudio Atzori	5e4b91d9ef	more pervasive use of constants from ModelConstants, especially for ORCID	2021-05-26 18:20:23 +02:00
Claudio Atzori	c4a23c2f4d	fix: preserving the old identifier among the originalIds in the doiboost construction process, trying to avoid UnsupportedOperationException while adding elements to the originalIds	2021-05-19 16:01:52 +02:00
Claudio Atzori	ba03f549d7	fix: preserving the old identifier among the originalIds in the doiboost construction process	2021-05-19 15:43:26 +02:00
Claudio Atzori	2cbf15f4fb	using ModelConstants	2021-05-17 09:54:45 +02:00
Claudio Atzori	f19feceaf0	set the old identifier before switching to the new one	2021-05-14 12:53:40 +02:00
Claudio Atzori	1bd70fa2c6	preserving the old identifier among the originalIds in the doiboost construction process	2021-05-14 11:30:41 +02:00
Claudio Atzori	ca3f3a7687	using ModelConstants	2021-05-14 11:29:49 +02:00
Claudio Atzori	23b8883ab1	applied intellij code cleanup	2021-05-14 10:58:12 +02:00
Enrico Ottonello	c537986b7c	deleted folders with merged data immediately before merge phases	2021-04-28 11:25:25 +02:00
Claudio Atzori	5afa7d3e0c	core utilities in dhp-common moved in external module dhp-schemas	2021-04-27 15:44:01 +02:00
Claudio Atzori	27ab8a704d	adjusted poms to align with the external dhp-schema module	2021-04-27 10:12:27 +02:00
Claudio Atzori	c2bb03c8b5	depending on external dhp-schemas module	2021-04-23 17:57:35 +02:00
Claudio Atzori	e5abbec2ba	[orcid] download of the lambda file defined in a script	2021-04-22 11:22:10 +02:00
Claudio Atzori	55964cbd81	[orcid] large oozie workflow cleanup; updated workflow for the orcidnodoi actionset creation	2021-04-22 10:18:09 +02:00
Claudio Atzori	52244f813a	merging from enrico.ottonello/dnet-hadoop:orcid-no-doi	2021-04-21 12:24:09 +02:00
Sandro La Bruzzo	a16e5299f9	applied unique function on the final dataset	2021-04-16 17:36:48 +02:00
Enrico Ottonello	27068aacd1	wf to move orcid-no-doi dataset on the folder ready the import	2021-04-16 17:17:47 +02:00
Sandro La Bruzzo	67085da305	fixed NPE	2021-04-16 11:05:58 +02:00
Sandro La Bruzzo	7d6a80e2f2	added new type on MAG mapping	2021-04-16 09:14:15 +02:00
Sandro La Bruzzo	3f77bfceb0	fixed test failure on jenkins	2021-04-14 10:03:01 +02:00
Sandro La Bruzzo	479abd10cb	Add into ORCID workflow a method that extracts orcid directly to the dump generated by Enrico	2021-04-13 17:47:43 +02:00
Claudio Atzori	e686b8de8d	[ORCID-no-doi] integrating PR#98 D-Net/dnet-hadoop#98	2021-04-01 17:11:03 +02:00
Claudio Atzori	ee34cc51c3	[ORCID-no-doi] integrating PR#98 D-Net/dnet-hadoop#98	2021-04-01 17:07:49 +02:00
Claudio Atzori	7941d7be29	WIP: using common definitions from ModelConstants	2021-03-31 18:33:57 +02:00
Enrico Ottonello	59ec5137e1	improvement related to https://issue.openaire.research-infrastructures.eu/issues/6501	2021-03-31 16:25:41 +02:00
Sandro La Bruzzo	616d2ecce2	splitted workflow collecting datacite into two workflows. Released on beta	2021-03-31 15:45:58 +02:00
Sandro La Bruzzo	1dfda3624e	improved workflow importing datacite	2021-03-26 13:56:29 +01:00
Enrico Ottonello	ebd67b8c8f	removed duplicates orcid data on authors set	2021-03-25 11:20:52 +01:00
Sandro La Bruzzo	625e4c29c4	added model constants	2021-03-23 09:39:56 +01:00
Sandro La Bruzzo	c392936b97	fixed error on best access right	2021-03-23 09:23:22 +01:00
Sandro La Bruzzo	c73072079d	fix conflicts	2021-03-22 16:36:31 +01:00
Sandro La Bruzzo	098914dcff	fix wrong relation with source null	2021-03-22 11:35:02 +01:00
Sandro La Bruzzo	25d5663d97	added filter	2021-03-18 10:24:42 +01:00
Sandro La Bruzzo	5f98ea74a9	Added fix for pid generation in stableIds	2021-03-17 15:53:24 +01:00
Sandro La Bruzzo	cc5bbafa5d	some fix to make workflows runs	2021-03-17 12:12:56 +01:00
Sandro La Bruzzo	4bb3bcafa5	add author sequence number	2021-03-11 11:32:32 +01:00
Sandro La Bruzzo	a8e5d0ea0d	updated test and fixed assign of access right	2021-03-11 10:41:24 +01:00
Sandro La Bruzzo	f5e7c57654	Fixed ticket 6282	2021-03-11 10:32:45 +01:00
Claudio Atzori	d525785497	[#6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color.	2021-03-09 11:12:55 +01:00
Sandro La Bruzzo	a2169ccf07	// implemented Ticket #6281 added pid to Instance in doiBoost	2021-03-09 10:46:36 +01:00
Claudio Atzori	8d2bb24512	merged from master	2021-03-08 15:44:34 +01:00
Enrico Ottonello	70cb100647	added updating last orcid dataset folders after completion	2021-03-01 10:17:04 +01:00
Enrico Ottonello	bd3b16402b	added result typologies	2021-03-01 10:16:02 +01:00
Enrico Ottonello	53d7023460	dateOfCollection taken from orcid last_update.txt on hdfs; cleaned wf parameters	2021-02-25 18:43:29 +01:00
Enrico Ottonello	d43ea88caf	aligned orcid result typologies with openaire vocabulary	2021-02-25 15:02:10 +01:00
Enrico Ottonello	975823b968	data from last updated orcid	2021-02-23 15:35:04 +01:00
Enrico Ottonello	ee4ba7298b	fix last update read/write from file on hdfs	2021-02-09 23:24:57 +01:00
Claudio Atzori	72c57b28fa	switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT	2021-02-04 14:08:18 +01:00
Enrico Ottonello	c238561001	Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi	2021-02-04 10:44:21 +01:00
Enrico Ottonello	465ce39f75	job execution now based on file last_update.txt on hdfs	2021-02-04 10:44:04 +01:00
Sandro La Bruzzo	99cf3a8ea4	Merged Datacite transfrom into this branch	2021-01-28 16:34:46 +01:00
Claudio Atzori	ab2fe9266a	[DOIBoost] minor fixes in workflow definition	2021-01-05 10:26:39 +01:00
Claudio Atzori	7c722f3fdc	[DOIBoost] fixed typo	2021-01-05 10:25:54 +01:00
Claudio Atzori	8879704ba0	[DOIBoost] configurable ES server url and index name in crossref importer	2021-01-05 10:00:13 +01:00
Sandro La Bruzzo	7834a35768	avoid to save intermediate dataset before generation of Sequence file	2021-01-04 17:54:57 +01:00
Sandro La Bruzzo	e79445a8b4	minor fix for claudio polemica	2021-01-04 17:39:25 +01:00
Sandro La Bruzzo	8765020b85	minor fix	2021-01-04 17:37:08 +01:00
Sandro La Bruzzo	b0dc92786f	defined a single oozie workflow for the generation of doiboost	2021-01-04 17:01:35 +01:00
Claudio Atzori	28460c2cd1	using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper	2020-12-23 16:59:52 +01:00
Sandro La Bruzzo	1f6c8a9e83	added orcid_pending type to records coming from Crossref	2020-12-15 11:47:15 +01:00
Enrico Ottonello	b2de598c1a	all actions from download lambda file to merge updated data into one wf	2020-12-15 10:42:55 +01:00
Enrico Ottonello	efe4c2a9c5	authors and works are now updated in two separate spark actions of the wf	2020-12-12 02:06:21 +01:00
Enrico Ottonello	858efbfad1	fix dataset creation for downloaded works	2020-12-11 16:49:54 +01:00
Claudio Atzori	d9532446eb	imported more diffs from master branch; code formatting	2020-12-10 16:14:16 +01:00
Claudio Atzori	12e2f930c8	resolved conflicts	2020-12-10 10:57:39 +01:00
Enrico Ottonello	2233750a37	original orcid xml data are stored in a field of the class that models orcid data	2020-12-09 09:45:19 +01:00
Sandro La Bruzzo	302baab67b	fixed doiboost mapping and workflows	2020-12-07 19:59:33 +01:00
Enrico Ottonello	5c65e602d3	wf doi_authors generates one json data foreach row	2020-12-07 15:28:10 +01:00

1 2 3 4 5 ...

290 Commits