Miriam Baglioni
50cc21d92e
Added method to normalize doi values (lower case, remove all preceeding 10., filtering out doi not starting with 10.)
2021-06-29 18:35:28 +02:00
Sandro La Bruzzo
80e15cc455
implemented mapping from uniprot, pdb and ebi links
2021-06-24 17:20:00 +02:00
Sandro La Bruzzo
a167543637
Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_id_scholexplorer
2021-06-21 09:14:11 +02:00
Miriam Baglioni
13c96622c9
-
2021-06-18 09:45:16 +02:00
Miriam Baglioni
b486ae498f
added test and test resource to verify the generation of the date of acceptance from the input extracted from the dump
2021-06-18 09:43:32 +02:00
Miriam Baglioni
464c2ddde3
changed to split in two steps the generation of the crossref dataset
2021-06-18 09:42:31 +02:00
Miriam Baglioni
6aca0d8ebb
added kryo encoding for input files
2021-06-18 09:42:07 +02:00
Miriam Baglioni
3585e53da3
changed to split in two steps the generation of the crossref dataset
2021-06-18 09:41:23 +02:00
Sandro La Bruzzo
3100166d29
Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer
2021-06-16 16:22:16 +02:00
Miriam Baglioni
95885bcf12
forces executor Executor memory and driver executor memory to be 7G (trying to avoid OOM)
2021-06-16 10:17:52 +02:00
Miriam Baglioni
2550a73981
-
2021-06-16 10:04:41 +02:00
Miriam Baglioni
1c47c0d786
modified the number of executors trying to avoid OOM exception
2021-06-15 21:05:39 +02:00
Miriam Baglioni
7deac55138
added one option for resume from in the wf
2021-06-15 18:38:20 +02:00
Miriam Baglioni
66e7ef892f
changed the parameter name
2021-06-15 11:08:54 +02:00
Miriam Baglioni
4f47ad0891
no need to rename the folders, just write in overwrite mode, so I changed the name of the output folder
2021-06-15 09:28:31 +02:00
Miriam Baglioni
9f9dd00b94
refactoring
2021-06-15 09:24:46 +02:00
Miriam Baglioni
63d74ee379
refactoring
2021-06-15 09:24:11 +02:00
Miriam Baglioni
6ebc236657
added needed property: outputPath
2021-06-15 09:23:24 +02:00
Miriam Baglioni
f7379255b6
changed the workflow to extract info from the dump
2021-06-15 09:22:54 +02:00
Miriam Baglioni
d6e21bb6ea
creates the crossref dataset used for doiboost together with unpacking part from tar
2021-06-14 17:27:19 +02:00
Miriam Baglioni
ce0cfd79e0
creates the crossref dataset used for doiboost
2021-06-14 13:40:19 +02:00
Miriam Baglioni
93efe4de82
split the construction of crossref dataset in two parts. This one just unpacks the tar entries
2021-06-14 13:39:40 +02:00
Miriam Baglioni
8873e6b6d1
workflow and parameter
2021-06-14 10:15:57 +02:00
Miriam Baglioni
0f1acdf6b6
workflow and parameter
2021-06-14 10:08:55 +02:00
Sandro La Bruzzo
efbea1e01a
minor fix
2021-06-14 09:45:14 +02:00
Miriam Baglioni
75780fc636
extraction of the tar for the dump of crossref, and creation of the dataset
2021-06-14 09:45:07 +02:00
Miriam Baglioni
8d2e086e48
changes to avoid reassignment to val
2021-06-07 17:50:37 +02:00
Miriam Baglioni
f33521d338
Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala'
...
to be able to replace the aboject assigned to author val has been replaced by var
2021-06-07 17:27:07 +02:00
Miriam Baglioni
bc12e9819e
Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala'
...
The change is to fix the issue that arises when the same work appears more than once on the same ORCID profile. The change avoid to replicate the association doi -> author when the orcid id is already associated to the doi.
2021-06-07 16:37:01 +02:00
Claudio Atzori
5e4b91d9ef
more pervasive use of constants from ModelConstants, especially for ORCID
2021-05-26 18:20:23 +02:00
Claudio Atzori
c4a23c2f4d
fix: preserving the old identifier among the originalIds in the doiboost construction process, trying to avoid UnsupportedOperationException while adding elements to the originalIds
2021-05-19 16:01:52 +02:00
Claudio Atzori
ba03f549d7
fix: preserving the old identifier among the originalIds in the doiboost construction process
2021-05-19 15:43:26 +02:00
Claudio Atzori
2cbf15f4fb
using ModelConstants
2021-05-17 09:54:45 +02:00
Claudio Atzori
f19feceaf0
set the old identifier before switching to the new one
2021-05-14 12:53:40 +02:00
Claudio Atzori
1bd70fa2c6
preserving the old identifier among the originalIds in the doiboost construction process
2021-05-14 11:30:41 +02:00
Claudio Atzori
ca3f3a7687
using ModelConstants
2021-05-14 11:29:49 +02:00
Claudio Atzori
23b8883ab1
applied intellij code cleanup
2021-05-14 10:58:12 +02:00
Enrico Ottonello
c537986b7c
deleted folders with merged data immediately before merge phases
2021-04-28 11:25:25 +02:00
Claudio Atzori
5afa7d3e0c
core utilities in dhp-common moved in external module dhp-schemas
2021-04-27 15:44:01 +02:00
Claudio Atzori
27ab8a704d
adjusted poms to align with the external dhp-schema module
2021-04-27 10:12:27 +02:00
Claudio Atzori
c2bb03c8b5
depending on external dhp-schemas module
2021-04-23 17:57:35 +02:00
Claudio Atzori
e5abbec2ba
[orcid] download of the lambda file defined in a script
2021-04-22 11:22:10 +02:00
Claudio Atzori
55964cbd81
[orcid] large oozie workflow cleanup; updated workflow for the orcidnodoi actionset creation
2021-04-22 10:18:09 +02:00
Claudio Atzori
52244f813a
merging from enrico.ottonello/dnet-hadoop:orcid-no-doi
2021-04-21 12:24:09 +02:00
Sandro La Bruzzo
a16e5299f9
applied unique function on the final dataset
2021-04-16 17:36:48 +02:00
Enrico Ottonello
27068aacd1
wf to move orcid-no-doi dataset on the folder ready the import
2021-04-16 17:17:47 +02:00
Sandro La Bruzzo
67085da305
fixed NPE
2021-04-16 11:05:58 +02:00
Sandro La Bruzzo
7d6a80e2f2
added new type on MAG mapping
2021-04-16 09:14:15 +02:00
Sandro La Bruzzo
3f77bfceb0
fixed test failure on jenkins
2021-04-14 10:03:01 +02:00
Sandro La Bruzzo
479abd10cb
Add into ORCID workflow a method that extracts orcid directly to the dump generated by Enrico
2021-04-13 17:47:43 +02:00
Claudio Atzori
e686b8de8d
[ORCID-no-doi] integrating PR#98 D-Net/dnet-hadoop#98
2021-04-01 17:11:03 +02:00
Claudio Atzori
ee34cc51c3
[ORCID-no-doi] integrating PR#98 D-Net/dnet-hadoop#98
2021-04-01 17:07:49 +02:00
Claudio Atzori
7941d7be29
WIP: using common definitions from ModelConstants
2021-03-31 18:33:57 +02:00
Enrico Ottonello
59ec5137e1
improvement related to https://issue.openaire.research-infrastructures.eu/issues/6501
2021-03-31 16:25:41 +02:00
Sandro La Bruzzo
616d2ecce2
splitted workflow collecting datacite into two workflows.
...
Released on beta
2021-03-31 15:45:58 +02:00
Sandro La Bruzzo
1dfda3624e
improved workflow importing datacite
2021-03-26 13:56:29 +01:00
Enrico Ottonello
ebd67b8c8f
removed duplicates orcid data on authors set
2021-03-25 11:20:52 +01:00
Sandro La Bruzzo
625e4c29c4
added model constants
2021-03-23 09:39:56 +01:00
Sandro La Bruzzo
c392936b97
fixed error on best access right
2021-03-23 09:23:22 +01:00
Sandro La Bruzzo
c73072079d
fix conflicts
2021-03-22 16:36:31 +01:00
Sandro La Bruzzo
098914dcff
fix wrong relation with source null
2021-03-22 11:35:02 +01:00
Sandro La Bruzzo
25d5663d97
added filter
2021-03-18 10:24:42 +01:00
Sandro La Bruzzo
5f98ea74a9
Added fix for pid generation in stableIds
2021-03-17 15:53:24 +01:00
Sandro La Bruzzo
cc5bbafa5d
some fix to make workflows runs
2021-03-17 12:12:56 +01:00
Sandro La Bruzzo
4bb3bcafa5
add author sequence number
2021-03-11 11:32:32 +01:00
Sandro La Bruzzo
a8e5d0ea0d
updated test and fixed assign of access right
2021-03-11 10:41:24 +01:00
Sandro La Bruzzo
f5e7c57654
Fixed ticket 6282
2021-03-11 10:32:45 +01:00
Claudio Atzori
d525785497
[ #6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color.
2021-03-09 11:12:55 +01:00
Sandro La Bruzzo
a2169ccf07
// implemented Ticket #6281 added pid to Instance in doiBoost
2021-03-09 10:46:36 +01:00
Claudio Atzori
8d2bb24512
merged from master
2021-03-08 15:44:34 +01:00
Enrico Ottonello
70cb100647
added updating last orcid dataset folders after completion
2021-03-01 10:17:04 +01:00
Enrico Ottonello
bd3b16402b
added result typologies
2021-03-01 10:16:02 +01:00
Enrico Ottonello
53d7023460
dateOfCollection taken from orcid last_update.txt on hdfs; cleaned wf parameters
2021-02-25 18:43:29 +01:00
Enrico Ottonello
d43ea88caf
aligned orcid result typologies with openaire vocabulary
2021-02-25 15:02:10 +01:00
Enrico Ottonello
975823b968
data from last updated orcid
2021-02-23 15:35:04 +01:00
Enrico Ottonello
ee4ba7298b
fix last update read/write from file on hdfs
2021-02-09 23:24:57 +01:00
Claudio Atzori
72c57b28fa
switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT
2021-02-04 14:08:18 +01:00
Enrico Ottonello
c238561001
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2021-02-04 10:44:21 +01:00
Enrico Ottonello
465ce39f75
job execution now based on file last_update.txt on hdfs
2021-02-04 10:44:04 +01:00
Sandro La Bruzzo
99cf3a8ea4
Merged Datacite transfrom into this branch
2021-01-28 16:34:46 +01:00
Claudio Atzori
ab2fe9266a
[DOIBoost] minor fixes in workflow definition
2021-01-05 10:26:39 +01:00
Claudio Atzori
7c722f3fdc
[DOIBoost] fixed typo
2021-01-05 10:25:54 +01:00
Claudio Atzori
8879704ba0
[DOIBoost] configurable ES server url and index name in crossref importer
2021-01-05 10:00:13 +01:00
Sandro La Bruzzo
7834a35768
avoid to save intermediate dataset before generation of Sequence file
2021-01-04 17:54:57 +01:00
Sandro La Bruzzo
e79445a8b4
minor fix for claudio polemica
2021-01-04 17:39:25 +01:00
Sandro La Bruzzo
8765020b85
minor fix
2021-01-04 17:37:08 +01:00
Sandro La Bruzzo
b0dc92786f
defined a single oozie workflow for the generation of doiboost
2021-01-04 17:01:35 +01:00
Claudio Atzori
28460c2cd1
using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper
2020-12-23 16:59:52 +01:00
Sandro La Bruzzo
1f6c8a9e83
added orcid_pending type to records coming from Crossref
2020-12-15 11:47:15 +01:00
Enrico Ottonello
b2de598c1a
all actions from download lambda file to merge updated data into one wf
2020-12-15 10:42:55 +01:00
Enrico Ottonello
efe4c2a9c5
authors and works are now updated in two separate spark actions of the wf
2020-12-12 02:06:21 +01:00
Enrico Ottonello
858efbfad1
fix dataset creation for downloaded works
2020-12-11 16:49:54 +01:00
Claudio Atzori
d9532446eb
imported more diffs from master branch; code formatting
2020-12-10 16:14:16 +01:00
Claudio Atzori
12e2f930c8
resolved conflicts
2020-12-10 10:57:39 +01:00
Enrico Ottonello
2233750a37
original orcid xml data are stored in a field of the class that models orcid data
2020-12-09 09:45:19 +01:00
Sandro La Bruzzo
302baab67b
fixed doiboost mapping and workflows
2020-12-07 19:59:33 +01:00
Enrico Ottonello
5c65e602d3
wf doi_authors generates one json data foreach row
2020-12-07 15:28:10 +01:00
Enrico Ottonello
fa1855a4b8
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2020-12-07 11:02:59 +01:00
Enrico Ottonello
b1b589ada1
wf to generate orcid dataset
2020-12-07 11:02:32 +01:00
Sandro La Bruzzo
b31dd126fb
fixed crossref workflow added common ORCID Class
2020-12-07 10:42:38 +01:00