Miriam Baglioni
6aca0d8ebb
added kryo encoding for input files
2021-06-18 09:42:07 +02:00
Miriam Baglioni
3585e53da3
changed to split in two steps the generation of the crossref dataset
2021-06-18 09:41:23 +02:00
Sandro La Bruzzo
3100166d29
Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer
2021-06-16 16:22:16 +02:00
Miriam Baglioni
95885bcf12
forces executor Executor memory and driver executor memory to be 7G (trying to avoid OOM)
2021-06-16 10:17:52 +02:00
Miriam Baglioni
2550a73981
-
2021-06-16 10:04:41 +02:00
Miriam Baglioni
1c47c0d786
modified the number of executors trying to avoid OOM exception
2021-06-15 21:05:39 +02:00
Miriam Baglioni
7deac55138
added one option for resume from in the wf
2021-06-15 18:38:20 +02:00
Miriam Baglioni
66e7ef892f
changed the parameter name
2021-06-15 11:08:54 +02:00
Miriam Baglioni
4f47ad0891
no need to rename the folders, just write in overwrite mode, so I changed the name of the output folder
2021-06-15 09:28:31 +02:00
Miriam Baglioni
9f9dd00b94
refactoring
2021-06-15 09:24:46 +02:00
Miriam Baglioni
63d74ee379
refactoring
2021-06-15 09:24:11 +02:00
Miriam Baglioni
6ebc236657
added needed property: outputPath
2021-06-15 09:23:24 +02:00
Miriam Baglioni
f7379255b6
changed the workflow to extract info from the dump
2021-06-15 09:22:54 +02:00
Miriam Baglioni
d6e21bb6ea
creates the crossref dataset used for doiboost together with unpacking part from tar
2021-06-14 17:27:19 +02:00
Miriam Baglioni
ce0cfd79e0
creates the crossref dataset used for doiboost
2021-06-14 13:40:19 +02:00
Miriam Baglioni
93efe4de82
split the construction of crossref dataset in two parts. This one just unpacks the tar entries
2021-06-14 13:39:40 +02:00
Miriam Baglioni
8873e6b6d1
workflow and parameter
2021-06-14 10:15:57 +02:00
Miriam Baglioni
0f1acdf6b6
workflow and parameter
2021-06-14 10:08:55 +02:00
Sandro La Bruzzo
efbea1e01a
minor fix
2021-06-14 09:45:14 +02:00
Miriam Baglioni
75780fc636
extraction of the tar for the dump of crossref, and creation of the dataset
2021-06-14 09:45:07 +02:00
Miriam Baglioni
8d2e086e48
changes to avoid reassignment to val
2021-06-07 17:50:37 +02:00
Miriam Baglioni
f33521d338
Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala'
...
to be able to replace the aboject assigned to author val has been replaced by var
2021-06-07 17:27:07 +02:00
Miriam Baglioni
bc12e9819e
Aggiornare 'dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/orcid/SparkConvertORCIDToOAF.scala'
...
The change is to fix the issue that arises when the same work appears more than once on the same ORCID profile. The change avoid to replicate the association doi -> author when the orcid id is already associated to the doi.
2021-06-07 16:37:01 +02:00
Claudio Atzori
5e4b91d9ef
more pervasive use of constants from ModelConstants, especially for ORCID
2021-05-26 18:20:23 +02:00
Claudio Atzori
c4a23c2f4d
fix: preserving the old identifier among the originalIds in the doiboost construction process, trying to avoid UnsupportedOperationException while adding elements to the originalIds
2021-05-19 16:01:52 +02:00
Claudio Atzori
ba03f549d7
fix: preserving the old identifier among the originalIds in the doiboost construction process
2021-05-19 15:43:26 +02:00
Claudio Atzori
2cbf15f4fb
using ModelConstants
2021-05-17 09:54:45 +02:00
Claudio Atzori
f19feceaf0
set the old identifier before switching to the new one
2021-05-14 12:53:40 +02:00
Claudio Atzori
1bd70fa2c6
preserving the old identifier among the originalIds in the doiboost construction process
2021-05-14 11:30:41 +02:00
Claudio Atzori
ca3f3a7687
using ModelConstants
2021-05-14 11:29:49 +02:00
Claudio Atzori
23b8883ab1
applied intellij code cleanup
2021-05-14 10:58:12 +02:00
Enrico Ottonello
c537986b7c
deleted folders with merged data immediately before merge phases
2021-04-28 11:25:25 +02:00
Claudio Atzori
5afa7d3e0c
core utilities in dhp-common moved in external module dhp-schemas
2021-04-27 15:44:01 +02:00
Claudio Atzori
27ab8a704d
adjusted poms to align with the external dhp-schema module
2021-04-27 10:12:27 +02:00
Claudio Atzori
c2bb03c8b5
depending on external dhp-schemas module
2021-04-23 17:57:35 +02:00
Claudio Atzori
e5abbec2ba
[orcid] download of the lambda file defined in a script
2021-04-22 11:22:10 +02:00
Claudio Atzori
55964cbd81
[orcid] large oozie workflow cleanup; updated workflow for the orcidnodoi actionset creation
2021-04-22 10:18:09 +02:00
Claudio Atzori
52244f813a
merging from enrico.ottonello/dnet-hadoop:orcid-no-doi
2021-04-21 12:24:09 +02:00
Sandro La Bruzzo
a16e5299f9
applied unique function on the final dataset
2021-04-16 17:36:48 +02:00
Enrico Ottonello
27068aacd1
wf to move orcid-no-doi dataset on the folder ready the import
2021-04-16 17:17:47 +02:00
Sandro La Bruzzo
67085da305
fixed NPE
2021-04-16 11:05:58 +02:00
Sandro La Bruzzo
7d6a80e2f2
added new type on MAG mapping
2021-04-16 09:14:15 +02:00
Sandro La Bruzzo
3f77bfceb0
fixed test failure on jenkins
2021-04-14 10:03:01 +02:00
Sandro La Bruzzo
479abd10cb
Add into ORCID workflow a method that extracts orcid directly to the dump generated by Enrico
2021-04-13 17:47:43 +02:00
Claudio Atzori
e686b8de8d
[ORCID-no-doi] integrating PR#98 D-Net/dnet-hadoop#98
2021-04-01 17:11:03 +02:00
Claudio Atzori
ee34cc51c3
[ORCID-no-doi] integrating PR#98 D-Net/dnet-hadoop#98
2021-04-01 17:07:49 +02:00
Claudio Atzori
7941d7be29
WIP: using common definitions from ModelConstants
2021-03-31 18:33:57 +02:00
Enrico Ottonello
59ec5137e1
improvement related to https://issue.openaire.research-infrastructures.eu/issues/6501
2021-03-31 16:25:41 +02:00
Sandro La Bruzzo
616d2ecce2
splitted workflow collecting datacite into two workflows.
...
Released on beta
2021-03-31 15:45:58 +02:00
Sandro La Bruzzo
1dfda3624e
improved workflow importing datacite
2021-03-26 13:56:29 +01:00
Enrico Ottonello
ebd67b8c8f
removed duplicates orcid data on authors set
2021-03-25 11:20:52 +01:00
Sandro La Bruzzo
625e4c29c4
added model constants
2021-03-23 09:39:56 +01:00
Sandro La Bruzzo
c392936b97
fixed error on best access right
2021-03-23 09:23:22 +01:00
Sandro La Bruzzo
c73072079d
fix conflicts
2021-03-22 16:36:31 +01:00
Sandro La Bruzzo
098914dcff
fix wrong relation with source null
2021-03-22 11:35:02 +01:00
Sandro La Bruzzo
25d5663d97
added filter
2021-03-18 10:24:42 +01:00
Sandro La Bruzzo
5f98ea74a9
Added fix for pid generation in stableIds
2021-03-17 15:53:24 +01:00
Sandro La Bruzzo
cc5bbafa5d
some fix to make workflows runs
2021-03-17 12:12:56 +01:00
Sandro La Bruzzo
4bb3bcafa5
add author sequence number
2021-03-11 11:32:32 +01:00
Sandro La Bruzzo
a8e5d0ea0d
updated test and fixed assign of access right
2021-03-11 10:41:24 +01:00
Sandro La Bruzzo
f5e7c57654
Fixed ticket 6282
2021-03-11 10:32:45 +01:00
Claudio Atzori
d525785497
[ #6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color.
2021-03-09 11:12:55 +01:00
Sandro La Bruzzo
a2169ccf07
// implemented Ticket #6281 added pid to Instance in doiBoost
2021-03-09 10:46:36 +01:00
Claudio Atzori
8d2bb24512
merged from master
2021-03-08 15:44:34 +01:00
Enrico Ottonello
70cb100647
added updating last orcid dataset folders after completion
2021-03-01 10:17:04 +01:00
Enrico Ottonello
bd3b16402b
added result typologies
2021-03-01 10:16:02 +01:00
Enrico Ottonello
53d7023460
dateOfCollection taken from orcid last_update.txt on hdfs; cleaned wf parameters
2021-02-25 18:43:29 +01:00
Enrico Ottonello
d43ea88caf
aligned orcid result typologies with openaire vocabulary
2021-02-25 15:02:10 +01:00
Enrico Ottonello
975823b968
data from last updated orcid
2021-02-23 15:35:04 +01:00
Enrico Ottonello
ee4ba7298b
fix last update read/write from file on hdfs
2021-02-09 23:24:57 +01:00
Claudio Atzori
72c57b28fa
switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT
2021-02-04 14:08:18 +01:00
Enrico Ottonello
c238561001
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2021-02-04 10:44:21 +01:00
Enrico Ottonello
465ce39f75
job execution now based on file last_update.txt on hdfs
2021-02-04 10:44:04 +01:00
Sandro La Bruzzo
99cf3a8ea4
Merged Datacite transfrom into this branch
2021-01-28 16:34:46 +01:00
Claudio Atzori
ab2fe9266a
[DOIBoost] minor fixes in workflow definition
2021-01-05 10:26:39 +01:00
Claudio Atzori
7c722f3fdc
[DOIBoost] fixed typo
2021-01-05 10:25:54 +01:00
Claudio Atzori
8879704ba0
[DOIBoost] configurable ES server url and index name in crossref importer
2021-01-05 10:00:13 +01:00
Sandro La Bruzzo
7834a35768
avoid to save intermediate dataset before generation of Sequence file
2021-01-04 17:54:57 +01:00
Sandro La Bruzzo
e79445a8b4
minor fix for claudio polemica
2021-01-04 17:39:25 +01:00
Sandro La Bruzzo
8765020b85
minor fix
2021-01-04 17:37:08 +01:00
Sandro La Bruzzo
b0dc92786f
defined a single oozie workflow for the generation of doiboost
2021-01-04 17:01:35 +01:00
Claudio Atzori
28460c2cd1
using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper
2020-12-23 16:59:52 +01:00
Sandro La Bruzzo
1f6c8a9e83
added orcid_pending type to records coming from Crossref
2020-12-15 11:47:15 +01:00
Enrico Ottonello
b2de598c1a
all actions from download lambda file to merge updated data into one wf
2020-12-15 10:42:55 +01:00
Enrico Ottonello
efe4c2a9c5
authors and works are now updated in two separate spark actions of the wf
2020-12-12 02:06:21 +01:00
Enrico Ottonello
858efbfad1
fix dataset creation for downloaded works
2020-12-11 16:49:54 +01:00
Claudio Atzori
d9532446eb
imported more diffs from master branch; code formatting
2020-12-10 16:14:16 +01:00
Claudio Atzori
12e2f930c8
resolved conflicts
2020-12-10 10:57:39 +01:00
Enrico Ottonello
2233750a37
original orcid xml data are stored in a field of the class that models orcid data
2020-12-09 09:45:19 +01:00
Sandro La Bruzzo
302baab67b
fixed doiboost mapping and workflows
2020-12-07 19:59:33 +01:00
Enrico Ottonello
5c65e602d3
wf doi_authors generates one json data foreach row
2020-12-07 15:28:10 +01:00
Enrico Ottonello
fa1855a4b8
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2020-12-07 11:02:59 +01:00
Enrico Ottonello
b1b589ada1
wf to generate orcid dataset
2020-12-07 11:02:32 +01:00
Sandro La Bruzzo
b31dd126fb
fixed crossref workflow added common ORCID Class
2020-12-07 10:42:38 +01:00
Enrico Ottonello
8812ab65e1
completed download function to wf; added accumulators
2020-12-04 21:13:49 +01:00
Enrico Ottonello
53b22c1937
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2020-12-02 23:21:27 +01:00
Enrico Ottonello
1b1e9ea67c
wf to generate doi_author_list for doiboost; wf to download updated works
2020-12-02 23:20:16 +01:00
Sandro La Bruzzo
7da679542f
fixed wrong projectId
2020-12-02 14:28:09 +01:00
Sandro La Bruzzo
6ba8037cc7
fixed failure to test due to changing of input
2020-12-02 11:34:46 +01:00
Claudio Atzori
cfb55effd9
code formatting
2020-12-02 11:23:49 +01:00
Enrico Ottonello
f2df3ead74
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2020-11-30 14:22:46 +01:00
Enrico Ottonello
40c4559e92
added datainfo on authors pid with "sysimport:crosswalk:entityregistry",
2020-11-30 14:19:22 +01:00
Claudio Atzori
a104d2b6ad
cleanup
2020-11-26 11:12:00 +01:00
Claudio Atzori
db0181b8af
Merge pull request 'added bidirectionality to relations from project and result coming from crossref' ( #60 ) from miriam.baglioni/dnet-hadoop:sxBidirectionality into master
2020-11-25 17:17:40 +01:00
Sandro La Bruzzo
ec3e238de6
Fixed problem on duplicated identifier
2020-11-25 17:15:54 +01:00
Sandro La Bruzzo
264723ffd8
updated stuff for zenodo upload
2020-11-25 11:56:07 +01:00
Enrico Ottonello
99a086f0c6
max concurrent executors set to 10, according to ORCID Director of Technology mail request
2020-11-24 17:49:32 +01:00
Miriam Baglioni
00874a8ce6
added bidirectionality to relations from project and result
2020-11-24 15:17:23 +01:00
Enrico Ottonello
5c17e768b2
set wf configuration with spark.dynamicAllocation.maxExecutors 20 over 20 input partitions
2020-11-23 16:01:23 +01:00
Enrico Ottonello
97c8111847
action to convert lambda file in seq file; spark action to download updated authors
2020-11-23 09:49:22 +01:00
Enrico Ottonello
c0c2e05eae
added wf to extracting authors and works xml data from orcid dump to hdfs; added wf to download the lamda file (containing last orcid update informations) from orcid to hdfs
2020-11-17 18:23:12 +01:00
Enrico Ottonello
005f849674
added compression to output dataset
2020-11-13 12:45:31 +01:00
Enrico Ottonello
9a2fa9dc2f
added test for other names parsing from summaries dump
2020-11-13 10:25:34 +01:00
Enrico Ottonello
13f28fa225
moved AuthorData to dhp-schemas; added other names to author data
2020-11-12 17:43:32 +01:00
Claudio Atzori
9b0fb9e958
merged from master
2020-11-12 09:27:12 +01:00
Enrico Ottonello
1f861f2b0d
now wf output is a sequence file with the format seq("eu.dnetlib.dhp.schema.oaf.Publication",eu.dnetlib.dhp.schema.action.AtomicActions)
2020-11-11 17:38:50 +01:00
Enrico Ottonello
fea2451658
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2020-11-10 11:49:43 +01:00
Enrico Ottonello
1513174d7e
added further test case
2020-11-10 11:44:55 +01:00
Sandro La Bruzzo
8e1d43aab2
Implemented ID generation using IdentifierRecordFactory on DOIBoost
2020-11-09 11:53:55 +01:00
Sandro La Bruzzo
cd27df91a1
fixed bug on missing relation in ANDS
2020-11-06 17:12:31 +01:00
Enrico Ottonello
6bc7dbeca7
first version of dataset successful generated from orcid dump 2020
2020-11-06 13:47:50 +01:00
Sandro La Bruzzo
39337d8a8a
fixed test
2020-11-02 09:26:25 +01:00
Enrico Ottonello
9818e74a70
added dependency version in main pom.xml for orcid no doi
2020-10-22 16:38:00 +02:00
Enrico Ottonello
210a50e4f4
replaced null value
2020-10-22 16:24:42 +02:00
Enrico Ottonello
b0290dbcb7
moved all dependencies version to main pom.xml
2020-10-22 16:20:46 +02:00
Enrico Ottonello
a38ab57062
let run test methods
2020-10-22 15:43:50 +02:00
Enrico Ottonello
1139d6568d
replaced null value with a more safe empty string as return value
2020-10-22 15:32:26 +02:00
Enrico Ottonello
c58db1c8ea
added filter on null value after map function
2020-10-22 15:11:02 +02:00
Enrico Ottonello
846ba30873
if typologies mapping fails, an exception will be propagated
2020-10-22 14:36:18 +02:00
Enrico Ottonello
c3114ba0ae
replaced null as return value with a more safe empty string
2020-10-22 14:21:31 +02:00
Enrico Ottonello
c295c71ca0
added comment
2020-10-22 14:07:26 +02:00
Enrico Ottonello
ab083f9946
propagate exception on parsing work (PR request)
2020-10-22 14:02:32 +02:00
sandro
3a81a940b7
solved bug on merge publication
2020-10-21 22:41:55 +02:00
Sandro La Bruzzo
34bf64c94f
fixed export Scholexplorer to OpenAire
2020-10-13 08:47:58 +02:00
Sandro La Bruzzo
cd9c377d18
adpted scholexplorer Dump generation to the new Dataset definition
2020-10-08 10:10:13 +02:00
Sandro La Bruzzo
c4a3c52e45
fixed Doiboost bug in the identifier
2020-10-01 15:46:44 +02:00
Enrico Ottonello
a97ad20c7b
exception is now propagated (PR review)
2020-09-22 10:46:34 +02:00
Enrico Ottonello
fefbcfb106
dependency version moved to main pom (PR review)
2020-09-22 10:20:25 +02:00
Enrico Ottonello
9e8e7fe6ef
add comments
2020-09-15 11:32:49 +02:00
Enrico Ottonello
0377b40fba
output to one parquet file
2020-07-30 18:38:07 +02:00
Enrico Ottonello
196f36c6ed
fix publication dataset creation
2020-07-30 13:38:33 +02:00
Enrico Ottonello
c82b15b5f4
migrate configuration to ocean, fix publication dataset creation
2020-07-28 15:23:52 +02:00
Enrico Ottonello
ca37d3427b
separate workflow to parse orcid summaries, activities and generate dataset with no doi publications; test
2020-07-03 23:30:31 +02:00
Enrico Ottonello
1729cc5cf3
publication conversion from json to oaf test
2020-07-02 18:46:20 +02:00
Enrico Ottonello
5525f57ec8
converter from orcid work json to oaf
2020-07-01 18:36:14 +02:00
Enrico Ottonello
b7b6be12a5
fixed enriched works generation
2020-06-29 18:03:16 +02:00
Enrico Ottonello
b2213b6435
merged with dnet version
2020-06-26 17:27:34 +02:00
Enrico Ottonello
c5e149c46e
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2020-06-26 16:15:38 +02:00
Enrico Ottonello
d6498278ed
added workflow to generate seq(orcidId,work) and seq(orcidId,enrichedWork)
2020-06-25 18:43:29 +02:00
Sandro La Bruzzo
a6c0faac70
added test to verify secondary sorting
2020-06-25 10:48:15 +02:00