Sandro La Bruzzo
7f8848ecdd
added first implementation of Pangaea Mapping
2021-04-27 11:30:37 +02:00
Claudio Atzori
27ab8a704d
adjusted poms to align with the external dhp-schema module
2021-04-27 10:12:27 +02:00
Claudio Atzori
a7cf449b36
cleanup
2021-04-27 10:11:26 +02:00
Claudio Atzori
fa42026590
fixed PersonCleaner extension functions
2021-04-27 10:10:06 +02:00
Claudio Atzori
ef4bfd82e2
code formatting
2021-04-27 10:09:31 +02:00
Claudio Atzori
faa8f6f4e2
Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids
2021-04-27 09:57:03 +02:00
miconis
6d5c14e030
assertions updated in entity merger test
2021-04-27 09:47:49 +02:00
Claudio Atzori
c2bb03c8b5
depending on external dhp-schemas module
2021-04-23 17:57:35 +02:00
Claudio Atzori
7ed107be53
depending on external dhp-schemas module
2021-04-23 17:52:36 +02:00
Claudio Atzori
c25238480c
making ODF record parsing namespace unaware ( #6629 )
2021-04-23 17:34:57 +02:00
Claudio Atzori
99cfb027fa
making ODF record parsing namespace unaware ( #6629 )
2021-04-23 17:09:36 +02:00
Miriam Baglioni
72e5aa3b42
refactoring
2021-04-23 12:10:30 +02:00
Miriam Baglioni
7d1b8b7f64
merge upstream
2021-04-23 11:55:49 +02:00
miconis
d0e3366c34
Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids
2021-04-22 11:45:19 +02:00
miconis
3c12eeadce
bug fix in propagation of relations
2021-04-22 11:44:33 +02:00
Claudio Atzori
e5abbec2ba
[orcid] download of the lambda file defined in a script
2021-04-22 11:22:10 +02:00
Claudio Atzori
55964cbd81
[orcid] large oozie workflow cleanup; updated workflow for the orcidnodoi actionset creation
2021-04-22 10:18:09 +02:00
Claudio Atzori
8f309b72ff
[dedup] using node names consistently across the workflow
2021-04-21 17:54:51 +02:00
Claudio Atzori
52244f813a
merging from enrico.ottonello/dnet-hadoop:orcid-no-doi
2021-04-21 12:24:09 +02:00
Sandro La Bruzzo
fd29307b84
updated workflow name
2021-04-21 09:21:41 +02:00
Claudio Atzori
815b9f4d56
[openorgs dedup] fixed workflow parameter declarations. Introduced support for resuming the execution from intermediate steps
2021-04-20 17:24:45 +02:00
Claudio Atzori
d0d477cca3
code formatting
2021-04-20 12:50:34 +02:00
miconis
0393cdce42
addition of alternative names in export queries
2021-04-20 12:45:21 +02:00
miconis
cadd0a5de8
modification of the queries for openorgs: they now consider also pending orgs
2021-04-20 12:06:56 +02:00
Sandro La Bruzzo
e06c7f32f6
updated id figshare as described in #6377
2021-04-20 10:18:07 +02:00
Sandro La Bruzzo
dbe0d0378e
resolved ticket #6377
2021-04-20 09:44:44 +02:00
Antonis Lempesis
625d993cd9
added step for observatory db
2021-04-20 02:31:06 +03:00
Antonis Lempesis
25d0512fbd
code cleanup
2021-04-20 01:43:23 +03:00
Sandro La Bruzzo
524e5f3092
Improved parallelization on transformation wf on hadoop
2021-04-19 15:17:25 +02:00
Sandro La Bruzzo
cdfe01bbae
improved parallelization on transformation job
2021-04-19 15:14:52 +02:00
Sandro La Bruzzo
3ae67b7a1d
Merge remote-tracking branch 'origin/stable_ids' into stable_ids
2021-04-16 17:36:57 +02:00
Sandro La Bruzzo
a16e5299f9
applied unique function on the final dataset
2021-04-16 17:36:48 +02:00
Claudio Atzori
45057440c1
code formatting
2021-04-16 17:28:25 +02:00
Enrico Ottonello
34ca792a55
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2021-04-16 17:18:46 +02:00
Enrico Ottonello
27068aacd1
wf to move orcid-no-doi dataset on the folder ready the import
2021-04-16 17:17:47 +02:00
miconis
7ad573d023
bug fix: changed join in propagaterelations without applying filter on the id
2021-04-16 16:40:42 +02:00
Sandro La Bruzzo
67085da305
fixed NPE
2021-04-16 11:05:58 +02:00
Sandro La Bruzzo
644aa8f40c
Merge remote-tracking branch 'origin/stable_ids' into stable_ids
2021-04-16 09:14:26 +02:00
Sandro La Bruzzo
7d6a80e2f2
added new type on MAG mapping
2021-04-16 09:14:15 +02:00
Claudio Atzori
906d50563c
Merge pull request 'properly invalidating impala metadata' ( #105 ) from antonis.lempesis/dnet-hadoop:master into master
...
Reviewed-on: D-Net/dnet-hadoop#105
2021-04-15 15:06:22 +02:00
Claudio Atzori
3d58f95522
[stats update] properly invalidating impala metadata
2021-04-15 15:03:05 +02:00
Antonis Lempesis
03d36fadea
properly invalidating impala metadata
2021-04-15 13:34:22 +03:00
miconis
f64e57c112
refactoring of the id generation, sparkcreatemergerels collects entities to create root id after a join
2021-04-15 10:59:24 +02:00
miconis
176a5e493d
Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids
2021-04-14 18:06:34 +02:00
miconis
3525a8f504
id generation of representative record moved to the SparkCreateMergeRel job
2021-04-14 18:06:07 +02:00
Sandro La Bruzzo
3f77bfceb0
fixed test failure on jenkins
2021-04-14 10:03:01 +02:00
Claudio Atzori
3125cef545
code formatting
2021-04-14 09:11:54 +02:00
Sandro La Bruzzo
44a0064df6
Merge remote-tracking branch 'origin/stable_ids' into stable_ids
2021-04-13 17:48:12 +02:00
Sandro La Bruzzo
479abd10cb
Add into ORCID workflow a method that extracts orcid directly to the dump generated by Enrico
2021-04-13 17:47:43 +02:00
Claudio Atzori
710cd1e8f2
Merge pull request 'add xslt, personname cleaner' ( #104 ) from andreas.czerniak/BrStableId_dnet-hadoop:stable_ids into stable_ids
...
Reviewed-on: D-Net/dnet-hadoop#104
LGTM
2021-04-13 14:43:05 +02:00
Claudio Atzori
d1ca025b0b
[cleaning] remiving authors without fullname or providing 'deactivated' keyword. Removing test test titles
2021-04-13 14:32:41 +02:00
miconis
1542196a33
bug fix: starting node of duplicate scan wf changed
2021-04-13 10:15:43 +02:00
miconis
369ed1cd8a
bug fix: lookupurl parameter added to dedup record job
2021-04-13 09:08:05 +02:00
Andreas Czerniak
3b694074ff
add xslt, personname cleaner
2021-04-13 07:04:27 +02:00
Claudio Atzori
511c0521e5
[dedup] avoiding NPEs handling OpenOrg relations
2021-04-12 17:45:11 +02:00
miconis
d442e25cbc
bug fix: ids in self mergerels are not marked deletedbyinference=true
2021-04-12 15:56:22 +02:00
miconis
dcff9cecdf
bug fix: ids in self mergerels are not marked deletedbyinference=true
2021-04-12 15:55:27 +02:00
miconis
11b22b2d23
bug fix in the query, it now exports only relations with non-hidden organizations
2021-04-08 11:51:47 +02:00
miconis
0857100fb8
implementation of the tests for the openorgs integration in the openaire provision
2021-04-07 18:42:16 +02:00
miconis
bf685d849f
addition of pids in the query for the export of openorgs for the provision, addition of ec_fields in the openorgs model
2021-04-07 14:27:43 +02:00
Miriam Baglioni
70e391d427
merge upstream
2021-04-07 10:38:08 +02:00
miconis
eaaefb8b4c
implementation of the procedure to reuse content of different dbs when creating the raw graph
2021-04-06 14:35:51 +02:00
miconis
c39c82dfe9
modification of the jobs for the integration of openorgs in the provision, dedup records are no more created by merging but simply taking results of openorgs portal
2021-04-06 14:31:00 +02:00
Claudio Atzori
37b65cc3ad
Merge pull request 'updates on stats-update workflow' ( #100 ) from antonis.lempesis/dnet-hadoop:master into master
...
The workflow integrated in the _stable_ids_ branch has been run correctly on the BETA content, thus IMO this PR can be integrated in the master branch.
Reviewed-on: D-Net/dnet-hadoop#100
2021-04-02 16:13:35 +02:00
Claudio Atzori
1e7e5180fa
[Graph model] updated definition of ExternalReference: added alternateLabel, removed description ( #6503 )
2021-04-02 12:32:12 +02:00
Claudio Atzori
e686b8de8d
[ORCID-no-doi] integrating PR#98 D-Net/dnet-hadoop#98
2021-04-01 17:11:03 +02:00
Claudio Atzori
ee34cc51c3
[ORCID-no-doi] integrating PR#98 D-Net/dnet-hadoop#98
2021-04-01 17:07:49 +02:00
Claudio Atzori
70e49ed53c
[OpenOrgsWf] trivial refactoring
2021-04-01 10:30:51 +02:00
Claudio Atzori
7941d7be29
WIP: using common definitions from ModelConstants
2021-03-31 18:33:57 +02:00
Claudio Atzori
879e8cc7ef
WIP: using common definitions from ModelConstants
2021-03-31 17:12:01 +02:00
Claudio Atzori
72ce741ea6
WIP: using common definitions from ModelConstants
2021-03-31 17:07:13 +02:00
Enrico Ottonello
59ec5137e1
improvement related to https://issue.openaire.research-infrastructures.eu/issues/6501
2021-03-31 16:25:41 +02:00
Sandro La Bruzzo
616d2ecce2
splitted workflow collecting datacite into two workflows.
...
Released on beta
2021-03-31 15:45:58 +02:00
Miriam Baglioni
4b6e514f02
merge upstream
2021-03-30 10:27:12 +02:00
Claudio Atzori
9237d55d7f
[OpenOrgsWf] cleanup
2021-03-29 17:40:34 +02:00
Claudio Atzori
7f4e9479ec
[OpenOrgsWf] graph construction wf: allow to skip the import openorgs node (importOpenorgs true|false)
2021-03-29 16:59:16 +02:00
miconis
2709d08fc2
Merge branch 'stable_ids' into openorgswf
2021-03-29 16:39:07 +02:00
miconis
f446580e9f
code refactoring (useless classes and wf removed), implementation of the test for the openorgs dedup
2021-03-29 16:10:46 +02:00
Claudio Atzori
a0837ac357
[Stats update] integrating PR#100 for testing D-Net/dnet-hadoop#100
2021-03-29 15:59:58 +02:00
miconis
2355cc4e9b
minor changes and bug fix
2021-03-29 10:07:12 +02:00
Sandro La Bruzzo
1dfda3624e
improved workflow importing datacite
2021-03-26 13:56:29 +01:00
Enrico Ottonello
91d8660982
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2021-03-25 11:21:20 +01:00
Enrico Ottonello
ebd67b8c8f
removed duplicates orcid data on authors set
2021-03-25 11:20:52 +01:00
Claudio Atzori
827e7e37db
[Cleaning] drop instance.alternateIdentifier elements when they are available among instance.pid
2021-03-25 11:07:59 +01:00
miconis
28c1cdd132
merged stable_ids into openorgswf
2021-03-25 10:44:49 +01:00
miconis
5dfb66b0fa
minor changes
2021-03-25 10:29:34 +01:00
miconis
348b0ef921
bug fix, implementation of the workflow for the creation of raw_organizations (openorgs dedup), addition of the pid lists to the openorgs postgres db
2021-03-24 15:51:27 +01:00
Claudio Atzori
751125fdf9
[Actionmanager] zero function considers empty entity.id as well as rel.source/rel.target
2021-03-23 17:34:32 +01:00
Claudio Atzori
1e423fdc07
[Actionmanager] remove invalid records from the input graph before groupGraphTableByIdAndMerge
2021-03-23 13:39:24 +01:00
Claudio Atzori
e5ebb500cf
fixed pom versions; included missing workflow modules in dhp-workflows/pom.xml
2021-03-23 12:13:53 +01:00
Claudio Atzori
b75ad76f79
Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids
2021-03-23 09:59:12 +01:00
Claudio Atzori
8db248aa13
avoiding error on jenkins compilations: java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)!
2021-03-23 09:56:34 +01:00
Sandro La Bruzzo
625e4c29c4
added model constants
2021-03-23 09:39:56 +01:00
Claudio Atzori
b4febed138
updated mapping tests as consequence of the special treatment reserved to Handle PIDs
2021-03-23 09:37:48 +01:00
Claudio Atzori
431cbe9955
handle missing instance.pid during bulk cleaning
2021-03-23 09:28:58 +01:00
Sandro La Bruzzo
c392936b97
fixed error on best access right
2021-03-23 09:23:22 +01:00
Sandro La Bruzzo
c73072079d
fix conflicts
2021-03-22 16:36:31 +01:00
Sandro La Bruzzo
098914dcff
fix wrong relation with source null
2021-03-22 11:35:02 +01:00
miconis
0fe40b08e4
addition of deduplication profiles for the results, double check on pids and the title with a lower threshold
2021-03-19 17:12:05 +01:00
miconis
98854b0124
minor changes
2021-03-19 16:57:40 +01:00
Claudio Atzori
5a043e95ea
code formatting
2021-03-19 11:37:27 +01:00
Claudio Atzori
a4e82a65aa
integrated filter applied when merging BETA & PROD graphs to rule our records from Datacite
2021-03-19 11:34:44 +01:00
Claudio Atzori
75144dacb3
Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids
2021-03-19 09:07:40 +01:00
Claudio Atzori
972d5a3d98
[dedup] Datacite should be authoritative for datasets
2021-03-19 09:04:20 +01:00
Sandro La Bruzzo
25d5663d97
added filter
2021-03-18 10:24:42 +01:00
Sandro La Bruzzo
5f98ea74a9
Added fix for pid generation in stableIds
2021-03-17 15:53:24 +01:00
Sandro La Bruzzo
2be0428047
Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids
2021-03-17 14:54:28 +01:00
Claudio Atzori
8257f9a2bc
result.pid: adjusted the mapping applied to the contents from the aggregator
2021-03-17 12:45:38 +01:00
Sandro La Bruzzo
7c97a4d900
Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids
2021-03-17 12:13:03 +01:00
Sandro La Bruzzo
cc5bbafa5d
some fix to make workflows runs
2021-03-17 12:12:56 +01:00
Claudio Atzori
640b885706
added instance.alternativeIdentifiers to the graph model, adjusted the mapping applied to the contents from the aggregator
2021-03-16 14:19:32 +01:00
Claudio Atzori
61a2551e74
migrated last changes from svn (dnet45)
2021-03-15 17:17:55 +01:00
Antonis Lempesis
0ba0a6b9da
update promote wf to support monitor&production
2021-03-12 16:42:59 +02:00
Antonis Lempesis
60ebdf2dbe
update promote wf to support monitor&production
2021-03-12 16:34:53 +02:00
Antonis Lempesis
236435b470
following redirects
2021-03-12 14:11:21 +02:00
Antonis Lempesis
3c75a05044
fixed a ton of typos
2021-03-12 13:47:04 +02:00
Sandro La Bruzzo
4bb3bcafa5
add author sequence number
2021-03-11 11:32:32 +01:00
Sandro La Bruzzo
a8e5d0ea0d
updated test and fixed assign of access right
2021-03-11 10:41:24 +01:00
Sandro La Bruzzo
f5e7c57654
Fixed ticket 6282
2021-03-11 10:32:45 +01:00
Antonis Lempesis
fa1ec5b5e9
fixed typo...
2021-03-10 14:05:58 +02:00
Claudio Atzori
01630f638d
IdentifierFactory implementation based on the list of datasources authoritative for a given pid type
2021-03-09 17:11:50 +01:00
Claudio Atzori
59532b0919
[ #6281 Provenance of product PIDs] Added PIDs to the Instance type; extended mapping for OAF/ODF records
2021-03-09 11:14:45 +01:00
Claudio Atzori
d525785497
[ #6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color.
2021-03-09 11:12:55 +01:00
Sandro La Bruzzo
bbe1a7c69a
[ #6281 Provenance of product PIDs] Added PIDs to the Instance type in Scholexplorer Export
2021-03-09 10:46:36 +01:00
Sandro La Bruzzo
a2169ccf07
// implemented Ticket #6281 added pid to Instance in doiBoost
2021-03-09 10:46:36 +01:00
Claudio Atzori
f468c7f0d7
merged from master
2021-03-09 09:12:41 +01:00
Claudio Atzori
8d2bb24512
merged from master
2021-03-08 15:44:34 +01:00
Claudio Atzori
acbe3119a4
RestCollectorPlugin imported from dne45
2021-03-08 09:44:09 +01:00
Antonis Lempesis
f40c150a0d
fixed steps...
2021-03-06 00:35:57 +02:00
Claudio Atzori
fa7930d2e2
merging contributions from PR#97
2021-03-05 15:45:28 +01:00
Antonis Lempesis
6147ee4950
assigning correctly hive contexts to concepts
2021-03-05 14:12:18 +02:00
Antonis Lempesis
c5fbad8093
Contexts are now downloaded instead of using the stats_ext db
2021-03-04 00:42:21 +02:00
Claudio Atzori
55f6ff5f55
README.md for aggregation workflows
2021-03-03 16:18:34 +01:00
Claudio Atzori
e8789b0cdb
Merge pull request 'stats DB for monitor' ( #99 ) from antonis.lempesis/dnet-hadoop:master into master
...
Looks good to me, just a note on the parsing of the citations: since the last version, IIS produces citations as proper relationships among results. This is what we got already in the BETA graph
```
count r.reltype r.subreltype r.relclass
62.129.254 resultResult citation cites
62.043.309 resultResult citation isCitedBy
```
Thus, I suggest to move away from the current property based implementation for the extraction of the citation links and start relying on the relationships instead.
2021-03-03 10:29:09 +01:00
Claudio Atzori
36f750cd1d
removed unused classes
2021-03-03 10:22:29 +01:00
Claudio Atzori
b73dce3e3a
more logging on the MDStore mongodb client. Forcing UTF_8 encoding on the content
2021-03-03 10:17:16 +01:00
Antonis Lempesis
27796343ca
crude sleep. hardcoded value
2021-03-03 01:37:47 +02:00
Enrico Ottonello
70cb100647
added updating last orcid dataset folders after completion
2021-03-01 10:17:04 +01:00
Enrico Ottonello
bd3b16402b
added result typologies
2021-03-01 10:16:02 +01:00
Claudio Atzori
e76c4f62c1
MetadataRecord moved in dhp-schemas
2021-02-26 10:58:48 +01:00
miconis
1a85020572
bug fix in graph-mapper, changes in the implementation of the openorgs wf to create relations and populate openorgs db
2021-02-26 10:19:28 +01:00
Enrico Ottonello
ca1800510a
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2021-02-25 18:45:02 +01:00
Enrico Ottonello
53d7023460
dateOfCollection taken from orcid last_update.txt on hdfs; cleaned wf parameters
2021-02-25 18:43:29 +01:00
Claudio Atzori
7df2461ccc
indent XML records collected from oai-pmh endpoints
2021-02-25 16:19:12 +01:00
Enrico Ottonello
d43ea88caf
aligned orcid result typologies with openaire vocabulary
2021-02-25 15:02:10 +01:00
Claudio Atzori
b830e33392
mdstore collector plugin
2021-02-25 12:30:30 +01:00
Claudio Atzori
271e88537b
code formatting
2021-02-25 12:28:56 +01:00
Claudio Atzori
9c899f4433
cleanup on transformation functions and the relative tests
2021-02-24 15:07:59 +01:00
Claudio Atzori
fc3fa5e343
implemented mdstore collector plugin
2021-02-24 15:07:24 +01:00
Enrico Ottonello
975823b968
data from last updated orcid
2021-02-23 15:35:04 +01:00
Miriam Baglioni
896919e735
merge upstream
2021-02-23 10:45:29 +01:00
Antonis Lempesis
d90767c733
correctly invalidating metadata
2021-02-19 03:18:47 +02:00
Antonis Lempesis
3681afbe04
typo
2021-02-19 03:04:27 +02:00
Antonis Lempesis
c5502eba8f
actually moved stats computation in impala instead of hive...
2021-02-19 02:54:39 +02:00
Antonis Lempesis
33c85d4e66
moved stats computation in impala instead of hive
2021-02-18 17:23:34 +02:00
Antonis Lempesis
b8e96c8ae7
moved cache update to the end
2021-02-18 16:42:22 +02:00
Antonis Lempesis
bcbfc052b1
fixed last errors in step 21
2021-02-18 16:32:54 +02:00
Antonis Lempesis
10a29a4b9a
fixes in monitor step
2021-02-18 15:05:59 +02:00
Antonis Lempesis
8ef66452d5
fixed typo
2021-02-17 22:24:44 +02:00
Antonis Lempesis
a8836e2f5f
fixed typo
2021-02-17 19:27:07 +02:00
Claudio Atzori
e7eba9f7e7
WIP: transformation workflow error reporting; cleanup
2021-02-17 16:54:08 +01:00
Claudio Atzori
58467aaf1e
WIP: transformation workflow error reporting
2021-02-17 16:14:41 +01:00
Claudio Atzori
cc88701f29
retry for any Socket exception
2021-02-17 16:13:54 +01:00
Antonis Lempesis
a445c1ac3d
fixed variable names in monitor script
2021-02-17 16:45:09 +02:00
Antonis Lempesis
00d516360f
added missing ;
2021-02-17 16:41:10 +02:00
Claudio Atzori
545f8f3e48
using jackson objectmapper instead of GSon to serialise the aggregation report
2021-02-17 12:15:00 +01:00
Claudio Atzori
b592d78bb4
WIP: collectorWorker error reporting, generalised reported implementation
2021-02-17 10:28:01 +01:00
Antonis Lempesis
cd1b794409
added the monitor db wf
2021-02-17 02:11:55 +02:00
Claudio Atzori
cf27905a71
WIP: collectorWorker error reporting, added report messages
2021-02-16 16:53:14 +01:00
Alessia Bardi
32e81c2d89
non validated rel has null value in validated field
2021-02-16 11:01:42 +01:00
Claudio Atzori
1abe6d1ad7
WIP: collectorWorker error reporting, added report messages
2021-02-15 15:08:59 +01:00
Claudio Atzori
523a6bfa97
Merge pull request 'first commit to the correct branch' ( #94 ) from andreas.czerniak/BrAggr_dnet-hadoop:hadoop_aggregator into hadoop_aggregator
...
Looks good to me, thanks Andreas!
2021-02-15 12:15:31 +01:00
Antonis Lempesis
1c029b9fc0
fixed formatting
2021-02-14 03:14:24 +02:00
Antonis Lempesis
2c4dcc90ba
analyzing tables to produce stats
2021-02-14 02:54:55 +02:00
Sandro La Bruzzo
7edcc87ed4
changed xslt behaviour on failure
2021-02-12 17:27:08 +01:00
Sandro La Bruzzo
6a37c7f175
merge fixed
2021-02-12 16:38:47 +01:00
Sandro La Bruzzo
b3f5c2351d
Merge branch 'hadoop_aggregator' of code-repo.d4science.org:D-Net/dnet-hadoop into hadoop_aggregator
...
Conflicts:
dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/transformation/TransformationJobTest.java
2021-02-12 16:37:14 +01:00
Sandro La Bruzzo
f216277219
Implemented cleaning date
2021-02-12 16:34:52 +01:00
Andreas Czerniak
5a9017cf18
clone, min. changes, test, run
2021-02-12 14:32:36 +01:00
Claudio Atzori
aa55dedb8a
Merge branch 'hadoop_aggregator' of https://code-repo.d4science.org/D-Net/dnet-hadoop into hadoop_aggregator
2021-02-12 12:31:05 +01:00
Claudio Atzori
29c6f7e255
classes related to the collection workflow moved into common package; implemented MongoDB collection plugins
2021-02-12 12:31:02 +01:00
Sandro La Bruzzo
17e6f1934e
fixed NPE on cleaner
2021-02-12 11:48:11 +01:00
Sandro La Bruzzo
ebcc3ec14f
updated wrong datacite identifier in trasformation
2021-02-11 16:25:51 +01:00
Michele Artini
83d815d0bc
only stats
2021-02-11 10:57:23 +01:00
Michele Artini
8c836bf930
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2021-02-11 10:54:41 +01:00
Michele Artini
8c1600398a
added resumeFrom parameter
2021-02-11 10:54:16 +01:00
Claudio Atzori
3f8f78cbfb
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2021-02-11 09:36:10 +01:00
Claudio Atzori
b34b5a39ca
index field authoridtypevalue mixes up different author id-type value pairs, dropped in favour of orcidtypevalue
2021-02-11 09:36:04 +01:00
Michele Artini
7249cceb53
switch of 2 nodes
2021-02-11 09:27:08 +01:00
Alessia Bardi
986dd969d3
use the proper import for Lists
2021-02-10 12:03:54 +01:00
miconis
4b2124a18e
implementation of the openorgs wfs, implementation of the raw_all wf to migrate openorgs db entities
2021-02-10 11:51:50 +01:00
Alessia Bardi
c4d1feca74
mapper test with validated link to project
2021-02-10 11:22:54 +01:00
Alessia Bardi
09fc7e2f78
serialization of validated flag on relationships
2021-02-10 11:22:09 +01:00
Enrico Ottonello
ee4ba7298b
fix last update read/write from file on hdfs
2021-02-09 23:24:57 +01:00
Claudio Atzori
bc458d1b54
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2021-02-09 16:27:30 +01:00
Claudio Atzori
82e6c50f3f
updated solr fields (authoridtypevalue, resultsubject, resultresourcetypename)
2021-02-09 16:27:04 +01:00
Claudio Atzori
62bd3c53ee
Merge branch 'master' into provision_indexing
2021-02-09 15:46:26 +01:00
Claudio Atzori
bae029f828
collection_java_xmx allows to declare the heap size allocated for the java actions involved in the metadata collectionw workflow
2021-02-08 18:07:23 +01:00
Claudio Atzori
bebc54d5bf
seq file storing native records is now compressed
2021-02-08 18:06:25 +01:00
Claudio Atzori
50add4c61b
added requestDelay to HttpConnector2 configuration; Aggregation workflow constants moved in dhp-common
2021-02-08 12:19:38 +01:00
Miriam Baglioni
2f5e6647c6
merge upstream
2021-02-08 10:33:11 +01:00
Claudio Atzori
40df0f987d
better logging, WIP: collectorWorker error reporting; common functions moved in DHPUtils
2021-02-06 20:12:00 +01:00
Claudio Atzori
a8a758925e
better logging, WIP: collectorWorker error reporting
2021-02-05 19:18:05 +01:00
Claudio Atzori
730973679a
Merge branch 'hadoop_aggregator' of https://code-repo.d4science.org/D-Net/dnet-hadoop into hadoop_aggregator
2021-02-04 17:25:00 +01:00
Claudio Atzori
deb85706db
imported HttpConnector from https://svn.driver.research-infrastructures.eu/driver/dnet45/modules/dnet-modular-collector-service/trunk/src/main/java/eu/dnetlib/data/collector/plugins/HttpConnector.java as HttpConnector2
2021-02-04 17:24:52 +01:00
Sandro La Bruzzo
4dae5e605d
implemented messaging btween collection worker and Dnet
2021-02-04 15:51:15 +01:00
Claudio Atzori
72c57b28fa
switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT
2021-02-04 14:08:18 +01:00
Claudio Atzori
40764cf626
better logging, WIP: collectorWorker error reporting
2021-02-04 14:06:02 +01:00
Enrico Ottonello
c238561001
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2021-02-04 10:44:21 +01:00
Enrico Ottonello
465ce39f75
job execution now based on file last_update.txt on hdfs
2021-02-04 10:44:04 +01:00
Sandro La Bruzzo
69c253710b
fixed test
2021-02-04 10:30:49 +01:00
Claudio Atzori
e04045089f
better logging, WIP: collectorWorker error reporting
2021-02-03 17:58:22 +01:00
Alessia Bardi
c67329d3ad
updated test for EU Open Data portal datasets
2021-02-03 17:06:48 +01:00
Claudio Atzori
0e8a4f9f1a
better logging, WIP: collectorWorker error reporting
2021-02-03 12:33:41 +01:00
Alessia Bardi
fd705404a1
tests for EU Open Data portal dataset mapping
2021-02-03 10:28:17 +01:00
Miriam Baglioni
6190465851
merge upstream
2021-02-03 10:27:27 +01:00
Claudio Atzori
53884d12c2
code formatting
2021-02-02 14:38:03 +01:00
Claudio Atzori
ac46c247d2
code formatting
2021-02-02 14:24:00 +01:00
Claudio Atzori
bde14b149a
fixed transformation target paths
2021-02-02 12:49:29 +01:00
Claudio Atzori
ca4391aa1c
minor changes
2021-02-02 12:44:04 +01:00
Claudio Atzori
bb89b99b24
code formatting
2021-02-02 12:34:14 +01:00
Claudio Atzori
75807ea5ae
factored out constants
2021-02-02 12:28:21 +01:00
Sandro La Bruzzo
0634674add
implemented transformation test
2021-02-02 12:12:14 +01:00
Claudio Atzori
8eaa1fd4b4
WIP: metadata collection in INCREMENTAL mode and relative test
2021-02-01 19:29:10 +01:00
Sandro La Bruzzo
bead34d11a
code refactor
2021-02-01 14:58:06 +01:00
Sandro La Bruzzo
6ff234d81b
Implemented a first prototype of incremental harvesting and trasformation using readlock
2021-02-01 13:56:05 +01:00
Sandro La Bruzzo
b6b835ef49
update transformation Factory to get Transformation Rule by Id and not by Title
2021-02-01 08:49:42 +01:00
Sandro La Bruzzo
e423634cb6
RollBack in case of error WORKS!!!
2021-01-29 17:21:42 +01:00
Sandro La Bruzzo
8ee82576c6
Collection on Refresh WORKS!!!
2021-01-29 17:02:46 +01:00
Sandro La Bruzzo
0276180039
WIP mdstore
...
transaction implemented on hadoop side
2021-01-29 16:42:41 +01:00
Sandro La Bruzzo
0f8e2ecce6
Merged Datacite transfrom into this branch
2021-01-29 10:45:07 +01:00
Sandro La Bruzzo
99cf3a8ea4
Merged Datacite transfrom into this branch
2021-01-28 16:34:46 +01:00
Sandro La Bruzzo
686e7b507c
Merge branch 'hadoop_aggregator' of code-repo.d4science.org:D-Net/dnet-hadoop into aggregation_on_hadoop
2021-01-28 10:02:13 +01:00
Sandro La Bruzzo
98b9498b57
Removed old messaging system not quite used from collection and Transformation workflow
...
code refactor
2021-01-28 09:51:17 +01:00
Sandro La Bruzzo
184e7b3856
Implemented new Transformation using spark
2021-01-27 15:43:08 +01:00
Sandro La Bruzzo
150a617bd1
Merge pull request 'aggregation_on_hadoop' ( #90 ) from sandro.labruzzo/dnet-hadoop:aggregation_on_hadoop into hadoop_aggregator
...
Wonderfull code... You're the Best Sandro
2021-01-26 16:00:47 +01:00
Claudio Atzori
f1a852f278
align usage-stats workflow poms with latest snapshot version
2021-01-26 15:42:42 +01:00
Claudio Atzori
9c32119dc2
Merge pull request 'usage-stats-export-wf-v2' ( #89 ) from dimitris.pierrakos/dnet-hadoop:usage-stats-export-wf-v2 into master
...
Thank you Dimitris!
2021-01-26 15:01:41 +01:00
Claudio Atzori
885e0dd926
[Cleaning] filter authors not providing word characters in the fullname
2021-01-26 09:48:53 +01:00
Claudio Atzori
2890511613
[Cleaning] normalise missing Result.country
2021-01-26 09:41:44 +01:00
Claudio Atzori
4eb9ed35b1
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2021-01-25 18:12:24 +01:00
Claudio Atzori
cd379eb5e3
[Cleaning] trying to avoid NPEs, this time by ruling out authors without a defined fullname
2021-01-25 18:11:49 +01:00
Alessia Bardi
505477f36f
format code
2021-01-25 18:02:49 +01:00
Alessia Bardi
ded6ed8d7d
no ',' author, if there are no author in ODF records
2021-01-25 17:57:51 +01:00
Claudio Atzori
3465c8ccee
[Cleaning] trying to avoid NPEs
2021-01-25 16:54:53 +01:00
Sandro La Bruzzo
a54848a59c
Moved Vocabulary stuff to common module
2021-01-25 15:43:04 +01:00
Sandro La Bruzzo
ffb092b8d3
removed duplicate code HttpConnector.java
2021-01-25 15:05:37 +01:00
Sandro La Bruzzo
cda210a2ca
changed documentation since it didn't reflect the current status
2021-01-25 14:17:42 +01:00
Claudio Atzori
07a0ccfc96
[Cleaning] trying to avoid NPEs
2021-01-25 13:36:01 +01:00
miconis
c7e2d5a59a
minor changes
2021-01-25 12:40:45 +01:00
Claudio Atzori
34d653de41
[Cleaning] updated cleaning rule for DOIs
2021-01-22 14:16:33 +01:00
Miriam Baglioni
fe36895c53
added datasource blacklist for the organization to result propagation through institutional repositories
2021-01-22 11:55:10 +01:00
miconis
8fea29177c
refactoring, minor changes and implementation of the wf for openorgs with integration of organization phases into the scan wf
2021-01-18 16:48:08 +01:00
Dimitris
3e8d2a6b2d
Clean workflows
2021-01-15 16:19:12 +02:00
Michele Artini
cfbcdc95bc
fixed a wf param
2021-01-14 14:45:23 +01:00
Michele Artini
69ba3203c0
fixed a conflict
2021-01-14 14:43:25 +01:00
Michele Artini
b230d44411
fixed conflict
2021-01-14 14:32:31 +01:00
Michele Artini
b9d90e95b8
Added eventId to ShortEventMessage
2021-01-14 14:32:31 +01:00
Michele Artini
64b0b0bfb3
fixed a bug with invalid subject topic
2021-01-14 14:32:31 +01:00
Michele Artini
e3e0ab1de1
fixed a problem with join
2021-01-14 14:32:31 +01:00
Michele Artini
26a941315a
openaireId
2021-01-14 14:32:31 +01:00
Michele Artini
6f4d1a37f0
ES wf properties
2021-01-14 14:32:31 +01:00
Michele Artini
1391341d06
mkdir of output dir
2021-01-14 14:32:31 +01:00
Michele Artini
3c9cbd19f3
whitelist of topics
2021-01-14 14:32:31 +01:00
Michele Artini
467aa77279
workingDir and outputDir
2021-01-14 14:32:31 +01:00
Michele Artini
10f3f7eca7
workingDir and outputDir
2021-01-14 14:32:31 +01:00
Michele Artini
ff41a7b3a4
gzipped output
2021-01-14 14:32:31 +01:00
Claudio Atzori
80cf55ef2e
[Broker] fixed partitionEventsByOpendoarIds workflow parameter names
2021-01-13 16:24:30 +01:00
Claudio Atzori
41500669e2
[BIP! Scores integration] merged missing classes from bipFinder branch
2021-01-11 14:39:47 +01:00
Claudio Atzori
2a7a10809e
[BIP! Scores integration] merged missing classes from bipFinder branch
2021-01-11 10:05:02 +01:00
Claudio Atzori
d6686dd7cf
merged from master
2021-01-08 18:16:12 +01:00
Claudio Atzori
34229970e6
[BIP! Scores integration] Create updates as Result rather than subclasses; Result considers also metrics in the mergeFrom operation
2021-01-08 16:29:17 +01:00
Claudio Atzori
1361c9eb0c
[BIP! Scores integration] Create updates as Result rather than subclasses; Result considers also metrics in the mergeFrom operation
2021-01-07 10:07:30 +01:00
Claudio Atzori
ab2fe9266a
[DOIBoost] minor fixes in workflow definition
2021-01-05 10:26:39 +01:00
Claudio Atzori
7c722f3fdc
[DOIBoost] fixed typo
2021-01-05 10:25:54 +01:00
Claudio Atzori
8879704ba0
[DOIBoost] configurable ES server url and index name in crossref importer
2021-01-05 10:00:13 +01:00
Claudio Atzori
26e9d55c13
code formatting
2021-01-05 09:59:26 +01:00
Sandro La Bruzzo
7834a35768
avoid to save intermediate dataset before generation of Sequence file
2021-01-04 17:54:57 +01:00
Sandro La Bruzzo
e79445a8b4
minor fix for claudio polemica
2021-01-04 17:39:25 +01:00
Sandro La Bruzzo
8765020b85
minor fix
2021-01-04 17:37:08 +01:00
Sandro La Bruzzo
b0dc92786f
defined a single oozie workflow for the generation of doiboost
2021-01-04 17:01:35 +01:00
Claudio Atzori
7185158942
ignore missing properties
2020-12-29 11:06:28 +01:00
Claudio Atzori
28460c2cd1
using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper
2020-12-23 16:59:52 +01:00
Claudio Atzori
60649ac7d2
swapped expected and actual in tests, updated expected number of authors
2020-12-23 12:26:04 +01:00
Claudio Atzori
723b01f9e9
trivial: the less magic numbers and values around, the better
2020-12-23 12:22:48 +01:00
Claudio Atzori
7bfc35df5e
Merge pull request 'Changed typo in script names' ( #82 ) from antonis.lempesis/dnet-hadoop:master into master
...
no need to! :)
2020-12-22 12:36:21 +01:00
Antonis Lempesis
be5969a8c2
Changed typo in script names
2020-12-22 13:33:32 +02:00
miconis
1e1aab83e3
implementation of the raw wf for openorgs: still not complete, some functionalities are missing
2020-12-21 11:58:21 +01:00
Claudio Atzori
6cb0dc3f43
extended OCRID cleaning procedure
2020-12-21 11:40:17 +01:00
Claudio Atzori
573a8a3272
Merge pull request 'Changed typo in script names' ( #81 ) from antonis.lempesis/dnet-hadoop:master into master
...
ok! LGTM
2020-12-18 17:44:26 +01:00
Antonis Lempesis
2a074c3b2b
Changed typo in script names
2020-12-18 18:40:48 +02:00
Claudio Atzori
47270d9af5
lenient mock can be lenient
2020-12-18 15:38:59 +01:00
Claudio Atzori
2e503ee101
code formatting
2020-12-17 13:47:38 +01:00
Claudio Atzori
5a3e2199b2
Merge pull request 'Creation of the action set to include the bipFinder! score' ( #80 ) from miriam.baglioni/dnet-hadoop:bipFinder into bipFinder_master_test
2020-12-17 12:26:38 +01:00
Claudio Atzori
03319d3bd9
Revert "Merge pull request 'Creation of the action set to include the bipFinder! score' ( #62 ) from miriam.baglioni/dnet-hadoop:bipFinder into master"
...
This reverts commit add7e1693b
, reversing
changes made to f9a8fd8bbd
.
2020-12-17 12:23:58 +01:00
Claudio Atzori
add7e1693b
Merge pull request 'Creation of the action set to include the bipFinder! score' ( #62 ) from miriam.baglioni/dnet-hadoop:bipFinder into master
2020-12-17 12:09:03 +01:00
Alessia Bardi
f9a8fd8bbd
updated test record for textgrid
2020-12-17 11:59:45 +01:00
Claudio Atzori
4766495f5b
[orcid_to_result_from_semrel_propagation] fixed typo in SQL
2020-12-17 09:15:50 +01:00
Claudio Atzori
de00094ebc
Merge pull request 'FIX on the creation of subject based broker enrichments' ( #79 ) from broker into master
2020-12-15 14:58:31 +01:00
Michele Artini
f9dc1e45fd
fixed a bug with invalid subject topic
2020-12-15 14:54:11 +01:00
Sandro La Bruzzo
f92bd56f56
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-12-15 11:47:29 +01:00
Sandro La Bruzzo
1f6c8a9e83
added orcid_pending type to records coming from Crossref
2020-12-15 11:47:15 +01:00
Enrico Ottonello
b2de598c1a
all actions from download lambda file to merge updated data into one wf
2020-12-15 10:42:55 +01:00
Claudio Atzori
9f1181290e
Merge pull request 'broker' ( #78 ) from broker into master
...
The changes look good to me.
2020-12-15 10:03:45 +01:00
Michele Artini
0a0f62bd01
Merge branch 'master' into broker
2020-12-15 08:30:52 +01:00
Michele Artini
12fa5d122a
fixed a problem with join
2020-12-15 08:30:26 +01:00
Michele Artini
991e675dc6
validation in claim rels
2020-12-14 15:41:25 +01:00
Michele Artini
3e19cf7b4a
openaireId
2020-12-14 15:24:33 +01:00
Claudio Atzori
b6f08ce226
re-adding the old junit:junit dep as solr-test-framework needs it
2020-12-14 15:07:31 +01:00
Claudio Atzori
7d325e2c57
using actual result subclasses instead of their parent class
2020-12-14 14:40:54 +01:00
Claudio Atzori
152916890f
renamed test name
2020-12-14 14:40:05 +01:00
Michele Artini
a203aee32a
ES wf properties
2020-12-14 12:02:33 +01:00
Claudio Atzori
1506f49052
Xml record serialization for author PIDs: 1) only one value per PID type is allowed; 2) orcid prevails over orcid_pending
2020-12-14 11:14:03 +01:00
Michele Artini
d03756c962
mkdir of output dir
2020-12-14 11:11:41 +01:00
Michele Artini
399548f221
whitelist of topics
2020-12-14 11:03:55 +01:00
Michele Artini
38da1c282a
Merge branch 'master' into broker
2020-12-14 09:14:02 +01:00
Dimitris
dc9c2f3272
Commit 12122020
2020-12-12 12:00:14 +02:00
Enrico Ottonello
efe4c2a9c5
authors and works are now updated in two separate spark actions of the wf
2020-12-12 02:06:21 +01:00
Enrico Ottonello
858efbfad1
fix dataset creation for downloaded works
2020-12-11 16:49:54 +01:00
Claudio Atzori
61cd129ded
XML serialisation test
2020-12-11 12:44:53 +01:00
Claudio Atzori
ce7a319e01
using the correct assertion import
2020-12-11 12:44:17 +01:00
Claudio Atzori
7fe2433137
excluded transitive older junit dependencies, they can compromise the unit test executions
2020-12-11 12:42:55 +01:00
Claudio Atzori
d9532446eb
imported more diffs from master branch; code formatting
2020-12-10 16:14:16 +01:00
Claudio Atzori
1eaad89a3c
do not fail on uknown properties when grouping entities by ID
2020-12-10 15:56:11 +01:00
Michele Artini
933b4c1ada
workingDir and outputDir
2020-12-10 14:47:51 +01:00
Michele Artini
2e7df07328
workingDir and outputDir
2020-12-10 14:47:22 +01:00
Michele Artini
94bfed1c84
gzipped output
2020-12-10 11:59:28 +01:00
Claudio Atzori
12e2f930c8
resolved conflicts
2020-12-10 10:57:39 +01:00
Miriam Baglioni
b7adbc7c3e
merge branch with master
2020-12-10 10:35:27 +01:00
Alessia Bardi
112da6d76a
in theory, just auto-formatting after mvn compile
2020-12-09 20:00:27 +01:00
Alessia Bardi
bece04b330
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-12-09 19:54:43 +01:00
Alessia Bardi
426b76ee8e
more asserts for TextGrid record
2020-12-09 19:46:11 +01:00
Claudio Atzori
ff72fcd91a
allow orcid_pending to be percolate to the XML graph serialization
2020-12-09 19:04:50 +01:00
Claudio Atzori
4705144918
Merge pull request 'rel_project_validation' ( #69 ) from rel_project_validation into master
...
LGTM
2020-12-09 19:01:20 +01:00
Claudio Atzori
211aa04726
allow orcid_pending to be percolate to the XML graph serialization
2020-12-09 18:08:51 +01:00
Claudio Atzori
ada21ad920
Merge pull request 'dump of the results related to at least one project' ( #61 ) from miriam.baglioni/dnet-hadoop:dump into master
...
LGTM
2020-12-09 17:22:56 +01:00
Claudio Atzori
3c5ce1dada
code formatting
2020-12-09 17:07:20 +01:00
Michele Artini
1bc9adc10d
default trust for validated rels
2020-12-09 16:18:37 +01:00
Claudio Atzori
fcd7689b50
promote actions: shouldGroupById parameter marked as optional (default is true)
2020-12-09 13:10:16 +01:00
Michele Artini
5f21a356fd
reindent
2020-12-09 11:24:30 +01:00
Michele Artini
370a5e650b
validation attributes in resultProject relations
2020-12-09 11:18:26 +01:00
Antonis Lempesis
aead9efd24
added the new parameter (stats_tool_api_url) in the workflow parameters
2020-12-09 10:45:24 +01:00
Antonis Lempesis
77a3a6d82e
added the new parameter (stats_tool_api_url) in the workflow parameters
2020-12-09 10:45:24 +01:00
Antonis Lempesis
91226117b3
ignoring deletedbyinference relations
2020-12-09 10:45:24 +01:00
Antonis Lempesis
b7f29db126
finished first implementation of wf
2020-12-09 10:45:24 +01:00
Antonis Lempesis
ded2392275
initial implementation of the promote wf
2020-12-09 10:45:24 +01:00
Antonis Lempesis
1a87a1effd
added last step to update cache
2020-12-09 10:45:24 +01:00
Enrico Ottonello
2233750a37
original orcid xml data are stored in a field of the class that models orcid data
2020-12-09 09:45:19 +01:00
Claudio Atzori
27e96767e0
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-12-07 21:53:22 +01:00
Claudio Atzori
fba11eef2a
cleanup
2020-12-07 21:53:13 +01:00