antleb
68389d0125
Corrected the script used by the last step of the wf
2020-07-24 19:50:40 +03:00
antleb
ec52141f1a
changed refereed type from value to clssname
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
63cd797aba
Comment out step 15 to make it work with the new schema of Claudio
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
138c6ddffa
Insert statement to datasource table that takes into account the piwik_id of the openAIRE graph
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
3630794cef
Fix to consider the relationships that have been 'virtually deleted' for project_results - defect #5607
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
5546f29e63
Corrections on the shadow schema and the impala table stats calculation
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
adf8a025d2
Adding more relations (Sources, Licences, Additional) and shadow schema as provided and discussed with Antonis Lempesis
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
657a40536b
Corrections by Spyros: Scipt cleanup, corrections and re-arrangement
2020-07-24 19:50:40 +03:00
Giorgos Alexiou
477fa6234d
Script re-organisation and adding table invalidations needed for impala
2020-07-24 19:50:40 +03:00
Miriam Baglioni
6c2223d1fc
added code to get the openaire id for contexts
2020-07-24 17:30:15 +02:00
Miriam Baglioni
afd54c1684
removed not needed upload and refactoring
2020-07-24 17:28:56 +02:00
Miriam Baglioni
7b0569d989
changed to map also the result associated to the whole graph
2020-07-24 17:28:11 +02:00
Miriam Baglioni
082225ad61
-
2020-07-24 17:27:26 +02:00
Miriam Baglioni
968c59d97a
added teh logic to dump also the products for the whole graph. They will miss collected from and context information that will be materialized as new relations
2020-07-24 17:25:19 +02:00
Miriam Baglioni
332258d199
split the classes related to the communities dump and to the whole graph dump
2020-07-24 17:21:48 +02:00
Claudio Atzori
56bbfdc65d
introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'
2020-07-23 08:54:10 +02:00
miconis
b260fee787
implementation of the dedup_id generation using pids to make the graph more stable
2020-07-22 17:29:48 +02:00
Sandro La Bruzzo
9ab594ccf6
fixed test
2020-07-21 10:36:21 +02:00
Claudio Atzori
ebf60020ac
map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type
2020-07-20 19:01:10 +02:00
Miriam Baglioni
355d7e426e
added dumo for project - not finished
2020-07-20 18:54:43 +02:00
Miriam Baglioni
a2f01e5259
added getter and setter
2020-07-20 18:54:17 +02:00
Miriam Baglioni
40bbe94f7c
merge with master fork
2020-07-20 18:10:03 +02:00
Miriam Baglioni
2a15494b16
merge upstream
2020-07-20 18:05:01 +02:00
Miriam Baglioni
23160b4d29
realignment of the workflow classes with the changes in the structure of the module
2020-07-20 18:04:30 +02:00
Miriam Baglioni
b904e0699a
-
2020-07-20 18:02:53 +02:00
Miriam Baglioni
3aab7680f6
changed the test results
2020-07-20 18:00:43 +02:00
Miriam Baglioni
cde0300801
moved from projects to project
2020-07-20 17:57:35 +02:00
Miriam Baglioni
5076e4f320
changed test to comply with the modifications
2020-07-20 17:55:18 +02:00
Miriam Baglioni
08dbd99455
changed to dump the whole results graph by usign classes already implemented for communities. Added class to dump also organization
2020-07-20 17:54:28 +02:00
Miriam Baglioni
e47ea9349c
extended some types by adding provenance as the couple (provenance, trust) and moved some classes to be used by the complete graph dump also
2020-07-20 17:46:27 +02:00
Claudio Atzori
32f5e466e3
imports cleanup
2020-07-20 17:42:58 +02:00
Claudio Atzori
54ac583923
code formatting
2020-07-20 17:37:08 +02:00
Claudio Atzori
124e7ce19c
in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available
2020-07-20 17:33:37 +02:00
Claudio Atzori
050dda223d
Merge pull request 'removed duplicated fields' ( #25 ) from unique_field_in_lists into master
...
Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists.
The task to update the model in such a way is added on #9#issuecomment-1583
Thanks!
2020-07-20 12:12:50 +02:00
Claudio Atzori
e0c4cf6f7b
added parameter to drive the graph merge strategy: priority (BETA|PROD)
2020-07-20 10:48:01 +02:00
Claudio Atzori
94ccdb4852
Merge branch 'master' into merge_graph
2020-07-20 10:14:55 +02:00
Claudio Atzori
0937c9998f
Merge branch 'deduptesting'
2020-07-20 10:00:20 +02:00
Claudio Atzori
de72b1c859
cleanup
2020-07-20 09:59:11 +02:00
Michele Artini
331a3cbdd0
fixed originalId
2020-07-20 09:50:29 +02:00
Michele Artini
c59c5369b1
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-18 09:40:54 +02:00
Michele Artini
346a1d2b5a
update eventId generator
2020-07-18 09:40:36 +02:00
Sandro La Bruzzo
9116d75b3e
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-17 18:01:30 +02:00
Miriam Baglioni
d7d84c8217
-
2020-07-17 14:03:23 +02:00
Miriam Baglioni
47c7122773
changed priority from beta to production
2020-07-17 12:56:35 +02:00
Michele Artini
442f30930c
removed duplicated fields
2020-07-17 12:25:36 +02:00
Michele Artini
3adedd0a68
trust truncated to 3 decimals
2020-07-17 11:58:11 +02:00
Claudio Atzori
1781609508
code formatting
2020-07-16 19:06:56 +02:00
Claudio Atzori
db8b90a156
renamed CORE -> BETA
2020-07-16 19:05:13 +02:00
Miriam Baglioni
44e1c40c42
merge upstream
2020-07-16 18:49:38 +02:00
Claudio Atzori
878f2b931c
Merge branch 'master' into merge_graph
2020-07-16 16:34:24 +02:00
Claudio Atzori
cc5d13da85
introduced parameter shouldIndex (true|false)
2020-07-16 13:46:39 +02:00
Claudio Atzori
b098cc3cbe
avoid repeating identical values for fields: source, description
2020-07-16 13:45:53 +02:00
Claudio Atzori
805de4eca1
fix: filter the blocks with size = 1
2020-07-16 10:11:32 +02:00
Claudio Atzori
4b9fb2ffb8
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-15 11:26:04 +02:00
Claudio Atzori
b90389bac4
code formatting
2020-07-15 11:24:48 +02:00
Claudio Atzori
4e6f46e8fa
filter blocks with one record only
2020-07-15 11:22:20 +02:00
Michele Artini
262c29463e
relations with multiple datasources
2020-07-15 09:18:40 +02:00
Claudio Atzori
7d6e269b40
reverted CreateRelatedEntitiesJob_phase1 to its previous state
2020-07-13 22:54:04 +02:00
Claudio Atzori
8e97598eb4
avoid to NPE in case of null instances
2020-07-13 20:46:14 +02:00
Claudio Atzori
06def0c0cb
SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter
2020-07-13 20:09:06 +02:00
miconis
b52c246aed
merge done
2020-07-13 19:57:02 +02:00
miconis
b8a45041fd
minor changes
2020-07-13 19:53:18 +02:00
Claudio Atzori
66f9f6d323
adjusted parameters for the dedup stats workflow
2020-07-13 19:26:46 +02:00
miconis
03ecfa5ebd
implementation of the test class for the new block stats spark action
2020-07-13 18:48:23 +02:00
miconis
10e08ccf45
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-13 18:22:45 +02:00
miconis
9258e4f095
implementation of a new workflow to compute statistics on the blocks
2020-07-13 18:22:34 +02:00
Claudio Atzori
c6f6fb0f28
code formatting
2020-07-13 16:46:13 +02:00
Claudio Atzori
8d2102d7d2
Merge branch 'deduptesting'
2020-07-13 16:32:43 +02:00
Claudio Atzori
344a90c2e6
updated assertions in propagateRelationTest
2020-07-13 16:32:04 +02:00
Claudio Atzori
1143f426aa
WIP SparkCreateMergeRels distinct relations
2020-07-13 16:13:36 +02:00
Claudio Atzori
8c67938ad0
configurable number of partitions used in the SparkCreateSimRels phase
2020-07-13 16:07:07 +02:00
Claudio Atzori
c73168b18e
Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting
2020-07-13 15:54:58 +02:00
Claudio Atzori
c8284bab06
WIP SparkCreateMergeRels distinct relations
2020-07-13 15:54:51 +02:00
Sandro La Bruzzo
1d133b7fe6
update test
2020-07-13 15:52:41 +02:00
Michele Artini
3635d05061
poms
2020-07-13 15:52:23 +02:00
Claudio Atzori
7dd91edf43
parsing of optional parameter
2020-07-13 15:40:41 +02:00
Claudio Atzori
4c101a9d66
WIP SparkCreateMergeRels distinct relations
2020-07-13 15:31:38 +02:00
Claudio Atzori
8a612d861a
WIP SparkCreateMergeRels distinct relations
2020-07-13 15:30:57 +02:00
Sandro La Bruzzo
9ef2385022
implemented test for cut of connected component
2020-07-13 15:28:17 +02:00
Sandro La Bruzzo
d561b2dd21
implemented cut of connected component
2020-07-13 14:18:42 +02:00
Miriam Baglioni
8e0e090d7a
merge upstream
2020-07-13 12:46:55 +02:00
Claudio Atzori
e2093e42db
Merge branch 'master' into deduptesting
2020-07-13 10:57:49 +02:00
Michele Artini
2c4ed9a043
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-13 10:55:39 +02:00
Michele Artini
ccbe5c5658
fixed import of eu.dnetlib.dhp:dnet-openaire-broker-common
2020-07-13 10:55:27 +02:00
Claudio Atzori
7a3fd9f54c
dedup relation aggregator moved into dedicated class
2020-07-13 10:11:36 +02:00
Alessia Bardi
7e96105947
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-12 19:29:12 +02:00
Alessia Bardi
b7a39731a6
assert, not print
2020-07-12 19:28:56 +02:00
Miriam Baglioni
f9ad6f3255
Merge branch 'dump' of code-repo.d4science.org:miriam.baglioni/dnet-hadoop into dump
2020-07-10 19:42:53 +02:00
Miriam Baglioni
c27f12d6e8
avoid to consider _SUCCESS file
2020-07-10 19:42:23 +02:00
Claudio Atzori
770adc26e9
WIP aggregator to make relationships unique
2020-07-10 19:35:10 +02:00
Claudio Atzori
ecf119f37a
Merge branch 'master' into deduptesting
2020-07-10 19:04:16 +02:00
Claudio Atzori
31071e363f
Merge branch 'provision_indexing'
2020-07-10 19:03:57 +02:00
Claudio Atzori
06c1913062
added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations
2020-07-10 19:03:33 +02:00
Claudio Atzori
cc77446dc4
added dbSchema parameter to the raw_db workflow
2020-07-10 19:01:50 +02:00
Claudio Atzori
4c3836f62e
materialize the related entities before joining them
2020-07-10 19:00:44 +02:00
Michele Artini
e1ae964bc4
stats
2020-07-10 16:12:08 +02:00
Claudio Atzori
752d28f8eb
make the relations produced by the dedup SparkPropagateRelation jon unique
2020-07-10 15:09:50 +02:00
Sandro La Bruzzo
c01efed79b
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-10 14:44:57 +02:00
Sandro La Bruzzo
a7d3977481
added generation of EBI Dataset
2020-07-10 14:44:50 +02:00
Claudio Atzori
b21866a2da
allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types
2020-07-10 13:59:48 +02:00
Claudio Atzori
ff4d6214f1
experimenting with pruning of relations
2020-07-10 10:06:41 +02:00
Miriam Baglioni
faea30cda0
-
2020-07-09 14:05:21 +02:00
Michele Artini
2d742a84ae
DedupConfig as json file
2020-07-09 12:53:46 +02:00
Miriam Baglioni
a634794242
merge upstream
2020-07-09 11:46:51 +02:00
Michele Artini
a44b9b36b9
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-09 11:02:31 +02:00
Michele Artini
1c6a171633
updated pom
2020-07-09 11:02:09 +02:00
Claudio Atzori
3c728aaa0c
trying to overcome OOM errors during duplicate scan phase
2020-07-08 22:39:51 +02:00
Claudio Atzori
18c555cd79
Merge branch 'master' into deduptesting
2020-07-08 22:32:01 +02:00
Claudio Atzori
4365cf41d7
trying to overcome OOM errors during duplicate scan phase
2020-07-08 22:31:46 +02:00
Claudio Atzori
67e1d222b6
bulk cleaning when found null or empty, sets bestaccessrights evaluating the result instances
2020-07-08 17:53:35 +02:00
Alessia Bardi
853e8d7987
test for software merge
2020-07-08 17:03:53 +02:00
Claudio Atzori
610d377d57
first implementation of the BETA & PROD graphs merge procedure
2020-07-08 16:54:26 +02:00
Alessia Bardi
9a898c0e4c
Json schema generator
2020-07-08 12:52:00 +02:00
Alessia Bardi
636f9ce7d6
json schema generator lib
2020-07-08 12:50:57 +02:00
Alessia Bardi
8f83b726fa
Dump json schema compliant to json schema Draft 7
2020-07-08 12:48:46 +02:00
Claudio Atzori
e2ea30f89d
updated graph construction workflow definition: cleaning wf moved at the bottom to include cleaning of the information produced by the enrichment workflows
2020-07-08 12:16:24 +02:00
Miriam Baglioni
1b0b968548
fixed issue on substring
2020-07-08 12:11:51 +02:00
Miriam Baglioni
7fe00cb4fb
-
2020-07-08 10:29:37 +02:00
Miriam Baglioni
375ef07d7b
changed the description for the upload
2020-07-07 18:41:27 +02:00
Miriam Baglioni
35c8265793
added the json extention to filename
2020-07-07 18:29:49 +02:00
Miriam Baglioni
81434f8e5e
added method newInstance
2020-07-07 18:26:10 +02:00
Miriam Baglioni
817cddfc52
-
2020-07-07 18:25:12 +02:00
Miriam Baglioni
a66aa9bd83
removed unuseful tests
2020-07-07 18:25:00 +02:00
Miriam Baglioni
9b20a21b24
removed unuseful tests
2020-07-07 18:23:37 +02:00
Miriam Baglioni
8a1b42ff21
added check to verify that dump contains at least one product
2020-07-07 18:21:35 +02:00
Miriam Baglioni
d86adb82a7
-
2020-07-07 18:20:51 +02:00
Miriam Baglioni
b2782025f6
enabled the whole workflow to run. Added property to give priority to depenedency in the classpath - to solve conflicts
2020-07-07 18:10:47 +02:00
Miriam Baglioni
83d2c84b77
added constraints to xquery so that to get only profiles with status manager or all
2020-07-07 18:09:48 +02:00
Miriam Baglioni
4c8d86493c
-
2020-07-07 18:09:06 +02:00
Miriam Baglioni
0208bc18f3
added new resource for testing
2020-07-07 17:47:24 +02:00
Miriam Baglioni
f5bb65c9ef
the json schema for the dump of the results
2020-07-07 17:34:40 +02:00
Michele Artini
dffa0b01a2
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-07 15:37:29 +02:00
Michele Artini
efadbdb2bc
fixed a bug with duplicated events
2020-07-07 15:37:13 +02:00
Claudio Atzori
8af8e7481a
code formatting
2020-07-07 14:23:34 +02:00
Claudio Atzori
b383ed42fa
pass optional parameter relationFilter to the PrepareRelationJob implementation
2020-07-07 14:21:28 +02:00
Claudio Atzori
911894a987
Merge branch 'deduptesting'
2020-07-07 14:20:43 +02:00
Miriam Baglioni
c19818a3f8
merge branch with fork master
2020-07-06 13:58:23 +02:00
Miriam Baglioni
d22240c0ba
merge upstream
2020-07-06 13:58:02 +02:00
Enrico Ottonello
ca37d3427b
separate workflow to parse orcid summaries, activities and generate dataset with no doi publications; test
2020-07-03 23:30:31 +02:00
Michele Artini
edf6c6c4dc
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-03 11:48:24 +02:00
Michele Artini
04bebb708c
some fixes
2020-07-03 11:48:12 +02:00
Enrico Ottonello
1729cc5cf3
publication conversion from json to oaf test
2020-07-02 18:46:20 +02:00
Claudio Atzori
c3d67f709a
adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80)
2020-07-02 17:35:22 +02:00
Miriam Baglioni
f8bf4acd76
-
2020-07-02 16:03:11 +02:00
Miriam Baglioni
e6c79d44e6
-
2020-07-02 16:02:02 +02:00
Miriam Baglioni
d7f6f0c216
changed code to use other lib
2020-07-02 16:01:34 +02:00
Miriam Baglioni
8fdc9e070c
added dependency to OkHttp
2020-07-02 16:01:08 +02:00
Miriam Baglioni
94500a581b
merge branch with fork master
2020-07-02 14:25:39 +02:00
Miriam Baglioni
c133a23cf0
merge upstream
2020-07-02 14:24:57 +02:00
Claudio Atzori
1d39f7901c
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-02 12:45:01 +02:00
Claudio Atzori
0f77cac4b5
fix: deduper must use queueMaxSize instead of groupMaxSize for the block definition
2020-07-02 12:43:51 +02:00
Sandro La Bruzzo
18b9330312
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-02 12:43:19 +02:00
Michele Artini
b413db0bff
white/blacklists
2020-07-02 12:43:03 +02:00
Claudio Atzori
d380b85246
unit test for the preparation of the relations
2020-07-02 12:42:13 +02:00
Claudio Atzori
ed1c7e5d75
fixed workflow for the import of the claims alone
2020-07-02 12:40:21 +02:00
Sandro La Bruzzo
07f0723fa7
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-02 12:37:49 +02:00
Sandro La Bruzzo
1d420eedb4
added generation of EBI Dataset
2020-07-02 12:37:43 +02:00
Claudio Atzori
e4a29a4513
fixed workflow for the import of the claims alone
2020-07-02 12:36:33 +02:00
Enrico Ottonello
5525f57ec8
converter from orcid work json to oaf
2020-07-01 18:36:14 +02:00
Michele Artini
3bcdfbabe9
list with limits
2020-07-01 08:42:39 +02:00
Michele Artini
59a5421c24
indexing, accumulators, limited lists
2020-06-30 16:17:09 +02:00
Enrico Ottonello
b7b6be12a5
fixed enriched works generation
2020-06-29 18:03:16 +02:00
Michele Artini
6f13673464
accumulators
2020-06-29 16:33:32 +02:00
Sandro La Bruzzo
dab783b173
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-29 09:05:00 +02:00
Michele Artini
a6ea432435
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-29 08:44:20 +02:00
Michele Artini
35ae381d28
all events matchers
2020-06-29 08:43:56 +02:00
Claudio Atzori
7817338e05
added test to verify the relation pre-processing
2020-06-26 17:58:33 +02:00
Enrico Ottonello
b2213b6435
merged with dnet version
2020-06-26 17:27:34 +02:00
Enrico Ottonello
c5e149c46e
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop into orcid-no-doi
2020-06-26 16:15:38 +02:00
Claudio Atzori
8d59fdf34e
WIP: dataset based PrepareRelationsJob
2020-06-26 14:32:58 +02:00
Michele Artini
2393d9da2f
limits
2020-06-26 11:20:45 +02:00
Enrico Ottonello
d6498278ed
added workflow to generate seq(orcidId,work) and seq(orcidId,enrichedWork)
2020-06-25 18:43:29 +02:00
Sandro La Bruzzo
96ce124b59
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-25 17:00:43 +02:00
Miriam Baglioni
4a7de07ea2
refactoring
2020-06-25 16:32:40 +02:00
Miriam Baglioni
54a12978d3
fixed issue in xquery
2020-06-25 16:30:20 +02:00
Michele Artini
408165a756
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-25 15:53:35 +02:00
Michele Artini
e8fb305f18
compilation of event map
2020-06-25 15:53:20 +02:00
Michele Artini
4eb3e109d7
compilation of event map
2020-06-25 15:45:50 +02:00
Claudio Atzori
d839e88783
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-25 14:06:30 +02:00
Claudio Atzori
6f5771c1c9
sets author.rank when null
2020-06-25 14:06:21 +02:00
Michele Artini
e28033c6d8
some fixes
2020-06-25 13:01:09 +02:00
Claudio Atzori
216975c4ec
restored complete provision workflow
2020-06-25 12:55:52 +02:00
Claudio Atzori
2d77d3a388
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-25 12:54:30 +02:00
Claudio Atzori
93f627ea51
code formatting
2020-06-25 12:54:21 +02:00
Miriam Baglioni
05a99cfb61
change the position of value and description elements in the workflow definition
2020-06-25 12:36:08 +02:00
Claudio Atzori
7df2712824
Merge branch 'provision_indexing'
2020-06-25 12:22:41 +02:00
Claudio Atzori
e62333192c
WIP: prepare relation job
2020-06-25 12:22:18 +02:00
Claudio Atzori
6933ec11fb
WIP: prepare relation job
2020-06-25 11:04:12 +02:00
Sandro La Bruzzo
a6c0faac70
added test to verify secondary sorting
2020-06-25 10:48:15 +02:00
Claudio Atzori
69b0391708
WIP: prepare relation job
2020-06-25 10:19:56 +02:00
Michele Artini
abcbebcbb4
fixed generation of ids
2020-06-25 09:50:46 +02:00
Michele Artini
77d2a1b1c4
params to choose sql queries for beta or production
2020-06-25 09:28:13 +02:00
Claudio Atzori
46e76affeb
WIP: prepare relation job
2020-06-24 19:01:15 +02:00
Claudio Atzori
0e723d378b
added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job
2020-06-24 18:34:42 +02:00
Enrico Ottonello
fcbb4c1489
parser of orcid publication data from xml original dump
2020-06-24 16:29:32 +02:00
Michele Artini
202f6e62ff
Splitted join wf
2020-06-24 15:47:06 +02:00
Sandro La Bruzzo
96689a8994
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-24 14:06:50 +02:00
Sandro La Bruzzo
46631a4421
updated mapping scholexplorer to OAF
2020-06-24 14:06:38 +02:00
Michele Artini
e53dd62e87
minot changes
2020-06-24 09:24:45 +02:00
Michele Artini
8b9933b934
refactoring aggregators
2020-06-24 08:57:13 +02:00
Miriam Baglioni
3e5570de7a
-
2020-06-23 15:44:54 +02:00
Michele Artini
d13e3d3f68
fixed paths
2020-06-23 11:01:42 +02:00
Michele Artini
8386c6f90d
filter of valid resultResult relations
2020-06-23 10:24:15 +02:00
Michele Artini
38bb45d0b6
test osf:refereed
2020-06-23 10:14:39 +02:00
Michele Artini
c3286f4c37
fixed relType
2020-06-23 09:32:32 +02:00
Miriam Baglioni
507f7a94a8
added one of the main zenodo communities to the tagging conf for testing purposes
2020-06-23 08:45:27 +02:00
Michele Artini
af2f7705fc
partial refactoring of some joins
2020-06-23 08:37:35 +02:00
Miriam Baglioni
af1d40351b
changed XQuery to add also the main Zenodo community among the communities associated to the openaire community
2020-06-22 19:20:54 +02:00
Miriam Baglioni
e4b21be004
-
2020-06-22 17:31:50 +02:00
Miriam Baglioni
afa19b0c84
changed the way to PUT the files to the rest API
2020-06-22 17:20:07 +02:00
Miriam Baglioni
250fd1c854
merge branch with fork master
2020-06-22 16:25:48 +02:00
Claudio Atzori
8a3bc7c183
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-22 14:12:33 +02:00
Claudio Atzori
e162ba5075
added dnet workflows to orchestrate the execution of graph2hive, updateSolr and updateStats oozie wfs
2020-06-22 14:12:28 +02:00
Michele Artini
3ce20c198e
reformatting
2020-06-22 12:14:25 +02:00
Michele Artini
ed787398b3
refactoring wf
2020-06-22 11:45:14 +02:00
Claudio Atzori
9cd27183b6
[maven-release-plugin] prepare for next development iteration
2020-06-22 11:27:44 +02:00
Claudio Atzori
1e3dab0631
[maven-release-plugin] prepare release dhp-1.2.3
2020-06-22 11:27:39 +02:00
Miriam Baglioni
df80ae5c1b
merge branch with fork master
2020-06-22 10:51:23 +02:00
Miriam Baglioni
e8f914f8b3
-
2020-06-22 10:50:41 +02:00
Miriam Baglioni
edeb862476
excluded dependency in module that generates conflict
2020-06-22 10:49:56 +02:00
Miriam Baglioni
185facb8e5
change the deprecated DefaultHttpClient with the CLoseableHttpClient
2020-06-22 10:49:10 +02:00
Claudio Atzori
961a0d0b49
[actionset promotion] log debugging info in case of error in the action payload extraction or parsing the data
2020-06-22 10:20:45 +02:00
Claudio Atzori
5e8b922962
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-22 09:50:47 +02:00
Claudio Atzori
7d416f08d8
graph cleaning workflow: set hostedby to unknown repository when defined as NULL
2020-06-22 09:50:43 +02:00
Michele Artini
16c7a18435
refactoring
2020-06-22 08:51:31 +02:00
Miriam Baglioni
669a509430
-
2020-06-19 17:39:46 +02:00
Michele Artini
f9fc64ffaf
âÃMerge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-19 15:24:43 +02:00
Michele Artini
d88fe0ac84
join methods
2020-06-19 15:24:30 +02:00
Sandro La Bruzzo
464eeeec87
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-19 15:11:53 +02:00
Sandro La Bruzzo
1681de672d
updated mapping scholexplorer to OAF
2020-06-19 15:11:46 +02:00
Michele Artini
4822747313
some fixes
2020-06-19 13:53:56 +02:00
Michele Artini
834f139e6e
fixed some NPE
2020-06-19 12:33:29 +02:00
Claudio Atzori
d0ac7514b2
cleaning workflow to include cleaning of default values
2020-06-18 19:37:25 +02:00
Miriam Baglioni
44a12d244f
-
2020-06-18 18:38:54 +02:00
Michele Artini
52f62d5d8c
events
2020-06-18 14:49:13 +02:00
Miriam Baglioni
fb80353018
-
2020-06-18 14:21:36 +02:00
Michele Artini
61634fbfe0
removed kryo encoding
2020-06-18 14:09:58 +02:00
Michele Artini
8d2b199dd2
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-18 13:15:34 +02:00
Michele Artini
e659b02e6b
some wf fixing
2020-06-18 13:15:13 +02:00
Michele Artini
9a847b4557
some wf fixing
2020-06-18 13:14:10 +02:00
Miriam Baglioni
65bf312360
merge branch with fork master
2020-06-18 11:35:27 +02:00
Miriam Baglioni
3953f56bd3
added dependency to pom
2020-06-18 11:34:47 +02:00
Miriam Baglioni
a118b66858
-
2020-06-18 11:34:30 +02:00
Miriam Baglioni
f9578312b5
-
2020-06-18 11:34:15 +02:00
Miriam Baglioni
8b145e6aba
-
2020-06-18 11:25:28 +02:00
Miriam Baglioni
e8b3e972f2
changed the input params and the workflow definition to tackle the Result as all result product produced
2020-06-18 11:25:05 +02:00
Miriam Baglioni
3233b01089
changes due to adding all the result type under Result
2020-06-18 11:22:58 +02:00
Miriam Baglioni
5c8533d1a1
changed in the testing classes
2020-06-18 11:20:08 +02:00
Miriam Baglioni
bc8611a95a
added new resources for testing
2020-06-18 11:19:20 +02:00
Sandro La Bruzzo
9bf67f5de1
resolved conflicts
2020-06-17 09:15:43 +02:00
Sandro La Bruzzo
1d4275acc4
implemented first version of exportation of Scholexplorer into ActionSet
2020-06-17 09:10:38 +02:00
miconis
5233b15265
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-16 18:31:19 +02:00
miconis
11b77b9f4e
json dumps for entity merge test modified to fit the new model. title merge adjusted to fix the error
2020-06-16 18:31:11 +02:00
Claudio Atzori
64f02de5d3
updated workflow definition to include the cleaning step
2020-06-16 17:48:51 +02:00
Claudio Atzori
306669209f
code formatting
2020-06-16 16:54:44 +02:00
Claudio Atzori
1bc1d15eaf
stubbing for mock datasource.identities must be typed as array
2020-06-16 16:54:28 +02:00
Claudio Atzori
631fef12a7
Merge branch 'master' into dhp_oaf_model
2020-06-16 16:11:19 +02:00
Michele Artini
9e2c23e391
partial refactoring
2020-06-16 15:55:42 +02:00
Michele Artini
113c9b1de0
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-16 15:53:39 +02:00
Michele Artini
76ea7607f7
partial refactoring
2020-06-16 15:53:13 +02:00
Claudio Atzori
603b1bd0bb
Merge branch 'master' into dhp_oaf_model
2020-06-16 15:43:59 +02:00
Claudio Atzori
5441f01586
Merge pull request 'missing landingPage urls in instances' ( #22 ) from instances-with-landing-page into master
...
Looks good, thanks!
2020-06-16 15:32:44 +02:00
Claudio Atzori
89859111ee
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-16 15:28:29 +02:00
Claudio Atzori
4ec262db53
included externalreference(s) in the result view on the Hive graph DB
2020-06-16 15:28:20 +02:00
Michele Artini
8a4f84f8c0
refactoring
2020-06-16 12:34:13 +02:00
Claudio Atzori
2a4f65795f
WIP: graph cleaner implementation
2020-06-15 18:32:24 +02:00
Claudio Atzori
c15c8c0ad0
map datasource identities (including piwik ids) as original IDs
2020-06-15 16:07:30 +02:00
Miriam Baglioni
9dd3ef22c5
merge branch with fork master
2020-06-15 11:23:26 +02:00
Miriam Baglioni
68cf0fd03f
test input
2020-06-15 11:14:42 +02:00
Miriam Baglioni
0467145ae3
test for graph dump
2020-06-15 11:13:51 +02:00
Miriam Baglioni
e43eedb5b0
added resources and workflow for dump of community products
2020-06-15 11:13:21 +02:00
Miriam Baglioni
f96ca900e1
fixed issues while running on cluster
2020-06-15 11:12:14 +02:00
Miriam Baglioni
20b9e67728
added new class funder
2020-06-15 11:06:18 +02:00
Claudio Atzori
0d52816244
WIP: graph cleaner implementation
2020-06-13 13:06:04 +02:00
Claudio Atzori
bed65a1be6
WIP: graph cleaner implementation
2020-06-12 18:25:47 +02:00
Claudio Atzori
c4d9f1837f
[maven-release-plugin] prepare for next development iteration
2020-06-12 12:21:08 +02:00
Claudio Atzori
f0746a7605
[maven-release-plugin] prepare release dhp-1.2.2
2020-06-12 12:21:03 +02:00
Claudio Atzori
463489f59f
code formatting
2020-06-12 12:03:25 +02:00
Claudio Atzori
4bcad1c9c3
Merge branch 'graph_cleaning'
2020-06-12 11:40:25 +02:00
Claudio Atzori
cdb1956fe9
WIP: graph cleaner implementation
2020-06-12 11:36:59 +02:00
Alessia Bardi
b347499745
do not use deprecated subreltype
2020-06-12 10:58:02 +02:00
Claudio Atzori
97b1c4057c
WIP: graph cleaner implementation
2020-06-12 10:45:18 +02:00
Claudio Atzori
ba8a024af9
avoid NPEs merging titles
2020-06-12 10:45:11 +02:00
Michele Artini
30ea1bda88
oozie workflow
2020-06-12 10:42:35 +02:00
Michele Artini
c22cb5a3c6
refactoring
2020-06-12 09:47:55 +02:00
Michele Artini
472cf77639
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-11 14:30:47 +02:00
Michele Artini
c6b5bb3f17
orcid events
2020-06-11 14:30:24 +02:00
Michele Artini
c2e1b66e83
Revert "orcid events"
...
This reverts commit 48959e9a17
.
2020-06-11 14:28:03 +02:00
Michele Artini
48959e9a17
orcid events
2020-06-11 14:24:02 +02:00
Miriam Baglioni
e145972962
-
2020-06-11 13:08:39 +02:00
Miriam Baglioni
a01800224c
-
2020-06-11 13:02:04 +02:00
Miriam Baglioni
356dd582a3
map construction moved in class
2020-06-11 12:59:22 +02:00
Alessia Bardi
e79943965b
Fixes #5604 : field oamandatepublications in XML
2020-06-11 12:49:31 +02:00
Michele Artini
a41e0cb648
missing landingPage urls in instances
2020-06-11 12:28:34 +02:00
Michele Artini
04fdcacd83
results with all joined entities
2020-06-11 11:25:18 +02:00
Michele Artini
99f88e1cb8
fixed generation entities from claims
2020-06-11 10:51:57 +02:00
Miriam Baglioni
db27663750
-
2020-06-11 10:49:01 +02:00
Miriam Baglioni
bb9f21d0e7
job test for class producing first step of results dump
2020-06-11 10:20:05 +02:00
Claudio Atzori
d1d92c4d8c
fixed integration of claims in the graph
2020-06-11 10:12:00 +02:00
Claudio Atzori
953da4a427
Merge branch 'master' into graph_cleaning
2020-06-10 21:36:56 +02:00
Claudio Atzori
f1bce64391
WIP: graph cleaner implementation
2020-06-10 21:36:31 +02:00
Claudio Atzori
67c7b31ba6
Merge branch 'master' into graph_cleaning
2020-06-10 15:00:35 +02:00
Claudio Atzori
3ebf81d2b0
Merge pull request 'oaf-store-interpretation' ( #21 ) from oaf-store-interpretation into master
...
Looks good, thanks Michele!
2020-06-10 14:58:09 +02:00
Michele Artini
5869cb76b3
reformatting
2020-06-10 12:11:16 +02:00
Michele Artini
c08e66e01e
fixed a workflow parameter
2020-06-10 10:11:56 +02:00
Michele Artini
7177a32d75
import of invisible stores
2020-06-10 10:04:00 +02:00
Claudio Atzori
ce12f236bb
disabled test, need to need to update the joined_entity.json file
2020-06-09 20:07:36 +02:00
Claudio Atzori
a2fdf85ba1
WIP: graph cleaner implementation
2020-06-09 19:52:53 +02:00
Alessia Bardi
4551c1082f
mapping csv for orcid
2020-06-09 18:08:47 +02:00
Alessia Bardi
2d3f7d1eb4
fixed log classes to make the ORCID test run
2020-06-09 18:07:14 +02:00
Alessia Bardi
a3a6755d58
mapping csv for Unpaywall
2020-06-09 17:45:44 +02:00
Claudio Atzori
d9f33582c5
WIP: graph cleaner implementation
2020-06-09 17:20:40 +02:00
Alessia Bardi
f3b033cf09
added csv line for funders from Crossref
2020-06-09 17:08:26 +02:00
Alessia Bardi
79969d78b9
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-09 17:05:39 +02:00
Alessia Bardi
fc4d220964
updated function name for SNSF
2020-06-09 17:05:31 +02:00
Michele Artini
baaa55f4a3
use of pace to calculate trusts
2020-06-09 16:01:31 +02:00
Alessia Bardi
33b130ec43
Mapping instructions for MAG
2020-06-09 15:57:15 +02:00
Miriam Baglioni
206abba48c
merge branch with fork master
2020-06-09 15:41:14 +02:00
Miriam Baglioni
a089db18f1
workflow and parameters to exucute the dump
2020-06-09 15:39:38 +02:00
Miriam Baglioni
6bbe27587f
new classes to execute the dump for products associated to community, enrich each result with project information and assign the result to each community it belongs to
2020-06-09 15:39:03 +02:00
Miriam Baglioni
5121cbaf6a
new classes for external dump. Only classes functional to dump products
2020-06-09 15:37:46 +02:00
Alessia Bardi
d6de406e11
fixed classid for subjects
2020-06-09 14:43:34 +02:00
Alessia Bardi
f072125152
map volume and issue in journal information from MAG
2020-06-09 14:32:10 +02:00
Alessia Bardi
b7cb1163ea
identifiers always start with 50
2020-06-09 10:39:11 +02:00
Alessia Bardi
181f52b9bc
Added mapping table for Crossref
2020-06-08 19:33:47 +02:00
Alessia Bardi
9fd25887f7
Result identifiers all start with 50|
2020-06-08 19:32:24 +02:00
Alessia Bardi
16cb073b15
set the instance datepfacceptance with the Crossref createdDate in case the issuedDate is blank
2020-06-08 19:06:03 +02:00
Michele Artini
bb659d870c
join simrels
2020-06-08 16:29:01 +02:00
Michele Artini
81e85465d8
join simrels
2020-06-08 16:26:16 +02:00
Claudio Atzori
3d871c6651
Merge branch 'master' into graph_cleaning
2020-06-08 15:23:24 +02:00
Claudio Atzori
25a093b1a4
integrated changes from master
2020-06-08 15:04:00 +02:00
Sandro La Bruzzo
e34e7d6728
merge DOIBoost
2020-06-08 08:32:22 +02:00
Sandro La Bruzzo
e46e2a4776
Merge remote-tracking branch 'origin/master' into doiboost
2020-06-08 08:17:14 +02:00
Spyros Zoupanos
3576dd186b
Adding hive timeout as workflow parameter
2020-06-05 22:29:54 +03:00
Claudio Atzori
b2349659cf
WIP: graph property fixing implementation
2020-06-05 18:37:38 +02:00
Michele Artini
a73973a74b
partial implemantation of broker events generation
2020-06-05 11:43:00 +02:00
Michele Artini
7e82996e7c
partial implemantation of broker events generation
2020-06-04 17:10:43 +02:00
Sandro La Bruzzo
b57e8ba374
Merge remote-tracking branch 'origin/master' into doiboost
2020-06-04 14:39:41 +02:00
Sandro La Bruzzo
7ac1ba2e35
improvement DOIBoost
2020-06-04 14:39:20 +02:00
Michele Artini
97177d7f7b
partial refactoring
2020-06-04 10:26:34 +02:00
Sandro La Bruzzo
13815d5d13
improvement DOIBoost
2020-06-01 17:52:12 +02:00
Claudio Atzori
05f269a1c0
kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob
2020-06-01 00:32:42 +02:00
Claudio Atzori
5e23fb3a74
code formatting
2020-05-30 10:52:56 +02:00
Claudio Atzori
54ca8ed6c3
uniformed param name (isLookupUrl), Vocab model classes defined as Serializable
2020-05-29 18:17:30 +02:00
Claudio Atzori
1577bd5b8b
added IsLookupUrl to the raw_db workflow parameters
2020-05-29 16:18:16 +02:00
Claudio Atzori
91d78b825b
Merge pull request 'import from db using is vocabularies' ( #17 ) from result_pids into master
...
Looks good, thanks Michele!
2020-05-29 16:02:40 +02:00
Michele Artini
adb798faa5
import from db using is vocabularies
2020-05-29 12:03:51 +02:00
Claudio Atzori
6f5f498c78
restored common properties driving executor-cores and executor-memory in join_organization_relations wf node
2020-05-29 11:22:00 +02:00
Claudio Atzori
b2f9564f13
WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob
2020-05-29 10:58:15 +02:00
Miriam Baglioni
dfa4997a4f
removed commented code
2020-05-29 10:45:18 +02:00
Miriam Baglioni
6f1eea28b6
changed message in log
2020-05-29 10:41:39 +02:00
Sandro La Bruzzo
b87b3ddb6b
changed mapping ORCIDToOAF
2020-05-29 09:32:04 +02:00
Miriam Baglioni
8b6e886fb6
added new resource for testing
2020-05-28 23:54:31 +02:00
Miriam Baglioni
6989fb9c8a
changed the project test according to the newly introduced join with the db project codes
2020-05-28 23:53:24 +02:00
Miriam Baglioni
782984d8e5
added needed parameter
2020-05-28 23:52:41 +02:00
Miriam Baglioni
01f7876595
fix issue with flatMap - the return type must not be null
2020-05-28 23:50:32 +02:00
Claudio Atzori
a57965a3ea
limiting the dimensions of outliers
2020-05-28 17:36:37 +02:00
Miriam Baglioni
773735f870
added the path to the file containing the projects code from the db
2020-05-28 17:30:45 +02:00
Miriam Baglioni
6a15067a64
added one step in the workflow
2020-05-28 17:30:09 +02:00
Miriam Baglioni
5309a99a70
modified the PrepareProjects to consider those in the db
2020-05-28 17:29:53 +02:00
Miriam Baglioni
b737ed8236
added part to read projects from the openaire db to filter out those in the csv file that are not in the db
2020-05-28 17:29:21 +02:00
Claudio Atzori
821be1f8b6
experimental implementation of custom aggregation using kryo encoders
2020-05-28 13:53:13 +02:00
Claudio Atzori
83504ecace
limiting the maximum number of authors allowed in XML records to MAX_AUTHORS = 200; authors with ORCID can exceed that limit
2020-05-28 13:52:30 +02:00
Claudio Atzori
ef11593068
JoinedEntity.links defined as empty list by default
2020-05-28 13:50:44 +02:00
Claudio Atzori
5dea155a87
increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase
2020-05-28 13:49:59 +02:00
Miriam Baglioni
35b7279147
changed test because data are saved as SequenceFile now, and because of the group by the umber of produced update decrease
2020-05-28 10:26:12 +02:00
Miriam Baglioni
37c155b86a
merge branch with fork master
2020-05-28 10:09:51 +02:00
Miriam Baglioni
df44db686a
refactoring
2020-05-28 10:07:00 +02:00
Miriam Baglioni
87b07f4af8
removed unused variables
2020-05-28 10:05:43 +02:00
Miriam Baglioni
1060977272
added fs actions to remove and the create the workingDir
2020-05-28 10:04:36 +02:00
Miriam Baglioni
96d1a3c431
deleted the file were to store the csv files
2020-05-28 10:04:10 +02:00
Miriam Baglioni
669c05c771
added groupBy before creating Actions
2020-05-28 10:00:45 +02:00
Sandro La Bruzzo
02f90eeb07
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-28 09:58:32 +02:00
Sandro La Bruzzo
7d29b61c62
code refactor
2020-05-28 09:57:46 +02:00
Claudio Atzori
fdd54bad1c
code formatting
2020-05-27 19:31:54 +02:00
Miriam Baglioni
1855453434
changed the outputdir of the last step
2020-05-27 17:59:36 +02:00
Claudio Atzori
b9b1bc9967
Merge branch 'master' into provision_indexing
2020-05-27 12:55:20 +02:00
Claudio Atzori
aac1515b58
Merge pull request 'result_pids without conflicts ???' ( #16 ) from result_pids into master
...
Looks good, thanks Michele
2020-05-27 12:54:52 +02:00
Michele Artini
f5ce7d76e1
resolve conflicts
2020-05-27 12:49:17 +02:00
Claudio Atzori
cfd753217c
repartition the join_entities in 24k files
2020-05-27 12:44:01 +02:00
Claudio Atzori
2f1a623d09
sync from master branch
2020-05-27 12:39:58 +02:00
Claudio Atzori
9e4ec1543b
updated test
2020-05-27 12:38:42 +02:00
Claudio Atzori
8047d16dd9
added RDD based adjacency list creation procedure
2020-05-27 12:38:12 +02:00
Claudio Atzori
f057dcdf65
limit the max number of externalreferences to MAX_EXTERNAL_ENTITIES
2020-05-27 12:37:33 +02:00
Michele Artini
b81f2741d2
xquery
2020-05-27 12:10:20 +02:00
Michele Artini
a25598140a
result pids (new xpaths + IS vocabularies)
2020-05-27 12:10:20 +02:00
Michele Artini
7a7272d9ec
result pids (new xpaths + IS vocabularies)
2020-05-27 12:10:20 +02:00
Michele Artini
3ceb2d2853
match terms with vocabularies
2020-05-27 11:34:13 +02:00
Claudio Atzori
4e36d689dd
fixed XML serialization for children sub-elements (duplicates & externalreferences)
2020-05-26 18:30:40 +02:00
Miriam Baglioni
92e3a52e91
merge branch with fork master
2020-05-26 15:57:51 +02:00
Michele Artini
c15d997925
xquery
2020-05-26 13:13:17 +02:00
Michele Artini
c6af36496a
result pids (new xpaths + IS vocabularies)
2020-05-26 13:11:09 +02:00
Michele Artini
093f1aff03
result pids (new xpaths + IS vocabularies)
2020-05-26 13:06:55 +02:00
Claudio Atzori
b8e541a454
fixing repeated organization.websiteurl in organization entities ( #5645 ) as well as project.ecinternationalorganizationeurinterests
2020-05-26 10:30:09 +02:00
Claudio Atzori
55595d7235
HACK: patch NULL values with defaults found in result.datainfo.deletedbyinference and result.context
2020-05-26 10:28:35 +02:00
Claudio Atzori
7b288a94cb
code formatting
2020-05-26 09:54:13 +02:00
Miriam Baglioni
54d869e618
merge upstream
2020-05-26 09:22:04 +02:00
Miriam Baglioni
eea07f4c42
refactoring
2020-05-26 09:21:49 +02:00
Sandro La Bruzzo
79c26382da
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-26 09:15:50 +02:00
Sandro La Bruzzo
25f52e19a4
implemented generation of ActionSet
2020-05-26 09:15:33 +02:00
Michele Artini
d6aada4957
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-26 08:44:31 +02:00
Michele Artini
b1546605e3
updated version of a dependency
2020-05-26 08:44:15 +02:00
Claudio Atzori
7582532e73
[maven-release-plugin] prepare for next development iteration
2020-05-25 19:48:18 +02:00
Claudio Atzori
01c2e93395
[maven-release-plugin] prepare release dhp-1.2.1
2020-05-25 19:48:14 +02:00
miconis
da1e5cf557
implementation of the result title merge. main title with higher trust, distinct between the others
2020-05-25 18:02:57 +02:00
Miriam Baglioni
d3d36647d2
merge upstream
2020-05-25 10:38:22 +02:00
Miriam Baglioni
74215f6d9f
refactoring
2020-05-25 10:38:16 +02:00
Miriam Baglioni
dbde2d243a
changed due to move of PacePerson from dhp-graph-mapper to dhp-common
2020-05-25 10:35:39 +02:00
Miriam Baglioni
f754c424bd
changed logic to compute only onece PacePerson for each Author to be enriched
2020-05-25 10:35:02 +02:00
Miriam Baglioni
8f51af4e9b
added PacePerson to get name surname for authors having only fullname set
2020-05-25 10:34:30 +02:00
Miriam Baglioni
b258f99ece
fix for issue that duplicated result
2020-05-25 10:26:48 +02:00
Miriam Baglioni
8f6ce970f9
moved PacePerson to dhp-common to avoid conflict in dependency with graph-mapper
2020-05-25 10:25:55 +02:00
Claudio Atzori
de108f54d6
code formatting
2020-05-23 10:21:19 +02:00
Claudio Atzori
6b56cae57d
added mapping for bestaccessrights
2020-05-23 09:57:39 +02:00
Claudio Atzori
7181807e64
code formatting
2020-05-23 09:51:48 +02:00
Sandro La Bruzzo
2408083566
implemented filtering step
2020-05-23 08:46:49 +02:00
Sandro La Bruzzo
244f6e50cf
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-22 20:52:15 +02:00
Sandro La Bruzzo
147dd389bf
minor fix
2020-05-22 20:51:42 +02:00
Miriam Baglioni
0d1ec1913f
added fix to avoid duplication of results
2020-05-22 18:42:25 +02:00
miconis
5d7ac78c41
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-22 17:25:08 +02:00
miconis
0fd0c7d725
reimplementation of the sim between two authors. now it takes into account both name and surname. threshold incremented to 1.0 if the name is too short
2020-05-22 17:24:57 +02:00
Michele Artini
eb606dc1e2
partial implementation of events with rels
2020-05-22 17:17:41 +02:00
Miriam Baglioni
29066a6b46
applied code cleanup
2020-05-22 15:38:50 +02:00
Miriam Baglioni
8610ad5142
added groupby id to fix multiple result with same id at join step
2020-05-22 15:32:55 +02:00
Miriam Baglioni
1e44703e3e
merge upstream
2020-05-22 15:30:07 +02:00
Miriam Baglioni
ac8025f469
-
2020-05-22 15:29:41 +02:00
Miriam Baglioni
50ad83b97f
-
2020-05-22 15:27:19 +02:00
Miriam Baglioni
473c6d3a23
produces AtomicActions instead of Projects
2020-05-22 15:26:57 +02:00
Sandro La Bruzzo
72278b9375
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-22 15:17:13 +02:00
Sandro La Bruzzo
22936d0877
Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost
2020-05-22 15:15:17 +02:00
Sandro La Bruzzo
9fbb221457
completed mapping of UnpayWall and ORCID
2020-05-22 15:15:09 +02:00
Miriam Baglioni
70389b0a30
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-22 13:53:23 +02:00
Miriam Baglioni
4308f31165
added fix to make test run
2020-05-22 13:13:01 +02:00
Claudio Atzori
946598cfba
Merge branch 'master' into provision_indexing
2020-05-22 12:35:41 +02:00
Claudio Atzori
3cf2796ac6
code formatting
2020-05-22 12:34:00 +02:00
Michele Artini
dc4621b3cb
filter ORCID e MAG identifiers
2020-05-22 12:25:01 +02:00
Michele Artini
9f2d0f1b08
filter ORCID e MAG identifiers
2020-05-22 11:00:27 +02:00
Michele Artini
9de71e54a8
filter ORCID e MAG identifiers
2020-05-22 10:47:39 +02:00
Michele Artini
c5f7e17348
author fullnames
2020-05-22 10:08:02 +02:00
Claudio Atzori
ad40470040
Merge branch 'master' into provision_indexing
2020-05-22 08:51:22 +02:00
Claudio Atzori
925d933204
making XmlRecordFactory immune to graph encoding changes (mostly to avoid NPEs)
2020-05-22 08:50:44 +02:00
Claudio Atzori
b33dd58be4
replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging
2020-05-22 08:50:06 +02:00
Michele Artini
c7ca3cf35b
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-21 16:48:20 +02:00
Michele Artini
3e34517479
partial implementation of events with rels
2020-05-21 16:47:53 +02:00
Miriam Baglioni
eae12a6586
Merge branch 'master' into dhp_oaf_model
2020-05-21 16:31:22 +02:00
Miriam Baglioni
6750075fbd
merge upstream
2020-05-21 16:31:09 +02:00
Miriam Baglioni
4589c428b1
generate action sets and saves them in the hdfs path for the actions sets
2020-05-21 16:30:39 +02:00
miconis
8b35e0e7f0
reimplementation of the author merging in deduprecord creation. implementation of the test class. minor changes
2020-05-21 12:02:44 +02:00
miconis
8bbd1d0501
reimplementation of the author merging in deduprecord creation. implementation of the test class.
2020-05-21 11:52:14 +02:00
Michele Artini
e43d4d7778
added a coalesce in sql query
2020-05-21 11:08:07 +02:00
Claudio Atzori
dbfb9c19fe
minor changes
2020-05-21 10:00:14 +02:00
Michele Artini
b3bcbb3129
resolve name of organization countries
2020-05-21 08:41:32 +02:00
Enrico Ottonello
1109d3b3fc
Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost
2020-05-21 00:41:27 +02:00
Enrico Ottonello
869a53040e
save to text file format
2020-05-21 00:41:21 +02:00
Sandro La Bruzzo
5818abaab4
fixed Crossref Mapping
2020-05-20 17:05:46 +02:00
Claudio Atzori
da4267d0fe
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-05-20 14:58:22 +02:00
Claudio Atzori
d7d2a0637f
added extra parameters to the provision indexing workflow
2020-05-20 14:55:38 +02:00
Miriam Baglioni
055eec5a77
added resource for prepare project test
2020-05-20 13:54:10 +02:00
Miriam Baglioni
9079bc1f61
-
2020-05-20 13:53:32 +02:00
Miriam Baglioni
67ba4fde57
added test for prepare projects step
2020-05-20 13:53:08 +02:00
Miriam Baglioni
5e0e554000
Merge branch 'master' into dhp_oaf_model
2020-05-20 10:57:30 +02:00
Miriam Baglioni
76f3f73caa
merge upstream
2020-05-20 10:31:40 +02:00
Miriam Baglioni
3c0eb12d3e
removed the not zipped files
2020-05-20 10:31:05 +02:00
Miriam Baglioni
c0d9e02340
zipped test resources that are too big
2020-05-20 10:30:25 +02:00
Miriam Baglioni
5e9c9fa87c
tests
2020-05-20 10:29:57 +02:00
Miriam Baglioni
faed7521bf
added resources for testing
2020-05-20 10:29:29 +02:00
Miriam Baglioni
75491482de
added a new preparation step to replicate each project for the programme it is associated to
2020-05-20 10:28:56 +02:00
Miriam Baglioni
eb0e47ba53
parameters for h2020 programme
2020-05-20 10:26:44 +02:00
Sandro La Bruzzo
b771d67e9d
next step of MAG conversion implemented
2020-05-20 08:14:03 +02:00
Miriam Baglioni
08218d2f3f
new workflow with added steps
2020-05-19 18:44:25 +02:00
Miriam Baglioni
457293ccc0
test for the variuos steps of project update with programme
2020-05-19 18:43:42 +02:00
Miriam Baglioni
9447d78ef3
added preparation classes
2020-05-19 18:42:50 +02:00
Michele Artini
85ca5622d4
partial implementation of generation of simple events
2020-05-19 16:17:35 +02:00
Claudio Atzori
0bdfbb0a57
reintroduced RDD based relation cut off procedure
2020-05-19 15:02:21 +02:00
Enrico Ottonello
934ad570e0
joined summaries and activities dataset
2020-05-19 12:57:21 +02:00
Enrico Ottonello
ca722d4d18
merged
2020-05-19 09:43:12 +02:00
Enrico Ottonello
7362bc3e9d
workflow to generate seq(doi,AuthorList)
2020-05-19 09:34:44 +02:00
Sandro La Bruzzo
8c95b50f26
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-19 09:25:04 +02:00
Sandro La Bruzzo
486e850bcc
next step of MAG conversion implemented
2020-05-19 09:24:45 +02:00
Enrico Ottonello
d4e9075f22
Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost
2020-05-18 19:51:36 +02:00
Enrico Ottonello
fc80e8c7de
added accumulator; last modified date of the record is added to saved data; lambda file is partitioned into 20 parts before starting downloading
2020-05-18 19:51:29 +02:00
Claudio Atzori
f3bc8aed31
lifted memory requirements for country propagation wf
2020-05-18 15:29:10 +02:00
Miriam Baglioni
b71fbb68b1
removed the removeOutputDir command from code. Reltions are written in Append. The erase of the output dir ment to remove all the relations computed in the prevoius steps
2020-05-18 13:57:20 +02:00
Miriam Baglioni
629af7cb79
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-18 13:07:36 +02:00
Miriam Baglioni
f0f14caf99
removed script files for shell actions not performed
2020-05-18 13:06:16 +02:00
Miriam Baglioni
23bbac7d7c
-
2020-05-18 13:05:03 +02:00
Miriam Baglioni
4f1ff7ba73
added dependency to org.apache.commons common-csv
2020-05-18 13:04:39 +02:00
Miriam Baglioni
abc45f2708
added dnet-45 HttpConnector and related Classes, produced the POJO for projects and programme
2020-05-18 13:04:06 +02:00
Claudio Atzori
ef9a9a9f1a
remove the outout path when starting
2020-05-15 22:34:19 +02:00
Enrico Ottonello
0b29bb7e3b
spark job to download orcid record modified after a fixed date
2020-05-15 19:49:26 +02:00
Miriam Baglioni
5a648016ef
parameters from the GetFile class
2020-05-15 18:18:50 +02:00
Miriam Baglioni
83c262a483
workflow to download the files
2020-05-15 18:18:31 +02:00
Miriam Baglioni
22cb9e0da7
simple code to get file from URL
2020-05-15 18:18:01 +02:00
Claudio Atzori
7838f2c63f
init the empty list for author pids mapped from OAF
2020-05-15 17:06:01 +02:00
Claudio Atzori
82b615ab33
NPE check
2020-05-15 16:04:46 +02:00
Miriam Baglioni
e26a67c3eb
merge with upstream
2020-05-15 15:53:05 +02:00
Claudio Atzori
7a89507ab1
code formatting
2020-05-15 15:16:54 +02:00
Miriam Baglioni
5ec8c49ad5
removed serialization points
2020-05-15 12:49:58 +02:00
Claudio Atzori
1d35836a58
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-05-15 12:26:31 +02:00
Claudio Atzori
cfc8948717
fixed mapping OdfToGraph: pick the correct element to map author pids and author affiliations; extended mapping Oaf2Graph: added support for author pids
2020-05-15 12:26:16 +02:00
Michele Artini
2a4e68a292
events recognition
2020-05-15 12:25:37 +02:00
Claudio Atzori
a832658296
code formatting
2020-05-15 10:21:09 +02:00
Claudio Atzori
50d6a2ad3c
added output directory removal in the blacklist spark actions; included common global properties in blacklist's workflow.xml
2020-05-15 09:53:37 +02:00
Claudio Atzori
18f46e47b9
added relations to the graph2hive import workflow
2020-05-15 09:34:48 +02:00
Claudio Atzori
9d028ffe1c
cleanup
2020-05-15 09:28:55 +02:00
Claudio Atzori
fd62359538
cleanup
2020-05-15 09:28:15 +02:00
Claudio Atzori
eb64335a54
parallel implementation for graph Hive importer
2020-05-15 09:05:26 +02:00
Miriam Baglioni
94571c9a51
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-14 18:29:55 +02:00
Miriam Baglioni
f25db01664
changed in the constant from propagationconstants to modelconstants
2020-05-14 18:29:24 +02:00
Miriam Baglioni
d05630d979
removed the constants added in ModelConstants
2020-05-14 18:22:50 +02:00
Claudio Atzori
f044d09315
revised mapping: more accurate mapping for name/surname from datacite format; improved mapping of null values
2020-05-14 15:07:24 +02:00
Miriam Baglioni
e7eb4f377e
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-14 10:34:17 +02:00
Miriam Baglioni
8828458acf
minor changes
2020-05-14 10:34:12 +02:00
Claudio Atzori
ab37953332
added global properties in wf definitions to avoid repeating name-node and job-tracker in the (many) distcp actions; reintroduced output directory removal at the beginning of each spark action
2020-05-14 10:25:41 +02:00
Claudio Atzori
12bfa6702e
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-05-13 17:01:17 +02:00
Claudio Atzori
5ecacad70a
fixed default resource typing in Oaf/Odf mapping
2020-05-13 17:01:11 +02:00
Enrico Ottonello
12756f9d41
multithread (4 threads) test to feed elastic search
2020-05-13 16:11:40 +02:00
Michele Artini
c0265213a0
partial implementation
2020-05-13 12:00:27 +02:00
Sandro La Bruzzo
a92ee0f41e
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-13 10:38:13 +02:00
Sandro La Bruzzo
d876f47d06
next step of MAG conversion implemented
2020-05-13 10:38:04 +02:00
Claudio Atzori
1ddd33de41
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-05-13 09:04:41 +02:00
Claudio Atzori
85f3c55992
fixed node names in blacklist workflow
2020-05-13 09:04:33 +02:00
Miriam Baglioni
43f127448d
changed the package name from dhp-propagation to dhp-enrichment for the preparation phase of funding propagation
2020-05-12 18:24:26 +02:00
Enrico Ottonello
08040cef80
spark action to analyze orcid lambda file
2020-05-12 16:57:43 +02:00
Claudio Atzori
ec0782e582
renamed jar containing the bulktagging and propagation workflows from dhp-[bulktagging|propagation] to dhp-enrichment; adjusted xml formatting
2020-05-12 15:49:28 +02:00
Miriam Baglioni
1547ca7e15
added blacklist step to the end of the provision wf
2020-05-12 12:17:27 +02:00
Miriam Baglioni
14979f299e
changed the configuration factory
2020-05-12 11:28:38 +02:00
Miriam Baglioni
f8aef6161a
minor modification
2020-05-12 11:28:07 +02:00
Miriam Baglioni
7387f3449a
changed the route to find the verb resolver classes
2020-05-12 11:27:38 +02:00
Miriam Baglioni
7687519f00
merged conflicts with upstream branch
2020-05-12 10:03:44 +02:00
Miriam Baglioni
8ffc050b8a
fixed problem in communityconfigurationfactory test
2020-05-12 10:01:09 +02:00
Claudio Atzori
527e8169a8
adjusted paths pointing to test configurations, cleanup
2020-05-11 18:17:05 +02:00
Claudio Atzori
f9a62ba63b
added wf nodes to copy entities to the output path
2020-05-11 18:16:39 +02:00
Miriam Baglioni
ad63effb4e
removed deletion of working dir
2020-05-11 17:48:22 +02:00
Claudio Atzori
c6b028f2af
code formatting
2020-05-11 17:38:08 +02:00
Claudio Atzori
6d0b11252e
bulktagging wfs moved into common dhp-enrichment module
2020-05-11 17:32:06 +02:00
Miriam Baglioni
50659011eb
refactoring
2020-05-11 16:14:26 +02:00
Miriam Baglioni
e883daf87e
added the outputPath parameter and the reset path to remove the outputath directory
2020-05-11 16:10:24 +02:00
Miriam Baglioni
5ab3424c77
removed unused dependencies
2020-05-11 16:09:37 +02:00
Miriam Baglioni
6a3b081263
added the last step of blacklisteing
2020-05-11 16:09:20 +02:00
Enrico Ottonello
3b1a68cbf5
elastic search feed test
2020-05-11 14:53:52 +02:00
Enrico Ottonello
f53e42bda7
merged
2020-05-11 14:49:28 +02:00
Enrico Ottonello
7990894454
different date format in lambda file parsing
2020-05-11 14:41:11 +02:00
Sandro La Bruzzo
0c6774e4da
updated pom version
2020-05-11 14:35:14 +02:00
Miriam Baglioni
bbc9b4f329
removed unused imports
2020-05-11 14:28:55 +02:00
Miriam Baglioni
757bae53ea
removed unusefule serialization points
2020-05-11 14:28:37 +02:00
Miriam Baglioni
b35d57a1ac
added resources for test
2020-05-11 14:15:30 +02:00
Miriam Baglioni
e563e65335
moved check from join to method
2020-05-11 14:11:44 +02:00
Sandro La Bruzzo
b90609848b
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-11 14:08:31 +02:00
Sandro La Bruzzo
4062eafbdb
merged from branch
2020-05-11 14:08:16 +02:00
Miriam Baglioni
f5d785e096
used the DbClient moved in dhp-common
2020-05-11 13:59:42 +02:00
Miriam Baglioni
112b2cb3c3
added the test class
2020-05-11 13:58:58 +02:00
Miriam Baglioni
9a7ae523c9
update to version 1.2.1-SNAPSHOT
2020-05-11 13:57:47 +02:00
Miriam Baglioni
2abb84877d
Merge branch 'master' into blacklist
2020-05-11 10:37:49 +02:00
Miriam Baglioni
b0f0b24263
update to version 1.2.1-SNAPSHOT
2020-05-11 10:37:31 +02:00
Miriam Baglioni
a7e91e23ba
update to versione 1.2.1-SNAPSHOT
2020-05-11 10:34:30 +02:00
Miriam Baglioni
bb59bdd60f
merge upstream
2020-05-11 10:33:17 +02:00
Miriam Baglioni
5e3548add6
-
2020-05-11 10:33:08 +02:00
Miriam Baglioni
dc8c8fa480
changed the version
2020-05-11 10:20:48 +02:00
Miriam Baglioni
871e079b45
merged with master
2020-05-11 10:20:00 +02:00
Claudio Atzori
60c40618d3
[maven-release-plugin] prepare for next development iteration
2020-05-11 10:17:14 +02:00
Claudio Atzori
c267d958d5
[maven-release-plugin] prepare release dhp-1.2.0
2020-05-11 10:17:10 +02:00
Miriam Baglioni
622ba87ec2
changed the version
2020-05-11 10:10:36 +02:00
Miriam Baglioni
391b2399cc
merge upstream
2020-05-11 10:08:51 +02:00
Claudio Atzori
42f1a2bf94
bumped project version to 1.2.0-SNAPSHOT
2020-05-11 10:05:57 +02:00
Sandro La Bruzzo
1412158a6f
merged from branch
2020-05-11 09:45:50 +02:00
Miriam Baglioni
32301451ec
merge upstream
2020-05-11 09:42:23 +02:00
Miriam Baglioni
7e66bc2527
fix a typo in the compression keyword and added some logging info in the spark job
2020-05-11 09:40:58 +02:00
Sandro La Bruzzo
1662f221f5
added test class
2020-05-11 09:39:11 +02:00
Sandro La Bruzzo
2b48a2c32c
Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost
2020-05-11 09:38:36 +02:00
Sandro La Bruzzo
4cebca09d2
start implementing MAG mapping
2020-05-11 09:38:27 +02:00
Spyros Zoupanos
ae0f535c73
Fixing hardcoded reference to main openAIRE graph db
2020-05-09 22:34:48 +03:00
Claudio Atzori
fd519df616
new rels produced by dedup workflow must be unique
2020-05-08 19:00:38 +02:00
Claudio Atzori
0ccc864ad9
[maven-release-plugin] prepare for next development iteration
2020-05-08 17:01:31 +02:00
Claudio Atzori
6e47c724c6
[maven-release-plugin] prepare release dhp-1.1.7
2020-05-08 17:01:27 +02:00
Claudio Atzori
5b28bb4131
code formatting
2020-05-08 16:49:47 +02:00
Claudio Atzori
8fd1952f16
code formatting
2020-05-08 16:01:09 +02:00
miconis
3420998bb4
reltype set in mergerels
2020-05-08 15:43:30 +02:00
Enrico Ottonello
b9d126dd1f
formatting modified after commit
2020-05-08 14:54:37 +02:00
Enrico Ottonello
7e1c987370
Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost
2020-05-08 14:49:50 +02:00
Enrico Ottonello
9d812788e4
added job to download from orcid the records modified after a fixed date, the info are taken from last_modified.csv on hdfs
2020-05-08 14:49:39 +02:00
Miriam Baglioni
9a29ab7508
got back to the readPath we have before
2020-05-08 13:08:56 +02:00
Miriam Baglioni
28556507e7
-
2020-05-08 12:54:52 +02:00
Claudio Atzori
b2192fdcdc
simplified reset_outputpath nodes across the workflows, applied common xml formatting
2020-05-08 12:33:31 +02:00
Miriam Baglioni
4c94231cad
merge with master fork
2020-05-08 12:25:57 +02:00
Miriam Baglioni
9b4c0d4b3a
-
2020-05-08 11:51:45 +02:00
Miriam Baglioni
53952707b6
modified test because of new step of data preparation. It now expects to find ResultCountrySet serialization nstead of DatasourceCountry
2020-05-08 11:49:19 +02:00
Claudio Atzori
62ea19f1d3
introduced mapping for ExternalReferences, made urls defined within an instance unique
2020-05-08 09:43:26 +02:00
Claudio Atzori
8c67073a07
force speculative execution to false
2020-05-08 09:42:21 +02:00
Miriam Baglioni
d6b9de9f46
Merge branch 'master' of https://code-repo.d4science.org/miriam.baglioni/dnet-hadoop
2020-05-07 18:22:59 +02:00
Miriam Baglioni
f95d288681
fixed swithch of parameters
2020-05-07 18:22:32 +02:00
Claudio Atzori
166aafd936
heavy cleanup
2020-05-07 18:22:26 +02:00
Michele Artini
ac0da5a7ee
Partial implementation of broker events
2020-05-07 12:31:26 +02:00
Miriam Baglioni
fb405275f7
merged with master
2020-05-07 11:48:21 +02:00
Miriam Baglioni
e124278934
-
2020-05-07 11:47:11 +02:00
Claudio Atzori
5111671e62
celanup
2020-05-07 11:47:00 +02:00
Miriam Baglioni
9f8855991c
changed Encorders.bean to Encoders.kryo
2020-05-07 11:44:35 +02:00
Miriam Baglioni
207b899d6d
merged with upstream
2020-05-07 11:43:53 +02:00
Claudio Atzori
5b3f8a0e90
using Encoders.bean instead of kryo
2020-05-07 11:41:41 +02:00
Miriam Baglioni
182225becb
Merge branch 'master' of https://code-repo.d4science.org/miriam.baglioni/dnet-hadoop
2020-05-07 11:38:17 +02:00
Miriam Baglioni
5efae3acb9
new workflow for job3
2020-05-07 11:38:10 +02:00
Claudio Atzori
73243793b2
Dataset based implementation for SparkCountryPropagationJob3
2020-05-07 11:15:24 +02:00
Claudio Atzori
128c3bf1c8
restored Author bean with simple getter/setter, author pid addition moved into dedicated implementation SparkOrcidToResultFromSemRelJob3
2020-05-07 11:14:56 +02:00
Miriam Baglioni
b2fec32c87
new workflow for job3
2020-05-07 10:01:57 +02:00
Miriam Baglioni
29bc8c44b1
changes in the construction of new country set
2020-05-07 10:01:34 +02:00
Miriam Baglioni
55e825acd4
chenged the test according to changes in SparkCOuntryPropagationJob2
2020-05-07 10:01:00 +02:00
Miriam Baglioni
16193cf0ba
new workflow and parameter for country propagation
2020-05-07 09:59:58 +02:00
Miriam Baglioni
5a476c7a13
chenged the xquery for the cfhb table
2020-05-07 09:58:17 +02:00
Miriam Baglioni
42ad51577a
new implementation with one more serialization step
2020-05-07 09:57:49 +02:00
Claudio Atzori
17860d3ab6
general changes in the RAW graph mapping: missing collectedfrom/hostedby causes records to be skipped; factored out most of the constants in ModelConstants class (dhp-schemas)
2020-05-06 13:20:02 +02:00
Claudio Atzori
fdfecc9578
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-05-06 11:28:01 +02:00
Claudio Atzori
c79e2f5977
drop workingPath before starting the dedup workflow
2020-05-06 11:27:44 +02:00
Michele Artini
8f30a09d84
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-05 17:12:22 +02:00
Michele Artini
ccc609f909
new module for the production of broker events
2020-05-05 17:09:00 +02:00
Miriam Baglioni
dd2e698a72
added a sequentialization step on the spark job. Addedd new parameter
2020-05-05 17:03:43 +02:00
Claudio Atzori
0825321d0b
improved unit tests in dhp-aggregation
2020-05-05 12:39:04 +02:00
Miriam Baglioni
252b219dd5
chanced the name of some properties
2020-05-05 10:03:32 +02:00
Claudio Atzori
4a8487165c
using long param names in wf definition
2020-05-04 19:19:29 +02:00
Claudio Atzori
a2fc37df5f
adjusted parameters
2020-05-04 19:18:59 +02:00
Claudio Atzori
f1b7e14036
code formatting
2020-05-04 19:18:34 +02:00
Miriam Baglioni
78578c3ccf
fixed wrong trnasition name in workflow
2020-05-04 15:46:24 +02:00
Miriam Baglioni
cc7d9b6b19
merge upstream
2020-05-04 13:59:09 +02:00
Miriam Baglioni
3957c815b9
changed the name of some parameters
2020-05-04 13:58:52 +02:00
Miriam Baglioni
e218360f8a
changed code for the mode of DbClient and also removed the dependency to graph-mapper
2020-05-04 12:26:17 +02:00
Miriam Baglioni
31ea05297d
moved the DbClient to common and added needed dependency to pom
2020-05-04 12:22:28 +02:00
miconis
085cf173d7
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-04 12:08:20 +02:00
miconis
3df703f67d
mergerels added to propagate relations
2020-05-04 12:08:12 +02:00
Claudio Atzori
bac37b3973
fixed children expansion in XML records
2020-05-04 11:51:17 +02:00
Claudio Atzori
077ccd8743
stats wf properties cleanup
2020-05-04 11:41:46 +02:00
Miriam Baglioni
b7dd400e51
added check if author.pid exists or is null
2020-05-01 15:09:02 +02:00
Miriam Baglioni
dbf3ba051a
minor
2020-04-30 20:22:07 +02:00
Miriam Baglioni
43053a286d
workflow pom with added blacklist module
2020-04-30 18:30:21 +02:00
Miriam Baglioni
0631fe548a
pom.xml
2020-04-30 18:29:46 +02:00
Miriam Baglioni
38ecfd5785
the wf with all the three steps for blacklisting relations
2020-04-30 18:28:46 +02:00
Miriam Baglioni
95433e1087
parameters for the preparation phase and blacklist phase
2020-04-30 18:28:13 +02:00
Miriam Baglioni
1070790c19
minor
2020-04-30 18:26:58 +02:00
Miriam Baglioni
b9d56b3ced
applies the actual removal of the relations
2020-04-30 18:26:25 +02:00
Miriam Baglioni
d6d6ebeae5
preparation step: creates the subset of the merges relations
2020-04-30 18:25:33 +02:00
Miriam Baglioni
13f30664ea
minor
2020-04-30 15:23:49 +02:00
Miriam Baglioni
276b95b7b3
add create file instruction
2020-04-30 15:05:17 +02:00
Miriam Baglioni
65a5d67b8b
minor modifications
2020-04-30 14:45:27 +02:00
Miriam Baglioni
418595fec2
removed the saveGraph parameter
2020-04-30 14:45:00 +02:00
Miriam Baglioni
ce8b1d0bc3
new workflow definition to be inserted in the provision pipeline
2020-04-30 14:38:54 +02:00
Miriam Baglioni
4b0bd91012
-
2020-04-30 12:45:28 +02:00
Miriam Baglioni
2349bfd8b8
changed the job test to remove the writeUpdate option
2020-04-30 11:43:33 +02:00
Sandro La Bruzzo
1e06bbaee8
fixed test
2020-04-30 11:38:58 +02:00
Miriam Baglioni
951517f9ec
new input parameters and workflow definition to be used in the provision pipeline
2020-04-30 11:32:50 +02:00
Miriam Baglioni
026f297e49
removed the writeUpdate oprion
2020-04-30 11:31:59 +02:00
Sandro La Bruzzo
b8e95295e2
merged from master
2020-04-30 11:27:59 +02:00
Miriam Baglioni
c89fe762b1
modified relation datasource organization
2020-04-30 11:17:03 +02:00
Miriam Baglioni
3abb76ff7a
merge with upstream
2020-04-30 11:15:54 +02:00
Michele Artini
eb9bd42970
fixed a problem with journals
2020-04-30 11:06:05 +02:00
Miriam Baglioni
638a3c465b
-
2020-04-30 11:05:17 +02:00
Michele Artini
a0a6109bbc
fixed a problem with journals
2020-04-30 11:03:46 +02:00
Miriam Baglioni
354f0162be
changes in the blacklist and workflow definition
2020-04-30 10:26:50 +02:00
Claudio Atzori
439c6255a2
cleanup
2020-04-29 19:09:07 +02:00
Claudio Atzori
77ac995770
cleaned up poms, added descriptions
2020-04-29 18:44:17 +02:00
Miriam Baglioni
3cffee74b9
merge with upstream
2020-04-29 18:25:29 +02:00
Miriam Baglioni
9ab46535e7
pom with the new blacklist module added
2020-04-29 18:17:15 +02:00
Miriam Baglioni
6a47e6191d
read from blacklist and write the result as relations on hdfs
2020-04-29 18:16:01 +02:00
Miriam Baglioni
869f576273
added hash map for relationship entityType id prefix, and relation inverse
2020-04-29 18:14:52 +02:00
Miriam Baglioni
b85ad7012a
reads the blacklist from the blacklist db and writes it as a set of relations on hdfs
2020-04-29 17:29:49 +02:00
Claudio Atzori
8fd81e863d
added default value for the external_stats_db_name
2020-04-29 15:36:24 +02:00
Claudio Atzori
c6f3ff4462
stats workflow content relocated into common package; added <global> property definitions in stats workflow.xml
2020-04-29 14:29:27 +02:00
Sandro La Bruzzo
4a89465740
reformatted code
2020-04-29 13:24:29 +02:00
Sandro La Bruzzo
a6b1a59d0a
merged with maaster
2020-04-29 13:20:57 +02:00
Sandro La Bruzzo
920c0f19c3
Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost
2020-04-29 13:13:16 +02:00
Sandro La Bruzzo
09f161f1f4
implemented unit test
2020-04-29 13:13:02 +02:00
miconis
e0d14fe4f8
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-04-29 13:02:53 +02:00
miconis
0352d3b0ba
entity dumps in dedup compressed
2020-04-29 13:02:34 +02:00
Michele Artini
c43b4c8962
formatting
2020-04-29 12:56:58 +02:00
Michele Artini
a5d7007005
Fix relations in migration
...
Fix pom.xml in dhp-stats-update
2020-04-29 12:05:41 +02:00
Miriam Baglioni
f7695e833c
resolved conflicts
2020-04-29 11:41:31 +02:00
Claudio Atzori
3616d0f88d
Merge pull request 'Adding the stats workflow to the dnet-hadoop hierarchy' ( #6 ) from spyros/dnet-hadoop:master into master
...
Integrating stats update workflow.
2020-04-29 10:35:02 +02:00
Claudio Atzori
964972d29a
added data provision workflow definition WIP
2020-04-29 09:25:50 +02:00
Enrico Ottonello
1edcd53581
added shell actions to download all 11 activities files from ORCID
2020-04-28 20:25:09 +02:00
miconis
62e467eb0c
assertion numbers updated to fit the new implementation of the pace-core
2020-04-28 11:46:23 +02:00
Claudio Atzori
6f5b899038
reformatted code according to the updated style descriptor
2020-04-28 11:23:29 +02:00
Claudio Atzori
ac25f2d8d1
integrated changes from master
2020-04-28 08:55:28 +02:00
Miriam Baglioni
2980e50edf
merge upstream
2020-04-27 15:06:48 +02:00
Claudio Atzori
a0bdbacdae
switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin
2020-04-27 14:52:31 +02:00
Claudio Atzori
7a3f8085f7
switched automatic code formatting plugin to net.revelc.code.formatter:formatter-maven-plugin
2020-04-27 14:45:40 +02:00
Michele Artini
1260d03eba
skip empty projects
2020-04-27 13:51:13 +02:00
Miriam Baglioni
df34a4ebcc
changed the configuration to add ignorecase option to each verb related to covid-19 community
2020-04-27 12:32:56 +02:00
Miriam Baglioni
7a59324ccf
changed the test to check for the new ignorecase option
2020-04-27 12:31:46 +02:00
Miriam Baglioni
986c97348d
added the ignorecase option to each selection verb
2020-04-27 12:31:05 +02:00
Miriam Baglioni
a303fc9f73
resources for testing propagation of result to comminuty from organization and from semrel
2020-04-27 11:14:16 +02:00
Miriam Baglioni
c093d764a3
-
2020-04-27 11:12:38 +02:00
Miriam Baglioni
c925e2be16
test for propagation of result to community from organization and result to community from semrel
2020-04-27 10:59:53 +02:00
Miriam Baglioni
ec7f166690
changed the bl because of changed of the examples for the re implementation of the propagation step
2020-04-27 10:58:41 +02:00
Miriam Baglioni
6135096ef1
refactoring
2020-04-27 10:57:50 +02:00
Miriam Baglioni
d30e710165
fixed duplicates action name in the workflow
2020-04-27 10:52:30 +02:00
Miriam Baglioni
f9ee343fc0
new parametrized workflow with preparation steps and new parameter input files
2020-04-27 10:48:31 +02:00
Miriam Baglioni
e2093644dc
changed in the workflow the directory where to store the preparedInfo and the graph genearated at this step
2020-04-27 10:46:44 +02:00
Miriam Baglioni
8a58bf2744
removed the writeUpdate option
2020-04-27 10:45:06 +02:00
Miriam Baglioni
5dccbe13db
merge with upstream
2020-04-27 10:43:59 +02:00
Miriam Baglioni
7b6505ec69
new resuorces for testing propagation of project to result after the re-implementation
2020-04-27 10:42:16 +02:00
Miriam Baglioni
1b0e0bd1b5
refactoring
2020-04-27 10:40:26 +02:00
Miriam Baglioni
e5a177f0a7
refactoring
2020-04-27 10:36:21 +02:00
Miriam Baglioni
e000754c92
refactoring
2020-04-27 10:34:03 +02:00
Miriam Baglioni
95a54d5460
removed the writeUpdate option. The update is available in the preparedInfo path
2020-04-27 10:30:32 +02:00
Miriam Baglioni
8802e4126b
re-implemented inverting the couple: from (projectId, relatedResultList) to (resultId, relatedProjectList)
2020-04-27 10:26:55 +02:00
Enrico Ottonello
a1861b9eaa
workflow works in parallel on 2 activity files
2020-04-24 18:33:37 +02:00
Enrico Ottonello
941e94af06
added workflow for generating authors with dois data sequence file
2020-04-24 15:50:40 +02:00
Claudio Atzori
268462623a
refined definition of equals and hash methods for Oaf model classes, now based on entity identifier, while relations consider sourceid, targetid and relationship semantic; Factored out function to group Oaf objects in grouping operations; Raw graph creation procedure merges entities and relationships providing the same identity
2020-04-24 14:42:01 +02:00
Claudio Atzori
a3e480d1c9
implmented DispatchEntitiesApplication using spark2 datasets
2020-04-24 14:36:53 +02:00
Claudio Atzori
48157e0fc4
GraphHiveImporterJob moved in dedicate package
2020-04-24 14:32:28 +02:00
Miriam Baglioni
adcbf0e29a
refactoring
2020-04-24 10:47:43 +02:00
Claudio Atzori
278fc9d276
code formatting
2020-04-23 18:51:38 +02:00
miconis
5414236644
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-04-23 18:17:23 +02:00
miconis
8d258c85ff
spark dedup test fixed, sample for dataset and orp added, test implemented
2020-04-23 18:16:20 +02:00
Michele Artini
072eae3803
fixed a problem with missing contexts
2020-04-23 16:42:49 +02:00
Michele Artini
b164d96874
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-04-23 16:19:16 +02:00
Michele Artini
d920ce501e
fixed a problem with missing instances
2020-04-23 16:18:40 +02:00
Miriam Baglioni
0e447add66
removed unuseful classes
2020-04-23 12:59:43 +02:00
Miriam Baglioni
edb00db86a
refactoring
2020-04-23 12:57:35 +02:00
Miriam Baglioni
44fab140de
-
2020-04-23 12:42:07 +02:00
Miriam Baglioni
769aa8178a
refactoring
2020-04-23 12:40:44 +02:00
Miriam Baglioni
d8dc31d4af
refactoring
2020-04-23 12:35:49 +02:00
Miriam Baglioni
8c5dac5cc3
removed unuseful classes
2020-04-23 12:30:58 +02:00
Miriam Baglioni
15656684b9
added proeprties for the preparation step and actual propagation. Added the new parametrized workflow
2020-04-23 12:13:34 +02:00
Miriam Baglioni
6f35f5ca42
added the steps of reset output dir and copy information not changed by the propagation step
2020-04-23 12:12:07 +02:00
Miriam Baglioni
19cd5b85c0
changed the classname to execute
2020-04-23 12:07:41 +02:00
Miriam Baglioni
fa2ff5c6f5
refactoring
2020-04-23 11:58:26 +02:00
Miriam Baglioni
540f70298b
added missing property
2020-04-23 11:51:48 +02:00
Miriam Baglioni
e431fe4f5b
added the implements Serializable to each class
2020-04-23 11:48:47 +02:00
Miriam Baglioni
24fa81d7e8
implementation parametrized for result type
2020-04-23 11:44:19 +02:00
Miriam Baglioni
ab2a24cc2b
changed the dependency to use reflections to find annotated classes
2020-04-23 11:08:47 +02:00
Miriam Baglioni
5153d88bd3
defiition of workflow and properties for bulktagging
2020-04-23 11:04:53 +02:00
Miriam Baglioni
3b2e4ab670
test for bulktag
2020-04-23 10:00:10 +02:00
Sandro La Bruzzo
fdc0523e4c
Merge remote-tracking branch 'origin/master' into doiboost
2020-04-23 09:34:13 +02:00
Sandro La Bruzzo
4ba386d996
improved crossref mapping
2020-04-23 09:33:48 +02:00
Claudio Atzori
8851050814
replaced hive_db_name with hiveDbName
2020-04-23 08:36:40 +02:00
Claudio Atzori
91f81107b1
applying code formatting
2020-04-23 07:52:32 +02:00
Claudio Atzori
1e7583c5a6
filtered invisible records in data provision workflow
2020-04-23 07:51:34 +02:00
Claudio Atzori
9ddafd46ca
fixed dedup record id prefix, set the correct dataInfo in the DedupRecordFactory
2020-04-23 07:50:18 +02:00
Claudio Atzori
ade4cb97af
fixed parameters passed to the postprocessing action in the workflow mapping the graph as hive DB
2020-04-22 18:24:06 +02:00
Sandro La Bruzzo
bb6c9785b4
Merge remote-tracking branch 'origin/master' into doiboost
2020-04-22 15:00:57 +02:00
Sandro La Bruzzo
157915988c
improved crossref mapping
2020-04-22 15:00:44 +02:00
Enrico Ottonello
5977f08e92
merged
2020-04-22 14:50:50 +02:00
Enrico Ottonello
7d759947ae
used vtd for parsing orcid xml record, set 4g heapspace
2020-04-22 14:41:19 +02:00
Claudio Atzori
e81960335c
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-04-22 10:46:37 +02:00
Michele Artini
9e4d58f505
ResultType
2020-04-22 10:07:26 +02:00
Claudio Atzori
c891661822
small adjustments in the graph2hive workflow
2020-04-21 18:52:23 +02:00
Miriam Baglioni
259525cb93
Merge remote-tracking branch 'upstream/master'
2020-04-21 18:33:46 +02:00
Miriam Baglioni
30e53261d0
minor
2020-04-21 18:00:53 +02:00
Claudio Atzori
0b55795d4d
small adjustments in the provisioning workflow
2020-04-21 16:15:04 +02:00
Claudio Atzori
88fbb3a353
added sparkSqlWarehouseDir to the default extra spark options passed to each workflow
2020-04-21 16:13:43 +02:00
Claudio Atzori
cd320efa96
added extra spark options to graph to hive workflow
2020-04-21 16:12:20 +02:00
Miriam Baglioni
90c768dde6
added shaded libs module
2020-04-21 16:03:51 +02:00
Claudio Atzori
91e72a6944
Dataset based implementation for SparkCreateDedupRecord phase, fixed datasource entity dump supplementing dedup unit tests
2020-04-21 12:06:08 +02:00
miconis
5c9ef08a8e
spark dedup test fixed
2020-04-21 10:19:04 +02:00
Sandro La Bruzzo
3624947a7f
Merge remote-tracking branch 'origin/master' into doiboost
2020-04-21 08:34:24 +02:00
Claudio Atzori
d772d967aa
restored changes from master branch
2020-04-20 18:53:06 +02:00
Claudio Atzori
eb8a020859
fixed behaviour of DedupRecordFactory
2020-04-20 18:44:06 +02:00
Sandro La Bruzzo
039f9b7871
Merge remote-tracking branch 'origin/master' into doiboost
2020-04-20 18:10:29 +02:00
Sandro La Bruzzo
e4b105cece
improved crossref mapping
2020-04-20 18:10:07 +02:00
Claudio Atzori
ede1af3d85
Merge branch 'master' into deduptesting
2020-04-20 16:52:14 +02:00
miconis
1102e32462
SparkDedupTest updated and organization dump fixed
2020-04-20 16:49:01 +02:00
Claudio Atzori
667d23c58b
finalising Actionset migration workflow
2020-04-20 16:45:21 +02:00
miconis
4da13e4570
Revert "Merge branch 'master' into deduptesting"
...
This reverts commit 772f75d167
, reversing
changes made to 5f45f2c77f
.
2020-04-20 16:04:49 +02:00
Claudio Atzori
9147af7fed
actionsets migration workflow moved in dhp-workflows/dhp-actionmanager
2020-04-20 15:24:33 +02:00
miconis
772f75d167
Merge branch 'master' into deduptesting
2020-04-20 14:50:12 +02:00
Sandro La Bruzzo
5d46ec7d5f
fixed name of wrong package
2020-04-20 14:49:32 +02:00
Sandro La Bruzzo
82cc3b707d
fixed name of wrong package
2020-04-20 14:47:06 +02:00
Sandro La Bruzzo
b2c872cb4d
merged master
2020-04-20 14:04:40 +02:00
Sandro La Bruzzo
7029942e06
Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost
2020-04-20 13:26:41 +02:00
Sandro La Bruzzo
0e45f4d450
continue mapping from crossref to OAF
2020-04-20 13:26:29 +02:00
Enrico Ottonello
a466648b4b
renamed output file
2020-04-20 12:32:03 +02:00
Claudio Atzori
d714bfb4d4
collectedfrom field moved in common parent class Oaf.java
2020-04-20 12:25:19 +02:00
Enrico Ottonello
4ae55e3891
added workflow parameters
2020-04-20 12:00:04 +02:00
Michele Artini
8ff7facfa3
fixed collectedFrom ID
2020-04-20 11:09:27 +02:00
Sandro La Bruzzo
eef60bb9f4
created structure of oozie wf for ORCID
2020-04-20 10:24:57 +02:00
Sandro La Bruzzo
4d0d9de07e
reorganized package and fixed test
2020-04-20 10:02:42 +02:00
Sandro La Bruzzo
618bc1fc72
first implementation of crossrefMapping
2020-04-20 09:53:34 +02:00
Michele Artini
25307965d2
add a default datainfo if missing
2020-04-20 09:43:27 +02:00
Michele Artini
d2058fdc47
tests
2020-04-20 09:31:14 +02:00
Enrico Ottonello
1d44a359ea
renamed package folder
2020-04-20 09:25:40 +02:00
Michele Artini
478a958f09
tests
2020-04-20 09:15:27 +02:00
Miriam Baglioni
e1848b7603
minor
2020-04-18 14:16:42 +02:00
Miriam Baglioni
0ff9b1ef05
added needed parameter
2020-04-18 14:16:29 +02:00
Miriam Baglioni
e2dfe8b656
removed not used action
2020-04-18 14:16:07 +02:00
Miriam Baglioni
437ebbad76
refactorign
2020-04-18 14:15:09 +02:00
Miriam Baglioni
9a8876ac86
added needed parameter
2020-04-18 14:14:08 +02:00
Miriam Baglioni
9854852878
refactoring
2020-04-18 14:13:16 +02:00
Miriam Baglioni
454b8a6a29
Merge remote-tracking branch 'upstream/master'
2020-04-18 14:09:44 +02:00
Miriam Baglioni
890ec28f0f
input parameters for preparation step1
2020-04-18 14:09:37 +02:00
Miriam Baglioni
fbf5c27c27
Added preparation classes before actual propagation
2020-04-18 14:09:03 +02:00
Claudio Atzori
5f45f2c77f
Merge branch 'master' into deduptesting
2020-04-18 12:46:40 +02:00
Claudio Atzori
ad7a131b18
introduced common project code formatting plugin, works on the commit hook, based on https://github.com/Cosium/git-code-format-maven-plugin , applied to each java class in the project
2020-04-18 12:42:58 +02:00
Claudio Atzori
a2938dd059
cleanup
2020-04-18 12:24:22 +02:00
Claudio Atzori
9374ff03ea
Merge branch 'master' into deduptesting
2020-04-18 12:06:58 +02:00
Claudio Atzori
71813795f6
various refactorings on the dnet-dedup-openaire workflow
2020-04-18 12:06:23 +02:00
Enrico Ottonello
7011d4203e
parser of orcid summaries from tar gz file on hdfs, that creates a sequence file with authors informations (oid, name, surname, credit name)
2020-04-17 18:52:39 +02:00
miconis
6450bb0daa
test for softwares dedup added. definition of orp, dataset and sw dedup configurations
2020-04-17 17:31:59 +02:00
Miriam Baglioni
72c63a326e
removed unuseful class
2020-04-17 17:14:51 +02:00
Miriam Baglioni
00c2ca3ee5
-
2020-04-17 17:14:25 +02:00
Miriam Baglioni
5cd092114f
use mergeFrom method to add the new community contexts
2020-04-17 17:13:18 +02:00
Miriam Baglioni
264c82f21e
minor
2020-04-17 16:54:46 +02:00
Miriam Baglioni
8c079c7a49
unit test for orcid to result propagation from semrel
2020-04-17 16:53:03 +02:00
Miriam Baglioni
eacd140a98
added missing parameter(s)
2020-04-17 16:52:30 +02:00
Miriam Baglioni
390e250faf
use the addPid method of the Author class to add a new pid
2020-04-17 16:52:02 +02:00
Miriam Baglioni
b46b080ddc
use mergeFrom method call to add the country(ies) instead of modify the result directly.
2020-04-17 16:50:54 +02:00
Miriam Baglioni
c4987dd12a
minor
2020-04-17 16:49:08 +02:00
Claudio Atzori
038ac7afd7
relation consistency workflow separated from dedup scan and creation of CCs
2020-04-17 13:12:44 +02:00
Claudio Atzori
c92bfeeaee
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-04-17 13:07:52 +02:00
Miriam Baglioni
adc11c97a7
Merge remote-tracking branch 'upstream/master'
2020-04-17 12:34:31 +02:00
Sandro La Bruzzo
a329ea5575
merged with master branch
2020-04-17 12:23:54 +02:00
Sandro La Bruzzo
01ea7721f3
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-04-17 12:12:25 +02:00
Sandro La Bruzzo
5e2fa996aa
fixed problem with conversion of long into string
2020-04-17 12:11:51 +02:00
miconis
418cf94642
implementation of the deletedbyinference test in propagating relations
2020-04-17 10:40:21 +02:00
Miriam Baglioni
5d772e5263
new implementation of propagation of community to result from organization that exploits the prepared info
2020-04-16 18:45:22 +02:00
Miriam Baglioni
fff1e5ec39
classes to (de)serialize the data provided in the preparation step
2020-04-16 18:44:43 +02:00
Miriam Baglioni
3fd9d6b02f
preparation phase for the propagation of community to result from organization
2020-04-16 18:43:55 +02:00
Miriam Baglioni
a9120164aa
added hive parameter and a step of reset of the working dir in the workflow
2020-04-16 18:42:04 +02:00
Miriam Baglioni
6afbd542ca
changed the save mode to avoid NegativeArraySize... error. Needed to modify also the preparationstep2
2020-04-16 18:40:14 +02:00
Miriam Baglioni
d60fd36046
changed the save method
2020-04-16 16:14:15 +02:00
Miriam Baglioni
951b13ac46
input parameters and workflow for new implementation of propagation of orcid to result from semrel and preparation phases
2020-04-16 16:13:10 +02:00
Miriam Baglioni
4d89f3dfed
removed unuseful classes
2020-04-16 16:11:44 +02:00
Miriam Baglioni
5e72a51f11
-
2020-04-16 16:11:20 +02:00
Miriam Baglioni
c33a593381
renamed
2020-04-16 16:09:47 +02:00
Miriam Baglioni
0e5399bf74
seconf phase of data preparation. Groups all the possible updates by id
2020-04-16 16:08:51 +02:00
Miriam Baglioni
548ba915ac
first phase of data preparation. For each result type (parallel) it produces the possible updates
2020-04-16 15:58:42 +02:00
Miriam Baglioni
243013cea3
to (de)serialize the association from the resultId and the list of autoritative authors with orcid to possibly propagate
2020-04-16 15:57:29 +02:00
Miriam Baglioni
ac3ad25b36
to (de)serialize needed information of the author to determine if the orcid can be passed (name, surname, fullname (?), orcid)
2020-04-16 15:56:33 +02:00
Miriam Baglioni
d6cd700a32
new implementation that exploits prepared information (the list of possible updates: resultId - possible list of orcid to be added
2020-04-16 15:55:25 +02:00
Miriam Baglioni
f077f22f73
minor
2020-04-16 15:54:16 +02:00
Miriam Baglioni
fd5d792e35
refactoring
2020-04-16 15:53:34 +02:00
Claudio Atzori
cb0952428e
Merge branch 'master' into deduptesting
2020-04-16 14:42:25 +02:00
Claudio Atzori
cc21bbfb1a
Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting
2020-04-16 14:41:37 +02:00
Claudio Atzori
ec5dfc068d
added spark.sql.shuffle.partitions=3840 to dedup scan wf
2020-04-16 14:41:28 +02:00
Claudio Atzori
09f356b047
Merge pull request 'Closes #7 : subdirs inside graph table dirs' ( #8 ) from przemyslaw.jacewicz/dnet-hadoop:przemyslawjacewicz_7_distcp_configuration_fix into master
...
Run the code from this PR in isolation and it worked fine. Thanks!
2020-04-16 14:38:46 +02:00
Claudio Atzori
3437383112
Merge branch 'master' into deduptesting
2020-04-16 12:46:14 +02:00
miconis
0eccbc318b
Deduper class (utilities for dedup) cleaned. Useless methods removed
2020-04-16 12:36:37 +02:00
Claudio Atzori
76d23895e6
Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting
2020-04-16 12:18:32 +02:00
miconis
6a089ec287
minor changes
2020-04-16 12:15:38 +02:00
Claudio Atzori
376efd67de
removed prepare statement in spark action
2020-04-16 12:14:16 +02:00
miconis
9b36458b6a
Merge branch 'deduptesting' of code-repo.d4science.org:D-Net/dnet-hadoop into deduptesting
2020-04-16 12:13:58 +02:00
miconis
cd4d9a148f
creating temporary directories in dedup test
2020-04-16 12:13:26 +02:00
Claudio Atzori
b39ff36c16
improving the wf definitions
2020-04-16 12:11:37 +02:00
Claudio Atzori
011b342bc9
trying to avoid OOM in SparkPropagateRelation
2020-04-16 11:13:51 +02:00
Miriam Baglioni
08227cfcbd
resources needed for running the test on propagation of result to organization from institutional repositories
2020-04-16 11:06:10 +02:00
Miriam Baglioni
a97e915c24
test unit for propagation of result to organization from institutional repository
2020-04-16 11:05:21 +02:00
Miriam Baglioni
b078710924
modification to the test due to the removal of unused parameters
2020-04-16 11:04:39 +02:00
Miriam Baglioni
a5e5c81a2c
input parameters and workflow definition for propagation of result to organization from institutional repositories
2020-04-16 11:03:41 +02:00
Miriam Baglioni
5e1bd67680
removed unuseful parameter
2020-04-16 11:02:01 +02:00
Miriam Baglioni
eaf19ce01b
removed unuseful class
2020-04-16 10:59:33 +02:00
Miriam Baglioni
7bd49abbef
commit to delete
2020-04-16 10:59:09 +02:00
Miriam Baglioni
53f418098b
added the isTest checkpoint
2020-04-16 10:53:48 +02:00
Miriam Baglioni
c28333d43f
minor
2020-04-16 10:52:50 +02:00
Miriam Baglioni
a8100baed6
changed the way to save the results to aviod NegativeArray... error
2020-04-16 10:50:09 +02:00
Miriam Baglioni
79b978ec57
refactoring
2020-04-16 10:48:41 +02:00
Claudio Atzori
069ef5eaed
trying to avoid OOM in SparkPropagateRelation
2020-04-15 21:23:21 +02:00
Claudio Atzori
8eedfefc98
try to introduce intermediate serialization on hdfs to avoid OOM
2020-04-15 18:35:35 +02:00
Przemysław Jacewicz
da019495d7
[dhp-actionmanager] target dir removal added for distcp actions
2020-04-15 17:56:57 +02:00
miconis
5689d49689
minor changes
2020-04-15 16:34:06 +02:00
Claudio Atzori
c439d0c6bb
PromoteActionPayloadForGraphTableJob reads directly the content pointed by the input path, adjusted promote action tests (ISLookup mock)
2020-04-15 16:18:33 +02:00
Claudio Atzori
ff30f99c65
using newline delimited json files for the raw graph materialization. Introduced contentPath parameter
2020-04-15 16:16:20 +02:00
Sandro La Bruzzo
3d3ac76dda
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-04-15 15:24:01 +02:00
Sandro La Bruzzo
74a7fac774
fixed problem with timestamp
2020-04-15 15:23:54 +02:00
Miriam Baglioni
3577219127
removed unuseful classes
2020-04-15 12:45:49 +02:00
Miriam Baglioni
964b22d418
modified the writing of the new relations. before: read old rels, add the new ones to them, write all the relations in new location. Now: first step of the wf copies the old relation i new location. If new relations are found, they are saved in the new location in append mode.
2020-04-15 12:32:01 +02:00
Miriam Baglioni
43f0590d4b
change in the testing because the business logic is changed.
2020-04-15 12:29:50 +02:00
Miriam Baglioni
473d17767c
new business logic for the actual propagation. It exploits previously computed information
2020-04-15 12:25:44 +02:00
Miriam Baglioni
6a377a7582
class to compute some information needed for the actual propagation
2020-04-15 12:25:11 +02:00
Miriam Baglioni
5a3487280d
classes to serialize/deserialize the prepared data
2020-04-15 12:24:36 +02:00
Miriam Baglioni
62b09be43c
added correct descritption for parameter isSparkSessionManaged
2020-04-15 12:23:06 +02:00
Miriam Baglioni
1859ce8902
minor refactoring
2020-04-15 12:21:31 +02:00
Miriam Baglioni
27f1d3ee8f
minor refactoring
2020-04-15 12:21:05 +02:00
Alessia Bardi
550a9f82ed
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-04-14 17:53:01 +02:00
Alessia Bardi
a68fae9bcb
now supporting openaire 4.0 compliance
2020-04-14 17:52:48 +02:00
Sandro La Bruzzo
c36239e693
fixed incremental indexing
2020-04-14 17:47:36 +02:00
Miriam Baglioni
3f4b579e7f
new workflow. It is composed of four steps. The first removes the directory where to store the results. The second copies the relation to the new locatio, the third id the preparation phase and then the actual propagation
2020-04-14 16:49:24 +02:00
Miriam Baglioni
ca2b40952e
minor changes
2020-04-14 16:48:02 +02:00
Miriam Baglioni
61d39e659e
parameters for the project2result propagation phase
2020-04-14 16:47:39 +02:00
Miriam Baglioni
92f19fa0a0
parameters for the project2result preparation phase
2020-04-14 16:46:57 +02:00
Miriam Baglioni
cadab9b81d
new implementation for result to project propagation. Use the prepared info in propagation
2020-04-14 16:46:07 +02:00
Miriam Baglioni
ceb1f299bf
minor changes
2020-04-14 16:45:12 +02:00
Claudio Atzori
82e8341f50
reorganizing parameter names in the provision workflow
2020-04-14 15:54:41 +02:00
Miriam Baglioni
e0038bde5b
Support class to serialize/deserialize the association project, set of linked results
2020-04-14 15:32:12 +02:00
Miriam Baglioni
c0bebb7c35
code to compute the prepared information used in the actual propagation step. This step will produce who files: one with potential updates (association between projects and a list of results), the other already linked entities (association between projects and the list of results already linked to them)
2020-04-14 15:31:26 +02:00
Miriam Baglioni
f47ee5b78e
directory where to store the prepared infor before actual propagation will take place
2020-04-14 15:29:21 +02:00
Miriam Baglioni
36cc9516d8
the starting relation set for testing
2020-04-14 15:28:34 +02:00
Miriam Baglioni
4b01dc60e6
test unit for result to project propagation
2020-04-14 15:28:00 +02:00
Miriam Baglioni
8f12292daa
changed the way to save the results on filesystem
2020-04-11 16:47:34 +02:00
Miriam Baglioni
87f802821e
new workflow for country propagation: it is composed of the preparation step and in the propagation. The propagation part runs in parallel on the result types
2020-04-11 16:40:22 +02:00
Miriam Baglioni
a562080b0b
parameters to be used in the prepared Job and in the actual country propagation job
2020-04-11 16:39:17 +02:00
Miriam Baglioni
1251ad4455
removed unuseful class
2020-04-11 16:38:13 +02:00
Miriam Baglioni
aef9b3aa90
new parametric implementation of country propagation. Exploits information compute before and broadcasts it to each executor
2020-04-11 16:36:59 +02:00
Miriam Baglioni
a2d833d5dd
step of data preparation before actual country propagation will take palce
2020-04-11 16:36:03 +02:00
Miriam Baglioni
6897c920a2
classes in support of new implementation of country propagation
2020-04-11 16:35:26 +02:00
Miriam Baglioni
85766a02d8
added dependency to use hive on local machine
2020-04-11 16:34:22 +02:00
Miriam Baglioni
79b8ea4fed
prepared information to be used in actual country propagation. Subset of info
2020-04-11 16:29:41 +02:00
Miriam Baglioni
1822476613
Test for country propagation
2020-04-11 16:28:09 +02:00
Miriam Baglioni
7783b09c5b
new implementation for result to project propagation. Prepare some info to be used in propagation
2020-04-11 16:26:23 +02:00
Claudio Atzori
6b5f9ca9cb
raw graph creation workflow moved under dhp-graph-mapper, claims integration is included
2020-04-10 17:53:07 +02:00
Miriam Baglioni
90469789b9
two new classes fro new implementation of project to result propagation
2020-04-09 13:29:01 +02:00
Miriam Baglioni
627ad58a8b
new wf definition
2020-04-09 11:33:19 +02:00
Miriam Baglioni
9c63c4840d
new workflow and parameters for country propagation
2020-04-08 19:13:42 +02:00
Miriam Baglioni
a2d309545b
new parametrized implementation for country propagation
2020-04-08 19:12:59 +02:00
Miriam Baglioni
6dfdba9ef7
new parametrized implementation for country propagation
2020-04-08 18:14:37 +02:00
Miriam Baglioni
03f7cb6402
new parametrized implementation for country propagation
2020-04-08 18:08:41 +02:00
Miriam Baglioni
df2fc4a6d7
Merge remote-tracking branch 'upstream/master'
2020-04-08 18:07:26 +02:00
Miriam Baglioni
fcfef4632f
input parameters for country propagation preparation job
2020-04-08 18:07:18 +02:00
miconis
0be2e72be5
further implementation of tests for the deduplication of each entity. publication dump added, empty entity files created
2020-04-08 18:02:30 +02:00
Miriam Baglioni
61045e84d9
merged conflict in pom
2020-04-08 14:23:30 +02:00
Claudio Atzori
47f3d9b757
unit test for GraphHiveImporterJob
2020-04-08 13:24:43 +02:00
Sandro La Bruzzo
ba9f07a6fe
fixed wrong test
2020-04-08 13:18:20 +02:00
Miriam Baglioni
540da4ab61
new busuness logic with prepared info before actual job run
2020-04-08 13:04:04 +02:00
Miriam Baglioni
8438702b3d
addition in propagation constants
2020-04-08 10:54:01 +02:00
Miriam Baglioni
2afe971816
new implementation for country propagatio
2020-04-08 10:49:09 +02:00
Miriam Baglioni
beebbcf66b
new config for countrypropagation
2020-04-08 10:31:29 +02:00
Claudio Atzori
d74e128aa6
Utility classes moved in dhp-common and dhp-schemas
2020-04-07 11:56:22 +02:00
Claudio Atzori
c57cf679ca
Merge branch 'provision_dataset'
2020-04-07 08:56:58 +02:00
Claudio Atzori
1a1a026a18
we do expect to find field bestaccessright already defined. No need to add it again
2020-04-07 08:55:33 +02:00
Claudio Atzori
fbdd18a96b
using dataset based relation preparation procedure
2020-04-07 08:54:39 +02:00
Claudio Atzori
77f59b1b10
dataset based provision WIP
2020-04-06 19:37:27 +02:00
Claudio Atzori
6177cf36fb
Merge pull request 'Closes #4 : New action manager implementation' ( #5 ) from przemyslaw.jacewicz/dnet-hadoop:przemyslawjacewicz_actionmanager_impl_prototype into master
...
Nothing more to add here. Thanks for your contribution!
2020-04-06 17:35:07 +02:00
Claudio Atzori
e355961997
dataset based provision WIP
2020-04-06 17:34:25 +02:00
miconis
56fbe689f0
implementation of the tests for each spark action
2020-04-06 16:30:31 +02:00
Claudio Atzori
ca345aaad3
dataset based provision WIP
2020-04-06 15:33:31 +02:00
Claudio Atzori
c8f4b95464
dataset based provision WIP
2020-04-06 08:59:58 +02:00
Claudio Atzori
eb2f5f3198
dataset based provision WIP
2020-04-04 17:41:31 +02:00
Claudio Atzori
3d1b637cab
dataset based provision WIP
2020-04-04 14:03:43 +02:00
miconis
53fd624c34
implemented test for sparkcreatesimrels
2020-04-03 18:32:25 +02:00
Claudio Atzori
24b2c9012e
dataset based provision WIP
2020-04-02 18:44:09 +02:00
miconis
a61763d149
structure for sparksimrel changed to be compliant with mockito testing
2020-04-02 18:37:53 +02:00
Claudio Atzori
daa26acc9d
dataset based provision WIP, fixed spark2EventLogDir
2020-04-02 16:15:50 +02:00
Przemysław Jacewicz
7b2a7e2417
[dhp-actionmanager] missing descriptions added and minor naming and formatting fixes
2020-04-02 11:48:40 +02:00
Spyros Zoupanos
1ab97bbe00
Adding the full stats workflow to the dnet-hadoop hierarchy
2020-04-01 22:22:05 +03:00
Claudio Atzori
9c7092416a
dataset based provision WIP
2020-04-01 19:07:30 +02:00
miconis
bfa5bc74df
minor changes
2020-04-01 19:05:48 +02:00
Przemysław Jacewicz
80cf43b9c8
[dhp-actionmanager] promoting workflow added
2020-04-01 18:51:25 +02:00
Przemysław Jacewicz
5b459bcc47
[dhp-actionmanager] promoting spark job added
2020-04-01 18:49:08 +02:00
miconis
9802bcb9fe
dedup testing
2020-04-01 18:48:31 +02:00
Przemysław Jacewicz
e21bb89dbd
[dhp-actionmanager] partitioning spark job added
2020-04-01 18:41:29 +02:00
Przemysław Jacewicz
f9f7350bb9
[dhp-actionmanager] common package added with utility classes supporting hadoop and spark envs
2020-04-01 18:39:26 +02:00
Przemysław Jacewicz
ad70c23b2e
[dhp-actionmanager] pom updated
2020-04-01 18:36:00 +02:00
Przemysław Jacewicz
4e910a78d4
[dhp-workflows] spark 2 connection properties added
2020-04-01 18:29:26 +02:00
Claudio Atzori
1402eb1fe7
cleanup
2020-04-01 15:38:50 +02:00
Claudio Atzori
7061d07727
ActionSets migration serialize the output as plain text files instead of SequenceFiles
2020-04-01 14:58:22 +02:00
Claudio Atzori
adcdd2d05e
WIP: reimplementing the adjacency list construction process using spark Datasets
2020-04-01 14:56:57 +02:00
Sandro La Bruzzo
205e9521c6
implemented import crossref job
2020-04-01 14:12:33 +02:00
Sandro La Bruzzo
201d79021e
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-03-31 14:41:41 +02:00
Sandro La Bruzzo
cd7416ae4c
first implementation of incremental update of scholix index
2020-03-31 14:41:35 +02:00
przemek
9d1d18d4b9
Merge branch 'master' into przemyslawjacewicz_actionmanager_impl_prototype
2020-03-31 12:04:58 +02:00
Claudio Atzori
377e1ba840
[maven-release-plugin] prepare for next development iteration
2020-03-30 20:06:00 +02:00
Claudio Atzori
76d9315129
[maven-release-plugin] prepare release dhp-1.1.6
2020-03-30 20:05:56 +02:00
Claudio Atzori
ef429010ee
removed log file and job-override.properties
2020-03-30 20:00:58 +02:00
Claudio Atzori
0fbec69b82
use oozie prepare statement to cleanup working directories
2020-03-30 19:48:41 +02:00
Claudio Atzori
3af2b8d700
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-03-30 13:12:21 +02:00
Claudio Atzori
f3f9affd49
allow dynamic executors to build XML records
2020-03-30 13:12:11 +02:00
Claudio Atzori
2e2d4c4c68
adjusted path to template resource
2020-03-30 13:11:49 +02:00
Miriam Baglioni
dd011f4a95
to make them visible to Claudio
2020-03-30 10:55:47 +02:00
Miriam Baglioni
b1af90a45f
to make it visible to Claudio
2020-03-30 10:50:03 +02:00
Sandro La Bruzzo
62cc257e5c
fixed step1 workflow
2020-03-27 17:07:34 +01:00
Sandro La Bruzzo
1a7a866861
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-03-27 15:11:48 +01:00
Sandro La Bruzzo
7cef698f36
reformat code
2020-03-27 15:11:34 +01:00
Claudio Atzori
1767dfaa3f
method can be protected, it is meant to be used only in tests
2020-03-27 14:31:26 +01:00
Sandro La Bruzzo
a4b6a51168
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-03-27 13:48:56 +01:00
Sandro La Bruzzo
15d9106b3f
FIxed merge of dhp dedup
2020-03-27 13:48:44 +01:00
Claudio Atzori
e196fff212
adjusted path for source resource in unit test
2020-03-27 13:45:10 +01:00
Sandro La Bruzzo
8c9a56a0c8
refactored package name
2020-03-27 13:19:33 +01:00
Sandro La Bruzzo
2bd2d6f202
Merge branch 'master' of code-repo.d3science.org:D-Net/dnet-hadoop
2020-03-27 13:16:36 +01:00
Sandro La Bruzzo
a9935f80d4
refactor class name and workflow name for graph mapper, added javadoc
2020-03-27 13:16:24 +01:00
Michele Artini
ae03948eed
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-03-27 11:47:07 +01:00
Michele Artini
f6e86b44a6
tests
2020-03-27 11:46:37 +01:00
Michele Artini
408be3c632
test and fixed a problem with datacite namespaces
2020-03-27 11:44:50 +01:00
Claudio Atzori
673e744649
moved openaire specific implementations under dedicated package eu.dnetlib.dhp.oa
2020-03-27 10:42:17 +01:00
Claudio Atzori
098fabab3f
reorganizing content under dhp-workflows/dhp-graph-mapper
2020-03-26 19:44:19 +01:00
Claudio Atzori
77c4294924
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-03-26 18:26:52 +01:00
Claudio Atzori
43cbcda7ef
unit test for SparkGraphImporterJob
2020-03-26 18:26:40 +01:00
Sandro La Bruzzo
e04da6d66a
merged all oozie wf in one
2020-03-26 14:17:07 +01:00
Sandro La Bruzzo
e71e001b58
commented test that doesn't work
2020-03-26 14:15:21 +01:00
Sandro La Bruzzo
0cd022ad6a
merge with master
2020-03-26 14:08:29 +01:00
Claudio Atzori
abcd3f5bf5
added sample data for unit tests
2020-03-26 11:12:52 +01:00
Sandro La Bruzzo
d5f11e27be
renamed wf
2020-03-26 09:49:23 +01:00
Sandro La Bruzzo
9a37ad0127
renamed modules
2020-03-26 09:46:46 +01:00
Sandro La Bruzzo
a768226e52
updated generate scholix to generate json
2020-03-26 09:40:50 +01:00
Claudio Atzori
9dff4adbc3
dhp-graph-mapper workflow tests upgraded to junit5
2020-03-25 18:25:12 +01:00
Claudio Atzori
cd7dc3e1ae
dhp-dedup-openaire workflow tests upgraded to junit5
2020-03-25 18:04:23 +01:00
Claudio Atzori
c0e825e713
dhp-aggregation workflow tests upgraded to junit5
2020-03-25 17:59:45 +01:00
Michele Artini
ebe45003d9
fixed some junit packages
2020-03-25 16:45:03 +01:00
Michele Artini
d9bfdcd607
updated poms
2020-03-25 16:31:12 +01:00
Michele Artini
120e823cd1
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-03-25 16:00:10 +01:00
Claudio Atzori
71ae7dd272
renamed module dnet-dedup to dnet-dedup-openaire
2020-03-25 15:57:09 +01:00
Michele Artini
fd57722c69
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-03-25 15:56:49 +01:00
Claudio Atzori
f441f823dd
fixed path referencing a test resource file
2020-03-25 15:21:46 +01:00
Claudio Atzori
51d0c9bdd7
integrated changes from branch dedupTest
2020-03-25 15:15:41 +01:00
Claudio Atzori
36f8f2ea66
master set to 'yarn' in spark actions, removed path to rawSet from the dedup scan workflow
2020-03-25 14:16:06 +01:00
Michele Artini
2559299da4
tests
2020-03-25 12:25:00 +01:00
Claudio Atzori
2180cc4fe7
more fields included in result view definition
2020-03-25 11:21:46 +01:00
Claudio Atzori
efb0b7d660
master set to 'yarn' in spark actions
2020-03-25 11:15:35 +01:00
Michele Artini
0fda2c3a30
some tests on db records
2020-03-25 09:43:58 +01:00
miconis
02320de371
minor changes
2020-03-24 17:43:51 +01:00
miconis
8e8b5e8f30
roots wf merged in scan wf
2020-03-24 17:40:58 +01:00
Miriam Baglioni
19d7f8b51d
decommented execution for some of the result type for testing purposes
2020-03-24 16:49:46 +01:00
Miriam Baglioni
ad24c8478f
added missing parameter
2020-03-24 16:19:59 +01:00
Miriam Baglioni
46094a3eec
bug fixing for implementation with dataset
2020-03-24 16:19:36 +01:00
Claudio Atzori
51ff68db66
Merge branch 'dedupTest' of https://code-repo.d4science.org/D-Net/dnet-hadoop into dedupTest
2020-03-24 11:18:19 +01:00
Claudio Atzori
1e869e7bed
using method available from currently used library
2020-03-24 11:17:44 +01:00
miconis
f0d72b76a8
package structure fixed
2020-03-24 10:51:40 +01:00
Claudio Atzori
aaedbb1b8b
WIP: dedup workflow, stage 2
2020-03-24 09:59:28 +01:00
Michele Artini
e3760c7f39
fix a bug with organization countries
2020-03-24 08:43:56 +01:00
Claudio Atzori
8b0ba3d76a
posprocessing script correctly run as hive2 action
2020-03-23 17:40:39 +01:00
miconis
93e2291291
minor changes
2020-03-23 17:17:56 +01:00
miconis
f7890a90df
implementation of the mechanism that checks the existance of a mergerel file
2020-03-23 17:13:30 +01:00
Miriam Baglioni
ad712f2d79
added the needed variables in the config and read the variables in the workflow
2020-03-23 17:11:36 +01:00
Miriam Baglioni
f1e9fe9752
changed implementation using dataset and query on hive
2020-03-23 17:11:00 +01:00
Miriam Baglioni
f09cd1e911
removed unuseful variable in the configuration
2020-03-23 17:10:14 +01:00
Miriam Baglioni
9418e3d4fa
read dataset from files instead of using hive tables
2020-03-23 17:09:27 +01:00
Miriam Baglioni
a7bf037306
remove unused class
2020-03-23 14:36:43 +01:00
Miriam Baglioni
8ab8b6b0bf
minor
2020-03-23 14:35:23 +01:00
Miriam Baglioni
30d58fd98c
change the configuration of the workflow
2020-03-23 14:32:49 +01:00
Miriam Baglioni
a440152b46
refactoring
2020-03-23 14:30:56 +01:00
Miriam Baglioni
47561f3597
changed the implementation from rdd to dataset got from sql queries (on hive)
2020-03-23 11:58:32 +01:00
miconis
c20e179f5a
structure of the workflows updated
2020-03-23 11:43:49 +01:00
Claudio Atzori
658d40ccbe
WIP trying to use hive2 actions
2020-03-23 11:14:54 +01:00
Claudio Atzori
ecb64e4998
Merge branch 'migration_wfs_regular_all_steps'
2020-03-23 08:57:01 +01:00
Michele Artini
15160032bd
fixed a bug setting some organization fields
2020-03-23 08:39:14 +01:00
Claudio Atzori
a4c52661a0
WIP: fixing dedup workflows
2020-03-20 19:17:24 +01:00
Claudio Atzori
6cb0a9bff0
dedup wf directory structure aligned with project commons
2020-03-20 16:48:14 +01:00
miconis
e16e644faf
implementation of the workflow for entity update and for relations update
2020-03-20 13:01:56 +01:00
przemek
638b78f96a
Merge remote-tracking branch 'origin/master' into przemyslawjacewicz_actionmanager_impl_prototype
2020-03-19 15:12:56 +01:00
miconis
4e82a24af2
minor changes and implementation of the create connected components action
2020-03-19 15:01:07 +01:00
Claudio Atzori
36236dd1c1
action migration workflow produces eu.dnetlib.dhp.schema.action.AtomicAction(s)
2020-03-19 14:00:38 +01:00
Claudio Atzori
a0ab15a64c
need to stick on using guava:11.0.2 as it is the version used by the hadoop components (oozie client for sure). The last version (28.2-jre) breaks the oozie workflow submission
2020-03-19 13:58:58 +01:00
Sandro La Bruzzo
0594b92a6d
implemented relation with dataset
2020-03-19 11:11:07 +01:00
miconis
679b5869e5
implementation of the lookup procedure to take dedup conf from the resource profiles
2020-03-18 17:41:56 +01:00
Claudio Atzori
abe8fb69a2
added global properties, moved postprocessing script inside the oozie_app directory
2020-03-18 15:43:54 +01:00
miconis
f32eae5ce9
implementation of the spark action for the simrel creation
2020-03-18 14:27:49 +01:00
Claudio Atzori
c7e0730720
compress the output produced by migration steps 1 and 2
2020-03-18 09:34:57 +01:00
Claudio Atzori
2f11e37602
fixed expansion of path variables
2020-03-17 19:41:07 +01:00
Claudio Atzori
2795b0b096
no need to mkdir a the all_entities file
2020-03-17 17:22:14 +01:00
Claudio Atzori
19746ad308
when reuseContent, reset ${workingPath}/all_entities
2020-03-17 17:17:06 +01:00
Claudio Atzori
2f0c85eeb3
updated parameters for regular_all_steps worfklow, introduced flag 'reuseContent'
2020-03-17 17:04:58 +01:00
Miriam Baglioni
67ea3cf3ed
changed the way to read the file with info on resource or relation. From sequenceFile to textFile
2020-03-17 16:32:05 +01:00
Miriam Baglioni
b4652d018c
moved the creation of new dir to common class.
2020-03-17 16:31:24 +01:00
Claudio Atzori
b8290b5851
updated parameters for regular_all_steps worfklow
2020-03-17 15:45:30 +01:00
Claudio Atzori
4706f24ec5
updated parameters for regular_all_steps worfklow
2020-03-17 15:23:54 +01:00
Claudio Atzori
aeb01fa353
reading from newline delimited json textfiles instead of sequence files
2020-03-17 11:57:24 +01:00
Miriam Baglioni
92f4e0001d
Merge branch 'bulktag'
2020-03-16 13:33:27 +01:00
Miriam Baglioni
ab08a37024
Merge remote-tracking branch 'upstream/master'
2020-03-16 12:45:23 +01:00
Claudio Atzori
af835f2f98
when migrating actionsets from DM cluster, populate the AtomicAction.targetValue when empty (dedup similarities)
2020-03-15 18:07:59 +01:00
Claudio Atzori
9c84e21b87
added workflow to migrate latest version of each actionset content from DM to OCEAN cluster, mapping the targetValues from the old protobuf data model to the dhp.OAF datamodel
2020-03-13 15:56:52 +01:00
Claudio Atzori
8fe7ae1482
xml formatting
2020-03-13 15:53:56 +01:00
Przemysław Jacewicz
d0c9b0cdd6
WIP promote job functions updated
2020-03-13 12:36:42 +01:00
Przemysław Jacewicz
8d9b3c5de2
WIP action payload mapping into OAF type moved, (local) graph table name enum created, tests fixed
2020-03-13 10:01:39 +01:00
Przemysław Jacewicz
5cc560c7e5
Removed unnecessary dependency on old OAF model
2020-03-13 09:57:46 +01:00
Sandro La Bruzzo
addaaa091f
migrate relation from RDD to Dataset
2020-03-13 09:13:20 +01:00
Przemysław Jacewicz
3f24593e51
WIP: promote job tests and test resources implementation snapshot
2020-03-11 17:06:29 +01:00
Przemysław Jacewicz
2e996d610f
WIP: promote job functions implementation snapshot
2020-03-11 17:02:57 +01:00
Przemysław Jacewicz
cc63cdc9e6
WIP: promote job implementation snapshot
2020-03-11 17:02:06 +01:00
Przemysław Jacewicz
69540f6f78
Serialization-safe supplier added
2020-03-11 16:59:05 +01:00
Przemysław Jacewicz
e6e214dab5
Oaf merge and get strategy added
2020-03-11 16:58:17 +01:00
Claudio Atzori
7b6f0c8756
reading graph dump as text files, encoded as newline-delimited JSON records, as indicated in the wiki
2020-03-10 17:19:17 +01:00
Claudio Atzori
60aedb1110
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-03-10 17:09:44 +01:00
Claudio Atzori
a3f184fd3f
added field websiteurl in related organizations
2020-03-10 17:08:58 +01:00
Claudio Atzori
0e95544495
fixed serialization for datasource subjects
2020-03-10 17:07:44 +01:00
Sandro La Bruzzo
7b28783fb4
updated unpaywall mapping
2020-03-08 17:00:19 +01:00
Michele Artini
b6efa9d6ab
Configuration of the SequenceFile Writer
2020-03-05 15:49:14 +01:00
Claudio Atzori
5e342a555c
no need to compute the inverse relClass, fixed text() in xpath expressions
2020-03-05 12:51:48 +01:00
Claudio Atzori
6ec04d4e02
specified column used to perform the join operation in the javadoc
2020-03-05 12:50:38 +01:00
Michele Artini
7a2a466161
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-03-04 14:50:59 +01:00
Michele Artini
755eade2fb
fix creation ids
2020-03-04 14:49:45 +01:00
Claudio Atzori
6379f32466
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-03-04 10:57:06 +01:00
Claudio Atzori
0233987603
introduced post processing step following the hive DB creation/population
2020-03-04 10:56:50 +01:00
Claudio Atzori
1e563bc15e
introduced distinct properties driving the resouce usage for the XML record creation and the indexing phase
2020-03-04 10:55:11 +01:00
Claudio Atzori
9af3e904be
close the SparkSession at the end
2020-03-04 10:53:31 +01:00
Michele Artini
e7167b996a
logs and closeable
2020-03-04 10:46:36 +01:00
Claudio Atzori
25ceec29ab
code formatting
2020-03-04 10:44:24 +01:00
Claudio Atzori
63c00c5e88
fixed typo
2020-03-04 10:43:44 +01:00
Miriam Baglioni
c37f2bd1b5
moved some classes to package to make code clearer
2020-03-03 16:42:23 +01:00
Miriam Baglioni
d9d2060561
implementation for bulk tagging
2020-03-03 16:38:50 +01:00
Miriam Baglioni
e80f80ca93
properties and workflow for new propagation
2020-03-02 17:03:31 +01:00
Claudio Atzori
9cf5ce2e66
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-03-02 17:03:10 +01:00
Claudio Atzori
bc7cfd5975
indexing workflow WIP: fixed projects fundingtree xml conversion, prioritized links between results and projects when limiting them to 100 in the join procedure
2020-03-02 17:03:07 +01:00
Miriam Baglioni
50080c1b3c
changed the implementation of addAll method. Before adding all the items in a collection, we check if the accumulator set is not empty
2020-03-02 16:41:37 +01:00
Miriam Baglioni
02815dd2cf
update result for community moved in propagationconstants
2020-03-02 16:40:56 +01:00
Miriam Baglioni
95f8c3092f
update for new propagation implementation and moving of updateResult for community business logic since the same can be used for result to community from organization and result to community from semrel
2020-03-02 16:40:17 +01:00
Miriam Baglioni
3d63f35dcb
implementation of new propagation. Result to community for results linked to given organization. We exploit the hasAuthorInstitution semantic link to discover which results are related to institutions
2020-03-02 16:39:03 +01:00
Michele Artini
4b29a121b0
migration using spark in step2
2020-03-02 16:12:14 +01:00
Michele Artini
5445a57102
migration using spark in step2
2020-03-02 16:11:59 +01:00
Miriam Baglioni
3a4ccb26c0
New properties for the orcid to result propagation through semantic relation
2020-02-28 18:26:04 +01:00
Miriam Baglioni
b50166b9ad
None
2020-02-28 18:25:28 +01:00
Miriam Baglioni
550cb21c23
None
2020-02-28 18:24:39 +01:00
Miriam Baglioni
b098ee0bae
Changed the structure of typed row to conatain also list of authors with orcid
2020-02-28 18:23:51 +01:00
Miriam Baglioni
841f5523fe
Added information and methods for the new propagation of orcid to result through semrel
2020-02-28 18:23:16 +01:00
Miriam Baglioni
2b7b05fb29
New propagation of ORCID to result exploiting the semantic relation connecting them. R has author with orcid o, R is bounf by strong semantic relationship with R1 that has the same author withouth orcid, then o is also associated to the author in R1
2020-02-28 18:22:41 +01:00
Miriam Baglioni
833c83c694
Wrong file name
2020-02-28 18:21:01 +01:00
Miriam Baglioni
a86426776a
Changed from Oaf to Result the type of the updateResult method parameter, not to be forced to cast each time
2020-02-28 18:20:19 +01:00
Sandro La Bruzzo
b32655e48e
changed code to save intermediate result
2020-02-27 10:18:46 +01:00
Claudio Atzori
60bc2b1a20
drop the hive DB before populating it from scratch
2020-02-27 10:10:55 +01:00
Sandro La Bruzzo
f09e065865
incremented number of repartition
2020-02-26 19:26:19 +01:00
Sandro La Bruzzo
071f5c3e52
fixed NPE
2020-02-26 15:42:20 +01:00
Sandro La Bruzzo
a1a6fc8315
fixed NPE
2020-02-26 15:42:13 +01:00
Sandro La Bruzzo
1edf02a3ce
added log
2020-02-26 15:25:03 +01:00
Sandro La Bruzzo
c3ecabd8e8
fixed NPE
2020-02-26 14:40:02 +01:00
Sandro La Bruzzo
5d0f46651b
fixed NPE
2020-02-26 14:31:34 +01:00
Sandro La Bruzzo
bc342bf73a
fixed wrong generation type in summary
2020-02-26 12:49:47 +01:00
Sandro La Bruzzo
3112e21858
fixed typo
2020-02-26 12:22:43 +01:00
Sandro La Bruzzo
119ae6eef5
fixed wrong loop in the workflow
2020-02-26 12:18:50 +01:00
Sandro La Bruzzo
7936583a3d
added generation of Scholix collection
2020-02-26 12:09:06 +01:00
Przemysław Jacewicz
02db368dc5
Merge branch 'master' into przemyslawjacewicz_actionmanager_impl_prototype
2020-02-26 11:50:20 +01:00
Sandro La Bruzzo
2ef3705b2c
Added Provision workflow
2020-02-26 10:51:35 +01:00
Michele Artini
689908b2e9
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-02-25 16:00:51 +01:00
Michele Artini
93665773ea
Fixed a problem with JavaRDD Union
2020-02-25 15:59:21 +01:00
Sandro La Bruzzo
b021b8a2e1
Added index wf
2020-02-24 10:15:55 +01:00
Claudio Atzori
6a73fd5da5
in order to reuse the same XmlRecordFactory across different tasks, the state of contexts must be one per record built
2020-02-21 09:17:19 +01:00
Michele Artini
d49cd2fdc6
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-02-20 11:21:54 +01:00
Miriam Baglioni
3f941a2af4
Merge branch 'master' into propagationCommunityToResult
2020-02-19 18:05:22 +01:00
Miriam Baglioni
b2bdc9b99b
merging project to result propagation logic to master
2020-02-19 18:04:59 +01:00
Miriam Baglioni
a153a07997
none
2020-02-19 18:03:13 +01:00
Miriam Baglioni
d0279af630
start to implement the business logic
2020-02-19 17:59:24 +01:00
Miriam Baglioni
5f63ab1416
to query the information system to get the list of comunities up to now. It will have a more general usage when introducing bulk tagging
2020-02-19 17:59:02 +01:00
Miriam Baglioni
5ceb174d24
Merge branch 'master' into propagationCommunityToResult
2020-02-19 17:13:38 +01:00
Miriam Baglioni
e8af7a6b64
Merge remote-tracking branch 'upstream/master'
2020-02-19 17:03:10 +01:00
Miriam Baglioni
79ff79b0cd
propagation of result to community through semantic relation: C -> R and R -> isSupplementedBy R1 => C -> R1
2020-02-19 17:02:39 +01:00
Claudio Atzori
5e5e32cb48
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-02-19 16:56:52 +01:00
Claudio Atzori
33185fd0b7
ISLookupClientFactory moved in dhp-common
2020-02-19 16:56:38 +01:00
Michele Artini
5d3739b5cf
migration of claims
2020-02-19 15:11:17 +01:00
Miriam Baglioni
ab84163bb3
added set accumulator in TypedRow and used it to acucmulate country information in Country Propagation
2020-02-19 15:02:50 +01:00
Miriam Baglioni
bb0fdf5e0a
fix wrong source target in new relation
2020-02-19 15:00:46 +01:00
Miriam Baglioni
9e1678ccf8
fix error in workflow name
2020-02-19 14:59:24 +01:00
Miriam Baglioni
8aa3b4d7c0
adding to propagation constants the ones needed for propagation of project to result and addition of new accumulator Set in typed row to collect values of a type
2020-02-19 14:55:54 +01:00
Miriam Baglioni
7167673a58
implementation and configuration for propagation of project to result through semantic relation: P -> R1 and R1 -> supplemented by -> R2 => P -> R2
2020-02-19 14:54:18 +01:00
Michele Artini
173f1df1e5
saved a query for openaire production database
2020-02-19 10:15:08 +01:00
Sandro La Bruzzo
9a2d74ac82
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-02-19 10:13:45 +01:00
Sandro La Bruzzo
e5d7cdf422
fixed sql query
2020-02-19 10:13:36 +01:00
Sandro La Bruzzo
2b8675462f
refactoring code
2020-02-19 10:07:08 +01:00
Miriam Baglioni
b81e6af429
added config for new propagation
2020-02-18 17:30:44 +01:00
Miriam Baglioni
b736a9581c
changed relclass and reltype in reelation specification for country propagation and implementation of propagation of result affiliation through institutional repositories
2020-02-18 17:27:28 +01:00
Miriam Baglioni
ed262293a6
aligned to new snapshot version 1.1.6
2020-02-18 17:25:32 +01:00
Miriam Baglioni
2688a89c21
changed relclass and reltype in relation specification
2020-02-18 17:24:40 +01:00
Miriam Baglioni
c0022fec9f
moved on upper package to serve other propagations
2020-02-18 17:24:11 +01:00
Miriam Baglioni
e0a777028a
fix problem in parameters
2020-02-18 17:23:34 +01:00
Claudio Atzori
ed76521d9b
removed stale test resources, will be re-added later on
2020-02-18 11:51:08 +01:00
Claudio Atzori
0f364605ff
removed stale tests, need to reimplemente them anyway
2020-02-18 11:48:19 +01:00
Miriam Baglioni
5868ff8a86
synch fork with master
2020-02-17 18:22:27 +01:00
Przemysław Jacewicz
958f0693d6
WIP: logic for promoting action sets added
2020-02-17 18:19:19 +01:00
Miriam Baglioni
18e4092d5c
change name of properties dir
2020-02-17 18:07:06 +01:00
Miriam Baglioni
bd0e504b42
changes to the wf configuration
2020-02-17 18:04:15 +01:00
Miriam Baglioni
3a9d723655
adding default parameters in code
2020-02-17 16:30:52 +01:00
Przemysław Jacewicz
bea1a94346
Merge branch 'master' into przemyslawjacewicz_actionmanager_impl_prototype
...
# Conflicts:
# dhp-workflows/pom.xml
2020-02-17 15:07:23 +01:00
Claudio Atzori
6a288625e5
fixed workflow outgoing node
2020-02-17 15:04:33 +01:00
Miriam Baglioni
a5517eee35
adding the mkdirs for creation of propagation folder under provision on tmp
2020-02-17 14:20:42 +01:00
Miriam Baglioni
9abde5cfac
removed outputPath from job parameters
2020-02-17 14:19:53 +01:00
Claudio Atzori
1b18fd4d54
sync with master branch
2020-02-17 13:49:46 +01:00
Sandro La Bruzzo
4f04759738
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-02-17 12:31:58 +01:00
Sandro La Bruzzo
76ee85141a
added oozie job for DNET migration and implemented Spark job for extracting entities
2020-02-17 12:31:44 +01:00
Miriam Baglioni
be2421d5d8
removed wrongly pushed file
2020-02-17 12:07:26 +01:00
Claudio Atzori
c460e2d281
Aggiornare 'dhp-workflows/docs/oozie-installer.markdown'
2020-02-17 11:54:48 +01:00
Miriam Baglioni
c7bc73aedf
country propagation for results collected from institutional repositories
2020-02-17 11:44:48 +01:00
Michele Artini
176c5606bd
aligned with origin/master, aligned model and mapping
2020-02-17 10:40:53 +01:00
Claudio Atzori
56d1810a66
working procedure for records indexing using Spark, via lib com.lucidworks.spark:spark-solr
2020-02-14 12:28:52 +01:00
Claudio Atzori
1ee1baa8c0
Merge branch 'master' into provision_indexing
2020-02-13 18:17:07 +01:00
Claudio Atzori
a3d0b57b25
[maven-release-plugin] prepare for next development iteration
2020-02-13 18:11:33 +01:00
Claudio Atzori
6ed9a15bc8
[maven-release-plugin] prepare release dhp-1.1.5
2020-02-13 18:11:31 +01:00
Claudio Atzori
49e648f7c3
bumped version
2020-02-13 18:09:31 +01:00
Claudio Atzori
f9fae97e09
test json files aligned with the latest model changes
2020-02-13 18:05:59 +01:00
Claudio Atzori
1fee6e2b7e
implemented XML records construction and serialization, indexing WIP
2020-02-13 16:53:27 +01:00
Michele Artini
80cb52593f
bug fixing
2020-02-13 15:34:13 +01:00
Michele Artini
cdea0dae75
bug fixing
2020-02-12 16:34:00 +01:00
Michele Artini
69336195d3
simplifications
2020-02-12 11:12:38 +01:00
Michele Artini
06c2fd6df9
bug fixing
2020-02-11 15:29:50 +01:00
Michele Artini
5fc09b179c
bug fixing
2020-02-11 12:48:03 +01:00
Michele Artini
95740767e0
Ready for tests
2020-02-10 16:04:06 +01:00
Michele Artini
181e8498d4
...
2020-02-07 16:02:49 +01:00
Przemysław Jacewicz
86b60268bb
actionmanager implementation prototyping
2020-02-06 19:14:41 +01:00
Michele Artini
bb1533a07e
partial commit
2020-02-05 15:35:40 +01:00
Michele Artini
fbb0fc140b
partial implementation of migration
2020-02-04 15:25:47 +01:00
Claudio Atzori
7ba0f44d05
WIP
2020-01-30 18:21:07 +01:00
Claudio Atzori
49ef2f4eb1
removed input parameter specification, SparkXmlRecordBuilderJob doesn't need hive
2020-01-30 18:20:26 +01:00
Claudio Atzori
b5e1e2e5b2
reintegrated changes from fcbc4ccd70
2020-01-30 18:11:04 +01:00
Claudio Atzori
7bacd6812e
Merge branch 'provision_indexing' of https://code-repo.d4science.org/D-Net/dnet-hadoop into HEAD
...
Conflicts:
dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/GraphJoiner.java
dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/MappingUtils.java
dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/RelatedEntity.java
dhp-workflows/dhp-graph-provision/src/main/java/eu/dnetlib/dhp/graph/SparkXmlRecordBuilderJob.java
2020-01-30 17:59:46 +01:00
Claudio Atzori
b2691a3b0a
save adjacency list as JoinedEntity
2020-01-30 17:46:29 +01:00
Claudio Atzori
8c2aff99b0
joining entities using T x R x S, WIP: last representation based on LinkedEntity type
2020-01-29 15:40:33 +01:00
Sandro La Bruzzo
19a80e4638
implemented workfow for aggregation and generation of infospace graph
2020-01-24 09:58:55 +01:00
Claudio Atzori
fcbc4ccd70
a bit of docs doesn't hurt
2020-01-24 08:43:23 +01:00
Claudio Atzori
a55f5fecc6
joining entities using T x R x S method with groupByKey, WIP: making target objects (T) have lower memory footprint
2020-01-24 08:17:53 +01:00
Michele Artini
6bfe2dc96e
partial implementation
2020-01-22 16:00:23 +01:00
Claudio Atzori
799929c1e3
joining entities using T x R x S method with groupByKey
2020-01-21 16:35:44 +01:00
Michele Artini
f6eccdde33
partial implementation
2020-01-21 14:17:05 +01:00
Michele Artini
cd114f1c3b
partial update
2020-01-21 12:32:10 +01:00
Michele Artini
b35c59eb42
partial implementation of entities from db
2020-01-20 16:04:19 +01:00
Michele Artini
81f82b5d34
partial implementation of applications to migrate entities
2020-01-17 15:26:21 +01:00
Claudio Atzori
1cd6899480
merged from master
2020-01-17 14:25:57 +01:00
Claudio Atzori
97c239ee0d
WIP: trying to find a way to build the records for the index
2020-01-16 12:02:28 +02:00
miconis
4955be0197
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-01-14 15:03:44 +02:00
miconis
f61adfc2bb
minor changes
2020-01-14 15:03:27 +02:00
miconis
9bdcb02179
minor changes and update of the configuration for publications
2020-01-14 15:01:03 +02:00
Michele Artini
f7b9a7a9af
entity migration (partial implementation)
2020-01-10 15:55:23 +01:00
Michele Artini
7229fecbcf
fix warnings in poms
2019-12-20 13:41:08 +01:00
Sandro La Bruzzo
dd21db7036
fixed stuff
2019-12-18 16:28:22 +01:00
Claudio Atzori
7ba586d2e5
oozie workflow aimed to build the adjacency lists representation of the graph, needed to build the records to be indexed
2019-12-17 16:24:49 +01:00
Sandro La Bruzzo
76efcde4fd
using new branch decisionTreeDedup
2019-12-13 12:20:35 +01:00
Sandro La Bruzzo
b4392f9f43
implemented DedupRecord factory for missing entities
2019-12-13 09:40:02 +01:00
Sandro La Bruzzo
39367676d7
implemented DedupRecord factory with the merge of project
2019-12-12 15:18:48 +01:00
Sandro La Bruzzo
6b45e37e22
implemented DedupRecord factory with the merge of organizations
2019-12-11 16:57:37 +01:00
Sandro La Bruzzo
abd9034da0
implemented DedupRecord factory with the merge of publications
2019-12-11 15:43:24 +01:00
miconis
4b66b471a4
implementation of the sorting by trust mechanism and the merge of oaf entities
2019-12-10 14:57:16 +01:00
Sandro La Bruzzo
cc63706347
Implemented deduplication on spark
2019-12-06 13:38:00 +01:00
Sandro La Bruzzo
aad0cb40b7
Added schema Scholexplorer
2019-11-14 10:34:09 +01:00
Claudio Atzori
5711e75f67
use ${project.version} whenever possible
2019-11-08 17:41:51 +01:00
Claudio Atzori
245b4cbbb3
removed import limit
2019-11-08 17:41:01 +01:00
Claudio Atzori
7fe6835b47
[maven-release-plugin] prepare for next development iteration
2019-11-07 17:39:30 +01:00
Claudio Atzori
58918967d9
[maven-release-plugin] prepare release dhp-1.0.4
2019-11-07 17:39:27 +01:00
Claudio Atzori
5308f05a02
allow to speficy the target hive DB name in the infospace import workflow
2019-11-07 17:38:09 +01:00
Claudio Atzori
a52d5bde4f
simplified import procedure, maps the infospace as hive tables
2019-11-06 17:45:52 +01:00
Claudio Atzori
1e7a2ac41d
align parmeter names, graph import procedure WIP
2019-11-04 17:41:01 +01:00
Claudio Atzori
f39148dab8
[maven-release-plugin] prepare for next development iteration
2019-11-04 12:34:48 +01:00
Claudio Atzori
34b0e7b40a
[maven-release-plugin] prepare release dhp-1.0.3
2019-11-04 12:34:46 +01:00
Claudio Atzori
439ad80d81
conversion utilities from protobuffer model to DHP model moved in dnet-mapreduce-jobs. Removed also the relative protobuf dependencies
2019-11-04 12:33:23 +01:00
Claudio Atzori
32ed4ae8d6
conversion utilities from protobuffer model to DHP model moved in dnet-mapreduce-jobs. Removed also the relative protobuf dependencies
2019-11-04 12:28:56 +01:00
Sandro La Bruzzo
fd0ad82111
[maven-release-plugin] prepare for next development iteration
2019-10-31 12:08:51 +01:00
Sandro La Bruzzo
f224613b40
[maven-release-plugin] prepare release dhp-1.0.2
2019-10-31 12:08:49 +01:00
Sandro La Bruzzo
e13c30cc96
[maven-release-plugin] rollback the release of dhp-1.0.2
2019-10-31 12:07:04 +01:00
Sandro La Bruzzo
4da5239203
[maven-release-plugin] prepare release dhp-1.0.2
2019-10-31 12:06:14 +01:00
Sandro La Bruzzo
db8b346edd
[maven-release-plugin] rollback the release of 1.0.1
2019-10-31 11:49:05 +01:00
Sandro La Bruzzo
fc80052173
[maven-release-plugin] prepare for next development iteration
2019-10-31 11:47:42 +01:00
Sandro La Bruzzo
3150c7ce6d
[maven-release-plugin] prepare release 1.0.1
2019-10-31 11:47:40 +01:00
Sandro La Bruzzo
18ec8e8147
moved protoutils function to dhp-schemas
2019-10-31 11:31:37 +01:00
Sandro La Bruzzo
997e57d45b
Added entity filter to spark class
2019-10-30 12:19:03 +01:00
Sandro La Bruzzo
a336956708
added defautl property to job
2019-10-30 12:01:42 +01:00
Claudio Atzori
78b5b57e86
trying to make the spark action to be run as spark2
2019-10-29 18:56:34 +01:00
Claudio Atzori
c8bb81cd9a
align dependencies with IIS cluster
2019-10-29 18:10:20 +01:00
Sandro La Bruzzo
fe62ccd6dd
implemented oozie wf
2019-10-28 12:12:50 +01:00
Sandro La Bruzzo
9ee4e5a196
remove a bit of syntactic sugar on the object inheritance :(
2019-10-25 18:10:30 +02:00
Sandro La Bruzzo
c74335ebc7
resolved conflict
2019-10-25 14:34:50 +02:00
Sandro La Bruzzo
8c902c500a
minor fix
2019-10-25 14:33:54 +02:00
miconis
9fa5aebe9c
minor changes
2019-10-25 12:52:28 +02:00
miconis
551eda1600
dataset, orp and software mapping implemented. addition of test resources for results. implementation of tests to check the result of the mapping
2019-10-25 12:48:25 +02:00
Sandro La Bruzzo
eef14fade3
fixed conflict
2019-10-25 11:58:20 +02:00
Sandro La Bruzzo
0ea7e861ab
added organizations test
2019-10-25 11:56:28 +02:00
miconis
4908165e05
implementation of the createPublication method to map publications
2019-10-25 11:54:14 +02:00
miconis
df37bd6aaf
placeholders for setters in createpublication
2019-10-25 10:57:19 +02:00
Sandro La Bruzzo
c8d6d6bbd1
implemented organization mapping
2019-10-25 10:23:51 +02:00
miconis
b525b54130
starting implementing the createPublication class
2019-10-25 09:55:31 +02:00
Claudio Atzori
4b331790e7
resolved conflicts
2019-10-25 09:45:12 +02:00
Claudio Atzori
c929c1dfac
more proto 2 graph model mappings
2019-10-25 09:25:36 +02:00
Sandro La Bruzzo
09ffda03a2
removed circular dependencies
2019-10-25 09:24:18 +02:00
Sandro La Bruzzo
a10d071cf4
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2019-10-24 17:55:44 +02:00
Sandro La Bruzzo
3a8bb11695
mapped first part
2019-10-24 17:55:40 +02:00
Claudio Atzori
d46371ceab
Merge branch 'master' of https://code-repo.d2science.org/D-Net/dnet-hadoop
2019-10-24 17:43:55 +02:00
Claudio Atzori
0d88f9a6a4
added mapping for projects
2019-10-24 17:43:42 +02:00
Sandro La Bruzzo
2dd9572f41
added Mapping of OriginalDescription
2019-10-24 17:36:44 +02:00
miconis
351d850ad3
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2019-10-24 17:29:07 +02:00
miconis
b66a7e3030
publication test added
2019-10-24 17:29:01 +02:00
Sandro La Bruzzo
6c32d418ac
added conversion of ExtraInfo
2019-10-24 17:26:55 +02:00
Claudio Atzori
5f339a2c24
added mappings for basic types
2019-10-24 17:21:45 +02:00
Sandro La Bruzzo
9d04111391
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2019-10-24 17:05:52 +02:00
Sandro La Bruzzo
0902bac7dd
fixed conflict
2019-10-24 17:05:42 +02:00
Claudio Atzori
d8bfaa3687
added mapping for relations
2019-10-24 17:04:13 +02:00
Sandro La Bruzzo
d2965636e0
created test for convert json into new OAF data model
2019-10-24 17:02:35 +02:00
Claudio Atzori
79c4f1bbd8
Protobuf to internal graph model, early steps
2019-10-24 16:56:13 +02:00
Claudio Atzori
d38aeb8c6e
DataInfo.provenanceaction not repeatable, fluent setters
2019-10-24 16:55:38 +02:00
Sandro La Bruzzo
5744a64478
added module dhp=graph-mapper
2019-10-24 16:00:28 +02:00
Sandro La Bruzzo
5a8a323f2a
dhp-collection-worker integrated in dhp-workflows
2019-10-24 11:36:59 +02:00
Claudio Atzori
dd1d6fcb01
moved libs in main pom file
2019-10-18 10:50:55 +02:00
Claudio Atzori
176a13601b
commented out maven plugin for integration tests
2019-10-18 10:50:32 +02:00
Claudio Atzori
0c284e0a51
doc
2019-10-18 09:42:41 +02:00
Claudio Atzori
c7654b6fe3
renamed collection & transformation oozie workflow files
2019-10-18 09:42:20 +02:00
Claudio Atzori
44d7e85797
imported oozie-installer.markdown docs from https://github.com/openaire/iis/blob/master/iis-wf/docs/oozie-installer.markdown
2019-10-17 18:43:43 +02:00
Claudio Atzori
27db5afdad
integrating the oozie workflow build/deploy/run mechanism, took inspiration from iis
2019-10-17 18:38:30 +02:00
Sandro La Bruzzo
bbb87d0e3d
implemented saxonHE on transformation spark job
2019-10-10 11:33:51 +02:00
Sandro La Bruzzo
4b8c7c279d
Added documentation on a class, and reused ArgumetApplicationParser on dhp-aggregation
2019-10-07 17:02:53 +02:00
Sandro La Bruzzo
53ec9bccca
changed the implemetation of RabitMQ Comunication
2019-04-16 12:28:01 +02:00
Sandro La Bruzzo
403c13eebf
Implemented message manager, Fixed bug on collection worker, implemented Collecion and Transform spark job
2019-04-11 15:39:29 +02:00
Sandro La Bruzzo
ded6aef5e1
moved collector worker
2019-04-03 16:05:16 +02:00
Sandro La Bruzzo
c2ecbf5572
moved collector worker
2019-04-03 16:03:36 +02:00
enricoottonello
b316467608
added common module
2019-04-03 10:53:54 +02:00
Sandro La Bruzzo
12c65eab4c
implemented command line
2019-03-25 15:18:31 +01:00
Sandro La Bruzzo
6156562893
Added test
2019-03-18 10:47:28 +01:00
Sandro La Bruzzo
e67d9ee1a9
added first implementation of dnet-workflows
2019-03-18 10:44:35 +01:00