Miriam Baglioni
a8d65b68cb
changed to delete the part to check if it was a test or a real execution
2020-07-29 16:47:57 +02:00
Miriam Baglioni
3ec2392904
Added new class to move the place the split is effectively run
2020-07-29 16:46:50 +02:00
Michele Artini
8ba94833bd
added an es prop
2020-07-29 14:16:08 +02:00
Miriam Baglioni
178c2729a7
changed the path to reach the java class to be executed
2020-07-29 12:29:51 +02:00
Miriam Baglioni
437ac12139
removed unused parameter
2020-07-29 12:28:16 +02:00
Claudio Atzori
6f11c0496e
fixed typo in module name dhp-worfklow-profiles -> dhp-workflow-profiles
2020-07-28 15:01:58 +02:00
Claudio Atzori
f680eb3e12
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-28 14:10:56 +02:00
Claudio Atzori
985b360c31
fixed typo in module name dhp-worfklow-profiles -> dhp-workflow-profiles
2020-07-28 14:10:52 +02:00
Michele Artini
3acd632123
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-28 12:02:30 +02:00
Michele Artini
35e6e9c064
tests
2020-07-28 12:02:15 +02:00
Claudio Atzori
ee832f358e
Merge pull request 'stats_wf_extensions_and_corrections' ( #28 ) from spyros/dnet-hadoop:stats_wf_extensions_and_corrections into master
...
Thank you Guys! The update workflow will be made available to the beta & production orchestration systems under the HDFS path
```/lib/dnet/oa/graph/stats/oozie_app```
2020-07-27 16:02:03 +02:00
Antonis Lempesis
4ac8ebe427
correctly calculating the project duration
2020-07-24 19:50:40 +03:00
Antonis Lempesis
18d9464b52
creating shadow db only if it not exists...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
e217d496ab
added the dest db...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
b16bb68b9f
added the target db name...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
1ee7eeedf3
added the source db name...
2020-07-24 19:50:40 +03:00
Antonis Lempesis
cecbbfa0fc
added missing tables and views: contexts, creation_date, funder
2020-07-24 19:50:40 +03:00
Antonis Lempesis
25b7a615f5
moved datasource_sources table creating in the datasource section
2020-07-24 19:50:40 +03:00
Antonis Lempesis
a8da4ab9c0
years in projects are now integers
2020-07-24 19:50:40 +03:00
Antonis Lempesis
c9cfc165d9
not using impala since the resulting tables are not visible
2020-07-24 19:50:40 +03:00
Antonis Lempesis
dd3d6a6e15
compute stats for the used and new impala tables
2020-07-24 19:50:40 +03:00
Antonis Lempesis
e6f50de6ef
Separated impala from hive steps
2020-07-24 19:50:40 +03:00
Antonis Lempesis
de49173420
fixed a typo in queries
2020-07-24 19:50:40 +03:00
antleb
391cf80fb8
Added peer-reviewed, green, gold tables and fields in result. Added shortcuts from result-country
2020-07-24 19:50:40 +03:00
antleb
68389d0125
Corrected the script used by the last step of the wf
2020-07-24 19:50:40 +03:00
antleb
ec52141f1a
changed refereed type from value to clssname
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
63cd797aba
Comment out step 15 to make it work with the new schema of Claudio
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
138c6ddffa
Insert statement to datasource table that takes into account the piwik_id of the openAIRE graph
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
3630794cef
Fix to consider the relationships that have been 'virtually deleted' for project_results - defect #5607
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
5546f29e63
Corrections on the shadow schema and the impala table stats calculation
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
adf8a025d2
Adding more relations (Sources, Licences, Additional) and shadow schema as provided and discussed with Antonis Lempesis
2020-07-24 19:50:40 +03:00
Spyros Zoupanos
657a40536b
Corrections by Spyros: Scipt cleanup, corrections and re-arrangement
2020-07-24 19:50:40 +03:00
Giorgos Alexiou
477fa6234d
Script re-organisation and adding table invalidations needed for impala
2020-07-24 19:50:40 +03:00
Miriam Baglioni
6c2223d1fc
added code to get the openaire id for contexts
2020-07-24 17:30:15 +02:00
Miriam Baglioni
afd54c1684
removed not needed upload and refactoring
2020-07-24 17:28:56 +02:00
Miriam Baglioni
7b0569d989
changed to map also the result associated to the whole graph
2020-07-24 17:28:11 +02:00
Miriam Baglioni
082225ad61
-
2020-07-24 17:27:26 +02:00
Miriam Baglioni
968c59d97a
added teh logic to dump also the products for the whole graph. They will miss collected from and context information that will be materialized as new relations
2020-07-24 17:25:19 +02:00
Miriam Baglioni
332258d199
split the classes related to the communities dump and to the whole graph dump
2020-07-24 17:21:48 +02:00
Claudio Atzori
56bbfdc65d
introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'
2020-07-23 08:54:10 +02:00
Sandro La Bruzzo
9ab594ccf6
fixed test
2020-07-21 10:36:21 +02:00
Claudio Atzori
ebf60020ac
map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type
2020-07-20 19:01:10 +02:00
Miriam Baglioni
355d7e426e
added dumo for project - not finished
2020-07-20 18:54:43 +02:00
Miriam Baglioni
a2f01e5259
added getter and setter
2020-07-20 18:54:17 +02:00
Miriam Baglioni
40bbe94f7c
merge with master fork
2020-07-20 18:10:03 +02:00
Miriam Baglioni
2a15494b16
merge upstream
2020-07-20 18:05:01 +02:00
Miriam Baglioni
23160b4d29
realignment of the workflow classes with the changes in the structure of the module
2020-07-20 18:04:30 +02:00
Miriam Baglioni
b904e0699a
-
2020-07-20 18:02:53 +02:00
Miriam Baglioni
3aab7680f6
changed the test results
2020-07-20 18:00:43 +02:00
Miriam Baglioni
cde0300801
moved from projects to project
2020-07-20 17:57:35 +02:00
Miriam Baglioni
5076e4f320
changed test to comply with the modifications
2020-07-20 17:55:18 +02:00
Miriam Baglioni
08dbd99455
changed to dump the whole results graph by usign classes already implemented for communities. Added class to dump also organization
2020-07-20 17:54:28 +02:00
Miriam Baglioni
e47ea9349c
extended some types by adding provenance as the couple (provenance, trust) and moved some classes to be used by the complete graph dump also
2020-07-20 17:46:27 +02:00
Claudio Atzori
32f5e466e3
imports cleanup
2020-07-20 17:42:58 +02:00
Claudio Atzori
54ac583923
code formatting
2020-07-20 17:37:08 +02:00
Claudio Atzori
124e7ce19c
in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available
2020-07-20 17:33:37 +02:00
Claudio Atzori
050dda223d
Merge pull request 'removed duplicated fields' ( #25 ) from unique_field_in_lists into master
...
Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists.
The task to update the model in such a way is added on #9#issuecomment-1583
Thanks!
2020-07-20 12:12:50 +02:00
Claudio Atzori
e0c4cf6f7b
added parameter to drive the graph merge strategy: priority (BETA|PROD)
2020-07-20 10:48:01 +02:00
Claudio Atzori
94ccdb4852
Merge branch 'master' into merge_graph
2020-07-20 10:14:55 +02:00
Claudio Atzori
0937c9998f
Merge branch 'deduptesting'
2020-07-20 10:00:20 +02:00
Claudio Atzori
de72b1c859
cleanup
2020-07-20 09:59:11 +02:00
Michele Artini
331a3cbdd0
fixed originalId
2020-07-20 09:50:29 +02:00
Michele Artini
c59c5369b1
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-18 09:40:54 +02:00
Michele Artini
346a1d2b5a
update eventId generator
2020-07-18 09:40:36 +02:00
Sandro La Bruzzo
9116d75b3e
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-17 18:01:30 +02:00
Miriam Baglioni
d7d84c8217
-
2020-07-17 14:03:23 +02:00
Miriam Baglioni
47c7122773
changed priority from beta to production
2020-07-17 12:56:35 +02:00
Michele Artini
442f30930c
removed duplicated fields
2020-07-17 12:25:36 +02:00
Michele Artini
3adedd0a68
trust truncated to 3 decimals
2020-07-17 11:58:11 +02:00
Claudio Atzori
1781609508
code formatting
2020-07-16 19:06:56 +02:00
Claudio Atzori
db8b90a156
renamed CORE -> BETA
2020-07-16 19:05:13 +02:00
Miriam Baglioni
44e1c40c42
merge upstream
2020-07-16 18:49:38 +02:00
Claudio Atzori
878f2b931c
Merge branch 'master' into merge_graph
2020-07-16 16:34:24 +02:00
Claudio Atzori
cc5d13da85
introduced parameter shouldIndex (true|false)
2020-07-16 13:46:39 +02:00
Claudio Atzori
b098cc3cbe
avoid repeating identical values for fields: source, description
2020-07-16 13:45:53 +02:00
Claudio Atzori
805de4eca1
fix: filter the blocks with size = 1
2020-07-16 10:11:32 +02:00
Claudio Atzori
4b9fb2ffb8
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-15 11:26:04 +02:00
Claudio Atzori
b90389bac4
code formatting
2020-07-15 11:24:48 +02:00
Claudio Atzori
4e6f46e8fa
filter blocks with one record only
2020-07-15 11:22:20 +02:00
Michele Artini
262c29463e
relations with multiple datasources
2020-07-15 09:18:40 +02:00
Claudio Atzori
7d6e269b40
reverted CreateRelatedEntitiesJob_phase1 to its previous state
2020-07-13 22:54:04 +02:00
Claudio Atzori
8e97598eb4
avoid to NPE in case of null instances
2020-07-13 20:46:14 +02:00
Claudio Atzori
06def0c0cb
SparkBlockStats allows to repartition the input rdd via the numPartitions workflow parameter
2020-07-13 20:09:06 +02:00
miconis
b52c246aed
merge done
2020-07-13 19:57:02 +02:00
miconis
b8a45041fd
minor changes
2020-07-13 19:53:18 +02:00
Claudio Atzori
66f9f6d323
adjusted parameters for the dedup stats workflow
2020-07-13 19:26:46 +02:00
miconis
03ecfa5ebd
implementation of the test class for the new block stats spark action
2020-07-13 18:48:23 +02:00
miconis
10e08ccf45
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-13 18:22:45 +02:00
miconis
9258e4f095
implementation of a new workflow to compute statistics on the blocks
2020-07-13 18:22:34 +02:00
Claudio Atzori
c6f6fb0f28
code formatting
2020-07-13 16:46:13 +02:00
Claudio Atzori
8d2102d7d2
Merge branch 'deduptesting'
2020-07-13 16:32:43 +02:00
Claudio Atzori
344a90c2e6
updated assertions in propagateRelationTest
2020-07-13 16:32:04 +02:00
Claudio Atzori
1143f426aa
WIP SparkCreateMergeRels distinct relations
2020-07-13 16:13:36 +02:00
Claudio Atzori
8c67938ad0
configurable number of partitions used in the SparkCreateSimRels phase
2020-07-13 16:07:07 +02:00
Claudio Atzori
c73168b18e
Merge branch 'deduptesting' of https://code-repo.d4science.org/D-Net/dnet-hadoop into deduptesting
2020-07-13 15:54:58 +02:00
Claudio Atzori
c8284bab06
WIP SparkCreateMergeRels distinct relations
2020-07-13 15:54:51 +02:00
Sandro La Bruzzo
1d133b7fe6
update test
2020-07-13 15:52:41 +02:00
Michele Artini
3635d05061
poms
2020-07-13 15:52:23 +02:00
Claudio Atzori
7dd91edf43
parsing of optional parameter
2020-07-13 15:40:41 +02:00
Claudio Atzori
4c101a9d66
WIP SparkCreateMergeRels distinct relations
2020-07-13 15:31:38 +02:00
Claudio Atzori
8a612d861a
WIP SparkCreateMergeRels distinct relations
2020-07-13 15:30:57 +02:00
Sandro La Bruzzo
9ef2385022
implemented test for cut of connected component
2020-07-13 15:28:17 +02:00
Sandro La Bruzzo
d561b2dd21
implemented cut of connected component
2020-07-13 14:18:42 +02:00
Miriam Baglioni
8e0e090d7a
merge upstream
2020-07-13 12:46:55 +02:00
Claudio Atzori
e2093e42db
Merge branch 'master' into deduptesting
2020-07-13 10:57:49 +02:00
Michele Artini
2c4ed9a043
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-13 10:55:39 +02:00
Michele Artini
ccbe5c5658
fixed import of eu.dnetlib.dhp:dnet-openaire-broker-common
2020-07-13 10:55:27 +02:00
Claudio Atzori
7a3fd9f54c
dedup relation aggregator moved into dedicated class
2020-07-13 10:11:36 +02:00
Alessia Bardi
7e96105947
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-12 19:29:12 +02:00
Alessia Bardi
b7a39731a6
assert, not print
2020-07-12 19:28:56 +02:00
Miriam Baglioni
f9ad6f3255
Merge branch 'dump' of code-repo.d4science.org:miriam.baglioni/dnet-hadoop into dump
2020-07-10 19:42:53 +02:00
Miriam Baglioni
c27f12d6e8
avoid to consider _SUCCESS file
2020-07-10 19:42:23 +02:00
Claudio Atzori
770adc26e9
WIP aggregator to make relationships unique
2020-07-10 19:35:10 +02:00
Claudio Atzori
ecf119f37a
Merge branch 'master' into deduptesting
2020-07-10 19:04:16 +02:00
Claudio Atzori
31071e363f
Merge branch 'provision_indexing'
2020-07-10 19:03:57 +02:00
Claudio Atzori
06c1913062
added different limits for grouping by source and by target, incremented spark.sql.shuffle.partitions for the join operations
2020-07-10 19:03:33 +02:00
Claudio Atzori
cc77446dc4
added dbSchema parameter to the raw_db workflow
2020-07-10 19:01:50 +02:00
Claudio Atzori
4c3836f62e
materialize the related entities before joining them
2020-07-10 19:00:44 +02:00
Michele Artini
e1ae964bc4
stats
2020-07-10 16:12:08 +02:00
Claudio Atzori
752d28f8eb
make the relations produced by the dedup SparkPropagateRelation jon unique
2020-07-10 15:09:50 +02:00
Sandro La Bruzzo
c01efed79b
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-10 14:44:57 +02:00
Sandro La Bruzzo
a7d3977481
added generation of EBI Dataset
2020-07-10 14:44:50 +02:00
Claudio Atzori
b21866a2da
allow to set different to relations cut points by source and by target; adjusted weight assigned to relationship types
2020-07-10 13:59:48 +02:00
Claudio Atzori
ff4d6214f1
experimenting with pruning of relations
2020-07-10 10:06:41 +02:00
Miriam Baglioni
faea30cda0
-
2020-07-09 14:05:21 +02:00
Michele Artini
2d742a84ae
DedupConfig as json file
2020-07-09 12:53:46 +02:00
Miriam Baglioni
a634794242
merge upstream
2020-07-09 11:46:51 +02:00
Michele Artini
a44b9b36b9
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-09 11:02:31 +02:00
Michele Artini
1c6a171633
updated pom
2020-07-09 11:02:09 +02:00
Claudio Atzori
3c728aaa0c
trying to overcome OOM errors during duplicate scan phase
2020-07-08 22:39:51 +02:00
Claudio Atzori
18c555cd79
Merge branch 'master' into deduptesting
2020-07-08 22:32:01 +02:00
Claudio Atzori
4365cf41d7
trying to overcome OOM errors during duplicate scan phase
2020-07-08 22:31:46 +02:00
Claudio Atzori
67e1d222b6
bulk cleaning when found null or empty, sets bestaccessrights evaluating the result instances
2020-07-08 17:53:35 +02:00
Alessia Bardi
853e8d7987
test for software merge
2020-07-08 17:03:53 +02:00
Claudio Atzori
610d377d57
first implementation of the BETA & PROD graphs merge procedure
2020-07-08 16:54:26 +02:00
Alessia Bardi
9a898c0e4c
Json schema generator
2020-07-08 12:52:00 +02:00
Alessia Bardi
636f9ce7d6
json schema generator lib
2020-07-08 12:50:57 +02:00
Alessia Bardi
8f83b726fa
Dump json schema compliant to json schema Draft 7
2020-07-08 12:48:46 +02:00
Claudio Atzori
e2ea30f89d
updated graph construction workflow definition: cleaning wf moved at the bottom to include cleaning of the information produced by the enrichment workflows
2020-07-08 12:16:24 +02:00
Miriam Baglioni
1b0b968548
fixed issue on substring
2020-07-08 12:11:51 +02:00
Miriam Baglioni
7fe00cb4fb
-
2020-07-08 10:29:37 +02:00
Miriam Baglioni
375ef07d7b
changed the description for the upload
2020-07-07 18:41:27 +02:00
Miriam Baglioni
35c8265793
added the json extention to filename
2020-07-07 18:29:49 +02:00
Miriam Baglioni
81434f8e5e
added method newInstance
2020-07-07 18:26:10 +02:00
Miriam Baglioni
817cddfc52
-
2020-07-07 18:25:12 +02:00
Miriam Baglioni
a66aa9bd83
removed unuseful tests
2020-07-07 18:25:00 +02:00
Miriam Baglioni
9b20a21b24
removed unuseful tests
2020-07-07 18:23:37 +02:00
Miriam Baglioni
8a1b42ff21
added check to verify that dump contains at least one product
2020-07-07 18:21:35 +02:00
Miriam Baglioni
d86adb82a7
-
2020-07-07 18:20:51 +02:00
Miriam Baglioni
b2782025f6
enabled the whole workflow to run. Added property to give priority to depenedency in the classpath - to solve conflicts
2020-07-07 18:10:47 +02:00
Miriam Baglioni
83d2c84b77
added constraints to xquery so that to get only profiles with status manager or all
2020-07-07 18:09:48 +02:00
Miriam Baglioni
4c8d86493c
-
2020-07-07 18:09:06 +02:00
Miriam Baglioni
0208bc18f3
added new resource for testing
2020-07-07 17:47:24 +02:00
Miriam Baglioni
f5bb65c9ef
the json schema for the dump of the results
2020-07-07 17:34:40 +02:00
Michele Artini
dffa0b01a2
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-07 15:37:29 +02:00
Michele Artini
efadbdb2bc
fixed a bug with duplicated events
2020-07-07 15:37:13 +02:00
Claudio Atzori
8af8e7481a
code formatting
2020-07-07 14:23:34 +02:00
Claudio Atzori
b383ed42fa
pass optional parameter relationFilter to the PrepareRelationJob implementation
2020-07-07 14:21:28 +02:00
Claudio Atzori
911894a987
Merge branch 'deduptesting'
2020-07-07 14:20:43 +02:00
Miriam Baglioni
c19818a3f8
merge branch with fork master
2020-07-06 13:58:23 +02:00
Miriam Baglioni
d22240c0ba
merge upstream
2020-07-06 13:58:02 +02:00
Michele Artini
edf6c6c4dc
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-03 11:48:24 +02:00
Michele Artini
04bebb708c
some fixes
2020-07-03 11:48:12 +02:00
Claudio Atzori
c3d67f709a
adjusted dedup configuration for result entities: using new wordssuffixprefix clustering function, removed ngrampairs, adjusted queueMaxSize (800) and slidingWindowSize (80)
2020-07-02 17:35:22 +02:00
Miriam Baglioni
f8bf4acd76
-
2020-07-02 16:03:11 +02:00
Miriam Baglioni
e6c79d44e6
-
2020-07-02 16:02:02 +02:00
Miriam Baglioni
d7f6f0c216
changed code to use other lib
2020-07-02 16:01:34 +02:00
Miriam Baglioni
8fdc9e070c
added dependency to OkHttp
2020-07-02 16:01:08 +02:00
Miriam Baglioni
94500a581b
merge branch with fork master
2020-07-02 14:25:39 +02:00
Miriam Baglioni
c133a23cf0
merge upstream
2020-07-02 14:24:57 +02:00
Claudio Atzori
1d39f7901c
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-02 12:45:01 +02:00
Claudio Atzori
0f77cac4b5
fix: deduper must use queueMaxSize instead of groupMaxSize for the block definition
2020-07-02 12:43:51 +02:00
Sandro La Bruzzo
18b9330312
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-02 12:43:19 +02:00
Michele Artini
b413db0bff
white/blacklists
2020-07-02 12:43:03 +02:00
Claudio Atzori
d380b85246
unit test for the preparation of the relations
2020-07-02 12:42:13 +02:00
Claudio Atzori
ed1c7e5d75
fixed workflow for the import of the claims alone
2020-07-02 12:40:21 +02:00
Sandro La Bruzzo
07f0723fa7
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-02 12:37:49 +02:00
Sandro La Bruzzo
1d420eedb4
added generation of EBI Dataset
2020-07-02 12:37:43 +02:00
Claudio Atzori
e4a29a4513
fixed workflow for the import of the claims alone
2020-07-02 12:36:33 +02:00
Michele Artini
3bcdfbabe9
list with limits
2020-07-01 08:42:39 +02:00
Michele Artini
59a5421c24
indexing, accumulators, limited lists
2020-06-30 16:17:09 +02:00
Michele Artini
6f13673464
accumulators
2020-06-29 16:33:32 +02:00
Sandro La Bruzzo
dab783b173
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-29 09:05:00 +02:00
Michele Artini
a6ea432435
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-29 08:44:20 +02:00
Michele Artini
35ae381d28
all events matchers
2020-06-29 08:43:56 +02:00
Claudio Atzori
7817338e05
added test to verify the relation pre-processing
2020-06-26 17:58:33 +02:00
Claudio Atzori
8d59fdf34e
WIP: dataset based PrepareRelationsJob
2020-06-26 14:32:58 +02:00
Michele Artini
2393d9da2f
limits
2020-06-26 11:20:45 +02:00
Sandro La Bruzzo
96ce124b59
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-25 17:00:43 +02:00
Miriam Baglioni
4a7de07ea2
refactoring
2020-06-25 16:32:40 +02:00
Miriam Baglioni
54a12978d3
fixed issue in xquery
2020-06-25 16:30:20 +02:00
Michele Artini
408165a756
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-25 15:53:35 +02:00
Michele Artini
e8fb305f18
compilation of event map
2020-06-25 15:53:20 +02:00
Michele Artini
4eb3e109d7
compilation of event map
2020-06-25 15:45:50 +02:00
Claudio Atzori
d839e88783
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-25 14:06:30 +02:00
Claudio Atzori
6f5771c1c9
sets author.rank when null
2020-06-25 14:06:21 +02:00
Michele Artini
e28033c6d8
some fixes
2020-06-25 13:01:09 +02:00
Claudio Atzori
216975c4ec
restored complete provision workflow
2020-06-25 12:55:52 +02:00
Claudio Atzori
2d77d3a388
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-25 12:54:30 +02:00
Claudio Atzori
93f627ea51
code formatting
2020-06-25 12:54:21 +02:00
Miriam Baglioni
05a99cfb61
change the position of value and description elements in the workflow definition
2020-06-25 12:36:08 +02:00
Claudio Atzori
7df2712824
Merge branch 'provision_indexing'
2020-06-25 12:22:41 +02:00
Claudio Atzori
e62333192c
WIP: prepare relation job
2020-06-25 12:22:18 +02:00
Claudio Atzori
6933ec11fb
WIP: prepare relation job
2020-06-25 11:04:12 +02:00
Sandro La Bruzzo
a6c0faac70
added test to verify secondary sorting
2020-06-25 10:48:15 +02:00
Claudio Atzori
69b0391708
WIP: prepare relation job
2020-06-25 10:19:56 +02:00
Michele Artini
abcbebcbb4
fixed generation of ids
2020-06-25 09:50:46 +02:00
Michele Artini
77d2a1b1c4
params to choose sql queries for beta or production
2020-06-25 09:28:13 +02:00
Claudio Atzori
46e76affeb
WIP: prepare relation job
2020-06-24 19:01:15 +02:00
Claudio Atzori
0e723d378b
added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job
2020-06-24 18:34:42 +02:00
Michele Artini
202f6e62ff
Splitted join wf
2020-06-24 15:47:06 +02:00
Sandro La Bruzzo
96689a8994
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-24 14:06:50 +02:00
Sandro La Bruzzo
46631a4421
updated mapping scholexplorer to OAF
2020-06-24 14:06:38 +02:00
Michele Artini
e53dd62e87
minot changes
2020-06-24 09:24:45 +02:00
Michele Artini
8b9933b934
refactoring aggregators
2020-06-24 08:57:13 +02:00
Miriam Baglioni
3e5570de7a
-
2020-06-23 15:44:54 +02:00
Michele Artini
d13e3d3f68
fixed paths
2020-06-23 11:01:42 +02:00
Michele Artini
8386c6f90d
filter of valid resultResult relations
2020-06-23 10:24:15 +02:00
Michele Artini
38bb45d0b6
test osf:refereed
2020-06-23 10:14:39 +02:00
Michele Artini
c3286f4c37
fixed relType
2020-06-23 09:32:32 +02:00
Miriam Baglioni
507f7a94a8
added one of the main zenodo communities to the tagging conf for testing purposes
2020-06-23 08:45:27 +02:00
Michele Artini
af2f7705fc
partial refactoring of some joins
2020-06-23 08:37:35 +02:00
Miriam Baglioni
af1d40351b
changed XQuery to add also the main Zenodo community among the communities associated to the openaire community
2020-06-22 19:20:54 +02:00
Miriam Baglioni
e4b21be004
-
2020-06-22 17:31:50 +02:00
Miriam Baglioni
afa19b0c84
changed the way to PUT the files to the rest API
2020-06-22 17:20:07 +02:00
Miriam Baglioni
250fd1c854
merge branch with fork master
2020-06-22 16:25:48 +02:00
Claudio Atzori
8a3bc7c183
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-22 14:12:33 +02:00
Claudio Atzori
e162ba5075
added dnet workflows to orchestrate the execution of graph2hive, updateSolr and updateStats oozie wfs
2020-06-22 14:12:28 +02:00
Michele Artini
3ce20c198e
reformatting
2020-06-22 12:14:25 +02:00
Michele Artini
ed787398b3
refactoring wf
2020-06-22 11:45:14 +02:00
Claudio Atzori
9cd27183b6
[maven-release-plugin] prepare for next development iteration
2020-06-22 11:27:44 +02:00
Claudio Atzori
1e3dab0631
[maven-release-plugin] prepare release dhp-1.2.3
2020-06-22 11:27:39 +02:00
Miriam Baglioni
df80ae5c1b
merge branch with fork master
2020-06-22 10:51:23 +02:00
Miriam Baglioni
e8f914f8b3
-
2020-06-22 10:50:41 +02:00
Miriam Baglioni
edeb862476
excluded dependency in module that generates conflict
2020-06-22 10:49:56 +02:00
Miriam Baglioni
185facb8e5
change the deprecated DefaultHttpClient with the CLoseableHttpClient
2020-06-22 10:49:10 +02:00
Claudio Atzori
961a0d0b49
[actionset promotion] log debugging info in case of error in the action payload extraction or parsing the data
2020-06-22 10:20:45 +02:00
Claudio Atzori
5e8b922962
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-22 09:50:47 +02:00
Claudio Atzori
7d416f08d8
graph cleaning workflow: set hostedby to unknown repository when defined as NULL
2020-06-22 09:50:43 +02:00
Michele Artini
16c7a18435
refactoring
2020-06-22 08:51:31 +02:00
Miriam Baglioni
669a509430
-
2020-06-19 17:39:46 +02:00
Michele Artini
f9fc64ffaf
âÃMerge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-19 15:24:43 +02:00
Michele Artini
d88fe0ac84
join methods
2020-06-19 15:24:30 +02:00
Sandro La Bruzzo
464eeeec87
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-19 15:11:53 +02:00
Sandro La Bruzzo
1681de672d
updated mapping scholexplorer to OAF
2020-06-19 15:11:46 +02:00
Michele Artini
4822747313
some fixes
2020-06-19 13:53:56 +02:00
Michele Artini
834f139e6e
fixed some NPE
2020-06-19 12:33:29 +02:00
Claudio Atzori
d0ac7514b2
cleaning workflow to include cleaning of default values
2020-06-18 19:37:25 +02:00
Miriam Baglioni
44a12d244f
-
2020-06-18 18:38:54 +02:00
Michele Artini
52f62d5d8c
events
2020-06-18 14:49:13 +02:00
Miriam Baglioni
fb80353018
-
2020-06-18 14:21:36 +02:00
Michele Artini
61634fbfe0
removed kryo encoding
2020-06-18 14:09:58 +02:00
Michele Artini
8d2b199dd2
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-18 13:15:34 +02:00
Michele Artini
e659b02e6b
some wf fixing
2020-06-18 13:15:13 +02:00
Michele Artini
9a847b4557
some wf fixing
2020-06-18 13:14:10 +02:00
Miriam Baglioni
65bf312360
merge branch with fork master
2020-06-18 11:35:27 +02:00
Miriam Baglioni
3953f56bd3
added dependency to pom
2020-06-18 11:34:47 +02:00
Miriam Baglioni
a118b66858
-
2020-06-18 11:34:30 +02:00
Miriam Baglioni
f9578312b5
-
2020-06-18 11:34:15 +02:00
Miriam Baglioni
8b145e6aba
-
2020-06-18 11:25:28 +02:00
Miriam Baglioni
e8b3e972f2
changed the input params and the workflow definition to tackle the Result as all result product produced
2020-06-18 11:25:05 +02:00
Miriam Baglioni
3233b01089
changes due to adding all the result type under Result
2020-06-18 11:22:58 +02:00
Miriam Baglioni
5c8533d1a1
changed in the testing classes
2020-06-18 11:20:08 +02:00
Miriam Baglioni
bc8611a95a
added new resources for testing
2020-06-18 11:19:20 +02:00
Sandro La Bruzzo
9bf67f5de1
resolved conflicts
2020-06-17 09:15:43 +02:00
Sandro La Bruzzo
1d4275acc4
implemented first version of exportation of Scholexplorer into ActionSet
2020-06-17 09:10:38 +02:00
miconis
5233b15265
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-16 18:31:19 +02:00
miconis
11b77b9f4e
json dumps for entity merge test modified to fit the new model. title merge adjusted to fix the error
2020-06-16 18:31:11 +02:00
Claudio Atzori
64f02de5d3
updated workflow definition to include the cleaning step
2020-06-16 17:48:51 +02:00
Claudio Atzori
306669209f
code formatting
2020-06-16 16:54:44 +02:00
Claudio Atzori
1bc1d15eaf
stubbing for mock datasource.identities must be typed as array
2020-06-16 16:54:28 +02:00
Claudio Atzori
631fef12a7
Merge branch 'master' into dhp_oaf_model
2020-06-16 16:11:19 +02:00
Michele Artini
9e2c23e391
partial refactoring
2020-06-16 15:55:42 +02:00
Michele Artini
113c9b1de0
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-16 15:53:39 +02:00
Michele Artini
76ea7607f7
partial refactoring
2020-06-16 15:53:13 +02:00
Claudio Atzori
603b1bd0bb
Merge branch 'master' into dhp_oaf_model
2020-06-16 15:43:59 +02:00
Claudio Atzori
5441f01586
Merge pull request 'missing landingPage urls in instances' ( #22 ) from instances-with-landing-page into master
...
Looks good, thanks!
2020-06-16 15:32:44 +02:00
Claudio Atzori
89859111ee
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-16 15:28:29 +02:00
Claudio Atzori
4ec262db53
included externalreference(s) in the result view on the Hive graph DB
2020-06-16 15:28:20 +02:00
Michele Artini
8a4f84f8c0
refactoring
2020-06-16 12:34:13 +02:00
Claudio Atzori
2a4f65795f
WIP: graph cleaner implementation
2020-06-15 18:32:24 +02:00
Claudio Atzori
c15c8c0ad0
map datasource identities (including piwik ids) as original IDs
2020-06-15 16:07:30 +02:00
Miriam Baglioni
9dd3ef22c5
merge branch with fork master
2020-06-15 11:23:26 +02:00
Miriam Baglioni
68cf0fd03f
test input
2020-06-15 11:14:42 +02:00
Miriam Baglioni
0467145ae3
test for graph dump
2020-06-15 11:13:51 +02:00
Miriam Baglioni
e43eedb5b0
added resources and workflow for dump of community products
2020-06-15 11:13:21 +02:00
Miriam Baglioni
f96ca900e1
fixed issues while running on cluster
2020-06-15 11:12:14 +02:00
Miriam Baglioni
20b9e67728
added new class funder
2020-06-15 11:06:18 +02:00
Claudio Atzori
0d52816244
WIP: graph cleaner implementation
2020-06-13 13:06:04 +02:00
Claudio Atzori
bed65a1be6
WIP: graph cleaner implementation
2020-06-12 18:25:47 +02:00
Claudio Atzori
c4d9f1837f
[maven-release-plugin] prepare for next development iteration
2020-06-12 12:21:08 +02:00
Claudio Atzori
f0746a7605
[maven-release-plugin] prepare release dhp-1.2.2
2020-06-12 12:21:03 +02:00
Claudio Atzori
463489f59f
code formatting
2020-06-12 12:03:25 +02:00
Claudio Atzori
4bcad1c9c3
Merge branch 'graph_cleaning'
2020-06-12 11:40:25 +02:00
Claudio Atzori
cdb1956fe9
WIP: graph cleaner implementation
2020-06-12 11:36:59 +02:00
Alessia Bardi
b347499745
do not use deprecated subreltype
2020-06-12 10:58:02 +02:00
Claudio Atzori
97b1c4057c
WIP: graph cleaner implementation
2020-06-12 10:45:18 +02:00
Claudio Atzori
ba8a024af9
avoid NPEs merging titles
2020-06-12 10:45:11 +02:00
Michele Artini
30ea1bda88
oozie workflow
2020-06-12 10:42:35 +02:00
Michele Artini
c22cb5a3c6
refactoring
2020-06-12 09:47:55 +02:00
Michele Artini
472cf77639
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-06-11 14:30:47 +02:00
Michele Artini
c6b5bb3f17
orcid events
2020-06-11 14:30:24 +02:00
Michele Artini
c2e1b66e83
Revert "orcid events"
...
This reverts commit 48959e9a17
.
2020-06-11 14:28:03 +02:00
Michele Artini
48959e9a17
orcid events
2020-06-11 14:24:02 +02:00
Miriam Baglioni
e145972962
-
2020-06-11 13:08:39 +02:00
Miriam Baglioni
a01800224c
-
2020-06-11 13:02:04 +02:00
Miriam Baglioni
356dd582a3
map construction moved in class
2020-06-11 12:59:22 +02:00
Alessia Bardi
e79943965b
Fixes #5604 : field oamandatepublications in XML
2020-06-11 12:49:31 +02:00
Michele Artini
a41e0cb648
missing landingPage urls in instances
2020-06-11 12:28:34 +02:00
Michele Artini
04fdcacd83
results with all joined entities
2020-06-11 11:25:18 +02:00
Michele Artini
99f88e1cb8
fixed generation entities from claims
2020-06-11 10:51:57 +02:00
Miriam Baglioni
db27663750
-
2020-06-11 10:49:01 +02:00
Miriam Baglioni
bb9f21d0e7
job test for class producing first step of results dump
2020-06-11 10:20:05 +02:00
Claudio Atzori
d1d92c4d8c
fixed integration of claims in the graph
2020-06-11 10:12:00 +02:00
Claudio Atzori
953da4a427
Merge branch 'master' into graph_cleaning
2020-06-10 21:36:56 +02:00
Claudio Atzori
f1bce64391
WIP: graph cleaner implementation
2020-06-10 21:36:31 +02:00
Claudio Atzori
67c7b31ba6
Merge branch 'master' into graph_cleaning
2020-06-10 15:00:35 +02:00
Claudio Atzori
3ebf81d2b0
Merge pull request 'oaf-store-interpretation' ( #21 ) from oaf-store-interpretation into master
...
Looks good, thanks Michele!
2020-06-10 14:58:09 +02:00
Michele Artini
5869cb76b3
reformatting
2020-06-10 12:11:16 +02:00
Michele Artini
c08e66e01e
fixed a workflow parameter
2020-06-10 10:11:56 +02:00
Michele Artini
7177a32d75
import of invisible stores
2020-06-10 10:04:00 +02:00
Claudio Atzori
ce12f236bb
disabled test, need to need to update the joined_entity.json file
2020-06-09 20:07:36 +02:00
Claudio Atzori
a2fdf85ba1
WIP: graph cleaner implementation
2020-06-09 19:52:53 +02:00
Alessia Bardi
4551c1082f
mapping csv for orcid
2020-06-09 18:08:47 +02:00
Alessia Bardi
2d3f7d1eb4
fixed log classes to make the ORCID test run
2020-06-09 18:07:14 +02:00
Alessia Bardi
a3a6755d58
mapping csv for Unpaywall
2020-06-09 17:45:44 +02:00
Claudio Atzori
d9f33582c5
WIP: graph cleaner implementation
2020-06-09 17:20:40 +02:00
Alessia Bardi
f3b033cf09
added csv line for funders from Crossref
2020-06-09 17:08:26 +02:00
Alessia Bardi
79969d78b9
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-06-09 17:05:39 +02:00
Alessia Bardi
fc4d220964
updated function name for SNSF
2020-06-09 17:05:31 +02:00
Michele Artini
baaa55f4a3
use of pace to calculate trusts
2020-06-09 16:01:31 +02:00
Alessia Bardi
33b130ec43
Mapping instructions for MAG
2020-06-09 15:57:15 +02:00
Miriam Baglioni
206abba48c
merge branch with fork master
2020-06-09 15:41:14 +02:00
Miriam Baglioni
a089db18f1
workflow and parameters to exucute the dump
2020-06-09 15:39:38 +02:00
Miriam Baglioni
6bbe27587f
new classes to execute the dump for products associated to community, enrich each result with project information and assign the result to each community it belongs to
2020-06-09 15:39:03 +02:00
Miriam Baglioni
5121cbaf6a
new classes for external dump. Only classes functional to dump products
2020-06-09 15:37:46 +02:00
Alessia Bardi
d6de406e11
fixed classid for subjects
2020-06-09 14:43:34 +02:00
Alessia Bardi
f072125152
map volume and issue in journal information from MAG
2020-06-09 14:32:10 +02:00
Alessia Bardi
b7cb1163ea
identifiers always start with 50
2020-06-09 10:39:11 +02:00
Alessia Bardi
181f52b9bc
Added mapping table for Crossref
2020-06-08 19:33:47 +02:00
Alessia Bardi
9fd25887f7
Result identifiers all start with 50|
2020-06-08 19:32:24 +02:00
Alessia Bardi
16cb073b15
set the instance datepfacceptance with the Crossref createdDate in case the issuedDate is blank
2020-06-08 19:06:03 +02:00
Michele Artini
bb659d870c
join simrels
2020-06-08 16:29:01 +02:00
Michele Artini
81e85465d8
join simrels
2020-06-08 16:26:16 +02:00
Claudio Atzori
3d871c6651
Merge branch 'master' into graph_cleaning
2020-06-08 15:23:24 +02:00
Claudio Atzori
25a093b1a4
integrated changes from master
2020-06-08 15:04:00 +02:00
Sandro La Bruzzo
e34e7d6728
merge DOIBoost
2020-06-08 08:32:22 +02:00
Sandro La Bruzzo
e46e2a4776
Merge remote-tracking branch 'origin/master' into doiboost
2020-06-08 08:17:14 +02:00
Spyros Zoupanos
3576dd186b
Adding hive timeout as workflow parameter
2020-06-05 22:29:54 +03:00
Claudio Atzori
b2349659cf
WIP: graph property fixing implementation
2020-06-05 18:37:38 +02:00
Michele Artini
a73973a74b
partial implemantation of broker events generation
2020-06-05 11:43:00 +02:00
Michele Artini
7e82996e7c
partial implemantation of broker events generation
2020-06-04 17:10:43 +02:00
Sandro La Bruzzo
b57e8ba374
Merge remote-tracking branch 'origin/master' into doiboost
2020-06-04 14:39:41 +02:00
Sandro La Bruzzo
7ac1ba2e35
improvement DOIBoost
2020-06-04 14:39:20 +02:00
Michele Artini
97177d7f7b
partial refactoring
2020-06-04 10:26:34 +02:00
Sandro La Bruzzo
13815d5d13
improvement DOIBoost
2020-06-01 17:52:12 +02:00
Claudio Atzori
05f269a1c0
kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob
2020-06-01 00:32:42 +02:00
Claudio Atzori
5e23fb3a74
code formatting
2020-05-30 10:52:56 +02:00
Claudio Atzori
54ca8ed6c3
uniformed param name (isLookupUrl), Vocab model classes defined as Serializable
2020-05-29 18:17:30 +02:00
Claudio Atzori
1577bd5b8b
added IsLookupUrl to the raw_db workflow parameters
2020-05-29 16:18:16 +02:00
Claudio Atzori
91d78b825b
Merge pull request 'import from db using is vocabularies' ( #17 ) from result_pids into master
...
Looks good, thanks Michele!
2020-05-29 16:02:40 +02:00
Michele Artini
adb798faa5
import from db using is vocabularies
2020-05-29 12:03:51 +02:00
Claudio Atzori
6f5f498c78
restored common properties driving executor-cores and executor-memory in join_organization_relations wf node
2020-05-29 11:22:00 +02:00
Claudio Atzori
b2f9564f13
WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob
2020-05-29 10:58:15 +02:00
Miriam Baglioni
dfa4997a4f
removed commented code
2020-05-29 10:45:18 +02:00
Miriam Baglioni
6f1eea28b6
changed message in log
2020-05-29 10:41:39 +02:00
Sandro La Bruzzo
b87b3ddb6b
changed mapping ORCIDToOAF
2020-05-29 09:32:04 +02:00
Miriam Baglioni
8b6e886fb6
added new resource for testing
2020-05-28 23:54:31 +02:00
Miriam Baglioni
6989fb9c8a
changed the project test according to the newly introduced join with the db project codes
2020-05-28 23:53:24 +02:00
Miriam Baglioni
782984d8e5
added needed parameter
2020-05-28 23:52:41 +02:00
Miriam Baglioni
01f7876595
fix issue with flatMap - the return type must not be null
2020-05-28 23:50:32 +02:00
Claudio Atzori
a57965a3ea
limiting the dimensions of outliers
2020-05-28 17:36:37 +02:00
Miriam Baglioni
773735f870
added the path to the file containing the projects code from the db
2020-05-28 17:30:45 +02:00
Miriam Baglioni
6a15067a64
added one step in the workflow
2020-05-28 17:30:09 +02:00
Miriam Baglioni
5309a99a70
modified the PrepareProjects to consider those in the db
2020-05-28 17:29:53 +02:00
Miriam Baglioni
b737ed8236
added part to read projects from the openaire db to filter out those in the csv file that are not in the db
2020-05-28 17:29:21 +02:00
Claudio Atzori
821be1f8b6
experimental implementation of custom aggregation using kryo encoders
2020-05-28 13:53:13 +02:00
Claudio Atzori
83504ecace
limiting the maximum number of authors allowed in XML records to MAX_AUTHORS = 200; authors with ORCID can exceed that limit
2020-05-28 13:52:30 +02:00
Claudio Atzori
ef11593068
JoinedEntity.links defined as empty list by default
2020-05-28 13:50:44 +02:00
Claudio Atzori
5dea155a87
increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase
2020-05-28 13:49:59 +02:00
Miriam Baglioni
35b7279147
changed test because data are saved as SequenceFile now, and because of the group by the umber of produced update decrease
2020-05-28 10:26:12 +02:00
Miriam Baglioni
37c155b86a
merge branch with fork master
2020-05-28 10:09:51 +02:00
Miriam Baglioni
df44db686a
refactoring
2020-05-28 10:07:00 +02:00
Miriam Baglioni
87b07f4af8
removed unused variables
2020-05-28 10:05:43 +02:00
Miriam Baglioni
1060977272
added fs actions to remove and the create the workingDir
2020-05-28 10:04:36 +02:00
Miriam Baglioni
96d1a3c431
deleted the file were to store the csv files
2020-05-28 10:04:10 +02:00
Miriam Baglioni
669c05c771
added groupBy before creating Actions
2020-05-28 10:00:45 +02:00
Sandro La Bruzzo
02f90eeb07
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-28 09:58:32 +02:00
Sandro La Bruzzo
7d29b61c62
code refactor
2020-05-28 09:57:46 +02:00
Claudio Atzori
fdd54bad1c
code formatting
2020-05-27 19:31:54 +02:00
Miriam Baglioni
1855453434
changed the outputdir of the last step
2020-05-27 17:59:36 +02:00
Claudio Atzori
b9b1bc9967
Merge branch 'master' into provision_indexing
2020-05-27 12:55:20 +02:00
Claudio Atzori
aac1515b58
Merge pull request 'result_pids without conflicts ???' ( #16 ) from result_pids into master
...
Looks good, thanks Michele
2020-05-27 12:54:52 +02:00
Michele Artini
f5ce7d76e1
resolve conflicts
2020-05-27 12:49:17 +02:00
Claudio Atzori
cfd753217c
repartition the join_entities in 24k files
2020-05-27 12:44:01 +02:00
Claudio Atzori
2f1a623d09
sync from master branch
2020-05-27 12:39:58 +02:00
Claudio Atzori
9e4ec1543b
updated test
2020-05-27 12:38:42 +02:00
Claudio Atzori
8047d16dd9
added RDD based adjacency list creation procedure
2020-05-27 12:38:12 +02:00
Claudio Atzori
f057dcdf65
limit the max number of externalreferences to MAX_EXTERNAL_ENTITIES
2020-05-27 12:37:33 +02:00
Michele Artini
b81f2741d2
xquery
2020-05-27 12:10:20 +02:00
Michele Artini
a25598140a
result pids (new xpaths + IS vocabularies)
2020-05-27 12:10:20 +02:00
Michele Artini
7a7272d9ec
result pids (new xpaths + IS vocabularies)
2020-05-27 12:10:20 +02:00
Michele Artini
3ceb2d2853
match terms with vocabularies
2020-05-27 11:34:13 +02:00
Claudio Atzori
4e36d689dd
fixed XML serialization for children sub-elements (duplicates & externalreferences)
2020-05-26 18:30:40 +02:00
Miriam Baglioni
92e3a52e91
merge branch with fork master
2020-05-26 15:57:51 +02:00
Michele Artini
c15d997925
xquery
2020-05-26 13:13:17 +02:00
Michele Artini
c6af36496a
result pids (new xpaths + IS vocabularies)
2020-05-26 13:11:09 +02:00
Michele Artini
093f1aff03
result pids (new xpaths + IS vocabularies)
2020-05-26 13:06:55 +02:00
Claudio Atzori
b8e541a454
fixing repeated organization.websiteurl in organization entities ( #5645 ) as well as project.ecinternationalorganizationeurinterests
2020-05-26 10:30:09 +02:00
Claudio Atzori
55595d7235
HACK: patch NULL values with defaults found in result.datainfo.deletedbyinference and result.context
2020-05-26 10:28:35 +02:00
Claudio Atzori
7b288a94cb
code formatting
2020-05-26 09:54:13 +02:00
Miriam Baglioni
54d869e618
merge upstream
2020-05-26 09:22:04 +02:00
Miriam Baglioni
eea07f4c42
refactoring
2020-05-26 09:21:49 +02:00
Sandro La Bruzzo
79c26382da
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-26 09:15:50 +02:00
Sandro La Bruzzo
25f52e19a4
implemented generation of ActionSet
2020-05-26 09:15:33 +02:00
Michele Artini
d6aada4957
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-26 08:44:31 +02:00
Michele Artini
b1546605e3
updated version of a dependency
2020-05-26 08:44:15 +02:00
Claudio Atzori
7582532e73
[maven-release-plugin] prepare for next development iteration
2020-05-25 19:48:18 +02:00
Claudio Atzori
01c2e93395
[maven-release-plugin] prepare release dhp-1.2.1
2020-05-25 19:48:14 +02:00
miconis
da1e5cf557
implementation of the result title merge. main title with higher trust, distinct between the others
2020-05-25 18:02:57 +02:00
Miriam Baglioni
d3d36647d2
merge upstream
2020-05-25 10:38:22 +02:00
Miriam Baglioni
74215f6d9f
refactoring
2020-05-25 10:38:16 +02:00
Miriam Baglioni
dbde2d243a
changed due to move of PacePerson from dhp-graph-mapper to dhp-common
2020-05-25 10:35:39 +02:00
Miriam Baglioni
f754c424bd
changed logic to compute only onece PacePerson for each Author to be enriched
2020-05-25 10:35:02 +02:00
Miriam Baglioni
8f51af4e9b
added PacePerson to get name surname for authors having only fullname set
2020-05-25 10:34:30 +02:00
Miriam Baglioni
b258f99ece
fix for issue that duplicated result
2020-05-25 10:26:48 +02:00
Miriam Baglioni
8f6ce970f9
moved PacePerson to dhp-common to avoid conflict in dependency with graph-mapper
2020-05-25 10:25:55 +02:00
Claudio Atzori
de108f54d6
code formatting
2020-05-23 10:21:19 +02:00
Claudio Atzori
6b56cae57d
added mapping for bestaccessrights
2020-05-23 09:57:39 +02:00
Claudio Atzori
7181807e64
code formatting
2020-05-23 09:51:48 +02:00
Sandro La Bruzzo
2408083566
implemented filtering step
2020-05-23 08:46:49 +02:00
Sandro La Bruzzo
244f6e50cf
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-22 20:52:15 +02:00
Sandro La Bruzzo
147dd389bf
minor fix
2020-05-22 20:51:42 +02:00
Miriam Baglioni
0d1ec1913f
added fix to avoid duplication of results
2020-05-22 18:42:25 +02:00
miconis
5d7ac78c41
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-22 17:25:08 +02:00
miconis
0fd0c7d725
reimplementation of the sim between two authors. now it takes into account both name and surname. threshold incremented to 1.0 if the name is too short
2020-05-22 17:24:57 +02:00
Michele Artini
eb606dc1e2
partial implementation of events with rels
2020-05-22 17:17:41 +02:00
Miriam Baglioni
29066a6b46
applied code cleanup
2020-05-22 15:38:50 +02:00
Miriam Baglioni
8610ad5142
added groupby id to fix multiple result with same id at join step
2020-05-22 15:32:55 +02:00
Miriam Baglioni
1e44703e3e
merge upstream
2020-05-22 15:30:07 +02:00
Miriam Baglioni
ac8025f469
-
2020-05-22 15:29:41 +02:00
Miriam Baglioni
50ad83b97f
-
2020-05-22 15:27:19 +02:00
Miriam Baglioni
473c6d3a23
produces AtomicActions instead of Projects
2020-05-22 15:26:57 +02:00
Sandro La Bruzzo
72278b9375
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-22 15:17:13 +02:00
Sandro La Bruzzo
22936d0877
Merge branch 'doiboost' of code-repo.d4science.org:D-Net/dnet-hadoop into doiboost
2020-05-22 15:15:17 +02:00
Sandro La Bruzzo
9fbb221457
completed mapping of UnpayWall and ORCID
2020-05-22 15:15:09 +02:00
Miriam Baglioni
70389b0a30
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-22 13:53:23 +02:00
Miriam Baglioni
4308f31165
added fix to make test run
2020-05-22 13:13:01 +02:00
Claudio Atzori
946598cfba
Merge branch 'master' into provision_indexing
2020-05-22 12:35:41 +02:00
Claudio Atzori
3cf2796ac6
code formatting
2020-05-22 12:34:00 +02:00
Michele Artini
dc4621b3cb
filter ORCID e MAG identifiers
2020-05-22 12:25:01 +02:00
Michele Artini
9f2d0f1b08
filter ORCID e MAG identifiers
2020-05-22 11:00:27 +02:00
Michele Artini
9de71e54a8
filter ORCID e MAG identifiers
2020-05-22 10:47:39 +02:00
Michele Artini
c5f7e17348
author fullnames
2020-05-22 10:08:02 +02:00
Claudio Atzori
ad40470040
Merge branch 'master' into provision_indexing
2020-05-22 08:51:22 +02:00
Claudio Atzori
925d933204
making XmlRecordFactory immune to graph encoding changes (mostly to avoid NPEs)
2020-05-22 08:50:44 +02:00
Claudio Atzori
b33dd58be4
replaced parameter 'reuseRecords' with 'resumeFrom', allowing to restart the provision workflow execution from any step, useful for manual submissions or debugging
2020-05-22 08:50:06 +02:00
Michele Artini
c7ca3cf35b
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-21 16:48:20 +02:00
Michele Artini
3e34517479
partial implementation of events with rels
2020-05-21 16:47:53 +02:00
Miriam Baglioni
eae12a6586
Merge branch 'master' into dhp_oaf_model
2020-05-21 16:31:22 +02:00
Miriam Baglioni
6750075fbd
merge upstream
2020-05-21 16:31:09 +02:00
Miriam Baglioni
4589c428b1
generate action sets and saves them in the hdfs path for the actions sets
2020-05-21 16:30:39 +02:00
miconis
8b35e0e7f0
reimplementation of the author merging in deduprecord creation. implementation of the test class. minor changes
2020-05-21 12:02:44 +02:00
miconis
8bbd1d0501
reimplementation of the author merging in deduprecord creation. implementation of the test class.
2020-05-21 11:52:14 +02:00
Michele Artini
e43d4d7778
added a coalesce in sql query
2020-05-21 11:08:07 +02:00
Claudio Atzori
dbfb9c19fe
minor changes
2020-05-21 10:00:14 +02:00
Michele Artini
b3bcbb3129
resolve name of organization countries
2020-05-21 08:41:32 +02:00
Enrico Ottonello
1109d3b3fc
Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost
2020-05-21 00:41:27 +02:00
Enrico Ottonello
869a53040e
save to text file format
2020-05-21 00:41:21 +02:00
Sandro La Bruzzo
5818abaab4
fixed Crossref Mapping
2020-05-20 17:05:46 +02:00
Claudio Atzori
da4267d0fe
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-05-20 14:58:22 +02:00
Claudio Atzori
d7d2a0637f
added extra parameters to the provision indexing workflow
2020-05-20 14:55:38 +02:00
Miriam Baglioni
055eec5a77
added resource for prepare project test
2020-05-20 13:54:10 +02:00
Miriam Baglioni
9079bc1f61
-
2020-05-20 13:53:32 +02:00
Miriam Baglioni
67ba4fde57
added test for prepare projects step
2020-05-20 13:53:08 +02:00
Miriam Baglioni
5e0e554000
Merge branch 'master' into dhp_oaf_model
2020-05-20 10:57:30 +02:00
Miriam Baglioni
76f3f73caa
merge upstream
2020-05-20 10:31:40 +02:00
Miriam Baglioni
3c0eb12d3e
removed the not zipped files
2020-05-20 10:31:05 +02:00
Miriam Baglioni
c0d9e02340
zipped test resources that are too big
2020-05-20 10:30:25 +02:00
Miriam Baglioni
5e9c9fa87c
tests
2020-05-20 10:29:57 +02:00
Miriam Baglioni
faed7521bf
added resources for testing
2020-05-20 10:29:29 +02:00
Miriam Baglioni
75491482de
added a new preparation step to replicate each project for the programme it is associated to
2020-05-20 10:28:56 +02:00
Miriam Baglioni
eb0e47ba53
parameters for h2020 programme
2020-05-20 10:26:44 +02:00
Sandro La Bruzzo
b771d67e9d
next step of MAG conversion implemented
2020-05-20 08:14:03 +02:00
Miriam Baglioni
08218d2f3f
new workflow with added steps
2020-05-19 18:44:25 +02:00
Miriam Baglioni
457293ccc0
test for the variuos steps of project update with programme
2020-05-19 18:43:42 +02:00
Miriam Baglioni
9447d78ef3
added preparation classes
2020-05-19 18:42:50 +02:00
Michele Artini
85ca5622d4
partial implementation of generation of simple events
2020-05-19 16:17:35 +02:00
Claudio Atzori
0bdfbb0a57
reintroduced RDD based relation cut off procedure
2020-05-19 15:02:21 +02:00
Enrico Ottonello
934ad570e0
joined summaries and activities dataset
2020-05-19 12:57:21 +02:00
Enrico Ottonello
ca722d4d18
merged
2020-05-19 09:43:12 +02:00
Enrico Ottonello
7362bc3e9d
workflow to generate seq(doi,AuthorList)
2020-05-19 09:34:44 +02:00
Sandro La Bruzzo
8c95b50f26
Merge remote-tracking branch 'origin/master' into doiboost
2020-05-19 09:25:04 +02:00
Sandro La Bruzzo
486e850bcc
next step of MAG conversion implemented
2020-05-19 09:24:45 +02:00
Enrico Ottonello
d4e9075f22
Merge branch 'doiboost' of https://code-repo.d4science.org/D-Net/dnet-hadoop into doiboost
2020-05-18 19:51:36 +02:00
Enrico Ottonello
fc80e8c7de
added accumulator; last modified date of the record is added to saved data; lambda file is partitioned into 20 parts before starting downloading
2020-05-18 19:51:29 +02:00
Claudio Atzori
f3bc8aed31
lifted memory requirements for country propagation wf
2020-05-18 15:29:10 +02:00
Miriam Baglioni
b71fbb68b1
removed the removeOutputDir command from code. Reltions are written in Append. The erase of the output dir ment to remove all the relations computed in the prevoius steps
2020-05-18 13:57:20 +02:00
Miriam Baglioni
629af7cb79
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-05-18 13:07:36 +02:00
Miriam Baglioni
f0f14caf99
removed script files for shell actions not performed
2020-05-18 13:06:16 +02:00