Claudio Atzori
3be0e5c2cd
Merge branch 'master' into graph_db
2020-08-05 12:54:48 +02:00
Alessia Bardi
a29565ff57
code formatting
2020-08-04 12:55:27 +02:00
Alessia Bardi
01db29e208
fixes redmine issue #5846 : datacite and its different namespace declarations
2020-08-04 12:53:48 +02:00
Alessia Bardi
b4e4e5f858
do not duplicate result PIDs
2020-08-04 12:52:14 +02:00
Claudio Atzori
771bf8bcc4
WIP: materialize graph as Hive DB, mergeAggregatorGraphs
2020-08-04 12:26:09 +02:00
Claudio Atzori
0da1d2c0c9
introduced GraphFormat.DEFAULT, indicating a common value to be used across the workflows
2020-08-04 12:25:31 +02:00
Claudio Atzori
1fcc28968e
integrated changes from master
2020-08-04 10:57:44 +02:00
Michele Artini
652b13abb6
Merge branch 'master' into nsprefix_blacklist
2020-07-31 07:58:37 +02:00
Claudio Atzori
cd631bb5bc
defaults fixed in the cleaning workflow forces result.publisher to NULL when result.publisher.value in empty
2020-07-30 17:03:53 +02:00
Claudio Atzori
4bbfcf1ac6
Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop
2020-07-30 16:25:06 +02:00
Claudio Atzori
4ff8007518
added function to set the missing vocabulary names, used in the cleaning workflow as a pre-cleaning step
2020-07-30 16:24:39 +02:00
Michele Artini
bdece15ca0
blacklist of nsprefix
2020-07-30 16:13:38 +02:00
Sandro La Bruzzo
c97c8f0c44
implemented new oozie job to extract entities in a separate dataset
2020-07-30 12:13:58 +02:00
Sandro La Bruzzo
3010a362bc
updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset
2020-07-30 09:25:56 +02:00
Sandro La Bruzzo
16ae3c9ccf
updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset
2020-07-30 09:25:32 +02:00
Claudio Atzori
9e594cf4c2
WIP: materialize graph as Hive DB, aggregator graph
2020-07-29 19:25:11 +02:00
Claudio Atzori
2dbac631c9
WIP: factoring out utilities into dhp-workflows-common
2020-07-29 13:08:20 +02:00
Claudio Atzori
56bbfdc65d
introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project'
2020-07-23 08:54:10 +02:00
Claudio Atzori
ebf60020ac
map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type
2020-07-20 19:01:10 +02:00
Claudio Atzori
32f5e466e3
imports cleanup
2020-07-20 17:42:58 +02:00
Claudio Atzori
54ac583923
code formatting
2020-07-20 17:37:08 +02:00
Claudio Atzori
124e7ce19c
in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available
2020-07-20 17:33:37 +02:00
Claudio Atzori
050dda223d
Merge pull request 'removed duplicated fields' ( #25 ) from unique_field_in_lists into master
...
Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists.
The task to update the model in such a way is added on #9#issuecomment-1583
Thanks!
2020-07-20 12:12:50 +02:00
Claudio Atzori
e0c4cf6f7b
added parameter to drive the graph merge strategy: priority (BETA|PROD)
2020-07-20 10:48:01 +02:00
Claudio Atzori
94ccdb4852
Merge branch 'master' into merge_graph
2020-07-20 10:14:55 +02:00
Michele Artini
331a3cbdd0
fixed originalId
2020-07-20 09:50:29 +02:00
Sandro La Bruzzo
9116d75b3e
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-17 18:01:30 +02:00
Miriam Baglioni
47c7122773
changed priority from beta to production
2020-07-17 12:56:35 +02:00
Michele Artini
442f30930c
removed duplicated fields
2020-07-17 12:25:36 +02:00
Claudio Atzori
1781609508
code formatting
2020-07-16 19:06:56 +02:00
Claudio Atzori
878f2b931c
Merge branch 'master' into merge_graph
2020-07-16 16:34:24 +02:00
Michele Artini
e1ae964bc4
stats
2020-07-10 16:12:08 +02:00
Sandro La Bruzzo
c01efed79b
Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop
2020-07-10 14:44:57 +02:00
Sandro La Bruzzo
a7d3977481
added generation of EBI Dataset
2020-07-10 14:44:50 +02:00
Claudio Atzori
67e1d222b6
bulk cleaning when found null or empty, sets bestaccessrights evaluating the result instances
2020-07-08 17:53:35 +02:00
Claudio Atzori
610d377d57
first implementation of the BETA & PROD graphs merge procedure
2020-07-08 16:54:26 +02:00
Sandro La Bruzzo
1d420eedb4
added generation of EBI Dataset
2020-07-02 12:37:43 +02:00
Claudio Atzori
6f5771c1c9
sets author.rank when null
2020-06-25 14:06:21 +02:00
Claudio Atzori
7df2712824
Merge branch 'provision_indexing'
2020-06-25 12:22:41 +02:00
Michele Artini
abcbebcbb4
fixed generation of ids
2020-06-25 09:50:46 +02:00
Michele Artini
77d2a1b1c4
params to choose sql queries for beta or production
2020-06-25 09:28:13 +02:00
Claudio Atzori
0e723d378b
added default from vocab for missing instance.refereed; remove spurious prefixes from orcid values; WIP: prepare relation job
2020-06-24 18:34:42 +02:00
Claudio Atzori
7d416f08d8
graph cleaning workflow: set hostedby to unknown repository when defined as NULL
2020-06-22 09:50:43 +02:00
Claudio Atzori
d0ac7514b2
cleaning workflow to include cleaning of default values
2020-06-18 19:37:25 +02:00
Sandro La Bruzzo
9bf67f5de1
resolved conflicts
2020-06-17 09:15:43 +02:00
Sandro La Bruzzo
1d4275acc4
implemented first version of exportation of Scholexplorer into ActionSet
2020-06-17 09:10:38 +02:00
Claudio Atzori
5441f01586
Merge pull request 'missing landingPage urls in instances' ( #22 ) from instances-with-landing-page into master
...
Looks good, thanks!
2020-06-16 15:32:44 +02:00
Claudio Atzori
2a4f65795f
WIP: graph cleaner implementation
2020-06-15 18:32:24 +02:00
Claudio Atzori
c15c8c0ad0
map datasource identities (including piwik ids) as original IDs
2020-06-15 16:07:30 +02:00
Claudio Atzori
0d52816244
WIP: graph cleaner implementation
2020-06-13 13:06:04 +02:00