Commit Graph

812 Commits

Author SHA1 Message Date
Claudio Atzori 05f269a1c0 kryo based parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-06-01 00:32:42 +02:00
Claudio Atzori 6f5f498c78 restored common properties driving executor-cores and executor-memory in join_organization_relations wf node 2020-05-29 11:22:00 +02:00
Claudio Atzori b2f9564f13 WIP: fixed PrepareRelationsJob; parallel implementation of CreateRelatedEntitiesJob_phase2, now works by OafType; introduced custom aggregator in AdjacencyListBuilderJob 2020-05-29 10:58:15 +02:00
Claudio Atzori a57965a3ea limiting the dimensions of outliers 2020-05-28 17:36:37 +02:00
Claudio Atzori 821be1f8b6 experimental implementation of custom aggregation using kryo encoders 2020-05-28 13:53:13 +02:00
Claudio Atzori 83504ecace limiting the maximum number of authors allowed in XML records to MAX_AUTHORS = 200; authors with ORCID can exceed that limit 2020-05-28 13:52:30 +02:00
Claudio Atzori ef11593068 JoinedEntity.links defined as empty list by default 2020-05-28 13:50:44 +02:00
Claudio Atzori 5dea155a87 increased number of partitions produced by the join_all_entities phase as well as spark.sql.shuffle.partitions in adjancency_lists phase 2020-05-28 13:49:59 +02:00
Claudio Atzori fdd54bad1c code formatting 2020-05-27 19:31:54 +02:00
Claudio Atzori b9b1bc9967 Merge branch 'master' into provision_indexing 2020-05-27 12:55:20 +02:00
Claudio Atzori aac1515b58 Merge pull request 'result_pids without conflicts ???' (#16) from result_pids into master
Looks good, thanks Michele
2020-05-27 12:54:52 +02:00
Michele Artini f5ce7d76e1 resolve conflicts 2020-05-27 12:49:17 +02:00
Claudio Atzori cfd753217c repartition the join_entities in 24k files 2020-05-27 12:44:01 +02:00
Claudio Atzori 2f1a623d09 sync from master branch 2020-05-27 12:39:58 +02:00
Claudio Atzori 9e4ec1543b updated test 2020-05-27 12:38:42 +02:00
Claudio Atzori 8047d16dd9 added RDD based adjacency list creation procedure 2020-05-27 12:38:12 +02:00
Claudio Atzori f057dcdf65 limit the max number of externalreferences to MAX_EXTERNAL_ENTITIES 2020-05-27 12:37:33 +02:00
Michele Artini b81f2741d2 xquery 2020-05-27 12:10:20 +02:00
Michele Artini a25598140a result pids (new xpaths + IS vocabularies) 2020-05-27 12:10:20 +02:00
Michele Artini 7a7272d9ec result pids (new xpaths + IS vocabularies) 2020-05-27 12:10:20 +02:00
Michele Artini 3ceb2d2853 match terms with vocabularies 2020-05-27 11:34:13 +02:00
Claudio Atzori 4e36d689dd fixed XML serialization for children sub-elements (duplicates & externalreferences) 2020-05-26 18:30:40 +02:00
Michele Artini c15d997925 xquery 2020-05-26 13:13:17 +02:00
Michele Artini c6af36496a result pids (new xpaths + IS vocabularies) 2020-05-26 13:11:09 +02:00
Michele Artini 093f1aff03 result pids (new xpaths + IS vocabularies) 2020-05-26 13:06:55 +02:00
Claudio Atzori b8e541a454 fixing repeated organization.websiteurl in organization entities (#5645) as well as project.ecinternationalorganizationeurinterests 2020-05-26 10:30:09 +02:00
Claudio Atzori 55595d7235 HACK: patch NULL values with defaults found in result.datainfo.deletedbyinference and result.context 2020-05-26 10:28:35 +02:00
Claudio Atzori 7b288a94cb code formatting 2020-05-26 09:54:13 +02:00
Miriam Baglioni 54d869e618 merge upstream 2020-05-26 09:22:04 +02:00
Miriam Baglioni eea07f4c42 refactoring 2020-05-26 09:21:49 +02:00
Michele Artini d6aada4957 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-26 08:44:31 +02:00
Michele Artini b1546605e3 updated version of a dependency 2020-05-26 08:44:15 +02:00
Claudio Atzori 7582532e73 [maven-release-plugin] prepare for next development iteration 2020-05-25 19:48:18 +02:00
Claudio Atzori 01c2e93395 [maven-release-plugin] prepare release dhp-1.2.1 2020-05-25 19:48:14 +02:00
miconis da1e5cf557 implementation of the result title merge. main title with higher trust, distinct between the others 2020-05-25 18:02:57 +02:00
Miriam Baglioni d3d36647d2 merge upstream 2020-05-25 10:38:22 +02:00
Miriam Baglioni 74215f6d9f refactoring 2020-05-25 10:38:16 +02:00
Miriam Baglioni dbde2d243a changed due to move of PacePerson from dhp-graph-mapper to dhp-common 2020-05-25 10:35:39 +02:00
Miriam Baglioni f754c424bd changed logic to compute only onece PacePerson for each Author to be enriched 2020-05-25 10:35:02 +02:00
Miriam Baglioni 8f51af4e9b added PacePerson to get name surname for authors having only fullname set 2020-05-25 10:34:30 +02:00
Miriam Baglioni b258f99ece fix for issue that duplicated result 2020-05-25 10:26:48 +02:00
Miriam Baglioni 8f6ce970f9 moved PacePerson to dhp-common to avoid conflict in dependency with graph-mapper 2020-05-25 10:25:55 +02:00
Claudio Atzori de108f54d6 code formatting 2020-05-23 10:21:19 +02:00
Claudio Atzori 6b56cae57d added mapping for bestaccessrights 2020-05-23 09:57:39 +02:00
Claudio Atzori 7181807e64 code formatting 2020-05-23 09:51:48 +02:00
Miriam Baglioni 0d1ec1913f added fix to avoid duplication of results 2020-05-22 18:42:25 +02:00
miconis 5d7ac78c41 Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-05-22 17:25:08 +02:00
miconis 0fd0c7d725 reimplementation of the sim between two authors. now it takes into account both name and surname. threshold incremented to 1.0 if the name is too short 2020-05-22 17:24:57 +02:00
Michele Artini eb606dc1e2 partial implementation of events with rels 2020-05-22 17:17:41 +02:00
Miriam Baglioni 29066a6b46 applied code cleanup 2020-05-22 15:38:50 +02:00