Commit Graph

1130 Commits

Author SHA1 Message Date
Sandro La Bruzzo 31d2d6d41e Scholexplorer: introduction of dedup openaire 2021-07-21 18:09:32 +02:00
Alessia Bardi 9069958479 tests for enermaps 2021-07-20 19:31:43 +02:00
Claudio Atzori 65934888a1 adding record identifier among the originalIds regardless of what IdentifierFactory produces 2021-07-19 17:52:52 +02:00
Claudio Atzori 5947cddafc adding record identifier among the originalIds regardless of what IdentifierFactory produces 2021-07-19 17:52:24 +02:00
Claudio Atzori 0977baf41d contents mapped from the stores with 'claim' interpretation will not change their identifier along their way towards the graph 2021-07-19 17:43:52 +02:00
Claudio Atzori 5e5f65a3c3 contents mapped from the stores with 'claim' interpretation will not change their identifier along their way towards the graph 2021-07-19 15:56:55 +02:00
Sandro La Bruzzo 7e2caafe84 Scholexplorer: fixed mapping typologies 2021-07-15 09:53:12 +02:00
Miriam Baglioni 774cdb190e changes to mirror the last dump of the graph with the ols data model. 2021-07-13 18:57:24 +02:00
Miriam Baglioni 886617afd0 One result linked to more than on project is saved just once 2021-07-13 18:15:35 +02:00
Miriam Baglioni 320cf02d96 Changed the way to find results linked to projects. We verify to actually have the project on the graph before selecting the result 2021-07-13 18:13:32 +02:00
Miriam Baglioni 52ce35d57b - 2021-07-13 18:08:46 +02:00
Miriam Baglioni 970b387b8d modification to allow dump of a single community 2021-07-13 18:08:10 +02:00
Miriam Baglioni eae10c5894 modification to allow the dump for a single community 2021-07-13 18:07:25 +02:00
Miriam Baglioni c028feef4f workflow for the dump as sub workflows 2021-07-13 18:06:44 +02:00
Miriam Baglioni d70f8c96fd funding contains and not starts with h2020 2021-07-13 17:34:53 +02:00
Miriam Baglioni 5e38c7f42d dumping only communities with status all 2021-07-13 17:32:38 +02:00
Miriam Baglioni 618d2de2da minor changes and refactoring 2021-07-13 17:10:02 +02:00
Miriam Baglioni 59615da65e Add test to verify the creation of relation between context and projects 2021-07-13 17:09:15 +02:00
Miriam Baglioni 084b4ef999 added the creation of the openaireId from funder and grant number if the element is not present in the context profile 2021-07-13 17:07:46 +02:00
Miriam Baglioni 8f322a73cb change because of the renaming of originalId in acronym 2021-07-13 16:22:58 +02:00
Miriam Baglioni 72397ea1ba Added fix for community of arbitrary name length 2021-07-13 16:18:35 +02:00
Miriam Baglioni 5295d10691 added check not to dump deletedByInference entities 2021-07-13 16:11:46 +02:00
Miriam Baglioni e9a17ec899 added check to verify not to add void APC 2021-07-13 15:53:35 +02:00
Miriam Baglioni 8429aed6c6 Added resource for testing selection of valid relations 2021-07-13 15:49:38 +02:00
Miriam Baglioni 39b1a6edf6 added test class for the selection of valid relations and description 2021-07-13 15:23:09 +02:00
Miriam Baglioni 9a58f1b93d added logic to select only the valid relations: those not deletedbyinference and having both part of the relation as entities in the graph 2021-07-13 15:20:39 +02:00
Miriam Baglioni 13c66e16be changed logic to split for communities 2021-07-13 15:15:27 +02:00
Miriam Baglioni 6410ab71d8 added APC in the dump and test method 2021-07-13 15:13:58 +02:00
Miriam Baglioni 65a242646d added resource for APC dump 2021-07-13 14:45:25 +02:00
Miriam Baglioni 4b432fbee8 extended test class 2021-07-13 14:40:39 +02:00
Miriam Baglioni 87a6e2b967 extended test class 2021-07-13 14:38:28 +02:00
Miriam Baglioni 69fd40fd30 modified code to split the Croatian funder 2021-07-13 14:35:26 +02:00
Miriam Baglioni 86e50f7311 modified code to split the Croatian funder 2021-07-13 14:31:45 +02:00
Miriam Baglioni da88c850c6 changed the logic to verify if a community is contained in the list of context of a result 2021-07-13 14:22:44 +02:00
Miriam Baglioni 2f66fedfec changed the logic to verify if a community is contained in the list of context of a result 2021-07-13 14:22:23 +02:00
Sandro La Bruzzo bbe8193930 merged stable ids 2021-07-12 17:00:43 +02:00
Sandro La Bruzzo 09fccf8000 added workflow to serialize scholix and summary in json 2021-07-09 11:01:42 +02:00
Sandro La Bruzzo 0ea576745f updated CreateInputGraph because ggenerics don't work on Spark Dataset 2021-07-09 10:29:24 +02:00
Sandro La Bruzzo cd17e19044 implemented branch workflow to import datacite and crossref in scholexplorer 2021-07-08 21:20:19 +02:00
Sandro La Bruzzo 8a034e46e1 updated baseline workflow 2021-07-08 11:11:41 +02:00
Claudio Atzori b7b8e0986e [raw_all] The claim merge procedure includes the claimed contexts in the merged result 2021-07-08 10:42:31 +02:00
Sandro La Bruzzo 0799ac9fb6 fixed wrong path 2021-07-08 10:36:37 +02:00
Sandro La Bruzzo 4d53402712 extended ebiLinks to create a dataset before generation of OAF 2021-07-08 10:26:21 +02:00
Sandro La Bruzzo a4a54a3786 code refactor 2021-07-08 09:08:25 +02:00
Sandro La Bruzzo a01dbe0ab0 completed workflow of generation of scholix and summaries 2021-07-07 23:10:34 +02:00
Claudio Atzori fdcff42e46 [raw_all] Aggregator graph creation merges claims (updates) with the corresponding entity 2021-07-07 19:01:59 +02:00
Claudio Atzori bc014023c8 Merge pull request 'to solve the scala SI-3623' (#122) from andreas.czerniak/BrStableId_dnet-hadoop:stable_ids into stable_ids
Reviewed-on: #122
2021-07-07 11:13:51 +02:00
Claudio Atzori 32bdfdccbc [raw_all] Aggregator graph creation merges claims (updates) with the corresponding entity 2021-07-07 11:08:27 +02:00
Claudio Atzori f580cb77e1 added mapping for claim relation 'resultResult_publicationDataset_isRelatedTo' (present on BETA) 2021-07-06 21:11:11 +02:00
Sandro La Bruzzo 8535506c22 added scholix generation 2021-07-06 17:18:06 +02:00
Sandro La Bruzzo 4c54bd8742 add test to verify merge scholix on source 2021-07-06 11:32:14 +02:00
Andreas Czerniak 3531802710 to solve the scala SI-3623 2021-07-06 11:30:56 +02:00
Sandro La Bruzzo 7d8db2eb8a betterRenamingMethod 2021-07-06 09:56:32 +02:00
Sandro La Bruzzo c952c8d236 generate first side of scholix mapping 2021-07-06 09:53:14 +02:00
Sandro La Bruzzo e4b84ef5d6 fixed mapping OAF to Scholix summary 2021-07-02 16:48:48 +02:00
Sandro La Bruzzo c6fa8598e1 massive code refactor:
removed modules dhp-*-scholexplorer
2021-07-01 22:13:45 +02:00
Sandro La Bruzzo 84b834c893 added test dataset test for pangaea 2021-06-30 17:31:09 +02:00
Sandro La Bruzzo 1a6b398968 implemented Creation of Raw Graph and Resolution 2021-06-30 17:27:55 +02:00
Sandro La Bruzzo 623a0c4edb code Refactor, renaming packages 2021-06-30 11:09:30 +02:00
Sandro La Bruzzo 7e08655e5f added relation dates in all scholexplorer Datasources 2021-06-29 12:02:03 +02:00
Sandro La Bruzzo 075055eaca added relation dates in bio mapping 2021-06-29 10:33:09 +02:00
Sandro La Bruzzo f36f92287d implemented mapping from Crossref Event Data to Oaf 2021-06-29 10:21:23 +02:00
Sandro La Bruzzo 511ec14c63 implemented mapping from EBI and Scholix Resolved to OAF 2021-06-28 22:04:22 +02:00
Sandro La Bruzzo ad50415167 Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer 2021-06-24 17:20:50 +02:00
Sandro La Bruzzo 80e15cc455 implemented mapping from uniprot, pdb and ebi links 2021-06-24 17:20:00 +02:00
Claudio Atzori 2e8fd2c531 cleanup 2021-06-23 14:38:24 +02:00
Claudio Atzori 4dc9ebf217 [raw_all] fixed unit test 2021-06-23 14:38:07 +02:00
Claudio Atzori 50fc5a64a0 [raw_all] Aggregator graph creation merges claims (updates) with the corresponding entity 2021-06-23 11:49:42 +02:00
Sandro La Bruzzo 080a280bea added pdb to Oaf Transformation 2021-06-21 16:23:59 +02:00
Sandro La Bruzzo 507e42102a added pdb to oaf class 2021-06-21 09:36:40 +02:00
Sandro La Bruzzo 4fe7b75644 renamed packages 2021-06-18 16:41:24 +02:00
Sandro La Bruzzo 3100166d29 Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer 2021-06-16 16:22:16 +02:00
Claudio Atzori 7243a40c88 code formatting 2021-06-16 15:03:03 +02:00
Sandro La Bruzzo dfcf78cf24 removed wrong code 2021-06-16 14:57:42 +02:00
Sandro La Bruzzo cc0f2b11fb Implemented mapping from pubmed baseline to OAF 2021-06-16 14:56:24 +02:00
Michele Artini ada063ce70 fixed a problem with empty mdstore list (2) 2021-06-14 12:04:47 +02:00
Michele Artini 83132ee99a fixed a problem with empty mdstore list 2021-06-14 11:57:00 +02:00
Sandro La Bruzzo aeb8132627 Merged branch stable_ids 2021-06-14 10:07:29 +02:00
Claudio Atzori 2039bb9f5f orcid / orcid_pending cleaning backported from master branch 2021-06-14 09:40:50 +02:00
Claudio Atzori dd19c4ac5a Merge pull request 'import_new_mdstores' (#112) from import_new_mdstores into stable_ids
Reviewed-on: #112
2021-06-14 09:23:55 +02:00
Claudio Atzori a900bfb874 delegating the date parsing to https://github.com/sisyphsu/dateparser 2021-06-11 16:53:01 +02:00
Sandro La Bruzzo 5b724d9972 added relations to datacite mapping 2021-06-04 10:14:22 +02:00
Sandro La Bruzzo e57294ac99 implemented changes on PUBMed dataflow 2021-06-03 10:52:09 +02:00
Michele Artini ede2749822 orcid pid type 2021-06-01 12:42:43 +02:00
Michele Artini f0fbfdcfae Merge branch 'stable_ids' into import_new_mdstores 2021-06-01 12:03:00 +02:00
Michele Artini e950750262 add nodes to import hdfs mdstores 2021-06-01 10:48:50 +02:00
Michele Artini 03a510859a removed coalesce(1) 2021-05-31 14:10:51 +02:00
Michele Artini e9f2b6037c patch of mdstore records 2021-05-31 11:36:26 +02:00
Michele Artini ad56a44fda save as gzipped sequence file 2021-05-28 14:45:39 +02:00
Claudio Atzori 6e3a4e9237 updated test expectations 2021-05-28 09:37:50 +02:00
Michele Artini 4fa5671d16 first implementation of Hdfs Mdstores Importer 2021-05-27 16:22:07 +02:00
Claudio Atzori 5e4b91d9ef more pervasive use of constants from ModelConstants, especially for ORCID 2021-05-26 18:20:23 +02:00
Claudio Atzori 9d725efdc1 reverted implementation of the mdstore client 2021-05-20 18:26:09 +02:00
Claudio Atzori ae5c28e54f code formatting 2021-05-20 16:13:06 +02:00
Claudio Atzori 232dce83db fixes #6701: xpath for titles to support both datacite and Guidelines v4 mapping 2021-05-20 14:41:15 +02:00
Claudio Atzori 23b8883ab1 applied intellij code cleanup 2021-05-14 10:58:12 +02:00
Claudio Atzori d4c3476152 mapping datasource.journal only when an issn is available, null otherwhise 2021-05-11 11:08:54 +02:00
Claudio Atzori d1cbee8413 imported methods from CleaningFunctions, defined in GraphCleaningFunctions 2021-05-10 16:43:39 +02:00
Claudio Atzori d4a30fabe3 clean up tests 2021-05-05 17:28:15 +02:00
Claudio Atzori dccaf173cf fixed mapping applied to ODF records. Added unit test to verify the mapping for OpenTrials 2021-05-05 16:36:15 +02:00
Claudio Atzori 2e1eb96f9a code formatting 2021-05-05 11:23:57 +02:00
Claudio Atzori fb930b84d3 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-05-04 18:06:30 +02:00
Claudio Atzori 923d19ea8e mdstore read lock/unlock when bulk copying records from mongodb to hdfs 2021-05-04 18:06:21 +02:00
Sandro La Bruzzo 714b71bd21 updated pubmed 2021-05-04 14:54:12 +02:00
Alessia Bardi 9a20057615 fixed query for organisations' pids 2021-04-29 15:23:39 +02:00
Sandro La Bruzzo 2129e9caa7 updated pangaea transformation to parse directly the xml 2021-04-28 10:21:03 +02:00
Claudio Atzori 5afa7d3e0c core utilities in dhp-common moved in external module dhp-schemas 2021-04-27 15:44:01 +02:00
Sandro La Bruzzo 74484d2823 bug fixing 2021-04-27 12:13:44 +02:00
Sandro La Bruzzo c74b03d59c Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids 2021-04-27 11:31:07 +02:00
Sandro La Bruzzo 7f8848ecdd added first implementation of Pangaea Mapping 2021-04-27 11:30:37 +02:00
Claudio Atzori 27ab8a704d adjusted poms to align with the external dhp-schema module 2021-04-27 10:12:27 +02:00
Claudio Atzori c2bb03c8b5 depending on external dhp-schemas module 2021-04-23 17:57:35 +02:00
Claudio Atzori c25238480c making ODF record parsing namespace unaware (#6629) 2021-04-23 17:34:57 +02:00
Claudio Atzori d0d477cca3 code formatting 2021-04-20 12:50:34 +02:00
miconis 0393cdce42 addition of alternative names in export queries 2021-04-20 12:45:21 +02:00
miconis cadd0a5de8 modification of the queries for openorgs: they now consider also pending orgs 2021-04-20 12:06:56 +02:00
Claudio Atzori d1ca025b0b [cleaning] remiving authors without fullname or providing 'deactivated' keyword. Removing test test titles 2021-04-13 14:32:41 +02:00
miconis 11b22b2d23 bug fix in the query, it now exports only relations with non-hidden organizations 2021-04-08 11:51:47 +02:00
miconis 0857100fb8 implementation of the tests for the openorgs integration in the openaire provision 2021-04-07 18:42:16 +02:00
miconis bf685d849f addition of pids in the query for the export of openorgs for the provision, addition of ec_fields in the openorgs model 2021-04-07 14:27:43 +02:00
miconis eaaefb8b4c implementation of the procedure to reuse content of different dbs when creating the raw graph 2021-04-06 14:35:51 +02:00
miconis c39c82dfe9 modification of the jobs for the integration of openorgs in the provision, dedup records are no more created by merging but simply taking results of openorgs portal 2021-04-06 14:31:00 +02:00
Claudio Atzori 7941d7be29 WIP: using common definitions from ModelConstants 2021-03-31 18:33:57 +02:00
Claudio Atzori 72ce741ea6 WIP: using common definitions from ModelConstants 2021-03-31 17:07:13 +02:00
Claudio Atzori 9237d55d7f [OpenOrgsWf] cleanup 2021-03-29 17:40:34 +02:00
Claudio Atzori 7f4e9479ec [OpenOrgsWf] graph construction wf: allow to skip the import openorgs node (importOpenorgs true|false) 2021-03-29 16:59:16 +02:00
miconis 2709d08fc2 Merge branch 'stable_ids' into openorgswf 2021-03-29 16:39:07 +02:00
miconis f446580e9f code refactoring (useless classes and wf removed), implementation of the test for the openorgs dedup 2021-03-29 16:10:46 +02:00
miconis 2355cc4e9b minor changes and bug fix 2021-03-29 10:07:12 +02:00
Claudio Atzori 827e7e37db [Cleaning] drop instance.alternateIdentifier elements when they are available among instance.pid 2021-03-25 11:07:59 +01:00
miconis 28c1cdd132 merged stable_ids into openorgswf 2021-03-25 10:44:49 +01:00
miconis 348b0ef921 bug fix, implementation of the workflow for the creation of raw_organizations (openorgs dedup), addition of the pid lists to the openorgs postgres db 2021-03-24 15:51:27 +01:00
Claudio Atzori 751125fdf9 [Actionmanager] zero function considers empty entity.id as well as rel.source/rel.target 2021-03-23 17:34:32 +01:00
Claudio Atzori b4febed138 updated mapping tests as consequence of the special treatment reserved to Handle PIDs 2021-03-23 09:37:48 +01:00
Claudio Atzori 431cbe9955 handle missing instance.pid during bulk cleaning 2021-03-23 09:28:58 +01:00
Sandro La Bruzzo c73072079d fix conflicts 2021-03-22 16:36:31 +01:00
Claudio Atzori 5a043e95ea code formatting 2021-03-19 11:37:27 +01:00
Claudio Atzori a4e82a65aa integrated filter applied when merging BETA & PROD graphs to rule our records from Datacite 2021-03-19 11:34:44 +01:00
Claudio Atzori 8257f9a2bc result.pid: adjusted the mapping applied to the contents from the aggregator 2021-03-17 12:45:38 +01:00
Claudio Atzori 640b885706 added instance.alternativeIdentifiers to the graph model, adjusted the mapping applied to the contents from the aggregator 2021-03-16 14:19:32 +01:00
Claudio Atzori 01630f638d IdentifierFactory implementation based on the list of datasources authoritative for a given pid type 2021-03-09 17:11:50 +01:00
Claudio Atzori 59532b0919 [#6281 Provenance of product PIDs] Added PIDs to the Instance type; extended mapping for OAF/ODF records 2021-03-09 11:14:45 +01:00
Claudio Atzori d525785497 [#6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color. 2021-03-09 11:12:55 +01:00
Claudio Atzori f468c7f0d7 merged from master 2021-03-09 09:12:41 +01:00
Claudio Atzori 8d2bb24512 merged from master 2021-03-08 15:44:34 +01:00
Claudio Atzori fa7930d2e2 merging contributions from PR#97 2021-03-05 15:45:28 +01:00
miconis 1a85020572 bug fix in graph-mapper, changes in the implementation of the openorgs wf to create relations and populate openorgs db 2021-02-26 10:19:28 +01:00
Claudio Atzori b830e33392 mdstore collector plugin 2021-02-25 12:30:30 +01:00
Claudio Atzori fc3fa5e343 implemented mdstore collector plugin 2021-02-24 15:07:24 +01:00
miconis 4b2124a18e implementation of the openorgs wfs, implementation of the raw_all wf to migrate openorgs db entities 2021-02-10 11:51:50 +01:00
Alessia Bardi c4d1feca74 mapper test with validated link to project 2021-02-10 11:22:54 +01:00
Claudio Atzori 72c57b28fa switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT 2021-02-04 14:08:18 +01:00
Alessia Bardi c67329d3ad updated test for EU Open Data portal datasets 2021-02-03 17:06:48 +01:00
Alessia Bardi fd705404a1 tests for EU Open Data portal dataset mapping 2021-02-03 10:28:17 +01:00
Sandro La Bruzzo 686e7b507c Merge branch 'hadoop_aggregator' of code-repo.d4science.org:D-Net/dnet-hadoop into aggregation_on_hadoop 2021-01-28 10:02:13 +01:00
Sandro La Bruzzo 98b9498b57 Removed old messaging system not quite used from collection and Transformation workflow
code refactor
2021-01-28 09:51:17 +01:00
Sandro La Bruzzo 150a617bd1 Merge pull request 'aggregation_on_hadoop' (#90) from sandro.labruzzo/dnet-hadoop:aggregation_on_hadoop into hadoop_aggregator
Wonderfull code... You're the Best Sandro
2021-01-26 16:00:47 +01:00
Claudio Atzori 885e0dd926 [Cleaning] filter authors not providing word characters in the fullname 2021-01-26 09:48:53 +01:00
Claudio Atzori 2890511613 [Cleaning] normalise missing Result.country 2021-01-26 09:41:44 +01:00
Claudio Atzori 4eb9ed35b1 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2021-01-25 18:12:24 +01:00
Claudio Atzori cd379eb5e3 [Cleaning] trying to avoid NPEs, this time by ruling out authors without a defined fullname 2021-01-25 18:11:49 +01:00
Alessia Bardi 505477f36f format code 2021-01-25 18:02:49 +01:00
Alessia Bardi ded6ed8d7d no ',' author, if there are no author in ODF records 2021-01-25 17:57:51 +01:00
Claudio Atzori 3465c8ccee [Cleaning] trying to avoid NPEs 2021-01-25 16:54:53 +01:00
Sandro La Bruzzo a54848a59c Moved Vocabulary stuff to common module 2021-01-25 15:43:04 +01:00
Claudio Atzori 07a0ccfc96 [Cleaning] trying to avoid NPEs 2021-01-25 13:36:01 +01:00
Claudio Atzori 34d653de41 [Cleaning] updated cleaning rule for DOIs 2021-01-22 14:16:33 +01:00
Claudio Atzori 26e9d55c13 code formatting 2021-01-05 09:59:26 +01:00
Claudio Atzori 7185158942 ignore missing properties 2020-12-29 11:06:28 +01:00
Claudio Atzori 28460c2cd1 using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper 2020-12-23 16:59:52 +01:00
Claudio Atzori 723b01f9e9 trivial: the less magic numbers and values around, the better 2020-12-23 12:22:48 +01:00
Claudio Atzori 6cb0dc3f43 extended OCRID cleaning procedure 2020-12-21 11:40:17 +01:00
Claudio Atzori 47270d9af5 lenient mock can be lenient 2020-12-18 15:38:59 +01:00
Alessia Bardi f9a8fd8bbd updated test record for textgrid 2020-12-17 11:59:45 +01:00
Michele Artini 991e675dc6 validation in claim rels 2020-12-14 15:41:25 +01:00
Claudio Atzori 12e2f930c8 resolved conflicts 2020-12-10 10:57:39 +01:00
Alessia Bardi 112da6d76a in theory, just auto-formatting after mvn compile 2020-12-09 20:00:27 +01:00
Alessia Bardi bece04b330 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-12-09 19:54:43 +01:00
Alessia Bardi 426b76ee8e more asserts for TextGrid record 2020-12-09 19:46:11 +01:00
Claudio Atzori 4705144918 Merge pull request 'rel_project_validation' (#69) from rel_project_validation into master
LGTM
2020-12-09 19:01:20 +01:00
Claudio Atzori ada21ad920 Merge pull request 'dump of the results related to at least one project' (#61) from miriam.baglioni/dnet-hadoop:dump into master
LGTM
2020-12-09 17:22:56 +01:00
Michele Artini 1bc9adc10d default trust for validated rels 2020-12-09 16:18:37 +01:00
Michele Artini 5f21a356fd reindent 2020-12-09 11:24:30 +01:00
Michele Artini 370a5e650b validation attributes in resultProject relations 2020-12-09 11:18:26 +01:00
Claudio Atzori a104a632df cleanup 2020-12-04 16:32:47 +01:00
Miriam Baglioni 5fb65ffc4a merge branch with master 2020-12-03 11:24:35 +01:00
Miriam Baglioni ea88dc3401 fixed issue in property name 2020-12-03 11:24:23 +01:00
Claudio Atzori cfb55effd9 code formatting 2020-12-02 11:23:49 +01:00
Claudio Atzori 57f448b7a4 graph cleaning workflow separate orcid_pending from orcid, depending on the author pid provenance 2020-12-02 10:44:05 +01:00
Alessia Bardi a417624670 tests for raw graph mapping 2020-12-02 10:15:26 +01:00
Claudio Atzori 893ac4a77b GenerateEntitiesApplication can be configured to hash the id value or not 2020-12-02 09:30:06 +01:00
Claudio Atzori 2c407e775e GenerateEntitiesApplication can be configured to hash the id value or not 2020-11-30 12:00:38 +01:00
Claudio Atzori e731a7658d cleaning texts to remove tab characters too 2020-11-27 09:00:04 +01:00
Claudio Atzori c1b9a4045a grouping of records will be performed by the dedup workflow 2020-11-26 10:59:10 +01:00
Miriam Baglioni 124591a7f3 refactoring 2020-11-25 18:23:28 +01:00
Miriam Baglioni 1a89f8211c #61 (comment) 2020-11-25 18:12:40 +01:00
Miriam Baglioni 5fbe54ef54 #61 (comment) 2020-11-25 18:10:28 +01:00
Miriam Baglioni ed01e5a5e1 #61 (comment) 2020-11-25 18:09:34 +01:00
Miriam Baglioni d4ddde2ef2 changed because of #61 (comment) 2020-11-25 18:01:01 +01:00
Miriam Baglioni f5e5e92a10 changed because of #61 (comment) 2020-11-25 17:58:53 +01:00
Miriam Baglioni 1df94b85b4 changed because of #61 (comment) 2020-11-25 17:57:43 +01:00
Claudio Atzori dfd6205b95 Consistency graph workflow merges all the entities by ID 2020-11-25 14:55:32 +01:00
Miriam Baglioni 90d4369fd2 added test to verify the compression in writing community info on hdfs 2020-11-25 14:34:58 +01:00
Miriam Baglioni 6750e33d69 merge branch with master 2020-11-25 14:09:01 +01:00
Miriam Baglioni b2c455f883 added java doc 2020-11-25 14:08:09 +01:00
Miriam Baglioni 1f130cdf92 changed the relation (produces -> isProducedBy) due to the change in the code 2020-11-25 14:04:26 +01:00
Miriam Baglioni e758d5d9b4 refactoring 2020-11-25 13:46:39 +01:00
Miriam Baglioni 87a9f616ae refactoring and addition of the funder nsp first part as nome for the dump insteasd of the whole nsp 2020-11-25 13:45:41 +01:00
Miriam Baglioni e7e418e444 added decision node to verify if to upload in Zenodo 2020-11-25 13:44:10 +01:00
Miriam Baglioni 305e3d0c9c added resource file for relation with relClass = isProducedBy 2020-11-25 13:43:41 +01:00
Miriam Baglioni 21ce175d17 added FilterFunction specification if filter operation 2020-11-25 13:42:31 +01:00
Miriam Baglioni bde6d337dd test classes for dump of results related to funders 2020-11-25 13:42:01 +01:00
Miriam Baglioni b37b9352d7 added constant value for semantic relationship between projects and results 2020-11-25 13:41:08 +01:00
Claudio Atzori 36173c13a5 reverted filters in the clening process 2020-11-25 10:24:42 +01:00
Claudio Atzori eeebd5a920 Cleanig workflow: remove newlines from titles, descriptions, subjects 2020-11-24 18:40:25 +01:00
Claudio Atzori e1a1bb3ee4 moved class CleaningFunctions in the correct package. Remove newlines from titles, descriptions, subjects 2020-11-24 18:34:03 +01:00
Miriam Baglioni 72bb0fe360 changed directory name 2020-11-24 16:47:07 +01:00
Miriam Baglioni 39f4a20873 chenged the path and the name for saving the communities_infrastructures dump file 2020-11-24 14:47:32 +01:00
Miriam Baglioni 7e14452a87 final versione of the wf to get the dump of results associated to at least one funder per funder 2020-11-24 14:46:34 +01:00
Miriam Baglioni c167a18057 added new parameter for the dumpType 2020-11-24 14:45:50 +01:00
Miriam Baglioni 54a309bb6b refactoring 2020-11-24 14:45:30 +01:00
Miriam Baglioni 35ecea8842 changed to consider the modification for the specification of the type of dump 2020-11-24 14:45:15 +01:00
Miriam Baglioni b9b6bdb2e6 fixing issue on previous implementation 2020-11-24 14:44:53 +01:00
Miriam Baglioni 7e940f1991 changed to consider the modification for the specification of the type of dump 2020-11-24 14:43:34 +01:00
Miriam Baglioni 62928ef7a5 changed to save the communities_infrastructures information as the other entity dumps: in a json.gz file 2020-11-24 14:42:41 +01:00
Claudio Atzori 33bae02451 reverted behaviour of the cleaning workflow: grouping entities by ID will be managed differently 2020-11-24 14:42:33 +01:00
Miriam Baglioni 3319440c53 changed the direction of the relation between projects and result considered to select the results linked to projects 2020-11-24 14:41:09 +01:00
Miriam Baglioni 00c377dac2 added specification of MapFunction types in map 2020-11-24 14:40:22 +01:00
Miriam Baglioni 44db258dc4 added enumerated for the dump type 2020-11-24 14:38:06 +01:00
Miriam Baglioni 1832708c42 modified boolean variable with string one whcih specify the type of dump we are performing: complete, community or funder 2020-11-24 14:37:36 +01:00
Miriam Baglioni 259c67ce36 fixed issue in path name 2020-11-20 12:32:23 +01:00
Miriam Baglioni 0a9db67eec - 2020-11-20 12:21:33 +01:00
Miriam Baglioni d362f2637d merge branch with master 2020-11-19 19:17:20 +01:00
Miriam Baglioni cf3f47563f new parameter files 2020-11-19 19:16:05 +01:00
Miriam Baglioni 24c56fa7a3 new logic and workflow for dump of results with link to projects. In this implementation the result match the model of the communityresult. 2020-11-19 19:15:39 +01:00
Claudio Atzori fcbb05eb21 cleanup 2020-11-19 15:14:33 +01:00
Claudio Atzori 3f34757c63 merged from master 2020-11-19 14:34:54 +01:00
Miriam Baglioni fafb688887 - 2020-11-18 18:56:48 +01:00
Miriam Baglioni 906db690d2 - 2020-11-18 17:43:08 +01:00
Claudio Atzori ede7fae6c8 Merge pull request 'XML record indexing test' (#58) from provision_indexing into master 2020-11-18 17:04:34 +01:00
Miriam Baglioni 5402062ff5 changed parameter file with the ono associated to the job 2020-11-18 16:58:20 +01:00
Miriam Baglioni a172a37ad1 fixed typo 2020-11-18 16:55:07 +01:00
Miriam Baglioni 46ba3793f6 code, workflow and parameters for the dump of the results associated to funders 2020-11-18 16:47:31 +01:00
Miriam Baglioni 57cac36898 changed the workflow name 2020-11-18 13:38:03 +01:00
Claudio Atzori 8177ce7939 test for XmlIndexingJob based on a local miniSolrCluster 2020-11-18 10:58:05 +01:00
Alessia Bardi 10e673660f Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-18 10:01:23 +01:00
Alessia Bardi be7b310cef rel semantcis ignore case 2020-11-18 10:01:20 +01:00
Michele Artini 33da2e3d6c xpaths for dateOfCollection and dateOfTransformation 2020-11-18 09:26:20 +01:00
Alessia Bardi 8f87020a50 #56: map relevantDates from aggregated ODF records 2020-11-17 18:42:09 +01:00
Alessia Bardi 7e0a76a8ac test fr TextGrid 2020-11-17 18:39:25 +01:00
Claudio Atzori cfc01f136e PID filtering based on a blacklist 2020-11-17 12:27:06 +01:00
Claudio Atzori 6ab1ce53c9 fixed condition in result pid cleaning; cleanup 2020-11-16 10:09:17 +01:00
Claudio Atzori 4de8c8b237 fixed workflow variable name 2020-11-16 10:03:11 +01:00
Claudio Atzori 331d621800 added test resource 2020-11-14 12:16:15 +01:00
Claudio Atzori 5d4e34e26a fixed typo in variable name 2020-11-14 10:32:26 +01:00
Claudio Atzori 768bc5304c Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-13 15:40:34 +01:00
Claudio Atzori 93f7b7974f Merge pull request 'trust truncated to 3 decimals' (#24) from trunc_trust into master
LGTM
2020-11-13 15:40:02 +01:00
Claudio Atzori 528231a287 grouping graph entities by id turned out to be an easy extension for the already existing cleaning workflow 2020-11-13 15:37:48 +01:00
Claudio Atzori 2bed29eb09 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:12 +01:00
Claudio Atzori 13e36a4da0 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:02 +01:00
Claudio Atzori 9b0fb9e958 merged from master 2020-11-12 09:27:12 +01:00
Michele Artini 40160d171f organizations pids 2020-11-09 12:58:36 +01:00
Sandro La Bruzzo 027ef2326c Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-11-06 17:12:42 +01:00
Sandro La Bruzzo cd27df91a1 fixed bug on missing relation in ANDS 2020-11-06 17:12:31 +01:00
Claudio Atzori d10447e747 re-packaged graph dump workflow sources 2020-11-05 17:38:18 +01:00
Claudio Atzori 2d76497488 cleanup 2020-11-05 17:10:24 +01:00
Miriam Baglioni f8e9bda24c merge branch with master 2020-11-05 16:31:18 +01:00
Miriam Baglioni be5ed8f554 added check to avoid sending empty metadata. 2020-11-05 16:10:17 +01:00
Claudio Atzori 2148a51fae minor changes 2020-11-05 11:24:12 +01:00
Claudio Atzori 4625b7486e code formatting 2020-11-04 18:12:43 +01:00
Miriam Baglioni e9ac471ae9 removed dependency from classes for the pid graph dump 2020-11-04 18:04:42 +01:00
Miriam Baglioni b90a945c49 removed property files for pid graph dump 2020-11-04 17:28:33 +01:00
Miriam Baglioni bac307155a removed properties specific for pid graph dump 2020-11-04 17:28:04 +01:00
Miriam Baglioni 9c9d50f486 removed code specific for pid graph dump 2020-11-04 17:26:22 +01:00
Miriam Baglioni 5669890934 removed commented lines 2020-11-04 17:15:21 +01:00
Miriam Baglioni 6a89f59be9 removed commented lines 2020-11-04 17:13:59 +01:00
Miriam Baglioni 56150d7e5e removed all code related to the dump of pids graph 2020-11-04 17:13:12 +01:00
Miriam Baglioni 16c54a96f8 removed pid dump 2020-11-04 17:11:32 +01:00
Miriam Baglioni 0cac5436ff Merge branch 'dump' of code-repo.d4science.org:miriam.baglioni/dnet-hadoop into dump 2020-11-04 13:21:11 +01:00
Alessia Bardi 51808b5afd Updated descriptions 2020-11-04 12:29:48 +01:00
Alessia Bardi e6becf8659 Updated descriptions 2020-11-04 12:17:57 +01:00
Alessia Bardi 0abe0eee33 Updated descriptions 2020-11-04 12:15:30 +01:00
Alessia Bardi f6ab238f5d Updated descriptions 2020-11-04 11:50:47 +01:00
Miriam Baglioni c010a8442f fixed issue on test code 2020-11-03 17:26:51 +01:00
Miriam Baglioni 8ec7a61188 merge branch with master 2020-11-03 16:59:08 +01:00
Miriam Baglioni c209284ca7 new schemas for the entities in the dump with added descriptions 2020-11-03 16:58:08 +01:00
Miriam Baglioni 08806deddf added the splitSize non mandatory parameter. Default size 10G 2020-11-03 16:57:34 +01:00
Miriam Baglioni 7d2eda43ca added new non mandatory property publish to determine if to publish the upload or leave it pending. Default value flase 2020-11-03 16:57:01 +01:00
Miriam Baglioni cbbb1bdc54 moved business logic to new class in common for handling the zip of hte archives 2020-11-03 16:55:50 +01:00
Miriam Baglioni d4382b54df moved the tar archive with maz size on common module 2020-11-03 16:54:50 +01:00
Claudio Atzori 86d6fbe95b refactoring: CleaningFunctions and OafMapperUtils moved in dhp-commong 2020-11-03 12:19:46 +01:00
Claudio Atzori 8471888ad3 Merge branch 'graph_cleaning' into stable_ids 2020-11-03 11:52:47 +01:00
Claudio Atzori 5310e56dba remove empy PIDs 2020-11-03 11:52:10 +01:00
Claudio Atzori 3fcd669e99 result merge operation leverage on custom ResultTypeComparator in the aggregator graph construction 2020-11-03 10:53:23 +01:00
Claudio Atzori 09e44dabff Merge branch 'master' into stable_ids 2020-11-02 12:16:01 +01:00
Sandro La Bruzzo 754c86f33e fixed test to work on jenkins 2020-11-02 09:35:01 +01:00
Miriam Baglioni dabb33e018 changed the discriminant for which split the file 2020-10-30 17:52:22 +01:00
Miriam Baglioni 0fba08eae4 max allowed size per file 10 Gb 2020-10-30 16:05:55 +01:00
Claudio Atzori 4ca75d6951 Merge pull request 'Dedup ID creation policy' (#48) from deduptesting into stable_ids 2020-10-30 15:15:32 +01:00
Miriam Baglioni b828587252 prevent the code to cicle indefinetly 2020-10-30 15:01:25 +01:00