Commit Graph

815 Commits

Author SHA1 Message Date
miconis c39c82dfe9 modification of the jobs for the integration of openorgs in the provision, dedup records are no more created by merging but simply taking results of openorgs portal 2021-04-06 14:31:00 +02:00
Claudio Atzori 7941d7be29 WIP: using common definitions from ModelConstants 2021-03-31 18:33:57 +02:00
Claudio Atzori 72ce741ea6 WIP: using common definitions from ModelConstants 2021-03-31 17:07:13 +02:00
Claudio Atzori 9237d55d7f [OpenOrgsWf] cleanup 2021-03-29 17:40:34 +02:00
Claudio Atzori 7f4e9479ec [OpenOrgsWf] graph construction wf: allow to skip the import openorgs node (importOpenorgs true|false) 2021-03-29 16:59:16 +02:00
miconis 2709d08fc2 Merge branch 'stable_ids' into openorgswf 2021-03-29 16:39:07 +02:00
miconis f446580e9f code refactoring (useless classes and wf removed), implementation of the test for the openorgs dedup 2021-03-29 16:10:46 +02:00
miconis 2355cc4e9b minor changes and bug fix 2021-03-29 10:07:12 +02:00
Claudio Atzori 827e7e37db [Cleaning] drop instance.alternateIdentifier elements when they are available among instance.pid 2021-03-25 11:07:59 +01:00
miconis 28c1cdd132 merged stable_ids into openorgswf 2021-03-25 10:44:49 +01:00
miconis 348b0ef921 bug fix, implementation of the workflow for the creation of raw_organizations (openorgs dedup), addition of the pid lists to the openorgs postgres db 2021-03-24 15:51:27 +01:00
Claudio Atzori 751125fdf9 [Actionmanager] zero function considers empty entity.id as well as rel.source/rel.target 2021-03-23 17:34:32 +01:00
Claudio Atzori b4febed138 updated mapping tests as consequence of the special treatment reserved to Handle PIDs 2021-03-23 09:37:48 +01:00
Claudio Atzori 431cbe9955 handle missing instance.pid during bulk cleaning 2021-03-23 09:28:58 +01:00
Sandro La Bruzzo c73072079d fix conflicts 2021-03-22 16:36:31 +01:00
Claudio Atzori 5a043e95ea code formatting 2021-03-19 11:37:27 +01:00
Claudio Atzori a4e82a65aa integrated filter applied when merging BETA & PROD graphs to rule our records from Datacite 2021-03-19 11:34:44 +01:00
Claudio Atzori 8257f9a2bc result.pid: adjusted the mapping applied to the contents from the aggregator 2021-03-17 12:45:38 +01:00
Claudio Atzori 640b885706 added instance.alternativeIdentifiers to the graph model, adjusted the mapping applied to the contents from the aggregator 2021-03-16 14:19:32 +01:00
Claudio Atzori 01630f638d IdentifierFactory implementation based on the list of datasources authoritative for a given pid type 2021-03-09 17:11:50 +01:00
Claudio Atzori 59532b0919 [#6281 Provenance of product PIDs] Added PIDs to the Instance type; extended mapping for OAF/ODF records 2021-03-09 11:14:45 +01:00
Claudio Atzori d525785497 [#6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color. 2021-03-09 11:12:55 +01:00
Claudio Atzori f468c7f0d7 merged from master 2021-03-09 09:12:41 +01:00
Claudio Atzori 8d2bb24512 merged from master 2021-03-08 15:44:34 +01:00
miconis 1a85020572 bug fix in graph-mapper, changes in the implementation of the openorgs wf to create relations and populate openorgs db 2021-02-26 10:19:28 +01:00
Claudio Atzori b830e33392 mdstore collector plugin 2021-02-25 12:30:30 +01:00
Claudio Atzori fc3fa5e343 implemented mdstore collector plugin 2021-02-24 15:07:24 +01:00
miconis 4b2124a18e implementation of the openorgs wfs, implementation of the raw_all wf to migrate openorgs db entities 2021-02-10 11:51:50 +01:00
Alessia Bardi c4d1feca74 mapper test with validated link to project 2021-02-10 11:22:54 +01:00
Alessia Bardi c67329d3ad updated test for EU Open Data portal datasets 2021-02-03 17:06:48 +01:00
Alessia Bardi fd705404a1 tests for EU Open Data portal dataset mapping 2021-02-03 10:28:17 +01:00
Sandro La Bruzzo 686e7b507c Merge branch 'hadoop_aggregator' of code-repo.d4science.org:D-Net/dnet-hadoop into aggregation_on_hadoop 2021-01-28 10:02:13 +01:00
Sandro La Bruzzo 98b9498b57 Removed old messaging system not quite used from collection and Transformation workflow
code refactor
2021-01-28 09:51:17 +01:00
Sandro La Bruzzo 150a617bd1 Merge pull request 'aggregation_on_hadoop' (#90) from sandro.labruzzo/dnet-hadoop:aggregation_on_hadoop into hadoop_aggregator
Wonderfull code... You're the Best Sandro
2021-01-26 16:00:47 +01:00
Claudio Atzori 885e0dd926 [Cleaning] filter authors not providing word characters in the fullname 2021-01-26 09:48:53 +01:00
Claudio Atzori 2890511613 [Cleaning] normalise missing Result.country 2021-01-26 09:41:44 +01:00
Claudio Atzori 4eb9ed35b1 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2021-01-25 18:12:24 +01:00
Claudio Atzori cd379eb5e3 [Cleaning] trying to avoid NPEs, this time by ruling out authors without a defined fullname 2021-01-25 18:11:49 +01:00
Alessia Bardi 505477f36f format code 2021-01-25 18:02:49 +01:00
Alessia Bardi ded6ed8d7d no ',' author, if there are no author in ODF records 2021-01-25 17:57:51 +01:00
Claudio Atzori 3465c8ccee [Cleaning] trying to avoid NPEs 2021-01-25 16:54:53 +01:00
Sandro La Bruzzo a54848a59c Moved Vocabulary stuff to common module 2021-01-25 15:43:04 +01:00
Claudio Atzori 07a0ccfc96 [Cleaning] trying to avoid NPEs 2021-01-25 13:36:01 +01:00
Claudio Atzori 34d653de41 [Cleaning] updated cleaning rule for DOIs 2021-01-22 14:16:33 +01:00
Claudio Atzori 26e9d55c13 code formatting 2021-01-05 09:59:26 +01:00
Claudio Atzori 7185158942 ignore missing properties 2020-12-29 11:06:28 +01:00
Claudio Atzori 28460c2cd1 using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper 2020-12-23 16:59:52 +01:00
Claudio Atzori 723b01f9e9 trivial: the less magic numbers and values around, the better 2020-12-23 12:22:48 +01:00
Claudio Atzori 6cb0dc3f43 extended OCRID cleaning procedure 2020-12-21 11:40:17 +01:00
Claudio Atzori 47270d9af5 lenient mock can be lenient 2020-12-18 15:38:59 +01:00
Alessia Bardi f9a8fd8bbd updated test record for textgrid 2020-12-17 11:59:45 +01:00
Michele Artini 991e675dc6 validation in claim rels 2020-12-14 15:41:25 +01:00
Claudio Atzori 12e2f930c8 resolved conflicts 2020-12-10 10:57:39 +01:00
Alessia Bardi 112da6d76a in theory, just auto-formatting after mvn compile 2020-12-09 20:00:27 +01:00
Alessia Bardi bece04b330 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-12-09 19:54:43 +01:00
Alessia Bardi 426b76ee8e more asserts for TextGrid record 2020-12-09 19:46:11 +01:00
Claudio Atzori 4705144918 Merge pull request 'rel_project_validation' (#69) from rel_project_validation into master
LGTM
2020-12-09 19:01:20 +01:00
Claudio Atzori ada21ad920 Merge pull request 'dump of the results related to at least one project' (#61) from miriam.baglioni/dnet-hadoop:dump into master
LGTM
2020-12-09 17:22:56 +01:00
Michele Artini 1bc9adc10d default trust for validated rels 2020-12-09 16:18:37 +01:00
Michele Artini 5f21a356fd reindent 2020-12-09 11:24:30 +01:00
Michele Artini 370a5e650b validation attributes in resultProject relations 2020-12-09 11:18:26 +01:00
Claudio Atzori a104a632df cleanup 2020-12-04 16:32:47 +01:00
Miriam Baglioni 5fb65ffc4a merge branch with master 2020-12-03 11:24:35 +01:00
Miriam Baglioni ea88dc3401 fixed issue in property name 2020-12-03 11:24:23 +01:00
Claudio Atzori cfb55effd9 code formatting 2020-12-02 11:23:49 +01:00
Claudio Atzori 57f448b7a4 graph cleaning workflow separate orcid_pending from orcid, depending on the author pid provenance 2020-12-02 10:44:05 +01:00
Alessia Bardi a417624670 tests for raw graph mapping 2020-12-02 10:15:26 +01:00
Claudio Atzori 893ac4a77b GenerateEntitiesApplication can be configured to hash the id value or not 2020-12-02 09:30:06 +01:00
Claudio Atzori 2c407e775e GenerateEntitiesApplication can be configured to hash the id value or not 2020-11-30 12:00:38 +01:00
Claudio Atzori e731a7658d cleaning texts to remove tab characters too 2020-11-27 09:00:04 +01:00
Claudio Atzori c1b9a4045a grouping of records will be performed by the dedup workflow 2020-11-26 10:59:10 +01:00
Miriam Baglioni 124591a7f3 refactoring 2020-11-25 18:23:28 +01:00
Miriam Baglioni 1a89f8211c #61 (comment) 2020-11-25 18:12:40 +01:00
Miriam Baglioni 5fbe54ef54 #61 (comment) 2020-11-25 18:10:28 +01:00
Miriam Baglioni ed01e5a5e1 #61 (comment) 2020-11-25 18:09:34 +01:00
Miriam Baglioni d4ddde2ef2 changed because of #61 (comment) 2020-11-25 18:01:01 +01:00
Miriam Baglioni f5e5e92a10 changed because of #61 (comment) 2020-11-25 17:58:53 +01:00
Miriam Baglioni 1df94b85b4 changed because of #61 (comment) 2020-11-25 17:57:43 +01:00
Claudio Atzori dfd6205b95 Consistency graph workflow merges all the entities by ID 2020-11-25 14:55:32 +01:00
Miriam Baglioni 90d4369fd2 added test to verify the compression in writing community info on hdfs 2020-11-25 14:34:58 +01:00
Miriam Baglioni 6750e33d69 merge branch with master 2020-11-25 14:09:01 +01:00
Miriam Baglioni b2c455f883 added java doc 2020-11-25 14:08:09 +01:00
Miriam Baglioni 1f130cdf92 changed the relation (produces -> isProducedBy) due to the change in the code 2020-11-25 14:04:26 +01:00
Miriam Baglioni e758d5d9b4 refactoring 2020-11-25 13:46:39 +01:00
Miriam Baglioni 87a9f616ae refactoring and addition of the funder nsp first part as nome for the dump insteasd of the whole nsp 2020-11-25 13:45:41 +01:00
Miriam Baglioni e7e418e444 added decision node to verify if to upload in Zenodo 2020-11-25 13:44:10 +01:00
Miriam Baglioni 305e3d0c9c added resource file for relation with relClass = isProducedBy 2020-11-25 13:43:41 +01:00
Miriam Baglioni 21ce175d17 added FilterFunction specification if filter operation 2020-11-25 13:42:31 +01:00
Miriam Baglioni bde6d337dd test classes for dump of results related to funders 2020-11-25 13:42:01 +01:00
Miriam Baglioni b37b9352d7 added constant value for semantic relationship between projects and results 2020-11-25 13:41:08 +01:00
Claudio Atzori 36173c13a5 reverted filters in the clening process 2020-11-25 10:24:42 +01:00
Claudio Atzori eeebd5a920 Cleanig workflow: remove newlines from titles, descriptions, subjects 2020-11-24 18:40:25 +01:00
Claudio Atzori e1a1bb3ee4 moved class CleaningFunctions in the correct package. Remove newlines from titles, descriptions, subjects 2020-11-24 18:34:03 +01:00
Miriam Baglioni 72bb0fe360 changed directory name 2020-11-24 16:47:07 +01:00
Miriam Baglioni 39f4a20873 chenged the path and the name for saving the communities_infrastructures dump file 2020-11-24 14:47:32 +01:00
Miriam Baglioni 7e14452a87 final versione of the wf to get the dump of results associated to at least one funder per funder 2020-11-24 14:46:34 +01:00
Miriam Baglioni c167a18057 added new parameter for the dumpType 2020-11-24 14:45:50 +01:00
Miriam Baglioni 54a309bb6b refactoring 2020-11-24 14:45:30 +01:00
Miriam Baglioni 35ecea8842 changed to consider the modification for the specification of the type of dump 2020-11-24 14:45:15 +01:00
Miriam Baglioni b9b6bdb2e6 fixing issue on previous implementation 2020-11-24 14:44:53 +01:00
Miriam Baglioni 7e940f1991 changed to consider the modification for the specification of the type of dump 2020-11-24 14:43:34 +01:00
Miriam Baglioni 62928ef7a5 changed to save the communities_infrastructures information as the other entity dumps: in a json.gz file 2020-11-24 14:42:41 +01:00
Claudio Atzori 33bae02451 reverted behaviour of the cleaning workflow: grouping entities by ID will be managed differently 2020-11-24 14:42:33 +01:00
Miriam Baglioni 3319440c53 changed the direction of the relation between projects and result considered to select the results linked to projects 2020-11-24 14:41:09 +01:00
Miriam Baglioni 00c377dac2 added specification of MapFunction types in map 2020-11-24 14:40:22 +01:00
Miriam Baglioni 44db258dc4 added enumerated for the dump type 2020-11-24 14:38:06 +01:00
Miriam Baglioni 1832708c42 modified boolean variable with string one whcih specify the type of dump we are performing: complete, community or funder 2020-11-24 14:37:36 +01:00
Miriam Baglioni 259c67ce36 fixed issue in path name 2020-11-20 12:32:23 +01:00
Miriam Baglioni 0a9db67eec - 2020-11-20 12:21:33 +01:00
Miriam Baglioni d362f2637d merge branch with master 2020-11-19 19:17:20 +01:00
Miriam Baglioni cf3f47563f new parameter files 2020-11-19 19:16:05 +01:00
Miriam Baglioni 24c56fa7a3 new logic and workflow for dump of results with link to projects. In this implementation the result match the model of the communityresult. 2020-11-19 19:15:39 +01:00
Claudio Atzori fcbb05eb21 cleanup 2020-11-19 15:14:33 +01:00
Claudio Atzori 3f34757c63 merged from master 2020-11-19 14:34:54 +01:00
Miriam Baglioni fafb688887 - 2020-11-18 18:56:48 +01:00
Miriam Baglioni 906db690d2 - 2020-11-18 17:43:08 +01:00
Claudio Atzori ede7fae6c8 Merge pull request 'XML record indexing test' (#58) from provision_indexing into master 2020-11-18 17:04:34 +01:00
Miriam Baglioni 5402062ff5 changed parameter file with the ono associated to the job 2020-11-18 16:58:20 +01:00
Miriam Baglioni a172a37ad1 fixed typo 2020-11-18 16:55:07 +01:00
Miriam Baglioni 46ba3793f6 code, workflow and parameters for the dump of the results associated to funders 2020-11-18 16:47:31 +01:00
Miriam Baglioni 57cac36898 changed the workflow name 2020-11-18 13:38:03 +01:00
Claudio Atzori 8177ce7939 test for XmlIndexingJob based on a local miniSolrCluster 2020-11-18 10:58:05 +01:00
Alessia Bardi 10e673660f Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-18 10:01:23 +01:00
Alessia Bardi be7b310cef rel semantcis ignore case 2020-11-18 10:01:20 +01:00
Michele Artini 33da2e3d6c xpaths for dateOfCollection and dateOfTransformation 2020-11-18 09:26:20 +01:00
Alessia Bardi 8f87020a50 #56: map relevantDates from aggregated ODF records 2020-11-17 18:42:09 +01:00
Alessia Bardi 7e0a76a8ac test fr TextGrid 2020-11-17 18:39:25 +01:00
Claudio Atzori cfc01f136e PID filtering based on a blacklist 2020-11-17 12:27:06 +01:00
Claudio Atzori 6ab1ce53c9 fixed condition in result pid cleaning; cleanup 2020-11-16 10:09:17 +01:00
Claudio Atzori 4de8c8b237 fixed workflow variable name 2020-11-16 10:03:11 +01:00
Claudio Atzori 331d621800 added test resource 2020-11-14 12:16:15 +01:00
Claudio Atzori 5d4e34e26a fixed typo in variable name 2020-11-14 10:32:26 +01:00
Claudio Atzori 768bc5304c Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-13 15:40:34 +01:00
Claudio Atzori 93f7b7974f Merge pull request 'trust truncated to 3 decimals' (#24) from trunc_trust into master
LGTM
2020-11-13 15:40:02 +01:00
Claudio Atzori 528231a287 grouping graph entities by id turned out to be an easy extension for the already existing cleaning workflow 2020-11-13 15:37:48 +01:00
Claudio Atzori 2bed29eb09 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:12 +01:00
Claudio Atzori 13e36a4da0 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:02 +01:00
Claudio Atzori 9b0fb9e958 merged from master 2020-11-12 09:27:12 +01:00
Michele Artini 40160d171f organizations pids 2020-11-09 12:58:36 +01:00
Sandro La Bruzzo 027ef2326c Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-11-06 17:12:42 +01:00
Sandro La Bruzzo cd27df91a1 fixed bug on missing relation in ANDS 2020-11-06 17:12:31 +01:00
Claudio Atzori d10447e747 re-packaged graph dump workflow sources 2020-11-05 17:38:18 +01:00
Claudio Atzori 2d76497488 cleanup 2020-11-05 17:10:24 +01:00
Miriam Baglioni f8e9bda24c merge branch with master 2020-11-05 16:31:18 +01:00
Miriam Baglioni be5ed8f554 added check to avoid sending empty metadata. 2020-11-05 16:10:17 +01:00
Claudio Atzori 2148a51fae minor changes 2020-11-05 11:24:12 +01:00
Claudio Atzori 4625b7486e code formatting 2020-11-04 18:12:43 +01:00
Miriam Baglioni e9ac471ae9 removed dependency from classes for the pid graph dump 2020-11-04 18:04:42 +01:00
Miriam Baglioni b90a945c49 removed property files for pid graph dump 2020-11-04 17:28:33 +01:00
Miriam Baglioni bac307155a removed properties specific for pid graph dump 2020-11-04 17:28:04 +01:00