Commit Graph

889 Commits

Author SHA1 Message Date
Claudio Atzori 9237d55d7f [OpenOrgsWf] cleanup 2021-03-29 17:40:34 +02:00
Claudio Atzori 7f4e9479ec [OpenOrgsWf] graph construction wf: allow to skip the import openorgs node (importOpenorgs true|false) 2021-03-29 16:59:16 +02:00
miconis 2709d08fc2 Merge branch 'stable_ids' into openorgswf 2021-03-29 16:39:07 +02:00
miconis f446580e9f code refactoring (useless classes and wf removed), implementation of the test for the openorgs dedup 2021-03-29 16:10:46 +02:00
miconis 2355cc4e9b minor changes and bug fix 2021-03-29 10:07:12 +02:00
miconis 28c1cdd132 merged stable_ids into openorgswf 2021-03-25 10:44:49 +01:00
miconis 348b0ef921 bug fix, implementation of the workflow for the creation of raw_organizations (openorgs dedup), addition of the pid lists to the openorgs postgres db 2021-03-24 15:51:27 +01:00
Sandro La Bruzzo c73072079d fix conflicts 2021-03-22 16:36:31 +01:00
Claudio Atzori 5a043e95ea code formatting 2021-03-19 11:37:27 +01:00
Claudio Atzori a4e82a65aa integrated filter applied when merging BETA & PROD graphs to rule our records from Datacite 2021-03-19 11:34:44 +01:00
Claudio Atzori 8257f9a2bc result.pid: adjusted the mapping applied to the contents from the aggregator 2021-03-17 12:45:38 +01:00
Claudio Atzori 640b885706 added instance.alternativeIdentifiers to the graph model, adjusted the mapping applied to the contents from the aggregator 2021-03-16 14:19:32 +01:00
Claudio Atzori 01630f638d IdentifierFactory implementation based on the list of datasources authoritative for a given pid type 2021-03-09 17:11:50 +01:00
Claudio Atzori 59532b0919 [#6281 Provenance of product PIDs] Added PIDs to the Instance type; extended mapping for OAF/ODF records 2021-03-09 11:14:45 +01:00
Claudio Atzori d525785497 [#6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color. 2021-03-09 11:12:55 +01:00
Claudio Atzori f468c7f0d7 merged from master 2021-03-09 09:12:41 +01:00
Claudio Atzori 8d2bb24512 merged from master 2021-03-08 15:44:34 +01:00
miconis 1a85020572 bug fix in graph-mapper, changes in the implementation of the openorgs wf to create relations and populate openorgs db 2021-02-26 10:19:28 +01:00
Claudio Atzori b830e33392 mdstore collector plugin 2021-02-25 12:30:30 +01:00
Claudio Atzori fc3fa5e343 implemented mdstore collector plugin 2021-02-24 15:07:24 +01:00
miconis 4b2124a18e implementation of the openorgs wfs, implementation of the raw_all wf to migrate openorgs db entities 2021-02-10 11:51:50 +01:00
Sandro La Bruzzo 686e7b507c Merge branch 'hadoop_aggregator' of code-repo.d4science.org:D-Net/dnet-hadoop into aggregation_on_hadoop 2021-01-28 10:02:13 +01:00
Sandro La Bruzzo 98b9498b57 Removed old messaging system not quite used from collection and Transformation workflow
code refactor
2021-01-28 09:51:17 +01:00
Sandro La Bruzzo 150a617bd1 Merge pull request 'aggregation_on_hadoop' (#90) from sandro.labruzzo/dnet-hadoop:aggregation_on_hadoop into hadoop_aggregator
Wonderfull code... You're the Best Sandro
2021-01-26 16:00:47 +01:00
Claudio Atzori 885e0dd926 [Cleaning] filter authors not providing word characters in the fullname 2021-01-26 09:48:53 +01:00
Claudio Atzori 2890511613 [Cleaning] normalise missing Result.country 2021-01-26 09:41:44 +01:00
Claudio Atzori 4eb9ed35b1 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2021-01-25 18:12:24 +01:00
Claudio Atzori cd379eb5e3 [Cleaning] trying to avoid NPEs, this time by ruling out authors without a defined fullname 2021-01-25 18:11:49 +01:00
Alessia Bardi 505477f36f format code 2021-01-25 18:02:49 +01:00
Alessia Bardi ded6ed8d7d no ',' author, if there are no author in ODF records 2021-01-25 17:57:51 +01:00
Claudio Atzori 3465c8ccee [Cleaning] trying to avoid NPEs 2021-01-25 16:54:53 +01:00
Sandro La Bruzzo a54848a59c Moved Vocabulary stuff to common module 2021-01-25 15:43:04 +01:00
Claudio Atzori 07a0ccfc96 [Cleaning] trying to avoid NPEs 2021-01-25 13:36:01 +01:00
Claudio Atzori 34d653de41 [Cleaning] updated cleaning rule for DOIs 2021-01-22 14:16:33 +01:00
Claudio Atzori 26e9d55c13 code formatting 2021-01-05 09:59:26 +01:00
Claudio Atzori 7185158942 ignore missing properties 2020-12-29 11:06:28 +01:00
Claudio Atzori 28460c2cd1 using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper 2020-12-23 16:59:52 +01:00
Claudio Atzori 723b01f9e9 trivial: the less magic numbers and values around, the better 2020-12-23 12:22:48 +01:00
Claudio Atzori 6cb0dc3f43 extended OCRID cleaning procedure 2020-12-21 11:40:17 +01:00
Michele Artini 991e675dc6 validation in claim rels 2020-12-14 15:41:25 +01:00
Claudio Atzori 12e2f930c8 resolved conflicts 2020-12-10 10:57:39 +01:00
Claudio Atzori 4705144918 Merge pull request 'rel_project_validation' (#69) from rel_project_validation into master
LGTM
2020-12-09 19:01:20 +01:00
Claudio Atzori ada21ad920 Merge pull request 'dump of the results related to at least one project' (#61) from miriam.baglioni/dnet-hadoop:dump into master
LGTM
2020-12-09 17:22:56 +01:00
Michele Artini 1bc9adc10d default trust for validated rels 2020-12-09 16:18:37 +01:00
Michele Artini 5f21a356fd reindent 2020-12-09 11:24:30 +01:00
Michele Artini 370a5e650b validation attributes in resultProject relations 2020-12-09 11:18:26 +01:00
Claudio Atzori a104a632df cleanup 2020-12-04 16:32:47 +01:00
Miriam Baglioni 5fb65ffc4a merge branch with master 2020-12-03 11:24:35 +01:00
Miriam Baglioni ea88dc3401 fixed issue in property name 2020-12-03 11:24:23 +01:00
Claudio Atzori cfb55effd9 code formatting 2020-12-02 11:23:49 +01:00
Claudio Atzori 57f448b7a4 graph cleaning workflow separate orcid_pending from orcid, depending on the author pid provenance 2020-12-02 10:44:05 +01:00
Claudio Atzori 893ac4a77b GenerateEntitiesApplication can be configured to hash the id value or not 2020-12-02 09:30:06 +01:00
Claudio Atzori 2c407e775e GenerateEntitiesApplication can be configured to hash the id value or not 2020-11-30 12:00:38 +01:00
Claudio Atzori e731a7658d cleaning texts to remove tab characters too 2020-11-27 09:00:04 +01:00
Claudio Atzori c1b9a4045a grouping of records will be performed by the dedup workflow 2020-11-26 10:59:10 +01:00
Miriam Baglioni 124591a7f3 refactoring 2020-11-25 18:23:28 +01:00
Miriam Baglioni 5fbe54ef54 D-Net/dnet-hadoop#61 (comment) 2020-11-25 18:10:28 +01:00
Miriam Baglioni ed01e5a5e1 D-Net/dnet-hadoop#61 (comment) 2020-11-25 18:09:34 +01:00
Miriam Baglioni f5e5e92a10 changed because of D-Net/dnet-hadoop#61 (comment) 2020-11-25 17:58:53 +01:00
Miriam Baglioni 1df94b85b4 changed because of D-Net/dnet-hadoop#61 (comment) 2020-11-25 17:57:43 +01:00
Claudio Atzori dfd6205b95 Consistency graph workflow merges all the entities by ID 2020-11-25 14:55:32 +01:00
Miriam Baglioni 6750e33d69 merge branch with master 2020-11-25 14:09:01 +01:00
Miriam Baglioni b2c455f883 added java doc 2020-11-25 14:08:09 +01:00
Miriam Baglioni e758d5d9b4 refactoring 2020-11-25 13:46:39 +01:00
Miriam Baglioni 87a9f616ae refactoring and addition of the funder nsp first part as nome for the dump insteasd of the whole nsp 2020-11-25 13:45:41 +01:00
Miriam Baglioni e7e418e444 added decision node to verify if to upload in Zenodo 2020-11-25 13:44:10 +01:00
Miriam Baglioni 21ce175d17 added FilterFunction specification if filter operation 2020-11-25 13:42:31 +01:00
Miriam Baglioni b37b9352d7 added constant value for semantic relationship between projects and results 2020-11-25 13:41:08 +01:00
Claudio Atzori 36173c13a5 reverted filters in the clening process 2020-11-25 10:24:42 +01:00
Claudio Atzori eeebd5a920 Cleanig workflow: remove newlines from titles, descriptions, subjects 2020-11-24 18:40:25 +01:00
Claudio Atzori e1a1bb3ee4 moved class CleaningFunctions in the correct package. Remove newlines from titles, descriptions, subjects 2020-11-24 18:34:03 +01:00
Miriam Baglioni 72bb0fe360 changed directory name 2020-11-24 16:47:07 +01:00
Miriam Baglioni 39f4a20873 chenged the path and the name for saving the communities_infrastructures dump file 2020-11-24 14:47:32 +01:00
Miriam Baglioni 7e14452a87 final versione of the wf to get the dump of results associated to at least one funder per funder 2020-11-24 14:46:34 +01:00
Miriam Baglioni c167a18057 added new parameter for the dumpType 2020-11-24 14:45:50 +01:00
Miriam Baglioni b9b6bdb2e6 fixing issue on previous implementation 2020-11-24 14:44:53 +01:00
Miriam Baglioni 7e940f1991 changed to consider the modification for the specification of the type of dump 2020-11-24 14:43:34 +01:00
Miriam Baglioni 62928ef7a5 changed to save the communities_infrastructures information as the other entity dumps: in a json.gz file 2020-11-24 14:42:41 +01:00
Claudio Atzori 33bae02451 reverted behaviour of the cleaning workflow: grouping entities by ID will be managed differently 2020-11-24 14:42:33 +01:00
Miriam Baglioni 3319440c53 changed the direction of the relation between projects and result considered to select the results linked to projects 2020-11-24 14:41:09 +01:00
Miriam Baglioni 00c377dac2 added specification of MapFunction types in map 2020-11-24 14:40:22 +01:00
Miriam Baglioni 44db258dc4 added enumerated for the dump type 2020-11-24 14:38:06 +01:00
Miriam Baglioni 1832708c42 modified boolean variable with string one whcih specify the type of dump we are performing: complete, community or funder 2020-11-24 14:37:36 +01:00
Miriam Baglioni 259c67ce36 fixed issue in path name 2020-11-20 12:32:23 +01:00
Miriam Baglioni 0a9db67eec - 2020-11-20 12:21:33 +01:00
Miriam Baglioni cf3f47563f new parameter files 2020-11-19 19:16:05 +01:00
Miriam Baglioni 24c56fa7a3 new logic and workflow for dump of results with link to projects. In this implementation the result match the model of the communityresult. 2020-11-19 19:15:39 +01:00
Claudio Atzori fcbb05eb21 cleanup 2020-11-19 15:14:33 +01:00
Claudio Atzori 3f34757c63 merged from master 2020-11-19 14:34:54 +01:00
Miriam Baglioni fafb688887 - 2020-11-18 18:56:48 +01:00
Miriam Baglioni 906db690d2 - 2020-11-18 17:43:08 +01:00
Miriam Baglioni 5402062ff5 changed parameter file with the ono associated to the job 2020-11-18 16:58:20 +01:00
Miriam Baglioni a172a37ad1 fixed typo 2020-11-18 16:55:07 +01:00
Miriam Baglioni 46ba3793f6 code, workflow and parameters for the dump of the results associated to funders 2020-11-18 16:47:31 +01:00
Miriam Baglioni 57cac36898 changed the workflow name 2020-11-18 13:38:03 +01:00
Alessia Bardi 10e673660f Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-18 10:01:23 +01:00
Alessia Bardi be7b310cef rel semantcis ignore case 2020-11-18 10:01:20 +01:00
Michele Artini 33da2e3d6c xpaths for dateOfCollection and dateOfTransformation 2020-11-18 09:26:20 +01:00
Alessia Bardi 8f87020a50 #56: map relevantDates from aggregated ODF records 2020-11-17 18:42:09 +01:00
Claudio Atzori cfc01f136e PID filtering based on a blacklist 2020-11-17 12:27:06 +01:00
Claudio Atzori 6ab1ce53c9 fixed condition in result pid cleaning; cleanup 2020-11-16 10:09:17 +01:00
Claudio Atzori 4de8c8b237 fixed workflow variable name 2020-11-16 10:03:11 +01:00
Claudio Atzori 5d4e34e26a fixed typo in variable name 2020-11-14 10:32:26 +01:00
Claudio Atzori 768bc5304c Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-11-13 15:40:34 +01:00
Claudio Atzori 93f7b7974f Merge pull request 'trust truncated to 3 decimals' (#24) from trunc_trust into master
LGTM
2020-11-13 15:40:02 +01:00
Claudio Atzori 528231a287 grouping graph entities by id turned out to be an easy extension for the already existing cleaning workflow 2020-11-13 15:37:48 +01:00
Claudio Atzori 2bed29eb09 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:12 +01:00
Claudio Atzori 13e36a4da0 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:02 +01:00
Claudio Atzori 9b0fb9e958 merged from master 2020-11-12 09:27:12 +01:00
Michele Artini 40160d171f organizations pids 2020-11-09 12:58:36 +01:00
Sandro La Bruzzo 027ef2326c Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-11-06 17:12:42 +01:00
Sandro La Bruzzo cd27df91a1 fixed bug on missing relation in ANDS 2020-11-06 17:12:31 +01:00
Claudio Atzori d10447e747 re-packaged graph dump workflow sources 2020-11-05 17:38:18 +01:00
Claudio Atzori 2d76497488 cleanup 2020-11-05 17:10:24 +01:00
Miriam Baglioni f8e9bda24c merge branch with master 2020-11-05 16:31:18 +01:00
Miriam Baglioni be5ed8f554 added check to avoid sending empty metadata. 2020-11-05 16:10:17 +01:00
Claudio Atzori 2148a51fae minor changes 2020-11-05 11:24:12 +01:00
Claudio Atzori 4625b7486e code formatting 2020-11-04 18:12:43 +01:00
Miriam Baglioni e9ac471ae9 removed dependency from classes for the pid graph dump 2020-11-04 18:04:42 +01:00
Miriam Baglioni b90a945c49 removed property files for pid graph dump 2020-11-04 17:28:33 +01:00
Miriam Baglioni bac307155a removed properties specific for pid graph dump 2020-11-04 17:28:04 +01:00
Miriam Baglioni 9c9d50f486 removed code specific for pid graph dump 2020-11-04 17:26:22 +01:00
Miriam Baglioni 5669890934 removed commented lines 2020-11-04 17:15:21 +01:00
Miriam Baglioni 6a89f59be9 removed commented lines 2020-11-04 17:13:59 +01:00
Miriam Baglioni 56150d7e5e removed all code related to the dump of pids graph 2020-11-04 17:13:12 +01:00
Miriam Baglioni 16c54a96f8 removed pid dump 2020-11-04 17:11:32 +01:00
Alessia Bardi 51808b5afd Updated descriptions 2020-11-04 12:29:48 +01:00
Alessia Bardi e6becf8659 Updated descriptions 2020-11-04 12:17:57 +01:00
Alessia Bardi 0abe0eee33 Updated descriptions 2020-11-04 12:15:30 +01:00
Alessia Bardi f6ab238f5d Updated descriptions 2020-11-04 11:50:47 +01:00
Miriam Baglioni 8ec7a61188 merge branch with master 2020-11-03 16:59:08 +01:00
Miriam Baglioni c209284ca7 new schemas for the entities in the dump with added descriptions 2020-11-03 16:58:08 +01:00
Miriam Baglioni 08806deddf added the splitSize non mandatory parameter. Default size 10G 2020-11-03 16:57:34 +01:00
Miriam Baglioni 7d2eda43ca added new non mandatory property publish to determine if to publish the upload or leave it pending. Default value flase 2020-11-03 16:57:01 +01:00
Miriam Baglioni cbbb1bdc54 moved business logic to new class in common for handling the zip of hte archives 2020-11-03 16:55:50 +01:00
Miriam Baglioni d4382b54df moved the tar archive with maz size on common module 2020-11-03 16:54:50 +01:00
Claudio Atzori 86d6fbe95b refactoring: CleaningFunctions and OafMapperUtils moved in dhp-commong 2020-11-03 12:19:46 +01:00
Claudio Atzori 8471888ad3 Merge branch 'graph_cleaning' into stable_ids 2020-11-03 11:52:47 +01:00
Claudio Atzori 5310e56dba remove empy PIDs 2020-11-03 11:52:10 +01:00
Claudio Atzori 3fcd669e99 result merge operation leverage on custom ResultTypeComparator in the aggregator graph construction 2020-11-03 10:53:23 +01:00
Claudio Atzori 09e44dabff Merge branch 'master' into stable_ids 2020-11-02 12:16:01 +01:00
Miriam Baglioni dabb33e018 changed the discriminant for which split the file 2020-10-30 17:52:22 +01:00
Miriam Baglioni 0fba08eae4 max allowed size per file 10 Gb 2020-10-30 16:05:55 +01:00
Miriam Baglioni b828587252 prevent the code to cicle indefinetly 2020-10-30 15:01:25 +01:00
Miriam Baglioni f747e303ac classes for dumping of the graph as ttl file 2020-10-30 14:13:45 +01:00
Miriam Baglioni 16baf5b69e formatting 2020-10-30 14:13:14 +01:00
Miriam Baglioni a9eef9c852 added check for possible Optional value in relation dataInfo 2020-10-30 14:12:28 +01:00
Miriam Baglioni 5f4de9a962 formatting 2020-10-30 14:11:40 +01:00
Miriam Baglioni 14bf2e7238 added option to split dumps bigger that 40Gb on different files 2020-10-30 14:09:04 +01:00
Claudio Atzori 58f28296ea ProvisionConstants moved as ModelHardLimits in dhp-common and applied to truncate long abstracts (len > 150000). Further filtering for empty PID values 2020-10-30 10:56:42 +01:00
Miriam Baglioni 78fdb11c3f merge branch with master 2020-10-29 12:55:22 +01:00
Sandro La Bruzzo 1d9fdb7367 fixed spark memory issue in SparkSplitOafTODLIEntities 2020-10-28 12:30:32 +01:00
Miriam Baglioni d2374e3b9e added code to handle cases where the funding tree is not existing 2020-10-27 16:15:21 +01:00
Miriam Baglioni 5d3012eeb4 changed code to dump only the programme list and not the classification list 2020-10-27 16:14:18 +01:00
Miriam Baglioni 3241ec1777 added connection timeout and socket timeout 600 sec 2020-10-27 16:12:11 +01:00
Claudio Atzori 266bf1a221 common IdentifierFactory in use on the mapping from the aggregator data; merge the entities sharing the same id; code formatting 2020-10-16 17:02:10 +02:00
Claudio Atzori 34f1d0904b common IdentifierFactory in use on the mapping from the aggregator data 2020-10-16 16:00:19 +02:00
Sandro La Bruzzo fed711da80 Merge remote-tracking branch 'origin/master' into merge_record_to_common 2020-10-13 15:32:45 +02:00
Alessia Bardi 8775a64bc1 Merge pull request 'Merging different compatibility levels (pinocchio operator)' (#47) from merge_graph into master 2020-10-09 14:44:52 +02:00
Claudio Atzori e751c1402f Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-10-09 13:53:21 +02:00
Claudio Atzori b961dc7d1e added originalid to the fields in the result graph view 2020-10-09 13:53:15 +02:00
Sandro La Bruzzo eec418cd26 moved AuthoreMerger into dhp-common 2020-10-08 10:33:55 +02:00
Sandro La Bruzzo cd9c377d18 adpted scholexplorer Dump generation to the new Dataset definition 2020-10-08 10:10:13 +02:00
Claudio Atzori a3f37a9414 javadoc 2020-10-07 16:44:22 +02:00
Claudio Atzori 8d85a2fced [BETA wf only] datasources involved in the merge operation doesn't obey to the infra precedence policy, but relies on a custom behaviour that, given two datasources from beta and prod returns the one from prod with the highest compatibility among the two 2020-10-07 16:28:52 +02:00
Miriam Baglioni ae08b3c0dd merge branch with master 2020-10-05 11:35:55 +02:00
Miriam Baglioni 11b7eaae09 changed the name of the folder where to store the context entity from context to communities_infrastructures 2020-10-05 11:24:54 +02:00
Miriam Baglioni 32bffb0134 changed the name from communities_infrastructures to communities_infrastuctures.json 2020-10-05 11:24:17 +02:00
Miriam Baglioni 25cbcf6114 changed to solve issues about names. context renamed communities_infrastructure.json and removed the double json.gz extention to the name of the part in the tar 2020-10-02 12:17:46 +02:00
Claudio Atzori 49ae3450a9 code formatting 2020-10-02 09:43:24 +02:00
Claudio Atzori c2a6e2a9bf fixed mapping for datasource journal info (ISSNs) 2020-10-02 09:37:08 +02:00
Miriam Baglioni 01117a46e1 whole workflow activated 2020-10-01 17:19:21 +02:00
Miriam Baglioni cfb5766c6b removed double json.gz from names of files in the tar 2020-10-01 17:18:34 +02:00
Miriam Baglioni fcaedac980 merge branch with master 2020-10-01 16:46:59 +02:00
Miriam Baglioni c6e6ed1bd8 merge branch with master 2020-10-01 16:24:41 +02:00
Claudio Atzori 2e9e13444d author pids made unique by value 2020-10-01 12:50:40 +02:00
Claudio Atzori e265c3e125 cleaning functions factored out in a dedicated class 2020-10-01 10:50:15 +02:00
Claudio Atzori 4287164aba include relevantdate field in the result view 2020-10-01 10:28:55 +02:00
Miriam Baglioni 7b6a7333e6 merge branch with master 2020-09-25 16:42:07 +02:00
Miriam Baglioni 983a12ed15 temporary modification to allow the upload of files in the sandbox without the neew to recreate the mapping from scratch 2020-09-25 16:41:51 +02:00
Miriam Baglioni 8b36d19182 added property depositionId and chenage property newVersion that became string from boolean to handle the three possible distinct values 2020-09-25 16:41:15 +02:00
Miriam Baglioni ed5239f9ec added new code to handle the new possibility to upload files to an already open deposition 2020-09-25 16:34:32 +02:00
Miriam Baglioni 3a8c524fce refactor 2020-09-25 16:34:02 +02:00
Miriam Baglioni 54800fb9b0 enabled only the step to upload in zenodo 2020-09-25 14:40:22 +02:00
Miriam Baglioni de6c4d46d8 fixed conflicts 2020-09-24 15:35:01 +02:00
Claudio Atzori 044d3a0214 fixed query used to load datasources in the Graph 2020-09-24 13:48:58 +02:00
Claudio Atzori 42f55395c8 fixed order of the ISSNs returned by the SQL query 2020-09-24 12:09:58 +02:00
Claudio Atzori 9a7e72d528 using concat_ws to join textual columns from PSQL. When using || to perform the concatenation, Null columns makes the operation result to be Null 2020-09-24 10:42:47 +02:00
Claudio Atzori 9e3e93c6b6 setting the correct issn type in the datasource.journal element 2020-09-24 10:39:16 +02:00
Miriam Baglioni 39eb8ab25b changed the dump to move from h2020programme to h2020classification 2020-09-23 17:33:00 +02:00
Miriam Baglioni e2ceefe9be - 2020-09-14 14:33:28 +02:00
Miriam Baglioni 1f893e63dc - 2020-09-14 14:33:10 +02:00
Claudio Atzori 8a523474b7 code formatting 2020-09-07 11:40:16 +02:00
Miriam Baglioni 8694bb9b31 refactoring due to compilation 2020-08-24 17:07:34 +02:00
Miriam Baglioni 8a069a4fea - 2020-08-24 17:01:30 +02:00
Miriam Baglioni 34fa96f3b1 - 2020-08-24 17:00:20 +02:00
Miriam Baglioni 5fb2949cb8 added utils methods 2020-08-24 17:00:09 +02:00
Miriam Baglioni 2a540b6c01 added constants for the pid graph dump 2020-08-24 16:55:35 +02:00
Miriam Baglioni 40c8d2de7b test resources for the dump of the pids graph 2020-08-24 16:50:39 +02:00
Miriam Baglioni bef79d3bdf first attempt to the dump of pids graph 2020-08-24 16:49:38 +02:00
Miriam Baglioni 85203c16e3 merge branch with master 2020-08-19 11:49:03 +02:00
Miriam Baglioni 1c593a9cfe - 2020-08-19 11:29:51 +02:00
Miriam Baglioni e42b2f5ae2 - 2020-08-19 11:29:09 +02:00
Miriam Baglioni f81ee22418 changed to mirror the changes in the model (Instance, CommunityInstance, GraphResult) 2020-08-19 11:28:26 +02:00
Miriam Baglioni 387be43fd4 changed to discriminate if dumping all the results type together or each one in its own archive 2020-08-19 11:25:27 +02:00
Miriam Baglioni c5858afb88 added parameter to guide the dump for the result (resultAggregation). true if all the result types should be dump together, false otherwise. 2020-08-19 11:24:14 +02:00
Miriam Baglioni 5570678c65 changed parameter name from hfdsNameNode to nameNode 2020-08-19 10:59:26 +02:00
Miriam Baglioni dc5096a327 refactoring due to compilation 2020-08-19 10:57:36 +02:00
Miriam Baglioni 09f5b92763 added specific reference to class 2020-08-14 20:00:09 +02:00
Miriam Baglioni 37e7c43652 changed parameter name from hdfsNaemNode to nameNode 2020-08-14 18:18:25 +02:00
Miriam Baglioni a5043de5da added method to get the mapped instance 2020-08-13 18:45:50 +02:00
Miriam Baglioni fcd10f452c changed because of D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:55:32 +02:00
Miriam Baglioni bfd1fcde6d removed not useful method and changed because of D-Net/dnet-hadoop#40 (comment) and D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:14:37 +02:00
Miriam Baglioni 7fd8397123 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:13:15 +02:00
Miriam Baglioni 753d448cc9 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:12:58 +02:00
Miriam Baglioni c0e071fa26 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:12:40 +02:00
Miriam Baglioni 526db915bc apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:12:16 +02:00
Miriam Baglioni b0fab0d138 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:11:57 +02:00
Miriam Baglioni 1b6320b251 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:11:41 +02:00
Miriam Baglioni 743d31be22 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:11:22 +02:00
Miriam Baglioni 65b48df652 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:11:06 +02:00
Miriam Baglioni 90b54d3efb apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:08:24 +02:00
Miriam Baglioni 69bbb9592a apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:07:39 +02:00
Miriam Baglioni 945323299a apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:07:24 +02:00
Miriam Baglioni e04c993247 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:07:07 +02:00
Miriam Baglioni ed0812d0ce apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:06:49 +02:00
Miriam Baglioni d55cfe0ea5 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:06:20 +02:00
Miriam Baglioni 80866bec7d apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:06:05 +02:00
Miriam Baglioni 1400978c0a apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:05:44 +02:00
Miriam Baglioni 7b941a2e0a apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:05:17 +02:00
Miriam Baglioni f7474f50fe apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:04:52 +02:00
Miriam Baglioni 367203f412 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:04:33 +02:00
Miriam Baglioni 3ab4809d31 apply changes in D-Net/dnet-hadoop#40 (comment) 2020-08-13 12:04:10 +02:00
Miriam Baglioni 235d4e4d6e moved Context as relevant for Communities dump 2020-08-12 18:16:45 +02:00
Miriam Baglioni 7400cd019d removed not needed variable 2020-08-12 10:03:33 +02:00
Miriam Baglioni 98d28bab5c fixed missing _ in context nsprefix 2020-08-12 10:00:18 +02:00
Miriam Baglioni 2d67476417 merge branch with master 2020-08-11 15:46:04 +02:00
Miriam Baglioni 0603ec4757 changed test to upload the dump for covid-19 community 2020-08-11 15:43:25 +02:00
Miriam Baglioni acb0926b2e json schemas for the dumped entities and relation 2020-08-11 15:39:48 +02:00
Miriam Baglioni ff52c51f92 added the communityMapPath parameter and removed the isLookUpUrl parameter 2020-08-11 15:39:22 +02:00
Miriam Baglioni 6f43acda5e added the maketar and send to zenodo step. Adjusted wf parameters 2020-08-11 15:38:20 +02:00
Miriam Baglioni ddc19de2e9 removed the isLookUpUrl among the parameters 2020-08-11 15:37:47 +02:00
Miriam Baglioni 592a8ea573 added parameter file for maketar class 2020-08-11 15:37:14 +02:00
Miriam Baglioni 77a0951b32 added the make archive step in the workflow 2020-08-11 15:32:32 +02:00
Miriam Baglioni cf4d918787 added description, changed parameter name and added method 2020-08-11 15:27:31 +02:00
Miriam Baglioni dc5fc5366d Creation of an archive for each related dump part 2020-08-11 15:26:06 +02:00
Miriam Baglioni 0ce49049d6 added description 2020-08-11 15:25:11 +02:00
Miriam Baglioni 9bae991167 added description of the class 2020-08-11 11:20:43 +02:00
Miriam Baglioni 341dc59ead removed the repartition(1). Added code for the creation of an archive containing all the parts dumped for each community 2020-08-11 11:18:58 +02:00
Miriam Baglioni 1991a49f70 removed reference to isLookUp to get the communityMap 2020-08-10 18:02:56 +02:00
Miriam Baglioni fe88904df0 changed the wf definition 2020-08-10 12:01:14 +02:00
Miriam Baglioni 87856467e2 removed isLookUpUrl and added code to read from HDSF the communitymap 2020-08-10 11:38:41 +02:00
Miriam Baglioni 1cf7043e26 removed isLookUoUrl from the parameters 2020-08-10 11:38:03 +02:00
Miriam Baglioni 46986aae2d added the new parameter for newdeposion/newversion and concept_record_id 2020-08-07 18:00:06 +02:00
Miriam Baglioni 3aedfdf0d6 added option to do a new deposition or new version of an old deposition 2020-08-07 17:49:14 +02:00
Miriam Baglioni 1b3ad1bce6 filter out authors pid (only orcid). Added check to get unique provenance for context id. filtr out countries with code UNKNOWN 2020-08-07 17:48:18 +02:00
Miriam Baglioni 5ceb8c5f0a moved constants from graph.Constants 2020-08-07 17:46:47 +02:00
Miriam Baglioni 6c65c93c0e refactoring 2020-08-07 17:45:35 +02:00
Miriam Baglioni 68adf86fe4 refactoring 2020-08-07 17:43:20 +02:00
Miriam Baglioni 26d2ad6ebb refactoring 2020-08-07 17:41:56 +02:00
Miriam Baglioni 9675af7965 refactoring 2020-08-07 17:41:07 +02:00
Miriam Baglioni 346a91f4d9 Added constants 2020-08-07 17:35:39 +02:00
Miriam Baglioni d52b0e1797 no use of IsLookUp. The query is done once and its result stored on HDFS. The path to the result is given instead of the isLookUpUrl 2020-08-07 17:34:40 +02:00
Miriam Baglioni ae1b7fbfdb changed method signature from set of mapkey entries to String representing path on file system where to find the map 2020-08-07 17:32:27 +02:00
Miriam Baglioni 545ea9f77e moved in common. Zenodo response model and APIClient to deposit in Zenodo 2020-08-07 16:44:51 +02:00
Miriam Baglioni da9b012c15 fixed dewcription 2020-08-06 11:55:44 +02:00
Miriam Baglioni 6dbadcf181 the new schema for the dumped result 2020-08-06 11:05:56 +02:00
Sandro La Bruzzo 4fb1821fab Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-08-06 10:28:31 +02:00
Sandro La Bruzzo 9d9e9edbd2 improved extractEntity Relation workflows using dataset 2020-08-06 10:28:24 +02:00
Miriam Baglioni 14eda4f46e added method to try to put inputstream to zenodo 2020-08-05 14:18:25 +02:00
Miriam Baglioni e737a47270 added classes to try to send input stream to zenodo for the upload 2020-08-05 14:17:40 +02:00
Miriam Baglioni 873e9cd50c changed hadoop setting to connect to s3 2020-08-04 15:37:25 +02:00
Alessia Bardi a29565ff57 code formatting 2020-08-04 12:55:27 +02:00
Alessia Bardi 01db29e208 fixes redmine issue #5846: datacite and its different namespace declarations 2020-08-04 12:53:48 +02:00
Alessia Bardi b4e4e5f858 do not duplicate result PIDs 2020-08-04 12:52:14 +02:00
Miriam Baglioni 5b651abf82 merge branch with master 2020-08-04 10:14:07 +02:00
Miriam Baglioni 901ae37f7b added step to workflow 2020-08-03 18:12:54 +02:00
Miriam Baglioni e43aeb139a added new property file and changed some parameter to old files 2020-08-03 18:07:28 +02:00
Miriam Baglioni aa9f3d9698 changed logic for save in s3 directly 2020-08-03 18:06:18 +02:00
Miriam Baglioni d465f0eec9 added fulltext to result 2020-08-03 18:03:27 +02:00
Miriam Baglioni c892c7dfa7 changed to query for community map just once and save the result for remaining executions 2020-08-03 17:56:31 +02:00
Michele Artini 652b13abb6 Merge branch 'master' into nsprefix_blacklist 2020-07-31 07:58:37 +02:00
Claudio Atzori cd631bb5bc defaults fixed in the cleaning workflow forces result.publisher to NULL when result.publisher.value in empty 2020-07-30 17:03:53 +02:00
Miriam Baglioni 57c87b7653 re-implemented to fix issue on not serializable Set<String> variable 2020-07-30 16:43:43 +02:00
Miriam Baglioni ef8e5957b5 added specific directory where to save results 2020-07-30 16:42:46 +02:00
Miriam Baglioni 75f3361c85 - 2020-07-30 16:41:31 +02:00
Miriam Baglioni 3f695b25fa refactoring 2020-07-30 16:40:15 +02:00
Miriam Baglioni e623f12bef refactoring 2020-07-30 16:32:59 +02:00
Miriam Baglioni ff7d05abb4 added support class to store the couple organizationId representativeId gaot from sql query on hive 2020-07-30 16:32:04 +02:00
Miriam Baglioni cf6d80b2ab added command to close the writer 2020-07-30 16:31:22 +02:00
Miriam Baglioni f985bca37b added USER_CLAIM constant value 2020-07-30 16:25:26 +02:00
Claudio Atzori 4bbfcf1ac6 Merge branch 'master' of https://code-repo.d4science.org/D-Net/dnet-hadoop 2020-07-30 16:25:06 +02:00
Claudio Atzori 4ff8007518 added function to set the missing vocabulary names, used in the cleaning workflow as a pre-cleaning step 2020-07-30 16:24:39 +02:00
Miriam Baglioni 6f1c40a933 - 2020-07-30 16:24:28 +02:00
Miriam Baglioni 2b66a93f9e added property file that was missing 2020-07-30 16:24:17 +02:00
Michele Artini bdece15ca0 blacklist of nsprefix 2020-07-30 16:13:38 +02:00
Sandro La Bruzzo c97c8f0c44 implemented new oozie job to extract entities in a separate dataset 2020-07-30 12:13:58 +02:00
Sandro La Bruzzo 3010a362bc updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset 2020-07-30 09:25:56 +02:00
Sandro La Bruzzo 16ae3c9ccf updated changing in the workflow of provision in the phase of aggregation. Removed serialization in JSON RDD and used spark Dataset 2020-07-30 09:25:32 +02:00
Miriam Baglioni 76bcab98ce added code to filter out null originalId from the dump 2020-07-29 18:28:21 +02:00
Miriam Baglioni 86bab79512 - 2020-07-29 18:20:22 +02:00
Miriam Baglioni 31791dcf3d fixed wrong property file path name 2020-07-29 18:20:08 +02:00
Miriam Baglioni 9e722aa1ef - 2020-07-29 18:00:08 +02:00
Miriam Baglioni d22f106f27 added constant to identify datasource associated to funders 2020-07-29 17:56:55 +02:00
Miriam Baglioni 40e194fe2f added check to not dump datasources related to funders 2020-07-29 17:56:18 +02:00
Miriam Baglioni b48934f6df changed the workflow name 2020-07-29 17:43:43 +02:00
Miriam Baglioni 074e9ab75e refactoring 2020-07-29 17:42:50 +02:00
Miriam Baglioni 8ad8dac7d4 merge branch with fork master 2020-07-29 17:38:28 +02:00
Miriam Baglioni 9fa82dc93b fixed issue 2020-07-29 17:36:16 +02:00
Miriam Baglioni 8907648d6a - 2020-07-29 17:35:47 +02:00
Miriam Baglioni 40a8dafbdc - 2020-07-29 17:30:44 +02:00
Miriam Baglioni 6d0f08277b classes to implement the dump of the whole graph. 2020-07-29 17:03:19 +02:00
Miriam Baglioni 8d4327b292 input parameters and workflow definition for the dump of the whole graph 2020-07-29 17:00:34 +02:00
Miriam Baglioni b5f995ab12 refactoring 2020-07-29 16:59:48 +02:00
Miriam Baglioni f7a87cc447 added new constants value 2020-07-29 16:58:40 +02:00
Miriam Baglioni b71d12cf26 refactoring 2020-07-29 16:52:44 +02:00
Miriam Baglioni a8d65b68cb changed to delete the part to check if it was a test or a real execution 2020-07-29 16:47:57 +02:00
Miriam Baglioni 3ec2392904 Added new class to move the place the split is effectively run 2020-07-29 16:46:50 +02:00
Miriam Baglioni 178c2729a7 changed the path to reach the java class to be executed 2020-07-29 12:29:51 +02:00
Miriam Baglioni 437ac12139 removed unused parameter 2020-07-29 12:28:16 +02:00
Miriam Baglioni 6c2223d1fc added code to get the openaire id for contexts 2020-07-24 17:30:15 +02:00
Miriam Baglioni afd54c1684 removed not needed upload and refactoring 2020-07-24 17:28:56 +02:00
Miriam Baglioni 7b0569d989 changed to map also the result associated to the whole graph 2020-07-24 17:28:11 +02:00
Miriam Baglioni 082225ad61 - 2020-07-24 17:27:26 +02:00
Miriam Baglioni 968c59d97a added teh logic to dump also the products for the whole graph. They will miss collected from and context information that will be materialized as new relations 2020-07-24 17:25:19 +02:00
Miriam Baglioni 332258d199 split the classes related to the communities dump and to the whole graph dump 2020-07-24 17:21:48 +02:00
Claudio Atzori 56bbfdc65d introduced parameter 'numParitions', driving the hive DB table data partitioning. Currently specified only for table 'project' 2020-07-23 08:54:10 +02:00
Claudio Atzori ebf60020ac map results as OPRs in case of missing //CobjCategory/@type and the vocabulary dnet:result_typologies doesn't resolve the super type 2020-07-20 19:01:10 +02:00
Miriam Baglioni 355d7e426e added dumo for project - not finished 2020-07-20 18:54:43 +02:00
Miriam Baglioni a2f01e5259 added getter and setter 2020-07-20 18:54:17 +02:00
Miriam Baglioni 40bbe94f7c merge with master fork 2020-07-20 18:10:03 +02:00
Miriam Baglioni 23160b4d29 realignment of the workflow classes with the changes in the structure of the module 2020-07-20 18:04:30 +02:00
Miriam Baglioni 08dbd99455 changed to dump the whole results graph by usign classes already implemented for communities. Added class to dump also organization 2020-07-20 17:54:28 +02:00
Miriam Baglioni e47ea9349c extended some types by adding provenance as the couple (provenance, trust) and moved some classes to be used by the complete graph dump also 2020-07-20 17:46:27 +02:00
Claudio Atzori 32f5e466e3 imports cleanup 2020-07-20 17:42:58 +02:00
Claudio Atzori 54ac583923 code formatting 2020-07-20 17:37:08 +02:00
Claudio Atzori 124e7ce19c in case of missing attribute //dr:CobjCategory/@type the resulttype is derived by looking up the vocabulary dnet:result_typologies with the 1st instance type available 2020-07-20 17:33:37 +02:00
Claudio Atzori 050dda223d Merge pull request 'removed duplicated fields' (#25) from unique_field_in_lists into master
Looks good as a temporary workaround. I agree the model could seamlessly make the distinct operation by using HashSets instead of Linked (or Array) Lists.

The task to update the model in such a way is added on #9#issuecomment-1583

Thanks!
2020-07-20 12:12:50 +02:00
Claudio Atzori e0c4cf6f7b added parameter to drive the graph merge strategy: priority (BETA|PROD) 2020-07-20 10:48:01 +02:00
Claudio Atzori 94ccdb4852 Merge branch 'master' into merge_graph 2020-07-20 10:14:55 +02:00
Michele Artini 331a3cbdd0 fixed originalId 2020-07-20 09:50:29 +02:00
Sandro La Bruzzo 9116d75b3e Merge branch 'master' of code-repo.d4science.org:D-Net/dnet-hadoop 2020-07-17 18:01:30 +02:00
Miriam Baglioni 47c7122773 changed priority from beta to production 2020-07-17 12:56:35 +02:00
Michele Artini 442f30930c removed duplicated fields 2020-07-17 12:25:36 +02:00
Michele Artini 3adedd0a68 trust truncated to 3 decimals 2020-07-17 11:58:11 +02:00
Claudio Atzori 1781609508 code formatting 2020-07-16 19:06:56 +02:00
Claudio Atzori 878f2b931c Merge branch 'master' into merge_graph 2020-07-16 16:34:24 +02:00
Miriam Baglioni f9ad6f3255 Merge branch 'dump' of code-repo.d4science.org:miriam.baglioni/dnet-hadoop into dump 2020-07-10 19:42:53 +02:00
Miriam Baglioni c27f12d6e8 avoid to consider _SUCCESS file 2020-07-10 19:42:23 +02:00
Claudio Atzori 31071e363f Merge branch 'provision_indexing' 2020-07-10 19:03:57 +02:00