Commit Graph

335 Commits

Author SHA1 Message Date
Sandro La Bruzzo 31d2d6d41e Scholexplorer: introduction of dedup openaire 2021-07-21 18:09:32 +02:00
Miriam Baglioni d418c309f5 removed the part after part-x- in the file name generated by spark. It was too long and created problems while creating the tar entries 2021-07-13 17:11:49 +02:00
Sandro La Bruzzo ad50415167 Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer 2021-06-24 17:20:50 +02:00
Claudio Atzori 67afd06cd1 [cleaning] cleaning instance.pid and instance.alternateidentifier using the same procedure used to clean result.pid 2021-06-24 12:10:17 +02:00
Sandro La Bruzzo cc0f2b11fb Implemented mapping from pubmed baseline to OAF 2021-06-16 14:56:24 +02:00
Claudio Atzori 2039bb9f5f orcid / orcid_pending cleaning backported from master branch 2021-06-14 09:40:50 +02:00
Claudio Atzori a900bfb874 delegating the date parsing to https://github.com/sisyphsu/dateparser 2021-06-11 16:53:01 +02:00
Claudio Atzori eb6acfbabc [cleaning] removing non parsable relation.validationDate(s) 2021-05-28 10:50:44 +02:00
Claudio Atzori 9d725efdc1 reverted implementation of the mdstore client 2021-05-20 18:26:09 +02:00
Claudio Atzori 23b8883ab1 applied intellij code cleanup 2021-05-14 10:58:12 +02:00
Claudio Atzori d4c3476152 mapping datasource.journal only when an issn is available, null otherwhise 2021-05-11 11:08:54 +02:00
Claudio Atzori d1cbee8413 imported methods from CleaningFunctions, defined in GraphCleaningFunctions 2021-05-10 16:43:39 +02:00
Claudio Atzori 3797543600 MDStoreManager model classes moved in dhp-schemas 2021-05-10 14:32:05 +02:00
Claudio Atzori b1785ba77c alternative way to set timeouts for the ISLookup client 2021-05-05 11:23:46 +02:00
Claudio Atzori 923d19ea8e mdstore read lock/unlock when bulk copying records from mongodb to hdfs 2021-05-04 18:06:21 +02:00
Claudio Atzori 91e7220f20 cleaned up workflow for actionset migration, adjusted dnet|cnr* dependency versions 2021-04-29 10:09:52 +02:00
Claudio Atzori 5afa7d3e0c core utilities in dhp-common moved in external module dhp-schemas 2021-04-27 15:44:01 +02:00
Claudio Atzori f783e60ff7 cleanup 2021-04-27 14:04:50 +02:00
Claudio Atzori 27ab8a704d adjusted poms to align with the external dhp-schema module 2021-04-27 10:12:27 +02:00
Claudio Atzori c2bb03c8b5 depending on external dhp-schemas module 2021-04-23 17:57:35 +02:00
Claudio Atzori 8704d32780 code formatting 2021-04-15 16:52:58 +02:00
Claudio Atzori ba4b4c74d8 do not make the identifier prefix depend on the Handle 2021-04-15 16:48:26 +02:00
Claudio Atzori 710cd1e8f2 Merge pull request 'add xslt, personname cleaner' (#104) from andreas.czerniak/BrStableId_dnet-hadoop:stable_ids into stable_ids
Reviewed-on: #104

LGTM
2021-04-13 14:43:05 +02:00
Claudio Atzori d1ca025b0b [cleaning] remiving authors without fullname or providing 'deactivated' keyword. Removing test test titles 2021-04-13 14:32:41 +02:00
Andreas Czerniak d7614c1f85 introduce new const 2021-04-13 07:04:27 +02:00
Claudio Atzori 902d05f548 [cleaning] avoiding NPEs handling null author PIDs 2021-04-12 17:31:40 +02:00
Claudio Atzori 72ce741ea6 WIP: using common definitions from ModelConstants 2021-03-31 17:07:13 +02:00
Claudio Atzori 27681b876c code formatting 2021-03-29 17:47:11 +02:00
miconis 2709d08fc2 Merge branch 'stable_ids' into openorgswf 2021-03-29 16:39:07 +02:00
Claudio Atzori 3becaa5539 [Cleaning] drop alternate identifiers with empty values 2021-03-29 16:01:35 +02:00
Claudio Atzori 48f2b6127e [Cleaning] drop alternate identifiers with empty values 2021-03-29 14:23:18 +02:00
miconis 2355cc4e9b minor changes and bug fix 2021-03-29 10:07:12 +02:00
Claudio Atzori b5b7dc2104 [Cleaning] drop alternate identifiers with empty values 2021-03-26 12:30:00 +01:00
Claudio Atzori 827e7e37db [Cleaning] drop instance.alternateIdentifier elements when they are available among instance.pid 2021-03-25 11:07:59 +01:00
Claudio Atzori 431cbe9955 handle missing instance.pid during bulk cleaning 2021-03-23 09:28:58 +01:00
Sandro La Bruzzo c73072079d fix conflicts 2021-03-22 16:36:31 +01:00
Claudio Atzori 3256b9c836 code formatting 2021-03-19 09:36:12 +01:00
Claudio Atzori 75144dacb3 Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids 2021-03-19 09:07:40 +01:00
Claudio Atzori 9588bfba81 [cleaning] entries avaialbe as PIDs must not appear as alternateIdentifier 2021-03-19 09:07:30 +01:00
Sandro La Bruzzo 25d5663d97 added filter 2021-03-18 10:24:42 +01:00
Sandro La Bruzzo 5f98ea74a9 Added fix for pid generation in stableIds 2021-03-17 15:53:24 +01:00
Claudio Atzori 734232d3b9 identifier factory doesn't depend on pre-existing entity.id 2021-03-17 15:14:53 +01:00
Claudio Atzori a3dac32f16 pidFilter a bit more permissive 2021-03-17 15:06:05 +01:00
Claudio Atzori 8257f9a2bc result.pid: adjusted the mapping applied to the contents from the aggregator 2021-03-17 12:45:38 +01:00
Claudio Atzori 3b2da86f0a added precondition on IdentifierFactory to check the presence of entity.id 2021-03-16 17:05:38 +01:00
Claudio Atzori 640b885706 added instance.alternativeIdentifiers to the graph model, adjusted the mapping applied to the contents from the aggregator 2021-03-16 14:19:32 +01:00
Claudio Atzori f74e464942 create bestaccessright as Qualifier 2021-03-10 15:40:05 +01:00
Claudio Atzori c801ab6c1d minor 2021-03-09 17:22:31 +01:00
Claudio Atzori 9917d7e01c PID authorities include ArXiv 2021-03-09 17:12:52 +01:00
Claudio Atzori 01630f638d IdentifierFactory implementation based on the list of datasources authoritative for a given pid type 2021-03-09 17:11:50 +01:00
Claudio Atzori b3f3b895e5 [#6282 open access status in the Graph] OAStatus renamed as openAccessRoute 2021-03-09 11:41:11 +01:00
Claudio Atzori 765f9bdee7 merged from dhp_oaf_model 2021-03-09 11:37:41 +01:00
Claudio Atzori d525785497 [#6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color. 2021-03-09 11:12:55 +01:00
Claudio Atzori 8d2bb24512 merged from master 2021-03-08 15:44:34 +01:00
Claudio Atzori fa7930d2e2 merging contributions from PR#97 2021-03-05 15:45:28 +01:00
Claudio Atzori ec80b7ade3 code formatting 2021-03-03 10:22:53 +01:00
Claudio Atzori b73dce3e3a more logging on the MDStore mongodb client. Forcing UTF_8 encoding on the content 2021-03-03 10:17:16 +01:00
Claudio Atzori e76c4f62c1 MetadataRecord moved in dhp-schemas 2021-02-26 10:58:48 +01:00
Claudio Atzori b830e33392 mdstore collector plugin 2021-02-25 12:30:30 +01:00
Claudio Atzori dc98c39500 more logging 2021-02-25 12:29:18 +01:00
Claudio Atzori fc3fa5e343 implemented mdstore collector plugin 2021-02-24 15:07:24 +01:00
Claudio Atzori cf27905a71 WIP: collectorWorker error reporting, added report messages 2021-02-16 16:53:14 +01:00
Claudio Atzori 58288a95b8 WIP: collectorWorker error reporting, added report messages 2021-02-15 15:28:53 +01:00
Claudio Atzori 1abe6d1ad7 WIP: collectorWorker error reporting, added report messages 2021-02-15 15:08:59 +01:00
Claudio Atzori 29c6f7e255 classes related to the collection workflow moved into common package; implemented MongoDB collection plugins 2021-02-12 12:31:02 +01:00
Claudio Atzori 50add4c61b added requestDelay to HttpConnector2 configuration; Aggregation workflow constants moved in dhp-common 2021-02-08 12:19:38 +01:00
Claudio Atzori 40df0f987d better logging, WIP: collectorWorker error reporting; common functions moved in DHPUtils 2021-02-06 20:12:00 +01:00
Claudio Atzori a8a758925e better logging, WIP: collectorWorker error reporting 2021-02-05 19:18:05 +01:00
Michele Artini 2ee0c3e47e http entity as json string 2021-02-05 09:45:39 +01:00
Claudio Atzori 730973679a Merge branch 'hadoop_aggregator' of https://code-repo.d4science.org/D-Net/dnet-hadoop into hadoop_aggregator 2021-02-04 17:25:00 +01:00
Claudio Atzori deb85706db imported HttpConnector from https://svn.driver.research-infrastructures.eu/driver/dnet45/modules/dnet-modular-collector-service/trunk/src/main/java/eu/dnetlib/data/collector/plugins/HttpConnector.java as HttpConnector2 2021-02-04 17:24:52 +01:00
Sandro La Bruzzo 4dae5e605d implemented messaging btween collection worker and Dnet 2021-02-04 15:51:15 +01:00
Claudio Atzori 72c57b28fa switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT 2021-02-04 14:08:18 +01:00
Claudio Atzori 40764cf626 better logging, WIP: collectorWorker error reporting 2021-02-04 14:06:02 +01:00
Michele Artini 26d2eb946f messages sender 2021-02-04 09:45:46 +01:00
Michele Artini 1b9731632b Message Sender 2021-02-03 16:42:36 +01:00
Michele Artini 820d729e99 recover of Message and MessageType class 2021-02-03 16:20:34 +01:00
Claudio Atzori 0e8a4f9f1a better logging, WIP: collectorWorker error reporting 2021-02-03 12:33:41 +01:00
Claudio Atzori d62ea1490d cleaned up RabbitMQ stuff 2021-02-02 10:53:19 +01:00
Claudio Atzori 73d772a4b4 added method to list the known vocabulary names 2021-02-02 10:39:47 +01:00
Claudio Atzori 8eaa1fd4b4 WIP: metadata collection in INCREMENTAL mode and relative test 2021-02-01 19:29:10 +01:00
Sandro La Bruzzo 6ff234d81b Implemented a first prototype of incremental harvesting and trasformation using readlock 2021-02-01 13:56:05 +01:00
Sandro La Bruzzo 0276180039 WIP mdstore
transaction implemented on hadoop side
2021-01-29 16:42:41 +01:00
Michele Artini d942d0c77d methods toString(), hashCode() and equals() 2021-01-29 13:16:48 +01:00
Michele Artini 38f2508c87 new fields in mdstore beans 2021-01-28 08:24:45 +01:00
Sandro La Bruzzo a54848a59c Moved Vocabulary stuff to common module 2021-01-25 15:43:04 +01:00
Claudio Atzori 28460c2cd1 using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper 2020-12-23 16:59:52 +01:00
Claudio Atzori 6848d0c3d7 trivial: avoid duplicated code 2020-12-23 12:21:58 +01:00
Claudio Atzori d8b5f43a7e code formatting 2020-12-22 14:59:03 +01:00
miconis 794e22b09c bug fix in the authormerge: now authors with higher size have priority, normalization of author name fixed 2020-12-21 17:51:42 +01:00
Claudio Atzori 12e2f930c8 resolved conflicts 2020-12-10 10:57:39 +01:00
Alessia Bardi 112da6d76a in theory, just auto-formatting after mvn compile 2020-12-09 20:00:27 +01:00
Miriam Baglioni 6fbc67a959 using ModelConstant.ORCID and removing not used constants 2020-12-09 17:10:20 +01:00
Claudio Atzori 3c5ce1dada code formatting 2020-12-09 17:07:20 +01:00
Miriam Baglioni 212b52614f added graph mapper versus community result without context and project in common to be used for the doiboost mapping 2020-12-09 16:59:02 +01:00
Claudio Atzori 491ad24750 introduced filtering for DOIs in graph cleaning workflow 2020-12-09 09:10:33 +01:00
Claudio Atzori 943b961cf6 introduced PidBlacklist 2020-12-02 09:30:34 +01:00
Claudio Atzori 893ac4a77b GenerateEntitiesApplication can be configured to hash the id value or not 2020-12-02 09:30:06 +01:00
Claudio Atzori 349e7246aa do not consider NCID, GBIF as PIDs candidate for the ID creation 2020-11-30 16:52:40 +01:00
Claudio Atzori 2c407e775e GenerateEntitiesApplication can be configured to hash the id value or not 2020-11-30 12:00:38 +01:00
Claudio Atzori 758d27745d cleaning tab characters from text fields 2020-11-27 16:07:24 +01:00
Claudio Atzori 596a2a459d added testing class for OafMapperUtils 2020-11-27 12:01:11 +01:00
Claudio Atzori fa66e5b6b8 ResultTypeComparator gives priority to Records collectedfrom Crossref 2020-11-26 13:09:19 +01:00
Claudio Atzori d0d5525d40 minor changes 2020-11-26 11:04:17 +01:00
Miriam Baglioni 66c0e3e574 changed because of #61 (comment) 2020-11-25 17:52:17 +01:00
Claudio Atzori 1372a4d1bf fixed merging method 2020-11-25 16:05:51 +01:00
Claudio Atzori dfd6205b95 Consistency graph workflow merges all the entities by ID 2020-11-25 14:55:32 +01:00
Claudio Atzori e1a1bb3ee4 moved class CleaningFunctions in the correct package. Remove newlines from titles, descriptions, subjects 2020-11-24 18:34:03 +01:00
Claudio Atzori e43ab07af6 code formatting 2020-11-24 14:41:39 +01:00
Miriam Baglioni 73dbb79602 removed the checl for the community name in the common version on MakeTar 2020-11-24 14:36:15 +01:00
Claudio Atzori c016cc050a IdentifierFactory: in case a record provides more than one pid of the same type, the the lexicographically lower value is chosen as best pick 2020-11-23 19:16:40 +01:00
Claudio Atzori 3f34757c63 merged from master 2020-11-19 14:34:54 +01:00
Claudio Atzori 2bed29eb09 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:12 +01:00
Claudio Atzori 13e36a4da0 WIP: added oozie workflow for grouping graph entities by id 2020-11-13 10:05:02 +01:00
Claudio Atzori 9b0fb9e958 merged from master 2020-11-12 09:27:12 +01:00
Miriam Baglioni f8e9bda24c merge branch with master 2020-11-05 16:31:18 +01:00
Miriam Baglioni 7ebdfacee9 removed commented code and added documentation to new method 2020-11-05 16:30:36 +01:00
Claudio Atzori 4625b7486e code formatting 2020-11-04 18:12:43 +01:00
Claudio Atzori e5da4ee9b1 dedup workflow using the common PidComparator 2020-11-04 15:02:02 +01:00
Claudio Atzori ea2a0ea949 IdentifierFactory considers only DOIs matching a given regex 2020-11-03 18:43:37 +01:00
Miriam Baglioni d4382b54df moved the tar archive with maz size on common module 2020-11-03 16:54:50 +01:00
Claudio Atzori 86d6fbe95b refactoring: CleaningFunctions and OafMapperUtils moved in dhp-commong 2020-11-03 12:19:46 +01:00
Claudio Atzori 3fcd669e99 result merge operation leverage on custom ResultTypeComparator in the aggregator graph construction 2020-11-03 10:53:23 +01:00
Claudio Atzori 78c3c1b62b exclude pid values set to 'none' 2020-11-02 14:25:26 +01:00
Claudio Atzori 09e44dabff Merge branch 'master' into stable_ids 2020-11-02 12:16:01 +01:00
Miriam Baglioni 10d8bbada8 changed deprecated method with non deprecated versioen 2020-10-30 14:10:10 +01:00
Claudio Atzori 58f28296ea ProvisionConstants moved as ModelHardLimits in dhp-common and applied to truncate long abstracts (len > 150000). Further filtering for empty PID values 2020-10-30 10:56:42 +01:00
Miriam Baglioni 4cf4454341 changed from deprecated method to new one 2020-10-27 17:46:19 +01:00
Miriam Baglioni c8f32dd109 - 2020-10-27 17:45:58 +01:00
Miriam Baglioni 3582eba565 - 2020-10-27 17:31:33 +01:00
Miriam Baglioni 3241ec1777 added connection timeout and socket timeout 600 sec 2020-10-27 16:12:11 +01:00
Miriam Baglioni cc68855a1e merge upstream 2020-10-27 15:54:16 +01:00
Miriam Baglioni 1cb60aede4 added connection timeout and socket timeout 600 sec 2020-10-27 15:53:02 +01:00
sandro 3a81a940b7 solved bug on merge publication 2020-10-21 22:41:55 +02:00
Claudio Atzori c188868450 Merge branch 'master' into stable_ids 2020-10-16 12:06:23 +02:00
Miriam Baglioni 959f30811e added connection timeout and socket timeout 600 sec 2020-10-16 10:52:30 +02:00
Sandro La Bruzzo 734934e2eb fixed error on empty intersection with publication and relation on export to OAF 2020-10-08 17:29:29 +02:00
Sandro La Bruzzo eec418cd26 moved AuthoreMerger into dhp-common 2020-10-08 10:33:55 +02:00
Claudio Atzori 8958f20813 code formatting 2020-10-07 13:14:31 +02:00
Claudio Atzori 1abcabb6e6 WIP stable ids: IdentifierFactory & unit test 2020-10-06 18:55:23 +02:00
Claudio Atzori 6ce340bd3d WIP stable ids: IdentifierFactory 2020-10-06 15:44:53 +02:00
Claudio Atzori 49ae3450a9 code formatting 2020-10-02 09:43:24 +02:00
Claudio Atzori 1c44182dea minor changes 2020-10-02 09:41:34 +02:00
Miriam Baglioni ccd48dd78a added new test for new method 2020-09-25 16:33:43 +02:00
Miriam Baglioni 3e5497b336 added new method to handle an open deposition to which upload data 2020-09-25 16:33:15 +02:00
Claudio Atzori 8a523474b7 code formatting 2020-09-07 11:40:16 +02:00
Miriam Baglioni c7f944a533 refactoring due to compilation 2020-08-19 10:01:26 +02:00
Miriam Baglioni 02a4986e7b Applying changed from code reviews #40 (comment) and #40 (comment) and #40 (comment) 2020-08-13 11:53:01 +02:00
Miriam Baglioni 33a6a51333 Disabled Test (impossible to publish without accessToken). And applying changes from code review #40 (comment) 2020-08-13 11:48:32 +02:00
Miriam Baglioni 306603272e removed accession token 2020-08-12 09:39:58 +02:00