Commit Graph

207 Commits

Author SHA1 Message Date
Claudio Atzori 44a937f4ed factored out entity grouping implementation, extended to consider results from delegated authorities rather than identical records from other sources 2022-01-19 12:24:52 +01:00
Miriam Baglioni 42e8f76778 [GraphCleaning] change the return value in the filtering function to avoid to lose the APC entities 2022-01-13 16:06:43 +01:00
Miriam Baglioni 56409d1281 [Dump] resolved conflicts with beta and merging 2021-12-14 15:03:45 +01:00
Miriam Baglioni a3592b463a Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-12-14 14:58:26 +01:00
Claudio Atzori aff3ddc8d2 added cleaning for the format field, removing carrige return and tab characters 2021-12-14 11:41:46 +01:00
Miriam Baglioni 936578aaf1 Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta 2021-12-13 15:01:47 +01:00
Claudio Atzori 41c70c607d cleaning workflow assigns the proper default instance type when a value could not be cleaned using the vocabularies 2021-12-09 16:44:28 +01:00
Claudio Atzori e6e177dda0 vocabulary based cleaning considers also the term label when looking up for a synonym 2021-12-09 13:57:53 +01:00
Miriam Baglioni b113586207 resolved conflicts 2021-12-07 10:16:14 +01:00
Miriam Baglioni 96a7d46278 [Graph Dump] fixed tests 2021-12-06 15:06:32 +01:00
Sandro La Bruzzo 81bf604059 [scala-refactor] Module dhp-common:
Moved all scala source into src/main/scala and src/test/scala
2021-12-06 11:29:24 +01:00
Claudio Atzori 863a2f9db3 avoid to filter OAF records defined as invisible = true 2021-12-03 09:08:12 +01:00
Miriam Baglioni 8905a39bf3 mergin with branch beta 2021-12-02 13:17:29 +01:00
Sandro La Bruzzo 1e1f5e4fe0 minor fix 2021-11-25 13:03:17 +01:00
Sandro La Bruzzo 2164a2a889 Datacite: Code Refactor generated a general SparkApplication Scala where all the spark scala have to inherit
Commented a little the Datacite transformation code
2021-11-25 10:54:13 +01:00
Miriam Baglioni 9fae872181 [Graph Dump] changed to mirror the changes in the model 2021-11-19 11:25:50 +01:00
Claudio Atzori 82a4e4efae [cleaning wf] fixed methodology to rule out invalid result titles, based on https://support.openaire.eu/issues/7206 2021-11-17 14:17:22 +01:00
Claudio Atzori 49f897ef29 [cleaning wf] fixed regex used to spot garbage in result titles; adjusted threshold for filtering titles 2021-11-16 15:24:23 +01:00
Sandro La Bruzzo aafdffa6b3 resolved conflict 2021-10-26 09:45:46 +02:00
Sandro La Bruzzo 034304b33a conflict resolved on merge 2021-10-26 09:40:47 +02:00
Claudio Atzori 6b34ba737e minor 2021-10-21 14:16:18 +02:00
Sandro La Bruzzo ae4e99a471 Adapted workflow of resolution of PID to work into OpenAIRE data workflow
- Added relations in both verse on all Scholexplorer datasources
2021-10-20 17:12:16 +02:00
Miriam Baglioni c8321ad31a merge with branch beta 2021-10-01 12:59:08 +02:00
Claudio Atzori 663b1556d7 manually integrating PR#140 #140 2021-09-15 16:40:25 +02:00
Claudio Atzori 3359f73fcf cleanup & best practices 2021-08-13 12:00:42 +02:00
Miriam Baglioni 6e84b3951f GetCSV refactoring - moving classes to dhp-common that have dependency with GetCSV class (that was located in graph-mapper) 2021-08-12 17:57:41 +02:00
Claudio Atzori 2ee21da43b suggestions from SonarLint 2021-08-11 12:13:22 +02:00
Miriam Baglioni 6bd1eca7e0 merge branch with beta 2021-08-05 15:23:32 +02:00
Miriam Baglioni ee13da9258 merge branch with master 2021-08-05 11:34:20 +02:00
Claudio Atzori a9961a1835 [cleaning] title cleaning based on the me.xuender:unidecode library 2021-07-28 16:36:33 +02:00
Claudio Atzori 6dddad86ee [cleaning] title cleaning based on the me.xuender:unidecode library 2021-07-28 16:21:29 +02:00
Miriam Baglioni 35e395eae8 merge with master 2021-07-27 12:34:59 +02:00
Claudio Atzori bc835d2024 [cleaning] fixed filtering function for missing titles 2021-07-23 11:56:13 +02:00
Claudio Atzori ffdb2a3ea3 [cleaning] fixed filtering function for missing titles 2021-07-23 11:55:55 +02:00
Sandro La Bruzzo 62ae36a3d2 fixed NPE 2021-07-22 15:41:38 +02:00
Sandro La Bruzzo d94565862a fixed NPE 2021-07-21 21:23:11 +02:00
Sandro La Bruzzo 31d2d6d41e Scholexplorer: introduction of dedup openaire 2021-07-21 18:09:32 +02:00
Miriam Baglioni d418c309f5 removed the part after part-x- in the file name generated by spark. It was too long and created problems while creating the tar entries 2021-07-13 17:11:49 +02:00
Claudio Atzori 67afd06cd1 [cleaning] cleaning instance.pid and instance.alternateidentifier using the same procedure used to clean result.pid 2021-06-24 12:10:17 +02:00
Claudio Atzori 2039bb9f5f orcid / orcid_pending cleaning backported from master branch 2021-06-14 09:40:50 +02:00
Claudio Atzori a900bfb874 delegating the date parsing to https://github.com/sisyphsu/dateparser 2021-06-11 16:53:01 +02:00
Claudio Atzori eb6acfbabc [cleaning] removing non parsable relation.validationDate(s) 2021-05-28 10:50:44 +02:00
Claudio Atzori 9d725efdc1 reverted implementation of the mdstore client 2021-05-20 18:26:09 +02:00
Claudio Atzori 23b8883ab1 applied intellij code cleanup 2021-05-14 10:58:12 +02:00
Claudio Atzori d4c3476152 mapping datasource.journal only when an issn is available, null otherwhise 2021-05-11 11:08:54 +02:00
Claudio Atzori d1cbee8413 imported methods from CleaningFunctions, defined in GraphCleaningFunctions 2021-05-10 16:43:39 +02:00
Claudio Atzori 3797543600 MDStoreManager model classes moved in dhp-schemas 2021-05-10 14:32:05 +02:00
Claudio Atzori b1785ba77c alternative way to set timeouts for the ISLookup client 2021-05-05 11:23:46 +02:00
Claudio Atzori 923d19ea8e mdstore read lock/unlock when bulk copying records from mongodb to hdfs 2021-05-04 18:06:21 +02:00
Claudio Atzori 5afa7d3e0c core utilities in dhp-common moved in external module dhp-schemas 2021-04-27 15:44:01 +02:00