Commit Graph

71 Commits

Author SHA1 Message Date
Michele Artini 3b5163d8e3 fixed a xpath 2024-03-08 08:23:07 +01:00
Michele Artini db6f774394 selection of the new plugin 2024-03-04 10:43:40 +01:00
Michele Artini 9506d80ddc changed sql to select accepted datasources 2024-03-04 08:25:40 +01:00
Michele Artini c2b6841eb0 opendoar datasource filter 2024-03-01 15:32:56 +01:00
Michele Artini be7f327e88 opendoar datasource filter 2024-03-01 13:38:36 +01:00
Michele Artini 32f4d6f691 reports for types 2024-03-01 11:43:37 +01:00
Michele Artini 71204a8056 some fields in stats 2024-02-29 10:17:31 +01:00
Michele Artini 5ddbef3a5b new stats 2024-02-28 14:34:09 +01:00
Michele Artini 3d14bef381 OpenDoar reports 2024-02-28 10:51:13 +01:00
Michele Artini f8cf7ffbcb stats 2024-02-22 14:01:11 +01:00
Michele Artini d2b7541583 fixed a problem with Dataset model 2024-02-16 11:36:46 +01:00
Michele Artini 8ffdd9747d added id to BaseRecordInfo 2024-02-15 14:27:50 +01:00
Michele Artini da65728afe produce a parquet file 2024-02-15 14:04:17 +01:00
Michele Artini e254720377 fixed path reports 2024-02-15 08:52:28 +01:00
Michele Artini 8d85c1e97e used a parser STAX 2024-02-15 08:21:52 +01:00
Michele Artini b42e2b4d61 fixed log class 2024-02-14 15:52:31 +01:00
Michele Artini 773346f638 increased memory 2024-02-14 14:40:27 +01:00
Michele Artini 2e11197142 removed invalid deletion 2024-02-14 11:59:30 +01:00
Michele Artini ddd6a7ceb3 minor fixes 2024-02-14 11:39:37 +01:00
Michele Artini 963a2500be new reports in hadoop job 2024-02-14 10:37:39 +01:00
Michele Artini 4b1ecad4e2 prepared a job to analyze the BASE records 2024-02-13 13:48:26 +01:00
Michele Artini dd7350ecf2 fixed a problem with xpaths 2024-02-13 08:36:42 +01:00
Michele Artini 265bfd364d refactoing 2024-02-12 15:35:36 +01:00
Michele Artini 16766c514e refactoring 2024-02-12 12:19:57 +01:00
Michele Artini 5add433b74 partial refactoring 2024-02-09 14:33:04 +01:00
Michele Artini c974c75f83 partial refactoring 2024-02-09 12:36:20 +01:00
Michele Artini c6db6335b9 prepare filtering for base import 2024-02-06 15:10:29 +01:00
Michele Artini abcd81bba0 first implementation of the collection plugin for BASE 2024-02-05 15:19:41 +01:00
Giambattista Bloisi e64c2854a3 Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Claudio Atzori 92e48f12f7 [metadata collection] updated collector plugin name 2022-07-29 13:54:00 +02:00
Claudio Atzori 929b145130 code formatting 2022-06-21 23:07:06 +02:00
Serafeim Chatzopoulos 623f7be26d Fix reading files from HDFS in FileCollector & FileGZipCollector plugins 2022-04-28 16:31:11 +03:00
Serafeim Chatzopoulos d0b84d3297 Add FileCollectorPlugin and respective test 2022-04-07 15:06:38 +03:00
Serafeim Chatzopoulos bc1bf55507 Add AbstractSplittedRecordPlugin 2022-04-07 14:33:04 +03:00
Serafeim Chatzopoulos e612489670 Add fileGZip collector plugin and respective test 2022-04-06 19:12:44 +03:00
Claudio Atzori 663b1556d7 manually integrating PR#140 #140 2021-09-15 16:40:25 +02:00
Sandro La Bruzzo 3c6fc2096c fix bug on oai iterator that skip record cleaned 2021-09-07 10:46:26 +02:00
Miriam Baglioni 8769dd8eef GetCSV refactoring - refactoring due to movement of classes 2021-08-12 18:20:56 +02:00
Claudio Atzori 9f4db73f30 updated/fixed unit tests 2021-08-11 15:02:51 +02:00
Claudio Atzori 2ee21da43b suggestions from SonarLint 2021-08-11 12:13:22 +02:00
Claudio Atzori 777536ce91 [aggregation] string values used as regular expressions in the OAI collection classes are defined in a single point as constants, to be reused across the code (PR#122) 2021-07-07 11:23:48 +02:00
Andreas Czerniak ebf3f47a02 from&until more OAI2.0 compl., adding tfs 2021-07-07 09:29:49 +02:00
Claudio Atzori 9d725efdc1 reverted implementation of the mdstore client 2021-05-20 18:26:09 +02:00
Claudio Atzori 23b8883ab1 applied intellij code cleanup 2021-05-14 10:58:12 +02:00
Claudio Atzori 923d19ea8e mdstore read lock/unlock when bulk copying records from mongodb to hdfs 2021-05-04 18:06:21 +02:00
Claudio Atzori 5afa7d3e0c core utilities in dhp-common moved in external module dhp-schemas 2021-04-27 15:44:01 +02:00
Sandro La Bruzzo 63c0303137 removed unused import, add log 2021-04-27 12:17:23 +02:00
Sandro La Bruzzo c73072079d fix conflicts 2021-03-22 16:36:31 +01:00
Claudio Atzori 61a2551e74 migrated last changes from svn (dnet45) 2021-03-15 17:17:55 +01:00
Claudio Atzori acbe3119a4 RestCollectorPlugin imported from dne45 2021-03-08 09:44:09 +01:00