Michele Artini
3b5163d8e3
fixed a xpath
2024-03-08 08:23:07 +01:00
Michele Artini
db6f774394
selection of the new plugin
2024-03-04 10:43:40 +01:00
Michele Artini
9506d80ddc
changed sql to select accepted datasources
2024-03-04 08:25:40 +01:00
Michele Artini
c2b6841eb0
opendoar datasource filter
2024-03-01 15:32:56 +01:00
Michele Artini
be7f327e88
opendoar datasource filter
2024-03-01 13:38:36 +01:00
Michele Artini
32f4d6f691
reports for types
2024-03-01 11:43:37 +01:00
Michele Artini
71204a8056
some fields in stats
2024-02-29 10:17:31 +01:00
Michele Artini
5ddbef3a5b
new stats
2024-02-28 14:34:09 +01:00
Michele Artini
3d14bef381
OpenDoar reports
2024-02-28 10:51:13 +01:00
Michele Artini
f8cf7ffbcb
stats
2024-02-22 14:01:11 +01:00
Michele Artini
d2b7541583
fixed a problem with Dataset model
2024-02-16 11:36:46 +01:00
Michele Artini
8ffdd9747d
added id to BaseRecordInfo
2024-02-15 14:27:50 +01:00
Michele Artini
da65728afe
produce a parquet file
2024-02-15 14:04:17 +01:00
Michele Artini
e254720377
fixed path reports
2024-02-15 08:52:28 +01:00
Michele Artini
8d85c1e97e
used a parser STAX
2024-02-15 08:21:52 +01:00
Michele Artini
b42e2b4d61
fixed log class
2024-02-14 15:52:31 +01:00
Michele Artini
773346f638
increased memory
2024-02-14 14:40:27 +01:00
Michele Artini
2e11197142
removed invalid deletion
2024-02-14 11:59:30 +01:00
Michele Artini
ddd6a7ceb3
minor fixes
2024-02-14 11:39:37 +01:00
Michele Artini
963a2500be
new reports in hadoop job
2024-02-14 10:37:39 +01:00
Michele Artini
4b1ecad4e2
prepared a job to analyze the BASE records
2024-02-13 13:48:26 +01:00
Michele Artini
dd7350ecf2
fixed a problem with xpaths
2024-02-13 08:36:42 +01:00
Michele Artini
265bfd364d
refactoing
2024-02-12 15:35:36 +01:00
Michele Artini
16766c514e
refactoring
2024-02-12 12:19:57 +01:00
Michele Artini
5add433b74
partial refactoring
2024-02-09 14:33:04 +01:00
Michele Artini
c974c75f83
partial refactoring
2024-02-09 12:36:20 +01:00
Michele Artini
c6db6335b9
prepare filtering for base import
2024-02-06 15:10:29 +01:00
Michele Artini
abcd81bba0
first implementation of the collection plugin for BASE
2024-02-05 15:19:41 +01:00
Sandro La Bruzzo
6ce36b3e41
Implemented ORCID Workflow on DHP-Aggregation for retrieving ORCID DUMP and generating tables
2023-11-14 12:04:29 +01:00
Giambattista Bloisi
e64c2854a3
Refactor Dedup process to use Spark Dataframe API and intermediate representation with Row interface
...
JsonPath cache contention fixed by using a ConcurrentHashMap
Blacklist filtering performance improvement
Minor performance improvements when evaluating similarity
Sorting in clustered elements is deterministic (by ordering and identity field, instead of ordering field only)
2023-07-24 15:36:24 +02:00
Claudio Atzori
92e48f12f7
[metadata collection] updated collector plugin name
2022-07-29 13:54:00 +02:00
Claudio Atzori
929b145130
code formatting
2022-06-21 23:07:06 +02:00
Serafeim Chatzopoulos
623f7be26d
Fix reading files from HDFS in FileCollector & FileGZipCollector plugins
2022-04-28 16:31:11 +03:00
Serafeim Chatzopoulos
d0b84d3297
Add FileCollectorPlugin and respective test
2022-04-07 15:06:38 +03:00
Serafeim Chatzopoulos
bc1bf55507
Add AbstractSplittedRecordPlugin
2022-04-07 14:33:04 +03:00
Serafeim Chatzopoulos
e612489670
Add fileGZip collector plugin and respective test
2022-04-06 19:12:44 +03:00
Sandro La Bruzzo
7af0bbd0b1
[scala-refactor] Module dhp-aggregation:
...
Moved all scala source into src/main/scala and src/test/scala
2021-12-06 11:26:36 +01:00
Sandro La Bruzzo
4acfa8fa2e
Scholexplorer Datasource Aggregation:
...
- Added collectedfrom in the inverse relation generated
Relation resolution:
- increased number of partitions in workflow.xml
- using classid instead of classname to build the pid-dnetId mapping
2021-10-26 17:51:20 +02:00
Sandro La Bruzzo
ae4e99a471
Adapted workflow of resolution of PID to work into OpenAIRE data workflow
...
- Added relations in both verse on all Scholexplorer datasources
2021-10-20 17:12:16 +02:00
Claudio Atzori
663b1556d7
manually integrating PR#140 #140
2021-09-15 16:40:25 +02:00
Sandro La Bruzzo
3c6fc2096c
fix bug on oai iterator that skip record cleaned
2021-09-07 10:46:26 +02:00
Miriam Baglioni
8769dd8eef
GetCSV refactoring - refactoring due to movement of classes
2021-08-12 18:20:56 +02:00
Miriam Baglioni
6e84b3951f
GetCSV refactoring - moving classes to dhp-common that have dependency with GetCSV class (that was located in graph-mapper)
2021-08-12 17:57:41 +02:00
Claudio Atzori
9f4db73f30
updated/fixed unit tests
2021-08-11 15:02:51 +02:00
Claudio Atzori
2ee21da43b
suggestions from SonarLint
2021-08-11 12:13:22 +02:00
Claudio Atzori
777536ce91
[aggregation] string values used as regular expressions in the OAI collection classes are defined in a single point as constants, to be reused across the code (PR#122)
2021-07-07 11:23:48 +02:00
Claudio Atzori
bc014023c8
Merge pull request 'to solve the scala SI-3623' ( #122 ) from andreas.czerniak/BrStableId_dnet-hadoop:stable_ids into stable_ids
...
Reviewed-on: #122
2021-07-07 11:13:51 +02:00
Andreas Czerniak
ebf3f47a02
from&until more OAI2.0 compl., adding tfs
2021-07-07 09:29:49 +02:00
Claudio Atzori
70ded407bb
HttpClient used in metadata collection retries also on 404
2021-07-05 18:04:30 +02:00
Claudio Atzori
af42377d0e
HttpClient used in metadata collection retries on 502, 503, 504
2021-06-28 09:34:30 +02:00