UrlsController/src/main/java/eu/openaire/urls_controller/models
Lampros Smyrnaios 8dfb58ee63 Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment".
After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers.
Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.
2023-07-11 17:27:23 +03:00
..
Assignment.java Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment". 2023-07-11 17:27:23 +03:00
Attempt.java - Upgrade the results-loading process: Instead of making thousands of sql-insert requests to Impala now we write the results to parquet files, upload them to HDFS and then import the data into the Impala tables with just 2 requests. This results in a huge performance improvement. 2022-11-10 17:18:21 +02:00
BulkImportReport.java - Optimize the json-conversion of the "BulkImportReport". 2023-05-18 17:30:40 +03:00
Datasource.java Add the "Datasource" inside the "Task" class and include it in the Assignment. 2021-05-20 02:50:50 +03:00
DocFileData.java New feature: BulkImport full-text files from compatible datasources. 2023-05-11 03:07:55 +03:00
Error.java - Process the Error of PDF-aggregation. Distinguish between "couldRetry" and "noRetry" cases. 2021-08-05 15:43:37 +03:00
FileLocationData.java New feature: BulkImport full-text files from compatible datasources. 2023-05-11 03:07:55 +03:00
ParquetReport.java - Apply error-checking on individual CallableTasks and in tasks-batches related to the creation and upload of all the data related to the "attempt" and "payload" table. So, if no data could be uploaded for one or both tables, no "load"-queries will be executed for that/those tables. 2022-12-09 12:46:06 +02:00
Payload.java - Fix a "@JsonProperty" annotation inside "Payload.java". 2022-04-05 00:01:44 +03:00
SumParquetSuccess.java - Apply error-checking on individual CallableTasks and in tasks-batches related to the creation and upload of all the data related to the "attempt" and "payload" table. So, if no data could be uploaded for one or both tables, no "load"-queries will be executed for that/those tables. 2022-12-09 12:46:06 +02:00
Task.java Add the "Datasource" inside the "Task" class and include it in the Assignment. 2021-05-20 02:50:50 +03:00
UrlReport.java - Add the "isControllerAlive"-endpoint. 2021-09-23 15:08:52 +03:00
WorkerInfo.java - Use separate HDFS subdirectories for each worker in order to avoid seeing exceptions about "empty hdfs directory" when "loading" data to the database, because one worker has loaded data generated by multiple workers (since we use only 1 load operation for multiple parquet files). 2023-05-15 13:12:20 +03:00