- Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs.
In the end, we only care about the urls when choosing which records should be aggregated.
- Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.
- Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower.
- Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.
- Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.
- Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed".
- Avoid deleting immediately the failed workerReports.
- Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap".
- Add a scheduling task to process leftover failed workerReports from the current execution, regularly.
- Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports.
- Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted.
- Code refactoring.
- Refactor the "bulkImportReportID".
- Add the "bulk:" prefix in the provenance value, in the DB.
- Fix not using correctly the "Lists.partition()" method.
- Make sure the "bulkImportDir" is removed from the "bulkImportDirsUnderProcessing" Set, in case of an early-error.
- Fix the "numFailedSegments"-calculation.
- Improve some messages.
- Code polishing.
- Increase the "numOfThreadsPerBulkImportProcedure" to 6.
- Fix Bulk import not working from a second-level subdirectory; the report-subDirectory was not created.
- Fix not returning the bulk-import-report as "application/json".
- Add useful messages for missing parameters.
- Change the HTTP-method for the "bulkImportFullTexts" endpoint to "POST".
- Show a structured json-response for the "bulkImportFullTexts" endpoint.
- Fix uncommon date-format.
- Remove single quotes from json-report, since they are returned as bytes, not characters.
- Optimize the generation of the json-bulkImport-report.
After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers.
Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.