- Handle the case when "fileUtils.constructS3FilenameAndUploadToS3()" returns "null", in "processBulkImportedFile()".
- Avoid an "IllegalArgumentException" in "Lists.partition()" when the number of files to bulkImport are fewer than the number of threads available to handle them.
- Include the last directory's "/" divider in the fileDIR group of "FILEPATH_ID_EXTENSION" regex (renamed from "FILENAME_ID_EXTENSION").
- Fix an incomplete log-message.
- Provide the "fileLocation" argument in the "DocFileData" constructor, in "processBulkImportedFile()", even though it's not used after.
- Fix typo of not using the "OpenAireID" in the S3 location of bulkImported files. Instead, the "fileNameID" was used, which in aggregation is the OpenAireID, but not in bulk-import.
- Update dependencies.
- Code polishing.
- Move common "ExecutionException" handling-code into its own method: "GenericUtils.getSelectedStackTraceForCausedException()".
- Avoid a double log.
- Code polishing.
- Use views of the "initialDatabase" view and tables to a) reduce the amount of space used by test-DBs and b) improve test-db creation performance.
- Avoid possible failures from outdated metadata.
- Use a single query with a list of the fileHashes, instead of thousands of singe-md5hash-check queries (run at most 6 in parallel) which require a lot of I/O.
- Avoid checking multiple times the same fileHash, in case it is related with multiple payloads.
- In case of a database-error, avoid completely losing the full-texts of that worker, instead, continue processing the full-texts.
- Reorder JOINs and predicates to reduce the computational cost.
- Remove the memory-costly "pu.url" predicates from the "where" clause, as the DB has no empty urls anymore.
- Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).
- Submit each task immediately for execution, instead of waiting for a scheduling thread to send all gathered tasks (up to that point) to the ExecutorService (and block until they are finished, before it can start again).
- Hold the Future of each submitted task to a synchronized-list to check the result of each task at a scheduled time.
- Reduce the cpu-time to assure the Service can shut down, by checking if there are "actively" and "about-to-be-executed" tasks, at the same time. Instead of having to rely on the additional checking of the "shutdown"-status of each worker to verify that no active task exist.
- Improve the threads' shutdown procedure.
- Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs.
In the end, we only care about the urls when choosing which records should be aggregated.
- Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.