Commit Graph

104 Commits

Author SHA1 Message Date
Lampros Smyrnaios 39c36f9e66 - Resolve a concurrency issue, by enforcing synchronization on the "BulkImportReport.getJsonReport()" method.
- Increase the number of stacktrace-lines to 20, for bulkImport-segment-failures.
- Improve "GenericUtils.getSelectiveStackTrace()".
2024-05-01 01:29:25 +03:00
Lampros Smyrnaios 8e14d4dbe0 - Fix not counting the files from the bulkImport-segment, which failed due to an exception.
- Write segment-exception-messages to the bulkImport-report.
2024-04-30 23:43:52 +03:00
Lampros Smyrnaios e2d43a9af0 Upgrade the "processBulkImportedFilesSegment" code:
1) Pre-calculate the file-hashes for all files of the segment and perform a single "getHashLocationsQuery", instead of thousands
2) Write some important events to the bulkImportReport file, as soon as they are added in the list.
2024-04-02 14:35:19 +03:00
Lampros Smyrnaios 08de530f03 Various improvements:
- Handle the case when "fileUtils.constructS3FilenameAndUploadToS3()" returns "null", in "processBulkImportedFile()".
- Avoid an "IllegalArgumentException" in "Lists.partition()" when the number of files to bulkImport are fewer than the number of threads available to handle them.
- Include the last directory's "/" divider in the fileDIR group of "FILEPATH_ID_EXTENSION" regex (renamed from "FILENAME_ID_EXTENSION").
- Fix an incomplete log-message.
- Provide the "fileLocation" argument in the "DocFileData" constructor, in "processBulkImportedFile()", even though it's not used after.
2024-03-29 17:23:01 +02:00
Lampros Smyrnaios 1d821ed803 - Prepare version for next release.
- Fix typo of not using the "OpenAireID" in the S3 location of bulkImported files. Instead, the "fileNameID" was used, which in aggregation is the OpenAireID, but not in bulk-import.
- Update dependencies.
- Code polishing.
2024-03-28 06:09:28 +02:00
Lampros Smyrnaios 8bc5cc35e2 - Optimize writing to the Bulk-import-report file.
- Show the IP of the worker which posts a "workerShutdownReport".
- Code polishing.
2024-03-22 17:50:55 +02:00
Lampros Smyrnaios 56d233d38e - Move the "FileUtils.mergeParquetFiles()" method to "ParquetFileUtils.mergeParquetFilesOfTable()".
- Fix a typo.
2024-03-20 15:25:19 +02:00
Lampros Smyrnaios 724eae1514 - Optimize the placement of "DatabaseConnector.databaseLock.unlock()" statements.
- Rename a maven-repository.
2024-03-20 15:08:01 +02:00
Lampros Smyrnaios 9b0818b535 - Add handling for additional/specific exceptions, when checking the "futures".
- Move common "ExecutionException" handling-code into its own method: "GenericUtils.getSelectedStackTraceForCausedException()".
- Avoid a double log.
- Code polishing.
2024-03-14 13:59:23 +02:00
Lampros Smyrnaios f61cae41a1 - Try to get the cause of the exception of the callable-tasks which handle the bulk-import of fileSegments.
- Fix not counting the failedSegments when an exception was thrown.
- Code polishing.
2024-03-13 12:15:59 +02:00
Lampros Smyrnaios 8f9786de09 Upgrade the algorithm for finding the previously-found fulltexts, based on their md5hash:
- Use a single query with a list of the fileHashes, instead of thousands of singe-md5hash-check queries (run at most 6 in parallel) which require a lot of I/O.
- Avoid checking multiple times the same fileHash, in case it is related with multiple payloads.
- In case of a database-error, avoid completely losing the full-texts of that worker, instead, continue processing the full-texts.
2024-03-13 11:28:37 +02:00
Lampros Smyrnaios 1048463ca0 - Improve error-handling in "S3ObjectStore.emptyBucket()".
- Change some log-levels.
- Code polishing.
2024-03-11 16:17:32 +02:00
Lampros Smyrnaios 8f18008001 Avoid performing payload-related operations in case no fulltext was received from the worker, due to en error. 2024-03-11 14:57:13 +02:00
Lampros Smyrnaios dd394f18a0 - Optimize the JOIN-order in the "findAssignmentsQuery".
- Optimize the "DOC_URL_FILTER"-regex.
- Update dependencies.
2024-03-11 11:35:38 +02:00
Lampros Smyrnaios 43ea64758d - Improve handling of the case when no fulltexts have been found or none of the found ones were requested from the worker, as they were already retrieved in the past.
- Show the number of files with problematic locations (if any of them exist).
- Code polishing.
2024-02-23 12:39:28 +02:00
Lampros Smyrnaios 5dadb8ad2f - Optimize the "DOC_URL_FILTER"-regex, by using a non-capturing group.
- Remove an extra "File.separator" from the fulltexts-fullFilePath.
2024-01-19 15:46:23 +02:00
Lampros Smyrnaios 3a70b57146 Prioritize the full-text urls over the landing-page ones. 2024-01-15 12:59:50 +02:00
Lampros Smyrnaios 2e60128084 - Allow to easily change the por used by workers.
- Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown.
- Update dependencies.
- Code polishing.
2023-12-19 23:31:42 +02:00
Lampros Smyrnaios d90ad51609 Add the "shutdownAllWorkersGracefully" and "cancelShutdownAllWorkersGracefully" endpoints, in order to be able to shut them down at once and update them, without shutting down the whole Service. So in this case the bulk-import procedures will continue to work. 2023-11-29 16:45:58 +02:00
Lampros Smyrnaios 7f789b8ad0 - If we receive an "UnknownHostException" when uploading to the S3ObjectStore, then skip the current full-texts' batch to leave some time for the network to get unstuck.
- Code polishing.
2023-11-22 15:29:18 +02:00
Lampros Smyrnaios 9b1f2c4931 Improve performance and reduce memory usage of the "findAssignmentsQuery":
- Reorder JOINs and predicates to reduce the computational cost.
- Remove the memory-costly "pu.url" predicates from the "where" clause, as the DB has no empty urls anymore.
2023-10-31 15:59:48 +02:00
Lampros Smyrnaios db929d8931 - Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments.
- Improve some log-messages.
- Code polishing.
2023-10-30 12:29:54 +02:00
Lampros Smyrnaios 856c62887d - Make sure the "UTF_8" charset is used, when we get a message from the response-body.
- Improve some log-messages.
2023-10-26 11:44:23 +03:00
Lampros Smyrnaios bdf834c439 - Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument.
- Fix not updating the "UrlsController.numOfWorkers" correctly.
- Code polishing.
2023-10-23 17:19:29 +03:00
Lampros Smyrnaios 40729c6295 Move similar code into the new "ParquetFileUtils.getPayloadParquetRecord()" method. 2023-10-17 12:50:51 +03:00
Lampros Smyrnaios a354da763d - Improve some log-messages.
- Increase app's version.
- Code polishing.
2023-10-06 17:28:54 +03:00
Lampros Smyrnaios 0c79fdea35 Update the "findAssignmentsQuery" to check the "attempt.error_class" field for the current pub_url, not the pub_id. 2023-10-06 14:59:26 +03:00
Lampros Smyrnaios ebf8896005 - Fix getter and setter methods for the "isAuthoritative" field.
- Update Gradle.
2023-10-05 16:31:52 +03:00
Lampros Smyrnaios 96c11ba4b8 - Add a missing change.
- Code optimization and polishing.
- Update dependencies.
2023-10-04 16:17:12 +03:00
Lampros Smyrnaios 7019f7c3c7 Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads. 2023-10-04 15:43:31 +03:00
Lampros Smyrnaios b702cf4484 Upgrade the "findAssignmentsQuery":
- Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs.
In the end, we only care about the urls when choosing which records should be aggregated.
- Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.
2023-10-04 13:43:15 +03:00
Lampros Smyrnaios c9626de120 Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled. 2023-10-04 13:01:13 +03:00
Lampros Smyrnaios 865926fbc3 - Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them.
- Code polishing.
2023-10-02 15:46:55 +03:00
Lampros Smyrnaios ede7ca5a89 - Add bulk-import support for non-Authoritative data-sources.
- Update Spring Boot.
- Code polishing.
2023-09-26 18:02:48 +03:00
Lampros Smyrnaios 90a864ea61 Add more info in bulk-import logs. 2023-09-20 17:50:10 +03:00
Lampros Smyrnaios 360731ba72 - Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint.
- Code optimization and polishing.
2023-09-14 13:53:01 +03:00
Lampros Smyrnaios b4f91f188e Fix the "retries-num" appearing in log-messages. 2023-09-14 12:08:33 +03:00
Lampros Smyrnaios 02bae38885 - Improve response-time to "getAssignments"-requests, by avoiding merging the parquet files of the "assignment" table, right after acquiring the assignments from the DB. They are already getting merged, when each assignments-batch is deleted after a workerReport has been processed.
- Optimize code-positioning for unlocking the DB when done executing queries.
2023-09-13 17:03:11 +03:00
Lampros Smyrnaios 8fdb8e9137 Add renaming of the workerReport-file, to indicate failure, when the processing failed because no workerInfo was found for the worker-id existing in the report. This way, it can be retried by the scheduler later. 2023-09-13 16:35:41 +03:00
Lampros Smyrnaios c98e8df323 Move the "getRenamedWorkerReport"-code in its own method. 2023-09-13 16:27:18 +03:00
Lampros Smyrnaios 6891c467d4 - Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode.
- Add a missing change for the optimization of reading files.
- Update dependencies.
2023-09-13 15:29:30 +03:00
Lampros Smyrnaios 3dd349dd00 Improve the "findAssignmentsQuery":
- Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower.
- Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.
2023-09-13 14:38:15 +03:00
Lampros Smyrnaios ee2df19ce1 - Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint.
- Add useful log-messages for various bulk-import stages and improve the current ones.
- Optimize reading and writing the reports.
2023-09-11 17:24:39 +03:00
Lampros Smyrnaios 6944678391 Improve error-handling when renaming workerReport-files. 2023-09-08 17:41:10 +03:00
Lampros Smyrnaios 1c8f3765ca - Fix not acquiring the full workerReport when retrying it, with the scheduler.
- Improve error-handling in the "inspectWorkerReportsAndTakeAction" process.
- Code polishing.
2023-09-08 14:59:48 +03:00
Lampros Smyrnaios e72a4d3d10 - Improve handling of already renamed workerReport-files, which relate to failed workerReports in the 1st try.
- Add check for workerReport-files, which may have been deleted before their time, due to an error.
2023-09-08 14:11:41 +03:00
Lampros Smyrnaios bd9245cc3d Avoid deleting the assignment-records in case of a "parquet-data creation, upload or insertion" problem, in order to avoid double-processing of the urls, until the report has been retried by the scheduler. 2023-09-08 13:44:24 +03:00
Lampros Smyrnaios 718f5cfefb - Improve prioritization of the most recent publications.
- Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.
2023-09-07 14:05:58 +03:00
Lampros Smyrnaios 199105f7f1 Fix not writing some bulk-import error-messages to the logs. Instead, they were only written to the json-reports. 2023-09-04 16:33:27 +03:00
Lampros Smyrnaios acef891167 Improve prioritization of "publication_boost" records, by adding a second ordering in the end. 2023-09-04 15:34:37 +03:00