Commit Graph

244 Commits

Author SHA1 Message Date
Lampros Smyrnaios db929d8931 - Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments.
- Improve some log-messages.
- Code polishing.
2023-10-30 12:29:54 +02:00
Lampros Smyrnaios 856c62887d - Make sure the "UTF_8" charset is used, when we get a message from the response-body.
- Improve some log-messages.
2023-10-26 11:44:23 +03:00
Lampros Smyrnaios bdf834c439 - Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument.
- Fix not updating the "UrlsController.numOfWorkers" correctly.
- Code polishing.
2023-10-23 17:19:29 +03:00
Lampros Smyrnaios 0c7bf6357b - Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()".
- Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).
2023-10-23 12:21:42 +03:00
Lampros Smyrnaios a7581335f1 - Improve the "getDataForPayloadPrefillQuery".
- Improve some error-messages.
2023-10-21 11:31:31 +03:00
Lampros Smyrnaios 44c2fe7418 - Fix the "IndexOutOfBoundsException", when checking the futures' results.
- Update dependencies.
2023-10-20 14:25:05 +03:00
Lampros Smyrnaios df0ea62a5a - Handle the case when the "webHDFSBaseUrl" does not use HTTPS.
- Improve error-reporting when uploading a file to HDFS.
2023-10-19 11:59:37 +03:00
Lampros Smyrnaios 40729c6295 Move similar code into the new "ParquetFileUtils.getPayloadParquetRecord()" method. 2023-10-17 12:50:51 +03:00
Lampros Smyrnaios f05eee7569 Improve the names of some methods. 2023-10-16 23:39:43 +03:00
Lampros Smyrnaios def21b991d Improve the UX of the "installAndRun.sh" script. 2023-10-09 17:28:22 +03:00
Lampros Smyrnaios fb2877dbe8 Upgrade the execution system for the backgroundTasks:
- Submit each task immediately for execution, instead of waiting for a scheduling thread to send all gathered tasks (up to that point) to the ExecutorService (and block until they are finished, before it can start again).
- Hold the Future of each submitted task to a synchronized-list to check the result of each task at a scheduled time.
- Reduce the cpu-time to assure the Service can shut down, by checking if there are "actively" and "about-to-be-executed" tasks, at the same time. Instead of having to rely on the additional checking of the "shutdown"-status of each worker to verify that no active task exist.
- Improve the threads' shutdown procedure.
2023-10-09 17:23:59 +03:00
Lampros Smyrnaios a354da763d - Improve some log-messages.
- Increase app's version.
- Code polishing.
2023-10-06 17:28:54 +03:00
Lampros Smyrnaios 0c79fdea35 Update the "findAssignmentsQuery" to check the "attempt.error_class" field for the current pub_url, not the pub_id. 2023-10-06 14:59:26 +03:00
Lampros Smyrnaios ebf8896005 - Fix getter and setter methods for the "isAuthoritative" field.
- Update Gradle.
2023-10-05 16:31:52 +03:00
Lampros Smyrnaios b2ce6393c1 - Add check for remaining "bulkImportDirsUnderProcessing", before shutting down the Service.
- Code polishing.
2023-10-05 13:43:47 +03:00
Lampros Smyrnaios 96c11ba4b8 - Add a missing change.
- Code optimization and polishing.
- Update dependencies.
2023-10-04 16:17:12 +03:00
Lampros Smyrnaios 7019f7c3c7 Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads. 2023-10-04 15:43:31 +03:00
Lampros Smyrnaios b702cf4484 Upgrade the "findAssignmentsQuery":
- Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs.
In the end, we only care about the urls when choosing which records should be aggregated.
- Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.
2023-10-04 13:43:15 +03:00
Lampros Smyrnaios c9626de120 Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled. 2023-10-04 13:01:13 +03:00
Lampros Smyrnaios 865926fbc3 - Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them.
- Code polishing.
2023-10-02 15:46:55 +03:00
Lampros Smyrnaios ede7ca5a89 - Add bulk-import support for non-Authoritative data-sources.
- Update Spring Boot.
- Code polishing.
2023-09-26 18:02:48 +03:00
Lampros Smyrnaios 90a864ea61 Add more info in bulk-import logs. 2023-09-20 17:50:10 +03:00
Lampros Smyrnaios 0f5d4dac78 Check and show warning/error message for failed payloads. 2023-09-20 17:38:22 +03:00
Lampros Smyrnaios 068b97dd60 Set Xms and Xmx Java-parameters when running the Jar, in Docker. 2023-09-15 14:19:46 +03:00
Lampros Smyrnaios 903c3e1ffc Add thread-safety when reading the bulkImportReport-files. 2023-09-15 11:54:32 +03:00
Lampros Smyrnaios 846c53913f Add LICENSE. 2023-09-14 16:05:36 +03:00
Lampros Smyrnaios 360731ba72 - Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint.
- Code optimization and polishing.
2023-09-14 13:53:01 +03:00
Lampros Smyrnaios b4f91f188e Fix the "retries-num" appearing in log-messages. 2023-09-14 12:08:33 +03:00
Lampros Smyrnaios 02bae38885 - Improve response-time to "getAssignments"-requests, by avoiding merging the parquet files of the "assignment" table, right after acquiring the assignments from the DB. They are already getting merged, when each assignments-batch is deleted after a workerReport has been processed.
- Optimize code-positioning for unlocking the DB when done executing queries.
2023-09-13 17:03:11 +03:00
Lampros Smyrnaios 8fdb8e9137 Add renaming of the workerReport-file, to indicate failure, when the processing failed because no workerInfo was found for the worker-id existing in the report. This way, it can be retried by the scheduler later. 2023-09-13 16:35:41 +03:00
Lampros Smyrnaios c98e8df323 Move the "getRenamedWorkerReport"-code in its own method. 2023-09-13 16:27:18 +03:00
Lampros Smyrnaios 6891c467d4 - Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode.
- Add a missing change for the optimization of reading files.
- Update dependencies.
2023-09-13 15:29:30 +03:00
Lampros Smyrnaios 3dd349dd00 Improve the "findAssignmentsQuery":
- Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower.
- Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.
2023-09-13 14:38:15 +03:00
Lampros Smyrnaios ee2df19ce1 - Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint.
- Add useful log-messages for various bulk-import stages and improve the current ones.
- Optimize reading and writing the reports.
2023-09-11 17:24:39 +03:00
Lampros Smyrnaios 6944678391 Improve error-handling when renaming workerReport-files. 2023-09-08 17:41:10 +03:00
Lampros Smyrnaios 1c8f3765ca - Fix not acquiring the full workerReport when retrying it, with the scheduler.
- Improve error-handling in the "inspectWorkerReportsAndTakeAction" process.
- Code polishing.
2023-09-08 14:59:48 +03:00
Lampros Smyrnaios e72a4d3d10 - Improve handling of already renamed workerReport-files, which relate to failed workerReports in the 1st try.
- Add check for workerReport-files, which may have been deleted before their time, due to an error.
2023-09-08 14:11:41 +03:00
Lampros Smyrnaios bd9245cc3d Avoid deleting the assignment-records in case of a "parquet-data creation, upload or insertion" problem, in order to avoid double-processing of the urls, until the report has been retried by the scheduler. 2023-09-08 13:44:24 +03:00
Lampros Smyrnaios 718f5cfefb - Improve prioritization of the most recent publications.
- Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.
2023-09-07 14:05:58 +03:00
Lampros Smyrnaios 4014d1eabb Code polishing. 2023-09-05 15:20:03 +03:00
Lampros Smyrnaios 199105f7f1 Fix not writing some bulk-import error-messages to the logs. Instead, they were only written to the json-reports. 2023-09-04 16:33:27 +03:00
Lampros Smyrnaios acef891167 Improve prioritization of "publication_boost" records, by adding a second ordering in the end. 2023-09-04 15:34:37 +03:00
Lampros Smyrnaios 98516498eb - Increase app's version.
- Code polishing.
2023-09-04 12:46:55 +03:00
Lampros Smyrnaios febe2b212c Upgrade management of failed workerReports:
- Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed".
- Avoid deleting immediately the failed workerReports.
- Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap".
- Add a scheduling task to process leftover failed workerReports from the current execution, regularly.
- Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports.
- Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted.
- Code refactoring.
2023-09-01 15:10:58 +03:00
Lampros Smyrnaios 5c459a3a16 Optimize handling of HTTP-4XX errors in "UrlsServiceImpl.postReportResultToWorker()". 2023-08-31 13:20:12 +03:00
Lampros Smyrnaios 601776e81c - Fix not handling for the case where the info about the worker in the WorkerReport, does not exist inside the "workersInfoMap", as that worker is not participating in the Service. (this case may appear in future code)
- Code polishing.
2023-08-30 17:07:51 +03:00
Lampros Smyrnaios c32dfa882e Fix not deleting the assignment-records, for every workerReport, after processing it. 2023-08-30 16:22:58 +03:00
Lampros Smyrnaios aa3f32f3da - Make sure the given number of threads, given by the user is above zero.
- Adjust the number and size of log files.
- Update Spring Boot.
- Code polishing.
2023-08-30 14:02:54 +03:00
Lampros Smyrnaios 44459c8681 - Rename "ImpalaConnector.java" to "DatabaseConnector.java".
- Update dependencies.
- Code polishing.
2023-08-23 16:55:23 +03:00
Lampros Smyrnaios b3e0d214fd Update the BulkImport API:
- Refactor the "bulkImportReportID".
- Add the "bulk:" prefix in the provenance value, in the DB.
- Fix not using correctly the "Lists.partition()" method.
- Make sure the "bulkImportDir" is removed from the "bulkImportDirsUnderProcessing" Set, in case of an early-error.
- Fix the "numFailedSegments"-calculation.
- Improve some messages.
- Code polishing.
2023-08-21 18:19:53 +03:00