Commit Graph

43 Commits

Author SHA1 Message Date
Lampros Smyrnaios 9b0818b535 - Add handling for additional/specific exceptions, when checking the "futures".
- Move common "ExecutionException" handling-code into its own method: "GenericUtils.getSelectedStackTraceForCausedException()".
- Avoid a double log.
- Code polishing.
2024-03-14 13:59:23 +02:00
Lampros Smyrnaios 1048463ca0 - Improve error-handling in "S3ObjectStore.emptyBucket()".
- Change some log-levels.
- Code polishing.
2024-03-11 16:17:32 +02:00
Lampros Smyrnaios 3563fd6e2a - Try to get the cause of the exception of the callable-tasks which handle parquet-files.
- Update License.
- Update dependencies.
2024-02-07 18:34:28 +02:00
Lampros Smyrnaios bdc61c2cda When at least one worker is still active and have to wait for service-shutdown, show a log-message to inform the user, including that worker's IP. 2024-01-15 13:35:22 +02:00
Lampros Smyrnaios 2e60128084 - Allow to easily change the por used by workers.
- Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown.
- Update dependencies.
- Code polishing.
2023-12-19 23:31:42 +02:00
Lampros Smyrnaios d20c9a7d2e - Show the original exception thrown by the background-job, not the one thrown in the main-thread, which is useless, except from its message.
- Reduce the interval for deleting the unhandled assignments to once every 3 days.
- Set the upcoming version.
- Update dependencies.
2023-11-27 18:19:53 +02:00
Lampros Smyrnaios 7f789b8ad0 - If we receive an "UnknownHostException" when uploading to the S3ObjectStore, then skip the current full-texts' batch to leave some time for the network to get unstuck.
- Code polishing.
2023-11-22 15:29:18 +02:00
Lampros Smyrnaios db929d8931 - Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments.
- Improve some log-messages.
- Code polishing.
2023-10-30 12:29:54 +02:00
Lampros Smyrnaios 856c62887d - Make sure the "UTF_8" charset is used, when we get a message from the response-body.
- Improve some log-messages.
2023-10-26 11:44:23 +03:00
Lampros Smyrnaios 0c7bf6357b - Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()".
- Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).
2023-10-23 12:21:42 +03:00
Lampros Smyrnaios 44c2fe7418 - Fix the "IndexOutOfBoundsException", when checking the futures' results.
- Update dependencies.
2023-10-20 14:25:05 +03:00
Lampros Smyrnaios fb2877dbe8 Upgrade the execution system for the backgroundTasks:
- Submit each task immediately for execution, instead of waiting for a scheduling thread to send all gathered tasks (up to that point) to the ExecutorService (and block until they are finished, before it can start again).
- Hold the Future of each submitted task to a synchronized-list to check the result of each task at a scheduled time.
- Reduce the cpu-time to assure the Service can shut down, by checking if there are "actively" and "about-to-be-executed" tasks, at the same time. Instead of having to rely on the additional checking of the "shutdown"-status of each worker to verify that no active task exist.
- Improve the threads' shutdown procedure.
2023-10-09 17:23:59 +03:00
Lampros Smyrnaios ebf8896005 - Fix getter and setter methods for the "isAuthoritative" field.
- Update Gradle.
2023-10-05 16:31:52 +03:00
Lampros Smyrnaios b2ce6393c1 - Add check for remaining "bulkImportDirsUnderProcessing", before shutting down the Service.
- Code polishing.
2023-10-05 13:43:47 +03:00
Lampros Smyrnaios 96c11ba4b8 - Add a missing change.
- Code optimization and polishing.
- Update dependencies.
2023-10-04 16:17:12 +03:00
Lampros Smyrnaios ede7ca5a89 - Add bulk-import support for non-Authoritative data-sources.
- Update Spring Boot.
- Code polishing.
2023-09-26 18:02:48 +03:00
Lampros Smyrnaios 6891c467d4 - Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode.
- Add a missing change for the optimization of reading files.
- Update dependencies.
2023-09-13 15:29:30 +03:00
Lampros Smyrnaios 1c8f3765ca - Fix not acquiring the full workerReport when retrying it, with the scheduler.
- Improve error-handling in the "inspectWorkerReportsAndTakeAction" process.
- Code polishing.
2023-09-08 14:59:48 +03:00
Lampros Smyrnaios febe2b212c Upgrade management of failed workerReports:
- Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed".
- Avoid deleting immediately the failed workerReports.
- Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap".
- Add a scheduling task to process leftover failed workerReports from the current execution, regularly.
- Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports.
- Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted.
- Code refactoring.
2023-09-01 15:10:58 +03:00
Lampros Smyrnaios 44459c8681 - Rename "ImpalaConnector.java" to "DatabaseConnector.java".
- Update dependencies.
- Code polishing.
2023-08-23 16:55:23 +03:00
Lampros Smyrnaios b3e0d214fd Update the BulkImport API:
- Refactor the "bulkImportReportID".
- Add the "bulk:" prefix in the provenance value, in the DB.
- Fix not using correctly the "Lists.partition()" method.
- Make sure the "bulkImportDir" is removed from the "bulkImportDirsUnderProcessing" Set, in case of an early-error.
- Fix the "numFailedSegments"-calculation.
- Improve some messages.
- Code polishing.
2023-08-21 18:19:53 +03:00
Lampros Smyrnaios 8d8a387ff2 Reduce the waiting time for new background tasks to be scheduled for processing. 2023-07-24 20:33:56 +03:00
Lampros Smyrnaios 9cbac77c2a - Add check for "shouldShutdownService" before allowing to continue with a bulk-import request.
- Add check for remaining background tasks (including bulkImports), before checking if the workers have shut down and then shut down the Service.
2023-07-21 16:19:00 +03:00
Lampros Smyrnaios cec2531737 - Increase the "numOfBackgroundThreads" to 8.
- Make the "numOfBackgroundThreads" and "numOfThreadsPerBulkImportProcedure" configurable from the "application.yml" file.
- Code polishing.
2023-07-21 11:45:50 +03:00
Lampros Smyrnaios b94c35c66e - Fix double active "@Scheduled" annotation for the "ScheduledTasks.updatePrometheusMetrics()" method.
- Code polishing.
2023-07-13 18:32:45 +03:00
Lampros Smyrnaios e8644cb64f - Optimize the "insertAssignmentsQuery".
- Add documentation about the Prometheus Metrics, in README.
- Update Dependencies.
- Code polishing.
2023-07-05 17:10:30 +03:00
Lampros Smyrnaios 0f4b63c4a9 Expose the following statistics as prometheus-metrics and create/update a stats-endpoint for each one:
- "numOfPayloadsAggregatedByServiceThroughCrawling"
 - "numOfPayloadsAggregatedByServiceThroughBulkImport"
 - "numOfPayloadsAggregatedByService"
 - "numOfLegacyPayloads"
 - "numOfRecordsInspectedByServiceThroughCrawling" (renamed from "numOfInspectedRecords")
2023-06-23 15:22:26 +03:00
Lampros Smyrnaios b9712bed85 - Expose the "numOfAllPayloads" and "numOfInspectedRecords" DB-stats to Prometheus, by using a scheduling task to request the numbers from the DB, every 6 hours.
- Update the "StatsServiceImpl.getNumberOfPayloadsAggregatedByService()" to use the new table "payload_aggregated", instead of casting and checking the date of the records.
- Code polishing.
2023-06-19 14:42:00 +03:00
Lampros Smyrnaios 798fa09d68 - Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()".
- Add/Improve some log messages.
- Update and cleanup dependencies.
- Code polishing.
2023-06-15 23:19:36 +03:00
Lampros Smyrnaios 6669dc61bf - Increase the initialDelay for the "checkIfServiceIsReadyForShutdown" scheduled-task, in production, to 10 minutes.
- Code polishing.
2023-06-06 16:49:53 +03:00
Lampros Smyrnaios a38d6ace79 Code polishing. 2023-05-29 12:21:48 +03:00
Lampros Smyrnaios 74ff31fc64 - Show the workerIPs in the logs.
- Rename the "FullTexts"-files to "BulkImport".
2023-05-29 12:12:08 +03:00
Lampros Smyrnaios 3988eb3a48 - Use a separate HDFS sub-dir for every assignments-batch, in order to avoid any disrruptancies from multiple threads moving parquet-files from the same sub-dir. Multiple batches from the same worker may be processed at the same time. These sub-dirs are deleted afterwards.
- Treat the "contains no visible files" situation as an error. In which case the assignments-data is presumed to not have been inserted to the database tables.
- Code polishing/cleanup.
2023-05-27 02:36:05 +03:00
Lampros Smyrnaios 02cee097d4 Fix an issue, which could cause some background jobs to be executed more than 1 times. The previously executed jobs were not deleted from the global list fast enough, and they would be selected again, in case they were not finished before the scheduler started again. 2023-05-26 13:08:00 +03:00
Lampros Smyrnaios 2b50e08bf6 - Handle the case, were multiple threads may load the same HDFS directory to a database table, thus causing the "directory contains no visible files"-SQLException.
- Improve the values of the delays for some scheduledTasks.
- Improve elapsed time precision for the "lastAccessedOn" metadata of the workerReports.
- Code polishing.
2023-05-25 00:34:36 +03:00
Lampros Smyrnaios 164245cb53 - Automatically delete the unsuccessful WorkerReports, which are more than 7 days old.
- Optimize the Service's startup speed, by setting "initialDelays" to the scheduled tasks.
- Optimize documentation.
2023-05-24 16:59:42 +03:00
Lampros Smyrnaios 551c4acef5 Fix property naming missmatch. 2023-05-24 14:49:29 +03:00
Lampros Smyrnaios 0ea3e2de24 Add the "shutdownService" and "cancelShutdownService" endpoints. The Controller sends the related requests to the Workers and shutdowns gracefully, after all workers have shutdown. 2023-05-24 13:42:29 +03:00
Lampros Smyrnaios b6e8cd1889 New feature: BulkImport full-text files from compatible datasources. 2023-05-11 03:07:55 +03:00
Lampros Smyrnaios 4280f89296 - Set the default value of the "isTestEnvironment" property to "true", in order to avoid undesired outcomes in the production db.
- Code polishing.
2023-03-21 17:04:28 +02:00
Lampros Smyrnaios 33ba3e8d91 - Avoid getting and uploading (to S3), full-texts which are already uploaded by previous assignments-batches.
- Fix not updating the fileLocation with the s3Url for records which share the same full-text.
- Set only one delete-order for each assignments-batch-files, not one (or more, by mistake) per zip-batch.
- Set the HttpStatus to "204 - NO_CONTENT", when no assignments are available to be returned to the Worker.
- Fix not unlocking the "dataBaseLock" in case of a "dataBase-connection"-error, in "addWorkerReport()".
- Improve some log-messages.
- Change the log-level for the "S3-bucket already exists" message.
- Update Gradle.
- Optimize imports.
- Code cleanup.
2021-12-21 15:55:27 +02:00
Lampros Smyrnaios d100af35d0 - Implement the "getUrls" and "addWorkerReport" endpoints with full database-handling.
- Add connectivity with an Impala-database and create a dedicated Controller for future statistics-requests.
- Optimize the "getTestUrls"-endpoint.
- Disable the "reportCurrentTime()" scheduled-task.
- Update dependencies and bump project's version to '1.0.0-SNAPSHOT'.
- Set the logging-appender to "File".
- Code cleanup.
2021-11-09 23:59:27 +02:00
Lampros Smyrnaios 8a4376da9c Initial commit of UrlsController. 2021-03-16 15:25:15 +02:00