Commit Graph

71 Commits

Author SHA1 Message Date
Lampros Smyrnaios 8bc5cc35e2 - Optimize writing to the Bulk-import-report file.
- Show the IP of the worker which posts a "workerShutdownReport".
- Code polishing.
2024-03-22 17:50:55 +02:00
Lampros Smyrnaios 56d233d38e - Move the "FileUtils.mergeParquetFiles()" method to "ParquetFileUtils.mergeParquetFilesOfTable()".
- Fix a typo.
2024-03-20 15:25:19 +02:00
Lampros Smyrnaios 724eae1514 - Optimize the placement of "DatabaseConnector.databaseLock.unlock()" statements.
- Rename a maven-repository.
2024-03-20 15:08:01 +02:00
Lampros Smyrnaios 8f9786de09 Upgrade the algorithm for finding the previously-found fulltexts, based on their md5hash:
- Use a single query with a list of the fileHashes, instead of thousands of singe-md5hash-check queries (run at most 6 in parallel) which require a lot of I/O.
- Avoid checking multiple times the same fileHash, in case it is related with multiple payloads.
- In case of a database-error, avoid completely losing the full-texts of that worker, instead, continue processing the full-texts.
2024-03-13 11:28:37 +02:00
Lampros Smyrnaios 1048463ca0 - Improve error-handling in "S3ObjectStore.emptyBucket()".
- Change some log-levels.
- Code polishing.
2024-03-11 16:17:32 +02:00
Lampros Smyrnaios 8f18008001 Avoid performing payload-related operations in case no fulltext was received from the worker, due to en error. 2024-03-11 14:57:13 +02:00
Lampros Smyrnaios dd394f18a0 - Optimize the JOIN-order in the "findAssignmentsQuery".
- Optimize the "DOC_URL_FILTER"-regex.
- Update dependencies.
2024-03-11 11:35:38 +02:00
Lampros Smyrnaios 43ea64758d - Improve handling of the case when no fulltexts have been found or none of the found ones were requested from the worker, as they were already retrieved in the past.
- Show the number of files with problematic locations (if any of them exist).
- Code polishing.
2024-02-23 12:39:28 +02:00
Lampros Smyrnaios 5dadb8ad2f - Optimize the "DOC_URL_FILTER"-regex, by using a non-capturing group.
- Remove an extra "File.separator" from the fulltexts-fullFilePath.
2024-01-19 15:46:23 +02:00
Lampros Smyrnaios 3a70b57146 Prioritize the full-text urls over the landing-page ones. 2024-01-15 12:59:50 +02:00
Lampros Smyrnaios 2e60128084 - Allow to easily change the por used by workers.
- Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown.
- Update dependencies.
- Code polishing.
2023-12-19 23:31:42 +02:00
Lampros Smyrnaios 9b1f2c4931 Improve performance and reduce memory usage of the "findAssignmentsQuery":
- Reorder JOINs and predicates to reduce the computational cost.
- Remove the memory-costly "pu.url" predicates from the "where" clause, as the DB has no empty urls anymore.
2023-10-31 15:59:48 +02:00
Lampros Smyrnaios db929d8931 - Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments.
- Improve some log-messages.
- Code polishing.
2023-10-30 12:29:54 +02:00
Lampros Smyrnaios 856c62887d - Make sure the "UTF_8" charset is used, when we get a message from the response-body.
- Improve some log-messages.
2023-10-26 11:44:23 +03:00
Lampros Smyrnaios bdf834c439 - Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument.
- Fix not updating the "UrlsController.numOfWorkers" correctly.
- Code polishing.
2023-10-23 17:19:29 +03:00
Lampros Smyrnaios a354da763d - Improve some log-messages.
- Increase app's version.
- Code polishing.
2023-10-06 17:28:54 +03:00
Lampros Smyrnaios 0c79fdea35 Update the "findAssignmentsQuery" to check the "attempt.error_class" field for the current pub_url, not the pub_id. 2023-10-06 14:59:26 +03:00
Lampros Smyrnaios 96c11ba4b8 - Add a missing change.
- Code optimization and polishing.
- Update dependencies.
2023-10-04 16:17:12 +03:00
Lampros Smyrnaios 7019f7c3c7 Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads. 2023-10-04 15:43:31 +03:00
Lampros Smyrnaios b702cf4484 Upgrade the "findAssignmentsQuery":
- Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs.
In the end, we only care about the urls when choosing which records should be aggregated.
- Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.
2023-10-04 13:43:15 +03:00
Lampros Smyrnaios 865926fbc3 - Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them.
- Code polishing.
2023-10-02 15:46:55 +03:00
Lampros Smyrnaios ede7ca5a89 - Add bulk-import support for non-Authoritative data-sources.
- Update Spring Boot.
- Code polishing.
2023-09-26 18:02:48 +03:00
Lampros Smyrnaios 360731ba72 - Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint.
- Code optimization and polishing.
2023-09-14 13:53:01 +03:00
Lampros Smyrnaios 02bae38885 - Improve response-time to "getAssignments"-requests, by avoiding merging the parquet files of the "assignment" table, right after acquiring the assignments from the DB. They are already getting merged, when each assignments-batch is deleted after a workerReport has been processed.
- Optimize code-positioning for unlocking the DB when done executing queries.
2023-09-13 17:03:11 +03:00
Lampros Smyrnaios 8fdb8e9137 Add renaming of the workerReport-file, to indicate failure, when the processing failed because no workerInfo was found for the worker-id existing in the report. This way, it can be retried by the scheduler later. 2023-09-13 16:35:41 +03:00
Lampros Smyrnaios c98e8df323 Move the "getRenamedWorkerReport"-code in its own method. 2023-09-13 16:27:18 +03:00
Lampros Smyrnaios 6891c467d4 - Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode.
- Add a missing change for the optimization of reading files.
- Update dependencies.
2023-09-13 15:29:30 +03:00
Lampros Smyrnaios 3dd349dd00 Improve the "findAssignmentsQuery":
- Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower.
- Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.
2023-09-13 14:38:15 +03:00
Lampros Smyrnaios 6944678391 Improve error-handling when renaming workerReport-files. 2023-09-08 17:41:10 +03:00
Lampros Smyrnaios 1c8f3765ca - Fix not acquiring the full workerReport when retrying it, with the scheduler.
- Improve error-handling in the "inspectWorkerReportsAndTakeAction" process.
- Code polishing.
2023-09-08 14:59:48 +03:00
Lampros Smyrnaios e72a4d3d10 - Improve handling of already renamed workerReport-files, which relate to failed workerReports in the 1st try.
- Add check for workerReport-files, which may have been deleted before their time, due to an error.
2023-09-08 14:11:41 +03:00
Lampros Smyrnaios bd9245cc3d Avoid deleting the assignment-records in case of a "parquet-data creation, upload or insertion" problem, in order to avoid double-processing of the urls, until the report has been retried by the scheduler. 2023-09-08 13:44:24 +03:00
Lampros Smyrnaios 718f5cfefb - Improve prioritization of the most recent publications.
- Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.
2023-09-07 14:05:58 +03:00
Lampros Smyrnaios acef891167 Improve prioritization of "publication_boost" records, by adding a second ordering in the end. 2023-09-04 15:34:37 +03:00
Lampros Smyrnaios 98516498eb - Increase app's version.
- Code polishing.
2023-09-04 12:46:55 +03:00
Lampros Smyrnaios febe2b212c Upgrade management of failed workerReports:
- Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed".
- Avoid deleting immediately the failed workerReports.
- Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap".
- Add a scheduling task to process leftover failed workerReports from the current execution, regularly.
- Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports.
- Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted.
- Code refactoring.
2023-09-01 15:10:58 +03:00
Lampros Smyrnaios 5c459a3a16 Optimize handling of HTTP-4XX errors in "UrlsServiceImpl.postReportResultToWorker()". 2023-08-31 13:20:12 +03:00
Lampros Smyrnaios 601776e81c - Fix not handling for the case where the info about the worker in the WorkerReport, does not exist inside the "workersInfoMap", as that worker is not participating in the Service. (this case may appear in future code)
- Code polishing.
2023-08-30 17:07:51 +03:00
Lampros Smyrnaios c32dfa882e Fix not deleting the assignment-records, for every workerReport, after processing it. 2023-08-30 16:22:58 +03:00
Lampros Smyrnaios aa3f32f3da - Make sure the given number of threads, given by the user is above zero.
- Adjust the number and size of log files.
- Update Spring Boot.
- Code polishing.
2023-08-30 14:02:54 +03:00
Lampros Smyrnaios 44459c8681 - Rename "ImpalaConnector.java" to "DatabaseConnector.java".
- Update dependencies.
- Code polishing.
2023-08-23 16:55:23 +03:00
Lampros Smyrnaios a524375656 - Create the HDFS-subDirs before generating "callableTasks" for creating and uploading the parquetFiles.
- Delete gradle .zip file after installation.
2023-08-04 15:30:41 +03:00
Lampros Smyrnaios 860c73ea91 - Improve the "shutdownController.sh" script.
- Set names for the Prometheus and Grafana containers.
- Code polishing.
2023-07-27 18:27:48 +03:00
Lampros Smyrnaios d821ae398f Improve performance by applying the merging-procedure for the parquet files of the database tables less often, while keeping the benefits of having a relatively small maximum number of parquet files in search operations. 2023-07-24 20:28:41 +03:00
Lampros Smyrnaios cec2531737 - Increase the "numOfBackgroundThreads" to 8.
- Make the "numOfBackgroundThreads" and "numOfThreadsPerBulkImportProcedure" configurable from the "application.yml" file.
- Code polishing.
2023-07-21 11:45:50 +03:00
Lampros Smyrnaios fd1cf56863 - Avoid passing some duplicate publications to the Workers, by post-processing the assignments retrieved from Impala. The assignments may contain duplicate id-url pairs, which have different datasources, since one publication may be connected to multiple datasources.
- Code polishing.
2023-07-19 18:31:24 +03:00
Lampros Smyrnaios 8dfb58ee63 Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment".
After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers.
Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.
2023-07-11 17:27:23 +03:00
Lampros Smyrnaios e8644cb64f - Optimize the "insertAssignmentsQuery".
- Add documentation about the Prometheus Metrics, in README.
- Update Dependencies.
- Code polishing.
2023-07-05 17:10:30 +03:00
Lampros Smyrnaios a89abe3f2f Prioritize the publications, which are specified inside the "publication_boost" table, according to their "boost-level". 2023-06-29 12:32:06 +03:00
Lampros Smyrnaios 4c3e2e6b6e - Fix not using actual the "currentAssignmentsBatch" of the workerReport itself, when creating the parquetFileNames and when reporting to the user the initialization of the "addition of the workerReport".
- Code polishing.
2023-06-27 16:08:01 +03:00