Lampros Smyrnaios
3563fd6e2a
- Try to get the cause of the exception of the callable-tasks which handle parquet-files.
...
- Update License.
- Update dependencies.
2024-02-07 18:34:28 +02:00
Lampros Smyrnaios
34d7a143e7
Add/improve documentation.
2024-02-01 14:37:29 +02:00
Lampros Smyrnaios
5dadb8ad2f
- Optimize the "DOC_URL_FILTER"-regex, by using a non-capturing group.
...
- Remove an extra "File.separator" from the fulltexts-fullFilePath.
2024-01-19 15:46:23 +02:00
Lampros Smyrnaios
bdc61c2cda
When at least one worker is still active and have to wait for service-shutdown, show a log-message to inform the user, including that worker's IP.
2024-01-15 13:35:22 +02:00
Lampros Smyrnaios
3a70b57146
Prioritize the full-text urls over the landing-page ones.
2024-01-15 12:59:50 +02:00
Lampros Smyrnaios
ee1ca8966b
- Avoid continuing to request workerReport-batches when from the 1st batch, the base-directory of that assignments-counter is not found.
...
- Update dependencies.
2024-01-15 12:57:33 +02:00
Lampros Smyrnaios
2e60128084
- Allow to easily change the por used by workers.
...
- Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown.
- Update dependencies.
- Code polishing.
2023-12-19 23:31:42 +02:00
Lampros Smyrnaios
d90ad51609
Add the "shutdownAllWorkersGracefully" and "cancelShutdownAllWorkersGracefully" endpoints, in order to be able to shut them down at once and update them, without shutting down the whole Service. So in this case the bulk-import procedures will continue to work.
2023-11-29 16:45:58 +02:00
Lampros Smyrnaios
d20c9a7d2e
- Show the original exception thrown by the background-job, not the one thrown in the main-thread, which is useless, except from its message.
...
- Reduce the interval for deleting the unhandled assignments to once every 3 days.
- Set the upcoming version.
- Update dependencies.
2023-11-27 18:19:53 +02:00
Lampros Smyrnaios
7f789b8ad0
- If we receive an "UnknownHostException" when uploading to the S3ObjectStore, then skip the current full-texts' batch to leave some time for the network to get unstuck.
...
- Code polishing.
2023-11-22 15:29:18 +02:00
Lampros Smyrnaios
9b1f2c4931
Improve performance and reduce memory usage of the "findAssignmentsQuery":
...
- Reorder JOINs and predicates to reduce the computational cost.
- Remove the memory-costly "pu.url" predicates from the "where" clause, as the DB has no empty urls anymore.
2023-10-31 15:59:48 +02:00
Lampros Smyrnaios
db929d8931
- Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments.
...
- Improve some log-messages.
- Code polishing.
2023-10-30 12:29:54 +02:00
Lampros Smyrnaios
856c62887d
- Make sure the "UTF_8" charset is used, when we get a message from the response-body.
...
- Improve some log-messages.
2023-10-26 11:44:23 +03:00
Lampros Smyrnaios
bdf834c439
- Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument.
...
- Fix not updating the "UrlsController.numOfWorkers" correctly.
- Code polishing.
2023-10-23 17:19:29 +03:00
Lampros Smyrnaios
0c7bf6357b
- Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()".
...
- Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).
2023-10-23 12:21:42 +03:00
Lampros Smyrnaios
a7581335f1
- Improve the "getDataForPayloadPrefillQuery".
...
- Improve some error-messages.
2023-10-21 11:31:31 +03:00
Lampros Smyrnaios
44c2fe7418
- Fix the "IndexOutOfBoundsException", when checking the futures' results.
...
- Update dependencies.
2023-10-20 14:25:05 +03:00
Lampros Smyrnaios
df0ea62a5a
- Handle the case when the "webHDFSBaseUrl" does not use HTTPS.
...
- Improve error-reporting when uploading a file to HDFS.
2023-10-19 11:59:37 +03:00
Lampros Smyrnaios
40729c6295
Move similar code into the new "ParquetFileUtils.getPayloadParquetRecord()" method.
2023-10-17 12:50:51 +03:00
Lampros Smyrnaios
f05eee7569
Improve the names of some methods.
2023-10-16 23:39:43 +03:00
Lampros Smyrnaios
def21b991d
Improve the UX of the "installAndRun.sh" script.
2023-10-09 17:28:22 +03:00
Lampros Smyrnaios
fb2877dbe8
Upgrade the execution system for the backgroundTasks:
...
- Submit each task immediately for execution, instead of waiting for a scheduling thread to send all gathered tasks (up to that point) to the ExecutorService (and block until they are finished, before it can start again).
- Hold the Future of each submitted task to a synchronized-list to check the result of each task at a scheduled time.
- Reduce the cpu-time to assure the Service can shut down, by checking if there are "actively" and "about-to-be-executed" tasks, at the same time. Instead of having to rely on the additional checking of the "shutdown"-status of each worker to verify that no active task exist.
- Improve the threads' shutdown procedure.
2023-10-09 17:23:59 +03:00
Lampros Smyrnaios
a354da763d
- Improve some log-messages.
...
- Increase app's version.
- Code polishing.
2023-10-06 17:28:54 +03:00
Lampros Smyrnaios
0c79fdea35
Update the "findAssignmentsQuery" to check the "attempt.error_class" field for the current pub_url, not the pub_id.
2023-10-06 14:59:26 +03:00
Lampros Smyrnaios
ebf8896005
- Fix getter and setter methods for the "isAuthoritative" field.
...
- Update Gradle.
2023-10-05 16:31:52 +03:00
Lampros Smyrnaios
b2ce6393c1
- Add check for remaining "bulkImportDirsUnderProcessing", before shutting down the Service.
...
- Code polishing.
2023-10-05 13:43:47 +03:00
Lampros Smyrnaios
96c11ba4b8
- Add a missing change.
...
- Code optimization and polishing.
- Update dependencies.
2023-10-04 16:17:12 +03:00
Lampros Smyrnaios
7019f7c3c7
Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads.
2023-10-04 15:43:31 +03:00
Lampros Smyrnaios
b702cf4484
Upgrade the "findAssignmentsQuery":
...
- Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs.
In the end, we only care about the urls when choosing which records should be aggregated.
- Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.
2023-10-04 13:43:15 +03:00
Lampros Smyrnaios
c9626de120
Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled.
2023-10-04 13:01:13 +03:00
Lampros Smyrnaios
865926fbc3
- Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them.
...
- Code polishing.
2023-10-02 15:46:55 +03:00
Lampros Smyrnaios
ede7ca5a89
- Add bulk-import support for non-Authoritative data-sources.
...
- Update Spring Boot.
- Code polishing.
2023-09-26 18:02:48 +03:00
Lampros Smyrnaios
90a864ea61
Add more info in bulk-import logs.
2023-09-20 17:50:10 +03:00
Lampros Smyrnaios
0f5d4dac78
Check and show warning/error message for failed payloads.
2023-09-20 17:38:22 +03:00
Lampros Smyrnaios
068b97dd60
Set Xms and Xmx Java-parameters when running the Jar, in Docker.
2023-09-15 14:19:46 +03:00
Lampros Smyrnaios
903c3e1ffc
Add thread-safety when reading the bulkImportReport-files.
2023-09-15 11:54:32 +03:00
Lampros Smyrnaios
846c53913f
Add LICENSE.
2023-09-14 16:05:36 +03:00
Lampros Smyrnaios
360731ba72
- Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint.
...
- Code optimization and polishing.
2023-09-14 13:53:01 +03:00
Lampros Smyrnaios
b4f91f188e
Fix the "retries-num" appearing in log-messages.
2023-09-14 12:08:33 +03:00
Lampros Smyrnaios
02bae38885
- Improve response-time to "getAssignments"-requests, by avoiding merging the parquet files of the "assignment" table, right after acquiring the assignments from the DB. They are already getting merged, when each assignments-batch is deleted after a workerReport has been processed.
...
- Optimize code-positioning for unlocking the DB when done executing queries.
2023-09-13 17:03:11 +03:00
Lampros Smyrnaios
8fdb8e9137
Add renaming of the workerReport-file, to indicate failure, when the processing failed because no workerInfo was found for the worker-id existing in the report. This way, it can be retried by the scheduler later.
2023-09-13 16:35:41 +03:00
Lampros Smyrnaios
c98e8df323
Move the "getRenamedWorkerReport"-code in its own method.
2023-09-13 16:27:18 +03:00
Lampros Smyrnaios
6891c467d4
- Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode.
...
- Add a missing change for the optimization of reading files.
- Update dependencies.
2023-09-13 15:29:30 +03:00
Lampros Smyrnaios
3dd349dd00
Improve the "findAssignmentsQuery":
...
- Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower.
- Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.
2023-09-13 14:38:15 +03:00
Lampros Smyrnaios
ee2df19ce1
- Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint.
...
- Add useful log-messages for various bulk-import stages and improve the current ones.
- Optimize reading and writing the reports.
2023-09-11 17:24:39 +03:00
Lampros Smyrnaios
6944678391
Improve error-handling when renaming workerReport-files.
2023-09-08 17:41:10 +03:00
Lampros Smyrnaios
1c8f3765ca
- Fix not acquiring the full workerReport when retrying it, with the scheduler.
...
- Improve error-handling in the "inspectWorkerReportsAndTakeAction" process.
- Code polishing.
2023-09-08 14:59:48 +03:00
Lampros Smyrnaios
e72a4d3d10
- Improve handling of already renamed workerReport-files, which relate to failed workerReports in the 1st try.
...
- Add check for workerReport-files, which may have been deleted before their time, due to an error.
2023-09-08 14:11:41 +03:00
Lampros Smyrnaios
bd9245cc3d
Avoid deleting the assignment-records in case of a "parquet-data creation, upload or insertion" problem, in order to avoid double-processing of the urls, until the report has been retried by the scheduler.
2023-09-08 13:44:24 +03:00
Lampros Smyrnaios
718f5cfefb
- Improve prioritization of the most recent publications.
...
- Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.
2023-09-07 14:05:58 +03:00