Commit Graph

55 Commits

Author SHA1 Message Date
Lampros Smyrnaios 69ea5b6d19 - Increase the "ReadTimeout" to 2 hours, as the Worker struggles to get the assignments-data in time.
- Revert the change about special handling of the "RestClientException". The exMsg was appearing in a different line, in the logs, and was a "SocketTimeoutException".
2023-10-27 18:39:10 +03:00
Lampros Smyrnaios bfa76e9484 - Show the full stacktrace in the weird case of a "RestClientException" without an exception-message. Also, in this case, retry immediately, as there is no long-lasting network problem that requires some time between requests, but most probably a random interruption.
- Code polishing.
2023-10-27 17:36:54 +03:00
Lampros Smyrnaios 10e39d79a4 - Improve a log-message.
- Update dependencies.
2023-10-20 17:35:39 +03:00
Lampros Smyrnaios 01e378ea66 - Add progress-report-log for assignments-processing.
- Code polishing.
2023-10-05 12:02:52 +03:00
Lampros Smyrnaios 2895668417 - Add LICENSE.
- Code polishing.
2023-09-14 16:09:20 +03:00
Lampros Smyrnaios 49cd0c19c2 - Increase the "hoursToWaitBeforeDeletion" to 48.
- Adjust the number and size of log files.
2023-08-31 17:54:07 +03:00
Lampros Smyrnaios e85282d35b Update the "addReportResultToWorker"-endpoint to check if the given "assignmentsCounter" was handled by that worker, without considering the related full-texts directory, since that may have been deleted in the meantime. 2023-08-31 17:52:52 +03:00
Lampros Smyrnaios b579296ada - Code optimization and polishing.
- Update dependencies.
2023-08-28 16:11:26 +03:00
Lampros Smyrnaios dc97b323c9 - Show a warning, if the "numOfUnretrievedFiles" is over 50.
- Delete gradle .zip file after installation.
- Code polishing.
2023-08-04 15:33:48 +03:00
Lampros Smyrnaios 088cf73b30 - Update dependencies.
- Code optimization and polishing.
2023-07-27 17:46:17 +03:00
Lampros Smyrnaios 952bf7c035 - Update dependencies.
- Code polishing.
2023-07-06 13:22:09 +03:00
Lampros Smyrnaios 33df46f6f5 - Improve README.
- Update and cleanup dependencies.
- Code polishing.
2023-06-22 12:47:36 +03:00
Lampros Smyrnaios 9c897b8bf4 - Make use of the new Normalizer utilized by the PublicationRetriever plugin.
- Code polishing.
2023-06-10 02:40:45 +03:00
Lampros Smyrnaios 2aedae2367 - In case a serious error happened while processing the assignments, instead of shutting down immediately, now the Worker shuts down the executor service, registers that it will shut down soon and waits for the Controller to retrieve the already downloaded full-text files.
- In case the full-texts' subdirectory could not be created, then terminate the "handleAssignment" method immediately. No posting of a faulty workerReport to the Controller should happen.
- Code polishing.
2023-05-31 15:25:36 +03:00
Lampros Smyrnaios 4a95826f58 - Avoid processing the assignments, for which the assignments_full-texts subdirectory cannot be created.
- Avoid a double-log.
2023-05-31 02:27:24 +03:00
Lampros Smyrnaios 7f3ca80959 Bypass url-canonicalization for urls containing certain uncommon characters which cause the urls to get rejected. 2023-05-30 19:45:14 +03:00
Lampros Smyrnaios a9b1b20a51 - Prevent running out of space, by checking the available free space and stalling the acquisition of new assignments until more free space becomes available.
- Fix missing change.
2023-05-30 17:58:29 +03:00
Lampros Smyrnaios 0908dcab8a Use a single "restTemplate" object, with the same timeouts (a bit increased from the old requestRestTemplate, to account for a possible overloaded Controller), since we no longer need to wait for hours until the workerReport is processed by the Controller. 2023-05-29 14:15:55 +03:00
Lampros Smyrnaios 2b69733912 - Increase the test-delays of the scheduled tasks.
- Update dependencies.
2023-05-29 12:45:43 +03:00
Lampros Smyrnaios f57314908a - Improve elapsed time precision for the "lastModified" metadata of the assignments-fulltext subDirectories.
- Code polishing.
2023-05-25 00:37:44 +03:00
Lampros Smyrnaios 1bf27a5a4e - Fix a bug, which caused the old full-text files to not be deleted.
- Reduce the "InitialDelay" for the "checkIfShouldShutdown" scheduler.
2023-05-24 16:47:53 +03:00
Lampros Smyrnaios 0ca02f3587 Change the delay values of scheduledTasks to production ones. 2023-05-24 13:56:20 +03:00
Lampros Smyrnaios 9fdaa9503b - Delete any left-over full-texts after 36 hours.
- Upon shutting down, post a "shutdownReport" to the Controller.
2023-05-23 22:22:57 +03:00
Lampros Smyrnaios 903032f454 - After a WorkerReport has been sent, ask for new assignments immediately. So, the Worker does not have to wait for hours for the Controller to check for duplicate files in the DB, retrieve and upload the full-texts and insert the records to the DB.
- Special care is taken to delete the delivered full-texts as soon as possible.
- Write the workerReport to a json-file, in case something goes wrong, and keep it until the Controller notifies the Worker that the processing was successful.
2023-05-23 22:19:41 +03:00
Lampros Smyrnaios 4d90846261 - In case the specified "controllerIP" is actually a domain-name, find its IP-address, so that a proper IP-to-IP comparison can be performed and the "securityChecks" can pass.
- Increase the "read-timeout" when searching for the host's machine public-IP.
- Update dependencies.
- Code polishing.
2023-05-22 21:25:22 +03:00
Lampros Smyrnaios bd0ead816d Make the value of time-out for "restTemplateForReport", to scale along the "maxAssignmentsLimitPerBatch". 2023-05-16 19:08:59 +03:00
Lampros Smyrnaios 714938531b - Add the time-zone in the logs.
- Code polishing.
2023-05-11 03:14:56 +03:00
Lampros Smyrnaios d5a997ad3d Use restTemplates with different read timeouts depending on the operation. For the assignments-request we need a shorter read timeout, than the one we need for the worker-report. This guarantees that the connection does not hungs for so long, when the Controller crashes before sending the assignments. 2023-04-29 17:24:16 +03:00
Lampros Smyrnaios 0ba15dd31a Increase the "requestReadTimeoutDuration" to 10 hours, as the number of full-texts to be transferred to the Controller keeps getting larger. 2023-04-26 15:08:46 +03:00
Lampros Smyrnaios 839a797124 - Improve performance of full-texts transferring to the Controller, by preloading some bytes for faster response to the Controller's read requests.
- Optimize directories-creation process by eliminating the additive check for existence, as that check already takes place inside the "mkdirs()" method.
- Remove the obsolete code which in case the specific assignments' subdirectory failed to be created, then a different base-dir was used instead. Since the user-defined baseDir is already been successfully created upon initialization, any problem on creating subdirectories inside that base-directory will most likely persist even when changing the base directory. Additionally, even if the subdirectory with the changed base-directory succeeded, the "FullTextsController.getFullTexts()" method would not use it, resulting in errors.
- Code polishing.
2023-03-08 13:12:17 +02:00
Lampros Smyrnaios 4da54e7a7d - Show a warning, in case the number of archived files is different from the number of requested files.
- Code polishing.
- Update Gradle.
2023-03-07 16:25:10 +02:00
Lampros Smyrnaios ec09ecc7ff - Refactor and Spring-ify the File-storage initialization process.
- Fix the problematic file-storage-path (it could not be used when the Controller was requesting the full-texts), which was produced when the user-defined path could not be created.
2023-03-07 16:21:32 +02:00
Lampros Smyrnaios ff4fd3d289 - Show the elapsed time for each assignments-request to be processed by the Worker.
- Update dependencies.
2023-03-02 17:34:44 +02:00
Lampros Smyrnaios 66d3f7bcb2 - Show a warning, in case the number of results is different from the number of the assignments (due to missing / double logging).
- Update Spring.
2023-02-24 23:27:02 +02:00
Lampros Smyrnaios 81b61b530f Drastically improve performance by applying a pre-processing algorithm for the assignments-list to open some "space" between assignments which have the same domain, which in return, causes the threads to block less during execution.
(The threads block, due to the mandatory "politeness-delay" before reconnecting with the same domain, in order to avoid overloading the remote servers.)
2023-02-24 23:23:37 +02:00
Lampros Smyrnaios 84a37bd4b7 - Handle the case, where an instance of a urlReport record (having the same id and sourceUrl), may have failed to give a docUrl, due to en error, even if another instance gives the docUrl and the docFile. The absence of that handling could lead to a record-instance, being assigned a "fileLocation" which was actually an error-message (comment), and as a result the real "fileLocation" would have never been reached to be assigned, so the payload would be lost.
- Improve exceptions-handling.
2023-02-21 15:22:49 +02:00
Lampros Smyrnaios 24b52fba63 - Refactor the initialization and configuration process and Spring-ify the project.
- Update Spring dependency.
2023-01-25 18:33:49 +02:00
Lampros Smyrnaios 7dd5719bff - Update a method-call, to reflect the latest changes in the "PublicationsRetriever"-software.
- Add TODOs.
2023-01-17 18:25:49 +02:00
Lampros Smyrnaios d37cd738a0 Refactor the full-texts deletion process to reduce storage space and complexity:
- Delete the assignments-batch full-texts after the whole procedure (for each assignments-batch) is finished, either successfully or not.
- Do not check for remaining files, when the Worker shuts down, since, in case of problematic handling the files are deleted anyway.

The full-texts are not needed to be kept, in case of an error, since the Controller will reassign the non-downloaded id-url records to some worker (maybe different) and these files will be downloaded again and handled there.

Also, change the "assignmentsNumsHandled" to hold data only for assignments which are handled all the way, including the upload of the full-texts from the Controller and also the insertion of the WorkerReport to the database.
2022-12-07 12:29:05 +02:00
Lampros Smyrnaios 90a69686cf - When the Worker is about to shut-down, after deleting all the handled assignments' files, check for remaining full-texts in the local storage and warn the user. If no remaining files were found, then delete the parent fulltexts' directory.
- Polish the code.
2022-11-02 02:27:04 +02:00
Lampros Smyrnaios 373bfa810b - Apply a "shouldShutdownWorker"-check in "ScheduledTasks.handleNewAssignments()", when there was a "connection-error" in the previous request. This makes sure that the Worker will honor the user's shut down request, even if it's "stuck" in a connection-error loop.
- Optimize the input-streams creation in the "FullTextsController".
2022-09-12 16:48:44 +03:00
Lampros Smyrnaios d91732bc16 - Add deletion, of the cookies in the newly-supported CookieManager, after each batch.
- Update the Spring-Security-code to use the "SecurityFilterChain", as the previous code was deprecated.
- Update dependencies.
- Code cleanup.
2022-06-27 17:58:02 +03:00
Lampros Smyrnaios edbf6461d5 - Refactor the scheduling of the "handleNewAssignments()" task. Spring already waits for the last task to get finished, before running the new one (unless Async is specifically enabled), so the "isAvailableForWork" didn't do anything (thus the bug described in a previous commit was never going to appear). Also, now we set to request the new assignments-batch immediately after the last one is finished (not after 15 mins), while dealing with potential continuous connection-errors.
- Avoid running the "deleteHandledAssignmentsFullTexts()" scheduled task on application's start.
- Optimize assignment of "requestUrl".
- Add clarity in the scheduled tasks, by using "fixedDelay" instead of "fixedRate", to signify that the time specified is counted right from the time the last task is finished (even though without enabling the "Async" there is no "danger" of running them in parallel).
- Code cleanup.
2022-02-21 12:48:21 +02:00
Lampros Smyrnaios 92d011e8a0 - Make sure the handled assignments - full-texts are deleted before the application exits.
- When the user sets the "maxAssignmentsBatchesToHandleBeforeRestart" above zero, shutdown immediately after the last assignments-batch. Do not wait for the next scheduled check.
- Allow the user to set the "maxAssignmentsBatchesToHandleBeforeRestart" in the "installAndRun.sh" script.
- Increase the "fixedRate" for the "ScheduledTasks.deleteHandledAssignmentsFullTexts()" method to 12 hours.
- Update README.md
2021-12-31 04:09:05 +02:00
Lampros Smyrnaios 1ddfd34236 - Allow the user to set a maximum number of assignments-batches for the Worker to handle. After handling those batches, the Worker will shut down. A number of < 0 > indicates an infinite number of batches.
- Avoid converting the zero fileSize to < null >. Now, the default value is < null >, so the zero-value will indicate a zero-byte file.
- Update dependencies.
- Code cleanup.
2021-12-24 00:12:34 +02:00
Lampros Smyrnaios c46c8c448a - Upgrade the zip-file delivery by using the "InputStreamResource". This way is more reliable, have better performance and uses less memory.
- Use the "InputStreamResource" also in "get(single)FullText"-endpoint, in order to avoid loading a big full-text file in memory.
- Decrease the system-reserved memory by 128 MB.
- Fix path-variable regexes for "getFullText"-endpoint.
- Optimize imports.
- Code cleanup.
2021-12-17 08:25:54 +02:00
Lampros Smyrnaios 045788c728 - Use the "Timestamp" data-type instead of the "Date", in order to include more information.
- Code cleanup.
2021-11-27 02:37:33 +02:00
Lampros Smyrnaios 20b71164d5 - The worker will store the files in its local file-system and will send them to the controller in batches, after the latter requests them. When all files from a given assignments-num are sent, the files will be deleted from the Worker, in a scheduled-job.
- Implement the "getFullTexts"-endpoint, which returns the requested full-texts in a zip file.
- Implement the "getFullText"-endpoint, which returns the requested full-text.
- Implement the "getHandledAssignmentsCounts"-endpoint which returns the assignments-numbers, which were handled by that worker.
- Make sure each urlReport has the same "Date" for a given assignments-number. Also, make sure the "size" and "hash" have a "null" value, in case the full-text was not found.
- Check and log thread-pool shutdown errors.
- Add the stack-trace in the error-logs, instead of the Stderr.
- Update SpringBoot dependency.
- Change log levels.
- Code cleanup.
2021-11-26 17:04:31 +02:00
Lampros Smyrnaios 0f12a9305c - Decrease the time interval for the scheduled task "handleNewAssignments". This helps to reduce the "dead-time" between reporting the current assignments and requesting the new ones.
- Avoid a potential NPE when giving information about the received AssignmentRequest.
- Log and return, when the received assignments-list is empty.
- Improve some logging-messages.
- Update the logs' fileName and change the preferred appender to "File".
- Code cleanup.
2021-10-14 03:03:47 +03:00
Lampros Smyrnaios 61597d1627 - Read the Controller's url from a file, when starting the Application.
- Switch the "AssignmentsHandler.askForTest" to "false".
- Get the size and the hash of a docFile which is previously downloaded by another ID in that batch.
- Reset the "AssignmentHandler.urlReports" list after posting the results to the Controller.
- Enhance logging and comments.
- Add more guidelines in the README.
- Disable the scheduled test-live job.
- Code cleanup.
2021-09-21 16:21:39 +03:00