Lampros Smyrnaios
13f56d16c0
Inform the user, if a previous "shutdownWorker"-request has been given, in "GeneralController.shutdownWorkerGracefully()"-endpoint.
2023-02-01 16:41:40 +02:00
Lampros Smyrnaios
b98ea92dec
Update/improve documentation.
2023-01-27 14:27:57 +02:00
Lampros Smyrnaios
24b52fba63
- Refactor the initialization and configuration process and Spring-ify the project.
...
- Update Spring dependency.
2023-01-25 18:33:49 +02:00
Lampros Smyrnaios
d6ff62d2ef
Update the "installAndRun.sh" script:
...
- Add the ability to build and run the app without re-installing the PublicationsRetriever library. This is useful when trying a non-published version of that library.
- Fix a wrong variable-name.
2023-01-20 01:59:26 +02:00
Lampros Smyrnaios
bd0d9eb36f
- Delete the transferred full-texts as soon as possible, in order to mitigate the "No space left on device"-error, which may appear, in case we have some very large files.
...
- Use the new "GenericUtils.clearBlockingData()" method from the "PublicationsRetriever" library.
- Remove the deprecated "getMultipleFullTexts"-endpoint, along with the Zip-related code.
2023-01-18 16:55:59 +02:00
Lampros Smyrnaios
7dd5719bff
- Update a method-call, to reflect the latest changes in the "PublicationsRetriever"-software.
...
- Add TODOs.
2023-01-17 18:25:49 +02:00
Lampros Smyrnaios
c283cb4365
Improve exception-handling in "AssignmentsHandler.postWorkerReport()".
2023-01-16 15:22:32 +02:00
Lampros Smyrnaios
d96d0c68cd
Make sure the "responseCode" is "200-OK", before trying to get the InputStream in "UriBuilder.getPublicIP()".
2023-01-11 16:02:31 +02:00
Lampros Smyrnaios
fd62ac567e
- Add a new endpoint "getFullTextsImproved" which uses Facebook's [**Zstandard**]( https://facebook.github.io/zstd/ ) compression algorithm, which brings very big benefits on compression rate and speed.
...
- Remove some dependencies.
2023-01-09 15:48:30 +02:00
Lampros Smyrnaios
778dc6e25c
- Improve the stability of "UriBuilder.getPublicIP()", by using a "HttpURLConnection" to increase the connection and read timeouts and avoid timeout-exceptions.
...
- Show the number of assignments which are requested from the Controller, in the log-message.
- Update Spring.
2023-01-03 18:43:26 +02:00
Lampros Smyrnaios
378db2ff2f
- Add an existence-check for the "publications_retriever"-JAR, before trying to make a backup, inside "installAndRun.sh".
...
- Add a final logging message, right before the app shuts down.
2022-12-15 14:15:24 +02:00
Lampros Smyrnaios
8c1daadad0
- Increase the "requestReadTimeoutDuration" to 5 hours.
...
- Improve gradle's performance.
2022-12-12 17:49:14 +02:00
Lampros Smyrnaios
6c17e86c70
Code polishing.
2022-12-09 12:53:08 +02:00
Lampros Smyrnaios
d37cd738a0
Refactor the full-texts deletion process to reduce storage space and complexity:
...
- Delete the assignments-batch full-texts after the whole procedure (for each assignments-batch) is finished, either successfully or not.
- Do not check for remaining files, when the Worker shuts down, since, in case of problematic handling the files are deleted anyway.
The full-texts are not needed to be kept, in case of an error, since the Controller will reassign the non-downloaded id-url records to some worker (maybe different) and these files will be downloaded again and handled there.
Also, change the "assignmentsNumsHandled" to hold data only for assignments which are handled all the way, including the upload of the full-texts from the Controller and also the insertion of the WorkerReport to the database.
2022-12-07 12:29:05 +02:00
Lampros Smyrnaios
326af0f12d
- Return a success-message in the response-body, of the "shutdownWorkerGracefully" and "cancelShutdownWorkerGracefully" endpoints.
...
- Apply the checks for the "totalZipBatches" param, before the Worker-related checks, in "FullTextsController.getMultipleFullTexts()"
- Show the Heap-sizes in megabytes.
2022-12-05 21:58:16 +02:00
Lampros Smyrnaios
5f48f72f06
- Add handling for the case, when the Controller could not retrieve any assignments from the database (without an error).
...
- Improve exception handling.
- Remove obsolete code.
2022-12-05 16:47:15 +02:00
Lampros Smyrnaios
182d6153d4
- Set some optimization settings for gradle.
...
- Fix error-handling in "installAndRun.sh".
- Update dependencies.
2022-11-30 16:25:57 +02:00
Lampros Smyrnaios
01f12e2fe2
- Align with "PublicationsRetriever's" updated "couldRetry" and "wasValid" logic.
...
- Update dependencies.
2022-11-11 16:02:20 +02:00
Lampros Smyrnaios
90a69686cf
- When the Worker is about to shut-down, after deleting all the handled assignments' files, check for remaining full-texts in the local storage and warn the user. If no remaining files were found, then delete the parent fulltexts' directory.
...
- Polish the code.
2022-11-02 02:27:04 +02:00
Lampros Smyrnaios
6450a4b8ac
- Add check for ZERO value of "totalZipBatches", in "FullTextsController.getMultipleFullTexts()".
...
- Improve or comment-out some log-messages.
- Disable the empty SpringBootTest, as it caused building problems.
2022-10-06 16:59:45 +03:00
Lampros Smyrnaios
4b85b092fe
Handle the new "HttpStatus.MULTI_STATUS"-response from the Controller, inside "AssignmentsHandler.postWorkerReport()".
2022-09-28 22:41:43 +03:00
Lampros Smyrnaios
b051e10fd3
- Fix a bug, causing the domainAndPath-tracking data to be deleted after every batch, after the initial threshold was reached. Now the thresholds increase, along the processed id-urls, in order to clear data, e.g. every 300_000 processed id-urls, as intended.
...
- Use different thresholds for clearing just the "domainAndPath"-blocking-data and all-tracking-data.
2022-09-28 19:10:01 +03:00
Lampros Smyrnaios
373bfa810b
- Apply a "shouldShutdownWorker"-check in "ScheduledTasks.handleNewAssignments()", when there was a "connection-error" in the previous request. This makes sure that the Worker will honor the user's shut down request, even if it's "stuck" in a connection-error loop.
...
- Optimize the input-streams creation in the "FullTextsController".
2022-09-12 16:48:44 +03:00
Lampros Smyrnaios
d73a99b1c0
- Increase the security of "shutdownWorker" and "cancelShutdownWorker" endpoints, by only allowing the requests, which come from the same machine.
...
- Update the "UriBuilder.java" to be able to take the running port of the server, in case the port-number was initially set to "random" (0).
2022-09-12 16:38:44 +03:00
Lampros Smyrnaios
25070d7aba
- Lower the thresholds for how often to clear the data-structures.
...
- Clear the "ConnSupportUtils.domainsWithConnectionData" data-structure, after each batch.
- Move the code for handling the "CookieStore" inside the "PublicationsRetrieverPlugin", as it is more related to that.
2022-07-04 18:42:05 +03:00
Lampros Smyrnaios
5035094e44
- Move the "shutdownOrCancelCode" input in the "inputDataFile" provided by the user, for convenience and to be able to make this "auth-code" mandatory. Previously, it was optional and the app could not be made to stop in a normal-manner, if this code was not provided.
...
- Improve the instructions and the error-messages for the "inputDataFile".
2022-06-28 16:00:11 +03:00
Lampros Smyrnaios
d91732bc16
- Add deletion, of the cookies in the newly-supported CookieManager, after each batch.
...
- Update the Spring-Security-code to use the "SecurityFilterChain", as the previous code was deprecated.
- Update dependencies.
- Code cleanup.
2022-06-27 17:58:02 +03:00
Lampros Smyrnaios
26cbb83b51
- Add the "shutdownWorker"-endpoint to accept requests for shutting-down the Worker, gracefully, after it completes its current work (including sending the publications-files to the Controller). A user-defined "auth-code" is required.
...
- Add the "cancelShutdownWorker"-endpoint to cancel a previous "shutdownWorker"-request. A user-defined "auth-code" is required.
2022-06-22 18:53:27 +03:00
Lampros Smyrnaios
d6e94912a4
- Optimize zip-file creation.
...
- Update dependencies.
2022-05-26 15:24:36 +03:00
Lampros Smyrnaios
a1f750a0aa
- Handle the case, where, from a group of related records, the initial record which led to a publication-url, failed to have its full-text downloaded. Now we make sure the file-related data for all those related records is kept "null" and a special error is written.
...
- Code optimization.
2022-04-05 17:51:45 +03:00
Lampros Smyrnaios
d682298850
Improve assignment of "PublicationsRetriever.threadsMultiplier", depending on the total available threads on the system. The previous assignment was not scaling well.
2022-04-05 00:13:52 +03:00
Lampros Smyrnaios
4976afa829
Fix a "@JsonProperty" annotation inside "Payload.java".
2022-04-01 23:43:43 +03:00
Lampros Smyrnaios
31af0a81eb
- Update the Worker's report to include the datasourceID for each record. It is used by the Controller inside the S3-fileNames.
...
- Update dependencies.
2022-04-01 19:42:32 +03:00
Lampros Smyrnaios
5fee05e994
Update dependencies.
2022-03-28 14:29:54 +03:00
Lampros Smyrnaios
8453c742f2
Update Spring dependencies.
2022-02-25 17:41:10 +02:00
Lampros Smyrnaios
760e0ef7e2
Increase "PublicationsRetriever.threadsMultiplier" to 10, which in turn, increases performance by 41%.
2022-02-23 17:31:32 +02:00
Lampros Smyrnaios
377b98d677
Increase the "requestReadTimeoutDuration" from 1 hour to 3. This way, each worker will handle saturation without aborting the connection, when multiple workers are waiting for the "databaseLock" in the Controller.
2022-02-22 13:29:02 +02:00
Lampros Smyrnaios
edbf6461d5
- Refactor the scheduling of the "handleNewAssignments()" task. Spring already waits for the last task to get finished, before running the new one (unless Async is specifically enabled), so the "isAvailableForWork" didn't do anything (thus the bug described in a previous commit was never going to appear). Also, now we set to request the new assignments-batch immediately after the last one is finished (not after 15 mins), while dealing with potential continuous connection-errors.
...
- Avoid running the "deleteHandledAssignmentsFullTexts()" scheduled task on application's start.
- Optimize assignment of "requestUrl".
- Add clarity in the scheduled tasks, by using "fixedDelay" instead of "fixedRate", to signify that the time specified is counted right from the time the last task is finished (even though without enabling the "Async" there is no "danger" of running them in parallel).
- Code cleanup.
2022-02-21 12:48:21 +02:00
Lampros Smyrnaios
0d2f0b8b01
Code cleanup.
2022-02-19 17:21:51 +02:00
Lampros Smyrnaios
b63ad87d00
Bug fixes:
...
- Fix a bug, where, in case it took too long to get the assignments from the Controller (possible when there are too many workers requesting at the same time or if the database is responding slowly), the Worker's scheduler would request for new assignments, in the meantime.
- Fix a bug, where, if the "maxAssignmentsBatchesToHandleBeforeRestart" was set, the Worker's scheduler could request another batch, right before the Worker was about to shut down.
- Fix a bug, where the condition of when to clear the over-sized data-structures was based on the "assignmentRequestCounter" send by the Controller (which is increased on each request by any worker and not for each individual one), and not on the "numHandledAssignmentsBatches" kept by each individual worker. This would result in much earlier cleanup, relative to the number of the Workers.
2022-02-19 17:09:02 +02:00
Lampros Smyrnaios
3d1faf4a8a
- Reduce memory-consumption in the long-run, by clearing some underlying data-structures after a threshold.
...
- Update Gradle.
2022-02-18 20:02:34 +02:00
Lampros Smyrnaios
4cadaf98fc
Update the README.md
2022-02-07 20:59:10 +02:00
Lampros Smyrnaios
73552ce079
- Handle the latest download-errors provided by the "PublicationsRetriever" program.
...
- Update the "test" requestUrl.
2022-02-07 14:40:33 +02:00
Lampros Smyrnaios
2e4c1323a3
- Add check for null or empty id or url.
...
- Code cleanup.
2022-01-28 03:47:46 +02:00
Lampros Smyrnaios
a428b1d1e6
- Fix not prioritizing the gradle version defined inside the "installAndRun.sh" script.
...
- Update SpringBoot dependency.
2022-01-21 15:19:52 +02:00
Lampros Smyrnaios
8912bb1cf9
Fix adding an invalid error-message in case of an "alreadyDownloaded" full-text being discovered inside the "FileUtils.dataToBeLoggedList".
2022-01-17 23:46:15 +02:00
Lampros Smyrnaios
0032a8018f
- Improve search-accuracy of "alreadyDownloaded" full-texts.
...
- Handle the potential error-case of an "alreadyDownloaded" full-text not being discovered inside the "FileUtils.dataToBeLoggedList".
2022-01-17 10:12:48 +02:00
Lampros Smyrnaios
d61ff4b6dd
Integrate some changes from the "PublicationsRetrieverPlugin".
2022-01-14 15:13:00 +02:00
Lampros Smyrnaios
8abb260d60
- In case of an unknown (non-documented) exception inside "LoaderAndChecker.invokeAllTasksAndWait", now it will be logged and the app will gently shut down with an error-message in the Error-stream.
...
- Avoid double-checking for handledAssignments -in order to delete their full-texts- when the app is about to shut down, in case the "maxAssignmentsBatchesToHandleBeforeRestart" is set above Zero.
2022-01-04 00:23:45 +02:00
Lampros Smyrnaios
92d011e8a0
- Make sure the handled assignments - full-texts are deleted before the application exits.
...
- When the user sets the "maxAssignmentsBatchesToHandleBeforeRestart" above zero, shutdown immediately after the last assignments-batch. Do not wait for the next scheduled check.
- Allow the user to set the "maxAssignmentsBatchesToHandleBeforeRestart" in the "installAndRun.sh" script.
- Increase the "fixedRate" for the "ScheduledTasks.deleteHandledAssignmentsFullTexts()" method to 12 hours.
- Update README.md
2021-12-31 04:09:05 +02:00