Lampros Smyrnaios
8f9786de09
Upgrade the algorithm for finding the previously-found fulltexts, based on their md5hash:
...
- Use a single query with a list of the fileHashes, instead of thousands of singe-md5hash-check queries (run at most 6 in parallel) which require a lot of I/O.
- Avoid checking multiple times the same fileHash, in case it is related with multiple payloads.
- In case of a database-error, avoid completely losing the full-texts of that worker, instead, continue processing the full-texts.
2024-03-13 11:28:37 +02:00
Lampros Smyrnaios
e4540e7f3c
Handle the case when a urlReports-sublist does not have any payloads inside.
2024-03-12 14:25:00 +02:00
Lampros Smyrnaios
e20c5d2146
- Add error-handling for the case when no payloads could be associated with a specific url which should have been in the hashMultiMap in "addUrlReportsByMatchingRecordsFromBacklog".
...
- Fix not cloning the payload, before changing it and adding it in the "prefilledPayloads"-list; instead, an object-reference was used.
2024-03-11 19:48:04 +02:00
Lampros Smyrnaios
1048463ca0
- Improve error-handling in "S3ObjectStore.emptyBucket()".
...
- Change some log-levels.
- Code polishing.
2024-03-11 16:17:32 +02:00
Lampros Smyrnaios
8f18008001
Avoid performing payload-related operations in case no fulltext was received from the worker, due to en error.
2024-03-11 14:57:13 +02:00
Lampros Smyrnaios
ce3e149a95
Improve the "emptying/deleting" process of the S3-bucket.
2024-03-11 13:34:38 +02:00
Lampros Smyrnaios
dd394f18a0
- Optimize the JOIN-order in the "findAssignmentsQuery".
...
- Optimize the "DOC_URL_FILTER"-regex.
- Update dependencies.
2024-03-11 11:35:38 +02:00
Lampros Smyrnaios
43ea64758d
- Improve handling of the case when no fulltexts have been found or none of the found ones were requested from the worker, as they were already retrieved in the past.
...
- Show the number of files with problematic locations (if any of them exist).
- Code polishing.
2024-02-23 12:39:28 +02:00
Lampros Smyrnaios
749172edd8
Add the Jenkins' build-status badge in README.
2024-02-08 19:49:58 +02:00
Lampros Smyrnaios
b72996c9a9
- Configure the destination of the logs in the "application.properties" file.
...
- Add some gradle files to be used by Jenkins.
2024-02-08 19:47:34 +02:00
Lampros Smyrnaios
3563fd6e2a
- Try to get the cause of the exception of the callable-tasks which handle parquet-files.
...
- Update License.
- Update dependencies.
2024-02-07 18:34:28 +02:00
Lampros Smyrnaios
34d7a143e7
Add/improve documentation.
2024-02-01 14:37:29 +02:00
Lampros Smyrnaios
5dadb8ad2f
- Optimize the "DOC_URL_FILTER"-regex, by using a non-capturing group.
...
- Remove an extra "File.separator" from the fulltexts-fullFilePath.
2024-01-19 15:46:23 +02:00
Lampros Smyrnaios
bdc61c2cda
When at least one worker is still active and have to wait for service-shutdown, show a log-message to inform the user, including that worker's IP.
2024-01-15 13:35:22 +02:00
Lampros Smyrnaios
3a70b57146
Prioritize the full-text urls over the landing-page ones.
2024-01-15 12:59:50 +02:00
Lampros Smyrnaios
ee1ca8966b
- Avoid continuing to request workerReport-batches when from the 1st batch, the base-directory of that assignments-counter is not found.
...
- Update dependencies.
2024-01-15 12:57:33 +02:00
Lampros Smyrnaios
2e60128084
- Allow to easily change the por used by workers.
...
- Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown.
- Update dependencies.
- Code polishing.
2023-12-19 23:31:42 +02:00
Lampros Smyrnaios
d90ad51609
Add the "shutdownAllWorkersGracefully" and "cancelShutdownAllWorkersGracefully" endpoints, in order to be able to shut them down at once and update them, without shutting down the whole Service. So in this case the bulk-import procedures will continue to work.
2023-11-29 16:45:58 +02:00
Lampros Smyrnaios
d20c9a7d2e
- Show the original exception thrown by the background-job, not the one thrown in the main-thread, which is useless, except from its message.
...
- Reduce the interval for deleting the unhandled assignments to once every 3 days.
- Set the upcoming version.
- Update dependencies.
2023-11-27 18:19:53 +02:00
Lampros Smyrnaios
7f789b8ad0
- If we receive an "UnknownHostException" when uploading to the S3ObjectStore, then skip the current full-texts' batch to leave some time for the network to get unstuck.
...
- Code polishing.
2023-11-22 15:29:18 +02:00
Lampros Smyrnaios
9b1f2c4931
Improve performance and reduce memory usage of the "findAssignmentsQuery":
...
- Reorder JOINs and predicates to reduce the computational cost.
- Remove the memory-costly "pu.url" predicates from the "where" clause, as the DB has no empty urls anymore.
2023-10-31 15:59:48 +02:00
Lampros Smyrnaios
db929d8931
- Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments.
...
- Improve some log-messages.
- Code polishing.
2023-10-30 12:29:54 +02:00
Lampros Smyrnaios
856c62887d
- Make sure the "UTF_8" charset is used, when we get a message from the response-body.
...
- Improve some log-messages.
2023-10-26 11:44:23 +03:00
Lampros Smyrnaios
bdf834c439
- Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument.
...
- Fix not updating the "UrlsController.numOfWorkers" correctly.
- Code polishing.
2023-10-23 17:19:29 +03:00
Lampros Smyrnaios
0c7bf6357b
- Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()".
...
- Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).
2023-10-23 12:21:42 +03:00
Lampros Smyrnaios
a7581335f1
- Improve the "getDataForPayloadPrefillQuery".
...
- Improve some error-messages.
2023-10-21 11:31:31 +03:00
Lampros Smyrnaios
44c2fe7418
- Fix the "IndexOutOfBoundsException", when checking the futures' results.
...
- Update dependencies.
2023-10-20 14:25:05 +03:00
Lampros Smyrnaios
df0ea62a5a
- Handle the case when the "webHDFSBaseUrl" does not use HTTPS.
...
- Improve error-reporting when uploading a file to HDFS.
2023-10-19 11:59:37 +03:00
Lampros Smyrnaios
40729c6295
Move similar code into the new "ParquetFileUtils.getPayloadParquetRecord()" method.
2023-10-17 12:50:51 +03:00
Lampros Smyrnaios
f05eee7569
Improve the names of some methods.
2023-10-16 23:39:43 +03:00
Lampros Smyrnaios
def21b991d
Improve the UX of the "installAndRun.sh" script.
2023-10-09 17:28:22 +03:00
Lampros Smyrnaios
fb2877dbe8
Upgrade the execution system for the backgroundTasks:
...
- Submit each task immediately for execution, instead of waiting for a scheduling thread to send all gathered tasks (up to that point) to the ExecutorService (and block until they are finished, before it can start again).
- Hold the Future of each submitted task to a synchronized-list to check the result of each task at a scheduled time.
- Reduce the cpu-time to assure the Service can shut down, by checking if there are "actively" and "about-to-be-executed" tasks, at the same time. Instead of having to rely on the additional checking of the "shutdown"-status of each worker to verify that no active task exist.
- Improve the threads' shutdown procedure.
2023-10-09 17:23:59 +03:00
Lampros Smyrnaios
a354da763d
- Improve some log-messages.
...
- Increase app's version.
- Code polishing.
2023-10-06 17:28:54 +03:00
Lampros Smyrnaios
0c79fdea35
Update the "findAssignmentsQuery" to check the "attempt.error_class" field for the current pub_url, not the pub_id.
2023-10-06 14:59:26 +03:00
Lampros Smyrnaios
ebf8896005
- Fix getter and setter methods for the "isAuthoritative" field.
...
- Update Gradle.
2023-10-05 16:31:52 +03:00
Lampros Smyrnaios
b2ce6393c1
- Add check for remaining "bulkImportDirsUnderProcessing", before shutting down the Service.
...
- Code polishing.
2023-10-05 13:43:47 +03:00
Lampros Smyrnaios
96c11ba4b8
- Add a missing change.
...
- Code optimization and polishing.
- Update dependencies.
2023-10-04 16:17:12 +03:00
Lampros Smyrnaios
7019f7c3c7
Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads.
2023-10-04 15:43:31 +03:00
Lampros Smyrnaios
b702cf4484
Upgrade the "findAssignmentsQuery":
...
- Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs.
In the end, we only care about the urls when choosing which records should be aggregated.
- Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.
2023-10-04 13:43:15 +03:00
Lampros Smyrnaios
c9626de120
Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled.
2023-10-04 13:01:13 +03:00
Lampros Smyrnaios
865926fbc3
- Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them.
...
- Code polishing.
2023-10-02 15:46:55 +03:00
Lampros Smyrnaios
ede7ca5a89
- Add bulk-import support for non-Authoritative data-sources.
...
- Update Spring Boot.
- Code polishing.
2023-09-26 18:02:48 +03:00
Lampros Smyrnaios
90a864ea61
Add more info in bulk-import logs.
2023-09-20 17:50:10 +03:00
Lampros Smyrnaios
0f5d4dac78
Check and show warning/error message for failed payloads.
2023-09-20 17:38:22 +03:00
Lampros Smyrnaios
068b97dd60
Set Xms and Xmx Java-parameters when running the Jar, in Docker.
2023-09-15 14:19:46 +03:00
Lampros Smyrnaios
903c3e1ffc
Add thread-safety when reading the bulkImportReport-files.
2023-09-15 11:54:32 +03:00
Lampros Smyrnaios
846c53913f
Add LICENSE.
2023-09-14 16:05:36 +03:00
Lampros Smyrnaios
360731ba72
- Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint.
...
- Code optimization and polishing.
2023-09-14 13:53:01 +03:00
Lampros Smyrnaios
b4f91f188e
Fix the "retries-num" appearing in log-messages.
2023-09-14 12:08:33 +03:00
Lampros Smyrnaios
02bae38885
- Improve response-time to "getAssignments"-requests, by avoiding merging the parquet files of the "assignment" table, right after acquiring the assignments from the DB. They are already getting merged, when each assignments-batch is deleted after a workerReport has been processed.
...
- Optimize code-positioning for unlocking the DB when done executing queries.
2023-09-13 17:03:11 +03:00