Lampros Smyrnaios
8bc5cc35e2
- Optimize writing to the Bulk-import-report file.
...
- Show the IP of the worker which posts a "workerShutdownReport".
- Code polishing.
2024-03-22 17:50:55 +02:00
Lampros Smyrnaios
b9b29dd51c
Move some code from "FileUtils.getAndUploadFullTexts()" to two separate methods.
2024-03-20 16:53:03 +02:00
Lampros Smyrnaios
56d233d38e
- Move the "FileUtils.mergeParquetFiles()" method to "ParquetFileUtils.mergeParquetFilesOfTable()".
...
- Fix a typo.
2024-03-20 15:25:19 +02:00
Lampros Smyrnaios
9b0818b535
- Add handling for additional/specific exceptions, when checking the "futures".
...
- Move common "ExecutionException" handling-code into its own method: "GenericUtils.getSelectedStackTraceForCausedException()".
- Avoid a double log.
- Code polishing.
2024-03-14 13:59:23 +02:00
Lampros Smyrnaios
f61cae41a1
- Try to get the cause of the exception of the callable-tasks which handle the bulk-import of fileSegments.
...
- Fix not counting the failedSegments when an exception was thrown.
- Code polishing.
2024-03-13 12:15:59 +02:00
Lampros Smyrnaios
8f9786de09
Upgrade the algorithm for finding the previously-found fulltexts, based on their md5hash:
...
- Use a single query with a list of the fileHashes, instead of thousands of singe-md5hash-check queries (run at most 6 in parallel) which require a lot of I/O.
- Avoid checking multiple times the same fileHash, in case it is related with multiple payloads.
- In case of a database-error, avoid completely losing the full-texts of that worker, instead, continue processing the full-texts.
2024-03-13 11:28:37 +02:00
Lampros Smyrnaios
e4540e7f3c
Handle the case when a urlReports-sublist does not have any payloads inside.
2024-03-12 14:25:00 +02:00
Lampros Smyrnaios
e20c5d2146
- Add error-handling for the case when no payloads could be associated with a specific url which should have been in the hashMultiMap in "addUrlReportsByMatchingRecordsFromBacklog".
...
- Fix not cloning the payload, before changing it and adding it in the "prefilledPayloads"-list; instead, an object-reference was used.
2024-03-11 19:48:04 +02:00
Lampros Smyrnaios
1048463ca0
- Improve error-handling in "S3ObjectStore.emptyBucket()".
...
- Change some log-levels.
- Code polishing.
2024-03-11 16:17:32 +02:00
Lampros Smyrnaios
ce3e149a95
Improve the "emptying/deleting" process of the S3-bucket.
2024-03-11 13:34:38 +02:00
Lampros Smyrnaios
43ea64758d
- Improve handling of the case when no fulltexts have been found or none of the found ones were requested from the worker, as they were already retrieved in the past.
...
- Show the number of files with problematic locations (if any of them exist).
- Code polishing.
2024-02-23 12:39:28 +02:00
Lampros Smyrnaios
3563fd6e2a
- Try to get the cause of the exception of the callable-tasks which handle parquet-files.
...
- Update License.
- Update dependencies.
2024-02-07 18:34:28 +02:00
Lampros Smyrnaios
34d7a143e7
Add/improve documentation.
2024-02-01 14:37:29 +02:00
Lampros Smyrnaios
5dadb8ad2f
- Optimize the "DOC_URL_FILTER"-regex, by using a non-capturing group.
...
- Remove an extra "File.separator" from the fulltexts-fullFilePath.
2024-01-19 15:46:23 +02:00
Lampros Smyrnaios
ee1ca8966b
- Avoid continuing to request workerReport-batches when from the 1st batch, the base-directory of that assignments-counter is not found.
...
- Update dependencies.
2024-01-15 12:57:33 +02:00
Lampros Smyrnaios
2e60128084
- Allow to easily change the por used by workers.
...
- Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown.
- Update dependencies.
- Code polishing.
2023-12-19 23:31:42 +02:00
Lampros Smyrnaios
d20c9a7d2e
- Show the original exception thrown by the background-job, not the one thrown in the main-thread, which is useless, except from its message.
...
- Reduce the interval for deleting the unhandled assignments to once every 3 days.
- Set the upcoming version.
- Update dependencies.
2023-11-27 18:19:53 +02:00
Lampros Smyrnaios
7f789b8ad0
- If we receive an "UnknownHostException" when uploading to the S3ObjectStore, then skip the current full-texts' batch to leave some time for the network to get unstuck.
...
- Code polishing.
2023-11-22 15:29:18 +02:00
Lampros Smyrnaios
856c62887d
- Make sure the "UTF_8" charset is used, when we get a message from the response-body.
...
- Improve some log-messages.
2023-10-26 11:44:23 +03:00
Lampros Smyrnaios
0c7bf6357b
- Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()".
...
- Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).
2023-10-23 12:21:42 +03:00
Lampros Smyrnaios
a7581335f1
- Improve the "getDataForPayloadPrefillQuery".
...
- Improve some error-messages.
2023-10-21 11:31:31 +03:00
Lampros Smyrnaios
df0ea62a5a
- Handle the case when the "webHDFSBaseUrl" does not use HTTPS.
...
- Improve error-reporting when uploading a file to HDFS.
2023-10-19 11:59:37 +03:00
Lampros Smyrnaios
40729c6295
Move similar code into the new "ParquetFileUtils.getPayloadParquetRecord()" method.
2023-10-17 12:50:51 +03:00
Lampros Smyrnaios
f05eee7569
Improve the names of some methods.
2023-10-16 23:39:43 +03:00
Lampros Smyrnaios
a354da763d
- Improve some log-messages.
...
- Increase app's version.
- Code polishing.
2023-10-06 17:28:54 +03:00
Lampros Smyrnaios
96c11ba4b8
- Add a missing change.
...
- Code optimization and polishing.
- Update dependencies.
2023-10-04 16:17:12 +03:00
Lampros Smyrnaios
7019f7c3c7
Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads.
2023-10-04 15:43:31 +03:00
Lampros Smyrnaios
c9626de120
Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled.
2023-10-04 13:01:13 +03:00
Lampros Smyrnaios
865926fbc3
- Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them.
...
- Code polishing.
2023-10-02 15:46:55 +03:00
Lampros Smyrnaios
ede7ca5a89
- Add bulk-import support for non-Authoritative data-sources.
...
- Update Spring Boot.
- Code polishing.
2023-09-26 18:02:48 +03:00
Lampros Smyrnaios
0f5d4dac78
Check and show warning/error message for failed payloads.
2023-09-20 17:38:22 +03:00
Lampros Smyrnaios
903c3e1ffc
Add thread-safety when reading the bulkImportReport-files.
2023-09-15 11:54:32 +03:00
Lampros Smyrnaios
360731ba72
- Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint.
...
- Code optimization and polishing.
2023-09-14 13:53:01 +03:00
Lampros Smyrnaios
6891c467d4
- Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode.
...
- Add a missing change for the optimization of reading files.
- Update dependencies.
2023-09-13 15:29:30 +03:00
Lampros Smyrnaios
ee2df19ce1
- Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint.
...
- Add useful log-messages for various bulk-import stages and improve the current ones.
- Optimize reading and writing the reports.
2023-09-11 17:24:39 +03:00
Lampros Smyrnaios
1c8f3765ca
- Fix not acquiring the full workerReport when retrying it, with the scheduler.
...
- Improve error-handling in the "inspectWorkerReportsAndTakeAction" process.
- Code polishing.
2023-09-08 14:59:48 +03:00
Lampros Smyrnaios
4014d1eabb
Code polishing.
2023-09-05 15:20:03 +03:00
Lampros Smyrnaios
98516498eb
- Increase app's version.
...
- Code polishing.
2023-09-04 12:46:55 +03:00
Lampros Smyrnaios
febe2b212c
Upgrade management of failed workerReports:
...
- Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed".
- Avoid deleting immediately the failed workerReports.
- Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap".
- Add a scheduling task to process leftover failed workerReports from the current execution, regularly.
- Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports.
- Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted.
- Code refactoring.
2023-09-01 15:10:58 +03:00
Lampros Smyrnaios
601776e81c
- Fix not handling for the case where the info about the worker in the WorkerReport, does not exist inside the "workersInfoMap", as that worker is not participating in the Service. (this case may appear in future code)
...
- Code polishing.
2023-08-30 17:07:51 +03:00
Lampros Smyrnaios
aa3f32f3da
- Make sure the given number of threads, given by the user is above zero.
...
- Adjust the number and size of log files.
- Update Spring Boot.
- Code polishing.
2023-08-30 14:02:54 +03:00
Lampros Smyrnaios
44459c8681
- Rename "ImpalaConnector.java" to "DatabaseConnector.java".
...
- Update dependencies.
- Code polishing.
2023-08-23 16:55:23 +03:00
Lampros Smyrnaios
66a5b3c7da
Update Bulk-Import API:
...
- Increase the "numOfThreadsPerBulkImportProcedure" to 6.
- Fix Bulk import not working from a second-level subdirectory; the report-subDirectory was not created.
- Fix not returning the bulk-import-report as "application/json".
- Add useful messages for missing parameters.
- Change the HTTP-method for the "bulkImportFullTexts" endpoint to "POST".
- Show a structured json-response for the "bulkImportFullTexts" endpoint.
- Fix uncommon date-format.
- Remove single quotes from json-report, since they are returned as bytes, not characters.
- Optimize the generation of the json-bulkImport-report.
2023-07-25 11:59:47 +03:00
Lampros Smyrnaios
fd1cf56863
- Avoid passing some duplicate publications to the Workers, by post-processing the assignments retrieved from Impala. The assignments may contain duplicate id-url pairs, which have different datasources, since one publication may be connected to multiple datasources.
...
- Code polishing.
2023-07-19 18:31:24 +03:00
Lampros Smyrnaios
b94c35c66e
- Fix double active "@Scheduled" annotation for the "ScheduledTasks.updatePrometheusMetrics()" method.
...
- Code polishing.
2023-07-13 18:32:45 +03:00
Lampros Smyrnaios
8dfb58ee63
Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment".
...
After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers.
Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.
2023-07-11 17:27:23 +03:00
Lampros Smyrnaios
2d5643cb0a
Fix missing spaces in some secondary "DROP"-queries.
2023-07-07 20:51:14 +03:00
Lampros Smyrnaios
e8644cb64f
- Optimize the "insertAssignmentsQuery".
...
- Add documentation about the Prometheus Metrics, in README.
- Update Dependencies.
- Code polishing.
2023-07-05 17:10:30 +03:00
Lampros Smyrnaios
4c3e2e6b6e
- Fix not using actual the "currentAssignmentsBatch" of the workerReport itself, when creating the parquetFileNames and when reporting to the user the initialization of the "addition of the workerReport".
...
- Code polishing.
2023-06-27 16:08:01 +03:00
Lampros Smyrnaios
798fa09d68
- Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()".
...
- Add/Improve some log messages.
- Update and cleanup dependencies.
- Code polishing.
2023-06-15 23:19:36 +03:00