Lampros Smyrnaios
0c7bf6357b
- Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()".
...
- Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).
2023-10-23 12:21:42 +03:00
Lampros Smyrnaios
a7581335f1
- Improve the "getDataForPayloadPrefillQuery".
...
- Improve some error-messages.
2023-10-21 11:31:31 +03:00
Lampros Smyrnaios
a354da763d
- Improve some log-messages.
...
- Increase app's version.
- Code polishing.
2023-10-06 17:28:54 +03:00
Lampros Smyrnaios
96c11ba4b8
- Add a missing change.
...
- Code optimization and polishing.
- Update dependencies.
2023-10-04 16:17:12 +03:00
Lampros Smyrnaios
7019f7c3c7
Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads.
2023-10-04 15:43:31 +03:00
Lampros Smyrnaios
c9626de120
Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled.
2023-10-04 13:01:13 +03:00
Lampros Smyrnaios
865926fbc3
- Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them.
...
- Code polishing.
2023-10-02 15:46:55 +03:00
Lampros Smyrnaios
ede7ca5a89
- Add bulk-import support for non-Authoritative data-sources.
...
- Update Spring Boot.
- Code polishing.
2023-09-26 18:02:48 +03:00
Lampros Smyrnaios
0f5d4dac78
Check and show warning/error message for failed payloads.
2023-09-20 17:38:22 +03:00
Lampros Smyrnaios
903c3e1ffc
Add thread-safety when reading the bulkImportReport-files.
2023-09-15 11:54:32 +03:00
Lampros Smyrnaios
360731ba72
- Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint.
...
- Code optimization and polishing.
2023-09-14 13:53:01 +03:00
Lampros Smyrnaios
ee2df19ce1
- Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint.
...
- Add useful log-messages for various bulk-import stages and improve the current ones.
- Optimize reading and writing the reports.
2023-09-11 17:24:39 +03:00
Lampros Smyrnaios
1c8f3765ca
- Fix not acquiring the full workerReport when retrying it, with the scheduler.
...
- Improve error-handling in the "inspectWorkerReportsAndTakeAction" process.
- Code polishing.
2023-09-08 14:59:48 +03:00
Lampros Smyrnaios
4014d1eabb
Code polishing.
2023-09-05 15:20:03 +03:00
Lampros Smyrnaios
98516498eb
- Increase app's version.
...
- Code polishing.
2023-09-04 12:46:55 +03:00
Lampros Smyrnaios
febe2b212c
Upgrade management of failed workerReports:
...
- Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed".
- Avoid deleting immediately the failed workerReports.
- Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap".
- Add a scheduling task to process leftover failed workerReports from the current execution, regularly.
- Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports.
- Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted.
- Code refactoring.
2023-09-01 15:10:58 +03:00
Lampros Smyrnaios
601776e81c
- Fix not handling for the case where the info about the worker in the WorkerReport, does not exist inside the "workersInfoMap", as that worker is not participating in the Service. (this case may appear in future code)
...
- Code polishing.
2023-08-30 17:07:51 +03:00
Lampros Smyrnaios
aa3f32f3da
- Make sure the given number of threads, given by the user is above zero.
...
- Adjust the number and size of log files.
- Update Spring Boot.
- Code polishing.
2023-08-30 14:02:54 +03:00
Lampros Smyrnaios
44459c8681
- Rename "ImpalaConnector.java" to "DatabaseConnector.java".
...
- Update dependencies.
- Code polishing.
2023-08-23 16:55:23 +03:00
Lampros Smyrnaios
fd1cf56863
- Avoid passing some duplicate publications to the Workers, by post-processing the assignments retrieved from Impala. The assignments may contain duplicate id-url pairs, which have different datasources, since one publication may be connected to multiple datasources.
...
- Code polishing.
2023-07-19 18:31:24 +03:00
Lampros Smyrnaios
8dfb58ee63
Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment".
...
After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers.
Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.
2023-07-11 17:27:23 +03:00
Lampros Smyrnaios
2d5643cb0a
Fix missing spaces in some secondary "DROP"-queries.
2023-07-07 20:51:14 +03:00
Lampros Smyrnaios
e8644cb64f
- Optimize the "insertAssignmentsQuery".
...
- Add documentation about the Prometheus Metrics, in README.
- Update Dependencies.
- Code polishing.
2023-07-05 17:10:30 +03:00
Lampros Smyrnaios
798fa09d68
- Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()".
...
- Add/Improve some log messages.
- Update and cleanup dependencies.
- Code polishing.
2023-06-15 23:19:36 +03:00
Lampros Smyrnaios
c37f157f51
Split the full-texts-batch's main handling-code to two separate methods, which can be used in parallel by two threads, in the future.
2023-06-14 17:16:38 +03:00
Lampros Smyrnaios
0b1ab5b991
Attempt to recover from serious failures, by individualizing the error-handling for each of the "table-merging" queries.
2023-06-10 02:28:02 +03:00
Lampros Smyrnaios
cd1fb0af88
- Process the WorkerReports in background Jobs and post the reportResults to the Workers.
...
- Save the workerReports to json files, until they are processed successfully.
- Show some custom metrics in prometheus.
2023-05-24 13:52:28 +03:00
Lampros Smyrnaios
0ab6bae93a
- Optimize the json-conversion of the "BulkImportReport".
...
- Code polishing.
2023-05-18 17:30:40 +03:00
Lampros Smyrnaios
9412391903
- In test-environment mode, check for already existing file-hashes only in the "payload_aggregated" table, instead of the whole "payload" view. This way the investigation for false-positive docUrls is easier, as we avoid checking against the millions of "legacy" payloads.
...
- Improve performance in production, by not creating the string objects for "trace"-logs.
2023-05-15 12:44:16 +03:00
Lampros Smyrnaios
992d4ffd5e
- Add the time-zone in the logs.
...
- Change some log-levels to "trace", although most of them are still disabled.
2023-05-11 03:10:53 +03:00
Lampros Smyrnaios
b6e8cd1889
New feature: BulkImport full-text files from compatible datasources.
2023-05-11 03:07:55 +03:00
Lampros Smyrnaios
fd15372fd6
Add error-checks for retrieving the status-code from HttpUrlConnections.
2023-05-03 13:30:29 +03:00
Lampros Smyrnaios
55ea5118ac
- Update the "testDatabaseName" property.
...
- Code polishing.
2023-04-26 19:33:28 +03:00
Lampros Smyrnaios
484cf5cefc
- Avoid requesting the remaining full-text batches in case the Worker returns a 5XX error in one of the batches.
...
- Add nullability-checks for "datasourceId" and "hash" before constructing the new filename and upload the full-text on S3.
- Improve a log-message.
2023-03-29 17:12:37 +03:00
Lampros Smyrnaios
4280f89296
- Set the default value of the "isTestEnvironment" property to "true", in order to avoid undesired outcomes in the production db.
...
- Code polishing.
2023-03-21 17:04:28 +02:00
Lampros Smyrnaios
17a6c120dd
Improve logs for full-texts' metrics.
2023-03-14 20:57:01 +02:00
Lampros Smyrnaios
38643c76a3
- Code polishing.
...
- Update Gradle.
2023-03-07 16:55:41 +02:00
Lampros Smyrnaios
7b217764e0
Improve performance when downloading and decompressing the full-texts archive.
2023-03-02 17:44:53 +02:00
Lampros Smyrnaios
c8485d472e
Code polishing.
2023-02-24 13:53:09 +02:00
Lampros Smyrnaios
b7f6056032
- Improve an error-message.
...
- Update Gradle.
2023-02-21 15:42:07 +02:00
Lampros Smyrnaios
a1c16ffc19
- Exclude empty and null urls in the assignments.
...
- Update the "getFullTextsImproved"-call to "getFullTexts", now that the "improved" version is stable.
- Update Gradle.
- Code polishing.
2023-02-16 14:24:47 +02:00
Lampros Smyrnaios
dc8f0f2bd1
- Heavily reduce the maximum amount of space needed, by deleting the files of each full-texts batch, right after they are uploaded to the S3 Object Store.
...
- Add a check for when the retrieved full-texts-batch is missing some requested files and show a warn-log.
- Update dependencies.
2023-01-23 20:23:21 +02:00
Lampros Smyrnaios
8876089022
- Use Facebook's [**Zstandard**]( https://facebook.github.io/zstd/ ) compression algorithm, which brings very big benefits on compression rate and speed.
...
- Update the minIO dependency.
- Code polishing.
2023-01-10 13:34:54 +02:00
Lampros Smyrnaios
d1a4c84289
- Make sure the fullPath of the baseFilesLocation is available when the user specifies a non-root directory.
...
- Improve error-checking and exception-handling in some "S3ObjectStore"-methods.
- Make sure the "responseCode" is "200-OK", before trying to get the InputStream in "UriBuilder.getPublicIP()".
2023-01-09 15:44:53 +02:00
Lampros Smyrnaios
4528d1f9be
- Fix the "baseFilesLocation" being null (there was no serious problem, but multiple directories were spawned in the project's directory).
...
- Make sure the given "baseFilesLocation" ends with a file-separator, before using it.
- Optimize the process of unzipping-files.
2022-12-20 18:38:11 +02:00
Lampros Smyrnaios
e11afe5ab2
Improve performance of the hash-checking algorithm by using multithreading.
2022-12-15 18:34:28 +02:00
Lampros Smyrnaios
9cdbbdea67
Refactor the files' storage location.
2022-12-15 18:29:51 +02:00
Lampros Smyrnaios
bfdf06bd09
- Apply error-checking on individual CallableTasks and in tasks-batches related to the creation and upload of all the data related to the "attempt" and "payload" table. So, if no data could be uploaded for one or both tables, no "load"-queries will be executed for that/those tables.
...
- Catch the more general "Exception", inside "FileUtils.mergeParquetFiles()", in order to be certain that the "SQLException" can also be caught.
- Code polishing.
2022-12-09 12:46:06 +02:00
Lampros Smyrnaios
c8baf5a5fc
- Fix not finding the parquet-schema files when the app was run inside a Docker Container.
...
- Update the "namespaces" and the "names" inside the parquet schemas.
- Code polishing.
2022-12-08 12:16:05 +02:00
Lampros Smyrnaios
8607594f6d
- Improve exception handling.
...
- Code polishing.
2022-12-07 13:48:00 +02:00