Lampros Smyrnaios
6944678391
Improve error-handling when renaming workerReport-files.
2023-09-08 17:41:10 +03:00
Lampros Smyrnaios
1c8f3765ca
- Fix not acquiring the full workerReport when retrying it, with the scheduler.
...
- Improve error-handling in the "inspectWorkerReportsAndTakeAction" process.
- Code polishing.
2023-09-08 14:59:48 +03:00
Lampros Smyrnaios
e72a4d3d10
- Improve handling of already renamed workerReport-files, which relate to failed workerReports in the 1st try.
...
- Add check for workerReport-files, which may have been deleted before their time, due to an error.
2023-09-08 14:11:41 +03:00
Lampros Smyrnaios
bd9245cc3d
Avoid deleting the assignment-records in case of a "parquet-data creation, upload or insertion" problem, in order to avoid double-processing of the urls, until the report has been retried by the scheduler.
2023-09-08 13:44:24 +03:00
Lampros Smyrnaios
718f5cfefb
- Improve prioritization of the most recent publications.
...
- Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.
2023-09-07 14:05:58 +03:00
Lampros Smyrnaios
acef891167
Improve prioritization of "publication_boost" records, by adding a second ordering in the end.
2023-09-04 15:34:37 +03:00
Lampros Smyrnaios
98516498eb
- Increase app's version.
...
- Code polishing.
2023-09-04 12:46:55 +03:00
Lampros Smyrnaios
febe2b212c
Upgrade management of failed workerReports:
...
- Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed".
- Avoid deleting immediately the failed workerReports.
- Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap".
- Add a scheduling task to process leftover failed workerReports from the current execution, regularly.
- Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports.
- Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted.
- Code refactoring.
2023-09-01 15:10:58 +03:00
Lampros Smyrnaios
5c459a3a16
Optimize handling of HTTP-4XX errors in "UrlsServiceImpl.postReportResultToWorker()".
2023-08-31 13:20:12 +03:00
Lampros Smyrnaios
601776e81c
- Fix not handling for the case where the info about the worker in the WorkerReport, does not exist inside the "workersInfoMap", as that worker is not participating in the Service. (this case may appear in future code)
...
- Code polishing.
2023-08-30 17:07:51 +03:00
Lampros Smyrnaios
c32dfa882e
Fix not deleting the assignment-records, for every workerReport, after processing it.
2023-08-30 16:22:58 +03:00
Lampros Smyrnaios
aa3f32f3da
- Make sure the given number of threads, given by the user is above zero.
...
- Adjust the number and size of log files.
- Update Spring Boot.
- Code polishing.
2023-08-30 14:02:54 +03:00
Lampros Smyrnaios
44459c8681
- Rename "ImpalaConnector.java" to "DatabaseConnector.java".
...
- Update dependencies.
- Code polishing.
2023-08-23 16:55:23 +03:00
Lampros Smyrnaios
a524375656
- Create the HDFS-subDirs before generating "callableTasks" for creating and uploading the parquetFiles.
...
- Delete gradle .zip file after installation.
2023-08-04 15:30:41 +03:00
Lampros Smyrnaios
860c73ea91
- Improve the "shutdownController.sh" script.
...
- Set names for the Prometheus and Grafana containers.
- Code polishing.
2023-07-27 18:27:48 +03:00
Lampros Smyrnaios
d821ae398f
Improve performance by applying the merging-procedure for the parquet files of the database tables less often, while keeping the benefits of having a relatively small maximum number of parquet files in search operations.
2023-07-24 20:28:41 +03:00
Lampros Smyrnaios
cec2531737
- Increase the "numOfBackgroundThreads" to 8.
...
- Make the "numOfBackgroundThreads" and "numOfThreadsPerBulkImportProcedure" configurable from the "application.yml" file.
- Code polishing.
2023-07-21 11:45:50 +03:00
Lampros Smyrnaios
fd1cf56863
- Avoid passing some duplicate publications to the Workers, by post-processing the assignments retrieved from Impala. The assignments may contain duplicate id-url pairs, which have different datasources, since one publication may be connected to multiple datasources.
...
- Code polishing.
2023-07-19 18:31:24 +03:00
Lampros Smyrnaios
8dfb58ee63
Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment".
...
After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers.
Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.
2023-07-11 17:27:23 +03:00
Lampros Smyrnaios
e8644cb64f
- Optimize the "insertAssignmentsQuery".
...
- Add documentation about the Prometheus Metrics, in README.
- Update Dependencies.
- Code polishing.
2023-07-05 17:10:30 +03:00
Lampros Smyrnaios
a89abe3f2f
Prioritize the publications, which are specified inside the "publication_boost" table, according to their "boost-level".
2023-06-29 12:32:06 +03:00
Lampros Smyrnaios
4c3e2e6b6e
- Fix not using actual the "currentAssignmentsBatch" of the workerReport itself, when creating the parquetFileNames and when reporting to the user the initialization of the "addition of the workerReport".
...
- Code polishing.
2023-06-27 16:08:01 +03:00
Lampros Smyrnaios
d52601e819
- Add handling for the "EmptyResultDataAccessException" in "UrlsServiceImpl.getAssignments()", which is thrown when no assignments are returned from the query.
...
- Code polishing.
2023-06-22 12:39:11 +03:00
Lampros Smyrnaios
798fa09d68
- Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()".
...
- Add/Improve some log messages.
- Update and cleanup dependencies.
- Code polishing.
2023-06-15 23:19:36 +03:00
Lampros Smyrnaios
e2776c50d0
- Optimize the "WorkerReportResult" and the "ShutdownWorker" requests.
...
- Improve documentation.
2023-06-10 02:31:57 +03:00
Lampros Smyrnaios
3988eb3a48
- Use a separate HDFS sub-dir for every assignments-batch, in order to avoid any disrruptancies from multiple threads moving parquet-files from the same sub-dir. Multiple batches from the same worker may be processed at the same time. These sub-dirs are deleted afterwards.
...
- Treat the "contains no visible files" situation as an error. In which case the assignments-data is presumed to not have been inserted to the database tables.
- Code polishing/cleanup.
2023-05-27 02:36:05 +03:00
Lampros Smyrnaios
cd1fb0af88
- Process the WorkerReports in background Jobs and post the reportResults to the Workers.
...
- Save the workerReports to json files, until they are processed successfully.
- Show some custom metrics in prometheus.
2023-05-24 13:52:28 +03:00
Lampros Smyrnaios
5f75b48e95
- Increase the "read-timeout" when searching for the host's machine public-IP.
...
- Update dependencies.
- Code polishing.
2023-05-22 21:33:02 +03:00
Lampros Smyrnaios
a8eea1ccf4
Fix missing changes.
2023-05-15 13:13:24 +03:00
Lampros Smyrnaios
f51a34138f
- Use separate HDFS subdirectories for each worker in order to avoid seeing exceptions about "empty hdfs directory" when "loading" data to the database, because one worker has loaded data generated by multiple workers (since we use only 1 load operation for multiple parquet files).
...
- Store each worker's info in a hash-table, in order to efficiently know if we need to create new hdfs subdirectories. Also, this will help to issue "shutdown" requests to the workers in the future, as well as to know which worker has shutdown.
2023-05-15 13:12:20 +03:00
Lampros Smyrnaios
9412391903
- In test-environment mode, check for already existing file-hashes only in the "payload_aggregated" table, instead of the whole "payload" view. This way the investigation for false-positive docUrls is easier, as we avoid checking against the millions of "legacy" payloads.
...
- Improve performance in production, by not creating the string objects for "trace"-logs.
2023-05-15 12:44:16 +03:00
Lampros Smyrnaios
992d4ffd5e
- Add the time-zone in the logs.
...
- Change some log-levels to "trace", although most of them are still disabled.
2023-05-11 03:10:53 +03:00
Lampros Smyrnaios
b6e8cd1889
New feature: BulkImport full-text files from compatible datasources.
2023-05-11 03:07:55 +03:00
Lampros Smyrnaios
49662319a1
- Simplify the creation of local directories.
...
- Improve exception messages.
2023-04-28 14:58:33 +03:00
Lampros Smyrnaios
55ea5118ac
- Update the "testDatabaseName" property.
...
- Code polishing.
2023-04-26 19:33:28 +03:00
Lampros Smyrnaios
4dc34429f8
- Increase the waiting-time before checking the docker containers' status, in order to catch configuration-crashes.
...
- Code polishing.
2023-04-10 22:28:53 +03:00
Lampros Smyrnaios
c39fef2654
Upgrade payload-table to payload-view which consists of three separate payload tables: "payload_legacy", "payload_aggregated" and "payload_bulk_import".
2023-04-10 15:55:50 +03:00
Lampros Smyrnaios
37363100fd
Prioritize most recent publications.
2023-04-10 15:00:23 +03:00
Lampros Smyrnaios
4280f89296
- Set the default value of the "isTestEnvironment" property to "true", in order to avoid undesired outcomes in the production db.
...
- Code polishing.
2023-03-21 17:04:28 +02:00
Lampros Smyrnaios
003c0bf179
- Add support for excluding specific datasources from being crawled. These datasources may be aggregated through bulk-imports, by other pieces of software. Such a datasource is "arXiv.org".
...
- Fix an issue, where the "datasource-type" was retrieved instead of the "datasource-name".
- Polish the "findAssignmentsQuery".
2023-03-21 07:19:35 +02:00
Lampros Smyrnaios
38643c76a3
- Code polishing.
...
- Update Gradle.
2023-03-07 16:55:41 +02:00
Lampros Smyrnaios
c8485d472e
Code polishing.
2023-02-24 13:53:09 +02:00
Lampros Smyrnaios
8893662a81
Refactor the UrlsController: a) offload the business-logic to the dedicated "UrlsService" and b) move the "checkParquetFilesSuccess()"-method to "ParquetFileUtils".
2023-02-21 15:36:35 +02:00