Commit Graph

44 Commits

Author SHA1 Message Date
Lampros Smyrnaios cde6063d63 - Update dependencies.
- Code polishing.
2023-07-25 12:12:56 +03:00
Lampros Smyrnaios b73be6d8da Fix the Stats API returning simple numbers as "application/json". Now they are returned as "text/plain". 2023-07-25 12:03:27 +03:00
Lampros Smyrnaios 66a5b3c7da Update Bulk-Import API:
- Increase the "numOfThreadsPerBulkImportProcedure" to 6.
- Fix Bulk import not working from a second-level subdirectory; the report-subDirectory was not created.
- Fix not returning the bulk-import-report as "application/json".
- Add useful messages for missing parameters.
- Change the HTTP-method for the "bulkImportFullTexts" endpoint to "POST".
- Show a structured json-response for the "bulkImportFullTexts" endpoint.
- Fix uncommon date-format.
- Remove single quotes from json-report, since they are returned as bytes, not characters.
- Optimize the generation of the json-bulkImport-report.
2023-07-25 11:59:47 +03:00
Lampros Smyrnaios d821ae398f Improve performance by applying the merging-procedure for the parquet files of the database tables less often, while keeping the benefits of having a relatively small maximum number of parquet files in search operations. 2023-07-24 20:28:41 +03:00
Lampros Smyrnaios cec2531737 - Increase the "numOfBackgroundThreads" to 8.
- Make the "numOfBackgroundThreads" and "numOfThreadsPerBulkImportProcedure" configurable from the "application.yml" file.
- Code polishing.
2023-07-21 11:45:50 +03:00
Lampros Smyrnaios fd1cf56863 - Avoid passing some duplicate publications to the Workers, by post-processing the assignments retrieved from Impala. The assignments may contain duplicate id-url pairs, which have different datasources, since one publication may be connected to multiple datasources.
- Code polishing.
2023-07-19 18:31:24 +03:00
Lampros Smyrnaios 8dfb58ee63 Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment".
After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers.
Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.
2023-07-11 17:27:23 +03:00
Lampros Smyrnaios d5c139c410 Handle the case where the "stats"-queries are executed while some table of the DB are in a "merge" state. In this case, the queries fail and the Controller retries up to 10 times. 2023-07-06 18:29:13 +03:00
Lampros Smyrnaios e8644cb64f - Optimize the "insertAssignmentsQuery".
- Add documentation about the Prometheus Metrics, in README.
- Update Dependencies.
- Code polishing.
2023-07-05 17:10:30 +03:00
Lampros Smyrnaios a89abe3f2f Prioritize the publications, which are specified inside the "publication_boost" table, according to their "boost-level". 2023-06-29 12:32:06 +03:00
Lampros Smyrnaios 4c3e2e6b6e - Fix not using actual the "currentAssignmentsBatch" of the workerReport itself, when creating the parquetFileNames and when reporting to the user the initialization of the "addition of the workerReport".
- Code polishing.
2023-06-27 16:08:01 +03:00
Lampros Smyrnaios 0f4b63c4a9 Expose the following statistics as prometheus-metrics and create/update a stats-endpoint for each one:
- "numOfPayloadsAggregatedByServiceThroughCrawling"
 - "numOfPayloadsAggregatedByServiceThroughBulkImport"
 - "numOfPayloadsAggregatedByService"
 - "numOfLegacyPayloads"
 - "numOfRecordsInspectedByServiceThroughCrawling" (renamed from "numOfInspectedRecords")
2023-06-23 15:22:26 +03:00
Lampros Smyrnaios d52601e819 - Add handling for the "EmptyResultDataAccessException" in "UrlsServiceImpl.getAssignments()", which is thrown when no assignments are returned from the query.
- Code polishing.
2023-06-22 12:39:11 +03:00
Lampros Smyrnaios b9712bed85 - Expose the "numOfAllPayloads" and "numOfInspectedRecords" DB-stats to Prometheus, by using a scheduling task to request the numbers from the DB, every 6 hours.
- Update the "StatsServiceImpl.getNumberOfPayloadsAggregatedByService()" to use the new table "payload_aggregated", instead of casting and checking the date of the records.
- Code polishing.
2023-06-19 14:42:00 +03:00
Lampros Smyrnaios 798fa09d68 - Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()".
- Add/Improve some log messages.
- Update and cleanup dependencies.
- Code polishing.
2023-06-15 23:19:36 +03:00
Lampros Smyrnaios 88a74b2c41 Add support for all private addresses, defined in "RFC 1918" standard. This fixes the issue of discarding some "shutdownService" requests due to coming from different local private addresses, when the Controller was run inside a docker container. 2023-06-15 13:26:27 +03:00
Lampros Smyrnaios e2776c50d0 - Optimize the "WorkerReportResult" and the "ShutdownWorker" requests.
- Improve documentation.
2023-06-10 02:31:57 +03:00
Lampros Smyrnaios 6669dc61bf - Increase the initialDelay for the "checkIfServiceIsReadyForShutdown" scheduled-task, in production, to 10 minutes.
- Code polishing.
2023-06-06 16:49:53 +03:00
Lampros Smyrnaios a38d6ace79 Code polishing. 2023-05-29 12:21:48 +03:00
Lampros Smyrnaios 74ff31fc64 - Show the workerIPs in the logs.
- Rename the "FullTexts"-files to "BulkImport".
2023-05-29 12:12:08 +03:00
Lampros Smyrnaios 3988eb3a48 - Use a separate HDFS sub-dir for every assignments-batch, in order to avoid any disrruptancies from multiple threads moving parquet-files from the same sub-dir. Multiple batches from the same worker may be processed at the same time. These sub-dirs are deleted afterwards.
- Treat the "contains no visible files" situation as an error. In which case the assignments-data is presumed to not have been inserted to the database tables.
- Code polishing/cleanup.
2023-05-27 02:36:05 +03:00
Lampros Smyrnaios cd1fb0af88 - Process the WorkerReports in background Jobs and post the reportResults to the Workers.
- Save the workerReports to json files, until they are processed successfully.
- Show some custom metrics in prometheus.
2023-05-24 13:52:28 +03:00
Lampros Smyrnaios 0ea3e2de24 Add the "shutdownService" and "cancelShutdownService" endpoints. The Controller sends the related requests to the Workers and shutdowns gracefully, after all workers have shutdown. 2023-05-24 13:42:29 +03:00
Lampros Smyrnaios 5f75b48e95 - Increase the "read-timeout" when searching for the host's machine public-IP.
- Update dependencies.
- Code polishing.
2023-05-22 21:33:02 +03:00
Lampros Smyrnaios a8eea1ccf4 Fix missing changes. 2023-05-15 13:13:24 +03:00
Lampros Smyrnaios f51a34138f - Use separate HDFS subdirectories for each worker in order to avoid seeing exceptions about "empty hdfs directory" when "loading" data to the database, because one worker has loaded data generated by multiple workers (since we use only 1 load operation for multiple parquet files).
- Store each worker's info in a hash-table, in order to efficiently know if we need to create new hdfs subdirectories. Also, this will help to issue "shutdown" requests to the workers in the future, as well as to know which worker has shutdown.
2023-05-15 13:12:20 +03:00
Lampros Smyrnaios 9412391903 - In test-environment mode, check for already existing file-hashes only in the "payload_aggregated" table, instead of the whole "payload" view. This way the investigation for false-positive docUrls is easier, as we avoid checking against the millions of "legacy" payloads.
- Improve performance in production, by not creating the string objects for "trace"-logs.
2023-05-15 12:44:16 +03:00
Lampros Smyrnaios 8381df70c6 - Improve performance of uploading parquet-files to HDFS.
- Add some logs.
- Code polishing.
2023-05-11 19:40:48 +03:00
Lampros Smyrnaios 992d4ffd5e - Add the time-zone in the logs.
- Change some log-levels to "trace", although most of them are still disabled.
2023-05-11 03:10:53 +03:00
Lampros Smyrnaios b6e8cd1889 New feature: BulkImport full-text files from compatible datasources. 2023-05-11 03:07:55 +03:00
Lampros Smyrnaios 42b93e9429 - Add the "getNumberOfAllDistinctFullTexts" stats-endpoint.
- Add TODOs for more stats endpoints.
- Code polishing.
2023-05-04 15:48:49 +03:00
Lampros Smyrnaios 49662319a1 - Simplify the creation of local directories.
- Improve exception messages.
2023-04-28 14:58:33 +03:00
Lampros Smyrnaios 55ea5118ac - Update the "testDatabaseName" property.
- Code polishing.
2023-04-26 19:33:28 +03:00
Lampros Smyrnaios 4dc34429f8 - Increase the waiting-time before checking the docker containers' status, in order to catch configuration-crashes.
- Code polishing.
2023-04-10 22:28:53 +03:00
Lampros Smyrnaios c39fef2654 Upgrade payload-table to payload-view which consists of three separate payload tables: "payload_legacy", "payload_aggregated" and "payload_bulk_import". 2023-04-10 15:55:50 +03:00
Lampros Smyrnaios 37363100fd Prioritize most recent publications. 2023-04-10 15:00:23 +03:00
Lampros Smyrnaios 4280f89296 - Set the default value of the "isTestEnvironment" property to "true", in order to avoid undesired outcomes in the production db.
- Code polishing.
2023-03-21 17:04:28 +02:00
Lampros Smyrnaios 003c0bf179 - Add support for excluding specific datasources from being crawled. These datasources may be aggregated through bulk-imports, by other pieces of software. Such a datasource is "arXiv.org".
- Fix an issue, where the "datasource-type" was retrieved instead of the "datasource-name".
- Polish the "findAssignmentsQuery".
2023-03-21 07:19:35 +02:00
Lampros Smyrnaios ff13af7abb Use a StatsService interface. 2023-03-13 12:39:39 +02:00
Lampros Smyrnaios 38643c76a3 - Code polishing.
- Update Gradle.
2023-03-07 16:55:41 +02:00
Lampros Smyrnaios c8485d472e Code polishing. 2023-02-24 13:53:09 +02:00
Lampros Smyrnaios 8893662a81 Refactor the UrlsController: a) offload the business-logic to the dedicated "UrlsService" and b) move the "checkParquetFilesSuccess()"-method to "ParquetFileUtils". 2023-02-21 15:36:35 +02:00
Lampros Smyrnaios a1c16ffc19 - Exclude empty and null urls in the assignments.
- Update the "getFullTextsImproved"-call to "getFullTexts", now that the "improved" version is stable.
- Update Gradle.
- Code polishing.
2023-02-16 14:24:47 +02:00
Lampros Smyrnaios 2253f05bf5 Refactor the "StatsController"-code, by offloading it to a dedicated "StatsService". 2023-02-09 19:25:48 +02:00