UrlsController

Commit Graph

Author	SHA1	Message	Date
Lampros Smyrnaios	6891c467d4	- Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode. - Add a missing change for the optimization of reading files. - Update dependencies.	2023-09-13 15:29:30 +03:00
Lampros Smyrnaios	3dd349dd00	Improve the "findAssignmentsQuery": - Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower. - Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.	2023-09-13 14:38:15 +03:00
Lampros Smyrnaios	ee2df19ce1	- Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint. - Add useful log-messages for various bulk-import stages and improve the current ones. - Optimize reading and writing the reports.	2023-09-11 17:24:39 +03:00
Lampros Smyrnaios	6944678391	Improve error-handling when renaming workerReport-files.	2023-09-08 17:41:10 +03:00
Lampros Smyrnaios	1c8f3765ca	- Fix not acquiring the full workerReport when retrying it, with the scheduler. - Improve error-handling in the "inspectWorkerReportsAndTakeAction" process. - Code polishing.	2023-09-08 14:59:48 +03:00
Lampros Smyrnaios	e72a4d3d10	- Improve handling of already renamed workerReport-files, which relate to failed workerReports in the 1st try. - Add check for workerReport-files, which may have been deleted before their time, due to an error.	2023-09-08 14:11:41 +03:00
Lampros Smyrnaios	bd9245cc3d	Avoid deleting the assignment-records in case of a "parquet-data creation, upload or insertion" problem, in order to avoid double-processing of the urls, until the report has been retried by the scheduler.	2023-09-08 13:44:24 +03:00
Lampros Smyrnaios	718f5cfefb	- Improve prioritization of the most recent publications. - Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.	2023-09-07 14:05:58 +03:00
Lampros Smyrnaios	4014d1eabb	Code polishing.	2023-09-05 15:20:03 +03:00
Lampros Smyrnaios	199105f7f1	Fix not writing some bulk-import error-messages to the logs. Instead, they were only written to the json-reports.	2023-09-04 16:33:27 +03:00
Lampros Smyrnaios	acef891167	Improve prioritization of "publication_boost" records, by adding a second ordering in the end.	2023-09-04 15:34:37 +03:00
Lampros Smyrnaios	98516498eb	- Increase app's version. - Code polishing.	2023-09-04 12:46:55 +03:00
Lampros Smyrnaios	febe2b212c	Upgrade management of failed workerReports: - Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed". - Avoid deleting immediately the failed workerReports. - Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap". - Add a scheduling task to process leftover failed workerReports from the current execution, regularly. - Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports. - Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted. - Code refactoring.	2023-09-01 15:10:58 +03:00
Lampros Smyrnaios	5c459a3a16	Optimize handling of HTTP-4XX errors in "UrlsServiceImpl.postReportResultToWorker()".	2023-08-31 13:20:12 +03:00
Lampros Smyrnaios	601776e81c	- Fix not handling for the case where the info about the worker in the WorkerReport, does not exist inside the "workersInfoMap", as that worker is not participating in the Service. (this case may appear in future code) - Code polishing.	2023-08-30 17:07:51 +03:00
Lampros Smyrnaios	c32dfa882e	Fix not deleting the assignment-records, for every workerReport, after processing it.	2023-08-30 16:22:58 +03:00
Lampros Smyrnaios	aa3f32f3da	- Make sure the given number of threads, given by the user is above zero. - Adjust the number and size of log files. - Update Spring Boot. - Code polishing.	2023-08-30 14:02:54 +03:00
Lampros Smyrnaios	44459c8681	- Rename "ImpalaConnector.java" to "DatabaseConnector.java". - Update dependencies. - Code polishing.	2023-08-23 16:55:23 +03:00
Lampros Smyrnaios	b3e0d214fd	Update the BulkImport API: - Refactor the "bulkImportReportID". - Add the "bulk:" prefix in the provenance value, in the DB. - Fix not using correctly the "Lists.partition()" method. - Make sure the "bulkImportDir" is removed from the "bulkImportDirsUnderProcessing" Set, in case of an early-error. - Fix the "numFailedSegments"-calculation. - Improve some messages. - Code polishing.	2023-08-21 18:19:53 +03:00
Lampros Smyrnaios	a524375656	- Create the HDFS-subDirs before generating "callableTasks" for creating and uploading the parquetFiles. - Delete gradle .zip file after installation.	2023-08-04 15:30:41 +03:00
Lampros Smyrnaios	860c73ea91	- Improve the "shutdownController.sh" script. - Set names for the Prometheus and Grafana containers. - Code polishing.	2023-07-27 18:27:48 +03:00
Lampros Smyrnaios	0699acc999	Make sure we use the latest version of the "zstd-jni" library, where the core code for the "ZStandard" compression algorithm is. The Apache's "commons-compress" package which wraps it in a file-managements code, updates the "zstd-jni" less often.	2023-07-27 17:42:57 +03:00
Lampros Smyrnaios	dfb9c8204e	Add useful messages for missing parameters in Stats API.	2023-07-25 15:36:54 +03:00
Lampros Smyrnaios	cde6063d63	- Update dependencies. - Code polishing.	2023-07-25 12:12:56 +03:00
Lampros Smyrnaios	b73be6d8da	Fix the Stats API returning simple numbers as "application/json". Now they are returned as "text/plain".	2023-07-25 12:03:27 +03:00
Lampros Smyrnaios	66a5b3c7da	Update Bulk-Import API: - Increase the "numOfThreadsPerBulkImportProcedure" to 6. - Fix Bulk import not working from a second-level subdirectory; the report-subDirectory was not created. - Fix not returning the bulk-import-report as "application/json". - Add useful messages for missing parameters. - Change the HTTP-method for the "bulkImportFullTexts" endpoint to "POST". - Show a structured json-response for the "bulkImportFullTexts" endpoint. - Fix uncommon date-format. - Remove single quotes from json-report, since they are returned as bytes, not characters. - Optimize the generation of the json-bulkImport-report.	2023-07-25 11:59:47 +03:00
Lampros Smyrnaios	8d8a387ff2	Reduce the waiting time for new background tasks to be scheduled for processing.	2023-07-24 20:33:56 +03:00
Lampros Smyrnaios	d821ae398f	Improve performance by applying the merging-procedure for the parquet files of the database tables less often, while keeping the benefits of having a relatively small maximum number of parquet files in search operations.	2023-07-24 20:28:41 +03:00
Lampros Smyrnaios	7dc72e242e	- Fix missing changes. - Change the HTTP-method of the renamed "test/uploadParquetFile" endpoint to "POST".	2023-07-24 19:55:37 +03:00
Lampros Smyrnaios	9cbac77c2a	- Add check for "shouldShutdownService" before allowing to continue with a bulk-import request. - Add check for remaining background tasks (including bulkImports), before checking if the workers have shut down and then shut down the Service.	2023-07-21 16:19:00 +03:00
Lampros Smyrnaios	cec2531737	- Increase the "numOfBackgroundThreads" to 8. - Make the "numOfBackgroundThreads" and "numOfThreadsPerBulkImportProcedure" configurable from the "application.yml" file. - Code polishing.	2023-07-21 11:45:50 +03:00
Lampros Smyrnaios	fd1cf56863	- Avoid passing some duplicate publications to the Workers, by post-processing the assignments retrieved from Impala. The assignments may contain duplicate id-url pairs, which have different datasources, since one publication may be connected to multiple datasources. - Code polishing.	2023-07-19 18:31:24 +03:00
Lampros Smyrnaios	b94c35c66e	- Fix double active "@Scheduled" annotation for the "ScheduledTasks.updatePrometheusMetrics()" method. - Code polishing.	2023-07-13 18:32:45 +03:00
Lampros Smyrnaios	8dfb58ee63	Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment". After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers. Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.	2023-07-11 17:27:23 +03:00
Lampros Smyrnaios	2d5643cb0a	Fix missing spaces in some secondary "DROP"-queries.	2023-07-07 20:51:14 +03:00
Lampros Smyrnaios	d5c139c410	Handle the case where the "stats"-queries are executed while some table of the DB are in a "merge" state. In this case, the queries fail and the Controller retries up to 10 times.	2023-07-06 18:29:13 +03:00
Lampros Smyrnaios	e8644cb64f	- Optimize the "insertAssignmentsQuery". - Add documentation about the Prometheus Metrics, in README. - Update Dependencies. - Code polishing.	2023-07-05 17:10:30 +03:00
Lampros Smyrnaios	a89abe3f2f	Prioritize the publications, which are specified inside the "publication_boost" table, according to their "boost-level".	2023-06-29 12:32:06 +03:00
Lampros Smyrnaios	4c3e2e6b6e	- Fix not using actual the "currentAssignmentsBatch" of the workerReport itself, when creating the parquetFileNames and when reporting to the user the initialization of the "addition of the workerReport". - Code polishing.	2023-06-27 16:08:01 +03:00
Lampros Smyrnaios	0f4b63c4a9	Expose the following statistics as prometheus-metrics and create/update a stats-endpoint for each one: - "numOfPayloadsAggregatedByServiceThroughCrawling" - "numOfPayloadsAggregatedByServiceThroughBulkImport" - "numOfPayloadsAggregatedByService" - "numOfLegacyPayloads" - "numOfRecordsInspectedByServiceThroughCrawling" (renamed from "numOfInspectedRecords")	2023-06-23 15:22:26 +03:00
Lampros Smyrnaios	b6b1cb08b9	Add instructions on how to run the Prometheus and Grafana docker-containers alongside the UrlsController, by using the same script.	2023-06-23 14:52:07 +03:00
Lampros Smyrnaios	d52601e819	- Add handling for the "EmptyResultDataAccessException" in "UrlsServiceImpl.getAssignments()", which is thrown when no assignments are returned from the query. - Code polishing.	2023-06-22 12:39:11 +03:00
Lampros Smyrnaios	b9712bed85	- Expose the "numOfAllPayloads" and "numOfInspectedRecords" DB-stats to Prometheus, by using a scheduling task to request the numbers from the DB, every 6 hours. - Update the "StatsServiceImpl.getNumberOfPayloadsAggregatedByService()" to use the new table "payload_aggregated", instead of casting and checking the date of the records. - Code polishing.	2023-06-19 14:42:00 +03:00
Lampros Smyrnaios	798fa09d68	- Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()". - Add/Improve some log messages. - Update and cleanup dependencies. - Code polishing.	2023-06-15 23:19:36 +03:00
Lampros Smyrnaios	88a74b2c41	Add support for all private addresses, defined in "RFC 1918" standard. This fixes the issue of discarding some "shutdownService" requests due to coming from different local private addresses, when the Controller was run inside a docker container.	2023-06-15 13:26:27 +03:00
Lampros Smyrnaios	c37f157f51	Split the full-texts-batch's main handling-code to two separate methods, which can be used in parallel by two threads, in the future.	2023-06-14 17:16:38 +03:00
Lampros Smyrnaios	e2776c50d0	- Optimize the "WorkerReportResult" and the "ShutdownWorker" requests. - Improve documentation.	2023-06-10 02:31:57 +03:00
Lampros Smyrnaios	0b1ab5b991	Attempt to recover from serious failures, by individualizing the error-handling for each of the "table-merging" queries.	2023-06-10 02:28:02 +03:00
Lampros Smyrnaios	6669dc61bf	- Increase the initialDelay for the "checkIfServiceIsReadyForShutdown" scheduled-task, in production, to 10 minutes. - Code polishing.	2023-06-06 16:49:53 +03:00
Lampros Smyrnaios	5d99a4be5d	- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version.	2023-06-06 16:18:38 +03:00

1 2 3 4 5

213 Commits All Branches Search

213 Commits

All Branches