UrlsController

Commit Graph

Author	SHA1	Message	Date
Lampros Smyrnaios	0c79fdea35	Update the "findAssignmentsQuery" to check the "attempt.error_class" field for the current pub_url, not the pub_id.	2023-10-06 14:59:26 +03:00
Lampros Smyrnaios	ebf8896005	- Fix getter and setter methods for the "isAuthoritative" field. - Update Gradle.	2023-10-05 16:31:52 +03:00
Lampros Smyrnaios	b2ce6393c1	- Add check for remaining "bulkImportDirsUnderProcessing", before shutting down the Service. - Code polishing.	2023-10-05 13:43:47 +03:00
Lampros Smyrnaios	96c11ba4b8	- Add a missing change. - Code optimization and polishing. - Update dependencies.	2023-10-04 16:17:12 +03:00
Lampros Smyrnaios	7019f7c3c7	Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads.	2023-10-04 15:43:31 +03:00
Lampros Smyrnaios	b702cf4484	Upgrade the "findAssignmentsQuery": - Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs. In the end, we only care about the urls when choosing which records should be aggregated. - Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.	2023-10-04 13:43:15 +03:00
Lampros Smyrnaios	c9626de120	Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled.	2023-10-04 13:01:13 +03:00
Lampros Smyrnaios	865926fbc3	- Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them. - Code polishing.	2023-10-02 15:46:55 +03:00
Lampros Smyrnaios	ede7ca5a89	- Add bulk-import support for non-Authoritative data-sources. - Update Spring Boot. - Code polishing.	2023-09-26 18:02:48 +03:00
Lampros Smyrnaios	90a864ea61	Add more info in bulk-import logs.	2023-09-20 17:50:10 +03:00
Lampros Smyrnaios	0f5d4dac78	Check and show warning/error message for failed payloads.	2023-09-20 17:38:22 +03:00
Lampros Smyrnaios	903c3e1ffc	Add thread-safety when reading the bulkImportReport-files.	2023-09-15 11:54:32 +03:00
Lampros Smyrnaios	360731ba72	- Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint. - Code optimization and polishing.	2023-09-14 13:53:01 +03:00
Lampros Smyrnaios	b4f91f188e	Fix the "retries-num" appearing in log-messages.	2023-09-14 12:08:33 +03:00
Lampros Smyrnaios	02bae38885	- Improve response-time to "getAssignments"-requests, by avoiding merging the parquet files of the "assignment" table, right after acquiring the assignments from the DB. They are already getting merged, when each assignments-batch is deleted after a workerReport has been processed. - Optimize code-positioning for unlocking the DB when done executing queries.	2023-09-13 17:03:11 +03:00
Lampros Smyrnaios	8fdb8e9137	Add renaming of the workerReport-file, to indicate failure, when the processing failed because no workerInfo was found for the worker-id existing in the report. This way, it can be retried by the scheduler later.	2023-09-13 16:35:41 +03:00
Lampros Smyrnaios	c98e8df323	Move the "getRenamedWorkerReport"-code in its own method.	2023-09-13 16:27:18 +03:00
Lampros Smyrnaios	6891c467d4	- Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode. - Add a missing change for the optimization of reading files. - Update dependencies.	2023-09-13 15:29:30 +03:00
Lampros Smyrnaios	3dd349dd00	Improve the "findAssignmentsQuery": - Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower. - Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.	2023-09-13 14:38:15 +03:00
Lampros Smyrnaios	ee2df19ce1	- Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint. - Add useful log-messages for various bulk-import stages and improve the current ones. - Optimize reading and writing the reports.	2023-09-11 17:24:39 +03:00
Lampros Smyrnaios	6944678391	Improve error-handling when renaming workerReport-files.	2023-09-08 17:41:10 +03:00
Lampros Smyrnaios	1c8f3765ca	- Fix not acquiring the full workerReport when retrying it, with the scheduler. - Improve error-handling in the "inspectWorkerReportsAndTakeAction" process. - Code polishing.	2023-09-08 14:59:48 +03:00
Lampros Smyrnaios	e72a4d3d10	- Improve handling of already renamed workerReport-files, which relate to failed workerReports in the 1st try. - Add check for workerReport-files, which may have been deleted before their time, due to an error.	2023-09-08 14:11:41 +03:00
Lampros Smyrnaios	bd9245cc3d	Avoid deleting the assignment-records in case of a "parquet-data creation, upload or insertion" problem, in order to avoid double-processing of the urls, until the report has been retried by the scheduler.	2023-09-08 13:44:24 +03:00
Lampros Smyrnaios	718f5cfefb	- Improve prioritization of the most recent publications. - Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.	2023-09-07 14:05:58 +03:00
Lampros Smyrnaios	4014d1eabb	Code polishing.	2023-09-05 15:20:03 +03:00
Lampros Smyrnaios	199105f7f1	Fix not writing some bulk-import error-messages to the logs. Instead, they were only written to the json-reports.	2023-09-04 16:33:27 +03:00
Lampros Smyrnaios	acef891167	Improve prioritization of "publication_boost" records, by adding a second ordering in the end.	2023-09-04 15:34:37 +03:00
Lampros Smyrnaios	98516498eb	- Increase app's version. - Code polishing.	2023-09-04 12:46:55 +03:00
Lampros Smyrnaios	febe2b212c	Upgrade management of failed workerReports: - Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed". - Avoid deleting immediately the failed workerReports. - Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap". - Add a scheduling task to process leftover failed workerReports from the current execution, regularly. - Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports. - Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted. - Code refactoring.	2023-09-01 15:10:58 +03:00
Lampros Smyrnaios	5c459a3a16	Optimize handling of HTTP-4XX errors in "UrlsServiceImpl.postReportResultToWorker()".	2023-08-31 13:20:12 +03:00
Lampros Smyrnaios	601776e81c	- Fix not handling for the case where the info about the worker in the WorkerReport, does not exist inside the "workersInfoMap", as that worker is not participating in the Service. (this case may appear in future code) - Code polishing.	2023-08-30 17:07:51 +03:00
Lampros Smyrnaios	c32dfa882e	Fix not deleting the assignment-records, for every workerReport, after processing it.	2023-08-30 16:22:58 +03:00
Lampros Smyrnaios	aa3f32f3da	- Make sure the given number of threads, given by the user is above zero. - Adjust the number and size of log files. - Update Spring Boot. - Code polishing.	2023-08-30 14:02:54 +03:00
Lampros Smyrnaios	44459c8681	- Rename "ImpalaConnector.java" to "DatabaseConnector.java". - Update dependencies. - Code polishing.	2023-08-23 16:55:23 +03:00
Lampros Smyrnaios	b3e0d214fd	Update the BulkImport API: - Refactor the "bulkImportReportID". - Add the "bulk:" prefix in the provenance value, in the DB. - Fix not using correctly the "Lists.partition()" method. - Make sure the "bulkImportDir" is removed from the "bulkImportDirsUnderProcessing" Set, in case of an early-error. - Fix the "numFailedSegments"-calculation. - Improve some messages. - Code polishing.	2023-08-21 18:19:53 +03:00
Lampros Smyrnaios	a524375656	- Create the HDFS-subDirs before generating "callableTasks" for creating and uploading the parquetFiles. - Delete gradle .zip file after installation.	2023-08-04 15:30:41 +03:00
Lampros Smyrnaios	860c73ea91	- Improve the "shutdownController.sh" script. - Set names for the Prometheus and Grafana containers. - Code polishing.	2023-07-27 18:27:48 +03:00
Lampros Smyrnaios	dfb9c8204e	Add useful messages for missing parameters in Stats API.	2023-07-25 15:36:54 +03:00
Lampros Smyrnaios	cde6063d63	- Update dependencies. - Code polishing.	2023-07-25 12:12:56 +03:00
Lampros Smyrnaios	b73be6d8da	Fix the Stats API returning simple numbers as "application/json". Now they are returned as "text/plain".	2023-07-25 12:03:27 +03:00
Lampros Smyrnaios	66a5b3c7da	Update Bulk-Import API: - Increase the "numOfThreadsPerBulkImportProcedure" to 6. - Fix Bulk import not working from a second-level subdirectory; the report-subDirectory was not created. - Fix not returning the bulk-import-report as "application/json". - Add useful messages for missing parameters. - Change the HTTP-method for the "bulkImportFullTexts" endpoint to "POST". - Show a structured json-response for the "bulkImportFullTexts" endpoint. - Fix uncommon date-format. - Remove single quotes from json-report, since they are returned as bytes, not characters. - Optimize the generation of the json-bulkImport-report.	2023-07-25 11:59:47 +03:00
Lampros Smyrnaios	8d8a387ff2	Reduce the waiting time for new background tasks to be scheduled for processing.	2023-07-24 20:33:56 +03:00
Lampros Smyrnaios	d821ae398f	Improve performance by applying the merging-procedure for the parquet files of the database tables less often, while keeping the benefits of having a relatively small maximum number of parquet files in search operations.	2023-07-24 20:28:41 +03:00
Lampros Smyrnaios	7dc72e242e	- Fix missing changes. - Change the HTTP-method of the renamed "test/uploadParquetFile" endpoint to "POST".	2023-07-24 19:55:37 +03:00
Lampros Smyrnaios	9cbac77c2a	- Add check for "shouldShutdownService" before allowing to continue with a bulk-import request. - Add check for remaining background tasks (including bulkImports), before checking if the workers have shut down and then shut down the Service.	2023-07-21 16:19:00 +03:00
Lampros Smyrnaios	cec2531737	- Increase the "numOfBackgroundThreads" to 8. - Make the "numOfBackgroundThreads" and "numOfThreadsPerBulkImportProcedure" configurable from the "application.yml" file. - Code polishing.	2023-07-21 11:45:50 +03:00
Lampros Smyrnaios	fd1cf56863	- Avoid passing some duplicate publications to the Workers, by post-processing the assignments retrieved from Impala. The assignments may contain duplicate id-url pairs, which have different datasources, since one publication may be connected to multiple datasources. - Code polishing.	2023-07-19 18:31:24 +03:00
Lampros Smyrnaios	b94c35c66e	- Fix double active "@Scheduled" annotation for the "ScheduledTasks.updatePrometheusMetrics()" method. - Code polishing.	2023-07-13 18:32:45 +03:00
Lampros Smyrnaios	8dfb58ee63	Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment". After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers. Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.	2023-07-11 17:27:23 +03:00

1 2 3 4

193 Commits