UrlsController

Commit Graph

Author	SHA1	Message	Date
Lampros Smyrnaios	7f789b8ad0	- If we receive an "UnknownHostException" when uploading to the S3ObjectStore, then skip the current full-texts' batch to leave some time for the network to get unstuck. - Code polishing.	2023-11-22 15:29:18 +02:00
Lampros Smyrnaios	9b1f2c4931	Improve performance and reduce memory usage of the "findAssignmentsQuery": - Reorder JOINs and predicates to reduce the computational cost. - Remove the memory-costly "pu.url" predicates from the "where" clause, as the DB has no empty urls anymore.	2023-10-31 15:59:48 +02:00
Lampros Smyrnaios	db929d8931	- Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments. - Improve some log-messages. - Code polishing.	2023-10-30 12:29:54 +02:00
Lampros Smyrnaios	856c62887d	- Make sure the "UTF_8" charset is used, when we get a message from the response-body. - Improve some log-messages.	2023-10-26 11:44:23 +03:00
Lampros Smyrnaios	bdf834c439	- Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument. - Fix not updating the "UrlsController.numOfWorkers" correctly. - Code polishing.	2023-10-23 17:19:29 +03:00
Lampros Smyrnaios	0c7bf6357b	- Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()". - Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).	2023-10-23 12:21:42 +03:00
Lampros Smyrnaios	a7581335f1	- Improve the "getDataForPayloadPrefillQuery". - Improve some error-messages.	2023-10-21 11:31:31 +03:00
Lampros Smyrnaios	44c2fe7418	- Fix the "IndexOutOfBoundsException", when checking the futures' results. - Update dependencies.	2023-10-20 14:25:05 +03:00
Lampros Smyrnaios	df0ea62a5a	- Handle the case when the "webHDFSBaseUrl" does not use HTTPS. - Improve error-reporting when uploading a file to HDFS.	2023-10-19 11:59:37 +03:00
Lampros Smyrnaios	40729c6295	Move similar code into the new "ParquetFileUtils.getPayloadParquetRecord()" method.	2023-10-17 12:50:51 +03:00
Lampros Smyrnaios	f05eee7569	Improve the names of some methods.	2023-10-16 23:39:43 +03:00
Lampros Smyrnaios	def21b991d	Improve the UX of the "installAndRun.sh" script.	2023-10-09 17:28:22 +03:00
Lampros Smyrnaios	fb2877dbe8	Upgrade the execution system for the backgroundTasks: - Submit each task immediately for execution, instead of waiting for a scheduling thread to send all gathered tasks (up to that point) to the ExecutorService (and block until they are finished, before it can start again). - Hold the Future of each submitted task to a synchronized-list to check the result of each task at a scheduled time. - Reduce the cpu-time to assure the Service can shut down, by checking if there are "actively" and "about-to-be-executed" tasks, at the same time. Instead of having to rely on the additional checking of the "shutdown"-status of each worker to verify that no active task exist. - Improve the threads' shutdown procedure.	2023-10-09 17:23:59 +03:00
Lampros Smyrnaios	a354da763d	- Improve some log-messages. - Increase app's version. - Code polishing.	2023-10-06 17:28:54 +03:00
Lampros Smyrnaios	0c79fdea35	Update the "findAssignmentsQuery" to check the "attempt.error_class" field for the current pub_url, not the pub_id.	2023-10-06 14:59:26 +03:00
Lampros Smyrnaios	ebf8896005	- Fix getter and setter methods for the "isAuthoritative" field. - Update Gradle.	2023-10-05 16:31:52 +03:00
Lampros Smyrnaios	b2ce6393c1	- Add check for remaining "bulkImportDirsUnderProcessing", before shutting down the Service. - Code polishing.	2023-10-05 13:43:47 +03:00
Lampros Smyrnaios	96c11ba4b8	- Add a missing change. - Code optimization and polishing. - Update dependencies.	2023-10-04 16:17:12 +03:00
Lampros Smyrnaios	7019f7c3c7	Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads.	2023-10-04 15:43:31 +03:00
Lampros Smyrnaios	b702cf4484	Upgrade the "findAssignmentsQuery": - Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs. In the end, we only care about the urls when choosing which records should be aggregated. - Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.	2023-10-04 13:43:15 +03:00
Lampros Smyrnaios	c9626de120	Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled.	2023-10-04 13:01:13 +03:00
Lampros Smyrnaios	865926fbc3	- Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them. - Code polishing.	2023-10-02 15:46:55 +03:00
Lampros Smyrnaios	ede7ca5a89	- Add bulk-import support for non-Authoritative data-sources. - Update Spring Boot. - Code polishing.	2023-09-26 18:02:48 +03:00
Lampros Smyrnaios	90a864ea61	Add more info in bulk-import logs.	2023-09-20 17:50:10 +03:00
Lampros Smyrnaios	0f5d4dac78	Check and show warning/error message for failed payloads.	2023-09-20 17:38:22 +03:00
Lampros Smyrnaios	068b97dd60	Set Xms and Xmx Java-parameters when running the Jar, in Docker.	2023-09-15 14:19:46 +03:00
Lampros Smyrnaios	903c3e1ffc	Add thread-safety when reading the bulkImportReport-files.	2023-09-15 11:54:32 +03:00
Lampros Smyrnaios	846c53913f	Add LICENSE.	2023-09-14 16:05:36 +03:00
Lampros Smyrnaios	360731ba72	- Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint. - Code optimization and polishing.	2023-09-14 13:53:01 +03:00
Lampros Smyrnaios	b4f91f188e	Fix the "retries-num" appearing in log-messages.	2023-09-14 12:08:33 +03:00
Lampros Smyrnaios	02bae38885	- Improve response-time to "getAssignments"-requests, by avoiding merging the parquet files of the "assignment" table, right after acquiring the assignments from the DB. They are already getting merged, when each assignments-batch is deleted after a workerReport has been processed. - Optimize code-positioning for unlocking the DB when done executing queries.	2023-09-13 17:03:11 +03:00
Lampros Smyrnaios	8fdb8e9137	Add renaming of the workerReport-file, to indicate failure, when the processing failed because no workerInfo was found for the worker-id existing in the report. This way, it can be retried by the scheduler later.	2023-09-13 16:35:41 +03:00
Lampros Smyrnaios	c98e8df323	Move the "getRenamedWorkerReport"-code in its own method.	2023-09-13 16:27:18 +03:00
Lampros Smyrnaios	6891c467d4	- Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode. - Add a missing change for the optimization of reading files. - Update dependencies.	2023-09-13 15:29:30 +03:00
Lampros Smyrnaios	3dd349dd00	Improve the "findAssignmentsQuery": - Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower. - Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.	2023-09-13 14:38:15 +03:00
Lampros Smyrnaios	ee2df19ce1	- Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint. - Add useful log-messages for various bulk-import stages and improve the current ones. - Optimize reading and writing the reports.	2023-09-11 17:24:39 +03:00
Lampros Smyrnaios	6944678391	Improve error-handling when renaming workerReport-files.	2023-09-08 17:41:10 +03:00
Lampros Smyrnaios	1c8f3765ca	- Fix not acquiring the full workerReport when retrying it, with the scheduler. - Improve error-handling in the "inspectWorkerReportsAndTakeAction" process. - Code polishing.	2023-09-08 14:59:48 +03:00
Lampros Smyrnaios	e72a4d3d10	- Improve handling of already renamed workerReport-files, which relate to failed workerReports in the 1st try. - Add check for workerReport-files, which may have been deleted before their time, due to an error.	2023-09-08 14:11:41 +03:00
Lampros Smyrnaios	bd9245cc3d	Avoid deleting the assignment-records in case of a "parquet-data creation, upload or insertion" problem, in order to avoid double-processing of the urls, until the report has been retried by the scheduler.	2023-09-08 13:44:24 +03:00
Lampros Smyrnaios	718f5cfefb	- Improve prioritization of the most recent publications. - Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.	2023-09-07 14:05:58 +03:00
Lampros Smyrnaios	4014d1eabb	Code polishing.	2023-09-05 15:20:03 +03:00
Lampros Smyrnaios	199105f7f1	Fix not writing some bulk-import error-messages to the logs. Instead, they were only written to the json-reports.	2023-09-04 16:33:27 +03:00
Lampros Smyrnaios	acef891167	Improve prioritization of "publication_boost" records, by adding a second ordering in the end.	2023-09-04 15:34:37 +03:00
Lampros Smyrnaios	98516498eb	- Increase app's version. - Code polishing.	2023-09-04 12:46:55 +03:00
Lampros Smyrnaios	febe2b212c	Upgrade management of failed workerReports: - Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed". - Avoid deleting immediately the failed workerReports. - Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap". - Add a scheduling task to process leftover failed workerReports from the current execution, regularly. - Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports. - Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted. - Code refactoring.	2023-09-01 15:10:58 +03:00
Lampros Smyrnaios	5c459a3a16	Optimize handling of HTTP-4XX errors in "UrlsServiceImpl.postReportResultToWorker()".	2023-08-31 13:20:12 +03:00
Lampros Smyrnaios	601776e81c	- Fix not handling for the case where the info about the worker in the WorkerReport, does not exist inside the "workersInfoMap", as that worker is not participating in the Service. (this case may appear in future code) - Code polishing.	2023-08-30 17:07:51 +03:00
Lampros Smyrnaios	c32dfa882e	Fix not deleting the assignment-records, for every workerReport, after processing it.	2023-08-30 16:22:58 +03:00
Lampros Smyrnaios	aa3f32f3da	- Make sure the given number of threads, given by the user is above zero. - Adjust the number and size of log files. - Update Spring Boot. - Code polishing.	2023-08-30 14:02:54 +03:00

1 2 3 4 5

246 Commits All Branches Search

246 Commits

All Branches