UrlsController

Commit Graph

Author	SHA1	Message	Date
Lampros Smyrnaios	8bc5cc35e2	- Optimize writing to the Bulk-import-report file. - Show the IP of the worker which posts a "workerShutdownReport". - Code polishing.	2024-03-22 17:50:55 +02:00
Lampros Smyrnaios	1048463ca0	- Improve error-handling in "S3ObjectStore.emptyBucket()". - Change some log-levels. - Code polishing.	2024-03-11 16:17:32 +02:00
Lampros Smyrnaios	2e60128084	- Allow to easily change the por used by workers. - Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown. - Update dependencies. - Code polishing.	2023-12-19 23:31:42 +02:00
Lampros Smyrnaios	d90ad51609	Add the "shutdownAllWorkersGracefully" and "cancelShutdownAllWorkersGracefully" endpoints, in order to be able to shut them down at once and update them, without shutting down the whole Service. So in this case the bulk-import procedures will continue to work.	2023-11-29 16:45:58 +02:00
Lampros Smyrnaios	856c62887d	- Make sure the "UTF_8" charset is used, when we get a message from the response-body. - Improve some log-messages.	2023-10-26 11:44:23 +03:00
Lampros Smyrnaios	bdf834c439	- Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument. - Fix not updating the "UrlsController.numOfWorkers" correctly. - Code polishing.	2023-10-23 17:19:29 +03:00
Lampros Smyrnaios	fb2877dbe8	Upgrade the execution system for the backgroundTasks: - Submit each task immediately for execution, instead of waiting for a scheduling thread to send all gathered tasks (up to that point) to the ExecutorService (and block until they are finished, before it can start again). - Hold the Future of each submitted task to a synchronized-list to check the result of each task at a scheduled time. - Reduce the cpu-time to assure the Service can shut down, by checking if there are "actively" and "about-to-be-executed" tasks, at the same time. Instead of having to rely on the additional checking of the "shutdown"-status of each worker to verify that no active task exist. - Improve the threads' shutdown procedure.	2023-10-09 17:23:59 +03:00
Lampros Smyrnaios	96c11ba4b8	- Add a missing change. - Code optimization and polishing. - Update dependencies.	2023-10-04 16:17:12 +03:00
Lampros Smyrnaios	ede7ca5a89	- Add bulk-import support for non-Authoritative data-sources. - Update Spring Boot. - Code polishing.	2023-09-26 18:02:48 +03:00
Lampros Smyrnaios	903c3e1ffc	Add thread-safety when reading the bulkImportReport-files.	2023-09-15 11:54:32 +03:00
Lampros Smyrnaios	ee2df19ce1	- Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint. - Add useful log-messages for various bulk-import stages and improve the current ones. - Optimize reading and writing the reports.	2023-09-11 17:24:39 +03:00
Lampros Smyrnaios	aa3f32f3da	- Make sure the given number of threads, given by the user is above zero. - Adjust the number and size of log files. - Update Spring Boot. - Code polishing.	2023-08-30 14:02:54 +03:00
Lampros Smyrnaios	44459c8681	- Rename "ImpalaConnector.java" to "DatabaseConnector.java". - Update dependencies. - Code polishing.	2023-08-23 16:55:23 +03:00
Lampros Smyrnaios	b3e0d214fd	Update the BulkImport API: - Refactor the "bulkImportReportID". - Add the "bulk:" prefix in the provenance value, in the DB. - Fix not using correctly the "Lists.partition()" method. - Make sure the "bulkImportDir" is removed from the "bulkImportDirsUnderProcessing" Set, in case of an early-error. - Fix the "numFailedSegments"-calculation. - Improve some messages. - Code polishing.	2023-08-21 18:19:53 +03:00
Lampros Smyrnaios	a524375656	- Create the HDFS-subDirs before generating "callableTasks" for creating and uploading the parquetFiles. - Delete gradle .zip file after installation.	2023-08-04 15:30:41 +03:00
Lampros Smyrnaios	dfb9c8204e	Add useful messages for missing parameters in Stats API.	2023-07-25 15:36:54 +03:00
Lampros Smyrnaios	b73be6d8da	Fix the Stats API returning simple numbers as "application/json". Now they are returned as "text/plain".	2023-07-25 12:03:27 +03:00
Lampros Smyrnaios	66a5b3c7da	Update Bulk-Import API: - Increase the "numOfThreadsPerBulkImportProcedure" to 6. - Fix Bulk import not working from a second-level subdirectory; the report-subDirectory was not created. - Fix not returning the bulk-import-report as "application/json". - Add useful messages for missing parameters. - Change the HTTP-method for the "bulkImportFullTexts" endpoint to "POST". - Show a structured json-response for the "bulkImportFullTexts" endpoint. - Fix uncommon date-format. - Remove single quotes from json-report, since they are returned as bytes, not characters. - Optimize the generation of the json-bulkImport-report.	2023-07-25 11:59:47 +03:00
Lampros Smyrnaios	d821ae398f	Improve performance by applying the merging-procedure for the parquet files of the database tables less often, while keeping the benefits of having a relatively small maximum number of parquet files in search operations.	2023-07-24 20:28:41 +03:00
Lampros Smyrnaios	7dc72e242e	- Fix missing changes. - Change the HTTP-method of the renamed "test/uploadParquetFile" endpoint to "POST".	2023-07-24 19:55:37 +03:00
Lampros Smyrnaios	9cbac77c2a	- Add check for "shouldShutdownService" before allowing to continue with a bulk-import request. - Add check for remaining background tasks (including bulkImports), before checking if the workers have shut down and then shut down the Service.	2023-07-21 16:19:00 +03:00
Lampros Smyrnaios	cec2531737	- Increase the "numOfBackgroundThreads" to 8. - Make the "numOfBackgroundThreads" and "numOfThreadsPerBulkImportProcedure" configurable from the "application.yml" file. - Code polishing.	2023-07-21 11:45:50 +03:00
Lampros Smyrnaios	fd1cf56863	- Avoid passing some duplicate publications to the Workers, by post-processing the assignments retrieved from Impala. The assignments may contain duplicate id-url pairs, which have different datasources, since one publication may be connected to multiple datasources. - Code polishing.	2023-07-19 18:31:24 +03:00
Lampros Smyrnaios	8dfb58ee63	Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment". After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers. Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.	2023-07-11 17:27:23 +03:00
Lampros Smyrnaios	d5c139c410	Handle the case where the "stats"-queries are executed while some table of the DB are in a "merge" state. In this case, the queries fail and the Controller retries up to 10 times.	2023-07-06 18:29:13 +03:00
Lampros Smyrnaios	e8644cb64f	- Optimize the "insertAssignmentsQuery". - Add documentation about the Prometheus Metrics, in README. - Update Dependencies. - Code polishing.	2023-07-05 17:10:30 +03:00
Lampros Smyrnaios	0f4b63c4a9	Expose the following statistics as prometheus-metrics and create/update a stats-endpoint for each one: - "numOfPayloadsAggregatedByServiceThroughCrawling" - "numOfPayloadsAggregatedByServiceThroughBulkImport" - "numOfPayloadsAggregatedByService" - "numOfLegacyPayloads" - "numOfRecordsInspectedByServiceThroughCrawling" (renamed from "numOfInspectedRecords")	2023-06-23 15:22:26 +03:00
Lampros Smyrnaios	d52601e819	- Add handling for the "EmptyResultDataAccessException" in "UrlsServiceImpl.getAssignments()", which is thrown when no assignments are returned from the query. - Code polishing.	2023-06-22 12:39:11 +03:00
Lampros Smyrnaios	b9712bed85	- Expose the "numOfAllPayloads" and "numOfInspectedRecords" DB-stats to Prometheus, by using a scheduling task to request the numbers from the DB, every 6 hours. - Update the "StatsServiceImpl.getNumberOfPayloadsAggregatedByService()" to use the new table "payload_aggregated", instead of casting and checking the date of the records. - Code polishing.	2023-06-19 14:42:00 +03:00
Lampros Smyrnaios	798fa09d68	- Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()". - Add/Improve some log messages. - Update and cleanup dependencies. - Code polishing.	2023-06-15 23:19:36 +03:00
Lampros Smyrnaios	6669dc61bf	- Increase the initialDelay for the "checkIfServiceIsReadyForShutdown" scheduled-task, in production, to 10 minutes. - Code polishing.	2023-06-06 16:49:53 +03:00
Lampros Smyrnaios	54685bbe9a	- Avoid sending "cancelShutdown" requests to already shutDown Workers. - Optimize performance of the code running right before the "postShutdownOrCancelRequestToWorker". - Show which Workers have already shutdown and as a result a "postShutdownOrCancelRequestToWorker" will not be performed on them.	2023-05-29 13:41:37 +03:00
Lampros Smyrnaios	f9c6bad768	Do not send shutDownRequests to workers which have already shutdown.	2023-05-29 12:42:54 +03:00
Lampros Smyrnaios	a38d6ace79	Code polishing.	2023-05-29 12:21:48 +03:00
Lampros Smyrnaios	74ff31fc64	- Show the workerIPs in the logs. - Rename the "FullTexts"-files to "BulkImport".	2023-05-29 12:12:08 +03:00
Lampros Smyrnaios	3988eb3a48	- Use a separate HDFS sub-dir for every assignments-batch, in order to avoid any disrruptancies from multiple threads moving parquet-files from the same sub-dir. Multiple batches from the same worker may be processed at the same time. These sub-dirs are deleted afterwards. - Treat the "contains no visible files" situation as an error. In which case the assignments-data is presumed to not have been inserted to the database tables. - Code polishing/cleanup.	2023-05-27 02:36:05 +03:00
Lampros Smyrnaios	164245cb53	- Automatically delete the unsuccessful WorkerReports, which are more than 7 days old. - Optimize the Service's startup speed, by setting "initialDelays" to the scheduled tasks. - Optimize documentation.	2023-05-24 16:59:42 +03:00
Lampros Smyrnaios	cd1fb0af88	- Process the WorkerReports in background Jobs and post the reportResults to the Workers. - Save the workerReports to json files, until they are processed successfully. - Show some custom metrics in prometheus.	2023-05-24 13:52:28 +03:00
Lampros Smyrnaios	0ea3e2de24	Add the "shutdownService" and "cancelShutdownService" endpoints. The Controller sends the related requests to the Workers and shutdowns gracefully, after all workers have shutdown.	2023-05-24 13:42:29 +03:00
Lampros Smyrnaios	c7bfd75973	- Add the "getWorkersInfo" endpoint. - Improve startup speed, by using a faster remote server to get the host's machine public IP. This also reduces the risk of not being able to get the public IP at all. - Fix the detection of a different IP for a known worker. - Improve documentation.	2023-05-23 14:57:15 +03:00
Lampros Smyrnaios	5f75b48e95	- Increase the "read-timeout" when searching for the host's machine public-IP. - Update dependencies. - Code polishing.	2023-05-22 21:33:02 +03:00
Lampros Smyrnaios	0ab6bae93a	- Optimize the json-conversion of the "BulkImportReport". - Code polishing.	2023-05-18 17:30:40 +03:00
Lampros Smyrnaios	f7f919cee1	- Make sure we set the "hasShutdown" to "false", for each known worker which was restarted. - Fix markdown of urls in prometheus' readme.	2023-05-16 12:24:14 +03:00
Lampros Smyrnaios	f51a34138f	- Use separate HDFS subdirectories for each worker in order to avoid seeing exceptions about "empty hdfs directory" when "loading" data to the database, because one worker has loaded data generated by multiple workers (since we use only 1 load operation for multiple parquet files). - Store each worker's info in a hash-table, in order to efficiently know if we need to create new hdfs subdirectories. Also, this will help to issue "shutdown" requests to the workers in the future, as well as to know which worker has shutdown.	2023-05-15 13:12:20 +03:00
Lampros Smyrnaios	9412391903	- In test-environment mode, check for already existing file-hashes only in the "payload_aggregated" table, instead of the whole "payload" view. This way the investigation for false-positive docUrls is easier, as we avoid checking against the millions of "legacy" payloads. - Improve performance in production, by not creating the string objects for "trace"-logs.	2023-05-15 12:44:16 +03:00
Lampros Smyrnaios	b6e8cd1889	New feature: BulkImport full-text files from compatible datasources.	2023-05-11 03:07:55 +03:00
Lampros Smyrnaios	42b93e9429	- Add the "getNumberOfAllDistinctFullTexts" stats-endpoint. - Add TODOs for more stats endpoints. - Code polishing.	2023-05-04 15:48:49 +03:00
Lampros Smyrnaios	d7797eaaf6	Add the "getNumberOfPayloadsForDatasource" endpoint.	2023-04-24 09:54:35 +03:00
Lampros Smyrnaios	ff13af7abb	Use a StatsService interface.	2023-03-13 12:39:39 +02:00
Lampros Smyrnaios	38643c76a3	- Code polishing. - Update Gradle.	2023-03-07 16:55:41 +02:00

1 2 3

104 Commits