UrlsController

Commit Graph

Author	SHA1	Message	Date
Lampros Smyrnaios	e20c5d2146	- Add error-handling for the case when no payloads could be associated with a specific url which should have been in the hashMultiMap in "addUrlReportsByMatchingRecordsFromBacklog". - Fix not cloning the payload, before changing it and adding it in the "prefilledPayloads"-list; instead, an object-reference was used.	2024-03-11 19:48:04 +02:00
Lampros Smyrnaios	66a5b3c7da	Update Bulk-Import API: - Increase the "numOfThreadsPerBulkImportProcedure" to 6. - Fix Bulk import not working from a second-level subdirectory; the report-subDirectory was not created. - Fix not returning the bulk-import-report as "application/json". - Add useful messages for missing parameters. - Change the HTTP-method for the "bulkImportFullTexts" endpoint to "POST". - Show a structured json-response for the "bulkImportFullTexts" endpoint. - Fix uncommon date-format. - Remove single quotes from json-report, since they are returned as bytes, not characters. - Optimize the generation of the json-bulkImport-report.	2023-07-25 11:59:47 +03:00
Lampros Smyrnaios	8dfb58ee63	Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment". After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers. Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.	2023-07-11 17:27:23 +03:00
Lampros Smyrnaios	0ab6bae93a	- Optimize the json-conversion of the "BulkImportReport". - Code polishing.	2023-05-18 17:30:40 +03:00
Lampros Smyrnaios	f51a34138f	- Use separate HDFS subdirectories for each worker in order to avoid seeing exceptions about "empty hdfs directory" when "loading" data to the database, because one worker has loaded data generated by multiple workers (since we use only 1 load operation for multiple parquet files). - Store each worker's info in a hash-table, in order to efficiently know if we need to create new hdfs subdirectories. Also, this will help to issue "shutdown" requests to the workers in the future, as well as to know which worker has shutdown.	2023-05-15 13:12:20 +03:00
Lampros Smyrnaios	b6e8cd1889	New feature: BulkImport full-text files from compatible datasources.	2023-05-11 03:07:55 +03:00
Lampros Smyrnaios	bfdf06bd09	- Apply error-checking on individual CallableTasks and in tasks-batches related to the creation and upload of all the data related to the "attempt" and "payload" table. So, if no data could be uploaded for one or both tables, no "load"-queries will be executed for that/those tables. - Catch the more general "Exception", inside "FileUtils.mergeParquetFiles()", in order to be certain that the "SQLException" can also be caught. - Code polishing.	2022-12-09 12:46:06 +02:00
Lampros Smyrnaios	6226e2298d	- Upgrade the results-loading process: Instead of making thousands of sql-insert requests to Impala now we write the results to parquet files, upload them to HDFS and then import the data into the Impala tables with just 2 requests. This results in a huge performance improvement. One side effect of using the parquet-files, is that the timestamps are now BIGDECIMAL numbers, instead of "Timestamp" objects, but, converting them to such objects is pretty easy, if we ever need to do it. - Code polishing.	2022-11-10 17:18:21 +02:00
Lampros Smyrnaios	a23c918a42	- Fix a "@JsonProperty" annotation inside "Payload.java". - Fix a "@Value" annotation inside "FileUtils.java". - Add a new database and show its name along with the initial's name in the logs. - Code cleanup and improvement.	2022-04-05 00:01:44 +03:00
Lampros Smyrnaios	5e4fad2479	- Change the fileNames' structure in the S3-ObjectStore. - Update dependencies.	2022-04-01 19:24:04 +03:00
Lampros Smyrnaios	15224c6468	Improve performance in the "getUrls"-endpoint, and more: - Optimize the "findAssignmentsQuery" by using an inner limit (larger than the outer). - Save a ton of time from inserting the assignments into the database, by using a temporal table to hold the new assignments, in order for them to be easily accessible both from the Controller (which processes them and send them to the Worker) and the database itself, in order to "import" them into the "assignment"-table. - Replace the "Date" with "Timestamp", in order to hold more detailed information. - Code cleanup.	2021-11-30 19:59:46 +02:00
Lampros Smyrnaios	d931315ced	- Add the "isControllerAlive"-endpoint. - Change the data-type of the "UrlReport.status" to be "enum StatusType", in order to increase consistency and comparability. - Change the "Date" datatype in "Payload" to have the SQL's version. - Fix the project's name inside "settings.gradle". - Code cleanup.	2021-09-23 15:08:52 +03:00
Lampros Smyrnaios	d56e988518	- Process the Error of PDF-aggregation. Distinguish between "couldRetry" and "noRetry" cases. - Update the "RequestParam" for the getUrls endpoints. - Fix the "assignmentCounter". - Code cleanup.	2021-08-05 15:43:37 +03:00
Lampros Smyrnaios	27375b9396	- Refactor the Assignment-creation. In order to match the database, now we have a list of Assignments sent through the AssignmentResponse, instead of a single Assignment having a list of tasks. - Cleanup the members of the "Payload" model (also prepare for database integration).	2021-07-05 14:04:39 +03:00
Lampros Smyrnaios	5e7ccbd8c6	Add the "addWorkerReport" endpoint.	2021-06-22 05:38:48 +03:00
Lampros Smyrnaios	40763ec146	Update the "WorkerReport" request and the "UrlReport" and "Payload" models.	2021-06-19 07:07:36 +03:00
Lampros Smyrnaios	6729f51b03	Add an "assignmentId" field in the "Assignment"-class.	2021-06-09 05:48:54 +03:00
Lampros Smyrnaios	787299b5b7	Add the "Datasource" inside the "Task" class and include it in the Assignment.	2021-05-20 02:50:50 +03:00
Lampros Smyrnaios	e2cc320baf	- Add the "getTestUrls"-endpoint which returns an "Assignment" with data retrieved from the added resource-file. - Update the "getUrls"-endpoint to be ready to retrieve data from the database, once it's added. - Update the dependencies. - Code cleanup.	2021-05-18 17:23:20 +03:00
Lampros Smyrnaios	d3588ea36b	Add the "DownloadAttempt" class.	2021-04-24 21:44:51 +03:00
Lampros Smyrnaios	a6ab810ad3	Update classes: "Publication" and "Payload".	2021-04-24 21:40:10 +03:00
Lampros Smyrnaios	85ecc4a36b	Add classes: "AssignmentResponse", "WorkerReport", "WorkerRequest", "UrlReport".	2021-04-24 21:06:52 +03:00
Lampros Smyrnaios	c2ea8a69de	Update classes: "Assignment", "Task", "Error", "Payload", "UrlsRequest".	2021-04-24 21:05:21 +03:00
Lampros Smyrnaios	89c6a73a30	Add "Assignment", "Task" and "Error" classes.	2021-04-15 03:36:08 +03:00
Lampros Smyrnaios	8a4376da9c	Initial commit of UrlsController.	2021-03-16 15:25:15 +02:00

25 Commits