02bae38885- Improve response-time to "getAssignments"-requests, by avoiding merging the parquet files of the "assignment" table, right after acquiring the assignments from the DB. They are already getting merged, when each assignments-batch is deleted after a workerReport has been processed. - Optimize code-positioning for unlocking the DB when done executing queries.Lampros Smyrnaios2023-09-13 17:03:11 +0300
8fdb8e9137Add renaming of the workerReport-file, to indicate failure, when the processing failed because no workerInfo was found for the worker-id existing in the report. This way, it can be retried by the scheduler later.Lampros Smyrnaios2023-09-13 16:35:41 +0300
c98e8df323Move the "getRenamedWorkerReport"-code in its own method.Lampros Smyrnaios2023-09-13 16:27:18 +0300
6891c467d4- Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode. - Add a missing change for the optimization of reading files. - Update dependencies.Lampros Smyrnaios2023-09-13 15:29:30 +0300
3dd349dd00Improve the "findAssignmentsQuery": - Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower. - Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.Lampros Smyrnaios2023-09-13 14:38:15 +0300
ee2df19ce1- Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint. - Add useful log-messages for various bulk-import stages and improve the current ones. - Optimize reading and writing the reports.Lampros Smyrnaios2023-09-11 17:24:39 +0300
1c8f3765ca- Fix not acquiring the full workerReport when retrying it, with the scheduler. - Improve error-handling in the "inspectWorkerReportsAndTakeAction" process. - Code polishing.Lampros Smyrnaios2023-09-08 14:59:48 +0300
e72a4d3d10- Improve handling of already renamed workerReport-files, which relate to failed workerReports in the 1st try. - Add check for workerReport-files, which may have been deleted before their time, due to an error.Lampros Smyrnaios2023-09-08 14:11:41 +0300
bd9245cc3dAvoid deleting the assignment-records in case of a "parquet-data creation, upload or insertion" problem, in order to avoid double-processing of the urls, until the report has been retried by the scheduler.Lampros Smyrnaios2023-09-08 13:44:24 +0300
718f5cfefb- Improve prioritization of the most recent publications. - Avoid processing publications which will be published in the next 5 years, counting from each "current" year, since they are not providing full-texts yet. Still allow the invalid publication-years like "2566", "9999", etc.Lampros Smyrnaios2023-09-07 14:05:58 +0300
199105f7f1Fix not writing some bulk-import error-messages to the logs. Instead, they were only written to the json-reports.Lampros Smyrnaios2023-09-04 16:33:27 +0300
acef891167Improve prioritization of "publication_boost" records, by adding a second ordering in the end.Lampros Smyrnaios2023-09-04 15:34:37 +0300
febe2b212cUpgrade management of failed workerReports: - Upon completing processing a workerReport, the name of the json-file will be appended with "successful" or "failed". - Avoid deleting immediately the failed workerReports. - Add a scheduling task to process leftover failed workerReports from previous executions of the service, only once, 12 hours after startup, in order for the workers to have participated and filled the "workersInfoMap". - Add a scheduling task to process leftover failed workerReports from the current execution, regularly. - Fix not iterating through the workers' subDirs when checking the last-access-time of workerReports. - Fix not deleting the assignment records from the DB, when a failed leftover workerReport gets deleted. - Code refactoring.
2.3
Lampros Smyrnaios2023-09-01 15:10:58 +0300
5c459a3a16Optimize handling of HTTP-4XX errors in "UrlsServiceImpl.postReportResultToWorker()".Lampros Smyrnaios2023-08-31 13:20:12 +0300
601776e81c- Fix not handling for the case where the info about the worker in the WorkerReport, does not exist inside the "workersInfoMap", as that worker is not participating in the Service. (this case may appear in future code) - Code polishing.Lampros Smyrnaios2023-08-30 17:07:51 +0300
c32dfa882eFix not deleting the assignment-records, for every workerReport, after processing it.Lampros Smyrnaios2023-08-30 16:22:58 +0300
aa3f32f3da- Make sure the given number of threads, given by the user is above zero. - Adjust the number and size of log files. - Update Spring Boot. - Code polishing.Lampros Smyrnaios2023-08-30 14:02:54 +0300
b3e0d214fdUpdate the BulkImport API: - Refactor the "bulkImportReportID". - Add the "bulk:" prefix in the provenance value, in the DB. - Fix not using correctly the "Lists.partition()" method. - Make sure the "bulkImportDir" is removed from the "bulkImportDirsUnderProcessing" Set, in case of an early-error. - Fix the "numFailedSegments"-calculation. - Improve some messages. - Code polishing.Lampros Smyrnaios2023-08-21 18:19:53 +0300
a524375656- Create the HDFS-subDirs before generating "callableTasks" for creating and uploading the parquetFiles. - Delete gradle .zip file after installation.Lampros Smyrnaios2023-08-04 15:30:41 +0300
860c73ea91- Improve the "shutdownController.sh" script. - Set names for the Prometheus and Grafana containers. - Code polishing.
2.2
Lampros Smyrnaios2023-07-27 18:27:48 +0300
0699acc999Make sure we use the latest version of the "zstd-jni" library, where the core code for the "ZStandard" compression algorithm is. The Apache's "commons-compress" package which wraps it in a file-managements code, updates the "zstd-jni" less often.Lampros Smyrnaios2023-07-27 17:42:57 +0300
dfb9c8204eAdd useful messages for missing parameters in Stats API.Lampros Smyrnaios2023-07-25 15:36:54 +0300
b73be6d8daFix the Stats API returning simple numbers as "application/json". Now they are returned as "text/plain".Lampros Smyrnaios2023-07-25 12:03:27 +0300
66a5b3c7daUpdate Bulk-Import API: - Increase the "numOfThreadsPerBulkImportProcedure" to 6. - Fix Bulk import not working from a second-level subdirectory; the report-subDirectory was not created. - Fix not returning the bulk-import-report as "application/json". - Add useful messages for missing parameters. - Change the HTTP-method for the "bulkImportFullTexts" endpoint to "POST". - Show a structured json-response for the "bulkImportFullTexts" endpoint. - Fix uncommon date-format. - Remove single quotes from json-report, since they are returned as bytes, not characters. - Optimize the generation of the json-bulkImport-report.Lampros Smyrnaios2023-07-25 11:59:47 +0300
8d8a387ff2Reduce the waiting time for new background tasks to be scheduled for processing.Lampros Smyrnaios2023-07-24 20:33:56 +0300
d821ae398fImprove performance by applying the merging-procedure for the parquet files of the database tables less often, while keeping the benefits of having a relatively small maximum number of parquet files in search operations.Lampros Smyrnaios2023-07-24 20:28:41 +0300
7dc72e242e- Fix missing changes. - Change the HTTP-method of the renamed "test/uploadParquetFile" endpoint to "POST".Lampros Smyrnaios2023-07-24 19:55:37 +0300
9cbac77c2a- Add check for "shouldShutdownService" before allowing to continue with a bulk-import request. - Add check for remaining background tasks (including bulkImports), before checking if the workers have shut down and then shut down the Service.Lampros Smyrnaios2023-07-21 16:19:00 +0300
cec2531737- Increase the "numOfBackgroundThreads" to 8. - Make the "numOfBackgroundThreads" and "numOfThreadsPerBulkImportProcedure" configurable from the "application.yml" file. - Code polishing.Lampros Smyrnaios2023-07-21 11:45:50 +0300
fd1cf56863- Avoid passing some duplicate publications to the Workers, by post-processing the assignments retrieved from Impala. The assignments may contain duplicate id-url pairs, which have different datasources, since one publication may be connected to multiple datasources. - Code polishing.Lampros Smyrnaios2023-07-19 18:31:24 +0300
b94c35c66e- Fix double active "@Scheduled" annotation for the "ScheduledTasks.updatePrometheusMetrics()" method. - Code polishing.Lampros Smyrnaios2023-07-13 18:32:45 +0300
8dfb58ee63Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment". After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers. Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table.Lampros Smyrnaios2023-07-11 17:27:23 +0300
d5c139c410Handle the case where the "stats"-queries are executed while some table of the DB are in a "merge" state. In this case, the queries fail and the Controller retries up to 10 times.Lampros Smyrnaios2023-07-06 18:29:13 +0300
e8644cb64f- Optimize the "insertAssignmentsQuery". - Add documentation about the Prometheus Metrics, in README. - Update Dependencies. - Code polishing.Lampros Smyrnaios2023-07-05 17:10:30 +0300
a89abe3f2fPrioritize the publications, which are specified inside the "publication_boost" table, according to their "boost-level".Lampros Smyrnaios2023-06-29 12:32:06 +0300
4c3e2e6b6e- Fix not using actual the "currentAssignmentsBatch" of the workerReport itself, when creating the parquetFileNames and when reporting to the user the initialization of the "addition of the workerReport". - Code polishing.Lampros Smyrnaios2023-06-27 16:08:01 +0300
0f4b63c4a9Expose the following statistics as prometheus-metrics and create/update a stats-endpoint for each one: - "numOfPayloadsAggregatedByServiceThroughCrawling" - "numOfPayloadsAggregatedByServiceThroughBulkImport" - "numOfPayloadsAggregatedByService" - "numOfLegacyPayloads" - "numOfRecordsInspectedByServiceThroughCrawling" (renamed from "numOfInspectedRecords")Lampros Smyrnaios2023-06-23 15:22:26 +0300
b6b1cb08b9Add instructions on how to run the Prometheus and Grafana docker-containers alongside the UrlsController, by using the same script.Lampros Smyrnaios2023-06-23 14:52:07 +0300
d52601e819- Add handling for the "EmptyResultDataAccessException" in "UrlsServiceImpl.getAssignments()", which is thrown when no assignments are returned from the query. - Code polishing.Lampros Smyrnaios2023-06-22 12:39:11 +0300
b9712bed85- Expose the "numOfAllPayloads" and "numOfInspectedRecords" DB-stats to Prometheus, by using a scheduling task to request the numbers from the DB, every 6 hours. - Update the "StatsServiceImpl.getNumberOfPayloadsAggregatedByService()" to use the new table "payload_aggregated", instead of casting and checking the date of the records. - Code polishing.Lampros Smyrnaios2023-06-19 14:42:00 +0300
798fa09d68- Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()". - Add/Improve some log messages. - Update and cleanup dependencies. - Code polishing.Lampros Smyrnaios2023-06-15 23:19:36 +0300
88a74b2c41Add support for all private addresses, defined in "RFC 1918" standard. This fixes the issue of discarding some "shutdownService" requests due to coming from different local private addresses, when the Controller was run inside a docker container.Lampros Smyrnaios2023-06-15 13:26:27 +0300
c37f157f51Split the full-texts-batch's main handling-code to two separate methods, which can be used in parallel by two threads, in the future.Lampros Smyrnaios2023-06-14 17:16:38 +0300
e2776c50d0- Optimize the "WorkerReportResult" and the "ShutdownWorker" requests. - Improve documentation.Lampros Smyrnaios2023-06-10 02:31:57 +0300
0b1ab5b991Attempt to recover from serious failures, by individualizing the error-handling for each of the "table-merging" queries.Lampros Smyrnaios2023-06-10 02:28:02 +0300
6669dc61bf- Increase the initialDelay for the "checkIfServiceIsReadyForShutdown" scheduled-task, in production, to 10 minutes. - Code polishing.Lampros Smyrnaios2023-06-06 16:49:53 +0300
5d99a4be5d- Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version.Lampros Smyrnaios2023-06-06 16:18:38 +0300
54685bbe9a- Avoid sending "cancelShutdown" requests to already shutDown Workers. - Optimize performance of the code running right before the "postShutdownOrCancelRequestToWorker". - Show which Workers have already shutdown and as a result a "postShutdownOrCancelRequestToWorker" will not be performed on them.Lampros Smyrnaios2023-05-29 13:41:37 +0300
03bf4294b8- Add documentation about the "BulkImport API" in the README. - Fix a link in README. - Update dependencies.Lampros Smyrnaios2023-05-29 12:13:39 +0300
74ff31fc64- Show the workerIPs in the logs. - Rename the "FullTexts"-files to "BulkImport".Lampros Smyrnaios2023-05-29 12:12:08 +0300
3988eb3a48- Use a separate HDFS sub-dir for every assignments-batch, in order to avoid any disrruptancies from multiple threads moving parquet-files from the same sub-dir. Multiple batches from the same worker may be processed at the same time. These sub-dirs are deleted afterwards. - Treat the "contains no visible files" situation as an error. In which case the assignments-data is presumed to not have been inserted to the database tables. - Code polishing/cleanup.Lampros Smyrnaios2023-05-27 02:36:05 +0300
02cee097d4Fix an issue, which could cause some background jobs to be executed more than 1 times. The previously executed jobs were not deleted from the global list fast enough, and they would be selected again, in case they were not finished before the scheduler started again.Lampros Smyrnaios2023-05-26 13:08:00 +0300
2b50e08bf6- Handle the case, were multiple threads may load the same HDFS directory to a database table, thus causing the "directory contains no visible files"-SQLException. - Improve the values of the delays for some scheduledTasks. - Improve elapsed time precision for the "lastAccessedOn" metadata of the workerReports. - Code polishing.Lampros Smyrnaios2023-05-25 00:34:36 +0300
164245cb53- Automatically delete the unsuccessful WorkerReports, which are more than 7 days old. - Optimize the Service's startup speed, by setting "initialDelays" to the scheduled tasks. - Optimize documentation.Lampros Smyrnaios2023-05-24 16:59:42 +0300
8b5f143b0aPlace the "workerReports" and the "bulkImportReports" dirs inside the "reports" parent-directory.Lampros Smyrnaios2023-05-24 14:10:57 +0300
cd1fb0af88- Process the WorkerReports in background Jobs and post the reportResults to the Workers. - Save the workerReports to json files, until they are processed successfully. - Show some custom metrics in prometheus.Lampros Smyrnaios2023-05-24 13:52:28 +0300
0ea3e2de24Add the "shutdownService" and "cancelShutdownService" endpoints. The Controller sends the related requests to the Workers and shutdowns gracefully, after all workers have shutdown.Lampros Smyrnaios2023-05-24 13:42:29 +0300
c2a1b96069- Rename the mounted "mnt/bulkImport/" directory to "/mnt/bulk_import/". - Increase the "awaitTermination" timeout for the ExecutorService to 2 minutes.Lampros Smyrnaios2023-05-23 21:09:34 +0300
c7bfd75973- Add the "getWorkersInfo" endpoint. - Improve startup speed, by using a faster remote server to get the host's machine public IP. This also reduces the risk of not being able to get the public IP at all. - Fix the detection of a different IP for a known worker. - Improve documentation.Lampros Smyrnaios2023-05-23 14:57:15 +0300
5f75b48e95- Increase the "read-timeout" when searching for the host's machine public-IP. - Update dependencies. - Code polishing.Lampros Smyrnaios2023-05-22 21:33:02 +0300
0ab6bae93a- Optimize the json-conversion of the "BulkImportReport". - Code polishing.Lampros Smyrnaios2023-05-18 17:30:40 +0300
f7f919cee1- Make sure we set the "hasShutdown" to "false", for each known worker which was restarted. - Fix markdown of urls in prometheus' readme.Lampros Smyrnaios2023-05-16 12:24:14 +0300
b499209ce3- Move the Prometheus and grafana configuration in a dedicated directory and docker-compose file. - Add documentation about setting-up prometheus and grafana.Lampros Smyrnaios2023-05-15 18:52:31 +0300
f51a34138f- Use separate HDFS subdirectories for each worker in order to avoid seeing exceptions about "empty hdfs directory" when "loading" data to the database, because one worker has loaded data generated by multiple workers (since we use only 1 load operation for multiple parquet files). - Store each worker's info in a hash-table, in order to efficiently know if we need to create new hdfs subdirectories. Also, this will help to issue "shutdown" requests to the workers in the future, as well as to know which worker has shutdown.Lampros Smyrnaios2023-05-15 13:12:20 +0300
9412391903- In test-environment mode, check for already existing file-hashes only in the "payload_aggregated" table, instead of the whole "payload" view. This way the investigation for false-positive docUrls is easier, as we avoid checking against the millions of "legacy" payloads. - Improve performance in production, by not creating the string objects for "trace"-logs.Lampros Smyrnaios2023-05-15 12:44:16 +0300
8381df70c6- Improve performance of uploading parquet-files to HDFS. - Add some logs. - Code polishing.Lampros Smyrnaios2023-05-11 19:40:48 +0300
992d4ffd5e- Add the time-zone in the logs. - Change some log-levels to "trace", although most of them are still disabled.Lampros Smyrnaios2023-05-11 03:10:53 +0300
42b93e9429- Add the "getNumberOfAllDistinctFullTexts" stats-endpoint. - Add TODOs for more stats endpoints. - Code polishing.Lampros Smyrnaios2023-05-04 15:48:49 +0300
b3196376ebFix a bug, which caused the full-text files to never close.Lampros Smyrnaios2023-05-04 13:03:28 +0300
1b14a7e554- Add profiles to docker-services to selectively run the additional "Prometheus" and "Grafana" services or not. - Update Gradle.Lampros Smyrnaios2023-04-22 16:50:33 +0300
c2b17163cdAutomatically show the Controller's logs after the docker-container starts running and the status is shown.Lampros Smyrnaios2023-04-11 11:59:10 +0300
4dc34429f8- Increase the waiting-time before checking the docker containers' status, in order to catch configuration-crashes. - Code polishing.Lampros Smyrnaios2023-04-10 22:28:53 +0300
c39fef2654Upgrade payload-table to payload-view which consists of three separate payload tables: "payload_legacy", "payload_aggregated" and "payload_bulk_import".Lampros Smyrnaios2023-04-10 15:55:50 +0300
484cf5cefc- Avoid requesting the remaining full-text batches in case the Worker returns a 5XX error in one of the batches. - Add nullability-checks for "datasourceId" and "hash" before constructing the new filename and upload the full-text on S3. - Improve a log-message.Lampros Smyrnaios2023-03-29 17:12:37 +0300
495d5de19b- Automatically get the status of the docker containers after 30 secs of their initialization. - Add an error-handling in "installAndRun.sh" - Update dependencies.Lampros Smyrnaios2023-03-27 19:43:15 +0300
4280f89296- Set the default value of the "isTestEnvironment" property to "true", in order to avoid undesired outcomes in the production db. - Code polishing.Lampros Smyrnaios2023-03-21 17:04:28 +0200
e975bec911- Add Prometheus and Grafana which help measuring various metrics for the Controller's health and performance. - Fix Docker config still using the old (now removed) "application.properties" file. - Simplify the process of building and running the docker image; Now we use docker compose to run the Controller, along with the Prometheus and Grafana. Also, now it is not requested from the user to login and push the image (this may change in the future).Lampros Smyrnaios2023-03-21 16:46:33 +0200
003c0bf179- Add support for excluding specific datasources from being crawled. These datasources may be aggregated through bulk-imports, by other pieces of software. Such a datasource is "arXiv.org". - Fix an issue, where the "datasource-type" was retrieved instead of the "datasource-name". - Polish the "findAssignmentsQuery".Lampros Smyrnaios2023-03-21 07:19:35 +0200
f835a752bfTransform the "application.properties" file to "application.yml" and optimize the property-trees.Lampros Smyrnaios2023-03-20 15:23:00 +0200