Commit Graph

  • b94c35c66e - Fix double active "@Scheduled" annotation for the "ScheduledTasks.updatePrometheusMetrics()" method. - Code polishing. Lampros Smyrnaios 2023-07-13 18:32:45 +0300
  • 8dfb58ee63 Avoid assigning the same publications multiple times to the Workers, after the recent "parallelization enchantment". After that enchantment, each worker could request multiple assignment-batches, before its previous batches were processed by the Controller. This means that for each batch that was processed, the Controller was deleting from the "assignment" table, all the assignments (-batches) delivered to the Worker that brought that batch, even though the "attempt" and "payload" records for the rest of the batches were not inserted in the DB yet. So in a new assignments-batch request, the same publications that were already under processing, were delivered to the same or other Workers. Now, for each finished batch, only the assignments of that batch are deleted from the "assignment" table. Lampros Smyrnaios 2023-07-11 17:27:23 +0300
  • 2d5643cb0a Fix missing spaces in some secondary "DROP"-queries. 2.1 Lampros Smyrnaios 2023-07-07 20:51:14 +0300
  • d5c139c410 Handle the case where the "stats"-queries are executed while some table of the DB are in a "merge" state. In this case, the queries fail and the Controller retries up to 10 times. Lampros Smyrnaios 2023-07-06 18:29:13 +0300
  • e8644cb64f - Optimize the "insertAssignmentsQuery". - Add documentation about the Prometheus Metrics, in README. - Update Dependencies. - Code polishing. Lampros Smyrnaios 2023-07-05 17:10:30 +0300
  • a89abe3f2f Prioritize the publications, which are specified inside the "publication_boost" table, according to their "boost-level". Lampros Smyrnaios 2023-06-29 12:32:06 +0300
  • 4c3e2e6b6e - Fix not using actual the "currentAssignmentsBatch" of the workerReport itself, when creating the parquetFileNames and when reporting to the user the initialization of the "addition of the workerReport". - Code polishing. Lampros Smyrnaios 2023-06-27 16:08:01 +0300
  • 0f4b63c4a9 Expose the following statistics as prometheus-metrics and create/update a stats-endpoint for each one: - "numOfPayloadsAggregatedByServiceThroughCrawling" - "numOfPayloadsAggregatedByServiceThroughBulkImport" - "numOfPayloadsAggregatedByService" - "numOfLegacyPayloads" - "numOfRecordsInspectedByServiceThroughCrawling" (renamed from "numOfInspectedRecords") Lampros Smyrnaios 2023-06-23 15:22:26 +0300
  • b6b1cb08b9 Add instructions on how to run the Prometheus and Grafana docker-containers alongside the UrlsController, by using the same script. Lampros Smyrnaios 2023-06-23 14:52:07 +0300
  • d52601e819 - Add handling for the "EmptyResultDataAccessException" in "UrlsServiceImpl.getAssignments()", which is thrown when no assignments are returned from the query. - Code polishing. Lampros Smyrnaios 2023-06-22 12:39:11 +0300
  • b9712bed85 - Expose the "numOfAllPayloads" and "numOfInspectedRecords" DB-stats to Prometheus, by using a scheduling task to request the numbers from the DB, every 6 hours. - Update the "StatsServiceImpl.getNumberOfPayloadsAggregatedByService()" to use the new table "payload_aggregated", instead of casting and checking the date of the records. - Code polishing. Lampros Smyrnaios 2023-06-19 14:42:00 +0300
  • 798fa09d68 - Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()". - Add/Improve some log messages. - Update and cleanup dependencies. - Code polishing. Lampros Smyrnaios 2023-06-15 23:19:36 +0300
  • 88a74b2c41 Add support for all private addresses, defined in "RFC 1918" standard. This fixes the issue of discarding some "shutdownService" requests due to coming from different local private addresses, when the Controller was run inside a docker container. Lampros Smyrnaios 2023-06-15 13:26:27 +0300
  • c37f157f51 Split the full-texts-batch's main handling-code to two separate methods, which can be used in parallel by two threads, in the future. Lampros Smyrnaios 2023-06-14 17:16:38 +0300
  • e2776c50d0 - Optimize the "WorkerReportResult" and the "ShutdownWorker" requests. - Improve documentation. Lampros Smyrnaios 2023-06-10 02:31:57 +0300
  • 0b1ab5b991 Attempt to recover from serious failures, by individualizing the error-handling for each of the "table-merging" queries. Lampros Smyrnaios 2023-06-10 02:28:02 +0300
  • 6669dc61bf - Increase the initialDelay for the "checkIfServiceIsReadyForShutdown" scheduled-task, in production, to 10 minutes. - Code polishing. Lampros Smyrnaios 2023-06-06 16:49:53 +0300
  • 5d99a4be5d - Add the Shutdown Service API documentation. - Improve the BulkImport API documentation. - Fix markdown in README. - Update the app's version. Lampros Smyrnaios 2023-06-06 16:18:38 +0300
  • 54685bbe9a - Avoid sending "cancelShutdown" requests to already shutDown Workers. - Optimize performance of the code running right before the "postShutdownOrCancelRequestToWorker". - Show which Workers have already shutdown and as a result a "postShutdownOrCancelRequestToWorker" will not be performed on them. Lampros Smyrnaios 2023-05-29 13:41:37 +0300
  • f9c6bad768 Do not send shutDownRequests to workers which have already shutdown. 2.0 Lampros Smyrnaios 2023-05-29 12:42:54 +0300
  • a38d6ace79 Code polishing. Lampros Smyrnaios 2023-05-29 12:21:48 +0300
  • 03bf4294b8 - Add documentation about the "BulkImport API" in the README. - Fix a link in README. - Update dependencies. Lampros Smyrnaios 2023-05-29 12:13:39 +0300
  • 74ff31fc64 - Show the workerIPs in the logs. - Rename the "FullTexts"-files to "BulkImport". Lampros Smyrnaios 2023-05-29 12:12:08 +0300
  • 3988eb3a48 - Use a separate HDFS sub-dir for every assignments-batch, in order to avoid any disrruptancies from multiple threads moving parquet-files from the same sub-dir. Multiple batches from the same worker may be processed at the same time. These sub-dirs are deleted afterwards. - Treat the "contains no visible files" situation as an error. In which case the assignments-data is presumed to not have been inserted to the database tables. - Code polishing/cleanup. Lampros Smyrnaios 2023-05-27 02:36:05 +0300
  • 02cee097d4 Fix an issue, which could cause some background jobs to be executed more than 1 times. The previously executed jobs were not deleted from the global list fast enough, and they would be selected again, in case they were not finished before the scheduler started again. Lampros Smyrnaios 2023-05-26 13:08:00 +0300
  • 2b50e08bf6 - Handle the case, were multiple threads may load the same HDFS directory to a database table, thus causing the "directory contains no visible files"-SQLException. - Improve the values of the delays for some scheduledTasks. - Improve elapsed time precision for the "lastAccessedOn" metadata of the workerReports. - Code polishing. Lampros Smyrnaios 2023-05-25 00:34:36 +0300
  • 164245cb53 - Automatically delete the unsuccessful WorkerReports, which are more than 7 days old. - Optimize the Service's startup speed, by setting "initialDelays" to the scheduled tasks. - Optimize documentation. Lampros Smyrnaios 2023-05-24 16:59:42 +0300
  • 551c4acef5 Fix property naming missmatch. Lampros Smyrnaios 2023-05-24 14:49:29 +0300
  • 8b5f143b0a Place the "workerReports" and the "bulkImportReports" dirs inside the "reports" parent-directory. Lampros Smyrnaios 2023-05-24 14:10:57 +0300
  • cd1fb0af88 - Process the WorkerReports in background Jobs and post the reportResults to the Workers. - Save the workerReports to json files, until they are processed successfully. - Show some custom metrics in prometheus. Lampros Smyrnaios 2023-05-24 13:52:28 +0300
  • 0ea3e2de24 Add the "shutdownService" and "cancelShutdownService" endpoints. The Controller sends the related requests to the Workers and shutdowns gracefully, after all workers have shutdown. Lampros Smyrnaios 2023-05-24 13:42:29 +0300
  • c2a1b96069 - Rename the mounted "mnt/bulkImport/" directory to "/mnt/bulk_import/". - Increase the "awaitTermination" timeout for the ExecutorService to 2 minutes. Lampros Smyrnaios 2023-05-23 21:09:34 +0300
  • c7bfd75973 - Add the "getWorkersInfo" endpoint. - Improve startup speed, by using a faster remote server to get the host's machine public IP. This also reduces the risk of not being able to get the public IP at all. - Fix the detection of a different IP for a known worker. - Improve documentation. Lampros Smyrnaios 2023-05-23 14:57:15 +0300
  • 5f75b48e95 - Increase the "read-timeout" when searching for the host's machine public-IP. - Update dependencies. - Code polishing. Lampros Smyrnaios 2023-05-22 21:33:02 +0300
  • 0ab6bae93a - Optimize the json-conversion of the "BulkImportReport". - Code polishing. Lampros Smyrnaios 2023-05-18 17:30:40 +0300
  • f7f919cee1 - Make sure we set the "hasShutdown" to "false", for each known worker which was restarted. - Fix markdown of urls in prometheus' readme. Lampros Smyrnaios 2023-05-16 12:24:14 +0300
  • b499209ce3 - Move the Prometheus and grafana configuration in a dedicated directory and docker-compose file. - Add documentation about setting-up prometheus and grafana. Lampros Smyrnaios 2023-05-15 18:52:31 +0300
  • a8eea1ccf4 Fix missing changes. Lampros Smyrnaios 2023-05-15 13:13:24 +0300
  • f51a34138f - Use separate HDFS subdirectories for each worker in order to avoid seeing exceptions about "empty hdfs directory" when "loading" data to the database, because one worker has loaded data generated by multiple workers (since we use only 1 load operation for multiple parquet files). - Store each worker's info in a hash-table, in order to efficiently know if we need to create new hdfs subdirectories. Also, this will help to issue "shutdown" requests to the workers in the future, as well as to know which worker has shutdown. Lampros Smyrnaios 2023-05-15 13:12:20 +0300
  • 9412391903 - In test-environment mode, check for already existing file-hashes only in the "payload_aggregated" table, instead of the whole "payload" view. This way the investigation for false-positive docUrls is easier, as we avoid checking against the millions of "legacy" payloads. - Improve performance in production, by not creating the string objects for "trace"-logs. Lampros Smyrnaios 2023-05-15 12:44:16 +0300
  • 8381df70c6 - Improve performance of uploading parquet-files to HDFS. - Add some logs. - Code polishing. Lampros Smyrnaios 2023-05-11 19:40:48 +0300
  • 992d4ffd5e - Add the time-zone in the logs. - Change some log-levels to "trace", although most of them are still disabled. Lampros Smyrnaios 2023-05-11 03:10:53 +0300
  • b6e8cd1889 New feature: BulkImport full-text files from compatible datasources. Lampros Smyrnaios 2023-05-11 03:07:55 +0300
  • 42b93e9429 - Add the "getNumberOfAllDistinctFullTexts" stats-endpoint. - Add TODOs for more stats endpoints. - Code polishing. Lampros Smyrnaios 2023-05-04 15:48:49 +0300
  • b3196376eb Fix a bug, which caused the full-text files to never close. Lampros Smyrnaios 2023-05-04 13:03:28 +0300
  • 3473d91ce4 Add the "shutdownController.sh" script. Lampros Smyrnaios 2023-05-03 20:43:44 +0300
  • fd15372fd6 Add error-checks for retrieving the status-code from HttpUrlConnections. Lampros Smyrnaios 2023-05-03 13:30:29 +0300
  • 49662319a1 - Simplify the creation of local directories. - Improve exception messages. Lampros Smyrnaios 2023-04-28 14:58:33 +0300
  • 55ea5118ac - Update the "testDatabaseName" property. - Code polishing. Lampros Smyrnaios 2023-04-26 19:33:28 +0300
  • d7797eaaf6 Add the "getNumberOfPayloadsForDatasource" endpoint. Lampros Smyrnaios 2023-04-24 09:54:35 +0300
  • 1b14a7e554 - Add profiles to docker-services to selectively run the additional "Prometheus" and "Grafana" services or not. - Update Gradle. Lampros Smyrnaios 2023-04-22 16:50:33 +0300
  • 68759e3023 Update dependencies. Lampros Smyrnaios 2023-04-20 18:57:16 +0300
  • c2b17163cd Automatically show the Controller's logs after the docker-container starts running and the status is shown. Lampros Smyrnaios 2023-04-11 11:59:10 +0300
  • 4dc34429f8 - Increase the waiting-time before checking the docker containers' status, in order to catch configuration-crashes. - Code polishing. Lampros Smyrnaios 2023-04-10 22:28:53 +0300
  • c39fef2654 Upgrade payload-table to payload-view which consists of three separate payload tables: "payload_legacy", "payload_aggregated" and "payload_bulk_import". Lampros Smyrnaios 2023-04-10 15:55:50 +0300
  • 37363100fd Prioritize most recent publications. Lampros Smyrnaios 2023-04-10 15:00:23 +0300
  • 484cf5cefc - Avoid requesting the remaining full-text batches in case the Worker returns a 5XX error in one of the batches. - Add nullability-checks for "datasourceId" and "hash" before constructing the new filename and upload the full-text on S3. - Improve a log-message. Lampros Smyrnaios 2023-03-29 17:12:37 +0300
  • 495d5de19b - Automatically get the status of the docker containers after 30 secs of their initialization. - Add an error-handling in "installAndRun.sh" - Update dependencies. Lampros Smyrnaios 2023-03-27 19:43:15 +0300
  • 882c6f447b Update the "testDatabaseName". Lampros Smyrnaios 2023-03-21 23:10:21 +0200
  • 4280f89296 - Set the default value of the "isTestEnvironment" property to "true", in order to avoid undesired outcomes in the production db. - Code polishing. Lampros Smyrnaios 2023-03-21 17:04:28 +0200
  • e975bec911 - Add Prometheus and Grafana which help measuring various metrics for the Controller's health and performance. - Fix Docker config still using the old (now removed) "application.properties" file. - Simplify the process of building and running the docker image; Now we use docker compose to run the Controller, along with the Prometheus and Grafana. Also, now it is not requested from the user to login and push the image (this may change in the future). Lampros Smyrnaios 2023-03-21 16:46:33 +0200
  • 003c0bf179 - Add support for excluding specific datasources from being crawled. These datasources may be aggregated through bulk-imports, by other pieces of software. Such a datasource is "arXiv.org". - Fix an issue, where the "datasource-type" was retrieved instead of the "datasource-name". - Polish the "findAssignmentsQuery". Lampros Smyrnaios 2023-03-21 07:19:35 +0200
  • f835a752bf Transform the "application.properties" file to "application.yml" and optimize the property-trees. Lampros Smyrnaios 2023-03-20 15:23:00 +0200
  • 17a6c120dd Improve logs for full-texts' metrics. Lampros Smyrnaios 2023-03-14 20:57:01 +0200
  • ff13af7abb Use a StatsService interface. Lampros Smyrnaios 2023-03-13 12:39:39 +0200
  • 38643c76a3 - Code polishing. - Update Gradle. Lampros Smyrnaios 2023-03-07 16:55:41 +0200
  • 4af298a52a Revert the version of "libthrift"-dependency to "0.17.0", as the newer version is not compatible with Java 8. Lampros Smyrnaios 2023-03-03 12:57:30 +0200
  • 7b217764e0 Improve performance when downloading and decompressing the full-texts archive. Lampros Smyrnaios 2023-03-02 17:44:53 +0200
  • 62a4279e3b Update dependencies. Lampros Smyrnaios 2023-03-02 17:40:16 +0200
  • c4670073ae - Add missing refactoring-change. - Code polishing. - Update Spring. Lampros Smyrnaios 2023-02-24 23:49:04 +0200
  • c8485d472e Code polishing. Lampros Smyrnaios 2023-02-24 13:53:09 +0200
  • b7f6056032 - Improve an error-message. - Update Gradle. Lampros Smyrnaios 2023-02-21 15:42:07 +0200
  • 8893662a81 Refactor the UrlsController: a) offload the business-logic to the dedicated "UrlsService" and b) move the "checkParquetFilesSuccess()"-method to "ParquetFileUtils". Lampros Smyrnaios 2023-02-21 15:36:35 +0200
  • a1c16ffc19 - Exclude empty and null urls in the assignments. - Update the "getFullTextsImproved"-call to "getFullTexts", now that the "improved" version is stable. - Update Gradle. - Code polishing. Lampros Smyrnaios 2023-02-16 14:24:47 +0200
  • 2253f05bf5 Refactor the "StatsController"-code, by offloading it to a dedicated "StatsService". Lampros Smyrnaios 2023-02-09 19:25:48 +0200
  • 49fefefafd - Refactor the payloads-statistics-code and provide two endpoints: "getNumberOfPayloadsAggregatedByService", which returns the number of payloads aggregated only by the PDF-Aggregation-Service, and the "getNumberOfAllPayloads", which returns the number of all payloads existing in the database, even the ones aggregated in the past, by other pieces of software. - Update README.md. - Make sure the docker image is clean-built, by avoiding the use of cache. Lampros Smyrnaios 2023-02-02 17:58:47 +0200
  • c9f33d3afa Add an extra precaution-check to allow the emptying or deletion of an S3-Object-Store bucket, only when the app runs in "TestEnvironment". Lampros Smyrnaios 2023-02-01 16:42:22 +0200
  • f89730f196 Improve documentation. Lampros Smyrnaios 2023-01-27 14:31:07 +0200
  • dc8f0f2bd1 - Heavily reduce the maximum amount of space needed, by deleting the files of each full-texts batch, right after they are uploaded to the S3 Object Store. - Add a check for when the retrieved full-texts-batch is missing some requested files and show a warn-log. - Update dependencies. Lampros Smyrnaios 2023-01-23 20:23:21 +0200
  • d8773e6ebb - Make sure the test-environment uses a dedicated hdfs-parquet-directory. - Block app-execution in case the hdfs parquet directories failed to be created. - Code polishing. Lampros Smyrnaios 2023-01-18 13:38:05 +0200
  • b0b00c8aed Update the minio dependency. Lampros Smyrnaios 2023-01-11 15:46:34 +0200
  • c08ba1cc89 Revert the update of the "minio" dependency, as it introduces a bug, related to the "okhttp3.HttpUrl"-class. Lampros Smyrnaios 2023-01-10 15:58:23 +0200
  • 8876089022 - Use Facebook's [**Zstandard**](https://facebook.github.io/zstd/) compression algorithm, which brings very big benefits on compression rate and speed. - Update the minIO dependency. - Code polishing. Lampros Smyrnaios 2023-01-10 13:34:54 +0200
  • d1a4c84289 - Make sure the fullPath of the baseFilesLocation is available when the user specifies a non-root directory. - Improve error-checking and exception-handling in some "S3ObjectStore"-methods. - Make sure the "responseCode" is "200-OK", before trying to get the InputStream in "UriBuilder.getPublicIP()". Lampros Smyrnaios 2023-01-09 15:44:53 +0200
  • 9904ea5743 - Improve the stability of "UriBuilder.getPublicIP()", by using a "HttpURLConnection" to increase the connection and read timeouts and avoid timeout-exceptions. - Update Spring. Lampros Smyrnaios 2023-01-03 18:39:50 +0200
  • 4528d1f9be - Fix the "baseFilesLocation" being null (there was no serious problem, but multiple directories were spawned in the project's directory). - Make sure the given "baseFilesLocation" ends with a file-separator, before using it. - Optimize the process of unzipping-files. Lampros Smyrnaios 2022-12-20 18:38:11 +0200
  • e11afe5ab2 Improve performance of the hash-checking algorithm by using multithreading. Lampros Smyrnaios 2022-12-15 18:34:28 +0200
  • 9cdbbdea67 Refactor the files' storage location. Lampros Smyrnaios 2022-12-15 18:29:51 +0200
  • e51ee9dd27 - Add info about the Stats API usage in "README.md". - Optimize performance in "ParquetFileUtils.createAndLoadParquetDataIntoAttemptTable()" and "ParquetFileUtils.createAndLoadParquetDataIntoPayloadTable()". - Handle the "EmptyResultDataAccessException" inside "StatsController". - Optimize gradle's performance. - Code polishing. Lampros Smyrnaios 2022-12-15 14:04:22 +0200
  • bfdf06bd09 - Apply error-checking on individual CallableTasks and in tasks-batches related to the creation and upload of all the data related to the "attempt" and "payload" table. So, if no data could be uploaded for one or both tables, no "load"-queries will be executed for that/those tables. - Catch the more general "Exception", inside "FileUtils.mergeParquetFiles()", in order to be certain that the "SQLException" can also be caught. - Code polishing. Lampros Smyrnaios 2022-12-09 12:46:06 +0200
  • 0209d24068 - Change the parquet compression from "Snappy" to "Gzip", as there is an unhandleable exception when the app is running inside a Docker Container and uses the "Snappy" compression. - Code polishing. Lampros Smyrnaios 2022-12-08 16:28:41 +0200
  • c8baf5a5fc - Fix not finding the parquet-schema files when the app was run inside a Docker Container. - Update the "namespaces" and the "names" inside the parquet schemas. - Code polishing. Lampros Smyrnaios 2022-12-08 12:16:05 +0200
  • 95c38c4a24 - Fix creating the "assignment" table, always in the testDatabase. - Code polishing. Lampros Smyrnaios 2022-12-07 14:58:38 +0200
  • 3c5f4c6464 Fix bytes to MB conversion. Lampros Smyrnaios 2022-12-07 14:32:18 +0200
  • 8607594f6d - Improve exception handling. - Code polishing. Lampros Smyrnaios 2022-12-07 13:48:00 +0200
  • f183df276b - Move the "uploadFullTexts"-code in its own method. - Code polishing. Lampros Smyrnaios 2022-12-06 12:24:34 +0200
  • b0c57d79a5 - When the Controller cannot retrieve any assignments from Impala (without an error), return an HTTP-"MULTI_STATUS" with an empty "AssignmentsResponse", instead of an "INTERNAL_SERVER_ERROR". - Fix an error-message. Lampros Smyrnaios 2022-12-05 16:44:00 +0200
  • 577ea983e8 - Improve some log-messages. - Set some optimization settings for gradle. - Fix error-handling in "installAndRun.sh". - Update dependencies. Lampros Smyrnaios 2022-11-30 16:28:39 +0200
  • 6226e2298d - Upgrade the results-loading process: Instead of making thousands of sql-insert requests to Impala now we write the results to parquet files, upload them to HDFS and then import the data into the Impala tables with just 2 requests. This results in a huge performance improvement. One side effect of using the parquet-files, is that the timestamps are now BIGDECIMAL numbers, instead of "Timestamp" objects, but, converting them to such objects is pretty easy, if we ever need to do it. - Code polishing. Lampros Smyrnaios 2022-11-10 17:18:21 +0200
  • 6a03103b79 Update dependencies. Lampros Smyrnaios 2022-11-10 16:50:21 +0200