Lampros Smyrnaios
Fix property naming missmatch.
2023-05-24 14:49:29 +03:00
Lampros Smyrnaios
Place the "workerReports" and the "bulkImportReports" dirs inside the "reports" parent-directory.
2023-05-24 14:10:57 +03:00
Lampros Smyrnaios
- Process the WorkerReports in background Jobs and post the reportResults to the Workers.
- Save the workerReports to json files, until they are processed successfully.
- Show some custom metrics in prometheus.
2023-05-24 13:52:28 +03:00
Lampros Smyrnaios
Add the "shutdownService" and "cancelShutdownService" endpoints. The Controller sends the related requests to the Workers and shutdowns gracefully, after all workers have shutdown.
2023-05-24 13:42:29 +03:00
Lampros Smyrnaios
- Rename the mounted "mnt/bulkImport/" directory to "/mnt/bulk_import/".
- Increase the "awaitTermination" timeout for the ExecutorService to 2 minutes.
2023-05-23 21:09:34 +03:00
Lampros Smyrnaios
- Add the "getWorkersInfo" endpoint.
- Improve startup speed, by using a faster remote server to get the host's machine public IP. This also reduces the risk of not being able to get the public IP at all.
- Fix the detection of a different IP for a known worker.
- Improve documentation.
2023-05-23 14:57:15 +03:00
Lampros Smyrnaios
- Increase the "read-timeout" when searching for the host's machine public-IP.
- Update dependencies.
- Code polishing.
2023-05-22 21:33:02 +03:00
Lampros Smyrnaios
- Optimize the json-conversion of the "BulkImportReport".
- Code polishing.
2023-05-18 17:30:40 +03:00
Lampros Smyrnaios
- Make sure we set the "hasShutdown" to "false", for each known worker which was restarted.
- Fix markdown of urls in prometheus' readme.
2023-05-16 12:24:14 +03:00
Lampros Smyrnaios
- Move the Prometheus and grafana configuration in a dedicated directory and docker-compose file.
- Add documentation about setting-up prometheus and grafana.
2023-05-15 18:52:31 +03:00
Lampros Smyrnaios
Fix missing changes.
2023-05-15 13:13:24 +03:00
Lampros Smyrnaios
- Use separate HDFS subdirectories for each worker in order to avoid seeing exceptions about "empty hdfs directory" when "loading" data to the database, because one worker has loaded data generated by multiple workers (since we use only 1 load operation for multiple parquet files).
- Store each worker's info in a hash-table, in order to efficiently know if we need to create new hdfs subdirectories. Also, this will help to issue "shutdown" requests to the workers in the future, as well as to know which worker has shutdown.
2023-05-15 13:12:20 +03:00
Lampros Smyrnaios
- In test-environment mode, check for already existing file-hashes only in the "payload_aggregated" table, instead of the whole "payload" view. This way the investigation for false-positive docUrls is easier, as we avoid checking against the millions of "legacy" payloads.
- Improve performance in production, by not creating the string objects for "trace"-logs.
2023-05-15 12:44:16 +03:00
Lampros Smyrnaios
- Improve performance of uploading parquet-files to HDFS.
- Add some logs.
- Code polishing.
2023-05-11 19:40:48 +03:00
Lampros Smyrnaios
- Add the time-zone in the logs.
- Change some log-levels to "trace", although most of them are still disabled.
2023-05-11 03:10:53 +03:00
Lampros Smyrnaios
New feature: BulkImport full-text files from compatible datasources.
2023-05-11 03:07:55 +03:00
Lampros Smyrnaios
- Add the "getNumberOfAllDistinctFullTexts" stats-endpoint.
- Add TODOs for more stats endpoints.
- Code polishing.
2023-05-04 15:48:49 +03:00
Lampros Smyrnaios
Fix a bug, which caused the full-text files to never close.
2023-05-04 13:03:28 +03:00
Lampros Smyrnaios
Add the "" script.
2023-05-03 20:43:44 +03:00
Lampros Smyrnaios
Add error-checks for retrieving the status-code from HttpUrlConnections.
2023-05-03 13:30:29 +03:00
Lampros Smyrnaios
- Simplify the creation of local directories.
- Improve exception messages.
2023-04-28 14:58:33 +03:00
Lampros Smyrnaios
- Update the "testDatabaseName" property.
- Code polishing.
2023-04-26 19:33:28 +03:00
Lampros Smyrnaios
Add the "getNumberOfPayloadsForDatasource" endpoint.
2023-04-24 09:54:35 +03:00
Lampros Smyrnaios
- Add profiles to docker-services to selectively run the additional "Prometheus" and "Grafana" services or not.
- Update Gradle.
2023-04-22 16:50:33 +03:00
Lampros Smyrnaios
Update dependencies.
2023-04-20 18:57:16 +03:00
Lampros Smyrnaios
Automatically show the Controller's logs after the docker-container starts running and the status is shown.
2023-04-11 11:59:10 +03:00
Lampros Smyrnaios
- Increase the waiting-time before checking the docker containers' status, in order to catch configuration-crashes.
- Code polishing.
2023-04-10 22:28:53 +03:00
Lampros Smyrnaios
Upgrade payload-table to payload-view which consists of three separate payload tables: "payload_legacy", "payload_aggregated" and "payload_bulk_import".
2023-04-10 15:55:50 +03:00
Lampros Smyrnaios
Prioritize most recent publications.
2023-04-10 15:00:23 +03:00
Lampros Smyrnaios
- Avoid requesting the remaining full-text batches in case the Worker returns a 5XX error in one of the batches.
- Add nullability-checks for "datasourceId" and "hash" before constructing the new filename and upload the full-text on S3.
- Improve a log-message.
2023-03-29 17:12:37 +03:00
Lampros Smyrnaios
- Automatically get the status of the docker containers after 30 secs of their initialization.
- Add an error-handling in ""
- Update dependencies.
2023-03-27 19:43:15 +03:00
Lampros Smyrnaios
Update the "testDatabaseName".
2023-03-21 23:10:21 +02:00
Lampros Smyrnaios
- Set the default value of the "isTestEnvironment" property to "true", in order to avoid undesired outcomes in the production db.
- Code polishing.
2023-03-21 17:04:28 +02:00
Lampros Smyrnaios
- Add Prometheus and Grafana which help measuring various metrics for the Controller's health and performance.
- Fix Docker config still using the old (now removed) "" file.
- Simplify the process of building and running the docker image; Now we use docker compose to run the Controller, along with the Prometheus and Grafana. Also, now it is not requested from the user to login and push the image (this may change in the future).
2023-03-21 16:46:33 +02:00
Lampros Smyrnaios
- Add support for excluding specific datasources from being crawled. These datasources may be aggregated through bulk-imports, by other pieces of software. Such a datasource is "".
- Fix an issue, where the "datasource-type" was retrieved instead of the "datasource-name".
- Polish the "findAssignmentsQuery".
2023-03-21 07:19:35 +02:00
Lampros Smyrnaios
Transform the "" file to "application.yml" and optimize the property-trees.
2023-03-20 15:23:00 +02:00
Lampros Smyrnaios
Improve logs for full-texts' metrics.
2023-03-14 20:57:01 +02:00
Lampros Smyrnaios
Use a StatsService interface.
2023-03-13 12:39:39 +02:00
Lampros Smyrnaios
- Code polishing.
- Update Gradle.
2023-03-07 16:55:41 +02:00
Lampros Smyrnaios
Revert the version of "libthrift"-dependency to "0.17.0", as the newer version is not compatible with Java 8.
2023-03-03 12:57:30 +02:00
Lampros Smyrnaios
Improve performance when downloading and decompressing the full-texts archive.
2023-03-02 17:44:53 +02:00
Lampros Smyrnaios
Update dependencies.
2023-03-02 17:40:16 +02:00
Lampros Smyrnaios
- Add missing refactoring-change.
- Code polishing.
- Update Spring.
2023-02-24 23:49:04 +02:00
Lampros Smyrnaios
Code polishing.
2023-02-24 13:53:09 +02:00
Lampros Smyrnaios
- Improve an error-message.
- Update Gradle.
2023-02-21 15:42:07 +02:00
Lampros Smyrnaios
Refactor the UrlsController: a) offload the business-logic to the dedicated "UrlsService" and b) move the "checkParquetFilesSuccess()"-method to "ParquetFileUtils".
2023-02-21 15:36:35 +02:00
Lampros Smyrnaios
- Exclude empty and null urls in the assignments.
- Update the "getFullTextsImproved"-call to "getFullTexts", now that the "improved" version is stable.
- Update Gradle.
- Code polishing.
2023-02-16 14:24:47 +02:00
Lampros Smyrnaios
Refactor the "StatsController"-code, by offloading it to a dedicated "StatsService".
2023-02-09 19:25:48 +02:00
Lampros Smyrnaios
- Refactor the payloads-statistics-code and provide two endpoints: "getNumberOfPayloadsAggregatedByService", which returns the number of payloads aggregated only by the PDF-Aggregation-Service, and the "getNumberOfAllPayloads", which returns the number of all payloads existing in the database, even the ones aggregated in the past, by other pieces of software.
- Update
- Make sure the docker image is clean-built, by avoiding the use of cache.
2023-02-02 17:58:47 +02:00
Lampros Smyrnaios
Add an extra precaution-check to allow the emptying or deletion of an S3-Object-Store bucket, only when the app runs in "TestEnvironment".
2023-02-01 16:42:22 +02:00