The Controller app of the PDF Aggregation Service.
Go to file
Lampros Smyrnaios 54685bbe9a - Avoid sending "cancelShutdown" requests to already shutDown Workers.
- Optimize performance of the code running right before the "postShutdownOrCancelRequestToWorker".
- Show which Workers have already shutdown and as a result a "postShutdownOrCancelRequestToWorker" will not be performed on them.
2023-05-29 13:41:37 +03:00
gradle/wrapper - Add profiles to docker-services to selectively run the additional "Prometheus" and "Grafana" services or not. 2023-04-22 16:50:33 +03:00
prometheus - Make sure we set the "hasShutdown" to "false", for each known worker which was restarted. 2023-05-16 12:24:14 +03:00
src/main - Avoid sending "cancelShutdown" requests to already shutDown Workers. 2023-05-29 13:41:37 +03:00
.gitignore springified project 2022-01-30 22:15:13 +02:00
Dockerfile - Add Prometheus and Grafana which help measuring various metrics for the Controller's health and performance. 2023-03-21 16:46:33 +02:00
README.md - Add documentation about the "BulkImport API" in the README. 2023-05-29 12:13:39 +03:00
build.gradle - Add documentation about the "BulkImport API" in the README. 2023-05-29 12:13:39 +03:00
docker-compose.yml Place the "workerReports" and the "bulkImportReports" dirs inside the "reports" parent-directory. 2023-05-24 14:10:57 +03:00
gradle.properties - Improve some log-messages. 2022-11-30 16:28:39 +02:00
installAndRun.sh - Move the Prometheus and grafana configuration in a dedicated directory and docker-compose file. 2023-05-15 18:52:31 +03:00
settings.gradle - Add the "isControllerAlive"-endpoint. 2021-09-23 15:08:52 +03:00
shutdownController.sh - Move the Prometheus and grafana configuration in a dedicated directory and docker-compose file. 2023-05-15 18:52:31 +03:00

README.md

UrlsController

The Controller's Application receives requests coming from the Workers , constructs an assignments-list with data received from a database and returns the list to the workers.
Then, it receives the "WorkerReports", it requests the full-texts from the workers, in batches, and uploads them on the S3-Object-Store. Finally, it writes the related reports, along with the updated file-locations into the database.

It can also process Bulk-Import requests, from compatible data sources, in which case it receives the full-text files immediately, without offloading crawling jobs to Workers.

For interacting with the database we use Impala.

BulkImport API:

  • "bulkImportFullTexts" endpoint: http://<IP>:/api/bulkImportFullTexts?provenance=<provenance>&bulkImportDir=<bulkImportDir>&shouldDeleteFilesOnFinish={true|false}
    This endpoint loads the right configuration with the help of the "provenance" param and then processes the full-texts files inside the given directory, in the following way: it generates the openAireIDs, uploads the files to the S3 ObjectStore, generates and stores the the "payload" records in the database. If it is specified, it removes the successfully imported full-texts from the directory.
  • "getBulkImportReport" endpoint: http://<IP>:/api/getBulkImportReport?id=<bulkImportReportId>
    This endpoint returns the bulkImport report, which corresponds to the given ID, in JSON format.

Statistics API:

  • "getNumberOfAllPayloads" endpoint: http://<IP>:/api/stats/getNumberOfAllPayloads
    This endpoint returns the total number of payloads existing in the database, independently of the way they were aggregated. This includes the payloads created by other pieces of software, before the PDF-Aggregation-Service was created.
  • "getNumberOfPayloadsAggregatedByService" endpoint: http://<IP>:/api/stats/getNumberOfPayloadsAggregatedByService
    This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself. It excludes the payloads aggregated by other methods, by applying a Date-filter for the records created in 2021 or later.
  • "getNumberOfPayloadsForDatasource" endpoint: http://<IP>:/api/stats/getNumberOfPayloadsForDatasource?datasourceId=<givenDatasourceId>
    This endpoint returns the number of payloads which belong to the datasource specified by the given datasourceID.
  • "getNumberOfRecordsInspected" endpoint: http://<IP>:/api/stats/getNumberOfRecordsInspected
    This endpoint returns the number of records inspected by the PDF-Aggregation-Service.

To install and run the application:

  • Run git clone and then cd UrlsController.
  • Set the preferable values inside the application.yml file.
  • Execute the installAndRun.sh script which builds and runs the app.
    If you want to just run the app, then run the script with the argument "1": ./installAndRun.sh 1.
    If you want to build and run the app on a Docker Container, then run the script with the argument "0" followed by the argument "1": ./installAndRun.sh 0 1.

Implementation notes:

  • For transferring the full-text files, we use Facebook's Zstandard compression algorithm, which brings very big benefits in compression rate and speed.
  • The uploaded full-text files follow this naming-scheme: "datasourceID/recordID::fileHash.pdf"