The Controller app of the PDF Aggregation Service.
Go to file
Lampros Smyrnaios 8893662a81 Refactor the UrlsController: a) offload the business-logic to the dedicated "UrlsService" and b) move the "checkParquetFilesSuccess()"-method to "ParquetFileUtils". 2023-02-21 15:36:35 +02:00
gradle/wrapper - Exclude empty and null urls in the assignments. 2023-02-16 14:24:47 +02:00
src/main Refactor the UrlsController: a) offload the business-logic to the dedicated "UrlsService" and b) move the "checkParquetFilesSuccess()"-method to "ParquetFileUtils". 2023-02-21 15:36:35 +02:00
.gitignore springified project 2022-01-30 22:15:13 +02:00
Dockerfile - Allow the user to build, push and run the App in Docker, straight though the "installAndRun.sh" script. 2022-02-04 15:49:56 +02:00
README.md - Exclude empty and null urls in the assignments. 2023-02-16 14:24:47 +02:00
build.gradle - Heavily reduce the maximum amount of space needed, by deleting the files of each full-texts batch, right after they are uploaded to the S3 Object Store. 2023-01-23 20:23:21 +02:00
gradle.properties - Improve some log-messages. 2022-11-30 16:28:39 +02:00
installAndRun.sh - Exclude empty and null urls in the assignments. 2023-02-16 14:24:47 +02:00
settings.gradle - Add the "isControllerAlive"-endpoint. 2021-09-23 15:08:52 +03:00

README.md

UrlsController

The Controller's Application receives requests coming from the Workers , constructs an assignments-list with data received from a database and returns the list to the workers.
Then, it receives the "WorkerReports", it requests the full-texts from the workers, in batches, and uploads them on the S3-Object-Store. Finally, it writes the related reports, along with the updated file-locations into the database.
The database used is the Impala.

Statistics API:

  • "getNumberOfAllPayloads" endpoint: http://:/api/stats/getNumberOfAllPayloads
    This endpoint returns the total number of payloads existing in the database, independently of the way they were aggregated. This includes the payloads created by other pieces of software, before the PDF-Aggregation-Service was created.
  • "getNumberOfPayloadsAggregatedByService" endpoint: http://:/api/stats/getNumberOfPayloadsAggregatedByService
    This endpoint returns the number of payloads aggregated by the PDF-Aggregated-Service itself. It excludes the payloads aggregated by other methods, by applying a Date-filter for the records created in 2021 or later.
  • "getNumberOfRecordsInspected" endpoint: http://:/api/stats/getNumberOfRecordsInspected
    This endpoint returns the number of records inspected by the PDF-Aggregation-Service.

To install and run the application:

  • Run git clone and then cd UrlsController.
  • Set the preferable values inside the application.properties file.
  • Execute the installAndRun.sh script which builds and runs the app.
    If you want to just run the app, then run the script with the argument "1": ./installAndRun.sh 1.
    If you want to build and run the app on a Docker Container, then run the script with the argument "0" followed by the argument "1": ./installAndRun.sh 0 1.

Implementation notes:

  • For transferring the full-text files, we use Facebook's Zstandard compression algorithm, which brings very big benefits in compression rate and speed.
  • The uploaded full-text files follow this naming-scheme: "datasourceID/recordId::fileHash.pdf"