The Controller app of the PDF Aggregation Service.
Go to file
Lampros Smyrnaios dc8f0f2bd1 - Heavily reduce the maximum amount of space needed, by deleting the files of each full-texts batch, right after they are uploaded to the S3 Object Store.
- Add a check for when the retrieved full-texts-batch is missing some requested files and show a warn-log.
- Update dependencies.
2023-01-23 20:23:21 +02:00
gradle/wrapper - Improve some log-messages. 2022-11-30 16:28:39 +02:00
src/main - Heavily reduce the maximum amount of space needed, by deleting the files of each full-texts batch, right after they are uploaded to the S3 Object Store. 2023-01-23 20:23:21 +02:00
.gitignore springified project 2022-01-30 22:15:13 +02:00
Dockerfile - Allow the user to build, push and run the App in Docker, straight though the "installAndRun.sh" script. 2022-02-04 15:49:56 +02:00
README.md - Use Facebook's [**Zstandard**](https://facebook.github.io/zstd/) compression algorithm, which brings very big benefits on compression rate and speed. 2023-01-10 13:34:54 +02:00
build.gradle - Heavily reduce the maximum amount of space needed, by deleting the files of each full-texts batch, right after they are uploaded to the S3 Object Store. 2023-01-23 20:23:21 +02:00
gradle.properties - Improve some log-messages. 2022-11-30 16:28:39 +02:00
installAndRun.sh - Change the parquet compression from "Snappy" to "Gzip", as there is an unhandleable exception when the app is running inside a Docker Container and uses the "Snappy" compression. 2022-12-08 16:28:41 +02:00
settings.gradle - Add the "isControllerAlive"-endpoint. 2021-09-23 15:08:52 +03:00

README.md

UrlsController

The Controller's Application receives requests coming from the Workers , constructs an assignments-list with data received from a database and returns the list to the workers.
Then, it receives the "WorkerReports", it requests the full-texts from the workers, in batches, and uploads them on the S3-Object-Store. Finally, it writes the related reports, along with the updated file-locations into the database.
The database used is the Impala.

Statistics API:

  • "getNumberOfPayloads" endpoint: http://IP:PORT/api/stats/getNumberOfPayloads
  • "getNumberOfRecordsInspected" endpoint: http://IP:PORT/api/stats/getNumberOfRecordsInspected

To install and run the application: - Run ```git clone``` and then ```cd UrlsController```. - Provide the **S3 Object Store** related configurations, inside the *src/main/resources/application.properties* file.
- Execute the ```installAndRun.sh``` script which builds and runs the app.
If you want to just run the app, then run the script with the argument "1": ```./installAndRun.sh 1```.
If you want to build and run the app on a **Docker Container**, then run the script with the argument "0" followed by the argument "1": ```./installAndRun.sh 0 1```.

Implementation notes:

  • For transferring the full-text files, we use Facebook's Zstandard compression algorithm, which brings very big benefits in compression rate and speed.
  • The names of the uploaded full-text files ae of the following form: "datasourceID/recordId::fileHash.pdf"