The Controller app of the PDF Aggregation Service.
Go to file
Lampros Smyrnaios 6226e2298d - Upgrade the results-loading process: Instead of making thousands of sql-insert requests to Impala now we write the results to parquet files, upload them to HDFS and then import the data into the Impala tables with just 2 requests. This results in a huge performance improvement.
One side effect of using the parquet-files, is that the timestamps are now BIGDECIMAL numbers, instead of "Timestamp" objects, but, converting them to such objects is pretty easy, if we ever need to do it.
- Code polishing.
2022-11-10 17:18:21 +02:00
gradle/wrapper Update dependencies. 2022-11-10 16:50:21 +02:00
src/main - Upgrade the results-loading process: Instead of making thousands of sql-insert requests to Impala now we write the results to parquet files, upload them to HDFS and then import the data into the Impala tables with just 2 requests. This results in a huge performance improvement. 2022-11-10 17:18:21 +02:00
.gitignore springified project 2022-01-30 22:15:13 +02:00
Dockerfile - Allow the user to build, push and run the App in Docker, straight though the "installAndRun.sh" script. 2022-02-04 15:49:56 +02:00
README.md Update the README.md 2022-02-07 21:11:03 +02:00
build.gradle - Upgrade the results-loading process: Instead of making thousands of sql-insert requests to Impala now we write the results to parquet files, upload them to HDFS and then import the data into the Impala tables with just 2 requests. This results in a huge performance improvement. 2022-11-10 17:18:21 +02:00
installAndRun.sh Update dependencies. 2022-11-10 16:50:21 +02:00
settings.gradle - Add the "isControllerAlive"-endpoint. 2021-09-23 15:08:52 +03:00

README.md

UrlsController

The Controller's Application receives requests coming from the Workers , constructs an assignments-list with data received from a database and returns the list to the workers.
Then, it receives the "WorkerReports", it requests the full-texts from the workers, in batches, and uploads them on the S3-Object-Store. Finally, it writes the related reports, along with the updated file-locations into the database.
The database used is the Impala .

To install and run the application:

  • Run git clone and then cd UrlsController.
  • Provide the S3 Object Store related configurations, inside the src/main/resources/application.properties file.
  • Execute the installAndRun.sh script which builds and runs the app.
    If you want to just run the app, then run the script with the argument "1": ./installAndRun.sh 1.
    If you want to build and run the app on a docker container, then run the script with the argument "0" followed by the argument "1": ./installAndRun.sh 0 1.