UrlsController/README.md

1.7 KiB

UrlsController

The Controller's Application receives requests coming from the Workers , constructs an assignments-list with data received from a database and returns the list to the workers.
Then, it receives the "WorkerReports", it requests the full-texts from the workers, in batches, and uploads them on the S3-Object-Store. Finally, it writes the related reports, along with the updated file-locations into the database.
The database used is the Impala.

Statistics API:

  • "getNumberOfPayloads" endpoint: http://IP:PORT/api/stats/getNumberOfPayloads
  • "getNumberOfRecordsInspected" endpoint: http://IP:PORT/api/stats/getNumberOfRecordsInspected

To install and run the application:

  • Run git clone and then cd UrlsController.
  • Set the preferable values inside the application.properties file.
  • Execute the installAndRun.sh script which builds and runs the app.
    If you want to just run the app, then run the script with the argument "1": ./installAndRun.sh 1.
    If you want to build and run the app on a Docker Container, then run the script with the argument "0" followed by the argument "1": ./installAndRun.sh 0 1.

Implementation notes:

  • For transferring the full-text files, we use Facebook's Zstandard compression algorithm, which brings very big benefits in compression rate and speed.
  • The names of the uploaded full-text files ae of the following form: "datasourceID/recordId::fileHash.pdf"