Lampros Smyrnaios
d8773e6ebb
- Block app-execution in case the hdfs parquet directories failed to be created. - Code polishing. |
||
---|---|---|
gradle/wrapper | ||
src/main | ||
.gitignore | ||
Dockerfile | ||
README.md | ||
build.gradle | ||
gradle.properties | ||
installAndRun.sh | ||
settings.gradle |
README.md
UrlsController
The Controller's Application receives requests coming from the Workers , constructs an assignments-list with data received from a database and returns the list to the workers.
Then, it receives the "WorkerReports", it requests the full-texts from the workers, in batches, and uploads them on the S3-Object-Store. Finally, it writes the related reports, along with the updated file-locations into the database.
The database used is the Impala.
Statistics API:
- "getNumberOfPayloads" endpoint: http://IP:PORT/api/stats/getNumberOfPayloads
- "getNumberOfRecordsInspected" endpoint: http://IP:PORT/api/stats/getNumberOfRecordsInspected
To install and run the application: - Run ```git clone``` and then ```cd UrlsController```. - Provide the **S3 Object Store** related configurations, inside the *src/main/resources/application.properties* file.
- Execute the ```installAndRun.sh``` script which builds and runs the app.
If you want to just run the app, then run the script with the argument "1": ```./installAndRun.sh 1```.
If you want to build and run the app on a **Docker Container**, then run the script with the argument "0" followed by the argument "1": ```./installAndRun.sh 0 1```.
Implementation notes:
- For transferring the full-text files, we use Facebook's Zstandard compression algorithm, which brings very big benefits in compression rate and speed.
- The names of the uploaded full-text files ae of the following form: "datasourceID/recordId::fileHash.pdf"