Lampros Smyrnaios
003c0bf179
- Add support for excluding specific datasources from being crawled. These datasources may be aggregated through bulk-imports, by other pieces of software. Such a datasource is "arXiv.org".
...
- Fix an issue, where the "datasource-type" was retrieved instead of the "datasource-name".
- Polish the "findAssignmentsQuery".
2023-03-21 07:19:35 +02:00
Lampros Smyrnaios
f835a752bf
Transform the "application.properties" file to "application.yml" and optimize the property-trees.
2023-03-20 15:23:00 +02:00
Lampros Smyrnaios
17a6c120dd
Improve logs for full-texts' metrics.
2023-03-14 20:57:01 +02:00
Lampros Smyrnaios
ff13af7abb
Use a StatsService interface.
2023-03-13 12:39:39 +02:00
Lampros Smyrnaios
38643c76a3
- Code polishing.
...
- Update Gradle.
2023-03-07 16:55:41 +02:00
Lampros Smyrnaios
4af298a52a
Revert the version of "libthrift"-dependency to "0.17.0", as the newer version is not compatible with Java 8.
2023-03-03 12:57:30 +02:00
Lampros Smyrnaios
7b217764e0
Improve performance when downloading and decompressing the full-texts archive.
2023-03-02 17:44:53 +02:00
Lampros Smyrnaios
62a4279e3b
Update dependencies.
2023-03-02 17:40:16 +02:00
Lampros Smyrnaios
c4670073ae
- Add missing refactoring-change.
...
- Code polishing.
- Update Spring.
2023-02-24 23:49:04 +02:00
Lampros Smyrnaios
c8485d472e
Code polishing.
2023-02-24 13:53:09 +02:00
Lampros Smyrnaios
b7f6056032
- Improve an error-message.
...
- Update Gradle.
2023-02-21 15:42:07 +02:00
Lampros Smyrnaios
8893662a81
Refactor the UrlsController: a) offload the business-logic to the dedicated "UrlsService" and b) move the "checkParquetFilesSuccess()"-method to "ParquetFileUtils".
2023-02-21 15:36:35 +02:00
Lampros Smyrnaios
a1c16ffc19
- Exclude empty and null urls in the assignments.
...
- Update the "getFullTextsImproved"-call to "getFullTexts", now that the "improved" version is stable.
- Update Gradle.
- Code polishing.
2023-02-16 14:24:47 +02:00
Lampros Smyrnaios
2253f05bf5
Refactor the "StatsController"-code, by offloading it to a dedicated "StatsService".
2023-02-09 19:25:48 +02:00
Lampros Smyrnaios
49fefefafd
- Refactor the payloads-statistics-code and provide two endpoints: "getNumberOfPayloadsAggregatedByService", which returns the number of payloads aggregated only by the PDF-Aggregation-Service, and the "getNumberOfAllPayloads", which returns the number of all payloads existing in the database, even the ones aggregated in the past, by other pieces of software.
...
- Update README.md.
- Make sure the docker image is clean-built, by avoiding the use of cache.
2023-02-02 17:58:47 +02:00
Lampros Smyrnaios
c9f33d3afa
Add an extra precaution-check to allow the emptying or deletion of an S3-Object-Store bucket, only when the app runs in "TestEnvironment".
2023-02-01 16:42:22 +02:00
Lampros Smyrnaios
f89730f196
Improve documentation.
2023-01-27 14:31:07 +02:00
Lampros Smyrnaios
dc8f0f2bd1
- Heavily reduce the maximum amount of space needed, by deleting the files of each full-texts batch, right after they are uploaded to the S3 Object Store.
...
- Add a check for when the retrieved full-texts-batch is missing some requested files and show a warn-log.
- Update dependencies.
2023-01-23 20:23:21 +02:00
Lampros Smyrnaios
d8773e6ebb
- Make sure the test-environment uses a dedicated hdfs-parquet-directory.
...
- Block app-execution in case the hdfs parquet directories failed to be created.
- Code polishing.
2023-01-18 13:38:05 +02:00
Lampros Smyrnaios
b0b00c8aed
Update the minio dependency.
2023-01-11 15:46:34 +02:00
Lampros Smyrnaios
c08ba1cc89
Revert the update of the "minio" dependency, as it introduces a bug, related to the "okhttp3.HttpUrl"-class.
2023-01-10 15:58:23 +02:00
Lampros Smyrnaios
8876089022
- Use Facebook's [**Zstandard**]( https://facebook.github.io/zstd/ ) compression algorithm, which brings very big benefits on compression rate and speed.
...
- Update the minIO dependency.
- Code polishing.
2023-01-10 13:34:54 +02:00
Lampros Smyrnaios
d1a4c84289
- Make sure the fullPath of the baseFilesLocation is available when the user specifies a non-root directory.
...
- Improve error-checking and exception-handling in some "S3ObjectStore"-methods.
- Make sure the "responseCode" is "200-OK", before trying to get the InputStream in "UriBuilder.getPublicIP()".
2023-01-09 15:44:53 +02:00
Lampros Smyrnaios
9904ea5743
- Improve the stability of "UriBuilder.getPublicIP()", by using a "HttpURLConnection" to increase the connection and read timeouts and avoid timeout-exceptions.
...
- Update Spring.
2023-01-03 18:39:50 +02:00
Lampros Smyrnaios
4528d1f9be
- Fix the "baseFilesLocation" being null (there was no serious problem, but multiple directories were spawned in the project's directory).
...
- Make sure the given "baseFilesLocation" ends with a file-separator, before using it.
- Optimize the process of unzipping-files.
2022-12-20 18:38:11 +02:00
Lampros Smyrnaios
e11afe5ab2
Improve performance of the hash-checking algorithm by using multithreading.
2022-12-15 18:34:28 +02:00
Lampros Smyrnaios
9cdbbdea67
Refactor the files' storage location.
2022-12-15 18:29:51 +02:00
Lampros Smyrnaios
e51ee9dd27
- Add info about the Stats API usage in "README.md".
...
- Optimize performance in "ParquetFileUtils.createAndLoadParquetDataIntoAttemptTable()" and "ParquetFileUtils.createAndLoadParquetDataIntoPayloadTable()".
- Handle the "EmptyResultDataAccessException" inside "StatsController".
- Optimize gradle's performance.
- Code polishing.
2022-12-15 14:04:22 +02:00
Lampros Smyrnaios
bfdf06bd09
- Apply error-checking on individual CallableTasks and in tasks-batches related to the creation and upload of all the data related to the "attempt" and "payload" table. So, if no data could be uploaded for one or both tables, no "load"-queries will be executed for that/those tables.
...
- Catch the more general "Exception", inside "FileUtils.mergeParquetFiles()", in order to be certain that the "SQLException" can also be caught.
- Code polishing.
2022-12-09 12:46:06 +02:00
Lampros Smyrnaios
0209d24068
- Change the parquet compression from "Snappy" to "Gzip", as there is an unhandleable exception when the app is running inside a Docker Container and uses the "Snappy" compression.
...
- Code polishing.
2022-12-08 16:28:41 +02:00
Lampros Smyrnaios
c8baf5a5fc
- Fix not finding the parquet-schema files when the app was run inside a Docker Container.
...
- Update the "namespaces" and the "names" inside the parquet schemas.
- Code polishing.
2022-12-08 12:16:05 +02:00
Lampros Smyrnaios
95c38c4a24
- Fix creating the "assignment" table, always in the testDatabase.
...
- Code polishing.
2022-12-07 14:58:38 +02:00
Lampros Smyrnaios
3c5f4c6464
Fix bytes to MB conversion.
2022-12-07 14:32:18 +02:00
Lampros Smyrnaios
8607594f6d
- Improve exception handling.
...
- Code polishing.
2022-12-07 13:48:00 +02:00
Lampros Smyrnaios
f183df276b
- Move the "uploadFullTexts"-code in its own method.
...
- Code polishing.
2022-12-06 12:24:34 +02:00
Lampros Smyrnaios
b0c57d79a5
- When the Controller cannot retrieve any assignments from Impala (without an error), return an HTTP-"MULTI_STATUS" with an empty "AssignmentsResponse", instead of an "INTERNAL_SERVER_ERROR".
...
- Fix an error-message.
2022-12-05 16:44:00 +02:00
Lampros Smyrnaios
577ea983e8
- Improve some log-messages.
...
- Set some optimization settings for gradle.
- Fix error-handling in "installAndRun.sh".
- Update dependencies.
2022-11-30 16:28:39 +02:00
Lampros Smyrnaios
6226e2298d
- Upgrade the results-loading process: Instead of making thousands of sql-insert requests to Impala now we write the results to parquet files, upload them to HDFS and then import the data into the Impala tables with just 2 requests. This results in a huge performance improvement.
...
One side effect of using the parquet-files, is that the timestamps are now BIGDECIMAL numbers, instead of "Timestamp" objects, but, converting them to such objects is pretty easy, if we ever need to do it.
- Code polishing.
2022-11-10 17:18:21 +02:00
Lampros Smyrnaios
6a03103b79
Update dependencies.
2022-11-10 16:50:21 +02:00
Lampros Smyrnaios
aad37cd81e
Add the "StatsController", which brings the "getNumberOfPayloads" and "getNumberOfRecordsInspected" endpoints.
2022-10-18 15:00:26 +03:00
Lampros Smyrnaios
e2d53105d1
Fix not creating the "assignment" table in a new production database, which contains only the "publication" and "datasource" data.
2022-10-07 15:51:31 +03:00
Lampros Smyrnaios
b6340066a7
- Improve handling of the case, where the full-texts were found, but the Controller could not acquire them from the Worker.
...
- Add/improve logs and comments.
- Code cleanup.
2022-09-28 22:34:33 +03:00
Lampros Smyrnaios
a22144bd51
- Refactor "FileUtils.getErrorMessageFromResponseBody(conn)" into "FileUtils.getMessageFromResponseBody(conn, isError)", in order to be able to either retrieve the "normal" or the "error" response.
...
- Add comments.
2022-09-15 23:12:05 +03:00
Lampros Smyrnaios
3e8f9c6074
Update the "UriBuilder.java" to be able to acquire the running port of the server, in case the port-number was initially set to "random" (0). Also make sure we get the "localHostAddress" and not the "localHostName", in case the public IP is not retrievable.
2022-09-12 17:04:05 +03:00
Lampros Smyrnaios
a2cd02115f
- Update the Spring-Security-code to use the "SecurityFilterChain", as the previous code was deprecated.
...
- Update dependencies.
2022-06-27 21:41:32 +03:00
Lampros Smyrnaios
e3b374a32f
- Optimize file-related tasks.
...
- Update dependencies.
- Code cleanup.
2022-05-26 15:43:59 +03:00
Lampros Smyrnaios
9096137008
Update documentation.
2022-04-14 14:42:36 +03:00
Lampros Smyrnaios
9b95eebb6c
- Remove the obsolete "parenthesis" and "increasing duplicate-num" from the full-texts' names, before sending them to the S3-Object-Store. They now end with the "file-hash", so it is guaranteed that they will be unique. The Worker continues to produce the previous kind of names, without any disturbance.
...
- Improve logging.
- Update MinIO dependency.
2022-04-11 21:15:22 +03:00
Lampros Smyrnaios
a81ed3c60f
- Add an "isTestEnvironment"-switch, which makes it easier to work with production and test databases.
...
- In case the Worker cannot be reached during a full-texts' batch request, abort the rest of the batches.
- Fix memory leaks when unzipping the batch-zip-file.
- Add explanatory comments for picking the database related to a full-text file.
2022-04-08 17:39:45 +03:00
Lampros Smyrnaios
33fc61a8d9
- Fix the fileName-ID not being directly related with the datasourceID, in the S3-ObjectStore name. Add explanatory comments.
...
- Add missing error-logs.
2022-04-05 16:22:02 +03:00