Lampros Smyrnaios
164245cb53
- Automatically delete the unsuccessful WorkerReports, which are more than 7 days old.
...
- Optimize the Service's startup speed, by setting "initialDelays" to the scheduled tasks.
- Optimize documentation.
2023-05-24 16:59:42 +03:00
Lampros Smyrnaios
551c4acef5
Fix property naming missmatch.
2023-05-24 14:49:29 +03:00
Lampros Smyrnaios
cd1fb0af88
- Process the WorkerReports in background Jobs and post the reportResults to the Workers.
...
- Save the workerReports to json files, until they are processed successfully.
- Show some custom metrics in prometheus.
2023-05-24 13:52:28 +03:00
Lampros Smyrnaios
0ea3e2de24
Add the "shutdownService" and "cancelShutdownService" endpoints. The Controller sends the related requests to the Workers and shutdowns gracefully, after all workers have shutdown.
2023-05-24 13:42:29 +03:00
Lampros Smyrnaios
c2a1b96069
- Rename the mounted "mnt/bulkImport/" directory to "/mnt/bulk_import/".
...
- Increase the "awaitTermination" timeout for the ExecutorService to 2 minutes.
2023-05-23 21:09:34 +03:00
Lampros Smyrnaios
c7bfd75973
- Add the "getWorkersInfo" endpoint.
...
- Improve startup speed, by using a faster remote server to get the host's machine public IP. This also reduces the risk of not being able to get the public IP at all.
- Fix the detection of a different IP for a known worker.
- Improve documentation.
2023-05-23 14:57:15 +03:00
Lampros Smyrnaios
5f75b48e95
- Increase the "read-timeout" when searching for the host's machine public-IP.
...
- Update dependencies.
- Code polishing.
2023-05-22 21:33:02 +03:00
Lampros Smyrnaios
0ab6bae93a
- Optimize the json-conversion of the "BulkImportReport".
...
- Code polishing.
2023-05-18 17:30:40 +03:00
Lampros Smyrnaios
f7f919cee1
- Make sure we set the "hasShutdown" to "false", for each known worker which was restarted.
...
- Fix markdown of urls in prometheus' readme.
2023-05-16 12:24:14 +03:00
Lampros Smyrnaios
a8eea1ccf4
Fix missing changes.
2023-05-15 13:13:24 +03:00
Lampros Smyrnaios
f51a34138f
- Use separate HDFS subdirectories for each worker in order to avoid seeing exceptions about "empty hdfs directory" when "loading" data to the database, because one worker has loaded data generated by multiple workers (since we use only 1 load operation for multiple parquet files).
...
- Store each worker's info in a hash-table, in order to efficiently know if we need to create new hdfs subdirectories. Also, this will help to issue "shutdown" requests to the workers in the future, as well as to know which worker has shutdown.
2023-05-15 13:12:20 +03:00
Lampros Smyrnaios
9412391903
- In test-environment mode, check for already existing file-hashes only in the "payload_aggregated" table, instead of the whole "payload" view. This way the investigation for false-positive docUrls is easier, as we avoid checking against the millions of "legacy" payloads.
...
- Improve performance in production, by not creating the string objects for "trace"-logs.
2023-05-15 12:44:16 +03:00
Lampros Smyrnaios
8381df70c6
- Improve performance of uploading parquet-files to HDFS.
...
- Add some logs.
- Code polishing.
2023-05-11 19:40:48 +03:00
Lampros Smyrnaios
992d4ffd5e
- Add the time-zone in the logs.
...
- Change some log-levels to "trace", although most of them are still disabled.
2023-05-11 03:10:53 +03:00
Lampros Smyrnaios
b6e8cd1889
New feature: BulkImport full-text files from compatible datasources.
2023-05-11 03:07:55 +03:00
Lampros Smyrnaios
42b93e9429
- Add the "getNumberOfAllDistinctFullTexts" stats-endpoint.
...
- Add TODOs for more stats endpoints.
- Code polishing.
2023-05-04 15:48:49 +03:00
Lampros Smyrnaios
b3196376eb
Fix a bug, which caused the full-text files to never close.
2023-05-04 13:03:28 +03:00
Lampros Smyrnaios
fd15372fd6
Add error-checks for retrieving the status-code from HttpUrlConnections.
2023-05-03 13:30:29 +03:00
Lampros Smyrnaios
49662319a1
- Simplify the creation of local directories.
...
- Improve exception messages.
2023-04-28 14:58:33 +03:00
Lampros Smyrnaios
55ea5118ac
- Update the "testDatabaseName" property.
...
- Code polishing.
2023-04-26 19:33:28 +03:00
Lampros Smyrnaios
d7797eaaf6
Add the "getNumberOfPayloadsForDatasource" endpoint.
2023-04-24 09:54:35 +03:00
Lampros Smyrnaios
4dc34429f8
- Increase the waiting-time before checking the docker containers' status, in order to catch configuration-crashes.
...
- Code polishing.
2023-04-10 22:28:53 +03:00
Lampros Smyrnaios
c39fef2654
Upgrade payload-table to payload-view which consists of three separate payload tables: "payload_legacy", "payload_aggregated" and "payload_bulk_import".
2023-04-10 15:55:50 +03:00
Lampros Smyrnaios
37363100fd
Prioritize most recent publications.
2023-04-10 15:00:23 +03:00
Lampros Smyrnaios
484cf5cefc
- Avoid requesting the remaining full-text batches in case the Worker returns a 5XX error in one of the batches.
...
- Add nullability-checks for "datasourceId" and "hash" before constructing the new filename and upload the full-text on S3.
- Improve a log-message.
2023-03-29 17:12:37 +03:00
Lampros Smyrnaios
4280f89296
- Set the default value of the "isTestEnvironment" property to "true", in order to avoid undesired outcomes in the production db.
...
- Code polishing.
2023-03-21 17:04:28 +02:00
Lampros Smyrnaios
003c0bf179
- Add support for excluding specific datasources from being crawled. These datasources may be aggregated through bulk-imports, by other pieces of software. Such a datasource is "arXiv.org".
...
- Fix an issue, where the "datasource-type" was retrieved instead of the "datasource-name".
- Polish the "findAssignmentsQuery".
2023-03-21 07:19:35 +02:00
Lampros Smyrnaios
f835a752bf
Transform the "application.properties" file to "application.yml" and optimize the property-trees.
2023-03-20 15:23:00 +02:00
Lampros Smyrnaios
17a6c120dd
Improve logs for full-texts' metrics.
2023-03-14 20:57:01 +02:00
Lampros Smyrnaios
ff13af7abb
Use a StatsService interface.
2023-03-13 12:39:39 +02:00
Lampros Smyrnaios
38643c76a3
- Code polishing.
...
- Update Gradle.
2023-03-07 16:55:41 +02:00
Lampros Smyrnaios
7b217764e0
Improve performance when downloading and decompressing the full-texts archive.
2023-03-02 17:44:53 +02:00
Lampros Smyrnaios
c4670073ae
- Add missing refactoring-change.
...
- Code polishing.
- Update Spring.
2023-02-24 23:49:04 +02:00
Lampros Smyrnaios
c8485d472e
Code polishing.
2023-02-24 13:53:09 +02:00
Lampros Smyrnaios
b7f6056032
- Improve an error-message.
...
- Update Gradle.
2023-02-21 15:42:07 +02:00
Lampros Smyrnaios
8893662a81
Refactor the UrlsController: a) offload the business-logic to the dedicated "UrlsService" and b) move the "checkParquetFilesSuccess()"-method to "ParquetFileUtils".
2023-02-21 15:36:35 +02:00
Lampros Smyrnaios
a1c16ffc19
- Exclude empty and null urls in the assignments.
...
- Update the "getFullTextsImproved"-call to "getFullTexts", now that the "improved" version is stable.
- Update Gradle.
- Code polishing.
2023-02-16 14:24:47 +02:00
Lampros Smyrnaios
2253f05bf5
Refactor the "StatsController"-code, by offloading it to a dedicated "StatsService".
2023-02-09 19:25:48 +02:00
Lampros Smyrnaios
49fefefafd
- Refactor the payloads-statistics-code and provide two endpoints: "getNumberOfPayloadsAggregatedByService", which returns the number of payloads aggregated only by the PDF-Aggregation-Service, and the "getNumberOfAllPayloads", which returns the number of all payloads existing in the database, even the ones aggregated in the past, by other pieces of software.
...
- Update README.md.
- Make sure the docker image is clean-built, by avoiding the use of cache.
2023-02-02 17:58:47 +02:00
Lampros Smyrnaios
c9f33d3afa
Add an extra precaution-check to allow the emptying or deletion of an S3-Object-Store bucket, only when the app runs in "TestEnvironment".
2023-02-01 16:42:22 +02:00
Lampros Smyrnaios
f89730f196
Improve documentation.
2023-01-27 14:31:07 +02:00
Lampros Smyrnaios
dc8f0f2bd1
- Heavily reduce the maximum amount of space needed, by deleting the files of each full-texts batch, right after they are uploaded to the S3 Object Store.
...
- Add a check for when the retrieved full-texts-batch is missing some requested files and show a warn-log.
- Update dependencies.
2023-01-23 20:23:21 +02:00
Lampros Smyrnaios
d8773e6ebb
- Make sure the test-environment uses a dedicated hdfs-parquet-directory.
...
- Block app-execution in case the hdfs parquet directories failed to be created.
- Code polishing.
2023-01-18 13:38:05 +02:00
Lampros Smyrnaios
8876089022
- Use Facebook's [**Zstandard**]( https://facebook.github.io/zstd/ ) compression algorithm, which brings very big benefits on compression rate and speed.
...
- Update the minIO dependency.
- Code polishing.
2023-01-10 13:34:54 +02:00
Lampros Smyrnaios
d1a4c84289
- Make sure the fullPath of the baseFilesLocation is available when the user specifies a non-root directory.
...
- Improve error-checking and exception-handling in some "S3ObjectStore"-methods.
- Make sure the "responseCode" is "200-OK", before trying to get the InputStream in "UriBuilder.getPublicIP()".
2023-01-09 15:44:53 +02:00
Lampros Smyrnaios
9904ea5743
- Improve the stability of "UriBuilder.getPublicIP()", by using a "HttpURLConnection" to increase the connection and read timeouts and avoid timeout-exceptions.
...
- Update Spring.
2023-01-03 18:39:50 +02:00
Lampros Smyrnaios
4528d1f9be
- Fix the "baseFilesLocation" being null (there was no serious problem, but multiple directories were spawned in the project's directory).
...
- Make sure the given "baseFilesLocation" ends with a file-separator, before using it.
- Optimize the process of unzipping-files.
2022-12-20 18:38:11 +02:00
Lampros Smyrnaios
e11afe5ab2
Improve performance of the hash-checking algorithm by using multithreading.
2022-12-15 18:34:28 +02:00
Lampros Smyrnaios
9cdbbdea67
Refactor the files' storage location.
2022-12-15 18:29:51 +02:00
Lampros Smyrnaios
e51ee9dd27
- Add info about the Stats API usage in "README.md".
...
- Optimize performance in "ParquetFileUtils.createAndLoadParquetDataIntoAttemptTable()" and "ParquetFileUtils.createAndLoadParquetDataIntoPayloadTable()".
- Handle the "EmptyResultDataAccessException" inside "StatsController".
- Optimize gradle's performance.
- Code polishing.
2022-12-15 14:04:22 +02:00