Lampros Smyrnaios
6226e2298d
- Upgrade the results-loading process: Instead of making thousands of sql-insert requests to Impala now we write the results to parquet files, upload them to HDFS and then import the data into the Impala tables with just 2 requests. This results in a huge performance improvement.
...
One side effect of using the parquet-files, is that the timestamps are now BIGDECIMAL numbers, instead of "Timestamp" objects, but, converting them to such objects is pretty easy, if we ever need to do it.
- Code polishing.
2022-11-10 17:18:21 +02:00
Lampros Smyrnaios
6a03103b79
Update dependencies.
2022-11-10 16:50:21 +02:00
Lampros Smyrnaios
aad37cd81e
Add the "StatsController", which brings the "getNumberOfPayloads" and "getNumberOfRecordsInspected" endpoints.
2022-10-18 15:00:26 +03:00
Lampros Smyrnaios
e2d53105d1
Fix not creating the "assignment" table in a new production database, which contains only the "publication" and "datasource" data.
2022-10-07 15:51:31 +03:00
Lampros Smyrnaios
b6340066a7
- Improve handling of the case, where the full-texts were found, but the Controller could not acquire them from the Worker.
...
- Add/improve logs and comments.
- Code cleanup.
2022-09-28 22:34:33 +03:00
Lampros Smyrnaios
a22144bd51
- Refactor "FileUtils.getErrorMessageFromResponseBody(conn)" into "FileUtils.getMessageFromResponseBody(conn, isError)", in order to be able to either retrieve the "normal" or the "error" response.
...
- Add comments.
2022-09-15 23:12:05 +03:00
Lampros Smyrnaios
3e8f9c6074
Update the "UriBuilder.java" to be able to acquire the running port of the server, in case the port-number was initially set to "random" (0). Also make sure we get the "localHostAddress" and not the "localHostName", in case the public IP is not retrievable.
2022-09-12 17:04:05 +03:00
Lampros Smyrnaios
a2cd02115f
- Update the Spring-Security-code to use the "SecurityFilterChain", as the previous code was deprecated.
...
- Update dependencies.
2022-06-27 21:41:32 +03:00
Lampros Smyrnaios
e3b374a32f
- Optimize file-related tasks.
...
- Update dependencies.
- Code cleanup.
2022-05-26 15:43:59 +03:00
Lampros Smyrnaios
9096137008
Update documentation.
2022-04-14 14:42:36 +03:00
Lampros Smyrnaios
9b95eebb6c
- Remove the obsolete "parenthesis" and "increasing duplicate-num" from the full-texts' names, before sending them to the S3-Object-Store. They now end with the "file-hash", so it is guaranteed that they will be unique. The Worker continues to produce the previous kind of names, without any disturbance.
...
- Improve logging.
- Update MinIO dependency.
2022-04-11 21:15:22 +03:00
Lampros Smyrnaios
a81ed3c60f
- Add an "isTestEnvironment"-switch, which makes it easier to work with production and test databases.
...
- In case the Worker cannot be reached during a full-texts' batch request, abort the rest of the batches.
- Fix memory leaks when unzipping the batch-zip-file.
- Add explanatory comments for picking the database related to a full-text file.
2022-04-08 17:39:45 +03:00
Lampros Smyrnaios
33fc61a8d9
- Fix the fileName-ID not being directly related with the datasourceID, in the S3-ObjectStore name. Add explanatory comments.
...
- Add missing error-logs.
2022-04-05 16:22:02 +03:00
Lampros Smyrnaios
a23c918a42
- Fix a "@JsonProperty" annotation inside "Payload.java".
...
- Fix a "@Value" annotation inside "FileUtils.java".
- Add a new database and show its name along with the initial's name in the logs.
- Code cleanup and improvement.
2022-04-05 00:01:44 +03:00
Lampros Smyrnaios
5e4fad2479
- Change the fileNames' structure in the S3-ObjectStore.
...
- Update dependencies.
2022-04-01 19:24:04 +03:00
Lampros Smyrnaios
48670f3399
- Show the percentage of the "NumFullTextsFound", in the logs.
...
- Update dependencies.
2022-03-28 14:29:31 +03:00
Lampros Smyrnaios
e587b2ca6c
Update Spring dependencies.
2022-02-25 17:41:24 +02:00
Lampros Smyrnaios
88acaae20f
- Replace the "numFullTextUrlsFound"-counter with "numFullTextsFound"-counter to reflect the end result of the actually available full-texts (which were downloaded by the Worker).
...
- Optimize the gather-fileNames loop.
- Improve a message in "installAndRun.sh"
2022-02-23 17:40:06 +02:00
Lampros Smyrnaios
ad5dbdde9b
- Improve performance when inserting records into the "attempt" table, by splitting the records equally, across more threads.
...
- Bring back the "UriBuilder", which informs us in the logs, about the Controller's url (IP, PORT, API).
- Code cleanup.
2022-02-22 13:54:16 +02:00
Lampros Smyrnaios
dfd40cb105
Insert only the records with uploaded-to-S3 full-texts, in the "payload" table.
2022-02-17 16:27:40 +02:00
Lampros Smyrnaios
71f6b46130
- In case of an error when creating the "current_assignment" table (e.g out of memory in the backend database server), check for partial-creation and drop it. Also, in any case, before we drop this table, now check if it exists firsts (in general it should always exist, unless the creation results in an error and the table was not created at all).
...
- Fix an error-message.
- Update dependencies.
- Code cleanup.
2022-02-14 12:36:00 +02:00
Lampros Smyrnaios
d2ed9cd9ed
Improve efficiency and performance when processing the full-texts.
2022-02-08 15:02:13 +02:00
Lampros Smyrnaios
5819bf584b
Update the README.md
2022-02-07 21:11:03 +02:00
Lampros Smyrnaios
1111c850b9
- Add support for more than one full-text per id. Allow recognizing fileName additions: "id(1).pdf", "id(2).pdf", etc.
...
- Fix not giving the databaseName in the "ImpalaController.get10PublicationIdsTest()".
- Improve consistency in the "maxAttemptsPerRecord" value, among different threads. Also, reduce the value-increase by one.
- Check if the tableName string is empty, in the "mergeParquetFiles".
- Improve error-logging.
- Set some local variables to "final", optimizing code-execution by the JVM.
2022-02-07 13:57:09 +02:00
Lampros Smyrnaios
5d70e82504
Merge pull request 'Springify and dockerize project (fixed and improved)' ( #2 ) from springify_project into master
...
Reviewed-on: #2
2022-02-04 14:56:16 +01:00
Lampros Smyrnaios
b206114144
- Allow the user to build, push and run the App in Docker, straight though the "installAndRun.sh" script.
...
- Re-add the logback-spring configuration.
- Change the docker-app name.
2022-02-04 15:49:56 +02:00
Lampros Smyrnaios
6aab1d242b
- Improve performance when handling WorkerReports' database insertions, by using parallelism to insert to two different tables in the same time. Also, pre-cache the query-argument-types.
...
- Update the error-message and counting system, on partial insertion event.
2022-02-04 14:48:22 +02:00
Lampros Smyrnaios
be4898e43e
Bug fixes and improvements:
...
- Fix an NPE, when the "getTestUrls"-endpoint is called. It was thrown because of an absent per-thread initialization of some thread-local variables.
- Fix JdbcTemplate error when querying the "getFileLocationForHashQuery".
- Fix the "S3ObjectStore.isLocationInStore" check.
- Fix not catching/handling some exceptions.
- Fix/improve log-messages.
- Optimize the "getFileLocationForHashQuery" to return only the first row. In the latest change, without this optimization, the query-result would cause non-handling the same-hash cases, because of an exception.
- Optimize the "ImpalaConnector.databaseLock.lock()" positioning.
- Update the "getTestUrls" api-path.
- Optimize list-allocation.
- Re-add the info-message about the successful emptying of the S3-bucket.
- Code cleanup.
2022-02-02 20:19:46 +02:00
Lampros Smyrnaios
d1c86ff273
Merge pull request 'Springify project' ( #1 ) from antonis.lempesis/UrlsController:master into springify_project
...
Reviewed-on: #1
2022-02-01 19:51:50 +01:00
Antonis Lempesis
35966b6f6e
finishing toucehs
2022-02-01 16:57:28 +02:00
Antonis Lempesis
c093e52d15
Merge branch 'master' of https://code-repo.d4science.org/antonis.lempesis/UrlsController
2022-02-01 02:08:21 +02:00
Antonis Lempesis
e9bede5c45
more fixes
2022-02-01 02:08:02 +02:00
Antonis Lempesis
f5748434c7
Merge branch 'master' of https://code-repo.d4science.org/antonis.lempesis/UrlsController
2022-01-31 14:01:39 +02:00
Antonis Lempesis
9ac10fc4b3
fixed Value annotations
2022-01-31 14:01:26 +02:00
Antonis Lempesis
0772e9cdfb
Merge branch 'master' of https://code-repo.d4science.org/antonis.lempesis/UrlsController
2022-01-31 13:49:34 +02:00
Antonis Lempesis
1c82088a7c
fixed Value annotations
2022-01-31 13:49:14 +02:00
Antonis Lempesis
3da6fd98e9
added Dockerfile
2022-01-31 04:21:31 +02:00
Antonis Lempesis
6dde8c0faa
finished merge
2022-01-31 04:17:16 +02:00
Antonis Lempesis
e47fd8d97b
merged refactor branch
2022-01-30 23:10:06 +02:00
Antonis Lempesis
3741cce886
springified project
2022-01-30 22:15:13 +02:00
Antonis Lempesis
bf26bf955f
springified project
2022-01-30 22:14:52 +02:00
Lampros Smyrnaios
d0ab42e4fa
- Change the scheme of the file-location URI.
...
- Move the old and the current database names in the "application.properties" file.
- Improve logging.
2022-01-28 07:24:42 +02:00
Lampros Smyrnaios
92b11baf93
- Update the repository for the Impala JDBC Driver.
...
- Code cleanup.
2022-01-28 00:59:19 +02:00
Antonis Lempesis
91f460ce51
moved impala jar to omtd repository and updated build file
2022-01-28 00:41:29 +02:00
Lampros Smyrnaios
a01e11eef0
When all the data is processed, increase the number of "max-attempts" to retry some very old records, in the next requests.
2022-01-27 01:18:26 +02:00
Lampros Smyrnaios
3c9f8870d1
- Change the repository for the Impala JDBC Driver, as the previous one had networking issues.
...
- Optimize the "findAssignmentsQuery".
2022-01-26 19:52:46 +02:00
Lampros Smyrnaios
ff46839158
Fix not prioritizing the gradle version defined inside the "installAndRun.sh" script.
2022-01-21 15:45:12 +02:00
Lampros Smyrnaios
8d9336fa52
Update dependencies.
2022-01-21 15:04:29 +02:00
Lampros Smyrnaios
ab99bc6168
- Make sure the temp table "current_assignment" from a cancelled previous execution, is dropped and purged on startup.
...
- Improve logging.
- Code cleanup.
2022-01-19 01:37:47 +02:00
Lampros Smyrnaios
83f40a23d9
Bring back the prepared-statements for the insert-queries. After the fix of the "broken pipe"-error, they now work. Bringing them back, increases security and solves the "SQL syntax errors" caused by the values of some URLs.
2022-01-13 00:54:21 +02:00