Lampros Smyrnaios
2d5643cb0a
Fix missing spaces in some secondary "DROP"-queries.
2023-07-07 20:51:14 +03:00
Lampros Smyrnaios
d5c139c410
Handle the case where the "stats"-queries are executed while some table of the DB are in a "merge" state. In this case, the queries fail and the Controller retries up to 10 times.
2023-07-06 18:29:13 +03:00
Lampros Smyrnaios
e8644cb64f
- Optimize the "insertAssignmentsQuery".
...
- Add documentation about the Prometheus Metrics, in README.
- Update Dependencies.
- Code polishing.
2023-07-05 17:10:30 +03:00
Lampros Smyrnaios
a89abe3f2f
Prioritize the publications, which are specified inside the "publication_boost" table, according to their "boost-level".
2023-06-29 12:32:06 +03:00
Lampros Smyrnaios
4c3e2e6b6e
- Fix not using actual the "currentAssignmentsBatch" of the workerReport itself, when creating the parquetFileNames and when reporting to the user the initialization of the "addition of the workerReport".
...
- Code polishing.
2023-06-27 16:08:01 +03:00
Lampros Smyrnaios
0f4b63c4a9
Expose the following statistics as prometheus-metrics and create/update a stats-endpoint for each one:
...
- "numOfPayloadsAggregatedByServiceThroughCrawling"
- "numOfPayloadsAggregatedByServiceThroughBulkImport"
- "numOfPayloadsAggregatedByService"
- "numOfLegacyPayloads"
- "numOfRecordsInspectedByServiceThroughCrawling" (renamed from "numOfInspectedRecords")
2023-06-23 15:22:26 +03:00
Lampros Smyrnaios
d52601e819
- Add handling for the "EmptyResultDataAccessException" in "UrlsServiceImpl.getAssignments()", which is thrown when no assignments are returned from the query.
...
- Code polishing.
2023-06-22 12:39:11 +03:00
Lampros Smyrnaios
b9712bed85
- Expose the "numOfAllPayloads" and "numOfInspectedRecords" DB-stats to Prometheus, by using a scheduling task to request the numbers from the DB, every 6 hours.
...
- Update the "StatsServiceImpl.getNumberOfPayloadsAggregatedByService()" to use the new table "payload_aggregated", instead of casting and checking the date of the records.
- Code polishing.
2023-06-19 14:42:00 +03:00
Lampros Smyrnaios
798fa09d68
- Identify and handle a possible Worker-crash, in "UrlsServiceImpl.postReportResultToWorker()".
...
- Add/Improve some log messages.
- Update and cleanup dependencies.
- Code polishing.
2023-06-15 23:19:36 +03:00
Lampros Smyrnaios
88a74b2c41
Add support for all private addresses, defined in "RFC 1918" standard. This fixes the issue of discarding some "shutdownService" requests due to coming from different local private addresses, when the Controller was run inside a docker container.
2023-06-15 13:26:27 +03:00
Lampros Smyrnaios
c37f157f51
Split the full-texts-batch's main handling-code to two separate methods, which can be used in parallel by two threads, in the future.
2023-06-14 17:16:38 +03:00
Lampros Smyrnaios
e2776c50d0
- Optimize the "WorkerReportResult" and the "ShutdownWorker" requests.
...
- Improve documentation.
2023-06-10 02:31:57 +03:00
Lampros Smyrnaios
0b1ab5b991
Attempt to recover from serious failures, by individualizing the error-handling for each of the "table-merging" queries.
2023-06-10 02:28:02 +03:00
Lampros Smyrnaios
6669dc61bf
- Increase the initialDelay for the "checkIfServiceIsReadyForShutdown" scheduled-task, in production, to 10 minutes.
...
- Code polishing.
2023-06-06 16:49:53 +03:00
Lampros Smyrnaios
54685bbe9a
- Avoid sending "cancelShutdown" requests to already shutDown Workers.
...
- Optimize performance of the code running right before the "postShutdownOrCancelRequestToWorker".
- Show which Workers have already shutdown and as a result a "postShutdownOrCancelRequestToWorker" will not be performed on them.
2023-05-29 13:41:37 +03:00
Lampros Smyrnaios
f9c6bad768
Do not send shutDownRequests to workers which have already shutdown.
2023-05-29 12:42:54 +03:00
Lampros Smyrnaios
a38d6ace79
Code polishing.
2023-05-29 12:21:48 +03:00
Lampros Smyrnaios
74ff31fc64
- Show the workerIPs in the logs.
...
- Rename the "FullTexts"-files to "BulkImport".
2023-05-29 12:12:08 +03:00
Lampros Smyrnaios
3988eb3a48
- Use a separate HDFS sub-dir for every assignments-batch, in order to avoid any disrruptancies from multiple threads moving parquet-files from the same sub-dir. Multiple batches from the same worker may be processed at the same time. These sub-dirs are deleted afterwards.
...
- Treat the "contains no visible files" situation as an error. In which case the assignments-data is presumed to not have been inserted to the database tables.
- Code polishing/cleanup.
2023-05-27 02:36:05 +03:00
Lampros Smyrnaios
02cee097d4
Fix an issue, which could cause some background jobs to be executed more than 1 times. The previously executed jobs were not deleted from the global list fast enough, and they would be selected again, in case they were not finished before the scheduler started again.
2023-05-26 13:08:00 +03:00
Lampros Smyrnaios
2b50e08bf6
- Handle the case, were multiple threads may load the same HDFS directory to a database table, thus causing the "directory contains no visible files"-SQLException.
...
- Improve the values of the delays for some scheduledTasks.
- Improve elapsed time precision for the "lastAccessedOn" metadata of the workerReports.
- Code polishing.
2023-05-25 00:34:36 +03:00
Lampros Smyrnaios
164245cb53
- Automatically delete the unsuccessful WorkerReports, which are more than 7 days old.
...
- Optimize the Service's startup speed, by setting "initialDelays" to the scheduled tasks.
- Optimize documentation.
2023-05-24 16:59:42 +03:00
Lampros Smyrnaios
551c4acef5
Fix property naming missmatch.
2023-05-24 14:49:29 +03:00
Lampros Smyrnaios
8b5f143b0a
Place the "workerReports" and the "bulkImportReports" dirs inside the "reports" parent-directory.
2023-05-24 14:10:57 +03:00
Lampros Smyrnaios
cd1fb0af88
- Process the WorkerReports in background Jobs and post the reportResults to the Workers.
...
- Save the workerReports to json files, until they are processed successfully.
- Show some custom metrics in prometheus.
2023-05-24 13:52:28 +03:00
Lampros Smyrnaios
0ea3e2de24
Add the "shutdownService" and "cancelShutdownService" endpoints. The Controller sends the related requests to the Workers and shutdowns gracefully, after all workers have shutdown.
2023-05-24 13:42:29 +03:00
Lampros Smyrnaios
c2a1b96069
- Rename the mounted "mnt/bulkImport/" directory to "/mnt/bulk_import/".
...
- Increase the "awaitTermination" timeout for the ExecutorService to 2 minutes.
2023-05-23 21:09:34 +03:00
Lampros Smyrnaios
c7bfd75973
- Add the "getWorkersInfo" endpoint.
...
- Improve startup speed, by using a faster remote server to get the host's machine public IP. This also reduces the risk of not being able to get the public IP at all.
- Fix the detection of a different IP for a known worker.
- Improve documentation.
2023-05-23 14:57:15 +03:00
Lampros Smyrnaios
5f75b48e95
- Increase the "read-timeout" when searching for the host's machine public-IP.
...
- Update dependencies.
- Code polishing.
2023-05-22 21:33:02 +03:00
Lampros Smyrnaios
0ab6bae93a
- Optimize the json-conversion of the "BulkImportReport".
...
- Code polishing.
2023-05-18 17:30:40 +03:00
Lampros Smyrnaios
f7f919cee1
- Make sure we set the "hasShutdown" to "false", for each known worker which was restarted.
...
- Fix markdown of urls in prometheus' readme.
2023-05-16 12:24:14 +03:00
Lampros Smyrnaios
b499209ce3
- Move the Prometheus and grafana configuration in a dedicated directory and docker-compose file.
...
- Add documentation about setting-up prometheus and grafana.
2023-05-15 18:52:31 +03:00
Lampros Smyrnaios
a8eea1ccf4
Fix missing changes.
2023-05-15 13:13:24 +03:00
Lampros Smyrnaios
f51a34138f
- Use separate HDFS subdirectories for each worker in order to avoid seeing exceptions about "empty hdfs directory" when "loading" data to the database, because one worker has loaded data generated by multiple workers (since we use only 1 load operation for multiple parquet files).
...
- Store each worker's info in a hash-table, in order to efficiently know if we need to create new hdfs subdirectories. Also, this will help to issue "shutdown" requests to the workers in the future, as well as to know which worker has shutdown.
2023-05-15 13:12:20 +03:00
Lampros Smyrnaios
9412391903
- In test-environment mode, check for already existing file-hashes only in the "payload_aggregated" table, instead of the whole "payload" view. This way the investigation for false-positive docUrls is easier, as we avoid checking against the millions of "legacy" payloads.
...
- Improve performance in production, by not creating the string objects for "trace"-logs.
2023-05-15 12:44:16 +03:00
Lampros Smyrnaios
8381df70c6
- Improve performance of uploading parquet-files to HDFS.
...
- Add some logs.
- Code polishing.
2023-05-11 19:40:48 +03:00
Lampros Smyrnaios
992d4ffd5e
- Add the time-zone in the logs.
...
- Change some log-levels to "trace", although most of them are still disabled.
2023-05-11 03:10:53 +03:00
Lampros Smyrnaios
b6e8cd1889
New feature: BulkImport full-text files from compatible datasources.
2023-05-11 03:07:55 +03:00
Lampros Smyrnaios
42b93e9429
- Add the "getNumberOfAllDistinctFullTexts" stats-endpoint.
...
- Add TODOs for more stats endpoints.
- Code polishing.
2023-05-04 15:48:49 +03:00
Lampros Smyrnaios
b3196376eb
Fix a bug, which caused the full-text files to never close.
2023-05-04 13:03:28 +03:00
Lampros Smyrnaios
fd15372fd6
Add error-checks for retrieving the status-code from HttpUrlConnections.
2023-05-03 13:30:29 +03:00
Lampros Smyrnaios
49662319a1
- Simplify the creation of local directories.
...
- Improve exception messages.
2023-04-28 14:58:33 +03:00
Lampros Smyrnaios
55ea5118ac
- Update the "testDatabaseName" property.
...
- Code polishing.
2023-04-26 19:33:28 +03:00
Lampros Smyrnaios
d7797eaaf6
Add the "getNumberOfPayloadsForDatasource" endpoint.
2023-04-24 09:54:35 +03:00
Lampros Smyrnaios
4dc34429f8
- Increase the waiting-time before checking the docker containers' status, in order to catch configuration-crashes.
...
- Code polishing.
2023-04-10 22:28:53 +03:00
Lampros Smyrnaios
c39fef2654
Upgrade payload-table to payload-view which consists of three separate payload tables: "payload_legacy", "payload_aggregated" and "payload_bulk_import".
2023-04-10 15:55:50 +03:00
Lampros Smyrnaios
37363100fd
Prioritize most recent publications.
2023-04-10 15:00:23 +03:00
Lampros Smyrnaios
484cf5cefc
- Avoid requesting the remaining full-text batches in case the Worker returns a 5XX error in one of the batches.
...
- Add nullability-checks for "datasourceId" and "hash" before constructing the new filename and upload the full-text on S3.
- Improve a log-message.
2023-03-29 17:12:37 +03:00
Lampros Smyrnaios
882c6f447b
Update the "testDatabaseName".
2023-03-21 23:10:21 +02:00
Lampros Smyrnaios
4280f89296
- Set the default value of the "isTestEnvironment" property to "true", in order to avoid undesired outcomes in the production db.
...
- Code polishing.
2023-03-21 17:04:28 +02:00