Compare commits

...

71 Commits
2.4 ... master

Author SHA1 Message Date
Lampros Smyrnaios 39c36f9e66 - Resolve a concurrency issue, by enforcing synchronization on the "BulkImportReport.getJsonReport()" method.
- Increase the number of stacktrace-lines to 20, for bulkImport-segment-failures.
- Improve "GenericUtils.getSelectiveStackTrace()".
2024-05-01 01:29:25 +03:00
Lampros Smyrnaios 8e14d4dbe0 - Fix not counting the files from the bulkImport-segment, which failed due to an exception.
- Write segment-exception-messages to the bulkImport-report.
2024-04-30 23:43:52 +03:00
Lampros Smyrnaios 0d117743c2 - Code-optimization.
- Upload the updated gradle-wrapper and set using the latest Gradle version in "installAndRun.sh" script.
2024-04-30 02:13:08 +03:00
Lampros Smyrnaios 64a1b7d4f0 - Comment-out the bucket-deletion process, in order to avoid any accidental deletion, even if it has to be explicitly allowed in the config.
- Update dependencies.
2024-04-26 12:54:00 +03:00
Lampros Smyrnaios e2d43a9af0 Upgrade the "processBulkImportedFilesSegment" code:
1) Pre-calculate the file-hashes for all files of the segment and perform a single "getHashLocationsQuery", instead of thousands
2) Write some important events to the bulkImportReport file, as soon as they are added in the list.
2024-04-02 14:35:19 +03:00
Lampros Smyrnaios bd323ad69a Avoid a very rare case, where we might get an "IllegalArgumentException" from "Lists.partition()", in case the "sizeOfUrlReports" is <= 3. 2024-03-29 18:12:52 +02:00
Lampros Smyrnaios 08de530f03 Various improvements:
- Handle the case when "fileUtils.constructS3FilenameAndUploadToS3()" returns "null", in "processBulkImportedFile()".
- Avoid an "IllegalArgumentException" in "Lists.partition()" when the number of files to bulkImport are fewer than the number of threads available to handle them.
- Include the last directory's "/" divider in the fileDIR group of "FILEPATH_ID_EXTENSION" regex (renamed from "FILENAME_ID_EXTENSION").
- Fix an incomplete log-message.
- Provide the "fileLocation" argument in the "DocFileData" constructor, in "processBulkImportedFile()", even though it's not used after.
2024-03-29 17:23:01 +02:00
Lampros Smyrnaios 1d821ed803 - Prepare version for next release.
- Fix typo of not using the "OpenAireID" in the S3 location of bulkImported files. Instead, the "fileNameID" was used, which in aggregation is the OpenAireID, but not in bulk-import.
- Update dependencies.
- Code polishing.
2024-03-28 06:09:28 +02:00
Lampros Smyrnaios 8bc5cc35e2 - Optimize writing to the Bulk-import-report file.
- Show the IP of the worker which posts a "workerShutdownReport".
- Code polishing.
2024-03-22 17:50:55 +02:00
Lampros Smyrnaios b9b29dd51c Move some code from "FileUtils.getAndUploadFullTexts()" to two separate methods. 2024-03-20 16:53:03 +02:00
Lampros Smyrnaios 56d233d38e - Move the "FileUtils.mergeParquetFiles()" method to "ParquetFileUtils.mergeParquetFilesOfTable()".
- Fix a typo.
2024-03-20 15:25:19 +02:00
Lampros Smyrnaios 724eae1514 - Optimize the placement of "DatabaseConnector.databaseLock.unlock()" statements.
- Rename a maven-repository.
2024-03-20 15:08:01 +02:00
Lampros Smyrnaios 785204419d Update/cleanup the repositories in "build.gradle". 2024-03-15 12:22:13 +02:00
Lampros Smyrnaios 9b0818b535 - Add handling for additional/specific exceptions, when checking the "futures".
- Move common "ExecutionException" handling-code into its own method: "GenericUtils.getSelectedStackTraceForCausedException()".
- Avoid a double log.
- Code polishing.
2024-03-14 13:59:23 +02:00
Lampros Smyrnaios b34417dc45 Optimize the test-DB creation process:
- Use views of the "initialDatabase" view and tables to a) reduce the amount of space used by test-DBs and b) improve test-db creation performance.
- Avoid possible failures from outdated metadata.
2024-03-14 13:10:54 +02:00
Lampros Smyrnaios f61cae41a1 - Try to get the cause of the exception of the callable-tasks which handle the bulk-import of fileSegments.
- Fix not counting the failedSegments when an exception was thrown.
- Code polishing.
2024-03-13 12:15:59 +02:00
Lampros Smyrnaios 8f9786de09 Upgrade the algorithm for finding the previously-found fulltexts, based on their md5hash:
- Use a single query with a list of the fileHashes, instead of thousands of singe-md5hash-check queries (run at most 6 in parallel) which require a lot of I/O.
- Avoid checking multiple times the same fileHash, in case it is related with multiple payloads.
- In case of a database-error, avoid completely losing the full-texts of that worker, instead, continue processing the full-texts.
2024-03-13 11:28:37 +02:00
Lampros Smyrnaios e4540e7f3c Handle the case when a urlReports-sublist does not have any payloads inside. 2024-03-12 14:25:00 +02:00
Lampros Smyrnaios e20c5d2146 - Add error-handling for the case when no payloads could be associated with a specific url which should have been in the hashMultiMap in "addUrlReportsByMatchingRecordsFromBacklog".
- Fix not cloning the payload, before changing it and adding it in the "prefilledPayloads"-list; instead, an object-reference was used.
2024-03-11 19:48:04 +02:00
Lampros Smyrnaios 1048463ca0 - Improve error-handling in "S3ObjectStore.emptyBucket()".
- Change some log-levels.
- Code polishing.
2024-03-11 16:17:32 +02:00
Lampros Smyrnaios 8f18008001 Avoid performing payload-related operations in case no fulltext was received from the worker, due to en error. 2024-03-11 14:57:13 +02:00
Lampros Smyrnaios ce3e149a95 Improve the "emptying/deleting" process of the S3-bucket. 2024-03-11 13:34:38 +02:00
Lampros Smyrnaios dd394f18a0 - Optimize the JOIN-order in the "findAssignmentsQuery".
- Optimize the "DOC_URL_FILTER"-regex.
- Update dependencies.
2024-03-11 11:35:38 +02:00
Lampros Smyrnaios 43ea64758d - Improve handling of the case when no fulltexts have been found or none of the found ones were requested from the worker, as they were already retrieved in the past.
- Show the number of files with problematic locations (if any of them exist).
- Code polishing.
2024-02-23 12:39:28 +02:00
Lampros Smyrnaios 749172edd8 Add the Jenkins' build-status badge in README. 2024-02-08 19:49:58 +02:00
Lampros Smyrnaios b72996c9a9 - Configure the destination of the logs in the "application.properties" file.
- Add some gradle files to be used by Jenkins.
2024-02-08 19:47:34 +02:00
Lampros Smyrnaios 3563fd6e2a - Try to get the cause of the exception of the callable-tasks which handle parquet-files.
- Update License.
- Update dependencies.
2024-02-07 18:34:28 +02:00
Lampros Smyrnaios 34d7a143e7 Add/improve documentation. 2024-02-01 14:37:29 +02:00
Lampros Smyrnaios 5dadb8ad2f - Optimize the "DOC_URL_FILTER"-regex, by using a non-capturing group.
- Remove an extra "File.separator" from the fulltexts-fullFilePath.
2024-01-19 15:46:23 +02:00
Lampros Smyrnaios bdc61c2cda When at least one worker is still active and have to wait for service-shutdown, show a log-message to inform the user, including that worker's IP. 2024-01-15 13:35:22 +02:00
Lampros Smyrnaios 3a70b57146 Prioritize the full-text urls over the landing-page ones. 2024-01-15 12:59:50 +02:00
Lampros Smyrnaios ee1ca8966b - Avoid continuing to request workerReport-batches when from the 1st batch, the base-directory of that assignments-counter is not found.
- Update dependencies.
2024-01-15 12:57:33 +02:00
Lampros Smyrnaios 2e60128084 - Allow to easily change the por used by workers.
- Show the number of active background-tasks and bulkImportDirs, which delay the Service's shutdown.
- Update dependencies.
- Code polishing.
2023-12-19 23:31:42 +02:00
Lampros Smyrnaios d90ad51609 Add the "shutdownAllWorkersGracefully" and "cancelShutdownAllWorkersGracefully" endpoints, in order to be able to shut them down at once and update them, without shutting down the whole Service. So in this case the bulk-import procedures will continue to work. 2023-11-29 16:45:58 +02:00
Lampros Smyrnaios d20c9a7d2e - Show the original exception thrown by the background-job, not the one thrown in the main-thread, which is useless, except from its message.
- Reduce the interval for deleting the unhandled assignments to once every 3 days.
- Set the upcoming version.
- Update dependencies.
2023-11-27 18:19:53 +02:00
Lampros Smyrnaios 7f789b8ad0 - If we receive an "UnknownHostException" when uploading to the S3ObjectStore, then skip the current full-texts' batch to leave some time for the network to get unstuck.
- Code polishing.
2023-11-22 15:29:18 +02:00
Lampros Smyrnaios 9b1f2c4931 Improve performance and reduce memory usage of the "findAssignmentsQuery":
- Reorder JOINs and predicates to reduce the computational cost.
- Remove the memory-costly "pu.url" predicates from the "where" clause, as the DB has no empty urls anymore.
2023-10-31 15:59:48 +02:00
Lampros Smyrnaios db929d8931 - Add a scheduling job to delete assignments older than 7 days. These may be left behind when the worker throws a "SocketTimeoutException" before it can receive the assignments and process them. No workerReport gets created for those assignments.
- Improve some log-messages.
- Code polishing.
2023-10-30 12:29:54 +02:00
Lampros Smyrnaios 856c62887d - Make sure the "UTF_8" charset is used, when we get a message from the response-body.
- Improve some log-messages.
2023-10-26 11:44:23 +03:00
Lampros Smyrnaios bdf834c439 - Refactor the "shutdown" script to do an orderly-shutdown, by default, by calling the "shutdownService" endpoint. In case a "force-shutdown" is needed, that can be requested with a cmd-argument.
- Fix not updating the "UrlsController.numOfWorkers" correctly.
- Code polishing.
2023-10-23 17:19:29 +03:00
Lampros Smyrnaios 0c7bf6357b - Improve performance in "FileUtils.addUrlReportsByMatchingRecordsFromBacklog()".
- Make sure we remove the assignments of all "not-successful", old, worker-reports, even for the ones which failed to be renamed to indicate success or failure, or failed to be executed by the background threads (and thus never reached the renaming stage).
2023-10-23 12:21:42 +03:00
Lampros Smyrnaios a7581335f1 - Improve the "getDataForPayloadPrefillQuery".
- Improve some error-messages.
2023-10-21 11:31:31 +03:00
Lampros Smyrnaios 44c2fe7418 - Fix the "IndexOutOfBoundsException", when checking the futures' results.
- Update dependencies.
2023-10-20 14:25:05 +03:00
Lampros Smyrnaios df0ea62a5a - Handle the case when the "webHDFSBaseUrl" does not use HTTPS.
- Improve error-reporting when uploading a file to HDFS.
2023-10-19 11:59:37 +03:00
Lampros Smyrnaios 40729c6295 Move similar code into the new "ParquetFileUtils.getPayloadParquetRecord()" method. 2023-10-17 12:50:51 +03:00
Lampros Smyrnaios f05eee7569 Improve the names of some methods. 2023-10-16 23:39:43 +03:00
Lampros Smyrnaios def21b991d Improve the UX of the "installAndRun.sh" script. 2023-10-09 17:28:22 +03:00
Lampros Smyrnaios fb2877dbe8 Upgrade the execution system for the backgroundTasks:
- Submit each task immediately for execution, instead of waiting for a scheduling thread to send all gathered tasks (up to that point) to the ExecutorService (and block until they are finished, before it can start again).
- Hold the Future of each submitted task to a synchronized-list to check the result of each task at a scheduled time.
- Reduce the cpu-time to assure the Service can shut down, by checking if there are "actively" and "about-to-be-executed" tasks, at the same time. Instead of having to rely on the additional checking of the "shutdown"-status of each worker to verify that no active task exist.
- Improve the threads' shutdown procedure.
2023-10-09 17:23:59 +03:00
Lampros Smyrnaios a354da763d - Improve some log-messages.
- Increase app's version.
- Code polishing.
2023-10-06 17:28:54 +03:00
Lampros Smyrnaios 0c79fdea35 Update the "findAssignmentsQuery" to check the "attempt.error_class" field for the current pub_url, not the pub_id. 2023-10-06 14:59:26 +03:00
Lampros Smyrnaios ebf8896005 - Fix getter and setter methods for the "isAuthoritative" field.
- Update Gradle.
2023-10-05 16:31:52 +03:00
Lampros Smyrnaios b2ce6393c1 - Add check for remaining "bulkImportDirsUnderProcessing", before shutting down the Service.
- Code polishing.
2023-10-05 13:43:47 +03:00
Lampros Smyrnaios 96c11ba4b8 - Add a missing change.
- Code optimization and polishing.
- Update dependencies.
2023-10-04 16:17:12 +03:00
Lampros Smyrnaios 7019f7c3c7 Improve aggregation speed, by generating additional "attempt" and "payload" records for the publications which are in the back-log and their url matches to one of the urls of the current payloads. 2023-10-04 15:43:31 +03:00
Lampros Smyrnaios b702cf4484 Upgrade the "findAssignmentsQuery":
- Retrieve the assignments by checking only the publication-urls against the "attempt", "assignment" and "payload" tables, not the IDs. This change allow us to: a) avoid re-attempting urls which have already been attempted multiple times (by different id-url pairs), b) avoid aggregating urls which are already inside the "payload" or "assignment" tables, even when they are related with other IDs.
In the end, we only care about the urls when choosing which records should be aggregated.
- Improve performance by using the "anti join" operator, where it fits, in order to allow the engine to use the faster "hash" operations.
2023-10-04 13:43:15 +03:00
Lampros Smyrnaios c9626de120 Handle the case when the "upload-file-to-S3" operation fails with a "ConnectException". In this case, all remaining upload operations for the files of that particular batch or segment, are canceled. 2023-10-04 13:01:13 +03:00
Lampros Smyrnaios 865926fbc3 - Handle the case when some results have been found from the "getAssignmentsQuery", but no data could be extracted from them.
- Code polishing.
2023-10-02 15:46:55 +03:00
Lampros Smyrnaios ede7ca5a89 - Add bulk-import support for non-Authoritative data-sources.
- Update Spring Boot.
- Code polishing.
2023-09-26 18:02:48 +03:00
Lampros Smyrnaios 90a864ea61 Add more info in bulk-import logs. 2023-09-20 17:50:10 +03:00
Lampros Smyrnaios 0f5d4dac78 Check and show warning/error message for failed payloads. 2023-09-20 17:38:22 +03:00
Lampros Smyrnaios 068b97dd60 Set Xms and Xmx Java-parameters when running the Jar, in Docker. 2023-09-15 14:19:46 +03:00
Lampros Smyrnaios 903c3e1ffc Add thread-safety when reading the bulkImportReport-files. 2023-09-15 11:54:32 +03:00
Lampros Smyrnaios 846c53913f Add LICENSE. 2023-09-14 16:05:36 +03:00
Lampros Smyrnaios 360731ba72 - Improve handling of the "NO_CONTENT" case, in "getAssignments"-endpoint.
- Code optimization and polishing.
2023-09-14 13:53:01 +03:00
Lampros Smyrnaios b4f91f188e Fix the "retries-num" appearing in log-messages. 2023-09-14 12:08:33 +03:00
Lampros Smyrnaios 02bae38885 - Improve response-time to "getAssignments"-requests, by avoiding merging the parquet files of the "assignment" table, right after acquiring the assignments from the DB. They are already getting merged, when each assignments-batch is deleted after a workerReport has been processed.
- Optimize code-positioning for unlocking the DB when done executing queries.
2023-09-13 17:03:11 +03:00
Lampros Smyrnaios 8fdb8e9137 Add renaming of the workerReport-file, to indicate failure, when the processing failed because no workerInfo was found for the worker-id existing in the report. This way, it can be retried by the scheduler later. 2023-09-13 16:35:41 +03:00
Lampros Smyrnaios c98e8df323 Move the "getRenamedWorkerReport"-code in its own method. 2023-09-13 16:27:18 +03:00
Lampros Smyrnaios 6891c467d4 - Avoid displaying a warning for the "test" HDFS directory, when the Controller is running in PROD mode.
- Add a missing change for the optimization of reading files.
- Update dependencies.
2023-09-13 15:29:30 +03:00
Lampros Smyrnaios 3dd349dd00 Improve the "findAssignmentsQuery":
- Fix an issue, where assignments, having an above-zero attempt_count, were finding their way to the results, just because they were prioritized based on their boost_level or pub_year. Apart from retrying the old failed assignments sooner, the non-yet-processed boosted-publications were pushed out to the workers much slower.
- Simplify the query, by removing the internal "ordering" and "limit", which had performance benefits when we did not need additional ordering for "level" and "pub_year". Back then, we wanted to apply the final orderings to as few rows as possible.
2023-09-13 14:38:15 +03:00
Lampros Smyrnaios ee2df19ce1 - Allow "pretty-printing" the json response of the "getBulkImportReport" endpoint.
- Add useful log-messages for various bulk-import stages and improve the current ones.
- Optimize reading and writing the reports.
2023-09-11 17:24:39 +03:00
33 changed files with 2114 additions and 891 deletions

View File

@ -4,4 +4,4 @@ COPY build/libs/*-SNAPSHOT.jar urls_controller.jar
EXPOSE 1880
ENTRYPOINT ["java","-jar","/urls_controller.jar", "--spring.config.location=file:///mnt/config/application.yml"]
ENTRYPOINT ["java", "-Xms1024m", "-Xmx4096m", "-jar", "/urls_controller.jar", "--spring.config.location=file:///mnt/config/application.yml"]

201
LICENSE Normal file
View File

@ -0,0 +1,201 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2021-2024 OpenAIRE AMKE
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

View File

@ -1,13 +1,26 @@
# UrlsController
## [![Jenkins build status](https://jenkins-dnet.d4science.org/buildStatus/icon?job=UrlsController)](https://jenkins-dnet.d4science.org/job/UrlsController/)
The Controller's Application receives requests coming from the [**Workers**](https://code-repo.d4science.org/lsmyrnaios/UrlsWorker) (deployed on the cloud), constructs an assignments-list with data received from a database and returns the list to the workers.<br>
Then, it receives the "WorkerReports", it requests the full-texts from the workers, in batches, and uploads them on the S3-Object-Store. Finally, it writes the related reports, along with the updated file-locations into the database.<br>
<br>
It can also process **Bulk-Import** requests, from compatible data sources, in which case it receives the full-text files immediately, without offloading crawling jobs to Workers.<br>
<br>
For interacting with the database we use [**Impala**](https://impala.apache.org/).<br>
For managing and generating data, we use [**Impala**](https://impala.apache.org/) JDBC and WebHDFS.<br>
<br>
**To install and run the application**:
- Run ```git clone``` and then ```cd UrlsController```.
- Set the preferable values inside the [__application.yml__](https://code-repo.d4science.org/lsmyrnaios/UrlsController/src/branch/master/src/main/resources/application.yml) file. Specifically, for tests, set the ***services.pdfaggregation.controller.isTestEnvironment*** property to "**true**" and make sure the "***services.pdfaggregation.controller.db.testDatabaseName***" property is set to a test-database.
- Execute the ```installAndRun.sh``` script which builds and runs the app.<br>
If you want to just run the app, then run the script with the argument "1": ```./installAndRun.sh 1```.<br>
If you want to build and run the app on a **Docker Container**, then run the script with the argument "0" followed by the argument "1": ```./installAndRun.sh 0 1```.<br>
Additionally, if you want to test/visualize the exposed metrics on Prometheus and Grafana, you can deploy their instances on docker containers,
by enabling the "runPrometheusAndGrafanaContainers" switch, inside the "./installAndRun.sh" script.<br>
<br>
**BulkImport API**:
- "**bulkImportFullTexts**" endpoint: **http://\<IP\>:\<PORT\>/api/bulkImportFullTexts?provenance=\<provenance\>&bulkImportDir=\<bulkImportDir\>&shouldDeleteFilesOnFinish={true|false}** <br>
This endpoint loads the right configuration with the help of the "provenance" parameter, delegates the processing to a background thread and immediately returns a message with useful information, including the "reportFileID", which can be used at any moment to request a report about the progress of the bulk-import procedure.<br>
@ -15,6 +28,12 @@ For interacting with the database we use [**Impala**](https://impala.apache.org/
- "**getBulkImportReport**" endpoint: **http://\<IP\>:\<PORT\>/api/getBulkImportReport?id=\<reportFileID\>** <br>
This endpoint returns the bulkImport report, which corresponds to the given reportFileID, in JSON format.
<br>
**How to add a bulk-import datasource**:
- Open the [__application.yml__](https://code-repo.d4science.org/lsmyrnaios/UrlsController/src/branch/master/src/main/resources/application.yml) file.
- Add a new object under the "bulk-import.bulkImportSources" property.
- Read the comments written in the end of the "bulk-import" property and make sure all requirements are met.
<br>
<br>
**Statistics API**:
@ -60,16 +79,6 @@ Note: The Shutdown Service API is accessible by the Controller's host machine.
<br>
<br>
**To install and run the application**:
- Run ```git clone``` and then ```cd UrlsController```.
- Set the preferable values inside the [__application.yml__](https://code-repo.d4science.org/lsmyrnaios/UrlsController/src/branch/master/src/main/resources/application.yml) file.
- Execute the ```installAndRun.sh``` script which builds and runs the app.<br>
If you want to just run the app, then run the script with the argument "1": ```./installAndRun.sh 1```.<br>
If you want to build and run the app on a **Docker Container**, then run the script with the argument "0" followed by the argument "1": ```./installAndRun.sh 0 1```.<br>
Additionally, if you want to test/visualize the exposed metrics on Prometheus and Grafana, you can deploy their instances on docker containers,
by enabling the "runPrometheusAndGrafanaContainers" switch, inside the "./installAndRun.sh" script.<br>
<br>
Implementation notes:
- For transferring the full-text files, we use Facebook's [**Zstandard**](https://facebook.github.io/zstd/) compression algorithm, which brings very big benefits in compression rate and speed.
- The uploaded full-text files follow this naming-scheme: "**datasourceID/recordID::fileHash.pdf**"

View File

@ -1,27 +1,27 @@
plugins {
id 'org.springframework.boot' version '2.7.15'
id 'io.spring.dependency-management' version '1.1.3'
id 'org.springframework.boot' version '2.7.18'
id 'io.spring.dependency-management' version '1.1.4'
id 'java'
}
java {
group = 'eu.openaire.urls_controller'
version = '2.4.0-SNAPSHOT'
version = '2.7.0-SNAPSHOT'
sourceCompatibility = JavaVersion.VERSION_1_8
}
repositories {
mavenCentral()
maven {
name "omtd"
url "https://repo.openminted.eu/content/repositories/releases/"
}
maven {
name "pentaho-repo"
url "https://public.nexus.pentaho.org/content/groups/omni/"
name "madgik"
url "https://repo.madgik.di.uoa.gr/content/repositories/thirdparty/"
}
}
ext {
hadoopVersion = '3.4.0'
}
dependencies {
runtimeOnly "org.springframework.boot:spring-boot-devtools"
@ -43,18 +43,18 @@ dependencies {
//implementation group: 'jakarta.validation', name: 'jakarta.validation-api', version: '3.0.2'
// https://mvnrepository.com/artifact/com.google.guava/guava
implementation group: 'com.google.guava', name: 'guava', version: '32.1.2-jre'
implementation group: 'com.google.guava', name: 'guava', version: '33.1.0-jre'
// https://mvnrepository.com/artifact/org.apache.commons/commons-lang3
implementation group: 'org.apache.commons', name: 'commons-lang3', version: '3.13.0'
implementation group: 'org.apache.commons', name: 'commons-lang3', version: '3.14.0'
// https://mvnrepository.com/artifact/org.apache.commons/commons-compress
implementation("org.apache.commons:commons-compress:1.23.0") {
implementation("org.apache.commons:commons-compress:1.26.1") {
exclude group: 'com.github.luben', module: 'zstd-jni'
}
implementation 'com.github.luben:zstd-jni:1.5.5-5' // Even though this is part of the above dependency, the Apache commons rarely updates it, while the zstd team makes improvements very often.
implementation 'com.github.luben:zstd-jni:1.5.6-3' // Even though this is part of the above dependency, the Apache commons rarely updates it, while the zstd team makes improvements very often.
implementation 'io.minio:minio:8.5.5'
implementation 'io.minio:minio:8.5.9'
// https://mvnrepository.com/artifact/com.cloudera.impala/jdbc
implementation("com.cloudera.impala:jdbc:2.5.31") {
@ -80,7 +80,7 @@ dependencies {
implementation('org.apache.parquet:parquet-avro:1.13.1')
// https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common
implementation('org.apache.hadoop:hadoop-common:3.3.6') {
implementation("org.apache.hadoop:hadoop-common:$hadoopVersion") {
exclude group: 'org.apache.parquet', module: 'parquet-avro'
exclude group: 'org.apache.avro', module: 'avro'
exclude group: 'org.slf4j', module: 'slf4j-api'
@ -96,7 +96,7 @@ dependencies {
}
// https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core
implementation('org.apache.hadoop:hadoop-mapreduce-client-core:3.3.6') {
implementation("org.apache.hadoop:hadoop-mapreduce-client-core:$hadoopVersion") {
exclude group: 'org.apache.parquet', module: 'parquet-avro'
exclude group: 'org.apache.avro', module: 'avro'
exclude group: 'org.slf4j', module: 'slf4j-api'
@ -110,17 +110,16 @@ dependencies {
// Add back some updated version of the needed dependencies.
implementation 'org.apache.thrift:libthrift:0.17.0' // Newer versions (>=0.18.X) are not compatible with JAVA 8.
implementation 'com.fasterxml.woodstox:woodstox-core:6.5.1'
implementation 'com.fasterxml.woodstox:woodstox-core:6.6.2'
// https://mvnrepository.com/artifact/org.json/json
implementation 'org.json:json:20230618'
implementation 'org.json:json:20240303' // This is used only in "ParquetFileUtils.createRemoteParquetDirectories()". TODO - Replace it with "gson".
// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation 'com.google.code.gson:gson:2.10.1'
// https://mvnrepository.com/artifact/io.micrometer/micrometer-registry-prometheus
runtimeOnly 'io.micrometer:micrometer-registry-prometheus:1.11.3'
runtimeOnly 'io.micrometer:micrometer-registry-prometheus:1.12.5'
testImplementation 'org.springframework.security:spring-security-test'
testImplementation "org.springframework.boot:spring-boot-starter-test"

BIN
gradle/wrapper/gradle-wrapper.jar vendored Normal file

Binary file not shown.

View File

@ -1,6 +1,6 @@
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-8.3-bin.zip
distributionUrl=https\://services.gradle.org/distributions/gradle-8.7-bin.zip
networkTimeout=10000
validateDistributionUrl=true
zipStoreBase=GRADLE_USER_HOME

249
gradlew vendored Executable file
View File

@ -0,0 +1,249 @@
#!/bin/sh
#
# Copyright © 2015-2021 the original authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
##############################################################################
#
# Gradle start up script for POSIX generated by Gradle.
#
# Important for running:
#
# (1) You need a POSIX-compliant shell to run this script. If your /bin/sh is
# noncompliant, but you have some other compliant shell such as ksh or
# bash, then to run this script, type that shell name before the whole
# command line, like:
#
# ksh Gradle
#
# Busybox and similar reduced shells will NOT work, because this script
# requires all of these POSIX shell features:
# * functions;
# * expansions «$var», «${var}», «${var:-default}», «${var+SET}»,
# «${var#prefix}», «${var%suffix}», and «$( cmd )»;
# * compound commands having a testable exit status, especially «case»;
# * various built-in commands including «command», «set», and «ulimit».
#
# Important for patching:
#
# (2) This script targets any POSIX shell, so it avoids extensions provided
# by Bash, Ksh, etc; in particular arrays are avoided.
#
# The "traditional" practice of packing multiple parameters into a
# space-separated string is a well documented source of bugs and security
# problems, so this is (mostly) avoided, by progressively accumulating
# options in "$@", and eventually passing that to Java.
#
# Where the inherited environment variables (DEFAULT_JVM_OPTS, JAVA_OPTS,
# and GRADLE_OPTS) rely on word-splitting, this is performed explicitly;
# see the in-line comments for details.
#
# There are tweaks for specific operating systems such as AIX, CygWin,
# Darwin, MinGW, and NonStop.
#
# (3) This script is generated from the Groovy template
# https://github.com/gradle/gradle/blob/HEAD/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
# within the Gradle project.
#
# You can find Gradle at https://github.com/gradle/gradle/.
#
##############################################################################
# Attempt to set APP_HOME
# Resolve links: $0 may be a link
app_path=$0
# Need this for daisy-chained symlinks.
while
APP_HOME=${app_path%"${app_path##*/}"} # leaves a trailing /; empty if no leading path
[ -h "$app_path" ]
do
ls=$( ls -ld "$app_path" )
link=${ls#*' -> '}
case $link in #(
/*) app_path=$link ;; #(
*) app_path=$APP_HOME$link ;;
esac
done
# This is normally unused
# shellcheck disable=SC2034
APP_BASE_NAME=${0##*/}
# Discard cd standard output in case $CDPATH is set (https://github.com/gradle/gradle/issues/25036)
APP_HOME=$( cd "${APP_HOME:-./}" > /dev/null && pwd -P ) || exit
# Use the maximum available, or set MAX_FD != -1 to use that value.
MAX_FD=maximum
warn () {
echo "$*"
} >&2
die () {
echo
echo "$*"
echo
exit 1
} >&2
# OS specific support (must be 'true' or 'false').
cygwin=false
msys=false
darwin=false
nonstop=false
case "$( uname )" in #(
CYGWIN* ) cygwin=true ;; #(
Darwin* ) darwin=true ;; #(
MSYS* | MINGW* ) msys=true ;; #(
NONSTOP* ) nonstop=true ;;
esac
CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar
# Determine the Java command to use to start the JVM.
if [ -n "$JAVA_HOME" ] ; then
if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
# IBM's JDK on AIX uses strange locations for the executables
JAVACMD=$JAVA_HOME/jre/sh/java
else
JAVACMD=$JAVA_HOME/bin/java
fi
if [ ! -x "$JAVACMD" ] ; then
die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi
else
JAVACMD=java
if ! command -v java >/dev/null 2>&1
then
die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi
fi
# Increase the maximum file descriptors if we can.
if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then
case $MAX_FD in #(
max*)
# In POSIX sh, ulimit -H is undefined. That's why the result is checked to see if it worked.
# shellcheck disable=SC2039,SC3045
MAX_FD=$( ulimit -H -n ) ||
warn "Could not query maximum file descriptor limit"
esac
case $MAX_FD in #(
'' | soft) :;; #(
*)
# In POSIX sh, ulimit -n is undefined. That's why the result is checked to see if it worked.
# shellcheck disable=SC2039,SC3045
ulimit -n "$MAX_FD" ||
warn "Could not set maximum file descriptor limit to $MAX_FD"
esac
fi
# Collect all arguments for the java command, stacking in reverse order:
# * args from the command line
# * the main class name
# * -classpath
# * -D...appname settings
# * --module-path (only if needed)
# * DEFAULT_JVM_OPTS, JAVA_OPTS, and GRADLE_OPTS environment variables.
# For Cygwin or MSYS, switch paths to Windows format before running java
if "$cygwin" || "$msys" ; then
APP_HOME=$( cygpath --path --mixed "$APP_HOME" )
CLASSPATH=$( cygpath --path --mixed "$CLASSPATH" )
JAVACMD=$( cygpath --unix "$JAVACMD" )
# Now convert the arguments - kludge to limit ourselves to /bin/sh
for arg do
if
case $arg in #(
-*) false ;; # don't mess with options #(
/?*) t=${arg#/} t=/${t%%/*} # looks like a POSIX filepath
[ -e "$t" ] ;; #(
*) false ;;
esac
then
arg=$( cygpath --path --ignore --mixed "$arg" )
fi
# Roll the args list around exactly as many times as the number of
# args, so each arg winds up back in the position where it started, but
# possibly modified.
#
# NB: a `for` loop captures its iteration list before it begins, so
# changing the positional parameters here affects neither the number of
# iterations, nor the values presented in `arg`.
shift # remove old arg
set -- "$@" "$arg" # push replacement arg
done
fi
# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"'
# Collect all arguments for the java command:
# * DEFAULT_JVM_OPTS, JAVA_OPTS, JAVA_OPTS, and optsEnvironmentVar are not allowed to contain shell fragments,
# and any embedded shellness will be escaped.
# * For example: A user cannot expect ${Hostname} to be expanded, as it is an environment variable and will be
# treated as '${Hostname}' itself on the command line.
set -- \
"-Dorg.gradle.appname=$APP_BASE_NAME" \
-classpath "$CLASSPATH" \
org.gradle.wrapper.GradleWrapperMain \
"$@"
# Stop when "xargs" is not available.
if ! command -v xargs >/dev/null 2>&1
then
die "xargs is not available"
fi
# Use "xargs" to parse quoted args.
#
# With -n1 it outputs one arg per line, with the quotes and backslashes removed.
#
# In Bash we could simply go:
#
# readarray ARGS < <( xargs -n1 <<<"$var" ) &&
# set -- "${ARGS[@]}" "$@"
#
# but POSIX shell has neither arrays nor command substitution, so instead we
# post-process each arg (as a line of input to sed) to backslash-escape any
# character that might be a shell metacharacter, then use eval to reverse
# that process (while maintaining the separation between arguments), and wrap
# the whole thing up as a single "set" statement.
#
# This will of course break if any of these variables contains a newline or
# an unmatched quote.
#
eval "set -- $(
printf '%s\n' "$DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS" |
xargs -n1 |
sed ' s~[^-[:alnum:]+,./:=@_]~\\&~g; ' |
tr '\n' ' '
)" '"$@"'
exec "$JAVACMD" "$@"

92
gradlew.bat vendored Normal file
View File

@ -0,0 +1,92 @@
@rem
@rem Copyright 2015 the original author or authors.
@rem
@rem Licensed under the Apache License, Version 2.0 (the "License");
@rem you may not use this file except in compliance with the License.
@rem You may obtain a copy of the License at
@rem
@rem https://www.apache.org/licenses/LICENSE-2.0
@rem
@rem Unless required by applicable law or agreed to in writing, software
@rem distributed under the License is distributed on an "AS IS" BASIS,
@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@rem See the License for the specific language governing permissions and
@rem limitations under the License.
@rem
@if "%DEBUG%"=="" @echo off
@rem ##########################################################################
@rem
@rem Gradle startup script for Windows
@rem
@rem ##########################################################################
@rem Set local scope for the variables with windows NT shell
if "%OS%"=="Windows_NT" setlocal
set DIRNAME=%~dp0
if "%DIRNAME%"=="" set DIRNAME=.
@rem This is normally unused
set APP_BASE_NAME=%~n0
set APP_HOME=%DIRNAME%
@rem Resolve any "." and ".." in APP_HOME to make it shorter.
for %%i in ("%APP_HOME%") do set APP_HOME=%%~fi
@rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
set DEFAULT_JVM_OPTS="-Xmx64m" "-Xms64m"
@rem Find java.exe
if defined JAVA_HOME goto findJavaFromJavaHome
set JAVA_EXE=java.exe
%JAVA_EXE% -version >NUL 2>&1
if %ERRORLEVEL% equ 0 goto execute
echo. 1>&2
echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH. 1>&2
echo. 1>&2
echo Please set the JAVA_HOME variable in your environment to match the 1>&2
echo location of your Java installation. 1>&2
goto fail
:findJavaFromJavaHome
set JAVA_HOME=%JAVA_HOME:"=%
set JAVA_EXE=%JAVA_HOME%/bin/java.exe
if exist "%JAVA_EXE%" goto execute
echo. 1>&2
echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME% 1>&2
echo. 1>&2
echo Please set the JAVA_HOME variable in your environment to match the 1>&2
echo location of your Java installation. 1>&2
goto fail
:execute
@rem Setup the command line
set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar
@rem Execute Gradle
"%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %*
:end
@rem End local scope for the variables with windows NT shell
if %ERRORLEVEL% equ 0 goto mainEnd
:fail
rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
rem the _cmd.exe /c_ return code!
set EXIT_CODE=%ERRORLEVEL%
if %EXIT_CODE% equ 0 set EXIT_CODE=1
if not ""=="%GRADLE_EXIT_CONSOLE%" exit %EXIT_CODE%
exit /b %EXIT_CODE%
:mainEnd
if "%OS%"=="Windows_NT" endlocal
:omega

View File

@ -9,26 +9,26 @@ handle_error () {
# Change the working directory to the script's directory, when running from another location.
cd "${0%/*}" || handle_error "Could not change-dir to this script's dir!" 1
justInstall=0
justRun=0
shouldRunInDocker=0
if [[ $# -eq 1 ]]; then
justInstall=$1
justRun=$1
elif [[ $# -eq 2 ]]; then
justInstall=$1
justRun=$1
shouldRunInDocker=$2
elif [[ $# -gt 2 ]]; then
echo -e "Wrong number of arguments given: ${#}\nPlease execute it like: script.sh <justInstall: 0 | 1> <shouldRunInDocker: 0 | 1>"; exit 2
echo -e "Wrong number of arguments given: ${#}\nPlease execute it like: installAndRun.sh <justRun: 0 | 1> <shouldRunInDocker: 0 | 1>"; exit 2
fi
if [[ justInstall -eq 1 && shouldRunInDocker -eq 1 ]]; then
echo -e "Cannot run in docker without re-building the project (just to be safe). Setting \"justInstall\" to < 0 >"
justInstall=0
if [[ justRun -eq 1 && shouldRunInDocker -eq 1 ]]; then
echo -e "Cannot run in docker without re-building the project (just to be safe). Setting \"justRun\" to < 0 >"
justRun=0
fi
gradleVersion="8.3"
gradleVersion="8.7"
if [[ justInstall -eq 0 ]]; then
if [[ justRun -eq 0 ]]; then
if [[ ! -d /opt/gradle/gradle-${gradleVersion} ]]; then
wget https://services.gradle.org/distributions/gradle-${gradleVersion}-bin.zip
@ -46,7 +46,7 @@ if [[ justInstall -eq 0 ]]; then
#gradle tasks # For debugging installation
#gradle -v # For debugging installation
gradle clean build
gradle clean build # --refresh-dependencies # --info
if [[ shouldRunInDocker -eq 1 ]]; then
@ -64,11 +64,14 @@ if [[ justInstall -eq 0 ]]; then
(sudo docker compose -f prometheus/docker-compose-prometheus.yml up --build -d && echo -e "\nThe Prometheus and Grafana docker-containers started running.\n") || handle_error "Could not build and/or run the 'prometheus' and/or 'grafana' docker containers!" 5
fi
echo -e "Waiting 65 seconds before getting the status..\n"
sleep 65
echo -e "Waiting 20 seconds before getting the status..\n"
sleep 20
# 20 seconds are enough to check if there is an immediate fatal error with one of the docker images. This will cover the problematic configuration case for Prometheus and Grafana containers.
sudo docker ps -a || handle_error "Could not get the status of docker-containers!" 6 # Using -a to get the status of failed containers as well.
echo -e "\n\nGetting the logs of docker-container \"urls_controller\":\n"
sudo docker logs "$(sudo docker ps -aqf "name=^urls_controller$")" || handle_error "Could not get the logs of docker-container \"urls_controller\"!" 7 # Using "regex anchors" to avoid false-positives. Works even if the container is not running, thus showing the error-log.
sudo docker logs -f urls_controller || handle_error "Could not get the logs of docker-container \"urls_controller\"!" 7 # Using "regex anchors" to avoid false-positives. Works even if the container is not running, thus showing the error-log.
# Use just the container-name and the "-f" parameter to indicate that we want to follow on logs updates, until we specify to unfollow them (with ctrl+c).
# This way we do not need to run the "docker logs" again and again, not checking the the container-id each time.
fi
else
export PATH=/opt/gradle/gradle-${gradleVersion}/bin:$PATH # Make sure the gradle is still accessible (it usually isn't without the "export").

View File

@ -1,11 +0,0 @@
# This script shuts down (ONLY!) the Controller, by stopping and killing the related containers.
# It is used during testing.
# It does not shuts down the whole service! The workers will keep running and their work will be lost.
echo "Running compose down.."
sudo docker compose -f docker-compose.yml down
sudo docker compose -f ./prometheus/docker-compose-prometheus.yml down
# In case we need to hard-remove the containers, use the following commands:
#sudo docker stop $(sudo docker ps -aqf "name=^(?:urlscontroller-urls_controller|prometheus-(?:prometheus|grafana))-1$") || true # There may be no active containers
#sudo docker rm $(sudo docker ps -aqf "name=^(?:urlscontroller-urls_controller|prometheus-(?:prometheus|grafana))-1$") || true # All containers may be already removed.

40
shutdownService.sh Executable file
View File

@ -0,0 +1,40 @@
# This script shuts down (ONLY!) the Controller, by stopping and killing the related containers.
# It is used during testing.
# It does not shuts down the whole service! The workers will keep running and their work will be lost.
# For error-handling, we cannot use the "set -e" since: it has problems https://mywiki.wooledge.org/BashFAQ/105
# So we have our own function, for use when a single command fails.
handle_error () {
echo -e "\n\n$1\n\n"; exit $2
}
# Change the working directory to the script's directory, when running from another location.
cd "${0%/*}" || handle_error "Could not change-dir to this script's dir!" 1
forceControllerShutdown=0
if [[ $# -eq 1 ]]; then
forceControllerShutdown=$1
elif [[ $# -gt 1 ]]; then
echo -e "Wrong number of arguments given: ${#}\nPlease execute it like: shutdownService.sh <forceControllerShutdown: 0 | 1>"; exit 2
fi
# We may have no arguments, if we do not want to force the Controller to shutdown.
# Shutdown Prometheus, if it's running.
sudo docker compose -f ./prometheus/docker-compose-prometheus.yml down
if [[ forceControllerShutdown -eq 1 ]]; then
echo "Shutting down the Controller.."
sudo docker compose -f docker-compose.yml down
else
echo "Shutting down the Service.."
sudo apt install -y curl
curl -X POST -i 'http://localhost:1880/api/shutdownService'
# Follow the logs until shutdown.
sudo docker logs -f urls_controller || handle_error "Could not get the logs of docker-container \"urls_controller\"!" 3 # Using "regex anchors" to avoid false-positives. Works even if the container is not running, thus showing the error-log.
fi
# In case we need to hard-remove the containers, use the following commands:
#sudo docker stop $(sudo docker ps -aqf "name=^(?:urlscontroller-urls_controller|prometheus-(?:prometheus|grafana))-1$") || true # There may be no active containers
#sudo docker rm $(sudo docker ps -aqf "name=^(?:urlscontroller-urls_controller|prometheus-(?:prometheus|grafana))-1$") || true # All containers may be already removed.

View File

@ -1,5 +1,6 @@
package eu.openaire.urls_controller;
import com.zaxxer.hikari.HikariDataSource;
import eu.openaire.urls_controller.controllers.BulkImportController;
import eu.openaire.urls_controller.controllers.UrlsController;
import eu.openaire.urls_controller.services.UrlsServiceImpl;
@ -9,6 +10,7 @@ import io.micrometer.core.aop.TimedAspect;
import io.micrometer.core.instrument.MeterRegistry;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
@ -29,15 +31,18 @@ import java.util.concurrent.TimeUnit;
@SpringBootApplication
@EnableScheduling
public class Application {
public class UrlsControllerApplication {
private static final Logger logger = LoggerFactory.getLogger(Application.class);
private static final Logger logger = LoggerFactory.getLogger(UrlsControllerApplication.class);
@Autowired
HikariDataSource hikariDataSource;
private static ConfigurableApplicationContext context;
public static void main(String[] args) {
context = SpringApplication.run(Application.class, args);
context = SpringApplication.run(UrlsControllerApplication.class, args);
}
@Bean
@ -53,8 +58,10 @@ public class Application {
}
public static void gentleAppShutdown()
public void gentleAppShutdown()
{
logger.info("Shutting down the app..");
shutdownThreads();
int exitCode = 0;
try {
exitCode = SpringApplication.exit(context, () -> 0); // The "PreDestroy" method will be called. (the "context" will be closed automatically (I checked it))
@ -65,34 +72,55 @@ public class Application {
}
private boolean haveThreadsShutdown = false;
@PreDestroy
public void preDestroy() {
public void shutdownThreads()
{
// Normally this methods will have already been executed by "gentleAppShutdown()" just before the "PreDestroy" has called this method again.
// BUT, in case the service was shutdown from the OS, without the use of the "ShutdownService-API, then this will be the 1st time this method is called.
if ( haveThreadsShutdown )
return;
logger.info("Shutting down the threads..");
shutdownThreads(UrlsServiceImpl.insertsExecutor);
shutdownThreads(FileUtils.hashMatchingExecutor);
shutdownThreads(UrlsController.backgroundExecutor);
shutdownThreads(BulkImportController.bulkImportExecutor);
shutdownThreadsForExecutorService(UrlsServiceImpl.insertsExecutor);
shutdownThreadsForExecutorService(FileUtils.hashMatchingExecutor);
shutdownThreadsForExecutorService(BulkImportController.bulkImportExecutor);
shutdownThreadsForExecutorService(UrlsController.backgroundExecutor);
haveThreadsShutdown = true;
// For some reason the Hikari Datasource cannot close properly by Spring Boot, unless we explicitly call close here.
hikariDataSource.close();
logger.info("Exiting..");
}
private void shutdownThreads(ExecutorService executorService)
private void shutdownThreadsForExecutorService(ExecutorService executorService) throws RuntimeException
{
executorService.shutdown(); // Define that no new tasks will be scheduled.
try {
if ( ! executorService.awaitTermination(2, TimeUnit.MINUTES) ) {
logger.warn("The working threads did not finish on time! Stopping them immediately..");
executorService.shutdownNow();
// Wait a while for tasks to respond to being cancelled (thus terminated).
if ( ! executorService.awaitTermination(1, TimeUnit.MINUTES) )
logger.warn("The executor " + executorService + " could not be terminated!");
}
} catch (SecurityException se) {
logger.error("Could not shutdown the threads in any way..!", se);
} catch (InterruptedException ie) {
try {
executorService.shutdownNow();
// Wait a while for tasks to respond to being cancelled (thus terminated).
if ( ! executorService.awaitTermination(1, TimeUnit.MINUTES) )
logger.warn("The executor " + executorService + " could not be terminated!");
} catch (SecurityException se) {
logger.error("Could not shutdown the threads in any way..!", se);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}

View File

@ -65,16 +65,16 @@ public class BulkImport {
public static class BulkImportSource {
String datasourceID;
String datasourcePrefix;
String pdfUrlPrefix;
String mimeType;
private String datasourceID;
private String datasourcePrefix;
private String pdfUrlPrefix;
private String mimeType;
private boolean isAuthoritative;
public BulkImportSource() {
}
public String getDatasourceID() {
return datasourceID;
}
@ -107,6 +107,13 @@ public class BulkImport {
this.mimeType = mimeType;
}
public boolean getIsAuthoritative() {
return isAuthoritative;
}
public void setIsAuthoritative(boolean isAuthoritative) {
this.isAuthoritative = isAuthoritative;
}
@Override
public String toString() {
@ -115,6 +122,7 @@ public class BulkImport {
", datasourcePrefix='" + datasourcePrefix + '\'' +
", pdfUrlPrefix='" + pdfUrlPrefix + '\'' +
", mimeType='" + mimeType + '\'' +
", isAuthoritative=" + isAuthoritative +
'}';
}
}

View File

@ -2,11 +2,13 @@ package eu.openaire.urls_controller.components;
import com.google.gson.Gson;
import com.google.gson.JsonSyntaxException;
import eu.openaire.urls_controller.Application;
import eu.openaire.urls_controller.UrlsControllerApplication;
import eu.openaire.urls_controller.configuration.DatabaseConnector;
import eu.openaire.urls_controller.controllers.BulkImportController;
import eu.openaire.urls_controller.controllers.ShutdownController;
import eu.openaire.urls_controller.controllers.StatsController;
import eu.openaire.urls_controller.controllers.UrlsController;
import eu.openaire.urls_controller.models.WorkerInfo;
import eu.openaire.urls_controller.payloads.requests.WorkerReport;
import eu.openaire.urls_controller.services.UrlsServiceImpl;
import eu.openaire.urls_controller.util.FileUtils;
@ -23,11 +25,8 @@ import org.springframework.stereotype.Component;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Set;
import java.util.concurrent.Callable;
import java.text.DecimalFormat;
import java.util.*;
import java.util.concurrent.CancellationException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;
@ -47,6 +46,9 @@ public class ScheduledTasks {
@Autowired
UrlsServiceImpl urlsService;
@Autowired
private UrlsControllerApplication urlsControllerApplication;
private final StatsController statsController;
private final UrlsController urlsController;
@ -54,6 +56,8 @@ public class ScheduledTasks {
@Value("${services.pdfaggregation.controller.assignmentLimit}")
private int assignmentsLimit;
public static final DecimalFormat df = new DecimalFormat("0.00");
private final String workerReportsDirPath;
public static final AtomicInteger numOfAllPayloads = new AtomicInteger(0);
@ -69,7 +73,6 @@ public class ScheduledTasks {
{
if ( !workerReportsDirPath.endsWith("/") )
workerReportsDirPath += "/";
this.workerReportsDirPath = workerReportsDirPath; // This dir will be created later.
this.statsController = statsController;
this.urlsController = urlsController;
@ -87,40 +90,52 @@ public class ScheduledTasks {
@Scheduled(initialDelay = 1_800_000, fixedDelay = 900_000) // Execute this method 30 mins from the start and 15 mins after each last execution, in order for some tasks to have been gathered.
//@Scheduled(initialDelay = 60_000, fixedDelay = 20_000) // Just for testing (every 20 secs).
// The initial delay is larger, because we have to wait some time for at least one worker to finish retrieving the full-texts from thousands of publications, whereas, later we will have a lot of workerReports waiting to be processed.
public void executeBackgroundTasks()
public void checkResultsOfBackgroundTasks()
{
List<Callable<Boolean>> tempList = new ArrayList<>(UrlsController.backgroundCallableTasks); // Copy the list in order to know what was executed.
// So the items added while this execution happens, will remain in the global-list, while the others will have already been deleted.
// Also, the "Executor.invokeAll()" requires an "unchanged" list, otherwise there will be "undefined results".
int numOfTasks = tempList.size(); // Since the temp-list is a deep-copy and not a reference, new tasks that are added will not be executed.
if ( numOfTasks == 0 )
int sizeOfFutures = UrlsController.futuresOfBackgroundTasks.size();
if ( sizeOfFutures == 0 )
return;
// Immediately delete the selected tasks form the global list, so that if these tasks are not finished before the scheduler runs again, they will not be re-processed.
for ( Callable<Boolean> selectedTask : tempList )
UrlsController.backgroundCallableTasks.remove(selectedTask);
logger.debug("Going to check the results of " + sizeOfFutures + " background tasks.");
logger.debug(numOfTasks + " background tasks were found inside the \"backgroundCallableTasks\" list and are about to be executed.");
// Execute the tasks and wait for them to finish.
try {
List<Future<Boolean>> futures = UrlsController.backgroundExecutor.invokeAll(tempList);
int sizeOfFutures = futures.size();
for ( int i = 0; i < sizeOfFutures; ++i ) {
try {
Boolean value = futures.get(i).get(); // Get and see if an exception is thrown..
// Add check for the value, if wanted.. (we don't care at the moment)
} catch (ExecutionException ee) {
String stackTraceMessage = GenericUtils.getSelectiveStackTrace(ee, null, 15); // These can be serious errors like an "out of memory exception" (Java HEAP).
logger.error("Task_" + (i+1) + " failed with: " + ee.getMessage() + "\n" + stackTraceMessage);
} catch (CancellationException ce) {
logger.error("Task_" + (i+1) + " was cancelled: " + ce.getMessage());
} catch (IndexOutOfBoundsException ioobe) {
logger.error("IOOBE for task_" + i + " in the futures-list! " + ioobe.getMessage());
}
// Calling ".get()" on each future will block the current scheduling thread until a result is returned (if the callable task succeeded or failed).
// So this scheduled-task will not be executed again until all current callableTasks have finished executing.
int numFailedTasks = 0;
List<Future<Boolean>> futuresToDelete = new ArrayList<>(sizeOfFutures);
for ( int i=0; i < sizeOfFutures; ++i ) { // Check the futures up to the current "sizeOfFutures". The size of the list may increase while we check, but here we wil check only the first e.g. 10 futures.
Future<Boolean> future = null;
try {
future = UrlsController.futuresOfBackgroundTasks.get(i);
if ( ! future.get() ) // Get and see if an exception is thrown. This blocks the current thread, until the task of the future has finished.
numFailedTasks ++;
} catch (ExecutionException ee) { // These can be serious errors like an "out of memory exception" (Java HEAP).
logger.error(GenericUtils.getSelectedStackTraceForCausedException(ee, "Background_task_" + i + " failed with: ", null, 30));
numFailedTasks ++;
} catch (CancellationException ce) {
logger.error("Background_task_" + i + " was cancelled: " + ce.getMessage());
numFailedTasks ++;
} catch (InterruptedException ie) {
logger.error("Background_task_" + i + " was interrupted: " + ie.getMessage());
numFailedTasks ++;
} catch (IndexOutOfBoundsException ioobe) {
logger.error("IOOBE for background_task_" + i + " in the futures-list! " + ioobe.getMessage());
// Only here, the "future" will be null.
} finally {
if ( future != null ) // It may be null in case we have a IOBE.
futuresToDelete.add(future); // Do not delete them directly here, as the indexes will get messed up and we will get "IOOBE".
}
} catch (Exception e) {
logger.error("", e);
}
for ( Future<Boolean> future : futuresToDelete )
UrlsController.futuresOfBackgroundTasks.remove(future); // We do not need it anymore. This is the safest way to delete them without removing newly added futures as well.
if ( numFailedTasks > 0 )
logger.warn(numFailedTasks + " out of " + sizeOfFutures + " background tasks have failed!");
else if ( logger.isTraceEnabled() )
logger.trace("All of the " + sizeOfFutures + " background tasks have succeeded.");
}
@ -132,23 +147,44 @@ public class ScheduledTasks {
return; // Either the service was never instructed to shut down, or the user canceled the request.
// Check whether there are still background tasks to be processed. Either workerReport or Bulk-import requests.
if ( UrlsController.backgroundCallableTasks.size() > 0 )
int numOfFutures = UrlsController.futuresOfBackgroundTasks.size();
if ( numOfFutures > 0 ) {
logger.debug("There are still " + numOfFutures + " backgroundTasks waiting to be executed or have their status checked..");
return;
}
// Here, the above may have given a result of < 0 >, but a new task may be asked for execution right next and still await for execution..
// The crawling-jobs can be safely finish, by avoiding to shut-down as long as at least one worker is still running (waiting for the Controller to verify that the assignments-batch is completed).
// The bulk-import procedures have their bulkImport DIR registered in the "bulkImportDirsUnderProcessing", before their takes is being submitted for execution.
// So the Controller will now shut down if either of takes-types have not finished.
// Check whether there are any active bulk-import procedures.
int numOfBulkImportDirsUnderProcessing = BulkImportController.bulkImportDirsUnderProcessing.size();
if ( numOfBulkImportDirsUnderProcessing > 0 ) {
logger.debug("There are still " + numOfBulkImportDirsUnderProcessing + " bulkImportDirsUnderProcessing..");
return;
}
// Check whether the workers have not shutdown yet, which means that they either crawl assignments or/and they are waiting for the Controller to process the WorkerReport and then shutdown.
Set<String> workerIds = UrlsController.workersInfoMap.keySet();
if ( workerIds.size() > 0 ) {
for ( String workerId : workerIds )
if ( ! UrlsController.workersInfoMap.get(workerId).getHasShutdown() ) // The workerId is certainly inside the map and has a workerInfo value.
for ( String workerId : workerIds ) {
WorkerInfo workerInfo = UrlsController.workersInfoMap.get(workerId); // The workerId is certainly inside the map and has a workerInfo value.
if ( ! workerInfo.getHasShutdown() ) {
logger.debug("At least one worker (with IP: " + workerInfo.getWorkerIP() + ") is still active. Waiting for all workers to shutdown, before shutting down the service.");
return; // If at least 1 worker is still active, then do not shut down the Controller.
}
}
logger.info("All workers have already shutdown. Shutting down the Controller..");
} else
logger.info("No workers have participated in the service yet, so the Controller will shut-down immediately.");
logger.info("No workers have participated in the service yet. The Controller will shut-down now.");
// If one worker has crashed, then it will have not informed the Controller. So the controller will think that it is still running and will not shut down.
// IMPORTANT: If one worker has crashed, then it will have not informed the Controller. So the controller will think that it is still running and will not shut down..!
// In this case, we have to manually shut-down the service, from Docker cli.
// Any left-over worker-reports are kept to be retried next time the Controller starts.
Application.gentleAppShutdown();
urlsControllerApplication.gentleAppShutdown();
}
@ -162,16 +198,16 @@ public class ScheduledTasks {
{
// We make sure an initial delay of some minutes is in place before this is executed, since we have to make sure all workers are up and running in order for them to be able to answer the full-texts-requests.
if ( UrlsController.numOfWorkers.get() == 0 ) {
if ( UrlsController.numOfActiveWorkers.get() == 0 ) {
long timeToWait = (isTestEnvironment ? 1_200_000 : 43_200_000); // 10 mins | 12 hours
logger.warn("None of the workers have participated in the service yet. Will wait " + ((timeToWait /1000) /60) + " minutes and try again..");
logger.warn("None of the workers is participating in the service, at the moment. Will wait " + ((timeToWait /1000) /60) + " minutes and try again..");
try {
Thread.sleep(timeToWait);
} catch (InterruptedException ie) {
logger.warn("The wait-period was interrupted! Will try either way.");
}
if ( UrlsController.numOfWorkers.get() == 0 ) {
logger.error("None of the workers have participated in the service yet again. Will not process any leftover workerReports!");
if ( UrlsController.numOfActiveWorkers.get() == 0 ) {
logger.error("None of the workers is participating in the service, yet again. Will not process any leftover workerReports!");
return;
}
}
@ -207,6 +243,25 @@ public class ScheduledTasks {
}
@Scheduled(initialDelay = 259_200_000, fixedDelay = 259_200_000) // Run every 3 days. 3 days after startup.
//@Scheduled(initialDelay = 1_200_000, fixedDelay = 1_200_000) // Just for testing (every 1200 secs).
public void checkAndDeleteUnhandledAssignments()
{
// Remove the assignments having a "date" older than 3 days.
// For some reason, sometimes, the worker cannot receive the assignments in time.
// In this case, no job is created for those assignments nd no workerReport gets stored in storage.
// The assignments just remain in the table, and the urls cannot be rechecked.
Calendar calendar = Calendar.getInstance();
calendar.add(Calendar.DAY_OF_MONTH, -3); // Subtract 3 from current Date.
DatabaseConnector.databaseLock.lock();
urlsService.deleteAssignmentsWithOlderDate(calendar.getTimeInMillis()); // Any error-log is written inside.
DatabaseConnector.databaseLock.unlock();
}
// Scheduled Metrics for Prometheus.
// Prometheus scrapes for metrics usually every 15 seconds, but that is an extremely short time-period for DB-statistics.
@ -246,7 +301,7 @@ public class ScheduledTasks {
} // Any error is already logged.
// TODO - Export more complex data; <numOfAllPayloadsPerDatasource>, <numOfAllPayloadsPerYear>,
// TODO - Export more complex data; <numOfAllPayloadsPerDatasource>, <numOfAllPayloadsPerPublicationYear>,
// <numOfAggregatedPayloadsPerDatasource>, ..., <numOfBulkImportedPayloadsPerDatasource>, ...
}
@ -254,7 +309,6 @@ public class ScheduledTasks {
enum ActionForWorkerReports {process_previous_failed, process_current_failed, delete_old}
// TODO - Maybe make these numbers configurable from the "application.yml" file.
private static final double daysToWaitBeforeDeletion = 7.0;
private static final double daysToWaitBeforeProcessing = 0.5; // 12 hours
@ -267,6 +321,8 @@ public class ScheduledTasks {
private static StringBuilder jsonStringBuilder = null;
private static final int daysDivisor = (1000 * 60 * 60 * 24); // In order to get the time-difference in days. We divide with: /1000 to get seconds, /60 to get minutes, /60 to get hours and /24 to get days.
private static final double daysToWaitBeforeDeletion = 7.0;
private void inspectWorkerReportsAndTakeAction(ActionForWorkerReports actionForWorkerReports)
{
@ -293,7 +349,7 @@ public class ScheduledTasks {
for ( File workerReportSubDir : workerReportSubDirs )
{
File[] workerReportFiles = workerReportSubDir.listFiles(File::isFile);
if (workerReportFiles == null) {
if ( workerReportFiles == null ) {
logger.error("There was an error when getting the workerReports of \"workerReportSubDir\": " + workerReportSubDir);
return;
} else if (workerReportFiles.length == 0) {
@ -329,10 +385,10 @@ public class ScheduledTasks {
} else { // Deletion..
if ( elapsedDays > daysToWaitBeforeDeletion ) {
// Enough time has passed, the directory should be deleted immediately.
logger.warn("The workerReport \"" + workerReportName + "\" was accessed " + elapsedDays + " days ago (passed the " + daysToWaitBeforeDeletion + " days limit) and will be deleted.");
logger.warn("The workerReport \"" + workerReportName + "\" was accessed " + df.format(elapsedDays) + " days ago (passed the " + daysToWaitBeforeDeletion + " days limit) and will be deleted.");
numWorkerReportsToBeHandled ++;
if ( fileUtils.deleteFile(workerReportFile.getAbsolutePath()) // Either successful or failed.
&& workerReportName.contains("failed") // If this has failed, then delete the assignment-records. For the successful, they have already been deleted.
&& !workerReportName.contains("successful") // If this has failed or its state is unknown (it was never renamed), then delete the assignment-records. For the successful, they have already been deleted.
&& extractAssignmentsCounterAndDeleteRelatedAssignmentRecords(workerReportName) )
numWorkerReportsHandled ++;
}
@ -375,13 +431,13 @@ public class ScheduledTasks {
private boolean processFailedWorkerReport(File workerReportFile, String workerReportName)
{
logger.debug("Going to load and parse the workerReport: " + workerReportName);
logger.debug("Going to load and parse the failed workerReport: " + workerReportName);
// Load the file's json content into a "WorkerReport" object.
try ( BufferedReader bfRead = new BufferedReader(new FileReader(workerReportFile)) ) { // The default size is sufficient here.
try ( BufferedReader bfRead = new BufferedReader(new FileReader(workerReportFile), FileUtils.halfMb) ) {
String line;
while ( (line = bfRead.readLine()) != null ) // The line, without any line-termination-characters.
jsonStringBuilder.append(line).append("\n");
jsonStringBuilder.append(line).append(GenericUtils.endOfLine);
} catch (Exception e) {
logger.error("Problem when acquiring the contents of workerReport \"" + workerReportName + "\"");
jsonStringBuilder.setLength(0); // Reset the StringBuilder without de-allocating.
@ -431,6 +487,8 @@ public class ScheduledTasks {
return false;
}
logger.debug("Will delete the assignments of the old, not-successful, workerReport: " + workerReportName);
DatabaseConnector.databaseLock.lock();
urlsService.deleteAssignmentsBatch(curReportAssignmentsCounter); // Any error-log is written inside.
DatabaseConnector.databaseLock.unlock();

View File

@ -58,23 +58,29 @@ public class DatabaseConnector {
logger.info("Going to create (if not exist) the TEST-database \"" + testDatabaseName + "\" and its tables. Also will fill some tables with data from the initial-database \"" + initialDatabaseName + "\".");
jdbcTemplate.execute("CREATE DATABASE IF NOT EXISTS " + testDatabaseName);
jdbcTemplate.execute("CREATE TABLE IF NOT EXISTS " + testDatabaseName + ".publication stored as parquet as select * from " + initialDatabaseName + ".publication");
jdbcTemplate.execute("COMPUTE STATS " + testDatabaseName + ".publication");
try { // Metastore takes some time to recognize the DB has been created, in order to use it later..
Thread.sleep(1000);
} catch (InterruptedException ignore) {}
jdbcTemplate.execute("CREATE TABLE IF NOT EXISTS " + testDatabaseName + ".publication_pids stored as parquet as select * from " + initialDatabaseName + ".publication_pids");
jdbcTemplate.execute("COMPUTE STATS " + testDatabaseName + ".publication_pids");
jdbcTemplate.update("INVALIDATE METADATA");
jdbcTemplate.execute("CREATE TABLE IF NOT EXISTS " + testDatabaseName + ".publication_urls stored as parquet as select * from " + initialDatabaseName + ".publication_urls");
jdbcTemplate.execute("COMPUTE STATS " + testDatabaseName + ".publication_urls");
try { // Metastore takes some time to recognize the DB has been created, in order to use it later..
Thread.sleep(1000);
} catch (InterruptedException ignore) {}
jdbcTemplate.execute("CREATE TABLE IF NOT EXISTS " + testDatabaseName + ".publication_boost stored as parquet as select * from " + initialDatabaseName + ".publication_boost");
jdbcTemplate.execute("COMPUTE STATS " + testDatabaseName + ".publication_boost");
// Create VIEWs of the original data. We just READ from it, so it's safe for our testing environment..
jdbcTemplate.execute("CREATE TABLE IF NOT EXISTS " + testDatabaseName + ".datasource stored as parquet as select * from " + initialDatabaseName + ".datasource");
jdbcTemplate.execute("COMPUTE STATS " + testDatabaseName + ".datasource");
jdbcTemplate.execute("CREATE VIEW IF NOT EXISTS " + testDatabaseName + ".publication as select * from " + initialDatabaseName + ".publication");
jdbcTemplate.execute("CREATE TABLE IF NOT EXISTS " + testDatabaseName + ".payload_legacy stored as parquet as select * from " + initialDatabaseName + ".payload_legacy");
jdbcTemplate.execute("COMPUTE STATS " + testDatabaseName + ".payload_legacy");
jdbcTemplate.execute("CREATE VIEW IF NOT EXISTS " + testDatabaseName + ".publication_pids as select * from " + initialDatabaseName + ".publication_pids");
jdbcTemplate.execute("CREATE VIEW IF NOT EXISTS " + testDatabaseName + ".publication_urls as select * from " + initialDatabaseName + ".publication_urls");
jdbcTemplate.execute("CREATE VIEW IF NOT EXISTS " + testDatabaseName + ".publication_boost as select * from " + initialDatabaseName + ".publication_boost");
jdbcTemplate.execute("CREATE VIEW IF NOT EXISTS " + testDatabaseName + ".datasource as select * from " + initialDatabaseName + ".datasource");
jdbcTemplate.execute("CREATE VIEW IF NOT EXISTS " + testDatabaseName + ".payload_legacy as select * from " + initialDatabaseName + ".payload_legacy");
databaseName = testDatabaseName;
} else {
@ -107,9 +113,21 @@ public class DatabaseConnector {
jdbcTemplate.execute("CREATE TABLE IF NOT EXISTS " + databaseName + ".payload_bulk_import (id string, original_url string, actual_url string, `date` bigint, mimetype string, size string, `hash` string, `location` string, provenance string) stored as parquet");
jdbcTemplate.execute("COMPUTE STATS " + databaseName + ".payload_bulk_import");
jdbcTemplate.execute("CREATE VIEW IF NOT EXISTS " + databaseName + ".payload " +
"AS SELECT * from " + databaseName + ".payload_legacy " +
"UNION ALL SELECT * FROM " + databaseName +".payload_aggregated " +
try { // Metastore takes some time to recognize the tables have been created, in order to use them in the view.
Thread.sleep(1000);
} catch (InterruptedException ignore) {}
jdbcTemplate.update("INVALIDATE METADATA");
try { // Metastore takes some time to recognize the tables have been created, in order to use them in the view.
Thread.sleep(1000);
} catch (InterruptedException ignore) {}
jdbcTemplate.execute("CREATE VIEW IF NOT EXISTS " + databaseName + ".payload\n" +
"AS SELECT * from " + databaseName + ".payload_legacy\n" +
"UNION ALL SELECT * FROM " + databaseName +".payload_aggregated\n" +
"UNION ALL SELECT * FROM " + databaseName + ".payload_bulk_import");
// We do not do the "compute stats" for the view, since we get the following error: "COMPUTE STATS not supported for view: pdfaggregationdatabase_payloads_view.payload".

View File

@ -1,5 +1,9 @@
package eu.openaire.urls_controller.controllers;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonParseException;
import com.google.gson.JsonParser;
import eu.openaire.urls_controller.components.BulkImport;
import eu.openaire.urls_controller.models.BulkImportReport;
import eu.openaire.urls_controller.models.BulkImportResponse;
@ -25,6 +29,7 @@ import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.RejectedExecutionException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@ -53,15 +58,15 @@ public class BulkImportController {
public BulkImportController(BulkImportService bulkImportService, BulkImport bulkImport)
{
String bulkImportReportLocation1;
String bulkImportReportLocationTemp;
this.baseBulkImportLocation = bulkImport.getBaseBulkImportLocation();
this.bulkImportSources = new HashMap<>(bulkImport.getBulkImportSources());
bulkImportReportLocation1 = bulkImport.getBulkImportReportLocation();
if ( !bulkImportReportLocation1.endsWith("/") )
bulkImportReportLocation1 += "/";
this.bulkImportReportLocation = bulkImportReportLocation1;
bulkImportReportLocationTemp = bulkImport.getBulkImportReportLocation();
if ( !bulkImportReportLocationTemp.endsWith("/") )
bulkImportReportLocationTemp += "/";
this.bulkImportReportLocation = bulkImportReportLocationTemp;
this.bulkImportService = bulkImportService;
@ -86,8 +91,8 @@ public class BulkImportController {
@PostMapping("bulkImportFullTexts")
public ResponseEntity<?> bulkImportFullTexts(@RequestParam String provenance, @RequestParam String bulkImportDir, @RequestParam boolean shouldDeleteFilesOnFinish) {
public ResponseEntity<?> bulkImportFullTexts(@RequestParam String provenance, @RequestParam String bulkImportDir, @RequestParam boolean shouldDeleteFilesOnFinish)
{
BulkImport.BulkImportSource bulkImportSource = bulkImportSources.get(provenance);
if ( bulkImportSource == null ) {
String errorMsg = "The provided provenance \"" + provenance + "\" is not in the list of the bulk-imported sources, so no configuration-rules are available!";
@ -185,7 +190,7 @@ public class BulkImportController {
String bulkImportReportID = provenance + "/" + relativeBulkImportDir + "report_" + GenericUtils.getRandomNumber(10000, 99999);
String bulkImportReportFullPath = this.bulkImportReportLocation + bulkImportReportID + ".json";
String msg = "The bulkImportFullTexts request for " + provenance + " procedure and bulkImportDir: " + givenBulkDir + " was accepted and will be scheduled for execution. "
String msg = "The bulkImportFullTexts request for " + provenance + " procedure and bulkImportDir: " + givenBulkDir + " was accepted and will be submitted for execution. "
+ (shouldDeleteFilesOnFinish ? "The successfully imported files will be deleted." : "All files will remain inside the directory after processing.")
+ " You can request a report at any moment, using the reportID.";
@ -203,24 +208,40 @@ public class BulkImportController {
// Add this to a background job, since it will take a lot of time to be completed, and the caller will get a "read-timeout" at least and a socket-timeout at most (in case of a network failure during those hours).
String finalBulkImportDir = bulkImportDir;
String finalRelativeBulkImportDir = relativeBulkImportDir;
UrlsController.backgroundCallableTasks.add(() ->
bulkImportService.bulkImportFullTextsFromDirectory(bulkImportReport, finalRelativeBulkImportDir, finalBulkImportDir, givenDir, provenance, bulkImportSource, shouldDeleteFilesOnFinish)
);
try {
UrlsController.futuresOfBackgroundTasks.add(
UrlsController.backgroundExecutor.submit(
() -> bulkImportService.bulkImportFullTextsFromDirectory(bulkImportReport, finalRelativeBulkImportDir, finalBulkImportDir, givenDir, provenance, bulkImportSource, shouldDeleteFilesOnFinish)
)
);
} catch (RejectedExecutionException ree) {
errorMsg = "The bulkImport request for bulkImportReportLocation \"" + bulkImportReportLocation + "\" and provenance " + provenance + " has failed to be executed!";
bulkImportReport.addEvent(msg);
logger.error(errorMsg, ree);
return ResponseEntity.internalServerError().body(errorMsg);
}
// This directory, will be removed from "bulkImportDirsUnderProcessing", when the background job finishes.
return ResponseEntity.ok().body(new BulkImportResponse(msg, bulkImportReportID)); // The response is automatically serialized to json and it's of type "application/json".
return ResponseEntity.ok().body(new BulkImportResponse(msg, bulkImportReportID)); // The response is automatically serialized to json, and it has the type "application/json".
}
@GetMapping(value = "getBulkImportReport", produces = MediaType.APPLICATION_JSON_VALUE)
public ResponseEntity<?> getBulkImportReport(@RequestParam("id") String bulkImportReportId)
public ResponseEntity<?> getBulkImportReport(@RequestParam("id") String bulkImportReportId, @RequestParam(name = "pretty", defaultValue = "false") boolean prettyFormatting)
{
logger.info("Received a \"getBulkImportReport\" request for \"bulkImportReportId\": \"" + bulkImportReportId + "\"." + (prettyFormatting ? " Will return the report pretty-formatted." : ""));
// Even if the Service is set to shut down soon, we allow this endpoint to return the report up to the last minute,
// since the Service may be up for another hour, running a bulk-import procedure, for which we want to check its progress.
// Write the contents of the report-file to a string (efficiently!) and return the whole content as an HTTP-response.
StringBuilder stringBuilder = new StringBuilder(2_000);
final StringBuilder stringBuilder = new StringBuilder(25_000);
String line;
try ( BufferedReader in = new BufferedReader(new InputStreamReader(Files.newInputStream(Paths.get(this.bulkImportReportLocation, bulkImportReportId + ".json"))), FileUtils.tenMb) ) {
FileUtils.fileAccessLock.lock();
try ( BufferedReader in = new BufferedReader(new InputStreamReader(Files.newInputStream(Paths.get(this.bulkImportReportLocation, bulkImportReportId + ".json"))), FileUtils.twentyFiveKb) ) {
while ( (line = in.readLine()) != null )
stringBuilder.append(line).append("\n"); // The "readLine()" does not return the line-term char.
stringBuilder.append(line).append(GenericUtils.endOfLine); // The "readLine()" does not return the line-term char.
} catch (NoSuchFileException nsfe) {
logger.warn("The requested report-file with ID: \"" + bulkImportReportId + "\" was not found!");
return ResponseEntity.notFound().build();
@ -228,9 +249,20 @@ public class BulkImportController {
String errorMsg = "Failed to read the contents of report-file with ID: " + bulkImportReportId;
logger.error(errorMsg, e);
return ResponseEntity.internalServerError().body(errorMsg); // It's ok to give the file-path to the user, since the report already contains the file-path.
} finally {
FileUtils.fileAccessLock.unlock();
}
return ResponseEntity.ok().body(stringBuilder.toString());
String json = stringBuilder.toString().trim();
if ( prettyFormatting ) {
final Gson gson = new GsonBuilder().setPrettyPrinting().create();
try {
json = gson.toJson(JsonParser.parseString(json));
} catch (JsonParseException jpe) {
logger.error("Problem when parsing the json-string: " + jpe.getMessage() + "\nIt is not a valid json!\n" + json);
}
}
return ResponseEntity.ok().body(json);
}
}

View File

@ -26,6 +26,7 @@ public class ShutdownController {
ShutdownService shutdownService;
public static boolean shouldShutdownService = false;
public static boolean shouldShutdownAllWorkers = false;
@PostMapping("shutdownService")
@ -40,27 +41,20 @@ public class ShutdownController {
String endingMsg;
if ( shouldShutdownService ) {
endingMsg = "The controller has already received a \"shutdownService\" (which was not canceled afterwards).";
endingMsg = "The controller has already received a \"shutdownService\" request (which was not canceled afterwards).";
logger.info(initMsg + endingMsg);
} else {
shouldShutdownService = true;
endingMsg = "The service will shutdown, after finishing current work.";
logger.info(initMsg + endingMsg);
// Send "shutdownWorker" requests to all active Workers.
for ( String workerId : UrlsController.workersInfoMap.keySet() ) {
WorkerInfo workerInfo = UrlsController.workersInfoMap.get(workerId);
if ( ! workerInfo.getHasShutdown() ) // A worker may have shutdown on its own (by sending it a shutDown request manually), so it will have told the Controller when it shut down. In case of a Worker-crash, the Controller will not know about it.
shutdownService.postShutdownOrCancelRequestToWorker(workerId, workerInfo.getWorkerIP(), false);
else
logger.warn("Will not post ShutdownRequest to Worker \"" + workerId + "\", since is it has already shutdown.");
}
shutdownService.postShutdownOrCancelRequestsToAllWorkers(false);
// That's it for now. The workers may take some hours to finish their work (including delivering the full-text files).
// A scheduler monitors the shutdown of the workers. Once all worker have shutdown, the Controller shuts down as well.
}
return ResponseEntity.ok().body(endingMsg + "\n");
return ResponseEntity.ok().body(endingMsg + GenericUtils.endOfLine);
}
@ -78,39 +72,88 @@ public class ShutdownController {
String endingMsg = "Any previous \"shutdownService\"-request is canceled.";
logger.info(initMsg + endingMsg);
// Send "cancelShutdownWorker" requests to all active Workers.
for ( String workerId : UrlsController.workersInfoMap.keySet() ) {
WorkerInfo workerInfo = UrlsController.workersInfoMap.get(workerId);
if ( ! workerInfo.getHasShutdown() ) // A worker may have shutdown on its own (by sending it a shutDown request manually), so it will have told the Controller when it shut down. In case of a Worker-crash, the Controller will not know about it.
shutdownService.postShutdownOrCancelRequestToWorker(workerId, workerInfo.getWorkerIP(), true);
else
logger.warn("Will not post CancelShutdownRequest to Worker \"" + workerId + "\", since is it has already shutdown.");
// Cancel the shutdown of the workers, if we are able to catch up with them before they have already shutdown..
shutdownService.postShutdownOrCancelRequestsToAllWorkers(true);
return ResponseEntity.ok().body(endingMsg + GenericUtils.endOfLine);
}
// The "shutdownAllWorkers" and a "cancelShutdownAllWorkers" endpoints help when updating only the workers,
// while keeping the Controller running and accepting bulk-import requests.
@PostMapping("shutdownAllWorkers")
public ResponseEntity<?> shutdownAllWorkersGracefully(HttpServletRequest request)
{
String initMsg = "Received a \"shutdownAllWorkers\" request ";
String remoteAddr = GenericUtils.getRequestorAddress(request);
initMsg += "from [" + remoteAddr + "]. ";
ResponseEntity<?> responseEntity = shutdownService.passSecurityChecks(remoteAddr, initMsg);
if ( responseEntity != null )
return responseEntity;
String endingMsg;
if ( shouldShutdownAllWorkers ) {
endingMsg = "The controller has already received a \"shutdownAllWorkers\" request (which was not canceled afterwards).";
logger.info(initMsg + endingMsg);
} else {
shouldShutdownAllWorkers = true;
endingMsg = "All workers will shutdown, after finishing current work.";
logger.info(initMsg + endingMsg);
shutdownService.postShutdownOrCancelRequestsToAllWorkers(false);
// That's it for now. The workers may take some hours to finish their work (including delivering the full-text files).
// The service will continue to run and handle bulk-import requests.
// Once the workers are ready to work again, they can be started without any additional configuration.
}
return ResponseEntity.ok().body(endingMsg + "\n");
return ResponseEntity.ok().body(endingMsg + GenericUtils.endOfLine);
}
@PostMapping("cancelShutdownAllWorkers")
public ResponseEntity<?> cancelShutdownAllWorkersGracefully(HttpServletRequest request)
{
String initMsg = "Received a \"cancelShutdownAllWorkers\" request ";
String remoteAddr = GenericUtils.getRequestorAddress(request);
initMsg += "from [" + remoteAddr + "]. ";
ResponseEntity<?> responseEntity = shutdownService.passSecurityChecks(remoteAddr, initMsg);
if ( responseEntity != null )
return responseEntity;
shouldShutdownAllWorkers = false;
String endingMsg = "Any previous \"shutdownAllWorkers\"-request is canceled.";
logger.info(initMsg + endingMsg);
// Cancel the shutdown of the workers, if we are able to catch up with them before they have already shutdown..
shutdownService.postShutdownOrCancelRequestsToAllWorkers(true);
return ResponseEntity.ok().body(endingMsg + GenericUtils.endOfLine);
}
@PostMapping("workerShutdownReport")
public ResponseEntity<?> workerShutdownReport(@RequestParam String workerId, HttpServletRequest request)
{
String initMsg = "Received a \"workerShutdownReport\" from worker: \"" + workerId + "\".";
String remoteAddr = GenericUtils.getRequestorAddress(request);
String initMsg = "Received a \"workerShutdownReport\" from worker: \"" + workerId + "\" [IP: " + remoteAddr + "].";
WorkerInfo workerInfo = UrlsController.workersInfoMap.get(workerId);
if ( workerInfo == null ) {
String errorMsg = "The worker with id \"" + workerId + "\" has not participated in the PDF-Aggregation-Service!";
logger.warn(initMsg + "\n" + errorMsg);
logger.warn(initMsg + GenericUtils.endOfLine + errorMsg);
return ResponseEntity.badRequest().body(errorMsg);
}
String remoteAddr = GenericUtils.getRequestorAddress(request);
if ( ! remoteAddr.equals(workerInfo.getWorkerIP()) ) {
logger.error(initMsg + " The request came from another IP: " + remoteAddr + " | while this worker was registered with the IP: " + workerInfo.getWorkerIP());
logger.error(initMsg + " The request came from an IP different from the one this worker was registered with: " + workerInfo.getWorkerIP());
return ResponseEntity.status(HttpStatus.FORBIDDEN).build();
}
logger.info(initMsg);
workerInfo.setHasShutdown(true); // This will update the map.
UrlsController.numOfActiveWorkers.decrementAndGet();
// Return "HTTP-OK" to this worker. If this was part of a shutdown-service request, then wait for the scheduler to check and shutdown the service.
return ResponseEntity.ok().build();

View File

@ -21,10 +21,7 @@ import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.Callable;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.regex.Pattern;
@ -48,12 +45,11 @@ public class UrlsController {
public static final ConcurrentHashMap<String, WorkerInfo> workersInfoMap = new ConcurrentHashMap<>(6);
public static final AtomicInteger numOfWorkers = new AtomicInteger(0);
public static final AtomicInteger numOfActiveWorkers = new AtomicInteger(0);
public static ExecutorService backgroundExecutor;
public static final List<Callable<Boolean>> backgroundCallableTasks = Collections.synchronizedList(new ArrayList<>());
public static final List<Future<Boolean>> futuresOfBackgroundTasks = Collections.synchronizedList(new ArrayList<>());
private final String workerReportsDirPath;
@ -99,12 +95,13 @@ public class UrlsController {
if ( ShutdownController.shouldShutdownService ) {
// There might be the case that the Controller has not sent shutDown requests to the Workers yet, or it has, BUT:
// 1) A worker requests for new assignments before the shutDown request in handled by its side.
// 1) A worker requests for new assignments, before it can handle the shutDown request given to it.
// 2) A new Worker joins the Service (unexpected, but anyway).
String warnMsg = "The Service is about to shutdown, after all under-processing assignments and/or bulkImport requests are handled. No new requests are accepted!";
logger.warn(warnMsg); // It's likely not an actual error, but still it's not accepted.
return ResponseEntity.status(HttpStatus.CONFLICT).body(warnMsg); // The worker will wait 15 mins and upon going to retry it will notice that it should not do a new request then or it may have already shutdown in the meantime.
}
// Do not apply any check for the "ShutdownController.shouldShutdownAllWorkers", since then we have to also make sure it is set to false after all workers have been shutdown, in order for updated workers to ba able to request assignments after they are started again..
if ( request == null ) {
logger.error("The \"HttpServletRequest\" is null!");
@ -123,12 +120,12 @@ public class UrlsController {
if ( workerInfo.getHasShutdown() ) {
logger.info("The worker with id \"" + workerId + "\" was restarted.");
workerInfo.setHasShutdown(false);
numOfWorkers.decrementAndGet();
numOfActiveWorkers.incrementAndGet();
}
} else {
logger.info("The worker \"" + workerId + "\" is requesting assignments for the first time. Going to store its IP [" + remoteAddr + "] in memory.");
workersInfoMap.put(workerId, new WorkerInfo(remoteAddr, false));
numOfWorkers.incrementAndGet();
numOfActiveWorkers.incrementAndGet();
}
return urlsService.getAssignments(workerId, assignmentsLimit);
@ -187,11 +184,19 @@ public class UrlsController {
// The above method will overwrite a possibly existing file. So in case of a crash, it's better to back up the reports before starting the Controller again (as the assignments-counter will start over, from 0).
int finalSizeOUrlReports = sizeOfUrlReports;
UrlsController.backgroundCallableTasks.add(() ->
urlsService.addWorkerReport(curWorkerId, curReportAssignmentsCounter, urlReports, finalSizeOUrlReports)
);
try {
UrlsController.futuresOfBackgroundTasks.add(
backgroundExecutor.submit(
() -> urlsService.addWorkerReport(curWorkerId, curReportAssignmentsCounter, urlReports, finalSizeOUrlReports)
)
);
} catch (RejectedExecutionException ree) {
String errorMsg = "The WorkerReport from worker \"" + curWorkerId + "\" and assignments_" + curReportAssignmentsCounter + " have failed to be executed!";
logger.error(errorMsg, ree);
return ResponseEntity.internalServerError().body(errorMsg);
}
String msg = "The 'addWorkerReport' request for worker with id: '" + curWorkerId + "' and assignments_" + curReportAssignmentsCounter + " , was accepted and will be scheduled for execution.";
String msg = "The 'addWorkerReport' request for worker with id: '" + curWorkerId + "' and assignments_" + curReportAssignmentsCounter + " , was accepted and submitted for execution.";
logger.info(msg);
return ResponseEntity.ok().body(msg);
}

View File

@ -43,10 +43,13 @@ public class BulkImportReport {
public void addEvent(String event) {
eventsMultimap.put(GenericUtils.getReadableCurrentTimeAndZone(), event);
eventsMultimap.put(GenericUtils.getReadableCurrentTimeAndZone(), event); // This is synchronized.
}
public String getJsonReport()
/**
* Synchronize it to avoid concurrency issues when concurrent calls are made to the same bulkImport-Report object.
* */
public synchronized String getJsonReport()
{
//Convert the LinkedHashMultiMap<String, String> to Map<String, Collection<String>>, since Gson cannot serialize Multimaps.
eventsMap = eventsMultimap.asMap();

View File

@ -18,9 +18,9 @@ public class FileLocationData {
public FileLocationData(String fileLocation) throws RuntimeException {
// Extract and set LocationData.
Matcher matcher = FileUtils.FILENAME_ID_EXTENSION.matcher(fileLocation);
Matcher matcher = FileUtils.FILEPATH_ID_EXTENSION.matcher(fileLocation);
if ( !matcher.matches() )
throw new RuntimeException("Failed to match the \"" + fileLocation + "\" with the regex: " + FileUtils.FILENAME_ID_EXTENSION);
throw new RuntimeException("Failed to match the \"" + fileLocation + "\" with the regex: " + FileUtils.FILEPATH_ID_EXTENSION);
fileDir = matcher.group(1);
if ( (fileDir == null) || fileDir.isEmpty() )
throw new RuntimeException("Failed to extract the \"fileDir\" from \"" + fileLocation + "\".");

View File

@ -68,6 +68,14 @@ public class Payload {
this.datasourceId = datasourceId;
}
public Payload clone()
{
// Return a new object with the same values. Clone whatever objects are not immutable.!
return new Payload(this.id, this.original_url, this.actual_url, (Timestamp) this.timestamp_acquired.clone(), this.mime_type, this.size, this.hash,
this.location, this.provenance, this.datasourceId);
}
public String getId() {
return id;
}

View File

@ -13,6 +13,6 @@ public interface BulkImportService {
List<String> getFileLocationsInsideDir(String directory);
String getMD5hash(String string);
String getMD5Hash(String string);
}

View File

@ -15,20 +15,17 @@ import org.apache.avro.generic.GenericData;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.dao.EmptyResultDataAccessException;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;
import javax.xml.bind.DatatypeConverter;
import java.io.File;
import java.net.ConnectException;
import java.net.UnknownHostException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.security.MessageDigest;
import java.sql.Types;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.*;
import java.util.concurrent.Callable;
import java.util.concurrent.CancellationException;
import java.util.concurrent.ExecutionException;
@ -42,18 +39,12 @@ public class BulkImportServiceImpl implements BulkImportService {
private static final Logger logger = LoggerFactory.getLogger(BulkImportServiceImpl.class);
@Autowired
private FileUtils fileUtils;
@Autowired
private ParquetFileUtils parquetFileUtils;
@Autowired
private JdbcTemplate jdbcTemplate;
/**
* Given a directory with full-text-files, this method imports the full-texts files in the PDF Aggregation Service.
@ -64,23 +55,29 @@ public class BulkImportServiceImpl implements BulkImportService {
String bulkImportReportLocation = bulkImportReport.getReportLocation();
// Write to bulkImport-report file.
bulkImportReport.addEvent("Initializing the bulkImport " + provenance + " procedure with bulkImportDir: " + bulkImportDirName + ".");
String msg = "Initializing the bulkImport " + provenance + " procedure with bulkImportDir: " + bulkImportDirName + ".";
logger.info(msg);
bulkImportReport.addEvent(msg);
// Do not write immediately to the file, wait for the following checks.
String additionalLoggingMsg = " | provenance: \"" + provenance + "\" | bulkImportDir: \"" + bulkImportDirName + "\"";
if ( (ParquetFileUtils.payloadsSchema == null) // Parse the schema if it's not already parsed.
&& ((ParquetFileUtils.payloadsSchema = ParquetFileUtils.parseSchema(ParquetFileUtils.payloadSchemaFilePath)) == null ) ) {
String errorMsg = "The payloadsSchema could not be parsed!";
logger.error(errorMsg);
logger.error(errorMsg + additionalLoggingMsg);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
BulkImportController.bulkImportDirsUnderProcessing.remove(bulkImportDirName);
return false;
}
List<String> fileLocations = getFileLocationsInsideDir(bulkImportDirName); // the error-msg has already been written
if ( fileLocations == null ) {
bulkImportReport.addEvent("Could not retrieve the files for bulk-import!");
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
String errorMsg = "Could not retrieve the files for bulk-import!";
logger.error(errorMsg + additionalLoggingMsg);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
BulkImportController.bulkImportDirsUnderProcessing.remove(bulkImportDirName);
return false;
}
@ -90,33 +87,33 @@ public class BulkImportServiceImpl implements BulkImportService {
String errorMsg = "No files were found inside the bulkImportDir: " + bulkImportDirName;
logger.warn(errorMsg);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
BulkImportController.bulkImportDirsUnderProcessing.remove(bulkImportDirName);
return false;
}
if ( logger.isTraceEnabled() )
logger.trace("fileLocations:\n" + fileLocations);
logger.trace("fileLocations: " + additionalLoggingMsg + GenericUtils.endOfLine + fileLocations);
String localParquetDir = parquetFileUtils.parquetBaseLocalDirectoryPath + "bulk_import_" + provenance + File.separator + relativeBulkImportDir; // This ends with "/".
try {
Files.createDirectories(Paths.get(localParquetDir)); // No-op if it already exists.
} catch (Exception e) {
String errorMsg = "Could not create the local parquet-directory: " + localParquetDir;
logger.error(errorMsg, e);
logger.error(errorMsg + additionalLoggingMsg, e);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
BulkImportController.bulkImportDirsUnderProcessing.remove(bulkImportDirName);
return false;
}
// Create a new directory on HDFS, with this bulkImportDir name. So, that there will not be any "load data" operation to fail because another thread has loaded that base-dir right before.
String currentBulkImportHdfsDir = parquetFileUtils.parquetHDFSDirectoryPathPayloadsBulkImport + relativeBulkImportDir;
if ( ! parquetFileUtils.applyHDFOperation(parquetFileUtils.webHDFSBaseUrl + currentBulkImportHdfsDir + parquetFileUtils.mkDirsAndParams) ) { // N0-op if it already exists. It is very quick.
if ( ! parquetFileUtils.applyHDFSOperation(parquetFileUtils.webHDFSBaseUrl + currentBulkImportHdfsDir + parquetFileUtils.mkDirsAndParams) ) { // N0-op if it already exists. It is very quick.
String errorMsg = "Could not create the remote HDFS-directory: " + currentBulkImportHdfsDir;
logger.error(errorMsg);
logger.error(errorMsg + additionalLoggingMsg);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
BulkImportController.bulkImportDirsUnderProcessing.remove(bulkImportDirName);
return false;
}
@ -125,16 +122,20 @@ public class BulkImportServiceImpl implements BulkImportService {
List<Callable<Integer>> callableTasksForFileSegments = new ArrayList<>(numOfFiles);
int sizeOfEachSegment = (numOfFiles / BulkImportController.numOfThreadsForBulkImportProcedures);
if ( sizeOfEachSegment == 0 ) // In case the numFiles are fewer than the numOfThreads to handle them.
sizeOfEachSegment = 1;
List<List<String>> subLists = Lists.partition(fileLocations, sizeOfEachSegment); // Divide the initial list to "numOfThreadsForBulkImportProcedures" subLists. The last one may have marginally fewer files.
int subListsSize = subLists.size();
bulkImportReport.addEvent("Going to import the files in parallel, after dividing them in " + subListsSize + " segments.");
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
msg = "Going to bulk-import the " + numOfFiles + " files in parallel, after dividing them in " + subListsSize + " segments.";
logger.debug(msg + additionalLoggingMsg);
bulkImportReport.addEvent(msg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
for ( int i = 0; i < subListsSize; ++i ) {
int finalI = i;
callableTasksForFileSegments.add(() -> { // Handle inserts to the "attempt" table. Insert 20% of the "attempt" queries.
return processBulkImportedFilesSegment(bulkImportReport, finalI, subLists.get(finalI), bulkImportDirName, localParquetDir, currentBulkImportHdfsDir, provenance, bulkImportSource, timeMillis, shouldDeleteFilesOnFinish);
return processBulkImportedFilesSegment(bulkImportReport, finalI, subLists.get(finalI), bulkImportDirName, localParquetDir, currentBulkImportHdfsDir, provenance, bulkImportSource, timeMillis, shouldDeleteFilesOnFinish, additionalLoggingMsg);
});
}
@ -153,20 +154,30 @@ public class BulkImportServiceImpl implements BulkImportService {
numFailedSegments++;
// In case all the files failed to be bulk-imported, then we will detect it in the "numSuccessfulSegments"-check later.
// The failed-to-be-imported files, will not be deleted, even if the user specifies that he wants to delete the directory.
} catch (ExecutionException ee) {
String stackTraceMessage = GenericUtils.getSelectiveStackTrace(ee, null, 15); // These can be serious errors like an "out of memory exception" (Java HEAP).
logger.error("Task_" + (i+1) + " failed with: " + ee.getMessage() + "\n" + stackTraceMessage);
} catch (ExecutionException ee) { // These can be serious errors like an "out of memory exception" (Java HEAP).
numFailedSegments ++;
numAllFailedFiles += subLists.get(i).size(); // We assume all files of this segment failed, as all are passed through the same parts of code, so any serious exception should arise from the 1st files being processed and the rest of the files wil be skipped..
logger.error(GenericUtils.getSelectedStackTraceForCausedException(ee, "Task_" + i + " failed with: ", additionalLoggingMsg, 20));
bulkImportReport.addEvent("Segment_" + i + " failed with: " + ee.getCause().getMessage());
} catch (CancellationException ce) {
logger.error("Task_" + (i+1) + " was cancelled: " + ce.getMessage());
numFailedSegments ++;
numAllFailedFiles += subLists.get(i).size(); // We assume all files have failed.
logger.error("Task_" + i + " was cancelled: " + ce.getMessage() + additionalLoggingMsg);
bulkImportReport.addEvent("Segment_" + i + " failed with: " + ce.getMessage());
} catch (InterruptedException ie) {
numFailedSegments ++;
numAllFailedFiles += (subLists.get(i).size() / 3); // In this case, only some of the files will have failed (do not know how many).
logger.error("Task_" + i + " was interrupted: " + ie.getMessage());
bulkImportReport.addEvent("Segment_" + i + " failed with: " + ie.getMessage());
} catch (IndexOutOfBoundsException ioobe) {
logger.error("IOOBE for task_" + i + " in the futures-list! " + ioobe.getMessage());
logger.error("IOOBE for task_" + i + " in the futures-list! " + ioobe.getMessage() + additionalLoggingMsg);
}
}
} catch (Exception e) {
String errorMsg = "An error occurred when trying to bulk-import data from bulkImportDir: " + bulkImportDirName;
logger.error(errorMsg, e);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
BulkImportController.bulkImportDirsUnderProcessing.remove(bulkImportDirName);
return false;
} finally {
@ -175,12 +186,11 @@ public class BulkImportServiceImpl implements BulkImportService {
}
// Check the results.
String msg;
if ( numAllFailedFiles == numOfFiles ) {
String errorMsg = "None of the files inside the bulkImportDir: " + bulkImportDirName + " were imported!";
logger.error(errorMsg);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
BulkImportController.bulkImportDirsUnderProcessing.remove(bulkImportDirName);
return false;
} else if ( numAllFailedFiles > 0 ) { // Some failed, but not all.
@ -191,24 +201,23 @@ public class BulkImportServiceImpl implements BulkImportService {
logger.info(msg);
}
bulkImportReport.addEvent(msg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
// Merge the parquet files inside the table "payload_bulk_import", to improve performance of future operations.
DatabaseConnector.databaseLock.lock();
String mergeErrorMsg = fileUtils.mergeParquetFiles("payload_bulk_import", "", null); // msg is already logged
if ( mergeErrorMsg != null ) {
DatabaseConnector.databaseLock.unlock();
String mergeErrorMsg = parquetFileUtils.mergeParquetFilesOfTable("payload_bulk_import", "", null); // msg is already logged
DatabaseConnector.databaseLock.unlock();
if ( mergeErrorMsg != null ) { // the message in already logged
bulkImportReport.addEvent(mergeErrorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
BulkImportController.bulkImportDirsUnderProcessing.remove(bulkImportDirName);
return false;
}
DatabaseConnector.databaseLock.unlock();
String successMsg = "Finished the bulk-import procedure for " + provenance + " and bulkImportDir: " + bulkImportDirName;
logger.info(successMsg);
bulkImportReport.addEvent(successMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), false);
// The report-file will be overwritten every now and then, instead of appended, since we want to add an update new JSON report-object each time.
// Also, we do not want to write the object in the end (in its final form), since we want the user to have the ability to request the report at any time,
// after submitting the bulk-import request, to see its progress (since the number of file may be very large and the processing may take many hours).
@ -219,15 +228,53 @@ public class BulkImportServiceImpl implements BulkImportService {
private int processBulkImportedFilesSegment(BulkImportReport bulkImportReport, int segmentCounter, List<String> fileLocationsSegment, String bulkImportDirName, String localParquetDir, String currentBulkImportHdfsDir,
String provenance, BulkImport.BulkImportSource bulkImportSource, long timeMillis, boolean shouldDeleteFilesOnFinish)
String provenance, BulkImport.BulkImportSource bulkImportSource, long timeMillis, boolean shouldDeleteFilesOnFinish, String additionalLoggingMsg)
{
// Inside this thread, process a segment of the files.
String bulkImportReportLocation = bulkImportReport.getReportLocation();
int numOfFilesInSegment = fileLocationsSegment.size();
String msg = "Going to import " + numOfFilesInSegment + " files, for segment-" + segmentCounter + ", of bulkImport procedure: " + provenance + " | dir: " + bulkImportDirName;
logger.debug(msg);
String msg = "Going to import " + numOfFilesInSegment + " files, for segment-" + segmentCounter;
logger.debug(msg + additionalLoggingMsg);
bulkImportReport.addEvent(msg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
HashMap<String, DocFileData> fileLocationsWithData = new HashMap<>(numOfFilesInSegment); // Map the fileLocations with hashes in order to quickly retrieve them later, when constructing the s3-file-location.
Set<String> fileHashes = new HashSet<>(numOfFilesInSegment);
// Get file-data for this segment's files.
for ( String fileLocation : fileLocationsSegment ) {
DocFileData docFileData = new DocFileData(new File(fileLocation), null, null, fileLocation);
docFileData.calculateAndSetHashAndSize();
String fileHash = docFileData.getHash();
if ( fileHash != null ) { // We will not work with files for which we cannot produce the hash, since the s3-file-location requires it.
fileHashes.add(fileHash);
fileLocationsWithData.put(fileLocation, docFileData);
}
}
int fileHashesSetSize = fileHashes.size();
if ( fileHashesSetSize == 0 ) {
msg = "No fileHashes were retrieved for segment_" + segmentCounter + ", no files will be processed from it.";
logger.warn(msg + additionalLoggingMsg);
bulkImportReport.addEvent(msg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
return numOfFilesInSegment; // All files of this segment failed.
}
// Get the "alreadyFound" fileLocations for the fileHashes which exist in the DB.
HashMap<String, String> hashWithExistingLocationMap = fileUtils.getHashLocationMap(fileHashes, fileHashesSetSize, segmentCounter, "segment");
int numAlreadyRetrievedFiles = hashWithExistingLocationMap.size();
if ( numAlreadyRetrievedFiles > 0 ) {
msg = numAlreadyRetrievedFiles + " files from segment_" + segmentCounter + ", have been already retrieved in the past.";
logger.warn(msg + additionalLoggingMsg);
bulkImportReport.addEvent(msg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
}
// Generate the payload-records to be inserted in the DB.
// The OpenAireID which is synthesised from the fileName, may be different from the ID related to the exact same file which was retrieved in the past.
// The same file may be related with multiple publications from different datasources. The deduplication is out of the scope of this service.
// All different OpenAIRE IDs should be saved, independently of the fact that their urls or files may be related with other publications as well.
List<GenericData.Record> payloadRecords = new ArrayList<>(numOfFilesInSegment);
@ -236,20 +283,42 @@ public class BulkImportServiceImpl implements BulkImportService {
int counter = 0;
// Upload files to S3 and collect payloadRecords.
for ( String fileLocation: fileLocationsSegment ) {
GenericData.Record record = processBulkImportedFile(fileLocation, provenance, bulkImportSource, timeMillis);
if ( record != null )
payloadRecords.add(record);
else {
bulkImportReport.addEvent("An error caused the file: " + fileLocation + " to not be imported!");
for ( int i=0; i < numOfFilesInSegment; ++i ) {
String fileLocation = fileLocationsSegment.get(i);
DocFileData docFileData = fileLocationsWithData.get(fileLocation);
if ( docFileData == null ) {
failedFiles.add(fileLocation);
} else {
String alreadyRetrievedFileLocation = hashWithExistingLocationMap.get(docFileData.getHash());
GenericData.Record record = null;
try {
record = processBulkImportedFile(docFileData, alreadyRetrievedFileLocation, provenance, bulkImportSource, timeMillis, additionalLoggingMsg);
} catch (Exception e) {
String errorMsg = "Exception when uploading the files of segment_" + segmentCounter + " to the S3 Object Store. Will avoid uploading the rest of the files for this segment.. " + e.getMessage();
logger.error(errorMsg + additionalLoggingMsg);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
for ( int j=i; j < numOfFilesInSegment; ++j )
failedFiles.add(fileLocationsSegment.get(j)); // The rest of the files are considered "failed".
break;
}
if ( record != null )
payloadRecords.add(record);
else {
bulkImportReport.addEvent("An error caused the file: " + fileLocation + " to not be imported!");
failedFiles.add(fileLocation);
}
}
if ( ((++counter) % 100) == 0 ) { // Every 100 files, report the status.
bulkImportReport.addEvent("Progress for segment-" + segmentCounter + " : " + payloadRecords.size() + " files have been imported and " + failedFiles.size() + " have failed, out of " + numOfFilesInSegment + " files.");
if ( ((++counter) % 150) == 0 ) { // Every 150 files, report the status for this segment and write it to the file.
msg = "Progress for segment-" + segmentCounter + " : " + payloadRecords.size() + " files have been imported and " + failedFiles.size() + " have failed, out of " + numOfFilesInSegment + " files.";
if ( logger.isTraceEnabled() )
logger.trace(msg + additionalLoggingMsg);
bulkImportReport.addEvent(msg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
}
}
} // End of processing files for this segment.
int numOfPayloadRecords = payloadRecords.size();
if ( numOfPayloadRecords == 0 ) {
@ -262,7 +331,7 @@ public class BulkImportServiceImpl implements BulkImportService {
} else if ( numOfPayloadRecords != numOfFilesInSegment ) {
// Write this important note here, in order to certainly be in the report, even if a parquet-file failure happens and the method exists early.
String errorMsg = failedFiles.size() + " out of " + numOfFilesInSegment + " files failed to be bulk-imported, for segment-" + segmentCounter + " !";
logger.warn(errorMsg);
logger.warn(errorMsg + additionalLoggingMsg);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
}
@ -272,11 +341,11 @@ public class BulkImportServiceImpl implements BulkImportService {
String fullLocalParquetFilePath = localParquetDir + parquetFileName;
if ( logger.isTraceEnabled() )
logger.trace("Going to write " + numOfPayloadRecords + " payload-records to the parquet file: " + fullLocalParquetFilePath); // DEBUG!
logger.trace("Going to write " + numOfPayloadRecords + " payload-records to the parquet file: " + fullLocalParquetFilePath + additionalLoggingMsg); // DEBUG!
if ( ! parquetFileUtils.writeToParquet(payloadRecords, ParquetFileUtils.payloadsSchema, fullLocalParquetFilePath) ) { // Any errorMsg is already logged inside.
String errorMsg = "Could not write the payload-records for segment-" + segmentCounter + " to the parquet-file: " + parquetFileName + " !";
logger.error(errorMsg);
logger.error(errorMsg + additionalLoggingMsg);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
// None of the files of this segment will be deleted, in any case.
@ -284,13 +353,13 @@ public class BulkImportServiceImpl implements BulkImportService {
}
if ( logger.isTraceEnabled() )
logger.trace("Going to upload the parquet file: " + fullLocalParquetFilePath + " to HDFS."); // DEBUG!
logger.trace("Going to upload the parquet file: " + fullLocalParquetFilePath + " to HDFS." + additionalLoggingMsg); // DEBUG!
// Upload and insert the data to the "payload" Impala table. (no database-locking is required)
String errorMsg = parquetFileUtils.uploadParquetFileToHDFS(fullLocalParquetFilePath, parquetFileName, currentBulkImportHdfsDir); // The returned message is already logged inside.
if ( errorMsg != null ) { // The possible error-message returned, is already logged by the Controller.
errorMsg = "Could not upload the parquet-file " + parquetFileName + " to HDFS!";
logger.error(errorMsg);
logger.error(errorMsg + additionalLoggingMsg);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
// None of the files of this segment will be deleted, in any case.
@ -298,134 +367,88 @@ public class BulkImportServiceImpl implements BulkImportService {
}
if ( logger.isTraceEnabled() )
logger.trace("Going to load the data of parquet-file: \"" + parquetFileName + "\" to the database-table: \"payload_bulk_import\"."); // DEBUG!
logger.trace("Going to load the data of parquet-file: \"" + parquetFileName + "\" to the database-table: \"payload_bulk_import\"." + additionalLoggingMsg); // DEBUG!
DatabaseConnector.databaseLock.lock();
if ( !parquetFileUtils.loadParquetDataIntoTable((currentBulkImportHdfsDir + parquetFileName), "payload_bulk_import") ) {
DatabaseConnector.databaseLock.unlock();
boolean parquetDataLoaded = parquetFileUtils.loadParquetDataIntoTable((currentBulkImportHdfsDir + parquetFileName), "payload_bulk_import");
DatabaseConnector.databaseLock.unlock();
if ( !parquetDataLoaded ) {
errorMsg = "Could not load the payload-records to the database, for segment-" + segmentCounter + "!";
logger.error(errorMsg);
logger.error(errorMsg + additionalLoggingMsg);
bulkImportReport.addEvent(errorMsg);
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
// None of the files of this segment will be deleted, in any case.
return numOfFilesInSegment; // All files of this segment have failed.
}
DatabaseConnector.databaseLock.unlock();
String segmentSuccessMsg = "Finished importing " + numOfPayloadRecords + " files, out of " + numOfFilesInSegment + ", for segment-" + segmentCounter + ".";
logger.info(segmentSuccessMsg);
logger.info(segmentSuccessMsg + additionalLoggingMsg);
bulkImportReport.addEvent(segmentSuccessMsg);
if ( shouldDeleteFilesOnFinish ) {
segmentSuccessMsg = "As the user requested, the successfully imported files of " + provenance + " procedure, of bulk-import segment-" + segmentCounter + ", from directory " + bulkImportDirName + ", will be deleted.";
logger.info(segmentSuccessMsg);
bulkImportReport.addEvent(segmentSuccessMsg);
// Delete all files except the ones in the "failedHashSet"
// Delete all files except the ones in the "failedHashSet".
for ( String fileLocation : fileLocationsSegment ) {
if ( !failedFiles.contains(fileLocation) )
if ( !fileUtils.deleteFile(fileLocation) )
if ( !fileUtils.deleteFile(fileLocation) ) // The "error-log-message" is shown inside.
bulkImportReport.addEvent("The file " + fileLocation + " could not be deleted! Please make sure you have provided the WRITE-permission.");
}
}
fileUtils.writeToFile(bulkImportReportLocation, bulkImportReport.getJsonReport(), true);
return (numOfFilesInSegment - numOfPayloadRecords); // Return the numOfFailedFiles.
}
private GenericData.Record processBulkImportedFile(String fileLocation, String provenance, BulkImport.BulkImportSource bulkImportSource, long timeMillis)
private GenericData.Record processBulkImportedFile(DocFileData docFileData, String alreadyRetrievedFileLocation, String provenance, BulkImport.BulkImportSource bulkImportSource, long timeMillis, String additionalLoggingMsg)
throws ConnectException, UnknownHostException
{
File fullTextFile = new File(fileLocation);
DocFileData docFileData = new DocFileData(fullTextFile, null, null, null);
docFileData.calculateAndSetHashAndSize();
// Check if this file is already found by crawling. Even though we started excluding this datasource from crawling, many full-texts have already been downloaded.
// Also, it may be the case that this file was downloaded by another datasource.
FileLocationData fileLocationData;
try {
fileLocationData = new FileLocationData(fileLocation);
fileLocationData = new FileLocationData(docFileData.getLocation());
} catch (RuntimeException re) {
logger.error(re.getMessage());
logger.error(re.getMessage() + additionalLoggingMsg);
return null;
}
String fileHash = docFileData.getHash();
if ( fileHash == null )
return null; // No check of past found full-text can be made nor the S3-fileName can be created.
String datasourceId = bulkImportSource.getDatasourceID();
String datasourcePrefix = bulkImportSource.getDatasourcePrefix();
String fileNameID = fileLocationData.getFileNameID();
String actualUrl = (bulkImportSource.getPdfUrlPrefix() + fileNameID); // This string-concatenation, works with urls of Arvix. A different construction may be needed for other datasources.
String originalUrl = actualUrl; // We have the full-text files from bulk-import, so let's assume the original-url is also the full-text-link.
final String getFileLocationForHashQuery = "select `location` from " + DatabaseConnector.databaseName + ".payload where `hash` = ? limit 1";
final int[] hashArgType = new int[] {Types.VARCHAR};
String alreadyFoundFileLocation = null;
DatabaseConnector.databaseLock.lock();
try {
alreadyFoundFileLocation = jdbcTemplate.queryForObject(getFileLocationForHashQuery, new Object[]{fileHash}, hashArgType, String.class);
} catch (EmptyResultDataAccessException erdae) {
// No fileLocation is found, it's ok. It will be null by default.
} catch (Exception e) {
logger.error("Error when executing or acquiring data from the the 'getFileLocationForHashQuery'!\n", e);
// Continue with bulk-importing the file and uploading it to S3.
} finally {
DatabaseConnector.databaseLock.unlock();
}
String idMd5hash = getMD5hash(fileNameID.toLowerCase());
if ( idMd5hash == null )
String openAireId = generateOpenaireId(fileNameID, bulkImportSource.getDatasourcePrefix(), bulkImportSource.getIsAuthoritative());
if ( openAireId == null )
return null;
// openaire id = <datasourcePrefix> + "::" + <md5(lowercase(arxivId))>
String openAireId = (datasourcePrefix + "::" + idMd5hash);
String fileHash = docFileData.getHash(); // It's guaranteed to NOT be null at this point.
String s3Url = null;
if ( alreadyFoundFileLocation != null ) // If the full-text of this record is already-found and uploaded.
if ( alreadyRetrievedFileLocation != null ) // If the full-text of this record is already-found and uploaded.
{
// This full-text was found to already be in the database.
// If it has the same datasourceID, then it likely was crawled before from an ID belonging to this datasource.
// If also has the same ID, then the exact same record from that datasource was retrieved previously.
// If also, has the same ID, then the exact same record from that datasource was retrieved previously.
// Else, the file was downloaded by another record of this datasource.
// ELse if the datasourceID is not the same, then the same file was retrieved from another datasource.
// The above analysis is educational, it does not need to take place and is not currently used.
s3Url = alreadyFoundFileLocation;
s3Url = alreadyRetrievedFileLocation;
} else {
try {
s3Url = fileUtils.constructFileNameAndUploadToS3(fileLocationData.getFileDir(), fileLocationData.getFileName(), openAireId, fileLocationData.getDotFileExtension(), datasourceId, fileHash); // This throws Exception, in case the uploading failed.
if ( s3Url == null )
return null; // In case the 'datasourceID' or 'hash' is null. Which should never happen here, since both of them are checked before the execution reaches here.
} catch (Exception e) {
logger.error("Could not upload the file '" + fileLocationData.getFileName() + "' to the S3 ObjectStore!", e);
s3Url = fileUtils.constructS3FilenameAndUploadToS3(fileLocationData.getFileDir(), fileLocationData.getFileName(), openAireId, fileLocationData.getDotFileExtension(), bulkImportSource.getDatasourceID(), fileHash);
if ( s3Url == null )
return null;
}
}
GenericData.Record record = new GenericData.Record(ParquetFileUtils.payloadsSchema);
record.put("id", openAireId);
record.put("original_url", originalUrl);
record.put("actual_url", actualUrl);
record.put("date", timeMillis);
record.put("mimetype", bulkImportSource.getMimeType());
Long size = docFileData.getSize();
record.put("size", ((size != null) ? String.valueOf(size) : null));
record.put("hash", fileHash); // This is already checked and will not be null here.
record.put("location", s3Url);
record.put("provenance", ("bulk:" + provenance)); // Add a prefix in order to be more clear that this record comes from bulkImport, when looking all records in the "payload" VIEW.
// TODO - If another url-schema is introduced for other datasources, have a "switch"-statement and perform the right "actualUrl"-creation based on current schema.
String actualUrl = (bulkImportSource.getPdfUrlPrefix() + fileNameID); // This string-concatenation, works with urls of Arvix. A different construction may be needed for other datasources.
String originalUrl = actualUrl; // We have the full-text files from bulk-import, so let's assume the original-url is also the full-text-link.
return record;
return parquetFileUtils.getPayloadParquetRecord(openAireId, originalUrl, actualUrl, timeMillis, bulkImportSource.getMimeType(),
docFileData.getSize(), fileHash, s3Url, provenance, true); // It may return null.
}
public List<String> getFileLocationsInsideDir(String directory)
{
List<String> fileLocations = null;
try ( Stream<Path> walkStream = Files.find(Paths.get(directory), Integer.MAX_VALUE, (filePath, fileAttr) -> fileAttr.isRegularFile()) )
// In case we ever include other type-of-Files inside the same directory, we need to add this filter: "&& !filePath.toString().endsWith("name.ext")"
{
@ -435,12 +458,11 @@ public class BulkImportServiceImpl implements BulkImportService {
logger.error(errorMsg, e);
return null;
}
return fileLocations;
}
public String getMD5hash(String string)
public String getMD5Hash(String string)
{
String md5 = null;
try {
@ -454,4 +476,21 @@ public class BulkImportServiceImpl implements BulkImportService {
return md5;
}
public String generateOpenaireId(String id, String datasourcePrefix, boolean isAuthoritative)
{
// If the "provenance" relates to an "authoritative" source, then its id has to be lowercase, before the md5() is applied to it.
// general_openaire_id = <datasourcePrefix> + "::" + <md5(ID)>
// authoritative_openaire_id = <datasourcePrefix> + "::" + <md5(lowercase(ID))>
if ( isAuthoritative )
id = id.toLowerCase();
String idMd5Hash = getMD5Hash(id);
if ( idMd5Hash == null )
return null;
return (datasourcePrefix + "::" + idMd5Hash);
}
}

View File

@ -6,6 +6,8 @@ public interface ShutdownService {
ResponseEntity<?> passSecurityChecks(String remoteAddr, String initMsg);
void postShutdownOrCancelRequestsToAllWorkers(boolean shouldCancel);
boolean postShutdownOrCancelRequestToWorker(String workerId, String workerIp, boolean shouldCancel);
}

View File

@ -1,9 +1,11 @@
package eu.openaire.urls_controller.services;
import eu.openaire.urls_controller.controllers.UrlsController;
import eu.openaire.urls_controller.models.WorkerInfo;
import eu.openaire.urls_controller.util.UriBuilder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.stereotype.Service;
@ -24,6 +26,10 @@ public class ShutdownServiceImpl implements ShutdownService {
private static final Pattern PRIVATE_IP_ADDRESSES_RFC_1918 = Pattern.compile("(?:10.|172.(?:1[6-9]|2[0-9]|3[0-1])|192.168.)[0-9.]+");
@Value("${services.pdfaggregation.worker.port}")
private String workerPort;
public ResponseEntity<?> passSecurityChecks(String remoteAddr, String initMsg)
{
// In case the Controller is running inside a docker container, and we want to send the "shutdownServiceRequest" from the terminal (with curl), without entering inside the container,
@ -32,16 +38,28 @@ public class ShutdownServiceImpl implements ShutdownService {
logger.error(initMsg + "The request came from another IP: " + remoteAddr + " | while the Controller has the IP: " + UriBuilder.ip);
return ResponseEntity.status(HttpStatus.FORBIDDEN).build();
}
return null; // The checks are passing.
}
public void postShutdownOrCancelRequestsToAllWorkers(boolean shouldCancel)
{
// Send "shutdownWorker" requests to all active Workers.
for ( String workerId : UrlsController.workersInfoMap.keySet() ) {
WorkerInfo workerInfo = UrlsController.workersInfoMap.get(workerId);
if ( ! workerInfo.getHasShutdown() ) // A worker may have shutdown on its own (by sending it a shutDown request manually), so it will have told the Controller when it shut down. In case of a Worker-crash, the Controller will not know about it.
postShutdownOrCancelRequestToWorker(workerId, workerInfo.getWorkerIP(), shouldCancel);
else
logger.warn("Will not post " + (shouldCancel ? "Cancel-" : "") + "ShutdownRequest to Worker \"" + workerId + "\", since is it has already shutdown.");
}
}
private static final RestTemplate restTemplate = new RestTemplate();
public boolean postShutdownOrCancelRequestToWorker(String workerId, String workerIp, boolean shouldCancel)
{
String url = "http://" + workerIp + ":1881/api/" + (shouldCancel ? "cancelShutdownWorker" : "shutdownWorker");
String url = "http://" + workerIp + ":" + workerPort + "/api/" + (shouldCancel ? "cancelShutdownWorker" : "shutdownWorker");
try {
ResponseEntity<?> responseEntity = restTemplate.postForEntity(url, null, String.class);
int responseCode = responseEntity.getStatusCodeValue();

View File

@ -19,7 +19,7 @@ public class StatsServiceImpl implements StatsService {
private JdbcTemplate jdbcTemplate;
// No DB-lock is required for these READ-operations.
// BUT! The is an issue.. these queries may run while a "table-merging" operation is in progress.. thus resulting in "no table reference" and "no file found (fieName.parquet)"
// BUT! There is an issue.. these queries may run while a "table-merging" operation is in progress.. thus resulting in "no table reference" and "no file found (fieName.parquet)"
// Thus, we need to have an "error-detection-and-retry" mechanism, in order to avoid returning error that we know will exist in certain times and we can overcome them.
// The final time-to-return of the results-retrieval methods may be somewhat large, but the alternative of returning predictable errors or locking the DB and slowing down the aggregation system are even worse.
@ -27,7 +27,7 @@ public class StatsServiceImpl implements StatsService {
public ResponseEntity<?> getNumberOfPayloads(String getNumberQuery, String message, int retryCount)
{
if ( retryCount > 10 ) {
String errorMsg = "Could not find the requested payload-type table in an non-merging state, after " + retryCount + " retries!";
String errorMsg = "Could not find the requested payload-type table in an non-merging state, after " + (retryCount -1) + " retries!";
logger.error(errorMsg);
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(errorMsg);
}
@ -59,7 +59,7 @@ public class StatsServiceImpl implements StatsService {
public ResponseEntity<?> getNumberOfRecordsInspectedByServiceThroughCrawling(int retryCount)
{
if ( retryCount > 10 ) {
String errorMsg = "Could not find the requested attempt table in an non-merging state, after " + retryCount + " retries!";
String errorMsg = "Could not find the requested attempt table in an non-merging state, after " + (retryCount -1) + " retries!";
logger.error(errorMsg);
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(errorMsg);
}
@ -93,8 +93,10 @@ public class StatsServiceImpl implements StatsService {
}
}
// To get the human-friendly timestamp format from the BigInt in the database:
// select from_timestamp(CAST(CAST(`date` as decimal(30,0))/1000 AS timestamp), "yyyy-MM-dd HH:mm:ss.SSS") from payload
// Or simpler: select from_timestamp(CAST((`date`/1000) AS timestamp), "yyyy-MM-dd HH:mm:ss.SSS") from payload
private void sleep1min() {

View File

@ -6,6 +6,7 @@ import eu.openaire.urls_controller.controllers.UrlsController;
import eu.openaire.urls_controller.models.*;
import eu.openaire.urls_controller.payloads.responces.AssignmentsResponse;
import eu.openaire.urls_controller.util.FileUtils;
import eu.openaire.urls_controller.util.GenericUtils;
import eu.openaire.urls_controller.util.ParquetFileUtils;
import io.micrometer.core.annotation.Timed;
import org.slf4j.Logger;
@ -28,7 +29,10 @@ import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.sql.Timestamp;
import java.util.*;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.HashMap;
import java.util.List;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
@ -54,12 +58,20 @@ public class UrlsServiceImpl implements UrlsService {
@Value("${services.pdfaggregation.controller.workerReportsDirPath}")
private String workerReportsDirPath;
@Value("${services.pdfaggregation.worker.port}")
private String workerPort;
public static final AtomicLong assignmentsBatchCounter = new AtomicLong(0);
private final AtomicInteger maxAttemptsPerRecordAtomic;
private static String excludedDatasourceIDsStringList = null;
private static final String DOC_URL_FILTER = ".+(?:pdf|download|/doc|document|(?:/|[?]|&)file|/fulltext|attachment|/paper|view(?:file|doc)|/get|cgi/viewcontent.cgi\\?|t[ée]l[ée]charger|descargar).*";
// "DOC_URL_FILTER" works for lowerCase Strings (we use the "ignore-case" indicator in the "regexp_like()" method).
public static final ExecutorService insertsExecutor = Executors.newFixedThreadPool(6);
// TODO - Unify this ExecutorService with the hash-matching executorService. Since one will ALWAYS be called after the other. So why having two ExecServices to handle?
@ -84,22 +96,13 @@ public class UrlsServiceImpl implements UrlsService {
throw new RuntimeException("One of the bulk-imported datasourceIDs was not found! | source: " + source);
excludedIDs.add(datasourceID);
}
int exclusionListSize = excludedIDs.size(); // This list will not be empty.
// Prepare the "excludedDatasourceIDsStringList" to be used inside the "findAssignmentsQuery". Create the following string-pattern:
// ("ID_1", "ID_2", ...)
final StringBuilder sb = new StringBuilder((exclusionListSize * 46) + (exclusionListSize -1) +2 );
sb.append("(");
for ( int i=0; i < exclusionListSize; ++i ) {
sb.append("\"").append(excludedIDs.get(i)).append("\"");
if ( i < (exclusionListSize -1) )
sb.append(", ");
}
sb.append(")");
excludedDatasourceIDsStringList = sb.toString();
logger.info("The following bulkImport-datasources will be excluded from crawling: " + excludedDatasourceIDsStringList);
int stringBuilderCapacity = ((exclusionListSize * 46) + (exclusionListSize -1) +2);
excludedDatasourceIDsStringList = FileUtils.getQueryListString(excludedIDs, exclusionListSize, stringBuilderCapacity);
logger.info("The following bulkImport data-sources will be excluded from crawling: " + excludedDatasourceIDsStringList);
}
@ -111,34 +114,27 @@ public class UrlsServiceImpl implements UrlsService {
// Create the Assignments from the id-urls stored in the database up to the < assignmentsLimit >.
String findAssignmentsQuery =
"select pubid, url, datasourceid, datasourcename\n" + // Select the final sorted data with "assignmentsLimit".
"from (select distinct pubid, url, datasourceid, datasourcename, level, pub_year, attempt_count\n" + // Select the distinct id-url data. Beware that this will return duplicate id-url paris, wince one pair may be associated with multiple datasources.
" from (select p.id as pubid, pu.url as url, pb.level as level, attempts.counts as attempt_count, p.year as pub_year, d.id as datasourceid, d.name as datasourcename\n" + // Select all needed columns frm JOINs, order by "boost.level" and limit them to (assignmentsLimit * 10)
" from " + DatabaseConnector.databaseName + ".publication p\n" +
" join " + DatabaseConnector.databaseName + ".publication_urls pu on pu.id=p.id\n" +
" join " + DatabaseConnector.databaseName + ".datasource d on d.id=p.datasourceid\n" + // This is needed for the "d.allow_harvest=true" check later on.
" left outer join " + DatabaseConnector.databaseName + ".publication_boost pb\n" +
" on p.id=pb.id\n" +
" left outer join (select count(a.id) as counts, a.id from " + DatabaseConnector.databaseName + ".attempt a group by a.id) as attempts\n" +
" on attempts.id=p.id\n" +
" left outer join (\n" +
" select a.id, a.original_url from " + DatabaseConnector.databaseName + ".assignment a\n" +
" union all\n" +
" select pl.id, pl.original_url from " + DatabaseConnector.databaseName + ".payload pl\n" + // Here we access the payload-VIEW which includes the three payload-tables.
" ) as existing\n" +
" on existing.id=p.id and existing.original_url=pu.url\n" +
" where d.allow_harvest=true and existing.id is null\n" + // For records not found on existing, the "existing.id" will be null.
"from (select distinct p.id as pubid, pu.url as url, pb.level as level, attempts.counts as attempt_count, p.year as pub_year, d.id as datasourceid, d.name as datasourcename\n" + // Select the distinct id-url data. Beware that this will return duplicate id-url pairs, wince one pair may be associated with multiple datasources.
" from " + DatabaseConnector.databaseName + ".publication_urls pu\n" +
" join " + DatabaseConnector.databaseName + ".publication p on p.id=pu.id\n" +
" join " + DatabaseConnector.databaseName + ".datasource d on d.id=p.datasourceid and d.allow_harvest=true"+
((excludedDatasourceIDsStringList != null) ? // If we have an exclusion-list, use it below.
(" and d.id not in " + excludedDatasourceIDsStringList + "\n") : "") +
" and coalesce(attempts.counts, 0) <= " + maxAttemptsPerRecordAtomic.get() + "\n" +
" and not exists (select 1 from " + DatabaseConnector.databaseName + ".attempt a where a.id=p.id and a.error_class = 'noRetry' limit 1)\n" +
" and pu.url != '' and pu.url is not null\n" + // Some IDs have empty-string urls, there are no "null" urls, but keep the relevant check for future-proofing.
" and (p.year <= " + currentYear + " or p.year > " + (currentYear + 5) + ")\n" + // Exclude the pubs which will be published in the next 5 years. They don't provide full-texts now. (We don't exclude all future pubs, since, some have invalid year, like "9999").
" order by coalesce(level, 0) desc, coalesce(pub_year, 0) desc\n" +
" limit " + (assignmentsLimit * 10) + "\n" +
" ) as non_distinct_results\n" +
" order by coalesce(level, 0) desc, coalesce(pub_year, 0) desc, coalesce(attempt_count, 0), reverse(pubid), url\n" + // We also order by reverse "pubid" and "url", in order to get the exactly same records for consecutive runs, all things being equal.
" limit " + assignmentsLimit + "\n" +
") as findAssignmentsQuery";
(" and d.id not in " + excludedDatasourceIDsStringList + GenericUtils.endOfLine) : "") +
" left anti join (select a.original_url from " + DatabaseConnector.databaseName + ".assignment a\n" +
" union all\n" +
" select pl.original_url from " + DatabaseConnector.databaseName + ".payload pl\n" + // Here we access the payload-VIEW which includes the three payload-tables.
" ) as existing\n" +
" on existing.original_url=pu.url\n" +
" left outer join " + DatabaseConnector.databaseName + ".publication_boost pb\n" +
" on p.id=pb.id\n" +
" left outer join (select count(at.original_url) as counts, at.original_url from " + DatabaseConnector.databaseName + ".attempt at group by at.original_url) as attempts\n" +
" on attempts.original_url=pu.url\n" +
" where coalesce(attempts.counts, 0) <= " + maxAttemptsPerRecordAtomic.get() + GenericUtils.endOfLine +
" and not exists (select 1 from " + DatabaseConnector.databaseName + ".attempt a where a.original_url=pu.url and a.error_class = 'noRetry' limit 1)\n" +
" and (p.year <= " + currentYear + " or p.year > " + (currentYear + 5) + ")\n" + // Exclude the pubs which will be published in the next 5 years. They don't provide full-texts now. (We don't exclude all future pubs, since, some have invalid year, like "9999").
") as distinct_results\n" +
"order by coalesce(attempt_count, 0), coalesce(level, 0) desc, coalesce(pub_year, 0) desc, (case when regexp_like(url, '" + DOC_URL_FILTER + "', 'i') then 1 else 0 end) desc, reverse(pubid), url\n" + // We also order by reverse "pubid" and "url", in order to get the exactly same records for consecutive runs, all things being equal.
"limit " + assignmentsLimit;
// The datasourceID and datasourceName are currently not used during the processing in the Worker. They may be used by the Worker, in the future to apply a datasource-specific aggregation plugin to take the full-texts quickly, instead of using the general crawling one.
// However, the "datasourceid" is useful to be able to generate the fileNames for the S3, without needing to perform additional select queries (with JOINs) at that phase.
@ -148,7 +144,7 @@ public class UrlsServiceImpl implements UrlsService {
final String getAssignmentsQuery = "select * from " + DatabaseConnector.databaseName + ".current_assignment";
List<Assignment> assignments = new ArrayList<>(assignmentsLimit);
final List<Assignment> assignments = new ArrayList<>(assignmentsLimit);
long curAssignmentsBatchCounter = assignmentsBatchCounter.incrementAndGet();
@ -172,7 +168,7 @@ public class UrlsServiceImpl implements UrlsService {
assignment.setAssignmentsBatchCounter(curAssignmentsBatchCounter);
assignment.setTimestamp(timestamp);
Datasource datasource = new Datasource();
try { // For each of the 4 columns returned, do the following. The column-indexing starts from 1
try { // For each of the 4 columns returned, do the following. The column-indexing starts from 1.
assignment.setId(rs.getString(1));
assignment.setOriginalUrl(rs.getString(2));
datasource.setId(rs.getString(3));
@ -184,37 +180,38 @@ public class UrlsServiceImpl implements UrlsService {
assignments.add(assignment);
});
} catch (EmptyResultDataAccessException erdae) {
errorMsg = "No results were returned for \"getAssignmentsQuery\":\n" + getAssignmentsQuery;
String tmpErrMsg = dropCurrentAssignmentTable();
DatabaseConnector.databaseLock.unlock();
if ( tmpErrMsg != null )
errorMsg += "\n" + tmpErrMsg;
logger.warn(errorMsg);
return ResponseEntity.status(HttpStatus.NO_CONTENT).body(errorMsg);
errorMsg = "No results retrieved from the \"getAssignmentsQuery\" for worker with id: " + workerId + ". Will increase the \"maxAttempts\" to " + maxAttemptsPerRecordAtomic.incrementAndGet() + " for the next requests.";
logger.error(errorMsg);
if ( tmpErrMsg != null ) {
errorMsg += GenericUtils.endOfLine + tmpErrMsg; // The additional error-msg is already logged.
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(errorMsg);
} else
return ResponseEntity.status(HttpStatus.NO_CONTENT).body(new AssignmentsResponse((long) -1, null));
} catch (Exception e) {
errorMsg = DatabaseConnector.handleQueryException("getAssignmentsQuery", getAssignmentsQuery, e);
String tmpErrMsg = dropCurrentAssignmentTable();
DatabaseConnector.databaseLock.unlock();
errorMsg = DatabaseConnector.handleQueryException("getAssignmentsQuery", getAssignmentsQuery, e);
if ( tmpErrMsg != null )
errorMsg += "\n" + tmpErrMsg;
errorMsg += GenericUtils.endOfLine + tmpErrMsg;
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(errorMsg);
}
int assignmentsSize = assignments.size();
if ( assignmentsSize == 0 ) {
errorMsg = "No results retrieved from the \"findAssignmentsQuery\" for worker with id: " + workerId + ". Will increase the \"maxAttempts\" to " + maxAttemptsPerRecordAtomic.incrementAndGet() + " for the next requests.";
logger.error(errorMsg);
String tmpErrMsg = dropCurrentAssignmentTable();
DatabaseConnector.databaseLock.unlock();
if ( tmpErrMsg != null ) {
errorMsg += "\n" + tmpErrMsg; // The additional error-msg is already logged.
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(errorMsg);
} else
return ResponseEntity.status(HttpStatus.MULTI_STATUS).body(new AssignmentsResponse((long) -1, null));
} else if ( assignmentsSize < assignmentsLimit )
errorMsg = "Some results were retrieved from the \"getAssignmentsQuery\", but no data could be extracted from them, for worker with id: " + workerId;
if ( tmpErrMsg != null )
errorMsg += GenericUtils.endOfLine + tmpErrMsg;
logger.error(errorMsg);
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(errorMsg);
}
else if ( assignmentsSize < assignmentsLimit )
logger.warn("The retrieved results were fewer (" + assignmentsSize + ") than the \"assignmentsLimit\" (" + assignmentsLimit + "), for worker with id: " + workerId + ". Will increase the \"maxAttempts\" to " + maxAttemptsPerRecordAtomic.incrementAndGet() + ", for the next requests.");
logger.debug("Finished gathering " + assignmentsSize + " assignments for worker with id \"" + workerId + "\". Going to insert them into the \"assignment\" table and then return them to the worker.");
logger.debug("Finished gathering " + assignmentsSize + " assignments for worker with id \"" + workerId + "\". Going to insert them into the \"assignment\" table and then return them to the worker. | assignments_" + curAssignmentsBatchCounter);
// Write the Assignment details to the assignment-table.
String insertAssignmentsQuery = "insert into " + DatabaseConnector.databaseName + ".assignment \n select pubid, url, '" + workerId + "', " + curAssignmentsBatchCounter + ", " + timestampMillis
@ -223,29 +220,22 @@ public class UrlsServiceImpl implements UrlsService {
try {
jdbcTemplate.execute(insertAssignmentsQuery);
} catch (Exception e) {
errorMsg = DatabaseConnector.handleQueryException("insertAssignmentsQuery", insertAssignmentsQuery, e);
String tmpErrMsg = dropCurrentAssignmentTable();
if ( tmpErrMsg != null )
errorMsg += "\n" + tmpErrMsg;
DatabaseConnector.databaseLock.unlock();
errorMsg = DatabaseConnector.handleQueryException("insertAssignmentsQuery", insertAssignmentsQuery, e);
if ( tmpErrMsg != null )
errorMsg += GenericUtils.endOfLine + tmpErrMsg;
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(errorMsg);
}
errorMsg = dropCurrentAssignmentTable();
if ( errorMsg != null ) {
DatabaseConnector.databaseLock.unlock();
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(errorMsg);
}
logger.debug("Finished inserting " + assignmentsSize + " assignments into the \"assignment\"-table. Going to merge the parquet files for this table.");
String mergeErrorMsg = fileUtils.mergeParquetFiles("assignment", "", null);
if ( mergeErrorMsg != null ) {
DatabaseConnector.databaseLock.unlock();
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(mergeErrorMsg);
}
DatabaseConnector.databaseLock.unlock();
if ( errorMsg != null )
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(errorMsg);
// We do not need to "merge" the parquet files for the "assignment" table here, since this happens every time we delete the assignments of a specific batch.
logger.debug("Finished inserting " + assignmentsSize + " assignments into the \"assignment\"-table. | curAssignmentsBatchCounter: " + curAssignmentsBatchCounter + " | worker: " + workerId);
// Due to the fact that one publication with an id-url pair can be connected with multiple datasources, the results returned from the query may be duplicates.
// So, we apply a post-processing step where we collect only one instance of each id-url pair and send it to the Worker.
@ -258,10 +248,9 @@ public class UrlsServiceImpl implements UrlsService {
final HashMap<String, Assignment> uniquePairsAndAssignments = new HashMap<>((int) (assignmentsLimit * 0.9));
for ( Assignment assignment : assignments ) {
for ( Assignment assignment : assignments )
uniquePairsAndAssignments.put(assignment.getId() + "_" + assignment.getOriginalUrl(), assignment);
// This will just update the duplicate record with another "assignment object", containing a different datasource.
}
List<Assignment> distinctAssignments = new ArrayList<>(uniquePairsAndAssignments.values());
int distinctAssignmentsSize = distinctAssignments.size();
@ -285,23 +274,27 @@ public class UrlsServiceImpl implements UrlsService {
logger.info("Initializing the addition of the worker's (" + curWorkerId + ") report for assignments_" + curReportAssignmentsCounter);
boolean hasFulltexts = true;
// Before continuing with inserts, take and upload the fullTexts from the Worker. Also, update the file-"location".
FileUtils.UploadFullTextsResponse uploadFullTextsResponse = fileUtils.getAndUploadFullTexts(urlReports, sizeOfUrlReports, curReportAssignmentsCounter, curWorkerId);
if ( uploadFullTextsResponse == null ) {
// Nothing to post to the Worker, since we do not have the worker's info.
// Rename the worker-report-file to indicate its "failure", so that the scheduler can pick it up and retry processing it.
String workerReportBaseName = this.workerReportsDirPath + File.separator + curWorkerId + File.separator + curWorkerId + "_assignments_" + curReportAssignmentsCounter + "_report";
renameAndGetWorkerReportFile(workerReportBaseName, new File(workerReportBaseName + ".json"), "No info was found for worker: " + curWorkerId); // It may return null.
return false;
} else if ( uploadFullTextsResponse == FileUtils.UploadFullTextsResponse.databaseError ) {
postReportResultToWorker(curWorkerId, curReportAssignmentsCounter, "Problem with the Impala-database!");
return false;
}
else if ( uploadFullTextsResponse == FileUtils.UploadFullTextsResponse.unsuccessful ) {
} else if ( uploadFullTextsResponse == FileUtils.UploadFullTextsResponse.unsuccessful ) {
logger.error("Failed to get and/or upload the fullTexts for batch-assignments_" + curReportAssignmentsCounter);
// The docUrls were still found! Just update ALL the fileLocations, sizes, hashes and mimetypes, to show that the files are not available.
fileUtils.updateUrlReportsToHaveNoFullTextFiles(urlReports, false);
fileUtils.removeUnretrievedFullTextsFromUrlReports(urlReports, false);
// We write only the payloads which are connected with retrieved full-texts, uploaded to S3-Object-Store.
// We continue with writing the "attempts", as we want to avoid re-checking the failed-urls later.
// The urls which give full-text (no matter if we could not get it from the worker), are flagged as "couldRetry" anyway, so they will be picked-up to be checked again later.
} else
hasFulltexts = false;
} else if ( uploadFullTextsResponse == FileUtils.UploadFullTextsResponse.successful_without_fulltexts )
hasFulltexts = false;
else
logger.debug("Finished uploading the full-texts from batch-assignments_" + curReportAssignmentsCounter);
String localParquetPath = parquetFileUtils.parquetBaseLocalDirectoryPath + "assignments_" + curReportAssignmentsCounter + File.separator;
@ -318,48 +311,53 @@ public class UrlsServiceImpl implements UrlsService {
// Create HDFS subDirs for these assignments. Other background threads handling other assignments will not interfere with loading of parquetFiles to the DB tables.
String endingMkDirAndParams = curReportAssignmentsCounter + "/" + parquetFileUtils.mkDirsAndParams;
if ( !parquetFileUtils.applyHDFOperation(parquetFileUtils.webHDFSBaseUrl + parquetFileUtils.parquetHDFSDirectoryPathAttempts + endingMkDirAndParams)
|| !parquetFileUtils.applyHDFOperation(parquetFileUtils.webHDFSBaseUrl + parquetFileUtils.parquetHDFSDirectoryPathPayloadsAggregated + endingMkDirAndParams) )
if ( !parquetFileUtils.applyHDFSOperation(parquetFileUtils.webHDFSBaseUrl + parquetFileUtils.parquetHDFSDirectoryPathAttempts + endingMkDirAndParams)
|| (hasFulltexts && !parquetFileUtils.applyHDFSOperation(parquetFileUtils.webHDFSBaseUrl + parquetFileUtils.parquetHDFSDirectoryPathPayloadsAggregated + endingMkDirAndParams)) )
{
postReportResultToWorker(curWorkerId, curReportAssignmentsCounter, "Error when creating the HDFS sub-directories for assignments_" + curReportAssignmentsCounter);
return false;
}
List<Callable<ParquetReport>> callableTasks = parquetFileUtils.getTasksForCreatingAndUploadingParquetFiles(urlReports, sizeOfUrlReports, curReportAssignmentsCounter, localParquetPath, uploadFullTextsResponse);
boolean hasAttemptParquetFileProblem = false;
boolean hasPayloadParquetFileProblem = false;
DatabaseConnector.databaseLock.lock();
// Lock the DB here so the prefilled-Payloads which will be generated inside the "getTasksForCreatingAndUploadingParquetFiles()" method (using a dedicated query)
// will be synchronized with the insert of all attempt and payload records to the DB. This action is NOT a callable-task, so it runs during the execution of this method.
// This is important in order to avoid having workers take these records as assignments, when we know that payloads are ready to be inserted for them.
List<Callable<ParquetReport>> callableTasks = parquetFileUtils.getTasksForCreatingAndUploadingParquetFiles(urlReports, sizeOfUrlReports, curReportAssignmentsCounter, localParquetPath, uploadFullTextsResponse);
try { // Invoke all the tasks and wait for them to finish.
List<Future<ParquetReport>> futures = insertsExecutor.invokeAll(callableTasks);
SumParquetSuccess sumParquetSuccess = parquetFileUtils.checkParquetFilesSuccess(futures);
ResponseEntity<?> errorResponseEntity = sumParquetSuccess.getResponseEntity();
if ( errorResponseEntity != null ) { // The related log is already shown.
if ( errorResponseEntity != null ) { // The related log is already shown in this case.
DatabaseConnector.databaseLock.unlock();
postReportResultToWorker(curWorkerId, curReportAssignmentsCounter, "Error when creating or uploading the parquet files!");
return false;
}
hasAttemptParquetFileProblem = sumParquetSuccess.isAttemptParquetFileProblem();
hasPayloadParquetFileProblem = sumParquetSuccess.isPayloadParquetFileProblem();
if ( hasAttemptParquetFileProblem && hasPayloadParquetFileProblem )
if ( hasAttemptParquetFileProblem && hasPayloadParquetFileProblem ) {
DatabaseConnector.databaseLock.unlock();
throw new RuntimeException("All of the parquet files failed to be created or uploaded! Will avoid to execute load-requests into the database, for batch-assignments_" + curReportAssignmentsCounter);
else {
} else {
if ( hasAttemptParquetFileProblem )
logger.error("All of the attempt-parquet files failed to be created or uploaded! Will avoid to execute load-requests into the database-table \"attempt\", for batch-assignments_" + curReportAssignmentsCounter);
else if ( hasPayloadParquetFileProblem )
logger.error("The single payload-parquet-file failed to be created or uploaded! Will avoid to execute load-requests into the database-table \"payload_aggregated\", for batch-assignments_" + curReportAssignmentsCounter);
logger.error("All of the payload-parquet files failed to be created or uploaded! Will avoid to execute load-requests into the database-table \"payload_aggregated\", for batch-assignments_" + curReportAssignmentsCounter);
else
logger.debug("Going to execute \"load\"-requests on the database, for the uploaded parquet-files.");
logger.debug("Going to execute \"load\"-requests on the database, for the uploaded parquet-files, for batch-assignments_" + curReportAssignmentsCounter);
}
// Load all the parquet files of each type into its table.
DatabaseConnector.databaseLock.lock();
if ( ! hasAttemptParquetFileProblem )
hasAttemptParquetFileProblem = ! parquetFileUtils.loadParquetDataIntoTable(parquetFileUtils.parquetHDFSDirectoryPathAttempts + curReportAssignmentsCounter + "/", "attempt");
if ( ! hasPayloadParquetFileProblem )
if ( hasFulltexts && ! hasPayloadParquetFileProblem )
hasPayloadParquetFileProblem = ! parquetFileUtils.loadParquetDataIntoTable(parquetFileUtils.parquetHDFSDirectoryPathPayloadsAggregated + curReportAssignmentsCounter + "/", "payload_aggregated");
DatabaseConnector.databaseLock.unlock();
@ -369,17 +367,20 @@ public class UrlsServiceImpl implements UrlsService {
else if ( hasAttemptParquetFileProblem || hasPayloadParquetFileProblem )
logger.error("The data from the HDFS parquet sub-directories COULD NOT be loaded into the \"attempt\" or the \"payload_aggregated\" table, for batch-assignments_" + curReportAssignmentsCounter);
else
logger.debug("The data from the HDFS parquet sub-directories was loaded into the \"attempt\" and the \"payload_aggregated\" tables, for batch-assignments_" + curReportAssignmentsCounter);
logger.debug("The data from the HDFS parquet sub-directories was loaded into the \"attempt\" " + (hasFulltexts ? "and the \"payload_aggregated\" tables" : "table") + ", for batch-assignments_" + curReportAssignmentsCounter);
} catch (InterruptedException ie) { // In this case, any unfinished tasks are cancelled.
} catch (InterruptedException ie) { // Thrown by "insertsExecutor.invokeAll()". In this case, any unfinished tasks are cancelled.
DatabaseConnector.databaseLock.unlock();
logger.warn("The current thread was interrupted when waiting for the worker-threads to finish inserting into the tables: " + ie.getMessage());
// This is a very rare case. At the moment, we just move on with table-merging.
} catch (RuntimeException re) {
// Only one of the REs is inside the DB-locked code, so the "unlock" happens before it's thrown, there.
String errorMsg = re.getMessage();
logger.error(errorMsg);
postReportResultToWorker(curWorkerId, curReportAssignmentsCounter, errorMsg);
return false;
} catch (Exception e) {
} catch (Exception e) { // Thrown by "insertsExecutor.invokeAll()".
DatabaseConnector.databaseLock.unlock();
String errorMsg = "Unexpected error when inserting into the \"attempt\" and \"payload_aggregated\" tables in parallel! " + e.getMessage();
logger.error(errorMsg, e);
postReportResultToWorker(curWorkerId, curReportAssignmentsCounter, errorMsg);
@ -389,8 +390,8 @@ public class UrlsServiceImpl implements UrlsService {
fileUtils.deleteDirectory(new File(localParquetPath));
// Delete the HDFS subDirs for this Report.
String endingRmDirAndParams = curReportAssignmentsCounter + "/" + parquetFileUtils.rmDirsAndParams;
if ( !parquetFileUtils.applyHDFOperation(parquetFileUtils.webHDFSBaseUrl + parquetFileUtils.parquetHDFSDirectoryPathAttempts + endingRmDirAndParams)
|| !parquetFileUtils.applyHDFOperation(parquetFileUtils.webHDFSBaseUrl + parquetFileUtils.parquetHDFSDirectoryPathPayloadsAggregated + endingRmDirAndParams) )
if ( !parquetFileUtils.applyHDFSOperation(parquetFileUtils.webHDFSBaseUrl + parquetFileUtils.parquetHDFSDirectoryPathAttempts + endingRmDirAndParams)
|| (hasFulltexts && !parquetFileUtils.applyHDFSOperation(parquetFileUtils.webHDFSBaseUrl + parquetFileUtils.parquetHDFSDirectoryPathPayloadsAggregated + endingRmDirAndParams)) )
{
logger.error("Error when deleting the HDFS sub-directories for assignments_" + curReportAssignmentsCounter); // A directory-specific log has already appeared.
// The failure to delete the assignments_subDirs is not that of a problem and should not erase the whole process. So all goes as planned (the worker deletes any remaining files).
@ -399,7 +400,7 @@ public class UrlsServiceImpl implements UrlsService {
}
// Delete the assignments each time, as they are bound to the "current" assignmentsCounter. Otherwise, they will never be deleted!
// If this method exits sooner, due tio an error, then the assignments are not deleted in order to wait for the schedulers to retry them and not be given to workers, to avoid reprocessing the urls.
// If this method exits sooner, due to an error, then the assignments are not deleted in order to wait for the schedulers to retry them and not be given to workers, to avoid reprocessing the urls.
DatabaseConnector.databaseLock.lock();
String deleteErrorMsg = deleteAssignmentsBatch(curReportAssignmentsCounter);
DatabaseConnector.databaseLock.unlock();
@ -408,11 +409,10 @@ public class UrlsServiceImpl implements UrlsService {
return false;
}
// For every "numOfWorkers" assignment-batches that go to workers, we merge the tables, once a workerReport comes in.
// After the first few increases of "assignmentsBatchCounter" until all workers get assignment-batches,
// there will always be a time when the counter will be just before the "golden-value" and then one workerReport has to be processed here and the counter will be incremented by one and signal the merging-time.
if ( (currentNumOfWorkerReportsProcessed % UrlsController.numOfWorkers.get()) == 0 ) // The workersNum should not be zero! If a "division by zero" exception is thrown below, then there's a big bug somewhere in the design.
if ( ! mergeWorkerRelatedTables(curWorkerId, curReportAssignmentsCounter, hasAttemptParquetFileProblem, hasPayloadParquetFileProblem) )
// For every "numOfWorkers" assignment-batches that get processed, we merge the "attempts" and "payload_aggregated" tables.
if ( (currentNumOfWorkerReportsProcessed % UrlsController.numOfActiveWorkers.get()) == 0 ) // The workersNum will not be zero!
if ( ! mergeWorkerRelatedTables(curWorkerId, curReportAssignmentsCounter, hasAttemptParquetFileProblem, hasPayloadParquetFileProblem, hasFulltexts) )
// The "postReportResultToWorker()" was called inside.
return false;
if ( uploadFullTextsResponse == FileUtils.UploadFullTextsResponse.unsuccessful ) {
@ -428,7 +428,7 @@ public class UrlsServiceImpl implements UrlsService {
private String createAndInitializeCurrentAssignmentsTable(String findAssignmentsQuery)
{
final String createCurrentAssignmentsQuery = "create table " + DatabaseConnector.databaseName + ".current_assignment as \n" + findAssignmentsQuery;
String createCurrentAssignmentsQuery = "create table " + DatabaseConnector.databaseName + ".current_assignment as \n" + findAssignmentsQuery;
final String computeCurrentAssignmentsStatsQuery = "COMPUTE STATS " + DatabaseConnector.databaseName + ".current_assignment";
try {
@ -437,7 +437,7 @@ public class UrlsServiceImpl implements UrlsService {
String errorMsg = DatabaseConnector.handleQueryException("createCurrentAssignmentsQuery", createCurrentAssignmentsQuery, e);
String tmpErrMsg = dropCurrentAssignmentTable(); // The table may be partially created, e.g. in case of an "out of memory" error in the database-server, during the creation, resulting in an empty table (yes it has happened).
if ( tmpErrMsg != null )
errorMsg += "\n" + tmpErrMsg;
errorMsg += GenericUtils.endOfLine + tmpErrMsg;
return errorMsg;
}
@ -447,7 +447,7 @@ public class UrlsServiceImpl implements UrlsService {
String errorMsg = DatabaseConnector.handleQueryException("computeCurrentAssignmentsStatsQuery", computeCurrentAssignmentsStatsQuery, e);
String tmpErrMsg = dropCurrentAssignmentTable();
if ( tmpErrMsg != null )
errorMsg += "\n" + tmpErrMsg;
errorMsg += GenericUtils.endOfLine + tmpErrMsg;
return errorMsg;
}
@ -456,7 +456,7 @@ public class UrlsServiceImpl implements UrlsService {
private String dropCurrentAssignmentTable() {
String dropCurrentAssignmentsQuery = "DROP TABLE IF EXISTS " + DatabaseConnector.databaseName + ".current_assignment PURGE";
final String dropCurrentAssignmentsQuery = "DROP TABLE IF EXISTS " + DatabaseConnector.databaseName + ".current_assignment PURGE";
try {
jdbcTemplate.execute(dropCurrentAssignmentsQuery);
return null; // All good. No error-message.
@ -466,18 +466,18 @@ public class UrlsServiceImpl implements UrlsService {
}
private boolean mergeWorkerRelatedTables(String curWorkerId, long curReportAssignmentsCounter, boolean hasAttemptParquetFileProblem, boolean hasPayloadParquetFileProblem)
private boolean mergeWorkerRelatedTables(String curWorkerId, long curReportAssignmentsCounter, boolean hasAttemptParquetFileProblem, boolean hasPayloadParquetFileProblem, boolean hasFulltexts)
{
logger.debug("Going to merge the parquet files for the tables which were altered.");
logger.debug("Going to merge the parquet files for the tables which were altered, for batch-assignments_" + curReportAssignmentsCounter);
// When the uploaded parquet files are "loaded" into the tables, they are actually moved into the directory which contains the data of the table.
// This means that over time a table may have thousands of parquet files and the search through them will be very slow. Thus, we merge them every now and then.
// This means that over time a table may have thousands of parquet files and the search through them will be very slow. Thus, we merge them regularly.
String mergeErrorMsg;
DatabaseConnector.databaseLock.lock();
if ( ! hasAttemptParquetFileProblem ) {
mergeErrorMsg = fileUtils.mergeParquetFiles("attempt", "", null);
mergeErrorMsg = parquetFileUtils.mergeParquetFilesOfTable("attempt", "", null);
if ( mergeErrorMsg != null ) {
DatabaseConnector.databaseLock.unlock();
postReportResultToWorker(curWorkerId, curReportAssignmentsCounter, mergeErrorMsg);
@ -485,8 +485,8 @@ public class UrlsServiceImpl implements UrlsService {
}
}
if ( ! hasPayloadParquetFileProblem ) {
mergeErrorMsg = fileUtils.mergeParquetFiles("payload_aggregated", "", null);
if ( hasFulltexts && ! hasPayloadParquetFileProblem ) {
mergeErrorMsg = parquetFileUtils.mergeParquetFilesOfTable("payload_aggregated", "", null);
if ( mergeErrorMsg != null ) {
DatabaseConnector.databaseLock.unlock();
postReportResultToWorker(curWorkerId, curReportAssignmentsCounter, mergeErrorMsg);
@ -495,17 +495,28 @@ public class UrlsServiceImpl implements UrlsService {
}
DatabaseConnector.databaseLock.unlock();
logger.debug("Finished merging the database tables.");
logger.debug("Finished merging the database tables, for batch-assignments_" + curReportAssignmentsCounter);
return true;
}
public String deleteAssignmentsBatch(long givenAssignmentsBatchCounter)
{
// This will delete the rows of the "assignment" table which refer to the "curWorkerId". As we have non-KUDU Impala tables, the Delete operation can only succeed through a "merge" operation of the rest of the data.
// Only the rows referring to OTHER workerIDs get stored in a temp-table, while the "assignment" table gets deleted. Then, the temp_table becomes the "assignment" table.
// We don't need to keep the assignment-info anymore, the "findAssignmentsQuery" checks the "payload_aggregated" table for previously handled tasks.
return fileUtils.mergeParquetFiles("assignment", " WHERE assignments_batch_counter != ", givenAssignmentsBatchCounter);
// This will delete the rows of the "assignment" table which refer to the "givenAssignmentsBatchCounter".
// As we have non-KUDU Impala tables, the Delete operation can only succeed through a "merge" operation of the rest of the data.
// Only the rows referring to OTHER "givenAssignmentsBatchCounter" get stored in a temp-table, while the "assignment" table gets deleted. Then, the temp_table becomes the "assignment" table.
// We don't need to keep the assignment-info forever, as the "findAssignmentsQuery" checks the "payload" table for previously handled tasks.
return parquetFileUtils.mergeParquetFilesOfTable("assignment", " WHERE assignments_batch_counter != ", givenAssignmentsBatchCounter);
}
public String deleteAssignmentsWithOlderDate(long givenDate)
{
// This will delete the rows of the "assignment" table which are older than "givenDate".
// As we have non-KUDU Impala tables, the Delete operation can only succeed through a "merge" operation of the rest of the data.
// Only the rows referring to NEWER than "givenDate" get stored in a temp-table, while the "assignment" table gets deleted. Then, the temp_table becomes the "assignment" table.
// We don't need to keep the assignment-info forever, as the "findAssignmentsQuery" checks the "payload" table for previously handled tasks.
return parquetFileUtils.mergeParquetFilesOfTable("assignment", " WHERE `date` >= ", givenDate);
}
@ -516,30 +527,7 @@ public class UrlsServiceImpl implements UrlsService {
// Rename the worker-report to indicate success or failure.
String workerReportBaseName = this.workerReportsDirPath + File.separator + workerId + File.separator + workerId + "_assignments_" + assignmentRequestCounter + "_report";
File workerReport = new File(workerReportBaseName + ".json");
File renamedWorkerReport = null;
boolean wasWorkerReportRenamed = true;
try {
// Check if the workerReport does not exist under this name, as it may have been renamed previously to "failed" and now is the 2nd try.
if ( !workerReport.isFile() ) {
// Then this is the 2nd try and the report has already been renamed to "failed" the 1st time.
workerReport = new File(workerReportBaseName + "_failed.json");
if ( !workerReport.isFile() ) { // In this case, an error exists, since this file was deleted before its time.
logger.error("The workerReport file \"" + workerReport.getAbsolutePath() + "\" does not exist!");
wasWorkerReportRenamed = false; // Do not proceed on renaming the non-existing file.
// TODO - Do we need additional handling? This report may be either successful or failed but the file was deleted.
}
}
if ( wasWorkerReportRenamed ) { // Only if the file exists.
renamedWorkerReport = new File(workerReportBaseName + ((errorMsg == null) ? "_successful.json" : "_failed.json"));
if ( !workerReport.renameTo(renamedWorkerReport) ) {
logger.warn("There was a problem when renaming the workerReport: " + workerReport.getName());
wasWorkerReportRenamed = false;
}
}
} catch (Exception e) {
logger.error("There was a problem when renaming the workerReport: " + workerReport.getName(), e);
wasWorkerReportRenamed = false;
}
File renamedWorkerReport = renameAndGetWorkerReportFile(workerReportBaseName, workerReport, errorMsg); // It may return null.
// Get the IP of this worker.
WorkerInfo workerInfo = UrlsController.workersInfoMap.get(workerId);
@ -547,7 +535,7 @@ public class UrlsServiceImpl implements UrlsService {
logger.error("Could not find any info for worker with id: \"" + workerId +"\".");
return false;
}
String url = "http://" + workerInfo.getWorkerIP() + ":1881/api/addReportResultToWorker/" + assignmentRequestCounter; // This workerIP will NOT be null.
String url = "http://" + workerInfo.getWorkerIP() + ":" + workerPort + "/api/addReportResultToWorker/" + assignmentRequestCounter; // This workerIP will NOT be null.
if ( logger.isTraceEnabled() )
logger.trace("Going to \"postReportResultToWorker\": \"" + workerId + "\", for assignments_" + assignmentRequestCounter + ((errorMsg != null) ? "\nError: " + errorMsg : ""));
@ -559,13 +547,13 @@ public class UrlsServiceImpl implements UrlsService {
logger.error("HTTP-Connection problem with the submission of the \"postReportResultToWorker\" of worker \"" + workerId + "\" and assignments_" + assignmentRequestCounter + "! Error-code was: " + responseCode);
return false;
} else if ( errorMsg == null ) // If the worker was notified, then go delete the successful workerReport.
fileUtils.deleteFile(wasWorkerReportRenamed ? renamedWorkerReport.getAbsolutePath() : workerReport.getAbsolutePath());
fileUtils.deleteFile((renamedWorkerReport != null) ? renamedWorkerReport.getAbsolutePath() : workerReport.getAbsolutePath());
return true;
} catch (HttpServerErrorException hsee) {
logger.error("The Worker \"" + workerId + "\" failed to handle the \"postReportResultToWorker\", of assignments_" + assignmentRequestCounter + ": " + hsee.getMessage());
return false;
} catch (HttpClientErrorException hcee) {
logger.error("The Controller did something wrong when sending the report result to the Worker (" + workerId + "). | url: " + url + "\n" + hcee.getMessage());
logger.error("The Controller did something wrong when sending the report result to the Worker (" + workerId + "). | url: " + url + GenericUtils.endOfLine + hcee.getMessage());
return false;
} catch (Exception e) {
errorMsg = "Error for \"postReportResultToWorker\", of assignments_" + assignmentRequestCounter + " to the Worker: " + workerId;
@ -581,6 +569,34 @@ public class UrlsServiceImpl implements UrlsService {
}
private static File renameAndGetWorkerReportFile(String workerReportBaseName, File workerReport, String errorMsg)
{
File renamedWorkerReport = null;
try {
// Check if the workerReport does not exist under this name, as it may have been renamed previously to "failed" and now is the 2nd try.
if ( !workerReport.isFile() ) {
// Then this is the 2nd try and the report has already been renamed to "failed" the 1st time.
workerReport = new File(workerReportBaseName + "_failed.json");
if ( !workerReport.isFile() ) { // In this case, an error exists, since this file was deleted before its time.
logger.error("The workerReport file \"" + workerReport.getAbsolutePath() + "\" does not exist!");
// TODO - Do we need additional handling? This report may be either successful or failed but the file was deleted.
return null; // Do not proceed on renaming the non-existing file.
}
}
// Only if the file exists, proceed to renaming it.
renamedWorkerReport = new File(workerReportBaseName + ((errorMsg == null) ? "_successful.json" : "_failed.json"));
if ( ! workerReport.renameTo(renamedWorkerReport) ) {
logger.warn("There was a problem when renaming the workerReport: " + workerReport.getName());
return null;
}
} catch (Exception e) {
logger.error("There was a problem when renaming the workerReport: " + workerReport.getName(), e);
return null;
}
return renamedWorkerReport;
}
// The "batchExecute" does not work in this Impala-Database, so this is a "giant-query" solution.
// Note: this causes an "Out of memory"-ERROR in the current version of the Impala JDBC driver. If a later version is provided, then this code should be tested.
private static PreparedStatement constructLargeInsertQuery(Connection con, String baseInsertQuery, int dataSize, int numParamsPerRow) throws RuntimeException {

View File

@ -1,14 +1,15 @@
package eu.openaire.urls_controller.util;
import com.google.common.collect.HashMultimap;
import com.google.common.collect.Multimaps;
import com.google.common.collect.SetMultimap;
import eu.openaire.urls_controller.configuration.DatabaseConnector;
import eu.openaire.urls_controller.controllers.UrlsController;
import eu.openaire.urls_controller.models.Error;
import eu.openaire.urls_controller.models.Payload;
import eu.openaire.urls_controller.models.UrlReport;
import eu.openaire.urls_controller.models.WorkerInfo;
import org.apache.commons.io.FileDeleteStrategy;
import org.jetbrains.annotations.NotNull;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
@ -18,25 +19,25 @@ import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Component;
import java.io.*;
import java.net.ConnectException;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.UnknownHostException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.sql.Types;
import java.sql.SQLException;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import java.util.concurrent.Callable;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.Stream;
@Component
public class FileUtils {
@ -52,8 +53,12 @@ public class FileUtils {
@Autowired
private FileDecompressor fileDecompressor;
@Value("${services.pdfaggregation.worker.port}")
private String workerPort;
public enum UploadFullTextsResponse {successful, unsuccessful, databaseError}
public enum UploadFullTextsResponse {successful, successful_without_fulltexts, unsuccessful}
public String baseFilesLocation;
@ -75,117 +80,11 @@ public class FileUtils {
}
/**
* In each insertion, a new parquet-file is created, so we end up with millions of files. Parquet is great for fast-select, so have to stick with it and merge those files..
* This method, creates a clone of the original table in order to have only one parquet file in the end. Drops the original table.
* Renames the clone to the original's name.
* Returns the errorMsg, if an error appears, otherwise is returns "null".
*/
public String mergeParquetFiles(String tableName, String whereClause, Object parameter) {
String errorMsg;
if ( (tableName == null) || tableName.isEmpty() ) {
errorMsg = "No tableName was given. Do not know the tableName for which we should merger the underlying files for!";
logger.error(errorMsg);
return errorMsg; // Return the error-msg to indicate that something went wrong and pass it down to the Worker.
}
// Make sure the following are empty strings.
whereClause = (whereClause != null) ? (whereClause + " ") : "";
if ( parameter == null )
parameter = "";
else if ( parameter instanceof String )
parameter = "'" + parameter + "'"; // This will be a "string-check", thus the single-quotes.
// Else it is a "long", it will be used as is.
// Create a temp-table as a copy of the initial table.
try {
jdbcTemplate.execute("CREATE TABLE " + DatabaseConnector.databaseName + "." + tableName + "_tmp stored as parquet AS SELECT * FROM " + DatabaseConnector.databaseName + "." + tableName + " " + whereClause + parameter);
} catch (Exception e) {
errorMsg = "Problem when copying the contents of \"" + tableName + "\" table to a newly created \"" + tableName + "_tmp\" table, when merging the parquet-files!\n";
logger.error(errorMsg, e);
try { // Make sure we delete the possibly half-created temp-table.
jdbcTemplate.execute("DROP TABLE IF EXISTS " + DatabaseConnector.databaseName + "." + tableName + "_tmp PURGE");
// We cannot move on with merging, but no harm happened, since the "table_tmp" name is still reserved for future use (after it was dropped immediately)..
} catch (Exception e1) {
logger.error("Failed to drop the \"" + tableName + "_tmp\" table!", e1);
// TODO - Should we shutdown the service? It is highly unlikely that anyone will observe this error live (and act immediately)..!
}
return errorMsg; // We return only the initial error to the Worker, which is easily distinguished indie the "merge-queries".
}
// Drop the initial table.
try {
jdbcTemplate.execute("DROP TABLE " + DatabaseConnector.databaseName + "." + tableName + " PURGE");
} catch (Exception e) {
errorMsg = "Problem when dropping the initial \"" + tableName + "\" table, when merging the parquet-files!\n";
logger.error(errorMsg, e);
// The original table could not be dropped, so the temp-table cannot be renamed to the original..!
try { // Make sure we delete the already created temp-table, in order to be able to use it in the future. The merging has failed nevertheless.
jdbcTemplate.execute("DROP TABLE " + DatabaseConnector.databaseName + "." + tableName + "_tmp PURGE");
} catch (Exception e1) {
logger.error((errorMsg += "Failed to drop the \"" + tableName + "_tmp\" table!"), e1); // Add this error to the original, both are very important.
}
// Here, the original table is created.
return errorMsg;
// TODO - Should we shutdown the service? It is highly unlikely that anyone will observe this error live (and act immediately)..!
}
// Rename the temp-table to have the initial-table's name.
try {
jdbcTemplate.execute("ALTER TABLE " + DatabaseConnector.databaseName + "." + tableName + "_tmp RENAME TO " + DatabaseConnector.databaseName + "." + tableName);
} catch (Exception e) {
errorMsg = "Problem in renaming the \"" + tableName + "_tmp\" table to \"" + tableName + "\", when merging the parquet-files!\n";
logger.error(errorMsg, e);
// At this point we only have a "temp-table", the original is already deleted..
// Try to create the original, as a copy of the temp-table. If that succeeds, then try to delete the temp-table.
try {
jdbcTemplate.execute("CREATE TABLE " + DatabaseConnector.databaseName + "." + tableName + " stored as parquet AS SELECT * FROM " + DatabaseConnector.databaseName + "." + tableName + "_tmp");
} catch (Exception e1) {
errorMsg = "Problem when copying the contents of \"" + tableName + "_tmp\" table to a newly created \"" + tableName + "\" table, when merging the parquet-files!\n";
logger.error(errorMsg, e1);
// If the original table was not created, then we have to intervene manually, if it was created but without any data, then we can safely move on handling other assignments and workerReports, but the data will be lost! So this workerReport failed to be handled.
try { // The below query normally returns a list, as it takes a "regex-pattern" as an input. BUT, we give just the table name, without wildcards. So the result is either the tableName itself or none (not any other table).
jdbcTemplate.queryForObject("SHOW TABLES IN " + DatabaseConnector.databaseName + " LIKE '" + tableName + "'", List.class);
} catch (EmptyResultDataAccessException erdae) {
// The table does not exist, so it was not even half-created by the previous query.
// Not having the original table anymore is a serious error. A manual action is needed!
logger.error((errorMsg += "The original table \"" + tableName + "\" must be created manually! Serious problems may appear otherwise!"));
// TODO - Should we shutdown the service? It is highly unlikely that anyone will observe this error live (and fix it immediately to avoid other errors in the Service)..!
}
// Here, the original-table exists in the DB, BUT without any data inside! This worker Report failed to handled! (some of its data could not be loaded to the database, and all previous data was lost).
return errorMsg;
}
// The creation of the original table was successful. Try to delete the temp-table.
try {
jdbcTemplate.execute("DROP TABLE " + DatabaseConnector.databaseName + "." + tableName + "_tmp PURGE");
} catch (Exception e2) {
logger.error((errorMsg += "Problem when dropping the \"" + tableName + "_tmp\" table, when merging the parquet-files!\n"), e2);
// Manual deletion should be performed!
return errorMsg; // Return both errors here, as the second is so important that if it did not happen then we could move on with this workerReport.
// TODO - Should we shutdown the service? It is highly unlikely that anyone will observe this error live (and act immediately)..!
}
// Here the original table exists and the temp-table is deleted. We eventually have the same state as if the "ALTER TABLE" succeeded.
}
// Gather information to be used for queries-optimization.
try {
jdbcTemplate.execute("COMPUTE STATS " + DatabaseConnector.databaseName + "." + tableName);
} catch (Exception e) {
logger.error("Problem when gathering information from table \"" + tableName + "\" to be used for queries-optimization.", e);
// In this case the error is not so important to the whole operation.. It's only that the performance of this specific table will be less optimal, only temporarily, unless every "COMPUTE STATS" query fails for future workerReports too.
}
return null; // No errorMsg, everything is fine.
}
public static final DecimalFormat df = new DecimalFormat("0.00");
// The following regex might be useful in a future scenario. It extracts the "plain-filename" and "file-ID" and the "file-extension".
// Possible full-filenames are: "path1/path2/ID.pdf", "ID2.pdf", "path1/path2/ID(12).pdf", "ID2(25).pdf"
public static final Pattern FILENAME_ID_EXTENSION = Pattern.compile("(?:([^.()]+)/)?((([^/()]+)[^./]*)(\\.[\\w]{2,10}))$");
public static final Pattern FILEPATH_ID_EXTENSION = Pattern.compile("([^.()]+/)?((([^/()]+)[^./]*)(\\.[\\w]{2,10}))$");
private static final int numOfFullTextsPerBatch = 70; // The HTTP-headers cannot be too large (It failed with 100 fileNames).
@ -207,114 +106,89 @@ public class FileUtils {
workerIp = workerInfo.getWorkerIP(); // This won't be null.
// Get the file-locations.
final AtomicInteger numFullTextsFound = new AtomicInteger();
final AtomicInteger numFilesFoundFromPreviousAssignmentsBatches = new AtomicInteger();
int numValidFullTextsFound = 0;
int numFilesFoundFromPreviousAssignmentsBatches = 0;
int numFullTextsWithProblematicLocations = 0;
SetMultimap<String, Payload> allFileNamesWithPayloads = Multimaps.synchronizedSetMultimap(HashMultimap.create((sizeOfUrlReports / 5), 3)); // Holds multiple values for any key, if a fileName(key) has many IDs(values) associated with it.
HashMultimap<String, Payload> hashesWithPayloads = getHashesWithPayloads(urlReports, sizeOfUrlReports); // Holds multiple payloads for the same fileHash.
Set<String> fileHashes = hashesWithPayloads.keySet();
int fileHashesSetSize = fileHashes.size(); // Get the size of the keysSet, instead of the whole multimap.
if ( fileHashesSetSize == 0 ) {
logger.warn("No fulltexts were retrieved for assignments_" + assignmentsBatchCounter + ", from worker: \"" + workerId + "\".");
return UploadFullTextsResponse.successful_without_fulltexts; // It was handled, no error.
}
final String getFileLocationForHashQuery = "select `location` from " + DatabaseConnector.databaseName + ".payload" + (isTestEnvironment ? "_aggregated" : "") + " where `hash` = ? limit 1";
final int[] hashArgType = new int[] {Types.VARCHAR};
HashMap<String, String> hashLocationMap = getHashLocationMap(fileHashes, fileHashesSetSize, assignmentsBatchCounter, "assignments");
HashMultimap<String, Payload> allFileNamesWithPayloads = HashMultimap.create((sizeOfUrlReports / 5), 3); // Holds multiple values for any key, if a fileName(key) has many IDs(values) associated with it.
final List<Callable<Void>> callableTasks = new ArrayList<>(6);
for ( UrlReport urlReport : urlReports )
for ( String fileHash : fileHashes )
{
callableTasks.add(() -> {
Payload payload = urlReport.getPayload();
if ( payload == null )
return null;
String fileLocation = payload.getLocation();
if ( fileLocation == null )
return null; // The full-text was not retrieved, go to the next UrlReport.
// Query the payload-table FOR EACH RECORD to get the fileLocation of A PREVIOUS RECORD WITH THE SAME FILE-HASH.
// If no result is returned, then this record is not previously found, so go ahead and add it in the list of files to request from the worker.
// If a file-location IS returned (for this hash), then this file is already uploaded to the S3. Update the record to point to that file-location and do not request that file from the Worker.
String fileHash = payload.getHash();
if ( fileHash != null ) {
String alreadyFoundFileLocation = null;
try {
alreadyFoundFileLocation = jdbcTemplate.queryForObject(getFileLocationForHashQuery, new Object[] {fileHash}, hashArgType, String.class);
} catch (EmptyResultDataAccessException erdae) {
// No fileLocation is found, it's ok. It will be null by default.
} catch (Exception e) {
logger.error("Error when executing or acquiring data from the the \"getFileLocationForHashQuery\"!\n", e);
// TODO - SHOULD WE RETURN A "UploadFullTextsResponse.databaseError" AND force the caller to not even insert the payloads to the database??
// TODO - The idea is that since the database will have problems.. there is no point in trying to insert the payloads to Impala (we will handle it like: we tried to insert and got an error).
// Unless we do what it is said above, do not continue to the next UrlReport, this query-exception should not disrupt the normal full-text processing.
for ( Payload payload : hashesWithPayloads.get(fileHash) )
{
String alreadyFoundFileLocation = hashLocationMap.get(fileHash); // Only one location has been retrieved per fileHash.
if ( alreadyFoundFileLocation != null ) {
// Fill the payloads with locations from the "previously-found-hashes."
payload.setLocation(alreadyFoundFileLocation);
if ( logger.isTraceEnabled() )
logger.trace("The record with ID \"" + payload.getId() + "\" has an \"alreadyRetrieved\" file, with hash \"" + fileHash + "\" and location \"" + alreadyFoundFileLocation + "\"."); // DEBUG!
numFilesFoundFromPreviousAssignmentsBatches ++;
numValidFullTextsFound ++; // We trust the location being valid..
}
else { // This file has not been found before..
// Extract the "fileNameWithExtension" to be added in the HashMultimap.
String fileLocation = payload.getLocation();
Matcher matcher = FILEPATH_ID_EXTENSION.matcher(fileLocation);
if ( ! matcher.matches() ) {
logger.error("Failed to match the \"fileLocation\": \"" + fileLocation + "\" of id: \"" + payload.getId() + "\", originalUrl: \"" + payload.getOriginal_url() + "\", using this regex: " + FILEPATH_ID_EXTENSION);
numFullTextsWithProblematicLocations ++;
continue;
}
String fileNameWithExtension = matcher.group(2);
if ( (fileNameWithExtension == null) || fileNameWithExtension.isEmpty() ) {
logger.error("Failed to extract the \"fileNameWithExtension\" from \"fileLocation\": \"" + fileLocation + "\", of id: \"" + payload.getId() + "\", originalUrl: \"" + payload.getOriginal_url() + "\", using this regex: " + FILEPATH_ID_EXTENSION);
numFullTextsWithProblematicLocations ++;
continue;
}
if ( alreadyFoundFileLocation != null ) { // If the full-text of this record is already-found and uploaded.
payload.setLocation(alreadyFoundFileLocation); // Set the location to the older identical file, which was uploaded to S3. The other file-data is identical.
if ( logger.isTraceEnabled() )
logger.trace("The record with ID \"" + payload.getId() + "\" has an \"alreadyRetrieved\" file, with hash \"" + fileHash + "\" and location \"" + alreadyFoundFileLocation + "\"."); // DEBUG!
numFilesFoundFromPreviousAssignmentsBatches.incrementAndGet();
numFullTextsFound.incrementAndGet();
return null; // Do not request the file from the worker, it's already uploaded. Move on. The "location" will be filled my the "setFullTextForMultiplePayloads()" method, later.
}
}
// Extract the "fileNameWithExtension" to be added in the HashMultimap.
Matcher matcher = FILENAME_ID_EXTENSION.matcher(fileLocation);
if ( ! matcher.matches() ) {
logger.error("Failed to match the \"fileLocation\": \"" + fileLocation + "\" of id: \"" + payload.getId() + "\", originalUrl: \"" + payload.getOriginal_url() + "\", using this regex: " + FILENAME_ID_EXTENSION);
return null;
}
String fileNameWithExtension = matcher.group(2);
if ( (fileNameWithExtension == null) || fileNameWithExtension.isEmpty() ) {
logger.error("Failed to extract the \"fileNameWithExtension\" from \"fileLocation\": \"" + fileLocation + "\", of id: \"" + payload.getId() + "\", originalUrl: \"" + payload.getOriginal_url() + "\", using this regex: " + FILENAME_ID_EXTENSION);
return null;
}
numFullTextsFound.incrementAndGet();
allFileNamesWithPayloads.put(fileNameWithExtension, payload); // The keys and the values are not duplicate.
// Task with ID-1 might have an "ID-1.pdf" file, while a task with ID-2 can also have an "ID-1.pdf" file, as the pdf-url-2 might be the same with pdf-url-1, thus, the ID-2 file was not downloaded again.
return null;
});
}// end-for
DatabaseConnector.databaseLock.lock(); // The execution uses the database.
try { // Invoke all the tasks and wait for them to finish before moving to the next batch.
List<Future<Void>> futures = hashMatchingExecutor.invokeAll(callableTasks);
for ( Future<Void> future : futures ) {
try {
Void result = future.get(); // The result is always "null" as we have a "Void" type.
} catch (Exception e) {
logger.error("", e);
numValidFullTextsFound ++;
allFileNamesWithPayloads.put(fileNameWithExtension, payload); // The keys and the values are not duplicate.
// Task with ID-1 might have an "ID-1.pdf" file, while a task with ID-2 can also have an "ID-1.pdf" file, as the pdf-url-2 might be the same with pdf-url-1, thus, the ID-2 file was not downloaded again.
}
}
} catch (InterruptedException ie) { // In this case, any unfinished tasks are cancelled.
logger.warn("The current thread was interrupted when waiting for the worker-threads to finish checking for already-found file-hashes: " + ie.getMessage());
// This is a very rare case. At the moment, we just move on with what we have so far.
} catch (Exception e) {
logger.error("Unexpected error when checking for already-found file-hashes in parallel!", e);
return UploadFullTextsResponse.unsuccessful;
} finally {
DatabaseConnector.databaseLock.unlock(); // The remaining work of this function does not use the database.
}
logger.info("NumFullTextsFound by assignments_" + assignmentsBatchCounter + " = " + numFullTextsFound.get() + " (out of " + sizeOfUrlReports + " | about " + df.format(numFullTextsFound.get() * 100.0 / sizeOfUrlReports) + "%).");
logger.debug("NumFilesFoundFromPreviousAssignmentsBatches = " + numFilesFoundFromPreviousAssignmentsBatches.get());
if ( numFullTextsWithProblematicLocations > 0 )
logger.warn(numFullTextsWithProblematicLocations + " files had problematic names.");
ArrayList<String> allFileNames = new ArrayList<>(allFileNamesWithPayloads.keySet());
int numAllFullTexts = allFileNames.size();
if ( numAllFullTexts == 0 ) {
if ( numValidFullTextsFound == 0 ) {
logger.warn("No full-text files were retrieved for assignments_" + assignmentsBatchCounter + " | from worker: " + workerId);
return UploadFullTextsResponse.successful; // It was handled, no error.
return UploadFullTextsResponse.successful_without_fulltexts; // It's not what we want, but it's not an error either.
}
ArrayList<String> allFileNames = new ArrayList<>(allFileNamesWithPayloads.keySet()); // The number of fulltexts are lower than the number of payloads, since multiple payloads may lead to the same file.
int numFullTextsToBeRequested = allFileNames.size();
if ( numFullTextsToBeRequested == 0 ) {
logger.info(numValidFullTextsFound + " fulltexts were retrieved for assignments_" + assignmentsBatchCounter + ", from worker: \"" + workerId + "\", but all of them have been retrieved before.");
return UploadFullTextsResponse.successful_without_fulltexts; // It was handled, no error.
}
logger.info("NumFullTextsFound by assignments_" + assignmentsBatchCounter + " = " + numValidFullTextsFound + " (out of " + sizeOfUrlReports + " | about " + df.format(numValidFullTextsFound * 100.0 / sizeOfUrlReports) + "%).");
// TODO - Have a prometheus GAUGE to hold the value of the above percentage, so that we can track the success-rates over time..
logger.debug("NumFilesFoundFromPreviousAssignmentsBatches = " + numFilesFoundFromPreviousAssignmentsBatches);
// Request the full-texts in batches, compressed in a zstd tar file.
int numOfBatches = (numAllFullTexts / numOfFullTextsPerBatch);
int remainingFiles = (numAllFullTexts % numOfFullTextsPerBatch);
int numOfBatches = (numFullTextsToBeRequested / numOfFullTextsPerBatch);
int remainingFiles = (numFullTextsToBeRequested % numOfFullTextsPerBatch);
if ( remainingFiles > 0 ) { // Add an extra batch for the remaining files. This guarantees at least one batch will exist no matter how few (>0) the files are.
numOfBatches++;
logger.debug("The assignments_" + assignmentsBatchCounter + " have " + numAllFullTexts + " distinct non-already-uploaded fullTexts (total is: " + numFullTextsFound.get() + "). Going to request them from the Worker \"" + workerId + "\", in " + numOfBatches + " batches (" + numOfFullTextsPerBatch + " files each, except for the final batch, which will have " + remainingFiles + " files).");
logger.debug("The assignments_" + assignmentsBatchCounter + " have " + numFullTextsToBeRequested + " distinct, non-already-uploaded fullTexts (total is: " + numValidFullTextsFound + "). Going to request them from the Worker \"" + workerId + "\", in " + numOfBatches + " batches (" + numOfFullTextsPerBatch + " files each, except for the final batch, which will have " + remainingFiles + " files).");
} else
logger.debug("The assignments_" + assignmentsBatchCounter + " have " + numAllFullTexts + " distinct non-already-uploaded fullTexts (total is: " + numFullTextsFound.get() + "). Going to request them from the Worker \"" + workerId + "\", in " + numOfBatches + " batches (" + numOfFullTextsPerBatch + " files each).");
logger.debug("The assignments_" + assignmentsBatchCounter + " have " + numFullTextsToBeRequested + " distinct, non-already-uploaded fullTexts (total is: " + numValidFullTextsFound + "). Going to request them from the Worker \"" + workerId + "\", in " + numOfBatches + " batches (" + numOfFullTextsPerBatch + " files each).");
// Check if one full text is left out because of the division. Put it int the last batch.
String baseUrl = "http://" + workerIp + ":1881/api/full-texts/getFullTexts/" + assignmentsBatchCounter + "/" + numOfBatches + "/";
String baseUrl = "http://" + workerIp + ":" + workerPort + "/api/full-texts/getFullTexts/" + assignmentsBatchCounter + "/" + numOfBatches + "/";
// TODO - The worker should send the port in which it accepts requests, along with the current request.
// TODO - The least we have to do it to expose the port-assignment somewhere more obvious like inside the "application.yml" file.
@ -343,12 +217,12 @@ public class FileUtils {
curBatchPath = Files.createDirectories(Paths.get(targetDirectory));
// The base-directory will be created along with the first batch-directory.
} catch (Exception e) {
logger.error("Could not create the \"curBatchPath\" directory: " + targetDirectory + "\n" + e.getMessage(), e); // It shows the response body (after Spring v.2.5.6).
logger.error("Could not create the \"curBatchPath\" directory: " + targetDirectory + GenericUtils.endOfLine + e.getMessage(), e); // It shows the response body (after Spring v.2.5.6).
failedBatches ++;
continue;
}
List<String> fileNamesForCurBatch = getFileNamesForBatch(allFileNames, numAllFullTexts, batchCounter);
List<String> fileNamesForCurBatch = getFileNamesForBatch(allFileNames, numFullTextsToBeRequested, batchCounter);
String zstdFileFullPath = targetDirectory + "fullTexts_" + assignmentsBatchCounter + "_" + batchCounter + ".tar.zstd";
try {
if ( ! getAndSaveFullTextBatch(fileNamesForCurBatch, baseUrl, assignmentsBatchCounter, batchCounter, numOfBatches, zstdFileFullPath, workerId) ) {
@ -364,14 +238,98 @@ public class FileUtils {
failedBatches ++;
} // End of batches.
updateUrlReportsToHaveNoFullTextFiles(urlReports, true); // Make sure all records without an S3-Url have < null > file-data (some batches or uploads might have failed).
if ( failedBatches == numOfBatches )
logger.error("None of the " + numOfBatches + " batches could be handled for assignments_" + assignmentsBatchCounter + ", for worker: " + workerId);
removeUnretrievedFullTextsFromUrlReports(urlReports, true); // Make sure all records without an S3-Url have < null > file-data (some batches or uploads might have failed).
deleteDirectory(new File(curAssignmentsBaseLocation));
if ( failedBatches == numOfBatches ) {
logger.error("None of the " + numOfBatches + " batches could be handled for assignments_" + assignmentsBatchCounter + ", for worker: " + workerId);
// Check and warn about the number of failed payloads.
// Possible reasons: failed to check their hash in the DB, the file was not found inside the worker, whole batch failed to be delivered from the worker, files failed t be uploaded to S3
// Retrieve the payloads from the existing urlReports.
long finalPayloadsCounter = urlReports.parallelStream()
.map(UrlReport::getPayload).filter(payload -> ((payload != null) && (payload.getLocation() != null)))
.count();
int numInitialPayloads = (numValidFullTextsFound + numFullTextsWithProblematicLocations);
long numFailedPayloads = (numInitialPayloads - finalPayloadsCounter);
if ( numFailedPayloads == numInitialPayloads ) {
// This will also be the case if there was no DB failure, but all the batches have failed.
logger.error("None of the " + numInitialPayloads + " payloads could be handled for assignments_" + assignmentsBatchCounter + ", for worker: " + workerId);
return UploadFullTextsResponse.unsuccessful;
} else
return UploadFullTextsResponse.successful;
} else if ( numFailedPayloads > 0 )
logger.warn(numFailedPayloads + " payloads (out of " + numInitialPayloads + ") failed to be processed for assignments_" + assignmentsBatchCounter + ", for worker: " + workerId);
return UploadFullTextsResponse.successful;
}
public HashMultimap<String, Payload> getHashesWithPayloads(List<UrlReport> urlReports, int sizeOfUrlReports)
{
HashMultimap<String, Payload> hashesWithPayloads = HashMultimap.create((sizeOfUrlReports / 5), 3); // Holds multiple payloads for the same fileHash.
// The "Hash" part of the multimap helps with avoiding duplicate fileHashes.
for ( UrlReport urlReport : urlReports )
{
Payload payload = urlReport.getPayload();
if ( payload == null )
continue;
String fileLocation = payload.getLocation();
if ( fileLocation == null )
continue; // The full-text was not retrieved for this UrlReport.
// Query the payload-table FOR EACH RECORD to get the fileLocation of A PREVIOUS RECORD WITH THE SAME FILE-HASH.
// If no result is returned, then this record is not previously found, so go ahead and add it in the list of files to request from the worker.
// If a file-location IS returned (for this hash), then this file is already uploaded to the S3. Update the record to point to that file-location and do not request that file from the Worker.
String fileHash = payload.getHash();
if ( fileHash != null )
{
hashesWithPayloads.put(fileHash, payload); // Hold multiple payloads per fileHash.
// There are 2 cases, which contribute to that:
// 1) Different publication-IDs end up giving the same full-text-url, resulting in the same file. Those duplicates are not saved, but instead, the location, hash and size of the file is copied to the other payload.
// 2) Different publication-IDs end up giving different full-text-urls which point to the same file. Although very rare, in this case, the file is downloaded again by the Worker and has a different name.
// In either case, the duplicate file will not be transferred to the Controller, but in the 2nd one it takes up extra space, at least for some time.
// TODO - Implement a fileHash-check algorithm in the Worker's side ("PublicationsRetriever"), to avoid keeping those files in storage.
} else // This should never happen..
logger.error("Payload: " + payload + " has a null fileHash!");
}// end-for
return hashesWithPayloads;
}
public HashMap<String, String> getHashLocationMap(Set<String> fileHashes, int fileHashesSetSize, long batchCounter, String groupType)
{
// Prepare the "fileHashListString" to be used inside the "getHashLocationsQuery". Create the following string-pattern:
// ("HASH_1", "HASH_2", ...)
int stringBuilderCapacity = ((fileHashesSetSize * 32) + (fileHashesSetSize -1) +2);
String getHashLocationsQuery = "select distinct `hash`, `location` from " + DatabaseConnector.databaseName + ".payload where `hash` in "
+ getQueryListString(new ArrayList<>(fileHashes), fileHashesSetSize, stringBuilderCapacity);
HashMap<String, String> hashLocationMap = new HashMap<>(fileHashesSetSize/2); // No multimap is needed since only one location is returned for each fileHash.
DatabaseConnector.databaseLock.lock(); // The execution uses the database.
try {
jdbcTemplate.query(getHashLocationsQuery, rs -> {
try { // For each of the 4 columns returned, do the following. The column-indexing starts from 1.
hashLocationMap.put(rs.getString(1), rs.getString(2));
} catch (SQLException sqle) {
logger.error("No value was able to be retrieved from one of the columns of row_" + rs.getRow(), sqle);
}
});
} catch (EmptyResultDataAccessException erdae) {
logger.warn("No previously-found hash-locations where found for " + groupType + "_" + batchCounter);
} catch (Exception e) {
logger.error("Unexpected error when checking for already-found file-hashes, for " + groupType + "_" + batchCounter, e);
// We will continue with storing the files, we do not want to lose them.
} finally {
DatabaseConnector.databaseLock.unlock();
}
return hashLocationMap;
}
@ -380,7 +338,7 @@ public class FileUtils {
{
HttpURLConnection conn;
try {
if ( (conn = getConnection(baseUrl, assignmentsBatchCounter, batchCounter, fileNamesForCurBatch, numOfBatches, workerId)) == null )
if ( (conn = getConnectionForFullTextBatch(baseUrl, assignmentsBatchCounter, batchCounter, fileNamesForCurBatch, numOfBatches, workerId)) == null )
return false;
} catch (RuntimeException re) {
// The "cause" was logged inside "getConnection()".
@ -403,14 +361,14 @@ public class FileUtils {
logger.error("No full-texts' fleNames where extracted from directory: " + targetDirectory);
return false;
} else if ( (extractedFileNames.length - 2) != fileNamesForCurBatch.size() ) {
logger.warn("The number of extracted files (" + (extractedFileNames.length - 2) + ") was not equal to the number of the current-batch's (" + batchCounter + ") files (" + fileNamesForCurBatch.size() + ").");
logger.warn("The number of extracted files (" + (extractedFileNames.length - 2) + ") was not equal to the number of files (" + fileNamesForCurBatch.size() + ") of the current batch_" + batchCounter);
// We do NOT have to find and cross-reference the missing files with the urlReports, in order to set their locations to <null>,
// since, in the end of each assignments-batch, an iteration will be made and for all the non-retrieved and non-uploaded full-texts, the app will set them to null.
}
uploadFullTexts(extractedFileNames, targetDirectory, allFileNamesWithPayloads);
uploadFullTexts(extractedFileNames, targetDirectory, allFileNamesWithPayloads, batchCounter);
return true;
} catch (Exception e) {
logger.error("Could not extract and upload the full-texts for batch_" + batchCounter + " of assignments_" + assignmentsBatchCounter + "\n" + e.getMessage(), e); // It shows the response body (after Spring v.2.5.6).
logger.error("Could not extract and upload the full-texts for batch_" + batchCounter + " of assignments_" + assignmentsBatchCounter + GenericUtils.endOfLine + e.getMessage(), e); // It shows the response body (after Spring v.2.5.6).
return false;
} finally {
deleteDirectory(curBatchPath.toFile());
@ -418,7 +376,7 @@ public class FileUtils {
}
private HttpURLConnection getConnection(String baseUrl, long assignmentsBatchCounter, int batchNum, List<String> fileNamesForCurBatch, int totalBatches, String workerId) throws RuntimeException
private HttpURLConnection getConnectionForFullTextBatch(String baseUrl, long assignmentsBatchCounter, int batchNum, List<String> fileNamesForCurBatch, int totalBatches, String workerId) throws RuntimeException
{
baseUrl += batchNum + "/";
String requestUrl = getRequestUrlForBatch(baseUrl, fileNamesForCurBatch);
@ -427,14 +385,36 @@ public class FileUtils {
HttpURLConnection conn = (HttpURLConnection) new URL(requestUrl).openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("User-Agent", "UrlsController");
// TODO - Write the fileNames in the RequestBody, so that we can include as many as we want in each request.
// Right now, we can include only up to 70-80 fileNames in the url-string.
// TODO - We need to add the fileNames in the requestBody BEFORE we connect. So we will need to refactor the code to work in that order.
/*OutputStream os = conn.getOutputStream();
OutputStreamWriter osw = new OutputStreamWriter(os, StandardCharsets.UTF_8);
osw.write(fileNamesForCurBatch_separated_by_comma);
osw.flush();
osw.close();
os.close();*/
// TODO - The above update will also enable is to improve edge-case management like making sure we do not create a whole new batch for just a few files..
// Check this case for example, where we have one extra batch with the network, compression-decompression, transfer, uploading ect overhead:
// 2023-02-06 12:17:26.893 [http-nio-1880-exec-8] DEBUG e.o.urls_controller.util.FileUtils.getAndUploadFullTexts(@235) - The assignments_12 have 211 distinct non-already-uploaded fullTexts.
// Going to request them from the Worker "worker_X", in 4 batches (70 files each, except for the final batch, which will have 1 files).
// If we are not limited by the url-length we can easily say that if less than 10 files remain for the last batch, then add them to the previous batch (eg. the last batch will have 79 files)
// If equal to 10 or more files remain, then we will make an extra batch.
conn.connect();
int statusCode = conn.getResponseCode();
if ( statusCode == -1 ) { // Invalid HTTP-Response.
logger.warn("Problem when getting the \"status-code\" for url: " + requestUrl);
throw new RuntimeException(); // Avoid any other batches.
} else if ( statusCode != 200 ) {
logger.warn("HTTP-" + statusCode + ": " + getMessageFromResponseBody(conn, true) + "\n\nProblem when requesting the ZstdFile of batch_" + batchNum + " from the Worker with ID \"" + workerId + "\" and requestUrl: " + requestUrl);
if ( (statusCode >= 500) && (statusCode <= 599) )
String errMsg = getMessageFromResponseBody(conn, true);
logger.warn("HTTP-" + statusCode + ": " + errMsg + "\n\nProblem when requesting the ZstdFile of batch_" + batchNum + " from the Worker with ID \"" + workerId + "\" and requestUrl: " + requestUrl);
if ( ((statusCode >= 500) && (statusCode <= 599))
|| ((statusCode == 400) && ((errMsg != null) && errMsg.contains("The base directory for assignments_" + assignmentsBatchCounter + " was not found"))) )
throw new RuntimeException(); // Throw an exception to indicate that the Worker has problems and all remaining batches will fail as well.
return null;
} else
@ -443,7 +423,7 @@ public class FileUtils {
throw re;
} catch (Exception e) {
String exMessage = e.getMessage();
logger.warn("Problem when requesting the ZstdFile of batch_" + batchNum + " of assignments_" + assignmentsBatchCounter + " from the Worker with ID \"" + workerId + "\" and requestUrl: " + requestUrl + "\n" + exMessage);
logger.warn("Problem when requesting the ZstdFile of batch_" + batchNum + " of assignments_" + assignmentsBatchCounter + " from the Worker with ID \"" + workerId + "\" and requestUrl: " + requestUrl + GenericUtils.endOfLine + exMessage);
if ( exMessage.contains("Connection refused") ) {
logger.error("Since we received a \"Connection refused\", from \"" + workerId + "\", all of the remaining batches (" + (totalBatches - batchNum) + ") will not be requested!");
throw new RuntimeException();
@ -453,11 +433,11 @@ public class FileUtils {
}
private void uploadFullTexts(String[] fileNames, String targetDirectory, SetMultimap<String, Payload> allFileNamesWithPayloads)
private void uploadFullTexts(String[] fileNames, String targetDirectory, SetMultimap<String, Payload> allFileNamesWithPayloads, int batchCounter)
{
// Iterate over the files and upload them to S3.
//int numUploadedFiles = 0;
for( String fileName : fileNames )
for ( String fileName : fileNames )
{
if ( fileName.contains(".tar") ) // Exclude the tar-files from uploading (".tar" and ".tar.zstd").
continue;
@ -470,66 +450,68 @@ public class FileUtils {
}
// Let's try to upload the file to S3 and update the payloads, either in successful file-uploads (right-away) or not (in the end).
try {
// Prepare the filename as: "datasourceid/publicationid::hash.pdf"
// All related payloads point to this exact same file, BUT, may be related with different urlIDs, which in turn be related with different datasourceIDs.
// This file could have been found from different urlIds and thus be related to multiple datasourceIds.
// BUT, since the filename contains a specific urlID, the datasourceId should be the one related to that specific urlID.
// So, we extract this urlID, search the payload inside the "fileRelatedPayloads" and get the related datasourceID (instead of taking the first or a random datasourceID).
Matcher matcher = FILENAME_ID_EXTENSION.matcher(fileName);
if ( !matcher.matches() ) {
logger.error("Failed to match the \"" + fileName + "\" with the regex: " + FILENAME_ID_EXTENSION);
continue;
}
// The "matcher.group(3)" returns the "filenameWithoutExtension", which is currently not used.
String fileNameID = matcher.group(4);
if ( (fileNameID == null) || fileNameID.isEmpty() ) {
logger.error("Failed to extract the \"fileNameID\" from \"" + fileName + "\".");
continue;
}
String dotFileExtension = matcher.group(5);
if ( (dotFileExtension == null) || dotFileExtension.isEmpty() ) {
logger.error("Failed to extract the \"dotFileExtension\" from \"" + fileName + "\".");
continue;
}
// Prepare the filename as: "datasourceid/publicationid::hash.pdf"
// All related payloads point to this exact same file, BUT, may be related with different urlIDs, which in turn be related with different datasourceIDs.
// This file could have been found from different urlIds and thus be related to multiple datasourceIds.
// BUT, since the filename contains a specific urlID, the datasourceId should be the one related to that specific urlID.
// So, we extract this urlID, search the payload inside the "fileRelatedPayloads" and get the related datasourceID (instead of taking the first or a random datasourceID).
// This file is related with some payloads, in a sense that these payloads have urls which lead to the same full-text url.
// These payloads might have different IDs and sourceUrls. But, in the end, the different sourceUrls give the same full-text.
// Below, we make sure we pick the "datasource" from the payload, which has the same id as the full-text's name.
// If there are multiple payloads with the same id, which point to the same file, then we can take whatever datasource we want from those payloads.
// It is possible that payloads with same IDs, but different sourceUrls pointing to the same full-text, can be related with different datasources
// (especially for IDs of type: "doiboost_____::XXXXXXXXXXXXXXXXXXXXX").
// It does not really matter, since the first-ever payload to give this full-text could very well be another one,
// since the crawling happens in multiple threads which compete with each other for CPU time.
String datasourceId = null;
String hash = null;
boolean isFound = false;
for ( Payload payload : fileRelatedPayloads ) {
if ( fileNameID.equals(payload.getId()) ) {
datasourceId = payload.getDatasourceId();
hash = payload.getHash();
isFound = true;
break;
}
}
if ( !isFound ) { // This should never normally happen. If it does, then a very bad change will have taken place.
logger.error("The \"fileNameID\" (" + fileNameID + ") was not found inside the \"fileRelatedPayloads\" for fileName: " + fileName);
continue;
}
String s3Url = constructFileNameAndUploadToS3(targetDirectory, fileName, fileNameID, dotFileExtension, datasourceId, hash);
if ( s3Url == null )
continue;
setFullTextForMultiplePayloads(fileRelatedPayloads, s3Url);
//numUploadedFiles ++;
} catch (Exception e) {
logger.error("Could not upload the file \"" + fileName + "\" to the S3 ObjectStore!", e);
Matcher matcher = FILEPATH_ID_EXTENSION.matcher(fileName);
if ( !matcher.matches() ) {
logger.error("Failed to match the \"" + fileName + "\" with the regex: " + FILEPATH_ID_EXTENSION);
continue;
}
// Else, the record will have its file-data set to "null", in the end of this method.
// The "matcher.group(3)" returns the "filenameWithoutExtension", which is currently not used.
// Use the "fileNameID" and not the "filenameWithoutExtension", as we want to avoid keeping the possible "parenthesis" with the increasing number (about the duplication of ID-fileName).
String fileNameID = matcher.group(4); // The "fileNameID" is the OpenAIRE_ID for this file.
if ( (fileNameID == null) || fileNameID.isEmpty() ) {
logger.error("Failed to extract the \"fileNameID\" from \"" + fileName + "\".");
continue;
}
String dotFileExtension = matcher.group(5);
if ( (dotFileExtension == null) || dotFileExtension.isEmpty() ) {
logger.error("Failed to extract the \"dotFileExtension\" from \"" + fileName + "\".");
continue;
}
// This file is related with some payloads, in a sense that these payloads have urls which lead to the same full-text url.
// These payloads might have different IDs and sourceUrls. But, in the end, the different sourceUrls give the same full-text.
// Below, we make sure we pick the "datasource" from the payload, which has the same id as the full-text's name.
// If there are multiple payloads with the same id, which point to the same file, then we can take whatever datasource we want from those payloads.
// It is possible that payloads with same IDs, but different sourceUrls pointing to the same full-text, can be related with different datasources
// (especially for IDs of type: "doiboost_____::XXXXXXXXXXXXXXXXXXXXX").
// It does not really matter, since the first-ever payload to give this full-text could very well be another one,
// since the crawling happens in multiple threads which compete with each other for CPU time.
String datasourceId = null;
String hash = null;
boolean isFound = false;
for ( Payload payload : fileRelatedPayloads ) {
if ( fileNameID.equals(payload.getId()) ) {
datasourceId = payload.getDatasourceId();
hash = payload.getHash();
isFound = true;
break;
}
}
if ( !isFound ) { // This should never normally happen. If it does, then a very bad change will have taken place.
logger.error("The \"fileNameID\" (" + fileNameID + ") was not found inside the \"fileRelatedPayloads\" for fileName: " + fileName);
continue;
}
try {
String s3Url = constructS3FilenameAndUploadToS3(targetDirectory, fileName, fileNameID, dotFileExtension, datasourceId, hash);
if ( s3Url != null ) {
setFullTextForMultiplePayloads(fileRelatedPayloads, s3Url);
//numUploadedFiles ++;
}
} catch (Exception e) {
logger.error("Avoid uploading the rest of the files of batch_" + batchCounter + " | " + e.getMessage());
break;
}
// Else, the record will have its file-data set to "null", in the end of the caller method (as it will not have an s3Url as its location).
}//end filenames-for-loop
//logger.debug("Finished uploading " + numUploadedFiles + " full-texts (out of " + (fileNames.length -2) + " distinct files) from assignments_" + assignmentsBatchCounter + ", batch_" + batchCounter + " on S3-ObjectStore.");
@ -537,7 +519,32 @@ public class FileUtils {
}
public String constructFileNameAndUploadToS3(String fileDir, String fileName, String openAireID, String dotFileExtension, String datasourceId, String hash) throws Exception
public String constructS3FilenameAndUploadToS3(String targetDirectory, String fileName, String openAireId,
String dotFileExtension, String datasourceId, String hash) throws ConnectException, UnknownHostException
{
String filenameForS3 = constructS3FileName(fileName, openAireId, dotFileExtension, datasourceId, hash); // This name is for the uploaded file, in the S3 Object Store.
if ( filenameForS3 == null ) // The error is logged inside.
return null;
String fileFullPath = targetDirectory + fileName; // The fullPath to the local file (which has the previous name).
String s3Url = null;
try {
s3Url = s3ObjectStore.uploadToS3(filenameForS3, fileFullPath);
} catch (ConnectException ce) {
logger.error("Could not connect with the S3 Object Store! " + ce.getMessage());
throw ce;
} catch (UnknownHostException uhe) {
logger.error("The S3 Object Store could not be found! " + uhe.getMessage());
throw uhe;
} catch (Exception e) {
logger.error("Could not upload the local-file \"" + fileFullPath + "\" to the S3 ObjectStore, with S3-filename: \"" + filenameForS3 + "\"!", e);
return null;
}
return s3Url;
}
public String constructS3FileName(String fileName, String openAireID, String dotFileExtension, String datasourceId, String hash)
{
if ( datasourceId == null ) {
logger.error("The retrieved \"datasourceId\" was \"null\" for file: " + fileName);
@ -548,19 +555,15 @@ public class FileUtils {
return null;
}
String fileFullPath = fileDir + File.separator + fileName; // The fullPath to the local file.
// Use the "fileNameID" and not the "filenameWithoutExtension", as we want to avoid keeping the possible "parenthesis" with the increasing number (about the duplication of ID-fileName).
// Now we append the file-hash, so it is guaranteed that the filename will be unique.
fileName = datasourceId + "/" + openAireID + "::" + hash + dotFileExtension; // This is the fileName to be used in the objectStore, not of the local file!
return s3ObjectStore.uploadToS3(fileName, fileFullPath);
return datasourceId + "/" + openAireID + "::" + hash + dotFileExtension; // This is the fileName to be used in the objectStore, not of the local file!
}
public String getMessageFromResponseBody(HttpURLConnection conn, boolean isError) {
public String getMessageFromResponseBody(HttpURLConnection conn, boolean isError)
{
final StringBuilder msgStrB = new StringBuilder(500);
try ( BufferedReader br = new BufferedReader(new InputStreamReader((isError ? conn.getErrorStream() : conn.getInputStream()))) ) {
try ( BufferedReader br = new BufferedReader(new InputStreamReader((isError ? conn.getErrorStream() : conn.getInputStream()), StandardCharsets.UTF_8)) ) {
String inputLine;
while ( (inputLine = br.readLine()) != null )
{
@ -578,7 +581,8 @@ public class FileUtils {
}
private List<String> getFileNamesForBatch(List<String> allFileNames, int numAllFullTexts, int curBatch) {
private List<String> getFileNamesForBatch(List<String> allFileNames, int numAllFullTexts, int curBatch)
{
int initialIndex = ((curBatch-1) * numOfFullTextsPerBatch);
int endingIndex = (curBatch * numOfFullTextsPerBatch);
if ( endingIndex > numAllFullTexts ) // This might be the case, when the "numAllFullTexts" is too small.
@ -589,14 +593,15 @@ public class FileUtils {
try {
fileNamesOfCurBatch.add(allFileNames.get(i));
} catch (IndexOutOfBoundsException ioobe) {
logger.error("IOOBE for i=" + i + "\n" + ioobe.getMessage(), ioobe);
logger.error("IOOBE for i=" + i + GenericUtils.endOfLine + ioobe.getMessage(), ioobe);
}
}
return fileNamesOfCurBatch;
}
private String getRequestUrlForBatch(String baseUrl, List<String> fileNamesForCurBatch) {
private String getRequestUrlForBatch(String baseUrl, List<String> fileNamesForCurBatch)
{
final StringBuilder sb = new StringBuilder(numOfFullTextsPerBatch * 50);
sb.append(baseUrl);
int numFullTextsCurBatch = fileNamesForCurBatch.size();
@ -609,6 +614,8 @@ public class FileUtils {
}
public static final int twentyFiveKb = 25_600; // 25 Kb
public static final int halfMb = 524_288; // 0.5 Mb = 512 Kb = 524_288 bytes
public static final int tenMb = (10 * 1_048_576);
public boolean saveArchive(HttpURLConnection conn, File zstdFile)
@ -635,7 +642,8 @@ public class FileUtils {
* @param urlReports
* @param shouldCheckAndKeepS3UploadedFiles
*/
public void updateUrlReportsToHaveNoFullTextFiles(List<UrlReport> urlReports, boolean shouldCheckAndKeepS3UploadedFiles) {
public void removeUnretrievedFullTextsFromUrlReports(List<UrlReport> urlReports, boolean shouldCheckAndKeepS3UploadedFiles)
{
for ( UrlReport urlReport : urlReports ) {
Payload payload = urlReport.getPayload();
if ( payload == null )
@ -657,12 +665,117 @@ public class FileUtils {
}
/**
* This method searches the backlog of publications for the ones that have the same "original_url" or their "original_url" is equal to the "actual_url" or an existing payload.
* Then, it generates a new "UrlReport" object, which has as a payload a previously generated one, which has different "id", "original_url".
* Then, the program automatically generates "attempt" and "payload" records for these additional UrlReport-records.
* It must be executed inside the same "database-locked" block of code, along with the inserts of the attempt and payload records.
* */
public boolean addUrlReportsByMatchingRecordsFromBacklog(List<UrlReport> urlReports, List<Payload> initialPayloads, int numInitialPayloads, long assignmentsBatchCounter)
{
logger.debug("numInitialPayloads: " + numInitialPayloads + " | assignmentsBatchCounter: " + assignmentsBatchCounter);
// Create a HashMultimap, containing the "original_url" or "actual_url" as the key and the related "payload" objects as its values.
final HashMultimap<String, Payload> urlToPayloadsMultimap = HashMultimap.create((numInitialPayloads / 3), 3); // Holds multiple values for any key, if a fileName(key) has many IDs(values) associated with it.
for ( Payload payload : initialPayloads ) {
String original_url = payload.getOriginal_url();
String actual_url = payload.getActual_url();
// Link this payload with both the original and actual urls (in case they are different).
urlToPayloadsMultimap.put(original_url, payload);
if ( ! actual_url.equals(original_url) )
urlToPayloadsMultimap.put(actual_url, payload);
}
// A url may be related to different payloads, in the urlReports. For example, in one payload the url was the original_url
// but in another payload the url was only the actual-url (the original was different)
// Gather the original and actual urls of the current payloads and add them in a list usable by the query.
List<String> urlsToRetrieveRelatedIDs = initialPayloads.parallelStream()
.flatMap(payload -> Stream.of(payload.getOriginal_url(), payload.getActual_url())) // Add both "original_url" and "actual_url" in the final results.
.collect(Collectors.toList());
// Prepare the "urlsToRetrieveRelatedIDs" to be used inside the "getDataForPayloadPrefillQuery". Create the following string-pattern: ("URL_1", "URL_2", ...)
int urlsToRetrieveRelatedIDsSize = urlsToRetrieveRelatedIDs.size();
int stringBuilderCapacity = (urlsToRetrieveRelatedIDsSize * 100);
// Get the id and url of any
String getDataForPayloadPrefillQuery = "select distinct pu.id, pu.url\n" +
"from " + DatabaseConnector.databaseName + ".publication_urls pu\n" +
// Exclude the "already-processed" pairs.
"left anti join " + DatabaseConnector.databaseName + ".attempt a on a.id=pu.id and a.original_url=pu.url\n" +
"left anti join " + DatabaseConnector.databaseName + ".payload p on p.id=pu.id and p.original_url=pu.url\n" +
"left anti join " + DatabaseConnector.databaseName + ".assignment asgn on asgn.id=pu.id and asgn.original_url=pu.url\n" +
// Limit the urls to the ones matching to the payload-urls found for the current assignments.
"where pu.url in " + getQueryListString(urlsToRetrieveRelatedIDs, urlsToRetrieveRelatedIDsSize, stringBuilderCapacity);
//logger.trace("getDataForPayloadPrefillQuery:\n" + getDataForPayloadPrefillQuery);
final List<Payload> prefilledPayloads = new ArrayList<>(1000);
try {
jdbcTemplate.query(getDataForPayloadPrefillQuery, rs -> {
String id;
String original_url;
try { // For each of the 2 columns returned, do the following. The column-indexing starts from 1.
id = rs.getString(1);
original_url = rs.getString(2);
} catch (SQLException sqle) {
logger.error("No value was able to be retrieved from one of the columns of row_" + rs.getRow(), sqle);
return; // Move to the next payload.
}
Set<Payload> foundPayloads = urlToPayloadsMultimap.get(original_url);
if ( foundPayloads == null ) {
logger.error("No payloads associated with \"original_url\" = \"" + original_url + "\"!");
return;
}
// Select a random "foundPayload" to use its data to fill the "prefilledPayload" (in a "Set" the first element is pseudo-random).
Optional<Payload> optPayload = foundPayloads.stream().findFirst();
if ( !optPayload.isPresent() ) {
logger.error("Could not retrieve any payload for the \"original_url\": " + original_url);
return; // Move to the next payload.
}
Payload prefilledPayload = optPayload.get().clone(); // We take a clone of the original payload and change just the id and the original_url.
prefilledPayload.setId(id);
prefilledPayload.setOriginal_url(original_url);
prefilledPayloads.add(prefilledPayload);
});
} catch (EmptyResultDataAccessException erdae) {
logger.error("No results retrieved from the \"getDataForPayloadPrefillQuery\", when trying to prefill payloads, from assignment_" + assignmentsBatchCounter + ".");
return false;
} catch (Exception e) {
DatabaseConnector.handleQueryException("getDataForPayloadPrefillQuery", getDataForPayloadPrefillQuery, e);
return false;
}
int numPrefilledPayloads = prefilledPayloads.size();
if ( numPrefilledPayloads == 0 ) {
logger.error("Some results were retrieved from the \"getDataForPayloadPrefillQuery\", but no data could be extracted from them, when trying to prefill payloads, from assignment_" + assignmentsBatchCounter + ".");
return false;
}
logger.debug("numPrefilledPayloads: " + numPrefilledPayloads + " | assignmentsBatchCounter: " + assignmentsBatchCounter);
// Add the prefilled to the UrlReports.
final Error noError = new Error(null, null);
for ( Payload prefilledPayload : prefilledPayloads )
{
urlReports.add(new UrlReport(UrlReport.StatusType.accessible, prefilledPayload, noError));
}
logger.debug("Final number of UrlReports is " + urlReports.size() + " | assignmentsBatchCounter: " + assignmentsBatchCounter);
// In order to avoid assigning these "prefill" records to workers, before they are inserted in the attempt and payload tables..
// We have to make sure this method is called inside a "DB-locked" code block and the "DB-unlock" happens only after all records are loaded into the DB-tables.
return true;
}
/**
* Set the fileLocation for all those Payloads related to the File.
* @param filePayloads
* @param s3Url
*/
public void setFullTextForMultiplePayloads(Set<Payload> filePayloads, String s3Url) {
public void setFullTextForMultiplePayloads(@NotNull Set<Payload> filePayloads, String s3Url) {
for ( Payload payload : filePayloads )
if ( payload != null )
payload.setLocation(s3Url); // Update the file-location to the new S3-url. All the other file-data is already set from the Worker.
@ -696,14 +809,17 @@ public class FileUtils {
}
private static final Lock fileWriteLock = new ReentrantLock(true);
public static final Lock fileAccessLock = new ReentrantLock(true);
public String writeToFile(String fileFullPath, String stringToWrite, boolean shouldLockThreads)
{
if ( shouldLockThreads ) // In case multiple threads write to the same file. for ex. during the bulk-import procedure.
fileWriteLock.lock();
fileAccessLock.lock();
try ( BufferedWriter bufferedWriter = new BufferedWriter(Files.newBufferedWriter(Paths.get(fileFullPath)), FileUtils.tenMb) )
// TODO - Make this method to be synchronized be specific file, not in general.
// TODO - NOW: Multiple bulkImport procedures (with diff DIRs), are blocked while writing to DIFFERENT files..
try ( BufferedWriter bufferedWriter = new BufferedWriter(Files.newBufferedWriter(Paths.get(fileFullPath)), halfMb) )
{
bufferedWriter.write(stringToWrite); // This will overwrite the file. If the new string is smaller, then it does not matter.
} catch (Exception e) {
@ -712,9 +828,22 @@ public class FileUtils {
return errorMsg;
} finally {
if ( shouldLockThreads )
fileWriteLock.unlock();
fileAccessLock.unlock();
}
return null;
}
}
public static String getQueryListString(List<String> list, int fileHashesListSize, int stringBuilderCapacity) {
StringBuilder sb = new StringBuilder(stringBuilderCapacity);
sb.append("(");
for ( int i=0; i < fileHashesListSize; ++i ) {
sb.append("\"").append(list.get(i)).append("\"");
if ( i < (fileHashesListSize -1) )
sb.append(", ");
}
sb.append(")");
return sb.toString();
}
}

View File

@ -1,13 +1,23 @@
package eu.openaire.urls_controller.util;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.servlet.http.HttpServletRequest;
import java.text.SimpleDateFormat;
import java.util.Date;
public class GenericUtils {
private static final Logger logger = LoggerFactory.getLogger(GenericUtils.class);
public static final String endOfLine = "\n";
public static final String tab = "\t";
private static final SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS z");
public static String getReadableCurrentTimeAndZone() {
return (simpleDateFormat.format(new Date(System.currentTimeMillis())));
}
@ -18,15 +28,30 @@ public class GenericUtils {
}
public static String getSelectedStackTraceForCausedException(Throwable thr, String firstMessage, String additionalMessage, int numOfLines)
{
// The stacktrace of the "ExecutionException" is the one of the current code and not the code which ran inside the background-task. Try to get the cause.
Throwable causedThrowable = thr.getCause();
if ( causedThrowable == null ) {
logger.warn("No cause was retrieved for the \"ExecutionException\"!");
causedThrowable = thr;
}
String initialMessage = firstMessage + causedThrowable.getMessage() + ((additionalMessage != null) ? additionalMessage : "");
return getSelectiveStackTrace(causedThrowable, initialMessage, numOfLines);
}
public static String getSelectiveStackTrace(Throwable thr, String initialMessage, int numOfLines)
{
StackTraceElement[] stels = thr.getStackTrace();
StringBuilder sb = new StringBuilder(numOfLines *100);
StringBuilder sb = new StringBuilder(numOfLines *100); // This StringBuilder is thread-safe as a local-variable.
if ( initialMessage != null )
sb.append(initialMessage).append(" Stacktrace:").append("\n"); // This StringBuilder is thread-safe as a local-variable.
for ( int i = 0; (i < stels.length) && (i <= numOfLines); ++i ) {
sb.append(stels[i]);
if (i < numOfLines) sb.append("\n");
sb.append(initialMessage).append(endOfLine);
sb.append("Stacktrace:").append(endOfLine);
for ( int i = 0; (i < stels.length) && (i < numOfLines); ++i ) {
sb.append(tab).append(stels[i]);
if (i < (numOfLines -1)) sb.append(endOfLine);
}
return sb.toString();
}

View File

@ -23,6 +23,7 @@ import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.core.io.ClassPathResource;
import org.springframework.dao.EmptyResultDataAccessException;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.jdbc.core.JdbcTemplate;
@ -43,7 +44,10 @@ import java.util.ArrayList;
import java.util.Base64;
import java.util.List;
import java.util.concurrent.Callable;
import java.util.concurrent.CancellationException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;
import java.util.stream.Collectors;
@Component
public class ParquetFileUtils {
@ -66,10 +70,10 @@ public class ParquetFileUtils {
private final String hdfsUserName;
public static final String payloadSchemaFilePath = "schemas/payload.avsc";
private static final String attemptSchemaFilePath = "schemas/attempt.avsc";
public static final String payloadSchemaFilePath = "schemas/payload.avsc";
public static Schema payloadsSchema = null;
public Schema attemptsSchema;
@ -153,58 +157,89 @@ public class ParquetFileUtils {
}
/**
* The tasks created below must be executed inside a "database-lock" block.
* */
public List<Callable<ParquetReport>> getTasksForCreatingAndUploadingParquetFiles(List<UrlReport> urlReports, int sizeOfUrlReports, long curReportAssignments, String currentParquetPath, FileUtils.UploadFullTextsResponse uploadFullTextsResponse)
{
// Split the "UrlReports" into some sub-lists.
List<List<UrlReport>> subLists;
// Pre-define the tasks to run.
List<Callable<ParquetReport>> callableTasks = new ArrayList<>(6);
// One thread will handle the inserts to the "payload" table and the others to the "attempt" table. This way there will be as little blocking as possible (from the part of Impala).
int sizeOfEachSubList = (int)(sizeOfUrlReports * 0.2);
if ( sizeOfEachSubList > 10 )
{
subLists = Lists.partition(urlReports, sizeOfEachSubList); // This needs the "sizeOfEachSubList" to be above < 0 >.
// The above will create some sub-lists, each one containing 20% of total amount.
int subListsSize = subLists.size();
for ( int i = 0; i < subListsSize; ++i ) {
int finalI = i;
callableTasks.add(() -> { // Handle inserts to the "attempt" table. Insert 20% of the "attempt" queries.
return new ParquetReport(ParquetReport.ParquetType.attempt, createAndLoadParquetDataIntoAttemptTable(finalI, subLists.get(finalI), curReportAssignments, currentParquetPath));
});
}
} else {
// If the "urlReports" are so few, that we cannot get big "sublists", assign a single task to handle all the attempts.
callableTasks.add(() -> { // Handle inserts to the "attempt" table. Insert 20% of the "attempt" queries.
return new ParquetReport(ParquetReport.ParquetType.attempt, createAndLoadParquetDataIntoAttemptTable(0, urlReports, curReportAssignments, currentParquetPath));
});
}
if ( uploadFullTextsResponse == FileUtils.UploadFullTextsResponse.successful )
{
// At this point we know there was no problem with the full-texts, but we do not know if at least one full-text was retrieved.
if ( (payloadsSchema == null) // Parse the schema if it's not already parsed.
&& ((payloadsSchema = parseSchema(payloadSchemaFilePath)) == null ) ) {
logger.error("Nothing can be done without the payloadsSchema! Exiting.."); // The cause is already logged inside the above method.
System.exit(88); // Exit the whole app, as it cannot add the results to the database!
}
callableTasks.add(() -> { // Handle inserts to the "payload" table. Around 20% of the total amount.
return new ParquetReport(ParquetReport.ParquetType.payload, createAndLoadParquetDataIntoPayloadTable(urlReports, curReportAssignments, currentParquetPath, (parquetHDFSDirectoryPathPayloadsAggregated + curReportAssignments + "/")));
});
}
if ( (attemptsSchema == null) // Parse the schema if it's not already parsed.
&& ((attemptsSchema = parseSchema(attemptSchemaFilePath)) == null ) ) {
logger.error("Nothing can be done without the attemptsSchema! Exiting.."); // The cause is already logged inside the above method.
System.exit(89); // Exit the whole app, as it cannot add the results to the database!
System.exit(88); // Exit the whole app, as it cannot add the results to the database!
}
// Pre-define the tasks to run in multiple threads.
final List<Callable<ParquetReport>> callableTasks = new ArrayList<>(8); // 5 threads will handle the attempts and 3 the payloads.
if ( uploadFullTextsResponse == FileUtils.UploadFullTextsResponse.successful )
{
List<Payload> initialPayloads = urlReports.parallelStream()
.map(UrlReport::getPayload).filter(payload -> ((payload != null) && (payload.getLocation() != null)))
.collect(Collectors.toList());
int numInitialPayloads = initialPayloads.size();
if ( numInitialPayloads > 0 ) // If at least 1 payload was created by the processed records..
{ // (it's ok to have no payloads, if there were no full-texts available)
// At this point we know there was no problem with the full-texts, but we do not know if at least one full-text was retrieved FROM the worker.
if ( (payloadsSchema == null) // Parse the schema if it's not already parsed.
&& ((payloadsSchema = parseSchema(payloadSchemaFilePath)) == null ) ) {
logger.error("Nothing can be done without the payloadsSchema! Exiting.."); // The cause is already logged inside the above method.
System.exit(89); // Exit the whole app, as it cannot add the results to the database!
}
// The UrlsReports for the pre-filled are added only here, since we do not want attempt records fo these.
fileUtils.addUrlReportsByMatchingRecordsFromBacklog(urlReports, initialPayloads, numInitialPayloads, curReportAssignments); // This will add more Object in the update the "urlReports" list.
// In case the above method returns an error, nothing happens. We just have only the initial payloads to insert to the DB.
int sizeOfEachSubList = (int)(sizeOfUrlReports * 0.33); // We want 3 sub-lists for the payloads.
if ( sizeOfEachSubList == 0 ) // If the "sizeOfUrlReports" is <= 3.
sizeOfEachSubList = 1;
// There may be 1 more with very few elements, due to non-persisted splitting. Unfortunately, we cannot st the number of splits, only the size of most splits.
if ( sizeOfEachSubList > 10 ) {
List<List<UrlReport>> finalSubLists = Lists.partition(urlReports, sizeOfEachSubList); // This needs the "sizeOfEachSubList" to be above < 0 >.
int numSubListsPayload = finalSubLists.size();
// We will run <numSubListsPayload> tasks for the payloads.
// Since the payloads are not evenly destributed in the urlReports, there may be some sub-lists of urlReports without any payloads inside.
for ( int i = 0; i < numSubListsPayload; ++i ) {
int finalI = i;
callableTasks.add(() -> { // Handle inserts to the "payload" table. Around 20% of the total amount.
return new ParquetReport(ParquetReport.ParquetType.payload, createPayloadParquetDataAndUploadToHDFS(finalI, finalSubLists.get(finalI), curReportAssignments, currentParquetPath, (parquetHDFSDirectoryPathPayloadsAggregated + curReportAssignments + "/")));
});
}
} else {
// If the "urlReports" are so few, that we cannot get big "sublists", assign a single task to handle all the payload (sizeOfEachSubList * 5).
callableTasks.add(() -> { // Handle inserts to the "payload" table. Around 20% of the total amount.
return new ParquetReport(ParquetReport.ParquetType.payload, createPayloadParquetDataAndUploadToHDFS(0, urlReports, curReportAssignments, currentParquetPath, (parquetHDFSDirectoryPathPayloadsAggregated + curReportAssignments + "/")));
});
}
}
}
int sizeOfEachSubList = (int)(sizeOfUrlReports * 0.2); // We want 5 sub-lists for the attempts.
// There may be 1 more with very few elements, due to non-persisted splitting. Unfortunately, we cannot st the number of splits, only the size of most splits.
if ( sizeOfEachSubList > 10 ) {
List<List<UrlReport>> finalSubLists = Lists.partition(urlReports, sizeOfEachSubList); // This needs the "sizeOfEachSubList" to be above < 0 >.
// The above will create some sub-lists, each one containing 20% of total amount.
int numSubListsAttempt = finalSubLists.size();
// We will run <numSubListsAttempt> tasks for the payloads.
for ( int i = 0; i < numSubListsAttempt; ++i ) {
int finalI = i;
callableTasks.add(() -> { // Handle inserts to the "attempt" table. Insert 20% of the "attempt" queries.
return new ParquetReport(ParquetReport.ParquetType.attempt, createAttemptParquetDataAndUploadToHDFS(finalI, finalSubLists.get(finalI), curReportAssignments, currentParquetPath));
});
}
} else {
// If the "urlReports" are so few, that we cannot get big "sublists", assign a single task to handle all the attempts (sizeOfEachSubList * 5).
callableTasks.add(() -> { // Handle inserts to the "attempt" table.
return new ParquetReport(ParquetReport.ParquetType.attempt, createAttemptParquetDataAndUploadToHDFS(0, urlReports, curReportAssignments, currentParquetPath));
});
}
return callableTasks;
}
public boolean createAndLoadParquetDataIntoAttemptTable(int attemptsIncNum, List<UrlReport> urlReports, long curReportAssignmentsCounter, String localParquetPath)
public boolean createAttemptParquetDataAndUploadToHDFS(int attemptsIncNum, List<UrlReport> urlReports, long curReportAssignmentsCounter, String localParquetPath)
{
List<GenericData.Record> recordList = new ArrayList<>(urlReports.size());
GenericData.Record record;
@ -212,7 +247,7 @@ public class ParquetFileUtils {
for ( UrlReport urlReport : urlReports ) {
Payload payload = urlReport.getPayload();
if ( payload == null ) {
logger.warn("Payload was \"null\" for a \"urlReport\", in assignments_" + curReportAssignmentsCounter + "\n" + urlReport);
logger.warn("Payload was \"null\" for a \"urlReport\", in assignments_" + curReportAssignmentsCounter + GenericUtils.endOfLine + urlReport);
continue;
}
@ -239,7 +274,7 @@ public class ParquetFileUtils {
int recordsSize = recordList.size();
if ( recordsSize == 0 ) {
logger.warn("No attempts are available to be inserted to the database!");
logger.error("No attempt-parquet-records could be created in order to be inserted to the database!");
return false;
}
@ -261,16 +296,17 @@ public class ParquetFileUtils {
}
public boolean createAndLoadParquetDataIntoPayloadTable(List<UrlReport> urlReports, long curReportAssignmentsCounter, String localParquetPath, String parquetHDFSDirectoryPathPayloads)
public boolean createPayloadParquetDataAndUploadToHDFS(int payloadsCounter, List<UrlReport> urlReports, long curReportAssignmentsCounter, String localParquetPath, String parquetHDFSDirectoryPathPayloads)
{
List<GenericData.Record> recordList = new ArrayList<>((int) (urlReports.size() * 0.2));
GenericData.Record record;
int numPayloadsInsideUrlReports = 0;
for ( UrlReport urlReport : urlReports )
{
Payload payload = urlReport.getPayload();
if ( payload == null ) {
logger.warn("Payload was \"null\" for a \"urlReport\", in assignments_" + curReportAssignmentsCounter + "\n" + urlReport);
logger.warn("Payload was \"null\" for a \"urlReport\", in assignments_" + curReportAssignmentsCounter + GenericUtils.endOfLine + urlReport);
continue;
}
@ -278,32 +314,27 @@ public class ParquetFileUtils {
if ( fileLocation == null ) // We want only the records with uploaded full-texts in the "payload" table.
continue;
try {
record = new GenericData.Record(payloadsSchema);
record.put("id", payload.getId());
record.put("original_url", payload.getOriginal_url());
record.put("actual_url", payload.getActual_url());
Timestamp timestamp = payload.getTimestamp_acquired();
record.put("date", (timestamp != null) ? timestamp.getTime() : System.currentTimeMillis());
record.put("mimetype", payload.getMime_type());
Long size = payload.getSize();
record.put("size", ((size != null) ? String.valueOf(size) : null));
record.put("hash", payload.getHash());
record.put("location", fileLocation);
record.put("provenance", payload.getProvenance());
numPayloadsInsideUrlReports ++;
Timestamp timestamp = payload.getTimestamp_acquired();
record = getPayloadParquetRecord(payload.getId(), payload.getOriginal_url(), payload.getActual_url(),
(timestamp != null) ? timestamp.getTime() : System.currentTimeMillis(),
payload.getMime_type(), payload.getSize(), payload.getHash(), fileLocation, payload.getProvenance(), false);
if ( record != null )
recordList.add(record);
} catch (Exception e) {
logger.error("Failed to create a payload record!", e);
}
}
if ( numPayloadsInsideUrlReports == 0 )
return true; // This urlsReports-sublist does not have any payloads inside to use. That's fine.
int recordsSize = recordList.size();
if ( recordsSize == 0 ) {
logger.warn("No payloads are available to be inserted to the database!");
logger.error("No payload-parquet-records could be created in order to be inserted to the database!");
return false;
}
String fileName = curReportAssignmentsCounter + "_payloads.parquet";
String fileName = curReportAssignmentsCounter + "_payloads_" + payloadsCounter + ".parquet";
//logger.trace("Going to write " + recordsSize + " payload-records to the parquet file: " + fileName); // DEBUG!
String fullFilePath = localParquetPath + fileName;
@ -321,6 +352,30 @@ public class ParquetFileUtils {
}
public GenericData.Record getPayloadParquetRecord(String id, String original_url, String actual_url, long timeMillis, String mimetype, Long size,
String hash, String fileLocation, String provenance, boolean isForBulkImport)
{
GenericData.Record record;
try {
record = new GenericData.Record(payloadsSchema);
record.put("id", id);
record.put("original_url", original_url);
record.put("actual_url", actual_url);
record.put("date", timeMillis);
record.put("mimetype", mimetype);
record.put("size", ((size != null) ? String.valueOf(size) : null));
record.put("hash", hash);
record.put("location", fileLocation);
record.put("provenance", (isForBulkImport ? "bulk:" : "") + provenance);
// Add the "bulk:" prefix in order to be more clear that this record comes from bulkImport, when looking all records in the "payload" VIEW.
return record;
} catch (Exception e) {
logger.error("Failed to create a payload record!", e);
return null;
}
}
public boolean writeToParquet(List<GenericData.Record> recordList, Schema schema, String fullFilePath)
{
OutputFile outputFile;
@ -346,7 +401,7 @@ public class ParquetFileUtils {
} catch (Throwable e) { // The simple "Exception" may not be thrown here, but an "Error" may be thrown. "Throwable" catches EVERYTHING!
String errorMsg = "Problem when creating the \"ParquetWriter\" object or when writing the records with it!";
if ( e instanceof org.apache.hadoop.fs.FileAlreadyExistsException )
logger.error(errorMsg + "\n" + e.getMessage());
logger.error(errorMsg + GenericUtils.endOfLine + e.getMessage());
else
logger.error(errorMsg, e);
@ -406,7 +461,9 @@ public class ParquetFileUtils {
}
//logger.trace("The target location is: " + location + "\nWill do a silent redirect to HTTPS."); // DEBUG!
location = StringUtils.replace(location, "http:", "https:", 1);
// In case the WebHDFS uses https, then perform the offline redirect to https here (to avoid the live redirection.)
if ( webHDFSBaseUrl.startsWith("https:") )
location = StringUtils.replace(location, "http:", "https:", 1);
// Unless we handle this here, we have to either complicate the process by handling the https-redirect or in any-way getting a hit in performance by having one more step each time ww want to upload a file.
conn = (HttpURLConnection) (new URL(location)).openConnection(); // This already contains the "user.name" parameter.
@ -445,9 +502,9 @@ public class ParquetFileUtils {
// Using the "load data inpath" command, he files are MOVED, not copied! So we don't have to delete them afterwards.
// See: https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_load_data.html
} catch (Throwable e) {
String errorMsg = "Error while uploading parquet file \"" + parquetFileFullLocalPath + "\" to HDFS!\n" + e;
logger.error(errorMsg);
return errorMsg;
String errorMsg = "Error while uploading parquet file \"" + parquetFileFullLocalPath + "\" to HDFS!\n";
logger.error(errorMsg, e);
return errorMsg + e.getMessage();
}
// The local parquet file will be deleted later.
@ -513,9 +570,9 @@ public class ParquetFileUtils {
if ( statusCode == 404 ) {
logger.info(initErrMsg + "\"" + parquetBaseRemoteDirectory + "\" does not exist. We will create it, along with its sub-directories.");
attemptCreationSuccessful = applyHDFOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathAttempts + mkDirsAndParams);
payloadAggregatedCreationSuccessful = applyHDFOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathPayloadsAggregated + mkDirsAndParams);
payloadBulkImportCreationSuccessful = applyHDFOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathPayloadsBulkImport + mkDirsAndParams);
attemptCreationSuccessful = applyHDFSOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathAttempts + mkDirsAndParams);
payloadAggregatedCreationSuccessful = applyHDFSOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathPayloadsAggregated + mkDirsAndParams);
payloadBulkImportCreationSuccessful = applyHDFSOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathPayloadsBulkImport + mkDirsAndParams);
return (attemptCreationSuccessful && payloadAggregatedCreationSuccessful && payloadBulkImportCreationSuccessful);
// We need all directories to be created in order for the app to function properly!
}
@ -558,7 +615,7 @@ public class ParquetFileUtils {
foundPayloadsAggregatedDir = true;
else if ( dirPath.equals("payloads_bulk_import") )
foundPayloadsBulkImportDir = true;
else
else if ( ! dirPath.equals("test") ) // The "test" directory helps with testing the service, without interfering with the production directories.
logger.warn("Unknown remote parquet HDFS-directory found: " + dirPath);
}
} catch (JSONException je) { // In case any of the above "json-keys" was not found.
@ -572,19 +629,19 @@ public class ParquetFileUtils {
// For each missing subdirectories, run the mkDirs-request.
if ( !foundAttemptsDir ) {
logger.debug(initErrMsg + "\"" + parquetHDFSDirectoryPathAttempts + "\" does not exist! Going to create it.");
attemptCreationSuccessful = applyHDFOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathAttempts + mkDirsAndParams);
attemptCreationSuccessful = applyHDFSOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathAttempts + mkDirsAndParams);
} else
logger.info(initErrMsg + "\"" + parquetHDFSDirectoryPathAttempts + "\" exists.");
if ( !foundPayloadsAggregatedDir ) {
logger.debug(initErrMsg + "\"" + parquetHDFSDirectoryPathPayloadsAggregated + "\" does not exist! Going to create it.");
payloadAggregatedCreationSuccessful = applyHDFOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathPayloadsAggregated + mkDirsAndParams);
payloadAggregatedCreationSuccessful = applyHDFSOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathPayloadsAggregated + mkDirsAndParams);
} else
logger.info(initErrMsg + "\"" + parquetHDFSDirectoryPathPayloadsAggregated + "\" exists.");
if ( !foundPayloadsBulkImportDir ) {
logger.debug(initErrMsg + "\"" + parquetHDFSDirectoryPathPayloadsBulkImport + "\" does not exist! Going to create it.");
payloadBulkImportCreationSuccessful = applyHDFOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathPayloadsBulkImport + mkDirsAndParams);
payloadBulkImportCreationSuccessful = applyHDFSOperation(webHDFSBaseUrl + parquetHDFSDirectoryPathPayloadsBulkImport + mkDirsAndParams);
} else
logger.info(initErrMsg + "\"" + parquetHDFSDirectoryPathPayloadsBulkImport + "\" exists.");
@ -597,7 +654,7 @@ public class ParquetFileUtils {
}
public boolean applyHDFOperation(String hdfsOperationUrl)
public boolean applyHDFSOperation(String hdfsOperationUrl)
{
try {
URL url = new URL(hdfsOperationUrl);
@ -617,7 +674,7 @@ public class ParquetFileUtils {
return false;
}
if ( logger.isTraceEnabled() )
logger.trace("The Operation was successful for hdfs-op-url: " + hdfsOperationUrl + "\n" + fileUtils.getMessageFromResponseBody(conn, false));
logger.trace("The Operation was successful for hdfs-op-url: " + hdfsOperationUrl + GenericUtils.endOfLine + fileUtils.getMessageFromResponseBody(conn, false));
} catch (Exception e) {
logger.error("", e);
return false;
@ -634,11 +691,11 @@ public class ParquetFileUtils {
int numOfAllPayloadParquetFileCreations = 0;
int numOfFailedPayloadParquetFileCreations = 0;
for ( Future<ParquetReport> future : futures )
int sizeOfFutures = futures.size();
for ( int i = 0; i < sizeOfFutures; ++i )
{
ParquetReport parquetReport = null;
try {
parquetReport = future.get();
ParquetReport parquetReport = futures.get(i).get();
boolean hasProblems = (! parquetReport.isSuccessful());
ParquetReport.ParquetType parquetType = parquetReport.getParquetType();
if ( parquetType.equals(ParquetReport.ParquetType.attempt) ) {
@ -654,11 +711,17 @@ public class ParquetFileUtils {
logger.error(errMsg);
return new SumParquetSuccess(false, false, ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(errMsg));
}
} catch (Exception e) {
logger.error("", e);
// We do not know if the failed "future" refers to a "payload" or to a "attempt".
// So we cannot increase a specific counter. That's ok, the only drawback if that we may try to "load" the non-existent data and get an exception.
} catch (ExecutionException ee) { // These can be serious errors like an "out of memory exception" (Java HEAP).
logger.error(GenericUtils.getSelectedStackTraceForCausedException(ee, "Parquet_task_" + i + " failed with: ", null, 15));
} catch (CancellationException ce) {
logger.error("Parquet_task_" + i + " was cancelled: " + ce.getMessage());
} catch (InterruptedException ie) {
logger.error("Parquet_task_" + i + " was interrupted: " + ie.getMessage());
} catch (IndexOutOfBoundsException ioobe) {
logger.error("IOOBE for parquet_task_" + i + " in the futures-list! " + ioobe.getMessage());
}
// We do not know if the failed "future" refers to a "payload" or to a "attempt".
// So we cannot increase a specific counter. That's ok, the only drawback if that we may try to "load" the non-existent data and get an exception.
} // End-for
boolean hasAttemptParquetFileProblem = (numOfFailedAttemptParquetFileCreations == numOfAllAttemptParquetFileCreations);
@ -667,4 +730,110 @@ public class ParquetFileUtils {
return new SumParquetSuccess(hasAttemptParquetFileProblem, hasPayloadParquetFileProblem, null);
}
/**
* In each insertion, a new parquet-file is created, so we end up with millions of files. Parquet is great for fast-select, so have to stick with it and merge those files..
* This method, creates a clone of the original table in order to have only one parquet file in the end. Drops the original table.
* Renames the clone to the original's name.
* Returns the errorMsg, if an error appears, otherwise is returns "null".
*/
public String mergeParquetFilesOfTable(String tableName, String whereClause, Object parameter) {
String errorMsg;
if ( (tableName == null) || tableName.isEmpty() ) {
errorMsg = "No tableName was given. Do not know the tableName for which we should merger the underlying files for!";
logger.error(errorMsg);
return errorMsg; // Return the error-msg to indicate that something went wrong and pass it down to the Worker.
}
// Make sure the following are empty strings.
whereClause = (whereClause != null) ? (whereClause + " ") : "";
if ( parameter == null )
parameter = "";
else if ( parameter instanceof String )
parameter = "'" + parameter + "'"; // This will be a "string-check", thus the single-quotes.
// Else it is a "long", it will be used as is.
// Create a temp-table as a copy of the initial table.
try {
jdbcTemplate.execute("CREATE TABLE " + DatabaseConnector.databaseName + "." + tableName + "_tmp stored as parquet AS SELECT * FROM " + DatabaseConnector.databaseName + "." + tableName + " " + whereClause + parameter);
} catch (Exception e) {
errorMsg = "Problem when copying the contents of \"" + tableName + "\" table to a newly created \"" + tableName + "_tmp\" table, when merging the parquet-files!\n";
logger.error(errorMsg, e);
try { // Make sure we delete the possibly half-created temp-table.
jdbcTemplate.execute("DROP TABLE IF EXISTS " + DatabaseConnector.databaseName + "." + tableName + "_tmp PURGE");
// We cannot move on with merging, but no harm happened, since the "table_tmp" name is still reserved for future use (after it was dropped immediately)..
} catch (Exception e1) {
logger.error("Failed to drop the \"" + tableName + "_tmp\" table!", e1);
// TODO - Should we shutdown the service? It is highly unlikely that anyone will observe this error live (and act immediately)..!
}
return errorMsg; // We return only the initial error to the Worker, which is easily distinguished indie the "merge-queries".
}
// Drop the initial table.
try {
jdbcTemplate.execute("DROP TABLE " + DatabaseConnector.databaseName + "." + tableName + " PURGE");
} catch (Exception e) {
errorMsg = "Problem when dropping the initial \"" + tableName + "\" table, when merging the parquet-files!\n";
logger.error(errorMsg, e);
// The original table could not be dropped, so the temp-table cannot be renamed to the original..!
try { // Make sure we delete the already created temp-table, in order to be able to use it in the future. The merging has failed nevertheless.
jdbcTemplate.execute("DROP TABLE " + DatabaseConnector.databaseName + "." + tableName + "_tmp PURGE");
} catch (Exception e1) {
logger.error((errorMsg += "Failed to drop the \"" + tableName + "_tmp\" table!"), e1); // Add this error to the original, both are very important.
}
// Here, the original table is created.
return errorMsg;
// TODO - Should we shutdown the service? It is highly unlikely that anyone will observe this error live (and act immediately)..!
}
// Rename the temp-table to have the initial-table's name.
try {
jdbcTemplate.execute("ALTER TABLE " + DatabaseConnector.databaseName + "." + tableName + "_tmp RENAME TO " + DatabaseConnector.databaseName + "." + tableName);
} catch (Exception e) {
errorMsg = "Problem in renaming the \"" + tableName + "_tmp\" table to \"" + tableName + "\", when merging the parquet-files!\n";
logger.error(errorMsg, e);
// At this point we only have a "temp-table", the original is already deleted..
// Try to create the original, as a copy of the temp-table. If that succeeds, then try to delete the temp-table.
try {
jdbcTemplate.execute("CREATE TABLE " + DatabaseConnector.databaseName + "." + tableName + " stored as parquet AS SELECT * FROM " + DatabaseConnector.databaseName + "." + tableName + "_tmp");
} catch (Exception e1) {
errorMsg = "Problem when copying the contents of \"" + tableName + "_tmp\" table to a newly created \"" + tableName + "\" table, when merging the parquet-files!\n";
logger.error(errorMsg, e1);
// If the original table was not created, then we have to intervene manually, if it was created but without any data, then we can safely move on handling other assignments and workerReports, but the data will be lost! So this workerReport failed to be handled.
try { // The below query normally returns a list, as it takes a "regex-pattern" as an input. BUT, we give just the table name, without wildcards. So the result is either the tableName itself or none (not any other table).
jdbcTemplate.queryForObject("SHOW TABLES IN " + DatabaseConnector.databaseName + " LIKE '" + tableName + "'", List.class);
} catch (EmptyResultDataAccessException erdae) {
// The table does not exist, so it was not even half-created by the previous query.
// Not having the original table anymore is a serious error. A manual action is needed!
logger.error((errorMsg += "The original table \"" + tableName + "\" must be created manually! Serious problems may appear otherwise!"));
// TODO - Should we shutdown the service? It is highly unlikely that anyone will observe this error live (and fix it immediately to avoid other errors in the Service)..!
}
// Here, the original-table exists in the DB, BUT without any data inside! This workerReport failed to be handled! (some of its data could not be loaded to the database, and all previous data was lost).
return errorMsg;
}
// The creation of the original table was successful. Try to delete the temp-table.
try {
jdbcTemplate.execute("DROP TABLE " + DatabaseConnector.databaseName + "." + tableName + "_tmp PURGE");
} catch (Exception e2) {
logger.error((errorMsg += "Problem when dropping the \"" + tableName + "_tmp\" table, when merging the parquet-files!\n"), e2);
// Manual deletion should be performed!
return errorMsg; // Return both errors here, as the second is so important that if it did not happen then we could move on with this workerReport.
// TODO - Should we shutdown the service? It is highly unlikely that anyone will observe this error live (and act immediately)..!
}
// Here the original table exists and the temp-table is deleted. We eventually have the same state as if the "ALTER TABLE" succeeded.
}
// Gather information to be used for queries-optimization.
try {
jdbcTemplate.execute("COMPUTE STATS " + DatabaseConnector.databaseName + "." + tableName);
} catch (Exception e) {
logger.error("Problem when gathering information from table \"" + tableName + "\" to be used for queries-optimization.", e);
// In this case the error is not so important to the whole operation.. It's only that the performance of this specific table will be less optimal, only temporarily, unless every "COMPUTE STATS" query fails for future workerReports too.
}
return null; // No errorMsg, everything is fine.
}
}

View File

@ -48,8 +48,8 @@ public class S3ObjectStore {
boolean bucketExists = minioClient.bucketExists(BucketExistsArgs.builder().bucket(bucketName).build());
// Keep this commented-out to avoid objects-deletion by accident. The code is open-sourced, so it's easy to enable this ability if we really want it (e.g. for testing).
if ( shouldEmptyBucket && isTestEnvironment && bucketExists )
emptyBucket(bucketName, false);
/*if ( shouldEmptyBucket && isTestEnvironment && bucketExists )
emptyBucket(bucketName, false);*/
// Make the bucket, if not exist.
if ( !bucketExists ) {
@ -117,30 +117,57 @@ public class S3ObjectStore {
}
public void emptyBucket(String bucketName, boolean shouldDeleteBucket) throws Exception {
public void emptyBucket(String bucketName, boolean shouldDeleteBucket)
{
logger.warn("Going to " + (shouldDeleteBucket ? "delete" : "empty") + " bucket \"" + bucketName + "\"!");
// First list the objects of the bucket.
Iterable<Result<Item>> results = minioClient.listObjects(ListObjectsArgs.builder().bucket(bucketName).build());
Iterable<Result<Item>> results;
try {
results = minioClient.listObjects(ListObjectsArgs.builder().bucket(bucketName).build());
} catch (Exception e) {
logger.error("Could not retrieve the list of objects of bucket \"" + bucketName + "\"!");
return;
}
int countDeletedFiles = 0;
int countFilesNotDeleted = 0;
long totalSize = 0;
Item item;
// Then, delete the objects.
for ( Result<Item> resultItem : results ) {
try {
if ( !deleteFile(resultItem.get().objectName(), bucketName) ) {
logger.error("Cannot proceed with bucket deletion, since only an empty bucket can be removed!");
return;
}
item = resultItem.get();
} catch (Exception e) {
logger.warn("Could not remove " + resultItem.get().objectName());
logger.error("Could not get the item-object of one of the S3-Objects returned from the bucket!", e);
countFilesNotDeleted ++;
continue;
}
totalSize += item.size();
if ( !deleteFile(item.objectName(), bucketName) ) { // The reason and for what object, is already logged.
logger.error("Cannot proceed with bucket deletion, since only an empty bucket can be removed!");
countFilesNotDeleted ++;
} else
countDeletedFiles ++;
}
if ( shouldDeleteBucket ) {
// Lastly, delete the empty bucket.
minioClient.removeBucket(RemoveBucketArgs.builder().bucket(bucketName).build());
}
if ( countFilesNotDeleted == 0 ) {
// Lastly, delete the empty bucket. We need to do this last, as in case it's not empty, we get an error!
try {
minioClient.removeBucket(RemoveBucketArgs.builder().bucket(bucketName).build());
logger.info("Bucket \"" + bucketName + "\" was deleted!");
} catch (Exception e) {
logger.error("Bucket \"" + bucketName + "\" could not be deleted!", e);
}
} else
logger.error("Cannot execute the \"removeBucket\" command for bucket \"" + bucketName + "\", as " + countFilesNotDeleted + " files failed to be deleted!");
} else
logger.info("Bucket \"" + bucketName + "\" was emptied!");
logger.info("Bucket " + bucketName + " was " + (shouldDeleteBucket ? "deleted!" : "emptied!"));
logger.info(countDeletedFiles + " files were deleted, amounting to " + ((totalSize/1024)/1024) + " MB.");
}

View File

@ -15,7 +15,7 @@ services:
db:
initialDatabaseName: pdfaggregation_i
testDatabaseName: pdfaggregationdatabase_payloads_view_test
testDatabaseName: pdfaggregationdatabase_test_new
assignmentLimit: 10000
maxAttemptsPerRecord: 3
@ -32,9 +32,10 @@ services:
secretKey: XA
region: XA
bucketName: XA
shouldEmptyBucket: false
shouldEmptyBucket: false # For data-security, in order to use this ability, the relevant code has to be uncommented in "util.S3ObjectStore.java"
shouldShowAllS3Buckets: true
worker:
port: 1881
bulk-import:
baseBulkImportLocation: /mnt/bulk_import/
@ -46,11 +47,19 @@ bulk-import:
datasourcePrefix: arXiv_______ # For PID-providing datasource, we use the PID-prefix here. (so not the datasource-prefix: "od________18")
pdfUrlPrefix: https://arxiv.org/pdf/
mimeType: application/pdf
isAuthoritative: true
# otherImport:
# datasourceID: othersource__::0123
# datasourcePrefix: other_______
# pdfUrlPrefix: https://example.org/pdf/
# mimeType: application/pdf
# isAuthoritative: false
# For "authoritative" sources, a special prefix is selected, from: https://graph.openaire.eu/docs/data-model/pids-and-identifiers/#identifiers-in-the-graph
# For the rest, the "datasource_prefix" is selected, using this query:
# select datasource.namespaceprefix.value
# from openaire_prod_<PROD_DATE>.datasource -- Here use the production-table with the latest date.
# where officialname.value = 'datasourceOfficialName';
spring:
@ -72,7 +81,7 @@ spring:
ansi:
enabled: always
lifecycle:
timeout-per-shutdown-phase: 2m
timeout-per-shutdown-phase: 5m
hdfs:
@ -107,6 +116,8 @@ management:
application: ${spring.application.name}
logging:
file:
path: logs
level:
root: INFO
eu:

View File

@ -1,10 +1,10 @@
<configuration debug="false">
<appender name="RollingFile" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>logs/UrlsController.log</file>
<file>${LOG_PATH}/UrlsController.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
<fileNamePattern>logs/UrlsController.%i.log.zip</fileNamePattern>
<fileNamePattern>${LOG_PATH}/UrlsController.%i.log.zip</fileNamePattern>
<minIndex>1</minIndex>
<maxIndex>10</maxIndex>
</rollingPolicy>