|
|
||
|---|---|---|
| openaire-solr-importer | ||
| openaire-solr-payload-mapper | ||
| openaire-solr-test | ||
| .gitignore | ||
| Dockerfile | ||
| LICENSE | ||
| README.md | ||
| docker-compose.yml | ||
| pom.xml | ||
README.md
OpenAIRE SOLR Docker
This project defines a docker-compose based Solr cluster and a content indexing procedure. The project is structured in two main components
- The Solr cluster is defined in the
docker-compose.ymlfile and consists (as default) of 3 nodes and uses a 3 nodes zookeeper server. - The openaire-solr-importer. It is a java application responsible for
- Configure the Solr cluster, loading the configuration file (schema & solrconfig) onto Zookeeper
- Create the data collection
- Feed the collection with the input documents, applying the necessary conversions
Step by step guide
- Verify docker is installed
For this guide the docker version used is the following
docker --version
Docker version 27.2.0, build 3ab4256
Docker Desktop is a one-click-install application for your Mac, Linux, or Windows environment that lets you build, share, and run containerized applications and microservices.
It provides a straightforward GUI (Graphical User Interface) that lets you manage your containers, applications, and images directly from your machine.
Docker Desktop reduces the time spent on complex setups so you can focus on writing code. It takes care of port mappings, file system concerns, and other default settings, and is regularly updated with bug fixes and security updates.
Get it at https://www.docker.com/products/docker-desktop
- Clone this project
git clone https://code-repo.d4science.org/D-Net/openaire-solr-docker.git
- Compile the solr-importer Docker image
cd openaire-solr-docker
❯ docker build . -t openaire/solr-importer --no-cache
[+] Building 58.1s (16/16) FINISHED docker:desktop-linux
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 537B 0.0s
=> [internal] load metadata for docker.io/library/openjdk:17-bullseye 1.1s
=> [internal] load metadata for docker.io/library/maven:3.9-eclipse-temurin-17 1.7s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [builder 1/4] FROM docker.io/library/maven:3.9-eclipse-temurin-17@sha256:d4f3b77119ae1afcdf00276083416d58fd... 5.5s
=> => resolve docker.io/library/maven:3.9-eclipse-temurin-17@sha256:d4f3b77119ae1afcdf00276083416d58fd9c699294... 0.0s
=> => sha256:818a39a5051cd05f4158c0db68fbf2a5bf732b9b8d5ecde89db2a725a35aa731 9.77kB / 9.77kB 0.0s
=> => sha256:bc78492c245adcc54dde954f84b79e98d71c8ac11fe26c874e1b1347f88550e2 159B / 159B 0.6s
=> => sha256:d4f3b77119ae1afcdf00276083416d58fd9c69929400cea3595eba8965b6ae6f 7.86kB / 7.86kB 0.0s
=> => sha256:004549c9d7e4108958645eea52f71a0e8ba6f18d72dd6dca3a47239ce74704dd 24.16MB / 24.16MB 1.1s
=> => sha256:b3ad492a84cdc67ade7b4db435c7a511fd80135e165d0ab6683529afad3e9cf6 2.90kB / 2.90kB 0.0s
=> => sha256:225cb5fd64f3a6f155a1301661e18261d430eba8637a6f7e08c50e2774434caa 143.37MB / 143.37MB 3.4s
=> => sha256:b2ea5a28f7c23fb8f6b9e31e7612b3c542394162012defbebf1f640ad33a5b06 2.28kB / 2.28kB 0.7s
=> => sha256:878f62a07109a4bc1c6e84f338d9c7d7472bd4ab79c45e369cc350d1be3b3ba8 22.61MB / 22.61MB 1.7s
=> => extracting sha256:004549c9d7e4108958645eea52f71a0e8ba6f18d72dd6dca3a47239ce74704dd 1.0s
=> => sha256:ba63355dfc695379f7f9b29e7583aabdb617a26027ae4b37d6bb782f8bb3f14a 9.17MB / 9.17MB 1.6s
=> => sha256:30ff8964aee50bcf46c1b6a4b4c9fb180c64e917a793571ba3c5b1d1a4481dd1 849B / 849B 1.8s
=> => sha256:4af42910298d8a6180fd68a7de28cf581859551fda87acf000b0142af306c8d0 154B / 154B 1.9s
=> => sha256:b4d1d7ad89e02201ce3dd818bf534d160a5b44d1e53b47e57ed29ccd17d5ab31 360B / 360B 1.9s
=> => extracting sha256:225cb5fd64f3a6f155a1301661e18261d430eba8637a6f7e08c50e2774434caa 1.2s
=> => extracting sha256:bc78492c245adcc54dde954f84b79e98d71c8ac11fe26c874e1b1347f88550e2 0.0s
=> => extracting sha256:b2ea5a28f7c23fb8f6b9e31e7612b3c542394162012defbebf1f640ad33a5b06 0.0s
=> => extracting sha256:878f62a07109a4bc1c6e84f338d9c7d7472bd4ab79c45e369cc350d1be3b3ba8 0.7s
=> => extracting sha256:ba63355dfc695379f7f9b29e7583aabdb617a26027ae4b37d6bb782f8bb3f14a 0.1s
=> => extracting sha256:30ff8964aee50bcf46c1b6a4b4c9fb180c64e917a793571ba3c5b1d1a4481dd1 0.0s
=> => extracting sha256:b4d1d7ad89e02201ce3dd818bf534d160a5b44d1e53b47e57ed29ccd17d5ab31 0.0s
=> => extracting sha256:4af42910298d8a6180fd68a7de28cf581859551fda87acf000b0142af306c8d0 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 261.51kB 0.0s
=> CACHED [stage-1 1/6] FROM docker.io/library/openjdk:17-bullseye@sha256:bd3113dee5dfa55c246067cdb20e5880003ed... 0.0s
=> [stage-1 2/6] RUN apt-get update 4.1s
=> [stage-1 3/6] RUN apt-get install -y zip 1.1s
=> [builder 2/4] COPY src /usr/src/app/src 0.0s
=> [builder 3/4] COPY pom.xml /usr/src/app 0.0s
=> [builder 4/4] RUN mvn -f /usr/src/app/pom.xml clean package -DskipTests 49.6s
=> [stage-1 4/6] COPY --from=builder /usr/src/app/target/*jar-with-dependencies.jar /usr/app/app.jar 0.3s
=> [stage-1 5/6] COPY resources/scripts /scripts/ 0.0s
=> [stage-1 6/6] RUN chmod a+x /scripts/init_solr.sh 0.2s
=> exporting to image 0.3s
=> => exporting layers 0.3s
=> => writing image sha256:d216745b985f2ce75e5c3b26fe0539707573e9593ef37b89d44c7f8acc9d7bbf 0.0s
=> => naming to docker.io/openaire/solr-importer 0.0s
View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/wfzudmmkob4s0klc2mrj4pcdl
This confirms that the openaire/solr-importer Docker image is now ready for use.
Important Note Regarding Input Data: this project does not include any sample data for indexing. The solr-importer is designed to process data that needs to be sourced externally, for instance, from the ICM's OCEAN cluster. The importer expects to find the input data files organized in the following directory structure within the container's filesystem:
❯ tree openaire-solr-importer/resources/prod_xml_json
solr-importer/resources/prod_xml_json
├── part-00000-016f0fea-e0d0-4210-a4d6-ece83b3e3e18-c000.json.gz
├── part-00001-016f0fea-e0d0-4210-a4d6-ece83b3e3e18-c000.json.gz
├── part-00003-016f0fea-e0d0-4210-a4d6-ece83b3e3e18-c000.json.gz
└── part-00004-016f0fea-e0d0-4210-a4d6-ece83b3e3e18-c000.json.gz
Ensure that your data files are appropriately mounted or copied into the container at this location for the importer to function correctly during the indexing process.
- Start the Solr cluster and the importer
Use Docker Compose to start all the defined services – the Zookeeper nodes, the Solr nodes, and the solr-importer container – in detached mode (-d):
❯ docker compose up -d
[+] Running 7/7
✔ Container zoo2 Started 0.3s
✔ Container zoo1 Started 0.3s
✔ Container zoo3 Started 0.3s
✔ Container solr3 Started 0.6s
✔ Container solr1 Started 0.6s
✔ Container solr2 Started 0.7s
✔ Container solr-importer Started
This command will initiate the containers in the background. You can monitor their startup progress in your terminal and/or in the docker desktop application. A successful startup will show the status of each container as "Started":
Interacting with Solr
The Solr and Zookeeper nodes within this Docker setup operate within a shared Docker network, named solr, as defined
in the docker-compose.yml file.
Consequently, the internal cluster configuration managed by Zookeeper uses network addresses specific to this internal solr network.
When attempting to interact with the Solr cluster from an application outside this solr Docker network using the standard
org.apache.solr.client.solrj.impl.CloudSolrClient, you might encounter connectivity issues. This is because the Solr node
addresses registered in Zookeeper are not directly resolvable from the external network context.
To illustrate how to interact with the Solr cluster running inside Docker from an external application without relying on
Zookeeper-based node discovery, the unit test eu.dnetlib.dhp.solr.SolrClientTest provides a practical example.
It demonstrates the use of org.apache.solr.client.solrj.impl.LBHttpSolrClient.
This client allows you to directly connect to the Solr nodes using their externally exposed ports, bypassing the need for
Zookeeper-driven discovery in external client scenarios.