Claudio Atzori ad4ba4fa9b | ||
---|---|---|
openaire-solr-test | ||
solr-importer | ||
.gitignore | ||
LICENSE | ||
README.md |
README.md
OpenAIRE SOLR Docker
This project defines a docker-compose based Solr cluster and a content indexing procedure. The project is structured in two main components
- The Solr cluster is defined in the
docker-compose.yml
file and consists (as default) of 3 nodes and uses a 3 nodes zookeeper server. - The openaire-solr-importer. It is a java application responsible for
- Configure the Solr cluster, loading the configuration file (schema & solrconfig) onto Zookeeper
- Create the data collection
- Feed the collection with the input documents, applying the necessary conversions
Step by step guide
- Clone this project
git clone https://code-repo.d4science.org/D-Net/openaire-solr-docker.git
- Compile the solr-importer Docker image
docker --version
Docker version 27.2.0, build 3ab4256
cd openaire-solr-docker/solr-importer
❯ docker build . -t openaire/solr-importer --no-cache
[+] Building 58.1s (16/16) FINISHED docker:desktop-linux
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 537B 0.0s
=> [internal] load metadata for docker.io/library/openjdk:17-bullseye 1.1s
=> [internal] load metadata for docker.io/library/maven:3.9-eclipse-temurin-17 1.7s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [builder 1/4] FROM docker.io/library/maven:3.9-eclipse-temurin-17@sha256:d4f3b77119ae1afcdf00276083416d58fd... 5.5s
=> => resolve docker.io/library/maven:3.9-eclipse-temurin-17@sha256:d4f3b77119ae1afcdf00276083416d58fd9c699294... 0.0s
=> => sha256:818a39a5051cd05f4158c0db68fbf2a5bf732b9b8d5ecde89db2a725a35aa731 9.77kB / 9.77kB 0.0s
=> => sha256:bc78492c245adcc54dde954f84b79e98d71c8ac11fe26c874e1b1347f88550e2 159B / 159B 0.6s
=> => sha256:d4f3b77119ae1afcdf00276083416d58fd9c69929400cea3595eba8965b6ae6f 7.86kB / 7.86kB 0.0s
=> => sha256:004549c9d7e4108958645eea52f71a0e8ba6f18d72dd6dca3a47239ce74704dd 24.16MB / 24.16MB 1.1s
=> => sha256:b3ad492a84cdc67ade7b4db435c7a511fd80135e165d0ab6683529afad3e9cf6 2.90kB / 2.90kB 0.0s
=> => sha256:225cb5fd64f3a6f155a1301661e18261d430eba8637a6f7e08c50e2774434caa 143.37MB / 143.37MB 3.4s
=> => sha256:b2ea5a28f7c23fb8f6b9e31e7612b3c542394162012defbebf1f640ad33a5b06 2.28kB / 2.28kB 0.7s
=> => sha256:878f62a07109a4bc1c6e84f338d9c7d7472bd4ab79c45e369cc350d1be3b3ba8 22.61MB / 22.61MB 1.7s
=> => extracting sha256:004549c9d7e4108958645eea52f71a0e8ba6f18d72dd6dca3a47239ce74704dd 1.0s
=> => sha256:ba63355dfc695379f7f9b29e7583aabdb617a26027ae4b37d6bb782f8bb3f14a 9.17MB / 9.17MB 1.6s
=> => sha256:30ff8964aee50bcf46c1b6a4b4c9fb180c64e917a793571ba3c5b1d1a4481dd1 849B / 849B 1.8s
=> => sha256:4af42910298d8a6180fd68a7de28cf581859551fda87acf000b0142af306c8d0 154B / 154B 1.9s
=> => sha256:b4d1d7ad89e02201ce3dd818bf534d160a5b44d1e53b47e57ed29ccd17d5ab31 360B / 360B 1.9s
=> => extracting sha256:225cb5fd64f3a6f155a1301661e18261d430eba8637a6f7e08c50e2774434caa 1.2s
=> => extracting sha256:bc78492c245adcc54dde954f84b79e98d71c8ac11fe26c874e1b1347f88550e2 0.0s
=> => extracting sha256:b2ea5a28f7c23fb8f6b9e31e7612b3c542394162012defbebf1f640ad33a5b06 0.0s
=> => extracting sha256:878f62a07109a4bc1c6e84f338d9c7d7472bd4ab79c45e369cc350d1be3b3ba8 0.7s
=> => extracting sha256:ba63355dfc695379f7f9b29e7583aabdb617a26027ae4b37d6bb782f8bb3f14a 0.1s
=> => extracting sha256:30ff8964aee50bcf46c1b6a4b4c9fb180c64e917a793571ba3c5b1d1a4481dd1 0.0s
=> => extracting sha256:b4d1d7ad89e02201ce3dd818bf534d160a5b44d1e53b47e57ed29ccd17d5ab31 0.0s
=> => extracting sha256:4af42910298d8a6180fd68a7de28cf581859551fda87acf000b0142af306c8d0 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 261.51kB 0.0s
=> CACHED [stage-1 1/6] FROM docker.io/library/openjdk:17-bullseye@sha256:bd3113dee5dfa55c246067cdb20e5880003ed... 0.0s
=> [stage-1 2/6] RUN apt-get update 4.1s
=> [stage-1 3/6] RUN apt-get install -y zip 1.1s
=> [builder 2/4] COPY src /usr/src/app/src 0.0s
=> [builder 3/4] COPY pom.xml /usr/src/app 0.0s
=> [builder 4/4] RUN mvn -f /usr/src/app/pom.xml clean package -DskipTests 49.6s
=> [stage-1 4/6] COPY --from=builder /usr/src/app/target/*jar-with-dependencies.jar /usr/app/app.jar 0.3s
=> [stage-1 5/6] COPY resources/scripts /scripts/ 0.0s
=> [stage-1 6/6] RUN chmod a+x /scripts/init_solr.sh 0.2s
=> exporting to image 0.3s
=> => exporting layers 0.3s
=> => writing image sha256:d216745b985f2ce75e5c3b26fe0539707573e9593ef37b89d44c7f8acc9d7bbf 0.0s
=> => naming to docker.io/openaire/solr-importer 0.0s
View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/wfzudmmkob4s0klc2mrj4pcdl
The solr-importer Docker image is now ready.
Note that the project does not come with any example data, which must be retrieved from the ICM's OCEAN cluster. The import procedure assumes to find the input records in the following path
❯ tree solr-importer/resources/prod_xml_json
solr-importer/resources/prod_xml_json
├── part-00000-016f0fea-e0d0-4210-a4d6-ece83b3e3e18-c000.json.gz
├── part-00001-016f0fea-e0d0-4210-a4d6-ece83b3e3e18-c000.json.gz
├── part-00003-016f0fea-e0d0-4210-a4d6-ece83b3e3e18-c000.json.gz
└── part-00004-016f0fea-e0d0-4210-a4d6-ece83b3e3e18-c000.json.gz
- Start the Solr cluster and the importer
❯ docker compose up -d
[+] Running 7/7
✔ Container zoo2 Started 0.3s
✔ Container zoo1 Started 0.3s
✔ Container zoo3 Started 0.3s
✔ Container solr3 Started 0.6s
✔ Container solr1 Started 0.6s
✔ Container solr2 Started 0.7s
✔ Container solr-importer Started
Hint: get the docker desktop client for a simplified access to the running containers, manage the volumes, etc.
https://www.docker.com/products/docker-desktop/
Interacting with Solr
The Solr and the zookeeper nodes run inside Docker images and interact sharing a network declared in the docker-compose
file named solr
. This implies that the internal cluster information known to Zookeeper refers to such a network.
When trying to use the org.apache.solr.client.solrj.impl.CloudSolrClient
to interact with it from outside the solr
network,
it uses the information in Zookeeper to discover the solr nodes, which are however not resolvable from the outside.
The unit test eu.dnetlib.dhp.solr.SolrClientTest
makes use of the org.apache.solr.client.solrj.impl.LBHttp2SolrClient
to showcase how to query the Solr cluster running inside Docker, without needing to rely on the nodes discovery through Zookeeper.