Build & Deploy LLM & RAG Applications with OpenShift

Hands-on lab guide

27 minutes recorded live demonstration

Table of Contents

1 Introduction

1.1 About this hands-on lab

2 Getting-Started

2.1 Assumptions

2.2 Connect to the Bastion CLI

2.3 Login to OpenShift

2.4 Copy login command for CLI use

3 Working with the inference runtime llama.cpp

3.1 Building an inference runtime container with llama.cpp library

3.2 Pushing the built image to OpenShift on a new project

3.3 Deploy the runtime using Mistral model on the namespace

3.4 Create a namespace and deploy the pre-built Mistral Model container

3.5 Build and deploy llama.cpp with LLM and OpenShift

4 Opening-the-Inference-runtime-UI

5 Retrieval-Augmented-Generation

5.1 Deploying-a-Vector-Database

5.2 Querying the RAG model

1 Introduction

In this lab you’ll use a pre-trained Large Language Model and deploy it on OpenShift. It will make use of the unique Power10 features such as the Vector Scalar Extension (VSX) as well as the newly introduced Matrix Math Accelerator (MMA) engines.

This lab makes use of the following technologies.

IBM Power10 is a high-performance microprocessor designed for IBM Power servers. It features advanced architecture and technology innovations, such as a high-bandwidth cache hierarchy, improved memory subsystem, and support for
accelerated machine learning workloads. Power10 is designed to deliver high performance, scalability, and energy efficiency for enterprise and cloud computing applications.

IBM Power MMA (Matrix Math Accelerator) technology is a hardware-based solution designed to accelerate machine learning workloads on IBM Power servers. It includes specialized AI processors and software optimizations to improve the performance of machine learning tasks, such as training and inference, on these systems.

Vector Scalar Extension (VSX) is a set of instructions and architecture extensions for IBM POWER processors that enables efficient processing of vector (array-like) data types. VSX allows for faster and more efficient execution of tasks that involve large amounts of data, such as scientific computing, data analysis, and machine learning workloads.

A large language model (LLM) is an artificial intelligence model trained on a vast amount of text data, enabling it to generate human-like text based on the input it receives. These models can be used for various natural language processing tasks such as translation, summarization, and conversation.

RAG stands for Retrieval Augmented Generation. It’s a method that combines the power of retrieval systems with language models to generate more accurate and relevant responses. In this approach, the model first retrieves relevant information from a large external knowledge source (like a database or the internet), and then uses this information to generate a response.

llama.cpp is an open-source software library written mostly in C++ that performs inference on various large language models such as Llama. It is co-developed alongside the GGML project, a general-purpose tensor library.

Command-line tools are included with the library, alongside a server with a simple web interface.

MinIO is an open-source, high-performance object storage server compatible with the Amazon S3 API. It’s designed to provide reliable and scalable storage for cloud-native applications, big data, and AI workloads. MinIO is a community-driven project supported by MinIO Inc., a company founded by the original creators of MinIO.

MinIO offers several key features:

  1. High Performance: MinIO is optimized for high throughput and low latency, making it suitable for large-scale object storage deployments.
  2. Scalability: It can handle petabytes of data and thousands of concurrent users, allowing you to scale your storage needs as required.
  3. Reliability: MinIO ensures data durability and availability with built-in replication and erasure coding capabilities.
  4. Security: It supports encryption at rest and in transit, ensuring the confidentiality and integrity of your data.
  5. Integration: MinIO is compatible with the Amazon S3 API, making it easy to migrate existing applications and workflows to MinIO.

MinIO is widely used in various industries, including cloud computing, big data, AI, and machine learning, due to its high performance, scalability, reliability, and security features. It’s an excellent choice for organizations looking for a robust, open-source object storage solution.

Milvus is an open-source vector database built for AI applications, particularly those involving machine learning models that generate high-dimensional vectors for similarity search and clustering. It’s designed to handle large-scale vector data with low latency and high throughput, making it ideal for applications like recommendation systems, image and video search, and natural language processing.

Milvus offers several key features:

  1. Vector Search: Milvus enables fast and efficient vector similarity search, allowing you to find similar vectors in large datasets
    quickly.
  2. Scalability: It can handle petabytes of data and thousands of concurrent queries, ensuring your AI applications can scale as needed.
  3. Flexibility: Milvus supports various vector types, including embeddings, features, and distances, making it suitable for a wide range of use cases.
  4. Integration: Milvus is compatible with popular machine learning frameworks like TensorFlow, PyTorch, and scikit-learn, allowing you to integrate it seamlessly into your existing workflows.
  5. Real-time Analytics: Milvus supports real-time vector search, enabling you to process and analyze data in near real-time.

Milvus is used in various industries, including e-commerce, gaming, and entertainment, to power AI applications that rely on vector similarity search and clustering. It’s an excellent choice for organizations looking for a scalable, flexible, and high-performance vector database for their AI applications.

Streamlit is an open-source Python library for creating interactive web applications for data science, machine learning, and machine learning model deployment. It enables data scientists and developers to build user-friendly interfaces for their models, making it easier to share and deploy AI solutions with non-technical users.

Streamlit offers several key features:

  1. Interactive Web Apps: Streamlit allows you to create interactive web applications that display your data, visualizations, and machine learning models in a user-friendly interface.
  2. Easy to Use: Streamlit has a simple and intuitive API, making it easy to build and deploy web applications without requiring extensive web development knowledge.
  3. Real-time Updates: Streamlit supports real-time updates, allowing users to see changes to your data or models instantly.
  4. Sharing and Deployment: Streamlit makes it easy to share your applications with others by generating a URL that can be accessed by anyone. You can also deploy your applications as web apps using platforms like Heroku, AWS, or Google Cloud.
  5. Integration: Streamlit is compatible with popular data science libraries like Pandas, NumPy, and Scikit-learn, making it easy to integrate machine learning models into your web applications.

Streamlit is used in various industries, including finance, healthcare, and marketing, to build user-friendly interfaces for AI solutions.
It’s an excellent choice for data scientists and developers looking to share and deploy their AI models with non-technical users.

Etcd is a distributed key-value store that provides a reliable way to store and manage configuration data, service discovery, and remote procedure calls (RPCs) in distributed systems. It’s widely used in container orchestration platforms like Kubernetes, Docker Swarm, and Mesos to manage the configuration and state of these systems.

Etcd offers several key features:

  1. Distributed Key-Value Store: Etcd is a distributed key-value store that stores data across multiple nodes, ensuring high availability and fault tolerance.
  2. Reliability: Etcd provides strong consistency guarantees and ensures that data is never lost, even in the event of node failures.
  3. Scalability: Etcd can handle large-scale deployments with thousands of nodes and petabytes of data.
  4. Security: Etcd supports encryption at rest and in transit, ensuring the confidentiality and integrity of your data.
  5. Integration: Etcd is compatible with various programming languages and platforms, making it easy to integrate into existing systems.

Etcd is used in various industries, including cloud computing, big data, and AI, to manage the configuration and state of distributed systems. It’s an excellent choice for organizations looking for a reliable, scalable, and secure key-value store for their distributed systems.

1.1 About this hands-on lab

The first part of the lab in Section 2 focuses on how to access and interact with the lab and describes the following steps:

  • Connecting to the SSH console for the bastion
  • Connecting to the OpenShift Web GUI interface on a browser
  • Connecting to the OpenShift Command Line Interface (CLI) via PuTTY

The second part of the lab in Sections 3 and 4 focuses on these topics:

  • Building the llama.cpp container from scratch
  • Deploying the llama.cpp container within a project in OpenShift

The Third part of the lab in Section 5 demonstrates the use of Retrieval-Augmented Generation (RAG) and focuses on these topics:

  • Deploying a vector database
  • Querying the AI model on additional data provided via RAG
Quick 13-minute silent step-by-step demo walk-through

2 Getting Started

Please Note:

Commands that you should execute are displayed in bold blue txt. Left click within this area to copy the command to you clipboard.
Example output from commands that have been executed are displayed with white txt with black background.
  • Bullet items are required actions

Standard paragraphs are for informational purposes.

2.1 Assumptions

  1. You have access to an OpenShift Container Platform (OCP) environment running on IBM Power10
  2. You have access to a RHEL based Bastion LPAR
    • Make a note of the Bastion IP address
  3. The OCP user that you use is “cecuser”, if not, then just substitute “cecuser” for your own OCP user name
    • Make a note the OCP user name, if not using “cecuser”
    • Make a note of the “cecuser” password

2.2 Connect to the Bastion CLI

Now we will go over the steps to connect to the CLI for the environment.

I typically use the Putty application, but you are free to use your favourite terminal.

Putty is available for download HERE

  • Install Putty from the above link if desired.
  • Open the Putty application.
  • Fill the hostname with the IP address of your Bastion. This may be found in your assigned Project Kit if using IBM TechZone resources.
  • Press Open
  • Click Accept
  • You will see a “login as:” prompt, type cecuser and press enter:
  • Enter the password for the “cecuser” user. This will be the same for both the CLI and for the GUI.

2.3 Login to OpenShift

To login to the OpenShift environment from the command line, find the oc login command from your OpenShift GUI.

  • Point your browser to your OpenShift web console
  • Accept the certificate warning if certificates have not been configured correctly on demo equipment.
  • Click on the htpasswd option:
  • Add your user and password contained on the step 1 and Click login.
  • Familiarize yourself with the navigation for approximately 10 minutes if it’s your first time. You can easily switch between Developer and Administrator views using the menu option located at the top left corner.

2.4 Copy login command for CLI use

If you need to login again to the CLI, for any reason, you can find the login command on main OpenShift web console page.

  • On the top right side, you will see the cecuser drop down click on it and then on “Copy login command”
  • Once again click on the htpasswd option:
  • Add your user and password contained on the step 1 and Click login.
  • Click on Display Token on the top left
  • You can use the oc login command whenever your Authorization is expired. You may need to use the API token for login in into the registry.
  • As cecuser, copy and paste the oc login command from the web page into your Putty Session.
oc login --token=[Add your own token] server=[Add you own server]

For example:

oc login --token=sha256~8HzJyuecqujfeCXsaDnAeUUJ9VMsLafr-cJk5yn8tGk --server=https://api.p1289.cecc.ihost.com:6443

Logged into "https://api.p1289.cecc.ihost.com:6443" as "cecuser" using the token provided.You have access to 71 projects, the list has been suppressed. You can list all projects with 'oc projects'Using project "default".

3 Working with the inference runtime llama.cpp

Do not mistake llama.cpp with the Llama models that Facebook made available. You may run Llama models into llama.cpp but you can also run other models like Mistral and Granite too.

There are pre-built containers available to use with this lab too. If you want to perform the creation process of the Inference library llama.cpp container from scratch, follow through sections 3.1 to 3.3 (it requires additional memory in your bastion LPAR and takes about 10 minutes of extra lab time). If you want to avoid that and use a pre-built container instead, skip sections 3.1 to 3.3 and go to section 3.4.

The easiest way to get up and running fast, is typically to head straight to Section 3.5, where you initiate the build and deployment from the bastion, but within OpenShift.

TechZone and TechXchange users should move directly to Section 3.5 as you will not have enough memory on your Bastion to compile the source and build the containers.

3.1 Building an inference runtime container with llama.cpp library

By following this chapter, you can build the runtime container from scratch using the Dockerfile bellow. This Dockerfile is already automatically loaded to your system when performing the git clone command below.

Please note, that this exercise requires significant memory to complete. If you do not have enough memory on your bastion, then you can skip direct to section 3.4, where you are directed to build and deploy the application within OpenShift.

Skip directly to Section 3.5 for the easiest way to build and deploy the llama.cpp application from within OpenShift. TechZone and TechXchange users should move directly to Section 3.5 as you will not have enough memory on your Bastion to compile the source and build the containers.

FROM registry.access.redhat.com/ubi9/ubi as builder

#################################################
# Creating a compiler environment for the build 
#################################################
RUN dnf update -y  && dnf -y groupinstall 'Development Tools' && dnf install -y \
  cmake git ninja-build \
  && dnf clean all

####################################################
# Downloading and compiling OpenBLAS for a compiler environment for the build #
####################################################
RUN git clone --recursive https://github.com/DanielCasali/OpenBLAS.git && cd OpenBLAS && \
    make -j$(nproc --all) TARGET=POWER10 DYNAMIC_ARCH=1 && \
    make PREFIX=/opt/OpenBLAS install && \
    cd /

############################################################
# Downloading and compiling llama.cpp using the OpenBLAS Library we just compiled:
############################################################
RUN git clone https://github.com/DanielCasali/llama.cpp.git && cd llama.cpp && sed -i "s/powerpc64le/native -mvsx -mtune=native -D__POWER10_VECTOR__/g" ggml/src/CMakeLists.txt && \
    mkdir build; \
    cd build; \
    cmake -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS=/opt/OpenBLAS/include -G Ninja ..; \
    cmake --build . --config Release

CMD bash
########################################################
# Copying the built executable and libraries needed for the runtime on a simple ubi9 so it is a small container
########################################################
FROM registry.access.redhat.com/ubi9/ubi

COPY --from=builder --chmod=755 /llama.cpp/build/bin/llama-server /usr/local/bin
COPY --from=builder --chmod=644 /llama.cpp/build/src/libllama.so /llama.cpp/build/src/libllama.so
COPY --from=builder --chmod=644 /llama.cpp/build/ggml/src/libggml.so /llama.cpp/build/ggml/src/libggml.so

ENTRYPOINT [ "/usr/local/bin/llama-server", "--host", "0.0.0.0"]
  • Using the putty console you opened on step 2.3, you need to install git to clone the project that has the assets to help us through the Lab. Note that git may already be pre-installed in some of the lab environments.
sudo dnf -y install git
  • After git gets successfully installed, clone the project from GitHub:
git clone https://github.com/DanielCasali/mma-ai.git
Cloning into 'mma-ai'...
remote: Enumerating objects: 8, done.
remote: Counting objects: 100% (8/8), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 8 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
Receiving objects: 100% (8/8), 6.26 KiB | 6.26 MiB/s, done.
Resolving deltas: 100% (1/1), done.
  • Enter the mma-ai/llama-runtime project:
cd mma-ai/llama-runtime/
  • Build the container to use for the runtime. You will use the Dockerfile you saw on the beginning of this section. You will use the OpenShift internal registry, so the command inside “$( )” retrieves the OpenShift internal registry host so we can tag the image there:
podman build . --tag $(oc get routes -A |grep image-registry|awk '{print $3}')/ai/llama-runtime-ubi:latest
[1/2] STEP 1/5: FROM registry.access.redhat.com/ubi9/ubi AS builder
[1/2] STEP 2/5: RUN dnf update -y  && dnf -y groupinstall 'Development Tools' && dnf install -y   cmake git ninja-build   && dnf clean all
Updating Subscription Management repositories.
Unable to read consumer identity
subscription-manager is operating in container mode.
.
.
.
[2/2] STEP 1/5: FROM registry.access.redhat.com/ubi9/ubi
[2/2] STEP 2/5: COPY --from=builder --chmod=755 /llama.cpp/build/bin/llama-server /usr/local/bin
--> 2066dbbb20a8
[2/2] STEP 3/5: COPY --from=builder --chmod=644 /llama.cpp/build/src/libllama.so /llama.cpp/build/src/libllama.so
--> 06e978aead69
[2/2] STEP 4/5: COPY --from=builder --chmod=644 /llama.cpp/build/ggml/src/libggml.so /llama.cpp/build/ggml/src/libggml.so
--> 896f86c2f229
[2/2] STEP 5/5: ENTRYPOINT [ "/usr/local/bin/llama-server", "--host", "0.0.0.0"]
[2/2] COMMIT default-route-OpenShift-image-registry.apps.p1325.cecc.ihost.com/ai/llama-runtime-ubi:latest
--> 1f99895367aa
Successfully tagged default-route-OpenShift-image-registry.apps.p1325.cecc.ihost.com/ai/llama-runtime-ubi:latest

The previous step takes some time to download and compile Open BLAS and llama.cpp.

3.2 Pushing the built image to OpenShift on a new project

  • Create the ai project where you will run the container:
oc new-project ai
Now using project "ai" on server "https://api.p1325.cecc.ihost.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.43 -- /agnhost serve-hostname
  • On the CLI, run the podman login command. The command within “$( )” retrieves the OpenShift internal registry host. You need to type the Username as cecuser.
podman login $(oc get routes -A |grep image-registry|awk '{print $3}') --tls-verify=false

Use the information you have from the OpenShift Token tab you left open on Firefox, copy the string right bellow “Your API token is”.

  • Paste the API token on the Password field and press enter.
Login Succeeded!
  • Use podman push to get the container into the internal registry:
podman push $(oc get routes -A |grep image-registry|awk '{print $3}')/ai/llama-runtime-ubi:latest \
--tls-verify=false
Getting image source signatures
Copying blob 28a720baac6f skipped: already exists
Copying blob 8f522959366b skipped: already exists
Copying blob 8c672c500f73 skipped: already exists
Copying blob fe14819e82ca skipped: already exists
Copying config a85918e36b done   |
Writing manifest to image destination

3.3 Deploy the runtime using Mistral model on the namespace

  • Apply the deployment that runs the container just built and downloads the Mistral model from hugging face:
oc apply -f mistral-deploy.yaml
deployment.apps/llama-cpp-server created

The above yaml file is a Kubernetes deployment configuration for an application named “llama-cpp-server”. It specifies that there should be one replica of the application running. The application consists of two containers: “fetch-model-data” and “llama-cpp”.

The “fetch-model-data” container is an init container that fetches a model file from a specified URL and saves it
to a volume named “llama-models”. This container does not start the main application.

The “llama-cpp” container is the main application container. It uses the fetched model file and runs with certain arguments and resource limits. It listens on port 8080 for HTTP requests. The container has readiness and liveness probes to check its status.

The deployment also specifies an emptyDir volume named “llama-models” for the containers to use.

  • Apply the service and the route to access the content of the llama.cpp runtime:
oc create -f llama-svc.yaml
service/llama-service created

The above yaml file defines a Kubernetes Service named “llama-service” with the following specifications:

  1. It belongs to the “app” namespace and is labeled as “llama-service”.
  2. The service type is “ClusterIP”, meaning it is only accessible within the cluster.
  3. It exposes port 8080 (TCP) from the selected pods.
  4. The service selector matches pods with the label “app: llama-cpp-server”
oc create -f llama-route.yaml
route.route.OpenShift.io/llama-cpp created

The above yaml file is a Route configuration for OpenShift. It creates a route named “llama-cpp” that directs traffic to the “llama-service” service, using the “llama-cpp-server” target port. The route does not use TLS (tls: null). The application associated with this route is labeled as “app: llama-service”.

  • You can verify the status of the runtime readiness by checking the pod status and its progress, as shown below.
oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
llama-cpp-server-664bddbbcc-9fmf8   0/1     Init:0/1  0          1m3s
  • You can repeat the command oc get pod until you see the READY 1/1 (in bold italics bellow). It takes about 6 to 8 minutes for its completion.
oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
llama-cpp-server-664bddbbcc-9fmf8   1/1     Running   0          5m43s

3.4 Create a namespace and deploy the pre-built Mistral Model container

If you created the Inference library llama.cpp container from scratch by executing the steps in sections 3.1 to 3.3, skip this and go to section 4.

The steps below guide you to using the pre-built llama.cpp container.

  • Using the putty console you opened in step 2.3, install git to clone the project with the assets to help us through the Lab. Note that some lab environments might already have git installed.
sudo dnf -y install git
git clone https://github.com/DanielCasali/mma-ai.git
Cloning into 'mma-ai'...
remote: Enumerating objects: 8, done.
remote: Counting objects: 100% (8/8), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 8 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
Receiving objects: 100% (8/8), 6.26 KiB | 6.26 MiB/s, done.
Resolving deltas: 100% (1/1), done.
  • Enter the mma-ai/llama-runtime project:
cd mma-ai/llama-runtime/
  • Create the ai project where we will run the container:
oc new-project ai
Now using project "ai" on server "https://api.p1325.cecc.ihost.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.43 -- /agnhost serve-hostname
  • Apply the deployment for the ready container runtime that pulls mistral model from Hugging Face.
oc apply -f mistral-deploy-ready.yaml
deployment.apps/llama-cpp-server created

The above describes a Kubernetes Deployment named “llama-cpp-server”. It has one replica and uses the latest version of the specified image for the container. The container is named “llama-cpp” and runs a command to load a model file from a volume. The container exposes port 8080 for HTTP traffic.

The deployment includes an init container named “fetch-model-data” that fetches the model file from a URL if it doesn’t already exist in the specified volume.

The container has readiness and liveness probes configured to check its status. The readiness probe waits 30 seconds before checking if the container is ready, while the liveness probe checks every 10 seconds. Both probes use an HTTP GET request to the root path of the container.

  • Apply the service and the route:
oc apply  -f llama-svc.yaml
service/llama-service created

The above YAML file defines a Kubernetes Service named “llama-service” with the type “ClusterIP”. It listens on port 8080 using TCP protocol and forwards traffic to pods selected by the service, which are those labeled as “app: llama-cpp-server”.

oc apply -f llama-route.yaml
route.route.openshift.io/llama-cpp created

The above YAML file defines a Route resource in OpenShift, which is a way to expose a Service to the internet. The Route has the following characteristics:

  1. It’s named “llama-cpp”.
  2. It belongs to the application “llama-service”.
  3. It points to the Service named “llama-service”.
  4. It doesn’t use TLS (insecure).
  5. It uses port 80 (or the port specified by “targetPort” in the Service) and forwards traffic to the “llama-cpp-server” target.
  • You can verify the status of the runtime readiness by checking the pod status and its progress, as shown below.
oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
llama-cpp-server-664bddbbcc-9fmf8   0/1     Init:0/1  0          1m3s
  • You can repeat the command oc get pod until you see the READY 1/1 (in bold italics). It takes about 6 to 8 minutes for its completion.
  • Use this time to read more about the llama.cpp runtime we are using: https://en.wikipedia.org/wiki/Llama.cpp
oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
llama-cpp-server-664bddbbcc-9fmf8   1/1     Running   0          5m43s

3.5 Build and deploy llama.cpp with LLM and OpenShift

If you created the Inference library llama.cpp container in sections 3.1 to 3.4, skip this section and go directly to section 4 to access the llama.cpp application.

The steps below guide you to build and deploy the llama.cpp application within OpenShift.

  • Using putty console you opened in step 2.3, install git to clone the project that has the assets to help us through the Lab. Note some lab environments might already have git installed on them.
sudo dnf -y install git
Updating Subscription Management repositories.
Last metadata expiration check: 3:30:55 ago on Wed 13 Nov 2024 01:51:13 AM EST.
Dependencies resolved.
================================================================================
 Package        Arch    Version         Repository                         Size
================================================================================
Installing:
 git            ppc64le 2.43.5-1.el9_4  rhel-9-for-ppc64le-appstream-rpms  54 k
Installing dependencies:
 emacs-filesystem
                noarch  1:27.2-10.el9_4 rhel-9-for-ppc64le-appstream-rpms 9.3 k
 git-core       ppc64le 2.43.5-1.el9_4  rhel-9-for-ppc64le-appstream-rpms 4.8 M
 git-core-doc   noarch  2.43.5-1.el9_4  rhel-9-for-ppc64le-appstream-rpms 2.9 M
 perl-DynaLoader
                ppc64le 1.47-481.el9    rhel-9-for-ppc64le-appstream-rpms  26 k
 perl-Error     noarch  1:0.17029-7.el9 rhel-9-for-ppc64le-appstream-rpms  46 k
 perl-File-Find noarch  1.37-481.el9    rhel-9-for-ppc64le-appstream-rpms  26 k
 perl-Git       noarch  2.43.5-1.el9_4  rhel-9-for-ppc64le-appstream-rpms  39 k
 perl-TermReadKey
                ppc64le 2.38-11.el9     rhel-9-for-ppc64le-appstream-rpms  41 k

Transaction Summary
================================================================================
Install  9 Packages

Total download size: 8.0 M
Installed size: 41 M
Downloading Packages:
(1/9): perl-DynaLoader-1.47-481.el9.ppc64le.rpm 257 kB/s |  26 kB     00:00
(2/9): perl-Error-0.17029-7.el9.noarch.rpm      414 kB/s |  46 kB     00:00
(3/9): perl-TermReadKey-2.38-11.el9.ppc64le.rpm 336 kB/s |  41 kB     00:00
(4/9): perl-File-Find-1.37-481.el9.noarch.rpm   487 kB/s |  26 kB     00:00
(5/9): git-2.43.5-1.el9_4.ppc64le.rpm           867 kB/s |  54 kB     00:00
(6/9): perl-Git-2.43.5-1.el9_4.noarch.rpm       591 kB/s |  39 kB     00:00
(7/9): git-core-doc-2.43.5-1.el9_4.noarch.rpm    26 MB/s | 2.9 MB     00:00
(8/9): git-core-2.43.5-1.el9_4.ppc64le.rpm       28 MB/s | 4.8 MB     00:00
(9/9): emacs-filesystem-27.2-10.el9_4.noarch.rp 177 kB/s | 9.3 kB     00:00
--------------------------------------------------------------------------------
Total                                            27 MB/s | 8.0 MB     00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                        1/1
  Installing       : git-core-2.43.5-1.el9_4.ppc64le                        1/9
  Installing       : git-core-doc-2.43.5-1.el9_4.noarch                     2/9
  Installing       : emacs-filesystem-1:27.2-10.el9_4.noarch                3/9
  Installing       : perl-File-Find-1.37-481.el9.noarch                     4/9
  Installing       : perl-DynaLoader-1.47-481.el9.ppc64le                   5/9
  Installing       : perl-TermReadKey-2.38-11.el9.ppc64le                   6/9
  Installing       : perl-Error-1:0.17029-7.el9.noarch                      7/9
  Installing       : git-2.43.5-1.el9_4.ppc64le                             8/9
  Installing       : perl-Git-2.43.5-1.el9_4.noarch                         9/9
  Running scriptlet: perl-Git-2.43.5-1.el9_4.noarch                         9/9
  Verifying        : perl-Error-1:0.17029-7.el9.noarch                      1/9
  Verifying        : perl-TermReadKey-2.38-11.el9.ppc64le                   2/9
  Verifying        : perl-DynaLoader-1.47-481.el9.ppc64le                   3/9
  Verifying        : perl-File-Find-1.37-481.el9.noarch                     4/9
  Verifying        : git-2.43.5-1.el9_4.ppc64le                             5/9
  Verifying        : git-core-2.43.5-1.el9_4.ppc64le                        6/9
  Verifying        : git-core-doc-2.43.5-1.el9_4.noarch                     7/9
  Verifying        : perl-Git-2.43.5-1.el9_4.noarch                         8/9
  Verifying        : emacs-filesystem-1:27.2-10.el9_4.noarch                9/9
Installed products updated.

Installed:
  emacs-filesystem-1:27.2-10.el9_4.noarch   git-2.43.5-1.el9_4.ppc64le
  git-core-2.43.5-1.el9_4.ppc64le           git-core-doc-2.43.5-1.el9_4.noarch
  perl-DynaLoader-1.47-481.el9.ppc64le      perl-Error-1:0.17029-7.el9.noarch
  perl-File-Find-1.37-481.el9.noarch        perl-Git-2.43.5-1.el9_4.noarch
  perl-TermReadKey-2.38-11.el9.ppc64le

Complete!
git clone https://github.com/DanielCasali/mma-ai.git
Cloning into 'mma-ai'...
remote: Enumerating objects: 186, done.
remote: Counting objects: 100% (16/16), done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 186 (delta 2), reused 16 (delta 2), pack-reused 170 (from 1)
Receiving objects: 100% (186/186), 159.22 MiB | 51.42 MiB/s, done.
Resolving deltas: 100% (81/81), done.
Updating files: 100% (58/58), done.
  • Enter the mma-ai/llama-runtime project:
cd mma-ai/llama-runtime/
  • Create the ai project where we will run the container:
oc new-project ai
Now using project "ai" on server "https://api.p1325.cecc.ihost.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.43 -- /agnhost serve-hostname
  • Apply the deployment for the ready container runtime that pulls mistral model from Hugging Face:
oc apply -f ./build-deploy-mistral.yaml
deployment.apps/mma-ai created
buildconfig.build.openshift.io/mma-ai created
imagestream.image.openshift.io/mma-ai created
service/mma-ai created
service/llama-service created
route.route.openshift.io/mma-ai created

The above yaml file describes a Kubernetes Deployment named “mma-ai” with the following characteristics:

  1. It runs one replica of the container.
  2. The container uses the OpenShift image registry.
  3. The container uses the mistral-7b-instruct-v0.3.Q4_K_M.gguf LLM and exposes port 8080.
  4. The container has a health check with an HTTP GET request to the root path (“/”).
  5. The application also includes a BuildConfig for Docker strategy named “mma-ai” that uses the source code from a Git repository and builds the Docker image tagged as “mma-ai:latest”.
  6. There is an ImageStream named “mma-ai” that looks up the local image.
  7. The application has two services, one internal named “mma-ai” and another named “llama-service,” both of which expose port 8080 internally.
  8. There is a Route named “mma-ai” that exposes the internal service externally with HTTPS termination and insecureEdgeTerminationPolicy set to “Redirect”.
  • You can verify the status of the runtime readiness by checking the pod status. You can check its progress as shown below.
oc get pods
NAME                      READY   STATUS     RESTARTS   AGE
mma-ai-1-build            1/1     Running    0          15s
mma-ai-699d775754-wlgp4   0/1     Init:0/1   0          15s

  • You can repeat the command oc get pod until you see the READY 1/1 (in bold italics). It takes about 4 to 8 minutes for its completion.
  • Use this time to read more about the llama.cpp runtime we are using: https://en.wikipedia.org/wiki/Llama.cpp
watch oc get pods
Every 2.0s: oc get pods                  p1280-bastion: Wed Nov 13 06:09:26 2024

NAME                      READY   STATUS      RESTARTS   AGE
mma-ai-1-build            0/1     Completed   0          4m41s
mma-ai-699d775754-wlgp4   1/1     Running     0          4m41s
  • Use CONTROL-C to break out if you previously used the watch command to monitor the pods building

4 Opening the Inference runtime UI

Now that we deployed the Inference runtime, we can open and work it from the OpenShift Graphical User Interface.

  • From the GUI click on the “Administrator” drop down:
  • Toggle to the Developer view by clicking in “Developer”:
  • Click skip tour:
  • Click on the AI project:
  • Click “Topology”:
  • Click the button to open the llama-cpp-server UI:
  • For a better experience click “New UI”:
  • Test your Inferencing by querying the inferencing runtime at the “Say Something” box:

Remember, this AI lab has no access to the system or to the Internet, so it does not know what day today is, or the time-of-day, or what is the weather is like. If you ask something like this, it will generate hallucinations, which will be funny to read.

Some suggestions on how to experiment with this prompt is to ask it which languages it can translate sentences to and then asking it to translate a sentence.

  • You can experiment with some prompt engineering input to it and ask it the same question based on different viewpoints. For example, you may tell the model to pretend it’s an Italian chef (tell it in the System prompt) and ask it what the best dish in the word is, as shown below. Then you can tell it to pretend it’s a French chef and ask it the same question. If you don’t see the System prompt anymore in the GUI, just refresh the URL in the browser.

Other example questions:

Prompt

You are an Italian Chef

Question

What is the best dish in the world?

Prompt

You are a British Chef
What is the best dish in the world?

Prompt

You are a computer architect
Why should I run AI workloads close to data source?
Why should I run AI workloads on IBM Power10?
What is IBM Power MMA technology?
What is Vector Scalar Extension (VSX)?

5 Retrieval Augmented Generation

An approach for using Generative AI with additional or confidential data without using lots of money on expensive GPUs to re-train or fine-tune a model is applying Retrieval Augmented Generation (RAG) to it. RAG allows a pre-trained model to leverage new, updated data provided to it on the fly, and allows the model to generate the answer using that data primarily, as well as being able to tell us where the source for the answer was found. 

In the following sections, you will deploy a Vector Database and then use a python application that uses the vector database to perform the Retrieval Augmented Generation.

5.1 Deploying a Vector Database

We will use Milvus vector database as the foundation for this part of the lab.

  • Change path to the milvus directory:
cd ../
  • Apply all yaml files in the directory:
oc apply -f milvus/
deployment.apps/etcd-deployment created
service/etcd-service created
deployment.apps/milvus-deployment created
service/milvus-service created
configmap/milvus-config created
persistentvolumeclaim/minio-pvc created
deployment.apps/minio-deployment created
service/minio-service created
route.route.openshift.io/minio-console created

This creates some more elements in your OpenShift AI project as shown by the figure below:

  • Next, apply the streamlit definition:
oc apply -f streamlit
pod/streamlit created
route.route.openshift.io/streamlit created
service/streamlit created

This creates the Streamlit service as shown in the figure below.

  • You may then click on the Streamlit service endpoint to open its interface, as demonstrated in the figure below.
  • The Web GUI for the RAG model opens up, loads up the RAG model and such (takes about 3 minutes as it’s building up the vector database), and after that you can interact with a prompt similarly to what’s shown below:

5.2 Querying the RAG model

At this point, the AI model is ready to be queried and was loaded with additional data that it was not trained on and had no idea about it. Specifically, it was loaded with a PDF story which you can find below:

  • One interesting question to ask the model is asking it who the starfish is and how does it know about it. To answer that question, the model will have to “read” the book (ie, it was given the PDF to load as additional input data) and inference on the information from it. As you can see in the figure above, Grandpa writes a letter to Sarah and calls her “my little starfish”. So, the AI model should answer the question as Sarah and that it knows it from the letter Grandpa writes to her.
Who is the starfish and how do you know it?

Other questions that you may ask the model about the Forgotten Lighthouse book are available below

https://raw.githubusercontent.com/DanielCasali/mma-ai/main/datasource/The_Forgotten_Lighthouse_Question.pdf.

The AI model has not been trained on any of that book’s information, so all answers it provides uses RAG and demonstrates how a pre-trained AI model can be extended to inferencing on data without being trained on that data. In real life, this means that you can run a pre-trained AI model on your own data without having ever sending this data out of your premises for training the model itself, and you don’t need to use expensive GPU accelerators for achieving that!

Next Steps

  • Please let me know how you get on with this tutorial
  • Contact me for help deploying OpenShift and AI on IBM Power.

Clean up

  • Remove Git:
sudo dnf remove git
  • Remove local repository:
cd; rm -rdf mma-ai
  • Remove project:
oc delete project ai

Credit

Daniel de Souza Casali

Jason Liu

Rodrigo Ceron


by

Tags:

Comments

2 responses to “Build & Deploy LLM & RAG Applications with OpenShift”

  1. Diego avatar
    Diego

    Great demo, works perfectly

    1. Paul Chapman avatar

      Thanks for the feedback, Diego 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *