# NCSA.ai Docs

We're API-first, that's where all our effort goes.

👉 Try the full website (in open beta): <https://ncsa.ai>\
🛠️ Or checkout [the code on Github](https://github.com/KastanDay/llm-server).

## API

* **OpenAI-Compatible API**. We even support [a few extra parameters](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#api-reference).
* **We support \~all of the best open LLMs**, see [supported models](#supported-models) below.
* **Hosted on NCSA supercomputers** with autoscaling to load any requested models. The default 'stay-hot' time is 15 minutes, after which your model will be removed from the GPUs.
* **We're in open beta**, see [common problems and workarounds](#common-problems-and-workarounds) below.
* Please send bugs and feature requests to me, Kastan at <kvday2@illinois.edu>, or open an issue on our [GitHub repository](https://github.com/UIUC-Chatbot/llm-serving).

Try it now for free during beta:

{% hint style="danger" %} <mark style="color:red;">**Service is OFFLINE**</mark> due to GPU shortage (as of May, 2024). Will be back when we can recruit enough users / scale to make persistent services viable.
{% endhint %}

```bash
# instant access from the command line 🥹🥹
curl https://api.ncsa.ai/llm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "mistralai/Mistral-7B-Instruct-v0.2",
     "messages": [{"role": "user", "content": "Write a small bash program."}],
     "temperature": 0.5,
     "stream": true
   }'
```

For a **nicely formatted markdown response (non-streaming)**, pipe this command into [`jq`](https://jqlang.github.io/jq/) and use `glow` to render the markdown:

```bash
# brew install jq glow --OR-- apt-get install jq glow
curl -s https://api.ncsa.ai/llm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "mistralai/Mistral-7B-Instruct-v0.2",
     "messages": [{"role": "user", "content": "Write a few example bash programs so I can learn bash."}],
     "stream": false
   }' | jq -r '.choices[].message.content' | glow -
```

## Model Status

Users can **check the status** of their requested models by visiting [here](https://api.ncsa.ai/llm/models).&#x20;

* If the model status is `Deploying`,  your requests will be served once the deployment is complete. If the model status is `Running`, your requests will be served immediately.
* Please note that some large models, such as `meta-llama/Meta-Llama-3-70B-Instruct`, require at least 5 minutes of deployment time, or potentially longer.

## Recommended Models

These are the best LLMs available in the open source (as of April 5, 2024):&#x20;

1. `NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO` - the best fine-tune of Mixtral. Great instruction following, better than raw Mixtral-instruct.
2. `databricks/dbrx-instruct` - Large and capable. Runs on 4x A100-80GB GPUs. Read [their blog for details](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm).&#x20;
3. `teknium/OpenHermes-2.5-Mistral-7B` - Small but mighty. The best fine-tune of Mistral. Best value for money, and fast to cold-start.
4. `meta-llama/Meta-Llama-3-70B-Instruct` - Large and capable.

Some large models require minutes of deployment time, see [Model Status.](#model-status)

## Supported models

We use Ray Serve + VLLM to provide an [OpenAI-compatible API](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).

For a [list of supported models, see here](https://docs.vllm.ai/en/latest/models/supported_models.html). Note, this is an incomplete list! Many models not explicitly listed here **will work as long as they use a supported LLM&#x20;*****architecture***. E.g. fine-tunes of Mistral/Mixtral/Llama are supported. We support exactly the models supported by VLLM.

## Usage Guide

### 🐍 Python

**I leverage the existing** [**OpenAI Python package**](https://github.com/openai/openai-python)**, making my version a&#x20;*****drop in replacement*****&#x20;for any openAI calls. Say goodbye to huge OpenAI bills!💰**

{% tabs %}
{% tab title="Openai>=1.0" %}
{% hint style="warning" %}
Use [`ChatCompletions`](https://platform.openai.com/docs/api-reference/chat/create) format only, we do not support completions. Refer to [their docs](https://platform.openai.com/docs/api-reference/chat/create).
{% endhint %}

```python
from openai import OpenAI # pip install openai>=1.0

# Point requests to our NCSA LLM server instead of openai! 
client = OpenAI(
    api_key="irrelevant", # any non-empty string
    base_url = "https://api.ncsa.ai/llm/v1" ## 👈 ONLY CODE CHANGE ##
)

# view supported models here: https://docs.vllm.ai/en/latest/models/supported_models.html
completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Write a short bash program."}],
    stream=True) 

for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")
```

{% endtab %}

{% tab title="Openai<1.0" %}
{% hint style="warning" %}
Use [`ChatCompletions`](https://platform.openai.com/docs/api-reference/chat/create) format only, we do not support completions. Refer to [their docs](https://platform.openai.com/docs/api-reference/chat/create).
{% endhint %}

Todo: update this example to use ChatCompletions instead of old-style completions.

```python
import openai # pip install openai<1.0

## 👉 ONLY CODE CHANGE 👈 ##
# Point requests to our NCSA LLM server instead of openai! 
openai.api_base = "https://api.ncsa.ai/v1"

# Enter any non-empty model name & API key to pass openai's client library check.
openai.api_key = "irrelevant"

# view availabel models with $ curl 
model = "irrelevant"

prompt = "Write a flask api backend starter kit. Include a full login system, and allow users to create API keys. Require these API keys for all api endpoints. Write a starter get endpoint."
stream=True
completion = openai.Completion.create(
    model=model,
    prompt=prompt,
    max_tokens=600,
    echo=True, # include prompt in the response
    stream=stream) 

if stream: 
  # ⚡️⚡️ streaming 
  for token in completion:
    print(token.choices[0].text, end='')
else:
  # 🐌 no streaming, but handy for bulk jobs
  print(completion.choices[0].text)
```

{% endtab %}
{% endtabs %}

### 🌐 Postman

<figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2FLUEp9IJdQ9nlHihoma8O%2FLLM%20Serving%20OpenAI%20Clone%20(1).png?alt=media&#x26;token=67f689fd-f698-4df5-b424-522dbe630713" alt=""><figcaption></figcaption></figure>

**Copy and paste this URI into Postman**

```
https://api.ncsa.ai/llm/v1/chat/completions?

POST body (type: raw)
{
  "model": "mistralai/Mistral-7B-Instruct-v0.2",
  "messages": [{"role": "user", "content": "Write a short bash program."}],
  "temperature": 0.7,
  "stream": false
}
```

<figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2FgD6BCS84Xd5zUnLVC8TI%2FLLM%20Serving%20OpenAI%20Clone.png?alt=media&#x26;token=d6c82604-d71f-4c05-b550-95d2ae2f5373" alt=""><figcaption><p>Postman access to NCSA.ai.</p></figcaption></figure>

## Vision and Long-term Plan

Presented at **Joint Laboratory for Extreme Scale Computing (JLESC)** supercomputing conference, here's a fast-paced intro to why we're building NCSA.ai. It's a critical piece of infrastructure for supercomputing labs, like NCSA, Argonne, Sandia and are global collaborators.

<div><figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2FpkjVF55Bmkhb7pbQ0jCB%2FSlide5.png?alt=media&#x26;token=bcc6eea4-f75a-46b6-bb03-ff580f385b75" alt=""><figcaption></figcaption></figure> <figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2F0zd92GClHz3WqJoSGvWF%2FSlide6.png?alt=media&#x26;token=cb7de10a-ceb0-4e15-a9ed-b07c7719de1e" alt=""><figcaption></figcaption></figure> <figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2FOuM01z3jf9JD7o1SgwxN%2FSlide7.png?alt=media&#x26;token=9fd2a26b-410e-4123-a34c-79e1bf3e3e21" alt=""><figcaption></figcaption></figure> <figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2Fe5FpVtCTejDEZriVAFKB%2FSlide8.png?alt=media&#x26;token=ea9bfeab-c7d5-4826-96a0-faade0a8bd0f" alt=""><figcaption></figcaption></figure></div>

<div><figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2FN1QUHDAPA895nh1WVRye%2FSlide9.png?alt=media&#x26;token=45dc45d7-8b35-43be-8389-a27729e6cc68" alt=""><figcaption></figcaption></figure> <figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2FfwYZeNE3w4N1uxTiMGSb%2FSlide10.png?alt=media&#x26;token=a86c043a-0ffe-470e-812a-95a026cccf53" alt=""><figcaption></figcaption></figure> <figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2FafH106qnkGSphq3Sr65T%2FSlide11.png?alt=media&#x26;token=358c39b3-7827-44c9-8453-c8d430201dd1" alt=""><figcaption></figcaption></figure> <figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2F6ZNKPJtBsFSSbFUCWGTR%2FSlide12.png?alt=media&#x26;token=ce76998f-2772-4bc6-b8ce-7f5144029a8d" alt=""><figcaption></figcaption></figure> <figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2FnRHxGseWTRGyQmt2zwd7%2FSlide13.png?alt=media&#x26;token=e862d4c9-b48d-4b72-9434-e2cd9370fcf8" alt=""><figcaption></figcaption></figure></div>

{% file src="<https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2FfRNZETU74DmXfoOfiYdz%2FNCSA-ai%20(at%20JLESC%202024).pdf?alt=media&token=d8bb2598-2052-47db-af31-cfef7e3a78c4>" %}
Read the full presentaiton here.
{% endfile %}

### Why NCSA.ai?

Today, AI research in academia is 95% training and 5% inference. **Within the next 3 years this ratio will flip** to become 95% inference, even inside academia, as we use synthetic data, agents, etc.

Today, NCSA's supercomputers are perfectly designed for LLM training. But they're horribly designed for LLM inference.&#x20;

**Benefits of shared LLM inference:**

1. **100% uptime:** Researchers can build real applications on top of this infrastructure because it will be always available, instead of separate Slurm jobs which are not a realistic option for production applications.
2. **Cost Efficiency**: Many organizations or users have less than 50% GPU utilization, resulting in expensive hourly or contractual rental fees. Serverless platforms enable dynamic scaling of GPU resources, allowing users to pay only for what they use, significantly reducing average monthly expenses.
3. **Model Support (Multiple Frameworks)**: Users require support for various model frameworks, such as ONNX or PyTorch, depending on their organization's needs. An ideal platform should support all major frameworks, avoiding user friction caused by forced conversions or limitations.
4. **Minimal Cold Start Latency & Inference Time**: Low cold start latency and low inference time are critical aspects for optimal user experiences, except in batch processing or non-production environments. An ideal platform should offer consistently low cold start latency across all calls or loads.
5. **Effortless Scalable Infrastructure (0→1→n) and (n→0)**: Configuring and scaling GPU infrastructure can be a complex and time-consuming process. An ideal platform should be able to automate scaling, requiring minimal user input beyond setting limits or billing parameters.
6. **Comprehensive Logging & Visible Metrics**: Users need detailed logs of API calls for analyzing loads, scaling, success vs. failure rates, and general analytics. An ideal platform should offer options for exporting or connecting users' observability stacks.

### Autoscaling

**Motivation:** There are many AI models and few GPUs. We cannot, and should not, predict what models users will want. Therefore, let them choose and we will autoscale to deploy their models on-demand. It's "serverless" LLM inference as a service.&#x20;

**Technology:** To support \~all the AI models in the world, we have a hierarchy of support quality. The best model of the day will be kept in GPUs always for ultra-low-latency responses. Then we use a variety of LLM serving libraries (VLLM > TGI > Pipeline > custom code) to support the long tail of models.

**Usage:** When you request inference, if the model is not already "hot" on the GPUs we will have to download it from Huggingface Hub (unless it's already cached on our local storage), and then load it into GPUs. This is slow (1-10 minutes depending on model size).&#x20;

<figure><img src="https://861107230-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCNxdGM2teUDf0XSxudfZ%2Fuploads%2F1yA8Pe1luzf8C2hv2lDu%2FLLM%20Serving%20OpenAI%20Clone%20(2).png?alt=media&#x26;token=21843205-8482-44c2-b600-4cd53c5ef578" alt=""><figcaption><p>My plan for LLM serving at NCSA</p></figcaption></figure>

1. S-tier: the current SOTA LLM will be kept in GPU memory 100% of the time. Enabling ultra-low latency inference, and incentivizing users to all use the same model, thus lowering costs for everyone.&#x20;
2. A-tier: very popular, specifically those [supported by VLLM](https://docs.vllm.ai/en/latest/models/supported_models.html), will be fast, but may have \~1 minute loading time to move the model into GPUs.
3. C-tier: support for \~50% of the models on Huggingface Hub, via `Pipeline()` and `AutoModel()`.
4. D-tier: Support 100% of the models on Huggingface Hub by allowing users to submit arbitrary code for custom `load()` and `inference()` functions.&#x20;
5. F-tier: Existing solutions where each scientist battles brittle SLURM scripts independently.

### Billing

Payment in [ACCESS](https://access-ci.org/) credits, making it accessible to the UIUC community.

**Cost:** The limiting resource is GPU memory, but using VLLM as an inference engine get dramatic efficiency increases because many users can run inference on the same model at the same time with minimal overhead. Therefore, the more people use a model, the cheaper it becomes for everyone.

$$
\text{Cost}= \frac{\text{GPU memory-seconds}}{\text{number of simultanious users of the model}}
$$

## Common problems & workarounds

### No GPUs available

If you send a request and the model isn't available, we'll try to auto-scale and load it onto GPU(s). But if our supercomputer cluster is full, this will fail and the model will not load. We're working on returning a special error code for this.

**Workaround:** Check the /models endpoint to see if any models are currently "hot" and already loaded on GPUs. You'll get near-instant responses from any models listed here.

```
# list models that are "hot" on GPUs
curl https://api.ncsa.ai/llm/models

# example output:
{
  "hot_models": [
    {
      "model_name": "databricks/dbrx-instruct",
      "model_type": "VLLM_OPENAI",
      "status": "Running",
      "priority": 0,
      "route_prefix": "/model-1",
      "gpus_per_replica": 4
    }
  ],
  "cold_models": []
}
```

For any other problems, please shoot me an email. I'm happy to help: <kvday2@illinois.edu> &#x20;
