🏛️
LLM Serving Docs
🔗 NCSA.ai GUI
  • NCSA.ai Docs
  • Experimental: Function calling, JSON and Regex mode
  • Experimental: Embeddings via Text Embedding Inference
  • Simple Ollama hosting
Powered by GitBook
On this page
  • API
  • Model Status
  • Recommended Models
  • Supported models
  • Usage Guide
  • 🐍 Python
  • 🌐 Postman
  • Vision and Long-term Plan
  • Why NCSA.ai?
  • Autoscaling
  • Billing
  • Common problems & workarounds
  • No GPUs available

NCSA.ai Docs

NextExperimental: Function calling, JSON and Regex mode

Last updated 1 year ago

We're API-first, that's where all our effort goes.

👉 Try the full website (in open beta): 🛠️ Or checkout .

API

  • OpenAI-Compatible API. We even support .

  • We support ~all of the best open LLMs, see below.

  • Hosted on NCSA supercomputers with autoscaling to load any requested models. The default 'stay-hot' time is 15 minutes, after which your model will be removed from the GPUs.

  • We're in open beta, see below.

  • Please send bugs and feature requests to me, Kastan at [email protected], or open an issue on our .

Try it now for free during beta:

Service is OFFLINE due to GPU shortage (as of May, 2024). Will be back when we can recruit enough users / scale to make persistent services viable.

# instant access from the command line 🥹🥹
curl https://api.ncsa.ai/llm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "mistralai/Mistral-7B-Instruct-v0.2",
     "messages": [{"role": "user", "content": "Write a small bash program."}],
     "temperature": 0.5,
     "stream": true
   }'
# brew install jq glow --OR-- apt-get install jq glow
curl -s https://api.ncsa.ai/llm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "mistralai/Mistral-7B-Instruct-v0.2",
     "messages": [{"role": "user", "content": "Write a few example bash programs so I can learn bash."}],
     "stream": false
   }' | jq -r '.choices[].message.content' | glow -

Model Status

  • If the model status is Deploying, your requests will be served once the deployment is complete. If the model status is Running, your requests will be served immediately.

  • Please note that some large models, such as meta-llama/Meta-Llama-3-70B-Instruct, require at least 5 minutes of deployment time, or potentially longer.

Recommended Models

These are the best LLMs available in the open source (as of April 5, 2024):

  1. NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO - the best fine-tune of Mixtral. Great instruction following, better than raw Mixtral-instruct.

  2. teknium/OpenHermes-2.5-Mistral-7B - Small but mighty. The best fine-tune of Mistral. Best value for money, and fast to cold-start.

  3. meta-llama/Meta-Llama-3-70B-Instruct - Large and capable.

Supported models

Usage Guide

🐍 Python

from openai import OpenAI # pip install openai>=1.0

# Point requests to our NCSA LLM server instead of openai! 
client = OpenAI(
    api_key="irrelevant", # any non-empty string
    base_url = "https://api.ncsa.ai/llm/v1" ## 👈 ONLY CODE CHANGE ##
)

# view supported models here: https://docs.vllm.ai/en/latest/models/supported_models.html
completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Write a short bash program."}],
    stream=True) 

for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

Todo: update this example to use ChatCompletions instead of old-style completions.

import openai # pip install openai<1.0

## 👉 ONLY CODE CHANGE 👈 ##
# Point requests to our NCSA LLM server instead of openai! 
openai.api_base = "https://api.ncsa.ai/v1"

# Enter any non-empty model name & API key to pass openai's client library check.
openai.api_key = "irrelevant"

# view availabel models with $ curl 
model = "irrelevant"

prompt = "Write a flask api backend starter kit. Include a full login system, and allow users to create API keys. Require these API keys for all api endpoints. Write a starter get endpoint."
stream=True
completion = openai.Completion.create(
    model=model,
    prompt=prompt,
    max_tokens=600,
    echo=True, # include prompt in the response
    stream=stream) 

if stream: 
  # ⚡️⚡️ streaming 
  for token in completion:
    print(token.choices[0].text, end='')
else:
  # 🐌 no streaming, but handy for bulk jobs
  print(completion.choices[0].text)

🌐 Postman

Copy and paste this URI into Postman

https://api.ncsa.ai/llm/v1/chat/completions?

POST body (type: raw)
{
  "model": "mistralai/Mistral-7B-Instruct-v0.2",
  "messages": [{"role": "user", "content": "Write a short bash program."}],
  "temperature": 0.7,
  "stream": false
}

Vision and Long-term Plan

Presented at Joint Laboratory for Extreme Scale Computing (JLESC) supercomputing conference, here's a fast-paced intro to why we're building NCSA.ai. It's a critical piece of infrastructure for supercomputing labs, like NCSA, Argonne, Sandia and are global collaborators.

Why NCSA.ai?

Today, AI research in academia is 95% training and 5% inference. Within the next 3 years this ratio will flip to become 95% inference, even inside academia, as we use synthetic data, agents, etc.

Today, NCSA's supercomputers are perfectly designed for LLM training. But they're horribly designed for LLM inference.

Benefits of shared LLM inference:

  1. 100% uptime: Researchers can build real applications on top of this infrastructure because it will be always available, instead of separate Slurm jobs which are not a realistic option for production applications.

  2. Cost Efficiency: Many organizations or users have less than 50% GPU utilization, resulting in expensive hourly or contractual rental fees. Serverless platforms enable dynamic scaling of GPU resources, allowing users to pay only for what they use, significantly reducing average monthly expenses.

  3. Model Support (Multiple Frameworks): Users require support for various model frameworks, such as ONNX or PyTorch, depending on their organization's needs. An ideal platform should support all major frameworks, avoiding user friction caused by forced conversions or limitations.

  4. Minimal Cold Start Latency & Inference Time: Low cold start latency and low inference time are critical aspects for optimal user experiences, except in batch processing or non-production environments. An ideal platform should offer consistently low cold start latency across all calls or loads.

  5. Effortless Scalable Infrastructure (0→1→n) and (n→0): Configuring and scaling GPU infrastructure can be a complex and time-consuming process. An ideal platform should be able to automate scaling, requiring minimal user input beyond setting limits or billing parameters.

  6. Comprehensive Logging & Visible Metrics: Users need detailed logs of API calls for analyzing loads, scaling, success vs. failure rates, and general analytics. An ideal platform should offer options for exporting or connecting users' observability stacks.

Autoscaling

Motivation: There are many AI models and few GPUs. We cannot, and should not, predict what models users will want. Therefore, let them choose and we will autoscale to deploy their models on-demand. It's "serverless" LLM inference as a service.

Technology: To support ~all the AI models in the world, we have a hierarchy of support quality. The best model of the day will be kept in GPUs always for ultra-low-latency responses. Then we use a variety of LLM serving libraries (VLLM > TGI > Pipeline > custom code) to support the long tail of models.

Usage: When you request inference, if the model is not already "hot" on the GPUs we will have to download it from Huggingface Hub (unless it's already cached on our local storage), and then load it into GPUs. This is slow (1-10 minutes depending on model size).

  1. S-tier: the current SOTA LLM will be kept in GPU memory 100% of the time. Enabling ultra-low latency inference, and incentivizing users to all use the same model, thus lowering costs for everyone.

  2. C-tier: support for ~50% of the models on Huggingface Hub, via Pipeline() and AutoModel().

  3. D-tier: Support 100% of the models on Huggingface Hub by allowing users to submit arbitrary code for custom load() and inference() functions.

  4. F-tier: Existing solutions where each scientist battles brittle SLURM scripts independently.

Billing

Cost: The limiting resource is GPU memory, but using VLLM as an inference engine get dramatic efficiency increases because many users can run inference on the same model at the same time with minimal overhead. Therefore, the more people use a model, the cheaper it becomes for everyone.

Cost=GPU memory-secondsnumber of simultanious users of the model\text{Cost}= \frac{\text{GPU memory-seconds}}{\text{number of simultanious users of the model}}Cost=number of simultanious users of the modelGPU memory-seconds​

Common problems & workarounds

No GPUs available

If you send a request and the model isn't available, we'll try to auto-scale and load it onto GPU(s). But if our supercomputer cluster is full, this will fail and the model will not load. We're working on returning a special error code for this.

Workaround: Check the /models endpoint to see if any models are currently "hot" and already loaded on GPUs. You'll get near-instant responses from any models listed here.

# list models that are "hot" on GPUs
curl https://api.ncsa.ai/llm/models

# example output:
{
  "hot_models": [
    {
      "model_name": "databricks/dbrx-instruct",
      "model_type": "VLLM_OPENAI",
      "status": "Running",
      "priority": 0,
      "route_prefix": "/model-1",
      "gpus_per_replica": 4
    }
  ],
  "cold_models": []
}

For any other problems, please shoot me an email. I'm happy to help: [email protected]

For a nicely formatted markdown response (non-streaming), pipe this command into and use glow to render the markdown:

Users can check the status of their requested models by visiting .

databricks/dbrx-instruct - Large and capable. Runs on 4x A100-80GB GPUs. Read .

Some large models require minutes of deployment time, see

We use Ray Serve + VLLM to provide an .

For a . Note, this is an incomplete list! Many models not explicitly listed here will work as long as they use a supported LLM architecture. E.g. fine-tunes of Mistral/Mixtral/Llama are supported. We support exactly the models supported by VLLM.

I leverage the existing , making my version a drop in replacement for any openAI calls. Say goodbye to huge OpenAI bills!💰

Use format only, we do not support completions. Refer to .

Use format only, we do not support completions. Refer to .

A-tier: very popular, specifically those , will be fast, but may have ~1 minute loading time to move the model into GPUs.

Payment in credits, making it accessible to the UIUC community.

jq
here
their blog for details
OpenAI-compatible API
list of supported models, see here
OpenAI Python package
ChatCompletions
their docs
ChatCompletions
their docs
supported by VLLM
ACCESS
https://ncsa.ai
the code on Github
a few extra parameters
GitHub repository
supported models
common problems and workarounds
Model Status.
8MB
NCSA-ai (at JLESC 2024).pdf
pdf
Read the full presentaiton here.
Postman access to NCSA.ai.
My plan for LLM serving at NCSA