NCSA.ai Docs

We're API-first, that's where all our effort goes.

👉 Try the full website (in open beta): https://ncsa.ai 🛠️ Or checkout the code on Github.

API

Try it now for free during beta:

Service is OFFLINE due to GPU shortage (as of May, 2024). Will be back when we can recruit enough users / scale to make persistent services viable.

# instant access from the command line 🥹🥹
curl https://api.ncsa.ai/llm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "mistralai/Mistral-7B-Instruct-v0.2",
     "messages": [{"role": "user", "content": "Write a small bash program."}],
     "temperature": 0.5,
     "stream": true
   }'

For a nicely formatted markdown response (non-streaming), pipe this command into jq and use glow to render the markdown:

# brew install jq glow --OR-- apt-get install jq glow
curl -s https://api.ncsa.ai/llm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "mistralai/Mistral-7B-Instruct-v0.2",
     "messages": [{"role": "user", "content": "Write a few example bash programs so I can learn bash."}],
     "stream": false
   }' | jq -r '.choices[].message.content' | glow -

Model Status

Users can check the status of their requested models by visiting here.

  • If the model status is Deploying, your requests will be served once the deployment is complete. If the model status is Running, your requests will be served immediately.

  • Please note that some large models, such as meta-llama/Meta-Llama-3-70B-Instruct, require at least 5 minutes of deployment time, or potentially longer.

These are the best LLMs available in the open source (as of April 5, 2024):

  1. NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO - the best fine-tune of Mixtral. Great instruction following, better than raw Mixtral-instruct.

  2. databricks/dbrx-instruct - Large and capable. Runs on 4x A100-80GB GPUs. Read their blog for details.

  3. teknium/OpenHermes-2.5-Mistral-7B - Small but mighty. The best fine-tune of Mistral. Best value for money, and fast to cold-start.

  4. meta-llama/Meta-Llama-3-70B-Instruct - Large and capable.

Some large models require minutes of deployment time, see Model Status.

Supported models

We use Ray Serve + VLLM to provide an OpenAI-compatible API.

For a list of supported models, see here. Note, this is an incomplete list! Many models not explicitly listed here will work as long as they use a supported LLM architecture. E.g. fine-tunes of Mistral/Mixtral/Llama are supported. We support exactly the models supported by VLLM.

Usage Guide

🐍 Python

I leverage the existing OpenAI Python package, making my version a drop in replacement for any openAI calls. Say goodbye to huge OpenAI bills!💰

Use ChatCompletions format only, we do not support completions. Refer to their docs.

from openai import OpenAI # pip install openai>=1.0

# Point requests to our NCSA LLM server instead of openai! 
client = OpenAI(
    api_key="irrelevant", # any non-empty string
    base_url = "https://api.ncsa.ai/llm/v1" ## 👈 ONLY CODE CHANGE ##
)

# view supported models here: https://docs.vllm.ai/en/latest/models/supported_models.html
completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Write a short bash program."}],
    stream=True) 

for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

🌐 Postman

Copy and paste this URI into Postman

https://api.ncsa.ai/llm/v1/chat/completions?

POST body (type: raw)
{
  "model": "mistralai/Mistral-7B-Instruct-v0.2",
  "messages": [{"role": "user", "content": "Write a short bash program."}],
  "temperature": 0.7,
  "stream": false
}

Vision and Long-term Plan

Presented at Joint Laboratory for Extreme Scale Computing (JLESC) supercomputing conference, here's a fast-paced intro to why we're building NCSA.ai. It's a critical piece of infrastructure for supercomputing labs, like NCSA, Argonne, Sandia and are global collaborators.

Why NCSA.ai?

Today, AI research in academia is 95% training and 5% inference. Within the next 3 years this ratio will flip to become 95% inference, even inside academia, as we use synthetic data, agents, etc.

Today, NCSA's supercomputers are perfectly designed for LLM training. But they're horribly designed for LLM inference.

Benefits of shared LLM inference:

  1. 100% uptime: Researchers can build real applications on top of this infrastructure because it will be always available, instead of separate Slurm jobs which are not a realistic option for production applications.

  2. Cost Efficiency: Many organizations or users have less than 50% GPU utilization, resulting in expensive hourly or contractual rental fees. Serverless platforms enable dynamic scaling of GPU resources, allowing users to pay only for what they use, significantly reducing average monthly expenses.

  3. Model Support (Multiple Frameworks): Users require support for various model frameworks, such as ONNX or PyTorch, depending on their organization's needs. An ideal platform should support all major frameworks, avoiding user friction caused by forced conversions or limitations.

  4. Minimal Cold Start Latency & Inference Time: Low cold start latency and low inference time are critical aspects for optimal user experiences, except in batch processing or non-production environments. An ideal platform should offer consistently low cold start latency across all calls or loads.

  5. Effortless Scalable Infrastructure (0→1→n) and (n→0): Configuring and scaling GPU infrastructure can be a complex and time-consuming process. An ideal platform should be able to automate scaling, requiring minimal user input beyond setting limits or billing parameters.

  6. Comprehensive Logging & Visible Metrics: Users need detailed logs of API calls for analyzing loads, scaling, success vs. failure rates, and general analytics. An ideal platform should offer options for exporting or connecting users' observability stacks.

Autoscaling

Motivation: There are many AI models and few GPUs. We cannot, and should not, predict what models users will want. Therefore, let them choose and we will autoscale to deploy their models on-demand. It's "serverless" LLM inference as a service.

Technology: To support ~all the AI models in the world, we have a hierarchy of support quality. The best model of the day will be kept in GPUs always for ultra-low-latency responses. Then we use a variety of LLM serving libraries (VLLM > TGI > Pipeline > custom code) to support the long tail of models.

Usage: When you request inference, if the model is not already "hot" on the GPUs we will have to download it from Huggingface Hub (unless it's already cached on our local storage), and then load it into GPUs. This is slow (1-10 minutes depending on model size).

  1. S-tier: the current SOTA LLM will be kept in GPU memory 100% of the time. Enabling ultra-low latency inference, and incentivizing users to all use the same model, thus lowering costs for everyone.

  2. A-tier: very popular, specifically those supported by VLLM, will be fast, but may have ~1 minute loading time to move the model into GPUs.

  3. C-tier: support for ~50% of the models on Huggingface Hub, via Pipeline() and AutoModel().

  4. D-tier: Support 100% of the models on Huggingface Hub by allowing users to submit arbitrary code for custom load() and inference() functions.

  5. F-tier: Existing solutions where each scientist battles brittle SLURM scripts independently.

Billing

Payment in ACCESS credits, making it accessible to the UIUC community.

Cost: The limiting resource is GPU memory, but using VLLM as an inference engine get dramatic efficiency increases because many users can run inference on the same model at the same time with minimal overhead. Therefore, the more people use a model, the cheaper it becomes for everyone.

Cost=GPU memory-secondsnumber of simultanious users of the model\text{Cost}= \frac{\text{GPU memory-seconds}}{\text{number of simultanious users of the model}}

Common problems & workarounds

No GPUs available

If you send a request and the model isn't available, we'll try to auto-scale and load it onto GPU(s). But if our supercomputer cluster is full, this will fail and the model will not load. We're working on returning a special error code for this.

Workaround: Check the /models endpoint to see if any models are currently "hot" and already loaded on GPUs. You'll get near-instant responses from any models listed here.

# list models that are "hot" on GPUs
curl https://api.ncsa.ai/llm/models

# example output:
{
  "hot_models": [
    {
      "model_name": "databricks/dbrx-instruct",
      "model_type": "VLLM_OPENAI",
      "status": "Running",
      "priority": 0,
      "route_prefix": "/model-1",
      "gpus_per_replica": 4
    }
  ],
  "cold_models": []
}

For any other problems, please shoot me an email. I'm happy to help: [email protected]

Last updated