Are you looking for a faster way to turn your AI application ideas into working prototypes?

Langflow may be just the solution you need. Its visual editor simplifies the process, allowing you to build and test a functional AI application workflow quickly and easily. You can develop anything from chatbots and document analysis systems to content generators and agentic applications – and much more.

### What Is Langflow?

Put simply, [Langflow](https://docs.langflow.org/about-langflow) is an **open-source, node-based agent builder** that you can use right from your web browser. In minutes, you can create and serve flows: that is, functional representations of application workflows. All you have to do is connect and configure component nodes using the drag-and-drop visual editor, where each component represents one step in the workflow.

Then you can test your flows in real time using the Playground. This is a dynamic interface that lets you examine and experiment with different inputs, outputs, and memories – and get feedback about flow logic and response generation – before you move forward with more formal app development. Flows can even be embedded into your application code using the Langflow API.

Plus, Langflow is model-agnostic. It supports all major LLMs and vector databases you want to use, and it has built-in support for the Model Context Protocol (MCP), as well. It even has a **convenient Ollama integration** that makes it easy to use with open-weight models as well as custom fine-tunes.

With Vast, you can run Langflow with Ollama on powerful GPU instances at affordable rates, enabling cost-effective AI app development and prototyping. This guide will show you how to set up [Langflow \+ Ollama](https://docs.vast.ai/langflow-ollama) using our pre-built template in just minutes.

### Getting Started with Langflow \+ Ollama

Before you do anything else, set up your Vast account and add credit. If you need help with this step, review our [QuickStart guide](https://docs.vast.ai/quickstart) for a walkthrough of the setup process.

Now it's time to configure your template. Minimal changes are needed here, but you'll probably want to customize the template so that Ollama will download your preferred model automatically.

Follow our [step-by-step guide](https://docs.vast.ai/langflow-ollama) to get you started in our documentation.

### Conclusion

There is so much more to Langflow than it's possible to cover in this short introduction. The endlessly customizable framework lowers the barrier to building and testing AI applications, while Ollama keeps model choice open and flexible. Both beginners and seasoned developers can benefit from its ease of use and support for creating more complex projects.

Running Langflow \+ Ollama on **Vast.ai** gives you the cost-effective GPU infrastructure to go from prototype to production on your own terms.

Turn your ideas into working real-world solutions today\!


Run Langflow with Ollama on Vast.ai GPUs for fast AI prototyping. Step-by-step guide to build chatbots, agents, and custom AI apps with a visual editor.

Build and Test AI Applications with Langflow + Ollama on Vast.ai


[Video generation with AI](https://docs.vast.ai/video-generation) has come a long way. If you saw the infamous AI-generated video of Will Smith eating spaghetti a couple of years ago, you probably remember how unnatural and distorted the clip looked\! But that was just the beginning.

Things look very different today, with more advanced video generation tools producing much higher quality results. Today's models can create photorealistic visuals with smooth, consistent motion as well as stylized and cinematic effects if that's what's called for.

AI video generation has become a practical tool for anyone who needs an easy way to go from text to video, no matter the project. The right templates make that process even simpler.

With that in mind, let's take a look at a few of the best [AI video generation](https://docs.vast.ai/video-generation) templates available today.

### 1\. [ComfyUI \+ LTX Video](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=ComfyUI%20%2B%20LTX%20Video)

[**LTX Video**](https://github.com/Lightricks/ComfyUI-LTXVideo) is a group of open-source AI video models built for high-quality diffusion-based video generation. It offers both speed and precision, generating high-resolution videos **up to 60 seconds long** (with some versions of the model limited to **10 seconds**), often in less time than it takes to watch the resulting video.

How is this possible? LTX Video uses a unique approach where it **compresses video data 192 times smaller than the original** – past the limits of what most models can achieve – while still maintaining crisp visuals and smooth, realistic motion. A bonus is that this drastically reduces the amount of memory and compute required. In fact, LTX Video is optimized specifically for consumer-grade GPUs.

While most diffusion models use a two-step pipeline where a second-stage model sharpens details, LTX Video does decoding and cleanup in just one step. With a Video-VAE (variational autoencoder) and a denoising transformer combined into a single integrated system, there's no need to patchify the video stream, and high-quality results are possible even under heavy compression.

This [template](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=ComfyUI%20%2B%20LTX%20Video) pairs LTX Video with **ComfyUI**, a modular, node-based interface for Stable Diffusion workflows. Together, they serve as a seamless way to run efficient, high-compression video pipelines with a user-friendly interface.

### 2\. [Mochi](https://vast.ai/article/generating-videos-with-mochi-on-vast.ai)

Designed to excel in quality of motion as well as prompt adherence, [**Mochi**](https://github.com/genmoai/mochi) is a state-of-the-art open-source video generation model that performs competitively with leading closed models.

It demonstrates strong motion quality, generating smooth videos at **30 frames per second for durations of up to 5.4 seconds.** Fluid dynamics, fur and hair simulation, and human movement are all rendered with impressive realism.

Mochi's alignment with text prompts is also exceptional. Generated videos accurately depict the given input, allowing users to control characters, settings, and actions with a high degree of detail. Prompt adherence is benchmarked with a vision-language model and generated videos are evaluated using Gemini to ensure accurate, consistent results.

However, keep in mind that the initial release of Mochi only generates videos at **480p**, and the model is optimized for photorealistic styles, so animated content is not recommended.

### 3\. [Open-Sora](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=Open-Sora)

[**Open-Sora**](https://github.com/hpcaitech/Open-Sora) is an open-source video generation model with a streamlined, user-friendly platform that lowers the barrier to efficient, high-quality video content creation. It can produce videos **up to 15 seconds long, at resolutions of up to 720p, with any aspect ratio.**

The model also supports a wide range of visual generation tasks: text-to-image, text-to-video, and image-to-video generation. This flexibility means users can run diverse workflows within a single framework.

Notably, Open-Sora introduces two innovations. The first is the Spatial-Temporal Diffusion Transformer (STDiT), a framework that decouples spatial and temporal attention. The second is a highly compressive 3D autoencoder that not only makes representations more compact but also accelerates training with its own ad hoc training strategy.

These features allow Open-Sora to generate high-quality results more efficiently – and with a ready-to-use template, it's easier than ever to start experimenting.

### Final Thoughts

AI video generation has matured rapidly in just a few short years. Instead of glitchy experiments, we now have reliable tools that deliver polished video content.

At **Vast.ai**, we believe in democratizing [access to these advanced AI tools](https://docs.vast.ai/video-generation), making them easy to use and available to everyone without the overhead of complex setup.

With our pre-built templates for [LTX Video](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=ComfyUI%20%2B%20LTX%20Video), [Mochi](https://vast.ai/article/generating-videos-with-mochi-on-vast.ai), and [Open-Sora](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=Open-Sora), you can launch cutting-edge video generation models in minutes – backed by affordable, on-demand cloud GPUs that save you from having to purchase and maintain your own expensive hardware. Use only the compute you need, when you need it.

Ready to experiment? Explore the [video generation templates](https://docs.vast.ai/video-generation) on Vast.ai and bring your ideas to life today\!


Generate AI videos with LTX Video, Mochi, and Open-Sora on Vast.ai. Step-by-step templates for high-quality, photorealistic text-to-video workflows with affordable GPU hosting.

Vast.ai: Best Video Generation Templates 2025


NVIDIA is [reportedly](https://www.reuters.com/world/china/nvidia-working-new-ai-chip-china-that-outperforms-h20-sources-say-2025-08-19/) working on a new China-specific AI chip that will be more powerful than the aging HGX H20 model the company is currently authorized to sell in China.

Recent reports [suggest](https://www.tomshardware.com/tech-industry/artificial-intelligence/nvidia-responds-to-reports-that-its-h20-gpu-for-china-is-ending-production-next-gen-b30a-green-light-up-to-the-united-states-government-according-to-ceo-jensen-huang) that the H20 – originally developed as a scaled-down alternative to the A100 and H100 that were banned from export to China – is nearing the end of its production run. Its successor: the tentatively named B30A chip.

Based on the latest Blackwell architecture, the B30A will feature a single-die design delivering roughly half the raw computing power of NVIDIA's flagship B300 accelerator, according to sources. The new chip will include high-bandwidth memory and NVLink support for fast data transfer between processors – both of which are also present in the Hopper-based H20.

NVIDIA aims to send test batches of the B30A chip to Chinese customers as early as sometime this month, sources said.

The move comes following remarks from President Donald Trump suggesting that more advanced NVIDIA chips could be cleared for sale in China. However, regulatory approval remains uncertain. There are ongoing concerns about granting China broader access to U.S. AI technology – which could weaken the United States' competitive edge in the sector.

At the same time, analysts warn that restricting China too heavily risks driving developers toward rivals like Huawei. NVIDIA argues that it's important to retain market share in China – which generated 13% of the company's revenue in the past year – so that its software ecosystem stays relevant in one of the world's largest AI markets.

NVIDIA is also preparing a second Blackwell-based China-specific chip called the RTX6000D, designed primarily for AI inference tasks, which is expected to start shipping this month. With weaker specs than the H20 and bandwidth just under U.S. export limits, it's set to be a compliant, lower-cost alternative within NVIDIA's lineup.

For those outside China, however, these region-specific models highlight the stark difference between scaled-down chips and the unrestricted performance available elsewhere.

### On-Demand Access to Powerful GPUs, Anytime

While NVIDIA adapts its offerings to meet specific market regulations, users elsewhere don't have to work around restricted versions.

On [**Vast.ai**](http://Vast.ai), you can access a wide range of powerful GPUs – from the [RTX 5090](https://vast.ai/article/nvidia-geforce-rtx-5090-specs-everything-you-need-to-know) to the [A100](https://vast.ai/article/nvidia-rtx-4090-vs-a100-two-powerhouses-two-purposes), [H200](https://vast.ai/article/nvidia-h100-vs-h200-two-hopper-based-heavyweights), [B200](https://vast.ai/article/nvidia-h200-vs-b200-comparing-datacenter-grade-accelerators), and more – on demand, at a fraction of the cost of traditional cloud providers. Our cloud GPU rental platform makes it easy. Just choose the hardware you need, spin up an instance in seconds, and pay only for the compute you use.

Get started with Vast today, and enjoy flexible, cost-effective GPU compute that scales with your needs\!


NVIDIA is developing the B30A AI chip for China, a Blackwell-based successor to the H20 with half the power of the B300. Learn what it means for AI exports, competition, and how you can still access unrestricted GPUs like the H200, B200, and A100 on Vast.ai.

Report: NVIDIA Developing New Blackwell-Based AI Chip for China


Remember when ChatGPT couldn't tell you the weather or gather top headlines? Are you still frustrated asking Gemini to create a JIRA ticket from a bug report email? Perhaps you're tired of Cursor hallucinating database column names because it's walled off from your live schema, so you're constantly copy-pasting the same context across your Claude and Cursor apps. Despite their eloquent email-writing powers, out-of-the-box Large Language Models (LLMs) are powerful predictors, not contextually-aware assistants. They operate in silos, forcing you to start with fresh context \-- or none at all \-- for every new task.

This constant context-juggling isn't just annoying, it's fragile. When a better model is released or your data governance rules change, do you have to re-implement your entire workflow? What if you want to easily swap out Gemini's new video generation model for one from Stable Diffusion?

With [Model Context Protocol](https://modelcontextprotocol.io/) (MCP), an open-source framework released by Anthropic for building truly helpful, environment-agnostic AI assistants, you can create a persistent, shareable context layer across models and tools. And with [Vast.ai](http://vast.ai)'s secure and affordable cloud GPU instances, implementing your own MCP has never been easier.

**Model Context Protocol Architecture**

MCP architecture is built on three key components:

-   Clients: The user-facing applications (like Cursor, a custom chatbot, or Claude) that coordinate between the user, the AI model, and the MCP server.
-   Servers: The adapter between your tools and AI models, serving tools, resources, and prompts to your LLM provider and other services.
-   Service Providers: The software tools and data sources (like Slack, Notion, GitHub, or your company's internal database) that have existing APIs.

With MCP, any developer can build a server to connect an API to any MCP-compatible client. You no longer have to wait for official integrations from LLM providers.

An MCP server provides three categories of capabilities to the model – tools, resources, and prompts. In the following guide, you'll focus on creating MCP tools and learn to create useful actions to spawn from your clients via a remote MCP server on [Vast.ai](http://Vast.ai)'s platform.

**Remote MCP Servers**

Initially, MCP architecture only supported locally running servers using stdio transport. Since summer 2025, it supports remote access via streamable-http, enabling scalable, AI-enabled workflows that integrate LLMs across your organization. Now teams can more easily integrate data services and build custom workflows within their workspaces.

​

[Vast.ai](http://vast.ai)'s affordable GPU instance rentals provide flexible and affordable hosting for as many MCP servers as you want to create – scaling up your team's AI workflows quickly and securely.

**Tutorial: Build a Stock Sentiment Analysis MCP**

Today, you'll build an MCP server for stock sentiment analysis correlation hosted on [Vast.ai](http://vast.ai)'s platform. This setup provides a foundation for team-based workflows. For example, a financial analyst company may use this MCP tool to probe further into market conditions all within the Claude interface, or spur discussions by connecting this output to a shared Slack channel, providing up to date and collective information to multiple teams. To begin, let's review what you'll need for this process.

[Vast.ai](http://vast.ai)'s GPU rentals offer space for a vast variety of MCP servers, easily accommodating 70B large language models or image diffusion models on the most up-to-date graphics card offerings. Their platform creates opportunities for creative, custom server integrations like creating a shared, product-specific image prompt for your product teams or pulling in proprietary climate energy data to your coding and marketing workflows.

By the end of this tutorial, you will understand how to build a similar bespoke MCP server for your organization's needs, and have a running blueprint for creating your collection of remote MCP servers on [Vast.ai](http://vast.ai). Let's get started.

**Prerequisites**

-   A [Vast.ai](http://vast.ai) account with an API key
-   A Hugging Face account with Mistral7B access token
-   Python 3.10+ environment

1. **Rent and Configure Your [Vast.ai](http://vast.ai) GPU Instance**  
   Start by selecting a GPU Instance from the Vast.ai console. For this project, you'll use a **Mistral 7B LLM** to summarize and analyze market sentiments, so we recommend using an **RTX 4090 GPU with 24GB of VRAM**. [Vast.ai](http://vast.ai) offers a variety of powerful graphics cards, including A100 and 80GB VRAM for larger LLM and diffusion tasks.  
   Begin by selecting a PyTorch (Vast) template. Modify this template to launch as an interactive shell server without a Jupyter notebook instance and with 24 GB of disk space. For more information on selecting templates, please refer to the [template guide.](https://docs.vast.ai/instances/templates)  
   ![][image3]Now select an RTX 4090 offering at your required price point and select "Rent". Navigate to the "Instances" section in the sidebar. When the server is up and running, you should see "Running" next to "Status". Now you're ready to create your MCP server.  
   ![][image4]
2. **Create Your Environment**  
   In your terminal, install uv and set up your Python environment using the following command:

​​

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Now, restart your terminal to activate the \`uv\` command. Then, initialize a Python project.

```bash
uv init vast_mcp_server
cd vast_mcp_server
```

​​

Finally, create and activate your virtual environment to manage project dependencies.​

```bash
uv venv
source .venv/bin/activate
```

3. **Install Project Dependencies**  
   Create a requirements.txt file to outline project dependencies. You'll use this to write your server code locally and then install dependencies on your [Vast.ai](http://vast.ai) instance later.​

```bash
touch requirements.txt
```

Then add the following dependencies:​

```
fastmcp
pydantic
transformers
torch
requests
uvicorn
accelerate
```

Install the dependencies using the following command.

```bash
pip install -r requirements.txt
```

4. **Create Your MCP Server**  
   Now you'll create your first MCP server. This server will facilitate summarizing and analyzing news headlines using an LLM and correlating them with real-time stock data. This is a real-time task that out-of-the-box LLMs like Claude may struggle with. You'll create three MCP tools for stock analysis retrieval, sentiment analysis, and correlation – all available via your remote server.  
   This example uses [FastMCP](https://gofastmcp.com/getting-started/welcome) to facilitate simple protocol implementation. FastMCP uses simple decorators to handle the low-level implementation details, similar to the FastAPI approach.

Create a 'server.py' and 'test_server.py' file using the command:​

```bash
touch server.py test_server.py
```

First, you'll configure a Mistral 7B model for summarization and sentiment analysis for a given news article. You'll also set up a FastMCP server. Add the following code to the 'server.py' file.​

```python
import random
import logging
import asyncio
import os
import json
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM
from fastmcp import FastMCP
import torch


logging.basicConfig(level=logging.INFO)

# Initialize the MCP server
mcp = FastMCP("Financial Market Analyzer")

# Configure Mistral 7B LLM model
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto"
)

class NewsItem(BaseModel):
    title: str
    description: str
```

Now, you'll create three tools. The first uses the Mistral 7B LLM to summarize and analyze a provided news article. In a real-world implementation, you may want to update this example to accept a URL as input. Add the following code:

```python
# Analyzes the sentiment and provides a summary of a news headline using a large language model.
@mcp.tool()
def analyze_sentiment(news_item: NewsItem):
    prompt = (
        f"You are a financial news analysis expert. "
        f"Analyze the following news headline and description. "
        f"Provide a brief summary and determine the sentiment as POSITIVE, NEGATIVE, or NEUTRAL. "
        f"Respond only with a JSON object containing two keys: 'summary' and 'sentiment'.\n\n"
        f"Headline: \"{news_item.title}\"\n"
        f"Description: \"{news_item.description}\"\n\n"
        f"JSON Response:"
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    try:
        json_start_index = generated_text.find("{")
        if json_start_index != -1:
            json_str = generated_text[json_start_index:]
            analysis_result = json.loads(json_str)
            summary = analysis_result.get("summary", "No summary found.")
            sentiment_label = analysis_result.get("sentiment", "NEUTRAL").upper()
        else:
            summary = "Could not generate a structured summary."
            sentiment_label = "NEUTRAL"
            logging.warning("LLM response did not contain a valid JSON object.")

    except json.JSONDecodeError as e:
        summary = "Could not parse LLM response."
        sentiment_label = "NEUTRAL"
        logging.error(f"Failed to parse JSON from LLM: {e}")

    logging.info(f"Analyzed '{news_item.title}': Sentiment={sentiment_label}, Summary='{summary}'")
    return {"title": news_item.title, "sentiment": sentiment_label, "summary": summary}
```

Now you'll add a stock retrieval tool. In this example, you'll mock the stock data. However, in an actual implementation, you might pull this data from a stock-related API. Add the following code:

```python
# Fetches mock real-time stock data. In a real-world implementation, you would connect to a secure financial database API.
# For this example, we are using a simple mock to simulate real-time data.
@mcp.tool()
def get_stock_data(symbol: str):
    price = round(random.uniform(100.0, 500.0), 2)
    volume = random.randint(100000, 5000000)
    logging.info(f"Fetched mock data for {symbol}: Price={price}, Volume={volume}")
    return {"symbol": symbol, "price": price, "volume": volume}
```

Finally, you'll add a tool that correlates the stock prices and news sentiment. Add the following code:

```python
@mcp.tool()
async def correlate_data(stock_data: dict, news_analysis: dict):
    sentiment = news_analysis['sentiment']
    price = stock_data['price']

    correlation_message = "No clear correlation found."
    if sentiment == "POSITIVE" and stock_data['volume'] > 2000000:
        correlation_message = f"POSITIVE news coincided with a high trading volume. Possible impact. News Summary: {news_analysis['summary']}"
    elif sentiment == "NEGATIVE" and price < 300:
        correlation_message = f"NEGATIVE news for a low-priced stock. Potential further decline. News Summary: {news_analysis['summary']}"

    logging.info(f"Correlated data: {correlation_message}")
    return {"stock_data": stock_data, "news_analysis": news_analysis, "correlation": correlation_message}
```

Now, you'll create the ability to run your MCP remote server via streamable-http. Add the following code:

```python
# Runs the MCP server remotely using streamable-http transport
if __name__ == "__main__":
    logging.info(f"MCP server started on port {os.getenv('PORT', 8080)}")
    asyncio.run(
        mcp.run_async(
            transport="streamable-http",
            host="0.0.0.0",
            port=os.getenv("PORT", 8080),
        )
    )
```

Great work! Now you have three MCP tools defined on your server. After deploying via [Vast.ai](http://vast.ai), you can access any of these tools remotely by integrating with Claude, Cursor, or other MCP supporting apps. Let's upload it to your [Vast.ai](http://vast.ai) GPU instance.

5. **Upload Your MCP Server to [Vast.ai](http://Vast.ai)**  
   First, retrieve your API key from the Vast.ai Console. Then, install the Vast.ai CLI in your Linux terminal using the following command.

```bash
pip install --upgrade vastai
```

Set your API key in your local Vast CLI configuration.

```bash
vastai set api-key #your-api-key-here
```

Now you'll be able to connect by SSH to your remote GPU instance. Navigate to your [Vast.ai](http://vast.ai) Console and select your GPU instance. Click on "Open Terminal Access". Here, you should see your SSH commands to connect to your GPU instance with the following structure:

```bash
ssh -p XXXXX root@XXX.XXX.XXX.XXX -L XXXX:localhost:XXXX
```

Navigate back to your terminal and use the command to connect to your instance. If your instance is running, you should see your terminal login to the remote [Vast.ai](http://vast.ai) instance with a welcome message.

Since you are using a Hugging Face model, you'll need to login to the Hugging Face CLI using the following:

```bash
huggingface-cli login
```

This command with prompt you for your access token from the Hugging Face website.

Then install the dependencies using:

```bash
pip install -r requirements.txt
```

Finally, you'll securely copy your [server.py](http://server.py) and requirement.txt files to the remote GPU. Navigate to the terminal you used to create your server environment and ensure the virtual environment is activated. Use the following command to move your files.

```bash
scp -P 29946 server.py requirements.txt root@172.81.127.37:/root/
```

Now you're ready to run and test your server!

6. **Run and Test Your Server**  
   In your open terminal connected to your [Vast.ai](http://vast.ai) GPU instance, simply run

```bash
uv run server.py
```

If all is well, you should see the models downloading and the server startup success message:

```
Starting MCP server 'Financial Market Analyzer' with transport 'streamable-http' on http://0.0.0.0:8080/mcp
INFO: Started server process [2135]
INFO: Waiting for application startup.
INFO:mcp.server.streamable_http_manager:StreamableHTTP session manager started
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080
```

As a final step, test the server connection to ensure your tools are available. In your Python environment, create a file called 'test_server.py' and paste the script below.

```python
import asyncio
import json

from fastmcp import Client

# Test the MCP server using streamable-http transport.
async def test_server():
    async with Client("http://localhost:8080/mcp") as client:

        # List available tools
        tools = await client.list_tools()
        for tool in tools:
            print(f"Tool found: {tool.name}")

        # Call the news analysis tool
        print("Calling stock tool for INVC")
        stock_result = await client.call_tool("get_stock_data", {"symbol": "INVC"})
        print(f"Result: {stock_result[0].text}")

        # Call the sentiment analysis and summarization tool
        print("2. Calling 'analyze_sentiment' tool...")
        news_item_payload = {
            "news_item": {
                "title": "Tech Giant Announces Record Profits",
                "description": "Shares surged after the company reported quarterly earnings that exceeded all analyst expectations, signaling strong growth."
            }
        }
        sentiment_result = await client.call_tool("analyze_sentiment", news_item_payload)
        print(f"Result: {sentiment_result[0].text}")

        # Call the correlation tool
        print("3. Calling 'correlate_data' tool...")
        correlation_payload = {
            "stock_data": json.loads(stock_result[0].text),
            "news_analysis": json.loads(sentiment_result[0].text)
        }
        correlation_result = await client.call_tool("correlate_data", correlation_payload)
        print(f"Result: {correlation_result[0].text}")

if __name__ == "__main__":
    asyncio.run(test_server())
```

```bash
uv run test_server.py
```

Navigate back to your Python environment terminal and run the script.

​If your server is working correctly, you should see output showing each of the tools and calling each tool in turn.

Congratulations! You have created your first MCP server on [Vast.ai](http://vast.ai). Now you can integrate this into Claude, Cursor, Windsurf, or any other MCP-supporting client, so you can customize, scale, and create more MCP servers for your organization and AI-enabled workflows.


Learn how to build and host a custom Model Context Protocol (MCP) server on Vast.ai's affordable GPU cloud. Step-by-step guide for creating AI-enabled workflows, integrating stock sentiment analysis, and scaling remote MCP servers with Claude, Cursor, and more.

Building Your First MCP Server on Vast.ai

Deploy OpenAI's GPT-OSS-20B on Vast.ai with Harmony SDK for transparent multi-channel AI reasoning. Step-by-step tutorial to build a weather assistant with function calls and reasoning visibility.

Build a Weather Assistant with OpenAI GPT-OSS and Harmony SDK on Vast.ai


## Introduction

Qwen-Image revolutionizes image generation with its exceptional ability to render complex text directly within images, supporting multiple languages including English and Chinese with perfect typographic detail. This powerful foundation model from Alibaba's Qwen series goes beyond simple image creation, offering advanced editing capabilities, style transfer, and precise control over visual elements.

What sets Qwen-Image apart is its nuanced understanding of text-image relationships. While other models struggle with text rendering, Qwen-Image seamlessly integrates readable text into visuals, making it ideal for creating signage, posters, infographics, and artistic compositions that require precise text placement and formatting.

Vast.ai provides the perfect platform for deploying Qwen-Image, offering access to high-performance GPUs at competitive prices with the flexibility to scale based on your needs.

In this guide, you'll deploy Qwen-Image on Vast.ai and explore its capabilities for text-integrated image generation.

## Setting Up the Environment

First, install the Vast CLI and configure your API key:

```bash
pip install --upgrade vastai
export VAST_API_KEY="<your-api-key>"
vastai set api-key $VAST_API_KEY
```

You can obtain your API key from the [Vast.ai Account Page](https://cloud.vast.ai/account/).

## Choosing the Right Hardware

Qwen-Image requires substantial GPU resources for optimal performance:

-   **GPU Memory**: Minimum 80GB VRAM for the full model
-   **Recommended Hardware**: NVIDIA A100 80GB or H100
-   **Storage**: 200GB disk space for model weights and dependencies

Search for suitable instances with these specifications:

```bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 80 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 150 \
rentable = true"
```

This search filters for:

-   Modern GPUs with compute capability 7.5+
-   At least 80GB VRAM for the model
-   Static IP for stable connections
-   Verified providers for reliability
-   Adequate storage for model and outputs

## Deploying Qwen-Image

Deploy the instance using a pre-configured Docker image optimized for transformer models:

```bash
export INSTANCE_ID=<instance-id>
vastai create instance $INSTANCE_ID \
    --image huggingface/transformers-pytorch-gpu \
    --env '-p 8888:8888' \
    --disk 200 \
    --args bash -c "pip install jupyter && jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token='' --NotebookApp.password=''"
```

**Key parameters explained:**

-   `--image`: HuggingFace's PyTorch GPU image with ML libraries pre-installed
-   `--env '-p 8888:8888'`: Maps Jupyter's default port for external access
-   `--disk 200`: Allocates 200GB for model weights and generated images
-   `--args`: Auto-installs Jupyter and launches a passwordless notebook server

## Connecting to Your Instance

After deployment:

1. Navigate to the [Instances tab](https://cloud.vast.ai/instances/)
2. Find your instance and wait for "Running" status
3. Look for the "Open Ports" section in instance details
4. Click the port mapping link (e.g., `123.123.123.123:12345 -> 8888/tcp`)
5. This opens Jupyter directly in your browser
6. Create a new notebook or upload the example notebook if available

## Installing Dependencies

Install the required packages in your notebook:

```bash
apt install -y git
pip install git+https://github.com/huggingface/diffusers
```

The latest diffusers library from GitHub includes Qwen-Image support.

## Loading the Model

Initialize Qwen-Image with optimal settings for GPU deployment:

```python
from diffusers import DiffusionPipeline
import torch

model_name = "Qwen/Qwen-Image"

# Load the pipeline
if torch.cuda.is_available():
    torch_dtype = torch.bfloat16
    device = "cuda"
else:
    torch_dtype = torch.float32
    device = "cpu"

pipe = DiffusionPipeline.from_pretrained(model_name, torch_dtype=torch_dtype)
pipe = pipe.to(device)
```

The model uses bfloat16 precision on GPU for optimal memory usage and performance.

## Using Qwen-Image

Now let's explore Qwen-Image's text rendering capabilities with examples:

### Example 1: Complex Text Integration with Multiple Languages

```python
import os

positive_magic = {
    "en": "Ultra HD, 4K, cinematic composition.", # for english prompt
}

# Create img folder if it doesn't exist
os.makedirs("img", exist_ok=True)

# Generate image
prompt = '''A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "Welcome". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197". Ultra HD, 4K, cinematic composition'''
negative_prompt = " " # using an empty string if you do not have specific concept to remove

# Set image dimensions (16:9 aspect ratio)
width, height = 1664, 928

image = pipe(
    prompt=prompt + positive_magic["en"],
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_inference_steps=50,
    true_cfg_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(42)
).images[0]

# Save image to img folder
image.save("img/example_1.png")

# Display the image
display(image)
```

### Output

![Qwen Coffee Shop with Text](/uploads/example_1.webp)

This example demonstrates Qwen-Image's ability to:

-   Render multiple text elements (chalkboard, neon sign, poster)
-   Handle emojis and special characters
-   Integrate mathematical constants with precise formatting
-   Maintain readability across different text styles

### Example 2: Creative Fantasy Scene

```python
positive_magic = {
    "en": "Ultra HD, 4K, cinematic composition.", # for english prompt
}

# Create img folder if it doesn't exist
os.makedirs("img", exist_ok=True)

# Generate image
prompt = '''Rare photo: UFO hovering over a yeti riding the Loch Ness monster.'''
negative_prompt = " " # using an empty string if you do not have specific concept to remove

# Set image dimensions (16:9 aspect ratio)
width, height = 1664, 928

image = pipe(
    prompt=prompt + positive_magic["en"],
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_inference_steps=50,
    true_cfg_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(42)
).images[0]

# Save image to img folder
image.save("img/example_2.png")

# Display the image
display(image)
```

### Output

![Fantasy Scene with UFO and Mythical Creatures](/uploads/example_2.webp)

This showcases Qwen-Image's creative capabilities in composing imaginative scenarios.

### Example 3: Futuristic Space Scene

```python
positive_magic = {
    "en": "Ultra HD, 4K, cinematic composition.", # for english prompt
}

# Create img folder if it doesn't exist
os.makedirs("img", exist_ok=True)

# Generate image
prompt = '''Now I want you to show a spaceships approaching Andromeda Galaxy!, it should be futuristic spaceships out on exploring universe'''
negative_prompt = " " # using an empty string if you do not have specific concept to remove

# Set image dimensions (16:9 aspect ratio)
width, height = 1664, 928

image = pipe(
    prompt=prompt + positive_magic["en"],
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_inference_steps=50,
    true_cfg_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(42)
).images[0]

# Save image to img folder
image.save("img/example_3.png")

# Display the image
display(image)
```

### Output

![Futuristic Spaceships Approaching Andromeda](/uploads/example_3.webp)

This demonstrates Qwen-Image's ability to create detailed sci-fi scenes.

## Advanced Features and Capabilities

### Flexible Aspect Ratios

Qwen-Image supports various aspect ratios for different use cases:

```python
aspect_ratios = {
    "1:1": (1328, 1328),    # Social media posts
    "16:9": (1664, 928),    # Widescreen displays
    "9:16": (928, 1664),    # Mobile/portrait content
    "4:3": (1472, 1140),    # Traditional displays
    "3:4": (1140, 1472),    # Portrait photos
    "3:2": (1584, 1056),    # Standard photos
    "2:3": (1056, 1584),    # Portrait standard
}
```

### Generation Parameters

Fine-tune your outputs with these key parameters:

-   `num_inference_steps`: Higher values (50-100) for better quality
-   `true_cfg_scale`: Controls prompt adherence (3.0-7.0 range)
-   `generator`: Set seed for reproducible results
-   `negative_prompt`: Specify elements to avoid

## Practical Applications

Qwen-Image's unique text rendering capabilities enable numerous applications:

-   **Marketing Materials**: Create posters, banners, and advertisements with perfectly rendered text
-   **Educational Content**: Generate infographics and diagrams with clear labels and annotations
-   **Social Media**: Design posts with integrated text overlays and captions
-   **Multilingual Content**: Produce visuals with text in multiple languages
-   **Creative Projects**: Combine artistic imagery with typographic elements
-   **UI/UX Mockups**: Generate interface designs with readable text elements

## Performance Optimization

To maximize performance on Vast.ai:

1. **Use bfloat16 precision**: Reduces memory usage while maintaining quality
2. **Batch generation**: Process multiple prompts together when possible
3. **Cache the model**: Keep it loaded between generations to avoid reload overhead
4. **Optimize resolution**: Start with lower resolutions for testing, then increase for final outputs

## Conclusion

Qwen-Image represents a significant advancement in image generation technology, particularly in its ability to seamlessly integrate readable text into visual content. Its support for multiple languages, mathematical notation, and various text styles makes it uniquely suited for projects requiring precise text-image integration.

By deploying on Vast.ai, you gain access to powerful GPUs capable of running this advanced model efficiently. Whether you're creating marketing materials, educational content, or artistic compositions, Qwen-Image provides the tools to generate professional-quality visuals with integrated text.

### Next Steps

Try these to explore further:

-   **Experiment with multilingual prompts**: Test the model's capabilities with Chinese, Japanese, or mixed-language text
-   **Create branded content**: Generate marketing materials with your company's text and logos
-   **Build an API service**: Wrap the model in a REST API for integration with your applications
-   **Explore style transfer**: Use the model's editing capabilities to transform existing images

[Get started on Vast.ai](https://cloud.vast.ai/) and experience Qwen-Image today!


Deploy Qwen-Image on Vast.ai to create stunning visuals with perfectly rendered text, multilingual support, and advanced editing features.

Deploying Qwen-Image for Advanced Text-Integrated Image Generation on Vast.ai


If you're working at the cutting edge of AI and high-performance computing (HPC), NVIDIA's H200 and B200 GPUs are likely already very much on your radar. They're each designed to handle the massive compute demands of datacenter-class workloads – such as training trillion-parameter models and running high-speed inference at scale.

While both GPUs deliver incredible power, it's crucial to understand the differences between the two in order to make the right choice for your needs.

Today, we'll explore how the **H200** and **B200** measure up against each other and how each fits into real-world deployments. Which one is right for you? Let's take a look.

### NVIDIA H200: Hopper-Based Heavyweight

Building on the foundation laid by the H100, the NVIDIA H200 is based on the same Hopper architecture while bringing substantial improvements in memory capacity and bandwidth. This means fewer bottlenecks, an advantage that becomes increasingly important as you move further into enterprise scale.

The H200 nearly doubles the memory of its predecessor, boasting **141 GB of HBM3e memory** and **4.8 TB/s of bandwidth** – delivering about **1.4X faster** data access. MLPerf benchmarks using Llama 2 70B demonstrated the H200's power: it reached over 31,000 tokens per second, about **45% faster** than the H100.

With the same **fourth-gen NVLink**, **PCIe Gen5**, **MIG support**, and full Tensor Core and Transformer Engine capabilities, the H200 is a solid next step for existing Hopper-based deployments. You don't have to rethink your entire infrastructure; you just get more horsepower.

To sum up, here are some of the most notable features of the H200:

-   **141 GB of HBM3e memory at 4.8 TB/s** – almost double the memory of the H100 with 1.4X the bandwidth.

-   **Up to 1.9X improved LLM inference performance**, with more tokens per second and higher throughput for gen AI workloads compared to its predecessor.

-   **Lower energy use and total cost of ownership** (TCO) over the H100 thanks to improved energy efficiency and thermal management.

-   **An upgrade path exists on qualified HGX/MGX platforms**, so that little to no reconfiguration is required.

In short, the H200 represents the very best of the Hopper-based lineup. When it was introduced in late 2024, it set a new standard for memory-intensive and high-throughput workloads in datacenter GPUs – and it's still one of the most powerful GPUs on the market today.

The B200, on the other hand, is an entire generational leap forward.

### NVIDIA B200: The New Generation

Built on NVIDIA's advanced Blackwell architecture, the B200 features **192 GB of ultra-fast HBM3e memory** with **8.0 TB/s of memory bandwidth.** Those are some staggering numbers, even compared to the H200's respectable 141 GB at 4.8 TB/s.

The Blackwell architecture also introduces **fifth-gen Tensor Cores** and **dual Transformer Engines**, which are particularly useful for AI workloads with long context windows and heavy token parallelism. The B200 scales much more efficiently than the H200 for the latest and most advanced model classes.

In fact, scaling is a major area where the B200 stands out. With **fifth-gen** **NVLink**, its inter-GPU bandwidth reaches **1.8 TB/s** within a node – compared to Hopper's 900 GB/s – with cross-node scaling supported via the NVLink Switch System.

Of course, this may not matter to you very much if you're only using a single card. At cluster scale, however, or in multi-GPU systems like the DGX B200 (which connects eight B200 GPUs together), the improvements in distributed training efficiency really make a difference.

The B200 truly sits at the frontier of performance. Its raw capacity and speed can support next-gen inference pipelines and AI workloads at hyperscale. According to NVIDIA, it delivers **3X the training performance** and **15X the inference performance** of the previous generation.

These gains do come at a cost in power draw. The B200 has a max thermal design power (TDP) of **1000 W** compared to the H200 at 700 W – which means it also requires far more robust cooling solutions. But the extra power consumption may just be worth it for the boost in performance.

To recap, a few advantages of the B200 include:

-   **192 GB of HBM3e memory at 8.0 TB/s** – ultra-high capacity that's well suited for trillion-parameter models and allows training of AI models _entirely in GPU memory_.

-   **Fifth-gen Tensor Cores and second-gen Transformer Engines** for higher efficiency in FP8, FP16, and mixed precision performance, especially in long-context and multi-modal workloads.

-   **NVLink 5 interconnect** with up to **1.8 TB/s** GPU-to-GPU bandwidth for multi-node scaling.

-   **Fine-grain micro-tensor scaling** that optimizes performance and accuracy, enabling 4-bit floating point (**FP4**) AI.

The B200 was built to shatter performance records. As you can imagine, it has an eyewatering price tag to match. Nonetheless, it can be helpful to compare the B200 and H200 based on specs alone.

### H200 vs. B200: Key Specs Compared

Here's a side-by-side look at some key features of the H200 and B200:

<table class="w-full border-collapse border border-gray-300">
    <thead>
        <tr class="bg-gray-100">
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                Feature
            </th>
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                H200
            </th>
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                B200
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Architecture
            </td>
            <td class="border border-gray-300 px-4 py-2">Hopper</td>
            <td class="border border-gray-300 px-4 py-2">Blackwell</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Memory
            </td>
            <td class="border border-gray-300 px-4 py-2">141 GB HBM3e</td>
            <td class="border border-gray-300 px-4 py-2">192 GB HBM3e</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Memory Bandwidth
            </td>
            <td class="border border-gray-300 px-4 py-2">4.8 TB/s</td>
            <td class="border border-gray-300 px-4 py-2">8 TB/s</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                CUDA Cores
            </td>
            <td class="border border-gray-300 px-4 py-2">16,896</td>
            <td class="border border-gray-300 px-4 py-2">16,896 x2</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Tensor Cores
            </td>
            <td class="border border-gray-300 px-4 py-2">528 (4th Gen)</td>
            <td class="border border-gray-300 px-4 py-2">528 x2 (5th Gen)</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Boost Clock
            </td>
            <td class="border border-gray-300 px-4 py-2">1.98 GHz</td>
            <td class="border border-gray-300 px-4 py-2">1.98 GHz</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Transformer Engine
            </td>
            <td class="border border-gray-300 px-4 py-2">Single</td>
            <td class="border border-gray-300 px-4 py-2">Single (2nd gen)</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Form Factor
            </td>
            <td class="border border-gray-300 px-4 py-2">SXM</td>
            <td class="border border-gray-300 px-4 py-2">SXM</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Interconnect
            </td>
            <td class="border border-gray-300 px-4 py-2">
                NVLink 4 (900 GB/s) per GPU
            </td>
            <td class="border border-gray-300 px-4 py-2">
                NVLink 5 (1.8 TB/s) per GPU
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Precision Formats
            </td>
            <td class="border border-gray-300 px-4 py-2">
                FP8, FP16/BF16, TF32, FP32, FP64, INT8
            </td>
            <td class="border border-gray-300 px-4 py-2">
                FP4, FP6, FP8, FP16/BF16, TF32, FP32, FP64, INT8
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Multi-Instance GPUs
            </td>
            <td class="border border-gray-300 px-4 py-2">Up to 7</td>
            <td class="border border-gray-300 px-4 py-2">Up to 7</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Max Thermal Design Power (TDP)
            </td>
            <td class="border border-gray-300 px-4 py-2">700 W</td>
            <td class="border border-gray-300 px-4 py-2">1000 W</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Recommended Power Supply
            </td>
            <td class="border border-gray-300 px-4 py-2">1100 W</td>
            <td class="border border-gray-300 px-4 py-2">1400 W</td>
        </tr>
    </tbody>
</table>

Choosing the right GPU often comes down to more than just specs, however.

### Use Cases: Which GPU Should You Choose?

While both the H200 and B200 deliver game-changing performance capabilities, they each have distinct advantages depending on your needs.

#### You should consider the H200 if you prefer:

-   **A high-performance GPU** ideal for fine-tuning LLMs and high-throughput inference at the enterprise level without getting into exascale territory.

-   **A proven workhorse** that balances strong performance with lower TDP and cost.

-   **Being able to seamlessly upgrade** from existing Hopper deployments without overhauling infrastructure.

#### Or the B200 might be the right choice if you need:

-   **To push model boundaries** and handle long-context and/or multi-modal foundation model training at scale.

-   **The absolute maximum throughput and efficiency** for multi-GPU or multi-node clusters.

-   **A future-proof investment** for infrastructure built to support the next generation of cutting-edge AI and AGI.

### Conclusion

At **Vast.ai**, we know that the choice between GPUs like the H200 and B200 can depend on a lot more than just performance. Access and cost are important factors, as well.

The H200 and B200 represent the very best of NVIDIA's current lineup, and they're among the most powerful and expensive GPUs in the world today. Buying them outright is undoubtedly out of reach for most teams.

That's what we're here for. With Vast, you can rent high-end GPUs like the [A100](https://vast.ai/article/H100-vs-A100-Comparing-two-Powerhouse-GPUs), [H100](https://vast.ai/article/nvidia-h100-vs-l40s-power-meets-versatility), [H200](https://vast.ai/article/nvidia-h100-vs-h200-two-hopper-based-heavyweights), and soon the B200, with on-demand pricing and spot instances – and pay only for what you use.

On average, **Vast users save 5–6X on GPU compute** compared to traditional cloud providers. You'll find a variety of options on our platform to fit your needs, so you can focus on what really matters: your work, not your budget.

Explore Vast today and enjoy enterprise-grade compute without enterprise-level costs.


The H200 and B200 GPUs represent the top end of NVIDIA's current lineup for AI and high-performance computing (HPC). Both are designed for serious, large-scale workloads – from training foundation models to running complex scientific simulations.

NVIDIA H200 vs. B200: Comparing Datacenter-Grade Accelerators

Deploy DeepSeek R1 on Vast.ai with vLLM and integrate Crew AI for scalable multi-agent workflows. Create cost-effective, OpenAI-compatible AI endpoints.

Running DeepSeek R1 + Crew AI on Vast.ai


Vast.ai GPUs can be seamlessly rented through SkyPilot, an open-source framework that automates AI, LLM, and batch job execution. This integration makes it easy to access Vast.ai's low-cost, high-performance GPUs while SkyPilot handles provisioning, scaling, and cost optimization.

Whether you're training deep learning models, fine-tuning LLMs, or running large-scale batch jobs, this partnership provides an efficient and cost-effective solution for developers everywhere.

## Why This Integration Matters

Running large-scale AI workloads can be expensive and resource-intensive, especially as GPU availability fluctuates across providers. With Vast.ai's extensive GPU supply and SkyPilot's intelligent cloud orchestration, you can enjoy streamlined access to affordable, on-demand compute resources.

SkyPilot automatically selects the most cost-effective and available infrastructure, provisioning GPUs to ensure seamless execution of AI and batch jobs at scale. As the market leader in low-cost cloud GPU rental, Vast.ai provides an ideal foundation for this approach.

## What Is SkyPilot?

Put simply, SkyPilot is a powerful framework for managing cloud workloads in a way that abstracts away infrastructure complexity. It was developed in the UC Berkeley lab that proposed the concept of "sky computing," which aims to make cloud providers function as a single compute pool.

With SkyPilot, you can run AI models and batch jobs efficiently without worrying about infrastructure management. SkyPilot includes improvements to managed jobs and Kubernetes support, faster GPU provisioning, and ready-to-use LLM recipes.

## Vast.ai's Role

You can easily access and manage GPU resources on Vast.ai through SkyPilot. Vast.ai provides a vast supply of cost-effective GPU compute, while SkyPilot automates the selection and provisioning process to match your workload needs – allowing you to run thousands of concurrent jobs and scale efficiently.

With Vast.ai's extensive GPU marketplace, you get reliable, affordable compute on demand. SkyPilot enhances this by simplifying deployment, ensuring a seamless experience while maximizing access and failover protection.

## Getting Started: Practical Tutorial

Now let's walk through a complete example of deploying a language model on Vast.ai using SkyPilot.

### Setting Up the Environment

First, install SkyPilot with Vast.ai support:

```bash
pip install -U "skypilot[vast]"
pip install "vastai-sdk>=0.1.12"
```

Configure your Vast.ai API key:

```bash
# Get your API key from https://vast.ai → Account → API Keys
echo "YOUR_VAST_API_KEY_HERE" > ~/.vast_api_key
chmod 600 ~/.vast_api_key

# Verify setup
sky check
```

Expected output:

```
Vast: enabled ✓
```

### Creating the YAML Configuration

Create a file named `deepseek-r1-inference.yaml`:

```yaml
resources:
  accelerators: H100:1
  cloud: vast
  image_id: docker:vllm/vllm-openai:latest
  disk_size: 100

run: |
  vllm serve deepseek-ai/DeepSeek-R1-0528-Qwen3-8B \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9 \
    --reasoning-parser deepseek_r1
```

This configuration:

- Requests 1x H100 GPU on Vast.ai
- Uses the pre-built vLLM Docker image
- Allocates 100GB disk space for the model
- Serves the DeepSeek R1 model on port 8000

### Deploying the Service

Launch your service with SkyPilot:

```bash
sky launch deepseek-r1-inference.yaml --cluster deepseek-r1
```

This command will:

1. Find an available H100 GPU on Vast.ai
2. Provision the instance
3. Download the DeepSeek R1 model (~15GB)
4. Start the vLLM API server

The deployment takes 5-10 minutes. Monitor progress with:

```bash
sky logs deepseek-r1 --follow
```

Verify deployment:

```bash
sky status deepseek-r1
```

You should see `Status: UP` when ready.

### SSH Port Forwarding

Since direct port mapping isn't available, use SSH tunneling to access your service:

```bash
ssh -L 8000:localhost:8000 deepseek-r1
```

**Keep this terminal open!** The tunnel needs to stay active for API access.

### Testing the API

In a new terminal, test your deployed service:

```bash
# Test health endpoint
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models
```

#### Python Client Example

```python
import openai

# Configure client for your local tunnel
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-fake-key"  # vLLM doesn't require real API key
)

# Make a simple request
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528-Qwen3-8B",
    messages=[{
        "role": "user",
        "content": "Write a brief history of Los Angeles in one paragraph."
    }],
    max_tokens=300,
    temperature=0.7
)

print(response.choices[0].message.content)
```

### Managing Your Deployment

```bash
# Check status
sky status deepseek-r1

# Stop cluster (saves money)
sky stop deepseek-r1

# Start stopped cluster
sky start deepseek-r1

# Terminate completely
sky down deepseek-r1
```

## Conclusion

You now have a production-ready AI API running on Vast.ai's cost-effective H100 GPUs, managed through SkyPilot's streamlined interface. This combination provides enterprise-grade AI capabilities at a fraction of traditional cloud costs.

The integration between Vast.ai and SkyPilot represents a powerful solution for developers who need reliable, affordable GPU access without the complexity of manual infrastructure management. Start experimenting with your own AI workloads using this cost-effective stack!


Easily rent Vast.ai’s low-cost GPUs via SkyPilot for seamless AI, LLM, and batch job execution with automated provisioning, scaling, and cost optimization.

Vast.ai GPUs Can Now Be Rented Through SkyPilot


## Introduction

OpenAI has just released the GPT-OSS family of models, marking their return to open-weight model releases. The family includes GPT-OSS-120B (120 billion parameters) and GPT-OSS-20B (20 billion parameters), offering developers access to state-of-the-art language capabilities previously locked behind proprietary APIs.

GPT-OSS offers configurable reasoning capabilities with reasoning levels - from "low" for quick responses to "high" for complex problem-solving. The model uses OpenAI's harmony encoding system, which structures conversations and reasoning in a sophisticated format that enables fine-grained control over the model's thinking process.

This breakthrough means you can now:

-   Deploy OpenAI-quality models on your own infrastructure
-   Scale inference according to your exact needs

Vast.ai provides the perfect platform for running GPT-OSS models. With its marketplace of GPUs, you can choose the right hardware for your needs - whether running the efficient 20B model or the powerful 120B variant.

In this guide, we'll show you how to deploy both GPT-OSS models on Vast.ai using vLLM for optimized inference, with a focus on the 120B model. You'll learn how to interact with these models using the harmony encoding system for different reasoning levels.

## Setting Up the Environment

Before we can deploy our model, we need to set up our Vast.ai environment. First, install the Vast SDK:

```bash
# Install required packages
pip install --upgrade vastai
pip install --upgrade openai
pip install --upgrade openai-harmony
```

Set up your Vast API key (available from your [Account Page](https://cloud.vast.ai/account/)):

```bash
# Set your Vast.ai API key
export VAST_API_KEY="" # Your key here
vastai set api-key $VAST_API_KEY
```

## Choosing the Right Hardware

The GPT-OSS models have different hardware requirements:

-   **GPT-OSS-120B**: Requires an H100 GPU with 80GB VRAM
-   **GPT-OSS-20B**: Requires 16GB+ VRAM

For this guide, we'll demonstrate with the 120B model on H100. Let's search for suitable instances:

```bash
# Search for suitable GPU instances
vastai search offers " \
gpu_name = H100_SXM \
geolocation=US \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 120 \
rentable = true"
```

## Deploying GPT-OSS with vLLM

Now let's deploy our GPT-OSS instance using vLLM's optimized inference server. We'll use the latest vLLM image with special GPT-OSS support. The pip install commands are from the model's Hugging Face pages ([120B](https://huggingface.co/openai/gpt-oss-120b), [20B](https://huggingface.co/openai/gpt-oss-20b)).

### Deploying GPT-OSS-120B (H100 recommended)

```bash
# Deploy vLLM instance for 120B model
export INSTANCE_ID= # Insert instance ID from search results

vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 160  --onstart-cmd 'uv pip install --system --upgrade transformers kernels torch openai; uv pip install --system --pre vllm==0.10.1+gptoss --extra-index-url https://wheels.vllm.ai/gpt-oss/ --extra-index-url https://download.pytorch.org/whl/nightly/cu128; vllm serve openai/gpt-oss-120b --max-model-len 80000'
```

### Alternative: Deploying GPT-OSS-20B on A100

Here's how to deploy the 20B model on an A100 GPU:

```bash
# Deploy vLLM instance for 20B model
export INSTANCE_ID= # Insert instance ID from search results

vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 80  --onstart-cmd 'uv pip install --system --upgrade transformers kernels torch openai; uv pip install --system --pre vllm==0.10.1+gptoss --extra-index-url https://wheels.vllm.ai/gpt-oss/ --extra-index-url https://download.pytorch.org/whl/nightly/cu128; vllm serve openai/gpt-oss-20b'
```

After deployment, wait for the model to download and start serving. You can monitor the logs to see when it's ready. Once running, find your instance's IP address and port from the [Instances tab](https://cloud.vast.ai/instances/) in the Vast AI Console.

## Interacting with GPT-OSS Using Harmony Encoding

GPT-OSS uses OpenAI's harmony encoding system, which provides structured conversation formatting and enables different reasoning levels. Let's set up our client:

```python
from openai import OpenAI
from openai_harmony import (
    load_harmony_encoding,
    HarmonyEncodingName,
    Role,
    Message,
    Conversation,
    SystemContent,
    DeveloperContent,
)

# Your server details
VAST_IP_ADDRESS = ""
VAST_PORT = ""

# Initialize client
client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

# Load harmony encoding
enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

## Simple Chat Function
def chat_gpt_oss(prompt, reasoning="medium"):
    """
    Simple chat with GPT-OSS
    reasoning: "low", "medium", or "high"
    """
    # Create conversation with harmony format
    conversation = Conversation.from_messages([
        Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
        Message.from_role_and_content(
            Role.DEVELOPER,
            DeveloperContent.new().with_instructions(f"Reasoning: {reasoning}")
        ),
        Message.from_role_and_content(Role.USER, prompt),
    ])

    # Get formatted prompt
    formatted_prompt = enc.render_conversation_for_completion(conversation, Role.ASSISTANT)

    # Make API call
    response = client.completions.create(
        model="openai/gpt-oss-120b",
        prompt=formatted_prompt,
        max_tokens=2000,
        temperature=0.7,
    )

    return response.choices[0].text


prompt = "Explain quantum computing to a 4th grader. Keep in concise."
```

## Comparing Reasoning Levels

One of GPT-OSS's most interesting features is its configurable reasoning levels. Let's see how the model responds differently to the same prompt with low and high reasoning:

### Low Reasoning Response

```python
result = chat_gpt_oss(prompt, "low")
print("----LOW REASONING----")
print(result)
```

Output:

```
----LOW REASONING----
analysisWe need to explain quantum computing to a 4th grader, concise. Use simple analogies, like bits as light switches, qubits as coins that can be heads, tails, or both. Keep short.assistantfinal**Quantum Computing in a Nutshell (for a 4th‑grader)**

1. **Regular computers = light switches**
   - A normal computer works with bits that are either **0** (off) or **1** (on), just like a light switch that's either down or up.

2. **Quantum computers = magic coins**
   - A quantum computer uses **qubits**.
   - Think of a qubit like a coin that can be **heads (0)**, **tails (1)**, **or both at the same time** while it's spinning in the air. That "both at once" is called **superposition**.

3. **Working together**
   - When you have many regular bits, you have to try every possible combination one after another.
   - With many qubits, they can be in many combinations **all at once** because they can be linked together in a special way called **entanglement**. It's like a group of magic coins that always know what the others are doing, even if they're far apart.

4. **Why it matters**
   - Because of superposition and entanglement, a quantum computer can solve some puzzles (like cracking codes or simulating molecules) **much faster** than a regular computer.

**Bottom line:** A regular computer is like a row of on/off switches; a quantum computer is like a bunch of spinning magic coins that can be many things at once, letting it try many solutions at the same time.
```

### High Reasoning Response

```python
result = chat_gpt_oss(prompt, "high")
print("----HIGH REASONING----")
print(result)
```

Output:

```
----HIGH REASONING----
analysisWe need to explain quantum computing to a 4th grader, concise. Use simple analogies, avoid jargon, but keep accurate. Should be short. Provide analogies like coins, dice, superposition, entanglement. Keep concise.assistantfinal**What is a quantum computer?**

Imagine you have a super‑smart robot that can solve puzzles much faster than a regular computer. A quantum computer is that robot, but instead of using normal "on/off" switches (bits) like a regular computer, it uses tiny, magical switches called **qubits**.

### Bits vs. Qubits
- **Bit (regular computer):** Like a tiny light that can be **off (0)** or **on (1)**.
- **Qubit (quantum computer):** Like a tiny spinning top that can be **0**, **1**, **or both at the same time** (this is called *superposition*).

### Why is that cool?
If you have 10 bits, they can show only one of 2¹⁰ = 1,024 possible patterns at a time.
If you have 10 qubits, they can be in all 2¹⁰ = 1,024 patterns **at once**! That lets a quantum computer try many solutions together.

### A Simple Analogy
Think of a maze puzzle:

- **Regular computer:** Sends a single mouse down the maze, step by step, until it finds the exit.
- **Quantum computer:** Sends a whole swarm of mice that can explore every possible path at the same time and tell you the exit almost instantly.

### Entanglement (Teamwork)
Sometimes two qubits become "best friends" and link together, no matter how far apart they are. Changing one instantly changes the other. This is called **entanglement**, and it lets quantum computers coordinate their work in a super‑efficient way.

### Bottom Line
A quantum computer uses qubits that can be 0, 1, or both at once, and they can be entangled with each other. This lets the machine look at many possibilities at the same time, solving certain problems way faster than ordinary computers.

So, it's like a super‑fast, super‑clever puzzle‑solver that can try lots of answers all at once!
```

## Comparing Low vs High Reasoning Outputs

The quantum computing explanations demonstrate the distinct capabilities of GPT-OSS-120B's reasoning levels. With low reasoning, the model produced a concise, structured explanation using simple analogies like "magic coins" and "light switches" - delivering clear concepts in a compact format suitable for quick understanding.

The high reasoning response revealed significantly more sophisticated thinking. The model provided detailed analogies like the maze puzzle with mice, explained mathematical concepts (2¹⁰ patterns), and structured the explanation with clear sections and formatting. Most notably, it offered multiple perspectives on the same concepts - from the spinning top analogy to the teamwork explanation of entanglement.

This difference in reasoning depth represents a significant advancement in model control. Rather than simply adjusting temperature or other parameters, GPT-OSS allows direct specification of how thoroughly the model should think through problems. The low reasoning approach prioritized efficiency while maintaining accuracy, while high reasoning prioritized comprehensive understanding with rich examples and detailed explanations.

## Conclusion

OpenAI's first open-source model thus introduces not just accessibility to advanced language capabilities, but a new paradigm for controlling model reasoning depth - enabling applications to dynamically adjust between quick responses and thorough analysis based on context and requirements.


Learn how to deploy and run OpenAI's GPT-OSS models on Vast.ai using vLLM, with configurable reasoning levels and harmony encoding system.

Running OpenAI's GPT-OSS on Vast.ai


At **Vast.ai**, we believe privacy and security are inextricably woven into our mission to equip customers worldwide with accessible, scalable GPUs. In honor of this commitment, we are pleased to announce that we have achieved **SOC 2 Type II certification** – an independent assurance that with us, your data is secure, reliably available, and continuously protected.

Vast.ai’s Enduring Security Commitment

Earlier this year, our [SOC 2 Type I audit](https://vast.ai/article/vast-ai-achieves-soc-2-type-1-certication) confirmed we have the right internal controls in place to meet trust services criteria for data security, integrity, and privacy. The Type II audit put those controls to the test, evaluating their operational effectiveness over an extended period.

Achieving **SOC 2 Type II certification** is independent validation that the security infrastructure we've built over 6+ years of trusted service continues to meet rigorous [compliance standards](https://vast.ai/article/security-and-compliance-at-vast-ai) in day-to-day execution. This achievement strengthens our ability to support the evolving security needs of enterprises, research institutions, and organizations building on our infrastructure.

Here's a quick look at SOC 2 Type II and what it entails.

### What Is SOC 2 Type II?

System and Organization Controls 2 (SOC 2\) audits – developed by the American Institute of Certified Public Accountants (AICPA) – protect customer data by rigorously evaluating an organization’s internal controls across security, confidentiality, privacy, and processing integrity benchmarks.

- **Type I** reports provide a control design snapshot at a single point in time.

- **Type II** reports go further, assessing how those controls perform over months of continuous operation – confirming they're both well-designed and consistently implemented.

Achieving both Type I and Type II certifications means your sensitive data is secure, available, and privately stored across all our systems. This establishes a foundation of enduring excellence aligned with rigorous industry standards for all customers, which we can tailor to your unique data security and regulatory compliance needs to scale infrastructure and further reduce risks.

After completing our initial 3-month Type II audit, we’ve launched a standard 12-month cycle to ensure we have no gaps in coverage. This two-phase strategy lets us move fast on improvements, then settle into an annual rhythm of verification. We are now in the 12-month cycle with continuous coverage year round.

### Vast.ai's Secure Cloud Offering

For customers with stricter security and compliance requirements, our [**Secure Cloud**](https://cloud.vast.ai/?secure_cloud=true) offering provides high-security GPU access through our certified datacenter partners. This environment, already backed by stringent operational and security controls, now benefits from the added assurance of a SOC 2 Type II audit across our platform infrastructure.

Our **[Trust Center](https://trust.vast.ai/?utm_source=blog)** offers a deeper look into our controls, compliance stance, FAQs and Subprocessors, and our other compliance certifications provide additional safeguards.

Why This Matters

Earning our **SOC 2 Type II certification** is another step forward in strengthening our foundation of trust and security that our customers rely on every day. It reflects both our adherence to proven industry practices and our ongoing commitment to transparency.

This achievement isn't the end of our compliance journey – it's part of an ongoing initiative. Vast.ai will undergo SOC 2 Type II audits every 12 months to ensure continual coverage.

Additionally, we've requested an [**SOC 3 report**](https://www.aicpa-cima.com/topic/audit-assurance/audit-and-assurance-greater-than-soc-3), which will provide a publicly available summary of our security posture for those who want to review our controls without accessing the full SOC 2 report.

### Our Ongoing Commitment

Achieving the **SOC 2 Type II certification** is a crucial step in our compliance roadmap as we continue to build trustworthy and dependable GPU infrastructure for our users. Alongside other security and [compliance measures](https://vast.ai/compliance) across cloud and AI services, such as our [SOC 2 Type I certification](https://vast.ai/article/vast-ai-achieves-soc-2-type-1-certication) and [Vulnerability Bounty Program](https://vast.ai/article/announcing-the-vast-ai-vulnerability-bounty-program), we are weaving security and trust throughout our platform for every member of our community.

Through this integrated approach, we’re dedicated to protecting the privacy and integrity of the data entrusted to us by developers, researchers, and enterprise teams around the world. We take this responsibility seriously as a distributed peer-to-peer platform powering global compute, and we look forward to sharing future updates as we continue to advance our security and compliance roadmap.

For more information on Vast.ai's comprehensive security and compliance stance, or to request a copy of our SOC 2 reports, please contact us at [compliance@vast.ai](mailto:compliance@vast.ai).

Thank you to our team at Vast.ai and our audit partners for helping us achieve this milestone.

We’re committed to protecting your data and upholding the trust you place in us, and we’ll keep you updated on our ongoing progress in our mission to provide powerful, dependable GPUs.


Vast.ai has achieved SOC 2 Type II certification, demonstrating our commitment to enterprise-grade security, data protection, and compliance. Learn about our comprehensive security controls, Secure Cloud offering, and ongoing compliance roadmap that ensures your data remains secure and protected.

Vast.ai Achieves SOC 2 Type II Certification


The landscape of AI inference is rapidly evolving, with organizations seeking cost-effective solutions that don't compromise on performance or control. While cloud providers offer convenient managed services, they often come with premium pricing and limited customization options. On the other hand, on-premises deployments provide control but require significant hardware investments.

A hybrid approach offers the best of both worlds: maintaining local control over your inference pipeline while leveraging cost-effective remote GPU resources. This blog post demonstrates how to implement such a system using LiteLLM as a local proxy and Vast.ai for remote GPU hosting.

## The Hybrid Architecture Advantage

This architecture combines local control with cost-effective cloud GPU access through two key components:

**LiteLLM** is a Python SDK and proxy server that provides a unified OpenAI-compatible interface for 100+ LLM APIs from different providers. It includes features like cost tracking, rate limiting, and logging capabilities when used as a proxy server.

**Vast.ai** is a GPU marketplace that offers access to over 10,000 GPUs from secure datacenters. According to their website, users can save up to 80% compared to traditional cloud services with their pay-as-you-go pricing model.

## What You'll Learn

In this blog post, we'll deploy a complete hybrid inference pipeline:

1. Deploy a vLLM server with the DeepSeek-R1 model on Vast.ai
2. Configure LiteLLM locally to proxy requests to the remote server
3. Test the complete pipeline using OpenAI client libraries
4. Demonstrate advanced reasoning capabilities

The result is a flexible, cost-effective inference setup that you control locally while leveraging remote GPU resources.

## Deploy LiteLLM + vLLM Pipeline on Vast.ai

Let's start by installing and configuring the Vast AI API client.

### Install Vast AI Client

Get your API key from the [Account Page](https://cloud.vast.ai/account/) in the Vast Console.

```bash
pip install --upgrade vastai
```

```bash
# Here we will set our api key
export VAST_API_KEY="" #Your key here
vastai set api-key $VAST_API_KEY
```

### Hardware Requirements

For our LiteLLM + vLLM pipeline, we need a GPU with 24GB+ RAM for the DeepSeek-R1 model weights and KV cache, plus a static IP address with direct ports for stable connections.

Let's search for suitable instances:

```bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 24 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"
```

## Deploy vLLM on Vast.ai

Choose an instance ID from the search results above and deploy using the Docker command below.

This deployment uses the vllm/vllm-openai:latest image to serve the deepseek-ai/DeepSeek-R1-0528-Qwen3-8B model with reasoning capabilities on port 8000:

```bash
export INSTANCE_ID="" #insert instance ID

vastai create instance $INSTANCE_ID \
  --image vllm/vllm-openai:latest \
  --env '-p 8000:8000' \
  --disk 60 \
  --args \
    --model deepseek-ai/DeepSeek-R1-0528-Qwen3-8B \
    --served-model-name deepseek \
    --max-model-len 4096 \
    --reasoning-parser qwen3 \
```

## Get Vast IP and Port

After deployment, go to the [Instances Tab](https://cloud.vast.ai/instances/) in the Vast AI Console and find your instance.

Click the IP address button at the top of the instance. A panel will show the IP address and forwarded ports:

```
Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp
```

Note down VAST_IP (XX.XX.XXX.XX) and VAST_PORT (YYYY) for the next steps.

## Installing LiteLLM and Dependencies

Now let's install LiteLLM locally to proxy requests to our Vast.ai server:

```bash
pip install litellm
pip install 'litellm[proxy]'
```

### Configure LiteLLM

Set your Vast.ai instance details from the previous step:

```python
# Configure your Vast.ai instance details
VAST_IP = ""
VAST_PORT = ""
MODEL_NAME = "deepseek"

# Configure LiteLLM settings
LITELLM_PORT = 4000  # Change this if port 4000 is in use

# Create the API base URL
API_BASE = f"http://{VAST_IP}:{VAST_PORT}/v1"
print(f"Vast.ai API Base URL: {API_BASE}")
print(f"LiteLLM will run on port: {LITELLM_PORT}")
```

```python
# Write config file with variables
config_content = f"""model_list:
  - model_name: {MODEL_NAME}
    litellm_params:
      model: openai/{MODEL_NAME}
      api_base: {API_BASE}
      api_key: fake-key
general_settings:
  master_key: fake-key
"""

with open('litellm_config.yaml', 'w') as f:
    f.write(config_content)
    
print("Config file created with:")
print(f"- Model: {MODEL_NAME}")
print(f"- API Base: {API_BASE}")
```

```python
import subprocess
import time

# Kill any existing LiteLLM processes
subprocess.run("pkill -f litellm || true", shell=True)
time.sleep(1)

# Start LiteLLM
cmd = f"nohup litellm --config litellm_config.yaml --port {LITELLM_PORT} --host 0.0.0.0 > litellm.log 2>&1 &"
subprocess.run(cmd, shell=True)
time.sleep(5)

print(f"✅ LiteLLM running on port: {LITELLM_PORT}")
print(f"URL: http://localhost:{LITELLM_PORT}/v1")

# Save port for other cells
with open('.litellm_port', 'w') as f:
    f.write(str(LITELLM_PORT))
```

Output:
```
✅ LiteLLM running on port: 4000
URL: http://localhost:4000/v1
```

## Install OpenAI Python Client

Now install the OpenAI client to test our setup:

```bash
pip install openai
```

## Test the Pipeline

Let's test our complete LiteLLM + Vast.ai setup with a basic API call:

```python
from openai import OpenAI

client = OpenAI(
    api_key="fake-key",
    base_url=f"http://localhost:{LITELLM_PORT}/v1"
)

# Basic test
try:
    response = client.chat.completions.create(
        model="deepseek",
        messages=[
            {"role": "user", "content": "Hello! How are you?"}
        ],
        max_tokens=1000
    )
    print("Response:", response.choices[0].message.content)
    
except Exception as e:
    print(f"❌ Error: {e}")
```

Output:
```
Response: 
Hello! I'm just a friendly AI here, so I don't get tired or have feelings—but I'm fully ready and happy to help you! How are you doing today? Let me know if there's anything you'd like to chat about or need assistance with.
```

Perfect! Now let's test the reasoning capabilities of the DeepSeek model:

```python
try:
    response_1 = client.chat.completions.create(
        model="deepseek",
        messages=[
            {"role": "user", "content": "what is 8 * 7?, Give me your answer in step by step reasoning."}
        ],
        max_tokens=3000
    )
    print("Response:", response_1.choices[0].message.content)
    
except Exception as e:
    print(f"❌ Error: {e}")
```

Output:
```
Response: 
### Step-by-Step Reasoning for 8 * 7

Multiplication is a basic arithmetic operation that can be thought of as repeated addition. To find the product of 8 and 7, you can add the number 8 to itself 7 times (or equivalently, add the number 7 to itself 8 times). I'll use the repeated addition method with the number 8 to make it clear.

Start with the first 8:
- Add the second 8: 8 + 8 = 16
- Add the third 8: 16 + 8 = 24
- Add the fourth 8: 24 + 8 = 32
- Add the fifth 8: 32 + 8 = 40
- Add the sixth 8: 40 + 8 = 48
- Add the seventh 8: 48 + 8 = 56

As shown, after adding 8 seven times, the result is 56.

Alternatively, you can verify this with another method, such as multiplying 7 eight times or subtracting 8 from 8 * 8 (since 8 * 7 = (8 * 8) - 8 = 64 - 8 = 56). Both methods confirm the result.

**Final Answer:** 56
```

The pipeline is working correctly. You now have a complete LiteLLM + Vast.ai setup with local proxy control and remote GPU inference.

**Note**: For production deployments, consider using Docker containers for LiteLLM rather than the command-line approach shown here.

## Conclusion

This hybrid inference architecture demonstrates how modern AI deployment can balance cost, control, and performance. By combining LiteLLM's local proxy capabilities with Vast.ai's GPU marketplace, we've created a system that offers several key advantages:

**Cost Considerations**: Vast.ai's marketplace model and pay-as-you-go pricing can provide cost savings compared to traditional cloud providers, while running LiteLLM locally eliminates additional hosting costs for the proxy layer.

**Local Control**: Running LiteLLM locally provides control over request routing, logging, and configuration. The proxy server includes built-in features for cost tracking, rate limiting, and observability integrations.

**API Compatibility**: LiteLLM's OpenAI-compatible interface allows existing applications to integrate with minimal changes, while supporting multiple LLM providers through a unified API.

**Flexible Architecture**: This setup allows you to experiment with different models and providers while maintaining a consistent interface for your applications.

Whether you're building AI-powered applications, conducting research, or optimizing existing ML pipelines, this hybrid approach provides a practical path to cost-effective, controllable AI inference that doesn't sacrifice capability for economics.

Deploy hybrid AI inference using LiteLLM + Vast.ai. Run local OpenAI-compatible proxies with remote GPU models for scalable, cost-effective LLM performance.

Hybrid AI Inference: Local LiteLLM Proxy with Remote Vast.ai GPU

Deploy Llama 4, Qwen3, and Kimi K2 on Vast.ai in minutes. Use optimized templates like Oobabooga, vLLM, HuggingFace TGI, and Open WebUI—no setup hassle.

Modular MAX vs vLLM Performance Comparison on Vast.ai


Today, we’re announcing the launch of our Vast.ai Vulnerability Bounty Program. This is a new initiative we’ve created to help us improve, expand, and innovate our platform while making it more secure. Vast.ai’s mission is to provide the best GPUs for AI compute at accessible and affordable prices for everyone who needs it. With this bounty program, we’re inviting AI developers, researchers, and enthusiasts to collaborate with us. The launch of this program is in direct response to our community’s feedback, and we appreciate all the collaboration happening already.

### **Responsible Vulnerability Disclosure**

### We take security seriously and encourage responsible disclosure of any [vulnerabilities](https://vast.ai/vulnerability-disclosure-policy).

**What is the [Vast.ai](http://Vast.ai) Vulnerability Bounty program?**

The program provides a structured way to report bugs, suggest new features, or responsibly uncover vulnerabilities. Whether you’re a security researcher or a community builder, there’s an opportunity for you to do the right thing and be rewarded.

### **What Is In Scope?**

-   Vast.ai web console, REST API & billing flows
-   Provider Daemon code (host agent)
-   Match‑making & pricing engine
-   Default Docker & KVM isolation on reference images
-   GPU memory‑isolation / tenant breakout flaws

What is out of Scope?

-   User workloads & third‑party container images
-   Social‑engineering, physical security, or denial‑of‑service (DoS) tests
-   Brute‑force attacks against customer passwords or MFA
-   Any activity that violates applicable law, exports regulations, or provider Terms of Service

### If you discover a security issue, we ask that you:

-   Report it directly to our team at [security@vast.ai](mailto:security@vast.ai?subject=Vulnerability%20Bounty%20Program%20Inquiry).

-   Allow us a reasonable time to address the issue before any public disclosure.

### In return, we commit to:

-   Acknowledging your report promptly and keeping you informed of progress.

-   Recognizing your efforts with a form of bounty rewards, platform credit, or joining our hall of fame.

### **How to Get Started?**

1. Join the [Vast.ai Discord](https://discord.gg/NCgfu5BrdQ) to connect with our team and community.

2. Submit your bounty contributions and start hunting.

Please review [the policy](https://vast.ai/vulnerability-disclosure-policy) for the Rules of Engagement and specifics on scope, which will be updated periodically.


Help secure Vast.ai and earn rewards. Join our Vulnerability Bounty Program to report bugs, suggest features, and collaborate with the AI infrastructure community.

Announcing the Vast.ai Vulnerability Bounty Program


In Part 1 of this series, we created a quantized version of the Qwen3-8B model (Qwen3-8B-W8A8) using LLM-Compressor with 8-bit weight and activation quantization. In this notebook, we'll demonstrate how to deploy and compare this quantized model against its full precision counterpart (Qwen3-8B) using Vast.ai. We'll examine both deployment processes and compare the outputs to assess any quality differences between the models while highlighting the efficiency gains from quantization.

## Key benefits of quantized models:

- Reduced memory footprint (approximately 1.7x smaller in our case)
- Lower inference latency
- Decreased computational requirements
- More affordable deployment options

Let's start by exploring how to deploy our quantized model on Vast.ai.

## Deploying our Model on Vast.ai

First, we will install the Vast.ai SDK and input our API key. This allows us to interact with the Vast.ai platform programmatically and manage our GPU instances.

```bash
#In an environment of your choice
pip install --upgrade vastai
```

```bash
# Here we will set our api key
export VAST_API_KEY="VAST_API_KEY" #Your key here
vastai set api-key $VAST_API_KEY
```

Next, we'll search for an appropriate GPU instance to serve our quantized model. Since this is a W8A8 quantized version of Qwen3-8B, we can use a machine with less VRAM than would be required for the full model. We're looking for machines with at least 24GB of VRAM, which is sufficient for our 8-bit quantized model and its context window:

```bash
vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 24 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 20 \
rentable = true"
```

Once we've identified a suitable instance from the search results, we'll deploy our quantized model to that instance. We're using the vLLM Docker image with an OpenAI-compatible API, which makes it easy to serve and interact with our model.

Note that we're deploying the W8A8 quantized version of Qwen3-8B that was created and uploaded to Hugging Face in Part 1. Please replace `your-hf-username` with your actual Hugging Face username to use your compressed model:

```bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 20 --args --model your-hf-username/Qwen3-8B-W8A8
```

## Calling Our Model Using the OpenAI SDK

Now that our model is deployed and running on Vast.ai, we'll set up a client to interact with it. The vLLM server exposes an OpenAI-compatible API, allowing us to use the OpenAI SDK to send requests to our model.

First, we need to install the OpenAI SDK if you haven't already:

```bash
pip install openai
```

Then, we use the OpenAI SDK and the `VAST_IP_ADDRESS` and `VAST_PORT` from our instance to call our model.

Also add your hugginface username `HF_USERNAME` to call the model we saved to hugginface.

```python
from openai import OpenAI

VAST_IP_ADDRESS = "VAST_IP_ADDRESS"
VAST_PORT = "VAST_PORT"

HF_USERNAME="HF_USERNAME"

# Create a client instance pointing to the vLLM server
client = OpenAI(
    api_key="dummy-key",  # vLLM doesn't require a real API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"  # Point to your vLLM server
)

# Call the model
response = client.chat.completions.create(
    model=f"{HF_USERNAME}/Qwen3-8B-W8A8",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the major applications of deep learning?"}
    ],
    temperature=0.7,
    max_tokens=1000
)

# Print the response
print(response.choices[0].message.content)
```

Output:

```
<think>
Okay, the user is asking about the major applications of deep learning. I need to provide a comprehensive yet clear answer. Let me start by recalling the different areas where deep learning is commonly used.

First, computer vision comes to mind. That includes image recognition, object detection, and maybe even image generation. Then there's natural language processing, like chatbots, translation, and sentiment analysis.

I should also mention speech recognition, which is used in virtual assistants. Autonomous vehicles rely on deep learning for perception tasks. Healthcare is another big area, such as medical imaging analysis and drug discovery.

Finance uses deep learning for fraud detection and algorithmic trading. Recommender systems in e-commerce and streaming services are also important. Maybe touch on generative models like GANs and diffusion models for creating art or text.

Wait, I should categorize these applications to make it organized. Let me list them under different headings. Also, I need to ensure I don't miss any major areas. Oh, maybe mention robotics and control systems too.

I should check if there are any recent advancements or emerging fields. For example, deep learning in gaming, like reinforcement learning for game AI. Also, maybe something about anomaly detection in various industries.

Need to keep the explanation straightforward without too much jargon. Make sure each application is explained briefly but clearly. Avoid going into too much technical detail unless necessary.

Let me structure the answer with clear sections for each major application. Start with an introduction, then list each area with a brief description. Conclude by summarizing the versatility of deep learning.

Wait, the user might be looking for both current and potential future applications. Should I include that? Maybe a short note at the end about ongoing research and future possibilities.

Also, check for any overlaps or if some applications are better categorized under specific fields. For example, NLP and speech recognition are both related to language processing.

I think that's a solid structure. Now, I'll draft the answer accordingly, making sure it's informative and covers all key areas without being too verbose.
</think>

Deep learning, a subset of machine learning, has revolutionized numerous fields due to its ability to model complex patterns in data. Here are its major applications across various domains:

---

### **1. Computer Vision**
- **Image Recognition**: Identifying objects, scenes, and patterns in images (e.g., facial recognition, medical imaging analysis).
- **Object Detection**: Locating and classifying multiple objects in a single image (e.g., autonomous vehicles, surveillance systems).
- **Image Generation**: Creating new images using Generative Adversarial Networks (GANs) or diffusion models (e.g., AI art tools).
- **Video Analysis**: Action recognition, video summarization, and scene understanding.

---

### **2. Natural Language Processing (NLP)**
- **Text Generation**: Writing articles, stories, or code using models like GPT or BERT.
- **Translation**: Real-time language translation (e.g., Google Translate).
- **Sentiment Analysis**: Determining emotions in text (e.g., social media monitoring).
- **Chatbots & Virtual Assistants**: Interactive AI assistants (e.g., Siri, Alexa, customer service bots).

---

### **3. Speech Recognition & Synthesis**
- **Voice Assistants**: Converting speech to text (e.g., Alexa, Google Assistant).
- **Speech-to-Text**: Transcribing audio for accessibility or transcription services.
- **Text-to-Speech**: Synthesizing natural-sounding speech (e.g., audiobooks, virtual characters).

---

### **4. Autonomous Systems**
- **Self-Driving Cars**: Perceiving environments, navigating, and making real-time decisions.
- **Robotics**: Controlling robotic arms, navigation, and object manipulation.
- **Industrial Automation**: Predictive maintenance and quality control in manufacturing.

---

### **5. Healthcare**
- **Medical Imaging**: Diagnosing diseases (e.g., tumors in X-rays, MRIs) using CNNs.
- **Drug Discovery**: Accelerating the development of new medications through molecular modeling.
- **Personalized Medicine**: Predicting patient outcomes based on genetic data and medical history.

---

### **6. Finance**
- **Fraud Detection**: Identifying unusual transaction patterns.
- **Algorithmic Trading**: Predicting market trends and executing trades.
- **Risk Management**: Assessing credit risk and financial stability.

---

### **7. Recommender Systems**
- **E-commerce**: Personalized product recommendations (e.g., Amazon, Netflix).
- **Streaming Services**: Content suggestions based on user preferences.
- **Social Media**: Curating news feeds and trending topics.

---

### **8. Gaming & Entertainment**
- **Game AI**: Creating intelligent opponents or NPCs (e.g., AlphaGo, Dota 2 bots).
- **Content Creation**: Generating music, art, or video using deep learning models.

---
```

## Compare with full `Qwen/Qwen3-8B` Model

Now that we've seen the output from our quantized model, let's deploy and test the full precision version of the same model for comparison. This will help us evaluate any potential quality differences and understand the tradeoffs of quantization.

First, we need to find a more powerful machine with significantly more GPU memory. The full precision model requires approximately 3-4x more VRAM than its quantized counterpart. Let's search for instances with at least 80GB of VRAM:

```bash
vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 80 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 40 \
rentable = true"
```

Now we'll deploy the full precision model. Notice the differences in our configuration compared to the quantized model:

1. We're requiring more disk space (40GB vs 20GB)
2. We're using a much more powerful GPU with significantly more VRAM
3. We're deploying the original model from Qwen directly (Qwen/Qwen3-8B)

```bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 40 --args --model  Qwen/Qwen3-8B
```

With the full precision model deployed, let's query it with the same prompt we used for the quantized model to perform a direct comparison of their outputs:

```python

VAST_IP_ADDRESS = "VAST_IP_ADDRESS"
VAST_PORT = "VAST_PORT"

# Create a client instance pointing to the vLLM server
client = OpenAI(
    api_key="dummy-key",  # vLLM doesn't require a real API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"  # Point to your vLLM server
)

# Call the model
response = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the major applications of deep learning?"}
    ],
    temperature=0.7,
    max_tokens=1000
)

# Print the response
print(response.choices[0].message.content)
```

```
<think>
Okay, the user is asking about the major applications of deep learning. Let me start by recalling what I know. Deep learning is a subset of machine learning that uses neural networks with many layers. It's been really popular in various fields. I need to list the main areas where it's applied.

First, computer vision comes to mind. Image recognition, object detection, maybe even things like facial recognition. Then there's natural language processing, like chatbots, translation services, text generation. Speech recognition is another big one, like virtual assistants.

What else? Autonomous vehicles use deep learning for perception tasks, like identifying objects on the road. Healthcare is another area, maybe medical imaging analysis, predicting diseases. Then there's recommendation systems used by companies like Netflix or Amazon.

Wait, what about robotics? Deep reinforcement learning is used in robotics for control systems. Also, maybe gaming, like in AlphaGo. Financial services for fraud detection or algorithmic trading.

I should also mention some more specialized areas like generative models for creating images or music, and maybe something about NLP like sentiment analysis. Oh, and maybe mention that these applications are driven by the ability of deep learning to handle large amounts of data and complex patterns.

Wait, the user might be looking for a comprehensive list, so I should structure it into categories. Let me check if I missed any major areas. Maybe cybersecurity? Or maybe something like anomaly detection in various industries. Also, maybe mention the difference between deep learning and traditional machine learning in terms of application.

I need to make sure each application is clearly explained with examples. Let me organize them into sections. Computer Vision, NLP, Speech Recognition, Autonomous Vehicles, Healthcare, Recommendation Systems, Robotics, Gaming, Financial Services, Cybersecurity, Generative Models, and maybe others.

I should also note that deep learning is used in both supervised and unsupervised tasks. Oh, and maybe mention some specific models like CNNs for images, RNNs or Transformers for NLP. But the user might not need the technical details, just the applications.

Wait, the user might be a student or someone new to the field, so keeping it clear and concise is important. Avoid jargon where possible. Also, make sure to highlight the impact of each application.

Let me make sure I cover all major areas. Maybe add something about drug discovery in healthcare. Or maybe mention that deep learning is used in weather prediction or climate modeling.

Hmm, I think that's a good list. Let me structure it with bullet points and explanations for each application. Also, maybe conclude with a note on the versatility and impact of deep learning.
</think>

Deep learning, a subset of machine learning, has revolutionized numerous fields due to its ability to model complex patterns and process large datasets. Here are the **major applications** of deep learning:

---

### **1. Computer Vision**
- **Image Recognition**: Identifying objects, people, and scenes in images (e.g., Google Photos, medical imaging).
- **Object Detection**: Locating and classifying objects in real-time (e.g., autonomous vehicles, security cameras).
- **Facial Recognition**: Used in biometric authentication (e.g., smartphones, border control).
- **Image Generation**: Creating realistic images (e.g., GANs for art, style transfer).

---

### **2. Natural Language Processing (NLP)**
- **Text Generation**: Writing articles, stories, or code (e.g., chatbots, AI assistants like Siri).
- **Machine Translation**: Languages like Google Translate use neural networks for real-time translation.
- **Sentiment Analysis**: Analyzing emotions in text (e.g., social media monitoring, customer feedback).
- **Speech Recognition**: Converting speech to text (e.g., voice assistants, transcription services).

---

### **3. Autonomous Systems**
- **Self-Driving Cars**: Detecting pedestrians, traffic signs, and obstacles (e.g., Tesla Autopilot, Waymo).
- **Robotics**: Controlling robotic arms and navigation systems (e.g., industrial robots, drones).
- **Reinforcement Learning**: Training agents for tasks like game playing (e.g., AlphaGo, Dota 2).

---

### **4. Healthcare**
- **Medical Imaging**: Diagnosing diseases from X-rays, MRIs, or CT scans (e.g., cancer detection).
- **Drug Discovery**: Accelerating the development of new drugs by predicting molecular interactions.
- **Personalized Medicine**: Tailoring treatments based on patient data (e.g., genomics analysis).

---

### **5. Recommendation Systems**
- **E-commerce**: Product recommendations (e.g., Amazon, Netflix).
- **Streaming Services**: Content suggestions (e.g., Spotify, YouTube).
- **Social Media**: Curating news feeds (e.g., Facebook, Instagram).

---

### **6. Finance**
- **Fraud Detection**: Identifying unusual
```

## Conclusion: Comparing Model Outputs and Size Tradeoffs

After deploying and testing both the quantized model (W8A8) and the full precision model, we can make several key observations:

### Model Size and Resource Requirements

- **Full Precision Model (Qwen3-8B)**:

  - Required ~80GB+ VRAM GPU
  - Needed 40GB disk space
  - Model size: ~16GB (FP16)

- **Quantized Model (Qwen3-8B-W8A8)**:
  - Required only 24GB VRAM GPU (~3.3x less memory)
  - Needed only 20GB disk space
  - Model size: ~9.5GB (8-bit weights and activations)

### Output Quality Comparison

Looking at the outputs from both models, we can observe:

1. **Structural Similarity**: Both models organized their responses with similar categories (Computer Vision, NLP, Healthcare, etc.) and maintained good formatting with markdown.

2. **Content Quality**: Both models produced high-quality, relevant content with appropriate examples for each application domain. The specific details varied slightly, but the core information was consistent.

3. **Reasoning Process**: Both models showed similar thinking processes in their `<think>` sections, demonstrating that the quantized model retained the structured reasoning capabilities of the original model.

4. **Minimal Quality Degradation**: The quantized model showed negligible degradation in output quality compared to the full precision model, despite being significantly smaller.

### Key Takeaways

1. **Deployment Accessibility**: The W8A8 quantized model makes deployment much more accessible, requiring far less expensive hardware while maintaining output quality.

2. **Cost Efficiency**: Using the quantized model can significantly reduce cloud GPU costs by allowing deployment on less expensive hardware.

3. **Production Readiness**: 8-bit quantization appears to be production-ready for this type of model, offering an excellent balance between efficiency and performance.

4. **Use Case Fit**: For most general text generation and understanding tasks, the quantized model appears to be a suitable replacement for the full precision model.

This notebook demonstrates that weight and activation quantization (W8A8) using LLM Compressor is an effective approach for deploying large language models in resource-constrained environments without significant quality compromises.

Vast.ai provides an excellent platform for this workflow, allowing us to use high-end GPUs for model compression and more cost-effective GPUs for efficient inference.


Compare quantized vs. full Qwen3-8B models deployed on Vast.ai. See quality vs. efficiency tradeoffs using W8A8 compression with LLM-Compressor and vLLM.

Using LLM-Compressor to Quantize Qwen3-8B on Vast.ai (Part 2)


As AI models grow increasingly powerful, they're also becoming increasingly expensive to deploy. The latest language models require massive amounts of GPU memory and computational resources, putting them out of reach for many teams and use cases. Model compression has emerged as a critical technique for making these models more accessible while maintaining their performance.

Today, we'll explore how to compress large language models using LLM-Compressor and deploy them cost-effectively on Vast.ai. We'll walk through the complete process of taking a 16GB model and reducing it to approximately 9.5GB while preserving quality—making deployment significantly more affordable and accessible.

## The Tools: LLM-Compressor and Vast.ai

### LLM-Compressor: Making Models Leaner

The vLLM team has developed [LLM-Compressor](https://github.com/vllm-project/llm-compressor) as a comprehensive solution for reducing model size without sacrificing performance. Unlike other compression tools that focus on a single technique, LLM-Compressor offers a complete toolkit for optimization.

What makes it particularly powerful:

- **Comprehensive approach**: Combines quantization, pruning, and distillation techniques that can be mixed and matched
- **Production-ready**: Built specifically for deployment scenarios with seamless vLLM integration
- **Developer-friendly**: The one-shot API handles the complex compression pipeline automatically
- **Proven techniques**: Implements industry-standard methods like GPTQ, AWQ, and SmoothQuant

### Vast.ai: Affordable GPU Power

Traditional cloud providers can make GPU-intensive tasks prohibitively expensive. Vast.ai changes this equation by creating a marketplace where you can access high-end GPUs at competitive rates—often significantly cheaper than major cloud platforms.

For model compression workflows, Vast.ai offers several advantages:

- **Flexible pricing**: Pay-per-hour model means you only pay for compression time, not idle resources
- **Hardware variety**: Choose from different GPU types based on your specific memory and compute requirements
- **Quick deployment**: Get instances running in minutes without complex setup procedures
- **Community-driven**: Access to a wider range of hardware options through the marketplace model

## In this Tutorial

This tutorial demonstrates how to:

1. Compress the `Qwen3-8B` model using LLM-Compressor with 8-bit weight and activation quantization (W8A8)
2. Upload the compressed `Qwen3-8B-W8A8` model to Hugging Face for easy sharing and deployment

In the second part of this series, we'll show how to deploy the compressed `Qwen3-8B-W8A8` model on Vast.ai for efficient inference and compare its performance with the full-precision model.

## GPU Requirements

Model compression requires significant computational resources, especially for larger models. For compressing the Qwen3-8B model, we recommend the following specifications:

### Minimum Requirements:

- **GPU Memory**: 30GB+ VRAM (e.g., A5000, A6000, RTX 4090, V100-32GB)
- **GPU Compute**: CUDA-compatible GPU with compute capability 7.0+
- **System RAM**: 64GB recommended (minimum 32GB)
- **Storage**: 80GB free disk space for model files and datasets
- **CUDA Version**: 11.8 or newer

### Why these requirements?

- The full Qwen3-8B model requires approximately 16GB just to load in FP16 format
- During compression, we need memory for both the original model and compression overhead
- Calibration datasets and intermediate files require additional storage space
- Model compilers and quantization tools have their own memory overhead

For larger models (12B+), you would need even more VRAM (40GB+) or techniques like model sharding.

## Renting an Instance on Vast.ai

1. Ensure that you have a Vast.ai account. If not, sign up at [vast.ai](https://vast.ai)
2. Go to the Vast Templates in the Console https://cloud.vast.ai/templates/
3. Select the `PyTorch (CuDNN Runtime)` Template - this provides a pre-configured environment with PyTorch and CUDA
4. Filter for an instance with the following specifications for optimal performance:
   - 1 GPU with high VRAM (A5000, A6000, RTX 4090, or similar)
   - 30GB+ of VRAM (required for full model loading plus quantization overhead)
   - 80GB+ of storage (for model weights, checkpoints, and datasets)
   - 32GB+ of system RAM (64GB recommended for smoother operation)
5. Select an instance and click rent - hourly pricing will vary based on the selected hardware
6. Install the Vast TLS certificate in your browser to access the notebook server securely https://docs.vast.ai/instances/jupyter#1SmCz
7. Go to your Instances https://cloud.vast.ai/instances/ and click "Open" to access the Jupyter server on your instance
8. Upload this notebook to the server through the Jupyter interface

## Setup and Installation

First we will install and import our required packages. The cell below installs:

- Transformers: For loading and saving the model in Hugging Face format
- PyTorch: The deep learning framework that powers our model
- Hugging Face Hub: For uploading our compressed model
- LLM-Compressor: The toolkit we'll use for quantization

```bash
# Install required packages
# Using specific versions to ensure compatibility
pip install transformers==4.51.3  # Recent Transformers version with good Qwen3 support
pip install torch==2.7.0          # PyTorch version compatible with current CUDA drivers
pip install huggingface_hub       # Core Hugging Face Hub library
pip install huggingface_hub[hf_xet]  # Extension for handling large file transfers
pip install huggingface_hub[cli]     # Command-line interface for HF Hub
pip install llmcompressor            # The model compression toolkit
```

Next, import the necessary libraries:

```python
# Import necessary libraries
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor.modifiers.quantization import GPTQModifier  # For weight quantization
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier  # For activation quantization preprocessing
from llmcompressor import oneshot  # Simplified API for end-to-end compression
```

## Define Model and Output Path

Next, we'll define our model variables and prepare for downloading the `Qwen/Qwen3-8B` model.

- `MODEL_ID`: The Hugging Face model ID for the original model
- `OUTPUT_MODEL_ID`: A name for our compressed model version
- `OUTPUT_DIR`: Local directory where the compressed model will be saved

```python
# Define model path and output directory
MODEL_ID = "Qwen/Qwen3-8B"  # Original model on Hugging Face
OUTPUT_MODEL_ID = "Qwen3-8B-W8A8"  # Our compressed model name
OUTPUT_DIR = "./Qwen3-8B-W8A8"  # Local directory for saving
os.makedirs(OUTPUT_DIR, exist_ok=True)  # Create output directory if it doesn't exist
```

Download the model files using the snapshot_download function:

```python
from huggingface_hub import snapshot_download

# Download the model files to local directory
# This is a more reliable way to download large models compared to the standard AutoModel approach
# This will download all model files (~16GB for Qwen3-8B) and may take some time depending on connection speed
snapshot_download(
    repo_id="Qwen/Qwen3-8B",
    local_dir_use_symlinks=False  # Set to True to use symlinks instead of copying files (saves space but may cause issues)
)
```

## Configure Quantization

Now we'll define our compression recipe, which consists of two steps:

1. **SmoothQuant**: A technique that redistributes the quantization difficulty between weights and activations, making activations easier to quantize with minimal accuracy loss

2. **GPTQ**: A weight quantization method specifically designed for large language models that uses a reconstruction approach to minimize the impact of quantization on model quality

We're using W8A8 quantization, which means:

- W8: Weights stored in 8-bit precision (instead of FP16/FP32)
- A8: Activations computed in 8-bit precision at inference time

This will reduce the model size by approximately 1.7x and potentially increase inference speed.

```python
# Create quantization recipe
# A recipe is a sequence of compression techniques applied in order
recipe = [
    # First apply SmoothQuant to make activations easier to quantize
    # The smoothing_strength parameter controls how aggressively to shift quantization difficulty
    # from activations to weights (higher = more shifting, 0.8 is a good balance)
    SmoothQuantModifier(smoothing_strength=0.8),

    # Then apply GPTQ for weight and activation quantization
    GPTQModifier(
        targets="Linear",  # Apply to all linear layers in the model
        scheme="W8A8",    # 8-bit weights and activations (other options: W4A16, W8A16, etc.)
        ignore=["lm_head"] # Don't quantize the language modeling head (important for output quality)
    )
]
```

## Apply Quantization with oneshot API

Now we'll apply our compression recipe to the model using the oneshot API.

The oneshot API handles the entire compression workflow:

1. Loading the model from Hugging Face or local path
2. Creating a calibration dataset (or using an existing one)
3. Running calibration to determine optimal quantization parameters
4. Applying the compression recipe
5. Saving the compressed model

We're using the `open_platypus` dataset for calibration, which consists of diverse open-source technical questions and answers. This helps ensure the model maintains accuracy across a range of technical topics.

**Note**: This cell will take a significant amount of time (30-60 minutes) and GPU memory to run.

```python
print(f"Starting compression of {MODEL_ID}...")

# Apply quantization using oneshot API
oneshot(
    model=MODEL_ID,               # HF model ID or local path to load the model from
    dataset="open_platypus",      # Calibration dataset - preset dataset with technical Q&A
    recipe=recipe,                # Our compression recipe defined above
    output_dir=OUTPUT_DIR,        # Where to save the compressed model
    max_seq_length=2048,          # Maximum sequence length for calibration samples
    num_calibration_samples=512,  # Number of samples to use for calibration (more is better but slower)
)
```

## Test the Compressed Model

After compression, it's important to verify that the model still performs reasonably well. This cell loads the compressed model and runs a simple inference test.

We'll evaluate the model on a prompt about AI technology to see how well it generates coherent and relevant text. This helps us verify that the quantization didn't significantly degrade model quality.

Note that a proper evaluation would involve more systematic testing across multiple prompts and metrics.

```python
# Load the compressed model from our output directory
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
model = AutoModelForCausalLM.from_pretrained(
    OUTPUT_DIR,
    device_map="auto",      # Automatically place model on available devices (CPU/GPU)
    torch_dtype="auto"      # Use the model's native precision (will be INT8 for quantized weights)
)

# Simple test to verify the model still works
prompt = "AI technology is transforming industries by"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate text with some randomness for creativity
with torch.no_grad():  # Disable gradient calculation for inference
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=100,     # Generate up to 100 new tokens
        do_sample=True,         # Use sampling instead of greedy decoding
        temperature=0.7         # Control randomness (lower = more deterministic)
    )

# Decode the generated text and print it
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Generated text: {generated_text}")
```

## Upload the Model to Hugging Face

Now that we have successfully compressed the model, we'll upload it to Hugging Face Hub for easy sharing and deployment.

First, we'll set up the necessary credentials and repository information:

```python
from huggingface_hub import HfApi, login
import os

# Define variables for Hugging Face upload
# You'll need to replace these with your actual Hugging Face information
HF_USERNAME = "HF_USERNAME" # Add your Hugging Face username here
HF_TOKEN = "HF_TOKEN"       # Replace with your actual Hugging Face token (from settings page)
HF_MODEL_ID = f"{HF_USERNAME}/{OUTPUT_MODEL_ID}"  # Full repository path on Hugging Face
```

Next, we'll create a proper model card (README.md) with information about our compressed model. This helps others understand what the model is, how it was compressed, and how to use it:

```python
# Create a detailed model card describing the compressed model
model_card_content = f"""---
language:
- en
- zh
library_name: transformers
license: other
datasets:
- open_platypus
---

# Qwen3-8B-W8A8

This is a compressed version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) using llm-compressor with the following optimizations:
- 8-bit weight quantization using GPTQ
- 8-bit activation quantization
- SmoothQuant pre-processing

## Model Details

- **Original Model**: Qwen/Qwen3-8B
- **Quantization Method**: GPTQ + SmoothQuant (W8A8)
- **Compression Libraries**: [llm-compressor](https://github.com/vllm-project/llm-compressor)
- **Calibration Dataset**: open_platypus (512 samples)
- **Optimized For**: Inference with vLLM
"""

# Write README.md to model directory
readme_path = os.path.join(OUTPUT_DIR, "README.md")
with open(readme_path, "w") as f:
    f.write(model_card_content)
```

Finally, we'll upload the compressed model to Hugging Face Hub. This process may take some time depending on your internet connection speed, as the model files are large (approximately 9.5GB even after compression):

```python
# Login to Hugging Face with your token
login(token=HF_TOKEN)

# Initialize the Hugging Face API
api = HfApi()

# Create the repository (if it doesn't exist)
api.create_repo(
    repo_id=HF_MODEL_ID,
    exist_ok=True,        # Don't error if repo already exists
    private=False,        # Set to True if you want a private repository
)

# Upload the model files
print(f"Starting upload of {OUTPUT_DIR} to {HF_MODEL_ID}...")
api.upload_folder(
    folder_path=OUTPUT_DIR,
    repo_id=HF_MODEL_ID,
    repo_type="model",
    ignore_patterns=["*.tmp", "*.log", "__pycache__"],  # Files to ignore during upload
    commit_message="Upload compressed Qwen3-8B with W8A8 quantization",
)

print(f"Model successfully uploaded to: https://huggingface.co/{HF_MODEL_ID}")
```

## Wrapping Up: From 16GB to 9.5GB Without Breaking the Bank

We've just walked through a transformation that would have seemed impossible a few years ago—taking a sophisticated language model and cutting its resource requirements by approximately 40% while preserving its capabilities.

### What We Accomplished

Our compression workflow delivered results across multiple dimensions:

**Technical achievements:**

- Reduced model footprint from 16GB to 9.5GB using W8A8 quantization
- Maintained output quality through careful calibration with technical datasets
- Created a production-ready model compatible with modern inference engines

**Economic impact:**

- Enabled deployment on more affordable GPUs with less VRAM requirements
- Reduced cloud inference costs through lower resource requirements
- Made advanced language models accessible to teams with limited budgets

**Operational benefits:**

- Faster model loading and reduced memory pressure during inference
- Better resource utilization allowing for higher throughput or multi-model serving
- Simplified deployment pipeline through automated compression workflows

### The Bigger Picture

Model compression offers an alternative approach to AI deployment. Rather than accepting ever-increasing resource requirements, we can actively optimize models for real-world constraints. The combination of LLM-Compressor's sophisticated algorithms and Vast.ai's accessible infrastructure makes this optimization practical for any team.

### Coming Up Next

In Part 2, we'll put our compressed model to the test. We'll deploy both our optimized Qwen3-8B-W8A8 model and the original full-precision version on Vast.ai, then conduct detailed comparisons of their performance, cost, and output quality. You'll see exactly how much you can save without compromising on results—and learn when compression makes the most sense for your specific use case.


Compress LLMs like Qwen3-8B with LLM-Compressor and deploy on Vast.ai. Cut model size ~40% while preserving quality—save cost, speed up inference, and scale smartly.

Model Compression with LLM-Compressor and Deployment on Vast.ai (Part 1)


You wouldn't buy a car without taking it for a test drive, so why deploy a large language model (LLM) without benchmarking it first?

When you're running LLMs at scale, you need more than just theoretical performance data. You're going to want to know how they actually perform under real conditions – throughput, latency, and hardware efficiency all matter. Skipping this step can result in higher costs and wasted resources.

Benchmarking early helps you avoid inefficiencies down the line, and it shouldn't take hours to get started. That's where **vLLM** comes in.

### A Smarter Way to Serve and Benchmark LLMs

vLLM is an open-source library that optimizes the inference and serving of LLMs, and it [simplifies](https://blog.vllm.ai/2023/06/20/vllm.html) the benchmarking process with an efficient architecture built around the attention algorithm PagedAttention. This mechanism reduces memory waste and improves throughput by storing attention data non-contiguously.

When fast and efficient LLM inference is crucial, understanding your model's performance starts with benchmarking. With vLLM, you can spin up a high-performance inference server and start collecting those benchmarks in minutes.

To make things even easier, you don't need to spin up your own infrastructure from scratch. Vast.ai provides an ideal platform to leverage vLLM, offering high-performance GPUs at a fraction of the cost of traditional cloud providers. With Vast, it's easy to find the right hardware for your model, and with simple Docker integration, you can get up and running with vLLM quickly.

This guide demonstrates how to benchmark any LLM using vLLM on Vast.ai, in just a few minutes.

### Getting Started: Benchmarking LLMs with vLLM

This guide walks you through how to **serve a large language model using vLLM** and run a benchmark test to evaluate performance. It’s using a **H100** **GPU server from [Vast.ai](https://vast.ai/)** with the **"PyTorch" template**. In this example, we’ll use the model `meta-llama/Llama-3.1-8B-Instruct`, but you can use **any Hugging Face-supported model** compatible with vLLM.

**1\. Install required Python packages**

First, install the **vLLM** library along with **Pandas** and Hugging Face’s **datasets** library. For example, you can use pip in your Python environment:

```shell
pip install vllm pandas datasets
```

This installs vLLM and the other libraries for data handling.

**2\. Log in to Hugging Face**

To access the Meta Llama 3.1 model, you need a Hugging Face API token. Then run:

```shell
huggingface-cli login
```

This will prompt you to paste your Hugging Face **access token** (from your Hugging Face account settings). After entering the token, the CLI saves it so vLLM and other tools can download the model.

## **3\. Start the vLLM server**

Now start the vLLM server hosting the Llama 3.1 8B Instruct model. vLLM’s OpenAI-compatible server listens on localhost port 8000 by default. Use the `vllm serve` command with **nohup** to run it in the background and redirect output to a log file. For example:

```shell
nohup vllm serve meta-llama/Llama-3.1-8B-Instruct > vllm.log 2>&1 &
```

This command launches the server with the specified model and writes all output (including any Uvicorn logs) to `vllm.log`. The server will run in the background. (The vLLM docs give a similar example: `vllm serve <model-name>`) By default, this serves the model on [`http://localhost:8000`](http://localhost:8000).

## **4\. Test the server with a curl request**

Once the server is running, you can test it by sending a chat completion request to the API. For example, use `curl` to call the chat endpoint (`/v1/chat/completions`) with a simple prompt:

```shell
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
          "model": "meta-llama/Llama-3.1-8B-Instruct",
          "messages": [
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "What is the capital of France?"}
          ]
        }'
```

This sends a JSON payload to the server. The `model` field names our loaded model, and `messages` contains a chat-style conversation. The server will respond with a JSON completion. (The official vLLM docs show a similar curl example for chat completions)

You should see a JSON response in the terminal with the model’s answer.

## **5\. Clone the vLLM repository**

To access the benchmark script, clone the vLLM source repository:

```shell
git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
```

This pulls the latest vLLM code. Now you have the `benchmark_serving.py` script and related files in the `benchmarks` folder.

## **6\. Run the benchmark script**

Finally, run the built-in benchmark script to measure serving performance. The `benchmark_serving.py` tool sends many requests to the running server and reports throughput/latency. For example, you might run:

```shell
  python3 vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --base-url http://127.0.0.1:8000 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --tokenizer meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name random \
    --random-input-len 128 \
    --random-output-len 128 \
    --num-prompts 80 \
    --max-concurrency 4 \
    --temperature 0.7

```

-   `--backend vllm`: Use the vLLM backend to send benchmark requests.

-   `--base-url http://127.0.0.1:8000`: Address of your running vLLM server.

-   `--model meta-llama/Llama-3.1-8B-Instruct`: Name of the model being benchmarked.

-   `--tokenizer meta-llama/Llama-3.1-8B-Instruct`: Tokenizer used for encoding/decoding; usually the same as the model.

-   `--dataset-name random`: Use synthetic/randomly generated prompts instead of a real dataset.

-   `--random-input-len 128`: Length (in tokens) of each generated input prompt.

-   `--random-output-len 128`: Desired length (in tokens) of the model's output.

-   `--num-prompts 80`: Total number of prompts to send during the test.

-   `--max-concurrency 4`: Number of prompts sent in parallel (simulates batch size).

-   `--temperature 0.7`: Controls randomness in generation; higher \= more creative.

This example command (modeled on published vLLM benchmarks) will run 80 chat requests against your server with concurrency 4\. After it finishes, the script prints stats like throughput and latency.

By following these steps and citing vLLM’s documentation and examples, you can serve `Llama-3.1-8B-Instruct` with vLLM and evaluate its performance. Adjust parameters (batch size, number of tokens, etc.) in the commands above as needed for your environment.

### Conclusion

And that's it\! A fast, easy-to-use LLM benchmarking workflow powered by vLLM. In just ten minutes, you can gather meaningful performance data that will help you make smarter decisions about which models (and hardware) are right for your specific project and workload.

Need powerful GPUs to run your tests? Vast.ai offers on-demand access to H100s, RTX 5090s, and other high-performance options at a fraction of the typical price – so you can save 5-6X on GPU compute. Get started today and benchmark on your terms\!


Benchmark LLMs in minutes using vLLM on Vast.ai. Serve models like Llama 3.1-8B, test performance, and optimize cost with fast, efficient, real-world insights.

How to Benchmark An LLM with vLLM in 10 Minutes




Are you looking to run powerful open-source LLMs like **Llama 4**, **Kimi K2**, or **Qwen3** without the hassle of managing complex infrastructure? Vast.ai makes it easy to train and deploy these models with a curated set of templates built for speed, flexibility, and scale. 

Our [templates](https://docs.vast.ai/templates) cover everything from browser-based interfaces like Oobabooga and Open WebUI to optimized backends like HuggingFace TGI and vLLM. 

Below, we break down the top recommended templates on **Vast.ai** and how you can use them to get your LLM project off the ground quickly and easily. 

### 1\. [Oobabooga (LLM WebUI)](https://docs.vast.ai/oobabooga-llm-webui)

[Oobabooga](https://github.com/oobabooga/text-generation-webui) provides a user-friendly web interface for interacting with open-source LLMs. Its UI  resembles the original ChatGPT style – ideal for those who prefer a graphical interface over command-line operations. 

With support for **multiple text generation backends in one UI/API**, you can switch between different models easily without restarting and maintain fine control over settings. Models like **Falcon**, **Llama**, and **Vicuna** can be loaded and explored with just a few clicks.

Our [Oobabooga template](https://docs.vast.ai/oobabooga-llm-webui) facilitates quick deployment, making it suitable for both beginners and experienced users.

### 2\. [HuggingFace TGI with Llama 3](https://docs.vast.ai/huggingface-tgi-with-llama3)

Ideal for serving **Llama 3**, this template is optimized for **high-performance text generation tasks** using HuggingFace's [Text Generation Inference (TGI) server](https://github.com/huggingface/text-generation-inference). It supports other popular open-source LLMs from HuggingFace, including **Falcon**, **StarCoder**, **BLOOM**, and **GPT-NeoX**, and it's particularly beneficial for applications requiring low-latency responses and scalability, such as chatbots or content generation tools.

For this [template](https://docs.vast.ai/huggingface-tgi-with-llama3), you'll need your own HuggingFace access token – and you'll also need to apply for permission to use Llama 3 on HuggingFace, since the model is hosted in a gated repository.

### 3\. [Open WebUI (Ollama)](https://docs.vast.ai/ollama-webui)

[Open WebUI](https://github.com/open-webui/open-webui) is an extensible, user-friendly platform for self-hosted AI deployments. It operates entirely offline and supports a range of open-source LLM backends like **Llama 4**, **Kimi K2**, **DeepSeek R1**, and **OpenAI-compatible APIs**. 

With **built-in support for retrieval-augmented generation** (RAG), Open WebUI seamlessly integrates document interactions and web search into the chat experience. You can even incorporate image generation capabilities as well as engage with multiple LLMs simultaneously in parallel.

Our [Ollama \+ WebUI template](https://docs.vast.ai/ollama-webui) will automatically set up Open WebUI as a web-based interface and expose a port for the Ollama API, making it easy to run and interact with LLMs directly from your instance.

### 4\. [vLLM](https://docs.vast.ai/vllm-llm-inference-and-serving)

Optimized for [serving open-source LLMs](https://vast.ai/article/serving-online-inference-with-vllm-api-on-vast) with high-throughput inference, [vLLM](https://github.com/vllm-project/vllm) uses a novel architecture that drastically reduces memory overhead, making it easier to serve large models like **Llama**, **Mistral**, and **Falcon** efficiently. Notably, it supports continuous batching for faster and more efficient multi-user inference at scale.

vLLM provides an OpenAI-compatible API server for easy integration into existing workflows, and it's particularly well suited for developers building **commercial or research applications that require fast and stable model responses.** The vLLM framework seamlessly supports most open-source models on HuggingFace, including the ones named above as well as **Mixtral**, **DeepSeek**, **LLaVA**, **BLOOM**, **GPT-NeoX**, **Qwen**, and more.

Our [vLLM template](https://docs.vast.ai/vllm-llm-inference-and-serving) contains everything needed for you to get started – all you have to do is specify the model you want to serve and the corresponding vLLM configuration.

### From Setup to Serving in Minutes

With Vast.ai's market-based cloud GPU rental platform, you can avoid the usual infrastructure and budget roadblocks as you spin up high-performance instances tailored to your workloads. 

Our [library](https://cloud.vast.ai/templates) of **ready-to-use templates** – covering everything from web-based interfaces to optimized inference engines – make it easy to train and deploy LLMs at scale. We take care of the heavy lifting, so you can spend less time configuring and more time building.

Why wait? **Get started with open-source LLMs on Vast.ai today\!**

Run open-source LLMs like Llama 4, Qwen3, and Kimi K2 with ease. Vast.ai templates let you deploy fast using Oobabooga, vLLM, HuggingFace TGI, or Open WebUI.

Open-Source LLMs You Can Train and Deploy on Vast.ai Right Now


The DeepSeek R1-0528 release introduces a significant improvement to the reasoning capabilities of language models by making it easier to access "thinking" mode without requiring complex prompt engineering or pre-pending thinking tokens. This enhanced model provides transparent, step-by-step reasoning that's particularly valuable for educational applications, complex problem solving, and scenarios where transparency in AI decision-making is crucial.

In this post, we'll explore how to deploy the **DeepSeek-R1-0528-Qwen3-8B** model using **vLLM** on Vast.ai's cloud GPU platform, leveraging the new **qwen3 reasoning parser** that simplifies access to the model's internal thinking process.

---

## The Open Source AI Revolution

DeepSeek R1-0528 represents more than just a technical advancement—it's a significant step toward democratizing advanced AI capabilities. As an **open source model**, it makes these reasoning and advanced AI capabilities more accessible.

Unlike closed-source models that require vendor dependence, usage costs, and data privacy concerns, DeepSeek R1-0528 enables **complete control** over your AI infrastructure. Organizations can now deploy advanced reasoning capabilities locally, maintaining data sovereignty while avoiding variable per-token pricing.

This democratization means organizations can access cutting-edge AI without massive infrastructure investments. Advanced reasoning capabilities are no longer exclusive to large tech companies—anyone with GPU access can deploy their own reasoning-capable AI system.

---

## What Makes DeepSeek R1-0528 Special?

The DeepSeek R1-0528 release addresses a key limitation in previous DeepSeek models: the complexity of accessing the model's internal reasoning process. Previously, users needed to manually prepend "thinking" tokens to model outputs or use complex prompt engineering to see step-by-step reasoning.

---

## Deploying DeepSeek R1-0528 on Vast.ai

### Step 1: Set Up Vast.ai Environment

First, install the Vast CLI and configure your API key:

```bash
pip install --upgrade vastai

export VAST_API_KEY="your_vast_api_key"
vastai set api-key $VAST_API_KEY
```

### Step 2: Find Suitable Hardware

The DeepSeek-R1-0528-Qwen3-8B model requires specific hardware capabilities:

-   **Minimum 24GB GPU RAM** for model weights and KV cache
-   **Single GPU configuration** (sufficient for the 8B parameter model)
-   **Static IP address** for stable API endpoint hosting
-   **Direct port access** for the vLLM OpenAI-compatible server
-   **60GB+ disk space** for model storage and dependencies

Search for suitable instances:

```bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 24 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 60 \
rentable = true"
```

### Step 3: Deploy the Model

Deploy using the vLLM OpenAI-compatible server with the qwen3 reasoning parser:

```bash
export INSTANCE_ID="your_instance_id"

vastai create instance $INSTANCE_ID \
--image vllm/vllm-openai:latest \
--env '-p 8000:8000' \
--disk 60 \
--args --model deepseek-ai/DeepSeek-R1-0528-Qwen3-8B \
--max-model-len 4096 \
--reasoning-parser qwen3
```

**Key deployment parameters:**

-   `--reasoning-parser qwen3`: Enables the simplified reasoning parser
-   `--max-model-len 4096`: Accommodates longer reasoning sequences
-   `--tensor-parallel-size 1`: Single GPU configuration
-   `--gpu-memory-utilization 0.90`: Optimized memory usage

### Step 4: Get Instance Details and Test Connection

Navigate to the [Instances tab](https://cloud.vast.ai/instances/) in the Vast AI Console and locate your instance. Click the IP address button to view the forwarded ports.

Test the connection with a simple curl command:

```bash
export VAST_IP_ADDRESS="your_ip_address"
export VAST_PORT="your_port"

curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-ai/DeepSeek-R1-0528-Qwen3-8B", "prompt": "Hello, how are you?", "max_tokens": 50}'
```

---

## Testing the Model's Reasoning Capabilities

### Setup Python Client

```python
import time
from openai import OpenAI

# Replace with your actual Vast.ai instance details
VAST_IP_ADDRESS = "your_ip_address"
VAST_PORT = "your_port"

client = OpenAI(
    api_key="DUMMY",
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)
```

### Basic Functionality Test

```python
start = time.time()
test_resp = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528-Qwen3-8B",
    messages=[{"role": "user", "content": "Hello"}]
)
elapsed = time.time() - start

print("Response:", test_resp.choices[0].message.content)
print(f"Elapsed time: {elapsed:.4f} s")
```

### Reasoning Capability Test

This is where DeepSeek R1-0528 truly shines. The model naturally provides step-by-step reasoning without complex prompt engineering:

```python
reasoning_prompt = """
Write the History of Paris in a paragraph, show your thinking
step by step first before paragraph and then write the history
in a paragraph
"""

start = time.time()
resp = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528-Qwen3-8B",
    messages=[{"role": "user", "content": reasoning_prompt}]
)
elapsed = time.time() - start

print("Model output:\n", resp.choices[0].message.content)
print(f"Elapsed time: {elapsed:.3f} s")
```

**Expected output structure:**

1. **Thinking process**: The model shows its step-by-step reasoning
2. **Final answer**: A well-structured paragraph about Paris's history

### Mathematical Reasoning Test

```python
math_prompt = "If 8 × 7 = ?, show your work step by step."

resp = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528-Qwen3-8B",
    messages=[{"role": "user", "content": math_prompt}]
)

print("Mathematical reasoning:\n", resp.choices[0].message.content)
```

The model will naturally break down the problem, show intermediate steps, and arrive at the solution with clear reasoning traces.

---

## Key Benefits of DeepSeek R1-0528

### 1. **Simplified Reasoning Access**

No need to prepend thinking tokens or use complex prompt engineering. The `qwen3` reasoning parser automatically handles the extraction and presentation of reasoning steps.

### 2. **Educational Value**

Perfect for educational applications where students need to understand the problem-solving process, not just the final answer.

### 3. **Transparency and Trust**

Users can see exactly how the model arrives at its conclusions, building trust in AI-powered decision-making systems.

### 4. **Cost-Effective Deployment**

Running on Vast.ai provides significant cost savings compared to major cloud providers while maintaining high performance.

### 5. **Easy Integration**

OpenAI-compatible API ensures seamless integration with existing applications and workflows.

---

## Use Cases

### Educational Applications

-   Step-by-step problem solving for mathematics, science, and logic
-   Teaching critical thinking and reasoning skills
-   Creating interactive learning experiences

### Business Intelligence

-   Transparent decision-making in automated systems
-   Audit trails for AI-powered recommendations
-   Complex analysis with clear reasoning paths

### Research and Development

-   Understanding model behavior and capabilities
-   Testing reasoning patterns across different domains
-   Developing more interpretable AI systems

---

## Conclusion

The DeepSeek R1-0528 release represents a significant step forward in making AI reasoning more accessible and transparent. By combining this powerful model with Vast.ai's cost-effective GPU infrastructure and the simplified qwen3 reasoning parser, developers can now easily deploy transparent, reasoning-capable AI systems.

Whether you're building educational tools, business intelligence systems, or research applications, DeepSeek R1-0528 on Vast.ai provides an excellent foundation for applications that require clear, step-by-step reasoning capabilities.

The elimination of complex prompt engineering for accessing reasoning modes makes this model particularly valuable for production applications where transparency and interpretability are key requirements.

---

## Useful Resources

-   [DeepSeek R1-0528 Model on Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B)
-   [vLLM Documentation](https://docs.vllm.ai/)
-   [Vast.ai GPU Marketplace](https://vast.ai)
-   [OpenAI Python SDK](https://github.com/openai/openai-python)

---

_Ready to deploy transparent, reasoning-capable AI? Get started with DeepSeek R1-0528 on Vast.ai today! 🚀_


Deploy DeepSeek R1-0528 on Vast.ai to access transparent, step-by-step AI reasoning without prompt tricks—ideal for education, research, and business use.

DeepSeek R1-0528: Enhanced Reasoning with Simplified Thinking Mode on Vast.ai


We're back with a new update from the team at **Vast.ai**\! This month, we've rolled out fresh hardware options, new video-generation tools, and continued improvements behind the scenes.

The big news is that our GPU lineup now includes the **NVIDIA DGX B200** as well as the **RTX PRO 6000 WS**\! These powerful machines open the door to even more demanding workloads, bringing you the compute power required for everything from multi-modal inference to precision-driven simulations.

We also recently reached an exciting milestone: we now have **over one thousand RTX 5090 GPUs available on Vast** – 1,180 and counting.

Here are the details about what's new this month\!

### NVIDIA Cloud GPU Updates

Whether you're training foundation models, rendering high-resolution video, or running inference, our latest hardware additions give you extra flexibility:

-   **NVIDIA DGX B200** – A universal AI supercomputer built on the Blackwell architecture, equipped with **eight GPUs** and **1.4 TB** of GPU memory. With massive bandwidth and compute capacity, it's well suited to handle virtually any enterprise AI workload.

-   **RTX PRO 6000 WS** – A high-end Blackwell-based workstation GPU featuring **96 GB** of GDDR7 memory and up to **4000 TOPS** of AI compute. It can easily handle massive datasets and multi-billion-parameter models while supporting seamless multi-app workflows.

### New Guide and Templates

We've focused on video generation this month, adding a couple of new templates and a guide to help you go from idea to clip as seamlessly as possible:

-   [**Guide to Video Generation**](https://docs.vast.ai/video-generation) – A walkthrough for setting up and using ComfyUI on Vast.ai, so you can create advanced stable diffusion pipelines (ideal for video generation workflows) with this powerful node-based interface.

-   **LTX Video (ComfyUI)** – A ready-to-use template that pairs the LTX Video model with ComfyUI, optimized for consumer-grade GPUs.

-   **Open-Sora** – A template for this open-source video generation model with a streamlined, user-friendly platform that lowers the barrier to efficient, high-quality video content creation.

And that's a wrap for June 2025\!

### Our Commitment

Whether you're looking to launch a B200 instance, experiment with video generation using our latest templates, or scale up with our expanding lineup of RTX 5090 GPUs, we're here to support your work at every stage. Our goal at **Vast.ai** is to make powerful compute accessible, flexible, and easy to use – so you can focus on building, training, and creating.

Need help? Reach out to our customer support team anytime at [support@vast.ai](mailto:support@vast.ai), or connect with the community on our Discord server for tips, troubleshooting, and more.

### Change Log

#### New Templates & Guides:

-   [Guide to Video Generation](https://docs.vast.ai/video-generation)
-   LTX Video (Comfy UI) – Video-Gen model optimized for consumer-grade GPUs.
-   Open-Sora – Democratizing efficient video production for everyone.

#### New GPUs:

-   DGX B200
-   RTX PRO 6000 WS

Plus we now have over one thousand 5090 GPUs available.


New DGX B200 and RTX PRO 6000 WS GPUs, over 1000 RTX 5090s available, and new video generation templates.

June 2025 Product Update


Ten years ago today, our CEO and founder, Jacob Cannell, published ["The Brain as a Universal Learning Machine"](https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine) on LessWrong.com.

Drawing from neuroscience, machine learning, and systems theory, he proposed that **the human brain is not a patchwork of specialized, genetically programmed modules, but a general-purpose learning system** – or, as he put it, "a biological implementation of a _Universal Learning Machine_."

In hindsight, the timing was notable. Cannell published the piece just as deep learning was moving from niche research to broader relevance in artificial intelligence (AI). In the decade since then, rapid progress in scalable, trainable, data-driven models has validated the idea that learning, not pre-engineered structure, is the key to building intelligent systems.

So on the tenth anniversary of its publication, we'd like to revisit the key ideas of the article, reflect on what's aged well, and look ahead at what's still unfolding.

We’d encourage you to read the full article on [Less Wrong](https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine).


A look at the brain as a universal learning machine, and how it can be used to learn about the world.

A Decade Later: Revisiting the Brain as a Universal Learning Machine


Serving multiple machine learning models simultaneously is a key challenge in production AI systems. The increasing size of foundational models and constraints on GPU memory often mean that deploying separate models for each task results in high RAM usage, underutilized hardware, and latency problems caused by repeatedly swapping entire models in and out of GPU memory. This not only increases cost but also complicates scaling.

However, by leveraging Lorax, a framework for dynamically loading LoRA adapters—and Vast.ai's flexible GPU marketplace, developers can efficiently serve several specialized models on a single base model deployment. This results in significantly better hardware utilization, reduced infrastructure costs, and low-latency inference for diverse AI workloads, making this a game-changer for enterprises deploying their own AI infra.

In this blog post, we'll explore how to deploy and run multiple LoRA adapters on a shared base model using Lorax on Vast.ai's cloud platform. This setup enables hosting thousands of fine-tuned models simultaneously, loading task-specific adapters on demand, and seamlessly switching between tasks such as math problem solving and customer support classification with minimal overhead.

---

## What Are LoRA Adapters and Lorax?

- **LoRAs (Low-Rank Adaptation):** Lightweight, task-specific parameter adjustments that adapt a large base model to new tasks with far fewer parameters than full fine-tuning. LoRAs can be swapped in and out without needing to reload the entire large model, saving significant RAM and time.

- **Lorax (LoRA eXchange):** An efficient serving framework that enables running thousands of LoRA adapters on a single GPU by dynamically loading these adapters into a base model at inference time. Lorax maintains high throughput and low latency while drastically reducing the deployment cost of hosting multiple fine-tuned variants.

---

## Deploying Multiple Models Using Lorax and Vast.ai

The following example uses the `mistralai/Mistral-7B-v0.1` base model hosted on Vast.ai, along with two LoRA adapters:

- `predibase/gsm8k`: Designed for solving math problems like the GSM8K benchmark.
- `predibase/customer_support`: Specializes in customer service query classification.

We deploy the base model once and dynamically switch between these specialized LoRA adapters, enabling efficient multi-model serving on the same underlying GPU instance.

---

### Step 1: Set Up Vast.ai Environment

1. **Install Vast CLI and configure API key:**

```bash
pip install vastai==0.2.6

export VAST_API_KEY="your_vast_api_key"
vastai set api-key $VAST_API_KEY
```

2. **Search for a suitable GPU instance:**

The `mistralai/Mistral-7B-v0.1` model requires at least 16GB VRAM, but it's safer to select a larger instance (e.g., 32GB VRAM) for LoRA adapter loading.

```bash
vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 32 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"
```

3. **Deploy the instance:**

Make sure you have accepted the usage terms for `Mistral-7B-v0.1` on Huggingface before proceeding.

```bash
export INSTANCE_ID=your_instance_id

vastai create instance $INSTANCE_ID --image ghcr.io/predibase/lorax:main \
--env '-p 8080:80 --shm-size 1g -e HUGGING_FACE_HUB_TOKEN="your_hf_token"' \
--disk 80 --args --model-id mistralai/Mistral-7B-v0.1
```

This command runs the Lorax server that hosts the base model and dynamically loads adapters on demand.

4. **Get Instance IP Address and Port**

Now, we need to get our IP address and port to call our model. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done.

Next, go to the [Instances tab](https://cloud.vast.ai/instances/) in the Vast AI Console and find the instance you just created.

At the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports.
You should see something like:

```
Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp
```

You will need the IP address (XX.XX.XXX.XX) and the port (YYYY) for the next step.

---

### Step 2: Query LoRA Adapters via HTTP Requests

Lorax supports on-demand loading of LoRAs from the Huggingface Hub, minimizing memory overhead versus deploying multiple large models simultaneously.

First, configure environment variables to connect to your Vast instance and set your Huggingface token:

```python
HF_TOKEN = "your_hf_token"
VAST_IP_ADDRESS = "your_instance_ip"
VAST_PORT = "your_vast_port"
```

#### Example: Math Problem Solving with `predibase/gsm8k`

Prepare a prompt:

```python
question = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?"
math_prompt = f"Please answer the following question: {question}\nAnswer"
```

Make a POST request to the `/generate` endpoint specifying the LoRA adapter ID:

```python
import requests

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {HF_TOKEN}"
}
data = {
    "inputs": math_prompt,
    "parameters": {
        "max_new_tokens": 100,
        "adapter_id": "predibase/gsm8k",
        "adapter_source": "hub"
    }
}

url = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/generate"
response = requests.post(url, headers=headers, json=data)
print("Response:", response.json()["generated_text"])
```

**Response:**

```
He runs 3*3=<<3*3=9>>9 sprints a week
So he runs 9*60=<<9*60=540>>540 meters a week
#### 540
```

#### Example: Customer Support Classification with `predibase/customer_support`

Prepare a customer support transcript prompt:

```python
transcript = "Hi I am having trouble with my account. It says I need to reset my password to log in but I already reset my password."

customer_support_prompt = f"""Consider the case of a customer contacting the support center.
The term task type refers to the reason for why the customer contacted support.

### The possible task types are: account issue, billing issue, product issue, none of the above

Summarize the issue/question/reason that drove the customer to contact support:

Transcript: {transcript}

Task Type:
"""
```

Make the API call:

```python
data = {
    "inputs": customer_support_prompt,
    "parameters": {
        "max_new_tokens": 100,
        "adapter_id": "predibase/customer_support",
        "adapter_source": "hub"
    }
}

response = requests.post(url, headers=headers, json=data)
print("Response:", response.json()["generated_text"])
```

**Response:**

```
account issue
```

---

### Step 3: Access Models Using the OpenAI SDK

Lorax provides compatibility with the OpenAI API, allowing easy integration into existing apps with minimal code changes.

1. **Install OpenAI SDK**

```bash
pip install --upgrade openai
```

2. **Call the math adapter:**

```python
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1",
    default_headers={"Authorization": f"Bearer {HF_TOKEN}"}
)

resp = client.completions.create(
    model="predibase/gsm8k",
    prompt=math_prompt,
    max_tokens=100,
    temperature=0.7
)

print("Response:", resp.choices[0].text)
```

Output:

```
James runs 3 x 3 = <<3*3=9>>9 sprints in a week.
He runs 9 x 60 = <<9*60=540>>540 meters in a week.
#### 540
```

3. **Call the customer support adapter:**

```python
resp = client.completions.create(
    model="predibase/customer_support",
    prompt=customer_support_prompt,
    max_tokens=100,
    temperature=0.7
)
print("Response:", resp.choices[0].text)
```

Output:

```
account issue
```

---

## Why This Approach Matters

Serving multiple specialized AI models traditionally involves deploying each fine-tuned model independently, consuming lots of GPU memory and resulting in higher cost and slower inference when switching tasks.

By contrast, Lorax’s approach uses a single shared base model and dynamically loads lightweight LoRA adapters tailored for different tasks. This yields several benefits:

- **Reduced RAM and GPU usage:** Only the small adapter layers are loaded on demand, not the entire large model.
- **Lower infrastructure cost:** A single GPU instance can serve thousands of specialized models.
- **Faster context switching:** Switching between tasks is near-instant as you don’t reload the full model.
- **Flexibility:** New adapters can be added or updated independently.
- **Ease of integration:** OpenAI-compatible APIs simplify adoption.

Combined with Vast.ai’s cost-effective, flexible GPU rental options, this method provides a scalable, production-ready solution for deploying multi-model AI services with a minimal footprint.

---

## Conclusion

Efficiently serving multiple machine learning models is vital for scaling AI-powered products. The combination of Lorax’s dynamic LoRA adapter serving and Vast.ai’s flexible GPU infrastructure offers a powerful, cost-effective, and scalable way to support many specialized models on a single deployment.

Whether solving math problems, classifying customer inquiries, or other tasks, this approach reduces overhead and unlocks practical multi-model serving for modern AI applications.

Try setting up your own multi-LoRA serving environment today to experience the benefits firsthand!

---

## Useful Resources

- [Lorax GitHub Repository](https://github.com/predibase/lorax)
- [Vast.ai GPU Marketplace](https://vast.ai)
- [Mistral-7B-v0.1 on Huggingface](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [OpenAI Python SDK](https://github.com/openai/openai-python)

---

_Happy modeling! 🚀_


Serve multiple ML models efficiently using Lorax and LoRA adapters on Vast.ai—cut costs, reduce latency, and maximize GPU use with dynamic adapter loading.

Efficiently Serving Multiple Machine Learning Models with Lorax and vLLM on Vast.ai


Nearly every startup faces the same two pressures: to move fast and make every dollar count.

Time and budget may be limited while demands remain high – especially for teams working with AI, machine learning, or other compute-heavy workloads. For startups, the pressure to deliver quickly and prove technical viability makes access to reliable infrastructure even more critical.

Traditional cloud providers often aren't built with startups in mind. Pricing may be steep or overly complex, resources can be constrained, and it's easy to end up locked into systems that don't scale with your needs. On the other hand, buying your own hardware adds upfront costs and delays, which can be difficult to justify for early-stage teams.

Vast.ai offers another path. **Our market-based cloud GPU rental platform gives startups the flexibility to run powerful workloads without the usual friction.**

Here's a closer look at why startups choose to build with Vast.ai.

### Why Vast.ai Makes Sense for Startups

Startups need infrastructure that keeps up with their pace, budget, and technical demands. The following are some of the reasons Vast.ai hits the mark.

#### 1\. Instant Access to GPUs

With Vast.ai, there are no waitlists or delays. Startups can instantly access powerful GPUs like **H100s**, **A100s**, and **RTX 5090s**, backed by a global network of providers that ensures better availability and built-in redundancy.

Plus, with machines available around the world, startups can deploy GPUs in regions closer to their major user concentrations, reducing network latency for performance-sensitive applications.

#### 2\. Cost-Effective Compute Power

Vast.ai offers significantly cheaper GPU rental compared to traditional cloud providers – **on average 5-6X lower in cost** – making high-performance compute far more accessible.

For bootstrapped startups and solo founders, Vast.ai levels the playing field, enabling smaller teams on a budget to compete in compute-intensive spaces, as well as handle temporary spikes in workload without having to invest in expensive infrastructure.

#### 3\. Flexible Startup-Friendly Billing

Vast.ai's **pay-as-you-go pricing** gives startups the freedom to scale compute resources up or down as needed, with no lock-in contracts. This way, **you only pay for what you use**, and you can quickly adapt to new demands as they arise.

On-demand and interruptible (bid auction) [rental options](https://docs.vast.ai/instances/rental-types) offer even more control – ideal for variable usage patterns and budget-conscious teams. On top of that, real-time cost tracking and detailed resource utilization dashboards help you monitor and optimize performance.

#### 4\. Enterprise-Grade Security

With Vast.ai, **startups can choose the level of trust and security that best fits their needs.** Providers [range](https://vast.ai/compliance) from verified hosts, suitable for general computing needs, to our vetted datacenter partners who maintain third-party compliance certifications (at minimum ISO 27001 certified) and sign comprehensive Data Processing Agreements (DPAs) with Vast.ai.

For sensitive or production workloads, our Secure Cloud tier ensures you’re connecting with partners in certified locations that are current with ISO 27001 and/or Tier 3/4 standards. (You can even filter for it. Look for the blue datacenter label\!)

We've also [completed](https://vast.ai/article/vast-ai-achieves-soc-2-type-1-certication) our SOC 2 Type I certification – part of our ongoing commitment to the highest standards of regulatory compliance.

#### 5\. Developer Control

Vast.ai provides **transparent instance specs** (GPU type, RAM, storage, and more), so startups can choose exactly what fits their workload and budget. Developers can bring their own Docker images, run their own stack, and configure environments to match their specific requirements.

With a low barrier to entry, even early-stage teams without dedicated DevOps support can get up and running quickly. Our selection of pre-configured [templates](https://docs.vast.ai/templates) are like recipes that help users easily set up instances; templates can define the Docker image, environment variables, and launch mode for each new instance.

#### 6\. Ideal for AI Workloads

Vast.ai is well suited for AI startups working on large-scale training, fine-tuning, and inference. Our customizable, on-demand [CPU/GPU clusters](https://vast.ai/clusters) ensure reliable throughput and flexible resource allocations for continuous or large-scale workloads.

With support for large language models (LLMs) and diffusion models, our platform is compatible with all major machine learning (ML) frameworks and offers the flexibility to run custom containers and tooling.

For teams that want to get started quickly, Vast.ai provides **ready-to-use GPU instances pre-configured for frameworks like TensorFlow and PyTorch**, reducing setup time and simplifying deployment.

#### 7\. Built to Experiment

With Vast.ai, **startups can prototype, test, and explore without any long-term commitments** or expensive upfront investment.

For instance, teams can run quick tests and train models on demand, access and evaluate cutting-edge GPUs like the H200 that may otherwise be out of reach, and even spin up high-performance environments on short notice for hackathons and demos.

### The Bottom Line

**Great ideas can come from anywhere, and they deserve the resources to grow.** By making GPU compute power more open and accessible, Vast.ai gives startups and early-stage teams a real shot at achieving their goals.

Built with transparency, accessibility, and security in mind – and aligned with the open-source ethos that many founders value – our platform supports the kind of flexibility and control that your team needs from day one.

Ready to build, test, and scale on your terms? Discover what's possible with Vast.ai\!


Vast.ai offers startups instant, low-cost access to powerful GPUs—perfect for AI, ML, and compute-heavy work—without lock-ins or upfront infrastructure costs.

7 Reasons Why Startups Use Vast.ai


The GeForce RTX 4090 and the A100 both sit at the high end of NVIDIA's GPU lineup, but they're built for very different worlds.

The **RTX 4090** is a powerful consumer-grade GPU, designed for ultra high-performance gaming, professional creative work, and even some entry-level AI workloads. It's surprisingly capable across a variety of tasks, especially for its price point.

The **A100**, on the other hand, isn't aimed at consumers at all. It's an enterprise-grade GPU built for data centers, research labs, and AI teams running large-scale model training, inference, and simulation workloads – designed to move serious data at serious speed.

In this post, we'll look at how these two machines compare with each other, and why you might choose one over the other based on your specific needs.

### NVIDIA GeForce RTX 4090: A Workhorse with Range

When it launched in September 2022 as the top-tier consumer GPU of its generation, the RTX 4090 quickly became the go-to choice for 4K gaming, 3D rendering, and AI-enhanced creative workflows.

Built on NVIDIA's **Ada Lovelace** architecture and powered by the **AD102 chip**, the RTX 4090 introduced significant upgrades in ray tracing and Tensor Core acceleration, and delivered notable performance gains with **DLSS 3** in a variety of games.

While it's no longer the flagship GPU of NVIDIA's GeForce line (that title now goes to the RTX 5090, which you can read about [here](https://vast.ai/article/nvidia-geforce-rtx-5090-specs-everything-you-need-to-know)), the RTX 4090 definitely still holds its own. It remains one of the most capable and well-rounded GPUs available at its price point for consumers and technical teams alike.

#### Some of the RTX 4090's key specs and features include:

-   **24 GB of GDDR6X memory** with **1,008 GB/s bandwidth** on a **384-bit** interface.  

-   **16,384 CUDA cores** for high-parallel throughput.  

-   **128 third-gen RT cores** and **512 fourth-gen Tensor cores** for advanced ray tracing and AI workloads.  

-   **DLSS 3** support for real-time upscaling and performance boosts in many games.  

-   **450 watts** of thermal design power (TDP), requiring an **850-watt power supply** due to its high processing capabilities.

If you're looking for serious performance in a desktop form factor, the RTX 4090 continues to deliver. It may not be the newest card on the market, but it remains a workhorse that gets the job done across a wide range of applications.

As capable as the RTX 4090 is, it's not built for everything. If your workloads go beyond what a high-end consumer GPU can handle, the A100 just might bring the compute power you need.

### NVIDIA A100: A Powerhouse with Purpose

The A100 is built for scale. Based on NVIDIA's Ampere architecture and powered by the **GA100 chip**, it's designed to tackle massive workloads – like foundational AI model training – thanks to its raw throughput, high-bandwidth memory, and ability to scale across multi-GPU environments.

When the A100 launched in June 2021, it introduced **third-gen Tensor Cores**, support for **TF32** precision, and **Multi-Instance GPU** (MIG) capabilities – enabling flexible resource partitioning and efficient parallel processing across shared environments.

While some of its core specs might look modest compared to the RTX 4090, the A100 delivers superior performance in AI and data-intensive workloads because it's optimized where it counts: memory bandwidth, interconnects, and specialized compute.

For instance, the A100's memory clock is much lower than the RTX 4090's on paper (roughly 3 Gbps vs. 21 Gbps). However, the A100 uses **HBM2e memory** with a much wider **5,120-bit interface**. This design allows it to deliver around **2 TB/s** of bandwidth – double the RTX 4090 – despite the lower frequency. It's an approach that prioritizes efficiency and scale.

#### To recap, here are a few notable features of the A100:

-   **80 GB of HBM2e memory** with about **2 TB/s of bandwidth** (twice that of the RTX 4090).  

-   **5,120-bit memory bus**, an exceptionally wide interface enabling fast access to memory, reducing bottlenecks in AI and high-performance computing (HPC) workloads.  

-   **432 third-gen Tensor Cores**, with high-throughput **TF32** performance for AI and robust **FP64** support for HPC workloads.  

-   **Multi-Instance GPU (MIG)** support, so it can be partitioned into up to 7 isolated GPU instances – ideal for multi-tenant workloads and fine-grained resource allocation.  

-   **NVLink and NVSwitch** allow up to 16 A100 GPUs to be interconnected at up to **600 GB/s** of bi-directional bandwidth per GPU.

In short, the A100 delivers serious performance for large-scale AI and HPC workloads. But if you're comparing it head-to-head with the RTX 4090 on specs alone, here's how the two stack up.

### RTX 4090 vs. A100: Key Specs Compared

For reference, this table highlights some of the features and specs of the RTX 4090 and A100:

<table class="w-full border-collapse border border-gray-300">
    <thead>
        <tr class="bg-gray-100">
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                Specification
            </th>
            <th class="border border-gray-300 px-4 py-2 text-center font-semibold">
                RTX 4090
            </th>
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                A100
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Architecture
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                Ada Lovelace
            </td>
            <td class="border border-gray-300 px-4 py-2">Ampere</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">VRAM</td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                24 GB GDDR6X
            </td>
            <td class="border border-gray-300 px-4 py-2">80 GB HBM2e</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Memory Clock
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                1313 MHz 21 Gbps
            </td>
            <td class="border border-gray-300 px-4 py-2">1512 MHz 3 Gbps</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Memory Bus Width
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">384</td>
            <td class="border border-gray-300 px-4 py-2">5120</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Bandwidth
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                1.01 TB/s
            </td>
            <td class="border border-gray-300 px-4 py-2">1.94 TB/s</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Streaming Multiprocessors (SMs)
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">128</td>
            <td class="border border-gray-300 px-4 py-2">108</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Tensor Cores
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">512</td>
            <td class="border border-gray-300 px-4 py-2">432</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                CUDA Cores
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">16,384</td>
            <td class="border border-gray-300 px-4 py-2">6,912</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Ray Tracing Cores
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">128</td>
            <td class="border border-gray-300 px-4 py-2">N/A</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Base Clock
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2235 MHz
            </td>
            <td class="border border-gray-300 px-4 py-2">1065 MHz</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Boost Clock
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2520 MHz
            </td>
            <td class="border border-gray-300 px-4 py-2">1410 MHz</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Multi-Instance GPUs
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                No MIG Support
            </td>
            <td class="border border-gray-300 px-4 py-2">
                Up to 7 MIGs @ 10GB each
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Thermal Design Power (TDP)
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">450W</td>
            <td class="border border-gray-300 px-4 py-2">300W</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Recommended Power Supply
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">850W</td>
            <td class="border border-gray-300 px-4 py-2">700W</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Launch Date
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                Sept. 20, 2022
            </td>
            <td class="border border-gray-300 px-4 py-2">June 28, 2021</td>
        </tr>
    </tbody>
</table>

As mentioned, specs are just one part of the equation. Ultimately, choosing the right GPU depends on your workload, environment, and priorities.

### Use Cases: Which GPU Should You Choose?

Here’s how the strengths of each GPU align with different types of use cases.

#### You should consider the RTX 4090 if you prefer:

-   **A powerful, desktop-friendly GPU** for high-end gaming, creative workflows, and light AI experimentation.
-   **Excellent performance-per-dollar in single-GPU setups**, particularly for developers and small teams.
-   **A widely available, less costly option** for rendering, video editing, or running small to medium AI models locally.

#### Whereas the A100 may be a better fit if you need:

-   **Scalable performance** for training large models, running high-throughput inference, or powering complex simulations.
-   **A data center-optimized GPU** with advanced memory architecture and precision modes tailored for AI and HPC workloads.
-   **Flexible deployment options**, including support for multi-instance workloads and high-speed interconnects in clustered environments.

\------------------------------

At **Vast.ai**, we understand there are other considerations, as well. Access, flexibility, and affordability are just as important, especially when you're experimenting, scaling, or moving fast.

That's why our market-based cloud GPU rental platform gives you affordable, on-demand access to RTX 4090s, A100s, and much more. With Vast.ai, you can instantly match the right hardware to your workload – without the upfront investment and overhead of managing your own infrastructure.

Try Vast today and discover what you can achieve with compute power that fits your needs and your budget\!


Compare NVIDIA’s RTX 4090 vs A100 GPUs to find out which is better for gaming, AI, HPC, or creative work—and how to access both on Vast.ai.

NVIDIA RTX 4090 vs. A100: Two Powerhouses, Two Purposes


Welcome back to another update from the team at **Vast.ai**\! Over the past two months, we've been hard at work refining our platform and rolling out features that make your experience on Vast faster, smoother, and more powerful. These updates are part of our ongoing effort to support your projects with scalable, flexible infrastructure you can rely on.

We're also pleased to have achieved [**SOC 2 Type I certification**](https://vast.ai/article/vast-ai-achieves-soc-2-type-1-certication) recently, a clear validation of the controls we've put in place to protect customer data and ensure the integrity of our infrastructure. The independent SOC 2 audit confirms that our security practices meet the highest standards of regulatory compliance – giving you added confidence as you build on our platform.

Another exciting update: [**H200 GPUs**](https://vast.ai/article/nvidia-h100-vs-h200-two-hopper-based-heavyweights) are now available on Vast\! If you're pushing the limits of high-performance compute, these machines are ready to meet the challenge.

Let's take a look at what else is new\!

### NVIDIA Cloud GPU Updates

Whether it's expanded hardware options, new templates, or updated guides, our latest improvements give you a more powerful, customizable environment for every stage of your workflow.

#### Highlights of the latest enhancements:

* **New Templates Added**

  * [***Dia 1.6b TTS***](https://cloud.vast.ai/?ref_id=62897&template_id=6ba071acb5f590913c5eaf6c2dc3d0db) – Convert text to lifelike speech using the open-source text-to-speech model from Nari Labs (full guide [here](https://docs.vast.ai/tts-with-nari-labs-dia))..  
  * [***Tari Universe***](https://cloud.vast.ai/?ref_id=62897&template_id=338f9628e43e586a87d37d18c719291d) – Run crypto mining workloads on Vast via desktop application (note: cryptocurrency funding is required).  
  * [***Pinokio Desktop***](https://cloud.vast.ai/?ref_id=62897&template_id=06c4818322f002904b324165b9dba52f) – Deploy AI applications with verified scripts from trusted publishers, without manual setup.

* **Introducing Local Volumes (CLI-only)**

  * Local Volumes are **persistent Docker volumes** that keep your data intact across restarts, new instances, and failures – so your storage survives instance deletion.  
  * **Benefits:** Share one volume across multiple containers on the same machine, and spin up fresh instances without data-transfer overhead. 

* **Upgraded Instance Portal**

  * For Vast.ai instances using [recommended templates](https://cloud.vast.ai/templates/), instances now launch via **secure Cloudflare tunnels** by default.  
  * Create a user-friendly **custom tunnel URL** that points to a port on your instance.  
  * **Live logs** and **live stats** (CPU, Memory, and Disk) for your instance are now viewable.

* **Vast Teams Improvements**

  * **Smoother invites** (fast and easy team creation), **standalone teams** (no need to convert your individual account), and **context switching** (seamlessly switch between Individual and Team views).  
  * **Access Keys:** Share API Keys and view Team-wide Session Keys as owner.  
  * Various stability and usability improvements.

For more information and helpful links, check out the full change log for our April–May 2025 updates at the bottom of this page.

### Our Commitment

Whether you're trying out the H200, deploying AI applications with just a few clicks, launching a new template, or exploring our latest platform features, we're here to make your experience on **Vast.ai** as smooth and flexible as possible. However you use Vast, our goal remains the same: to deliver reliable, cost-effective compute whenever you need it.

Got questions? Help is always within reach\! Contact our customer support team at [support@vast.ai](mailto:support@vast.ai) anytime, or join our Discord server for additional resources and community support.

### Change Log

#### New Templates

* **Dia 1.6b TTS** – Transform text into ultra-realistic speech using Dia, the open-source text-to-speech model from Nari Labs.  
  * [Full Guide](https://docs.vast.ai/tts-with-nari-labs-dia)  
  * [Launch an Instance with the Dia Template](https://cloud.vast.ai/?ref_id=62897&template_id=6ba071acb5f590913c5eaf6c2dc3d0db)  
* **Tari Universe** – Desktop Crypto mining application you can run on Vast. IMPORTANT: You must fund your Vast account with cryptocurrency if you want to use the service for mining.  
  * [Launch an Instance with the Tari Universe Template](https://cloud.vast.ai/?ref_id=62897&template_id=338f9628e43e586a87d37d18c719291d)  
* **Pinokio Desktop** – Access an extensive library of verified scripts from trusted publishers and deploy AI applications effortlessly without manual setup.  
  * [Launch an Instance on Vast with the Pinokio Desktop Template](https://cloud.vast.ai/?ref_id=62897&template_id=06c4818322f002904b324165b9dba52f)

####  Introducing Local Volumes (CLI only – UI coming soon\!)

**What are Local Volumes?**  
Persistent Docker volumes to keep your data intact across restarts, new instances, and failures.

**Key client benefits:**  
• Persistent storage survives instance deletion.  
• Share one volume across multiple containers on the same machine.  
• Spin up fresh instances on the same machine without data-transfer overhead.

**Get started:**  
• Search for available Volumes via CLI:  vastai search volumes  
• General docs: [https://docs.vast.ai/volumes](https://docs.vast.ai/volumes) 

*Note: Currently, clients can only rent Local Volumes through CLI. UI support and Network Volumes are coming soon.*

#### Teams Improvements

We’ve improved the Vast Teams experience for user access management and collaboration:

* **Smoother Invites:** Get team members connected on Vast quickly.  
* **Standalone Teams:** Create Teams without converting your individual account.  
* **Context Switching:** Seamlessly switch between individual and Team views.  
* **Access Keys:** Share API Keys and view Team-wide Session Keys as owner.  
* **Team Basics:** Transfer credits to/from your team, transfer ownership, and other basics.  
* **Bug Fixes:** Various stability and usability improvements.

[Create a team](https://cloud.vast.ai/manage-members/) and let us know your feedback\!

#### Instance Portal upgrade

We've also rolled out a major upgrade to the **Instance Portal** that appears on Vast.ai instances using [recommended templates](https://cloud.vast.ai/templates/).

**What’s New:**

* **Secure Tunnel URLs:** Instances now launch via secure Cloudflare tunnels by default, eliminating the “not secure” screen.  
* **Custom Tunnel URLs:** Create a user-friendly URL that points to a port on your Instance. Optionally use a named tunnel to use a custom domain.  
* **Live Logs:** See a live stream of log updates on your Instance.  
* **Live Stats:** See live CPU, Memory, and Disk stats for your Instance.

Deploy a recommended template instance or check out the [updated Instance Portal guide](https://docs.vast.ai/instance-portal) to learn more\!

#### Opt in to the new Instances UI

We've rolled out an (optional) UI update as part of our ongoing UI refresh:

* Cleaner instance card design  
* Instance data fetching improvements  
* Various instance UI bug fixes

Opt-in: Go to [Settings \> Early Access](https://cloud.vast.ai/account/) and toggle it on. You can switch back anytime.

#### SOC2 Type I Certification complete.

#### H200s available now\!

Vast.ai adds H200 GPUs, SOC 2 Type I cert, new templates, persistent storage, and better teams & instance tools for faster, secure cloud compute.

April-May 2025 Product Updates


## Introduction

The recent advancements in AI for video generation have been very fast paced. Models now create much more realistic videos than before, with multiple different use cases. One such model is [HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo), a model known for its impressive outputs—and tackle the practical challenges of running it efficiently on high-memory GPUs like the A100 or H100.

HunyuanVideo is Tencent's state-of-the-art text-to-video generation model that rivals or surpasses leading closed-source alternatives. As the largest open-source video generation model with over 13 billion parameters, HunyuanVideo represents a significant breakthrough in AI-powered video creation.

We also have a [notebook](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/LznBGR/aec37016e507d00895435abda7f0a3bc6e29625a/files/hunyuan_video_vast.ipynb) to follow along once you deploy the Vast Instance.

### Key Innovations

- **Unified Architecture**: Employs a "Dual-stream to Single-stream" transformer design that effectively handles both image and video generation.
- **MLLM Text Encoder**: Utilizes a multimodal large language model for superior text-to-visual alignment compared to previous text-encoder architectures.
- **Efficient Compression**: Implements causal 3D VAE for efficient spatial-temporal compression to enable training at higher resolution.
- **Automatic Prompt Rewriting**: Features intelligent prompt enhancement to optimize user-given prompts tailored for the model's consumption.

### What You’ll Learn

In this guide, we will:

1. Set up a custom Docker template for HunyuanVideo on Vast.ai.
2. Launch the Docker template on a suitable GPU instance.
3. Download pretrained models and generate high-quality videos from text prompts.

Let’s get started with HunyuanVideo!

---

## Renting an Instance on Vast

### Create a Custom Template

Tencent maintains a custom Docker image for HunyuanVideo: `hunyuanvideo/hunyuanvideo:cuda_12`. To run the model on Vast, we'll create a custom template using this image.

Follow these steps:

1. Ensure you have a [Vast.ai](https://vast.ai/) account.
2. Navigate to the **Templates** page on the [Vast Console](https://cloud.vast.ai/templates/).
3. Find the existing template named **PyTorch (CuDNN Runtime)** and click **Edit**.
4. Rename the template to **HunyuanVideo**.
5. Replace the Docker image in the **Image Path:Tag** field with `hunyuanvideo/hunyuanvideo:cuda_12`.
6. Set the **On-start Script** to:
   ```
   git clone https://github.com/tencent/HunyuanVideo
   ```
   This ensures the HunyuanVideo repository is downloaded on instance startup.
7. Allocate sufficient disk space, for example, **80 GB**.
8. Click **Save & Use** to save your template and prepare to deploy it.

---

### Selecting an Instance

HunyuanVideo requires a GPU with **at least 80GB VRAM** to run smoothly. Available GPUs meeting this requirement include Nvidia’s A100 and H100 cards.

To select an instance:

1. Filter Vast instances based on:
   - **Instance Type**: A100 or H100
   - **Number of GPUs**: 1
   - **VRAM**: ≥ 80GB
2. Choose and rent a suitable instance.
3. Install the Vast TLS certificate in your browser, enabling secure access to your Jupyter notebook server: [Installing TLS Certificate Guide](https://docs.vast.ai/instances/jupyter#1SmCz).
4. Open the Jupyter server via https://cloud.vast.ai/instances/, clicking **Open** for your rented instance.
5. Upload this notebook into the directory `/workspace/HunyuanVideo/` on the server.

---

## Downloading the Model Weights

Before generating videos, we need to download pretrained model weights from Hugging Face. These include the video model weights and the text encoders.

Run the following commands inside your instance:

```bash
huggingface-cli download tencent/HunyuanVideo --local-dir ./ckpts
huggingface-cli download xtuner/llava-llama-3-8b-v1_1-transformers --local-dir ./ckpts/llava-llama-3-8b-v1_1-transformers
python hyvideo/utils/preprocess_text_encoder_tokenizer_utils.py --input_dir ckpts/llava-llama-3-8b-v1_1-transformers --output_dir ckpts/text_encoder
huggingface-cli download openai/clip-vit-large-patch14 --local-dir ./ckpts/text_encoder_2
```

For a more in-depth discussion of the checkpoints, refer to [Tencent’s checkpoint README](https://github.com/Tencent/HunyuanVideo/blob/main/ckpts/README.md).

---

## Generating Videos using HunyuanVideo

### Displaying Videos in Jupyter

To conveniently view generated videos within the Jupyter notebook, let’s define a helper function using `IPython.display.Video`:

```python
from IPython.display import Video

def show_video(video_path, width=640, height=360, embed=True):
    """
    Display a video in a Jupyter notebook.

    Parameters:
    -----------
    video_path : str
        Path to the video file (local file or URL)
    width : int, optional
        Width of the video player in pixels
    height : int, optional
        Height of the video player in pixels
    embed : bool, optional
        Whether to embed the video in the notebook (True)
        or just link to it (False)

    Returns:
    --------
    IPython.display.Video
        Video display object
    """
    return Video(video_path, width=width, height=height, embed=embed)
```

---

### Generating Your First Video: Cat on the Grass

Let’s generate a realistic video of a cat walking on the grass using the example prompt provided by Tencent.

Run the following command in your terminal:

```bash
python sample_video.py \
    --video-size 720 1280 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --use-cpu-offload \
    --save-path ./results
```

This command will create a video approximately 129 frames long (roughly 4-5 seconds at standard frame rates) at 720x1280 resolution.

Once the generation finishes, locate the video file inside the `./results` directory.

To display it in your notebook, run:

```python
show_video("./results/[your_video_filename].mp4")
```

Replace `[your_video_filename].mp4` with the actual file name.

You should see a high-quality, realistically animated cat walking on grass, showcasing HunyuanVideo’s remarkable detail in animal motion and textures.

---

### Creating a Video of an Astronaut Walking on the Moon

Next, try generating a completely different scene:

```bash
python sample_video.py \
    --video-size 720 1280 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "An astronaut walks across the moon, realistic style." \
    --flow-reverse \
    --use-cpu-offload \
    --save-path ./results
```

After the video is generated, display it in your notebook the same way:

```python
show_video("./results/[your_astronaut_video].mp4")
```

The resulting video will demonstrate the model’s versatility at rendering vastly different environments and characters, from furry animals to astronauts in space suits.

---

## Conclusion and Next Steps

You've now successfully generated your first videos using HunyuanVideo on a Vast-powered cloud instance. This powerful model unlocks creative possibilities in AI-driven video generation with:

- **Flexible Resolution Options**: Try different `--video-size` values (e.g., 1280×720, 960×960) to optimize for your needs.
- **Adjustable Quality Settings**: Modify `--infer-steps` (default 50) to trade off video generation quality and speed.
- **Creative Control**: Experiment with `--embedded-cfg-scale` (default 6.0) to balance prompt fidelity and creative variance.
- **Deterministic Outputs**: Set `--seed` to reproduce favorite generated videos reliably.

Deploying large models like HunyuanVideo can be resource-intensive but is now accessible with cloud platforms like Vast, giving you access to top-tier GPUs without upfront hardware costs.

With these foundations, you’re ready to explore more sophisticated prompt engineering, fine-tune generation parameters, or even integrate HunyuanVideo into multimedia projects!

---

**Resources:**

- [HunyuanVideo GitHub Repository](https://github.com/tencent/HunyuanVideo)
- [Vast.ai Documentation](https://docs.vast.ai/)
- [HunyuanVideo Checkpoints README](https://github.com/Tencent/HunyuanVideo/blob/main/ckpts/README.md)

Learn how to run Tencent’s HunyuanVideo text-to-video model on Vast.ai using A100/H100 GPUs to generate realistic AI-powered videos from prompts.

Text to Video Generation Using HunyuanVideo on Vast


[RolmOCR](https://huggingface.co/reducto/RolmOCR) from Reducto is a powerful, open-source document OCR solution that delivers superior performance while requiring fewer resources than comparable models. Compared to its predecessor, [allenai/olmOCR-7B-0225-preview](https://huggingface.co/allenai/olmOCR-7B-0225-preview), this new model is built on a better base model, Qwen2.5-VL-7B instead of Qwen2.0-VL-7b. This model isn't trained to take in metadata information and with some 15% rotation of the documents, which enables it to be more flexible under different conditions. Overall, it excels at OCR on PDF's, documents, forms, and invoices. Because it is open source, companies can build on top of this model for their own proprietary pipelines while not sending data to other providers or model hosting services.

Vast.ai offers a GPU marketplace where you can rent compute power at lower costs than major cloud providers, with the flexibility to select specific hardware configurations optimized for specific models and GPU RAM needs, while keeping all of your company's data private.

In this guide, we'll demonstrate how to extract structured pricing data from invoice images using this new model. To follow along, we also have a notebook version you can find [here](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/nEbMzk/c6b76123dbef385aa3576dcda376eb2db876fc28/files/reducto_rolmocr_vast.ipynb)

## Deploying RolmOCR on Vast

### Install Vast

First, we'll install and set up the Vast API.

You can get your API key from the [Account Page](https://cloud.vast.ai/account/) in the Vast Console and set it below in `VAST_API_KEY`.

```bash
pip install vastai==0.2.6
```

```bash
export VAST_API_KEY="" # Your API key here
vastai set api-key $VAST_API_KEY
```

### Search for an Instance

Next, we'll search for an instance to host the model. While `reducto/RolmOCR` requires at least 16GB VRAM, selecting an instance with 60GB VRAM will accommodate larger documents and enable a wider context window.

```bash
vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 60 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"
```

### Deploy the Instance

Using the instance ID from the search, deploy the model to the instance.

Note: we set `VLLM_USE_V1=1` to use the v1 engine for vLLM, which `reducto/RolmOCR` requires.

```bash
export INSTANCE_ID= # insert instance ID here
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e VLLM_USE_V1=1' --disk 80 --args --model reducto/RolmOCR
```

## Using the OpenAI API to Call `reducto/RolmOCR`

### Download Dependencies

While waiting for the instance to finish setting up and starting to serve the model, we'll install the required Python packages locally.

```bash
pip install --upgrade openai datasets pydantic
```

### Download Sample Invoice Data

Now, we'll use the `datasets` library to load a subset of invoice data from the `katanaml-org/invoices-donut-data-v1` dataset on Hugging Face. This dataset contains 500 annotated invoice images with structured metadata for training document extraction models.

```python
from datasets import load_dataset

# Stream the dataset
streamed_dataset = load_dataset("katanaml-org/invoices-donut-data-v1", split="train", streaming=True)

# Take the first 3 samples
subset = list(streamed_dataset.take(3))
```

Next, we'll create an `encode_pil_image` function to convert images from the dataset into a base64 encoded string format suitable for the OpenAI API.

```python
import base64
import io
from PIL import Image

def encode_pil_image(pil_image):
    # Resize image while maintaining aspect ratio
    max_size = 1024  # Maximum dimension
    ratio = min(max_size / pil_image.width, max_size / pil_image.height)
    new_size = (int(pil_image.width * ratio), int(pil_image.height * ratio))
    resized_image = pil_image.resize(new_size, Image.Resampling.LANCZOS)

    # Convert to bytes with JPEG format and reduced quality for smaller size
    img_byte_arr = io.BytesIO()
    resized_image.save(img_byte_arr, format='JPEG', quality=85)
    img_byte_arr = img_byte_arr.getvalue()
    return base64.b64encode(img_byte_arr).decode("utf-8")
```

### Enforcing Structured Data Extraction

We'll create an `Invoice` Pydantic model for use to store the invoice number and the amount.

```python
from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str
    invoice_amount: str

json_schema = Invoice.model_json_schema()
```

## Configuring the API Client

Here we're creating a function to interface with the RolmOCR endpoint via the OpenAI API.

Set your instance's IP address and port from the [Instances tab](https://cloud.vast.ai/instances/) of the Vast AI Console.

You can see that we're leveraging the json schema fromt he above `Invoice` Model to pull have our model generate structured data.

```python
from openai import OpenAI

VAST_IP_ADDRESS = ""  # Insert your instance IP address
VAST_PORT = ""        # Insert your instance port
base_url = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"

client = OpenAI(api_key="", base_url=base_url)

model = "reducto/RolmOCR"

def ocr_page_with_rolm(img_base64):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_base64}"},
                    },
                    {
                        "type": "text",
                        "text": "Return the invoice number and total amount for each invoice as a json: {invoice_number : str, invoice_amount: str}",
                    },
                ],
            }
        ],
        extra_body={"guided_json": json_schema},
        temperature=0.2,
        max_tokens=500
    )
    return response.choices[0].message.content
```

### Extract and Display Invoice Data

Iterate over the sample subset of invoices, display the image, extract the invoice number and amount with RolmOCR, and compare the extracted data to the ground truth.

```python
import matplotlib.pyplot as plt
import json

invoices = []
ground_truth = []
for sample in subset:
    # Display invoice image
    plt.figure(figsize=(10, 14))
    plt.imshow(sample['image'])
    plt.axis('off')
    plt.show()

    # Encode and process with OCR
    img_base64 = encode_pil_image(sample['image'])
    result = ocr_page_with_rolm(img_base64)
    result_dict = json.loads(result)
    invoices.append(result_dict)

    # Extract ground truth fields
    ground_truth_i = json.loads(sample["ground_truth"])
    ground_truth_dict = {
        "invoice_number": ground_truth_i["gt_parse"]["header"]["invoice_no"],
        "invoice_amount": ground_truth_i["gt_parse"]["summary"]["total_gross_worth"]
    }
    ground_truth.append(ground_truth_dict)

    print("Ground Truth")
    print(ground_truth_dict)

    print("Extracted Info")
    print(result_dict)
```

### Example Outputs

#### Invoice 1

![Invoice 1](/uploads/reducto_rolmocr_output_1.png)

```
Ground Truth
{'invoice_number': '40378170', 'invoice_amount': '$8,25'}
Extracted Info
{'invoice_number': '40378170', 'invoice_amount': '$8.25'}
```

#### Invoice 2

![Invoice 2](/uploads/reducto_rolmocr_output_2.png)

```
Ground Truth
{'invoice_number': '61356291', 'invoice_amount': '$ 212,09'}
Extracted Info
{'invoice_number': '61356291', 'invoice_amount': '$212.09'}
```

#### Invoice 3

![Invoice 3](/uploads/reducto_rolmocr_output_3.png)

```
Ground Truth
{'invoice_number': '49565075', 'invoice_amount': '$96,73'}
Extracted Info
{'invoice_number': '49565075', 'invoice_amount': '$96,73'}
```

As observed, the output from `reducto/RolmOCR` closely matches the expected ground truth, demonstrating its accuracy in extracting key invoice information. The differences between the extracted and ground truth in these cases is around commas versus decimal points, which can be easily solved for with simple downstream methods.

## Conclusion

In this walkthrough, we demonstrated how to deploy and use RolmOCR on Vast to extract structured data from invoice images efficiently.

RolmOCR delivers high accuracy for critical fields such as invoice numbers and amounts, while staying more robust with image-based proxessing and reduced RAM usage. When combined with Vast.ai's flexible and cost-effective GPU marketplace, this forms a powerful solution for scalable, privacy-focused document processing workflows.

This approach can be extended to extract a wide range of structured data from diverse document formats, unlocking advanced automation and insights for businesses of all sizes.


Learn how to deploy Reducto’s RolmOCR on Vast.ai to extract structured invoice data using affordable GPUs and accurate open-source document OCR.

Running RolmOCR on Vast


NVIDIA's GeForce RTX 5090D was supposed to be a clever workaround – a GPU specially created to comply with U.S. export regulations while still delivering advanced gaming performance to Chinese consumers.

But recent reports suggest that even this modified version may not make it to market.

### NVIDIA RTX 5090D: A China-Specific Solution

The RTX 5090D was designed specifically for the Chinese market in response to the U.S. government's increasingly strict export rules. These rules target high-performance chips like the NVIDIA H100 and H20, and AMD's MI300 series, banning them from being sold in China if they exceed certain memory and I/O bandwidth thresholds.

To get around these limits, NVIDIA modified the RTX 5090D to lower its AI throughput while keeping specs that matter to gamers intact – such as core count, memory size, and clock speeds. That may not have been sufficient, however.

### The Bandwidth Problem

Intended to limit China's ability to develop advanced AI systems, U.S. export regulations on chips impose strict caps on memory and I/O bandwidth: **1,400 GB/s and 1,100 GB/s, respectively – _or_ a combined total over 1,700 GB/s**.

If the RTX 5090D exceeds any of these thresholds, it could still be restricted, despite being designed with reduced AI capabilities and not typically suited for large-scale enterprise workloads. The problem is, it very well might fall into the restricted category based on total bandwidth.

NVIDIA appears to be [taking preemptive action](https://wccftech.com/nvidia-reportedly-prepares-for-a-ban-on-the-geforce-rtx-5090d-in-china/) – asking add-in board (AIB) partners to halt shipments and prep work, and even putting a pause on related orders by cooling module manufacturers.

So far, there's been no official statement from NVIDIA. But after getting burned when the RTX 4090 was unexpectedly swept up in a prior wave of export controls, NVIDIA is reportedly acting early this time with this "precautionary" move amid ongoing regulatory uncertainty between the U.S. and China.

### What This Means for the Market

Whether this sales halt is temporary or permanent remains to be seen, but the wider implications are potentially far-reaching. The RTX 5090D was the only flagship-class Blackwell GPU intended for Chinese gamers. If it's taken off the table, that leaves a gap in the market – and an opportunity for rivals like AMD, which is reportedly preparing a China-exclusive gaming GPU.

Furthermore, these ongoing developments reflect a growing reality: that geopolitical decisions are impacting GPU availability worldwide.

While Chinese consumers lose access to another high-end option, global suppliers are keeping a close eye on what happens next. Constraints in one region can ripple across the supply chain, affecting availability, pricing, and competitive positioning elsewhere, as well.

### High-End Compute On Demand

While geopolitical tensions disrupt supply in some regions, developers, researchers, and creators still need reliable access to powerful hardware. Vast.ai offers a flexible and cost-effective alternative.

Our cloud GPU rental platform delivers high-performance compute – H100s, H200s, RTX 5090s, and more – at a fraction of the typical cost. With on-demand pricing and spot instance options, **Vast users save 5–6X on GPU compute** compared to traditional cloud providers.

Explore your options and get started today\! With Vast, you don't have to wait to access the power you want.


NVIDIA's RTX 5090D, designed specifically for the Chinese market to bypass US export restrictions, now faces uncertainty as it may still exceed bandwidth thresholds. NVIDIA has reportedly asked partners to halt shipments amid regulatory concerns, potentially leaving a gap for competitors in China's high-end GPU market.

RTX 5090D Faces Uncertainty in China Amid Export Crackdown


NVIDIA has [confirmed](https://nvidianews.nvidia.com/news/nvidia-blackwell-rtx-pro-workstations-servers-agentic-ai) the launch of the RTX Pro Blackwell series – a revolutionary generation of workstation and server GPUs. Designed to meet the needs of professional designers, data scientists, developers, and content creators, the lineup features a cutting-edge **RTX Pro 6000 Blackwell** GPU for workstations, plus other desktop and laptop variants and a data center version.

After early signs of the card [appeared in shipping logs](https://vast.ai/article/nvidia-rtx-pro-6000-x-rumors-blackwell-gpu-with-96gb-vram-spotted-in-shipping-log) a couple of months ago, we now have full details straight from the source – and this GPU is no minor upgrade. Built on NVIDIA's Blackwell architecture and featuring an incredible **96GB of GDDR7 memory** on a **512-bit bus**, the RTX Pro 6000 is set to redefine what's possible on a single professional workstation.

Here's what we now know about the RTX Pro 6000 Blackwell GPU.

### NVIDIA RTX Pro 6000: Specs That Mean Business

At the heart of the RTX Pro 6000 is the **GB202 die**, aligning with previous high-end Blackwell GPUs, like the RTX 5090\. But while the RTX 5090 is geared toward gamers and other enthusiasts, the RTX Pro 6000 goes all in for professional use.

We've certainly seen professional-grade GPUs with massive VRAM configurations before – there's the RTX 6000 Ada with 48GB of memory, for instance. However, the RTX Pro 6000 offers **double that amount**, with the faster speeds and efficiency of GDDR7 ECC memory. This makes it one of the most powerful workstation GPUs ever released.

And that's not all. On top of **24,064 CUDA cores** and **1,792 GB/s of memory bandwidth**, here's what else you get with this beast of a machine:

- **Streaming multiprocessors** with new neural shaders, delivering up to **1.5X** faster throughput for AI-augmented graphics workflows.
- **Fourth-gen RT cores** offering up to **2X** faster ray tracing performance for photorealism and complex 3D environments, with optimizations for **RTX Mega Geometry**.
- **Fifth-gen Tensor cores** capable of **4,000 AI TOPS**, supporting **FP4** precision and **DLSS 4 Multi Frame Generation** for cutting-edge AI graphics and accelerated model prototyping.
- **Ninth-gen NVENC** **& sixth-gen NVDEC**, improving encoding and decoding speed and quality for pro video work.
- **PCIe Gen 5 interface**, doubling bandwidth for faster CPU-GPU data transfer and smoother performance on heavy data workloads.
- **DisplayPort 2.1** support for ultra-high refresh rates – up to **4K at 480Hz** and **8K at 165Hz** – enabling multi-monitor setups with high color accuracy.
- **Multi-Instance GPU (MIG)**, allowing secure partitioning into up to **four GPU instances** for better workload isolation and resource allocation.

Not surprisingly, the RTX Pro 6000 does require a substantial **600 watts** of power, albeit only slightly more than the 575 watts of the RTX 5090\. But with that power comes the performance needed for agentic AI, immersive XR, large-scale simulation, and autonomous systems development – workloads where speed, memory, and throughput are non-negotiable.

(For more details and specs, check out the table at the bottom of this page\!)

So the only question that remains is: when can you get your hands on one?

### Availability and Rollout Timeline

The RTX Pro 6000 Workstation Edition is [expected](https://nvidianews.nvidia.com/news/nvidia-blackwell-rtx-pro-workstations-servers-agentic-ai) to **start shipping in May 2025** from manufacturers like BOXX, Dell, HP, Lambda, and Lenovo.

The server version of the RTX Pro 6000 is also on the way, with systems from Cisco, Dell, HP, Lenovo, and Supermicro expected to launch "soon," according to NVIDIA. Laptop variants, featuring **24GB of VRAM**, are expected to ship later in the year, from Dell, HP, Lenovo, and Razer.

NVIDIA hasn't announced official pricing yet, but estimates place the RTX Pro 6000 around $8500 – quite the investment. Fortunately, for those who need top-tier professional-grade GPU performance without the upfront cost, there are more accessible options.

### Access GPU Power Without the Overhead

Here at Vast.ai, we're excited to expand our GPU rental offerings as the **RTX Pro 6000 series** rolls out this year. In the meantime, you can tap into powerful GPUs like the **H100**, **H200**, and other high-performance options, all available on-demand through our platform, today.

With Vast, you can enjoy flexible, cost-effective access to serious compute power – with no long-term commitments and no hardware overhead – that's ready when you are.

### NVIDIA RTX Pro 6000: Key Specs

| Specification                    | RTX Pro 6000                                         |
| :------------------------------- | :--------------------------------------------------- |
| **Architecture**                 | Blackwell                                            |
| **GPU**                          | GB202                                                |
| **GPU Memory**                   | 96GB GDDR7 ECC                                       |
| **Memory Bandwidth**             | 1792 GB/s                                            |
| **Memory Interface**             | 512-bit                                              |
| **CUDA Cores**                   | 24,064                                               |
| **5th-Gen Tensor Cores**         | 752                                                  |
| **4th-Gen RT Cores**             | 188                                                  |
| **L2 Cache**                     | 128 MB                                               |
| **AI TOPS**                      | 4000                                                 |
| **Single-Precision Performance** | 125 TFLOPS                                           |
| **RT Core Performance**          | 380 TFLOPS                                           |
| **System Interface**             | PCIe 5.0 x16                                         |
| **Display Connectors**           | 4x DisplayPort 2.1b                                  |
| **Video Engines**                | 4x NVENC (9th Gen); 4x NVDEC (6th Gen)               |
| **MIG Support**                  | Up to 4x 24GB; or 2x 48GB; or 1x 96GB                |
| **Graphics APIs**                | DirectX 12, Shader Model 6.6, OpenGL 4.6, Vulkan 1.3 |
| **Compute APIs**                 | CUDA 12.8, OpenCL 3.0, DirectCompute                 |
| **Thermal Solution**             | Double-flow-through                                  |
| **Form Factor**                  | 5.4” H x 12” L, dual slot, extended height           |
| **Total Board Power**            | 600 watts                                            |


NVIDIA has confirmed the launch of the RTX Pro Blackwell series – a revolutionary generation of workstation and server GPUs. Designed to meet the needs of professional designers, data scientists, developers, and content creators, the lineup features a cutting-edge RTX Pro 6000 Blackwell GPU for workstations, plus other desktop and laptop variants and a data center version.

NVIDIA RTX Pro 6000 Blackwell: Redefining the Workstation GPU


## Introduction

Meta's Llama 4 is a breakthrough AI model combining state-of-the-art multimodal capabilities with the computational efficiency of mixture-of-experts (MoE) architecture. These models can process both text and images through an early fusion design that seamlessly integrates different modalities. Perhaps most impressively, they support context windows of up to 10 million tokens - dramatically expanding what's possible with large language models.

The Llama 4 family includes:

- **Llama 4 Scout**: 17B active parameters with 16 experts - efficient enough to run on modest GPU setups.

- **Llama 4 Maverick**: 17B active parameters but with 128 experts - offering enhanced capabilities while maintaining efficiency.

- **Llama 4 Behemoth**: 288B active parameters with 16 experts - Meta's most powerful model (not yet publicly released).

What makes these models special:

- **Mixture-of-Experts (MoE)**: Unlike traditional models that activate all parameters for every token, MoE models selectively activate only a fraction - dramatically improving computational efficiency.
- **Early Fusion**: Vision and text processing are integrated directly into the model backbone, enabling true multimodal reasoning.
- **10M Token Context Window**: Process entire books, codebases, or document collections in a single prompt - enabling comprehensive analysis impossible with earlier models.

In this guide, we'll deploy Llama 4 models on Vast.ai using three practical hardware configurations:

1. Llama 4 Scout on 8× H100 GPUs
2. Llama 4 Scout on 4× H100 GPUs
3. Llama 4 Maverick on 8× H200 GPUs

You'll learn exactly how to set up each configuration, deploy the models with the vLLM server, and interact with them through an OpenAI-compatible API. By the end, you'll have hands-on experience running some of the most advanced AI models available today.

## Installing Vast

Let's start by installing the Vast.ai command-line tools and setting up our API key:

```bash
pip install --upgrade vastai
```

```bash
# Here we will set our api key
export VAST_API_KEY="" #Your key here
vastai set api-key $VAST_API_KEY
```

## Running Llama 4 Scout on 8× H100s

### Choosing the Right Hardware

First, we need to search for GPUs on Vast.ai to run the Llama 4 Scout model on 8× H100 GPUs with a 200k token context window. This model requires specific hardware capabilities to run efficiently with vLLM's optimizations. Here are our requirements:

1. 8× H100 GPUs (80GB of VRAM each) to accommodate:

   - Llama 4 Scout model weights (17B active parameters with 16 experts)
   - KV Cache for handling the 200k token context window

2. A static IP address for:

   - Stable API endpoint hosting
   - Consistent client connections

3. At least one direct port that we can forward for:

   - vLLM's OpenAI-compatible API server
   - External access to the model endpoint
   - Secure request routing

4. At least 600GB of disk space to hold the model and other dependencies

Let's search for machines that meet these requirements:

```bash
vastai search offers " \
gpu_name = H100_NVL \
num_gpus = 8 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 600 \
rentable = true"
```

### Deploying the Server via Vast

We'll choose a machine from our search results and copy and paste the id below to set `INSTANCE_ID`.

We use `vastai create instance` to create an instance that:

1. Uses `vllm/vllm-openai:latest` docker image. This gives us an OpenAI-compatible server.
2. Forwards port `8000` to the outside of the container, which is the default OpenAI server port.
3. Uses `--model meta-llama/Llama-4-Scout-17B-16E-Instruct` to serve the Llama 4 Scout model.
4. Uses Llama 4-specific parameters:
   - `--tensor-parallel-size 8` to distribute inference across multiple GPUs.
   - `--max-model-len 200000` to set a 200k token context window.
   - `--override-generation-config='{"attn_temperature_tuning": true}'` for optimal text generation quality.
5. Uses `--disk 600` to ensure that we have 600GB of disk space for model weights and dependencies.

Note: Ensure that you fill in your huggingface token `HUGGING_FACE_HUB_TOKEN` to access the model. You'll need to accept the Llama 4 license terms on the model's huggingface page: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

```bash
export INSTANCE_ID=19319856 #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_HUB_TOKEN' --disk 600 --args --model meta-llama/Llama-4-Scout-17B-16E-Instruct --tensor-parallel-size 8 --max-model-len 200000 --override-generation-config='{"attn_temperature_tuning": true}'
```

### Get Instance IP Address and Port

Now, we need to get our IP address and port to call our model. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done.

Next, go to the [Instances tab](https://cloud.vast.ai/instances/) in the Vast AI Console and find the instance you just created.

At the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports.
You should see something like:

```
Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp
```

You will need the IP address (XX.XX.XXX.XX) and the port (YYYY) for the next step.

### Call our Model

#### Install OpenAI

To call our model we will install the OpenAI SDK.

```bash
pip install --upgrade openai
```

#### Download our Data

To show the power of our large context window, we will download The Great Gatsby from Project Gutenberg and ask Llama 4 to summarize it.

Here we download the text for the book:

```python
import requests

# Download the text file to a variable
url = "https://www.gutenberg.org/ebooks/64317.txt.utf-8"
text = requests.get(url).text

# Print the first 100 characters to confirm it worked
print(text[:200])
```

### Call Our Model

We will then use the OpenAI SDK. To do this, we need to set our `VAST_IP_ADDRESS` and `VAST_PORT` that we found above.

```python
from openai import OpenAI

# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""

# Initialize the client with your server URL
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require an actual API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

prompt = "please summarize this book in two sentences:" + text

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {"role": "user", "content": prompt}
    ],

)
print(response.choices[0].message.content)
```

### Model Response

We see that our model was able to ingest a large amount of text and summarize our book.

```
Here is a two-sentence summary of The Great Gatsby:

Set in the roaring twenties, The Great Gatsby is a classic novel by F. Scott Fitzgerald that revolves around the mysterious millionaire Jay Gatsby and his obsession with winning back his lost love, Daisy Buchanan. Through the eyes of narrator Nick Carraway, the novel explores themes of wealth, class, love, and the corrupting influence of materialism in the excesses of the Jazz Age
```

## Running Llama 4 Scout on 4× H100s

After successfully deploying Scout on 8× H100s, we'll now try a more cost-effective setup using 4× H100s. This will reduce our context window capacity but allow us to save on GPU costs.

We'll start by searching for an instance that meets our specifications:

```bash
vastai search offers " \
num_gpus = 4 \
gpu_name = H100_SXM \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 600 \
rentable = true"
```

We'll choose a machine from our search and create our instance with 4 GPUs and a 100k token context window using `--tensor-parallel-size 4` and `--max-model-len 100000`:

```bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_HUB_TOKEN' --disk 600 --args --model meta-llama/Llama-4-Scout-17B-16E-Instruct --tensor-parallel-size 4 --max-model-len 100000 --override-generation-config='{"attn_temperature_tuning": true}'
```

Once our instance is running, we'll set the new `VAST_IP_ADDRESS` and `VAST_PORT` and call our instance like we did earlier:

```python
from openai import OpenAI

# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""

# Initialize the client with your server URL
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require an actual API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

prompt = "please summarize this book in two sentences:" + text

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {"role": "user", "content": prompt}
    ],

)
print(response.choices[0].message.content)
```

### Model Response

```
Here is a two-sentence summary of The Great Gatsby:

Set in the roaring twenties, The Great Gatsby is a classic novel by F. Scott Fitzgerald that revolves around the mysterious millionaire Jay Gatsby and his obsession with winning back his lost love, Daisy Buchanan. Through the eyes of narrator Nick Carraway, the novel explores themes of love, greed, class, and the corrupting influence of wealth, ultimately leading to a tragic confrontation that exposes the dark underbelly of the American Dream.
```

## Running Llama 4 Maverick on 8× H200s

Finally, we'll upgrade to Llama 4's larger Maverick model with a 100k token context window on 8× H200 GPUs - a more powerful configuration that delivers enhanced capabilities.

To start, we'll search for an H200 machine with 8 GPUs:

```bash
vastai search offers " \
num_gpus = 8 \
gpu_name = H200 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 1000 \
rentable = true"
```

We'll find a machine from our search and create an instance with Llama 4 Maverick using 8 GPUs and a 100k token context window by setting `--disk 1000`, `--model meta-llama/Llama-4-Maverick-17B-128E-Instruct`, `--tensor-parallel-size 8` and `--max-model-len 100000`:

```bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_HUB_TOKEN' --disk 1000 --args --model meta-llama/Llama-4-Maverick-17B-128E-Instruct --tensor-parallel-size 8 --max-model-len 100000 --override-generation-config='{"attn_temperature_tuning": true}'
```

Finally, we'll set the new `VAST_IP_ADDRESS` and `VAST_PORT`, change our model to `"meta-llama/Llama-4-Maverick-17B-128E-Instruct"`, and call our instance:

```python
from openai import OpenAI

# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""

# Initialize the client with your server URL
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require an actual API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

prompt = "please summarize this book in two sentences:" + text

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    messages=[
        {"role": "user", "content": prompt}
    ],

)
print(response.choices[0].message.content)
```

### Model Response

```
Here is a two-sentence summary of The Great Gatsby:

The novel is narrated by Nick Carraway, who becomes fascinated with his wealthy neighbor Jay Gatsby and becomes entangled in Gatsby's quest to win back his lost love, Daisy Buchanan, through a complex web of relationships and tragic events. Ultimately, Gatsby's dream is destroyed, and he is murdered by George Wilson, the husband of Myrtle Wilson, who was having an affair with Tom Buchanan, Daisy's husband, highlighting the corrupting influence of wealth and the elusiveness of the American Dream.
```

## Conclusion

In this post, we've demonstrated how to deploy Meta's Llama 4 models on Vast.ai using three specific hardware configurations: Llama 4 Scout on 8× H100 GPUs with a 200k token context window, Scout on 4× H100 GPUs with a 100k context window, and Llama 4 Maverick on 8× H200 GPUs with a 100k context window. We've shown that with the right GPU setup, you can easily deploy and interact with these models through an OpenAI-compatible API.

Now that you have a basic deployment working, here are some next steps you might consider:

- **Expand the context window**: While we demonstrated context windows of 100k-200k tokens, Llama 4 supports up to 10M tokens. Try increasing `--max-model-len` to experiment with longer documents, books, or codebases. Note: You will also need to increase the number of GPUs.

- **Explore multimodal capabilities**: The tutorial focused on text processing, but Llama 4 is natively multimodal. Try adapting the deployment to accept image inputs using vLLM's multimodal capabilities and the early fusion architecture.

By leveraging Vast.ai's flexible GPU infrastructure, you can cost-effectively experiment with these cutting-edge models and build applications that take advantage of their sophisticated reasoning capabilities.


Learn how to deploy and interact with Meta's Llama 4 models on Vast.ai using three specific hardware configurations: Llama 4 Scout on 8× H100 GPUs with a 200k token context window, Scout on 4× H100 GPUs with a 100k context window, and Llama 4 Maverick on 8× H200 GPUs with a 100k context window.

Running Llama 4 Models on Vast.ai


The H100 and H200 GPUs represent the top end of NVIDIA's current lineup for AI and high-performance computing (HPC). Both are designed for serious, large-scale workloads – from training foundation models to running complex scientific simulations.

At a glance, the two GPUs may appear similar, but for those working across enterprise environments or pushing the boundaries of exascale computing, the differences between them can have a major impact on performance and scalability.

Today, we're taking a look at how the H100 and H200 stack up against each other to help you decide which of these two heavyweights may be the better fit for your needs.

### NVIDIA H100: The Proven Performer

The H100 GPU made waves when it launched in March 2023\. Quickly adopted by big names like OpenAI, Meta, and Stability AI, it became the engine behind some of the most advanced generative AI systems in production today.

Built on NVIDIA's innovative Hopper architecture, the H100 was engineered for scale, handling everything from trillion-parameter language models to high-throughput inference. It introduced **fourth-gen Tensor Cores** and a dedicated **Transformer Engine**, designed to accelerate deep learning workloads with speed and FP8 precision.

The H100 boasts up to a **30X boost in performance** for large language models (LLMs) over its predecessor, the A100 GPU, according to NVIDIA. This massive improvement helped cement the H100 as the new standard for foundational model training, and to this day, it's still one of the very best GPUs on the market.

#### Some of the H100's more notable features include:

-   **80 GB of HBM3 memory**, with some models offering up to 94 GB.

-   **3.35 TB/s of memory bandwidth** to handle data-intensive workloads.

-   **Fourth-gen Tensor Cores** with superior throughput performance, along with an advanced Transformer Engine and FP64 and FP8 capabilities to reduce memory usage.

-   **PCIe Gen5 and fourth-gen NVLink** (with 900 GB/s of GPU-to-GPU interconnect), as well as NDR Quantum-2 InfiniBand networking for efficient scalability.

-   **Multi-Instance GPU (MIG) support** for secure partitioning into up to 7 isolated right-size instances that maximize quality of service for multi-tenant environments and parallel workloads.

If your projects involve massive datasets, large-scale simulations, or advanced AI training for foundational models, the H100 has proven itself as a reliable, high-performance solution. But with the arrival of the H200 in late 2024, NVIDIA has taken things a step further, introducing upgrades that push the envelope even more for memory-intensive and high-throughput workloads.

We ran a side-by-side comparison of inference performance on the **DeepSeek R1 Distill Llama 8B** model using `vLLM` with both NVIDIA H100 and H200 GPUs on Vast.ai.

#### **Benchmark Configuration**

-   **Model:** `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`
-   **Requests:** 5,000
-   **Batch:** 512
-   **Input Tokens:** 16
-   **Output Tokens:** 128
-   **Backend:** `vLLM`

---

### **Results**

<table class="w-full border-collapse border border-gray-300">
    <thead>
        <tr class="bg-gray-100">
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                Metric
            </th>
            <th class="border border-gray-300 px-4 py-2 text-center font-semibold">
                H100
            </th>
            <th class="border border-gray-300 px-4 py-2 text-center font-semibold">
                H200
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Benchmark Duration
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                45.28 s
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                40.02 s
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Request Throughput
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                110.42 req/s
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                124.93 req/s
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Output Throughput
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                13,650 tok/s
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                15,435 tok/s
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Total Throughput
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                15,416 tok/s
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                17,434 tok/s
            </td>
        </tr>
    </tbody>
</table>

### NVIDIA H200: The Supercharged Successor

The H200 builds directly on the foundation laid by the H100, bringing meaningful improvements that target one of the biggest bottlenecks in AI and HPC workloads: memory.

While it shares the same Hopper architecture as the H100 and has similar core compute specs, the H200 significantly increases both memory capacity and bandwidth, two factors that become critical when operating at enterprise scale or pushing into exascale territory.

Equipped with **141 GB of HBM3e memory** and **4.8 TB/s of bandwidth**, the H200 nearly doubles the memory of the H100 and delivers about **1.4X faster** data access. Early benchmarks reflect that improvement. In MLPerf testing using Llama 2 70B, the H200 achieved over 31,000 tokens per second – about **45% faster than the H100**.

<img
    src="/uploads/h100vsh200_2.png"
    alt="Image credit: NVIDIA"
    style={{ width: "100%" }}
/>

The H200 retains all of the architectural strengths of the H100, including **fourth-gen NVLink**, **PCIe Gen5**, **MIG support**, and full **Tensor Core and Transformer Engine capabilities**. But for developers and organizations looking to squeeze out more performance without overhauling their infrastructure, the H200 offers a straightforward upgrade path with [measurable gains](https://resources.nvidia.com/en-us-data-center-overview/hpc-datasheet-sc23-h200).

#### To recap, here are a few advantages of the H200:

-   **141 GB of HBM3e memory at 4.8 TB/s** – nearly double the capacity of the H100, with 1.4X more memory bandwidth.

-   **Up to 1.9X the inference performance** of the H100 when handling LLMs, with more efficient processing per token and per inference cycle.

-   **50% reduced energy use and total cost of ownership** (TCO) due to improved thermal management and energy efficiency.

-   **Drop-in compatibility with existing Hopper-based systems**, enabling accelerated performance gains in H100-optimized infrastructure with minimal or no additional changes.

In short, the H200 represents a meaningful leap beyond the H100 in memory capacity, bandwidth, and throughput, bringing a new level of performance to AI and HPC workloads. That comes at a premium price, of course. But if you're weighing the tradeoffs between the two GPUs in terms of specs alone, let's take a side-by-side look at how they compare.

### H100 vs. H200: Key Specs Compared

For reference, here's a quick outline of the various features and specs of the NVIDIA H100 and H200:

<table class="w-full border-collapse border border-gray-300">
    <thead>
        <tr class="bg-gray-100">
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                Specification
            </th>
            <th class="border border-gray-300 px-4 py-2 text-center font-semibold">
                H100
            </th>
            <th class="border border-gray-300 px-4 py-2 text-center font-semibold">
                H200
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                GPU Architecture
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">Hopper</td>
            <td class="border border-gray-300 px-4 py-2 text-center">Hopper</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                GPU Memory
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                80 GB HBM3
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                141 GB HBM3e
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                GPU Memory Bandwidth
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                3.35 TB/s
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                4.8 TB/s
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                FP64 Tensor Core
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                33.5 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                33.5 TFLOPS
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">FP32</td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                67 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                67 TFLOPS
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                TF32 Tensor Core
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                989 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                989 TFLOPS
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                FP16 Tensor Core
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                1,979 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                1,979 TFLOPS
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                FP8 Tensor Core
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                3,958 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                3,958 TFLOPS
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                INT8 Tensor Core
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                3,958 TOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                3,958 TOPS
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Decoders
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                7 NVDEC, 7 NVJPEG
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                7 NVDEC, 7 NVJPEG
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                L2 Cache
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">50 MB</td>
            <td class="border border-gray-300 px-4 py-2 text-center">50 MB</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Max. Thermal Design Power
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                Up to 700 W
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                Up to 1000 W
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Multi-Instance GPUs
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                Up to 7 MIGs @ 10GB each
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                Up to 7 MIGs @ 18GB each
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Form Factor
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">SXM</td>
            <td class="border border-gray-300 px-4 py-2 text-center">SXM</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Interconnect
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                PCIe 5.0 x16
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                PCIe 5.0 x16
            </td>
        </tr>
    </tbody>
</table>

Ultimately, choosing the right GPU largely depends on your workload demands and performance priorities.

### Use Cases: Which GPU Should You Choose?

While both the H100 and H200 excel in AI and HPC environments, they serve slightly different needs.

#### You should consider the H100 if you prefer:

-   **A proven, high-performance GPU** for training medium to large models and handling intensive AI or HPC workloads.
-   **Use cases involving compute-bound workloads**, where total memory capacity is not as critical.
-   **A time-tested solution** already in widespread production use, balancing performance with lower power and cost.

#### Whereas the H200 may be a better fit if you need:

-   **Greater memory capacity and bandwidth** for running _even larger_ models, handling memory-bound workloads, or scaling LLM inference on the enterprise level.
-   **Higher throughput**, especially for tasks involving extended context windows, multi-modal AI, or very large batch sizes.
-   **A forward-looking upgrade path** with improved energy efficiency and lower TCO, thanks to higher performance per watt.

\------------------------------

At **Vast.ai**, we understand that choosing a GPU isn't only about specs. Often, the decision comes down to access and affordability, as well – especially when we're talking about some of the most powerful and expensive GPUs in existence. (Speaking of, check out our previous posts comparing the [H100 vs. A100](https://vast.ai/article/H100-vs-A100-Comparing-two-Powerhouse-GPUs), and the [H100 vs. L40S](https://vast.ai/article/nvidia-h100-vs-l40s-power-meets-versatility).)

That's why our cloud GPU rental platform is built to offer high-performance compute at a fraction of the typical cost. Vast gives you flexible, cost-effective access to H100s, H200s, and more.

With options for on-demand pricing and budget-friendly spot instances, you can scale up as needed without overspending. On average, **Vast users save 5- 6X on GPU compute** compared to traditional cloud providers.

Explore our platform today and get the power you need, when you need it – with no upfront investment required.


The H100 and H200 GPUs represent the top end of NVIDIA's current lineup for AI and high-performance computing (HPC). Both are designed for serious, large-scale workloads – from training foundation models to running complex scientific simulations.

NVIDIA H100 vs. H200: Two Hopper-based Heavyweights


Deep Cogito models are hybrid reasoning Large Language Models (LLMs) uniquely designed to operate in two distinct modes within a single deployment: standard direct answering and detailed step-by-step reasoning. Unlike traditional models that require separate deployments for different reasoning capabilities, Deep Cogito can seamlessly toggle between these modes using simple prompt engineering.

This guide demonstrates how to deploy `deepcogito/cogito-v1-preview-llama-8B` on Vast.ai and leverage its dual reasoning capabilities through vLLM's OpenAI-compatible API.

## Deploying on Vast.ai

First, install the Vast.ai SDK and configure your API key:

```bash
pip install --upgrade vastai
export VAST_API_KEY="YOUR_KEY_HERE"
vastai set api-key $VAST_API_KEY
```

Search for a suitable instance that meets the model's requirements:

```bash
vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 60 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"
```

Deploy the model using vLLM's OpenAI-compatible API:

```bash
vastai create instance YOUR_INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 80 --args --model deepcogito/cogito-v1-preview-llama-8B
```

## Using Deep Cogito with OpenAI SDK

Install the OpenAI SDK:

```bash
pip install --upgrade openai
```

### Enabling Thinking Mode

Deep Cogito provides two methods to toggle thinking mode:

1. **Hugging Face Transformers Library Method**:
   When using the Hugging Face Transformers library directly, you can enable thinking mode using the `enable_thinking=True` flag with the tokenizer's `apply_chat_template` method. This activates the model's reasoning process, displaying its thought process within `<thinking>` tags before providing the final answer.

2. **vLLM OpenAI-compatible API Method**:
   When using vLLM's OpenAI-compatible API (as in our Vast.ai deployment), use a system prompt instead. The official prompt is:
   ```
   "Enable deep thinking subroutine."
   ```
   
   For the 8B model specifically, we recommend using this enhanced version for more consistent results:
   ```
   "Enable deep thinking subroutine. \n\n YOU MUST USE THE THINKING SUBROUTINE"
   ```

Here's an implementation example using the OpenAI SDK:

```python
from openai import OpenAI

# Vast.ai instance configuration
VAST_IP_ADDRESS = "YOUR_VAST_IP"
VAST_PORT = "YOUR_VAST_PORT"

# Initialize the OpenAI client
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require an API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

# Function for reasoning mode
def chat_with_model_reasoning(prompt):
    response = client.chat.completions.create(
        model="deepcogito/cogito-v1-preview-llama-8B",
        messages=[
            {"role": "system", "content": "Enable deep thinking subroutine. \n\n YOU MUST USE THE THINKING SUBROUTINE"},
            {"role": "user", "content": prompt}
        ],
    )
    return response.choices[0].message.content

# Function for standard mode
def chat_with_model(prompt):
    response = client.chat.completions.create(
        model="deepcogito/cogito-v1-preview-llama-8B",
        messages=[
            {"role": "user", "content": prompt}
        ],
    )
    return response.choices[0].message.content
```

## Response Comparison

The following examples demonstrate how the model's responses differ between thinking mode and standard mode when asked to explain Large Language Models:

### With Thinking Mode:

```
<think>
Okay, I need to give a short introduction to Large Language Models (LLMs). Let me start by recalling what I know about them. LLMs are a type of artificial intelligence that uses neural networks to process and generate human-like text. They're trained on massive amounts of text data from the internet, books, and other sources. This training allows them to understand and generate contextually relevant responses.

Wait, I should probably mention that the training process involves machine learning, specifically deep learning. They use transformer architectures, which were introduced by Google in 2017. Transformers revolutionized NLP by allowing the model to process input in parallel, which makes them more efficient than previous sequential models.

But maybe I'm getting a bit too technical. The user might just want a basic overview. Let me focus on the key aspects. LLMs are known for their ability to understand natural language and generate coherent text. They can engage in conversations, answer questions, write stories, translate languages, and even help with coding. But they also have limitations, like sometimes providing incorrect or biased information.

I should also mention some popular examples, like GPT-3, BERT, and the more recent ones like GPT-4. These models vary in size and capability. For instance, GPT-3 has 175 billion parameters, making it one of the largest models. GPT-4 is even bigger and claimed to have more capabilities, though it's not publicly available yet.

Also, LLMs are used in various applications, such as chatbots, virtual assistants, and content generation tools. They can help with tasks like summarization, question answering, and sentiment analysis. But it's important to note that they're not perfect and can have issues like hallucinations (making up information) or biases inherited from the training data.

Hmm, I might need to simplify this a bit. Maybe start with a basic definition and then highlight their key features and applications. Avoid too much technical jargon unless necessary.

Wait, I should also mention the training data aspect. LLMs learn from enormous datasets, which means they can pick up on language patterns, idioms, and even cultural references. But this also means they can reflect biases present in the data, which is a concern.

In summary, LLMs are advanced AI models that can understand and generate human-like text using deep learning techniques. They excel at language tasks but have limitations. They're widely used in various applications but also require careful consideration of their ethical implications.
</think>

Large Language Models (LLMs) are advanced AI systems that can process and understand language in ways similar to human intelligence. These models use neural networks and are trained on vast amounts of text data from the internet, books, and other sources.

Here are the key things you should know about LLMs:

1. **How they work**: LLMs use a technology called transformer architecture, which allows them to process language in parallel and maintain context in conversations. This means they can understand what you're saying and respond appropriately.

2. **Capabilities**: LLMs are incredibly versatile, able to:
   - Engage in conversations
   - Answer questions
   - Write stories and articles
   - Help with coding
   - Translate languages
   - Summarize text
   - Create content

3. **Examples**: Popular LLMs include GPT-3, BERT, and ChatGPT. These models vary in size and capability, but they all share the ability to understand and generate human-like text.

4. **Important limitations**: While powerful, LLMs aren't perfect. They can:
   - Make up information (hallucinations)
   - Reflect biases from their training data
   - Sometimes provide incorrect answers

5. **Applications**: LLMs are used in:
   - Chatbots
   - Virtual assistants
   - Content generation tools
   - Language translation services
   - Text analysis and processing

As these models continue to evolve, they're becoming increasingly important in our digital world, though we need to remain mindful of their limitations and potential biases.
```

### Without Thinking Mode:

```
Large Language Models (LLMs) are AI systems that use neural networks to process and generate human-like text. They're pre-trained on massive datasets of text and can understand context, generate coherent responses, and perform various language tasks like translation, summarization, and question-answering.

The most well-known LLMs include GPT (Generative Pre-trained Transformer) models created by OpenAI, BERT (Bidirectional Encoder Representations from Transformers) created by Google, and others. They've become increasingly useful for applications like chatbots, content generation, and language understanding tasks.

Key characteristics of LLMs include their ability to understand context, generate human-like text, and adapt to different tasks, though they can also sometimes produce incorrect or biased information.
```

## Conclusion

Deep Cogito on Vast.ai offers a cost-effective solution for accessing a powerful reasoning LLM. The system prompt technique enables visibility into the model's reasoning process, making it particularly valuable for complex tasks such as coding, mathematical problems, and detailed analysis.

While the `enable_thinking` flag isn't available through vLLM's API, the system prompt approach can be highly effective with proper configuration. Users should experiment with prompt variations to achieve optimal results, as factors such as emphasis and repetition can significantly impact the consistency of the thinking mode, especially with the 8B model.

For production environments where understanding the model's reasoning process is crucial, this approach provides a robust and economical solution. We recommend testing different prompt variations to find the optimal configuration for your specific use case, as the effectiveness of thinking mode can vary depending on the complexity of the task and the chosen model size.


Learn how to deploy and interact with Deep Cogito’s hybrid reasoning LLM (cogito-v1-preview-llama-8B) on Vast.ai. This guide covers both standard and deep thinking modes via vLLM’s OpenAI-compatible API.

Running Deep Cogito on Vast.ai


## Introduction

In multi-speaker audio recordings like meetings, podcasts, or interviews, knowing who spoke when is crucial for many applications. This process, known as Speaker Diarization, partitions an audio stream into segments according to speaker identity.

[PyAnnote Audio](https://github.com/pyannote/pyannote-audio), an open-source toolkit built on PyTorch, offers state-of-the-art models for speaker diarization that are both accurate and accessible. Running these models on Vast.ai provides a cost-effective solution for processing large audio datasets without investing in expensive hardware. This combination gives developers and researchers the tools they need to build sophisticated audio processing pipelines at a fraction of the cost of traditional cloud providers.

### Why Speaker Diarization Matters

Speaker Diarization provides several key benefits for audio processing pipelines:

1. **Speaker Identification**: It identifies different speakers in a conversation, meeting, or any multi-speaker audio recording.

2. **Improved Transcription**: When combined with speech-to-text systems, diarization allows for speaker-attributed transcripts, making it clear who said what.

3. **Processing Efficiency**: By segmenting audio by speaker and removing non-speech portions, diarization can significantly reduce the computational load for downstream tasks like speech recognition, allowing these systems to process only relevant speech segments rather than the entire audio file.

4. **Audio Indexing**: Makes audio content searchable by speaker, allowing users to find all segments where a specific person speaks.

### What This Guide Covers

In this guide, we will:

-   Set up the Pyannote Audio Speaker Diarization pipeline
-   Process audio files to detect different speakers and their speaking turns
-   Calculate speaking time for each identified speaker
-   Identify regions with overlapping speech
-   Extract and save speaker-specific segments from the input audio
-   Play and verify the diarization results

The output will be a collection of audio files separated by speaker, making them ready for further processing in speech-to-text pipelines or speaker-specific analysis.

### Why VAST.ai

VAST.ai offers a marketplace approach to GPU rentals that provides significant advantages for audio processing tasks. Unlike traditional cloud providers, VAST.ai allows you to:

-   Rent precisely the GPU capacity needed for your workload
-   Access GPUs at more affordable rates
-   Access more types of GPU SKUs, particularly lower RAM/more affordable GPUs
-   Avoid long-term commitments for experimental projects

For speaker diarization specifically, VAST.ai is particularly well-suited as these models typically require GPU acceleration but don't demand extensive resources. Users can rent just the right amount of computing power for this specific task without overpaying for unused capacity, making it an economical choice for audio processing workflows that might otherwise be cost-prohibitive on traditional cloud platforms.

## Download This Notebook

To follow along with this tutorial, you can download the complete Jupyter notebook:

[Notebook](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/q7MzLr/ee586f7e9bfbcdb2fb48f059b7bf64caf92fa3bd/files/pyannote_diarization_vast.ipynb)

Having the notebook will allow you to execute the code blocks as you read through this guide, making it easier to understand and experiment with the diarization implementation.

## Choosing an Instance

For running the Pyannote Speaker Diarization model on VAST.ai, you'll need a relatively modest GPU setup. The pyannote/speaker-diarization-3.1 model runs in pure PyTorch and is designed to be efficient. Here are the recommended specifications:

-   GPU: A low-end GPU like an RTX 3060 or 4060 would be sufficient.
-   VRAM: 6-8GB of VRAM should be adequate as the Pyannote diarization pipeline is relatively efficient.
-   RAM: 8-16GB system RAM is recommended for processing audio files.
-   Storage: At least 10GB for the model, dependencies, and your audio files.
-   CUDA: Make sure the instance has CUDA installed (version 11.0+ recommended).
-   Python: Python 3.8+ with PyTorch installed.

## Selecting an Instance

Follow these steps to set up your environment:

1. Ensure that you have a Vast.ai account
2. Go to the Vast Templates in the Console https://cloud.vast.ai/templates/
3. Select the `PyTorch (CuDNN Runtime)` Template
4. Filter for an instance with:
    - 1 GPU
    - 6-8GB of VRAM
    - 8-16GB system RAM
    - 10GB of storage
5. Select an instance and click rent
6. Install the Vast TLS certificate in your browser to access the notebook server https://docs.vast.ai/instances/jupyter#1SmCz
7. Go to your Instances https://cloud.vast.ai/instances/ and click "Open" to access the jupyter server on your instance.
8. Upload the notebook to the server or create a new notebook

## Installing Dependencies

Let's start by installing the necessary Python packages:

```python
%%bash
pip install pyannote.audio
pip install pydub
pip install librosa
pip install datasets
```

We also need to install FFmpeg for audio processing:

```python
%%bash
apt-get update && apt-get install -y ffmpeg
```

## Setting Up Your Hugging Face Token

Pyannote models are hosted on Hugging Face, so you'll need to set up authentication:

```python
# Make sure you've accepted the user conditions at:
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/segmentation-3.0

HF_TOKEN = ""  # Add your token here
```

Ensure that you have accepted the terms for the models at the URLs above. The models are free to use, but you must agree to their terms of service.

## Downloading Test Data

For this tutorial, we'll use a sample file from the AMI Meeting Corpus dataset, which is a collection of 100 hours of meeting recordings. This dataset is perfect for testing speaker diarization as it contains natural multi-speaker conversations.

```python
from datasets import load_dataset
import os
import soundfile as sf

# Create a directory to save the files
os.makedirs("ami_samples", exist_ok=True)

# Load the dataset with the correct split
dataset = load_dataset("diarizers-community/ami", "ihm", split="train", streaming=True)

# Load just one sample
n_samples = 1
samples = list(dataset.take(n_samples))

for i, sample in enumerate(samples):
    audio = sample["audio"]
    audio_array = audio["array"]
    sampling_rate = audio["sampling_rate"]

    # Calculate duration in seconds
    duration = len(audio_array) / sampling_rate

    # Use soundfile to save the audio
    output_path = f"ami_samples/sample_{i}.wav"
    sf.write(output_path, audio_array, sampling_rate)

    print(f"Saved {output_path} - Speaker: {sample['speakers']} - Duration: {duration:.2f} seconds")
```

## Speaker Diarization

First, let's set up the Speaker Diarization pipeline:

```python
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=HF_TOKEN
)

# Move pipeline to appropriate device
pipeline = pipeline.to(device)
```

Next, we'll process our audio file to identify different speakers and their speaking turns:

```python
# Process the audio file
audio_file = "./ami_samples/sample_0.wav"
print(f"Processing {audio_file} on {device}")
output = pipeline(audio_file)

print("Voice activity segments:")

for segment, _, speaker in output.itertracks(yield_label=True):
    result = f"{segment.start:.2f} --> {segment.end:.2f} (duration: {segment.duration:.2f}s) Speaker: {speaker}"
    print(result)
```

The output will look something like this:

```
Processing ./ami_samples/sample_0.wav on cuda
Voice activity segments:
18.36 --> 18.42 (duration: 0.07s) Speaker: SPEAKER_03
23.01 --> 25.63 (duration: 2.62s) Speaker: SPEAKER_03
27.08 --> 27.64 (duration: 0.56s) Speaker: SPEAKER_05
...
```

## Additional Analytics

Now that we have processed our file, let's explore some useful features of the Pyannote SDK:

### Speaker Time

Here we calculate the total speaking time for each speaker:

```python
for speaker in output.labels():
    speaking_time = output.label_duration(speaker)
    print(f"Speaker {speaker} total speaking time: {speaking_time:.2f}s")
```

This will output something like:

```
Speaker SPEAKER_00 total speaking time: 558.98s
Speaker SPEAKER_01 total speaking time: 18.98s
Speaker SPEAKER_02 total speaking time: 22.88s
Speaker SPEAKER_03 total speaking time: 469.68s
Speaker SPEAKER_04 total speaking time: 698.02s
Speaker SPEAKER_05 total speaking time: 190.70s
Speaker SPEAKER_06 total speaking time: 5.74s
...
```

### Speaker Overlap

Pyannote can identify regions where multiple speakers are talking simultaneously:

```python
overlap = output.get_overlap()
print(f"Overlapping speech regions: {overlap}")
```

This will output the segments where more than one speaker is speaking.

```
Overlapping speech regions:
[[ 00:00:27.672 -->  00:00:27.689]
 [ 00:00:38.337 -->  00:00:38.860]
 [ 00:00:40.395 -->  00:00:40.463]
```

### Filter by Speaker

We can also filter the diarization output to focus on a specific speaker:

```python
speaker = "SPEAKER_06"
speaker_turns = output.label_timeline(speaker)
print(f"Speaker {speaker} speaks at:")
for speaker_turn in speaker_turns:
    print(speaker_turn)
```

This will show the segments where this speaker is speaking.

```
Speaker SPEAKER_06 speaks at:
[ 00:03:45.767 -->  00:03:45.852]
[ 00:03:55.386 -->  00:03:55.521]
[ 00:05:07.257 -->  00:05:07.274]
...
```

## Extracting Speaker Segments

To verify our results and prepare the audio for further processing, let's split the original audio into segments by speaker:

```python
import shutil
from pydub import AudioSegment

def split_audio_by_segments(audio_path, diarization_output, output_dir="output_segments"):
    """
    Split an audio file into multiple files based on diarization output

    Parameters:
    -----------
    audio_path: str
        Path to the input audio file
    diarization_output: Annotation
        Pyannote diarization output
    output_dir: str
        Directory to save the output segments
    """
    # Clear the output directory if it exists
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)

    # Create output directory
    os.makedirs(output_dir, exist_ok=True)

    # Load the audio file
    audio = AudioSegment.from_file(audio_path)

    # Extract each segment with speaker information
    for i, (segment, _, speaker) in enumerate(diarization_output.itertracks(yield_label=True)):
        # Convert seconds to milliseconds
        start_ms = int(segment.start * 1000)
        end_ms = int(segment.end * 1000)

        # Extract segment
        segment_audio = audio[start_ms:end_ms]

        # Generate output filename with speaker information
        filename = os.path.basename(audio_path)
        name, ext = os.path.splitext(filename)
        output_path = os.path.join(output_dir, f"{name}_segment_{i+1:04d}_{start_ms:08d}ms-{end_ms:08d}ms_speaker_{speaker}{ext}")

        # Export segment
        segment_audio.export(output_path, format=ext.replace('.', ''))
        print(f"Saved segment {i+1} to {output_path} (Speaker: {speaker})")

# Apply the function to our audio file
split_audio_by_segments(audio_file, output)
```

## Inspecting Results

To verify our results, we'll create a function to play audio files in our Jupyter environment:

```python
import librosa
from IPython.display import Audio, display

def play_audio(file_path, sr=None):
    """
    Play an audio file in a Jupyter notebook.
    """
    # Load the audio file
    y, sr = librosa.load(file_path, sr=sr)

    # Display an audio widget to play the sound
    audio_widget = Audio(data=y, rate=sr)
    display(audio_widget)
```

Now, let's listen to a few clips to verify that the speakers were correctly identified and isolated.

```python
import os
audio_dir = "./output_segments/"

audio_files = os.listdir(audio_dir)
audio_files.sort()

n_offset = 21
n_clips = 5

for fname in audio_files[n_offset:n_clips + n_offset]:
    print(f"File: {fname}")

    # Extract speaker info if present in filename
    if "_speaker_" in fname:
        speaker_part = fname.split("_speaker_")[1].split(".")[0]
        print(f"Speaker: {speaker_part}")

    play_audio(audio_dir + fname)
```

Around file 21 or so, we see a 14-second clip of `SPEAKER_00` speaking:

`File: sample_0_segment_0025_00055364ms-00070045ms_speaker_SPEAKER_00.wav Speaker: SPEAKER_00`

A few files later we see a 10-second clip of `SPEAKER_00` speaking again:

`File: sample_0_segment_0029_00071530ms-00081840ms_speaker_SPEAKER_00.wav Speaker: SPEAKER_00`

Note: The filenames may change. Each time we run the model there may be a different number of clips. The diarization model sometimes captures `<1` second of audio and labels it as a speaker.

The 10-second clip is audio of `SPEAKER_00` about the meeting agenda. It is a file of just `SPEAKER_00` talking.

The 14-second clip has two other speakers speaking. This seems like an error at first, but we will notice two other files that capture the additional speakers:

`sample_0_segment_0026_00059228ms-00060325ms_speaker_SPEAKER_02.wav`
`sample_0_segment_0027_00061793ms-00062924ms_speaker_SPEAKER_02.wav`

With further processing, we could remove these from the `SPEAKER_00` file if necessary for our application.

We can also use the Pyannote overlap function to verify that we did have overlapping speech at this time in the audio file.

## Verifying Speaker Overlap

Finally, let's check for regions where multiple speakers are talking simultaneously:

```python
overlap = output.get_overlap()
for overlap_ts in overlap:
    print(f"Overlapping speech regions: {overlap_ts}")
```

Here we see two overlapping segments that match up with the overlapping files we found above.

`sample_0_segment_0026_00059228ms-00060325ms_speaker_SPEAKER_02.wav`
`sample_0_segment_0027_00061793ms-00062924ms_speaker_SPEAKER_02.wav`

```
Overlapping speech regions: [ 00:00:59.228 -->  00:01:00.325]
Overlapping speech regions: [ 00:01:01.793 -->  00:01:02.924]
```

## Conclusion

This tutorial demonstrates how Pyannote's speaker diarization on VAST.ai provides a powerful solution for identifying "who spoke when" in multi-speaker recordings. The implementation offers several advantages:

-   **Accuracy and Efficiency**: Pyannote accurately identifies different speakers and their speaking times, even detecting overlapping speech
-   **Practical Applications**: Enables speaker-attributed transcription, conversation analytics, and content indexing for meetings, podcasts, and interviews
-   **Cost-Effective Processing**: VAST.ai's affordable GPU rentals make processing large audio datasets accessible without expensive hardware investments

The Pyannote diarization model provides impressive results out of the box, and running it on VAST.ai makes it accessible and affordable for a wide range of applications. Whether you're building a meeting transcription service, analyzing call center interactions, or researching conversation dynamics, this approach gives you a solid starting point.

Look out for more content from Vast about other audio processing tasks!


Run speaker diarization with Pyannote on VAST.ai to detect who spoke when in audio files—cost-effective, accurate, and easy to integrate with your pipeline.

Speaker Diarization with Pyannote on VAST


We're pleased to announce that Vast.ai has successfully completed the SOC 2 Type I audit, a significant milestone as we continue to deliver secure and scalable GPU compute to a global user base.

Security remains central to everything we do here at Vast.ai. We've spent the past 6+ years delivering reliable service worldwide while meeting the [highest standards](https://vast.ai/article/security-and-compliance-at-vast-ai) of regulatory compliance. 

### Vast.ai’s Secure Cloud Offering

Our [Trust Center](https://trust.vast.ai/) includes additional information about our controls, compliance stance, FAQs and Subprocessors.

For customers with heightened security and compliance requirements, our [**Secure Cloud**](https://cloud.vast.ai/?secure_cloud=true) offering provides access to GPUs hosted exclusively by our certified datacenter partners. This environment already delivers an added layer of assurance – now further backed by an independent SOC 2 audit on our platform infrastructure.

Achieving **SOC 2 Type I certification** is a clear validation of the controls we've built to protect customer data and ensure the integrity of our infrastructure.

Here's a brief overview of SOC 2 Type I and what it entails.

### What Is SOC 2 Type I?

Put simply, SOC 2 (System and Organization Controls 2\) compliance assesses an organization's security measures. The audit determines whether internal controls are appropriately designed to meet [trust services criteria](https://www.aicpa-cima.com/topic/audit-assurance/audit-and-assurance-greater-than-soc-2) – covering areas like security, confidentiality, privacy, and processing integrity – in order to safeguard customer data.

Developed by the American Institute of Certified Public Accountants (AICPA), the SOC 2 Type I report is essentially a detailed snapshot evaluating the design and implementation of these controls at a specific point in time. Its goal is to ensure that security measures meet rigorous industry standards.

### Why This Matters

Our SOC 2 Type I certification affirms to our customers and partners that we've put strong, well-designed controls in place to protect their data and maintain operational reliability. We take data security seriously and follow industry-recognized best practices.

The Type I audit is only the first step in a broader compliance roadmap. We are currently preparing for the SOC 2 Type II audit, which evaluates how effectively internal controls perform over an extended period. That assessment will validate that we maintain our security practices consistently, both in design and in day-to-day execution.

We're also advancing our compliance efforts across other standards, to help us better serve organizations with strict regulatory requirements.

### Our Ongoing Commitment

Earning this SOC 2 Type I certification reflects our commitment to protecting the privacy and integrity of the data entrusted to us by developers, researchers, and organizations worldwide. As a distributed peer-to-peer platform powering global compute, we take our responsibility seriously. This certification is one step in our ongoing effort to provide users with infrastructure they can trust. 

For more about security and compliance at Vast.ai as well as our datacenter partners, please see our previous post [here](https://vast.ai/article/security-and-compliance-at-vast-ai). To request a copy of our SOC 2 report, feel free to contact us anytime at [compliance@vast.ai](mailto:compliance@vast.ai).

A huge thank-you to our team at Vast.ai and the audit partners who helped us reach this milestone\! 

We'll keep raising the bar – and keep you updated as we do.



Vast.ai has achieved SOC 2 Type I certification, reinforcing its commitment to data security and trusted infrastructure. Learn what this means for users and what’s next on their compliance roadmap.

Vast.ai Achieves SOC 2 Type I Certification


Google has recently released Gemma 3, their latest open-source language model family. The model, designed to run efficiently on a single GPU, brings state-of-the-art capabilities including advanced multimodal understanding, multilingual support, and an impressive 128K token context window. In this post, we'll explore the key features of Gemma 3 and demonstrate how to deploy it on Vast.ai to leverage one of its new features, comparing images.

## What makes Gemma 3 Special

Gemma 3 is available in four model sizes: 1B, 4B, 12B, and 27B parameters. Each size comes in both instruction-tuned (IT) and pre-trained (PT) variants, allowing developers to choose the optimal model for their specific applications.

What makes Gemma 3 particularly accessible is its efficient architecture - all models are optimized to run on a single GPU, with the smaller variants capable of running on consumer-grade GPUs through Vast.ai's platform.

Gemma 3 offers multiple state-of-the-art capabilities:

### 1. Enhanced Context Window

-   Supports up to 128K tokens context window
-   Improved long-range understanding and coherence
-   Better handling of complex, multi-turn conversations

### 2. Multi-Modal Capabilities

-   Native image understanding and analysis (available in 4B, 12B, and 27B models only)
-   Support for both text and image inputs in the same prompt
-   Advanced ability to analyze images, text, and short videos

### 3. Improved Performance

-   State-of-the-art performance for its size
-   Enhanced reasoning capabilities
-   Better code generation and understanding

### 4. Multilingual Support

-   Out-of-the-box support for over 35 languages
-   Pretrained support for over 140 languages

### 5. Additional Features

-   Function calling support for automated workflows
-   Official quantized versions for faster performance
-   Optimized for various hardware platforms (NVIDIA GPUs, Google Cloud TPUs, AMD GPUs)

## Using Gemma 3 For Image Comparison On Vast

Gemma 3 supports multimodal inputs, including an impressive image comparison capability. In this guide, we'll demonstrate this feature using images from the Open Image Preferences dataset, available [here](https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized).

### Image Selection

For this post, we've selected two contrasting astronaut images from the Open Image Preferences dataset:

<div style={{ display: "flex", gap: "20px", marginBottom: "2rem" }}>
    <div style={{ flex: 1 }}>
        <img
            src="/uploads/image-1.jpg"
            alt="Astronaut 1"
            style={{ width: "100%" }}
        />
        <p style={{ textAlign: "center", fontWeight: "bold" }}>Astronaut 1</p>
    </div>
    <div style={{ flex: 1 }}>
        <img
            src="/uploads/image-2.jpg"
            alt="Astronaut 2"
            style={{ width: "100%" }}
        />
        <p style={{ textAlign: "center", fontWeight: "bold" }}>Astronaut 2</p>
    </div>
</div>

### Launch an Instance

First, let's set up a Vast.ai instance with the necessary configurations to run Gemma 3:

```bash
vastai create instance $INSTANCE_ID \
    --image vllm/vllm-openai:latest \
    --disk 40 \
    --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=hf-token' \
    --args --model google/gemma-3-4b-it --gpu-memory-utilization 0.9 --max-model-len 4096 --limit-mm-per-prompt "image=2"
```

### Call Gemma 3 Using the OpenAI SDK

Next, we'll write a script to call Gemma 3 on our instance. vLLM exposes an OpenAI-compatible API endpoint, allowing us to interact with Gemma 3 using the OpenAI SDK:

```python
from openai import OpenAI
import base64

# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""

# Initialize the client
client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

# Load and convert local images to base64
def image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Convert images to base64
image1_base64 = image_to_base64("/path/to/image-1.jpg")
image2_base64 = image_to_base64("/path/to/image-2.jpg")

response = client.chat.completions.create(
    model="google/gemma-3-4b-it",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Compare these two images"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image1_base64}"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image2_base64}"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)
```

Here is our output from Gemma 3:

```
Okay, let's compare these two images of astronauts. Here's a breakdown of the key differences and similarities:

**Similarities:**

* **Subject:** Both images feature a spacesuit and the astronaut's helmet.
* **Style:**  Both are highly stylized and appear to be digitally rendered images, likely created with a focus on aesthetic appeal and a sense of wonder. They evoke a futuristic and somewhat surreal feeling.
* **Helmet:** The overall shape and design of the helmet are fairly consistent between the two – a classic, bulbous, and slightly vintage-looking design.



**Differences:**

1. **Background:** This is the most significant difference.
   * **Image 1:** Features a vibrant, complex, and detailed space scene with a blue and gold starry sky, nebulas, and a visible horizon line suggesting a planet or moon. The background is rich in color and texture.
   * **Image 2:** Shows a much simpler and muted background. It’s a dusky blue sky with scattered stars and a faint suggestion of a horizon with mountains. There's no nebula or dramatic planetary features.

2. **Suit Detail & Appearance:**
    * **Image 1:** The spacesuit is more elaborate and brightly colored. It has intricate gold detailing on the arms and chest and a much more modern and polished look. It also appears to be a little more reflective.
    * **Image 2:** The suit is weathered and looks older. There's a visible rust and decay on the metal parts, suggesting it's been used and exposed to the elements. It’s a more rugged and worn appearance.

3. **Pose and Composition:**
   * **Image 1:** The astronaut is standing and facing slightly towards the camera, but the composition is broader, showing more of the suit and the expansive space.
   * **Image 2:** The astronaut is positioned at a three-quarter angle, focusing primarily on the helmet and upper body. The composition is more intimate and tightly framed.

4. **Lighting & Tone:**
   * **Image 1:**  The lighting is dramatic and highlights the spacesuit with warm, glowing lights. It's a brighter and more vibrant image overall.
   * **Image 2:** The lighting is softer and cooler, creating a more subdued and melancholic atmosphere.

**In essence:**

* **Image 1** is a grand, visually stunning depiction of an astronaut experiencing the majesty of space.
* **Image 2** is a more intimate and evocative portrayal, suggesting a sense of isolation, exploration, and the passage of time.



Would you like me to elaborate on any specific aspect of the comparison, such as the artistic style or the possible symbolism of the images?
```

### Image Comparison Analysis

The model provides detailed, structured analysis of the images, demonstrating its sophisticated visual understanding capabilities. The analysis includes:

1. **Comprehensive Comparison**: The model systematically breaks down both similarities and differences between the images
2. **Structured Output**: Results are organized into clear categories (Similarities, Differences, In essence)
3. **Detailed Observations**: The model captures both obvious and subtle details, from overall composition to specific visual elements
4. **Contextual Understanding**: The analysis includes interpretation of artistic style, mood, and potential symbolism
5. **Interactive Engagement**: The model offers to elaborate on specific aspects of the comparison

This level of analysis showcases Gemma 3's advanced multimodal capabilities, particularly its ability to:

-   Process and compare multiple images simultaneously
-   Provide nuanced visual analysis
-   Generate structured, coherent responses
-   Maintain context throughout the analysis

## Next Steps

The image comparison capabilities we've demonstrated can be applied to several real-world use cases:

-   **Visual Quality Control**: Compare product images for manufacturing defects
-   **Design Evolution**: Analyze changes between design iterations
-   **Medical Imaging**: Compare diagnostic images for changes or anomalies

### Beyond Image Analysis: Additional Capabilities

While image comparison showcases Gemma 3's visual understanding, the model's versatility extends far beyond visual tasks. Here are several other powerful applications you can build:

-   **Multilingual Chatbots**: Create AI assistants that can communicate in 140+ languages
-   **Document Analysis**: Leverage the 128K token context window for processing extensive documents
-   **Educational Tools**: Build interactive learning systems with multilingual support
-   **Visual Search**: Combine text and image understanding for sophisticated search applications

These applications can be deployed using the same Vast.ai infrastructure we've explored, allowing for efficient development and scaling.

## Conclusion

Gemma 3 represents a significant advancement in open-source AI models, offering state-of-the-art performance while being designed to run efficiently on a single GPU. Vast.ai's infrastructure provides an ideal platform for deploying Gemma 3 in production environments, making it accessible for both development and scaling applications.


Google’s open-source Gemma 3 model brings multimodal reasoning, 128K context, and multilingual support—all runnable on a single GPU. Deploy it on Vast.ai to compare images or build advanced AI apps.

Running Google's Gemma 3 on Vast.ai


Analyzing lengthy documents like SEC filings has traditionally required complex chunking strategies and careful management of context. However, with Mistral AI's [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) model and its 128k token context window, we can now process entire documents at once. This post demonstrates how to leverage this capability using Vast.ai's cloud GPU platform.

The Mistral Small 3.1 model represents a significant advancement in open-source language models. With 24 billion parameters, it combines state-of-the-art vision capabilities with an extended context window that makes it particularly suited for document analysis tasks.

Vast.ai's marketplace provides GPU configurations needed for optimal performance at cost-effective prices.

In this post, we will:
1. Set up a Vast.ai instance with the right GPU specifications
2. Deploy the model using vLLM for optimal serving performance
3. Utilize the model's native chat completions API
4. Leverage the long context window to analyze full public company annual reports (10-K filings)

This setup demonstrates how Mistral Small 3.1 with extended context windows can transform document analysis workflows, making it possible to process and understand complex documents in their entirety.

## Deploy Mistral Small 3.1 on Vast

Let's start by getting our environment set up. First, we need to install the Vast AI client and configure our access.

### Install Vast

Head over to the [Vast.ai Account Page](https://cloud.vast.ai/account/) to grab your API key. Once you have it, we'll use it to set up the client:

```bash
#In an environment of your choice
pip install --upgrade vastai
```

With the client installed, we can configure our API key:

```bash
# Here we will set our api key
export VAST_API_KEY="VAST_API_KEY" #Your key here
vastai set api-key $VAST_API_KEY
```

### Choosing the Right Hardware

Now we are going to search for GPUs on Vast.ai to run the Mistral-Small-3.1-24B-Instruct-2503 model. This model requires specific hardware capabilities to run efficiently with vLLM's optimizations. Here are our requirements:

1. A minimum of 60GB GPU RAM to accommodate:
   - Mistral model weights (24B Parameters)
   - KV Cache for handling the 128k token context window

2. A single GPU configuration, as Mistral-Small-3.1-24B-Instruct-2503 can be efficiently served on one GPU:
   Note: Multi-GPU configurations are supported if higher throughput is needed.

3. A static IP address for:
   - Stable API endpoint hosting
   - Consistent client connections
   - Reliable Langchain integration

4. At least one direct port that we can forward for:
   - vLLM's OpenAI-compatible API server
   - External access to the model endpoint
   - Secure request routing

5. At least 80GB of disk space to hold the model and other dependencies

We'll use `vast ai search offers` to find the right instance.

```bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 60 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"
```

### Deploying the Server via Vast

Select a machine from the output above and copy its ID into the `INSTANCE_ID` variable below.

We will use `vastai create instance` to create an instance that:
1. Uses `vllm/vllm-openai:latest` docker image. This gives us an OpenAI-compatible server.
2. Forwards port `8000` to the outside of the container, which is the default OpenAI server port.
3. Forwards `--model mistralai/Mistral-Small-3.1-24B-Instruct-2503` to serve the Mistral model.
4. Uses Mistral-specific parameters:
   - `--tokenizer_mode mistral` to use the Mistral tokenizer.
   - `--config_format mistral` for the correct model config.
   - `--load_format mistral` to properly load the model weights.
5. Uses `--gpu-memory-utilization 0.95` to maximize context window.
6. Ensures that we have 80 GB of disk space.

Note: Ensure that you fill in your huggingface token `HUGGING_FACE_HUB_TOKEN` to access the model. You'll need to accept the terms on the model's huggingface page https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503.

```bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_HUB_TOKEN' --disk 80 --args --model mistralai/Mistral-Small-3.1-24B-Instruct-2503   --tokenizer_mode mistral   --config_format mistral   --load_format mistral   --gpu-memory-utilization 0.95
```

### Verify Setup and Get Instance IP Address and Port

Now, we need to verify that our setup is working. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done. 

Next, go to the [Instances tab](https://cloud.vast.ai/instances/) in the Vast AI Console and find the instance you just created.

At the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the IP address and the forwarded ports. 
You should see something like: 
```
Open Ports
XX.XX.XXX.XX:YYYY -> 8000/tcp
``` 
Copy and paste the IP address (XX.XX.XXX.XX) into `VAST_IP_ADDRESS` and the port (YYYY) into `VAST_PORT` as inputs to the curl command below.

This curl command sends an OpenAI compatible request to your `vLLM` server. You should see the response if everything is set up correctly. 

Note: It may take a few minutes for the OpenAI server to initialize.

```bash
export VAST_IP_ADDRESS="VAST_IP_ADDRESS"
export VAST_PORT="VAST_PORT"
curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/v1/completions -H "Content-Type: application/json"  -d '{"model" : "mistralai/Mistral-Small-3.1-24B-Instruct-2503", "prompt": "Hello, how are you?", "max_tokens": 50}'
```

## Set Up Our Model

### Setup Dependencies

First, we'll install the OpenAI SDK to communicate with our model on the Vast server.

```bash
pip install --upgrade openai
```

### Set up Our Model

Next, let's configure our model using the IP address and port from the Vast.ai Console. We'll create a simple function to interact with the model, using Mistral's recommended temperature of 0.15.

```python
from openai import OpenAI

# Your Vast.ai instance details
VAST_IP_ADDRESS = "VAST_IP_ADDRESS"
VAST_PORT = "VAST_PORT"

# Initialize the client with your server URL
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require an actual API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

# Simple function to chat with the model
def chat_with_model(prompt):
    response = client.chat.completions.create(
        model="mistralai/Mistral-Small-3.1-24B-Instruct-2503",
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.15
    )
    return response.choices[0].message.content
```

#### Test our Model

Before we go on to long document analysis, let's test our function to make sure everything is working correctly.

```python
response = chat_with_model("what is the capital of France?")
print(response)
```

## Analyze Macy's Annual Reports to Track Store Numbers

The retail landscape has undergone significant changes, with major department stores closing across the US. Using our model's long context window, we'll analyze Macy's annual reports (10-K filings) to track their store count over the years.

### Download Filings

First, we need to download our data. We've gathered a list of Macy's annual report URLs from the SEC's EDGAR database (https://www.sec.gov/edgar/search/). Let's create a `macys_filings` list to store these URLs:

```python
import requests
from bs4 import BeautifulSoup
import re

# List of Macy's 10-K filing URLs
macys_filings = [
    "https://www.sec.gov/Archives/edgar/data/794367/000162828024012734/m-20240203.htm",
    "https://www.sec.gov/Archives/edgar/data/794367/000162828023009154/m-20230128.htm",
    "https://www.sec.gov/Archives/edgar/data/794367/000156459022011726/m-10k_20220129.htm",
    "https://www.sec.gov/Archives/edgar/data/794367/000156459021016119/m-10k_20210130.htm",
    "https://www.sec.gov/Archives/edgar/data/794367/000079436720000040/m-0201202010xk.htm",
    "https://www.sec.gov/Archives/edgar/data/794367/000079436719000038/m-0202201910xk.htm",
    "https://www.sec.gov/Archives/edgar/data/794367/000079436718000036/m-0203201810k.htm",
    "https://www.sec.gov/Archives/edgar/data/794367/000079436717000041/m-0128201710k.htm",
    "https://www.sec.gov/Archives/edgar/data/794367/000079436716000221/m-0130201610k.htm",
    "https://www.sec.gov/Archives/edgar/data/794367/000079436715000073/m-0131201510k.htm"
]
```

Next, we'll create a function that downloads and cleans the text from each filing, removing HTML formatting and other unnecessary elements:

```python
# Simple SEC headers
headers = {
    "User-Agent": "Your Name (your.email@example.com)",
    "Accept-Encoding": "gzip, deflate",
    "Host": "www.sec.gov"
}

def download_filing_text(url):
    """Download a filing and extract its text."""
    try:
        # Download the filing
        response = requests.get(url, headers=headers, timeout=30)
        response.raise_for_status()
        
        # Parse the HTML and extract text
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Remove script and style elements
        for script in soup(["script", "style"]):
            script.decompose()
        
        # Get all text with minimal processing
        text = soup.get_text(separator=' ')
        
        # Basic cleaning - collapse whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
        
    except Exception as e:
        print(f"Error downloading filing: {str(e)}")
        return ""
```

Now that we have our download function ready, let's use it to process all ten years of Macy's reports. We'll loop through each URL, download the content, and store the cleaned text:

```python
# Process all filings
filings_text = []

for i, url in enumerate(macys_filings):
    
    # Download and process filing
    print(f"Downloading {url} filing...")
    text = download_filing_text(url)
    
    if text:  # If we got text back (not empty string)
        print(f"✓ Successfully processed filing ({len(text)} characters)")
        filings_text.append(text)
    else:
        print(f"✗ Failed to process filing")

# Print summary
print("\n=== PROCESSING SUMMARY ===")
print(f"Successfully processed: {len(filings_text)}/{len(macys_filings)} filings")
```

### Extract Store Counts from Filings

Now comes the exciting part - analyzing the filings to track Macy's retail footprint over time. We'll use our model's long context window to examine each document in its entirety, asking a simple but powerful question: "How many stores did Macy's operate this fiscal year?" This approach demonstrates the advantage of processing complete documents at once - we don't need to worry about missing context or searching through different sections manually.

For each filing, we'll store the model's response in an `all_responses` list, building a comprehensive picture of Macy's store count evolution:

```python
all_responses = []
for i, text in enumerate(filings_text):
    print(f"\nProcessing file {i+1}/{len(filings_text)}")
    
    try:
        prompt = f"How many stores did Macy's operate this fiscal year? DATA: {text}"
        response = chat_with_model(prompt)
        all_responses.append(f"File {i+1} analysis: {response}")
        print(f"✓ File {i+1} processed successfully")

        print(response)
        
    except Exception as e:
        print(f"✗ Error processing file {i+1}: {str(e)}")
        all_responses.append(f"File {i+1} error: {str(e)}")
        
# Print all responses
print("\n=== ANALYSIS OF ALL FILES ===")
for response in all_responses:
    print(f"\n{response}\n")
```

The model's responses show detailed information extraction - accurately identifying store counts while providing precise dates (e.g., "February 3, 2024"), section references (e.g., "Item 1. Business. General"), and geographical context (e.g., "43 states, the District of Columbia, Puerto Rico and Guam") from each filing:

```
=== ANALYSIS OF ALL FILES ===

File 1 analysis: As of February 3, 2024, Macy's operated 718 store locations.


File 2 analysis: As of January 28, 2023, Macy's operated 722 store locations. This information is explicitly stated in the document under the section "Item 1. Business. General" where it mentions:

"As of January 28, 2023, the Company operated 722 store locations in 43 states, the District of Columbia, Puerto Rico and Guam."


File 3 analysis: As of January 29, 2022, Macy's operated 725 store locations. This information is explicitly stated in the "Item 2. Properties" section of the 10-K filing:

"As of January 29, 2022, the operations of the Company included 725 store locations in 43 states, the District of Columbia, Puerto Rico and Guam, comprising a total of approximately 112 million square feet."


File 4 analysis: Macy's, Inc. operated 727 store locations in 43 states, the District of Columbia, Puerto Rico, and Guam as of January 30, 2021.


File 5 analysis: As of February 1, 2020, Macy's operated 775 store locations in 43 states, the District of Columbia, Puerto Rico, and Guam.


File 6 analysis: As of February 2, 2019, Macy's operated a total of **867 stores**. This information is detailed in the 10-K filing under the section discussing the company's properties and store count activity.


File 7 analysis: As of February 3, 2018, Macy's operated approximately **852 stores** in 44 states, the District of Columbia, Puerto Rico, and Guam.


File 8 analysis: As of January 28, 2017, Macy's operated a total of **829 stores**. This number includes stores across various brands such as Macy's, Bloomingdale's, and Bluemercury, located in 45 states, the District of Columbia, Guam, and Puerto Rico.


File 9 analysis: As of January 30, 2016, Macy's operated a total of **870 stores**. This included Macy's, Macy's Backstage, Bloomingdale's, Bloomingdale's Outlet, and Bluemercury stores across 45 states, the District of Columbia, Guam, and Puerto Rico.


File 10 analysis: As of January 31, 2015, Macy's operated a total of 823 stores.
```

### Summarize Store Count Findings

Now that we have the store counts for each year, we will have Mistral summarize the results for us.

```python
prompt = "Give me a list of macys store numbers from 2015-2024 DATA: " + " ".join(all_responses)
response = chat_with_model(prompt)
```

```python
print(response)
```

We now have a concise summary of the number of stores that Macy's operated in each year.

```
Based on the provided data, here is the list of Macy's store numbers from 2015 to 2024:

- **2015:** 823 stores
- **2016:** 870 stores
- **2017:** 829 stores
- **2018:** 852 stores
- **2019:** 867 stores
- **2020:** 775 stores
- **2021:** 727 stores
- **2022:** 725 stores
- **2023:** 722 stores
- **2024:** 718 stores

This list reflects the number of store locations Macy's operated as of the specified dates each year.
```

## Conclusion

In this post, we demonstrated how Mistral Small 3.1's 128k token context window transforms SEC filing analysis. We:

- Deployed Mistral's 24B parameter model on a cost-effective Vast.ai GPU instance
- Processed complete 10-K filings without splitting them into smaller chunks
- Extracted store count data from various sections of Macy's annual reports
- Maintained the full document context throughout the analysis process

With this foundation in place, we can tackle more sophisticated document analysis tasks. While tracking store counts offers a simple demonstration, the real value lies in analyzing complex relationships across entire documents. Financial analysts can:

- Extract specific metrics while understanding their broader context
- Connect information from management discussions with financial data
- Assess risk factors in relation to company performance
- Identify patterns across different sections of a single document

This capability extends beyond financial analysis. The same approach works for legal contracts, academic papers, technical documentation, and any other domain where processing long documents as complete units provides better insights.

By combining Mistral's context window capabilities with Vast.ai's accessible GPU infrastructure, comprehensive document analysis becomes more practical and accessible for a wider range of users and use cases.


Deploy Mistral Small 3.1 on Vast.ai to analyze SEC filings with a 128k token context window, extracting insights like Macy’s store count trends from 2015-2024.

SEC Filing Analysis Using Mistral Small 3.1's Long Context Window


DeepSeek may be getting a [lot of hype](https://vast.ai/article/deepseek-r1-open-source-disruptor-or-overhyped-upstart) recently, but there's a new rival on the scene. Alibaba Cloud's Qwen series of AI models is setting a new standard for intelligence in the open-source world.

Earlier this year, Alibaba Cloud launched **Qwen 2.5-Max** – and it reportedly [outperforms](https://www.reuters.com/technology/artificial-intelligence/alibaba-releases-ai-model-it-claims-surpasses-deepseek-v3-2025-01-29/) other foundation models like DeepSeek-V3, GPT-4o, and Llama-3.1-405B "almost across the board," according to the company. Not only that, but it also [powered](https://www.scmp.com/tech/big-tech/article/3298233/alibabas-qwen-powers-top-10-open-source-models-china-ai-know-how-goes-beyond-deepseek) the top 10 open-source LLMs on Hugging Face's rankings last month. All of the top-ranked models on the Open LLM Leaderboard were trained and developed on the updated open-source versions of Qwen.

So what is Qwen, and what can you do with it? Let's take a closer look\!

### Qwen 2.5: Advancements and Features

The Qwen model series includes Qwen (an LLM), Qwen-VL (a vision-language model), Qwen-Audio, Qwen-Coder, and Qwen-Math. The latest Qwen 2.5 models have been pre-trained on a large-scale dataset of up to 18 trillion tokens, resulting in a substantial knowledge base as well as much-improved capabilities in coding and math.

In terms of performance metrics, Qwen 2.5 is quite impressive. It [achieved](https://www.alibabacloud.com/en/solutions/generative-ai/qwen) a score of 85+ on the Massive Multitask Language Understanding (MMLU) benchmark, a HumanEval score of 85+ in coding, and a MATH benchmark score above 80\.

Beyond these core competencies, Qwen 2.5 has highly advanced abilities in following instructions and understanding and generating structured data. It can handle diverse system prompts, demonstrating an adaptability that's well suited for role-based interactions and condition-setting for chatbots.

The models can even interact with software on PCs and mobile devices. One user posted a video online [showing](https://techcrunch.com/2025/01/27/alibabas-qwen-team-releases-ai-models-that-can-control-pcs-and-phones/) Qwen 2.5-VL opening the Booking.com app for Android and booking a flight from Chongqing to Beijing\!

### Specialized Models in the Qwen Family

Each open-source model in the Qwen series is designed for a specific domain. The following are the main options in the Qwen 2.5 lineup:

-   **Qwen 2.5-VL** – This large vision-language model processes images, text, and bounding boxes to recognize and analyze content. It can read text in Chinese and English, compare visuals, create stories, solve math problems, and answer questions.

-   **Qwen 2.5-Audio** – Designed to process audio and text, this audio language model accepts a variety of audio formats (including speech, music, and natural sounds) and generates text responses.

-   **Qwen 2.5-Max** – Pre-trained on over 20 trillion tokens and post-trained with curated Supervised Fine-Tuning (SFT), this high-performing Mixture-of-Expert (MoE) model has surpassed DeepSeek V3, Llama 3.1, and others in benchmarks like Arena-Hard, LiveBench, LiveCodeBench, and GPQA-Diamond.

-   **Qwen 2.5-Coder** – This coding model supports up to 128K tokens of context, covers 92 programming languages, and delivers competitive performance against larger language models in code generation, multi-language coding, completion, and repair.

-   **Qwen 2.5-Math** – Pre-trained and fine-tuned on synthesized data, this mathematical LLM supports English and Chinese queries and excels in Chain-of-Thought (CoT), Program-of-Thought (PoT), and Tool-Integrated Reasoning (TIR) while outperforming most 70B math models.

Furthermore, Qwen APIs enable users to develop generative AI applications for tasks like writing, image generation, and audio analysis, boosting efficiency and transforming the customer experience for businesses.

### Powering Qwen with Vast.ai

Given how competitive the Qwen models are with larger, more established AI systems – and the fact that they're open source – they offer an exciting opportunity for AI researchers, developers, and enthusiasts to experiment and innovate at minimal cost.

However, deploying the models on high-performance hardware is crucial for optimal performance. That's where Vast.ai comes in. Our platform makes this possible with **on-demand [RTX 5090 GPU](https://vast.ai/article/nvidia-geforce-rtx-5090-specs-everything-you-need-to-know) rental**, for instance, providing the power needed to run Qwen models efficiently without the overhead of expensive infrastructure.

With many other GPU options in our marketplace, you can choose the right balance of performance and cost for your specific needs. Get started with Vast today and experience the full potential of Qwen with high-performance GPUs at your fingertips\!


Alibaba Cloud’s Qwen 2.5 AI models outperform rivals like DeepSeek and GPT-4o, excelling in coding, math, and vision. Available open-source, powered by Vast.ai GPUs.

Alibaba's Qwen: An Open-Source AI Model that Surpasses DeepSeek?


Mukesh Ambani's Reliance Group is [making waves](https://www.bloomberg.com/news/articles/2025-01-23/billionaire-mukesh-ambani-plans-world-s-biggest-data-center-in-india-s-gujarat) with its plans to build what may become **the** **world's largest data center by capacity** in Jamnagar, India. The facility, expected to reach three gigawatts, would dwarf today's largest data centers, which are under one gigawatt each (or will just barely exceed one gigawatt once fully operational, as in the case of the Sines data center in Europe).

Reliance isn't alone in its bid to drastically expand AI infrastructure – companies around the world are racing to meet surging demand in this sector. Tech giants like Microsoft, Alphabet, and Amazon are investing billions of dollars into data centers to expand their AI capabilities.

The Stargate Project is another entity endeavoring to boost AI infrastructure in the United States. OpenAI, SoftBank, and Oracle all recently [pledged](https://techcrunch.com/2025/01/21/openai-teams-up-with-softbank-and-oracle-on-50b-data-center-project/) to invest $100 billion to $500 billion as part of the Stargate effort.

So what makes Reliance interesting? The company has some big ambitions...

### Reliance's AI Strategy

Ambani, the 67-year-old billionaire chairman and managing director of Reliance Group, hopes to offer AI inference to businesses and developers at significantly lower costs than competitors. In order to achieve this goal, Reliance is investing heavily in NVIDIA's AI semiconductors.

Undercutting prices isn't a new approach for Ambani; his aggressive pricing in other industries, like telecommunications, disrupted the market and forced competitors to adapt. It's not unreasonable to think that Reliance could do the same in the AI industry.

"By leveraging our expertise in infrastructure, networking, operations, software, and data and by collaborating with our global partners, our goal is to create the world's lowest AI inferencing cost, right here in India," Ambani [said](https://www.bloomberg.com/news/articles/2025-01-23/billionaire-mukesh-ambani-plans-world-s-biggest-data-center-in-india-s-gujarat) last year. "This will make AI applications in India more affordable than anywhere else, making AI accessible to all."

### Strengthening India's AI Infrastructure

One of Reliance's main objectives with this massive data center is to reduce India's reliance on overseas AI infrastructure. As AI adoption accelerates across the country, driven by a booming startup ecosystem and increasing enterprise demand, the need for localized AI compute power is greater than ever.

Currently, India's total data center capacity is under one gigawatt. If completed as planned, the Jamnagar facility alone could triple that amount by 2027\. Reliance also reportedly aims to power the new data center as much as possible with renewable energy – aligning with global sustainability trends and potentially positioning India as a more self-sufficient AI hub.

Beyond meeting domestic demand, this expansion raises questions about India's role in the global AI race. With governments and tech giants worldwide investing heavily in AI infrastructure, Reliance's project reflects India's potential to compete on the world stage as a provider of AI compute power.

### What's Next for AI Infrastructure?

Reliance's plan for the facility in Jamnagar is just one of many data center projects in the works today. According to McKinsey & Co, global data center capacity could [more than triple](https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/ai-power-expanding-data-center-capacity-to-meet-growing-demand) by 2030, reaching an estimated 219 gigawatts annually. Yet even with heavy investment, a significant shortfall will likely remain. **The U.S. alone is projected to face a 15-gigawatt gap by 2030**, requiring twice the data center capacity that's been built since 2000\.

If Reliance successfully builds its three-gigawatt facility, it could help address some of this global demand. However, ensuring that infrastructure keeps pace with AI's rapid advancement will remain a challenge worldwide.

In the meantime, access to high-performance compute remains a priority for developers, researchers, and businesses. While tech giants build massive facilities, **cloud GPU rental is an option that provides immediate, scalable access to AI compute** without the need for expensive hardware or infrastructure investments.

If you're looking for flexible and cost-effective AI compute, Vast.ai offers on-demand cloud GPU rental. We invite you to explore our platform and get the AI compute you need – and start scaling your workloads today\!


Reliance plans to build the world’s largest data center in India, aiming to lower AI costs and boost local infrastructure with NVIDIA chips and renewable energy.

India's AI Leap: World's Biggest Data Center on the Horizon?


A new NVIDIA Blackwell GPU with 96GB of VRAM? If rumors are to be believed, that could be exactly what's coming. 

A recent shipping manifest [revealed](https://wccftech.com/nvidia-rtx-pro-6000-x-blackwell-leak-96-gb-gddr7-600w/) an entry for the "**RTX PRO 6000 X Blackwell**," evidently being sent to India "for testing purposes." At least two different shippers have sent the GPUs for testing.

This isn't your typical consumer-grade gaming card, however. The as-yet-unannounced Blackwell-based GPU appears to be aimed at AI and data-intensive workloads. Here's what we know so far about this new entry in NVIDIA's professional lineup.

### NVIDIA RTX Pro 6000 X: What We Know So Far

With the impressive [RTX 5090](https://vast.ai/article/nvidia-geforce-rtx-5090-specs-everything-you-need-to-know) expected to remain at the top of the GeForce 50-series lineup, the professional segment is where we'll likely see ultra-high-end specs that push the limits on VRAM – and this latest shipping log discovery suggests that will be the case. 

According to reports, as mentioned, the RTX Pro 6000 X features a staggering **96GB of GDDR7 memory** on a 512-bit memory bus. That's significantly higher than even the most powerful gaming cards. For example, the RTX 5090 has 32GB of GDDR7 memory – which is more than enough for most purposes – but higher VRAM is obviously beneficial for workloads that require handling massive datasets, such as foundation AI model training and large-scale simulations.

The RTX Pro 6000 X also features a **GB202 die** and **600 watts** of total board power (TBP). This aligns with previous high-end Blackwell architecture designs, like the RTX 5090, which also has the GB202 die and comes in at 575 watts of TBP.

Interestingly, the discovery of the RTX Pro 6000 X on the shipping manifest follows an [earlier sighting](https://wccftech.com/nvidia-rtx-4090-with-96gb-vram-reportedly-exists/) of a 96GB RTX 4090 GPU – perhaps an engineering sample. While NVIDIA hasn't made any announcements, these repeated sightings nonetheless suggest a broader push for high-VRAM GPUs.

### Comparison to Existing RTX Workstation GPUs

We've seen professional-grade GPUs with massive VRAM configurations before, such as the RTX 6000 Ada with 48GB of memory, as well as the A100/A800 series. However, the RTX Pro 6000 X's 96GB of GDDR7 memory would make it one of the highest-capacity workstation GPUs to date. 

Assuming we see the RTX Pro 6000 X emerge, it would set a new benchmark for professional GPUs, doubling the memory of the RTX 6000 Ada while offering the faster speeds and efficiency of GDDR7 memory.

Plus, unlike the A100 and A800 series, which are designed primarily for data centers and enterprise AI workloads, the RTX Pro 6000 X could serve as a more flexible solution. Its ultra-high VRAM would allow professionals to work with larger datasets without necessarily relying on multi-GPU setups. 

### Looking Ahead

NVIDIA hasn't officially announced the RTX Pro 6000 X, so all information should be taken with a grain of salt, but the repeated mentions of this new GPU in shipping logs suggest that an announcement may be imminent. 

For Vast.ai users, this could mean more powerful GPU rental options in the near future. Stay tuned for more developments\! In the meantime, check out our current selection of high-performance GPUs available on Vast.ai for flexible and cost-effective access to cutting-edge hardware today.  

NVIDIA’s rumored RTX Pro 6000 X packs 96GB of GDDR7 VRAM, a GB202 die, and 600W TBP—ideal for AI, data science, and high-end workstation workloads.

NVIDIA RTX Pro 6000 X Rumors: Blackwell GPU with 96GB VRAM Spotted in Shipping Log

Pyannote Audio and NVIDIA’s Sortformer enhance Whisper’s speech recognition by adding speaker diarization. We’ll run them on VAST.ai to create speaker-attributed transcripts.

Implementing Speech-to-Text with Speaker Diarization: Comparing Pyannote and Sortformer on VAST.ai

NVIDIA’s RTX 5090 delivers cutting-edge AI and gaming performance, but launch issues, high costs, and stock shortages make it hard to get. Try it now on Vast.ai!

NVIDIA GeForce RTX 5090 Specs: Everything You Need to Know


February may be a short month, but we've made the most of it with plenty of updates!

Vast.ai is now offering NVIDIA RTX 5000-series GPUs for rent, including the powerful **RTX 5090**, bringing even more high-performance options to our platform. Plus, we've partnered with **SkyPilot** in an integration that allows you to rent Vast machines through SkyPilot's infrastructure and price optimization framework for even greater flexibility.

Of course, keeping everything running smoothly means staying on top of any technical hiccups. As always, we've been hard at work refining our platform and fixing bugs. Here are some of the latest bug fixes we've rolled out since our last update:

-   Fixed instance-to-instance copy issue.
-   Made various virtual machine (VM) enhancements.
-   Corrected an issue where machine is erroneously marked as "unavailable."
-   Fixed earnings display issue and template category display issue.

### NVIDIA Cloud GPU Updates

At Vast.ai, every update is aimed at making your experience smoother, faster, and more efficient. From expanding our GPU offerings to adding templates and updating guides, our latest improvements are designed to give you even more control and streamline your workflow.

#### Highlights of the latest enhancements:

-   NVIDIA RTX 5000-series GPUs are now available, including the [RTX 5090](https://cloud.vast.ai/?gpu_option=RTX+5090).
-   SkyPilot Integration: Rent Vast machines through SkyPilot’s infrastructure and price optimization framework.
-   Added [vLLM Template](https://cloud.vast.ai/?ref_id=62897&template_id=f4c3ccaf87393c5266748b3b42ba8567) and [Guide](https://docs.vast.ai/vllm-llm-inference-and-serving): Fast and easy-to-use library for LLM inference and serving.
-   Added [Unreal Pixel Streaming Template](https://cloud.vast.ai/?ref_id=62897&template_id=c1c84fd5a4d945fc16012e7ca02d1423): Stream high-quality, interactive 3D content directly to web browsers without requiring end users to install anything.
-   Added [Ollama + WebUI Guide](https://docs.vast.ai/ollama-webui) for running DeepSeek R1.
-   Added [Guide for Creating a Custom Template](https://docs.vast.ai/creating-a-custom-template) from Vast Base Image.
-   Updated Hosting Guide.

Please check out the full change log for our February 2025 updates at the bottom of this page.

### Our Commitment

Whether you're looking to try out the RTX 5090, take advantage of SkyPilot integration, explore our new and updated templates and guides, or just enjoy the many features we continue to refine and expand, we're here to help you get the most out of Vast.ai! We're always happy to support both new and longtime users with affordable, on-demand GPU and virtual machine rental.

Our easy-to-use platform is backed by a responsive customer support team and an active Discord community. Need assistance? Email us at [support@vast.ai](mailto:support@vast.ai) or join our Discord server for helpful resources today!

### Change Log

#### Features:

-   NVIDIA RTX 5000 Series GPUs: Rent Blackwell chips on Vast, including the [RTX 5090](https://cloud.vast.ai/?gpu_option=RTX+5090).
-   SkyPilot Integration: Rent Vast machines through SkyPilot’s infrastructure and price optimization framework.

#### Templates & Guides:

-   Added [vLLM Template](https://cloud.vast.ai/?ref_id=62897&template_id=f4c3ccaf87393c5266748b3b42ba8567) and [Guide](https://docs.vast.ai/vllm-llm-inference-and-serving): Fast and easy-to-use library for LLM inference and serving.
-   Added [Unreal Pixel Streaming Template](https://cloud.vast.ai/?ref_id=62897&template_id=c1c84fd5a4d945fc16012e7ca02d1423): Stream high-quality, interactive 3D content directly to web browsers without requiring end users to install anything.
-   Added [Ollama + WebUI Guide](https://docs.vast.ai/ollama-webui) for running DeepSeek R1.
-   Added [Guide for Creating a Custom Template](https://docs.vast.ai/creating-a-custom-template) from Vast Base Image.
-   Updated VM Desktop Template.
-   Updated Open WebUI Template: Now includes pre-built llama.cpp.
-   Updated ComfyUI & ComfyUI + FLUX.1 Templates.
-   Updated SD Swarm & SD Forge Templates.
-   Updated Fooocus Template.
-   Updated Ubuntu 22.04 VM Template: Updated drivers to CUDA 12.8, preinstalled Docker container toolkit, added Tailscale support.
-   Updated Kohya’s GUI.
-   Updated Oobabooga Template.
-   Updated Hosting Guide.

#### Bug Fixes:

-   Various virtual machine (VM) enhancements.
-   Instance-to-instance copy issue.
-   Machine marked as "unavailable" issue.
-   Earnings display issue.
-   Template category display issue.
-   Session key issue.

#### Miscellaneous Tasks:

-   Added API keys section into CLI page for CLI login command.
-   Added length validation to API keys.  


Rent NVIDIA RTX 5090 GPUs on Vast.ai! Now integrated with SkyPilot for better pricing. New templates, guides & fixes to improve your experience.

February 2025 Product Update


## Introduction

In the world of audio processing and speech recognition, identifying when someone is speaking versus when there's silence or background noise is a critical first step. This process, known as Voice Activity Detection (VAD), serves as the foundation for many speech-related applications, from transcription services to voice assistants. While conceptually simple, implementing an efficient and accurate VAD system can significantly improve downstream tasks and reduce computational costs.

[PyAnnote Audio](https://github.com/pyannote/pyannote-audio), an open-source toolkit built on PyTorch, offers state-of-the-art models for VAD that are both accurate and accessible. Running these models on Vast.ai provides a cost-effective solution for processing large audio datasets without investing in expensive hardware. This combination gives developers and researchers the tools they need to build sophisticated audio processing pipelines at a fraction of the cost of traditional cloud providers.

### Why Voice Activity Detection Matters

VAD provides several key benefits for speech processing pipelines:

1. **Reduced Computation Load**: By filtering out non-speech segments before running speech-to-text (STT) models, we significantly reduce the computational resources needed for transcription.

2. **Improved Accuracy**: Many STT models perform better when processing only speech segments rather than trying to interpret silence or background noise.

3. **Efficient Storage**: Extracting only the speech segments can reduce storage requirements for large audio datasets.

4. **Better User Experience**: For applications like voice assistants or transcription services, VAD helps eliminate unnecessary processing of silence.

### What This Guide Covers

In this guide, we will:

-   Set up the Pyannote Audio VAD pipeline
-   Process audio files to detect speech segments
-   Extract and save only the speech portions of the input audio
-   Visualize the results

The output will be a collection of audio files containing only the detected speech segments from the original recording, making them ready for further processing in speech-to-text pipelines.

### Why VAST.ai

VAST.ai offers a marketplace approach to GPU rentals that provides significant advantages for audio processing tasks. Unlike traditional cloud providers, VAST.ai allows you to:

-   Rent precisely the GPU capacity needed for your workload
-   Access GPU's at more affordable rates
-   Access more types of SKU's of GPU's, particularly lower RAM/more affordable GPUs
-   Avoid long-term commitments for experimental projects

For VAD specifically, VAST.ai offers an ideal balance of performance and cost-effectiveness, as these models benefit from GPU acceleration without requiring the most expensive hardware tiers.

## Download This Notebook

To follow along with this tutorial, you can download the complete Jupyter notebook:

[Notebook](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/6q8akx/6b8dc846a00d27decad5a60a6159466813f6d445/files/pyannote_vad_vast.ipynb)

Having the notebook will allow you to execute the code blocks as you read through this guide, making it easier to understand and experiment with the VAD implementation.

## Choosing an Instance

For running the Pyannote Voice Activity Detection model on VAST.ai, you'll need a relatively modest GPU setup since VAD models are computationally efficient compared to larger AI tasks. Here are the recommended specifications:

-   GPU: A mid-range GPU like an RTX 3060 or 4060 would be sufficient.
-   VRAM: 6-8GB of VRAM should be adequate as the Pyannote VAD model is relatively small.
-   RAM: 8-16GB system RAM is recommended for processing audio files.
-   Storage: At least 10GB for the model, dependencies, and your audio files.
-   CUDA: Make sure the instance has CUDA installed (version 11.0+ recommended).
-   Python: Python 3.8+ with PyTorch installed.

## Selecting an Instance

Follow these steps to set up your environment:

1. Ensure that you have a Vast.ai account
2. Go to the Vast Templates in the Console https://cloud.vast.ai/templates/
3. Select the `PyTorch (CuDNN Runtime)` Template
4. Filter for an instance with:
    - 1 GPU
    - 6-8GB of VRAM
    - 8-16GB system RAM
    - 10GB of storage
5. Select an instance and click rent
6. Install the Vast TLS certificate in your browser to access the notebook server https://docs.vast.ai/instances/jupyter#1SmCz
7. Go to your Instances https://cloud.vast.ai/instances/ and click "Open" to access the jupyter server on your instance.
8. Upload the notebook to the server or create a new notebook

## Installing Dependencies

Let's start by installing the necessary Python packages:

```python
%%bash
pip install pyannote.audio
pip install pydub
pip install librosa
pip install yt-dlp
```

We also need to install FFmpeg for audio processing:

```python
%%bash
apt-get update && apt-get install -y ffmpeg
```

## Setting Up Your Hugging Face Token

Pyannote models are hosted on Hugging Face, so you'll need to set up authentication:

```python
# Make sure you've accepted the user conditions at:
# https://hf.co/pyannote/voice-activity-detection
# https://hf.co/pyannote/segmentation

HF_TOKEN = ""  # Add your token here
```

Ensure that you have accepted the terms for the models at the URLs above. The models are free to use, but you must agree to their terms of service.

## Generating Test Data

For this tutorial, we'll download a sample audio file from Vast.ai's YouTube channel. You can also use your own audio file if you prefer:

```bash
yt-dlp -f "bestaudio" --extract-audio --audio-format wav -o "test.wav" https://www.youtube.com/watch?v=542xENIxKFU
```

## Voice Activity Detection

First, let's set up the VAD pipeline:

```python
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
        "pyannote/voice-activity-detection",
        use_auth_token=HF_TOKEN
    )

# Move pipeline to appropriate device
pipeline = pipeline.to(device)
```

Next, we'll process our audio file to identify speech segments:

```python
# Process the audio file
audio_file = "test.wav"
output = pipeline(audio_file)

print(f"Processing {audio_file} on {device}")
print("Voice activity segments:")

# Get all speech segments
speech_segments = list(output.get_timeline().support())

for i, speech in enumerate(speech_segments):
    # active speech between speech.start and speech.end
    print(f"Segment {i+1}: Speech from {speech.start:.2f}s to {speech.end:.2f}s (duration: {speech.duration:.2f}s)")
```

The output will look like this when using the default audio file:

```
Processing test.wav on cuda
Voice activity segments:
Segment 1: Speech from 6.78s to 51.62s (duration: 44.84s)
Segment 2: Speech from 53.56s to 54.27s (duration: 0.71s)
Segment 3: Speech from 55.55s to 84.76s (duration: 29.21s)
Segment 4: Speech from 86.53s to 89.03s (duration: 2.50s)
...
```

## Saving Speech Segments

Now we'll create a function to extract the identified speech segments from our audio file:

```python
import os
import shutil
from pydub import AudioSegment


def split_audio_by_segments(audio_path, segments, output_dir="output_segments"):
    """
    Split an audio file into multiple files based on speech segments

    Parameters:
    -----------
    audio_path: str
        Path to the input audio file
    segments: list
        List of speech segments (with start and end attributes)
    output_dir: str
        Directory to save the output segments
    """
    # Clear the output directory if it exists
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)

    # Create output directory
    os.makedirs(output_dir, exist_ok=True)

    # Load the audio file
    audio = AudioSegment.from_file(audio_path)

    # Extract each segment
    for i, segment in enumerate(segments):
        # Convert seconds to milliseconds
        start_ms = int(segment.start * 1000)
        end_ms = int(segment.end * 1000)

        # Extract segment
        segment_audio = audio[start_ms:end_ms]

        # Generate output filename
        filename = os.path.basename(audio_path)
        name, ext = os.path.splitext(filename)
        output_path = os.path.join(output_dir, f"{name}_segment_{i+1:04d}_{start_ms:08d}ms-{end_ms:08d}ms{ext}")

        # Export segment
        segment_audio.export(output_path, format=ext.replace('.', ''))
        print(f"Saved segment {i+1} to {output_path}")
```

Let's apply this function to extract our speech segments:

```python
split_audio_by_segments(audio_file, speech_segments)
```

This will be the output when using the default audio file:

```
Saved segment 1 to output_segments/test_segment_0001_00006780ms-00051617ms.wav
Saved segment 2 to output_segments/test_segment_0002_00053558ms-00054267ms.wav
Saved segment 3 to output_segments/test_segment_0003_00055549ms-00084760ms.wav
Saved segment 4 to output_segments/test_segment_0004_00086532ms-00089029ms.wav
...
```

## Inspecting Results

To verify our results, we'll create a function to play audio files in our Jupyter environment:

```python
import librosa
from IPython.display import Audio, display

def play_audio(file_path, sr=None):
    """
    Play an audio file in a Jupyter notebook.

    Parameters:
    -----------
    file_path : str
        Path to the audio file to play
    sr : int, optional
        Sample rate to load the audio with. If None, uses the file's native sample rate.

    Returns:
    --------
    Audio widget that can be played in the notebook

    Example:
    --------
    >>> play_audio('path/to/audio.wav')
    """
    # Load the audio file
    y, sr = librosa.load(file_path, sr=sr)


    # Return an audio widget to play the sound
    audio_widget = Audio(data=y, rate=sr)
    display(audio_widget)
```

First, we'll play the original audio file. Listen to the first minute or so to get an idea of what it sounds like before VAD.

```python
play_audio(audio_file)
```

Next, we'll listen to the first three clips to verify that we have isolated the speech in our test file:

```python
import os
audio_dir = "./output_segments/"

audio_files = os.listdir(audio_dir)
audio_files.sort()

n_clips = 3

for fname in audio_files[0:n_clips]:
    play_audio(audio_dir + fname)
```

You'll see that the audio lengths match up with the `speech_segments` output:

```
Segment 1: Speech from 6.78s to 51.62s (duration: 44.84s)
Segment 2: Speech from 53.56s to 54.27s (duration: 0.71s)
Segment 3: Speech from 55.55s to 84.76s (duration: 29.21s)
...
```

Listening to the output files, we can see that we have effectively isolated the speech.

## Conclusion

With this implementation, you now have a working Voice Activity Detection system that can identify and extract speech segments from audio files. This forms an excellent foundation for more advanced audio processing tasks like speech recognition, speaker diarization, or audio content analysis.

Look out for more content from Vast about all these other types of tasks!


Run state-of-the-art Voice Activity Detection (VAD) with PyAnnote Audio on Vast.ai for cost-effective speech processing. Extract speech segments easily!

Voice Activity Detection (VAD) with Pyannote on VAST


## Introduction

With the release of o1 family of models, reasoning has taken the AI world by storm. Instead of immediately answering a prompt, reasoning is where a model is trained to first output it's "thinking" for its answer. This increases performance on the most difficult tasks today, and since it is trained into the model, it is better than asking it to record its "chain of thought" via a prompt.

R1 takes this even further. We now have Open Source models for this type of task, which means:

1. We can retain the "thinking" token outputs which we couldn't do beforehand. This lets us create datasets for finetuning.
2. We can easily fine-tune these models on our own tasks
3. We can run them on more affordable compute to drastically reduce the cost!

This implementation combines three key components: A distilled DeepSeek model for reasoning transparency, a Vast Template for vLLM's optimized inference server for efficient deployment, and Langchain for parsing reasoning tokens versus normal output tokens. Together, they create a production-ready system that can handle both the technical demands of serving a large language model and the practical needs of processing its unique output format.

Vast.ai provides an ideal platform for this setup, offering the necessary GPU resources at a fraction of traditional cloud costs. Its marketplace model and simple Docker integration make it particularly well-suited for deploying and scaling language models, allowing developers to focus on building applications rather than managing infrastructure. The Templates feature from Vast allows for a repeatable deployment with little additional configuration.

This guide demonstrates how to deploy [deepseek/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) with Vast's templates and integrate them with Langchain for advanced processing capabilities. We'll show you how to leverage vLLM's optimized inference server and create custom parsers to handle DeepSeek's distinctive output format. Feel free to follow along in this [companion notebook](https://nbviewer.org/urls/bitbucket.org/!api/2.0/snippets/jsbcannell/xq8zG5/2cf7d8758235fa5176d9aee8b8d3885128c0109c/files/serving_deepseek_on_vast.ipynb)

## Setting Up the Environment

Before we can deploy our model, we need to set up our Vast.ai environment. First, install the Vast SDK:

```python
%%bash
pip install --upgrade vastai
```

Set up your Vast API key (available from your [Account Page](https://cloud.vast.ai/account/)):

```python
%%bash
# Here we will set our api key
export VAST_API_KEY="<your-key-here>"
vastai set api-key $VAST_API_KEY
```

### Choosing the Right Hardware

Now we are going to search for GPUs on Vast.ai to run the DeepSeek-R1-Distill-Qwen-32B model. This model requires specific hardware capabilities to run efficiently with vLLM's optimizations. Here are our requirements:

1. A minimum of 80GB GPU RAM to accommodate:

    - DeepSeek model weights (32B Parameters)
    - KV Cache for handling of extra long output token lengths

2. A single GPU configuration, as DeepSeek-R1-Distill-Qwen-32B can be efficiently served on one GPU:
   Note: Multi-GPU configurations are supported if higher throughput is needed.

3. A static IP address for:

    - Stable API endpoint hosting
    - Consistent client connections
    - Reliable Langchain integration

4. At least one direct port that we can forward for:

    - vLLM's OpenAI-compatible API server
    - External access to the model endpoint
    - Secure request routing

5. At least 120GB of disk space to hold the model and other things we might like to download

Here's how to search for suitable instances on Vast.ai:

```python
%%bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 80 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 120 \
rentable = true"
```

### Deploying the Server via Vast Template

We'll use vLLM's OpenAI-compatible server to deploy the DeepSeek model. This setup provides an OpenAI-compatible API endpoint that works seamlessly with existing tools and libraries.

We will do this with a template that:

1. Uses `vllm/vllm-openai:latest` docker image. This gives us an OpenAI-compatible server.
2. Forwards port `8000` to the outside of the container, which is the default OpenAI server port
3. Forwards `--model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --max-model-len 8192  --enforce-eager` on to the default entrypoint (the server itself)
4. Uses `--tensor-parallel-size 1` by default.
5. Uses `--gpu-memory-utilization 0.90` by default
6. Ensures that we have 120 GB of Disk space

```python
%%bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --disk 120 --template_hash eda062b3e0c9c36f09d9d9a294405ded
```

### Verify Setup

After deployment, verify that your server is running correctly with a simple curl test:

```python
%%bash
export VAST_IP_ADDRESS="<your-ip-here>"
export VAST_PORT="<your-port-here>"
curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
           "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
           "prompt": "Hello, how are you?",
           "max_tokens": 50
         }'
```

Note: You can find your instance's IP address and port in the [Instances tab](https://cloud.vast.ai/instances/) of the Vast AI Console.

## Implementing Custom Output Parsing

### Install Required Dependencies

```python
%%bash
pip install --upgrade langchain langchain-openai openai
```

### Creating a DeepSeek Output Parser

DeepSeek models output responses in a unique format with separate thinking and response sections. The thinking section, wrapped in `<think>` tags, contains the model's reasoning process, while everything after the closing tag represents the final response. This separation is valuable for understanding how the model reaches its conclusions.

Let's create a custom parser that can handle this format:

```python
from typing import Optional, Tuple
from langchain.schema import BaseOutputParser

class R1OutputParser(BaseOutputParser[Tuple[Optional[str], str]]):
    """Parser for DeepSeek R1 model output that includes thinking and response sections."""

    def parse(self, text: str) -> Tuple[Optional[str], str]:
        """Parse the model output into thinking and response sections.

        Args:
            text: Raw text output from the model

        Returns:
            Tuple containing (thinking_text, response_text)
            - thinking_text will be None if no thinking section is found
        """
        if "</think>" in text:
            # Split on </think> tag
            parts = text.split("</think>")
            # Extract thinking text (remove <think> tag)
            thinking_text = parts[0].replace("<think>", "").strip()
            # Get response text
            response_text = parts[1].strip()
            return thinking_text, response_text

        # If no thinking tags found, return None for thinking and full text as response
        return None, text.strip()

    @property
    def _type(self) -> str:
        """Return type key for serialization."""
        return "r1_output_parser"
```

Our parser does several important things:

1. It looks for the `</think>` tag to identify the boundary between thinking and response
2. It extracts and cleans up the thinking section by removing the `<think>` tags
3. It handles cases where no thinking section is present by returning None
4. It ensures both sections are properly stripped of extra whitespace

### Setting Up the Model Chain

Now that we have our parser, we need to set up a processing chain that connects our model to the parser. This chain will:

-   Send requests to our deployed model
-   Process the responses through our custom parser
-   Return the separated thinking and response sections

Here's how we set it up:

```python
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough

# Initialize the model with your Vast instance details
VAST_IP_ADDRESS="<your-ip-here>"
VAST_PORT="<your-port-here>"

openai_api_key = "EMPTY"
openai_api_base = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"

model = ChatOpenAI(
    base_url=openai_api_base,
    api_key=openai_api_key,
    model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    max_tokens=8000,
    temperature=0.7
)

# Create prompt template
prompt = ChatPromptTemplate.from_messages([
    ("user", "{input}")
])

# Create parser
parser = R1OutputParser()

# Create chain
chain = (
    {"input": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
```

Let's break down what each component does:

-   The `ChatOpenAI` instance connects to our deployed model using the OpenAI-compatible API
-   The `ChatPromptTemplate` formats our input messages
-   The `RunnablePassthrough` ensures our inputs flow through the chain correctly
-   Finally, our custom parser processes the model's output

### Example Usage

Let's test our setup with a prompt that can demonstrate the model's reasoning capabilities:

```python
prompt_text = "Explain quantum computing to a curious 10-year-old who loves video games."

thinking, response = chain.invoke(prompt_text)
print("\nTHINKING:\n")
print(thinking)
print("\nRESPONSE:\n")
print(response)
```

#### Output:

```
THINKING:

Okay, so I need to explain quantum computing to a 10-year-old who loves video games. Hmm, let's break this down. First, I should think about what a 10-year-old knows. They understand basic concepts like computers, games, maybe some video game terms like pixels or bits. They also enjoy stories or analogies that relate to their interests, so using video games as a context makes sense.

Quantum computing is a complex topic, so I need to simplify it. I remember that classical computers use bits, which are 0s and 1s. Quantum computers use qubits, which can be both 0 and 1 at the same time. That's a key difference. Maybe I can compare it to something in a video game. Maybe like a character that can be in two places at once, or a power-up that gives multiple abilities simultaneously.

I should also mention superposition and entanglement. Superposition is when a qubit can be in multiple states at once, and entanglement is when qubits are connected, so the state of one affects the others. Maybe relate that to something like a power-up that affects multiple things at once or a team of characters working together.

Another important point is that quantum computers can solve certain problems much faster than classical computers. Maybe compare it to a game where you can unlock a level or defeat a boss much quicker with a special power-up. But I should also note that not all problems are better solved by quantum computers, just specific ones.

Putting it all together, the explanation should start with what a classical computer does, then introduce qubits as special bits, explain their unique properties using video game analogies, and then talk about the advantages and limitations. I should keep the language simple, use terms they're familiar with, and make it engaging.

Let me think of an example. Maybe a game where you have to find a key in a maze. A classical computer would check each path one by one, while a quantum computer could check all paths at the same time, making it faster. Or maybe a power-up that lets you see all possible paths at once.

I also want to make it clear that it's not just faster, but it's a different way of computing. Maybe use the idea of magic or special abilities in a game that allow for unique capabilities.

I should avoid technical jargon and keep sentences short. Maybe use a narrative style, like telling a story about a game where quantum computing helps the player win. That could make it more relatable and fun.

So, to sum up, I'll start by comparing classical computers to video game rules, then introduce qubits as special bits that can do multiple things at once, explain superposition and entanglement with game-related examples, and discuss the speed advantages and specific applications. I'll wrap it up by connecting it back to video games, showing how quantum computing could make games more complex or fun in the future.

I think that's a solid approach. Now, I'll structure this into a clear, concise explanation using these points and analogies.

RESPONSE:

Quantum computing is like having a supercharged video game character! Imagine you're playing a game where you need to find a hidden treasure in a maze. A regular computer (like a normal game character) would check each path one by one, which could take a long time. But a quantum computer is like a special character that can check all paths at the same time, making it super fast to find the treasure!

In video games, you sometimes get power-ups that give you special abilities. Quantum computers use something called qubits, which are like those power-ups. Unlike regular bits in a computer that can only be 0 or 1, qubits can be both at the same time. This is called superposition. It's like having a character that can be in two places at once!

Also, qubits can work together in a magical way called entanglement. If one qubit changes, the others change too, no matter how far apart they are. It's like having a team of characters where each member's actions affect the others instantly.

Quantum computers are especially good at solving tricky puzzles or problems that would take a regular computer forever. But they're not better at everything—just certain tasks. So, in the game, your special character would shine in specific levels, making them super fun and challenging!

One day, quantum computing might help create even more exciting games, with puzzles that seem impossible to solve without those special powers. So, think of quantum computing as the ultimate power-up for solving complex problems in the blink of an eye!
```

This example shows how the model:

1. First works through its reasoning in the thinking section
2. Then provides a clear, final response
3. Keeps these two aspects separate for better analysis and control

## Key Features

-   **Separated Thinking and Response**: DeepSeek's unique output format provides insight into the model's reasoning process
-   **Custom Output Parsing**: Langchain integration enables structured parsing of the model's unique output format

## Conclusion

This guide has demonstrated how to deploy and serve the DeepSeek-R1-Distill-Qwen-32B model on Vast.ai. We've shown how to:

-   Set up a cost-effective GPU instance with the right specifications
-   Easilly deploy the model using Vast's Templates with vLLM
-   Create a custom Langchain parser to handle DeepSeek's unique thinking/response format
-   Integrate everything through an OpenAI-compatible API

With this setup on Vast.ai's GPU marketplace, you can now build powerful AI applications with DeepSeek while minimizing both infrastructure complexity and compute costs. The platform's combination of flexible GPU options and simple Docker-based deployment makes it an ideal foundation for serving and scaling language models without the expensive cost model of traditional cloud providers.


Deploy DeepSeek-R1-Distill-Qwen-32B on Vast.ai for cost-effective AI reasoning. Use vLLM, Langchain, and custom parsing to extract structured model insights efficiently.

Serving DeepSeek Models on Vast.ai with vLLM and Langchain!


## Introduction

Rerankers are powerful tools that excel at determining the relevance between pairs of text - whether you're matching search queries to documents, evaluating LLM outputs against prompts, or finding similar content in a database. Unlike simple keyword matching or embedding similarity, these specialized models perform a detailed comparison between inputs, capturing nuanced relationships that simpler methods might miss. They're particularly valuable in RAG (Retrieval Augmented Generation) systems, recommendation engines, and content filtering pipelines, where they can significantly improve quality while requiring minimal computational overhead.

vLLM has recently expanded its capabilities to include reranker model serving, offering compatibility with both OpenAI and Cohere APIs. In this guide, we'll focus on deploying the `BAAI/bge-reranker-base` model - a powerful yet efficient reranker designed for semantic similarity scoring.

Vast.ai provides a marketplace for renting GPU compute power, offering a cost-effective alternative to major cloud providers. Vast has GPU SKU's that you cannot normally find on clouds, including 4000 series GPU's which are cheaper, with a tradeoff of less RAM. Re-ranker models are often quite small so we can take advantage of these extra affordable prices.

In this guide, we will:

1. Set up a Vast.ai instance with the right GPU specifications for serving rerankers
2. Deploy the model using vLLM's optimized inference server
3. Demonstrate two ways to interact with the reranker:
    - Using the Cohere-compatible API for batch reranking
    - Using the OpenAI-compatible API for cross-encoder scoring

This setup provides a production-ready environment for serving reranker models, with the ability to handle both batch reranking requests and individual similarity scoring tasks.

## Setting Up Vast.ai

First, we'll install the Vast.ai API:

```bash
pip install --upgrade vastai
```

Next, we'll set our API key (found on the [Account Page](https://cloud.vast.ai/account/)):

```bash
export VAST_API_KEY="your-key-here"
vastai set api-key $VAST_API_KEY
```

## Choosing Hardware

The `BAAI/bge-reranker-base` model has modest requirements compared to larger language models. Here's what we need:

1. 16GB GPU RAM for:
    - Model weights (~278M parameters)
    - Batch processing overhead
    - System operations
2. Single GPU with Turing architecture or newer
3. Static IP for stable API endpoint
4. At least one direct port for the API server

Note that you could potentially go down to 8GB of GPU RAM for this model for more affordable usages.

We'll use the `vastai search offers` command to find instances that meet our requirements:

```bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 16 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
rentable = true"
```

## Deploying the Server

Next, we'll copy the `ID` from our chosen instance into the `instance-id` field below along with our `HUGGINGFACE_TOKEN` to rent an instance and deploy `BAAI/bge-reranker-base` using vLLM's OpenAI-compatible server:

```bash
export INSTANCE_ID=<instance-id>
vastai create instance $INSTANCE_ID \
    --image vllm/vllm-openai:latest \
    --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=HUGGINGFACE_TOKEN' \
    --disk 40 \
    --args --model BAAI/bge-reranker-base
```

This setup:

-   Uses vLLM's optimized inference server
-   Exposes port 8000 inside the container (Vast.ai will forward this to a different external port)
-   Downloads and serves the `BAAI/bge-reranker-base` model

## Verify the Setup

Before proceeding, verify that your instance is running correctly:

1. Go to the [Instances tab](https://cloud.vast.ai/instances/) in the Vast AI Console
2. Wait for the instance to download the image and model (this may take a few minutes)
3. Find your instance's IP address and port from the "Open Ports" panel - you'll see something like `XX.XX.XXX.XX:YYYY -> 8000/tcp` where YYYY is the external port that forwards to the container's port 8000
4. Use these values in `VAST_IP_ADDRESS` and `VAST_PORT` for the test request below

```bash
export VAST_IP_ADDRESS="your-ip-here"
export VAST_PORT="your-port-here"

curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/rerank \
    -H "Content-Type: application/json" \
    -d '{
    "model": "BAAI/bge-reranker-base",
    "query": "What is deep learning?",
    "documents": [
        "Deep learning is a type of machine learning"
    ]
    }'
```

## Using the Reranker

vLLM provides two ways to rerank documents using the same underlying model:

1. **OpenAI API Compatible** (`/score`)

    - Raw scoring between query and documents
    - Returns similarity scores for manual sorting
    - Best for custom ranking logic

2. **Cohere API Compatible** (`/rerank`)
    - Direct reranking of documents against a query
    - Returns pre-sorted results
    - Best for quick integration

### OpenAI-Compatible Endpoint

First, we'll create a function to call the `/score` endpoint. We'll need to enter the `IP_ADDRESS` and `PORT` from above:

```python
import requests

IP_ADDRESS = ""
PORT = ""

def openai_score(query,documents):
    base_url = f"http://{IP_ADDRESS}:{PORT}"

    # Format request for score endpoint
    test_request = {
        "model": "BAAI/bge-reranker-base",
        "text_1": query,  # Query is text_1
        "text_2": documents  # Documents are text_2
    }

    # Make request and print raw response
    response = requests.post(f"{base_url}/score", json=test_request)
    print("Status code:", response.status_code)
    print("Raw response:", response.text)

    # If successful, print formatted results
    if response.status_code == 200:
        data = response.json()
        scores = [(doc, item["score"]) for doc, item in zip(documents, data["data"])]
        scores.sort(key=lambda x: x[1], reverse=True)

        print("\nRanked results:")
        for text, score in scores:
            print("\nScore:", score)
            print("Text:", text)
```

We'll start with a simple example to show what happens when we have completely irrelevant documents:

```python
query = "What is Deep Learning?"
documents = [
    "Deep learning is a subset of machine learning that uses neural networks with many layers",
    "The weather is nice today",
    "Deep learning enables computers to learn from large amounts of data",
    "I like pizza"
]
openai_score(query,documents)
```

In this example, we can see how effectively the reranker distinguishes between relevant and irrelevant content:

-   The most relevant document (score ~1.0) provides a direct definition of deep learning
-   The second document (score ~0.18) mentions deep learning but provides less specific information
-   The irrelevant documents about weather and pizza receive nearly zero scores (~0.00004)

```
Ranked results:

Score: 0.99951171875
Text: Deep learning is a subset of machine learning that uses neural networks with many layers

Score: 0.17626953125
Text: Deep learning enables computers to learn from large amounts of data

Score: 3.737211227416992e-05
Text: The weather is nice today

Score: 3.737211227416992e-05
Text: I like pizza
```

Next, we'll run it with a slightly more realistic set of documents to see how the reranker handles more nuanced differences in relevance:

```python
query = "What is Deep Learning?"

documents = [
    "Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input",
    "Machine learning algorithms enable computers to learn from data without being explicitly programmed",
    "The latest smartphone features advanced AI capabilities for photo enhancement",
    "Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection",
    "A neural network is inspired by the biological neural networks that constitute animal brains",
    "Cloud computing provides scalable infrastructure for training deep learning models",
    "The history of artificial intelligence dates back to the 1950s",
    "Deep learning models require significant computational resources and large datasets for training"
]
openai_score(query,documents)
```

The reranker shows sophisticated understanding in this more nuanced example:

1. The comprehensive definition gets a perfect score (1.0)
2. Application examples (computer vision) score moderately well (~0.43)
3. Related concepts (machine learning, neural networks) get low but non-zero scores (~0.03)
4. Infrastructure and historical context receive nearly zero scores (`<0.001`)

This demonstrates how the reranker can capture subtle differences in relevance, not just obvious distinctions between relevant and irrelevant content.

```
Ranked results:

Score: 1.0
Text: Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input

Score: 0.425537109375
Text: Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection

Score: 0.034942626953125
Text: Machine learning algorithms enable computers to learn from data without being explicitly programmed

Score: 0.001674652099609375
Text: A neural network is inspired by the biological neural networks that constitute animal brains

Score: 0.0004711151123046875
Text: The history of artificial intelligence dates back to the 1950s

Score: 0.00026535987854003906
Text: Deep learning models require significant computational resources and large datasets for training

Score: 5.7816505432128906e-05
Text: Cloud computing provides scalable infrastructure for training deep learning models

Score: 3.737211227416992e-05
Text: The latest smartphone features advanced AI capabilities for photo enhancement
```

### Cohere-Compatible Endpoint

The `/rerank` endpoint provides a higher-level interface that:

-   Directly reranks documents against a query
-   Returns pre-sorted results
-   Handles all scoring logic internally
-   Simplifies integration into existing pipelines

First, we'll install Cohere.

```bash
pip install --upgrade cohere
```

We'll then create a function to call the cohere endpoint.

```python
import cohere

IP_ADDRESS = ""
PORT = ""

def cohere_reranker(query,documents):
    base_url = f"http://{IP_ADDRESS}:{PORT}"

    # Initialize v2 client with our endpoint
    co = cohere.ClientV2("sk-fake-key", base_url=base_url)

    result = co.rerank(
        model="BAAI/bge-reranker-base",
        query=query,
        documents=documents
    )

    print("\nRanked results:")
    for doc in result.results:
        print(f"\nScore: {doc.relevance_score}")
        print(f"Text: {doc.document.text}")  # Access text through document.text
```

First, we'll test with our simple example:

```python
query = "What is Deep Learning?"
documents = [
    "Deep learning is a subset of machine learning that uses neural networks with many layers",
    "The weather is nice today",
    "Deep learning enables computers to learn from large amounts of data",
    "I like pizza"
]
cohere_reranker(query,documents)
```

Output:

```
Ranked results:

Score: 0.99951171875
Text: Deep learning is a subset of machine learning that uses neural networks with many layers

Score: 0.17626953125
Text: Deep learning enables computers to learn from large amounts of data

Score: 3.737211227416992e-05
Text: The weather is nice today

Score: 3.737211227416992e-05
Text: I like pizza
```

Notice that the Cohere endpoint produces identical scores to the OpenAI endpoint - this is because both are using the same underlying model, just with different APIs. The key difference is that the Cohere endpoint automatically handles the sorting and formatting of results.

Next, we'll try our more complex example:

```python
query = "What is Deep Learning?"

documents = [
    "Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input",
    "Machine learning algorithms enable computers to learn from data without being explicitly programmed",
    "The latest smartphone features advanced AI capabilities for photo enhancement",
    "Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection",
    "A neural network is inspired by the biological neural networks that constitute animal brains",
    "Cloud computing provides scalable infrastructure for training deep learning models",
    "The history of artificial intelligence dates back to the 1950s",
    "Deep learning models require significant computational resources and large datasets for training"
]
cohere_reranker(query,documents)
```

Output:

```
Ranked results:

Score: 1.0
Text: Deep learning is a subset of machine learning that uses neural networks with multiple layers to progressively extract higher-level features from raw input

Score: 0.425537109375
Text: Deep learning has revolutionized computer vision, enabling tasks like facial recognition and object detection

Score: 0.034942626953125
Text: Machine learning algorithms enable computers to learn from data without being explicitly programmed

Score: 0.001674652099609375
Text: A neural network is inspired by the biological neural networks that constitute animal brains

Score: 0.0004711151123046875
Text: The history of artificial intelligence dates back to the 1950s

Score: 0.00026535987854003906
Text: Deep learning models require significant computational resources and large datasets for training

Score: 5.7816505432128906e-05
Text: Cloud computing provides scalable infrastructure for training deep learning models

Score: 3.737211227416992e-05
Text: The latest smartphone features advanced AI capabilities for photo enhancement
```

Again we see identical scores to the OpenAI endpoint, demonstrating that:

1. Both APIs provide consistent access to the same model
2. The Cohere endpoint offers a more streamlined interface
3. You can choose whichever API better fits your application's needs

The main advantages of the Cohere endpoint are:

-   Pre-sorted results (no need to sort scores manually)
-   Simpler response format
-   Built-in batching support
-   Familiar interface for existing Cohere users

## Key Features

-   **Optimized Inference**: vLLM's server provides efficient batch processing and automatic memory management, making it easy to serve rerankers without dealing with low-level optimizations.
-   **Dual API Support**: Access the model through both OpenAI and Cohere-compatible APIs, allowing flexible integration with existing applications.
-   **Cost-Effective Deployment**: Vast.ai's marketplace lets you access powerful GPUs at a fraction of traditional cloud costs, with easy scaling as your needs grow.

## What's Now Possible

With this setup, you can build:

-   More accurate semantic search systems
-   Better RAG applications with filtered context
-   Content recommendation systems
-   Semantic duplicate detection
-   Document clustering and organization

The combination of vLLM's efficient serving and Vast.ai's affordable GPUs makes it practical to deploy rerankers in production. You can start small and scale up as needed, while maintaining high performance and cost efficiency. Happy ranking!


Deploy rerankers on Vast.ai using vLLM for cost-effective, high-performance semantic search. Supports OpenAI & Cohere APIs for flexible integration.

Serving Rerankers on Vast.ai using vLLM


The growth of AI is outpacing compute. Today, there's an unprecedented demand for computing power as interest in AI surges worldwide – and servers just can't keep up.

More advanced AI models require increasingly powerful hardware. **Without enough GPUs, these large-scale AI systems struggle to run, and also cannot improve.**

Even OpenAI's CEO, Sam Altman, has [admitted](https://techcrunch.com/2024/10/31/openai-ceo-sam-altman-says-lack-of-compute-is-delaying-the-companys-products/) that computing limitations are slowing down the company's ability to release new products. Microsoft's CFO, Amy Woods, [noted](https://www.microsoft.com/en-us/Investor/events/FY-2025/earnings-fy-2025-q1) last year that demand for AI "continues to be higher" than their "available capacity."

If industry giants like OpenAI and Microsoft are feeling the strain, smaller organizations face even steeper challenges. Without deep pockets or dedicated data centers, they often have no choice but to explore other options.

With GPUs in short supply and demand skyrocketing, many companies are turning to a more flexible, scalable solution: **GPU-as-a-service, or GPUaaS**.

## What Is GPU-as-a-Service?

Simply put, **GPUaaS is a cloud-based solution that provides on-demand access to high-performance GPUs.** It fills a gap in the AI industry by enabling customers to rent the compute power they need, scaling up or down as their workloads demand.

Big cloud providers like AWS, Google Cloud, and Azure offer GPU instances, but GPUaaS goes further. Platforms like our own here at Vast.ai tap into a decentralized network of idle GPUs from data centers, enterprises, and even individuals, making high-performance compute more affordable and widely available.

As a result, GPUaaS addresses a critical need: namely, that **AI infrastructure must be scalable, cost-efficient, and accessible to everyone.**

## Why GPUaaS Is the Key to Scalable AI Infrastructure

AI adoption shows no sign of slowing down – and with it comes fluctuating and unpredictable demand for computing power at any given time. Traditional centralized infrastructure isn't well suited to this type of volatility.

Relying on centralized GPU resources often leads to wasted potential and frustrating roadblocks, such as:

-   **High upfront costs** for purchasing and maintaining GPUs.
-   **Underutilized hardware** during slow periods, resulting in inefficiency and wasted costs.
-   **Bottlenecks** when demand spikes, slowing down innovation and delaying results.
-   **Limited access** to GPUs due to supply chain slowdowns and shortages, impacting AI scalability.

This is where GPU-as-a-Service offers a smarter solution. With GPUaaS, you can:

-   **Quickly scale AI workloads** to match demand without over-provisioning.
-   **Reduce costs** with a flexible pay-as-you-go model while eliminating the risks of heavy upfront investment.
-   **Access worldwide GPU resources**, minimizing latency and enhancing performance.
-   **Decrease energy consumption** by leveraging existing, idle GPUs instead of powering additional servers.

## GPUaaS: The Future of AI Compute is Flexible

As AI continues to advance, organizations that can efficiently scale their compute power will have a clear advantage. GPU-as-a-Service reduces the barriers of cost, accessibility, and scalability, offering an alternative to traditional infrastructure. With GPUaaS, you can tap into a global network of GPUs and ensure you always have the resources you need, when you need them.

So why wait? In a world where AI innovation moves fast, compute should never be the bottleneck. **Vast.ai makes enterprise-grade GPUs accessible and affordable through our decentralized marketplace, offering savings of 5-6X on GPU compute.** Whether you're training cutting-edge models or running large-scale inference, we provide the flexibility and efficiency to keep your AI projects moving forward.

Get started and harness the power of GPUaaS with Vast.ai today!

GPU shortages are slowing AI growth. GPU-as-a-Service (GPUaaS) offers a scalable, cost-effective solution for on-demand AI compute power.

GPU-as-a-Service: The Scalable Solution to AI's Compute Crisis


Artificial intelligence has brought us a wide range of groundbreaking technologies. Two that have been particularly influential are **Generative Adversarial Networks** (GANs) and **Large Language Models** (LLMs).

To put it simply, GANs excel at generating realistic visual content – such as pictures and videos – by training neural networks to create and critique images, while LLMs dominate in text-based tasks with their ability to understand, generate, and manipulate human-like text across a vast array of domains and applications.

Understanding what each of these AI models can do – and knowing when to use a GAN, an LLM, or even both – can be key to making the most of your AI projects. Let's explore how they work, their differences, and the best ways to apply them to your specific needs!

## What Is a GAN?

### Understanding Generative Adversarial Networks

A **Generative Adversarial Network**, or GAN, is a machine learning model made up of two competing neural networks – a **generator** and a **discriminator**. They work together to create new data that closely resembles real-world examples. For instance, GANs can generate highly realistic images by learning from a dataset of actual images.

The term "adversarial" here is fitting. The generator's job is to create synthetic data, while the discriminator evaluates it, trying to distinguish between what's real and what's generated. As they continuously challenge each other, the generator gets better at producing increasingly realistic outputs – whether images, audio, or other types of data.

### Strengths of GANs

First and foremost, GANs dominate in their **ability to generate highly realistic synthetic content**. Unlike other generative models that rely on statistical approximations, the unique competitive training process used by GANs produces outputs that more closely resemble real-world data.

GANs are primarily used for **content creation and augmentation** – for instance, creating photorealistic faces, generating artistic styles based on existing artwork, and enhancing low-resolution images. They're remarkably well suited for creative endeavors that still call for realism and precision.

Unlike diffusion models, which generate content through a gradual, probabilistic denoising process with limited direct manipulation, GANs offer **more granular control**, allowing designers and researchers to intervene at specific stages of image generation. This means users can more directly guide the output by adjusting latent spaces, introducing constraints, or selectively modifying generated elements.

Also, because GANs only need a single forward pass in order to generate a sample, they **often perform faster than other models** despite their computational demands. This speed can be especially useful in applications where rapid prototyping or iterative design processes are required.

While GANs are best known for generating visual content, they can also **produce other data types, such as text, audio, and 3D models**. This versatility allows them to be applied across a wide range of industries and technical fields.

Another advantage of GANs is the role they can play in **data augmentation and enhancement.** For instance, in scenarios where real-world data is limited or sensitive, GANs can generate synthetic datasets to train machine learning models more effectively. This allows GANs to **preserve privacy** (in cases where the generator has limited or no access to real-world data), **enhance model performance**, and **enable experimentation** without the constraints of traditional data collection.

### Example Use Cases for GANs

As mentioned, GANs are widely used for generating and enhancing synthetic data. Here are some examples of use cases for GANs:

-   Image generation and enhancement, especially photorealistic faces
-   Augmenting datasets for machine learning tasks
-   Creating synthetic medical images for research
-   Style transfer and artistic design
-   Video generation and editing
-   3D model generation

From medical research to artistic expression, GANs serve as a practical solution for a variety of technical and creative applications.

And that brings us to large language models...

## What Is an LLM?

### Breaking Down Large Language Models

A **Large Language Model**, or an LLM, is an AI system that's been trained on massive amounts of text data, giving it the ability to understand, generate, and engage with human language in a natural way. It works by analyzing patterns and relationships in text using deep learning algorithms.

At the core of an LLM is a **transformer architecture** that leverages **self-attention mechanisms** to process text efficiently. These mechanisms allow the model to weigh the importance of different words within a sentence – regardless of their position – enabling it to capture long-range dependencies, contextual meanings, and subtle nuances in language.

This ability to consider context at multiple levels simultaneously is what makes LLMs highly effective at tasks like summarization, translation, paraphrasing, and human-like conversation.

### Strengths of LLMs

As we just touched on, the biggest strength of an LLM is its **deep contextual understanding of language**. It goes beyond the keyword matching and predefined rules of simpler language processing tools. Instead, LLMs are able to grasp the nuances, intent, and context behind words and phrases. They can process vast amounts of information and provide meaningful output that aligns with the intended context.

**Multi-task text processing**, in particular, encapsulates an LLM's deep, generalized understanding of language. It refers to the ability to perform multiple different language-related tasks without being specifically trained or retrained for each one individually.

For instance, an LLM can summarize and paraphrase text, translate between languages, analyze sentiment, correct grammar, expand on existing text, code in various programming languages, and generate different writing styles – all using the same underlying neural network, and often **performing multiple tasks within the same interaction**.

Another key capability of LLMs is their **scalability and adaptability**. Once trained, these models can be fine-tuned for specific applications. Businesses and developers can take pre-trained LLMs and refine them with industry-specific or internal data to enhance performance and accuracy for specialized tasks.

Additionally, LLMs can scale efficiently across different platforms and workflows. Their **ability to learn and evolve** with additional data ensures they remain relevant and effective over the long term.

### Example Use Cases for LLMs

LLMs can automate routine tasks or enhance complex workflows across a wide range of industries. The following are some areas where LLMs are used today:

-   Automated content creation and text editing
-   Code generation and review
-   Data analysis and summarization
-   Language translation that preserves nuance
-   Customer service chatbots and virtual assistants
-   Personalized learning and tutoring
-   Brainstorming and idea development

These use cases highlight just a few of the many ways LLMs can streamline processes, improve efficiency, and drive innovation.

## Key Differences Between GANs and LLMs

Now that we've explored both GANs and LLMs, let's look at how they directly compare to each other. Here's a side-by-side comparison for quick reference:

| Characteristic             | Generative Adversarial Networks (GANs)   | Large Language Models (LLMs)            |
| ---------------------------| ---------------------------------------- | --------------------------------------- |
| **Primary Output**         | Visual content (images, videos, audio)               | Textual content (text, code, etc.)               |
| **Core Mechanism**         | Two competing neural networks (generator and discriminator)                | Transformer architecture with self-attention mechanisms                | 
| **Training Approach**      | Adversarial approach with constant competition              | Supervised learning on massive text datasets             |
| **Key Strengths**          | - Highly realistic image generation <br>- Creative visual synthesis  <br>- Style transfer and manipulation | - Complex text understanding <br>- Contextual language generation <br>- Multi-task text processing |
| **Data for Training**      | Images, audio, structured data          | Text-based datasets           |
| **Compute Requirements**   | High GPU demand for training and image generation                 | Substantial computational resources for training and inference      |
| **Typical Applications**   | Digital art creation; medical imaging; face generation; fashion and design                 | Conversational AI; content writing; translation; code generation; research assistance                |

Next comes the question of which one you should use for your specific project...

## When to Use GANs vs. LLMs (Or Both)?

Choosing the right AI model for your needs is fairly straightforward. Here's a general guide:

**Use GANs when...**

-   Your goal is to create high-fidelity visual or audio content with precise control.
-   Realistic synthetic data is required for training machine learning models.
-   You need to enhance or modify existing images or videos.
-   Developing realistic simulations in fields like medical imaging.

**Use LLMs when...**

-   Generating, analyzing, or transforming textual content.
-   Creating intelligent conversational interfaces like chatbots.
-   Automating complex language-based workflows.
-   Extracting insights from large text collections.

**Combine GANs and LLMs for advanced applications that require...**

-   Integrated multimodal content generation.
-   Synthetic data generation that includes both visual and textual components.
-   Automated media generation with textual descriptions for context.

## Running GANs and LLMs on Vast.ai

At Vast.ai, you can access affordable GPU resources to train, fine-tune, experiment with, and deploy both GANs and LLMs at scale. Our platform offers a wide range of GPU options to match the demands of different workloads, so you can focus on training and deploying your AI models without worrying about infrastructure limitations.

We're proud to be the market leader in low-cost cloud GPU rental – whatever your needs are, we're here to help! Accelerate your AI projects with Vast today.

GANs generate realistic images, while LLMs excel in text-based tasks. Learn their differences, strengths, and how to choose the right AI model for your project.

GANs vs. LLMs: What You Need to Know


# January 2025 Product Update

We're back with our first product update of 2025!

This month, we're pleased to announce we've added support for [DeepSeek R1](https://vast.ai/article/deepseek-r1-open-source-disruptor-or-overhyped-upstart), so you can easily run this powerful new LLM through our platform. Our DeepSeek R1 / Ollama WebUI Guide and Ollama template streamline the process, giving you a seamless way to leverage DeepSeek's capabilities for your own projects.

As your trusted partner in cloud GPU resources, we're constantly working to enhance every aspect of our platform. As with every update, we prioritize addressing any technical hiccups that may affect your experience on Vast. Here are some of the latest bugs we've fixed since our last update:

- Teams account switching bugs.
- Teams invite bugs and UX issues.
- Fixed earning page issues related to earning display and Stripe.
- Erroneous client build errors showing on machines.
- Bandwidth speeds misreporting.
- Instances open button issues.

## NVIDIA Cloud GPU Updates

Here at Vast.ai, we're always looking for ways to make your experience even better, whether you're exploring cutting-edge AI models or running established workflows.

### Highlights of the latest enhancements:

- Added [](https://docs.vast.ai/ollama-webui)[DeepSeek R1 / Ollama + WebUI Guide](https://docs.vast.ai/ollama-webui) and [](https://cloud.vast.ai/template/readme/5d2c8df8e0d0ab7d1a7a3ed0f9105637)[Ollama Template](https://cloud.vast.ai/template/readme/5d2c8df8e0d0ab7d1a7a3ed0f9105637).
- Updated [](https://cloud.vast.ai/account/)[Earning Page](https://cloud.vast.ai/account/) UI, including visualizations of referral program earnings, template performance, and host earnings.
- [Docs Overhaul](https://docs.vast.ai/) – We cleaned up API Docs, replatformed and fixed Docs pages UI/UX and search issues, and added new guides.
- [Vast.ai Base Docker Image](https://github.com/vast-ai/base-image) – This clonable repo allows template creators to quickly and easily build on top of a generic Vast.ai Docker image.
- [Audit Logs](https://cloud.vast.ai/account/) added to web UI – Users can filter and view history of actions taken on their account.
- (Rolling out) Onboarding Flow – Helpful sidebar checklist that guides new users (including referrals) through all the steps necessary to launch their first instance.

For the full list of all our January 2025 updates, you can find the complete change log at the bottom of this page. We're excited for you to explore all the latest improvements!

## Our Commitment

Whether you're looking to get started with DeepSeek R1 or interested in exploring GPU or VM rental options, we're here to help you make the most of our platform. Got questions? We'd love to hear from you!

Send us an email anytime at [support@vast.ai](mailto:support@vast.ai), or join our Discord community for instant access to resources and helpful discussions. Let us show you how Vast.ai can power your next project!

## Change Log

### Features:

- Updated [](https://cloud.vast.ai/account/)[Earning Page](https://cloud.vast.ai/account/) UI, including visualizations of referral program earnings, template performance, and host earnings.
- [Docs Overhaul](https://docs.vast.ai/) – We cleaned up API Docs, replatformed and fixed Docs pages UI/UX and search issues, and added new guides.
- [Vast.ai Base Docker Image](https://github.com/vast-ai/base-image) – This clonable repo allows template creators to quickly and easily build on top of a generic Vast.ai Docker image.
- [Audit Logs](https://cloud.vast.ai/account/) added to web UI – Users can filter and view history of actions taken on their account.
- [Account Page](https://cloud.vast.ai/account/) UI Refresh.
- (Rolling out) Onboarding Flow – Helpful sidebar checklist that guides new users (including referrals) through all the steps necessary to launch their first instance.

### Templates & Guides:

- Added [](https://docs.vast.ai/ollama-webui)[DeepSeek R1 / Ollama + WebUI Guide](https://docs.vast.ai/ollama-webui) and [](https://cloud.vast.ai/template/readme/5d2c8df8e0d0ab7d1a7a3ed0f9105637)[Ollama Template](https://cloud.vast.ai/template/readme/5d2c8df8e0d0ab7d1a7a3ed0f9105637).
- Added [](https://docs.vast.ai/image-generation)[Image Generation Guide](https://docs.vast.ai/image-generation).
- Added [](https://docs.vast.ai/linux-virtual-machines)[Linux Virtual Machine Guide](https://docs.vast.ai/linux-virtual-machines).
- Added [](https://docs.vast.ai/linux-virtual-desktop)[Linux Virtual Desktop Guide](https://docs.vast.ai/linux-virtual-desktop).
- Added [](https://docs.vast.ai/pytorch)[PyTorch Guide](https://docs.vast.ai/pytorch).
- Added [](https://cloud.vast.ai/?ref_id=62897&template_id=0616da77103ae90486bf60b8ddcd4d9c)[NVIDIA CUDA Base Template](https://cloud.vast.ai/?ref_id=62897&template_id=0616da77103ae90486bf60b8ddcd4d9c).
- Updated A1111 Template.
- Updated PyTorch Templates.
- Updated KVM Desktop Template.

### Bug Fixes:

- Teams account switching bugs.
- Teams invite bugs and UX issues.
- Fixed earning page issues related to earning display and Stripe.
- Disk space selection mismatches and errors.
- Erroneous client build errors showing on machines.
- Bandwidth speeds misreporting.
- Machine API key saving issue.
- Various template editor UI issues related to version tags, environment vars, and ports.
- Various template listing UI issues related to loading / “load more” and card display.
- Instances open button issues.
- Security issues.


Vast.ai's January 2025 update adds support for DeepSeek R1, new guides, UI enhancements, bug fixes, and NVIDIA GPU improvements to enhance cloud experience and performance.

January 2025 Product Update


# DeepSeek R1: Open-Source Disruptor or Overhyped Upstart?

DeepSeek, a Chinese AI firm, has drawn attention in recent weeks for its release of DeepSeek R1 – a large language model (LLM) that aims to stand shoulder to shoulder with more established players. Its rapid rise has shaken up the AI race, with comparisons to OpenAI's o1 in particular.

## Try it For Yourself on Vast

Our [Ollama + WebUI](https://docs.vast.ai/ollama-webui) guide uses DeepSeek r1:70b as an example. Follow it to try out DeepSeek on our GPU cloud.

## The Disruptive Growth of DeepSeek

DeepSeek's impact isn't limited to industry insiders. It's already gaining massive popularity among everyday users. The DeepSeek mobile app, which provides a chatbot interface for DeepSeek R1, shot to the top of the Apple App Store charts within days of its release on January 20, 2025.

But the biggest reason DeepSeek is [making waves](https://techcrunch.com/2025/01/28/deepseek-everything-you-need-to-know-about-the-ai-chatbot-app/)? **The company has seemingly disrupted the prevailing wisdom that developing cutting-edge LLMs costs a ton of money.** The DeepSeek R1 LLM was developed at a fraction of the cost of comparable models from other companies – reportedly less than $6 million as opposed to hundreds of millions or even billions of dollars.

Another key factor setting it apart: **DeepSeek R1 is free and open source.** Researchers and developers can freely experiment with, modify, and deploy the model without the restrictions of proprietary alternatives.

The meteoric rise of the R1 LLM threw the U.S. financial markets into turmoil, prompting investors to rethink the valuation of companies like NVIDIA and reevaluate the AI prospects of tech giants like Microsoft and Meta. A significant stock market sell-off occurred on January 27, 2025.

Beyond its financial impact, though, what exactly is DeepSeek and how did it develop a model that's turning heads? Let's back up and take a closer look at this new player on the AI scene.

## What Is DeepSeek?

Based in Hangzhou, China, **DeepSeek is an AI development firm that focuses on creating open-source LLMs.** While it operates as an independent lab, DeepSeek is backed by High-Flyer Capital Management, a China-based quantitative hedge fund specializing in AI-driven financial strategies.

From the start, DeepSeek built its own data center clusters to train its generative AI models. U.S. export restrictions on advanced chips had an impact on operations, however. DeepSeek was forced to train its more recent models using NVIDIA H800 chips, a less powerful alternative to the H100 chips available to U.S. companies.

DeepSeek's first model was released in November 2023, and since then the firm has developed several different variations. But DeepSeek only catapulted to global prominence in January 2025, when its R1 model was released – along with a web interface, mobile application, and API access.

## DeepSeek R1: A Challenge to the Status Quo?

The DeepSeek R1 model focuses on logical inference, mathematical reasoning, and real-time problem-solving. Compared to other LLMs, **it uses a more efficient and sparse Mixture-of-Experts (MoE) architecture, where different parts of the model specialize in handling different tasks.** For every input, only a few "expert" neural networks are selected to process each piece of data.

What's really sparked buzz, however, is DeepSeek R1's ability to match performance with OpenAI's latest ChatGPT models across various tasks. So R1 is a more efficient model, developed much more cheaply, with impressive performance. It sounds almost too good to be true...

How was it trained?

According to DeepSeek, the R1 model leveraged the following training innovations:

-   **Reinforcement Learning** – Using a large-scale reinforcement learning approach without supervised fine-tuning, DeepSeek employed techniques like group relative policy optimization (GRPO) to enhance reasoning capabilities.
-   **Reward Engineering** – Instead of relying on traditional neural reward models, DeepSeek created a rule-based reward system that delivers superior results (i.e., by structuring incentives to guide the model's learning during training).
-   **Emergent Behavior Network** – DeepSeek discovered that advanced reasoning patterns can emerge on their own through reinforcement learning, without being explicitly programmed.
-   **Distillation** – DeepSeek used efficient knowledge transfer techniques and distilled knowledge from larger, more complex AI models, allowing R1 to achieve strong performance with reduced computational demands.

Not all of this is new. DeepSeek does present some algorithmic and architectural novelties – such as the sparse MoE design and the use of reinforcement learning to refine synthetic data selection and optimize reasoning pathways during training – but the company isn't necessarily the first to implement these techniques.

Still, it has combined them in a way that makes its R1 model notably efficient and competitive.

## Too Good to Be True?

Unfortunately, there is a catch. **DeepSeek R1 reportedly [relied](https://www.theverge.com/news/601195/openai-evidence-deepseek-distillation-ai-data) on synthetic data from OpenAI's GPT-4o for its training**, pushing some of the computational burden onto the latter, much less efficient model.

So while R1 is touted as being highly efficient, those claims only apply to the model itself, not to the entire process used to train it. The R1 model's lean MoE architecture means it is indeed efficient when running inference, but its training pipeline requires external synthetic data generated by other models like GPT-4o – and that external data generation is computationally expensive.

This means that a significant part of the total AI development cost and compute burden is still being offloaded onto other, more complex models. **It's a cost-saving move for DeepSeek, but it doesn't negate the need for massive specialized compute in the industry overall.**

Time will tell how sustainable DeepSeek's approach is – and what consequences it may have for the company as scrutiny over training methods increases and competition in the AI space continues to grow.

...There may even be national security concerns, as the U.S. government [eyes](https://www.cbsnews.com/news/deepseek-ai-raises-national-security-concerns-trump/) the potential for harmful foreign influence. But that's a topic for another day.

## The Bottom Line on DeepSeek R1

Regardless of how one feels about it, DeepSeek R1 has certainly made a bold entrance into the world. While its efficiency claims come with important caveats, the model remains a fascinating case study in how AI development is evolving.

Plus, its open-source license allows for commercial use, which is a pretty major draw for developers around the world.

## Want to Learn More

DeepSeek R1 is a **massive** model that will not fit on a single 8xH200 instance unless you run a quantized version. The 4bit quantization will run on a 8xH100 or 4xH200 node. The sweet spot might be the 8bit quantized model. If you are interested in trying it out, [reach out to us](https://vast.ai/contact-sales) so we can talk to you about your use case and help get it setup.

As always, we'd be happy to assist you in any way we can! We're here to support your AI experiments, providing the compute power you need to explore, test, and innovate – every day.


DeepSeek R1: Open-source LLM shaking up the AI race. Developed for under $6M, it rivals ChatGPT's performance. Try it on Vast's GPU cloud.

DeepSeek R1: Open-Source Disruptor or Overhyped Upstart?


# Structured Outputs with vLLM and Outlines on Vast.ai

## Introduction

In the world of AI applications, getting consistent, well-formatted responses from language models is crucial. While LLMs are powerful, their free-form outputs can be unpredictable and hard to parse programmatically. This normally prevents LLM's from integrating into existing applications, or giving developers control over their business logic. This is where structured outputs come in - they allow us to enforce specific response formats, making it easier to build reliable AI applications and fit the LLM responses into existing paradigms like Pydantic and JSON schemas.

[vLLM](https://docs.vllm.ai/en/latest/features/structured_outputs.html), combined with the plugin for the Outlines library, provides an elegant solution for generating structured outputs. By enforcing response schemas through Pydantic models or JSON schemas, we can ensure our LLM outputs follow exact specifications. Running this setup on Vast.ai makes it cost-effective and scalable, giving you access to powerful GPUs without the overhead of managing infrastructure.

## Setting Up Your Environment

### Install Vast

Install the Vast SDK

```python
%%bash
pip install --upgrade vastai
```

Set up Vast API Key

```python
%%bash
# Here we will set our api key
export VAST_API_KEY= #Your key here
vastai set api-key $VAST_API_KEY
```

### Choosing the Right Hardware

For optimal performance with vLLM, you'll need:

- GPUs with Turing architecture or newer (compute capability ≥ 7.5)
- At least 32GB GPU RAM for comfortable operation
- A static IP address for stable API access
- At least one direct port that we can forward for the API server

Vast.ai makes it easy to find machines meeting these requirements. Here's how to search for suitable instances:

```python
%%bash
vastai search offers 'compute_cap >= 750 gpu_ram >= 32 num_gpus = 1 static_ip=true direct_port_count >= 1'
```

### Deploying the Server

We'll use vLLM's OpenAI-compatible server, which allows us to use the familiar OpenAI API format and the OpenAI SDK while leveraging vLLM's optimizations.

For this example, the `meta-llama/Meta-Llama-3.1-8B-Instruct` model needs the user to accept the terms of use for the Llama 3.1 model, and use an authentication token. The setup process is otherwise straightforward:

```python
%%bash
vastai create instance <instance-id> \
    --image vllm/vllm-openai:latest \
    --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=<your-token>' \
    --disk 40 \
    --args --model meta-llama/Meta-Llama-3.1-8B-Instruct --guided-decoding-backend outlines
```

## Implementing Structured Outputs

### Install OpenAI and Pydantic

```python
%%bash

pip install --upgrade openai
pip install --upgrade pydantic
```

### Calendar Event Example

Let's look at a practical example - extracting calendar event information from text. First we will set up our connection to the server on our Vast instance. Then, using Pydantic, we can define exactly what fields we expect.

```python
import json

from pydantic import BaseModel
from openai import OpenAI
from typing import List


VAST_IP_ADDRESS=""
VAST_PORT=""


openai_api_key = "EMPTY"
openai_api_base = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: List[str]
```

The model will now format its responses to match this structure exactly, making it easy to process the output programmatically. Here we use the Pydantic model to define the Calendar Event schema for the response.

```python
completion = client.beta.chat.completions.parse(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "The Science Fair is on Friday. Alice and Bob are going."},
    ],
    extra_body={
        "guided_json": CalendarEvent.model_json_schema()
    }
)

event = json.loads(completion.choices[0].message.content)

print(event)
```

In this example output, the model successfully extracts all required information from the input text:

- The event name ("Science Fair")
- The date mentioned ("Friday")
- Both participants as an array (["Alice", "Bob"])

```python
#output
{'name': 'Science Fair', 'date': 'Friday', 'participants': ['Alice', 'Bob']}
```

### Customer Service Response Schema

For more complex use cases, we can use JSON schemas to define response structures. This is particularly useful for customer service applications where we need consistent response formats:

```python
def generate_structured_response(user_message):
    response_schema = {
        "type": "object",
        "properties": {
            "category": {
                "type": "string",
                "enum": ["Order Issue", "Product Inquiry", "Payment Issue", "General Query"]
            },
            "response": {
                "type": "string"
            },
            "next_steps": {
                "type": "array",
                "items": {
                    "type": "string"
                }
            },
            "follow_up_required": {
                "type": "boolean"
            }
        },
        "required": ["category", "response", "next_steps", "follow_up_required"]
    }

    prompt = f"""
    Given the following user message, generate a structured response in JSON format.

    - Category: Choose from ['Order Issue', 'Product Inquiry', 'Payment Issue', 'General Query']
    - Response: Craft a helpful response
    - Next Steps: Suggest steps the user should take or the support team will take
    - Follow-up Required: Yes or No

    User Message:
    {user_message}

    Respond in JSON format.
    """

    completion = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": "You are a helpful customer service bot that provides structured JSON responses."},
            {"role": "user", "content": prompt}
        ],
        temperature=0,
        extra_body={
            "guided_json": response_schema
        }
    )

    # Parse the JSON response
    return json.loads(completion.choices[0].message.content)

# User message to be processed
user_message = "I received my order, but a few items were missing. How can I get the missing items?"

# Generate structured response
structured_output = generate_structured_response(user_message)
print(json.dumps(structured_output, indent=2))
```

The response demonstrates the structured output capabilities:

- The issue is correctly categorized as an "Order Issue"
- A professional, detailed response is generated with contact information
- Clear next steps are provided in an array format
- A boolean flag indicates follow-up is required

```python
#output
{
  "category": "Order Issue",
  "response": "We apologize for the inconvenience. To resolve the issue, please contact our customer service team via email at [support@email.com](mailto:support@email.com) or phone at 1-800-SUPPORT. We will assist you in processing a replacement order for the missing items. Please have your order number ready when you reach out to us. We appreciate your patience and apologize again for the missing items.",
  "next_steps": [
    "Contact customer service team via email or phone",
    "Have order number ready for assistance"
  ],
  "follow_up_required": true
}
```

## Key Features

- **Structured Output Control**: vLLM with Outlines enforces strict output formats through JSON schemas and Pydantic models, ensuring consistent and reliable responses from LLMs.
- **OpenAI Compatibility**: Uses the familiar OpenAI API specification, unlocking the ability to integrate this into the OpenAI SDK. This makes it easy to integrate into existing applications or migrate from OpenAI services.

### Future Possibilities

- **Leverage Outputs in Langchain**: Leverage these structured responses in libraries like Langchain or others to build out complex AI systems.
- **Automated Data Processing**: Create pipelines that automatically extract and structure data from unstructured text, emails, or documents using custom schemas. This setup is perfect for batch processing in the background with vLLM's automatic batching.

## Conclusion

Structured outputs with vLLM and Outlines provide a powerful foundation for building reliable AI applications. By running on Vast.ai, you get the benefits of cost-effective GPU access while maintaining full control over your deployment. Whether you're building a customer service bot, data extraction system, or complex AI workflow, this setup gives you the tools you need for consistent, reliable AI outputs.

Start experimenting with structured outputs and see how they can improve your AI applications' reliability and usability.


Ensure structured, reliable AI responses using vLLM and Outlines on Vast.ai. Enforce JSON and Pydantic schemas for predictable outputs, ideal for automation and integration.

Structured Outputs with vLLM and Outlines on Vast.ai


# Generating Videos with Mochi on Vast.ai

## Background

Video generation has become increasingly important in creative and professional workflows. With the advancement of generative AI models, we can now create high-quality videos from text descriptions. Mochi, a new video generation model, allows us to create compelling video content from text prompts. This technology can be used for creating promotional content, educational materials, or artistic expressions without the need for traditional video production equipment.

Mochi, built on a novel diffusion model architecture, offers state-of-the-art capabilities in text-to-video generation. In this guide, we'll show you how to set up and run Mochi for creating videos from text prompts. The notebook demonstrates basic usage including memory optimizations and parameter controls for video length and quality. While this guide focuses on single video generation, the approach can be easily adapted for batch processing or integration into larger content creation pipelines.

With Vast, you can run this computationally intensive model on powerful GPUs at affordable rates, making video generation accessible and cost-effective. You can find the notebook this guide is based on [here](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/k7boE8/6940715c40468472d50053520f45f788465eddbd/files/Video-Gen.ipynb).

## Setting Up the Environment

To deploy on Vast, we recommend using the Vast AI Template for PyTorch: [PyTorch (cuDNN Runtime)](https://cloud.vast.ai?template_id=e6330219a022c9511f290c3c2d44592d). This template includes many required libraries and comes with SSH and JupyterLab pre-configured. While Mochi typically requires around 60GB VRAM for single-GPU operation, we're using the Hugging Face Diffusers implementation which includes optimizations like CPU offloading and VAE tiling that allow it to run on our RTX A6000 (48GB). For production use cases, an H100 or multiple GPUs may be recommended.

First, install the required dependencies:

```bash
pip install -r requirements.txt
```

Then, install the latest development version of diffusers, which includes the MochiPipeline:

```bash
pip install git+https://github.com/huggingface/diffusers.git
```

## Setting up the Pipeline

Let's look at how to set up our video generation pipeline using the Mochi model:

```python
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
from IPython.display import Video, display

# Load the Mochi pipeline
pipe = MochiPipeline.from_pretrained(
    "genmo/mochi-1-preview",
    variant="bf16",
    torch_dtype=torch.bfloat16
)

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()
```

## Creating the Video Generation Function

We'll create a function that handles the video generation process:

```python
def generate_video(prompt, output_filename="output.mp4", num_frames=90, fps=30):
    """
    Generates a video given a text prompt using the Mochi Pipeline.
    :param prompt: Text prompt for video generation
    :param output_filename: Name of the output video file
    :param num_frames: Number of frames to generate
    :param fps: Frames per second for the video
    """
    print(f"\nGenerating video for prompt: {prompt}\n")

    with torch.autocast("cuda", torch.bfloat16, cache_enabled=False):
        frames = pipe(prompt, num_frames=num_frames).frames[0]

    # Export to video file
    export_to_video(frames, output_filename, fps=fps)
    print(f"Video saved to: {output_filename}\n")

    # Display the video
    display(Video(output_filename, embed=True))
```

## Example Prompts

Here are some example prompts we can use to generate videos:

```python
# Nature close-up
prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
generate_video(prompt, "chameleon_eye.mp4")

# Landscape scene
prompt = "A serene waterfall flowing gently into a crystal clear lake surrounded by lush green forest."
generate_video(prompt, "waterfall_forest.mp4")

# Space scene
prompt = "An astronaut floating in space with Earth in the background, ultra-realistic footage, 4k."
generate_video(prompt, "astronaut_space.mp4")

# Nature scene
prompt = "A cute puppy running happily in a field of colorful flowers during sunset."
generate_video(prompt, "puppy_flowers.mp4")
```

## Key Features

- **Memory Optimizations**: The implementation uses CPU offloading and VAE tiling, making it possible to run on GPUs with less VRAM than typically required. This is crucial as the base model needs 60GB VRAM.
- **Handy Convenience Tooling**: Uses the Hugging Face Diffusers library which handles model loading, memory management, and inference, making it easy to get started without dealing with low-level details. The implementation also automatically handles video file writing and display in the notebook.
- **Customizable Generation**: The code supports adjusting key parameters like frame count and FPS, allowing control over video length and quality based on your needs.

## Tips for Best Performance

1. Match your hardware to your needs - the model requires approximately 60GB VRAM for single-GPU operation. For production use cases, Mochi's documentation recommends using H100 GPUs.

2. Be aware of model limitations - the output is 480p resolution and may show warping/distortions in cases with extreme motion.

3. Keep your prompts focused on photorealistic descriptions, as the model is specifically optimized for photorealistic styles rather than animated content.

## Conclusion

This implementation provides a foundation for text-to-video generation using Mochi. You can modify the generation parameters to suit your needs and adapt the code for batch processing of multiple prompts.

With Vast, you can access the GPU power needed for video generation at an affordable price point, making AI video creation accessible and cost-effective. Happy generating!


Generate high-quality videos from text prompts using Mochi on Vast.ai. Leverage powerful GPUs for efficient AI-driven video creation with memory optimizations.

Generating Videos with Mochi on Vast.ai


# Transcribing Audio with Whisper Large V3 on Vast.ai

## Background

With the rise of Generative AI, there have been many advancements in the field of speech recognition. One of the most notable is the development of Whisper, a family of open-source speech recognition models from OpenAI that can transcribe audio in multiple languages. Companies use this technology to transcribe customer service calls, summarize meetings, and create interactive voice assistants. Other applications might be extracting structured data from audio or video about what is being said.

Whisper Large V3, the latest version of the Whisper models, offers improved accuracy and performance over previous versions. In this guide, we'll show you how to set up and run Whisper Large V3 for batch audio transcription. This guide is built for a backend data pipeline, but can be easily modified to interact with users directly.

With Vast, you can run this model on extremely affordable and powerful GPUs to increase the speed of your transcription pipeline. You can see the script that this is based upon [here](https://bitbucket.org/!api/2.0/snippets/jsbcannell/bqz47r/5b696af8939ef28b17138e8162479bb7c03e9705/files/transcriber.py).

## Setting Up the Environment

To deploy on Vast, we will use the Vast AI Template for Pytorch: [PyTorch (cuDNN Runtime)](https://cloud.vast.ai?template_id=e6330219a022c9511f290c3c2d44592d). It has a lot of the libraries that we need, and comes with ssh and jupyterlab out of the box for us. For GPU's the L40s are very cost effective for reasonable batch sizes and the model itself is pretty small. Finally, the dataset is fairly large and gets expanded when creating the dataset splits, so we recommend having 110GB or more of disk space just to be safe.

We will take the requirements.txt file from [here](https://bitbucket.org/!api/2.0/snippets/jsbcannell/k7b4L6/bf17c4b1248ce536e43fe54d85b8efebe59b7b69/files/whisper_vast_requirements.txt) to install the necessary libraries to the template so that we can run the script.

## Running the Script:

### Setting up Transformers

First, let's look at how to set up our transcription pipeline. We'll use Hugging Face's transformers library to load and run the model.

    ```python
    import os
    import torch
    from huggingface_hub import snapshot_download
    from transformers import (
        pipeline,
        AutoModelForSpeechSeq2Seq,
        AutoProcessor,
        logging
    )

    # Configure model details
    MODEL_DIR = "./model"
    MODEL_NAME = "openai/whisper-large-v3-turbo"
    MODEL_REVISION = "41f01f3fe87f28c78e2fbf8b568835947dd65ed9"

    # Create model directory and download model files
    os.makedirs(MODEL_DIR, exist_ok=True)
    snapshot_download(
        MODEL_NAME,
        local_dir=MODEL_DIR,
        ignore_patterns=["*.pt", "*.bin"],
        revision=MODEL_REVISION,
    )
    ```

### Creating the ASR Pipeline

Next, we'll set up our Automatic Speech Recognition (ASR) pipeline. This involves loading the model and processor, and configuring them for optimal performance:

    ```python
    def initialize_asr_pipeline():
        print("Setting up pipeline")

        processor = AutoProcessor.from_pretrained(MODEL_NAME)
        device = "cuda" if torch.cuda.is_available() else "cpu"

        model = AutoModelForSpeechSeq2Seq.from_pretrained(
            MODEL_NAME,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            use_safetensors=True,
            device_map=device
        )
        model.generation_config.language = "<|en|>"

        # Configure feature extractor
        feature_extractor = processor.feature_extractor
        feature_extractor.sampling_rate = 16000
        feature_extractor.return_tensors = 'pt'

        # Create pipeline
        asr_pipeline = pipeline(
            "automatic-speech-recognition",
            model=model,
            tokenizer=processor.tokenizer,
            feature_extractor=feature_extractor,
            torch_dtype=torch.float16
        )

        return asr_pipeline
    ```

### Loading and Processing Audio Data

We'll use the Hugging Face datasets library to load our audio data. This example uses LibriSpeech, but you can modify it to use your own dataset:

    ```python
    def load_dataset(dataset_name):
        from datasets import load_dataset, Audio

        print("Loading dataset", dataset_name)
        dataset = load_dataset(dataset_name, "clean", split="validation")
        dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
        print("Dataset loaded")
        return dataset["audio"]

    def batch_audio(dataset, batch_size):
        for i in range(0, len(dataset), batch_size):
            print(f"Yielding batch {i}")
            yield dataset[i : i + batch_size]
    ```

### Running the Transcription

Finally, we'll put it all together to transcribe our audio files in batches. Because we defined the functions above, actually running the transcription is very simple.

    ```python
    # Initialize pipeline
    asr_pipeline = initialize_asr_pipeline()

    # Load dataset
    DATASET_NAME = "openslr/librispeech_asr"  # You can change this to your dataset
    dataset = load_dataset(DATASET_NAME)

    # Process batches
    results = []
    for batch in batch_audio(dataset, batch_size=32):
        print("Processing batch")
        with torch.no_grad():
            transcriptions = asr_pipeline(batch)
        print("Batch processed")
        results.extend(transcriptions)

    print(results)
    ```

## Results:

Some of our results are below:

```
'Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.', " Nor is Mr. Quilter's manner less interesting than his matter.", ' he tells us that at this festive season of the year with christmas and roast beef looming before us similes drawn from eating and its results occur most readily to the mind'
```

These can be compared to the original audio dataset [here](https://huggingface.co/datasets/hf-internal-testing/librispeech_asr_dummy). The audio in this dataset is very clear and easy to transcribe.

We can see that the text that is returned from this pipeline is basically perfect for the samples that we show. We would expect that the model's performance would drift if it were run on a private dataset specific to your use case, or if the audio was not as clear as this dataset. Luckily since we have access to the weights, this model can be fine-tuned to your specific use case.

## Key Features

- **Batch Processing**: The code processes audio files in batches of 32 for improved efficiency. This is very important for running this model as it benefits a lot from batching on the GPU.
- **Transformers Pipeline**: This is a very easy way to run the model and handles a lot of boilerplate code to properly prepare the model and the individual samnples for inference. This is extra important as highly performant Whisper serving frameworks aren't released yet.
- **Flexible Dataset Support**: Can work with any dataset hosted on Hugging Face's datasets Hub, with minimal modification for specific dataset schema.

## Tips for Best Performance

1. Adjust the batch size based on your GPU memory to ensure to fully utilize the tensor cores on the GPU.
2. Consider preprocessing your audio files to remove noise or normalize volume
3. Process the audio outputs with an LLM to automatically see if the text is coherent, if it isn't, that might be a good sign that the audio isn't clear.
4. Use these outputs to speed up humans doing the transcription and get a faster transcription with humans in the loop.
5. Leverage the faster and accurately transcribed outputs to fine-tune Whisper to your specific use case, increasing the accuracy and speed of your data flywheel and value to your business.

This implementation provides a solid foundation for batch audio transcription using Whisper Large V3. You can modify the dataset loading code to work with your own audio files, and adjust the batch size based on your available GPU memory.

Now you can see how with Vast, you can save a meaningful amount of money on labor-intensive tasks like transcription with an affordable and powerful GPU. Happy Transcribing!


Transcribe audio efficiently with Whisper Large V3 on Vast.ai. Leverage powerful GPUs for fast, accurate batch processing using Hugging Face’s transformers library.

Transcribing Audio with Whisper Large V3 on Vast.ai


# NVIDIA GeForce RTX 5090: Release Announced

NVIDIA has officially unveiled the **GeForce RTX 5090**, the latest in its groundbreaking lineup of GPUs. The release date for this highly anticipated graphics card has been set for **January 30, 2025**, with availability expected alongside the RTX 5080, while the RTX 5070 Ti and RTX 5070 will follow in February.

During his CES keynote yesterday, NVIDIA CEO Jensen Huang introduced the RTX 5090 with a bold claim: the GPU is **two times faster than the RTX 4090,** thanks to DLSS 4 and the Blackwell architecture. But that's just the beginning.

## Official RTX 5090 Specs

Here are a few of the official specifications for the long-awaited **RTX 5090** (see more at the bottom of this page!):

-   **Architecture:** Built on the advanced Blackwell architecture.
-   **Memory:** 32GB of GDDR7 memory with an impressive memory bandwidth of 1,792GB/sec.
-   **CUDA Cores:** 21,760 cores, designed for maximum computational power.
-   **Power Requirements:** 575W total graphics power (TGP) with a recommended 1000W PSU.
-   **Display Support:** PCIe Gen 5 with DisplayPort 2.1b connectors supporting up to 8K at 165Hz.
-   **Gaming Chip:** Featuring the cutting-edge GB202 gaming chip.

In addition to a massive performance boost, the RTX 5090 features a redesigned **Founders Edition** with two double flow-through fans and a 3D vapor chamber, making it sleek enough to fit into small form factor PCs – a surprising change from the bulky RTX 4090.

These specs establish the RTX 5090 as the ultimate tool for gamers, creators, and professionals who demand peak performance.

## Beyond the RTX 5090: The RTX 50-Series Lineup

The younger siblings in the 50-series lineup – the RTX 5080, RTX 5070 Ti, and RTX 5070 – will also deliver outstanding capabilities:

-   **RTX 5080:** 16GB of GDDR7 memory, 10,752 CUDA cores, and a memory bandwidth of 960GB/sec. NVIDIA claims this card is twice as fast as the RTX 4080, with a total power consumption of 360 watts and recommended 850W PSU.
-   **RTX 5070 Ti:** 16GB of GDDR7, 8,960 CUDA cores, and a memory bandwidth of 896GB/sec, plus 300W TGP and 750W PSU.
-   **RTX 5070:** 12GB of GDDR7, 6,144 CUDA cores, and a bandwidth of 672 GB/sec, plus 250W TGP with just 650W PSU. It boasts "RTX 4090 performance at $549," according to Huang.

All three serve as excellent choices for those seeking high performance without committing to the premium price of the flagship model. – which brings us to the cost of these GPUs...

## RTX 50-Series Pricing and Availability

The RTX 5090 will debut at **$1,999** and will hit shelves on January 30, 2025, alongside the RTX 5080 priced at $999. The RTX 5070 Ti and RTX 5070, priced at **$749** and **$549** respectively, will launch in February.

With this lineup, NVIDIA is offering a range of options to meet different performance needs and budgets. Let's take a look at what else sets the RTX 50-series apart from its predecessors.

## DLSS 4 and Neural Tech: What's New?

During his keynote speech, Huang showcased the RTX Blackwell GPUs with an impressive real-time rendering demo that highlighted the latest advancements. The demonstration featured RTX Neural Materials, RTX Neural Faces, text-to-animation capabilities, and **DLSS 4** in action.

As Huang put it, "The new generation of DLSS can generate beyond frames – it can predict the future. We used GeForce to enable AI, and now AI is revolutionizing GeForce."

**Multi Frame Generation** is a highlight of DLSS 4, using AI to extrapolate and insert up to three additional frames between traditionally rendered frames, boosting frame rates by up to 8x.

A standout feature, **RTX Neural Shaders**, compresses in-game textures more efficiently and delivers film-quality shading and lighting in real time. Another feature, called **RTX Neural Faces**, uses generative AI to create lifelike skin, hair, and facial details.

Not only that, but NVIDIA is also bringing the RTX 50-series to laptops starting in March 2025. The laptop version of the RTX 5090 will feature 24GB of GDDR7 memory, making it a powerhouse for users on the go.

## Experience Cutting-Edge Performance Now with Vast.ai

While the RTX 5090 is set to make a big splash in the GPU market, **you can already access top-tier computing power right now with Vast.ai.** Our platform offers flexible GPU rentals, enabling you to tackle your most demanding projects without breaking the bank or waiting for new hardware.

Explore our options and elevate your workflow with Vast. Don’t let hardware limitations hold you back – get started in just minutes, today!

### Additional Specifications:

| Spec                                    | RTX 5090            | RTX 5080            | RTX 5070 Ti        | RTX 5070        
| ----------------------------------------| ------------------- | ------------------- | ------------------ | ------------------ |
| **Architecture**                        | GB202               | GB203               | GB203              | GB205              |
| **VRAM**                                | 32GB                | 16GB                | 16GB               | 12GB               |
| **VRAM Speed**                          | 28Gbps              | 30Gbps              | 28Gbps             | 28Gbps             |
| **VRAM Bus Width**                      | 512                 | 256                 | 256                | 192                |
| **Bandwidth**                           | 1792 GB/s           | 960 GB/s            | 896 GB/s           | 672 GB/s           |
| **Streaming Multiprocessors (SMs)**     | 170                 | 84                  | 70                 | 48                 |
| **Tensor Cores**                        | 680                 | 336                 | 280                | 192                |
| **Ray Tracing Cores**                   | 170                 | 84                  | 70                 | 48                 |
| **Boost Clock**                         | 2407 Mhz            | 2617 Mhz            | 2452 Mhz           | 2512 Mhz           |
| **Total Graphics Power (TGP)**          | 575W                | 360W                | 300W               | 250W               |
| **Recommended Power Supply**            | 1000W               | 850W                | 750W               | 650W               |
| **Launch Date**                         | Jan. 30, 2025       | Jan. 30, 2025       | Feb. 2025          | Feb. 2025          |
| **Launch Price**                        | $1,999              | $999                | $749               | $549               |

NVIDIA has officially unveiled the GeForce RTX 5090, the latest in its groundbreaking lineup of GPUs.

NVIDIA GeForce RTX 5090: Release Announced


# Vast.ai Highlights: 2024 Round-Up

As we set sail on 2024, we'd like to take a moment to reflect on what's been a landmark year for Vast.ai. From expanding our hardware offerings to improving user experience across the board, we've achieved significant progress in making GPU compute more accessible and efficient for everyone.

Let's take a look back at the highlights of the past year!

## Q1: Setting the Foundation

We kicked off January by introducing our Teams feature, allowing multiple users to manage a single account with customizable roles and permissions. Along with OAuth support for Google and GitHub, this has made account management more flexible and secure than ever.

February brought exciting news for those working with different architectures – we added support for GH200 as well as ARM64 platforms, including the Grace Hopper superchip.

In March, we began beta hosting support for AMD GPUs. We also introduced @Vasti, the Discord bot that is trained on our documentation and can answer questions when it's tagged in a channel or contacted via direct message.

## Q2: Expanding Horizons

May marked a significant expansion in our hardware offerings with the addition of both Radeon and Instinct product lines from AMD. Combined with the existing array of NVIDIA machines available for rent on our marketplace, [this update](https://vast.ai/article/announcing-amd-support) gave users even more options to find exactly the right GPU for their needs.

In June, we eliminated hosting fees entirely. Instead, we transitioned to a more flexible internal surcharge model, making our platform more attractive for hosts while keeping the renter experience unchanged.

## Q3: Powering Innovation

July saw the addition of the H100 NVL to our GPU lineup. This NVIDIA powerhouse, designed specifically for large language model (LLM) deployment, expands the possibilities for both hosts and renters – and its addition represents our commitment to supporting cutting-edge AI development.

In August, we were proud to sponsor [NextEpoch 2024](https://vast.ai/article/NextEpoch-2024-Bringing-ML-to-the-next-generation-of-scientists), contributing both prizes and GPU computing power to support workshop participants as they explored the intersection of machine learning and biological research.

In September, we added to our website resources with the release of a guide on [serving Infinity Embeddings](https://vast.ai/article/serving-vllm-embeddings-on-vast-ai). (It can be found in the [Docs](https://docs.vast.ai/) section of our website, along with many other guides and resources with valuable insights for our users.)

## Q4: Looking to the Future

As we entered the final quarter, October brought the beta release of virtual machine (VM) support running on Kernel Virtual Machine, giving hosts more flexibility in their setups. Later in the year, we rolled out the [official update](https://vast.ai/article/announcing-virtual-machine-rental-on-vast-ai): Vast now offers full VMs alongside Docker-based instance rentals. This offering is designed to support larger customers and more diverse workflows via GPU remote desktop.

Rounding out the year, we added NVIDIA's H200 GPUs to our listings in November, ensuring that our platform remains a go-to destination for users seeking cutting-edge hardware.

Looking back, we're proud of how far Vast.ai has come over the past year. Whether you're a host or a renter, a small startup or an established enterprise, we've worked hard to make our platform better serve your needs.

As we move into 2025, we're excited about what's next! Thank you for being part of our journey this year.


A moment to reflect on what's been a landmark year for Vast.ai

Vast.ai Highlights: 2024 Round-Up


# December 2024 Product Update

Welcome to our last update of the year... The big news for December is that virtual machines (VMs) are now available to all users on Vast!

Added to our Docker-based instance rentals, this new *VM rental option* will better serve larger customers and accommodate a broader range of workflows via GPU remote desktop. Check out our blog post outlining the benefits, key things to know, and how to get started with VMs [here](https://vast.ai/article/announcing-virtual-machine-rental-on-vast-ai).

As usual over the past month, we've worked on improving efficiency and resolving bugs. The following are some of the latest fixes we've implemented:

-   Fixed issues with instance Open button.
-   Made template UX fixes and enhancements.
-   Fixed GPU index mapping.
-   Connect button no longer prompts user to set up SSH keys if already set in certain scenarios.
-   Fixed display of recent templates.
-   Fixed AMD bandwidth issue.

## NVIDIA Cloud GPU Updates

Here at Vast.ai, we see product updates as opportunities to improve your workflow. We focus on usability and performance enhancements in order to deliver an optimized experience for every user.

### Highlights of the latest enhancements:

-   VMs have now been rolled out for all users.
-   *New Template Experience* rolled out for all users.
-   Improved instance copy functionality for Docker instances and VMs.
-   Added CLI tools for hosts to self-test their machines and get feedback on issues.
-   Improved Admin tooling for host testing and verification.
-   Updated VM documentation.

The full list of this month's updates can be found in the *December 2024 change log* at the bottom of this page.

## Our Commitment

We're proud of being the market leader in low-cost cloud GPU rental, bringing you a simple interface to save 5-6X on GPU compute. Now, with our VM rental option, we're offering a whole new level of flexibility, allowing you to leverage full-system resources for high-performance tasks. No matter what your computing needs are, our services are built to support your success.

*Need help or have questions?* Our team is ready to assist you via email at [support@vast.ai](mailto:support@vast.ai) and on [Discord](https://discord.gg/vastai), so you can stay focused on your projects without missing a beat.

## Change Log

### Features:

-   VMs rolled out for all users.
-   *New Template Experience* rolled out for all users.
-   Improved instance copy functionality for Docker instances and VMs.
-   Added CLI tools for hosts to self-test their machines and get feedback on issues.

### Templates / Miscellaneous:

-   Improved Admin tooling for host testing and verification.
-   VM template featured.
-   VM documentation updated.

### Bug Fixes:

-   Fixed issues with instance Open button.
-   Template links always filter GPU search results as expected.
-   Template UX fixes and enhancements.
-   Fixed GPU index mapping.
-   Connect button no longer prompts users to set up SSH keys if already set in certain scenarios.
-   Fixed display of recent templates.
-   Fixed AMD bandwidth issue.
-   Fixed security issue.


December 2024 Product Update


# Announcing Virtual Machine Rental on Vast.ai

Have you ever thought about what you could accomplish with more flexibility in your GPU rentals? At Vast.ai, we're always looking for ways to innovate and improve our platform, and today we're excited to announce our latest feature: [virtual machine (VM) rental](https://vast.ai/docs/instance-setup/virtual-machines).

We now offer full VMs alongside Docker-based instance rentals. This new offering is designed to support larger customers and more diverse workflows via GPU remote desktop.

Here's everything you need to know to get started!

## Virtual Machine Rental: Expanding the Possibilities

The new VM rental option will help meet the demand for use cases that Docker can't easily support. **VMs also offer features not available in Docker, such as support for init managers and process tracing.**

### Key Benefits of Virtual Machine Support Include:

- **Enhanced Flexibility:** Users can run low level debugging and performance profiling utilities and are not constrained by container environments.
- **Kernel-space Customizations:** Developers gain granular control over their systems, enabling custom driver installations, kernel tweaks, and integration of specialized frameworks.
- **Faster Iteration:** Rather than pre-building and pushing custom Docker images, users can run Docker inside the instance.
- **Improved Security:** Virtual Machines offers an improved  security posture for sensitive workloads.

Host machines can now tap into these advantages, allowing them to attract more renters and capitalize on the increased need for GPU-powered tasks beyond Docker’s capabilities.

### Important VM Differences to Know:

VMs do come with some trade-offs compared to Docker-based instances. For example, they have slower creation and boot times, higher disk overhead, and users will have to deal with a smaller selection of machines and lesser availability of preconfigured templates. In addition, only SSH launch mode is currently supported.

Notably, the on-start script field allows users to specify a script to run on instance start. Unlike in Docker-based instances, the interpreter must be specified by a shebang.

Another difference is that **the cloud copy utility for VMs only allows migrating the entire VM instead of allowing individual folders to be copied.**

Despite these drawbacks, VMs open up new opportunities for both renters and hosts, offering greater flexibility, broader use cases, and the potential for higher earnings.

### Empowering Renters, Rewarding Hosts

One thing that sets us apart from other providers is our focus on cost-efficiency. We strive to keep rental costs low for users across the board – and in fact, Vast.ai GPU rentals are ~3-5X cheaper than current alternatives.

At the same time, we want to provide hosts with the opportunity to earn more.

As our COO, Travis Cannell, explains, "We've always believed that democratizing GPU access means offering maximum flexibility. Virtual Machine support was the natural next step, providing more control, ease of use, and enabling researchers to build and iterate."

With the opportunity for both renters and hosts to benefit, it's equally important to ensure that setup is seamless and operations remain secure. Here's how we make that happen.

## VMs on Vast: Simple Setup and Strong Security

Getting started with VM rental on Vast.ai is simple. **Our hosting software automatically tests host machines for VM compatibility and enables the feature on any machine where it's supported.** (There is no requirement to be VM compatible, however.)

For new host machines, these tests will be run upon installation. For machines configured before the VM feature release, testing for VM compatibility will happen at a time when the machine is unoccupied.

And that's it! Host machines that don't have VM support enabled will simply be hidden from search results when VM-based templates are selected.

Privacy and security are built into the system, as well. Hosts and users cannot view someone else's VM, and no one has access to see what any other VM is doing. Users can rent VMs with peace of mind knowing that their virtual environment is private and secure.

## Getting Started: Creating a VM

Ready to spin up your first VM? It's as easy as selecting the right template!

1.  Make sure you have an SSH key in your account (guide [here](https://vast.ai/docs/instance-setup/ssh)).
2.  Select the **Ubuntu 22.04 VM template** in the interface or [click here](https://cloud.vast.ai/?ref_id=108031&template_id=e2f271ce754aeef772181fbf9c82a354) to load it.
3.  Adjust the disk storage needed and then rent a GPU!

Note: SSH keys cannot be edited on running VMs, and SSH access is currently the only method preconfigured on VM instances. Also, when you select the correct template, it will automatically filter for machines that support VMs and will launch a VM when you rent with that template.

For more detailed information about VM template customization, see our Docs page [here](https://vast.ai/docs/instance-setup/virtual-machines).

## Power Up with VM Rental on Vast.ai

With our new virtual machine rental feature, we're opening up a world of possibilities for both hosts and renters. Whether you need the flexibility of GPU remote desktops for complex workflows or you're looking to maximize your earning potential as a host, Vast.ai has you covered. We're committed to giving you the tools to succeed – easily, affordably, and securely.

Ready to see what you can accomplish? Experience the [power of VM rental](https://vast.ai/docs/instance-setup/virtual-machines) with Vast.ai today!


At Vast.ai, we're always looking for ways to innovate and improve our platform, and today we're excited to announce our latest feature: virtual machine (VM) rental. 

Announcing Virtual Machine Rental on Vast.ai


Released on October 29th, the 3.5 update included multiple models: Stable Diffusion 3.5 Large, Stable Diffusion 3.5 Large Turbo, and Stable Diffusion 3.5 Medium. Each of these variants can be customized to meet individual hardware needs.

# A Look at the Stable Diffusion 3.5 Variants

**[Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large)**

This base model, features an impressive 8.1 billion parameters and stands out as the most powerful option within the Stable Diffusion family. It prioritizes superior quality and prompt adherence, making it particularly suitable for various professional applications. 3.5 Large boasts a resolution capability of 1 megapixel, it delivers high-quality outputs that meet the demands of various industries, from graphic design to digital content creation. Whether you're working on detailed imagery or need reliable performance for professional projects, this model is engineered to meet and then exceed user expectations.

**Stability AI Highlight**

Analysis shows that Stable Diffusion 3.5 Large excels in prompt adherence and matches larger models in image quality.

**[Stable Diffusion 3.5 Large Turbo](https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo)**

Stable Diffusion 3.5 Large Turbo is designed to create high-quality images with remarkable adherence to prompts, achieving this in just four streamlined steps. This efficiency not only enhances the user experience but also allows for a considerably faster image generation process compared to the original Stable Diffusion 3.5 Large model. Users can expect quicker results without compromising the quality or precision of the images produced, making this version an excellent choice for those seeking both speed and visual fidelity.

**Stability AI Highlight**

Stable Diffusion 3.5 Large Turbo provides some of the fastest inference times for its size, maintaining strong image quality and prompt adherence, even when compared to non-distilled models of similar size.

**[Stable Diffusion 3.5 Medium](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium)**

This advanced model boasts an impressive 2.5 billion parameters and incorporates an enhanced MMDiT-X architecture along with cutting-edge training methods. 3.5 Medium is specifically designed to function effortlessly on standard hardware, making it accessible to a wide range of users. Its design philosophy emphasizes a blend of high-quality output and straightforward customization, allowing users to tailor the model's performance to their specific needs.

The model is also capable of generating high-resolution images, with the ability to produce outputs ranging from 0.25 megapixels to 2 megapixels. This versatility in resolution enables users to create visuals suitable for various applications, whether for casual use or more demanding projects. Overall, the model provides an efficient and user-friendly solution for anyone seeking to generate quality images without specialized equipment.

**Stability AI Highlight**

Stable Diffusion 3.5 Medium surpasses other medium-sized models, providing a harmonious blend of prompt adherence and image quality, establishing it as a preferred option for efficient, high-quality performance.

## A Note on Community License

Like earlier versions, Stable Diffusion 3.5 operates under the permissive Stability AI Community License. According to Stability's website, this license permits the following uses:

- **Free for non-commercial use:** Individuals and organizations can use the model free of charge for non-commercial use, including scientific research.
- **Free for commercial use (up to $1M in annual revenue):** Startups, small to medium-sized businesses, and creators can use the model for commercial purposes at no cost as long as their total annual revenue is less than $1M.
- **Ownership of outputs:** Retain ownership of the media generated without restrictive licensing implications.

## A Final Note

Stable Diffusion 3.5 represents a significant advancement, offering a unique combination of prompt adherence and image quality that sets it apart from its competitors. Its ability to deliver high-quality results efficiently makes it an invaluable tool for creators and professionals alike. As the landscape of digital generation continues to evolve, this model positions itself as a reliable and effective choice for achieving outstanding visual outputs.

Excited about new developments in AI and where it is headed next? Stay updated with the latest news and developments in AI and technology through our [blog](https://vast.ai/blog) at [Vast.AI.](http://vast.ai) Interested in running Stable Diffusion on Vast.ai? Check out our [documentation](https://docs.vast.ai/stable-diffusion) for a full walkthrough.


Unleash the power of Stable Diffusion 3.5 on Vast.ai. Explore three models - Large, Large Turbo, and Medium - crafted for exceptional image generation, tailored to your hardware capacity. 

Stable Diffusion 3.5: Stability's Most Powerful AI Model Yet


# Turn Any PDF into an AI-Generated Podcast with Notebook Llama

AI has now gone multimodal: instead of just sending chat messages to a chatbot, there are awesome use cases involving audio, videos and images. One amazing product like this is [NotebookLM](https://notebooklm.google/) from Google which has taken the AI community by storm. It takes in YouTube videos, PDF’s or other documents, and derives insights and even creates a podcast from the notes and original material.

But Meta’s Llama team came out with Notebook Llama, an open source version of this flow, which lets users run their own models to create podcasts on their own material. We will show you how to run it with Vast. In this tutorial, we will show you how to put the world of AI researchers in the palm of your hands for the materials that you care about. Let's dive into how you can set this up using Vast.ai's GPU marketplace.

## What is Notebook Llama?

Notebook Llama is an innovative pipeline that transforms PDF documents into podcast-ready audio content through a series of four notebooks, each handling a specific part of the conversion process:

1. **PDF Pre-processing**: Uses Llama 3.2-3B-Instruct to convert PDFs into clean, structured text
2. **Initial Transcript Generation**: Employs Llama-3.2-8B-Instruct to create a conversational dialogue
3. **Transcript Refinement**: Leverages Llama 3.2 3B Instruct to enhance natural flow and engagement
4. **Text-to-Speech Generation**: Combines Suno/Bark and Parler-TTS for dynamic, two-voice audio

The notebooks that we'll use in this tutorial are slightly modified from their [GitHub counterparts](https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/NotebookLlama). You can find them at the following links:

- [Step 1: PDF Pre-Processing Logic](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/LzgpBE/b0a2e94bf5867bd7be218000a3770576eabbc7c9/files/Step-1%20PDF-Pre-Processing-Logic.json)
- [Step 2: Transcript Writer](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/q7BXoq/9e98ddd2583957116aef559765a55997df4348b5/files/Step-2-Transcript-Writer.json)
- [Step 3: Re-Writer](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/97xXo4/03cb3f5a7730cb6bc770b69c708079af05d50079/files/Step-3-Re-Writer.json)
- [Step 4: TTS Workflow](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/pqBXoe/2d8203c8e76267f14847865386d134619f9aec2a/files/Step-4-TTS-Workflow.ipynb)

## Setting Up Your Environment on Vast.ai
## Setting Up Your Environment on Vast.ai

### Step 1: Select the Right Instance

1. Visit Vast.ai and select a GPU instance with at least 40GB of RAM. We use 40GB as a rough estimate because the notebooks are running multiple small to mid-sized models.
2. Use this base image: https://cloud.vast.ai/?ref_id=62897&template_id=c50a4b1cc2fc37a62e2f3bea7cbd892a
3. Important: Choose "Jupyter Notebook" as the Launch Mode along with direct HTTPS access

### Step 2: Getting into your instance:
Once the instance is running, you can connect directly to it via the console in Vast.ai. More information about this an Jupyter Notebooks on vast can be found [here](https://vast.ai/docs/instance-setup/jupyter)

### Step 3: File Organization

1. Upload the notebook files
2. Create a `resources` folder
3. Upload your target PDF
4. Ensure `requirements.txt` is present

### Step 4: Environment Setup

Once your instance is running:

1. Connect via the console in Vast.ai
2. Navigate to the `app` directory:
```bash
cd app
```

3. Install dependencies:
```bash
pip install -r requirements.txt
pip install git+https://github.com/huggingface/parler-tts.git
sudo apt-get install ffmpeg
```

### Step 5: Hugging Face Authentication

1. Make sure that you have a Hugging Face account. If you don't have one, you can create one [here](https://huggingface.co/).
2. Make sure that you accept the terms for the [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) model.
3. Get your access token from the Hugging Face Settings page
4. Run:
```bash
huggingface-cli login
```
5. Enter your token when prompted

## How It Works: The Pipeline Explained

### Notebook 1: PDF Processing
The first notebook handles the crucial task of converting your PDF into clean, structured text. It uses PyPDF2 for initial extraction and Llama 3.2-3B-Instruct for intelligent text cleaning. What makes this approach unique is its use of a lightweight language model instead of traditional regex-based cleaning.

### Notebook 2: Creating the Conversation
This stage transforms the cleaned text into a natural dialogue using the Llama model. The model creates a two-speaker conversation where:
- Speaker 1 leads and teaches the content
- Speaker 2 asks questions and maintains flow
The output includes natural elements like "umm" and "hmm" to maintain authenticity.

### Notebook 3: Enhanced Rewriting
The third notebook uses Llama 3.2 3B Instruct to refine the conversation further. It's prompted to act as an "Oscar-winning screenwriter," adding:
- Natural dialogue patterns
- Relevant anecdotes and analogies
- TTS-compatible formatting

### Notebook 4: Audio Generation
The final stage brings your podcast to life using two different TTS approaches:
- Speaker 1: Parler-TTS for an expressive, dramatic voice
- Speaker 2: Suno/Bark for a more methodical style

This combination creates a dynamic conversation rather than monotonous single-voice narration.

For this specific notebook, we created a separate speaker description for speaker 2. Each Speaker is also using a different model.

# Output:

Each of these notebooks outputs a file to the resources folder to be used in the next step/notebook. At the end, there will be an MP# file that can serve as a podcast for you to listen to! We've included it in our folder for you to take a listen to.

## Why Use Vast.ai for This Project?

These notebooks leverage multiple large language models and TTS systems, requiring significant GPU resources for running multiple models at the same time. Vast.ai provides:
- Access to high-memory GPUs (40GB+) necessary for running larger models
- Cost-effective GPU rental compared to dedicated hardware or other cloud providers
- Easy-to-use Jupyter notebook support
- Great docker templates for running notebooks and being able to get running quickly in your environment
- Flexible scaling based on your needs

## Trying it out Yourself

Ready to try it yourself? Clone the repository, follow the setup instructions above, and start with the PDF provided to see the results

The beauty of this system is its modularity and its open source nature - you can modify each stage to suit your needs, whether that's using different models, adjusting the conversation style, or tweaking the voice characteristics.

Through the power of Vast.ai's GPU marketplace and the latest AI models, Notebook Llama offers an innovative way to make technical content more accessible and engaging. Give it a try and transform your PDFs into engaging podcast conversations!


As the world's first marketplace for cloud GPU rental, Vast.ai is excited to announce another first: we're expanding our offerings to include AMD support!

Turn Any PDF into an AI-Generated Podcast with Notebook Llama


# November 2024 Product Update

Another month, another product update! Here at **Vast.ai**, our GPU rental platform continues to prove that accessing compute power for cutting-edge research and machine learning doesn't have to break the bank – we're committed to delivering top-notch performance while keeping costs affordable.

## NVIDIA H200 GPUs Now Available

One of the exciting updates this month is that **NVIDIA H200 GPUs** are now listed on Vast! These industry-leading GPUs offer unprecedented performance, giving you even more options for your computing needs.

---

## Bug Fixes and Performance Improvements

As always, we've prioritized resolving bugs and fine-tuning performance since our last update. Here are some of the latest fixes we've made:

- Fixed issues with virtual machines (VMs) beta related to SSH keys, memory limits, and port reuse.
- Resolved issues with new template search and editor UI.
- Fixed address displayed on downloaded invoice.
- SSH buttons now always display when an instance is running.
- Fixed GPU bandwidth reporting error.
- Fixed potential vulnerability with instances.

---

## NVIDIA Cloud GPU Updates

At Vast.ai, product updates go beyond bug fixes – we're always looking for ways to enhance usability and performance. Whether you're a long-time user or just getting started, these changes are designed to make your experience smoother and more efficient.

### Highlights of the latest enhancements:
- **NVIDIA H200s** (as mentioned) are now listed on Vast.
- Continued VM beta rollout.
- Added direct link button to Jupyter terminal (for Jupyter-enabled instances).
- Updated numerous templates:
  - Whisper ASR
  - PyTorch NGC
  - ROCm Ubuntu
  - Oobabooga
  - NVIDIA CUDA
  - And more.
- Added **Fooocus API template**.

You can find the full list of this month's updates in the **October 2024 change log** at the bottom of the page.

---

## Our Commitment

Here at **Vast.ai**, our mission is to provide an accessible, efficient GPU rental platform that's well suited to the unique needs of both users and hosts. Whether you're launching a new project or fine-tuning a long-term deployment, we're here to support you every step of the way.

**Have questions?** Reach out to us at [support@vast.ai](mailto:support@vast.ai), or join the conversation on our [Discord server](https://discord.gg/vastai) to connect with fellow users and find helpful resources.

---

## Change Log

### Features:
- **NVIDIA H200s** are now listed.
- Continued VM beta rollout.
- Added direct link button to Jupyter terminal (for Jupyter-enabled instances).

### Templates / Miscellaneous:
- Added **Fooocus API template**.
- Updated **Whisper ASR template**.
- Updated **PyTorch NGC template**.
- Updated **NVIDIA RAPIDS template**.
- Updated **ROCm Ubuntu template**.
- Updated **Oobabooga template**.
- Updated **PyTorch (ROCm) template**.
- Updated **NVIDIA CUDA template**.
- Updated **PyTorch (cuDNN Devel) template**.
- Updated **PyTorch (cuDNN Runtime) template**.

### Bug Fixes:
- Fixed issues with VMs beta related to SSH keys, memory limits, and port reuse.
- Fixed issues with new template search and editor UI.
- Fixed address displayed on downloaded invoice.
- SSH buttons now always display when an instance is running.
- Fixed GPU bandwidth reporting error.
- Fixed host verification display issue.
- Fixed potential vulnerability with instances.


November 2024 Product Update


# Latest AI Model Releases: September and October

The world of Open Source AI has gotten many updates in the last month or so. There are now many new models with great quality:speed ratios and models that challenge the frontier of closed source models. This makes it even easier to build applications and automate workflows with open source models, which you can deploy on [Vast.ai](https://vast.ai). Meta, Mistral, and Nvidia have made the biggest waves with their recent releases.

## Meta and Llama 3.2 Models

On September 25th, Meta introduced several exciting [updates](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) to their AI model lineup, focusing on lightweight models and vision capabilities.

### Lightweight Models

First, Meta released impressive 1B and 3B parameter models that are best in class for their size. These models allow for near-instantaneous interactions on edge devices and can be used for processing large amounts of tokens with simple processing tasks.

The development process for these models involved pruning from Llama 3.1-8B, followed by knowledge distillation using the 8B and 70B models as teachers. The instruction-tuned versions underwent additional training with Supervised Fine-tuning (SFT), Direct Policy Optimization (DPO), and Rejection Sampling, incorporating synthetic data from Llama 3.1-405B.

These small models are easy to download and get playing with locally, but can be leveraged for large scale data processing tasks as they are much less expensive to run and still have great capabilities.

### Vision Models

Meta's new vision models come in 11B and 90B parameter sizes. Interestingly, these are built as adapters for the 8B and 70B text models, respectively, maintaining the same performance on text tasks while adding visual capabilities. This approach saved compute during training and allows for flexible deployment options.

The vision models were created by pretraining adapters and image encoders with noisy image and text data, followed by post-training similar to the text-only models. This has two advantages: For the same compute budget, Meta could train the models on more data, and the models are deployable on the same GPU's at the same time as their text-only models and develoeprs don't have to worry about regressions in their text-only tasks.

### Llama Guard Models

Meta also released new Llama Guard models for safety and moderation tasks, based on the 11B vision model and the 1B and 3B text models. These models are designed to help with safety and moderation tasks at the edge.

## Mistral's AI Innovations

Mistral has been on a similar research path as Meta, releasing models that are focused on similar efforts and qualities, small efficient models for the edge, and multimodal models as drop-in replacements for existing text-only models.

### Pixtral Model

Released on September 17th, [Pixtral](https://mistral.ai/news/pixtral-12b/) is a 12B parameter multimodal model. Similar to Meta's Llama 3.2 Vision approach, Pixtral was created as a drop-in replacement for Mistral-Nemo-11B, a text-only model. It features a 400M image encoder as an adapter, maintaining deployment flexibility. It performs well on a variety of multi-modal tasks.

Pixtral is available under the Apache 2.0 license and can be downloaded for the base model [here](https://huggingface.co/mistralai/Pixtral-12B-Base-2409) and for the instruct model behind a gated release [here](https://huggingface.co/mistralai/Pixtral-12B-2409).

### Mistral-Small v24.09

Also released on September 17th, 2024, [Mistral-Small v24.09](https://mistral.ai/news/september-24-release/) is a 22B parameter text model that bridges the gap between Mistral Nemo-11B and Mistral Large v2. This model offers substantial improvements in reasoning, code generation, and function calling compared to its predecessor. It's available under the Mistral Research License, and can be downloaded [here](https://huggingface.co/mistralai/Mistral-Small-Instruct-2409) behind the research gate.

### Ministral Models

On September 25th, 2024, Mistral introduced two new models: [Ministral 3B and Ministral 8B](https://mistral.ai/news/ministraux/).

Ministral 3B is a compact powerhouse, offering capabilities roughly equivalent to the original Mistral 7B but in just 3B parameters. Its instruct-tuned version outperforms Mistral-7B instruct models and shows superior performance to Llama 3.2-3B. However, it is not availabel on HuggingFace, and requires an Enterprise License from Mistral to access.

Ministral 8B represents a significant upgrade over the Mistral 7B series, offering a substantial leap in capabilities. According to benchmarks, it outperforms Llama 3.1-8B across of series of tasks. You can download the model [here] (https://huggingface.co/mistralai/Ministral-8B-Instruct-2410) behind the gated access and run it on Vast to evaluate it for your workloads.

## Nvidia's Contribution to AI Advancement

Nvidia has also focused on 70B models to compete with frontier models like those from OpenAI and Anthropic.

### Llama 3.1-Nemotron-70B Models

Nvidia has [introduced](https://developer.nvidia.com/blog/new-reward-model-helps-improve-llm-alignment-with-human-preferences/) two notable models based on the Llama 3.1 architecture:

1. **Llama 3.1-Nemotron-70B-Reward**: This model was trained via Reinforcement Learning from Human Feedback (RLHF) to serve as a reward model. It currently holds the top position on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench), showcasing its effectiveness in evaluating AI responses. It was also used to train the Instruct model, showing how well this model can be used to train other models synthetically.

2. **Llama 3.1-Nemotron-70B-Instruct**: Leveraging the reward model mentioned above, this instruct-tuned version has achieved impressive results. It currently leads the [ArenaHard Leaaderboard](https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard/commit/2bba63b79d685a2d46696c2da0576a86a3cfc7c6), demonstrating its superior performance in complex instruction-following tasks.

The Llama 3.1-Nemotron-70B-Reward model was trained on the [HelpSteer-v2 dataset](https://arxiv.org/abs/2406.08673), which likely contributed to its strong performance as a reward model.

What's particularly noteworthy is that the Llama 3.1-Nemotron-70B-Instruct model has shown outstanding results in automatic benchmarks, outperforming even some of the most advanced models in the field, including Claude 3.5 Sonnet and GPT-4.

These developments from Nvidia show that methods to move beyond the GPT-4 wall of capabilities are not just being developed, but open sourced and available to developers to use wherever they use compute.

## Conclusion:

It's a great time to be a developer using AI. In the past few months, there have been major capability improvements for LLM's and Multimodal models, enabling tasks that were not possible before with open source models. And the release of many smaller language models that are still as capable as previous generation models means that the cost curve is coming down for deploying to production.

At Vast, we're very excited to see these advancements, and you can look out for more updates and building on top of these types of models for specific workflows in the coming months.


Open Source AI has evolved rapidly, with new models offering top quality and speed, challenging closed-source standards.

Latest AI Model Releases: September and October


# October 2024 Product Update

Welcome back to another update from the Vast.ai team! Our commitment to excellence means we're always looking for ways to improve your experience. We understand how critical reliable, affordable computing resources are to our customers.

So let's get into the nitty-gritty details of our latest improvements! Here are some of the key bug fixes we've made since our last update:

- **Improved** Billing page display and performance.
- **Fixed** Sentry integration.
- **Added** clearing of cookies/cache after user fails CAPTCHA.
- **Fixed** Search page rent request progress display.
- **Made improvements to** Autoscaler.
- **Fixed** new Templates page performance and issue with saving credentials on Docker login.

## NVIDIA Cloud GPU Updates

Product development is at the heart of what we do. This month we've focused on delivering meaningful improvements that make your experience on Vast even more powerful and intuitive. Our goal is always to give you more control, flexibility, and value with every update.

### Highlights of the latest enhancements:
- **Added** virtual machine (VM) beta for VMs running on Kernel Virtual Machine (KVM) as an optional feature for hosts (see [Docs page for VMs](#)).
- **New Template editor** and listing UXs (opt-in).
- **Added** host call to force delete a machine.
- **Improved** machine maintenance descriptions.
- **Added** info to Docs for machine tiers (datacenter, verified, and deverified).
- **Expanded** Dark Mode support.

For a full list of all changes this month, check out the change log for our October 2024 product updates at the bottom of this page.

---

## Our Commitment

At Vast.ai, we're passionate about delivering a seamless, budget-friendly cloud GPU rental experience that caters to the diverse needs of both renters and hosts. We're committed to being your reliable partner from project inception to completion.

Need assistance? Feel free to drop us a line anytime at [support@vast.ai](mailto:support@vast.ai), or connect with the community on our [Discord server](#) where help and resources are always at your fingertips!

---

## Change Log

### Features:
- **Added** virtual machine (VM) beta.
- **New Template editor UX** (opt-in).
- **New Template listing UX** (opt-in).
- **Added** Host call to force delete a machine.
- **Expanded** Dark Mode support.
- **Improved** machine maintenance descriptions.

### Templates / Miscellaneous:
- **Updated** A1111 template.
- **Updated** PyTorch (cuDNN Runtime) template.
- **Added** SD WebUI Forge template.
- **Docs** for VMs.
- **Added** Ollama template.
- **Updated** Kohya’s GUI template.
- **Updated** TGI template (TLS/Auth).
- **Docs** for template creation.
- **Docs** for machine tiers (datacenter, verified, deverified).
- **Improved** Docs for Autoscaler.

### Bug Fixes:
- **Fixed** Billing page display and performance.
- **Fixed** Sentry integration.
- **Fixed** referral link template selection.
- **Fixed** Search page rent request progress display.
- **Fixed** template change on instances.
- **Fixed** auto-billing issue with empty threshold.
- **Added** clearing of cookies/cache after user fails CAPTCHA.
- **Fixed** new Templates page performance and Docker login credential saving.
- **Autoscaler** improvements.


October 2024 Product Update


# Could the End Be Near for the NVIDIA GeForce RTX 4090?

Rumors are circulating that the high-end consumer GPU is on the verge of being discontinued. With production potentially stopping, the 4090 could soon be disappearing off the shelves to make way for NVIDIA's new flagship GPU: the [RTX 5090](https://vast.ai/article/nvidia-rtx-5090-out-by-christmas-rumors).

As with any unconfirmed reports, it's best to take this news with a grain of salt – but the possibility has already sparked plenty of buzz among GPU enthusiasts.

## What's Driving the Rumors?

Reports about the potential discontinuation of the RTX 4090 appear to have [originated](https://wccftech.com/nvidia-geforce-rtx-4090-4090d-discontinued-next-month-in-preparation-for-next-gen-rtx-5090-5090d-gpus/) on China’s Board Channels, a frequent source of supply chain rumors. According to these sources, NVIDIA may stop producing the RTX 4090, along with its China-specific RTX 4090D variant, as early as October, with inventory expected to dwindle through the rest of the year.

But given that it's already late October without any official announcements from NVIDIA, the RTX 4090's availability seems secure in the near term. Still, we can't rule out any sudden shifts as NVIDIA prepares for the release of the next-gen RTX 5000 series.

According to rumors, the launch date of the new [flagship Blackwell RTX 5090 GPU](https://vast.ai/article/nvidia-rtx-5090-out-by-christmas-rumors) will be either in late 2024 or in early January at CES 2025. That timing does line up with NVIDIA potentially pulling the plug on the RTX 4090 sooner rather than later. Plus, the move would allow add-in board (AIB) partners to clear their stock of the RTX 4090 ahead of the RTX 5000-series release.

## Potential Impact on Pricing and Availability

Ending production of the RTX 4090 now (or very soon) would give NVIDIA's partners the holiday season to move existing inventory. This could result in shortages during the lead-up to Christmas as supplies run low, causing prices to spike as stock dwindles.

That said, with the highly anticipated RTX 5090 on the horizon, many buyers may hold off and wait for the new flagship release before making any purchase decisions. There's a chance this could put downward pressure on RTX 4090 prices, making it a more attractive option in the meantime.

It's hard to predict how things will unfold. Without official confirmation from NVIDIA, everything is merely speculation – and for now, it's a waiting game.

## Exploring Your GPU Options

While the future of the RTX 4090 remains uncertain, and the next-gen RTX 5090 will likely come with a steep price tag, one thing is clear: our cloud GPU rental platform allows you to access powerful GPUs at the most competitive rates available.

You don't need to make any huge purchases or commit to a single machine. With Vast.ai, you have the flexibility to choose from a range of GPUs that best fit your project, budget, and performance requirements at any given moment. Whether you're fine-tuning deep learning models, running complex simulations, or pushing your gaming rig to new heights, we're here to connect you with the GPU compute power that meets your exact needs.

[Discover the possibilities](https://vast.ai/) with Vast.ai today.


Rumors are circulating that the high-end consumer GPU is on the verge of being discontinued.

NVIDIA RTX 4090 to Be Discontinued Sooner Rather Than Later?


# Security and Compliance at Vast.ai

At Vast.ai, your security is our priority. We're proud of our track record of excellence over the past six years serving clients worldwide while keeping up the highest standards of regulatory compliance.

In fact, here at Vast.ai, we're currently in the process of completing our SOC 2 Type 1 certification – further solidifying our commitment to data security and regulatory compliance.

Our [Compliance Policy](https://vast.ai/compliance) is designed to protect your data every step of the way. Here's how:

## Strengthening Security with Datacenter Partners

On our GPU cloud platform, our top-tier datacenter providers set the standard. These aren't your average hosts. We partner exclusively with vetted datacenter providers who maintain rigorous third-party compliance certifications. They hold a minimum of ISO 27001 certification – and many go further, maintaining compliance with industry-specific regulations in areas like healthcare, finance, and government.

You can trust that any datacenter provider you work with through Vast has undergone extensive due diligence. We audit their compliance regularly, ensuring that everything from physical access controls to continuous monitoring systems are in place. If you need documentation, we can provide the relevant certifications directly to your compliance team.

> To filter for these providers only on Vast, be sure to select the **"Secure Cloud (Only Trusted Datacenters)"** checkbox.

## Key Security Features of Our Datacenter Partners

To ensure the highest level of security for your data, our trusted datacenter partners implement a comprehensive set of protective measures. Here are the key features that set them apart:

### Third-Party Certifications

All of our datacenter partners are ISO 27001-certified, ensuring they meet globally recognized information security standards. They may also hold additional certifications, such as HIPAA, NIST, PCI, and HITRUST, as well as GDPR compliance.

### Physical and Environmental Security

Our partners manage their equipment in secure facilities with restricted physical access, regular access reviews and recertifications, continuous video monitoring with 90+ day retention, and stringent asset tracking and removal approval processes. Environmental controls, including fire suppression systems and power backups, are tested regularly to prevent disruptions.

### Enhanced Auditing

We conduct thorough audits of our datacenter partners, verifying their ownership structure, source of funds, and identity, while ensuring they adhere to best practices and maintain equipment in line with their certifications.

### Extended Legal Agreements

All datacenter hosts sign expanded hosting agreements that include a separate Data Processing Agreement (DPA) for added protection.

Other compliance measures include robust incident response protocols and regular security and compliance training for all employees.

## Securing Your Peace of Mind

Compliance can be complex and daunting – but we're here to help!

Whether you're working in healthcare, finance, or other sectors with stringent data protection requirements, we offer a level of security that you can trust for your most sensitive workloads. With a six-year track record of no major incidents, Vast.ai remains a trusted choice for Fortune 500 companies and individuals alike.

If you have questions about how Vast.ai's system fits within your compliance policy framework, or if you'd like more information about our vetted datacenter providers, we’d love to talk. Feel free to schedule a consultation using [this Calendly link](https://calendly.com/travis-_da/compliance?month=2024-10), or reach out to us at [compliance@vast.ai](mailto:compliance@vast.ai). Our team is ready to help you navigate your security and compliance needs with confidence.


At Vast.ai, your security is our priority. We're proud of our track record of excellence over the past six years serving clients worldwide while keeping up the highest standards of...

Security and Compliance at Vast AI


# Google Colab vs. Vast: Which Platform Is Best for Your Projects?

If you’ve ever dabbled in machine learning, data science, or coding in general, chances are you’ve heard of Google Colaboratory – better known as **Google Colab**. But what exactly is it, what are its limitations, and how can you make the best use of it?

Let's take a closer look at these questions and more…

## What Is Google Colab?

In a nutshell, **Google Colab** is a free, browser-based platform that lets you run and write code in a Jupyter notebook environment without having to set up anything locally on your machine. It's widely used for Python-based projects, especially in machine learning and data analysis.

Colab offers an easy starting point for experiments, since you can use the Jupyter notebook interface to write and execute code, visualize results, and keep everything neatly organized in one place. This type of interactive programming allows you to experiment with code in real time, making it ideal for data science and research projects that require immediate feedback and iteration.

### Some other benefits include:

- **Accessibility** – The browser-based platform allows you to access your projects from anywhere.
- **Integration with Google Drive** – Save your notebooks and datasets directly to your Google Drive.
- **Collaboration** – Easily share your notebooks with others for real-time collaboration across teams.
- **Wealth of Resources** – Plenty of tutorials, sample notebooks, and community support are available in the Colab ecosystem, making it easier for you to learn and troubleshoot along the way.

These features make Google Colab a convenient tool for a wide range of users. However, it does have some significant limitations that may prompt you to look for other solutions – such as running a Jupyter instance right here on **Vast** and uploading your notebook directly.

## Why Choose Vast Over Google Colab?

While Google Colab is helpful for smaller-scale projects or initial experimentation, **Vast's** cloud GPU platform offers a much more flexible, powerful, and reliable experience through our Jupyter integration. Here’s why you should consider using Vast instead.

### No Resource Limits

With Google Colab, free-tier users face dynamic usage limits with idle timeout periods and resource caps that sometimes fluctuate. The paid plans have their own set of restrictions as well. For example, notebooks can run for up to 12 hours on a free plan, and the highest-tier paid plan supports continuous code execution for at most 24 hours.

But with **Vast**, there are no imposed time limits, so your projects can run uninterrupted for as long as you need. Whether your project takes hours or days, you won't be cut off or forced to restart.

> **Note**: Google Colab does include the option to pay for a dedicated virtual machine (VM) to get around usage limits, but Vast's pricing is extremely competitive as we are the market leader in low-cost cloud GPU rental.

### Customizable GPU Power for Every Project

Google Colab offers different VMs based on the plan you choose. The free tier only allows access to VMs with a standard system memory profile, restricting the compute power you can use. Paid versions of Colab provide access to machines with a higher system memory profile, but you still don't get to select specific GPUs unless you opt to manually pay for a dedicated machine.

**Vast** gives you access right from the start to powerful and customizable hardware options that are also highly affordable. You can select the exact GPU setup you need, whether you're running small tasks or large-scale machine learning models. For high-end performance, Vast supports multi-GPU configurations, making it easy to scale up your projects without hitting resource bottlenecks.

### Simplicity and Reliability

**Vast** simplifies the entire process by running your Jupyter notebook directly in the cloud with no need for SSH or complex setups. Jupyter by itself is much more reliable than Google Colab and doesn't require setting up SSH keys – you can always open a terminal right inside the browser.

And while Google Colab does facilitate sharing and collaboration, **Vast** offers the same, with the added benefit of running on a more stable platform without timeouts or interruptions. Team members can collaborate in real time, sharing notebooks and results without worrying about session limits.

## Making the Switch to Vast

If you've been using Google Colab for simple notebooks and want to step up your workflow, switching to Vast is straightforward. Simply download your notebook from Colab as a `.ipynb` file, run a Vast Jupyter instance with the recommended [PyTorch template](https://cloud.vast.ai/?ref_id=43484&template_id=f5540ef1a1398b8499546edb53dae704), and upload your notebook into Jupyter directly.

For a step-by-step guide on running Jupyter notebooks on Vast, check out our detailed doc [here](https://vast.ai/docs/use-cases/google-colab).

While Google Colab is a useful starting point for basic tasks, it often falls short when you're dealing with more complex workflows. **Vast** provides a more robust and customizable solution through its Jupyter notebook integration, with greater control, higher performance, and fewer interruptions.

With **Vast**, you'll enjoy the freedom to push your projects even further. **Explore the possibilities today!**


Google Colab is a free, browser-based platform that lets you run and write code in a Jupyter notebook environment without having to set up anything locally on your machine.

Google Colab Explained: Simplifying Your Workflow with Cloud Tools


# NVIDIA GeForce RTX 5090: Rumors and Speculations

As the year winds down, rumors are intensifying around NVIDIA's highly anticipated **GeForce RTX 5090** GPU. Industry insiders are divided on the release date, with some sources suggesting a launch just in time for Christmas, while other reports point to a formal announcement at **CES 2025** in the new year.

**NVIDIA's CEO Jensen Huang** will be hosting the keynote at CES on **January 6th, 2025** – which could be the perfect opportunity to launch the full trio of GPUs in the 50-series lineup: the **RTX 5090**, **5080**, and **5070**. But it's also possible that the **5090** itself will debut in time for last-minute holiday shopping.

The exact timing of the RTX 5090 release may be uncertain, but one thing is clear: NVIDIA's next flagship GPU is poised to redefine performance. Let's take a quick look at the speculation around this next-gen graphics powerhouse and its companions in the new lineup!

## NVIDIA RTX 5090: Power, Redefined

As we explored in a [previous blog post](https://vast.ai/article/nvidia-rtx-5090-leaks-rumors-gpu-performance), the flagship consumer-grade **RTX 5090** is set to dethrone its predecessor, the RTX 4090, with some truly remarkable upgrades. Here's what the various leaks and rumors suggest:

- Built on the cutting-edge **Blackwell architecture**, leaving Ada Lovelace behind
- **32GB** of next-gen **GDDR7 memory**, setting a new standard
- **512-bit** memory bus (compared to the 4090's 384-bit)
- **21,760 CUDA cores**, providing incredible horsepower for heavy workloads
- Base clock speed of ~**2900 MHz**, with potential boost clock speeds exceeding **3 GHz**
- Features the new **GB202 gaming chip**
- Powered by a single **12VHPWR connector**
- **600W** power consumption

The RTX 5090's little siblings in the 50 series are expected to pack a punch as well, balancing performance and cost for those who may want to upgrade without breaking the bank.

## NVIDIA RTX 5080 & 5070: Cost-effective Alternatives

The **RTX 5080** will likely be an excellent option for anyone needing a high-performance GPU without going all-in on the 5090. In fact, it reportedly could be even more powerful than initially expected. Rumors suggest it will feature a **32Gbps** memory speed – instead of the previously reported 28Gbps – and a maximum bandwidth of **1024GB/s** rather than **896GB/s**.

If these specs turn out to be true, then the RTX 5080 could outperform the RTX 4090 while also being more power-efficient, making it an exciting contender for high-performance users. To round out the rumored details, the **5080** will reportedly feature a **256-bit memory bus** with **16GB of GDDR7 VRAM**, around **10,000 CUDA cores**, and **400W** power consumption.

Finally, the **RTX 5070** will serve as a mid-range option without the extreme capabilities of the 5090 or 5080. It's rumored to offer a **192-bit memory bus** with **12GB of GDDR7 VRAM** (to the consternation of some who feel 12GB is surprisingly low), around **6,400 CUDA cores**, plus **250W** power consumption.

With all three cards – the **RTX 5090**, **5080**, and **5070** – likely being powered by a single 12VHPWR connector, built on the latest Blackwell architecture, and delivering everything from outstanding 4K resolution to potentially next-gen DLSS 4, these GPUs are expected to deliver the kind of leap in performance that will cater to a wide range of needs.

## Comparing the Rumored Specs: RTX 5090, 5080, and 5070

For easy reference, here's a side-by-side comparison of the rumored specifications for the entire 50-series lineup.

| Spec                  | RTX 5090 (Rumored)  | RTX 5080 (Rumored)  | RTX 5070 (Rumored) |
| --------------------- | ------------------- | ------------------- | ------------------ |
| **Architecture**      | Blackwell GB202-300 | Blackwell GB203-400 | Blackwell GB205    |
| **Memory**            | GDDR7               | GDDR7               | GDDR7              |
| **VRAM**              | 32GB                | 16GB                | 12GB               |
| **CUDA Cores**        | 21,760              | 10,752              | 6,400              |
| **Boost Clock**       | 3.0 GHz             | TBD                 | TBD                |
| **Memory Bus**        | 512-bit             | 256-bit             | 192-bit            |
| **Memory Speed**      | 28 Gbps             | 32 Gbps             | 28 Gbps            |
| **Bandwidth**         | 1792 GB/s           | 1024 GB/s           | 672 GB/s           |
| **Total Board Power** | 600W                | 400W                | 250W               |

> **Note:** All specifications mentioned are based on current rumors and reports. Final specifications may vary upon release.

## Can't Wait? We've Got You Covered

While we all wait for the official release (whether it's Christmas or CES), in the meantime you don't have to put your projects on hold. **Vast.ai** offers immediate access to high-performance GPUs through our [cloud GPU rental platform](https://vast.ai/).

Ready to supercharge your compute power today? Check out the flexible GPU rental options on Vast and get started in minutes – no waiting required!


GPU prices continue to increase, but there exists a happy and unexpected middle ground for AI applications.

NVIDIA RTX 5090: Out by Christmas? A Look at the Latest Rumors


# Serving Online Inference with TGI and Medusa on Vast.ai

## Background

Medusa is a method of specular decoding. Speculative decoding speeds up inference of large language models by having a smaller model multiple tokens and lets the larger model just verify. If the Verification for the large model is cheaper than generating the tokens themselves. If the smaller model is accurate enough, then the cost to generate tokens goes down overall.

Medusa is slightly different than other types of speculative decoding in that it adds a piece of the original model to do the speculation.

TGI is the first major serving framework for large language models that enables Medusa-style speculative decoding.

Speculative decoding comes with a trade-off between adding extra memory used and generation speed. Speculative decoding, including Medusa, requires more overall memory. But if your "draft" model is good enough, then it can provide an excellent speed up opportunity.

Compared to other forms of speculative decoding, Medusa requires a fine-tuned adapter on top of the original model. There exist some of these already that we can use. Be on the lookout for more posts on how to create these adapters from us.

To set up TGI with Medusa on Vast, you can follow the steps outlined in the provided notebook. The notebook will guide you through the process of configuring TGI to serve Medusa and optimizing the setup for your specific use case.

## Setup and Querying

First, set up your environment and Vast API key:

```bash
pip install --upgrade vastai
```

Once you create your account, go [here](https://cloud.vast.ai/cli/) to find your API key.

```bash
vastai set api-key <Your-API-Key-Here>
```

For serving a language model, we're looking for a machine with a static IP address, available ports to host on, and a single modern GPU with decent RAM, as we'll serve a single small model. TGI also requires CUDA version 12.1.1 or higher, so we'll filter for that as well. We will query the Vast API to get a list of these types of machines.

```bash
vastai search offers 'compute_cap > 800 gpu_ram > 20 num_gpus = 1 static_ip=true direct_port_count > 1 cuda_vers >= 12.1.1'
```

## Deploying the Image

The easiest way to deploy this instance is to use the command line. Copy and paste a specific instance ID you choose from the list above into `instance-id` below. This medusa model also needs to download gemma-7b-it, which is a gated model. You'll need to accept the terms on gemma-7b-it and pass in your Huggingface API Token into `HF_TOKEN=<Your_Huggingface_Token>`
.

```bash
vastai create instance <instance-id> --image ghcr.io/huggingface/text-generation-inference:latest --env '-p 8000:8000 -e HF_TOKEN={"<Huggingface-Token>"}' --disk 40 --args --port 8000 --model-id text-generation-inference/gemma-7b-it-medusa --speculate 2
```

## Connecting and Testing

To connect to your instance, we'll first need to get the IP address and port number. Once your instance is done setting up, you should see something like this:
![Instance_view](/uploads/instance_view_medusa_tgi.png)

Click on the highlighted button to see the IP address and correct port for our requests.

![IP_address_view](/uploads/ip_address_view_medusa_tgi.png)

We will copy over the IP address and the port into the cell below.

```bash
# This request assumes you haven't changed the model. If you did, fill it in the "model" value in the payload json below
curl -X POST http://<IP-Address>:<Port>/v1/completions -H "Content-Type: application/json"  -d '{"model" : "text-generation-inference/gemma-7b-it-medusa", "prompt": "Hello, how are you?", "max_tokens": 50}'
```

You will see a response from your model in the output. Your model is up and running on Vast!

In the [notebook](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/q7gnzg/30783bfb0b148eaf67a02f083b9cf579be87c5fc/files/serving_medusa_on_vast.json), we include ways to call this model with requests or OpenAI

### Other things to look out for with other configurations

If you are downloading a model that needs authentication from the Hugging Face Hub, passing `-e HF_token=<Your-Read-Only-Token>` within Vast's `--env` variable string should help.

Sometimes the full context of a model can't be used given the space allocated for TGI on the GPU + the model's size. In those cases, you might want to increase `--gpu-memory-utilization`, or decrease the `max-model-len`. Increasing `--gpu-memory-utilization` does come with CUDA OutOfMemory issues that can be hard to predict ahead of time.

We won't need either of these for this specific model and GPU configuration.

### Testing

Copy the IP address from your instance once it is ready, and then we can use the following code to call it. Note that while your server might have ports ready, the model might not have downloaded yet as it is much larger this time. You can check the status of this via the logs to see when it has started serving.

```bash
import requests

headers = {
    'Content-Type': 'application/json',
}

json_data = {
    'model': 'text-generation-inference/gemma-7b-it-medusa',
    'prompt': 'Hello, how are you?',
    'max_tokens': 50,
}

response = requests.post('http://<Instance-IP-Address>:<Port>/v1/completions', headers=headers, json=json_data)
print(response.content)
```

Or use OpenAI:

```bash
pip install openai
```

```python
from openai import OpenAIH

# Modify OpenAI's API key and API base to use TGI's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<Instance-IP-Address>:<Port>/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="text-generation-inference/gemma-7b-it-medusa",
                                      prompt="Hello, how are you?",
                                      max_tokens=50)
print("Completion result:", completion)
```

## Conclusions

Medusa, when paired with TGI on Vast, offers a compelling solution for engineering teams looking to optimize their AI inference costs and improve shipping velocity. The combination of Medusa's faster inference capabilities and TGI's state-of-the-art throughput enables better user experiences and reduced GPU time for serving users and processing data.

By leveraging Medusa's speed advantages, you can achieve higher throughput on GPUs, allowing you to handle more requests simultaneously and deliver faster responses to users. This is particularly valuable in scenarios where low latency and real-time interactions are crucial, such as in chatbots or virtual assistants.

Moreover, the increased throughput provided by Medusa and TGI translates to more efficient utilization of GPU resources. With faster inference times You spend less money for batch workloads and need less GPUs to handle traffic surges. Overall, you can scale your AI applications more effectively.

Vast's affordable compute options further enhance the benefits of using Medusa with TGI. By running Medusa on Vast's infrastructure, you can access cost-effective GPU resources that align with your budget and performance requirements. This allows you to maximize your margins while still delivering high-quality AI experiences to your users.

By leveraging the strengths of Medusa Decoding and the affordability of Vast, you can build cutting-edge AI applications that scale seamlessly and deliver exceptional results.


Medusa is a method of specular decoding. Speculative decoding speeds up inference of large language models...

Serving Online Inference with TGI and Medusa on Vast.ai


# Vast.ai September Updates

It's that time again! We're back with the latest updates here at **Vast.ai**, aimed at bringing you the best possible GPU rental platform experience. This month, we've rolled out numerous template updates as well as added a [new guide](https://vast.ai/docs/use-cases/10-serving-infinity) to our Docs on serving Infinity Embeddings.

As always, addressing bugs is one of our top priorities to ensure seamless performance. Here are some of the latest bug fixes our team has implemented since our last update:

- **Cloud copy** no longer sends wrong parameters.
- **UI login** when CAPTCHA is incorrect should now delete cache/cookies.
- **Billing page** now loads correctly (previously had issues on some browsers).
- Fixed an issue where the **templates page** sometimes rendered before Recommended templates were loaded.

## NVIDIA Cloud GPU Updates

With each update here at Vast.ai, we're focused on improving both the flexibility and capabilities of our platform. It's our hope that these updates will streamline your experience and help maximize your productivity on Vast. Below are some highlights of the latest changes we've made:

- A new guide for [serving Infinity Embeddings](https://vast.ai/article/serving-vllm-embeddings-on-vast-ai) has been added to the Docs section of our website.
- Added separate **8x** and **8x+ GPU #** filters.
- New **GUI & CLI** to change the template of an existing contract (within constraints), then destroy and recreate.
- New **CLI/API** for audit logs of important user commands (create, destroy, etc.) with `user_id`, `timestamp`, `IP`, etc.
- Added new **ComfyUI + FLUX.1** template, **Stable Diffusion WebUI Forge** template, and **Ollama** template.
- Updated **PyTorch** to the most recent compatible version.

For your convenience, you can check out the full changelog for our **September 2024** product updates at the bottom of this page.

## Our Commitment

Our goal at **Vast.ai** is to provide an exceptional and affordable cloud GPU rental service for both renters and hosts. Whether you're beginning a new project or fine-tuning current ones, we're here to support you throughout the entire process. Feel free to reach out any time through the chatbox at the bottom right, send us an email at [support@vast.ai](mailto:support@vast.ai), or join our **Discord server** for ongoing support and resources!

---

## Change Log

### Features:

- Added separate **8x** and **8x+ GPU #** filters.
- New **GUI & CLI** to change the template of existing contracts (within constraints), then destroy and recreate.
- New **CLI/API** for audit logs of important user commands (create, destroy, etc.) with `user_id`, `timestamp`, `IP`, etc.

### Templates / Miscellaneous:

- Added [guide](https://vast.ai/docs/use-cases/10-serving-infinity) for **serving Infinity Embeddings** to the Docs section on the website.
- Updated **PyTorch** to the most recent compatible version.
- Added new **ComfyUI + FLUX.1** template.
- Updated **A1111** template for TLS compatibility.
- Updated **Fooocus** template for TLS.
- Updated **TGI** template (TLS/Auth).
- Updated **Kohya's GUI** template (TLS/Auth).
- Added **Stable Diffusion WebUI Forge** template.
- Added **Ollama** template.

### Bug Fixes:

- Fixed an issue where the **templates page** sometimes rendered before Recommended templates were loaded.
- **UI login** when CAPTCHA is incorrect should now delete cache/cookies.
- **Cloud copy** no longer sends wrong parameters.
- **Billing page** now loads correctly (previously had issues on some browsers).


Discover the latest features and updates from Vast.ai.

September 2024 Product Update


# Serving Online Inference with SGLang on Vast.ai

## Background

SGLang is an open-source framework for serving language models, focusing on throughput for serving and batch workloads. This makes it ideal for building apps that serve multiple users and need to scale. SGLang is used by many high-tech companies for serving their Language Models to the public, most notably [X AI and their Grok Model.](https://venturebeat.com/ai/grok-2-gets-a-speed-bump-after-developers-rewrite-code-in-three-days/)

SGLang provides an OpenAI-compatible server, allowing you to easily integrate it into chatbots and other applications.

As companies develop their AI products, they often face challenges like rate limits and high costs when using these models. With SGLang on Vast, you can run your own models in the form factor you need, at a much more affordable price point. As inference demand grows with agents and complex workflows, SGLang on Vast excels in performance and affordability where it matters most.

This guide will show you how to set up SGLang to serve a language model on Vast. We reference a notebook that you can use [here](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/XEyMo8/72acf45925f8230b75195b7da9fff1884fa4052d/files/serving_vllm_on_vast.json)

## Setup and Querying

First, set up your environment and Vast API key:

```bash
pip install --upgrade vastai
```

Once you create your account, go [here](https://cloud.vast.ai/cli/) to find your API key.

```bash
vastai set api-key <Your-API-Key-Here>
```

For serving a language model, we're looking for a machine with a static IP address, available ports to host on, and a single modern GPU with decent RAM, CUDA graphs take a lot of space and can affect deployment options for certain models. SGLang also requires CUDA version 12.2 or higher, so we'll filter for that as well. We will query the Vast API to get a list of these types of machines.

```bash
vastai search offers 'compute_cap > 800 gpu_ram > 40 num_gpus = 1 static_ip=true direct_port_count > 1 cuda_vers >= 12.2'
```

## Deploying the Image

The easiest way to deploy this instance is to use the command line. Copy and paste a specific instance ID you choose from the list above into `instance-id` below. We'll also need to input the Huggingface API Token into `<Huggingface-Token>`.

```bash
vastai create instance <instance-id> --image lmsysorg/sglang:latest --env '-p 8000:8000 -e HF_TOKEN=<Huggingface-Token>' --disk 60 --args python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --port 8000
```

## Connecting and Testing

To connect to your instance, we'll first need to get the IP address and port number. Once your instance is done setting up, you should see something like this:
![Instance_view](/uploads/instance_view_sglang.png)

Click on the highlighted button to see the IP address and correct port for our requests.

![IP_address_view](/uploads/ip_address_view_sglang.png)

We will copy over the IP address and the port into the cell below.

```bash
# This request assumes you haven't changed the model. If you did, fill it in the "model" value in the payload json below
curl -X POST http://<IP-Address>:<Port>/v1/completions -H "Content-Type: application/json"  -d '{"model" : "meta-llama/Meta-Llama-3.1-8B-Instruct", "prompt": "Hello, how are you?", "max_tokens": 50}'
```

You will see a response from your model in the output. Your model is up and running on Vast!

In the [notebook](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/XEyMo8/72acf45925f8230b75195b7da9fff1884fa4052d/files/serving_vllm_on_vast.json), we include ways to call this model with requests or OpenAI

### Other things to look out for with other configurations

If you are downloading a model that needs authentication from the Hugging Face Hub, passing `-e HF_token=<Your-Read-Only-Token>` within Vast's `--env` variable string should help.

Sometimes the full context of a model can't be used given the space allocated for SGLang on the GPU + the model's size. In those cases, you might want to increase `--gpu-memory-utilization`, or decrease the `max-model-len`. Increasing `--gpu-memory-utilization` does come with CUDA OutOfMemory issues that can be hard to predict ahead of time.

We won't need either of these for this specific model and GPU configuration.

### Testing

Copy the IP address from your instance once it is ready, and then we can use the following code to call it. Note that while your server might have ports ready, the model might not have downloaded yet as it is much larger this time. You can check the status of this via the logs to see when it has started serving.

```bash
import requests

headers = {
    'Content-Type': 'application/json',
}

json_data = {
    'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct',
    'prompt': 'Hello, how are you?',
    'max_tokens': 50,
}

response = requests.post('http://<Instance-IP-Address>:<Port>/v1/completions', headers=headers, json=json_data)
print(response.content)
```

Or use OpenAI:

```bash
pip install openai
```

```python
from openai import OpenAI

# Modify OpenAI's API key and API base to use SGLang's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<Instance-IP-Address>:<Port>/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="meta-llama/Meta-Llama-3.1-8B-Instruct",
                                      prompt="Hello, how are you?",
                                      max_tokens=50)
print("Completion result:", completion)
```

## Conclusions

Model inference is expensive, and leveraging more affordable compute/models makes a huge difference for engineering teams in terms of margins and shipping velocity.

Using SGLang on Vast is perfect for this, pairing Vast's access to affordable compute with the simplicity and state-of-the-art throughput of the SGLang backend.

SGLang is a great beginning to building Generative AI Apps. We will continue to explore using this tool more with Vast in future posts.

Llama-3 is already ready to go on Vast to start experimenting and building!


SGLang is an open-source framework for serving language models, focusing on throughput for serving and batch workloads.

Serving sglang on Vast


# Serving vLLM Embeddings on Vast.ai

## Background:

[vLLM](https://github.com/vllm-project/vllm) is a developer-friendly framework for serving large language models. It is very fast and focused on high throughput serving multiple simultaneous calls for production use cases. `vLLM` is now more flexible than ever as it also supports embedding models. This brings vLLM's dynamic batching and Paged Attention to embedding models for much faster throughput, all from the docker image that developers are used to.

This guide will show you how to setup vLLM to serve embedding models on Vast.

## Setup and Querying

First, we setup our environment and vast api key

```bash
pip install --upgrade vastai
```

Once you create your account, you can go [here](https://cloud.vast.ai/cli/) to find your API Key.

```bash
vastai set api-key <Your-API-Key-Here>
```

To serve an embedding model, we'll want a machine with a static IP address + ports to forward. We only will need a single GPU, which can have a small amount of RAM since embedding models are small. We'll prefer a modern GPU so we will add a filter for a compute cap of more than 800.

```bash
vastai search offers 'compute_cap >= 800 gpu_ram >= 24 num_gpus = 1 static_ip=true direct_port_count > 1 cuda_vers >= 12.4'
```

## Deploying the Image:

The easiest way to deploy this instance is to use the command line. Copy and Paste a specific instance id you choose from the list above into `instance-id` below.

### Hosting a Single Embedding Model:

For now, we'll host just one embedding model.
The easiest way to deploy a single model on this instance is to use the command line. Copy and Paste a specific instance id you choose from the list above into `instance-id` below.

We particularly need `v2` so that we use the correct version of the api, `--port 8000` so it serves on the correct model, and `--model-id intfloat/e5-mistral-7b-instruct` to serve the correct model.

```bash
vastai create instance <instance-id> --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 40 --args --model intfloat/e5-mistral-7b-instruct
```

## Connecting and Testing:

To connect to your instance, we'll first need to get the IP address and port number. Once your instance is done setting up, you should see something like this:
![Instance_view](/uploads/serve_vllm_embeddings_instance_view.png)

Click on the highlighted button to see the IP address and correct port for our requests.

![IP_address_view](/uploads/serve_vllm_embeddings_ip_address_view.png)

## Connecting with the python requests library

```
import requests

headers = {
    'Content-Type': 'application/json',
}

json_data = {
    'model': 'intfloat/e5-mistral-7b-instruct',
    "input": ["San francisco is a"],
}

response = requests.post('http://<Instance-IP-Address>:<Port>/v1/embeddings', headers=headers, json=json_data)
print(response.content)
```

## Connecting with the OpenAI SDK

The OpenAI SDK also allows for generating embeddings. We can use it as well with the following code.

```bash
pip install openai
```

We will copy over the IP address and the port into the cell below.

```python

from openai import OpenAI

# Modify OpenAI's API key and API base to use Infinity's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<Instance-IP-Address>:<Port>/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
model = "intfloat/e5-mistral-7b-instruct"
embeddings = client.embeddings.create(model=model, input="San francisco is a").data[0].embedding
print("Embeddings:")
print(embeddings)
```

Now we can integrate our embedding server into apps that use the OpenAI SDK.

## Next Steps:

Embeddings are an important part of GenAI applications along with running inference of Generative Models. Now with vLLM, you can run them with the same docker image on Vast for ultimate flexibility.


Serving vLLM Embeddings on Vast.ai


In the high-end GPU market, NVIDIA's H100 accelerator family stands out as some of the most advanced hardware available today. Whether you're training AI models or deploying them, performing scientific simulations, or doing large-scale data analytics, the H100 lineup definitely has the performance you need for enormous workloads.

But which accelerator is right for you? Today we're looking at two options: the NVIDIA H100 NVL and the H100 SXM5. Let's dive into the differences and explore which one might be the better fit for your project needs.

## The H100 NVL: Built for AI Inference at Scale

Announced last year, the NVIDIA H100 NVL is an interesting variant on the H100 PCIe card. At first glance, it's exactly what it looks like: two H100 PCIe cards that come already bridged together.

This massive beast of a card spans four slots – with a TDP of 700W to match its size. Given that it's essentially just two H100s glued together, communication is handled by a pair of PCIe 5.0 x16 slots. It uses three NVLink bridges to connect the pair of H100 GPUs, which deliver 600GB/s of bidirectional bandwidth, or about 4.5x the bandwidth of the dual PCIe interfaces.

NVIDIA states that the H100 NVL is capable of 2.4x to 2.6x the performance of two separate H100s (at least when it comes to FP8 and FP16 workloads) – likely in part because it uses faster HBM3 memory rather than HBM2e. Plus, unlike the standard H100 PCIe GPUs, each card on the H100 NVL has all six stacks of HBM memory enabled instead of just five. The result is that the H100 NVL offers a 17.5% memory increase – and a total of 188GB VRAM (2 x 94GB).

A few other standout features include an HBM3 memory clock at ~5.1Gbps; 6144-bit memory bus width; and memory bandwidth of 2 x 3.9TB/s.

According to NVIDIA, the H100 NVL is ideal for inferencing large language models (LLMs):

"These GPUs work as one to deploy large language models and GPT models from anywhere from five billion parameters to 200 [billion]," [explained](https://www.theregister.com/2023/03/21/nvidia_h100_l4/) Ian Buck, NVIDIA's VP of Accelerated Computing.

So in addition to being able to handle various high-performance computing (HPC) tasks, the H100 NVL is specifically optimized for deploying already-trained neural networks at scale.

## The H100 SXM5: Heavyweight for AI Training

SXM technology is a form factor and interconnect standard primarily used for GPUs in large-scale AI applications and data center environments. Instead of relying on PCIe slots like traditional GPUs, SXM GPUs are directly socketed into the motherboard. This allows for faster, high-bandwidth connections as well as better power delivery and cooling solutions – crucial for high-end GPUs, particularly in dense server environments.

Ultimately, this setup helps ensure that SXM GPUs perform efficiently even under the most intense workloads. For instance, NVIDIA's H100 SXM5 is designed for AI training first and foremost – a powerhouse GPU that is especially well suited to training foundational models.

With 528 tensor cores, 80 GB of HBM3 memory, and a memory bandwidth of 3.35 TB/s, the SXM5 is designed for the most demanding machine learning and deep learning tasks. It also benefits from NVLink interconnects that allow up to 900 GB/s of GPU-to-GPU bandwidth, making it ideal for multi-GPU configurations in large-scale AI model training.

Like the H100 NVL, the SXM5 features HBM3 memory. However, the SXM5’s form factor is a bit more restrictive. With each unit pulling 700W, running these systems requires a lot of power and cooling, which can pose a challenge for some datacenters. Most colocation racks have a power capacity of 6-10KW, meaning systems with four or more SXM5s can push infrastructure to its limits.

(The 700W H100 NVL is comparatively easier to accommodate. A single socket, dual H100 NVL configuration – with four GH100 dies – would likely require around 2.5KW.)

A direct comparison of specs would be helpful in deciding which accelerator best suits your needs, especially when balancing power efficiency, infrastructure requirements, and the performance demands of your workloads.

## Which GPU is Right for You?

Here's an outline of some of the features and specs of the H100 NVL and the SXM5, as well as the H100 PCIe for good measure:

<table class="w-full border-collapse border border-gray-300">
    <thead>
        <tr class="bg-gray-100">
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                Feature
            </th>
            <th class="border border-gray-300 px-4 py-2 text-center font-semibold">
                H100 NVL
            </th>
            <th class="border border-gray-300 px-4 py-2 text-center font-semibold">
                H100 PCIe
            </th>
            <th class="border border-gray-300 px-4 py-2 text-center font-semibold">
                H100 SXM5
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Architecture
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">Hopper</td>
            <td class="border border-gray-300 px-4 py-2 text-center">Hopper</td>
            <td class="border border-gray-300 px-4 py-2 text-center">Hopper</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Memory Clock
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                ~5.1 Gbps HBM3
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                3.2 Gbps HBM2e
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                5.23 Gbps HBM3
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Memory Bus Width
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                6144-bit
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                5120-bit
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                5120-bit
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Memory Bandwidth
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x 3.9 TB/s
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">2 TB/s</td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                3.35 TB/s
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                FP32 CUDA Cores
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x 16896
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">14592</td>
            <td class="border border-gray-300 px-4 py-2 text-center">16896</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Tensor Cores
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x 528
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">456</td>
            <td class="border border-gray-300 px-4 py-2 text-center">528</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Boost Clock
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                1.98 GHz
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                1.75 GHz
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                1.98 GHz
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">VRAM</td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x 94 GB (188 GB)
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">80 GB</td>
            <td class="border border-gray-300 px-4 py-2 text-center">80 GB</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                INT8 Tensor
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x 1980 TOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                1513 TOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                1980 TOPS
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                FP16 Tensor
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x 990 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                756 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                990 TFLOPS
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                FP32 Tensor
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x 495 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                378 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                495 TFLOPS
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                FP64 Tensor
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x 67 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                51 TFLOPS
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                67 TFLOPS
            </td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Interconnect
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                NVLink 4 (600 GB/s)
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                NVLink 4 (600 GB/s)
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                NVLink 4, 18 Links (900 GB/s)
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">GPU</td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x GH100
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">GH100</td>
            <td class="border border-gray-300 px-4 py-2 text-center">GH100</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Transistor Count
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x 80B
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">80B</td>
            <td class="border border-gray-300 px-4 py-2 text-center">80B</td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">TDP</td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                700-800W
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">350W</td>
            <td class="border border-gray-300 px-4 py-2 text-center">700W</td>
        </tr>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">
                Interface
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                2 x PCIe 5.0 (Quad Slot)
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">
                PCIe 5.0 (Dual Slot)
            </td>
            <td class="border border-gray-300 px-4 py-2 text-center">SXM5</td>
        </tr>
    </tbody>
</table>

The choice between the NVIDIA H100 NVL and SXM5 ultimately depends on your specific project requirements. The NVL offers a more versatile and accessible option for various applications, while the SXM5 provides exceptional performance and scalability for demanding workloads.

If you prioritize flexibility and ease of integration, the NVL could be the better choice. However, if you need the highest possible performance and are willing to invest in a more specialized setup, the SXM5 is your ideal solution. Carefully consider your project's needs, budget, and infrastructure before making a decision.


A comparison of NVIDIA's high-end GPU options, the H100 NVL and H100 SXM5, for AI inference, AI training, and high-performance computing.

Blog

LLMs vs. SLMs: What's the Difference, and Why Does It Matter?

Build and Test AI Applications with Langflow + Ollama on Vast.ai

Vast.ai: Best Video Generation Templates 2025

Report: NVIDIA Developing New Blackwell-Based AI Chip for China

Building Your First MCP Server on Vast.ai

Build a Weather Assistant with OpenAI GPT-OSS and Harmony SDK on Vast.ai

Deploying Qwen-Image for Advanced Text-Integrated Image Generation on Vast.ai

Blog

LLMs vs. SLMs: What's the Difference, and Why Does It Matter?

Build and Test AI Applications with Langflow + Ollama on Vast.ai

Vast.ai: Best Video Generation Templates 2025

Report: NVIDIA Developing New Blackwell-Based AI Chip for China

Building Your First MCP Server on Vast.ai

Build a Weather Assistant with OpenAI GPT-OSS and Harmony SDK on Vast.ai

Deploying Qwen-Image for Advanced Text-Integrated Image Generation on Vast.ai

Subscribe for our product updates.