Fireworks DeepSeek-V4-Flash (Global)

DeepSeek-V4-Flash is a streamlined open-source Mixture-of-Experts model optimized for fast, cost-efficient inference while preserving strong reasoning and coding performance at 1M token context scale. It leverages the same hybrid attention innovations as Pro but is tuned for lower latency and higher throughput in real-time applications. It delivers near-Pro reasoning quality under sufficient compute budget, making it ideal for interactive agents and high-volume production workloads.

Provider: All DeepSeek models | Fireworks AI

API Endpoint

https://api.fireworks.ai/inference/v1/chat/completions

Quick Start (Python)

Install: pip install openai

from openai import OpenAI

client = OpenAI(
    api_key="your-fireworks-api-key",
    base_url="https://api.fireworks.ai/inference/v1",
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v4-flash",
    messages=[
        {"role": "user", "content": "Hello, how are you?"},
    ],
    max_tokens=1024,
    temperature=0.7,
)

print(response.choices[0].message.content)

Additional examples: Basic invoke, Streaming

Supported Parameters

Parameter	Type	Description
max_tokens	integer	Maximum tokens to generate. (≥1)
temperature	float	Controls randomness. (0–2) Default: 0.7.
top_p	float	Nucleus sampling threshold. (0–1) Default: 1.
stream	boolean	Stream response chunks as they are generated. Default: false.
stop	string	Stop sequence or array of stop sequences.

Feature Guides

Serverless Inference

Pay per token for public open models without managing GPU deployments.

Documentation

OpenAI Compatibility

Use OpenAI-compatible client libraries by changing the API base URL.

Documentation