This content originally appeared on DEV Community and was authored by Harish Kotra (he/him)
Gaia nodes provide streaming capabilities similar to OpenAI’s APIs. By default, when you request a completion from a Gaia node, the entire completion is generated before being sent back in a single response.
If you’re generating long completions, waiting for the response can take many seconds. To get responses sooner, you can ‘stream’ the completion as it’s being generated. This allows you to start printing or processing the beginning of the completion before the full completion is finished.
To stream completions, set stream=True
when calling the chat completions endpoints. This will return an object that streams back the response as data-only server-sent events. Extract chunks from the delta field rather than the message field.
Prerequisites
import time
from openai import OpenAI
1. What a typical chat completion response looks like
With a typical ChatCompletions API call, the response is first computed and then returned all at once.
# record the time before the request is sent
start_time = time.time()
# send a ChatCompletion request to count to 100
response = client.chat.completions.create(
model='llama',
messages=[
{'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
],
temperature=0
)
The reply can be extracted with response.choices[0].message
.
The content of the reply can be extracted with response.choices[0].message.content
.
2. How to stream a chat completion
With a streaming API call, the response is sent back incrementally in chunks via an event stream. In Python, you can iterate over these events with a for loop.
response = client.chat.completions.create(
model='llama',
messages=[
{'role': 'user', 'content': "What's 1+1? Answer in one word."}
],
temperature=0,
stream=True # this time, we set stream=True
)
for chunk in response:
print(chunk)
print(chunk.choices[0].delta.content)
print("****************")
As you can see above, streaming responses have a delta field rather than a message field. The delta can contain:
- A role token (e.g.,
{"role": "assistant"}
) - A content token (e.g.,
{"content": "text"}
) - Nothing when stream is over
3. How much time is saved by streaming a chat completion
Let’s look at how quickly we receive content with streaming:
start_time = time.time()
response = client.chat.completions.create(
model='llama',
messages=[
{'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
],
temperature=0,
stream=True
)
collected_chunks = []
collected_messages = []
for chunk in response:
chunk_time = time.time() - start_time
collected_chunks.append(chunk)
chunk_message = chunk.choices[0].delta.content
collected_messages.append(chunk_message)
print(f"Message received {chunk_time:.2f} seconds after request: {chunk_message}")
With streaming:
- First token arrives quickly (often <0.5s)
- Subsequent tokens arrive every ~0.01-0.02s
- User sees partial responses immediately
Without streaming:
- Must wait for full response (often several seconds)
- No intermediate feedback
Choose streaming when you want to:
- Show partial results immediately
- Provide responsive user experience
- Handle long responses gracefully
Credits
Inspired by this example.
This content originally appeared on DEV Community and was authored by Harish Kotra (he/him)