Model ID: doubao-embedding-large-text-250515模型ID:doubao-embedding-large-text-250515
Introduction介绍
We introduce Seed1.5-Embedding, a powerful embedding model built on top of our pretrained LLM (Seed1.5). It stands out with the following key features:
Generalist: Excels in general-purpose embedding tasks, achieving state-of-the-art performance on MTEB benchmark in both Chinese and English.
Reasoning Expertise: Demonstrates strong capability in complex query understanding and reasoning, delivering state-of-the-art results on BRIGHT benchmark.
Flexibility: Supports multiple embedding dimensions — [2048, 1024, 512, 256] — with minimal performance degradation at lower dimensions.
import os
import torch
# pip install --upgrade "volcengine-python-sdk[ark]"
from volcenginesdkarkruntime import Ark
from typing import Optional, List
def encode(
client, inputs: List[str], is_query: bool = False, mrl_dim: Optional[int] = None
):
if is_query:
# use instruction for optimal performance, feel free to tune this instruction for different tasks
# to reproduce MTEB results, refer to https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/seed_models.py for detailed instructions per task
inputs = [
f"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: {x}"
for x in inputs
]
resp = client.embeddings.create(
model="doubao-embedding-large-text-250515",
input=inputs,
encoding_format="float",
)
embedding = torch.tensor([d.embedding for d in resp.data], dtype=torch.bfloat16)
if mrl_dim is not None:
assert mrl_dim in [2048, 1024, 512, 256]
embedding = embedding[:, :mrl_dim]
# normalize to compute cosine sim
embedding = torch.nn.functional.normalize(embedding, dim=1, p=2).float().numpy()
return embedding
# gets API Key from environment variable ARK_API_KEY
client = Ark(
api_key=os.getenv("ARK_API_KEY"),
)
print("----- embeddings -----")
inputs = ["The sky is blue and the grass is green"]
embedding = encode(client, inputs, is_query=False, mrl_dim=1024)
print(embedding)
import os
import torch
# pip install --upgrade "volcengine-python-sdk[ark]"
from volcenginesdkarkruntime import Ark
from typing import Optional, List
def encode(
client, inputs: List[str], is_query: bool = False, mrl_dim: Optional[int] = None
):
if is_query:
# 使用指令以获取最佳效果,可以自由调整指令以适应不同任务
# 要复现测评结果,请参考 https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/seed_models.py 中每个任务的详细指令
inputs = [
f"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: {x}"
for x in inputs
]
resp = client.embeddings.create(
model="doubao-embedding-large-text-250515",
input=inputs,
encoding_format="float",
)
embedding = torch.tensor([d.embedding for d in resp.data], dtype=torch.bfloat16)
if mrl_dim is not None:
assert mrl_dim in [2048, 1024, 512, 256]
embedding = embedding[:, :mrl_dim]
# l2 正则以计算 cosine 相似度
embedding = torch.nn.functional.normalize(embedding, dim=1, p=2).float().numpy()
return embedding
# 从环境变量中获取 API Key
client = Ark(
api_key=os.getenv("ARK_API_KEY"),
)
print("----- embeddings -----")
inputs = ["The sky is blue and the grass is green"]
embedding = encode(client, inputs, is_query=False, mrl_dim=1024)
print(embedding)
We introduce Seed1.5-Embedding, a powerful embedding model built on top of our pretrained LLM (Seed1.5). It stands out with the following key features:
Generalist: Excels in general-purpose embedding tasks, achieving state-of-the-art performance on MTEB benchmark in both Chinese and English.
Reasoning Expertise: Demonstrates strong capability in complex query understanding and reasoning, delivering state-of-the-art results on BRIGHT benchmark.
Flexibility: Supports multiple embedding dimensions — [2048, 1024, 512, 256] — with minimal performance degradation at lower dimensions.