Seed1.5-Embedding

ByteDance Seed
Seed1.5-Embedding Logo

Introduction

We introduce Seed1.5-Embedding, a powerful embedding model built on top of our pretrained LLM (Seed1.5). It stands out with the following key features:

  • Generalist: Excels in general-purpose embedding tasks, achieving state-of-the-art performance on MTEB benchmark in both Chinese and English.
  • Reasoning Expertise: Demonstrates strong capability in complex query understanding and reasoning, delivering state-of-the-art results on BRIGHT benchmark.
  • Flexibility: Supports multiple embedding dimensions — [2048, 1024, 512, 256] — with minimal performance degradation at lower dimensions.

Performance

MTEB_v2 (English)

Model AVG Classification Clustering Pair Classification Reranking Retrieval STS Summarization
Seed1.5-Embedding 74.76 89.88 60.83 87.39 50.67 67.45 87.23 36.44
CHAIN19 73.97 89.78 58.85 88.67 49.11 66.21 86.12 38.32
gemini-embedding-exp-03-07 73.30 90.05 59.39 87.70 48.59 64.35 85.29 38.28
jasper-en-vision-language-v1 71.41 90.27 60.52 88.14 50.00 56.05 84.37 37.19
NV-Embed-v2 69.81 87.19 47.66 88.69 49.61 62.84 83.82 35.21
Linq-Embed-Mistral 69.80 83.00 54.07 88.44 49.44 60.14 84.69 37.26

C-MTEB (Chinese)

Model AVG Classification Clustering Pair Classification Reranking Retrieval STS
Seed1.5-Embedding 74.87 79.37 71.11 89.57 70.14 79.33 66.56
Conan-embedding-v2 74.24 76.47 68.84 92.44 74.41 78.31 65.48
Conan-embedding-v1 72.50 76.77 66.33 91.66 72.76 76.67 63.67
xiaobu-embedding-v2 72.36 76.53 65.17 91.87 72.58 76.50 64.18
gte-Qwen2-7B-instruct 71.62 75.77 66.06 87.48 68.92 75.71 65.20
bge-multilingual-gemma2 67.64 75.31 59.30 86.67 68.28 73.73 55.19

BRIGHT

Model AVG Bio. Earth. Econ. Psy. Rob. Stack. Sus. Leet. Pony AoPS TheoQ. TheoT.
Seed1.5-Embedding 27.2 34.8 46.9 23.4 31.6 19.1 25.4 21.0 43.2 4.9 12.2 33.3 30.5
ReasonIR-8B 24.4 26.2 31.4 23.3 30.0 18.0 23.9 20.5 35.0 10.5 14.7 31.9 27.2
gte-Qwen2-7B 22.5 30.6 36.4 17.8 24.6 13.2 22.2 14.8 25.5 9.9 14.4 27.8 32.9
GritLM 21.0 24.8 32.3 18.9 19.8 17.1 13.6 17.8 29.9 22.0 8.8 25.2 21.2
text-embedding-004 20.0 22.7 34.8 19.6 27.8 15.7 20.1 17.1 29.6 3.6 9.3 23.8 15.9
SFR-Embedding-Mistral 18.3 19.1 26.7 17.8 19.0 16.3 14.4 19.2 27.4 2.0 7.4 24.3 26.0

Usage

import os
import torch
# pip install --upgrade "volcengine-python-sdk[ark]"
from volcenginesdkarkruntime import Ark
from typing import Optional, List

def encode(
    client, inputs: List[str], is_query: bool = False, mrl_dim: Optional[int] = None
):
    if is_query:
        # use instruction for optimal performance, feel free to tune this instruction for different tasks
        # to reproduce MTEB results, refer to https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/seed_models.py for detailed instructions per task
        inputs = [
            f"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: {x}"
            for x in inputs
        ]
    resp = client.embeddings.create(
        model="doubao-embedding-large-text-250515",
        input=inputs,
        encoding_format="float",
    )
    embedding = torch.tensor([d.embedding for d in resp.data], dtype=torch.bfloat16)
    if mrl_dim is not None:
        assert mrl_dim in [2048, 1024, 512, 256]
        embedding = embedding[:, :mrl_dim]
    # normalize to compute cosine sim
    embedding = torch.nn.functional.normalize(embedding, dim=1, p=2).float().numpy()
    return embedding


# gets API Key from environment variable ARK_API_KEY
client = Ark(
    api_key=os.getenv("ARK_API_KEY"),
)

print("----- embeddings -----")
inputs = ["The sky is blue and the grass is green"]
embedding = encode(client, inputs, is_query=False, mrl_dim=1024)
print(embedding)