Seed1.5-Embedding

Introduction 介绍

We introduce Seed1.5-Embedding, a powerful embedding model built on top of our pretrained LLM (Seed1.5). It stands out with the following key features:

Generalist: Excels in general-purpose embedding tasks, achieving state-of-the-art performance on MTEB benchmark in both Chinese and English.
Reasoning Expertise: Demonstrates strong capability in complex query understanding and reasoning, delivering state-of-the-art results on BRIGHT benchmark.
Flexibility: Supports multiple embedding dimensions — [2048, 1024, 512, 256] — with minimal performance degradation at lower dimensions.

Performance 效果

MTEB_v2 (English)

Model	AVG	Classification	Clustering	Pair Classification	Reranking	Retrieval	STS	Summarization
Seed1.5-Embedding	74.76	89.88	60.83	87.39	50.67	67.45	87.23	36.44
CHAIN19	73.97	89.78	58.85	88.67	49.11	66.21	86.12	38.32
gemini-embedding-exp-03-07	73.30	90.05	59.39	87.70	48.59	64.35	85.29	38.28
jasper-en-vision-language-v1	71.41	90.27	60.52	88.14	50.00	56.05	84.37	37.19
NV-Embed-v2	69.81	87.19	47.66	88.69	49.61	62.84	83.82	35.21
Linq-Embed-Mistral	69.80	83.00	54.07	88.44	49.44	60.14	84.69	37.26

C-MTEB (Chinese)

Model	AVG	Classification	Clustering	Pair Classification	Reranking	Retrieval	STS
Seed1.5-Embedding	74.87	79.37	71.11	89.57	70.14	79.33	66.56
Conan-embedding-v2	74.24	76.47	68.84	92.44	74.41	78.31	65.48
Conan-embedding-v1	72.50	76.77	66.33	91.66	72.76	76.67	63.67
xiaobu-embedding-v2	72.36	76.53	65.17	91.87	72.58	76.50	64.18
gte-Qwen2-7B-instruct	71.62	75.77	66.06	87.48	68.92	75.71	65.20
bge-multilingual-gemma2	67.64	75.31	59.30	86.67	68.28	73.73	55.19

BRIGHT

Model	AVG	Bio.	Earth.	Econ.	Psy.	Rob.	Stack.	Sus.	Leet.	Pony	AoPS	TheoQ.	TheoT.
Seed1.5-Embedding	27.2	34.8	46.9	23.4	31.6	19.1	25.4	21.0	43.2	4.9	12.2	33.3	30.5
ReasonIR-8B	24.4	26.2	31.4	23.3	30.0	18.0	23.9	20.5	35.0	10.5	14.7	31.9	27.2
gte-Qwen2-7B	22.5	30.6	36.4	17.8	24.6	13.2	22.2	14.8	25.5	9.9	14.4	27.8	32.9
GritLM	21.0	24.8	32.3	18.9	19.8	17.1	13.6	17.8	29.9	22.0	8.8	25.2	21.2
text-embedding-004	20.0	22.7	34.8	19.6	27.8	15.7	20.1	17.1	29.6	3.6	9.3	23.8	15.9
SFR-Embedding-Mistral	18.3	19.1	26.7	17.8	19.0	16.3	14.4	19.2	27.4	2.0	7.4	24.3	26.0

Usage 使用方法

import os
import torch
# pip install --upgrade "volcengine-python-sdk[ark]"
from volcenginesdkarkruntime import Ark
from typing import Optional, List

def encode(
    client, inputs: List[str], is_query: bool = False, mrl_dim: Optional[int] = None
):
    if is_query:
        # use instruction for optimal performance, feel free to tune this instruction for different tasks
        # to reproduce MTEB results, refer to https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/seed_models.py for detailed instructions per task
        inputs = [
            f"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: {x}"
            for x in inputs
        ]
    resp = client.embeddings.create(
        model="doubao-embedding-large-text-250515",
        input=inputs,
        encoding_format="float",
    )
    embedding = torch.tensor([d.embedding for d in resp.data], dtype=torch.bfloat16)
    if mrl_dim is not None:
        assert mrl_dim in [2048, 1024, 512, 256]
        embedding = embedding[:, :mrl_dim]
    # normalize to compute cosine sim
    embedding = torch.nn.functional.normalize(embedding, dim=1, p=2).float().numpy()
    return embedding


# gets API Key from environment variable ARK_API_KEY
client = Ark(
    api_key=os.getenv("ARK_API_KEY"),
)

print("----- embeddings -----")
inputs = ["The sky is blue and the grass is green"]
embedding = encode(client, inputs, is_query=False, mrl_dim=1024)
print(embedding)

import os
import torch
# pip install --upgrade "volcengine-python-sdk[ark]"
from volcenginesdkarkruntime import Ark
from typing import Optional, List

def encode(
    client, inputs: List[str], is_query: bool = False, mrl_dim: Optional[int] = None
):
    if is_query:
        # 使用指令以获取最佳效果，可以自由调整指令以适应不同任务
        # 要复现测评结果，请参考 https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/seed_models.py 中每个任务的详细指令
        inputs = [
            f"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: {x}"
            for x in inputs
        ]
    resp = client.embeddings.create(
        model="doubao-embedding-large-text-250515",
        input=inputs,
        encoding_format="float",
    )
    embedding = torch.tensor([d.embedding for d in resp.data], dtype=torch.bfloat16)
    if mrl_dim is not None:
        assert mrl_dim in [2048, 1024, 512, 256]
        embedding = embedding[:, :mrl_dim]
    # l2 正则以计算 cosine 相似度
    embedding = torch.nn.functional.normalize(embedding, dim=1, p=2).float().numpy()
    return embedding


# 从环境变量中获取 API Key
client = Ark(
    api_key=os.getenv("ARK_API_KEY"),
)

print("----- embeddings -----")
inputs = ["The sky is blue and the grass is green"]
embedding = encode(client, inputs, is_query=False, mrl_dim=1024)
print(embedding)

Seed1.5-Embedding

Introduction 介绍

Performance 效果

MTEB_v2 (English)

C-MTEB (Chinese)

BRIGHT

Usage 使用方法

介绍

Seed1.5-Embedding

Introduction

Performance

Usage

Seed1.5-Embedding

介绍

测评效果

使用方法