Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[开源推荐] CocoIndex 🥥 为AI实时索引数据 #2918

Open
badmonster0 opened this issue Mar 11, 2025 · 0 comments
Open

[开源推荐] CocoIndex 🥥 为AI实时索引数据 #2918

badmonster0 opened this issue Mar 11, 2025 · 0 comments
Assignees

Comments

@badmonster0
Copy link

badmonster0 commented Mar 11, 2025

项目地址

https://github.com/cocoindex-io/cocoindex

类别

Python

项目标题

全世界第一款支持自定义逻辑并且自带增量更新的数据索引框架

项目描述

CocoIndex是全世界第一款支持自定义逻辑,并且自带增量更新(incremental update)的数据框架。CocoIndex 可以有效地帮你给AI准备数据(RAG,Semantic Search)。以最简单的形式,像乐高一样搭建你的ETL pipeline,并且提供增量更新(incremental update)。

Image

CocoIndex框架+引擎,里面可以套任何的自定义模块,各种PDF parsing,chunking,embedding都可以套进去用。

🔥 核心feature:

  • 像乐高一样搭建你的RAG Pipeline。
  • 增量更新,当你源数据改变后CocoIndex引擎会减少计算和数据更新,只更新需要的差量delta。
  • 数据流编程(Data flow programming),以最简单的形式定义数据流。
  • 高效稳定,核心是用Rust🦀写的。给各位爱好Python🐍的小伙伴们提供了Python SDK。

亮点

文档齐全,新手包友好。模块化的搭建你的RAG Pipeline,五分钟上手🚀。

示例代码

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(
                language="markdown", chunk_size=300, chunk_overlap=100))

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

截图或演示视频

Image

No response

@badmonster0 badmonster0 changed the title [开源推荐] CocoIndex 🥥 乐高一样搭建RAG pipeline [开源推荐] CocoIndex 🥥 为AI实时索引数据 Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants