AI大模型应用开发起步:LangChain达成文档归纳

2个月前发布 gsjqwyl
17 0 0

文章标题:

AI大模型应用开发起始:LangChain完成文档归纳

文章内容:

一、总体思路

长网页文本常常超出大语言模型单次处理的token数量限制,我们需要设计一个map-reduce流水线来进行拆分、局部总结与归并:
1. 加载网页内容
2. 拆分成可控制规模的片段
3. 对每个片段开展初步总结(map)
4. 汇总所有初步总结(reduce)
5. 若有必要递归reduce直至满足token限制
6. 输出最终总结

接下来我们用代码来实现!

二、前期准备

1. 初始化大语言模型

首先通过 init_chat_model 来加载大语言模型:

# llm_env.py
from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

三、主程序 main.py

1. 导入依赖与初始化
import os
import sys

sys.path.append(os.getcwd())

from langchain_community.document_loaders import WebBaseLoader
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.llm import LLMChain
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import CharacterTextSplitter
import operator
from typing import Annotated, List, Literal, TypedDict
from langchain.chains.combine_documents.reduce import collapse_docs, split_list_of_docs
from langchain_core.documents import Document
from langgraph.constants import Send
from langgraph.graph import END, START, StateGraph

from llm_set import llm_env

llm = llm_env.llm
2. 加载网页
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Artificial_intelligence")
docs = loader.load()

借助WebBaseLoader能够便捷地将网页文本加载到docs列表中。

3. 定义Prompt模板
  • Map阶段Prompt
map_prompt = ChatPromptTemplate.from_messages(
    [("system", "Write a concise summary of the following: \n\n{context}")]
)
  • Reduce阶段Prompt
reduce_template = """
The following is a set of summaries:
{docs}
Take these and distill it into a final, consolidated summary of the main themes.
"""

reduce_prompt = ChatPromptTemplate([("human", reduce_template)])
4. 拆分文档片段
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)
print(f"Split into {len(split_docs)} chunks")

将网页内容拆分成多个片段,片段大小设置为1000 tokens,方便单次处理。

5. 定义Token长度计算
token_max = 1000

def length_function(documents: List[Document]) -> int:
    return sum(llm.get_num_tokens(d.page_content) for d in documents)

用于计算输入文档的token总量,以判断是否需要继续进行collapse操作。

6. 定义状态

主状态:

class OverallState(TypedDict):
    contents: List[str]
    summaries: Annotated[list, operator.add]
    collapsed_summaries: List[Document]
    final_summary: str

Map阶段状态:

class SummaryState(TypedDict):
    content: str
7. 生成初步总结(Map阶段)
def generate_summary(state: SummaryState):
    prompt = map_prompt.invoke(state["content"])
    response = llm.invoke(prompt)
    return {"summaries": [response.content]}
8. Map调度逻辑
def map_summaries(state: OverallState):
    return [
        Send("generate_summary", {"content": content}) for content in state["contents"]
    ]
9. 收集总结
def collect_summaries(state: OverallState):
    return {
        "collapsed_summaries": [Document(summary) for summary in state["summaries"]]
    }
10. Reduce逻辑
  • 内部reduce函数
def _reduce(input: dict) -> str:
    prompt = reduce_prompt.invoke(input)
    response = llm.invoke(prompt)
    return response.content
  • Collapse总结
def collapse_summaries(state: OverallState):
    docs_lists = split_list_of_docs(
        state["collapsed_summaries"],
        length_function,
        token_max,
    )

    results = []
    for doc_list in docs_lists:
        combined = collapse_docs(doc_list, _reduce)
        results.append(combined)

    return {"collapsed_summaries": results}
11. 是否继续collapse
def should_collapse(state: OverallState):
    num_tokens = length_function(state["collapsed_summaries"])
    if num_tokens > token_max:
        return "collapse_summaries"
    else:
        return "generate_final_summary"
12. 生成最终总结
def generate_final_summary(state: OverallState):
    response = _reduce(state["collapsed_summaries"])
    return {"final_summary": response}

四、构建流程图 (StateGraph)

graph = StateGraph(OverallState)

graph.add_node("generate_summary", generate_summary)
graph.add_node("collect_summaries", collect_summaries)
graph.add_node("collapse_summaries", collapse_summaries)
graph.add_node("generate_final_summary", generate_final_summary)

graph.add_conditional_edges(START, map_summaries, ["generate_summary"])
graph.add_edge("generate_summary", "collect_summaries")
graph.add_conditional_edges("collect_summaries", should_collapse)
graph.add_conditional_edges("collapse_summaries", should_collapse)
graph.add_edge("generate_final_summary", END)

app = graph.compile()

五、执行总结流程

for step in app.stream(
    {"contents": [doc.page_content for doc in split_docs]},
    {"recursion_limit": 10},
):
    print(list(step.keys()))

通过.stream()启动整个流水线,传入切片后的contents,以流式方式输出每一步结果,直至最终汇总完成。

六、总结

通过这个示例,你可以看到:
✅ 借助LangChain与大语言模型轻松实现网页总结
✅ 设计了自动map-reduce流程,支持长文本拆分与递归reduce
✅ 利用StateGraph灵活编排流程

© 版权声明

相关文章

没有相关内容!

暂无评论

none
暂无评论...