超越基础RAG：构建面向学术论文的智能检索系统实战

在AI工程实践中，你很快会意识到：并不存在一张可以完全照搬的“通用蓝图”。

对于检索增强生成（RAG）中的基础检索（“R”），标准流程包括文档分块、查询语义搜索、结果重排等。这套流程已广为人知。

但随着系统深入应用，一个问题会浮现：如果一个系统对一篇文档的理解仅限于几个零散的文本块，我们如何能称其为“智能”？

那么，如何确保系统获得足够的信息，从而做出“聪明”的回应？

这促使你深入探究，开始思考企业内部的实际做法——因为这类工程实践往往未被系统化记录，大家仍在各自构建专属的流程管线。

这种探索会推动你进行一系列优化：定制化分块、改写用户查询、采用多种搜索策略、利用元数据过滤、扩展上下文以包含相邻块等。

正因如此，我构建了一个略显复杂的检索系统来进行演示。我们将逐步解析其工作流程，既观察每一步的输出结果，也探讨背后的设计权衡。

为便于公开演示，系统嵌入了150篇近期在arXiv上提及RAG的学术论文（总计2,250页）。这意味着我们构建的是一个面向学术论文的检索系统，所有测试查询均与RAG主题相关。

我将部分查询在各步骤的原始输出放在了此代码仓库中，以供查阅完整细节。

技术栈方面，我使用Qdrant和Redis存储数据，Cohere和OpenAI的大语言模型进行推理。整个管线未依赖任何现成框架（因为这会增加调试难度）。

检索与RAG回顾

当你构建类似Copilot的AI知识系统（即向系统输入自有文档并让其据此回答）时，你就是在构建一个RAG系统。

RAG是“检索增强生成”的缩写，包含两个核心部分：检索与生成。

检索：指基于用户查询，通过关键词或语义匹配来“查找”相关信息的过程。
生成：指大语言模型基于提供的上下文和查询来“生成答案”的部分。

对于RAG新手而言，这种构建系统的方式可能显得“琐碎”。大语言模型难道不应该自己完成大部分工作吗？

遗憾的是，大语言模型本质上是静态的。我们需要通过工程化的系统构建，在每次调用时为其提供所需的一切信息，它才能有效回答问题。

本文将更进一步，尝试在不依赖任何框架的情况下重建整个检索管线，并实现一些更高级的功能，例如构建多查询优化器、融合检索结果、扩展文本块以构建更优质的上下文等。

当然，所有这些“高级功能”都需要以“延迟增加”和“额外工程工作”为代价。

处理不同类型文档

与所有数据工程问题一样，首要挑战是设计数据存储方案。对于检索系统而言，核心在于分块策略，而如何分割文本以及为每个块附加何种信息，是构建高效系统的关键。

为了检索，我们必须将文本分割成不同的块，随后在这些块中匹配查询。

最简单的系统使用通用分块器，依据固定长度、段落或句子进行分割。

但每种文档类型都有其独特性，通用分块可能导致上下文丢失。

理解这一点，你需要审视不同文档的结构差异。例如，人力资源文档通常有清晰的分节标题，而API文档则可能包含未编号的段落、代码块和表格。

若对所有文档应用相同的分块逻辑，很可能导致“错误分割”。这意味着大语言模型获得的文本块信息不完整，从而无法生成准确答案。

此外，对于每个文本块，你还需要考虑应为其附加哪些数据。

是否包含元数据以便系统进行过滤？是否链接相似信息以建立关联？是否携带上下文帮助大语言模型理解信息来源？

这意味着数据存储架构至关重要。如果初始存储的信息后来被发现不足，你可能需要推倒重来。如果系统变得过度复杂，同样需要重构。

本系统将处理Excel和PDF文档，重点在于为每个文本块增加上下文、关键键和邻居信息。这些设计将在后续的检索环节中发挥作用。

本次演示中，数据存储于Redis与Qdrant。我们使用Qdrant进行语义搜索、BM25搜索及混合搜索，在需要扩展上下文时再从Redis获取数据。

处理表格文件

首先探讨如何对表格数据进行分块、添加上下文，并使用关键键保持信息关联。

对于已结构化的表格数据（如Excel），让系统直接查询表格似乎最直接。但在面对用户模糊或不规范的查询时，语义匹配通常更为有效。

SQL或直接查找仅在熟悉数据模式及确切字段名时有效。例如，当用户查询“Mazda 2023 specs”时，对表格行进行语义匹配就能提供有价值的线索。

我曾与一些公司交流，他们希望系统能够跨多个Excel文件匹配信息。为此，我们可以将关键键与文本块一同存储（无需构建完整的知识图谱）。

例如，在处理包含采购数据的Excel表格时，可以按行摄入如下格式的数据：

json { "chunk_id": "Sales_Q1_123::row::1", "doc_id": "Sales_Q1_123:1234", "location": {"sheet_name": "Sales Q1", "row_n": 1}, "type": "chunk", "text": "OrderID: 1001234f67 n Customer: Alice Hemsworth n Products: Blue sweater 4, Red pants 6", "context": "Quarterly sales snapshot", "keys": {"OrderID": "1001234f67"} }

如果我们在检索管线的后续环节决定关联信息，可以使用关键键进行标准查找，快速定位到相关的文本块。这样可以在文档间高效跳转，而无需在管线中额外增加路由步骤。

表格文档间关键键连接的简化示意图 | 作者供图

我们也可以为每个文档设置一个摘要。它充当文本块的“门卫”。

json { "chunk_id": "Sales_Q1::summary", "doc_id": "Sales_Q1_123:1234", "location": {"sheet_name": "Sales Q1"}, "type": "summary", "text": "Sheet tracks Q1 orders for 2025, type of product, and customer names for reconciliation.", "context": "" }

“门卫”式摘要这个概念起初可能有些抽象，但将其存储在文档层级，在后续构建上下文时非常有用。

当由大语言模型生成此摘要（以及简短的上下文字符串）时，它还能建议候选的关键键列（例如订单ID等）。

顺便提醒：如果可能，尽量手动指定关键键列；如果不行，务必添加验证逻辑，防止关键键是随机或无意义的（大语言模型有时会选择非常奇怪的列，反而忽略关键列）。

在这个 arXiv 论文系统中，我还导入了两份 Excel 文件，其中包含了论文标题和作者级别的信息。

一个典型的数据块（chunk）结构如下：

json { "chunk_id": "titles::row::8817::250930134607", "doc_id": "titles::250930134607", "location": { "sheet_name": "titles", "row_n": 8817 }, "type": "chunk", "text": "id: 2507 2114ntitle: Gender Similarities Dominate Mathematical Cognition at the Neural Level: A Japanese fMRI Study Using Advanced Wavelet Analysis and Generative AInkeywords: FMRI; Functional Magnetic Resonance Imaging; Gender Differences; Machine Learning; Mathematical Performance; Time Frequency Analysis; Waveletnabstract_url: https://arxiv.org/abs/2507.21140ncreated: 2025-07-23 00:00:00 UTCnauthor_1: Tatsuru Kikuchi", "context": "Analyzing trends in AI and computational research articles.", "keys": { "id": "2507 2114", "author_1": "Tatsuru Kikuchi" } }

严格来说，这两份 Excel 文件并非必需（PDF 文件已足够），但它们可以清晰地演示系统如何利用 keys 字段来查找和关联信息。

同时，我也为这些文件生成了摘要（summary）。

json { "chunk_id": "titles::summary::250930134607", "doc_id": "titles::250930134607", "location": { "sheet_name": "titles" }, "type": "summary", "text": "The dataset consists of articles with various attributes including ID, title, keywords, authors, and publication date. It contains a total of 2508 rows with a rich variety of topics predominantly around AI, machine learning, and advanced computational methods. Authors often contribute in teams, indicated by multiple author columns. The dataset serves academic and research purposes, enabling catego", }

此外，我们还在 Redis 的文档层级存储了元信息，例如文档内容、来源、访问权限以及更新时间。这为未来更新过期信息提供了可能。

接下来，我们将探讨 PDF 文档的处理，这通常是数据源中最具挑战性的部分。

摄入 PDF 文档

处理 PDF 文档的流程与处理表格数据类似，但分块（chunking）过程要复杂得多。与 Excel 数据存储 keys 不同，PDF 数据块存储的是 neighbors 信息。

市面上有多种处理 PDF 的框架可供选择，如 LlamaParse、Docling 等，但它们各有局限。因此，系统设计需要具备良好的可扩展性。

PDF 处理的难点在于其缺乏统一的结构，并且常常包含图表，许多系统对此处理不佳。

尽管如此，像 Docling 这样的工具至少能帮助我们正确解析常规表格，并将每个元素准确地映射到对应的页码和元素编号。

在此基础上，我们可以编写自定义的程序化逻辑：为每个元素映射章节和子章节、为每个数据块添加标题、执行“智能合并”以确保文本片段的自然阅读（例如，避免在句子中间分割）。

我们还会按照章节对数据块进行分组，并使用一个名为 neighbors 的字段将它们链接起来，以保持其上下文关联。

通过这种方式，我们可以在保持数据块体积较小的同时，在检索后仍能重建其扩展的上下文信息。

最终结果大致如下：

json { "chunk_id": "S3::C02::251009105423", "doc_id": "2507.18910v1", "location": { "page_start": 2, "page_end": 2 }, "type": "chunk", "text": "1 Introductionnn1.1 Background and MotivationnnLarge-scale pre-trained language models have demonstrated an ability to store vast amounts of factual knowledge in their parameters, but they struggle with accessing up-to-date information and providing verifiable sources. This limitation has motivated techniques that augment generative models with information retrieval. Retrieval-Augmented Generation (RAG) emerged as a solution to this problem, combining a neural retriever with a sequence-to-sequence generator to ground outputs in external documents [52]. The seminal work of [52] introduced RAG for knowledge-intensive tasks, showing that a generative model (built on a BART encoder-decoder) could retrieve relevant Wikipedia passages and incorporate them into its responses, thereby achieving state-of-the-art performance on open-domain question answering. RAG is built upon prior efforts in which retrieval was used to enhance question answering and language modeling [48, 26, 45]. Unlike earlier extractive approaches, RAG produces free-form answers while still leveraging non-parametric memory, offering the best of both worlds: improved factual accuracy and the ability to cite sources. This capability is especially important to mitigate hallucinations (i.e., believable but incorrect outputs) and to allow knowledge updates without retraining the model [52, 33].", "context": "Systematic review of RAG's development and applications in NLP, addressing challenges and advancements.", "section_neighbours": { "before": [ "S3::C01::251009105423" ], "after": [ "S3::C03::251009105423", "S3::C04::251009105423", "S3::C05::251009105423", "S3::C06::251009105423", "S3::C07::251009105423" ] }, "keys": {} }

当数据以此方式组织时，每个文本块可被视为一个“种子”。我们的目标是基于用户查询，定位潜在的相关信息起始点，并以此为中心向外扩展上下文。

与基础RAG系统的关键区别在于，我们尝试利用大型语言模型日益增长的上下文窗口，向其输入更多相关信息（尽管这会带来计算成本）。在后续的检索管线中，你将看到构建上下文时可能出现的“杂乱”情况。

构建检索管线

我们将采用模块化方式构建检索管线。这种设计便于分阶段测试，并有助于解释为何在将信息传递给LLM之前，需要以特定方式进行检索与转换。

管线将依次涵盖以下步骤：
1. 语义/混合/BM25搜索
2. 构建多查询优化器
3. 结果重排序
4. 扩展内容以构建上下文
5. 将最终上下文交由LLM生成答案

最后，我们还将讨论延迟、不必要的复杂性，以及为提升速度可以优化的环节。

语义搜索与混合搜索

管线的首要步骤是确保针对用户查询能返回相关文档。为此，我们同时采用语义搜索、BM25搜索以及结合二者的混合搜索。

在简单的检索系统中，通常仅使用语义搜索。进行语义搜索时，我们使用嵌入模型为每个文本块生成稠密向量。

嵌入模型简介：它将每段文本表示为高维空间中的一个点。点的位置反映了模型基于训练数据对文本“含义”的理解。语义相近的文本在向量空间中彼此靠近。

这意味着，如果模型在训练中见过大量语义相似的语言样本，它就能更准确地将相关文本聚集在一起，从而更有效地将用户查询匹配到最相关的内容。

在本系统中，我们使用OpenAI的text-embedding-3-large模型为学术论文内容生成稠密向量。该模型比text-embedding-3-small模型成本更高，可能并非所有场景下的最优选择。在实际应用中，可考虑使用针对特定领域优化的专用模型，或对通用模型进行微调。需注意，如果嵌入模型未在与你文本内容足够相似的语料上训练过，其匹配相关文档的能力会下降。

为了支持混合搜索与BM25搜索，我们还需要建立一个基于词项的索引（稀疏向量）。BM25算法主要匹配“精确的词元”（例如特定的“ID 826384”），而非像语义搜索那样返回“含义相似”的文本。

注意：BM25不匹配部分词元。如果用户查询一个以“8263”开头的ID，语义搜索可能找到相关结果，但BM25无法保证。

为了测试语义搜索功能，我们使用了一个预期可由已摄入论文回答的问题，例如：“为什么LLM在上下文窗口变长时性能会下降，以及该如何应对？”

[1] score=0.5071 doc=docs_ingestor/docs/arxiv/2508.15253.pdf chunk=S3::C02::251009131027
text: 1 Introduction This challenge is exacerbated when incorrect yet highly ranked contexts serve as hard negatives. Conventional RAG, i.e. , simply appending * Corresponding author 1 https://github.com/eunseongc/CARE Figure 1: LLMs struggle to resolve context-memory conflict. Green bars show the number of questions correctly answered without retrieval in a closed-book setting. Blue and yellow bars show performance when provided with a positive or negative context, respectively. Closed-book w/ Positive Context W/ Negative Context 18k 25.1% 49.1% 39.6% 47.5% 6k 4k 12k 4 Mistral-7b LLaMA3-8b GPT-4o-mini Claude-3.5 retrieved context to the prompt, struggles to discriminate between incorrect external context and correct parametric knowledge (Ren et al., 2025). This misalignment leads to overriding correct internal representations, resulting in substantial performance degradation on questions that the model initially answered correctly. As shown in Figure 1, we observed significant performance drops of25.149.1% across state-of-the-

[2] score=0.5022 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S3::C03::251009132038
text: 1 Introductions Despite these advances, LLMs might underutilize accurate external contexts, disproportionately favoring internal parametric knowledge during generation [50, 40]. This overreliance risks propagating outdated information or hallucinations, undermining the trustworthiness of RAG systems. Surprisingly, recent studies reveal a paradoxical phenomenon: injecting noise-random documents or tokens-to retrieved contexts that already contain answer-relevant snippets can improve the generation accuracy [10, 49]. While this noise-injection approach is simple and effective, its underlying influence on LLM remains unclear. Furthermore, long contexts containing noise documents create computational overhead. Therefore, it is important to design more principled strategies that can achieve similar benefits without incurring excessive cost.

[3] score=0.4982 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S6::C18::251009132038
text: 4 Experiments 4.3 Analysis Experiments Qualitative Study In Table 4, we analyze a case study from the NQ dataset using the Llama2-7B model, evaluating four decoding strategies: GD(0), CS, DoLA, and LFD. Despite access to groundtruth documents, both GD(0) and DoLA generate incorrect answers (e.g., ’18 minutes’), suggesting limited capacity to integrate contextual evidence. Similarly, while CS produces a partially relevant response (‘Texas Revolution’), it exhibits reduced factual consistency with the source material. In contrast, LFD demonstrates superior utilization of retrieved context, synthesizing a precise and factually aligned answer. Additional case studies and analyses are provided in Appendix F.

[4] score=0.4857 doc=docs_ingestor/docs/arxiv/2507.23588.pdf chunk=S6::C03::251009122456
text: 4 Results Figure 4: Change in attention pattern distribution in different models. For DiffLoRA variants we plot attention mass for main component (green) and denoiser component (yellow). Note that attention mass is normalized by the number of tokens ineach part of the sequence. The negative attention is shown after it is scaled by λ . DiffLoRA corresponds to the variant with learnable λ and LoRa parameters in both terms. BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY 00.20.40.6 BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY Llama-3.2-1B LoRA DLoRA-32 DLoRA, Tulu-3 perform similarly as the initial model, however they are outperformed by LoRA. When increasing the context length with more sample demonstrations, DiffLoRA seems to struggle even more in TREC-fine and Banking77. This might be due to the nature of instruction tuned data, and the max_sequence_length = 4096 applied during finetuning. LoRA is less impacted, likely because it diverges less

[5] score=0.4838 doc=docs_ingestor/docs/arxiv/2508.15253.pdf chunk=S3::C03::251009131027
text: 1 Introduction To mitigate context-memory conflict, existing studies such as adaptive retrieval (Ren et al., 2025; Baek et al., 2025) and the decoding strategies (Zhao et al., 2024; Han et al., 2025) adjust the influence of external context either before or during answer generation. However, due to the LLM’s limited capacity in detecting conflicts, it is susceptible to misleading contextual inputs that contradict the LLM’s parametric knowledge. Recently, robust training has equipped LLMs, enabling them to identify conflicts (Asai et al., 2024; Wang et al., 2024). As shown in Figure 2(a), it enables the LLM to dis-

[6] score=0.4827 doc=docs_ingestor/docs/arxiv/2508.05266.pdf chunk=S27::C03::251009123532
text: B. Subclassification Criteria for Misinterpretation of Design Specifications Initially, regarding long-context scenarios, we observed that directly prompting LLMs to generate RTL code based on lengthy contexts often resulted in certain code segments failing to accurately reflect high-level requirements. However, by manually decomposing the long context-retaining only the key descriptive text relevant to the erroneous segments while omitting unnecessary details-the LLM regenerated RTL code that correctly matched the specifications. As shown in Fig 23, after manual decomposition of the long context, the LLM successfully generated the correct code. This demonstrates that redundancy inlong contexts is a limiting factor in LLMs’ ability to generate accurate RTL code.

[7] score=0.4798 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S3::C02::251009132038
text: 1 Introductions Figure 1: Illustration for layer-wise behavior in LLMs for RAG. Given a query and retrieved documents with the correct answer (‘Real Madrid’), shallow layers capture local context, middle layers focus on answer-relevant content, while deep layers may over-rely on internal knowledge and hallucinate (e.g., ‘Barcelona’). Our proposal, LFD fuses middle-layer signals into the final output to preserve external knowledge and improve accuracy. Shallow Layers Middle Layers Deep Layers Who has more la liga titles real madrid or barcelona? …Nine teams have been crowned champions, with Real Madrid winning the title a record 33 times and Barcelona 25 times … Query Retrieved Document …with Real Madrid winning the title a record 33 times and Barcelona 25 times … Short-context Modeling Focus on Right Answer Answer is barcelona Wrong Answer LLMs …with Real Madrid winning the title a record 33 times and Barcelona 25 times … …with Real Madrid winning the title a record 33 times and Barcelona 25 times … Internal Knowledge Confou

从上面的结果看，模型能匹配到一些能回答问题的段落。

如果用同样的查询改用 BM25（精确 token 匹配），我们会得到：

“`
[1] score=22.0764 doc=docs_ingestor/docs/arxiv/2507.20888.pdf chunk=S4::C27::251009115003
text: 3 APPROACH 3.2.2 Project Knowledge Retrieval Similar Code Retrieval. Similar snippets within the same project are valuable for code completion, even if they are not entirely replicable. In this step, we also retrieve similar code snippets. Following RepoCoder, we no longer use the unfinished code as the query but instead use the code draft, because the code draft is closer to the ground truth compared to the unfinished code. We use the Jaccard index to calculate the similarity between the code draft and the candidate code snippets. Then, we obtain a list sorted by scores. Due to the potentially large differences in length between code snippets, we no longer use the top-k method. Instead, we get code snippets from the highest to the lowest scores until the preset context length is filled.

[2] score=17.4931 doc=docs_ingestor/docs/arxiv/2508.09105.pdf chunk=S20::C08::251009124222
text: C. Ablation Studies Ablation result across White-Box attribution: Table V shows the comparison result in methods of WhiteBox Attribution with Noise, White-Box Attrition with Alternative Model and our current method Black-Box zero-gradient Attribution with Noise under two LLM categories. We can know that: First, The White-Box Attribution with Noise is under the desired condition, thus the average Accuracy Score of two LLMs get the 0.8612and0.8073. Second, the the alternative models (the two models are exchanged for attribution) reach the 0.7058and0.6464. Finally, our current method Black-Box Attribution with Noise get the Accuracy of0.7008and0.6657 by two LLMs.

[3] score=17.1458 doc=docs_ingestor/docs/arxiv/2508.05100.pdf chunk=S4::C03::251009123245
text: Preliminaries Based on this, inspired by existing analyses (Zhang et al. 2024c), we measure the amount of information a position receives using discrete entropy, as shown in the following equation: which quantifies how much information t i receives from the attention perspective. This insight suggests that LLMs struggle with longer sequences whennot trained on them, likely due to the discrepancy in information received by tokens in longer contexts. Based on the previous analysis, the optimization of attention entropy should focus on two aspects: The information entropy at positions that are relatively important and likely contain key information should increase.
“`

对这个查询来说，BM25 的召回结果一般，但有时查询中包含必须精确匹配的关键词，BM25 反而更合适。

我们将查询改为 “papers from Anirban Saha Anik”，并使用 BM25 进行检索。

[1] score=62.3398 doc=authors.csv chunk=authors::row::1::251009110024 text: author_name: Anirban Saha Anik n_papers: 2 article_1: 2509.01058 article_2: 2507.07307 [2] score=56.4007 doc=titles.csv chunk=titles::row::24::251009110138 text: id: 2509.01058 title: Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL keywords: Controlled-Literacy; Health Misinformation; Public Health; RAG; RL; Reinforcement Learning; Retrieval Augmented Generation abstract_url: https://arxiv.org/abs/2509.01058 created: 2025-09-10 00:00:00 UTC author_1: Xiaoying Song author_2: Anirban Saha Anik author_3: Dibakar Barua author_4: Pengcheng Luo author_5: Junhua Ding author_6: Lingzi Hong [3] score=56.2614 doc=titles.csv chunk=titles::row::106::251009110138 text: id: 2507.07307 title: Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation keywords: Evidence Enhancement; Health Misinformation; LLMs; Large Language Models; RAG; Response Refinement; Retrieval Augmented Generation abstract_url: https://arxiv.org/abs/2507.07307 created: 2025-07-27 00:00:00 UTC author_1: Anirban Saha Anik author_2: Xiaoying Song author_3: Elliott Wang author_4: Bryan Wang author_5: Bengisu Yarimbas author_6: Lingzi Hong

所有结果都明确包含了 “Anirban Saha Anik”，这正是我们期望的精确匹配。

如果使用语义搜索，结果不仅会包含目标作者，还会返回一些名字相似但不相关的作者。

[1] score=0.5810 doc=authors.csv chunk=authors::row::1::251009110024 text: author_name: Anirban Saha Anik n_papers: 2 article_1: 2509.01058 article_2: 2507.07307 [2] score=0.4499 doc=authors.csv chunk=authors::row::55::251009110024 text: author_name: Anand A. Rajasekar n_papers: 1 article_1: 2508.0199 [3] score=0.4320 doc=authors.csv chunk=authors::row::59::251009110024 text: author_name: Anoop Mayampurath n_papers: 1 article_1: 2508.14817 [4] score=0.4306 doc=authors.csv chunk=authors::row::69::251009110024 text: author_name: Avishek Anand n_papers: 1 article_1: 2508.15437 [5] score=0.4215 doc=authors.csv chunk=authors::row::182::251009110024 text: author_name: Ganesh Ananthanarayanan n_papers: 1 article_1: 2509.14608

这个例子清晰地展示了语义搜索的局限性：名字相似并不等同于与查询相关。

因此，不同的检索场景需要适配不同的技术：语义搜索适用于概念匹配，而 BM25 在精确词汇匹配上表现更佳。我们也可以采用混合搜索策略，将语义搜索与 BM25 的优势结合起来。

下面是对原始查询 “why do LLMs get worse with longer context windows and what to do about it?” 运行混合检索（hybrid）的结果：

[1] score=0.5000 doc=docs_ingestor/docs/arxiv/2508.15253.pdf chunk=S3::C02::251009131027
text: 1 Introduction This challenge is exacerbated when incorrect yet highly ranked contexts serve as hard negatives. Conventional RAG, i.e. , simply appending * Corresponding author 1 https://github.com/eunseongc/CARE Figure 1: LLMs struggle to resolve context-memory conflict. Green bars show the number of questions correctly answered without retrieval in a closed-book setting. Blue and yellow bars show performance when provided with a positive or negative context, respectively. Closed-book w/ Positive Context W/ Negative Context 1 8k 25.1% 49.1% 39.6% 47.5% 6k 4k 1 2k 4 Mistral-7b LLaMA3-8b GPT-4o-mini Claude-3.5 retrieved context to the prompt, struggles to discriminate between incorrect external context and correct parametric knowledge (Ren et al., 2025). This misalignment leads to overriding correct internal representations, resulting in substantial performance degradation on questions that the model initially answered correctly. As shown in Figure 1, we observed significant performance drops of 25.1-49.1% across state-of-the-

[2] score=0.5000 doc=docs_ingestor/docs/arxiv/2507.20888.pdf chunk=S4::C27::251009115003
text: 3 APPROACH 3.2.2 Project Knowledge Retrieval Similar Code Retrieval. Similar snippets within the same project are valuable for code completion, even if they are not entirely replicable. In this step, we also retrieve similar code snippets. Following RepoCoder, we no longer use the unfinished code as the query but instead use the code draft, because the code draft is closer to the ground truth compared to the unfinished code. We use the Jaccard index to calculate the similarity between the code draft and the candidate code snippets. Then, we obtain a list sorted by scores. Due to the potentially large differences in length between code snippets, we no longer use the top-k method. Instead, we get code snippets from the highest to the lowest scores until the preset context length is filled.

[3] score=0.4133 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S3::C03::251009132038
text: 1 Introductions Despite these advances, LLMs might underutilize accurate external contexts, disproportionately favoring internal parametric knowledge during generation [50, 40]. This overreliance risks propagating outdated information or hallucinations, undermining the trustworthiness of RAG systems. Surprisingly, recent studies reveal a paradoxical phenomenon: injecting noise-random documents or tokens-to retrieved contexts that already contain answer-relevant snippets can improve the generation accuracy [10, 49]. While this noise-injection approach is simple and effective, its underlying influence on LLM remains unclear. Furthermore, long contexts containing noise documents create computational overhead. Therefore, it is important to design more principled strategies that can achieve similar benefits without incurring excessive cost.

[4] score=0.1813 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S6::C18::251009132038
text: 4 Experiments 4.3 Analysis Experiments Qualitative Study In Table 4, we analyze a case study from the NQ dataset using the Llama2-7B model, evaluating four decoding strategies: GD(0), CS, DoLA, and LFD. Despite access to groundtruth documents, both GD(0) and DoLA generate incorrect answers (e.g., ’18 minutes’), suggesting limited capacity to integrate contextual evidence. Similarly, while CS produces a partially relevant response (‘Texas Revolution’), it exhibits reduced factual consistency with the source material. In contrast, LFD demonstrates superior utilization of retrieved context, synthesizing a precise and factually aligned answer. Additional case studies and analyses are provided in Appendix F.

我发现对于此类查询，语义搜索的效果最佳。这也引出了一个思路：可以通过并行执行多种搜索方法、针对多条查询来获取初始文本块，但这会增加系统的复杂性。

因此，我们接下来构建一个模块，其核心功能是：将原始查询改写成多个“定向优化”的版本，为每个版本匹配合适的搜索方法，并对搜索结果进行融合。

多查询优化器（Multi-query optimizer）

本模块旨在优化“原始查询”：基于用户的初始输入，生成多个更具针对性的查询变体，并为每条变体选择合适的搜索方法。这能在提升召回率的同时，也引入了复杂度与权衡。

许多采用“智能体（Agent）”抽象的系统在检索时都会进行查询改写。

例如，LlamaIndex 的 QueryTool 就会利用大语言模型（LLM）来优化输入查询。

我们也可以自行实现这一步骤，但会赋予它“生成多条查询”并同时指定搜索方法的能力。当文档库规模增大时，你还可以在此阶段就引入过滤条件。

关于生成多少条查询，我建议尽量保持简单。因为这一步如果生成质量不高，会直接损害后续检索结果的有效性。系统生成的“无关查询”越多，整个处理流程中引入的噪声就越大。

我设计的函数会基于原始查询，生成 1 至 3 条具有学术风格的查询，并标注每条查询建议使用的搜索方法。

原始查询：
whyiseveryonesayingRAGdoesn'tscale?howarepeoplefixingthat?

生成的查询：
* hybrid: RAG scalability issues
* hybrid: solutions to RAG scaling challenges

检索系统返回的结果示例如下：

查询 1（混合检索） – 查询词：RAG scalability issues，返回前 20 条结果
[1] score=0.5000 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800 text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to maintain large knowledge corpora and efficient retrieval indices. Systems must handle millions or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and cost management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (such as cascaded retrieval) become essential at scale, especially in large deployments like web search engines. [2] score=0.5000 doc=docs_ingestor/docs/arxiv/2507.07695.pdf chunk=SDOC::SUM::251104135247 text: This paper proposes the KeyKnowledgeRAG (K2RAG) framework to enhance the efficiency and accuracy of Retrieval-Augment-Generate (RAG) systems. It addresses the high computational costs and scalability issues associated with naive RAG implementations by incorporating techniques such as knowledge graphs, a hybrid retrieval approach, and document summarization to reduce training times and improve answer accuracy. Evaluations show that K2RAG significantly outperforms traditional implementations, achieving greater answer similarity and faster execution times, thereby providing a scalable solution for companies seeking robust question-answering systems. [...]

查询 2（混合检索） – 查询词：solutions to RAG scaling challenges，返回前 20 条结果
[1] score=0.5000 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800 text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to maintain large knowledge corpora and efficient retrieval indices. Systems must handle millions or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and cost management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (such as cascaded retrieval) become essential at scale, especially in large deployments like web search engines. [2] score=0.5000 doc=docs_ingestor/docs/arxiv/2508.05100.pdf chunk=S3::C06::251104155301 text: Introduction Empirical analyses across multiple real-world benchmarks reveal that BEE-RAG fundamentally alters the entropy scaling laws governing conventional RAG systems, which provides a robust and scalable solution for RAG systems dealing with long-context scenarios. Our main contributions are summarized as follows: We introduce the concept of balanced context entropy, a novel attention reformulation that ensures entropy invariance across varying context lengths, and allocates attention to important segments. It addresses the critical challenge of context expansion in RAG. [...]

我们还可以通过包含特定关键词（如人名、ID）的查询来测试系统是否会优先选择 BM25 而非语义搜索。例如：

原始查询：
any papers from Chenxin Diao?

系统生成的查询：
– BM25 查询： Chenxin Diao

此查询将精确返回提及“Chenxin Diao”的结果。

为了进一步提升效果，可以构建一个“根据输入生成少量示例查询”的查询优化器。当原始查询进入时，优化器先生成示例来引导其表达方式。这对于小型模型尤其有用，因为它们通常不擅长将口语化的人类查询改写为更精准的学术表达。

例如，当用户提问“为什么 LLM 在说谎”时，优化器可以将查询重写为 causes of inaccuracies in large language models，而不是直接搜索 hallucinations。

并行抓取结果后，我们进行融合（Fuse）。融合后的结果示例如下：

“`
RRF Fusion top 38 for query: why is everyone saying RAG doesn’t scale? how are people fixing that?

[1] score=0.0328 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800
text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to maintain large knowledge corpora and efficient retrieval indices. Systems must handle millions or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and cost management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (such as cascaded retrieval) become essential at scale, especially in large deployments like web search engines.
[2] score=0.0313 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C42::251104142800
text: 7 Challenges of RAG 7.5.5 Scalability Scalability challenges arise as knowledge corpora expand. Advanced indexing, distributed retrieval, and approximate nearest neighbor techniques facilitate efficient handling of large-scale knowledge bases [57]. Selective indexing and corpus curation, combined with infrastructure improvements like caching and parallel retrieval, allow RAG systems to scale to massive knowledge repositories. Research indicates that moderate-sized models augmented with large external corpora can outperform significantly larger standalone models, suggesting parameter efficiency advantages [10].
[3] score=0.0161 doc=docs_ingestor/docs/arxiv/2507.07695.pdf chunk=SDOC::SUM::251104135247
text: This paper proposes the KeyKnowledgeRAG (K2RAG) framework to enhance the efficiency and accuracy of Retrieval-Augment-Generate (RAG) systems. It addresses the high computational costs and scalability issues associated with naive RAG implementations by incorporating techniques such as knowledge graphs, a hybrid retrieval approach, and document summarization to reduce training times and improve answer accuracy. Evaluations show that K2RAG significantly outperforms traditional implementations, achieving greater answer similarity and faster execution times, thereby providing a scalable solution for companies seeking robust question-answering systems.
[4] score=0.0161 doc=docs_ingestor/docs/arxiv/2508.05100.pdf chunk=S3::C06::251104155301
text: Introduction Empirical analyses across multiple real-world benchmarks reveal that BEE-RAG fundamentally alters the entropy scaling laws governing conventional RAG systems, which provides a robust and scalable solution for RAG systems dealing with long-context scenarios. Our main contributions are summarized as follows: We introduce the concept of balanced context entropy, a novel attention reformulation that ensures entropy invariance across varying context lengths, and allocates attention to important segments. It addresses the critical challenge of context expansion in RAG.

[…]
“`

可以看到，结果中包含了相关性较高的匹配，但也混入了一些无关内容，需要进一步过滤。

在继续之前需要指出，这一步在追求极低延迟的场景下，很可能会被“砍掉”或重新设计。我们发现，大语言模型在生成能够有效召回相关信息的关键查询方面并不总是可靠。如果处理不当，反而会引入噪声。因此，应避免过度生成大量相似的查询。

引入重排序器

检索系统返回的结果质量参差不齐，因此大多数系统会引入一个重排序器（Re-ranker）。

重排序器接收多个文本块（chunk）以及原始用户查询，并为每个候选结果计算一个相关性分数。你可以有多种模型选择，甚至可以使用更轻量级的模型。在本实践中，我们使用 Cohere 提供的重排序器。

我们继续使用前面的查询进行测试：“Why is everyone saying RAG doesn’t scale? How are people fixing that?”

至此，我们已经完成了用户查询的改写、语义/混合搜索、结果融合，接下来将把得到的文本块送入重排序器进行处理。

re-ranker 返回如下：

[… optimizer… retrieval… fuse…]

Rerank 摘要：
* 策略： cohere
* 模型： rerank-english-v3.0
* 候选文档数： 32
* 阈值以上合格数： 4
* 最终保留数： 4 (reranker_threshold=0.35)

重排序后相关文档 (4/32，得分 ≥ 0.35) 查询： why is everyone saying RAG doesn’t scale? how are people fixing that?

[1] 得分：0.7920
* 文档： docs_ingestor/docs/arxiv/2507.07695.pdf
* 块标识： S4::C08::251104135247
* 文本： 1 Introduction Scalability: Naive implementations of Retrieval-Augmented Generation (RAG) often rely on 16-bit floating-point large language models (LLMs) for the generation component. However, this approach introduces significant scalability challenges due to the increased memory demands required to host the LLM as well as longer inference times due to using a higher precision number type. To enable more efficient scaling, it is crucial to integrate methods or techniques that reduce the memory footprint and inference times of generator models. Quantized models offer more scalable solutions due to less computational requirements, hence when developing RAG systems we should aim to use quantized LLMs for more cost effective deployment as compared to a full fine-tuned LLM whose performance might be good but is more expensive to deploy due to higher memory requirements. A quantized LLM’s role in the RAG pipeline itself should be minimal and for means of rewriting retrieved information into a presentable fashion for the end users

[2] 得分：0.4749
* 文档： docs_ingestor/docs/arxiv/2507.18910.pdf
* 块标识： S22::C42::251104142800
* 文本： 7 Challenges of RAG 7.5.5 Scalability Scalability challenges arise as knowledge corpora expand. Advanced indexing, distributed retrieval, and approximate nearest neighbor techniques facilitate efficient handling of large-scale knowledge bases [57]. Selective indexing and corpus curation, combined with infrastructure improvements like caching and parallel retrieval, allow RAG systems to scale to massive knowledge repositories. Research indicates that moderate-sized models augmented with large external corpora can outperform significantly larger standalone models, suggesting parameter efficiency advantages [10].

[3] 得分：0.4304
* 文档： docs_ingestor/docs/arxiv/2507.18910.pdf
* 块标识： S22::C05::251104142800
* 文本： 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to maintain large knowledge corpora and efficient retrieval indices. Systems must handle millions or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and cost management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (such as cascaded retrieval) become essential at scale, especially in large deployments like web search engines.

[4] 得分：0.3556
* 文档： docs_ingestor/docs/arxiv/2509.13772.pdf
* 块标识： S11::C02::251104182521
* 文本： 7. Discussion and Limitations Scalability of RAGOrigin: We extend our evaluation by scaling the NQ dataset’s knowledge database to 16.7 million texts, combining entries from the knowledge database of NQ, HotpotQA, and MS-MARCO. Using the same user questions from NQ, we assess RAGOrigin’s performance under larger data volumes. As shown in Table 16, RAGOrigin maintains consistent effectiveness and performance even on this significantly expanded database. These results demonstrate that RAGOrigin remains robust at scale, making it suitable for enterprise-level applications requiring large

可以看到，该方法确实能够筛选出一些可用的“种子”文本块。请注意，当前数据集仅包含 150 篇文档。

返回结果中包含了同一文档的多个文本块。我们将在后续构建上下文时处理这一问题。若需限制“唯一文档”的数量而非文本块总数，可在此处添加相应的自定义逻辑。

我们使用另一个问题来测试重排序器：“hallucinations in RAG vs normal LLMs and how to reduce them”。

[… optimizer… retrieval… fuse…]

重排序摘要:
– 策略: cohere
– 模型: rerank-english-v3.0
– 候选文档数: 35
– 阈值以上合格数: 12
– 最终保留数: 5 (阈值=0.2)

重排序后相关文档 (5/35 保留，得分 ≥ 0.2)，查询问题：hallucinations in rag vs normal llms and how to reduce them

[1] 得分=0.9965 文档=docs_ingestor/docs/arxiv/2508.19614.pdf 块=S7::C03::251104164901
文本: 5 Related Work Hallucinations in LLMs Hallucinations in LLMs refer to instances where the model generates false or unsupported information not grounded in its reference data [42]. Existing mitigation strategies include multi-agent debating, where multiple LLM instances collaborate to detect inconsistencies through iterative debates [8, 14]; self-consistency verification, which aggregates and reconciles multiple reasoning paths to reduce individual errors [53]; and model editing, which directly modifies neural network weights to correct systematic factual errors [62, 19]. While RAG systems aim to ground responses in retrieved external knowledge, recent studies show that they still exhibit hallucinations, especially those that contradict the retrieved content [50]. To address this limitation, our work conducts an empirical study analyzing how LLMs internally process external knowledge

[2] 得分=0.9342 文档=docs_ingestor/docs/arxiv/2508.05509.pdf 块=S3::C01::251104160034
文本: Introduction Large language models (LLMs), like Claude (Anthropic 2024), ChatGPT (OpenAI 2023) and the Deepseek series (Liu et al. 2024), have demonstrated remarkable capabilities in many real-world tasks (Chen et al. 2024b; Zhou et al. 2025), such as question answering (Allam and Haggag 2012), text comprehension (Wright and Cervetti 2017) and content generation (Kumar 2024). Despite the success, these models are often criticized for their tendency to produce hallucinations, generating incorrect statements on tasks beyond their knowledge and perception (Ji et al. 2023; Zhang et al. 2024). Recently, retrieval-augmented generation (RAG) (Gao et al. 2023; Lewis et al. 2020) has emerged as a promising solution to alleviate such hallucinations. By dynamically leveraging external knowledge from textual corpora, RAG enables LLMs to generate more accurate and reliable responses without costly retraining (Lewis et al. 2020; Figure 1: Comparison of three paradigms. LAG exhibits greater lightweight properties compared to GraphRAG while

[3] 得分=0.9030 文档=docs_ingestor/docs/arxiv/2509.13702.pdf 块=S3::C01::251104182000
文本: ABSTRACT Hallucination remains a critical barrier to the reliable deployment of Large Language Models (LLMs) in high-stakes applications. Existing mitigation strategies, such as Retrieval-Augmented Generation (RAG) and post-hoc verification, are often reactive, inefficient, or fail to address the root cause within the generative process. Inspired by dual-process cognitive theory, we propose D ynamic S elfreinforcing C alibration for H allucination S uppression (DSCC-HS), a novel, proactive framework that intervenes directly during autoregressive decoding. DSCC-HS operates via a two-phase mechanism: (1) During training, a compact proxy model is iteratively aligned into two adversarial roles-a Factual Alignment Proxy (FAP) and a Hallucination Detection Proxy (HDP)-through contrastive logit-space optimization using augmented data and parameter-efficient LoRA adaptation. (2) During inference, these frozen proxies dynamically steer a large target model by injecting a real-time, vocabulary-aligned steering vector (computed as the

[4] 得分=0.9007 文档=docs_ingestor/docs/arxiv/2509.09360.pdf 块=S2::C05::251104174859
文本: 1 Introduction Figure 1. Standard Retrieval-Augmented Generation (RAG) workflow. A user query is encoded into a vector representation using an embedding model and queried against a vector database constructed from a document corpus. The most relevant document chunks are retrieved and appended to the original query, which is then provided as input to a large language model (LLM) to generate the final response. Corpus Retrieved_Chunks Vectpr DB Embedding model Query Response LLM Retrieval-Augmented Generation (RAG) [17] aims to mitigate hallucinations by grounding model outputs in retrieved, up-to-date documents, as illustrated in Figure 1. By injecting retrieved text from re- a

[5] 得分=0.8986 文档=docs_ingestor/docs/arxiv/2508.04057.pdf 块=S20::C02::251104155008
文本: Parametric knowledge can generate accurate answers. Effects of LLM hallucinations. To assess the impact of hallucinations when large language models (LLMs) generate answers without retrieval, we conduct a controlled experiment based on a simple heuristic: if a generated answer contains numeric values, it is more likely to be affected by hallucination. This is because LLMs are generally less reliable when producing precise facts such as numbers, dates, or counts from parametric memory alone (Ji et al. 2023; Singh et al. 2025). We filter out all directly answered queries (DQs) whose generated answers contain numbers, and we then rerun our DPR-AIS for these queries (referred to Exclude num ). The results are reported in Tab. 5. Overall, excluding numeric DQs results in slightly improved performance. The average exact match (EM) increases from 35.03 to 35.12, and the average F1 score improves from 35.68 to 35.80. While these gains are modest, they come with an increase in the retriever activation (RA) ratio-from 75.5% to 78.1%.

这条查询同样表现良好（尤其是在查看完整文本块时）。

我们再尝试一个更“脏”的查询：“why is the llm lying and rag help with this?”

[… 查询优化器 …]

原始查询：
why is the llm lying and rag help with this?

生成的查询：
– 语义查询： explore reasons for LLM inaccuracies
– 混合查询： RAG techniques for LLM truthfulness

[… 检索 … 融合 …]

重排序摘要：
– 策略： cohere
– 模型： rerank-english-v3.0
– 候选文档数： 39
– 阈值以上合格数： 39
– 保留数： 6 (阈值=0)

重排序后的相关文档（6/39 保留 ≥ 0）前 6 条，对应查询： why is the llm lying and rag help with this?

[1] 得分=0.0293 文档=docs_ingestor/docs/arxiv/2507.05714.pdf 块=S3::C01::251104134926
文本：1 Introduction Retrieval Augmentation Generation (hereafter referred to as RAG) helps large language models (LLMs) (OpenAI et al., 2024) reduce hallucinations (Zhang et al., 2023) and access real-time data 1 *Equal contribution.

[2] 得分=0.0284 文档=docs_ingestor/docs/arxiv/2508.15437.pdf 块=S3::C01::251104164223
文本：1 Introduction Large language models (LLMs) augmented with retrieval have become a dominant paradigm for knowledge-intensive NLP tasks. In a typical retrieval-augmented generation (RAG) setup, an LLM retrieves documents from an external corpus and conditions generation on the retrieved evidence (Lewis et al., 2020b; Izacard and Grave, 2021). This setup mitigates a key weakness of LLMs-hallucination-by grounding generation in externally sourced knowledge. RAG systems now power open-domain QA (Karpukhin et al., 2020), fact verification (V et al., 2024; Schlichtkrull et al., 2023), knowledge-grounded dialogue, and explanatory QA.

[3] 得分=0.0277 文档=docs_ingestor/docs/arxiv/2509.09651.pdf 块=S3::C01::251104180034
文本：1 Introduction Large Language Models (LLMs) have transformed natural language processing, achieving state-of-the-art performance in summarization, translation, and question answering. However, despite their versatility, LLMs are prone to generating false or misleading content, a phenomenon commonly referred to as hallucination [9, 21]. While sometimes harmless in casual applications, such inaccuracies pose significant risks in domains that demand strict factual correctness, including medicine, law, and telecommunications. In these settings, misinformation can have severe consequences, ranging from financial losses to safety hazards and legal disputes.

[4] 得分=0.0087 文档=docs_ingestor/docs/arxiv/2507.07695.pdf 块=S4::C08::251104135247
文本：1 Introduction Scalability: Naive implementations of Retrieval-Augmented Generation (RAG) often rely on 16-bit floating-point large language models (LLMs) for the generation component. However, this approach introduces significant scalability challenges due to the increased memory demands required to host the LLM as well as longer inference times due to using a higher precision number type. To enable more efficient scaling, it is crucial to integrate methods or techniques that reduce the memory footprint and inference times of generator models. Quantized models offer more scalable solutions due to less computational requirements, hence when developing RAG systems we should aim to use quantized LLMs for more cost effective deployment as compared to a full fine-tuned LLM whose performance might be good but is more expensive to deploy due to higher memory requirements. A quantized LLM’s role in the RAG pipeline itself should be minimal and for means of rewriting retrieved information into a presentable fashion for the end users

在继续之前需要指出，当前使用的重排序器（re-ranker）在某些情况下表现并不理想（从返回的分数可以观察到这一点）。

有时，重排序器会判定某些文本块（chunk）无法回答用户问题，但实际上这些文本块是具备相关性的，至少作为后续扩展的“种子”是足够的。

通常，重排序器期望一个文本块能“完整提示整段内容”，但在我们的系统中，文本块被用作“种子”以进行扩展。因此，某些被重排序器给予低分的文本块，对我们而言可能已经足够。这也是为什么我们将相关性阈值设置得较低。

这里存在优化空间，例如，可以探索定制一个更贴合特定领域或任务需求的重排序器，以获得更精准的排序结果。

无论如何，经过此步骤，我们已经获得了一系列相关文档。接下来，我们将利用在数据摄入阶段写入的元数据（metadata），对这些种子文本块进行“扩展与外扩”，目的是为后续的大语言模型（LLM）提供充足的上下文信息，使其能够生成准确的回答。

构建上下文

现在，我们已获得种子文本块。我们将从 Redis 中提取更多相关信息，执行扩展操作，并最终构建出完整的上下文。

这一步逻辑显然更为复杂：你需要精心设计“选取哪些文本块以及如何选取”的策略（优先使用关联键，若无则使用邻近关系），并行地获取数据，并对获取到的文本块进行进一步的清洗与处理。

在汇集所有相关文本块（以及文档本身的元信息）之后，还需要进行整合操作，例如去除重复内容、为扩展过程设置最大步长限制，并明确标记哪些是“直接命中的”核心文本块，哪些是“扩展得到的”辅助文本块。

Expanded context windows (Markdown ready):

Document #1 – Fusing Knowledge and Language: A Comparative Study of Knowledge Graph-Based Question Answering with LLMs

doc_id: doc::6371023da29b4bbe8242ffc5caf4a8cd
Last Updated: 2025-11-04T17:44:07.300967+00:00
Context: Comparative study on methodologies for integrating knowledge graphs in QA systems using LLMs.
Content fetched inside document:
text [start on page 4] LLMs in QA The advent of LLMs has steered in a transformative era in NLP, particularly within the domain of QA. These models, pre-trained on massive corpora of diverse text, exhibit sophisticated capabilities in both natural language understanding and generation. Their proficiency in producing coherent, contextually relevant, and human-like responses to a broad spectrum of prompts makes them exceptionally well-suited for QA tasks, where delivering precise and informative answers is paramount. Recent advancements by models such as BERT [57] and ChatGPT [58], have significantly propelled the field forward. LLMs have demonstrated strong performance in open-domain QA scenarios-such as commonsense reasoning[20]-owing to their extensive embedded knowledge of the world. Moreover, their ability to comprehend and articulate responses to abstract or contextually nuanced queries and reasoning tasks [22] underscores their utility in addressing complex QA challenges that require deep semantic understanding. Despite their strengths, LLMs also pose challenges: they can exhibit contextual ambiguity or overconfidence in their outputs ('hallucinations')[21], and their substantial computational and memory requirements complicate deployment in resource-constrained environments. RAG, fine tuning in QA ---------------------- this was the passage that we matched to the query ------------- LLMs also face problems when it comes to domain specific QA or tasks where they are needed to recall factual information accurately instead of just probabilistically generating whatever comes next. Research has also explored different prompting techniques, like chain-of-thought prompting[24], and sampling based methods[23] to reduce hallucinations. Contemporary research increasingly explores strategies such as fine-tuning and retrieval augmentation to enhance LLM-based QA systems. Fine-tuning on domain-specific corpora (e.g., BioBERT for biomedical text [17], SciBERT for scientific text [18]) has been shown to sharpen model focus, reducing irrelevant or generic responses in specialized settings such as medical or legal QA. Retrieval-augmented architectures such as RAG [19] combine LLMs with external knowledge bases, to try to further mitigate issues of factual inaccuracy and enable real-time incorporation of new information. Building on RAG's ability to bridge parametric and non-parametric knowledge, many modern QA pipelines introduce a lightweight re-ranking step [25] to sift through the retrieved contexts and promote passages that are most relevant to the query. However, RAG still faces several challenges. One key issue lies in the retrieval step itself-if the retriever fails to fetch relevant documents, the generator is left to hallucinate or provide incomplete answers. Moreover, integrating noisy or loosely relevant contexts can degrade response quality rather than enhance it, especially in high-stakes domains where precision is critical. RAG pipelines are also sensitive to the quality and domain alignment of the underlying knowledge base, and they often require extensive tuning to balance recall and precision effectively. -------------------------------------------------------------------------------------- [end on page 5]

Document #2 – Each to Their Own: Exploring the Optimal Embedding in RAG

doc_id: doc::3b9c43d010984d4cb11233b5de905555
Last Updated: 2025-11-04T14:00:38.215399+00:00
Context: Enhancing Large Language Models using Retrieval-Augmented Generation techniques.
Content fetched inside document:

1 Introduction
Large language models (LLMs) have recently accelerated the pace of transformation across multiple fields, including transportation (Lyu et al., 2025), arts (Zhao et al., 2025), and education (Gao et al., 2024), through various paradigms such as direct answer generation, training from scratch on different types of data, and fine-tuning on target domains. However, the hallucination problem (Henkel et al., 2024) associated with LLMs has confused people for a long time, stemming from multiple factors such as a lack of knowledge on the given prompt (Huang et al., 2025b) and a biased training process (Zhao, 2025).

Serving as a highly efficient solution, Retrieval-Augmented Generation (RAG) has been widely employed in constructing foundation models (Chen et al., 2024) and practical agents (Arslan et al., 2024). Compared to training methods like fine-tuning and prompt-tuning, its plug-and-play feature makes RAG an efficient, simple, and cost-effective approach. The main paradigm of RAG involves first calculating the similarities between a question and chunks in an external knowledge corpus, followed by incorporating the top K relevant chunks into the prompt to guide the LLMs (Lewis et al., 2020).

Despite the advantages of RAG, selecting the appropriate embedding models remains a crucial concern, as the quality of retrieved references directly influences the generation results of the LLM (Tu et al., 2025). Variations in training data and model architecture lead to different embedding models providing benefits across various domains. The differing similarity calculations across embedding models often leave researchers uncertain about how to choose the optimal one. Consequently, improving the accuracy of RAG from the perspective of embedding models continues to be an ongoing area of research.

To address this research gap, we propose two methods for improving RAG by combining the benefits of multiple embedding models. The first method is named Mixture-Embedding RAG, which sorts the retrieved materials from multiple embedding models based on normalized similarity and selects the top K materials as final references. The second method is named Confident RAG, where we first utilize vanilla RAG to generate answers multiple times, each time employing a different embedding model and recording the associated confidence metrics, and then select the answer with the highest confidence level as the final response. By validating our approach using multiple LLMs and embedding models, we illustrate the superior performance and generalization of Confident RAG, even though Mixture-Embedding RAG may lose to vanilla RAG. The main contributions of this paper can be summarized as follows:

We first point out that in RAG, different embedding models operate within their own prior domains. To leverage the strengths of various embedding models, we propose and test two novel RAG methods: Mixture-Embedding RAG and Confident RAG. These methods effectively utilize the retrieved results from different embedding models to their fullest extent.

While Mixture-Embedding RAG performs similarly to vanilla RAG, the Confident RAG method exhibits superior performance compared to both the vanilla LLM and vanilla RAG, with average improvements of 9.9% and 4.9%, respectively, when using the best confidence metric. Additionally, we discuss the optimal number of embedding models for the Confident RAG method based on the results.

总的上下文包含若干文档，大小约为 2–3k tokens。这可能会带来一些“冗余”，但我们选择不替 LLM 做过多“删减决策”，而是提供更完整的信息，使其能够“扫描整段文档”而非处理“孤立的片段”。

你可以在 GitHub 查看 5 条不同查询的完整处理管线。

在你的系统中，可以将此上下文作为状态，供 LLM 基于相同信息回答后续追问。

接入 LLM 生成答案（RAG 中的 “G”）

最后一步，我们观察 LLM 在获得这些上下文后如何生成答案。

我原本不打算加入此步骤，因为本文重点在于检索管线。但使用 LLM 来检验其能否将信息组织成条理清晰的内容，仍有其价值。

若使用更大的 LLM，延迟会显著增加。

我决定使用 GPT-5（低推理模式）进行测试，它成本较低但速度较慢。以下是针对查询 “why is the llm lying and can rag help with this?” 的处理结果：

“`
[优化查询… 检索… 融合… 重排序… 扩展… 构建上下文…]

LLM 回答：

LLM 有时会“说谎”，因为它们是基于模式生成下一个可能出现的词语，而非基于已验证的事实。在没有信息来源支撑的情况下，它们可能生成自信但错误的陈述——即幻觉——并且其内置知识可能已过时或不完整 [6][3]。这在要求严格准确性的领域（如法律或电信法规）中尤其危险，微小的错误可能导致严重后果 [3]。

RAG 通过让模型首先从外部语料库中检索相关文档，然后基于该证据生成答案来提供帮助。这减少了幻觉并引入了最新信息 [2][6][1]。在实践中，针对特定领域定制的 RAG 已显著提高了准确性；例如，一个无线电法规 RAG 管线将 GPT-4o 的准确率比基础提示提高了 +11.9%，并实现了高达 97% 的检索准确率，从而支持更可靠的答案 [3]。

然而，RAG 的效果取决于其检索质量以及模型对检索文本的利用程度。静态的、现成的检索器在处理复杂任务（如多跳问题或程序性查询）时可能表现不佳，这些任务需要迭代查找或查询重构 [2]。朴素的 RAG 也可能让模型淹没在冗长、嘈杂的上下文中，或拉取语义相似但不相关的片段——即“大海捞针”问题——从而导致错误答案 [4]。

更好的 RAG 实践可以解决这些问题：针对 RAG 进行指令微调，教授模型过滤、合并多源信息，以及采用“先思考再回答”的 RAG 特定推理方法（HIRAG），可以改善对证据的利用 [1]；自适应的、基于反馈的检索机制决定何时以及如何检索和重排序证据 [2]；优化分块和检索的管线设计可以提高答案准确性 [4]。

如果幻觉问题仍然存在，超越 RAG 的直接解码引导技术可以进一步抑制它 [5]。

引用文档：
[1] doc::b0610cc6134b401db0ea68a77096e883 – HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation
[2] doc::53b521e646b84289b46e648c66dde56a – Test-time Corpus Feedback: From Retrieval to RAG
[3] doc::9694bd0124d0453c81ecb32dd75ab489 – Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations
[4] doc::6d7a7d88cfc04636b20931fdf22f1e61 – KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities
[5] doc::3c9a1937ecbc454b8faff4f66bdf427f – DSCC-HS: A Dynamic Self-Reinforcing Framework for Hallucination Suppression in Large Language Models
[6] doc::688cfbc0abdc4520a73e219ac26aff41 – A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions
“`

可以看到，模型能够正确引用来源并有效利用所提供的信息。但由于我们使用的是 GPT-5，且上下文较大，延迟很高。

在我的测试环境中，GPT-5 生成首个 token 大约需要 9 秒（使用 GPT-5-mini 可降至 4–5 秒）。

如果整个检索管线耗时约为 3–5 秒，那么这最后的生成步骤将使总耗时再增加 2–3 倍。

有人主张减少送入上下文窗口的信息以降低延迟，但这与我们“扩展上下文”的初衷相悖。

也有人提出链式提示（chain prompting）方案：先用一个小型LLM抽取有用信息，再让一个大型LLM基于优化后的上下文进行回答——但这种方式在时间上能节省多少，我持保留态度。

还有人倾向于直接使用小型模型，以速度和成本为优先，牺牲一部分“智能”。在我们这种“将所有必要信息都提供给LLM”的场景下，小型模型或许也能胜任。

总而言之，如何优化系统是一个复杂的权衡过程。

关于延迟与成本的考量

主张将整篇文档直接送入LLM的人，往往并未对系统延迟进行“极致”优化。而延迟恰恰是用户体验的关键——用户不愿等待。

你可以通过一些UX技巧来改善等待体验，例如显示“正在搜索文档”、“正在定位相关段落”、“正在扩展上下文以提升理解”等状态提示。但如果你的检索管线耗时超过数秒，用户仍可能感到不满。

这也解释了为何业界开始转向“智能体化搜索”（Agentic search）：扩大上下文窗口、使用LLM重写查询、自动路由链、子问题分解、多步智能体引擎……这些增强功能通常都会带来更高的延迟。

就我构建的这个系统（主要基于Codex搭建）而言，在Serverless环境下，检索步骤大约需要4秒。

这个速度确实偏慢（但成本低廉）。

你可以尝试优化每一步，压缩时间，并尽量保持服务“热”状态。然而，当依赖外部API时，你并不总能控制其响应速度。

有人建议自托管更小的模型来处理优化和路由任务，但这会引入额外的托管成本，每月轻松增加数百美元。

最后的LLM调用是延迟的主要贡献者（正如前文所示）。

不过，这条管线单次运行（不含缓存）的成本约为0.0121美元。如果一个组织每天提出200个问题，使用GPT-4级别的模型，每日费用约为2.42美元。

如果将主LLM替换为GPT-4-mini这类更小模型，单次管线运行成本可降至约0.0041美元，200次查询每日费用约为0.82美元。

对于重复性问题，还可以通过缓存进一步降低成本。

至于文档向量化（embedding），我使用OpenAI的大型模型处理了约200篇PDF，花费约0.5美元。随着规模扩大，这项成本会增长，因此需要考虑是否改用更小或专用的模型，或者采用两阶段策略：先用关键词搜索缩小候选范围，再进行语义搜索。

关于本系统的几点补充说明

由于我们目前只处理了最新的RAG相关论文，当系统规模扩大后，可以引入更多机制来增强其稳健性。

首先需要提醒：许多“真实问题”只有在文档量达到一定规模后才会显现。在几百篇文档上运行良好的方案，一旦扩展到上万篇，可能立刻出现问题。

你可以让优化器设置过滤器，例如基于语义匹配主题。也可以设置日期过滤器，确保信息的新鲜度；或在重排序（re-rank）阶段引入“权威度”信号来提升某些高可信度来源的排名。

有些团队会走得更远，设计自定义的打分函数，来决定哪些文档应该被呈现以及如何排序，这完全取决于你的特定语料库。

如果你需要处理数千份文档，或许应该在文档摄入阶段尽量避免调用LLM，而是将LLM分析放在检索阶段，仅在收到查询时才对相关文档进行分析。分析结果可以缓存起来供后续查询复用。

最后，务必建立合适的评估体系（evals），以衡量检索质量和答案的“有据可依”程度，尤其是在你为了成本优化而切换模型时。

如果你读到这里，可以问问自己：构建这样一套系统是否值得？

后续，我将尝试展示“基础RAG”与“优化分块+上下文扩展”方案在输出质量上的对比，以便我们更直观地看到差异。

关注“鲸栖”小程序，掌握最新AI资讯

本文来自网络搜集，不代表鲸林向海立场，如有侵权，联系删除。转载请注明出处：http://www.itsolotime.com/archives/18422