当前位置：首页 > news >正文

怎么管理购物网站国内网页设计

news 2025/12/26 15:32:20

怎么管理购物网站,国内网页设计,wordpress返回旧版,东营做网站哪家好文章目录准备加载文档分割文档嵌入矢量存储查询矢量库检索返回评分先嵌入查询文本再检索检索器总结代码我们在百度、必应、谷歌等搜索引擎中使用的检索都是基于字符串的#xff1a;用户输入字符串后#xff0c;搜索引擎先对搜索内容进行分词#xff0c;然后在已经进行了倒… 文章目录准备加载文档分割文档嵌入矢量存储查询矢量库检索返回评分先嵌入查询文本再检索检索器总结代码我们在百度、必应、谷歌等搜索引擎中使用的检索都是基于字符串的用户输入字符串后搜索引擎先对搜索内容进行分词然后在已经进行了倒排索引的巨大数据库中找出最符合用户要求的结果。语义检索与其主要的区别是它根据文本的真正含义进行搜索其基本思路是将待检索的内容都转变成矢量这个过程也叫做嵌入转化矢量的基本原则是语义相近的内容距离更近、相似性更高。当用户输入检索内容时也是先把检索内容变成矢量然后去矢量数据库中找到最相似的文档。这样检索出来的结果并不依据字面的意思而是依据语义的相似度。本文描述了如何使用 langchain 和大语言模型以及矢量数据库完成pdf内容的语义检索。在对内容进行矢量化时使用了 nomic-embed-text这个模型个头小英文嵌入效果不错。后面还将涉及到以下内容文档和文档加载器文本分割器嵌入向量存储和检索器准备在正式开始撸代码之前需要准备一下编程环境。计算机本文涉及的所有代码可以在没有显存的环境中执行。我使用的机器配置为 CPU: Intel i5-8400 2.80GHz内存: 16GB Visual Studio Code 和 venv 这是很受欢迎的开发工具相关文章的代码可以在 Visual Studio Code 中开发和调试。我们用 python 的 venv 创建虚拟环境, 详见在Visual Studio Code中配置venv。 Ollama 在 Ollama 平台上部署本地大模型非常方便基于此平台我们可以让 langchain 使用 llama3.1、qwen2.5 等各种本地大模型。详见在langchian中使用本地部署的llama3.1大模型。加载文档 LangChain 实现了 Document 抽象可以把pdf、csv、html等各种文件加载成为 Document。它具有三个属性 page_content表示内容的字符串metadata包含任意元数据的字典id可选文档的字符串标识符。 metadata属性可以捕获有关文档来源、其与其他文档的关系以及其他信息的信息。请注意单个 Document 不一定是加载前源文件的完整的段落它通常只是其一部分。下面是加载pdf文档的代码 def load_file(file_path):加载pdf文件 # Loading documentsfrom langchain_community.document_loaders import PyPDFLoaderloader PyPDFLoader(file_path)docs loader.load()print(f加载文件成功总文本数:{len(docs)})# PyPDFLoader loads one Document object per PDF page. The first page is at index 0.print(fpage one:\n{docs[0].page_content[:200]}\n)print(fpage one metadata:\n{docs[0].metadata})return docs执行此方法后我们可以看到已加载文档的基本结构 page one: Table of Contents UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K (Mark One) ☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934 Fpage one metadata: {source: E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf, page: 0, page_label: 1}langchian 提供了大量的 Document Loader详见Document loaders 。分割文档如果以 Document 作为矢量化的单位往往粒度太粗糙在问答等场景中不容易找到理想的结果下面将进一步将这些 Document 进行智能拆分此过程将尽量确保每一部分的含义不被周围的文本”冲淡“。下面我们使用 RecursiveCharacterTextSplitter 它将使用常用分隔符如换行符递归拆分文档直到每个块的大小合适。其中的参数含义如下 chunk_size1000 chunk_size 参数指定了每个文本块的大小。这里设置为1000意味着每个分割后的文本块的长度大约为1000个字符。chunk_overlap200 chunk_overlap 参数指定了每个文本块之间的重叠字符数。这里设置为200意味着相邻的两个文本块之间有200个字符是重叠的。这样做的目的是为了确保一些跨块的上下文信息不会丢失有助于更好地理解和处理文本。add_start_indexTrue add_start_index 参数是一个布尔值当设置为 True 时分割器会在每个文本块中添加一个起始索引这个索引表示该文本块在原始文档中的起始位置。这对于后续的文本处理和引用非常有用可以方便地定位每个文本块在原始文档中的位置。 def split_text(docs):分割文档from langchain_text_splitters import RecursiveCharacterTextSplittertext_splitter RecursiveCharacterTextSplitter(chunk_size1000, chunk_overlap200, add_start_indexTrue)all_splits text_splitter.split_documents(docs)print(fNumber of splits: {len(all_splits)}) return all_splits嵌入向量搜索是存储和搜索非结构化数据如非结构化文本的常用方法。其理念是存储与文本相关的数字向量。给定一个查询我们可以将其嵌入为相同维度的向量并使用向量相似度指标如余弦相似度来识别相关文本。这里我们使用 Ollama 的 nomic-embed-text 模型做嵌入。 langchian 支持很多模型做嵌入详见Embedding models . from langchain_ollama.embeddings import OllamaEmbeddings embeddings OllamaEmbeddings(modelnomic-embed-text)矢量存储我们这里简单使用 InMemoryVectorStore它把矢量存储在内存中。当然我们也可以把矢量物理存储在磁盘里以后随时使用后面的文章我们将用 Chroma 演示这个过程。 def get_vector_store():获取内存矢量数据库from langchain_core.vectorstores import InMemoryVectorStorevector_store InMemoryVectorStore(embeddings)file_path get_file_path()docs load_file(file_path)all_splits split_text(docs)_ vector_store.add_documents(documentsall_splits)return vector_store查询矢量库具有相似含义的文本生成的向量在几何上接近。我们只需传入一个问题即可检索相关信息而无需了解文档中使用的任何特定关键词。检索定义检索方法 def similarity_search(query):内存矢量数据库检索测试vector_store get_vector_store()results vector_store.similarity_search(query)return results测试检索 results similarity_search(How many distribution centers does Nike have in the US?) print(fsimilarity_search results[0]:\n{results[0]})similarity_search results[0]: page_contentdirect to consumer operations sell products through the following number of retail stores in the United States: U.S. RETAIL STORES NUMBER NIKE Brand factory stores 213 NIKE Brand in-line stores (including employee-only stores) 74 Converse stores (including factory stores) 82 TOTAL 369 In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information. 2023 FORM 10-K 2 metadata{source: E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf, page: 4, page_label: 5, start_index: 3125}返回评分定义检索方法 def similarity_search_with_score(query):内存矢量数据库检索测试返回文档评分分数越高文档越相似。vector_store get_vector_store()results vector_store.similarity_search_with_score(query)return results测试检索 results similarity_search_with_score(What was Nikes revenue in 2023?) doc, score results[0] print(fScore and doc: {score}\n{doc})Score and doc: 0.800869769173528 page_contentUNITED STATES MARKET For fiscal 2023, NIKE Brand and Converse sales in the United States accounted for approximately 43% of total revenues, compared to 40% and 39% for fiscal 2022 and fiscal 2021, respectively. We sell our products to thousands of retail accounts in the United States, including a mix of footwear stores, sporting goods stores, athletic specialty stores, department stores, skate, tennis and golf shops and other retail accounts. In the United States, we utilize NIKE sales offices to solicit such sales. During fiscal 2023, our three largest United States customers accounted for approximately 22% of sales in the United States. Our NIKE Direct and Converse direct to consumer operations sell our products to consumers through various digital platforms. In addition, our NIKE Direct and Converse direct to consumer operations sell products through the following number of retail stores in the United States: U.S. RETAIL STORES NUMBER NIKE Brand factory stores 213 metadata{source: E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf, page: 4, page_label: 5, start_index: 2311}先嵌入查询文本再检索定义检索方法 def embed_query(query):嵌入查询测试embedding embeddings.embed_query(query)vector_store get_vector_store()results vector_store.similarity_search_by_vector(embedding)return results测试检索 results embed_query(How were Nikes margins impacted in 2023?) print(fembed_query results[0]:\n{results[0]}) embed_query results[0]: page_contentand 18% of total NIKE Brand footwear, respectively. For fiscal 2023, four footwear contract manufacturers each accounted for greater than 10% of footwear production and in the aggregate accounted for approximately 58% of NIKE Brand footwear production. As of May 31, 2023, our contract manufacturers operated 291 finished goods apparel factories located in 31 countries. For fiscal 2023, NIKE Brand apparel finished goods were manufactured by 55 contract manufacturers, many of which operate multiple factories. The largest single finished goods apparel factory accounted for approximately 8% of total fiscal 2023 NIKE Brand apparel production. For fiscal 2023, factories in Vietnam, China and Cambodia manufactured approximately 29%, 18% and 16% 2023 FORM 10-K 3 metadata{source: E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf, page: 5, page_label: 6, start_index: 3956}检索器 LangChain VectorStore 对象不属于 Runnable 的子类。LangChain Retriever 是 Runnable因此它们实现了一组标准方法例如:同步和异步调用和批处理操作。把VectorStore转换成Retriever 以后对矢量数据库的处理就可以添加到LangChain的链里面在实现RAG(Retrieval-Augmented Generation)等功能时很方便。 from typing import Listfrom langchain_core.documents import Document from langchain_core.runnables import chainchain def retriever(query: str) - List[Document]:vector_store get_vector_store()return vector_store.similarity_search(query, k1)def retriever_batch_1(query:List[str]):r retriever.batch(query)return r我们来测试一下 query [How many distribution centers does Nike have in the US?,When was Nike incorporated?, ]results retriever_batch_1(query) print(fretriever.batch 1:\n{results})retriever.batch 1: [[Document(ida26e4349-108c-4988-8502-ff9cce20cdf3, metadata{source: E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf, page: 4, page_label: 5, start_index: 3125}, page_contentdirect to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2)], [Document(id872d6f81-3aa1-4aaa-ba2d-2d4eac29e661, metadata{source: E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf, page: 3, page_label: 4, start_index: 714}, page_contentand sales through our digital platforms (also referred to as NIKE Brand Digital), to retail accounts and to a mix of independent distributors, licensees and sales\nrepresentatives in nearly all countries around the world. We also offer interactive consumer services and experiences through our digital platforms. Nearly all of our\nproducts are manufactured by independent contractors. Nearly all footwear and apparel products are manufactured outside the United States, while equipment products\nare manufactured both in the United States and abroad.\nAll references to fiscal 2023, 2022, 2021 and 2020 are to NIKE, Inc.\s fiscal years ended May 31, 2023, 2022, 2021 and 2020, respectively. Any references to other fiscal\nyears refer to a fiscal year ending on May 31 of that year.\nPRODUCTS\nOur NIKE Brand product offerings are aligned around our consumer construct focused on Men\s, Women\s and Kids\. We also design products specifically for the Jordan)]]Vectorstores 实现了一个 as_retriever 方法该方法将生成一个 Retriever。我们可以用下面的代码实现与上述retriever_batch_1同样的功能 def retriever_batch_2(query:List[str]):vector_store get_vector_store()retriever vector_store.as_retriever(search_typesimilarity,search_kwargs{k: 1},)r retriever.batch(query)return r总结总的来说分词检索更注重词语的表面匹配而语义检索更注重对查询意图和文档内容的深层次理解。随着技术的发展语义检索在处理复杂查询和提供更精准的信息方面显示出更大的优势。 langchian 像胶水可以轻松的把矢量数据库以及大语言模型的能力整合在一起快速形成稳定的应用。代码本文涉及的所有代码以及相关资源都已经共享参见 githubgitee 祝好运

查看全文

http://www.w-s-a.com/news/251462/