微软开源GraphRAG的使用教程(最全,非常详细)
创始人
2024-11-30 16:05:30
0

GraphRAG的介绍

目前微软已经开源了GraphRAG的完整项目代码。对于某一些LLM的下游任务则可以使用GraphRAG去增强自己业务的RAG的表现。项目给出了两种使用方式:

  1. 在打包好的项目状态下运行,可进行尝试使用。
  2. 在源码基础上运行,适合为了下游任务的微调时使用。
    如果需要利用Ollama部署本地大模型的可以参考我的另一篇博客
    以下在通过自身的实践之后的给出对这两种方式的使用教程,如果还有什么问题在评论区交流。

一、在源码基础上运行(便于后续修改)

1. 准备环境(在终端运行)

(1)创建虚拟环境(已安装好anaconda),此处建议使用python3.11:

conda create -n GraphRAG python=3.11 conda activate GraphRAG 

2. 下载源码并进入目录

git clone https://github.com/microsoft/graphrag.git    cd graphrag 

3. 下载依赖并初始化项目

(1)安装poetry资源包管理工具及相关依赖:

pip install poetry  poetry install 

(2)初始化

poetry run poe index --init --root .    

正确运行后,此处会在graphrag目录下生成output、prompts、.env、settings.yaml文件

4. 下载并将待检索的文档document放入./input/目录下

mkdir ./input curl https://www.xxx.com/xxx.txt > ./input/book.txt  #示例,可以替换为任何的txt文件 

5.修改相关配置文件

(1)修改.env文件(默认是隐藏的)中的api_key

vi .env  #进入.env文件,并修改为自己的api_key 

修改后是全局配置,后续不需要再次修改了

(2)修改settings.yaml文件,修改其中的使用的llm模型和对应的api_base

提前说明,因为GraphRAG需要多次调用大模型和Embedding,默认使用的是openai的GPT-4,花费及其昂贵(土豪当我没说,配置也不需要改 ),建议大家可以使用其他模型或国产大模型的api

我这里使用的是agicto提供的APIkey(主要是新用户注册可以免费获取到10块钱的调用额度,白嫖还是挺爽的)。我在这里主要就修改了API地址和调用模型的名称,修改完成后的settings文件完整内容如下:

(代码行后有标记的为需要修改的地方),如果用的是agicto则则不用修改settings.yaml

encoding_model: cl100k_base skip_workflows: [] llm:   api_key: ${GRAPHRAG_API_KEY}   type: openai_chat # or azure_openai_chat   model: deepseek-chat  #修改   model_supports_json: false # recommended if this is available for your model.   api_base: https://api.agicto.cn/v1 #修改   # max_tokens: 4000   # request_timeout: 180.0   # api_version: 2024-02-15-preview   # organization:    # deployment_name:    # tokens_per_minute: 150_000 # set a leaky bucket throttle   # requests_per_minute: 10_000 # set a leaky bucket throttle   # max_retries: 10   # max_retry_wait: 10.0   # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times   # concurrent_requests: 25 # the number of parallel inflight requests that may be made  parallelization:   stagger: 0.3   # num_threads: 50 # the number of threads to use for parallel processing  async_mode: threaded # or asyncio  embeddings:   ## parallelization: override the global parallelization settings for embeddings   async_mode: threaded # or asyncio   llm:     api_key: ${GRAPHRAG_API_KEY}     type: openai_embedding # or azure_openai_embedding     model: text-embedding-3-small #修改     api_base: https://api.agicto.cn/v1 #修改     # api_base: https://.openai.azure.com     # api_version: 2024-02-15-preview     # organization:      # deployment_name:      # tokens_per_minute: 150_000 # set a leaky bucket throttle     # requests_per_minute: 10_000 # set a leaky bucket throttle     # max_retries: 10     # max_retry_wait: 10.0     # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times     # concurrent_requests: 25 # the number of parallel inflight requests that may be made     # batch_size: 16 # the number of documents to send in a single request     # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request     # target: required # or optional     chunks:   size: 300   overlap: 100   group_by_columns: [id] # by default, we don't allow chunks to cross documents      input:   type: file # or blob   file_type: text # or csv   base_dir: "input"   file_encoding: utf-8   file_pattern: ".*\\.txt$"  cache:   type: file # or blob   base_dir: "cache"   # connection_string:    # container_name:   storage:   type: file # or blob   base_dir: "output/${timestamp}/artifacts"   # connection_string:    # container_name:   reporting:   type: file # or console, blob   base_dir: "output/${timestamp}/reports"   # connection_string:    # container_name:   entity_extraction:   ## llm: override the global llm settings for this task   ## parallelization: override the global parallelization settings for this task   ## async_mode: override the global async_mode settings for this task   prompt: "prompts/entity_extraction.txt"   entity_types: [organization,person,geo,event]   max_gleanings: 0  summarize_descriptions:   ## llm: override the global llm settings for this task   ## parallelization: override the global parallelization settings for this task   ## async_mode: override the global async_mode settings for this task   prompt: "prompts/summarize_descriptions.txt"   max_length: 500  claim_extraction:   ## llm: override the global llm settings for this task   ## parallelization: override the global parallelization settings for this task   ## async_mode: override the global async_mode settings for this task   # enabled: true   prompt: "prompts/claim_extraction.txt"   description: "Any claims or facts that could be relevant to information discovery."   max_gleanings: 0  community_report:   ## llm: override the global llm settings for this task   ## parallelization: override the global parallelization settings for this task   ## async_mode: override the global async_mode settings for this task   prompt: "prompts/community_report.txt"   max_length: 2000   max_input_length: 8000  cluster_graph:   max_cluster_size: 10  embed_graph:   enabled: false # if true, will generate node2vec embeddings for nodes   # num_walks: 10   # walk_length: 40   # window_size: 2   # iterations: 3   # random_seed: 597832  umap:   enabled: false # if true, will generate UMAP embeddings for nodes  snapshots:   graphml: false   raw_entities: false   top_level_nodes: false  local_search:   # text_unit_prop: 0.5   # community_prop: 0.1   # conversation_history_max_turns: 5   # top_k_mapped_entities: 10   # top_k_relationships: 10   # max_tokens: 12000  global_search:   # max_tokens: 12000   # data_max_tokens: 12000   # map_max_tokens: 1000   # reduce_max_tokens: 2000   # concurrency: 32 

6.构建GraphRAG的索引(耗时较长,取决于document的长度)

poetry run poe index --root .    

成功后如下:

⠋ GraphRAG Indexer  ├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00 ├── create_base_text_units ├── create_base_extracted_entities ├── create_summarized_entities ├── create_base_entity_graph ├── create_final_entities ├── create_final_nodes ├── create_final_communities ├── join_text_units_to_entity_ids ├── create_final_relationships ├── join_text_units_to_relationship_ids ├── create_final_community_reports ├── create_final_text_units ├── create_base_documents └── create_final_documents 🚀 All workflows completed successfully. 

7.进行查询

此处GraphRAG提供了两种查询方式
1)全局查询 :更侧重全文理解

poetry run poe query --root . --method global "本文主要讲了什么"    

运行成功后可以看到输出结果

2)局部查询:更侧重细节

poetry run poe query --root . --method local "本文主要讲了什么"    

运行成功后可以看到输出结果

8. 总结

上述过程为已经验证过的,如果报错可以检查是否正确配置api_key及api_base

二、在python包的基础上进行(快速尝试)

1. 环境安装

pip install graphrag 

2. 初始化项目

创建一个临时的文件夹graphrag,用于存在运行时数据

mkdir ./graphrag/input curl https://www.xxx.com/xxx.txt > ./myTest/input/book.txt  // 这里是示例代码,根据实际情况放入自己要测试的txt文本即可。 cd ./graphrag python -m graphrag.index --init 

3. 配置相关文件(可参考上述的配置文件过程)

4. 执行并构建图索引

python -m graphrag.index 

5.进行查询

1)全局查询

python -m graphrag.query --root ../myTest --method global "这篇文章主要讲述了什么内容?" 

2)局部查询

python -m graphrag.query --root ../myTest --method local "这篇文章主要讲述了什么内容?" 

总结

通过以上两种方式,我们已经尝试了利用源码和python资源包进行配置GraphRAG的方式。大家可以按照自己的需求尝试以上两种方法。如果还有问题,欢迎在评论区讨论!

相关内容

热门资讯

aapoker透明挂!wepo... aapoker透明挂!wepoke有挂 网上,wepoke本来真的有挂,透牌教程(有挂脚本)1、金币...
wpk真的有外挂!wpk 辅助... wpk真的有外挂!wpk 辅助工具,wpK一贯真的有挂(详细透视辅助神器教程)1、该软件可以轻松地帮...
微扑克辅助器ios!微扑克插件... 微扑克辅助器ios!微扑克插件,微扑克本来真的有挂,分享教程(有挂神器);1、操作简单,无需注册,只...
aapoker外挂!aapok... aapoker外挂!aapoker有手游版,AApOKER真是真的有挂,详细教程(有挂ai代打)1、...
wepoke ai辅助!wop... wepoke ai辅助!wopoker用ai有用,wepOkE好像真的有挂,第三方教程(有挂细节)在...
WPK透视辅助!wpk免费的俱... WPK透视辅助!wpk免费的俱乐部,wPk本来真的有挂(详细透视辅助插件教程)1、WPK透视辅助透视...
微扑克ai辅助工具!微扑克wp... 微扑克ai辅助工具!微扑克wpk透视辅助,微扑克原来真的有挂,爆料教程(有挂软件)1、玩家可以在微扑...
aa扑克辅助!aapoker有... aa扑克辅助!aapoker有猫腻,aApoker都是真的有挂,2025新版总结(有挂教程)aApo...
wepoke智能ai!wepo... wepoke智能ai!wepoke可以使用模拟器,wePOke就是真的有挂,攻略教程(有挂插件)1、...
wpk有外挂!wpk辅助透视,... wpk有外挂!wpk辅助透视,WPk一直真的有挂(详细透视辅助挂教程)小薇(透视辅助)致您一封信;亲...