引言
在构建RAG(检索增强生成)系统的过程中,提升检索效率与准确性是一个持续优化的课题。除了常见的嵌入向量检索外,结合全文检索技术能进一步改善系统表现。本文基于PostgreSQL数据库,分享中文全文检索分词器的配置、索引创建与使用实践,记录在真实场景中遇到的问题与解决方案。
一、背景
为了提升RAG系统的检索效果,我们探索了全文检索与向量检索结合的混合检索方案。PostgreSQL内置了强大的全文检索功能,并支持扩展插件实现多语言分词。针对中文场景,我们选用了 zhparser 分词插件,并结合 pg_textsearch 扩展实现基于BM25算法的全文检索索引。
二、环境准备:安装扩展
首先需要安装两个关键扩展:
1CREATE EXTENSION IF NOT EXISTS pg_textsearch; 2CREATE EXTENSION IF NOT EXISTS zhparser; 3
pg_textsearch:提供基于BM25算法的全文检索支持zhparser:中文分词解析器,支持对中文文本进行词语切分
三、配置中文分词器
1. 创建全文检索配置
在 pg_catalog 模式下创建中文分词配置:
1CREATE TEXT SEARCH CONFIGURATION pg_catalog.chinese (PARSER = zhparser); 2
2. 添加分词映射
将常见的词性标签映射到简单字典:
1ALTER TEXT SEARCH CONFIGURATION pg_catalog.chinese 2ADD MAPPING FOR a, b, c, d, e, f, g, h, i, j, k, l, m, 3 n, o, p, q, r, s, t, u, v, w, x, y, z 4WITH simple; 5
3. 验证配置
查询所有全文检索配置,确认中文解析器已生效:
1SELECT 2 n.nspname as schema_name, 3 c.cfgname as config_name, 4 p.prsname as parser_name 5FROM pg_ts_config c 6JOIN pg_namespace n ON n.oid = c.cfgnamespace 7JOIN pg_ts_parser p ON p.oid = c.cfgparser; 8
执行结果
1scorpio=# SELECT 2 n.nspname as schema_name, 3 c.cfgname as config_name, 4 p.prsname as parser_name 5FROM pg_ts_config c 6JOIN pg_namespace n ON n.oid = c.cfgnamespace 7JOIN pg_ts_parser p ON p.oid = c.cfgparser; 8 schema_name | config_name | parser_name 9-------------+-------------+------------- 10 pg_catalog | simple | default 11 pg_catalog | arabic | default 12 pg_catalog | armenian | default 13 pg_catalog | basque | default 14 pg_catalog | catalan | default 15 pg_catalog | danish | default 16 pg_catalog | dutch | default 17 pg_catalog | english | default 18 pg_catalog | finnish | default 19 pg_catalog | french | default 20 pg_catalog | german | default 21 pg_catalog | greek | default 22 pg_catalog | hindi | default 23 pg_catalog | hungarian | default 24 pg_catalog | indonesian | default 25 pg_catalog | irish | default 26 pg_catalog | italian | default 27 pg_catalog | lithuanian | default 28 pg_catalog | nepali | default 29 pg_catalog | norwegian | default 30 pg_catalog | portuguese | default 31 pg_catalog | romanian | default 32 pg_catalog | russian | default 33 pg_catalog | serbian | default 34 pg_catalog | spanish | default 35 pg_catalog | swedish | default 36 pg_catalog | tamil | default 37 pg_catalog | turkish | default 38 pg_catalog | yiddish | default 39 pg_catalog | chinese | zhparser 40(30 rows) 41
输出中应包含 chinese 配置,其解析器为 zhparser。
四、创建全文检索索引
1. 基于中文分词器创建BM25索引
1CREATE INDEX idx_chunks_content_bm25_zh 2ON alpha.chunks 3USING bm25 (content) 4WITH (text_config = 'chinese'); 5
执行结果
1scorpio=# CREATE INDEX idx_chunks_content_bm25_zh ON alpha.chunks 2USING bm25 (content) 3WITH (text_config = 'chinese'); 4NOTICE: BM25 index build started for relation idx_chunks_content_bm25_zh 5NOTICE: Using text search configuration: chinese 6NOTICE: Using index options: k1=1.20, b=0.75 7NOTICE: BM25 index build completed: 64 documents, avg_length=194.86, text_config='chinese' (k1=1.20, b=0.75) 8CREATE INDEX 9
系统会输出构建过程的详细日志,包括使用的分词配置、文档数量、平均文档长度以及BM25参数(k1=1.20, b=0.75)。
2. 同时创建英文分词器索引(可选对比)
1CREATE INDEX idx_chunks_content_bm25_en 2ON alpha.chunks 3USING bm25(content) 4WITH (text_config='english'); 5
英文分词器为PostgreSQL内置分词器,所以无需额外配置,索引创建非常顺利。
五、验证全文检索效果
执行中文全文检索查询示例:
1SELECT 2 id, 3 LEFT(content, 80), 4 ts_rank(to_tsvector('chinese', content), 5 phraseto_tsquery('chinese', '什么是RAG')) AS score 6FROM alpha.chunks 7WHERE to_tsvector('chinese', content) @@ 8 phraseto_tsquery('chinese', '什么是RAG') 9ORDER BY score DESC; 10
执行结果
1scorpio=# SELECT 2 id, 3 LEFT(content, 80), 4 ts_rank(to_tsvector('chinese', content), 5 phraseto_tsquery('chinese', '什么是RAG')) AS score 6FROM alpha.chunks 7WHERE to_tsvector('chinese', content) @@ 8 phraseto_tsquery('chinese', '什么是RAG') 9ORDER BY score DESC; 10 id | left | score 11-----+-----------------------------------------------------------------------------------+------------ 12 216 | # RAG系统介绍 +| 0.51396555 13 | +| 14 | ## 什么是RAG? +| 15 | +| 16 | RAG(Retrieval-Augmented Generation,检索增强生成)是一种结合了信息检索和文本生成 | 17(1 row) 18 19scorpio=# 20
该查询会返回包含“什么是RAG”的文档片段,并按相关度排序。
通过 EXPLAIN 可查看查询执行计划,确认是否走索引扫描:
1scorpio=# EXPLAIN (ANALYZE) 2SELECT 3 id, 4 LEFT(content, 80), 5 ts_rank(to_tsvector('chinese', content), 6 phraseto_tsquery('chinese', '什么是RAG')) AS score 7FROM alpha.chunks 8WHERE to_tsvector('chinese', content) @@ 9 phraseto_tsquery('chinese', '什么是RAG') 10ORDER BY score DESC; 11 QUERY PLAN 12-------------------------------------------------------------------------------------------------------------------- 13 Sort (cost=33.06..33.07 rows=1 width=44) (actual time=13.349..13.351 rows=1 loops=1) 14 Sort Key: (ts_rank(to_tsvector('chinese'::regconfig, content), '''什么'' <-> ''是'' <-> ''rag'''::tsquery)) DESC 15 Sort Method: quicksort Memory: 25kB 16 -> Seq Scan on chunks (cost=0.00..33.05 rows=1 width=44) (actual time=13.211..13.315 rows=1 loops=1) 17 Filter: (to_tsvector('chinese'::regconfig, content) @@ '''什么'' <-> ''是'' <-> ''rag'''::tsquery) 18 Rows Removed by Filter: 63 19 Planning Time: 0.482 ms 20 Execution Time: 13.391 ms 21(8 rows) 22 23--强制使用使用bm25索引执行计划 24 25scorpio=# EXPLAIN (ANALYZE) 26SELECT 27 id, 28 LEFT(content, 80), 29 ts_rank(to_tsvector('chinese', content), 30 phraseto_tsquery('chinese', '什么是RAG')) AS score 31FROM alpha.chunks 32WHERE content @@ phraseto_tsquery('chinese', '什么是RAG') -- 直接使用content 33ORDER BY score DESC; 34 QUERY PLAN 35-------------------------------------------------------------------------------------------------------------------- 36 Sort (cost=32.91..32.91 rows=1 width=44) (actual time=13.940..13.941 rows=0 loops=1) 37 Sort Key: (ts_rank(to_tsvector('chinese'::regconfig, content), '''什么'' <-> ''是'' <-> ''rag'''::tsquery)) DESC 38 Sort Method: quicksort Memory: 25kB 39 -> Seq Scan on chunks (cost=0.00..32.90 rows=1 width=44) (actual time=13.723..13.723 rows=0 loops=1) 40 Filter: (content @@ '''什么'' <-> ''是'' <-> ''rag'''::tsquery) 41 Rows Removed by Filter: 64 42 Planning Time: 65.847 ms 43 Execution Time: 14.656 ms 44(8 rows) 45
由于数据量小(或者索引不适用),优化器选择了顺序扫描,实际上索引是能够被使用的。
六、关键问题与解决方案
🔧 分词器配置必须位于 pg_catalog
在配置过程中,如果尝试在其他schema下创建分词配置,可能会在创建索引时失败。必须将 TEXT SEARCH CONFIGURATION 创建在 pg_catalog 模式下,否则 pg_textsearch 扩展无法识别该配置。
🔧 删除错误的配置
如果分词器配置有误(如 chinese_zh 配置在schema public中),可使用以下命令清理:
1DROP TEXT SEARCH CONFIGURATION IF EXISTS chinese_zh CASCADE; 2
🔧 分词器配置位置
在同一个PostgreSQL实例不同数据库中验证中文分词器配置信息
- 数据库scorpio中中文分词器配置信息
1scorpio=# \dF 2 List of text search configurations 3 Schema | Name | Description 4------------+------------+--------------------------------------- 5 pg_catalog | arabic | configuration for arabic language 6 pg_catalog | armenian | configuration for armenian language 7 pg_catalog | basque | configuration for basque language 8 pg_catalog | catalan | configuration for catalan language 9 pg_catalog | chinese | 10 pg_catalog | danish | configuration for danish language 11 pg_catalog | dutch | configuration for dutch language 12 pg_catalog | english | configuration for english language 13 pg_catalog | finnish | configuration for finnish language 14 pg_catalog | french | configuration for french language 15 pg_catalog | german | configuration for german language 16 pg_catalog | greek | configuration for greek language 17 pg_catalog | hindi | configuration for hindi language 18 pg_catalog | hungarian | configuration for hungarian language 19 pg_catalog | indonesian | configuration for indonesian language 20 pg_catalog | irish | configuration for irish language 21 pg_catalog | italian | configuration for italian language 22 pg_catalog | lithuanian | configuration for lithuanian language 23 pg_catalog | nepali | configuration for nepali language 24 pg_catalog | norwegian | configuration for norwegian language 25 pg_catalog | portuguese | configuration for portuguese language 26 pg_catalog | romanian | configuration for romanian language 27 pg_catalog | russian | configuration for russian language 28 pg_catalog | serbian | configuration for serbian language 29 pg_catalog | simple | simple configuration 30 pg_catalog | spanish | configuration for spanish language 31 pg_catalog | swedish | configuration for swedish language 32 pg_catalog | tamil | configuration for tamil language 33 pg_catalog | turkish | configuration for turkish language 34 pg_catalog | yiddish | configuration for yiddish language 35(30 rows) 36 37scorpio=# \dF+ chinese 38Text search configuration "pg_catalog.chinese" 39Parser: "public.zhparser" 40 Token | Dictionaries 41-------+-------------- 42 a | simple 43 b | simple 44 c | simple 45 d | simple 46 e | simple 47 f | simple 48 g | simple 49 h | simple 50 i | simple 51 j | simple 52 k | simple 53 l | simple 54 m | simple 55 n | simple 56 o | simple 57 p | simple 58 q | simple 59 r | simple 60 s | simple 61 t | simple 62 u | simple 63 v | simple 64 w | simple 65 x | simple 66 y | simple 67 z | simple 68 69
- 数据库hbu中中文分词器配置信息
1scorpio=# \c hbu 2You are now connected to database "hbu" as user "hbu". 3hbu=# \dF 4 List of text search configurations 5 Schema | Name | Description 6------------+------------+--------------------------------------- 7 pg_catalog | arabic | configuration for arabic language 8 pg_catalog | armenian | configuration for armenian language 9 pg_catalog | basque | configuration for basque language 10 pg_catalog | catalan | configuration for catalan language 11 pg_catalog | danish | configuration for danish language 12 pg_catalog | dutch | configuration for dutch language 13 pg_catalog | english | configuration for english language 14 pg_catalog | finnish | configuration for finnish language 15 pg_catalog | french | configuration for french language 16 pg_catalog | german | configuration for german language 17 pg_catalog | greek | configuration for greek language 18 pg_catalog | hindi | configuration for hindi language 19 pg_catalog | hungarian | configuration for hungarian language 20 pg_catalog | indonesian | configuration for indonesian language 21 pg_catalog | irish | configuration for irish language 22 pg_catalog | italian | configuration for italian language 23 pg_catalog | lithuanian | configuration for lithuanian language 24 pg_catalog | nepali | configuration for nepali language 25 pg_catalog | norwegian | configuration for norwegian language 26 pg_catalog | portuguese | configuration for portuguese language 27 pg_catalog | romanian | configuration for romanian language 28 pg_catalog | russian | configuration for russian language 29 pg_catalog | serbian | configuration for serbian language 30 pg_catalog | simple | simple configuration 31 pg_catalog | spanish | configuration for spanish language 32 pg_catalog | swedish | configuration for swedish language 33 pg_catalog | tamil | configuration for tamil language 34 pg_catalog | turkish | configuration for turkish language 35 pg_catalog | yiddish | configuration for yiddish language 36 public | chinese | 37(30 rows) 38 39hbu=# \dF+ chinese 40Text search configuration "public.chinese" 41Parser: "public.zhparser" 42 Token | Dictionaries 43-------+-------------- 44 a | simple 45 e | simple 46 i | simple 47 j | simple 48 l | simple 49 m | simple 50 n | simple 51 t | simple 52 v | simple 53 x | simple 54
由于PostgreSQL中文分词器 配置chinese是关联数据库(PostgreSQL语境中的数据库)的,另一个数据库中无法使用该配置,但可以在数据库下不同schema共享使用。
七、总结
通过本次配置,我们成功在PostgreSQL中实现了基于 zhparser 的中文全文检索,并结合 pg_textsearch 的BM25算法构建高效检索索引。主要收获如下:
- 分词器配置需位于系统schema:
chinese全文检索配置必须创建在pg_catalog中,否则索引创建会失败。 - 中英文分词器可并存:可为同一列创建不同语言的全文检索索引,适用于多语言内容检索场景。
- BM25提供可调参数:索引构建时支持调整
k1和b参数,可根据文档集特点进行优化。
该方案为RAG系统提供了稳定、高效的全文检索支持,尤其适用于中文文档的精准召回场景。
本文基于真实配置过程整理,适用于 PostgreSQL 17.7版本,使用
pg_textsearch与zhparser扩展。实际部署中需根据数据规模与查询模式进一步优化索引参数与查询结构。
《PostgreSQL全文检索中文分词器配置与优化实践》 是转载文章,点击查看原文。