PostgreSQL全文检索中文分词器配置与优化实践

引言

在构建RAG（检索增强生成）系统的过程中，提升检索效率与准确性是一个持续优化的课题。除了常见的嵌入向量检索外，结合全文检索技术能进一步改善系统表现。本文基于PostgreSQL数据库，分享中文全文检索分词器的配置、索引创建与使用实践，记录在真实场景中遇到的问题与解决方案。

一、背景

为了提升RAG系统的检索效果，我们探索了全文检索与向量检索结合的混合检索方案。PostgreSQL内置了强大的全文检索功能，并支持扩展插件实现多语言分词。针对中文场景，我们选用了 zhparser 分词插件，并结合 pg_textsearch 扩展实现基于BM25算法的全文检索索引。

二、环境准备：安装扩展

首先需要安装两个关键扩展：

1CREATE EXTENSION IF NOT EXISTS pg_textsearch;
2CREATE EXTENSION IF NOT EXISTS zhparser;
3

pg_textsearch：提供基于BM25算法的全文检索支持
zhparser：中文分词解析器，支持对中文文本进行词语切分

三、配置中文分词器

1. 创建全文检索配置

在 pg_catalog 模式下创建中文分词配置：

1CREATE TEXT SEARCH CONFIGURATION pg_catalog.chinese (PARSER = zhparser);
2

2. 添加分词映射

将常见的词性标签映射到简单字典：

1ALTER TEXT SEARCH CONFIGURATION pg_catalog.chinese 
2ADD MAPPING FOR a, b, c, d, e, f, g, h, i, j, k, l, m, 
3                n, o, p, q, r, s, t, u, v, w, x, y, z 
4WITH simple;
5

3. 验证配置

查询所有全文检索配置，确认中文解析器已生效：

1SELECT                                                  
2    n.nspname as schema_name, 
3    c.cfgname as config_name, 
4    p.prsname as parser_name
5FROM pg_ts_config c
6JOIN pg_namespace n ON n.oid = c.cfgnamespace
7JOIN pg_ts_parser p ON p.oid = c.cfgparser;
8

执行结果

1scorpio=# SELECT                                                  
2    n.nspname as schema_name, 
3    c.cfgname as config_name, 
4    p.prsname as parser_name
5FROM pg_ts_config c
6JOIN pg_namespace n ON n.oid = c.cfgnamespace
7JOIN pg_ts_parser p ON p.oid = c.cfgparser;
8 schema_name | config_name | parser_name 
9-------------+-------------+-------------
10 pg_catalog  | simple      | default
11 pg_catalog  | arabic      | default
12 pg_catalog  | armenian    | default
13 pg_catalog  | basque      | default
14 pg_catalog  | catalan     | default
15 pg_catalog  | danish      | default
16 pg_catalog  | dutch       | default
17 pg_catalog  | english     | default
18 pg_catalog  | finnish     | default
19 pg_catalog  | french      | default
20 pg_catalog  | german      | default
21 pg_catalog  | greek       | default
22 pg_catalog  | hindi       | default
23 pg_catalog  | hungarian   | default
24 pg_catalog  | indonesian  | default
25 pg_catalog  | irish       | default
26 pg_catalog  | italian     | default
27 pg_catalog  | lithuanian  | default
28 pg_catalog  | nepali      | default
29 pg_catalog  | norwegian   | default
30 pg_catalog  | portuguese  | default
31 pg_catalog  | romanian    | default
32 pg_catalog  | russian     | default
33 pg_catalog  | serbian     | default
34 pg_catalog  | spanish     | default
35 pg_catalog  | swedish     | default
36 pg_catalog  | tamil       | default
37 pg_catalog  | turkish     | default
38 pg_catalog  | yiddish     | default
39 pg_catalog  | chinese     | zhparser
40(30 rows)
41

输出中应包含 chinese 配置，其解析器为 zhparser。

四、创建全文检索索引

1. 基于中文分词器创建BM25索引

1CREATE INDEX idx_chunks_content_bm25_zh 
2ON alpha.chunks 
3USING bm25 (content) 
4WITH (text_config = 'chinese');
5

执行结果

1scorpio=# CREATE INDEX idx_chunks_content_bm25_zh ON alpha.chunks 
2USING bm25 (content) 
3WITH (text_config = 'chinese');
4NOTICE:  BM25 index build started for relation idx_chunks_content_bm25_zh
5NOTICE:  Using text search configuration: chinese
6NOTICE:  Using index options: k1=1.20, b=0.75
7NOTICE:  BM25 index build completed: 64 documents, avg_length=194.86, text_config='chinese' (k1=1.20, b=0.75)
8CREATE INDEX
9

系统会输出构建过程的详细日志，包括使用的分词配置、文档数量、平均文档长度以及BM25参数（k1=1.20, b=0.75）。

2. 同时创建英文分词器索引（可选对比）

1CREATE INDEX idx_chunks_content_bm25_en 
2ON alpha.chunks 
3USING bm25(content) 
4WITH (text_config='english');
5

英文分词器为PostgreSQL内置分词器，所以无需额外配置，索引创建非常顺利。

五、验证全文检索效果

执行中文全文检索查询示例：

1SELECT 
2    id,
3    LEFT(content, 80),
4    ts_rank(to_tsvector('chinese', content), 
5            phraseto_tsquery('chinese', '什么是RAG')) AS score
6FROM alpha.chunks
7WHERE to_tsvector('chinese', content) @@ 
8      phraseto_tsquery('chinese', '什么是RAG')
9ORDER BY score DESC;
10

执行结果

1scorpio=# SELECT 
2    id,
3    LEFT(content, 80),
4    ts_rank(to_tsvector('chinese', content), 
5            phraseto_tsquery('chinese', '什么是RAG')) AS score
6FROM alpha.chunks
7WHERE to_tsvector('chinese', content) @@ 
8      phraseto_tsquery('chinese', '什么是RAG')
9ORDER BY score DESC;
10 id  |                                       left                                        |   score    
11-----+-----------------------------------------------------------------------------------+------------
12 216 | # RAG系统介绍                                                                    +| 0.51396555
13     |                                                                                  +| 
14     | ## 什么是RAG？                                                                   +| 
15     |                                                                                  +| 
16     | RAG（Retrieval-Augmented Generation，检索增强生成）是一种结合了信息检索和文本生成 | 
17(1 row)
18
19scorpio=# 
20

该查询会返回包含“什么是RAG”的文档片段，并按相关度排序。

通过 EXPLAIN 可查看查询执行计划，确认是否走索引扫描：

1scorpio=# EXPLAIN (ANALYZE) 
2SELECT 
3    id,
4    LEFT(content, 80),
5    ts_rank(to_tsvector('chinese', content), 
6            phraseto_tsquery('chinese', '什么是RAG')) AS score
7FROM alpha.chunks
8WHERE to_tsvector('chinese', content) @@ 
9      phraseto_tsquery('chinese', '什么是RAG')
10ORDER BY score DESC;
11                                                     QUERY PLAN                                                     
12--------------------------------------------------------------------------------------------------------------------
13 Sort  (cost=33.06..33.07 rows=1 width=44) (actual time=13.349..13.351 rows=1 loops=1)
14   Sort Key: (ts_rank(to_tsvector('chinese'::regconfig, content), '''什么'' <-> ''是'' <-> ''rag'''::tsquery)) DESC
15   Sort Method: quicksort  Memory: 25kB
16   ->  Seq Scan on chunks  (cost=0.00..33.05 rows=1 width=44) (actual time=13.211..13.315 rows=1 loops=1)
17         Filter: (to_tsvector('chinese'::regconfig, content) @@ '''什么'' <-> ''是'' <-> ''rag'''::tsquery)
18         Rows Removed by Filter: 63
19 Planning Time: 0.482 ms
20 Execution Time: 13.391 ms
21(8 rows)
22
23--强制使用使用bm25索引执行计划
24
25scorpio=# EXPLAIN (ANALYZE)
26SELECT 
27    id,
28    LEFT(content, 80),
29    ts_rank(to_tsvector('chinese', content), 
30            phraseto_tsquery('chinese', '什么是RAG')) AS score
31FROM alpha.chunks
32WHERE content @@ phraseto_tsquery('chinese', '什么是RAG')  -- 直接使用content
33ORDER BY score DESC;
34                                                     QUERY PLAN                                                     
35--------------------------------------------------------------------------------------------------------------------
36 Sort  (cost=32.91..32.91 rows=1 width=44) (actual time=13.940..13.941 rows=0 loops=1)
37   Sort Key: (ts_rank(to_tsvector('chinese'::regconfig, content), '''什么'' <-> ''是'' <-> ''rag'''::tsquery)) DESC
38   Sort Method: quicksort  Memory: 25kB
39   ->  Seq Scan on chunks  (cost=0.00..32.90 rows=1 width=44) (actual time=13.723..13.723 rows=0 loops=1)
40         Filter: (content @@ '''什么'' <-> ''是'' <-> ''rag'''::tsquery)
41         Rows Removed by Filter: 64
42 Planning Time: 65.847 ms
43 Execution Time: 14.656 ms
44(8 rows)
45

由于数据量小（或者索引不适用），优化器选择了顺序扫描，实际上索引是能够被使用的。

六、关键问题与解决方案

🔧 分词器配置必须位于 `pg_catalog`

在配置过程中，如果尝试在其他schema下创建分词配置，可能会在创建索引时失败。必须将 TEXT SEARCH CONFIGURATION 创建在 pg_catalog 模式下，否则 pg_textsearch 扩展无法识别该配置。

🔧 删除错误的配置

如果分词器配置有误（如 chinese_zh 配置在schema public中），可使用以下命令清理：

1DROP TEXT SEARCH CONFIGURATION IF EXISTS chinese_zh CASCADE;
2

🔧 分词器配置位置

在同一个PostgreSQL实例不同数据库中验证中文分词器配置信息

数据库scorpio中中文分词器配置信息

1scorpio=# \dF
2               List of text search configurations
3   Schema   |    Name    |              Description              
4------------+------------+---------------------------------------
5 pg_catalog | arabic     | configuration for arabic language
6 pg_catalog | armenian   | configuration for armenian language
7 pg_catalog | basque     | configuration for basque language
8 pg_catalog | catalan    | configuration for catalan language
9 pg_catalog | chinese    | 
10 pg_catalog | danish     | configuration for danish language
11 pg_catalog | dutch      | configuration for dutch language
12 pg_catalog | english    | configuration for english language
13 pg_catalog | finnish    | configuration for finnish language
14 pg_catalog | french     | configuration for french language
15 pg_catalog | german     | configuration for german language
16 pg_catalog | greek      | configuration for greek language
17 pg_catalog | hindi      | configuration for hindi language
18 pg_catalog | hungarian  | configuration for hungarian language
19 pg_catalog | indonesian | configuration for indonesian language
20 pg_catalog | irish      | configuration for irish language
21 pg_catalog | italian    | configuration for italian language
22 pg_catalog | lithuanian | configuration for lithuanian language
23 pg_catalog | nepali     | configuration for nepali language
24 pg_catalog | norwegian  | configuration for norwegian language
25 pg_catalog | portuguese | configuration for portuguese language
26 pg_catalog | romanian   | configuration for romanian language
27 pg_catalog | russian    | configuration for russian language
28 pg_catalog | serbian    | configuration for serbian language
29 pg_catalog | simple     | simple configuration
30 pg_catalog | spanish    | configuration for spanish language
31 pg_catalog | swedish    | configuration for swedish language
32 pg_catalog | tamil      | configuration for tamil language
33 pg_catalog | turkish    | configuration for turkish language
34 pg_catalog | yiddish    | configuration for yiddish language
35(30 rows)
36
37scorpio=# \dF+ chinese
38Text search configuration "pg_catalog.chinese"
39Parser: "public.zhparser"
40 Token | Dictionaries 
41-------+--------------
42 a     | simple
43 b     | simple
44 c     | simple
45 d     | simple
46 e     | simple
47 f     | simple
48 g     | simple
49 h     | simple
50 i     | simple
51 j     | simple
52 k     | simple
53 l     | simple
54 m     | simple
55 n     | simple
56 o     | simple
57 p     | simple
58 q     | simple
59 r     | simple
60 s     | simple
61 t     | simple
62 u     | simple
63 v     | simple
64 w     | simple
65 x     | simple
66 y     | simple
67 z     | simple
68
69

数据库hbu中中文分词器配置信息

1scorpio=# \c hbu
2You are now connected to database "hbu" as user "hbu".
3hbu=# \dF
4               List of text search configurations
5   Schema   |    Name    |              Description              
6------------+------------+---------------------------------------
7 pg_catalog | arabic     | configuration for arabic language
8 pg_catalog | armenian   | configuration for armenian language
9 pg_catalog | basque     | configuration for basque language
10 pg_catalog | catalan    | configuration for catalan language
11 pg_catalog | danish     | configuration for danish language
12 pg_catalog | dutch      | configuration for dutch language
13 pg_catalog | english    | configuration for english language
14 pg_catalog | finnish    | configuration for finnish language
15 pg_catalog | french     | configuration for french language
16 pg_catalog | german     | configuration for german language
17 pg_catalog | greek      | configuration for greek language
18 pg_catalog | hindi      | configuration for hindi language
19 pg_catalog | hungarian  | configuration for hungarian language
20 pg_catalog | indonesian | configuration for indonesian language
21 pg_catalog | irish      | configuration for irish language
22 pg_catalog | italian    | configuration for italian language
23 pg_catalog | lithuanian | configuration for lithuanian language
24 pg_catalog | nepali     | configuration for nepali language
25 pg_catalog | norwegian  | configuration for norwegian language
26 pg_catalog | portuguese | configuration for portuguese language
27 pg_catalog | romanian   | configuration for romanian language
28 pg_catalog | russian    | configuration for russian language
29 pg_catalog | serbian    | configuration for serbian language
30 pg_catalog | simple     | simple configuration
31 pg_catalog | spanish    | configuration for spanish language
32 pg_catalog | swedish    | configuration for swedish language
33 pg_catalog | tamil      | configuration for tamil language
34 pg_catalog | turkish    | configuration for turkish language
35 pg_catalog | yiddish    | configuration for yiddish language
36 public     | chinese    | 
37(30 rows)
38
39hbu=# \dF+ chinese
40Text search configuration "public.chinese"
41Parser: "public.zhparser"
42 Token | Dictionaries 
43-------+--------------
44 a     | simple
45 e     | simple
46 i     | simple
47 j     | simple
48 l     | simple
49 m     | simple
50 n     | simple
51 t     | simple
52 v     | simple
53 x     | simple
54

由于PostgreSQL中文分词器配置chinese是关联数据库（PostgreSQL语境中的数据库）的，另一个数据库中无法使用该配置，但可以在数据库下不同schema共享使用。

七、总结

通过本次配置，我们成功在PostgreSQL中实现了基于 zhparser 的中文全文检索，并结合 pg_textsearch 的BM25算法构建高效检索索引。主要收获如下：

分词器配置需位于系统schema：chinese 全文检索配置必须创建在 pg_catalog 中，否则索引创建会失败。
中英文分词器可并存：可为同一列创建不同语言的全文检索索引，适用于多语言内容检索场景。
BM25提供可调参数：索引构建时支持调整 k1 和 b 参数，可根据文档集特点进行优化。

该方案为RAG系统提供了稳定、高效的全文检索支持，尤其适用于中文文档的精准召回场景。

本文基于真实配置过程整理，适用于 PostgreSQL 17.7版本，使用 pg_textsearch 与 zhparser 扩展。实际部署中需根据数据规模与查询模式进一步优化索引参数与查询结构。

《PostgreSQL全文检索中文分词器配置与优化实践》是转载文章，点击查看原文。

PostgreSQL全文检索中文分词器配置与优化实践

引言

一、背景

二、环境准备：安装扩展

三、配置中文分词器

1. 创建全文检索配置

2. 添加分词映射

3. 验证配置

四、创建全文检索索引

1. 基于中文分词器创建BM25索引

2. 同时创建英文分词器索引（可选对比）

五、验证全文检索效果

六、关键问题与解决方案

🔧 分词器配置必须位于 pg_catalog

🔧 删除错误的配置

🔧 分词器配置位置

七、总结

🔧 分词器配置必须位于 `pg_catalog`