MySQL存储过程字符集不同索引不生效

背景

目前公司数据库使用的是腾讯云的云数据库，当时数据库压力大一直在报警，故排查一下性能瓶颈在哪。

排查思路

腾讯云自带的性能分析

云厂商的数据库基本都自带了性能分析的工具，相比自建数据库也是一点优势
数据库压力大具体表现在数据库CPU长时间100%，出现大量慢SQL
我司还使用了事件+存储过程去定时执行一些任务，这部分也是重中之重

下面主要是我分析我司具体情况的思路，可以不参考，只是记录一下

打开腾讯云慢查询日志，可以看到目前数据库执行较慢的语句，把这些语句扣出来分析，没加索引加索引，需要优化加优化
优化完显示的慢查询后，压力降低些，但仍然颇高，继续分析
通过CPU分析结合我司具体情况，基本在每个小时的特定时间点CPU使用率就会突然飙升，然后持续一段时间，慢慢回落，这个可以大概想到是定时任务导致的
慢查询中也可以查询到调用了存储过程耗时巨大，扫描行数巨多（只能看到调用的存储过程和扫描行数总数，看不到具体的语句）
打开引擎监控，可以看到更详细的数据，这里主要查看全表扫描数，占了非常多，而且时间点和CPU使用率的时间点一致，很难不让人怀疑是存储过程导致的
这里需要打开腾讯云的数据库参数log_queries_not_using_indexes，可以记录下全表扫描的SQL,帮助排查

存储过程排查思路

通过腾讯云的辅助分析，可以确认问题基本是在存储过程上产生的，花了些时间调试一下，发现了一个奇怪的问题

-- CustomerID VARCHAR(10)
-- 走索引
EXPLAIN select MessageWarnType from customer_sysconfig where CustomerID='1';
-- 不走索引
EXPLAIN select MessageWarnType from customer_sysconfig where CustomerID=1;

暂时不知道什么原因导致的，但是感觉可能是和这个问题相关，按着这个方向去查资料，发现了一些端倪

stackoverflow文章给了一个方案说是因为字符集的问题，需要手动指定CONVERT(xxx USING utf8)
想起了以前也因为字符集导致的联表无法使用索引问题（历史遗留问题，以前是utf8的表后面是utf8mb4），于是需要想办法测试一下

原先的测试方法是通过直接调用存储过程，再结合腾讯云的分析来排查是否生效，但是现在看这种方式很傻逼
目前采用新建一个测试用的存储过程

CREATE DEFINER=`root`@`%` PROCEDURE `test`()
BEGIN
	DECLARE pCustomerID varchar(10);
	SET pCustomerID = '1';
	EXPLAIN select MessageWarnType from customer_sysconfig where CustomerID=CONVERT(pCustomerID USING utf8);

END

可以测试使用CONVERT和不使用CONVERT的情况，发现确实是使用了CONVERT以后就使用索引了

这种是使用了变量的情况（参考Stack Overflow的文章），我司目前是这种，但也得测试前面的直接使用1和’1’的情况即

-- CustomerID VARCHAR(10)
EXPLAIN select MessageWarnType from customer_sysconfig where CustomerID=1;
EXPLAIN select MessageWarnType from customer_sysconfig where CustomerID='1';

发现在存储过程执行也是一样的，使用数字1不走索引，字符串’1’走索引
这里还得提一下类型转换的问题,无法走索引的原因继续分析

字符集

为什么会有不走索引的问题？
这里参考了Stack Overflow文章提到的链接，大佬FEDERICO提到了

When you use:
SELECT … WHERE field=’string’;
or:
SET @x=’string’;
…’string’ is encoded with @@session.character_set_connection and @@session.collation_connection. So if you only use latin-2 for your tables/columns, you can change those variables to make your queries use latin-2 and the proper collation, and you don’t need to change anything in the code.
I hope this helps ??

当我们使用变量的时候，变量会使用@@session.character_set_connection and @@session.collation_connection的字符集和校对集来编码，此时我们再来测一下

-- 结果是utf8mb4和utf8mb4_general_ci
SELECT @@session.character_set_connection, @@session.collation_connection;
-- 查看我的表和字段的编码
SHOW CREATE TABLE customer_sysconfig;
-- 没有看到字段有设置特殊的，也没看到表的默认校对集，使用语句直接查询
-- 查询表的校对集
SELECT TABLE_COLLATION  FROM information_schema.TABLES WHERE TABLE_NAME = 'customer_sysconfig';
-- 查询字段的字符集和校对集
SELECT CHARACTER_SET_NAME, COLLATION_NAME FROM information_schema.COLUMNS WHERE TABLE_NAME = 'customer_sysconfig';


-- 没有特定数据库时，查看默认的校对集和字符集
SHOW VARIABLES LIKE 'collation_server';
SHOW VARIABLES LIKE 'character_set_server';

-- 指定特定数据库时，查看默认的校对集和字符集
USE database;

SHOW VARIABLES LIKE 'collation_database';
SHOW VARIABLES LIKE 'character_set_database';

通过上述查询后，我得出结论就是

customerId: utf8，utf8_general_ci  
customer_sysconfig: utf8，utf8_general_ci  
@@session.character_set_connection, @@session.collation_connection: utf8mb4，utf8mb4_general_ci

-- 检验变量的字符集
SET @x = 'xxx';
-- utf8mb4,utf8mb4_general_ci,utf8mb4,utf8mb4_general_ci
SELECT CHARSET(@x),COLLATION(@x), CHARSET('xxx'), COLLATION('xxx');

这就好像可以大概窥探到为什么使用变量的时候，会出现不走索引的情况了，因为变量的字符集和校对集和表的不一致，导致无法使用索引
但是还有几个问题，

文章描述的出处在哪,最好是有官网文档考据
无法解释为什么直接使用字符串，不使用变量，可以使用索引，如下语句

-- 为什么可以走索引，按  COLLATION('xxx')查询出来的是utf8mb4_general_ci，和CustomerID不一致，应该不行的才对
EXPLAIN select * from customer_sysconfig where CustomerID='1';

这里我觉得应该是MySQL做了优化，在使用字符串的时候允许自动匹配字段的字符集，才可以走索引(其实不太对)，找找看有没有相关的文章
charset-literal
charset-collation-coercibility
Character Set Repertoire
type-conversion
utf8mb4
根据上面几篇文章，可以得到几个信息点

charset-literal

每个字符串字面量都有默认对应的字符集/校对集获取方式，也可以通过 [_charset_name]'string' [COLLATE collation_name]这种形式手动指定字符集和校对集
获取规则如下

如果_charset_name和COLLATE collation_name都指定了，那么指定的值会被使用，而且collation_name必须和charset_name匹配（比如utf8的字符集不能使用utf8mb4_0900_ai_ci校对集）
If both _charset_name and COLLATE collation_name are specified, character set charset_name and collation collation_name are used. collation_name must be a permitted collation for charset_name.

如果_charset_name指定，COLLATE没有指定，字符集会使用指定的，COLLATEE会使用字符集默认的校对集（SHOW CHARACTER SET可以查看到默认）
If _charset_name is specified but COLLATE is not specified, character set charset_name and its default collation are used. To see the default collation for each character set, use the SHOW CHARACTER SET statement or query the INFORMATION_SCHEMA CHARACTER_SETS table.

如果_charset_name没有指定，COLLATE指定了，字符集会使用character_set_connection变量指定的值，校对集使用指定的，而且collation_name必须和charset_name匹配
If _charset_name is not specified but COLLATE collation_name is specified, the connection default character set given by the character_set_connection system variable and collation collation_name are used. collation_name must be a permitted collation for the connection default character set.

如果两者都不指定，那么会取character_set_connection and collation_connection系统变量的值
Otherwise (neither _charset_name nor COLLATE collation_name is specified), the connection default character set and collation given by the character_set_connection and collation_connection system variables are used.

charset-collation-coercibility

参考下面的语句

-- 比较的时候如何确定使用哪种字符集和校对集？如果两者都一样那没有异议，如果两者都一样，那应该怎么处理
SELECT x FROM T WHERE x = 'Y';

为了解决这个问题，指定了一些规则，为每种规则指定一个coercibility值

如果手动指定了COLLATE，coercibility=0
An explicit COLLATE clause has a coercibility of 0 (not coercible at all).

如果CONCAT 的两个字符串有不同的COLLATION，取1
The concatenation of two strings with different collations has a coercibility of 1.

如果是一个字段或者是变量，取2
The collation of a column or a stored routine parameter or local variable has a coercibility of 2.

如果是系统常量，比如是通过函数返回的，取3
A “system constant” (the string returned by functions such as USER() or VERSION()) has a coercibility of 3.

如果是字面量，取4
The collation of a literal has a coercibility of 4.

如果是数字或者临时值，取5
The collation of a numeric or temporal value has a coercibility of 5.

如果是null或者null派生的表达式取6
NULL or an expression that is derived from NULL has a coercibility of 6.

MySQL会在比较时对左右两边进行一个coercibility的比较，然后优先选择最低的度的那个字符集和校对集进行比较，如果coercibility相同也会有相应的规则

使用最低的coercibility的值
Use the collation with the lowest coercibility value.

如果两边有相同的coercibility
If both sides have the same coercibility, then:
2.1 如果两者都是unicode或者都不是unicode，报错（但是还有Repertoire的存在）
If both sides are Unicode, or both sides are not Unicode, it is an error.
2.2 如果一边是unicode，一边不是unicode，那么会采用unicode字符集，因为超集的规则更宽泛，可以适用于子集
If one of the sides has a Unicode character set, and another side has a non-Unicode character set, the side with Unicode character set wins, and automatic character set conversion is applied to the non-Unicode side. For example, the following statement does not return an error:
SELECT CONCAT(utf8mb4_column, latin1_column) FROM t1;
It returns a result that has a character set of utf8mb4 and the same collation as utf8mb4_column. Values of latin1_column are automatically converted to utf8mb4 before concatenating.
2.3 如果相同字符集，但是混和了_bin和一个_ci/_cs的校对集，会采用_bin的校对集
For an operation with operands from the same character set but that mix a _bin collation and a _ci or _cs collation, the _bin collation is used. This is similar to how operations that mix nonbinary and binary strings evaluate the operands as binary strings, applied to collations rather than data types.

从我们的案例来看

-- CustomerID是字段，coercibility是2， 1是数字，coercibility是5，于是选择了CustomerID对应的字符集作为最后比较的字符集，即采用utf8，utf8_general_ci
-- 但是由于类型转换的问题，一个是字符串，一个是数字，无法使用索引，参考https://dev.mysql.com/doc/refman/5.7/en/type-conversion.html，里面提到，由于比如字符串 '1'，' 1'和'1a'都可以被转化成1，这样子索引是无法生效的，有种类似%1%无法生效索引一样（B Tree）
select MessageWarnType from customer_sysconfig where CustomerID=1;
-- CustomerID是字段，coercibility是2，'1'是字面量，coercibility是4，采用utf8，utf8_general_ci
-- 本来应该要报错的，但是由于'1'不包含超出ASCII的字符，所以Repertoire是ASCII，一边是UFT8，一边是ASCII，所以可以转为utf8，因此不会报错
select MessageWarnType from customer_sysconfig where CustomerID='1';
SET @xx='1';
-- CustomerID是字段，coercibility是2，@xx是变量，coercibility是2，两者coercibility相同
-- 依照相同的coercibility，先判断unicode，一边unicode是utf8，一边是utf8mb4，都是UNICODE字符集
-- 理论应该要报错，但是由于utf8mb4是utf8的超集，允许自动转换为utf8mb4（https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html）
select MessageWarnType from customer_sysconfig where CustomerID=@xx;

解决

统一编码，或者是使用CONVERT(xxx using character_set),工作量较大，但能最大维持原项目不报错
假如从utf8改成utf8mb4，需要考虑几个方面

默认字符集修改
已有字符集修改
修改后一个字符从3个字节到4个字节，这个会导致原先比如255字节长度可以存放85个字符的varchar，目前只能存放63个字符，转换就会报错，需要相应修改容量或者类型

默认字符集修改

vi /etc/my.cnf

[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_general_ci
skip-character-set-client-handshake
#忽略应用连接自己设置的字符编码，保持与全局设置一致
[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4

-- 重启后查询默认值
SHOW VARIABLES WHERE variable_name LIKE 'character_set_%' OR variable_name LIKE 'collation%';

已有字符集修改

-- 设置数据库字符集编码
ALTER DATABASE `dbname` CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

-- 修改表字符集
alter table `tb_name` default character set utf8mb4;

-- 修改某字段字符集
alter table `tb_name` modify col_name varchar(20) character set utf8mb4;

-- 同时修改表及字段字符集
alter table `tb_name` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

-- 如果某个库里面表比较多 可以拼接出要执行的批量修改语句
SELECT
	CONCAT( 'ALTER TABLE ', TABLE_NAME, ' CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;' ) 
FROM
	information_schema.`TABLES` 
WHERE
	TABLE_SCHEMA = 'testdb';

总结

在考虑会不会使用索引时需要综合几个方面，除了常规的索引失效，还有可能是编码导致的问题
只考虑字符串比较的情况，需要确保使用的校对集是一样的才可以走索引
比较原则先看coercibility再看repertoire和特殊情况（比如utf8和utf8mb4）

参考

腾讯云log_queries_not_using_indexes
charset-literal
charset-collation-coercibility
Character Set Repertoire
type-conversion
utf8mb4
stack overflow-mysql-queries-are-fast-when-run-directly-but-really-slow-when-run-as-stored-proc
diagnosing-collation-issue-mysql-stored-procedure