这是mn导出的pdf,截图如下:
如复制文字到Anki输入框,发现这“足”和“而”明显跟正常的汉字长得不一样:
其实它不是正常的汉字“足”和“而”,而是康熙部首,这些文字在非汉字字体如常见的Consolas、Courier New、Courier是不会被正确渲染,而是直接显示成框框的,因为他们是在汉字的[\u4E00-\u9FFF]编码区外的字符,只是长得像汉字的“足”和“而”,而非正常的汉字。大部分可以用"NFKD"或"NFKC"规范化形式校正过来,但也有少量字符不会进行规范化:
这一部分可以校正过来:
<|"⼀" -> "一", "⼄" -> "乙", "⼆" -> "二", "⼈" -> "人", "⼉" -> "儿",
"⼊" -> "入", "⼋" -> "八", "⼏" -> "几", "⼑" -> "刀", "⼒" -> "力",
"⼔" -> "匕", "⼗" -> "十", "⼘" -> "卜", "⼚" -> "厂", "⼜" -> "又",
"⼝" -> "口", "⼟" -> "土", "⼠" -> "士", "⼤" -> "大", "⼥" -> "女",
"⼦" -> "子", "⼨" -> "寸", "⼩" -> "小", "⼫" -> "尸", "⼭" -> "山",
"⼯" -> "工", "⼰" -> "己", "⼲" -> "干", "⼴" -> "广", "⼸" -> "弓",
"⼼" -> "心", "⼽" -> "戈", "⼿" -> "手", "⽀" -> "支", "⽂" -> "文",
"⽃" -> "斗", "⽄" -> "斤", "⽅" -> "方", "⽆" -> "无", "⽇" -> "日",
"⽈" -> "曰", "⽉" -> "月", "⽊" -> "木", "⽋" -> "欠", "⽌" -> "止",
"⽍" -> "歹", "⽏" -> "毋", "⽐" -> "比", "⽑" -> "毛", "⽒" -> "氏",
"⽓" -> "气", "⽔" -> "水", "⽕" -> "火", "⽖" -> "爪", "⽗" -> "父",
"⽚" -> "片", "⽛" -> "牙", "⽜" -> "牛", "⽝" -> "犬", "⽞" -> "玄",
"⽟" -> "玉", "⽠" -> "瓜", "⽡" -> "瓦", "⽢" -> "甘", "⽣" -> "生",
"⽤" -> "用", "⽥" -> "田", "⽩" -> "白", "⽪" -> "皮", "⽫" -> "皿",
"⽬" -> "目", "⽭" -> "矛", "⽮" -> "矢", "⽯" -> "石", "⽰" -> "示",
"⽲" -> "禾", "⽳" -> "穴", "⽴" -> "立", "⽵" -> "竹", "⽶" -> "米",
"⽸" -> "缶", "⽹" -> "网", "⽺" -> "羊", "⽻" -> "羽", "⽼" -> "老",
"⽽" -> "而", "⽿" -> "耳", "⾁" -> "肉", "⾂" -> "臣", "⾃" -> "自",
"⾄" -> "至", "⾆" -> "舌", "⾈" -> "舟", "⾉" -> "艮", "⾊" -> "色",
"⾍" -> "虫", "⾎" -> "血", "⾏" -> "行", "⾐" -> "衣", "⾓" -> "角",
"⾔" -> "言", "⾕" -> "谷", "⾖" -> "豆", "⾚" -> "赤", "⾛" -> "走",
"⾜" -> "足", "⾝" -> "身", "⾟" -> "辛", "⾠" -> "辰", "⾢" -> "邑",
"⾣" -> "酉", "⾥" -> "里", "⾦" -> "金", "⾩" -> "阜", "⾪" -> "隶",
"⾬" -> "雨", "⾮" -> "非", "⾯" -> "面", "⾰" -> "革", "⾲" -> "韭",
"⾳" -> "音", "⾷" -> "食", "⾸" -> "首", "⾹" -> "香", "⾻" -> "骨",
"⾼" -> "高", "⿁" -> "鬼", "⿅" -> "鹿", "⿇" -> "麻", "⿉" -> "黍",
"⿊" -> "黑", "⿍" -> "鼎", "⿎" -> "鼓", "⿏" -> "鼠", "⿐" -> "鼻",
"⼣" -> "夕"|>
如Python中可以完美解决:
但这一部分无法校正过来
<|"⼞" -> "口", "⾒" -> "儿", "⾞" -> "车", "⾤" -> "采", "⾧" -> "长",
"⾨" -> "门", "⾭" -> "青", "⾴" -> "页", "⾵" -> "风", "⾶" -> "飞",
"⾺" -> "马", "⿂" -> "鱼", "⿃" -> "鸟", "⿄" -> "卤", "⿒" -> "齿",
"⿓" -> "龙"|>
但也不要紧,除了“⼞”需要额外手动添加到校正列表,其它的都算解决了。可以注意到只有“长得”完全一样的情况下"NFKD"或"NFKC"规范化才会进行,这可以帮我们解决绝大部分问题。不过我注意到除了康熙部首,还有部分补充偏旁是无法通过"NFKD"或"NFKC"规范化来进行修正的,这一部分只能用简单的字符串替换来修正,我挑了些汉字而非偏旁来作汉字替换列表如下:
<|"⼞" -> "口", "⺒" -> "巳", "⺎" -> "兀", "⺏" -> "尣", "⺓" -> "幺",
"⺔" -> "彑", "⺛" -> "旡", "⺝" -> "月", "⺞" -> "歺", "⺟" -> "母",
"⺠" -> "民", "⺢" -> "氺", "⺩" -> "王", "⺬" -> "示", "⺯" -> "糹",
"⺽" -> "臼", "⺾" -> "艹", "⻁" -> "虎", "⻃" -> "覀", "⻄" -> "西",
"⻆" -> "角", "⻈" -> "讠", "⻋" -> "车", "⻑" -> "長", "⻒" -> "镸",
"⻖" -> "阝", "⻘" -> "青", "⻙" -> "韦", "⻚" -> "页", "⻛" -> "风",
"⻜" -> "飞", "⻝" -> "食", "⻢" -> "马", "⻣" -> "骨", "⻤" -> "鬼",
"⻥" -> "鱼", "⻦" -> "鸟", "⻧" -> "卤", "⻩" -> "黄", "⻪" -> "黾",
"⻫" -> "斉", "⻬" -> "齐", "⻭" -> "歯", "⻮" -> "齿", "⻯" -> "竜",
"⻰" -> "龙", "⻱" -> "龜", "⻲" -> "亀", "⻳" -> "龟"|>
现在根据"NFKC"和最后面这个手动替换列表制作.exe文件在附件(代码托管在github中),让它在后台运行,它会自动纠正剪贴板中汉字的编码问题。另外,我这么用后台.exe来处理的方式只是权宜之际,麻烦官方尽快修复这问题,拜托了,如果技术上需要斟酌可以再联系我…MarginNote assistant.exe (44.0 MB)